The Impact of Asynchronous GPGPU Behaviors on Stochastic Simulation

Volume 2A: 33rd Computers and Information in Engineering Conference ◽

10.1115/detc2013-13221 ◽

2013 ◽

Author(s):

John C. Steuben ◽

Cameron J. Turner

Keyword(s):

Stochastic Simulation ◽

Graphics Processing Unit ◽

General Purpose ◽

Computing Methods ◽

Stochastic Simulations ◽

Processing Unit ◽

Robot Swarm ◽

Gpgpu Computing ◽

Global Clock ◽

The Impact

This work examines the effect of one key aspect of General Purpose Graphics Processing Unit (GPGPU) computing on the realism and fidelity of stochastic simulations. In particular it is shown that the asynchronous nature of GPGPU computing can be leveraged to produce increased fidelity and realism, compared to conventional computing methods, when applied to probabilistic or stochastic simulations. This is a multifaceted argument that shows: 1) Asynchronous behaviors are essential to produce high computational throughput on GPGPU devices, and thus allow more rigorous sampling, which in turn enables a deeper understanding of the underlying stochastic processes. 2) Asynchronous GPGPU computing can eliminate the “global clock” present in simulations and potentially produce a better representation of the underlying process. This paper also attempts to give a working introduction to GPGPU computing, and to the applications of this technology in the field of stochastic simulation. A range of literature regarding these simulations is also surveyed, in order to provide context. A demonstration of synchronous versus asynchronous algorithms for robot swarm path planning is used to illustrate this discussion. Several notes on the limitations of GPGPU computing in this field are also made, along with remarks regarding future development of GPGPU-accelerated stochastic simulations.

Download Full-text

A Survey of General Purpose Computation of GPU for Computational Fluid Dynamics

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.753-755.2731 ◽

2013 ◽

Vol 753-755 ◽

pp. 2731-2735

Author(s):

Wei Cao ◽

Zheng Hua Wang ◽

Chuan Fu Xu

Keyword(s):

Fluid Dynamics ◽

Computational Fluid Dynamics ◽

High Performance ◽

Stokes Equations ◽

Gpu Computing ◽

Graphics Processing Unit ◽

General Purpose ◽

Navier Stokes ◽

Processing Unit ◽

Gpgpu Computing

The graphics processing unit (GPU) has evolved from configurable graphics processor to a powerful engine for high performance computer. In this paper, we describe the graphics pipeline of GPU, and introduce the history and evolution of GPU architecture. We also provide a summary of software environments used on GPU, from graphics APIs to non-graphics APIs. At last, we present the GPU computing in computational fluid dynamics applications, including the GPGPU computing for Navier-Stokes equations methods and the GPGPU computing for Lattice Boltzmann method.

Download Full-text

Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design

Parallel Processing Letters ◽

10.1142/s0129626417500062 ◽

2017 ◽

Vol 27 (03n04) ◽

pp. 1750006 ◽

Cited By ~ 4

Author(s):

Farhad Merchant ◽

Anupam Chattopadhyay ◽

Soumyendu Raha ◽

S. K. Nandy ◽

Ranjani Narayan

Keyword(s):

Linear Algebra ◽

High Performance ◽

Graphics Processing Unit ◽

Building Blocks ◽

General Purpose ◽

Performance Tuning ◽

Floating Point ◽

Processing Unit ◽

Field Programmable ◽

The Impact

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE.

Download Full-text

Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

Acta Geophysica ◽

10.1515/acgeo-2016-0033 ◽

2016 ◽

Vol 64 (4) ◽

pp. 1051-1063 ◽

Cited By ~ 2

Author(s):

Guofeng Liu ◽

Chun Li

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Practical Implementation ◽

Processing Unit ◽

Time Migration ◽

Graphics Processing

Download Full-text

Comprehensive regression-based model to predict performance of general-purpose graphics processing unit

Cluster Computing ◽

10.1007/s10586-019-03011-2 ◽

2019 ◽

Vol 23 (2) ◽

pp. 1505-1516 ◽

Cited By ~ 2

Author(s):

Mohammad Hossein Shafiabadi ◽

Hossein Pedram ◽

Midia Reshadi ◽

Akram Reza

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence

Information ◽

10.3390/info11040193 ◽

2020 ◽

Vol 11 (4) ◽

pp. 193 ◽

Cited By ~ 7

Author(s):

Sebastian Raschka ◽

Joshua Patterson ◽

Corey Nolet

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Science ◽

Gpu Computing ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

The Core ◽

Critical Components ◽

High Level

Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with advancements in classical machine learning and scalable general-purpose graphics processing unit (GPU) computing, have become critical components of artificial intelligence, enabling many of these astounding breakthroughs and lowering the barrier to adoption. Python continues to be the most preferred language for scientific computing, data science, and machine learning, boosting both performance and productivity by enabling the use of low-level libraries and clean high-level APIs. This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it. We cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of educating the reader and driving the field of Python machine learning forward.

Download Full-text

AN EVALUATION OF MULTIPLE FEED-FORWARD NETWORKS ON GPUs

International Journal of Neural Systems ◽

10.1142/s0129065711002638 ◽

2011 ◽

Vol 21 (01) ◽

pp. 31-47 ◽

Cited By ~ 14

Author(s):

NOEL LOPES ◽

BERNARDETE RIBEIRO

Keyword(s):

Graphics Processing Unit ◽

Parallel Implementation ◽

Low Cost ◽

Back Propagation ◽

General Purpose ◽

Training System ◽

Graphics Hardware ◽

Processing Unit ◽

Data Parallel ◽

Graphics Processing

The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.

Download Full-text

New BSP/CGM algorithms for spanning trees

The International Journal of High Performance Computing Applications ◽

10.1177/1094342018803672 ◽

2018 ◽

Vol 33 (3) ◽

pp. 444-461

Author(s):

Jucele França de Alencar Vasconcellos ◽

Edson Norberto Cáceres ◽

Henrique Mongelli ◽

Siang Wun Song ◽

Frank Dehne ◽

...

Keyword(s):

Parallel Algorithms ◽

Parallel Machines ◽

Spanning Trees ◽

Graphics Processing Unit ◽

Computation Time ◽

General Purpose ◽

Coarse Grained ◽

Processing Unit ◽

List Ranking ◽

Bulk Synchronous Parallel

Computing a spanning tree (ST) and a minimum ST (MST) of a graph are fundamental problems in graph theory and arise as a subproblem in many applications. In this article, we propose parallel algorithms to these problems. One of the steps of previous parallel MST algorithms relies on the heavy use of parallel list ranking which, though efficient in theory, is very time-consuming in practice. Using a different approach with a graph decomposition, we devised new parallel algorithms that do not make use of the list ranking procedure. We proved that our algorithms are correct, and for a graph [Formula: see text], [Formula: see text], and [Formula: see text], the algorithms can be executed on a Bulk Synchronous Parallel/Coarse Grained Multicomputer (BSP/CGM) model using [Formula: see text] communications rounds with [Formula: see text] computation time for each round. To show that our algorithms have good performance on real parallel machines, we have implemented them on graphics processing unit. The obtained speedups are competitive and showed that the BSP/CGM model is suitable for designing general purpose parallel algorithms.

Download Full-text

Modeling and Computing Overlapping Aggregation of Large Data Sequences in Geographic Information Systems

International Journal of Information System Modeling and Design ◽

10.4018/ijismd.2019010102 ◽

2019 ◽

Vol 10 (1) ◽

pp. 20-41

Author(s):

Driss En-Nejjary ◽

Francois Pinet ◽

Myoung-Ah Kang

Keyword(s):

Information Systems ◽

Data Transfer ◽

Graphics Processing Unit ◽

Large Data ◽

General Purpose ◽

Environmental Data ◽

Acquisition Time ◽

Processing Unit ◽

Sequential Method ◽

Transfer Cost

Recently, in the field of information systems, the acquisition of geo-referenced data has made a huge leap forward in terms of technology. There is a real issue in terms of the data processing optimization, and different research works have been proposed to analyze large geo-referenced datasets based on multi-core approaches. In this article, different methods based on general-purpose logic on graphics processing unit (GPGPU) are modelled and compared to parallelize overlapping aggregations of raster sequences. Our methods are tested on a sequence of rasters representing the evolution of temperature over time for the same region. Each raster corresponds to a different data acquisition time period, and each raster geo-referenced cell is associated with a temperature value. This article proposes optimized methods to calculate the average temperature for the region for all the possible raster subsequences of a determined length, i.e., to calculate overlapping aggregated data summaries. In these aggregations, the same subsets of values are aggregated several times. For example, this type of aggregation can be useful in different environmental data analyses, e.g., to pre-calculate all the average temperatures in a database. The present article highlights a significant increase in performance and shows that the use of GPGPU parallel processing enabled us to run the aggregations up to more than 50 times faster than the sequential method including data transfer cost and more than 200 times faster without data transfer cost.

Download Full-text

General purpose graphics-processing-unit implementation of cosmological domain wall network evolution

Physical Review E ◽

10.1103/physreve.96.043310 ◽

2017 ◽

Vol 96 (4) ◽

Cited By ~ 5

Author(s):

J. R. C. C. C. Correia ◽

C. J. A. P. Martins

Keyword(s):

Domain Wall ◽

Graphics Processing Unit ◽

General Purpose ◽

Network Evolution ◽

Processing Unit ◽

Graphics Processing

Download Full-text

GPU-accelerated alignment of bisulfite-treated short-read sequences

10.1101/175729 ◽

2017 ◽

Author(s):

Richard Wilton ◽

Xin Li ◽

Andrew P. Feinberg ◽

Alexander S. Szalay

Keyword(s):

Dna Sequences ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Short Read ◽

Wide Range ◽

Programming Logic ◽

Short Read Aligner ◽

Graphics Processing ◽

Better Than

AbstractThe alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algorithmic improvements. One strategy is to integrate this additional programming logic into the read-alignment implementation in a way that the software becomes amenable to optimizations that lead to both higher speed and greater sensitivity than can be achieved without this integration.We have evaluated this approach using Arioc, a short-read aligner that uses GPU (general-purpose graphics processing unit) hardware to accelerate computationally-expensive programming logic. We integrated the BS-seq computational logic into both GPU and CPU code throughout the Arioc implementation. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments reported by the most widely used BS-seq read aligners. With simulated reads, Arioc's accuracy is equal to or better than the other read aligners we evaluated. With human sequencing reads, Arioc's throughput is at least 10 times faster than existing BS-seq aligners across a wide range of sensitivity settings.The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

Download Full-text