Installation to Production of a Large-Scale General Purpose Graphics Processing Unit (GPGPU) Cluster at the U.S. Army Research Laboratory: Thufir

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

Acta Geophysica ◽

10.1515/acgeo-2016-0033 ◽

2016 ◽

Vol 64 (4) ◽

pp. 1051-1063 ◽

Cited By ~ 2

Author(s):

Guofeng Liu ◽

Chun Li

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Practical Implementation ◽

Processing Unit ◽

Time Migration ◽

Graphics Processing

Download Full-text

Comprehensive regression-based model to predict performance of general-purpose graphics processing unit

Cluster Computing ◽

10.1007/s10586-019-03011-2 ◽

2019 ◽

Vol 23 (2) ◽

pp. 1505-1516 ◽

Cited By ~ 2

Author(s):

Mohammad Hossein Shafiabadi ◽

Hossein Pedram ◽

Midia Reshadi ◽

Akram Reza

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Graphics Processing

Download Full-text

AN EVALUATION OF MULTIPLE FEED-FORWARD NETWORKS ON GPUs

International Journal of Neural Systems ◽

10.1142/s0129065711002638 ◽

2011 ◽

Vol 21 (01) ◽

pp. 31-47 ◽

Cited By ~ 14

Author(s):

NOEL LOPES ◽

BERNARDETE RIBEIRO

Keyword(s):

Graphics Processing Unit ◽

Parallel Implementation ◽

Low Cost ◽

Back Propagation ◽

General Purpose ◽

Training System ◽

Graphics Hardware ◽

Processing Unit ◽

Data Parallel ◽

Graphics Processing

The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.

Download Full-text

General purpose graphics-processing-unit implementation of cosmological domain wall network evolution

Physical Review E ◽

10.1103/physreve.96.043310 ◽

2017 ◽

Vol 96 (4) ◽

Cited By ~ 5

Author(s):

J. R. C. C. C. Correia ◽

C. J. A. P. Martins

Keyword(s):

Domain Wall ◽

Graphics Processing Unit ◽

General Purpose ◽

Network Evolution ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Splotch

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652713 ◽

2016 ◽

Vol 31 (6) ◽

pp. 550-563

Author(s):

Timothy Dykes ◽

Claudio Gheller ◽

Marzia Rivi ◽

Mel Krokos

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Xeon Phi ◽

The Many ◽

Many Core ◽

Performance Results ◽

Graphics Processing ◽

Performance Computing

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.

Download Full-text

GPU-accelerated alignment of bisulfite-treated short-read sequences

10.1101/175729 ◽

2017 ◽

Author(s):

Richard Wilton ◽

Xin Li ◽

Andrew P. Feinberg ◽

Alexander S. Szalay

Keyword(s):

Dna Sequences ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Short Read ◽

Wide Range ◽

Programming Logic ◽

Short Read Aligner ◽

Graphics Processing ◽

Better Than

AbstractThe alignment of bisulfite-treated DNA sequences (BS-seq reads) to a large genome involves a significant computational burden beyond that required to align non-bisulfite-treated reads. In the analysis of BS-seq data, this can present an important performance bottleneck that can potentially be addressed by appropriate software-engineering and algorithmic improvements. One strategy is to integrate this additional programming logic into the read-alignment implementation in a way that the software becomes amenable to optimizations that lead to both higher speed and greater sensitivity than can be achieved without this integration.We have evaluated this approach using Arioc, a short-read aligner that uses GPU (general-purpose graphics processing unit) hardware to accelerate computationally-expensive programming logic. We integrated the BS-seq computational logic into both GPU and CPU code throughout the Arioc implementation. We then carried out a read-by-read comparison of Arioc's reported alignments with the alignments reported by the most widely used BS-seq read aligners. With simulated reads, Arioc's accuracy is equal to or better than the other read aligners we evaluated. With human sequencing reads, Arioc's throughput is at least 10 times faster than existing BS-seq aligners across a wide range of sensitivity settings.The Arioc software is available at https://github.com/RWilton/Arioc. It is released under a BSD open-source license.

Download Full-text

Analysis of Fast Fourier Transformations algorithm for CUDA Architecture

Lietuvos matematikos rinkinys ◽

10.15388/lmr.b.2012.46 ◽

2012 ◽

Vol 53 ◽

Author(s):

Beatričė Andziulienė ◽

Evaldas Žulkas ◽

Audrius Kuprinavičius

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Fast Fourier Transformation ◽

Processing Unit ◽

Data Allocation ◽

Analysis Method ◽

Central Processing ◽

Execution Speed ◽

Cuda Architecture ◽

Graphics Processing

In this work Fast Fourier transformation algorithm for general purpose graphics processing unit processing (GPGPU) is discussed. Algorithm structure and individual stages performance were analysed. With performance analysis method algorithm distribution and data allocation possibilities were determined, depending on algorithm stages execution speed and algorithm structure. Ratio between CPU and GPU execution during Fast Fourier transform signal processing was determined using computer-generated data with frequency. When adopting CPU code for CUDA execution, it not becomes more complex, even if stream procesor parallelization and data transfering algorith stages are considered. But central processing unit serial execution).

Download Full-text