cuGWAM: Genome-wide Association Multifactor Dimensionality reduction using CUDA-enabled high-performance graphics processing unit

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

High-Performance, Graphics Processing Unit-Accelerated Fock Build Algorithm

Journal of Chemical Theory and Computation ◽

10.1021/acs.jctc.0c00768 ◽

2020 ◽

Vol 16 (12) ◽

pp. 7232-7238

Author(s):

Giuseppe M. J. Barca ◽

Jorge L. Galvez-Vallejo ◽

David L. Poole ◽

Alistair P. Rendell ◽

Mark S. Gordon

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Ballooning Graphics Memory Space in Full GPU Virtualization Environments

Scientific Programming ◽

10.1155/2019/5240956 ◽

2019 ◽

Vol 2019 ◽

pp. 1-11

Author(s):

Younghun Park ◽

Minwoo Gu ◽

Sungyong Park

Keyword(s):

High Performance ◽

Virtual Machines ◽

Graphics Processing Unit ◽

Performance Degradation ◽

Processing Unit ◽

Memory Space ◽

Memory Size ◽

Memory Sharing ◽

Gpu Virtualization ◽

Graphics Processing

Advances in virtualization technology have enabled multiple virtual machines (VMs) to share resources in a physical machine (PM). With the widespread use of graphics-intensive applications, such as two-dimensional (2D) or 3D rendering, many graphics processing unit (GPU) virtualization solutions have been proposed to provide high-performance GPU services in a virtualized environment. Although elasticity is one of the major benefits in this environment, the allocation of GPU memory is still static in the sense that after the GPU memory is allocated to a VM, it is not possible to change the memory size at runtime. This causes underutilization of GPU memory or performance degradation of a GPU application due to the lack of GPU memory when an application requires a large amount of GPU memory. In this paper, we propose a GPU memory ballooning solution called gBalloon that dynamically adjusts the GPU memory size at runtime according to the GPU memory requirement of each VM and the GPU memory sharing overhead. The gBalloon extends the GPU memory size of a VM by detecting performance degradation due to the lack of GPU memory. The gBalloon also reduces the GPU memory size when the overcommitted or underutilized GPU memory of a VM creates additional overhead for the GPU context switch or the CPU load due to GPU memory sharing among the VMs. We implemented the gBalloon by modifying the gVirt, a full GPU virtualization solution for Intel’s integrated GPUs. Benchmarking results show that the gBalloon dynamically adjusts the GPU memory size at runtime, which improves the performance by up to 8% against the gVirt with 384 MB of high global graphics memory and 32% against the gVirt with 1024 MB of high global graphics memory.

Download Full-text

GPU empowered pipelines for calculating genome-wide kinship matrices with ultra-high dimensional genetic variants and facilitating 1D and 2D GWAS

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqz009 ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Wenchao Zhang ◽

Xinbin Dai ◽

Shizhong Xu ◽

Patrick X Zhao

Keyword(s):

Genetic Variants ◽

High Performance ◽

Genome Wide Association Study ◽

Graphics Processing Unit ◽

High Dimensional ◽

Processing Unit ◽

Kinship Matrix ◽

Matrix Operations ◽

Genome Wide ◽

Matrix Calculation

Abstract Genome-wide association study (GWAS) is a powerful approach that has revolutionized the field of quantitative genetics. Two-dimensional GWAS that accounts for epistatic genetic effects needs to consider the effects of marker pairs, thus quadratic genetic variants, compared to one-dimensional GWAS that accounts for individual genetic variants. Calculating genome-wide kinship matrices in GWAS that account for relationships among individuals represented by ultra-high dimensional genetic variants is computationally challenging. Fortunately, kinship matrix calculation involves pure matrix operations and the algorithms can be parallelized, particular on graphics processing unit (GPU)-empowered high-performance computing (HPC) architectures. We have devised a new method and two pipelines: KMC1D and KMC2D for kinship matrix calculation with high-dimensional genetic variants, respectively, facilitating 1D and 2D GWAS analyses. We first divide the ultra-high-dimensional markers and marker pairs into successive blocks. We then calculate the kinship matrix for each block and merge together the block-wise kinship matrices to form the genome-wide kinship matrix. All the matrix operations have been parallelized using GPU kernels on our NVIDIA GPU-accelerated server platform. The performance analyses show that the calculation speed of KMC1D and KMC2D can be accelerated by 100–400 times over the conventional CPU-based computing.

Download Full-text

Exploring Graphics Processing Unit (GPU) Resource Sharing Efficiency for High Performance Computing

Computers ◽

10.3390/computers2040176 ◽

2013 ◽

Vol 2 (4) ◽

pp. 176-214 ◽

Cited By ~ 5

Author(s):

Teng Li ◽

Vikram Narayana ◽

Tarek El-Ghazawi

Keyword(s):

High Performance Computing ◽

Resource Sharing ◽

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing ◽

Performance Computing

Download Full-text

Multifactor dimensionality reduction for graphics processing units enables genome-wide testing of epistasis in sporadic ALS

Bioinformatics ◽

10.1093/bioinformatics/btq009 ◽

2010 ◽

Vol 26 (5) ◽

pp. 694-695 ◽

Cited By ~ 48

Author(s):

Casey S. Greene ◽

Nicholas A. Sinnott-Armstrong ◽

Daniel S. Himmelstein ◽

Paul J. Park ◽

Jason H. Moore ◽

...

Keyword(s):

Dimensionality Reduction ◽

Multifactor Dimensionality Reduction ◽

Graphics Processing Units ◽

Genome Wide ◽

Sporadic Als ◽

Graphics Processing ◽

Wide Testing

Download Full-text

Low-power and high-performance design of OpenGL ES 2.0 graphics processing unit for mobile applications

2015 IEEE International Conference on Digital Signal Processing (DSP) ◽

10.1109/icdsp.2015.7251840 ◽

2015 ◽

Cited By ~ 3

Author(s):

Shen-Fu Hsiao ◽

Shang-Yu Li ◽

Kai-Hsiang Tsao

Keyword(s):

Low Power ◽

Mobile Applications ◽

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Opengl Es ◽

Graphics Processing ◽

High Performance Design

Download Full-text

Splotch

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652713 ◽

2016 ◽

Vol 31 (6) ◽

pp. 550-563

Author(s):

Timothy Dykes ◽

Claudio Gheller ◽

Marzia Rivi ◽

Mel Krokos

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Xeon Phi ◽

The Many ◽

Many Core ◽

Performance Results ◽

Graphics Processing ◽

Performance Computing

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.

Download Full-text