GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform

International Journal of Antennas and Propagation ◽

10.1155/2014/321081 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Ronglin Jiang ◽

Shugang Jiang ◽

Yu Zhang ◽

Ying Xu ◽

Lei Xu ◽

...

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Problem Size ◽

Central Processing ◽

Execution Speed ◽

Speedup Ratio ◽

Electromagnetic Calculations ◽

Graphics Processing

This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with16×18elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Analysis of Fast Fourier Transformations algorithm for CUDA Architecture

Lietuvos matematikos rinkinys ◽

10.15388/lmr.b.2012.46 ◽

2012 ◽

Vol 53 ◽

Author(s):

Beatričė Andziulienė ◽

Evaldas Žulkas ◽

Audrius Kuprinavičius

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Fast Fourier Transformation ◽

Processing Unit ◽

Data Allocation ◽

Analysis Method ◽

Central Processing ◽

Execution Speed ◽

Cuda Architecture ◽

Graphics Processing

In this work Fast Fourier transformation algorithm for general purpose graphics processing unit processing (GPGPU) is discussed. Algorithm structure and individual stages performance were analysed. With performance analysis method algorithm distribution and data allocation possibilities were determined, depending on algorithm stages execution speed and algorithm structure. Ratio between CPU and GPU execution during Fast Fourier transform signal processing was determined using computer-generated data with frequency. When adopting CPU code for CUDA execution, it not becomes more complex, even if stream procesor parallelization and data transfering algorith stages are considered. But central processing unit serial execution).

Download Full-text

GPU Computation and Platforms

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch007 ◽

2016 ◽

pp. 136-174

Author(s):

K. Bhargavi ◽

Sathish Babu B.

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Gpu Computing ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Computing Platforms ◽

Computationally Intensive ◽

Graphics Processing

The GPUs (Graphics Processing Unit) were mainly used to speed up computation intensive high performance computing applications. There are several tools and technologies available to perform general purpose computationally intensive application. This chapter primarily discusses about GPU parallelism, applications, probable challenges and also highlights some of the GPU computing platforms, which includes CUDA, OpenCL (Open Computing Language), OpenMPC (Open MP extended for CUDA), MPI (Message Passing Interface), OpenACC (Open Accelerator), DirectCompute, and C++ AMP (C++ Accelerated Massive Parallelism). Each of these platforms is discussed briefly along with their advantages and disadvantages.

Download Full-text

GRAPHICS PROCESSING UNIT CLUSTER ACCELERATED MONTE CARLO SIMULATION OF PHOTON TRANSPORT IN MULTI-LAYERED TISSUES

Journal of Innovative Optical Health Sciences ◽

10.1142/s1793545812500046 ◽

2012 ◽

Vol 05 (02) ◽

pp. 1250004 ◽

Cited By ~ 5

Author(s):

CHAO JIANG ◽

HENG HE ◽

PENGCHENG LI ◽

QINGMING LUO

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Message Passing ◽

Message Passing Interface ◽

Local Area Network ◽

Graphics Processing Unit ◽

Area Network ◽

Processing Unit ◽

Photon Transport ◽

Graphics Processing

We present a graphics processing unit (GPU) cluster-based Monte Carlo simulation of photon transport in multi-layered tissues. The cluster is composed of multiple computing nodes in a local area network where each node is a personal computer equipped with one or several GPU(s) for parallel computing. In this study, the MPI (Message Passing Interface), the OpenMP (Open Multi-Processing) and the CUDA (Compute Unified Device Architecture) technologies are employed to develop the program. It is demonstrated that this designing runs roughly N times faster than that using single GPU when the GPUs within the cluster are of the same type, where N is the total number of the GPUs within the cluster.

Download Full-text

Numerical simulation of flattened heat pipe with double heat sources for CPU and GPU cooling application in laptop computers

Journal of Computational Design and Engineering ◽

10.1093/jcde/qwaa091 ◽

2020 ◽

Author(s):

Wisoot Sanhan ◽

Kambiz Vafai ◽

Niti Kammuang-Lue ◽

Pradit Terdtoon ◽

Phrut Sakulchangsatjatai

Keyword(s):

Experimental Data ◽

Heat Pipe ◽

Graphics Processing Unit ◽

Processing Unit ◽

Heat Sources ◽

Final Thickness ◽

Laptop Computers ◽

Central Processing ◽

Graphics Processing ◽

Good Agreement

Abstract An investigation of the effect of the thermal performance of the flattened heat pipe on its double heat sources acting as central processing unit and graphics processing unit in laptop computers is presented in this work. A finite element method is used for predicting the flattening effect of the heat pipe. The cylindrical heat pipe with a diameter of 6 mm and the total length of 200 mm is flattened into three final thicknesses of 2, 3, and 4 mm. The heat pipe is placed under a horizontal configuration and heated with heater 1 and heater 2, 40 W in combination. The numerical model shows good agreement compared with the experimental data with the standard deviation of 1.85%. The results also show that flattening the cylindrical heat pipe to 66.7 and 41.7% of its original diameter could reduce its normalized thermal resistance by 5.2%. The optimized final thickness or the best design final thickness for the heat pipe is found to be 2.5 mm.

Download Full-text

HPC simulations of brownout: A noninteracting particles dynamic model

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020905971 ◽

2020 ◽

Vol 34 (3) ◽

pp. 267-281

Author(s):

Roberto Porcù ◽

Edie Miglio ◽

Nicola Parolini ◽

Mattia Penati ◽

Noemi Vergopolan

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Time Integration ◽

Computational Cost ◽

Aircraft Design ◽

Euler Method ◽

Processing Unit ◽

Integration Algorithm

Helicopters can experience brownout when flying close to a dusty surface. The uplifting of dust in the air can remarkably restrict the pilot’s visibility area. Consequently, a brownout can disorient the pilot and lead to the helicopter collision against the ground. Given its risks, brownout has become a high-priority problem for civil and military operations. Proper helicopter design is thus critical, as it has a strong influence over the shape and density of the cloud of dust that forms when brownout occurs. A way forward to improve aircraft design against brownout is the use of particle simulations. For simulations to be accurate and comparable to the real phenomenon, billions of particles are required. However, using a large number of particles, serial simulations can be slow and too computationally expensive to be performed. In this work, we investigate an message passing interface (MPI) + graphics processing unit (multi-GPU) approach to simulate brownout. In specific, we use a semi-implicit Euler method to consider the particle dynamics in a Lagrangian way, and we adopt a precomputed aerodynamic field. Here, we do not include particle–particle collisions in the model; this allows for independent trajectories and effective model parallelization. To support our methodology, we provide a speedup analysis of the parallelization concerning the serial and pure-MPI simulations. The results show (i) very high speedups of the MPI + multi-GPU implementation with respect to the serial and pure-MPI ones, (ii) excellent weak and strong scalability properties of the implemented time-integration algorithm, and (iii) the possibility to run realistic simulations of brownout with billions of particles at a relatively small computational cost. This work paves the way toward more realistic brownout simulations, and it highlights the potential of high-performance computing for aiding and advancing aircraft design for brownout mitigation.

Download Full-text

ALGORITHM OF SKELETON-BASED STATIC HAND GESTURE RECOGNITION

Vestnik komp iuternykh i informatsionnykh tekhnologii ◽

10.14489/vkit.2020.05.pp.013-022 ◽

2020 ◽

pp. 13-22

Author(s):

D. A. Kalina ◽

R. V. Golovanov ◽

D. V. Vorotnev

Keyword(s):

Gesture Recognition ◽

Graphics Processing Unit ◽

Recognition System ◽

Machine Learning Algorithms ◽

Support Vector ◽

Processing Unit ◽

The Novel ◽

Central Processing ◽

Graphics Processing ◽

Artificial Network

We present the monocamera approach of static hand gestures recognition based on skeletonization. The problem of creating skeleton of the human’s hand, as well as body, became solvable a few years ago after inventing so called convolutional pose machines – the novel architecture of artificial neural network. Our solution uses such kind of pretrained convolutional artificial network for extracting hand joints keypoints with further skeleton reconstruction. In this work we also propose special skeleton descriptor with proving its stability and distinguishability in terms of classification. We considered a few widespread machine learning algorithms to build and verify different classifiers. The quality of the classifier’s recognition is estimated using the wellknown Accuracy metric, which identified that classical SVM (Support Vector Machines) with radial basis kernel gives the best results. The testing of the whole system was conducted using public databases containing about 3000 of test images for more than 10 types of gestures. The results of a comparative analysis of the proposed system with existing approaches are demonstrated. It is shown that our gesture recognition system provides better quality in comparison with existing solutions. The performance of the proposed system was estimated for two configurations of standard personal computer: with CPU (Central Processing Unit) only and with GPU (Graphics Processing Unit) in addition where the latest one provides realtime processing with up to 60 frames per second. Thus we demonstrate that the proposed approach can find an application in the practice.

Download Full-text

GPU Accelerated Reconstruction in Compton Scattering Tomography Using Matrix Compression

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.519-520.102 ◽

2014 ◽

Vol 519-520 ◽

pp. 102-107

Author(s):

Yu Fei Yu ◽

Bin Yan ◽

Biao Wang ◽

Lei Li ◽

Yu Han ◽

...

Keyword(s):

Compton Scattering ◽

Graphics Processing Unit ◽

Sparse Matrix ◽

Reconstruction Algorithm ◽

Processing Unit ◽

Matrix Vector Multiplication ◽

Speedup Ratio ◽

Parallel Features ◽

Graphics Processing ◽

Matrix Vector

An acceleration strategy for TV-ADM reconstruction algorithm in Compton scattering tomography (CST) is proposed. By analyzing the sparse characteristic of CST projection matrixes, firstly, the sparse matrix vector CSR format and ELL format are used to store them, which greatly reduce the memory consumption. Then, a Sparse Matrix Vector multiplication (SpMV) method is utilized to accelerate the projector and back projector process. Finally, based on the parallel features, the TV-ADM is computed with Graphics Processing Unit (GPU). Numerical experiments show that the TV-ADM with the presented acceleration strategy could achieve a 96 times speedup ratio and 224 times memory compression ratio without precision loss.

Download Full-text

Graphics processing unit implementation of the F-statistic for continuous gravitational wave searches

Classical and Quantum Gravity ◽

10.1088/1361-6382/ac4616 ◽

2021 ◽

Author(s):

Liam Dunn ◽

Patrick Clearwater ◽

Andrew Melatos ◽

Karl Wette

Keyword(s):

Gravitational Wave ◽

Graphics Processing Units ◽

Graphics Processing Unit ◽

Computational Cost ◽

Processing Unit ◽

Central Processing ◽

Long Baseline ◽

Using Data ◽

Graphics Processing ◽

Gpu Implementation

Abstract The F-statistic is a detection statistic used widely in searches for continuous gravitational waves with terrestrial, long-baseline interferometers. A new implementation of the F-statistic is presented which accelerates the existing "resampling" algorithm using graphics processing units (GPUs). The new implementation runs between 10 and 100 times faster than the existing implementation on central processing units without sacrificing numerical accuracy. The utility of the GPU implementation is demonstrated on a pilot narrowband search for four newly discovered millisecond pulsars in the globular cluster Omega Centauri using data from the second Laser Interferometer Gravitational-Wave Observatory observing run. The computational cost is 17:2 GPU-hours using the new implementation, compared to 1092 core-hours with the existing implementation.

Download Full-text

Paralelização do Algoritmo Floyd-Warshall usando GPU

10.5753/wscad.2013.16769 ◽

2013 ◽

Author(s):

Roussian R. A. Gaioso ◽

Walid A. R. Jradi ◽

Lauro C. M. de Paula ◽

Wanderley De S. Alencar ◽

Wellington S. Martins ◽

...

Keyword(s):

Graphics Processing Unit ◽

Central Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Graphics Processing

Este artigo apresenta uma implementação paralela baseada em Graphics Processing Unit (GPU) para o problema da identiﬁcação dos caminhos mínimos entre todos os pares de vértices em um grafo. A implementação é baseada no algoritmo Floyd-Warshall e tira o máximo proveito da arquitetura multithreaded das GPUs atuais. Nossa solução reduz a comunicação entre a Central Processing Unit (CPU) e a GPU, melhora a utilização dos Streaming Multiprocessors (SMs) e faz um uso intensivo de acesso aglutinado em memória para otimizar o acesso de dados do grafo. A vantagem da implementação proposta é demonstrada por vários grafos gerados aleatoriamente utilizando a ferramenta GTgraph. Grafos contendo milhares de vértices foram gerados e utilizados nos experimentos. Os resultados mostraram um excelente desempenho em diversos grafos, alcançando ganhos de até 149x, quando comparado com uma implementação sequencial, e superando implementações tradicionais por um fator de quase quatro vezes. Nossos resultados conﬁrmam que implementações baseadas em GPU podem ser viáveis mesmo para algoritmos de grafos cujo acessos à memória e distribuição de trabalho são irregulares e causam dependência de dados.

Download Full-text