Method of parallelization of loops for grid calculation problems on GPU accelerators

2017 ◽  
pp. 059-066
Author(s):  
А.Yu. Doroshenko ◽  
◽  
O.G. Beketov ◽  

The formal parallelizing transformation of a nest of calculation loop for SIMD architecture devices, particularly for graphics processing units applying CUDA technology and heterogeneous clusters is developed. Procedure of transition from sequential to parallel algorithm is described and illustrated. Serialization of data is applied to optimize processing of large volumes of data. The advantage of the suggested method is its applicability for transformation of data which volumes exceed the memory of operating device. The experiment is conducted to demonstrate feasibility of the proposed approach. Technique presented in the provides the basis for further practical implementation of the automated system for parallelizing of nested loops.

2013 ◽  
Vol 2013 ◽  
pp. 1-15 ◽  
Author(s):  
Carlos Couder-Castañeda ◽  
Carlos Ortiz-Alemán ◽  
Mauricio Gabriel Orozco-del-Castillo ◽  
Mauricio Nava-Flores

An implementation with the CUDA technology in a single and in several graphics processing units (GPUs) is presented for the calculation of the forward modeling of gravitational fields from a tridimensional volumetric ensemble composed by unitary prisms of constant density. We compared the performance results obtained with the GPUs against a previous version coded in OpenMP with MPI, and we analyzed the results on both platforms. Today, the use of GPUs represents a breakthrough in parallel computing, which has led to the development of several applications with various applications. Nevertheless, in some applications the decomposition of the tasks is not trivial, as can be appreciated in this paper. Unlike a trivial decomposition of the domain, we proposed to decompose the problem by sets of prisms and use different memory spaces per processing CUDA core, avoiding the performance decay as a result of the constant calls to kernels functions which would be needed in a parallelization by observations points. The design and implementation created are the main contributions of this work, because the parallelization scheme implemented is not trivial. The performance results obtained are comparable to those of a small processing cluster.


2015 ◽  
Vol 61 (4) ◽  
pp. 357-363 ◽  
Author(s):  
Karolina Szczepankiewicz ◽  
Mateusz Malanowski ◽  
Michał Szczepankiewicz

Abstract In the paper an implementation of signal processing chain for a passive radar is presented. The passive radar which was developed at the Warsaw University of Technology, uses FM radio and DVB-T television transmitters as ”illuminators of opportunity”. As the computational load associated with passive radar processing is very high, NVIDIA CUDA technology has been employed for effective implementation using parallel processing. The paper contains the description of the algorithms implementation and the performance results analysis.


In this chapter, some examples of application of the developed software tools for design, generation, transformation, and optimization of programs for multicore processors and graphics processing units are considered. In particular, the algebra-algorithmic-integrated toolkit for design and synthesis of programs (IDS) and the rewriting rules system TermWare.NET are applied for design and parallelization of programs for multicore central processing units. The developed algebra-dynamic models and the rewriting rules toolkit are used for parallelization and optimization of programs for NVIDIA GPUs supporting the CUDA technology. The TuningGenie framework is applied for parallel program auto-tuning: optimization of sorting, Brownian motion simulation, and meteorological forecasting programs to a target platform. The parallelization of Fortran programs using the rewriting rules technique on sample problems in the field of quantum chemistry is examined.


Author(s):  
А.К. Новиков ◽  
C.П. Копысов ◽  
Н.С. Недожогин

Исследуются возможности ускорения предобусловленных методов бисопряженных градиентов (BiCGStab, Bi-Conjugate Gradient Stabilized) с предобусловливателем на основе аппроксимации обращения матрицы по формуле Шермана-Моррисона. Рассмотрена новая форма параллельного алгоритма, использующая матрично-векторные произведения при формирования матриц предобусловливателя. Показана эффективность распараллеливания наиболее ресурсоемких операций этого предобусловливателя на графических процессорах. Acceleration of preconditioned bi-conjugate gradient stabilized (BiCGStab) methods with preconditioners based on the matrix approximation by the Sherman-Morrison inversion formula is studied. A new form of the parallel algorithm using matrix-vector products to generate preconditioning matrices is proposed. A parallelization efficiency of the most resource-intensive operations of such preconditioners on multi-core central and graphics processing units (CPUs and GPUs) is shown.


2020 ◽  
pp. 353-385
Author(s):  
James Reinders ◽  
Ben Ashbaugh ◽  
James Brodman ◽  
Michael Kinsner ◽  
John Pennycook ◽  
...  

Abstract Over the last few decades, Graphics Processing Units (GPUs) have evolved from specialized hardware devices capable of drawing images on a screen to general-purpose devices capable of executing complex parallel kernels. Nowadays, nearly every computer includes a GPU alongside a traditional CPU, and many programs may be accelerated by offloading part of a parallel algorithm from the CPU to the GPU.


Sign in / Sign up

Export Citation Format

Share Document