Granular Dynamics Simulation on Multiple GPUs Using Domain Decomposition

Author(s):  
Hammad Mazhar ◽  
Andrew Seidl ◽  
Rebecca Shotwell ◽  
Marco B. Quadrelli ◽  
Dan Negrut ◽  
...  

This paper describes the software infrastructure needed to enable massive multi-body simulation using multiple GPUs. Utilizing a domain decomposition approach, a large system made up of billions of bodies can be split into self-contained subdomains which are then transferred to different GPUs and solved in parallel. Parallelism is enabled on multiple levels, first on the CPU through OpenMP and secondly on the GPU through NVIDIA CUDA (Compute Unified Device Architecture). This heterogeneous software infrastructure can be extended to networks of computers using MPI (Message Passing Interface) as each subdomain is self-contained. This paper will discuss the implementation of the spatial subdivision algorithm used for subdomain creation along with the algorithms used for collision detection and constraint solution.

Author(s):  
Yong Zhao ◽  
Chin Hoe Tai

The development and validation of a parallel unstructured non-nested multigrid method for simulation of unsteady incompressible viscous flow is presented. The Navier-Stokes solver is based on the artificial compressibility method (ACM) [10] and a higher-order characteristics-based finite-volume scheme [8] on unstructured multigrids. Unsteady flow is calculated with an implicit dual time stepping scheme. The parallelization of the solver is achieved by a multigrid domain decomposition approach (MG-DD), using the Single Program Multiple Data (SPMD) programming paradigm and Message-Passing Interface (MPI) for communication of data. The parallel codes using single grids and multigrids are used to simulate steady and unsteady incompressible viscous flows over a circular cylinder for validation and performance evaluation purposes. Speedups and parallel efficiencies obtained by both the parallel single-grid and multigrid solvers are reasonably good for both test cases, using up to 32 processors on the SGI Origin 2000. A maximum speedup of 12 could be achieved on 16 processors for the unsteady flow. The parallel results obtained agree well with those of serial solvers and with numerical solutions obtained by other researchers, as well as experimental measurements.


2020 ◽  
Author(s):  
Stiw Herrera ◽  
Weber Ribeiro ◽  
Thiago Teixeira ◽  
André Carneiro ◽  
Frederico Cabral ◽  
...  

Oil and gas simulations need new high-performance computing techniques to deal with the large amount of data allocation and the high computational cost that we obtain from the numerical method. The domain decomposition technique (domain division technique) was applied to a three-dimensional oil reservoir, where the MPI (Message Passing Interface) allowed the creation of a uni, bi and three-dimensional topology, where a subdivision of a reservoir could be solved in each MPI process created. A performance study was developed with these domain decomposition strategies in 20 computational nodes of the SDumont Supercomputer, using a Cascade Lake architecture.


2013 ◽  
Vol 30 (7) ◽  
pp. 1382-1397 ◽  
Author(s):  
Yunheng Wang ◽  
Youngsun Jung ◽  
Timothy A. Supinie ◽  
Ming Xue

Abstract A hybrid parallel scheme for the ensemble square root filter (EnSRF) suitable for parallel assimilation of multiscale observations, including those from dense observational networks such as those of radar, is developed based on the domain decomposition strategy. The scheme handles internode communication through a message passing interface (MPI) and the communication within shared-memory nodes via Open Multiprocessing (OpenMP) threads. It also supports pure MPI and pure OpenMP modes. The parallel framework can accommodate high-volume remote-sensed radar (or satellite) observations as well as conventional observations that usually have larger covariance localization radii. The performance of the parallel algorithm has been tested with simulated and real radar data. The parallel program shows good scalability in pure MPI and hybrid MPI–OpenMP modes, while pure OpenMP runs exhibit limited scalability on a symmetric shared-memory system. It is found that in MPI mode, better parallel performance is achieved with domain decomposition configurations in which the leading dimension of the state variable arrays is larger, because this configuration allows for more efficient memory access. Given a fixed amount of computing resources, the hybrid parallel mode is preferred to pure MPI mode on supercomputers with nodes containing shared-memory cores. The overall performance is also affected by factors such as the cache size, memory bandwidth, and the networking topology. Tests with a real data case with a large number of radars confirm that the parallel data assimilation can be done on a multicore supercomputer with a significant speedup compared to the serial data assimilation algorithm.


2014 ◽  
Vol 16 (3) ◽  
pp. 599-611 ◽  
Author(s):  
George M. Petrov ◽  
Jack Davis

AbstractThe implicit 2D3V particle-in-cell (PIC) code developed to study the interaction of ultrashort pulse lasers with matter [G. M. Petrov and J. Davis, Computer Phys. Comm. 179, 868 (2008); Phys. Plasmas 18, 073102 (2011)] has been parallelized using MPI (Message Passing Interface). The parallelization strategy is optimized for a small number of computer cores, up to about 64. Details on the algorithm implementation are given with emphasis on code optimization by overlapping computations with communications. Performance evaluation for 1D domain decomposition has been made on a small Linux cluster with 64 computer cores for two typical regimes of PIC operation: “particle dominated”, for which the bulk of the computation time is spent on pushing particles, and “field dominated”, for which computing the fields is prevalent. For a small number of computer cores, less than 32, the MPI implementation offers a significant numerical speed-up. In the “particle dominated” regime it is close to the maximum theoretical one, while in the “field dominated” regime it is about 75-80 % of the maximum speed-up. For a number of cores exceeding 32, performance degradation takes place as a result of the adopted 1D domain decomposition. The code parallelization will allow future implementation of atomic physics and extension to three dimensions.


2012 ◽  
Vol 49 (6) ◽  
pp. 709-723 ◽  
Author(s):  
V. Visseq ◽  
A. Martin ◽  
D. Iceta ◽  
E. Azema ◽  
D. Dureisseix ◽  
...  

2012 ◽  
Vol 2012 ◽  
pp. 1-8 ◽  
Author(s):  
Carlos Delgado ◽  
Josefa Gómez ◽  
Abdelhamid Tayebi ◽  
Iván González ◽  
Felipe Cátedra

We present an efficient method for the analysis of different objects that may contain a complex feeding system and a reflector structure. The approach is based on a domain decomposition technique that divides the geometry into several parts to minimize the vast computational resources required when applying a full wave method. This technique is also parallelized by using the Message Passing Interface to minimize the memory and time requirements of the simulation. A reflectarray analysis serves as an example of the proposed approach.


2013 ◽  
Vol 53 (1) ◽  
pp. 147-158 ◽  
Author(s):  
JunYoung Kwak ◽  
TaeYoung Chun ◽  
SangJoon Shin ◽  
Olivier A. Bauchau

2020 ◽  
Vol 15 ◽  
Author(s):  
Weiwen Zhang ◽  
Long Wang ◽  
Theint Theint Aye ◽  
Juniarto Samsudin ◽  
Yongqing Zhu

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.


Energies ◽  
2021 ◽  
Vol 14 (8) ◽  
pp. 2284
Author(s):  
Krzysztof Przystupa ◽  
Mykola Beshley ◽  
Olena Hordiichuk-Bublivska ◽  
Marian Kyryk ◽  
Halyna Beshley ◽  
...  

The problem of analyzing a big amount of user data to determine their preferences and, based on these data, to provide recommendations on new products is important. Depending on the correctness and timeliness of the recommendations, significant profits or losses can be obtained. The task of analyzing data on users of services of companies is carried out in special recommendation systems. However, with a large number of users, the data for processing become very big, which causes complexity in the work of recommendation systems. For efficient data analysis in commercial systems, the Singular Value Decomposition (SVD) method can perform intelligent analysis of information. With a large amount of processed information we proposed to use distributed systems. This approach allows reducing time of data processing and recommendations to users. For the experimental study, we implemented the distributed SVD method using Message Passing Interface, Hadoop and Spark technologies and obtained the results of reducing the time of data processing when using distributed systems compared to non-distributed ones.


1996 ◽  
Vol 22 (6) ◽  
pp. 789-828 ◽  
Author(s):  
William Gropp ◽  
Ewing Lusk ◽  
Nathan Doss ◽  
Anthony Skjellum

Sign in / Sign up

Export Citation Format

Share Document