High Performance Interconnects for Massively Parallel Sys- tems

2016 ◽  
pp. 33-56 ◽  
Author(s):  
Martin Schreiber ◽  
Pedro S Peixoto ◽  
Terry Haut ◽  
Beth Wingate

This paper presents, discusses and analyses a massively parallel-in-time solver for linear oscillatory partial differential equations, which is a key numerical component for evolving weather, ocean, climate and seismic models. The time parallelization in this solver allows us to significantly exceed the computing resources used by parallelization-in-space methods and results in a correspondingly significantly reduced wall-clock time. One of the major difficulties of achieving Exascale performance for weather prediction is that the strong scaling limit – the parallel performance for a fixed problem size with an increasing number of processors – saturates. A main avenue to circumvent this problem is to introduce new numerical techniques that take advantage of time parallelism. In this paper, we use a time-parallel approximation that retains the frequency information of oscillatory problems. This approximation is based on (a) reformulating the original problem into a large set of independent terms and (b) solving each of these terms independently of each other which can now be accomplished on a large number of high-performance computing resources. Our results are conducted on up to 3586 cores for problem sizes with the parallelization-in-space scalability limited already on a single node. We gain significant reductions in the time-to-solution of 118.3× for spectral methods and 1503.0× for finite-difference methods with the parallelization-in-time approach. A developed and calibrated performance model gives the scalability limitations a priori for this new approach and allows us to extrapolate the performance of the method towards large-scale systems. This work has the potential to contribute as a basic building block of parallelization-in-time approaches, with possible major implications in applied areas modelling oscillatory dominated problems.


Author(s):  
Pravin Jagtap ◽  
Rupesh Nasre ◽  
V. S. Sanapala ◽  
B. S. V. Patnaik

Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture[Formula: see text] parallel programming models have been used for parallel computing on Intel Xeon[Formula: see text] E5-[Formula: see text] multi-core CPU and NVIDIA Quadro M[Formula: see text] and NVIDIA Tesla p[Formula: see text] massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.


2014 ◽  
Author(s):  
Mehdi Gilaki ◽  
Ilya Avdeev

In this study, we have investigated feasibility of using commercial explicit finite element code LS-DYNA on massively parallel super-computing cluster for accurate modeling of structural impact on battery cells. Physical and numerical lateral impact tests have been conducted on cylindrical cells using a flat rigid drop cart in a custom-built drop test apparatus. The main component of cylindrical cell, jellyroll, is a layered spiral structure which consists of thin layers of electrodes and separator. Two numerical approaches were considered: (1) homogenized model of the cell and (2) heterogeneous (full) 3-D cell model. In the first approach, the jellyroll was considered as a homogeneous material with an effective stress-strain curve obtained through experiments. In the second model, individual layers of anode, cathode and separator were accounted for in the model, leading to extremely complex and computationally expensive finite element model. To overcome limitations of desktop computers, high-performance computing (HPC) techniques on a HPC cluster were needed in order to get the results of transient simulations in a reasonable solution time. We have compared two HPC methods used for this model is shared memory parallel processing (SMP) and massively parallel processing (MPP). Both the homogeneous and the heterogeneous models were considered for parallel simulations utilizing different number of computational nodes and cores and the performance of these models was compared. This work brings us one step closer to accurate modeling of structural impact on the entire battery pack that consists of thousands of cells.


2012 ◽  
Vol 2012 ◽  
pp. 1-10 ◽  
Author(s):  
Christoph Starke ◽  
Vasco Grossmann ◽  
Lars Wienbrandt ◽  
Sven Koschnicke ◽  
John Carstens ◽  
...  

The hardware structure of a processing element used for optimization of an investment strategy for financial markets is presented. It is shown how this processing element can be multiply implemented on the massively parallel FPGA-machine RIVYERA. This leads to a speedup of a factor of about 17,000 in comparison to one single high-performance PC, while saving more than 99% of the consumed energy. Furthermore, it is shown for a special security and different time periods that the optimized investment strategy delivers an outperformance between 2 and 14 percent in relation to a buy and hold strategy.


1995 ◽  
Vol 4 (1) ◽  
pp. 1-21 ◽  
Author(s):  
Matthew O'keefe ◽  
Terence Parr ◽  
B. Kevin Edgar ◽  
Steve Anderson ◽  
Paul Woodward ◽  
...  

Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. We have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.


2017 ◽  
Author(s):  
Etienne St-Onge ◽  
Benoit Scherrer ◽  
Simon Warfield

The Insight Toolkit (ITK) utilizes a generic design for image processing filters that allows many developers to rapidly implement new algorithms. While ITK filters benefit from a platform-independent and versatile multithreading capability, the current implementation does not easily achieve high performance. First, ITK relies on a static decomposition of the image into subsets of equal size which is highly inefficient when the computational complexity varies between subsets (unbalanced workloads). Second, the current domain decomposition is limited to subdivide the input domain along a single dimension (typically the slice dimension in a 3-D volume), which causes a multithreading under-utilization when the number of threads is larger than the size of this dimension when using massively parallel compute systems. We previously presented a new itk::TBBImageToImageFilter class that replaced the static task decomposition by a dynamic task decomposition for improved workload balancing, in which the job scheduling task was optimized using the Intel® Threading Building Blocks (TBB) library. In this work, we propose a new multidimensional dynamic image decomposition approach that allows decomposition over an arbitrary number of dimensions. This new generic multithreading capability, combined with the TBB dynamic task scheduler, substantially improves multithreading performance when using massively parallel processors.


Sign in / Sign up

Export Citation Format

Share Document