High Performance Interconnects for Massively Parallel Sys- tems

This paper presents, discusses and analyses a massively parallel-in-time solver for linear oscillatory partial differential equations, which is a key numerical component for evolving weather, ocean, climate and seismic models. The time parallelization in this solver allows us to significantly exceed the computing resources used by parallelization-in-space methods and results in a correspondingly significantly reduced wall-clock time. One of the major difficulties of achieving Exascale performance for weather prediction is that the strong scaling limit – the parallel performance for a fixed problem size with an increasing number of processors – saturates. A main avenue to circumvent this problem is to introduce new numerical techniques that take advantage of time parallelism. In this paper, we use a time-parallel approximation that retains the frequency information of oscillatory problems. This approximation is based on (a) reformulating the original problem into a large set of independent terms and (b) solving each of these terms independently of each other which can now be accomplished on a large number of high-performance computing resources. Our results are conducted on up to 3586 cores for problem sizes with the parallelization-in-space scalability limited already on a single node. We gain significant reductions in the time-to-solution of 118.3× for spectral methods and 1503.0× for finite-difference methods with the parallelization-in-time approach. A developed and calibrated performance model gives the scalability limitations a priori for this new approach and allows us to extrapolate the performance of the method towards large-scale systems. This work has the potential to contribute as a basic building block of parallelization-in-time approaches, with possible major implications in applied areas modelling oscillatory dominated problems.

Download Full-text

TTN: A High Performance Hierarchical Interconnection Network for Massively Parallel Computers

IEICE Transactions on Information and Systems ◽

10.1587/transinf.e92.d.1062 ◽

2009 ◽

Vol E92-D (5) ◽

pp. 1062-1078 ◽

Cited By ~ 14

Author(s):

M.M. Hafizur RAHMAN ◽

Yasushi INOGUCHI ◽

Yukinori SATO ◽

Susumu HORIGUCHI

Keyword(s):

High Performance ◽

Interconnection Network ◽

Parallel Computers ◽

Massively Parallel ◽

Massively Parallel Computers

Download Full-text

Efficient parallelization of SPH algorithm on modern multi-core CPUs and massively parallel GPUs

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962321500549 ◽

2021 ◽

pp. 2150054

Author(s):

Pravin Jagtap ◽

Rupesh Nasre ◽

V. S. Sanapala ◽

B. S. V. Patnaik

Keyword(s):

High Performance ◽

Performance Metrics ◽

Computational Simulation ◽

Massively Parallel ◽

Benchmark Problems ◽

Processing Unit ◽

Central Processing ◽

Neighbor Search ◽

Computational Performance ◽

Sph Algorithm

Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture[Formula: see text] parallel programming models have been used for parallel computing on Intel Xeon[Formula: see text] E5-[Formula: see text] multi-core CPU and NVIDIA Quadro M[Formula: see text] and NVIDIA Tesla p[Formula: see text] massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.

Download Full-text

Comparing High-Performance Computing Techniques for Modeling Structural Impact on Battery Cells

Volume 6A: Energy ◽

10.1115/imece2014-39271 ◽

2014 ◽

Author(s):

Mehdi Gilaki ◽

Ilya Avdeev

Keyword(s):

Finite Element ◽

Parallel Processing ◽

High Performance Computing ◽

High Performance ◽

Strain Curve ◽

Thin Layers ◽

Massively Parallel ◽

Structural Impact ◽

Performance Computing ◽

Battery Cells

In this study, we have investigated feasibility of using commercial explicit finite element code LS-DYNA on massively parallel super-computing cluster for accurate modeling of structural impact on battery cells. Physical and numerical lateral impact tests have been conducted on cylindrical cells using a flat rigid drop cart in a custom-built drop test apparatus. The main component of cylindrical cell, jellyroll, is a layered spiral structure which consists of thin layers of electrodes and separator. Two numerical approaches were considered: (1) homogenized model of the cell and (2) heterogeneous (full) 3-D cell model. In the first approach, the jellyroll was considered as a homogeneous material with an effective stress-strain curve obtained through experiments. In the second model, individual layers of anode, cathode and separator were accounted for in the model, leading to extremely complex and computationally expensive finite element model. To overcome limitations of desktop computers, high-performance computing (HPC) techniques on a HPC cluster were needed in order to get the results of transient simulations in a reasonable solution time. We have compared two HPC methods used for this model is shared memory parallel processing (SMP) and massively parallel processing (MPP). Both the homogeneous and the heterogeneous models were considered for parallel simulations utilizing different number of computational nodes and cores and the performance of these models was compared. This work brings us one step closer to accurate modeling of structural impact on the entire battery pack that consists of thousands of cells.

Download Full-text

Optimizing Investment Strategies with the Reconfigurable Hardware Platform RIVYERA

International Journal of Reconfigurable Computing ◽

10.1155/2012/646984 ◽

2012 ◽

Vol 2012 ◽

pp. 1-10 ◽

Cited By ~ 2

Author(s):

Christoph Starke ◽

Vasco Grossmann ◽

Lars Wienbrandt ◽

Sven Koschnicke ◽

John Carstens ◽

...

Keyword(s):

Financial Markets ◽

High Performance ◽

Investment Strategy ◽

Reconfigurable Hardware ◽

Massively Parallel ◽

Processing Element ◽

Investment Strategies ◽

Hardware Platform ◽

Time Periods ◽

Hardware Structure

The hardware structure of a processing element used for optimization of an investment strategy for financial markets is presented. It is shown how this processing element can be multiply implemented on the massively parallel FPGA-machine RIVYERA. This leads to a speedup of a factor of about 17,000 in comparison to one single high-performance PC, while saving more than 99% of the consumed energy. Furthermore, it is shown for a special security and different time periods that the optimized investment strategy delivers an outperformance between 2 and 14 percent in relation to a buy and hold strategy.

Download Full-text

The Fortran-P Translator: Towards Automatic Translation of Fortran 77 Programs for Massively Parallel Processors

Scientific Programming ◽

10.1155/1995/278064 ◽

1995 ◽

Vol 4 (1) ◽

pp. 1-21 ◽

Cited By ~ 3

Author(s):

Matthew O'keefe ◽

Terence Parr ◽

B. Kevin Edgar ◽

Steve Anderson ◽

Paul Woodward ◽

...

Keyword(s):

High Performance ◽

Parallel Machines ◽

Parallel Processors ◽

Massively Parallel ◽

Automatic Translation ◽

Efficient Code ◽

Self Similar ◽

User Friendly ◽

Application Codes ◽

Fortran 77

Massively parallel processors (MPPs) hold the promise of extremely high performance that, if realized, could be used to study problems of unprecedented size and complexity. One of the primary stumbling blocks to this promise has been the lack of tools to translate application codes to MPP form. In this article we show how applications codes written in a subset of Fortran 77, called Fortran-P, can be translated to achieve good performance on several massively parallel machines. This subset can express codes that are self-similar, where the algorithm applied to the global data domain is also applied to each subdomain. We have found many codes that match the Fortran-P programming style and have converted them using our tools. We believe a self-similar coding style will accomplish what a vectorizable style has accomplished for vector machines by allowing the construction of robust, user-friendly, automatic translation systems that increase programmer productivity and generate fast, efficient code for MPPs.

Download Full-text

Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps.2017.94 ◽

2017 ◽

Cited By ~ 10

Author(s):

Benjamin Klenk ◽

Holger Froening ◽

Hans Eberle ◽

Larry Dennison

Keyword(s):

Message Passing ◽

High Performance ◽

Massively Parallel

Download Full-text

Efficient multithreading for manycore processor: Multidimensional domain decomposition using Intel® TBB

The Insight Journal ◽

10.54294/73dn1l ◽

2017 ◽

Author(s):

Etienne St-Onge ◽

Benoit Scherrer ◽

Simon Warfield

Keyword(s):

Domain Decomposition ◽

High Performance ◽

Job Scheduling ◽

Building Blocks ◽

Massively Parallel ◽

Decomposition Approach ◽

Task Decomposition ◽

Dynamic Task ◽

Current Implementation ◽

Generic Design

The Insight Toolkit (ITK) utilizes a generic design for image processing filters that allows many developers to rapidly implement new algorithms. While ITK filters benefit from a platform-independent and versatile multithreading capability, the current implementation does not easily achieve high performance. First, ITK relies on a static decomposition of the image into subsets of equal size which is highly inefficient when the computational complexity varies between subsets (unbalanced workloads). Second, the current domain decomposition is limited to subdivide the input domain along a single dimension (typically the slice dimension in a 3-D volume), which causes a multithreading under-utilization when the number of threads is larger than the size of this dimension when using massively parallel compute systems. We previously presented a new itk::TBBImageToImageFilter class that replaced the static task decomposition by a dynamic task decomposition for improved workload balancing, in which the job scheduling task was optimized using the Intel® Threading Building Blocks (TBB) library. In this work, we propose a new multidimensional dynamic image decomposition approach that allows decomposition over an arbitrary number of dimensions. This new generic multithreading capability, combined with the TBB dynamic task scheduler, substantially improves multithreading performance when using massively parallel processors.

Download Full-text