Minimal Aggregated Shared Memory Messaging on Distributed Memory Supercomputers

Author(s):  
Benjamin F. Jamroz ◽  
John M. Dennis
2021 ◽  
Vol 26 ◽  
pp. 1-67
Author(s):  
Patrick Dinklage ◽  
Jonas Ellert ◽  
Johannes Fischer ◽  
Florian Kurpicz ◽  
Marvin Löbel

We present new sequential and parallel algorithms for wavelet tree construction based on a new bottom-up technique. This technique makes use of the structure of the wavelet trees—refining the characters represented in a node of the tree with increasing depth—in an opposite way, by first computing the leaves (most refined), and then propagating this information upwards to the root of the tree. We first describe new sequential algorithms, both in RAM and external memory. Based on these results, we adapt these algorithms to parallel computers, where we address both shared memory and distributed memory settings. In practice, all our algorithms outperform previous ones in both time and memory efficiency, because we can compute all auxiliary information solely based on the information we obtained from computing the leaves. Most of our algorithms are also adapted to the wavelet matrix , a variant that is particularly suited for large alphabets.


2020 ◽  
Vol 30 (3) ◽  
pp. 28-33 ◽  
Author(s):  
S. A. Pryadko ◽  
A. Yu. Troshin ◽  
V. D. Kozlov ◽  
A. E. Ivanov

The article describes various options for speeding up calculations on computer systems. These features are closely related to the architecture of these complexes. The objective of this paper is to provide necessary information when selecting the capability for the speeding process of solving the computation problem. The main features implemented using the following models are described: programming in systems with shared memory, programming in systems with distributed memory, and programming on graphics accelerators (video cards). The basic concept, principles, advantages, and disadvantages of each of the considered programming models are described. All standards for writing programs described in the article can be used both on Linux and Windows operating systems. The required libraries are available and compatible with the C/C++ programming language. The article concludes with recommendations on the use of a particular technology, depending on the type of task to be solved.


2005 ◽  
Vol 18 (2) ◽  
pp. 219-224
Author(s):  
Emina Milovanovic ◽  
Natalija Stojanovic

Because many universities lack the funds to purchase expensive parallel computers, cost effective alternatives are needed to teach students about parallel processing. Free software is available to support the three major paradigms of parallel computing. Parallaxis is a sophisticated SIMD simulator which runs on a variety of platforms.jBACI shared memory simulator supports the MIMD model of computing with a common shared memory. PVM and MPI allow students to treat a network of workstations as a message passing MIMD multicomputer with distributed memory. Each of this software tools can be used in a variety of courses to give students experience with parallel algorithms.


Algorithms ◽  
2021 ◽  
Vol 14 (12) ◽  
pp. 342
Author(s):  
Alessandro Varsi ◽  
Simon Maskell ◽  
Paul G. Spirakis

Resampling is a well-known statistical algorithm that is commonly applied in the context of Particle Filters (PFs) in order to perform state estimation for non-linear non-Gaussian dynamic models. As the models become more complex and accurate, the run-time of PF applications becomes increasingly slow. Parallel computing can help to address this. However, resampling (and, hence, PFs as well) necessarily involves a bottleneck, the redistribution step, which is notoriously challenging to parallelize if using textbook parallel computing techniques. A state-of-the-art redistribution takes O((log2N)2) computations on Distributed Memory (DM) architectures, which most supercomputers adopt, whereas redistribution can be performed in O(log2N) on Shared Memory (SM) architectures, such as GPU or mainstream CPUs. In this paper, we propose a novel parallel redistribution for DM that achieves an O(log2N) time complexity. We also present empirical results that indicate that our novel approach outperforms the O((log2N)2) approach.


Author(s):  
L. Giraud

This note presents some experiments on different clusters of SMPs, where both distributed and shared memory parallel programming paradigms can be naturally combined. Although the platforms exhibit the same macroscopic memory organization, it appears that their individual overall performance is closely dependent on the ability of their hardware to efficiently exploit the local shared memory within the nodes. In that context, cache blocking strategy appears to be very important not only to get good performance out of each individual processor but mainly good performance out of the overall computing node since sharing memory locally might become a severe bottleneck. On a very simple benchmark, representative of many large simulation codes, we show through numerical experiments that mixing the two programming models enables us to get attractive speed-ups that compete with a pure distributed memory approach. This opens promising perspectives for smoothly moving large industrial codes developed on distributed vector computers with a moderate number of processors on these emerging platforms for intensive scientific computing that are the clusters of SMPs.


2015 ◽  
Vol 48 (30) ◽  
pp. 221-226 ◽  
Author(s):  
Shuangshuang Jin ◽  
Yousu Chen ◽  
Di Wu ◽  
Ruisheng Diao ◽  
Zhenyu (Henry) Huang

2002 ◽  
Vol 35 (3) ◽  
pp. 374-376 ◽  
Author(s):  
Jason Rappleye ◽  
Martins Innus ◽  
Charles M. Weeks ◽  
Russ Miller

The computer programSnBimplements a direct-methods algorithm, known asShake-and-Bake, which optimizes trial structures consisting of randomly positioned atoms. Although largeShake-and-Bakeapplications require significant amounts of computing time, the algorithm can be easily implemented in parallel in order to decrease the real time required to achieve a solution. By using a master–worker model,SnBversion 2.2 is amenable to all of the prevalent modern parallel-computing platforms, including (i) shared-memory multiprocessor machines, such as the SGI Origin2000, (ii) distributed-memory multiprocessor machines, such as the IBM SP, and (iii) collections of workstations, including Beowulf clusters. A linear speedup in the processing of a fixed number of trial structures can be obtained on each of these platforms.


2000 ◽  
Vol 8 (1) ◽  
pp. 5-12 ◽  
Author(s):  
John Michalakes

Beginning with the March 1998 release of the Penn State University/NCAR Mesoscale Model (MM5), and continuing through eight subsequent releases up to the present, the official version has run on distributed -memory (DM) parallel computers. Source translation and runtime library support minimize the impact of parallelization on the original model source code, with the result that the majority of code is line-for-line identical with the original version. Parallel performance and scaling are equivalent to earlier, hand-parallelized versions; the modifications have no effect when the code is compiled and run without the DM option. Supported computers include the IBM SP, Cray T3E, Fujitsu VPP, Compaq Alpha clusters, and clusters of PCs (so-called Beowulf clusters). The approach also is compatible with shared-memory parallel directives, allowing distributed-memory/shared-memory hybrid parallelization on distributed-memory clusters of symmetric multiprocessors.


2000 ◽  
Vol 8 (3) ◽  
pp. 163-181 ◽  
Author(s):  
John Bircsak ◽  
Peter Craig ◽  
RaeLyn Crowell ◽  
Zarka Cvetanovic ◽  
Jonathan Harris ◽  
...  

This paper describes extensions to OpenMP that implement data placement features needed for NUMA architectures. OpenMP is a collection of compiler directives and library routines used to write portable parallel programs for shared-memory architectures. Writing efficient parallel programs for NUMA architectures, which have characteristics of both shared-memory and distributed-memory architectures, requires that a programmer control the placement of data in memory and the placement of computations that operate on that data. Optimal performance is obtained when computations occur on processors that have fast access to the data needed by those computations. OpenMP -- designed for shared-memory architectures -- does not by itself address these issues. The extensions to OpenMP Fortran presented here have been mainly taken from High Performance Fortran. The paper describes some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions. It also describes some additional compiler optimizations, and concludes with some preliminary results.


Sign in / Sign up

Export Citation Format

Share Document