Minimal Aggregated Shared Memory Messaging on Distributed Memory Supercomputers

Practical Wavelet Tree Construction

Journal of Experimental Algorithmics ◽

10.1145/3457197 ◽

2021 ◽

Vol 26 ◽

pp. 1-67

Author(s):

Patrick Dinklage ◽

Jonas Ellert ◽

Johannes Fischer ◽

Florian Kurpicz ◽

Marvin Löbel

Keyword(s):

Parallel Algorithms ◽

Shared Memory ◽

Distributed Memory ◽

Auxiliary Information ◽

Parallel Computers ◽

External Memory ◽

Sequential Algorithms ◽

Bottom Up ◽

Memory Efficiency ◽

Tree Construction

We present new sequential and parallel algorithms for wavelet tree construction based on a new bottom-up technique. This technique makes use of the structure of the wavelet trees—refining the characters represented in a node of the tree with increasing depth—in an opposite way, by first computing the leaves (most refined), and then propagating this information upwards to the root of the tree. We first describe new sequential algorithms, both in RAM and external memory. Based on these results, we adapt these algorithms to parallel computers, where we address both shared memory and distributed memory settings. In practice, all our algorithms outperform previous ones in both time and memory efficiency, because we can compute all auxiliary information solely based on the information we obtained from computing the leaves. Most of our algorithms are also adapted to the wavelet matrix , a variant that is particularly suited for large alphabets.

Download Full-text

Parallel programming technologies on computer complexes

Radio Industry (Russia) ◽

10.21778/2413-9599-2020-30-3-28-33 ◽

2020 ◽

Vol 30 (3) ◽

pp. 28-33 ◽

Cited By ~ 1

Author(s):

S. A. Pryadko ◽

A. Yu. Troshin ◽

V. D. Kozlov ◽

A. E. Ivanov

Keyword(s):

Parallel Programming ◽

Shared Memory ◽

Programming Language ◽

Operating Systems ◽

Distributed Memory ◽

Programming Models ◽

Writing Programs ◽

Advantages And Disadvantages ◽

C Programming Language ◽

C Programming

The article describes various options for speeding up calculations on computer systems. These features are closely related to the architecture of these complexes. The objective of this paper is to provide necessary information when selecting the capability for the speeding process of solving the computation problem. The main features implemented using the following models are described: programming in systems with shared memory, programming in systems with distributed memory, and programming on graphics accelerators (video cards). The basic concept, principles, advantages, and disadvantages of each of the considered programming models are described. All standards for writing programs described in the article can be used both on Linux and Windows operating systems. The required libraries are available and compatible with the C/C++ programming language. The article concludes with recommendations on the use of a particular technology, depending on the type of task to be solved.

Download Full-text

Teaching tools for parallel processing

Facta universitatis - series Electronics and Energetics ◽

10.2298/fuee0502219m ◽

2005 ◽

Vol 18 (2) ◽

pp. 219-224

Author(s):

Emina Milovanovic ◽

Natalija Stojanovic

Keyword(s):

Parallel Computing ◽

Parallel Processing ◽

Shared Memory ◽

Message Passing ◽

Distributed Memory ◽

Cost Effective ◽

Parallel Computers ◽

Free Software ◽

Teaching Tools ◽

Network Of Workstations

Because many universities lack the funds to purchase expensive parallel computers, cost effective alternatives are needed to teach students about parallel processing. Free software is available to support the three major paradigms of parallel computing. Parallaxis is a sophisticated SIMD simulator which runs on a variety of platforms.jBACI shared memory simulator supports the MIMD model of computing with a common shared memory. PVM and MPI allow students to treat a network of workstations as a message passing MIMD multicomputer with distributed memory. Each of this software tools can be used in a variety of courses to give students experience with parallel algorithms.

Download Full-text

An O(log2N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures

Algorithms ◽

10.3390/a14120342 ◽

2021 ◽

Vol 14 (12) ◽

pp. 342

Author(s):

Alessandro Varsi ◽

Simon Maskell ◽

Paul G. Spirakis

Keyword(s):

Parallel Computing ◽

Shared Memory ◽

Time Complexity ◽

Distributed Memory ◽

Particle Filters ◽

Dynamic Models ◽

State Of The Art ◽

Novel Approach ◽

Non Gaussian ◽

Memory Architectures

Resampling is a well-known statistical algorithm that is commonly applied in the context of Particle Filters (PFs) in order to perform state estimation for non-linear non-Gaussian dynamic models. As the models become more complex and accurate, the run-time of PF applications becomes increasingly slow. Parallel computing can help to address this. However, resampling (and, hence, PFs as well) necessarily involves a bottleneck, the redistribution step, which is notoriously challenging to parallelize if using textbook parallel computing techniques. A state-of-the-art redistribution takes O((log2N)2) computations on Distributed Memory (DM) architectures, which most supercomputers adopt, whereas redistribution can be performed in O(log2N) on Shared Memory (SM) architectures, such as GPU or mainstream CPUs. In this paper, we propose a novel parallel redistribution for DM that achieves an O(log2N) time complexity. We also present empirical results that indicate that our novel approach outperforms the O((log2N)2) approach.

Download Full-text

Vienna Fortran – A Fortran Language Extension for Distributed Memory Multiprocessors* *The work described in this paper is being carried out as part of the research project “Virtual Shared Memory for Multiprocessor Systems with Distributed Memory” funded by the Austrian Research Foundation (FWF) under the grant number P7576-TEC and the ESPRIT project “An Automatic Parallelization System for Genesis” funded by the Austrian Ministry for Science and Research (BMWF). This research was also supported by the National Aeronautics and Space Administration under NASA contract NAS1-18605 while the authors were in residence at ICASE, Mail Stop 132C, NASA Langley Research Center, Hampton, VA 23666. The authors assume all responsibility for the contents of the paper.

Advances in Parallel Computing - Languages, Compilers and Run-Time Environments for Distributed Memory Machines ◽

10.1016/b978-0-444-88712-2.50007-x ◽

1992 ◽

pp. 39-62 ◽

Cited By ~ 5

Author(s):

Barbara M. Chapman ◽

Piyush Mehrotra ◽

Hans P. Zima

Keyword(s):

Shared Memory ◽

Distributed Memory ◽

Automatic Parallelization ◽

Research Center ◽

Multiprocessor Systems ◽

Language Extension ◽

Virtual Shared Memory ◽

Fortran Language ◽

Langley Research Center ◽

Science And Research

Download Full-text

Combining Shared and Distributed Memory Programming Models on Clusters of Symmetric Multiprocessors: Some Basic Promising Experiments

The International Journal of High Performance Computing Applications ◽

10.1177/109434200201600405 ◽

2002 ◽

Vol 16 (4) ◽

pp. 425-430 ◽

Cited By ~ 3

Author(s):

L. Giraud

Keyword(s):

Shared Memory ◽

Numerical Experiments ◽

Distributed Memory ◽

Programming Models ◽

Symmetric Multiprocessors ◽

Moderate Number ◽

Overall Performance ◽

Computing Node ◽

Vector Computers ◽

Programming Paradigms

This note presents some experiments on different clusters of SMPs, where both distributed and shared memory parallel programming paradigms can be naturally combined. Although the platforms exhibit the same macroscopic memory organization, it appears that their individual overall performance is closely dependent on the ability of their hardware to efficiently exploit the local shared memory within the nodes. In that context, cache blocking strategy appears to be very important not only to get good performance out of each individual processor but mainly good performance out of the overall computing node since sharing memory locally might become a severe bottleneck. On a very simple benchmark, representative of many large simulation codes, we show through numerical experiments that mixing the two programming models enables us to get attractive speed-ups that compete with a pure distributed memory approach. This opens promising perspectives for smoothly moving large industrial codes developed on distributed vector computers with a moderate number of processors on these emerging platforms for intensive scientific computing that are the clusters of SMPs.

Download Full-text

Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

IFAC-PapersOnLine ◽

10.1016/j.ifacol.2015.12.381 ◽

2015 ◽

Vol 48 (30) ◽

pp. 221-226 ◽

Cited By ~ 1

Author(s):

Shuangshuang Jin ◽

Yousu Chen ◽

Di Wu ◽

Ruisheng Diao ◽

Zhenyu (Henry) Huang

Keyword(s):

Shared Memory ◽

Dynamic Simulation ◽

Distributed Memory

Download Full-text

SnBversion 2.2: an example of crystallographic multiprocessing

Journal of Applied Crystallography ◽

10.1107/s0021889802005782 ◽

2002 ◽

Vol 35 (3) ◽

pp. 374-376 ◽

Cited By ~ 24

Author(s):

Jason Rappleye ◽

Martins Innus ◽

Charles M. Weeks ◽

Russ Miller

Keyword(s):

Shared Memory ◽

Distributed Memory ◽

Computing Time ◽

Direct Methods ◽

Fixed Number ◽

Linear Speedup ◽

Beowulf Clusters ◽

Shared Memory Multiprocessor ◽

Computing Platforms ◽

Time Required

The computer programSnBimplements a direct-methods algorithm, known asShake-and-Bake, which optimizes trial structures consisting of randomly positioned atoms. Although largeShake-and-Bakeapplications require significant amounts of computing time, the algorithm can be easily implemented in parallel in order to decrease the real time required to achieve a solution. By using a master–worker model,SnBversion 2.2 is amenable to all of the prevalent modern parallel-computing platforms, including (i) shared-memory multiprocessor machines, such as the SGI Origin2000, (ii) distributed-memory multiprocessor machines, such as the IBM SP, and (iii) collections of workstations, including Beowulf clusters. A linear speedup in the processing of a fixed number of trial structures can be obtained on each of these platforms.

Download Full-text

The Same-Source Parallel MM5

Scientific Programming ◽

10.1155/2000/712795 ◽

2000 ◽

Vol 8 (1) ◽

pp. 5-12 ◽

Cited By ~ 6

Author(s):

John Michalakes

Keyword(s):

Shared Memory ◽

Distributed Memory ◽

Mesoscale Model ◽

State University ◽

Parallel Performance ◽

Hybrid Parallelization ◽

Beowulf Clusters ◽

Penn State ◽

Library Support ◽

The Impact

Beginning with the March 1998 release of the Penn State University/NCAR Mesoscale Model (MM5), and continuing through eight subsequent releases up to the present, the official version has run on distributed -memory (DM) parallel computers. Source translation and runtime library support minimize the impact of parallelization on the original model source code, with the result that the majority of code is line-for-line identical with the original version. Parallel performance and scaling are equivalent to earlier, hand-parallelized versions; the modifications have no effect when the code is compiled and run without the DM option. Supported computers include the IBM SP, Cray T3E, Fujitsu VPP, Compaq Alpha clusters, and clusters of PCs (so-called Beowulf clusters). The approach also is compatible with shared-memory parallel directives, allowing distributed-memory/shared-memory hybrid parallelization on distributed-memory clusters of symmetric multiprocessors.

Download Full-text

Extending OpenMP for NUMA Machines

Scientific Programming ◽

10.1155/2000/464182 ◽

2000 ◽

Vol 8 (3) ◽

pp. 163-181 ◽

Cited By ~ 16

Author(s):

John Bircsak ◽

Peter Craig ◽

RaeLyn Crowell ◽

Zarka Cvetanovic ◽

Jonathan Harris ◽

...

Keyword(s):

Shared Memory ◽

High Performance ◽

Distributed Memory ◽

Parallel Programs ◽

Compiler Optimizations ◽

High Performance Fortran ◽

Efficient Code ◽

Memory Architectures ◽

Shared Memory Architectures ◽

Fast Access

This paper describes extensions to OpenMP that implement data placement features needed for NUMA architectures. OpenMP is a collection of compiler directives and library routines used to write portable parallel programs for shared-memory architectures. Writing efficient parallel programs for NUMA architectures, which have characteristics of both shared-memory and distributed-memory architectures, requires that a programmer control the placement of data in memory and the placement of computations that operate on that data. Optimal performance is obtained when computations occur on processors that have fast access to the data needed by those computations. OpenMP -- designed for shared-memory architectures -- does not by itself address these issues. The extensions to OpenMP Fortran presented here have been mainly taken from High Performance Fortran. The paper describes some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions. It also describes some additional compiler optimizations, and concludes with some preliminary results.

Download Full-text