Extending OpenMP for NUMA Machines

John Bircsak; Peter Craig; RaeLyn Crowell; Zarka Cvetanovic; Jonathan Harris; C. Alexander Nelson; Carl D. Offner

doi:10.1155/2000/464182

Extending OpenMP for NUMA Machines

Scientific Programming ◽

10.1155/2000/464182 ◽

2000 ◽

Vol 8 (3) ◽

pp. 163-181 ◽

Cited By ~ 16

Author(s):

John Bircsak ◽

Peter Craig ◽

RaeLyn Crowell ◽

Zarka Cvetanovic ◽

Jonathan Harris ◽

...

Keyword(s):

Shared Memory ◽

High Performance ◽

Distributed Memory ◽

Parallel Programs ◽

Compiler Optimizations ◽

High Performance Fortran ◽

Efficient Code ◽

Memory Architectures ◽

Shared Memory Architectures ◽

Fast Access

This paper describes extensions to OpenMP that implement data placement features needed for NUMA architectures. OpenMP is a collection of compiler directives and library routines used to write portable parallel programs for shared-memory architectures. Writing efficient parallel programs for NUMA architectures, which have characteristics of both shared-memory and distributed-memory architectures, requires that a programmer control the placement of data in memory and the placement of computations that operate on that data. Optimal performance is obtained when computations occur on processors that have fast access to the data needed by those computations. OpenMP -- designed for shared-memory architectures -- does not by itself address these issues. The extensions to OpenMP Fortran presented here have been mainly taken from High Performance Fortran. The paper describes some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions. It also describes some additional compiler optimizations, and concludes with some preliminary results.

Download Full-text

Parallel Array Classes and Lightweight Sharing Mechanisms

Scientific Programming ◽

10.1155/1993/393409 ◽

1993 ◽

Vol 2 (4) ◽

pp. 203-216

Author(s):

Steve W. Otto

Keyword(s):

Finite Element Method ◽

Shared Memory ◽

Message Passing ◽

Distributed Memory ◽

Programming Model ◽

Memory Usage ◽

Particle In Cell ◽

Parallel Array ◽

Memory Architectures ◽

Shared Memory Architectures

We discuss a set of parallel array classes, MetaMP, for distributed-memory architectures. The classes are implemented in C++ and interface to the PVM or Intel NX message-passing systems. An array class implements a partitioned array as a set of objects distributed across the nodes – a "collective" object. Object methods hide the low-level message-passing and implement meaningful array operations. These include transparent guard strips (or sharing regions) that support finite-difference stencils, reductions and multibroadcasts for support of pivoting and row operations, and interpolation/contraction operations for support of multigrid algorithms. The concept of guard strips is generalized to an object implementation of lightweight sharing mechanisms for finite element method (FEM) and particle-in-cell (PIC) algorithms. The sharing is accomplished through the mechanism of weak memory coherence and can be efficiently implemented. The price of the efficient implementation is memory usage and the need to explicitly specify the coherence operations. An intriguing feature of this programming model is that it maps well to both distributed-memory and shared-memory architectures.

Download Full-text

A comparison of message passing and shared memory architectures for data parallel programs

Proceedings of 21 International Symposium on Computer Architecture ◽

10.1109/isca.1994.288158 ◽

2002 ◽

Cited By ~ 7

Author(s):

A.C. Klaiber ◽

H.M. Levy

Keyword(s):

Shared Memory ◽

Message Passing ◽

Parallel Programs ◽

Data Parallel ◽

Memory Architectures ◽

Shared Memory Architectures

Download Full-text

A comparison of message passing and shared memory architectures for data parallel programs

ACM SIGARCH Computer Architecture News ◽

10.1145/192007.192020 ◽

1994 ◽

Vol 22 (2) ◽

pp. 94-105 ◽

Cited By ~ 3

Author(s):

A. C. Klaiber ◽

H. M. Levy

Keyword(s):

Shared Memory ◽

Message Passing ◽

Parallel Programs ◽

Data Parallel ◽

Memory Architectures ◽

Shared Memory Architectures

Download Full-text

Trojan: a high-performance simulator for shared memory architectures

Proceedings of the 29th Annual Simulation Symposium ◽

10.1109/simsym.1996.492151 ◽

2002 ◽

Cited By ~ 5

Author(s):

D. Park ◽

R.H. Saavedra

Keyword(s):

Shared Memory ◽

High Performance ◽

Memory Architectures ◽

Shared Memory Architectures

Download Full-text

Parallel simulation

10.1093/oso/9780198803195.003.0007 ◽

2017 ◽

Author(s):

Michael P. Allen ◽

Dominic J. Tildesley

Keyword(s):

Shared Memory ◽

Message Passing ◽

High Performance ◽

Distributed Memory ◽

Nested Loops ◽

Code Domain ◽

Basic Approaches ◽

Effective Use ◽

Memory Architectures ◽

Performance Computing

Parallelization is essential for the effective use of modern high-performance computing facilities. This chapter summarizes some of the basic approaches that are commonly used in molecular simulation programs. The underlying shared-memory and distributed-memory architectures are explained. The concept of program threads and their use in parallelizing nested loops on a shared memory machine is described. Parallel tempering using message passing on a distributed memory machine is discussed and illustrated with an example code. Domain decomposition, and the implementation of constraints on parallel computers, are also explained.

Download Full-text

Design Space Optimization of Shared Memory Architecture in Accelerator-rich Systems

ACM Transactions on Design Automation of Electronic Systems ◽

10.1145/3446001 ◽

2021 ◽

Vol 26 (4) ◽

pp. 1-31

Author(s):

Mitali Sinha ◽

Gade Sri Harsha ◽

Pramit Bhattacharyya ◽

Sujay Deb

Keyword(s):

Shared Memory ◽

High Performance ◽

Design Space ◽

Data Allocation ◽

Memory Architecture ◽

Shared Data ◽

Memory Sharing ◽

Network Contention ◽

Memory Architectures ◽

Shared Memory Architectures

Shared memory architectures, as opposed to private-only memories, provide a viable alternative to meet the ever-increasing memory requirements of multi-accelerator systems to achieve high performance under stringent area and energy constraints. However, an impulsive memory sharing degrades performance due to network contention and latency to access shared memory. We propose the Accelerator Shared Memory (ASM) framework to provide an optimal private/shared memory configuration and shared data allocation under a system’s resource and network constraints. Evaluations show ASM provides up to 34.35% and 31.34% improvement in performance and energy, respectively, over baseline systems.

Download Full-text

Compiling High Performance Fortran for distributed-memory architectures

Parallel Computing ◽

10.1016/s0167-8191(99)00074-5 ◽

1999 ◽

Vol 25 (13-14) ◽

pp. 1785-1825 ◽

Cited By ~ 9

Author(s):

Siegfried Benkner ◽

Hans Zima

Keyword(s):

High Performance ◽

Distributed Memory ◽

High Performance Fortran ◽

Memory Architectures

Download Full-text

Coding Productivity in MapReduce Applications for Distributed and Shared Memory Architectures

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194015710096 ◽

2015 ◽

Vol 25 (09n10) ◽

pp. 1739-1741

Author(s):

Daniel Adornes ◽

Dalvan Griebler ◽

Cleverson Ledur ◽

Luiz Gustavo Fernandes

Keyword(s):

Shared Memory ◽

Code Generation ◽

High Performance ◽

Domain Specific ◽

Significant Performance ◽

Memory Architectures ◽

Programming Interfaces ◽

Shared Memory Architectures ◽

Performance Computing

MapReduce was originally proposed as a suitable and efficient approach for analyzing and processing large amounts of data. Since then, many researches contributed with MapReduce implementations for distributed and shared memory architectures. Nevertheless, different architectural levels require different optimization strategies in order to achieve high-performance computing. Such strategies in turn have caused very different MapReduce programming interfaces among these researches. This paper presents some research notes on coding productivity when developing MapReduce applications for distributed and shared memory architectures. As a case study, we introduce our current research on a unified MapReduce domain-specific language with code generation for Hadoop and Phoenix++, which has achieved a coding productivity increase from 41.84% and up to 94.71% without significant performance losses (below 3%) compared to those frameworks.

Download Full-text

HPF LIBRARY AND COMPILER SUPPORT FOR HALOS IN DATA PARALLEL IRREGULAR COMPUTATIONS

Parallel Processing Letters ◽

10.1142/s0129626400000196 ◽

2000 ◽

Vol 10 (02n03) ◽

pp. 189-200 ◽

Cited By ~ 1

Author(s):

THOMAS BRANDES

Keyword(s):

Message Passing ◽

High Performance ◽

Parallel Programs ◽

Address Space ◽

Compiler Support ◽

Data Parallel ◽

High Performance Fortran ◽

Non Local ◽

Performance Results ◽

Memory Architectures

On distributed memory architectures data parallel compilers emulate the global address space by distributing the data onto the processors according to the mapping directives of the user and by generating automatically explicit inter-processor communication. A shadow is additionally allocated local memory to keep on one processor also non-local values of the data that is accessed or defined by this processor. While shadow edges are already well studied for structured grids, this paper focuses on its use for applications with unstructured grids where updates on the shadow edges involve unstructured communication with complex communication schedules. The use of shadow edges is considered for High Performance Fortran (HPF) as the de facto standard language for writing data parallel programs in Fortran. A library with a HPF binding provides the explicit control of unstructured shadows and their communication schedules, also called halos. This halo library allows writing HPF programs with a performance close to hand-coded message-passing versions but where the user is freed of the burden to calculate shadow sizes and communication schedules and to do the exchanging of data with explicit message passing commands. In certain situations, the HPF compiler can create and use halos automatically. This paper shows the advantages and also the limits of this approach. The halo library and an automatic support of halos have been implemented within the ADAPTOR HPF compilation system. The performance results verify the effectiveness of the chosen approach.

Download Full-text

Realising a concurrent object-based programming model on parallel virtual shared memory architectures

Programming Models for Massively Parallel Computers ◽

10.1109/pmmpc.1995.504345 ◽

2002 ◽

Cited By ~ 1

Author(s):

M. Fisher ◽

J. Keane

Keyword(s):

Shared Memory ◽

Programming Model ◽

Object Based ◽

Virtual Shared Memory ◽

Memory Architectures ◽

Shared Memory Architectures

Download Full-text