scholarly journals Extending OpenMP for NUMA Machines

2000 ◽  
Vol 8 (3) ◽  
pp. 163-181 ◽  
Author(s):  
John Bircsak ◽  
Peter Craig ◽  
RaeLyn Crowell ◽  
Zarka Cvetanovic ◽  
Jonathan Harris ◽  
...  

This paper describes extensions to OpenMP that implement data placement features needed for NUMA architectures. OpenMP is a collection of compiler directives and library routines used to write portable parallel programs for shared-memory architectures. Writing efficient parallel programs for NUMA architectures, which have characteristics of both shared-memory and distributed-memory architectures, requires that a programmer control the placement of data in memory and the placement of computations that operate on that data. Optimal performance is obtained when computations occur on processors that have fast access to the data needed by those computations. OpenMP -- designed for shared-memory architectures -- does not by itself address these issues. The extensions to OpenMP Fortran presented here have been mainly taken from High Performance Fortran. The paper describes some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions. It also describes some additional compiler optimizations, and concludes with some preliminary results.

1993 ◽  
Vol 2 (4) ◽  
pp. 203-216
Author(s):  
Steve W. Otto

We discuss a set of parallel array classes, MetaMP, for distributed-memory architectures. The classes are implemented in C++ and interface to the PVM or Intel NX message-passing systems. An array class implements a partitioned array as a set of objects distributed across the nodes – a "collective" object. Object methods hide the low-level message-passing and implement meaningful array operations. These include transparent guard strips (or sharing regions) that support finite-difference stencils, reductions and multibroadcasts for support of pivoting and row operations, and interpolation/contraction operations for support of multigrid algorithms. The concept of guard strips is generalized to an object implementation of lightweight sharing mechanisms for finite element method (FEM) and particle-in-cell (PIC) algorithms. The sharing is accomplished through the mechanism of weak memory coherence and can be efficiently implemented. The price of the efficient implementation is memory usage and the need to explicitly specify the coherence operations. An intriguing feature of this programming model is that it maps well to both distributed-memory and shared-memory architectures.


Author(s):  
Michael P. Allen ◽  
Dominic J. Tildesley

Parallelization is essential for the effective use of modern high-performance computing facilities. This chapter summarizes some of the basic approaches that are commonly used in molecular simulation programs. The underlying shared-memory and distributed-memory architectures are explained. The concept of program threads and their use in parallelizing nested loops on a shared memory machine is described. Parallel tempering using message passing on a distributed memory machine is discussed and illustrated with an example code. Domain decomposition, and the implementation of constraints on parallel computers, are also explained.


2021 ◽  
Vol 26 (4) ◽  
pp. 1-31
Author(s):  
Mitali Sinha ◽  
Gade Sri Harsha ◽  
Pramit Bhattacharyya ◽  
Sujay Deb

Shared memory architectures, as opposed to private-only memories, provide a viable alternative to meet the ever-increasing memory requirements of multi-accelerator systems to achieve high performance under stringent area and energy constraints. However, an impulsive memory sharing degrades performance due to network contention and latency to access shared memory. We propose the Accelerator Shared Memory (ASM) framework to provide an optimal private/shared memory configuration and shared data allocation under a system’s resource and network constraints. Evaluations show ASM provides up to 34.35% and 31.34% improvement in performance and energy, respectively, over baseline systems.


2015 ◽  
Vol 25 (09n10) ◽  
pp. 1739-1741
Author(s):  
Daniel Adornes ◽  
Dalvan Griebler ◽  
Cleverson Ledur ◽  
Luiz Gustavo Fernandes

MapReduce was originally proposed as a suitable and efficient approach for analyzing and processing large amounts of data. Since then, many researches contributed with MapReduce implementations for distributed and shared memory architectures. Nevertheless, different architectural levels require different optimization strategies in order to achieve high-performance computing. Such strategies in turn have caused very different MapReduce programming interfaces among these researches. This paper presents some research notes on coding productivity when developing MapReduce applications for distributed and shared memory architectures. As a case study, we introduce our current research on a unified MapReduce domain-specific language with code generation for Hadoop and Phoenix++, which has achieved a coding productivity increase from 41.84% and up to 94.71% without significant performance losses (below 3%) compared to those frameworks.


2000 ◽  
Vol 10 (02n03) ◽  
pp. 189-200 ◽  
Author(s):  
THOMAS BRANDES

On distributed memory architectures data parallel compilers emulate the global address space by distributing the data onto the processors according to the mapping directives of the user and by generating automatically explicit inter-processor communication. A shadow is additionally allocated local memory to keep on one processor also non-local values of the data that is accessed or defined by this processor. While shadow edges are already well studied for structured grids, this paper focuses on its use for applications with unstructured grids where updates on the shadow edges involve unstructured communication with complex communication schedules. The use of shadow edges is considered for High Performance Fortran (HPF) as the de facto standard language for writing data parallel programs in Fortran. A library with a HPF binding provides the explicit control of unstructured shadows and their communication schedules, also called halos. This halo library allows writing HPF programs with a performance close to hand-coded message-passing versions but where the user is freed of the burden to calculate shadow sizes and communication schedules and to do the exchanging of data with explicit message passing commands. In certain situations, the HPF compiler can create and use halos automatically. This paper shows the advantages and also the limits of this approach. The halo library and an automatic support of halos have been implemented within the ADAPTOR HPF compilation system. The performance results verify the effectiveness of the chosen approach.


Sign in / Sign up

Export Citation Format

Share Document