Comparative Evaluation and Case Studies of Shared-Memory and Data-Parallel Execution Patterns

Xiaodong Zhang; Lin Sun

doi:10.1155/1999/468372

Comparative Evaluation and Case Studies of Shared-Memory and Data-Parallel Execution Patterns

Scientific Programming ◽

10.1155/1999/468372 ◽

1999 ◽

Vol 7 (1) ◽

pp. 1-19

Author(s):

Xiaodong Zhang ◽

Lin Sun

Keyword(s):

Linear System ◽

Shared Memory ◽

Interconnection Networks ◽

Parallel Execution ◽

Parallel Model ◽

Scientific Applications ◽

Data Parallel ◽

High Level ◽

Access Patterns ◽

Structured Program

Shared‐memory and data‐parallel programming models are two important paradigms for scientific applications. Both models provide high‐level program abstractions, and simple and uniform views of network structures. The common features of the two models significantly simplify program coding and debugging for scientific applications. However, the underlining execution and overhead patterns are significantly different between the two models due to their programming constraints, and due to different and complex structures of interconnection networks and systems which support the two models. We performed this experimental study to present implications and comparisons of execution patterns on two commercial architectures. We implemented a standard electromagnetic simulation program (EM) and a linear system solver using the shared‐memory model on the KSR‐1 and the data‐parallel model on the CM‐5. Our objectives are to examine the execution pattern changes required for an implementation transformation between the two models; to study memory access patterns; to address scalability issues; and to investigate relative costs and advantages/disadvantages of using the two models for scientific computations. Our results indicate that the EM program tends to become computation‐intensive in the KSR‐1 shared‐memory system, and memory‐demanding in the CM‐5 data‐parallel system when the systems and the problems are scaled. The EM program, a highly data‐parallel program performed extremely well, and the linear system solver, a highly control‐structured program suffered significantly in the data‐parallel model on the CM‐5. Our study provides further evidence that matching execution patterns of algorithms to parallel architectures would achieve better performance.

Download Full-text

Shared memory multiprocessor support for functional array processing in SAC

Journal of Functional Programming ◽

10.1017/s0956796805005538 ◽

2005 ◽

Vol 15 (3) ◽

pp. 353-401 ◽

Cited By ~ 29

Author(s):

CLEMENS GRELCK

Keyword(s):

Shared Memory ◽

Array Processing ◽

Numerical Data ◽

Parallel Execution ◽

Real Performance ◽

Execution Model ◽

Series Of Experiments ◽

High Level ◽

Performance Gains ◽

The Impact

Classical application domains of parallel computing are dominated by processing large arrays of numerical data. Whereas most functional languages focus on lists and trees rather than on arrays, SAC is tailor-made in design and in implementation for efficient high-level array processing. Advanced compiler optimizations yield performance levels that are often competitive with low-level imperative implementations. Based on SAC, we develop compilation techniques and runtime system support for the compiler-directed parallel execution of high-level functional array processing code on shared memory architectures. Competitive sequential performance gives us the opportunity to exploit the conceptual advantages of the functional paradigm for achieving real performance gains with respect to existing imperative implementations, not only in comparison with uniprocessor runtimes. While the design of SAC facilitates parallelization, the particular challenge of high sequential performance is that realization of satisfying speedups through parallelization becomes substantially more difficult. We present an initial compilation scheme and multi-threaded execution model, which we step-wise refine to reduce organizational overhead and to improve parallel performance. We close with a detailed analysis of the impact of certain design decisions on runtime performance, based on a series of experiments.

Download Full-text

SAC — FROM HIGH-LEVEL PROGRAMMING WITH ARRAYS TO EFFICIENT PARALLEL EXECUTION

Parallel Processing Letters ◽

10.1142/s0129626403001379 ◽

2003 ◽

Vol 13 (03) ◽

pp. 401-412 ◽

Cited By ~ 15

Author(s):

CLEMENS GRELCK ◽

SVEN-BODO SCHOLZ

Keyword(s):

Shared Memory ◽

Parallel Execution ◽

3 Dimensional ◽

Shape Invariant ◽

Fixed Set ◽

High Level ◽

Successive Over Relaxation ◽

Compilation Techniques ◽

Processing Language

SAC is a purely functional array processing language designed with numerical applications in mind. It supports generic, high-level program specifications in the style of APL. However, rather than providing a fixed set of built-in array operations, SAC provides means to specify such operations in the language itself in a way that still allows their application to arrays of any rank and size. This paper illustrates the major steps in compiling generic, rank- and shape-invariant SAC specifications into efficiently executable multithreaded code for parallel execution on shared memory multiprocessors. The effectiveness of the compilation techniques is demonstrated by means of a small case study on the PDE1 benchmark, which implements 3-dimensional red/black successive over-relaxation. Comparisons with HPF and ZPL show that despite the genericity of code, SAC achieves highly competitive runtime performance characteristics.

Download Full-text

Machine Learning–enabled Scalable Performance Prediction of Scientific Codes

ACM Transactions on Modeling and Computer Simulation ◽

10.1145/3450264 ◽

2021 ◽

Vol 31 (2) ◽

pp. 1-28

Author(s):

Gopinath Chennupati ◽

Nandakishore Santhi ◽

Phill Romero ◽

Stephan Eidenbenz

Keyword(s):

Machine Learning ◽

Performance Prediction ◽

Prediction Models ◽

Radiation Transport ◽

Discrete Event ◽

Basic Block ◽

Distribution Models ◽

Scientific Application ◽

High Level ◽

Access Patterns

Hardware architectures become increasingly complex as the compute capabilities grow to exascale. We present the Analytical Memory Model with Pipelines (AMMP) of the Performance Prediction Toolkit (PPT). PPT-AMMP takes high-level source code and hardware architecture parameters as input and predicts runtime of that code on the target hardware platform, which is defined in the input parameters. PPT-AMMP transforms the code to an (architecture-independent) intermediate representation, then (i) analyzes the basic block structure of the code, (ii) processes architecture-independent virtual memory access patterns that it uses to build memory reuse distance distribution models for each basic block, and (iii) runs detailed basic-block level simulations to determine hardware pipeline usage. PPT-AMMP uses machine learning and regression techniques to build the prediction models based on small instances of the input code, then integrates into a higher-order discrete-event simulation model of PPT running on Simian PDES engine. We validate PPT-AMMP on four standard computational physics benchmarks and present a use case of hardware parameter sensitivity analysis to identify bottleneck hardware resources on different code inputs. We further extend PPT-AMMP to predict the performance of a scientific application code, namely, the radiation transport mini-app SNAP. To this end, we analyze multi-variate regression models that accurately predict the reuse profiles and the basic block counts. We validate predicted SNAP runtimes against actual measured times.

Download Full-text

Retargeting sequential image-processing programs for data parallel execution

IEEE Transactions on Software Engineering ◽

10.1109/tse.2005.26 ◽

2005 ◽

Vol 31 (2) ◽

pp. 116-136 ◽

Cited By ~ 6

Author(s):

L.B. Baumstark ◽

L.M. Wills

Keyword(s):

Image Processing ◽

Parallel Execution ◽

Data Parallel ◽

Sequential Image

Download Full-text

Parallel Nonnegative Matrix Factorization via Newton Iteration

Parallel Processing Letters ◽

10.1142/s0129626416500146 ◽

2016 ◽

Vol 26 (03) ◽

pp. 1650014 ◽

Cited By ~ 3

Author(s):

Markus Flatz ◽

Marián Vajteršic

Keyword(s):

Shared Memory ◽

Matrix Factorization ◽

Message Passing ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Newton Iteration ◽

Parallel Execution ◽

Kkt Conditions ◽

Nonnegative Matrices ◽

First Order

The goal of Nonnegative Matrix Factorization (NMF) is to represent a large nonnegative matrix in an approximate way as a product of two significantly smaller nonnegative matrices. This paper shows in detail how an NMF algorithm based on Newton iteration can be derived using the general Karush-Kuhn-Tucker (KKT) conditions for first-order optimality. This algorithm is suited for parallel execution on systems with shared memory and also with message passing. Both versions were implemented and tested, delivering satisfactory speedup results.

Download Full-text

Introducing Shared Memory in Hypercube Architectures using Multistage interconnection Networks

TENCON '91. Region 10 International Conference on EC3-Energy, Computer, Communication and Control Systems ◽

10.1109/tencon.1991.729645 ◽

2005 ◽

Cited By ~ 2

Author(s):

V. Tiruveedhula ◽

J.S. Bedi

Keyword(s):

Shared Memory ◽

Interconnection Networks ◽

Multistage Interconnection Networks

Download Full-text

An object-oriented approach to the implementation of a high-level data parallel language

Lecture Notes in Computer Science - Scientific Computing in Object-Oriented Parallel Environments ◽

10.1007/3-540-63827-x_49 ◽

1997 ◽

pp. 97-104

Author(s):

Matthias Besch ◽

Hua Bi ◽

Gerd Heber ◽

Matthias Kessler ◽

Matthias Wilhelmi

Keyword(s):

Object Oriented ◽

Parallel Language ◽

Data Parallel ◽

Level Data ◽

Object Oriented Approach ◽

High Level ◽

Oriented Approach

Download Full-text

Using Heuristic Value Prediction and Dynamic Task Granularity Resizing to Improve Software Speculation

The Scientific World JOURNAL ◽

10.1155/2014/478013 ◽

2014 ◽

Vol 2014 ◽

pp. 1-18

Author(s):

Fan Xu ◽

Li Shen ◽

Zhiying Wang ◽

Bo Su ◽

Hui Guo ◽

...

Keyword(s):

High Efficiency ◽

Low Cost ◽

Parallel Execution ◽

Analytic Model ◽

Parallel Model ◽

Conventional Model ◽

Value Prediction ◽

Dynamic Task ◽

Control Overhead ◽

Key Factor

Exploiting potential thread-level parallelism (TLP) is becoming the key factor to improving performance of programs on multicore or many-core systems. Among various kinds of parallel execution models, the software-based speculative parallel model has become a research focus due to its low cost, high efficiency, flexibility, and scalability. The performance of the guest program under the software-based speculative parallel execution model is closely related to the speculation accuracy, the control overhead, and the rollback overhead of the model. In this paper, we first analyzed the conventional speculative parallel model and presented an analytic model of its expectation of the overall overhead, then optimized the conventional model based on the analytic model, and finally proposed a novel speculative parallel model named HEUSPEC. The HEUSPEC model includes three key techniques, namely, the heuristic value prediction, the value based correctness checking, and the dynamic task granularity resizing. We have implemented the runtime system of the model in ANSI C language. The experiment results show that when the speedup of the HEUSPEC model can reach 2.20 on the average (15% higher than conventional model) when depth is equal to 3 and 4.51 on the average (12% higher than conventional model) when speculative depth is equal to 7. Besides, it shows good scalability and lower memory cost.

Download Full-text

Comparing task and data parallel execution schemes for the DIIRK method

Lecture Notes in Computer Science - Euro-Par'96 Parallel Processing ◽

10.1007/bfb0024684 ◽

1996 ◽

pp. 52-61 ◽

Cited By ~ 3

Author(s):

Thomas Rauber ◽

Gudula Rünger

Keyword(s):

Parallel Execution ◽

Data Parallel

Download Full-text

PAEAN: Portable and scalable runtime support for parallel Haskell dialects

Journal of Functional Programming ◽

10.1017/s0956796816000010 ◽

2016 ◽

Vol 26 ◽

Cited By ~ 1

Author(s):

JOST BERTHOLD ◽

HANS-WOLFGANG LOIDL ◽

KEVIN HAMMOND

Keyword(s):

Shared Memory ◽

High Performance ◽

Parallel Machines ◽

State Of The Art ◽

Computing Systems ◽

Programming Abstraction ◽

Work Distribution ◽

High Level ◽

Parallelism Model ◽

Performance Computing

AbstractOver time, several competing approaches to parallel Haskell programming have emerged. Different approaches support parallelism at various different scales, ranging from small multicores to massively parallel high-performance computing systems. They also provide varying degrees of control, ranging from completely implicit approaches to ones providing full programmer control. Most current designs assume a shared memory model at the programmer, implementation and hardware levels. This is, however, becoming increasingly divorced from the reality at the hardware level. It also imposes significant unwanted runtime overheads in the form of garbage collection synchronisation etc. What is needed is an easy way to abstract over the implementation and hardware levels, while presenting a simple parallelism model to the programmer. The PArallEl shAred Nothing runtime system design aims to provide a portable and high-level shared-nothing implementation platform for parallel Haskell dialects. It abstracts over major issues such as work distribution and data serialisation, consolidating existing, successful designs into a single framework. It also provides an optional virtual shared-memory programming abstraction for (possibly) shared-nothing parallel machines, such as modern multicore/manycore architectures or cluster/cloud computing systems. It builds on, unifies and extends, existing well-developed support for shared-memory parallelism that is provided by the widely used GHC Haskell compiler. This paper summarises the state-of-the-art in shared-nothing parallel Haskell implementations, introduces the PArallEl shAred Nothing abstractions, shows how they can be used to implement three distinct parallel Haskell dialects, and demonstrates that good scalability can be obtained on recent parallel machines.

Download Full-text