Interprocessor communication speed and performance in distributed-memory parallel processors

Abstract Two pseudo-spectral implementations of 2-D viscoacoustic modeling are developed in a distributed-memory multi-processor computing environment. The first involves simultaneous computation of the response of one model to many source locations and, as it requires no interprocessor communication, is perfectly parallel. The second involves computation of the response, to one source, of a large model that is distributed across all processors. In the latter, local rather than global, Fourier transforms are used to minimize interprocessor communication and to eliminate the need for matrix transposition. In both algorithms, absorbing boundaries are defined as zones of decreased Q as part of the model, and so require no extra computation. An empirical method of determining sets of relaxation times for a broad range of Q values eliminates the need for iterative fitting of Q-frequency curves.

Download Full-text

Real time simulation of stiff dynamic systems via distributed memory parallel processors

Proceedings of IEEE Virtual Reality Annual International Symposium ◽

10.1109/vrais.1993.380744 ◽

2002 ◽

Cited By ~ 1

Author(s):

M.C. Stanley ◽

J.E. Colgate

Keyword(s):

Real Time ◽

Dynamic Systems ◽

Distributed Memory ◽

Parallel Processors ◽

Real Time Simulation ◽

Time Simulation

Download Full-text

ON MESSAGE PACKAGING IN TASK SCHEDULING FOR DISTRIBUTED MEMORY PARALLEL MACHINES

International Journal of Foundations of Computer Science ◽

10.1142/s0129054101000497 ◽

2001 ◽

Vol 12 (03) ◽

pp. 285-306 ◽

Cited By ~ 2

Author(s):

NORIYUKI FUJIMOTO ◽

TOMOKI BABA ◽

TAKASHI HASHIMOTO ◽

KENICHI HAGIHARA

Keyword(s):

Task Scheduling ◽

High Performance ◽

Distributed Memory ◽

Parallel Machines ◽

Scheduling Algorithm ◽

Parallel Programs ◽

Parallel Program ◽

Interprocessor Communication ◽

Task Scheduling Algorithm ◽

Software Overhead

In this paper, we report a performance gap betweeen a schedule with small makespan on the task scheduling model and the corresponding parallel program on distributed memory parallel machines. The main reason of the gap is the software overhead in the interprocessor communication. Therefore, speedup ratios of schedules on the model do not approximate well to those of parallel programs on the machines. The purpose of the paper is to get a task scheduling algorithm that generates a schedule with good approximation to the corresponding parallel program and with small makespan. For this purpose, we propose algorithm BCSH that generates only bulk synchronous schedules. In those schedules, no-communication phases and communication phases appear alternately. All interprocessor communications are done only in the latter phases, and thus the corresponding parallel programs can make better use of the message packaging technique easily. It reduces many software overheads of messages form a source processor to the same destination processor to almost one software overhead, and improves the performance of a parallel program significantly. Finally, we show some experimental results of performance gaps on BCSH, Kruatrachue's algorithm DSH, and Ahmad et al's algorithm ECPFD. The schedules by DSH and ECPFD are famous for their small makespans, but message packaging can not be effectively applied to the corresponding program. The results show that a bulk synchronous schedule with small makespan has advantages that the gap is small and the corresponding program is a high performance parallel one.

Download Full-text

Automatic parallelization of LINPACK routines on distributed memory parallel processors

[1993] Proceedings Seventh International Parallel Processing Symposium ◽

10.1109/ipps.1993.262774 ◽

2002 ◽

Cited By ~ 3

Author(s):

M. Neeracher ◽

R. Ruhl

Keyword(s):

Distributed Memory ◽

Automatic Parallelization ◽

Parallel Processors

Download Full-text

Modelling and performance assessment of large ATM switching networks on loosely-coupled parallel processors

Microprocessing and Microprogramming ◽

10.1016/0165-6074(96)00029-4 ◽

1996 ◽

Vol 41 (10) ◽

pp. 681-689

Author(s):

W. Liu ◽

E. Dirkx ◽

G. Petit ◽

J. Tiberghien

Keyword(s):

Performance Assessment ◽

Parallel Processors ◽

Switching Networks ◽

Loosely Coupled ◽

And Performance

Download Full-text

Domain decomposition and multigrid solvers for flow simulation in porous media on distributed memory parallel processors

Journal of Scientific Computing ◽

10.1007/bf01059945 ◽

1992 ◽

Vol 7 (2) ◽

pp. 127-162

Author(s):

R. Bhogeswara ◽

J. E. Killough

Keyword(s):

Porous Media ◽

Domain Decomposition ◽

Distributed Memory ◽

Flow Simulation ◽

Parallel Processors ◽

Multigrid Solvers

Download Full-text

Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study

Scientific Programming ◽

10.1155/2015/904983 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12

Author(s):

Hari Radhakrishnan ◽

Damian W. I. Rouson ◽

Karla Morris ◽

Sameer Shende ◽

Stavros C. Kassinos

Keyword(s):

Distributed Memory ◽

Profile Analysis ◽

Multicore Processors ◽

Rapid Evolution ◽

Model Verification ◽

Parallel Application ◽

Linear Speedup ◽

And Performance ◽

Many Core

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were done using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.

Download Full-text

Fault simulation of logic designs on parallel processors with distributed memory

Proceedings. International Test Conference 1990 ◽

10.1109/test.1990.114084 ◽

2002 ◽

Cited By ~ 4

Author(s):

L.M. Huisman ◽

R. Daoud

Keyword(s):

Distributed Memory ◽

Fault Simulation ◽

Parallel Processors

Download Full-text