Interprocessor communication speed and performance in distributed-memory parallel processors

1989 ◽  
Vol 17 (3) ◽  
pp. 315-324 ◽  
Author(s):  
M. Annaratone ◽  
C. Pommerell ◽  
R. Rühl
1993 ◽  
Vol 83 (5) ◽  
pp. 1345-1354
Author(s):  
Quingbo Liao ◽  
George A. McMechan

Abstract Two pseudo-spectral implementations of 2-D viscoacoustic modeling are developed in a distributed-memory multi-processor computing environment. The first involves simultaneous computation of the response of one model to many source locations and, as it requires no interprocessor communication, is perfectly parallel. The second involves computation of the response, to one source, of a large model that is distributed across all processors. In the latter, local rather than global, Fourier transforms are used to minimize interprocessor communication and to eliminate the need for matrix transposition. In both algorithms, absorbing boundaries are defined as zones of decreased Q as part of the model, and so require no extra computation. An empirical method of determining sets of relaxation times for a broad range of Q values eliminates the need for iterative fitting of Q-frequency curves.


2001 ◽  
Vol 12 (03) ◽  
pp. 285-306 ◽  
Author(s):  
NORIYUKI FUJIMOTO ◽  
TOMOKI BABA ◽  
TAKASHI HASHIMOTO ◽  
KENICHI HAGIHARA

In this paper, we report a performance gap betweeen a schedule with small makespan on the task scheduling model and the corresponding parallel program on distributed memory parallel machines. The main reason of the gap is the software overhead in the interprocessor communication. Therefore, speedup ratios of schedules on the model do not approximate well to those of parallel programs on the machines. The purpose of the paper is to get a task scheduling algorithm that generates a schedule with good approximation to the corresponding parallel program and with small makespan. For this purpose, we propose algorithm BCSH that generates only bulk synchronous schedules. In those schedules, no-communication phases and communication phases appear alternately. All interprocessor communications are done only in the latter phases, and thus the corresponding parallel programs can make better use of the message packaging technique easily. It reduces many software overheads of messages form a source processor to the same destination processor to almost one software overhead, and improves the performance of a parallel program significantly. Finally, we show some experimental results of performance gaps on BCSH, Kruatrachue's algorithm DSH, and Ahmad et al's algorithm ECPFD. The schedules by DSH and ECPFD are famous for their small makespans, but message packaging can not be effectively applied to the corresponding program. The results show that a bulk synchronous schedule with small makespan has advantages that the gap is small and the corresponding program is a high performance parallel one.


2015 ◽  
Vol 2015 ◽  
pp. 1-12
Author(s):  
Hari Radhakrishnan ◽  
Damian W. I. Rouson ◽  
Karla Morris ◽  
Sameer Shende ◽  
Stavros C. Kassinos

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were done using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.


Sign in / Sign up

Export Citation Format

Share Document