OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK Libraries

C. Addison; Y. Ren; M. van Waveren

doi:10.1155/2003/278167

OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK Libraries

Scientific Programming ◽

10.1155/2003/278167 ◽

2003 ◽

Vol 11 (2) ◽

pp. 95-104 ◽

Cited By ~ 2

Author(s):

C. Addison ◽

Y. Ren ◽

M. van Waveren

Keyword(s):

Shared Memory ◽

Linear Algebra ◽

Distributed Memory ◽

Parallel Computations ◽

Dense Linear Algebra ◽

Fine Grain ◽

Parallel Implementations ◽

Work Distribution ◽

Multiple Array ◽

Parallel Library

Dense linear algebra libraries need to cope efficiently with a range of input problem sizes and shapes. Inherently this means that parallel implementations have to exploit parallelism wherever it is present. While OpenMP allows relatively fine grain parallelism to be exploited in a shared memory environment it currently lacks features to make it easy to partition computation over multiple array indices or to overlap sequential and parallel computations. The inherent flexible nature of shared memory paradigms such as OpenMP poses other difficulties when it becomes necessary to optimise performance across successive parallel library calls. Notions borrowed from distributed memory paradigms, such as explicit data distributions help address some of these problems, but the focus on data rather than work distribution appears misplaced in an SMP context.

Download Full-text

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2019-0030 ◽

2019 ◽

Vol 29 (2) ◽

pp. 407-419

Author(s):

Beata Bylina ◽

Jarosław Bylina

Keyword(s):

Shared Memory ◽

Linear Algebra ◽

Multicore Architectures ◽

Numerical Accuracy ◽

Factorization Algorithm ◽

Computational Performance ◽

Parallel Implementations ◽

Diagonally Dominant Matrices ◽

Diagonally Dominant ◽

Level Parallelism

Abstract The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.

Download Full-text

ARMI: A High Level Communication Library for STAPL

Parallel Processing Letters ◽

10.1142/s0129626406002617 ◽

2006 ◽

Vol 16 (02) ◽

pp. 261-280 ◽

Cited By ~ 5

Author(s):

Nathan Thomas ◽

Steven Saunders ◽

Tim Smith ◽

Gabriel Tanase ◽

Lawrence Rauchwerger

Keyword(s):

Shared Memory ◽

Message Passing ◽

Basic Design ◽

Low Level ◽

Fine Grain ◽

Communication Layer ◽

Parallel Library ◽

High Level ◽

Performance Comparisons

ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer. We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP-V2200, Origin 3800, IBM Regatta and IBM RS/6000 SP cluster.

Download Full-text

Code Generation and Optimization of Distributed-memory Dense Linear Algebra Kernels

Procedia Computer Science ◽

10.1016/j.procs.2013.05.295 ◽

2013 ◽

Vol 18 ◽

pp. 1282-1291 ◽

Cited By ~ 4

Author(s):

Bryan Marker ◽

Don Batory ◽

Robert van de Geijn

Keyword(s):

Linear Algebra ◽

Code Generation ◽

Distributed Memory ◽

Dense Linear Algebra

Download Full-text

Practical Wavelet Tree Construction

Journal of Experimental Algorithmics ◽

10.1145/3457197 ◽

2021 ◽

Vol 26 ◽

pp. 1-67

Author(s):

Patrick Dinklage ◽

Jonas Ellert ◽

Johannes Fischer ◽

Florian Kurpicz ◽

Marvin Löbel

Keyword(s):

Parallel Algorithms ◽

Shared Memory ◽

Distributed Memory ◽

Auxiliary Information ◽

Parallel Computers ◽

External Memory ◽

Sequential Algorithms ◽

Bottom Up ◽

Memory Efficiency ◽

Tree Construction

We present new sequential and parallel algorithms for wavelet tree construction based on a new bottom-up technique. This technique makes use of the structure of the wavelet trees—refining the characters represented in a node of the tree with increasing depth—in an opposite way, by first computing the leaves (most refined), and then propagating this information upwards to the root of the tree. We first describe new sequential algorithms, both in RAM and external memory. Based on these results, we adapt these algorithms to parallel computers, where we address both shared memory and distributed memory settings. In practice, all our algorithms outperform previous ones in both time and memory efficiency, because we can compute all auxiliary information solely based on the information we obtained from computing the leaves. Most of our algorithms are also adapted to the wavelet matrix , a variant that is particularly suited for large alphabets.

Download Full-text

Minimal Aggregated Shared Memory Messaging on Distributed Memory Supercomputers

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps.2016.72 ◽

2016 ◽

Author(s):

Benjamin F. Jamroz ◽

John M. Dennis

Keyword(s):

Shared Memory ◽

Distributed Memory

Download Full-text

Applying the concurrent collections programming model to asynchronous parallel dense linear algebra

ACM SIGPLAN Notices ◽

10.1145/1837853.1693506 ◽

2010 ◽

Vol 45 (5) ◽

pp. 345-346 ◽

Cited By ~ 1

Author(s):

Aparna Chandramowlishwaran ◽

Kathleen Knobe ◽

Richard Vuduc

Keyword(s):

Linear Algebra ◽

Programming Model ◽

Dense Linear Algebra ◽

Asynchronous Parallel

Download Full-text

Dense linear algebra kernels on heterogeneous platforms: Redistribution issues

Parallel Computing ◽

10.1016/s0167-8191(01)00134-x ◽

2002 ◽

Vol 28 (2) ◽

pp. 155-185 ◽

Cited By ~ 9

Author(s):

Olivier Beaumont ◽

Arnaud Legrand ◽

Fabrice Rastello ◽

Yves Robert

Keyword(s):

Linear Algebra ◽

Heterogeneous Platforms ◽

Dense Linear Algebra

Download Full-text

Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

Computer Science - Research and Development ◽

10.1007/s00450-011-0191-z ◽

2011 ◽

Vol 27 (4) ◽

pp. 277-287 ◽

Cited By ~ 17

Author(s):

Hatem Ltaief ◽

Piotr Luszczek ◽

Jack Dongarra

Keyword(s):

Energy Efficiency ◽

Linear Algebra ◽

High Performance ◽

Multicore Architectures ◽

Dense Linear Algebra ◽

Power And Energy

Download Full-text

Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUs

ACM Transactions on Mathematical Software ◽

10.1145/3267101 ◽

2019 ◽

Vol 45 (2) ◽

pp. 1-28 ◽

Cited By ~ 1

Author(s):

Ali Charara ◽

David Keyes ◽

Hatem Ltaief

Keyword(s):

Linear Algebra ◽

Dense Linear Algebra ◽

Small Matrix

Download Full-text

Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators

Proceedings of the IEEE ◽

10.1109/jproc.2018.2868961 ◽

2018 ◽

Vol 106 (11) ◽

pp. 2040-2055 ◽

Cited By ~ 2

Author(s):

Jack Dongarra ◽

Mark Gates ◽

Jakub Kurzak ◽

Piotr Luszczek ◽

Yaohung M. Tsai

Keyword(s):

Linear Algebra ◽

Hardware Accelerators ◽

Dense Linear Algebra

Download Full-text