scholarly journals OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK Libraries

2003 ◽  
Vol 11 (2) ◽  
pp. 95-104 ◽  
Author(s):  
C. Addison ◽  
Y. Ren ◽  
M. van Waveren

Dense linear algebra libraries need to cope efficiently with a range of input problem sizes and shapes. Inherently this means that parallel implementations have to exploit parallelism wherever it is present. While OpenMP allows relatively fine grain parallelism to be exploited in a shared memory environment it currently lacks features to make it easy to partition computation over multiple array indices or to overlap sequential and parallel computations. The inherent flexible nature of shared memory paradigms such as OpenMP poses other difficulties when it becomes necessary to optimise performance across successive parallel library calls. Notions borrowed from distributed memory paradigms, such as explicit data distributions help address some of these problems, but the focus on data rather than work distribution appears misplaced in an SMP context.

2019 ◽  
Vol 29 (2) ◽  
pp. 407-419
Author(s):  
Beata Bylina ◽  
Jarosław Bylina

Abstract The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.


2006 ◽  
Vol 16 (02) ◽  
pp. 261-280 ◽  
Author(s):  
Nathan Thomas ◽  
Steven Saunders ◽  
Tim Smith ◽  
Gabriel Tanase ◽  
Lawrence Rauchwerger

ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer. We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP-V2200, Origin 3800, IBM Regatta and IBM RS/6000 SP cluster.


2013 ◽  
Vol 18 ◽  
pp. 1282-1291 ◽  
Author(s):  
Bryan Marker ◽  
Don Batory ◽  
Robert van de Geijn

2021 ◽  
Vol 26 ◽  
pp. 1-67
Author(s):  
Patrick Dinklage ◽  
Jonas Ellert ◽  
Johannes Fischer ◽  
Florian Kurpicz ◽  
Marvin Löbel

We present new sequential and parallel algorithms for wavelet tree construction based on a new bottom-up technique. This technique makes use of the structure of the wavelet trees—refining the characters represented in a node of the tree with increasing depth—in an opposite way, by first computing the leaves (most refined), and then propagating this information upwards to the root of the tree. We first describe new sequential algorithms, both in RAM and external memory. Based on these results, we adapt these algorithms to parallel computers, where we address both shared memory and distributed memory settings. In practice, all our algorithms outperform previous ones in both time and memory efficiency, because we can compute all auxiliary information solely based on the information we obtained from computing the leaves. Most of our algorithms are also adapted to the wavelet matrix , a variant that is particularly suited for large alphabets.


2010 ◽  
Vol 45 (5) ◽  
pp. 345-346 ◽  
Author(s):  
Aparna Chandramowlishwaran ◽  
Kathleen Knobe ◽  
Richard Vuduc

2002 ◽  
Vol 28 (2) ◽  
pp. 155-185 ◽  
Author(s):  
Olivier Beaumont ◽  
Arnaud Legrand ◽  
Fabrice Rastello ◽  
Yves Robert

2018 ◽  
Vol 106 (11) ◽  
pp. 2040-2055 ◽  
Author(s):  
Jack Dongarra ◽  
Mark Gates ◽  
Jakub Kurzak ◽  
Piotr Luszczek ◽  
Yaohung M. Tsai

Sign in / Sign up

Export Citation Format

Share Document