Management of Deep Memory Hierarchies – Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Computations

Applied Parallel Computing. State of the Art in Scientific Computing - Lecture Notes in Computer Science ◽

10.1007/11558958_3 ◽

2006 ◽

pp. 21-32 ◽

Cited By ~ 3

Author(s):

Bo Kågström

Keyword(s):

Data Structures ◽

Dense Matrix ◽

Memory Hierarchies ◽

Matrix Computations ◽

Hybrid Data ◽

Blocked Algorithms

Download Full-text

Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software

SIAM Review ◽

10.1137/s0036144503428693 ◽

2004 ◽

Vol 46 (1) ◽

pp. 3-45 ◽

Cited By ~ 104

Author(s):

Erik Elmroth ◽

Fred Gustavson ◽

Isak Jonsson ◽

Bo Kågström

Keyword(s):

Data Structures ◽

Dense Matrix ◽

Hybrid Data ◽

Blocked Algorithms

Download Full-text

Using dense matrix computations in the solution of sparse problems

Lecture Notes in Computer Science - Numerical Analysis and Its Applications ◽

10.1007/3-540-62598-4_114 ◽

1997 ◽

pp. 357-364

Author(s):

Tz. Ostromsky ◽

Z. Zlatev

Keyword(s):

Dense Matrix ◽

Matrix Computations ◽

Sparse Problems

Download Full-text

Exploiting the capabilities of modern GPUs for dense matrix computations

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.1472 ◽

2009 ◽

Vol 21 (18) ◽

pp. 2457-2477 ◽

Cited By ~ 38

Author(s):

Sergio Barrachina ◽

Maribel Castillo ◽

Francisco D. Igual ◽

Rafael Mayo ◽

Enrique S. Quintana-Ortí ◽

...

Keyword(s):

Dense Matrix ◽

Matrix Computations

Download Full-text

Hybrid Data Structures for Fast Queuing Operations in Vehicular Communication

Augmented Intelligence Toward Smart Vehicular Applications ◽

10.1201/9781003006817-11 ◽

2020 ◽

pp. 149-160

Author(s):

Raghavendra Pal

Keyword(s):

Data Structures ◽

Vehicular Communication ◽

Hybrid Data

Download Full-text

Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture ◽

10.1109/micro.2014.14 ◽

2014 ◽

Cited By ~ 3

Author(s):

Qing Yi ◽

Qian Wang ◽

Huimin Cui

Keyword(s):

Dense Matrix ◽

Compiler Optimizations ◽

Matrix Computations

Download Full-text

Overlapping Communications with Other Communications and Its Application to Distributed Dense Matrix Computations

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps.2019.00060 ◽

2019 ◽

Author(s):

Hua Huang ◽

Edmond Chow

Keyword(s):

Dense Matrix ◽

Matrix Computations

Download Full-text

Achieving Native GPU Performance for Out-of-Card Large Dense Matrix Multiplication

Parallel Processing Letters ◽

10.1142/s0129626416500079 ◽

2016 ◽

Vol 26 (02) ◽

pp. 1650007 ◽

Cited By ~ 3

Author(s):

Jing Wu ◽

Joseph Jaja

Keyword(s):

Matrix Multiplication ◽

Large Data ◽

Dense Matrix ◽

Single Node ◽

Matrix Computations ◽

Heterogeneous Platforms ◽

Heterogeneous Platform ◽

Data Transfers ◽

Developing Strategies

In this paper, we illustrate the possibility of developing strategies to carry out matrix computations on heterogeneous platforms which achieve native GPU performance on very large data sizes up to the capacity of the CPU memory. More specifically, we present a dense matrix multiplication strategy on a heterogeneous platform, specifically tailored for the case when the input is too large to fit on the device memory, which achieves near peak GPU performance. Our strategy involves the development of CUDA stream based software pipelines that effectively overlap PCIe data transfers with kernel executions. As a result, we are able to achieve over 1 and 2 TFLOPS performance on a single node using 1 and 2 GPUs respectively.

Download Full-text

BASIC SPARSE MATRIX COMPUTATIONS ON THE CM-5

International Journal of Modern Physics C ◽

10.1142/s0129183193000082 ◽

1993 ◽

Vol 04 (01) ◽

pp. 65-83 ◽

Cited By ~ 4

Author(s):

SERGE PETITON ◽

YOUCEF SAAD ◽

KESHENG WU ◽

WILLIAM FERNG

Keyword(s):

Experimental Study ◽

Data Structures ◽

Sparse Matrix ◽

Sparse Matrices ◽

Matrix Computations ◽

Data Parallel ◽

Matrix Vector

This paper presents a preliminary experimental study of the performance of basic sparse matrix computations on the CM-5. We concentrate on examining various ways of performing general sparse matrix-vector operations and the basic primitives on which these are based. We compare various data structures for storing sparse matrices and their corresponding matrix — vector operations. Both SPMD and Data parallel modes are examined and a comparison of the two modes is made.

Download Full-text

Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.1832 ◽

2011 ◽

Vol 24 (12) ◽

pp. 1317-1333 ◽

Cited By ~ 6

Author(s):

Bryan Marker ◽

Ernie Chan ◽

Jack Poulson ◽

Robert Geijn ◽

Rob F. Van der Wijngaart ◽

...

Keyword(s):

Dense Matrix ◽

Single Chip ◽

Matrix Computations ◽

Many Core ◽

Computer Processor

Download Full-text

LU decomposition

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0002 ◽

2020 ◽

pp. 74-133

Author(s):

Rob H. Bisseling

Keyword(s):

Cost Analysis ◽

Linear Systems ◽

High Performance ◽

Gaussian Elimination ◽

Collective Communication ◽

Dense Matrix ◽

Lu Decomposition ◽

Two Phase ◽

Matrix Computations ◽

Communication Method

This chapter discusses parallel dense matrix computations, in particular the solution of linear systems by LU decomposition with partial row pivoting. It first presents a general Cartesian scheme for the distribution of matrices. Based on BSP cost analysis, the square cyclic distribution is proposed as particularly suitable for matrix computations such as LU decomposition and Gaussian elimination. The chapter introduces two-phase broadcasting of vectors, which is a useful collective-communication method for sending copies of matrix rows or columns to a group of processors. It also discusses how to achieve high performance by delaying rank-1 matrix updates to create a multiple-rank update, which can be carried out by multiplying tall-and-skinny matrices in a cache-friendly manner. The high-performance parallel LU decomposition is tested on a top-ranking supercomputer, and its performance is analysed with respect to computation, communication, and synchronization.

Download Full-text