Management of Deep Memory Hierarchies – Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Computations

Author(s):  
Bo Kågström
SIAM Review ◽  
2004 ◽  
Vol 46 (1) ◽  
pp. 3-45 ◽  
Author(s):  
Erik Elmroth ◽  
Fred Gustavson ◽  
Isak Jonsson ◽  
Bo Kågström

2009 ◽  
Vol 21 (18) ◽  
pp. 2457-2477 ◽  
Author(s):  
Sergio Barrachina ◽  
Maribel Castillo ◽  
Francisco D. Igual ◽  
Rafael Mayo ◽  
Enrique S. Quintana-Ortí ◽  
...  

2016 ◽  
Vol 26 (02) ◽  
pp. 1650007 ◽  
Author(s):  
Jing Wu ◽  
Joseph Jaja

In this paper, we illustrate the possibility of developing strategies to carry out matrix computations on heterogeneous platforms which achieve native GPU performance on very large data sizes up to the capacity of the CPU memory. More specifically, we present a dense matrix multiplication strategy on a heterogeneous platform, specifically tailored for the case when the input is too large to fit on the device memory, which achieves near peak GPU performance. Our strategy involves the development of CUDA stream based software pipelines that effectively overlap PCIe data transfers with kernel executions. As a result, we are able to achieve over 1 and 2 TFLOPS performance on a single node using 1 and 2 GPUs respectively.


1993 ◽  
Vol 04 (01) ◽  
pp. 65-83 ◽  
Author(s):  
SERGE PETITON ◽  
YOUCEF SAAD ◽  
KESHENG WU ◽  
WILLIAM FERNG

This paper presents a preliminary experimental study of the performance of basic sparse matrix computations on the CM-5. We concentrate on examining various ways of performing general sparse matrix-vector operations and the basic primitives on which these are based. We compare various data structures for storing sparse matrices and their corresponding matrix — vector operations. Both SPMD and Data parallel modes are examined and a comparison of the two modes is made.


2011 ◽  
Vol 24 (12) ◽  
pp. 1317-1333 ◽  
Author(s):  
Bryan Marker ◽  
Ernie Chan ◽  
Jack Poulson ◽  
Robert Geijn ◽  
Rob F. Van der Wijngaart ◽  
...  

Author(s):  
Rob H. Bisseling

This chapter discusses parallel dense matrix computations, in particular the solution of linear systems by LU decomposition with partial row pivoting. It first presents a general Cartesian scheme for the distribution of matrices. Based on BSP cost analysis, the square cyclic distribution is proposed as particularly suitable for matrix computations such as LU decomposition and Gaussian elimination. The chapter introduces two-phase broadcasting of vectors, which is a useful collective-communication method for sending copies of matrix rows or columns to a group of processors. It also discusses how to achieve high performance by delaying rank-1 matrix updates to create a multiple-rank update, which can be carried out by multiplying tall-and-skinny matrices in a cache-friendly manner. The high-performance parallel LU decomposition is tested on a top-ranking supercomputer, and its performance is analysed with respect to computation, communication, and synchronization.


Sign in / Sign up

Export Citation Format

Share Document