Specializing Compiler Optimizations through Programmable Composition for Dense Matrix Computations

Author(s):  
Qing Yi ◽  
Qian Wang ◽  
Huimin Cui
2009 ◽  
Vol 21 (18) ◽  
pp. 2457-2477 ◽  
Author(s):  
Sergio Barrachina ◽  
Maribel Castillo ◽  
Francisco D. Igual ◽  
Rafael Mayo ◽  
Enrique S. Quintana-Ortí ◽  
...  

2016 ◽  
Vol 26 (02) ◽  
pp. 1650007 ◽  
Author(s):  
Jing Wu ◽  
Joseph Jaja

In this paper, we illustrate the possibility of developing strategies to carry out matrix computations on heterogeneous platforms which achieve native GPU performance on very large data sizes up to the capacity of the CPU memory. More specifically, we present a dense matrix multiplication strategy on a heterogeneous platform, specifically tailored for the case when the input is too large to fit on the device memory, which achieves near peak GPU performance. Our strategy involves the development of CUDA stream based software pipelines that effectively overlap PCIe data transfers with kernel executions. As a result, we are able to achieve over 1 and 2 TFLOPS performance on a single node using 1 and 2 GPUs respectively.


2011 ◽  
Vol 24 (12) ◽  
pp. 1317-1333 ◽  
Author(s):  
Bryan Marker ◽  
Ernie Chan ◽  
Jack Poulson ◽  
Robert Geijn ◽  
Rob F. Van der Wijngaart ◽  
...  

Author(s):  
Rob H. Bisseling

This chapter discusses parallel dense matrix computations, in particular the solution of linear systems by LU decomposition with partial row pivoting. It first presents a general Cartesian scheme for the distribution of matrices. Based on BSP cost analysis, the square cyclic distribution is proposed as particularly suitable for matrix computations such as LU decomposition and Gaussian elimination. The chapter introduces two-phase broadcasting of vectors, which is a useful collective-communication method for sending copies of matrix rows or columns to a group of processors. It also discusses how to achieve high performance by delaying rank-1 matrix updates to create a multiple-rank update, which can be carried out by multiplying tall-and-skinny matrices in a cache-friendly manner. The high-performance parallel LU decomposition is tested on a top-ranking supercomputer, and its performance is analysed with respect to computation, communication, and synchronization.


2001 ◽  
Vol 9 (1) ◽  
pp. 51-60 ◽  
Author(s):  
Jack Dongarra ◽  
Victor Eijkhout ◽  
Piotr Łuszczek

This paper describes a recursive method for the LU factorization of sparse matrices. The recursive formulation of common linear algebra codes has been proven very successful in dense matrix computations. An extension of the recursive technique for sparse matrices is presented. Performance results given here show that the recursive approach may perform comparable to leading software packages for sparse matrix factorization in terms of execution time, memory usage, and error estimates of the solution.


Sign in / Sign up

Export Citation Format

Share Document