A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs

Mathematical Problems in Engineering ◽

10.1155/2016/8471283 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Guixia He ◽

Jiaquan Gao

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Poor Performance ◽

Test Results ◽

Graphic Processing Units ◽

Multiple Gpus ◽

Matrix Vector Multiplication ◽

Compressed Sparse Row ◽

Access Patterns ◽

Matrix Vector

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.

Download Full-text

Efficient CSR-Based Sparse Matrix-Vector Multiplication on GPU

Mathematical Problems in Engineering ◽

10.1155/2016/4596943 ◽

2016 ◽

Vol 2016 ◽

pp. 1-14 ◽

Cited By ~ 1

Author(s):

Jiaquan Gao ◽

Panpan Qi ◽

Guixia He

Keyword(s):

Iterative Methods ◽

Shared Memory ◽

Eigenvalue Problems ◽

Sparse Matrix ◽

Computational Science ◽

Test Results ◽

Thread Block ◽

Matrix Vector Multiplication ◽

Compressed Sparse Row ◽

Matrix Vector

Sparse matrix-vector multiplication (SpMV) is an important operation in computational science and needs be accelerated because it often represents the dominant cost in many widely used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different numbers of rows to each thread block and executes different optimization implementations on the basis of the number of rows it involves for each block. The process of accesses to the CSR arrays is fully coalesced, and the GPU’s DRAM bandwidth is efficiently utilized by loading data into the shared memory, which alleviates the bottleneck of many existing CSR-based algorithms (i.e., CSR-scalar and CSR-vector). Test results on C2050 and K20c GPUs show that our method outperforms a perfect-CSR algorithm that inspires our work, the vendor tuned CUSPARSE V6.5 and CUSP V0.5.1, and three popular algorithms clSpMV, CSR5, and CSR-Adaptive.

Download Full-text

Acceleration of Sparse Matrix-Vector Multiplication by Region Traversal

Acta Polytechnica ◽

10.14311/1029 ◽

2008 ◽

Vol 48 (4) ◽

Author(s):

I. Šimeček

Keyword(s):

Sparse Matrix ◽

Numerical Linear Algebra ◽

Matrix Transformations ◽

Matrix Vector Multiplication ◽

Tightly Coupled ◽

Partial Multiplication ◽

Access Patterns ◽

Almost All ◽

Matrix Vector ◽

Sparse matrix-vector multiplication (shortly SpM×V) is one of most common subroutines in numerical linear algebra. The problem is that the memory access patterns during SpM×V are irregular, and utilization of the cache can suffer from low spatial or temporal locality. Approaches to improve the performance of SpM×V are based on matrix reordering and register blocking. These matrix transformations are designed to handle randomly occurring dense blocks in a sparse matrix. The efficiency of these transformations depends strongly on the presence of suitable blocks. The overhead of reorganization of a matrix from one format to another is often of the order of tens of executions ofSpM×V. For this reason, such a reorganization pays off only if the same matrix A is multiplied by multiple different vectors, e.g., in iterative linear solvers.This paper introduces an unusual approach to accelerate SpM×V. This approach can be combined with other acceleration approaches andconsists of three steps:1) dividing matrix A into non-empty regions,2) choosing an efficient way to traverse these regions (in other words, choosing an efficient ordering of partial multiplications),3) choosing the optimal type of storage for each region.All these three steps are tightly coupled. The first step divides the whole matrix into smaller parts (regions) that can fit in the cache. The second step improves the locality during multiplication due to better utilization of distant references. The last step maximizes the machine computation performance of the partial multiplication for each region.In this paper, we describe aspects of these 3 steps in more detail (including fast and time-inexpensive algorithms for all steps). Ourmeasurements prove that our approach gives a significant speedup for almost all matrices arising from various technical areas.

Download Full-text

Lower bounds for sparse matrix vector multiplication on hypercubic networks

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.249 ◽

1998 ◽

Vol Vol. 2 ◽

Author(s):

Giovanni Manzini

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Log P ◽

Probability Measures ◽

Worst Case ◽

Average Case ◽

Local Memory ◽

Matrix Vector Multiplication ◽

International Audience ◽

Matrix Vector

International audience In this paper we consider the problem of computing on a local memory machine the product y = Ax,where A is a random n×n sparse matrix with Θ (n) nonzero elements. To study the average case communication cost of this problem, we introduce four different probability measures on the set of sparse matrices. We prove that on most local memory machines with p processors, this computation requires Ω ((n/p) \log p) time on the average. We prove that the same lower bound also holds, in the worst case, for matrices with only 2n or 3n nonzero elements.

Download Full-text

Sparse matrix–vector multiplication

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0004 ◽

2020 ◽

pp. 190-290

Author(s):

Rob H. Bisseling

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Distributed Shared Memory ◽

Sparsity Pattern ◽

Matrix Vector Multiplication ◽

Special Cases ◽

The Matrix ◽

Matrix Vector ◽

Memory Architectures ◽

Shared Memory Architectures

This chapter introduces irregular algorithms and presents the example of parallel sparse matrix-vector multiplication (SpMV), which is the central operation in iterative linear system solvers. The irregular sparsity pattern of the matrix does not change during the multiplication, which may be repeated many times. This justifies putting a lot of effort into finding a good data distribution. The Mondriaan distribution of a sparse matrix is a useful non-Cartesian distribution that can be found by hypergraph-based partitioning. The Mondriaan package implements such a partitioning and also the newer medium-grain partitioning method. The chapter analyses the special cases of random sparse matrices and Laplacian matrices. It uses performance profiles and geometric means to compare different partitioning methods. Furthermore, it presents the hybrid-BSP model and a hybrid-BSP SpMV, which are aimed at hybrid distributed/shared-memory architectures. The parallel SpMV can be incorporated in applications, ranging from PageRank computation to artificial neural networks.

Download Full-text

An efficient sparse matrix-vector multiplication on CUDA-enabled graphic processing units for finite element method simulations

International Journal for Numerical Methods in Engineering ◽

10.1002/nme.5346 ◽

2016 ◽

Vol 110 (1) ◽

pp. 57-78 ◽

Cited By ~ 2

Author(s):

Atakan Altinkaynak

Keyword(s):

Finite Element Method ◽

Finite Element ◽

Sparse Matrix ◽

Graphic Processing Units ◽

Matrix Vector Multiplication ◽

Matrix Vector ◽

Element Method ◽

Graphic Processing

Download Full-text

A Novel Multi-GPU Parallel Optimization Model for The Sparse Matrix-Vector Multiplication

Parallel Processing Letters ◽

10.1142/s0129626416400016 ◽

2016 ◽

Vol 26 (04) ◽

pp. 1640001

Author(s):

Jiaquan Gao ◽

Yuanshen Zhou ◽

Kesong Wu

Keyword(s):

Optimization Model ◽

Graphics Processing Units ◽

High Efficiency ◽

Sparse Matrix ◽

Performance Model ◽

Parallel Optimization ◽

Multiple Gpus ◽

Matrix Vector Multiplication ◽

Storage Format ◽

Matrix Vector

Accelerating the sparse matrix-vector multiplication (SpMV) on the graphics processing units (GPUs) has attracted considerable attention recently. We observe that on a specific multiple-GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi-GPU parallel SpMV optimization model. Our model involves two stages. In the first stage, a simple rule is defined to divide any given matrix among multiple GPUs, and then a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels. Using these models, we construct in the second stage an optimally multi-GPU parallel SpMV algorithm that is automatically and rapidly generated for the platform for any problem. Given that our model for SpMV is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

Download Full-text

Performance Aspects of Sparse Matrix-Vector Multiplication

Acta Polytechnica ◽

10.14311/826 ◽

2006 ◽

Vol 46 (3) ◽

Author(s):

I. Šimeček

Keyword(s):

Sparse Matrix ◽

Linear Equations ◽

Software Pipelining ◽

Loop Unrolling ◽

Spatial Locality ◽

Matrix Vector Multiplication ◽

Access Patterns ◽

The Impact ◽

Matrix Vector ◽

Sparse matrix-vector multiplication (shortly SpM×V) is an important building block in algorithms solving sparse systems of linear equations, e.g., FEM. Due to matrix sparsity, the memory access patterns are irregular and utilization of the cache can suffer from low spatial or temporal locality. Approaches to improve the performance of SpM×V are based on matrix reordering and register blocking [1, 2], sometimes combined with software-pipelining [3]. Due to its overhead, register blocking achieves good speedups only for a large number of executions of SpM×V with the same matrix A.We have investigated the impact of two simple SW transformation techniques (software-pipelining and loop unrolling) on the performance of SpM×V, and have compared it with several implementation modifications aimed at reducing computational and memory complexity and improving the spatial locality. We investigate performance gains of these modifications on four CPU platforms.

Download Full-text

Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices

Mathematical Problems in Engineering ◽

10.1155/2021/6804723 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Wenpeng Ma ◽

Yiwen Hu ◽

Wu Yuan ◽

Xiazhen Liu

Keyword(s):

Building Block ◽

Sparse Matrix ◽

Sparse Matrices ◽

Matrix Vector Multiplication ◽

Triangular Systems ◽

Direct Technique ◽

Inexact Preconditioning ◽

Gmres Algorithm ◽

Matrix Vector ◽

Preconditioned Gmres

Solving triangular systems is the building block for preconditioned GMRES algorithm. Inexact preconditioning becomes attractive because of the feature of high parallelism on accelerators. In this paper, we propose and implement an iterative, inexact block triangular solve on multi-GPUs based on PETSc’s framework. In addition, by developing a distributed block sparse matrix-vector multiplication procedure and investigating the optimized vector operations, we form the multi-GPU-enabled preconditioned GMRES with the block Jacobi preconditioner. In the implementation, the GPU-Direct technique is employed to avoid host-device memory copies. The preconditioning step used by PETSc’s structure and the cuSPARSE library are also investigated for performance comparisons. The experiments show that the developed GMRES with inexact preconditioning on 8 GPUs can achieve up to 4.4x speedup over the CPU-only implementation with exact preconditioning using 8 MPI processes.

Download Full-text

Design Patterns for Sparse-Matrix Computations on Hybrid CPU/GPU Platforms

Scientific Programming ◽

10.1155/2014/469753 ◽

2014 ◽

Vol 22 (1) ◽

pp. 1-19 ◽

Cited By ~ 2

Author(s):

Valeria Cardellini ◽

Salvatore Filippone ◽

Damian W.I. Rouson

Keyword(s):

Software Design ◽

Design Patterns ◽

Sparse Matrix ◽

Sparse Matrices ◽

Double Precision ◽

Scientific Software ◽

Matrix Computations ◽

Matrix Vector Multiplication ◽

Software Design Patterns ◽

Matrix Vector

We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix–vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.

Download Full-text

Finite-Element Sparse Matrix Vector Multiplication on Graphic Processing Units

IEEE Transactions on Magnetics ◽

10.1109/tmag.2010.2043511 ◽

2010 ◽

Vol 46 (8) ◽

pp. 2982-2985 ◽

Cited By ~ 32

Author(s):

Maryam Mehri Dehnavi ◽

David M. Fernandez ◽

Dennis Giannacopoulos

Keyword(s):

Finite Element ◽

Sparse Matrix ◽

Graphic Processing Units ◽

Matrix Vector Multiplication ◽

Matrix Vector ◽

Graphic Processing

Download Full-text