An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6515 ◽

2021 ◽

Author(s):

José I. Aliaga ◽

Hartwig Anzt ◽

Thomas Grützmacher ◽

Enrique S. Quintana‐Ortí ◽

Andrés E. Tomás

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Multicore Processors ◽

Vector Product ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019886628 ◽

2019 ◽

Vol 34 (1) ◽

pp. 66-80 ◽

Cited By ~ 1

Author(s):

Akrem Benatia ◽

Weixing Ji ◽

Yizhuo Wang ◽

Feng Shi

Keyword(s):

Graphics Processing Units ◽

Sparse Matrix ◽

Sparse Matrices ◽

Heterogeneous Systems ◽

Input Matrix ◽

Heterogeneous Platforms ◽

Mapping Algorithm ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Partitioning

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take full advantage of these heterogeneous systems, the input sparse matrix has to be partitioned on different available processing units. The partitioning problem is more challenging with the existence of many sparse formats whose performances depend both on the sparsity of the input matrix and the used hardware. Thus, the best performance does not only depend on how to partition the input sparse matrix but also on which sparse format to use for each partition. To address this challenge, we propose in this article a new CPU-GPU heterogeneous method for computing the SpMV kernel that combines between different sparse formats to achieve better performance and better utilization of CPU-GPU heterogeneous platforms. The proposed solution horizontally partitions the input matrix into multiple block-rows and predicts their best sparse formats using machine learning-based performance models. A mapping algorithm is then used to assign the block-rows to the CPU and GPU(s) available in the system. Our experimental results using real-world large unstructured sparse matrices on two different machines show a noticeable performance improvement.

Download Full-text

GPU Accelerated Reconstruction in Compton Scattering Tomography Using Matrix Compression

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.519-520.102 ◽

2014 ◽

Vol 519-520 ◽

pp. 102-107

Author(s):

Yu Fei Yu ◽

Bin Yan ◽

Biao Wang ◽

Lei Li ◽

Yu Han ◽

...

Keyword(s):

Compton Scattering ◽

Graphics Processing Unit ◽

Sparse Matrix ◽

Reconstruction Algorithm ◽

Processing Unit ◽

Matrix Vector Multiplication ◽

Speedup Ratio ◽

Parallel Features ◽

Graphics Processing ◽

Matrix Vector

An acceleration strategy for TV-ADM reconstruction algorithm in Compton scattering tomography (CST) is proposed. By analyzing the sparse characteristic of CST projection matrixes, firstly, the sparse matrix vector CSR format and ELL format are used to store them, which greatly reduce the memory consumption. Then, a Sparse Matrix Vector multiplication (SpMV) method is utilized to accelerate the projector and back projector process. Finally, based on the parallel features, the TV-ADM is computed with Graphics Processing Unit (GPU). Numerical experiments show that the TV-ADM with the presented acceleration strategy could achieve a 96 times speedup ratio and 224 times memory compression ratio without precision loss.

Download Full-text

Multiple-precision matrix-vector multiplication on graphics processing units

Program systems theory and applications ◽

10.25209/2079-3316-2020-11-3-33-59 ◽

2020 ◽

Vol 11 (3) ◽

pp. 33-59 ◽

Cited By ~ 1

Author(s):

Константин Сергеевич Исупов ◽

Владимир Сергеевич Князьков

Keyword(s):

Graphics Processing Units ◽

Precision Matrix ◽

Matrix Vector Multiplication ◽

Multiple Precision ◽

Graphics Processing ◽

Matrix Vector

Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для графических процессоров (GPU) с использованием арифметики многократной точности на основе системы остаточных классов. В нашей реализации GEMV покомпонентные операции с многоразрядными векторами и матрицами разбиваются на части, каждая из которых выполняется отдельным CUDA ядром. Это исключает ветвление логики исполнения и позволяет добиться более полного использования ресурсов GPU. Эффективная структура данных для хранения многоразрядных массивов обеспечивает объединение доступов параллельных потоков к глобальной памяти GPU в транзакции. Для предложенной реализации GEMV выполнен анализ ошибок округления и получены оценки точности. Представлены экспериментальные результаты, показывающие высокую эффективность разработанной реализации по сравнению с существующими программными пакетами многократной точности для GPU.

Download Full-text

A Novel Multi-GPU Parallel Optimization Model for The Sparse Matrix-Vector Multiplication

Parallel Processing Letters ◽

10.1142/s0129626416400016 ◽

2016 ◽

Vol 26 (04) ◽

pp. 1640001

Author(s):

Jiaquan Gao ◽

Yuanshen Zhou ◽

Kesong Wu

Keyword(s):

Optimization Model ◽

Graphics Processing Units ◽

High Efficiency ◽

Sparse Matrix ◽

Performance Model ◽

Parallel Optimization ◽

Multiple Gpus ◽

Matrix Vector Multiplication ◽

Storage Format ◽

Matrix Vector

Accelerating the sparse matrix-vector multiplication (SpMV) on the graphics processing units (GPUs) has attracted considerable attention recently. We observe that on a specific multiple-GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi-GPU parallel SpMV optimization model. Our model involves two stages. In the first stage, a simple rule is defined to divide any given matrix among multiple GPUs, and then a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels. Using these models, we construct in the second stage an optimally multi-GPU parallel SpMV algorithm that is automatically and rapidly generated for the platform for any problem. Given that our model for SpMV is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

Download Full-text

A new sparse matrix vector multiplication graphics processing unit algorithm designed for finite element problems

International Journal for Numerical Methods in Engineering ◽

10.1002/nme.4865 ◽

2015 ◽

Vol 102 (12) ◽

pp. 1784-1814 ◽

Cited By ~ 15

Author(s):

J. Wong ◽

E. Kuhl ◽

E. Darve

Keyword(s):

Finite Element ◽

Graphics Processing Unit ◽

Sparse Matrix ◽

Processing Unit ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Fast sparse matrix-vector multiplication on graphics processing unit for finite element analysis

2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems ◽

10.1109/hpcc.2012.193 ◽

2012 ◽

Cited By ~ 11

Author(s):

Abal-Kassim Cheik Ahamed ◽

Frederic Magoules

Keyword(s):

Finite Element Analysis ◽

Finite Element ◽

Graphics Processing Unit ◽

Sparse Matrix ◽

Processing Unit ◽

Element Analysis ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Applied Sciences ◽

10.3390/app9050947 ◽

2019 ◽

Vol 9 (5) ◽

pp. 947 ◽

Cited By ~ 9

Author(s):

Thaha Muhammed ◽

Rashid Mehmood ◽

Aiiad Albeshri ◽

Iyad Katib

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Memory Access ◽

Group Matrix ◽

The Matrix ◽

Novel Method ◽

Coalesced Memory ◽

Graphics Processing ◽

Matrix Vector

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Download Full-text

Iterative sparse matrix-vector multiplication for accelerating the block Wiedemann algorithm over GF(2) on multi-graphics processing unit systems

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.2896 ◽

2012 ◽

Vol 25 (4) ◽

pp. 586-603 ◽

Cited By ~ 4

Author(s):

Bertil Schmidt ◽

Hans Aribowo ◽

Hoang-Vu Dang

Keyword(s):

Graphics Processing Unit ◽

Sparse Matrix ◽

Processing Unit ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

A novel multi-graphics processing unit parallel optimization framework for the sparse matrix-vector multiplication

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.3936 ◽

2016 ◽

Vol 29 (5) ◽

pp. e3936 ◽

Cited By ~ 10

Author(s):

Jiaquan Gao ◽

Yu Wang ◽

Jun Wang

Keyword(s):

Graphics Processing Unit ◽

Sparse Matrix ◽

Parallel Optimization ◽

Processing Unit ◽

Optimization Framework ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text