Optimizing non-coalesced memory access for irregular applications with GPU computing

2020 ◽  
Vol 21 (9) ◽  
pp. 1285-1301 ◽  
Author(s):  
Ran Zheng ◽  
Yuan-dong Liu ◽  
Hai Jin
2019 ◽  
Vol 9 (5) ◽  
pp. 947 ◽  
Author(s):  
Thaha Muhammed ◽  
Rashid Mehmood ◽  
Aiiad Albeshri ◽  
Iyad Katib

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.


1997 ◽  
Author(s):  
David O'Hallaron ◽  
Jonathan R. Shewchuk ◽  
Thomas Gross

2013 ◽  
Vol 41 (3) ◽  
pp. 380-391 ◽  
Author(s):  
Young Hoon Son ◽  
O. Seongil ◽  
Yuhwan Ro ◽  
Jae W. Lee ◽  
Jung Ho Ahn
Keyword(s):  

Author(s):  
Soumya Ranjan Nayak ◽  
S Sivakumar ◽  
Akash Kumar Bhoi ◽  
Gyoo-Soo Chae ◽  
Pradeep Kumar Mallick

Graphical processing unit (GPU) has gained more popularity among researchers in the field of decision making and knowledge discovery systems. However, most of the earlier studies have GPU memory utilization, computational time, and accuracy limitations. The main contribution of this paper is to present a novel algorithm called the Mixed Mode Database Miner (MMDBM) classifier by implementing multithreading concepts on a large number of attributes. The proposed method use the quick sort algorithm in GPU parallel computing to overcome the state of the art limitations. This method applies the dynamic rule generation approach for constructing the decision tree based on the predicted rules. Moreover, the implementation results are compared with both SLIQ and MMDBM using Java and GPU with the computed acceleration ratio time using the BP dataset. The primary objective of this work is to improve the performance with less processing time. The results are also analyzed using various threads in GPU mining using eight different datasets of UCI Machine learning repository. The proposed MMDBM algorithm have been validated on these chosen eight different dataset with accuracy of 91.3% in diabetes, 89.1% in breast cancer, 96.6% in iris, 89.9% in labor, 95.4% in vote, 89.5% in credit card, 78.7% in supermarket and 78.7% in BP, and simultaneously, it also takes less computational time for given datasets. The outcome of this work will be beneficial for the research community to develop more effective multi thread based GPU solution in GPU mining to handle large set of data in minimal processing time. Therefore, this can be considered a more reliable and precise method for GPU computing.


Author(s):  
Aleix Roca Nonell ◽  
Balazs Gerofi ◽  
Leonardo Bautista-Gomez ◽  
Dominique Martinet ◽  
Vicenç Beltran Querol ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document