Optimizing non-coalesced memory access for irregular applications with GPU computing

Ran Zheng; Yuan-dong Liu; Hai Jin

doi:10.1631/fitee.1900262

A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.e100.a.1188 ◽

2017 ◽

Vol E100.A (5) ◽

pp. 1188-1196

Author(s):

Heungseop AHN ◽

Seungwon CHOI

Keyword(s):

Memory Access ◽

Turbo Decoder ◽

Coalesced Memory

Download Full-text

Cascaded DMA Controller for Speedup of Indirect Memory Access in Irregular Applications

2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3) ◽

10.1109/ia349570.2019.00017 ◽

2019 ◽

Author(s):

Tomoya Kashimata ◽

Toshiaki Kitamura ◽

Keiji Kimura ◽

Hironori Kasahara

Keyword(s):

Memory Access ◽

Irregular Applications

Download Full-text

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Applied Sciences ◽

10.3390/app9050947 ◽

2019 ◽

Vol 9 (5) ◽

pp. 947 ◽

Cited By ~ 9

Author(s):

Thaha Muhammed ◽

Rashid Mehmood ◽

Aiiad Albeshri ◽

Iyad Katib

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Memory Access ◽

Group Matrix ◽

The Matrix ◽

Novel Method ◽

Coalesced Memory ◽

Graphics Processing ◽

Matrix Vector

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Download Full-text

Analyzing and Improving Memory Access Patterns of Large Irregular Applications on NUMA Machines

2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) ◽

10.1109/pdp.2016.37 ◽

2016 ◽

Cited By ~ 2

Author(s):

Artur Mariano ◽

Matthias Diener ◽

Christian Bischof ◽

Philippe O. A. Navaux

Keyword(s):

Memory Access ◽

Irregular Applications ◽

Access Patterns

Download Full-text

Architectural Implications of a Family of Irregular Applications

10.21236/ada339206 ◽

1997 ◽

Cited By ~ 1

Author(s):

David O'Hallaron ◽

Jonathan R. Shewchuk ◽

Thomas Gross

Keyword(s):

Irregular Applications

Download Full-text

Fine-Grained Management of Thread Blocks for Irregular Applications

2019 IEEE 37th International Conference on Computer Design (ICCD) ◽

10.1109/iccd46524.2019.00042 ◽

2019 ◽

Author(s):

Jonathan Beaumont ◽

Trevor Mudge

Keyword(s):

Fine Grained ◽

Irregular Applications

Download Full-text

Reducing memory access latency with asymmetric DRAM bank organizations

ACM SIGARCH Computer Architecture News ◽

10.1145/2508148.2485955 ◽

2013 ◽

Vol 41 (3) ◽

pp. 380-391 ◽

Cited By ~ 7

Author(s):

Young Hoon Son ◽

O. Seongil ◽

Yuhwan Ro ◽

Jae W. Lee ◽

Jung Ho Ahn

Keyword(s):

Memory Access ◽

Access Latency

Download Full-text

ROMANet: Fine-Grained Reuse-Driven Off-Chip Memory Access Management and Data Organization for Deep Neural Network Accelerators

IEEE Transactions on Very Large Scale Integration (VLSI) Systems ◽

10.1109/tvlsi.2021.3060509 ◽

2021 ◽

pp. 1-14

Author(s):

Rachmad Vidya Wicaksana Putra ◽

Muhammad Abdullah Hanif ◽

Muhammad Shafique

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Memory Access ◽

Data Organization ◽

Access Management ◽

Fine Grained

Download Full-text

Mixed-mode database miner classifier: Parallel computation of graphical processing unit mining

International Journal of Electrical Engineering Education ◽

10.1177/0020720920988494 ◽

2021 ◽

pp. 002072092098849

Author(s):

Soumya Ranjan Nayak ◽

S Sivakumar ◽

Akash Kumar Bhoi ◽

Gyoo-Soo Chae ◽

Pradeep Kumar Mallick

Keyword(s):

Credit Card ◽

Mixed Mode ◽

Processing Time ◽

Gpu Computing ◽

Graphical Processing Unit ◽

Computational Time ◽

Processing Unit ◽

Large Set ◽

Minimal Processing ◽

Graphical Processing

Graphical processing unit (GPU) has gained more popularity among researchers in the field of decision making and knowledge discovery systems. However, most of the earlier studies have GPU memory utilization, computational time, and accuracy limitations. The main contribution of this paper is to present a novel algorithm called the Mixed Mode Database Miner (MMDBM) classifier by implementing multithreading concepts on a large number of attributes. The proposed method use the quick sort algorithm in GPU parallel computing to overcome the state of the art limitations. This method applies the dynamic rule generation approach for constructing the decision tree based on the predicted rules. Moreover, the implementation results are compared with both SLIQ and MMDBM using Java and GPU with the computed acceleration ratio time using the BP dataset. The primary objective of this work is to improve the performance with less processing time. The results are also analyzed using various threads in GPU mining using eight different datasets of UCI Machine learning repository. The proposed MMDBM algorithm have been validated on these chosen eight different dataset with accuracy of 91.3% in diabetes, 89.1% in breast cancer, 96.6% in iris, 89.9% in labor, 95.4% in vote, 89.5% in credit card, 78.7% in supermarket and 78.7% in BP, and simultaneously, it also takes less computational time for given datasets. The outcome of this work will be beneficial for the research community to develop more effective multi thread based GPU solution in GPU mining to handle large set of data in minimal processing time. Therefore, this can be considered a more reliable and precise method for GPU computing.

Download Full-text

On the Applicability of PEBS based Online Memory Access Tracking for Heterogeneous Memory Management at Scale

Proceedings of the Workshop on Memory Centric High Performance Computing ◽

10.1145/3286475.3286477 ◽

2018 ◽

Author(s):

Aleix Roca Nonell ◽

Balazs Gerofi ◽

Leonardo Bautista-Gomez ◽

Dominique Martinet ◽

Vicenç Beltran Querol ◽

...

Keyword(s):

Memory Management ◽

Memory Access

Download Full-text