CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube

Cheng Qian; Bruce Childers; Libo Huang; Hui Guo; Zhiying Wang

doi:10.3390/electronics7110307

CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube

Electronics ◽

10.3390/electronics7110307 ◽

2018 ◽

Vol 7 (11) ◽

pp. 307 ◽

Cited By ~ 1

Author(s):

Cheng Qian ◽

Bruce Childers ◽

Libo Huang ◽

Hui Guo ◽

Zhiying Wang

Keyword(s):

Comprehensive Evaluation ◽

Random Access ◽

Large Data ◽

Main Memory ◽

Graph Representation ◽

Data Movement ◽

Access Latency ◽

Hybrid Memory ◽

Graph Traversal ◽

Compressed Sparse Row

Graph traversal is widely used in map routing, social network analysis, causal discovery and many more applications. Because it is a memory-bound process, graph traversal puts significant pressure on the memory subsystem. Due to poor spatial locality and the increasing size of today’s datasets, graph traversal consumes an ever-larger part of application execution time. One way to mitigate this cost is memory prefetching, which issues requests from the processor to the memory in anticipation of needing certain data. However, traditional prefetching does not work well for graph traversal due to data dependencies, the parallel nature of graphs and the need to move vast amounts of data from memory to the caches. In this paper, we propose a compressed sparse row representation-based graph accelerator on the Hybrid Memory Cube (HMC), called CGAcc. CGAcc combines Compressed Sparse Row (CSR) graph representation with in-memory prefetching and processing to improve the performance of graph traversal. Our approach integrates the prefetching and processing in the logic layer of a 3D stacked Dynamic Random-Access Memory (DRAM) architecture, based on Micron’s HMC. We selected HMC to implement CGAcc because it can provide quite high bandwidth and low access latency. Furthermore, this device has multiple DRAM layers connected to internal logic to control memory access and perform rudimentary computation. Using the CSR representation, CGAcc deploys prefetchers in the HMC to exploit the short transaction latency between the logic and DRAM layers. By doing this, it can also avoid large data movement costs. In the runtime, CGAcc pipelines the prefetching to fetch data from DRAM arrays to improve memory-level parallelism. To further reduce the access latency, several optimized internal caches are also introduced to hold the prefetched data to be Processed In-Memory (PIM). A comprehensive evaluation shows the effectiveness of CGAcc. Experimental results showed that, compared to a conventional HMC main memory equipped with a stream prefetcher, CGAcc achieved an average 3.51× speedup with moderate hardware cost.

Download Full-text

Implications of NVM Based Storage on Memory Subsystem Management

Applied Sciences ◽

10.3390/app10030999 ◽

2020 ◽

Vol 10 (3) ◽

pp. 999

Author(s):

Hyokyung Bahn ◽

Kyungwoon Cho

Keyword(s):

Random Access ◽

Disk Drive ◽

Main Memory ◽

Memory Storage ◽

Storage Device ◽

Storage Devices ◽

Large Memory ◽

Memory Subsystems ◽

Non Volatile Memory ◽

Management Techniques

Recently, non-volatile memory (NVM) has advanced as a fast storage medium, and legacy memory subsystems optimized for DRAM (dynamic random access memory) and HDD (hard disk drive) hierarchies need to be revisited. In this article, we explore the memory subsystems that use NVM as an underlying storage device and discuss the challenges and implications of such systems. As storage performance becomes close to DRAM performance, existing memory configurations and I/O (input/output) mechanisms should be reassessed. This article explores the performance of systems with NVM based storage emulated by the RAMDisk under various configurations. Through our measurement study, we make the following findings. (1) We can decrease the main memory size without performance penalties when NVM storage is adopted instead of HDD. (2) For buffer caching to be effective, judicious management techniques like admission control are necessary. (3) Prefetching is not effective in NVM storage. (4) The effect of synchronous I/O and direct I/O in NVM storage is less significant than that in HDD storage. (5) Performance degradation due to the contention of multi-threads is less severe in NVM based storage than in HDD. Based on these observations, we discuss a new PC configuration consisting of small memory and fast storage in comparison with a traditional PC consisting of large memory and slow storage. We show that this new memory-storage configuration can be an alternative solution for ever-growing memory demands and the limited density of DRAM memory. We anticipate that our results will provide directions in system software development in the presence of ever-faster storage devices.

Download Full-text

ASSESSMENT OF LINUX' DATA PATH IMPLEMENTATIONS FOR DOWNLOAD AND STREAMING

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194007003343 ◽

2007 ◽

Vol 17 (04) ◽

pp. 465-481 ◽

Cited By ~ 1

Author(s):

PÅL HALVORSEN ◽

TOM ANDERS DALSENG ◽

CARSTEN GRIWODZ

Keyword(s):

Operating Systems ◽

Comprehensive Evaluation ◽

Streaming Data ◽

High Rate ◽

Data Path ◽

Power Budget ◽

Data Movement ◽

Technological Advances ◽

Reduced Power Consumption ◽

Streaming Systems

Distributed multimedia streaming systems are increasingly popular due to technological advances, and numerous streaming services are available today. On servers or proxy caches, there is a huge scaling challenge in supporting thousands of concurrent users that request delivery of high-rate, time-dependent data like audio and video, because this requires transfers of large amounts of data through several sub-systems within a streaming node. Unnecessary copy operations in the data path can therefore contribute significantly to the resource consumption of streaming operations. Despite previous research, off-the-shelf operating systems have only limited support for data paths that have been optimized for streaming. Additionally, system call overhead has grown with newer operating systems editions, adding to the cost of data movement. Frequently, it is argued that these issues can be ignored because of the continuing growth of CPU speeds. However, such an argument fails to take problems of modern streaming systems into account. The dissipation of heat generated by disks and high-end CPUs is a major problem of data centers, which would be alleviated if less power-hungry CPUs could be used. The power budget of mobile devices, which are increasingly used for streaming as well, is tight, and reduced power consumption an important issue. In this paper, we prove that these operations consume a large amount of resources, and we therefore revisit the data movement problem and provide a comprehensive evaluation of possible streaming data I/O paths in the Linux 2.6 kernel. We have implemented and evaluated several enhanced mechanisms and show how to provide support for more efficient memory usage and reduction of user/kernel space switches for content download and streaming applications. In particular, we are able to reduce the CPU usage by approximately 27% compared to the best approach without kernel modifications, by removing copy operations and system calls for a streaming scenario in which RTP headers must be added to stored data for sequence numbers and timing.

Download Full-text

An Energy-Efficient DRAM Cache Architecture for Mobile Platforms With PCM-Based Main Memory

ACM Transactions on Embedded Computing Systems ◽

10.1145/3451995 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-22

Author(s):

Dongsuk Shin ◽

Hakbeom Jang ◽

Kiseok Oh ◽

Jae W. Lee

Keyword(s):

Energy Consumption ◽

Main Memory ◽

Battery Life ◽

Mobile Platforms ◽

Total Energy Consumption ◽

Efficient Manner ◽

Hybrid Memory ◽

Spatial Locality ◽

Cache Architecture ◽

Energy Delay Product

A long battery life is a first-class design objective for mobile devices, and main memory accounts for a major portion of total energy consumption. Moreover, the energy consumption from memory is expected to increase further with ever-growing demands for bandwidth and capacity. A hybrid memory system with both DRAM and PCM can be an attractive solution to provide additional capacity and reduce standby energy. Although providing much greater density than DRAM, PCM has longer access latency and limited write endurance to make it challenging to architect it for main memory. To address this challenge, this article introduces CAMP, a novel DRAM c ache a rchitecture for m obile platforms with P CM-based main memory. A DRAM cache in this environment is required to filter most of the writes to PCM to increase its lifetime, and deliver highest efficiency even for a relatively small-sized DRAM cache that mobile platforms can afford. To address this CAMP divides DRAM space into two regions: a page cache for exploiting spatial locality in a bandwidth-efficient manner and a dirty block buffer for maximally filtering writes. CAMP improves the performance and energy-delay-product by 29.2% and 45.2%, respectively, over the baseline PCM-oblivious DRAM cache, while increasing PCM lifetime by 2.7×. And CAMP also improves the performance and energy-delay-product by 29.3% and 41.5%, respectively, over the state-of-the-art design with dirty block buffer, while increasing PCM lifetime by 2.5×.

Download Full-text

A Detailed Study on Classification Algorithms in Big Data

Big Data Analytics for Sustainable Computing - Advances in Data Mining and Database Management ◽

10.4018/978-1-5225-9750-6.ch002 ◽

2020 ◽

pp. 30-46

Author(s):

Saranya N. ◽

Saravana Selvam

Keyword(s):

Big Data ◽

Random Forest ◽

Linear Regression ◽

Comprehensive Evaluation ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Classification Methods ◽

Computing Science ◽

Data Collections

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.

Download Full-text

Graph representation learning: a survey

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2020.13 ◽

2020 ◽

Vol 9 ◽

Author(s):

Fenxiao Chen ◽

Yun-Cheng Wang ◽

Bin Wang ◽

C.-C. Jay Kuo

Keyword(s):

Graph Embedding ◽

Large Data ◽

Representation Learning ◽

Graph Representation ◽

Data Sets ◽

Graph Data ◽

Graph Properties ◽

Wide Range ◽

Regular Lattices ◽

Low Dimensional

Abstract Research on graph representation learning has received great attention in recent years since most data in real-world applications come in the form of graphs. High-dimensional graph data are often in irregular forms. They are more difficult to analyze than image/video/audio data defined on regular lattices. Various graph embedding techniques have been developed to convert the raw graph data into a low-dimensional vector representation while preserving the intrinsic graph properties. In this review, we first explain the graph embedding task and its challenges. Next, we review a wide range of graph embedding techniques with insights. Then, we evaluate several stat-of-the-art methods against small and large data sets and compare their performance. Finally, potential applications and future directions are presented.

Download Full-text

Packed Compressed Sparse Row: A Dynamic Graph Representation

2018 IEEE High Performance extreme Computing Conference (HPEC) ◽

10.1109/hpec.2018.8547566 ◽

2018 ◽

Cited By ~ 2

Author(s):

Brian Wheatman ◽

Helen Xu

Keyword(s):

Graph Representation ◽

Dynamic Graph ◽

Compressed Sparse Row

Download Full-text

Q-Selector-Based Prefetching Method for DRAM/NVM Hybrid Main Memory System

Electronics ◽

10.3390/electronics9122158 ◽

2020 ◽

Vol 9 (12) ◽

pp. 2158

Author(s):

Jeong-Geun Kim ◽

Shin-Dug Kim ◽

Su-Kyung Yoon

Keyword(s):

Performance Status ◽

Random Access ◽

Memory Systems ◽

Memory System ◽

Main Memory ◽

Learning Method ◽

Q Learning ◽

Data Intensive ◽

Big Data Applications ◽

Non Volatile Memory

This research is to design a Q-selector-based prefetching method for a dynamic random-access memory (DRAM)/ Phase-change memory (PCM)hybrid main memory system for memory-intensive big data applications generating irregular memory accessing streams. Specifically, the proposed method fully exploits the advantages of two-level hybrid memory systems, constructed as DRAM devices and non-volatile memory (NVM) devices. The Q-selector-based prefetching method is based on the Q-learning method, one of the reinforcement learning algorithms, which determines a near-optimal prefetcher for an application’s current running phase. For this, our model analyzes real-time performance status to set the criteria for the Q-learning method. We evaluate the Q-selector-based prefetching method with workloads from data mining and data-intensive benchmark applications, PARSEC-3.0 and graphBIG. Our evaluation results show that the system achieves approximately 31% performance improvement and increases the hit ratio of the DRAM-cache layer by 46% on average compared to a PCM-only main memory system. In addition, it achieves better performance results compared to the state-of-the-art prefetcher, access map pattern matching (AMPM) prefetcher, by 14.3% reduction of execution time and 12.89% of better CPI enhancement.

Download Full-text

Application-Oriented Data Migration to Accelerate In-Memory Database on Hybrid Memory

Micromachines ◽

10.3390/mi13010052 ◽

2021 ◽

Vol 13 (1) ◽

pp. 52

Author(s):

Wenze Zhao ◽

Yajuan Du ◽

Mingzhe Zhang ◽

Mingyang Liu ◽

Kailun Jin ◽

...

Keyword(s):

Random Access ◽

Data Access ◽

Database Systems ◽

Data Migration ◽

Memory Architecture ◽

Hybrid Memory ◽

High Bandwidth ◽

Data Objects ◽

Stacked Memory

With the advantage of faster data access than traditional disks, in-memory database systems, such as Redis and Memcached, have been widely applied in data centers and embedded systems. The performance of in-memory database greatly depends on the access speed of memory. With the requirement of high bandwidth and low energy, die-stacked memory (e.g., High Bandwidth Memory (HBM)) has been developed to extend the channel number and width. However, the capacity of die-stacked memory is limited due to the interposer challenge. Thus, hybrid memory system with traditional Dynamic Random Access Memory (DRAM) and die-stacked memory emerges. Existing works have proposed to place and manage data on hybrid memory architecture in the view of hardware. This paper considers to manage in-memory database data in hybrid memory in the view of application. We first perform a preliminary study on the hotness distribution of client requests on Redis. From the results, we observe that most requests happen on a small portion of data objects in in-memory database. Then, we propose the Application-oriented Data Migration called ADM to accelerate in-memory database on hybrid memory. We design a hotness management method and two migration policies to migrate data into or out of HBM. We take Redis under comprehensive benchmarks as a case study for the proposed method. Through the experimental results, it is verified that our proposed method can effectively gain performance improvement and reduce energy consumption compared with existing Redis database.

Download Full-text

Power-Time Exploration Tools for NMP-Enabled Systems

Electronics ◽

10.3390/electronics8101096 ◽

2019 ◽

Vol 8 (10) ◽

pp. 1096

Author(s):

Chae Eun Rhee ◽

Seung-Won Park ◽

Jungwoo Choi ◽

Hyunmin Jung ◽

Hyuk-Jae Lee

Keyword(s):

Memory Performance ◽

Main Memory ◽

Design Decision ◽

Evaluation Tool ◽

Memory Processing ◽

Hybrid Memory ◽

System A ◽

Trade Offs ◽

High Bandwidth ◽

Effective Use

Recently, dramatic improvements in memory performance have been highly required for data demanding application services such as deep learning, big data, and immersive videos. To this end, the throughput-oriented memory such as high bandwidth memory (HBM) and hybrid memory cube (HMC) has been introduced to provide a high bandwidth. For its effective use, various research efforts have been conducted. Among them, the near-memory-processing (NMP) is a concept that utilizes bandwidth and power consumption by placing computation logic near the memory. In the NMP-enabled system, a processor hierarchy consisting of hosts and NMPs is formed based on the distance from the main memory. In this paper, an evaluation tool is proposed to obtain the optimal design decision considering the power-time trade-off in the processor hierarchy. Every time the operating condition and constraints change, the decision of task-level offloading is dynamically made. For the realistic NMP-enabled system environment, the relationship among HBM, host, and NMP should be carefully considered. Hosts and NMPs are almost hidden from each other and the communications between them are extremely limited. In the simulation results, popular benchmarks and a machine learning application are used to demonstrate power-time trade-offs depending on applications and system conditions.

Download Full-text

An Energy-Efficient and Fast Scheme for Hybrid Storage Class Memory in an AIoT Terminal System

Electronics ◽

10.3390/electronics9061013 ◽

2020 ◽

Vol 9 (6) ◽

pp. 1013

Author(s):

Hao Sun ◽

Lan Chen ◽

Xiaoran Hao ◽

Chenji Liu ◽

Mao Ni

Keyword(s):

Energy Consumption ◽

Data Storage ◽

File System ◽

Random Access ◽

Memory System ◽

Main Memory ◽

Fast Mode ◽

Hybrid Storage ◽

Storage Class Memory ◽

The Impact

Conventional main memory can no longer meet the requirements of low energy consumption and massive data storage in an artificial intelligence Internet of Things (AIoT) system. Moreover, the efficiency is decreased due to the swapping of data between the main memory and storage. This paper presents a hybrid storage class memory system to reduce the energy consumption and optimize IO performance. Phase change memory (PCM) brings the advantages of low static power and a large capacity to a hybrid memory system. In order to avoid the impact of poor write performance in PCM, a migration scheme implemented in the memory controller is proposed. By counting the write times and row buffer miss times in PCM simultaneously, the write-intensive data can be selected and migrated from PCM to dynamic random-access memory (DRAM) efficiently, which improves the performance of hybrid storage class memory. In addition, a fast mode with a tmpfs-based, in-memory file system is applied to hybrid storage class memory to reduce the number of data movements between memory and external storage. Experimental results show that the proposed system can reduce energy consumption by 46.2% on average compared with the traditional DRAM-only system. The fast mode increases the IO performance of the system by more than 30 times compared with the common ext3 file system.

Download Full-text