scholarly journals CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube

Electronics ◽  
2018 ◽  
Vol 7 (11) ◽  
pp. 307 ◽  
Author(s):  
Cheng Qian ◽  
Bruce Childers ◽  
Libo Huang ◽  
Hui Guo ◽  
Zhiying Wang

Graph traversal is widely used in map routing, social network analysis, causal discovery and many more applications. Because it is a memory-bound process, graph traversal puts significant pressure on the memory subsystem. Due to poor spatial locality and the increasing size of today’s datasets, graph traversal consumes an ever-larger part of application execution time. One way to mitigate this cost is memory prefetching, which issues requests from the processor to the memory in anticipation of needing certain data. However, traditional prefetching does not work well for graph traversal due to data dependencies, the parallel nature of graphs and the need to move vast amounts of data from memory to the caches. In this paper, we propose a compressed sparse row representation-based graph accelerator on the Hybrid Memory Cube (HMC), called CGAcc. CGAcc combines Compressed Sparse Row (CSR) graph representation with in-memory prefetching and processing to improve the performance of graph traversal. Our approach integrates the prefetching and processing in the logic layer of a 3D stacked Dynamic Random-Access Memory (DRAM) architecture, based on Micron’s HMC. We selected HMC to implement CGAcc because it can provide quite high bandwidth and low access latency. Furthermore, this device has multiple DRAM layers connected to internal logic to control memory access and perform rudimentary computation. Using the CSR representation, CGAcc deploys prefetchers in the HMC to exploit the short transaction latency between the logic and DRAM layers. By doing this, it can also avoid large data movement costs. In the runtime, CGAcc pipelines the prefetching to fetch data from DRAM arrays to improve memory-level parallelism. To further reduce the access latency, several optimized internal caches are also introduced to hold the prefetched data to be Processed In-Memory (PIM). A comprehensive evaluation shows the effectiveness of CGAcc. Experimental results showed that, compared to a conventional HMC main memory equipped with a stream prefetcher, CGAcc achieved an average 3.51× speedup with moderate hardware cost.

2020 ◽  
Vol 10 (3) ◽  
pp. 999
Author(s):  
Hyokyung Bahn ◽  
Kyungwoon Cho

Recently, non-volatile memory (NVM) has advanced as a fast storage medium, and legacy memory subsystems optimized for DRAM (dynamic random access memory) and HDD (hard disk drive) hierarchies need to be revisited. In this article, we explore the memory subsystems that use NVM as an underlying storage device and discuss the challenges and implications of such systems. As storage performance becomes close to DRAM performance, existing memory configurations and I/O (input/output) mechanisms should be reassessed. This article explores the performance of systems with NVM based storage emulated by the RAMDisk under various configurations. Through our measurement study, we make the following findings. (1) We can decrease the main memory size without performance penalties when NVM storage is adopted instead of HDD. (2) For buffer caching to be effective, judicious management techniques like admission control are necessary. (3) Prefetching is not effective in NVM storage. (4) The effect of synchronous I/O and direct I/O in NVM storage is less significant than that in HDD storage. (5) Performance degradation due to the contention of multi-threads is less severe in NVM based storage than in HDD. Based on these observations, we discuss a new PC configuration consisting of small memory and fast storage in comparison with a traditional PC consisting of large memory and slow storage. We show that this new memory-storage configuration can be an alternative solution for ever-growing memory demands and the limited density of DRAM memory. We anticipate that our results will provide directions in system software development in the presence of ever-faster storage devices.


Author(s):  
PÅL HALVORSEN ◽  
TOM ANDERS DALSENG ◽  
CARSTEN GRIWODZ

Distributed multimedia streaming systems are increasingly popular due to technological advances, and numerous streaming services are available today. On servers or proxy caches, there is a huge scaling challenge in supporting thousands of concurrent users that request delivery of high-rate, time-dependent data like audio and video, because this requires transfers of large amounts of data through several sub-systems within a streaming node. Unnecessary copy operations in the data path can therefore contribute significantly to the resource consumption of streaming operations. Despite previous research, off-the-shelf operating systems have only limited support for data paths that have been optimized for streaming. Additionally, system call overhead has grown with newer operating systems editions, adding to the cost of data movement. Frequently, it is argued that these issues can be ignored because of the continuing growth of CPU speeds. However, such an argument fails to take problems of modern streaming systems into account. The dissipation of heat generated by disks and high-end CPUs is a major problem of data centers, which would be alleviated if less power-hungry CPUs could be used. The power budget of mobile devices, which are increasingly used for streaming as well, is tight, and reduced power consumption an important issue. In this paper, we prove that these operations consume a large amount of resources, and we therefore revisit the data movement problem and provide a comprehensive evaluation of possible streaming data I/O paths in the Linux 2.6 kernel. We have implemented and evaluated several enhanced mechanisms and show how to provide support for more efficient memory usage and reduction of user/kernel space switches for content download and streaming applications. In particular, we are able to reduce the CPU usage by approximately 27% compared to the best approach without kernel modifications, by removing copy operations and system calls for a streaming scenario in which RTP headers must be added to stored data for sequence numbers and timing.


2022 ◽  
Vol 21 (1) ◽  
pp. 1-22
Author(s):  
Dongsuk Shin ◽  
Hakbeom Jang ◽  
Kiseok Oh ◽  
Jae W. Lee

A long battery life is a first-class design objective for mobile devices, and main memory accounts for a major portion of total energy consumption. Moreover, the energy consumption from memory is expected to increase further with ever-growing demands for bandwidth and capacity. A hybrid memory system with both DRAM and PCM can be an attractive solution to provide additional capacity and reduce standby energy. Although providing much greater density than DRAM, PCM has longer access latency and limited write endurance to make it challenging to architect it for main memory. To address this challenge, this article introduces CAMP, a novel DRAM c ache a rchitecture for m obile platforms with P CM-based main memory. A DRAM cache in this environment is required to filter most of the writes to PCM to increase its lifetime, and deliver highest efficiency even for a relatively small-sized DRAM cache that mobile platforms can afford. To address this CAMP divides DRAM space into two regions: a page cache for exploiting spatial locality in a bandwidth-efficient manner and a dirty block buffer for maximally filtering writes. CAMP improves the performance and energy-delay-product by 29.2% and 45.2%, respectively, over the baseline PCM-oblivious DRAM cache, while increasing PCM lifetime by 2.7×. And CAMP also improves the performance and energy-delay-product by 29.3% and 41.5%, respectively, over the state-of-the-art design with dirty block buffer, while increasing PCM lifetime by 2.5×.


Author(s):  
Saranya N. ◽  
Saravana Selvam

After an era of managing data collection difficulties, these days the issue has turned into the problem of how to process these vast amounts of information. Scientists, as well as researchers, think that today, probably the most essential topic in computing science is Big Data. Big Data is used to clarify the huge volume of data that could exist in any structure. This makes it difficult for standard controlling approaches for mining the best possible data through such large data sets. Classification in Big Data is a procedure of summing up data sets dependent on various examples. There are distinctive classification frameworks which help us to classify data collections. A few methods that discussed in the chapter are Multi-Layer Perception Linear Regression, C4.5, CART, J48, SVM, ID3, Random Forest, and KNN. The target of this chapter is to provide a comprehensive evaluation of classification methods that are in effect commonly utilized.


Author(s):  
Fenxiao Chen ◽  
Yun-Cheng Wang ◽  
Bin Wang ◽  
C.-C. Jay Kuo

Abstract Research on graph representation learning has received great attention in recent years since most data in real-world applications come in the form of graphs. High-dimensional graph data are often in irregular forms. They are more difficult to analyze than image/video/audio data defined on regular lattices. Various graph embedding techniques have been developed to convert the raw graph data into a low-dimensional vector representation while preserving the intrinsic graph properties. In this review, we first explain the graph embedding task and its challenges. Next, we review a wide range of graph embedding techniques with insights. Then, we evaluate several stat-of-the-art methods against small and large data sets and compare their performance. Finally, potential applications and future directions are presented.


Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2158
Author(s):  
Jeong-Geun Kim ◽  
Shin-Dug Kim ◽  
Su-Kyung Yoon

This research is to design a Q-selector-based prefetching method for a dynamic random-access memory (DRAM)/ Phase-change memory (PCM)hybrid main memory system for memory-intensive big data applications generating irregular memory accessing streams. Specifically, the proposed method fully exploits the advantages of two-level hybrid memory systems, constructed as DRAM devices and non-volatile memory (NVM) devices. The Q-selector-based prefetching method is based on the Q-learning method, one of the reinforcement learning algorithms, which determines a near-optimal prefetcher for an application’s current running phase. For this, our model analyzes real-time performance status to set the criteria for the Q-learning method. We evaluate the Q-selector-based prefetching method with workloads from data mining and data-intensive benchmark applications, PARSEC-3.0 and graphBIG. Our evaluation results show that the system achieves approximately 31% performance improvement and increases the hit ratio of the DRAM-cache layer by 46% on average compared to a PCM-only main memory system. In addition, it achieves better performance results compared to the state-of-the-art prefetcher, access map pattern matching (AMPM) prefetcher, by 14.3% reduction of execution time and 12.89% of better CPI enhancement.


Micromachines ◽  
2021 ◽  
Vol 13 (1) ◽  
pp. 52
Author(s):  
Wenze Zhao ◽  
Yajuan Du ◽  
Mingzhe Zhang ◽  
Mingyang Liu ◽  
Kailun Jin ◽  
...  

With the advantage of faster data access than traditional disks, in-memory database systems, such as Redis and Memcached, have been widely applied in data centers and embedded systems. The performance of in-memory database greatly depends on the access speed of memory. With the requirement of high bandwidth and low energy, die-stacked memory (e.g., High Bandwidth Memory (HBM)) has been developed to extend the channel number and width. However, the capacity of die-stacked memory is limited due to the interposer challenge. Thus, hybrid memory system with traditional Dynamic Random Access Memory (DRAM) and die-stacked memory emerges. Existing works have proposed to place and manage data on hybrid memory architecture in the view of hardware. This paper considers to manage in-memory database data in hybrid memory in the view of application. We first perform a preliminary study on the hotness distribution of client requests on Redis. From the results, we observe that most requests happen on a small portion of data objects in in-memory database. Then, we propose the Application-oriented Data Migration called ADM to accelerate in-memory database on hybrid memory. We design a hotness management method and two migration policies to migrate data into or out of HBM. We take Redis under comprehensive benchmarks as a case study for the proposed method. Through the experimental results, it is verified that our proposed method can effectively gain performance improvement and reduce energy consumption compared with existing Redis database.


Electronics ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 1096
Author(s):  
Chae Eun Rhee ◽  
Seung-Won Park ◽  
Jungwoo Choi ◽  
Hyunmin Jung ◽  
Hyuk-Jae Lee

Recently, dramatic improvements in memory performance have been highly required for data demanding application services such as deep learning, big data, and immersive videos. To this end, the throughput-oriented memory such as high bandwidth memory (HBM) and hybrid memory cube (HMC) has been introduced to provide a high bandwidth. For its effective use, various research efforts have been conducted. Among them, the near-memory-processing (NMP) is a concept that utilizes bandwidth and power consumption by placing computation logic near the memory. In the NMP-enabled system, a processor hierarchy consisting of hosts and NMPs is formed based on the distance from the main memory. In this paper, an evaluation tool is proposed to obtain the optimal design decision considering the power-time trade-off in the processor hierarchy. Every time the operating condition and constraints change, the decision of task-level offloading is dynamically made. For the realistic NMP-enabled system environment, the relationship among HBM, host, and NMP should be carefully considered. Hosts and NMPs are almost hidden from each other and the communications between them are extremely limited. In the simulation results, popular benchmarks and a machine learning application are used to demonstrate power-time trade-offs depending on applications and system conditions.


Electronics ◽  
2020 ◽  
Vol 9 (6) ◽  
pp. 1013
Author(s):  
Hao Sun ◽  
Lan Chen ◽  
Xiaoran Hao ◽  
Chenji Liu ◽  
Mao Ni

Conventional main memory can no longer meet the requirements of low energy consumption and massive data storage in an artificial intelligence Internet of Things (AIoT) system. Moreover, the efficiency is decreased due to the swapping of data between the main memory and storage. This paper presents a hybrid storage class memory system to reduce the energy consumption and optimize IO performance. Phase change memory (PCM) brings the advantages of low static power and a large capacity to a hybrid memory system. In order to avoid the impact of poor write performance in PCM, a migration scheme implemented in the memory controller is proposed. By counting the write times and row buffer miss times in PCM simultaneously, the write-intensive data can be selected and migrated from PCM to dynamic random-access memory (DRAM) efficiently, which improves the performance of hybrid storage class memory. In addition, a fast mode with a tmpfs-based, in-memory file system is applied to hybrid storage class memory to reduce the number of data movements between memory and external storage. Experimental results show that the proposed system can reduce energy consumption by 46.2% on average compared with the traditional DRAM-only system. The fast mode increases the IO performance of the system by more than 30 times compared with the common ext3 file system.


Sign in / Sign up

Export Citation Format

Share Document