Automatic Sublining for Efficient Sparse Memory Accesses

Wim Heirman; Stijn Eyerman; Kristof Du Bois; Ibrahim Hur

doi:10.1145/3452141

Automatic Sublining for Efficient Sparse Memory Accesses

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3452141 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-23

Author(s):

Wim Heirman ◽

Stijn Eyerman ◽

Kristof Du Bois ◽

Ibrahim Hur

Keyword(s):

Dynamic Environment ◽

Large Data ◽

Main Memory ◽

Single Element ◽

Graph Analytics ◽

Available Bandwidth ◽

Processor Architectures ◽

Spatial Locality ◽

Potential Impact ◽

Memory Accesses

Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues. Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal. We propose the Instruction Spatial Locality Estimator ( ISLE ), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.

Download Full-text

Efficient local locking for massively multithreaded in-memory hash-based operators

The VLDB Journal ◽

10.1007/s00778-020-00642-5 ◽

2021 ◽

Author(s):

Bashar Romanous ◽

Skyler Windh ◽

Ildar Absalyamov ◽

Prerna Budhkar ◽

Robert Halstead ◽

...

Keyword(s):

Relational Databases ◽

Aggregation Operators ◽

Main Memory ◽

Paradigm Shifts ◽

Multithreaded Processors ◽

Cache Hierarchies ◽

Processor Architectures ◽

Spatial Locality ◽

Content Addressable Memories ◽

Multi Core Processor

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.

Download Full-text

Gretch

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3439803 ◽

2021 ◽

Vol 18 (2) ◽

pp. 1-25

Author(s):

Anirudh Mohan Kaushik ◽

Gennady Pekhimenko ◽

Hiren Patel

Keyword(s):

High Performance ◽

Instruction Scheduling ◽

Graph Representation ◽

Specific Information ◽

Graph Analytics ◽

Spatial Locality ◽

Effective Operation ◽

Important Challenge ◽

Temporal And Spatial ◽

Memory Accesses

Data-dependent memory accesses (DDAs) pose an important challenge for high-performance graph analytics (GA). This is because such memory accesses do not exhibit enough temporal and spatial locality resulting in low cache performance. Prior efforts that focused on improving the performance of DDAs for GA are not applicable across various GA frameworks. This is because (1) they only focus on one particular graph representation, and (2) they require workload changes to communicate specific information to the hardware for their effective operation. In this work, we propose a hardware-only solution to improving the performance of DDAs for GA across multiple GA frameworks. We present a hardware prefetcher for GA called Gretch, that addresses the above limitations. An important observation we make is that identifying certain DDAs without hardware-software communication is sensitive to the instruction scheduling. A key contribution of this work is a hardware mechanism that activates Gretch to identify DDAs when using either in-order or out-of-order instruction scheduling. Our evaluation shows that Gretch provides an average speedup of 38% over no prefetching, 25% over conventional stride prefetcher, and outperforms prior DDAs prefetchers by 22% with only 1% increase in power consumption when executed on different GA workloads and frameworks.

Download Full-text

An Energy-Efficient DRAM Cache Architecture for Mobile Platforms With PCM-Based Main Memory

ACM Transactions on Embedded Computing Systems ◽

10.1145/3451995 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-22

Author(s):

Dongsuk Shin ◽

Hakbeom Jang ◽

Kiseok Oh ◽

Jae W. Lee

Keyword(s):

Energy Consumption ◽

Main Memory ◽

Battery Life ◽

Mobile Platforms ◽

Total Energy Consumption ◽

Efficient Manner ◽

Hybrid Memory ◽

Spatial Locality ◽

Cache Architecture ◽

Energy Delay Product

A long battery life is a first-class design objective for mobile devices, and main memory accounts for a major portion of total energy consumption. Moreover, the energy consumption from memory is expected to increase further with ever-growing demands for bandwidth and capacity. A hybrid memory system with both DRAM and PCM can be an attractive solution to provide additional capacity and reduce standby energy. Although providing much greater density than DRAM, PCM has longer access latency and limited write endurance to make it challenging to architect it for main memory. To address this challenge, this article introduces CAMP, a novel DRAM c ache a rchitecture for m obile platforms with P CM-based main memory. A DRAM cache in this environment is required to filter most of the writes to PCM to increase its lifetime, and deliver highest efficiency even for a relatively small-sized DRAM cache that mobile platforms can afford. To address this CAMP divides DRAM space into two regions: a page cache for exploiting spatial locality in a bandwidth-efficient manner and a dirty block buffer for maximally filtering writes. CAMP improves the performance and energy-delay-product by 29.2% and 45.2%, respectively, over the baseline PCM-oblivious DRAM cache, while increasing PCM lifetime by 2.7×. And CAMP also improves the performance and energy-delay-product by 29.3% and 41.5%, respectively, over the state-of-the-art design with dirty block buffer, while increasing PCM lifetime by 2.5×.

Download Full-text

Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAs

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3466823 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-33

Author(s):

Mikhail Asiatici ◽

Paolo Ienne

Keyword(s):

Large Scale ◽

Sparse Matrix ◽

Memory Systems ◽

Graph Analytics ◽

Matrix Vector Multiplication ◽

Area Reduction ◽

Cache Line ◽

Speed Up ◽

Memory Accesses ◽

On Chip

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.

Download Full-text

Significant Impact of Improved Machine Learning Algorithm in The Processes of Large Data Sets

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit206133 ◽

2020 ◽

pp. 458-467

Author(s):

Virendra Tiwari ◽

Balendra Garg ◽

Uday Prakash Sharma

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Learning Algorithms ◽

Dynamic Environment ◽

Large Data ◽

Machine Learning Algorithms ◽

Streaming Data ◽

Machine Learning Techniques ◽

Machine Learning Algorithm ◽

Learning Mechanisms

The machine learning algorithms are capable of managing multi-dimensional data under the dynamic environment. Despite its so many vital features, there are some challenges to overcome. The machine learning algorithms still requires some additional mechanisms or procedures for predicting a large number of new classes with managing privacy. The deficiencies show the reliable use of a machine learning algorithm relies on human experts because raw data may complicate the learning process which may generate inaccurate results. So the interpretation of outcomes with expertise in machine learning mechanisms is a significant challenge in the machine learning algorithm. The machine learning technique suffers from the issue of high dimensionality, adaptability, distributed computing, scalability, the streaming data, and the duplicity. The main issue of the machine learning algorithm is found its vulnerability to manage errors. Furthermore, machine learning techniques are also found to lack variability. This paper studies how can be reduced the computational complexity of machine learning algorithms by finding how to make predictions using an improved algorithm.

Download Full-text

Limiting CPU Frequency Scaling Considering Main Memory Accesses

KIISE Transactions on Computing Practices ◽

10.5626/ktcp.2014.20.9.483 ◽

2014 ◽

Vol 20 (9) ◽

pp. 483-491 ◽

Cited By ~ 2

Author(s):

Moonju Park

Keyword(s):

Main Memory ◽

Frequency Scaling ◽

Memory Accesses

Download Full-text

Main-Memory Hash Joins on Modern Processor Architectures

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2014.2313874 ◽

2015 ◽

Vol 27 (7) ◽

pp. 1754-1766 ◽

Cited By ~ 24

Author(s):

Cagri Balkesen ◽

Jens Teubner ◽

Gustavo Alonso ◽

M. Tamer ozsu

Keyword(s):

Main Memory ◽

Processor Architectures

Download Full-text

CuSP

ACM SIGOPS Operating Systems Review ◽

10.1145/3469379.3469385 ◽

2021 ◽

Vol 55 (1) ◽

pp. 47-60

Author(s):

Loc Hoang ◽

Roshan Dathathri ◽

Gurbinder Gill ◽

Keshav Pingali

Keyword(s):

Single Machine ◽

Distributed Memory ◽

The State ◽

Main Memory ◽

Input Graph ◽

Large Graphs ◽

Graph Analytics ◽

Graph Partitions ◽

High Level ◽

Edge Partitioning

Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. Distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems use a built-in partitioner that incorporates a particular partitioning policy, but the best policy is dependent on the algorithm, input graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a few policies. CuSP is a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and quickly generates highquality graph partitions. For example, it can partition wdc12, the largest publicly available web-crawl graph with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6× faster on average than the state-of-theart stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

Download Full-text

Graph Management Systems: A Qualitative Survey

APTIKOM Journal on Computer Science and Information Technologies ◽

10.11591/aptikom.j.csit.129 ◽

2018 ◽

Vol 3 (2) ◽

pp. 66-76

Author(s):

Maurizio Nolé ◽

Carlo Sartiani

Keyword(s):

Large Scale ◽

Large Data ◽

Database Systems ◽

Query Languages ◽

Graph Database ◽

Graph Analytics ◽

Qualitative Survey ◽

Mobile Phone Networks ◽

Graph Management ◽

Management Issues

In the recent years many real-world applications have been modeled by graph structures (e.g., social networks, mobile phone networks, web graphs, etc.), and many systems have been developed to manage, query, and analyze these datasets. These systems could be divided into specialized graph database systems and large-scale graph analytics systems. The first ones consider end-to-end data management issues including storage representations, transactions, and query languages, whereas the second ones focus on processing specific tasks over large data graphs. In this paper we provide an overview of several graph database systems and graph processing systems, with the aim of assisting the reader in identifying the best-suited solution for her application scenario.

Download Full-text

Exploring Efficient Architectures on Remote In-Memory NVM over RDMA

ACM Transactions on Embedded Computing Systems ◽

10.1145/3477004 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-20

Author(s):

Qingfeng Zhuge ◽

Hao Zhang ◽

Edwin Hsing-Mean Sha ◽

Rui Xu ◽

Jun Liu ◽

...

Keyword(s):

High Performance ◽

File System ◽

File Systems ◽

Data Access ◽

Main Memory ◽

Memory Modules ◽

Significant Performance ◽

Architectural Structures ◽

Memory Accesses ◽

Careful Design

Efficiently accessing remote file data remains a challenging problem for data processing systems. Development of technologies in non-volatile dual in-line memory modules (NVDIMMs), in-memory file systems, and RDMA networks provide new opportunities towards solving the problem of remote data access. A general understanding about NVDIMMs, such as Intel Optane DC Persistent Memory (DCPM), is that they expand main memory capacity with a cost of multiple times lower performance than DRAM. With an in-depth exploration presented in this paper, however, we show an interesting finding that the potential of NVDIMMs for high-performance, remote in-memory accesses can be revealed through careful design. We explore multiple architectural structures for accessing remote NVDIMMs in a real system using Optane DCPM, and compare the performance of various structures. Experiments are conducted to show significant performance gaps among different ways of using NVDIMMs as memory address space accessible through RDMA interface. Furthermore, we design and implement a prototype of user-level, in-memory file system, RIMFS, in the device DAX mode on Optane DCPM. By comparing against the DAX-supported Linux file system, Ext4-DAX, we show that the performance of remote reads on RIMFS over RDMA is 11.44 higher than that on a remote Ext4-DAX on average. The experimental results also show that the performance of remote accesses on RIMFS is maintained on a heavily loaded data server with CPU utilization as high as 90%, while the performance of remote reads on Ext4-DAX is significantly reduced by 49.3%, and the performance of local reads on Ext4-DAX is even more significantly reduced by 90.1%. The performance comparisons of writes exhibit the same trends.

Download Full-text