Polymorphic Memory: A Hybrid Approach for Utilizing On-Chip Memory in Manycore Systems

Seung-Ho Lim; Hyunchul Seok; Ki-Woong Park

doi:10.3390/electronics9122061

Polymorphic Memory: A Hybrid Approach for Utilizing On-Chip Memory in Manycore Systems

Electronics ◽

10.3390/electronics9122061 ◽

2020 ◽

Vol 9 (12) ◽

pp. 2061

Author(s):

Seung-Ho Lim ◽

Hyunchul Seok ◽

Ki-Woong Park

Keyword(s):

High Performance ◽

Hybrid Approach ◽

Random Access ◽

Previous Method ◽

Main Memory ◽

Memory Architecture ◽

Memory Hierarchies ◽

Dynamic Memory ◽

High Bandwidth ◽

On Chip

The key challenges of manycore systems are the large amount of memory and high bandwidth required to run many applications. Three-dimesnional integrated on-chip memory is a promising candidate for addressing these challenges. The advent of on-chip memory has provided new opportunities to rethink traditional memory hierarchies and their management. In this study, we propose a polymorphic memory as a hybrid approach when using on-chip memory. In contrast to previous studies, we use the on-chip memory as both a main memory (called M1 memory) and a Dynamic Random Access Memory (DRAM) cache (called M2 cache). The main memory consists of M1 memory and a conventional DRAM memory called M2 memory. To achieve high performance when running many applications on this memory architecture, we propose management techniques for the main memory with M1 and M2 memories and for polymorphic memory with dynamic memory allocations for many applications in a manycore system. The first technique is to move frequently accessed pages to M1 memory via hardware monitoring in a memory controller. The second is M1 memory partitioning to mitigate contention problems among many processes. Finally, we propose a method to use M2 cache between a conventional last-level cache and M2 memory, and we determine the best cache size for improving the performance with polymorphic memory. The proposed schemes are evaluated with the SPEC CPU2006 benchmark, and the experimental results show that the proposed approaches can improve the performance under various workloads of the benchmark. The performance evaluation confirms that the average performance improvement of polymorphic memory is 21.7%, with 0.026 standard deviation for the normalized results, compared to the previous method of using on-chip memory as a last-level cache.

Download Full-text

Application-Oriented Data Migration to Accelerate In-Memory Database on Hybrid Memory

Micromachines ◽

10.3390/mi13010052 ◽

2021 ◽

Vol 13 (1) ◽

pp. 52

Author(s):

Wenze Zhao ◽

Yajuan Du ◽

Mingzhe Zhang ◽

Mingyang Liu ◽

Kailun Jin ◽

...

Keyword(s):

Random Access ◽

Data Access ◽

Database Systems ◽

Data Migration ◽

Memory Architecture ◽

Hybrid Memory ◽

High Bandwidth ◽

Data Objects ◽

Stacked Memory

With the advantage of faster data access than traditional disks, in-memory database systems, such as Redis and Memcached, have been widely applied in data centers and embedded systems. The performance of in-memory database greatly depends on the access speed of memory. With the requirement of high bandwidth and low energy, die-stacked memory (e.g., High Bandwidth Memory (HBM)) has been developed to extend the channel number and width. However, the capacity of die-stacked memory is limited due to the interposer challenge. Thus, hybrid memory system with traditional Dynamic Random Access Memory (DRAM) and die-stacked memory emerges. Existing works have proposed to place and manage data on hybrid memory architecture in the view of hardware. This paper considers to manage in-memory database data in hybrid memory in the view of application. We first perform a preliminary study on the hotness distribution of client requests on Redis. From the results, we observe that most requests happen on a small portion of data objects in in-memory database. Then, we propose the Application-oriented Data Migration called ADM to accelerate in-memory database on hybrid memory. We design a hotness management method and two migration policies to migrate data into or out of HBM. We take Redis under comprehensive benchmarks as a case study for the proposed method. Through the experimental results, it is verified that our proposed method can effectively gain performance improvement and reduce energy consumption compared with existing Redis database.

Download Full-text

Exploiting Data Compression for Adaptive Block Placement in Hybrid Caches

Electronics ◽

10.3390/electronics11020240 ◽

2022 ◽

Vol 11 (2) ◽

pp. 240

Author(s):

Beomjun Kim ◽

Yongtae Kim ◽

Prashant Nair ◽

Seokin Hong

Keyword(s):

Random Access ◽

Spin Transfer Torque ◽

Main Memory ◽

Spin Transfer ◽

Access Memory ◽

Low Leakage ◽

Block Placement ◽

On Chip ◽

Hybrid Caches ◽

Cache Block

STT-RAM (Spin-Transfer Torque Random Access Memory) appears to be a viable alternative to SRAM-based on-chip caches. Due to its high density and low leakage power, STT-RAM can be used to build massive capacity last-level caches (LLC). Unfortunately, STT-RAM has a much longer write latency and a much greater write energy than SRAM. Researchers developed hybrid caches made up of SRAM and STT-RAM regions to cope with these challenges. In order to store as many write-intensive blocks in the SRAM region as possible in hybrid caches, an intelligent block placement policy is essential. This paper proposes an adaptive block placement framework for hybrid caches that incorporates metadata embedding (ADAM). When a cache block is evicted from the LLC, ADAM embeds metadata (i.e., write intensity) into the block. Metadata embedded in the cache block are then extracted and used to determine the block’s write intensity when it is fetched from main memory. Our research demonstrates that ADAM can enhance performance by 26% (on average) when compared to a baseline block placement scheme.

Download Full-text

Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs

2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) ◽

10.1109/micro.2016.7783717 ◽

2016 ◽

Cited By ~ 5

Author(s):

Naifeng Jing ◽

Jianfei Wang ◽

Fengfeng Fan ◽

Wenkang Yu ◽

Li Jiang ◽

...

Keyword(s):

High Performance ◽

Memory Architecture ◽

On Chip

Download Full-text

Energy-Efficient Non-Von Neumann Computing Architecture Supporting Multiple Computing Paradigms for Logic and Binarized Neural Networks

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea11030029 ◽

2021 ◽

Vol 11 (3) ◽

pp. 29

Author(s):

Tommaso Zanotti ◽

Francesco Maria Puglisi ◽

Paolo Pavan

Keyword(s):

Neural Networks ◽

High Performance ◽

Random Access ◽

Resistive Random Access Memory ◽

Hardware Accelerators ◽

Memory Architecture ◽

Ultra Low Power ◽

Von Neumann ◽

Non Volatile Memory ◽

Volatile Memory

Different in-memory computing paradigms enabled by emerging non-volatile memory technologies are promising solutions for the development of ultra-low-power hardware for edge computing. Among these, SIMPLY, a smart logic-in-memory architecture, provides high reconfigurability and enables the in-memory computation of both logic operations and binarized neural networks (BNNs) inference. However, operation-specific hardware accelerators can result in better performance for a particular task, such as the analog computation of the multiply and accumulate operation for BNN inference, but lack reconfigurability. Nonetheless, a solution providing the flexibility of SIMPLY while also achieving the high performance of BNN-specific analog hardware accelerators is missing. In this work, we propose a novel in-memory architecture based on 1T1R crossbar arrays, which enables the coexistence on the same crossbar array of both SIMPLY computing paradigm and the analog acceleration of the multiply and accumulate operation for BNN inference. We also highlight the main design tradeoffs and opportunities enabled by different emerging non-volatile memory technologies. Finally, by using a physics-based Resistive Random Access Memory (RRAM) compact model calibrated on data from the literature, we show that the proposed architecture improves the energy delay product by >103 times when performing a BNN inference task with respect to a SIMPLY implementation.

Download Full-text

Design and Development of a Modified AXI Based BIST Technique for Memory Architectures

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d4446.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 8023-8029

Keyword(s):

Fault Detection ◽

Random Access ◽

Memory Systems ◽

Memory Architecture ◽

Proposed Model ◽

Overall Performance ◽

Self Test ◽

On Chip ◽

The Cost ◽

Memory Architectures

Memory testing and fault detection is an important phase in testing the hardware devices. This improves the overall performance of the system and prevents runtime failures in the devices. Built In Self Test (BIST) is a hardware memory test architecture deployed in many System on Chip devices to enable fault detection. This technique reduces the cost and time needed to test the memory systems. Different BIST modules need to be used to detect faults in different memories. As a result, design complexity increases. In order to overcome these above shortcomings, it is essential to develop advanced extensible Interface (AXI) with Block Random Access Memory (BRAM) and Design and Develop AXI based self-test memory architecture (March Algorithms) to achieve parallel read and write capability. The proposed model reduced the dynamic power and the clock cycles needed for simulation when compared to existing techniques

Download Full-text

In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked DRAMs

Micromachines ◽

10.3390/mi10020124 ◽

2019 ◽

Vol 10 (2) ◽

pp. 124 ◽

Cited By ~ 2

Author(s):

Ho Shin ◽

Eui-Young Chung

Keyword(s):

Low Power ◽

High Performance ◽

High Capacity ◽

Random Access ◽

Computing System ◽

Access Time ◽

Low Latency ◽

Cache Management ◽

Long Latency ◽

High Bandwidth

Recently, 3D-stacked dynamic random access memory (DRAM) has become a promising solution for ultra-high capacity and high-bandwidth memory implementations. However, it also suffers from memory wall problems due to long latency, such as with typical 2D-DRAMs. Although there are various cache management techniques and latency hiding schemes to reduce DRAM access time, in a high-performance system using high-capacity 3D-stacked DRAM, it is ultimately essential to reduce the latency of the DRAM itself. To solve this problem, various asymmetric in-DRAM cache structures have recently been proposed, which are more attractive for high-capacity DRAMs because they can be implemented at a lower cost in 3D-stacked DRAMs. However, most research mainly focuses on the architecture of the in-DRAM cache itself and does not pay much attention to proper management methods. In this paper, we propose two new management algorithms for the in-DRAM caches to achieve a low-latency and low-power 3D-stacked DRAM device. Through the computing system simulation, we demonstrate the improvement of energy delay product up to 67%.

Download Full-text

Double Pumping Low Power Technique for Coarse - Grained Reconfigurable Architecture

International Journal of Electrical and Electronics Research ◽

10.37391/ijeer.040103 ◽

2016 ◽

Vol 4 (1) ◽

pp. 10-15

Author(s):

S. Munaf ◽

Dr. A. Bharathi ◽

Dr. A. N. Jayanthi

Keyword(s):

High Performance ◽

Random Access ◽

Power Reduction ◽

Coarse Grained ◽

Reconfigurable Architectures ◽

Memory Unit ◽

Memory Architecture ◽

Access Memory ◽

Processing Elements ◽

Data Bus

Coarse-grained reconfigurable architectures (CGRAs) require many processing elements (PEs) and a con- figuration memory unit (configuration cache) for reconfiguration of its PE array. Though this architecture is meant for high performance and flexibility. Power reduction is very crucial for CGRA to be more competitive and reliable processing core in embedded systems. We propose a DDR SDRAM (Double Data Rate Synchronous Dynamic Random Access Memory) architecture to reduce power-overhead caused by reconfiguration. The power reduction can be achieved by using the characteristics like double pumping the data bus and an I/O buffer between the memory and the data bus of DDR SDRAM. All modules have been designed at behavioral level with VHDL coding and to Simulate in Xilinx ISE navigator.

Download Full-text

Systems-on-Chip with Strong Ordering

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3428153 ◽

2021 ◽

Vol 18 (1) ◽

pp. 1-27

Author(s):

Sooraj Puthoor ◽

Mikko H. Lipasti

Keyword(s):

High Performance ◽

Geometric Mean ◽

Memory Consistency ◽

Memory Hierarchies ◽

Data Parallel ◽

Consistency Model ◽

Heterogeneous Processing ◽

Systems On Chip ◽

Memory Consistency Model ◽

On Chip

Sequential consistency (SC) is the most intuitive memory consistency model and the easiest for programmers and hardware designers to reason about. However, the strict memory ordering restrictions imposed by SC make it less attractive from a performance standpoint. Additionally, prior high-performance SC implementations required complex hardware structures to support speculation and recovery. In this article, we introduce the lockstep SC consistency model (LSC), a new memory model based on SC but carefully defined to accommodate the data parallel lockstep execution paradigm of GPUs. We also describe an efficient LSC implementation for an APU system-on-chip (SoC) and show that our implementation performs close to the baseline relaxed model. Evaluation of our implementation shows that the geometric mean performance cost for lockstep SC is just 0.76% for GPU execution and 6.11% for the entire APU SoC compared to a baseline with a weaker memory consistency model. Adoption of LSC in future APU and SoC designs will reduce the burden on programmers trying to write correct parallel programs, while also simplifying the implementation and verification of systems with heterogeneous processing elements and complex memory hierarchies. 1

Download Full-text

Efficient Instruction and Data Caching for High Performance Embedded Processors

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jji-i3a.201201788 ◽

1970 ◽

pp. 9

Author(s):

A. Ferrerón Labari ◽

D. Suárez Gracia ◽

V. Viñals Yúfera

Keyword(s):

Embedded Systems ◽

Power Consumption ◽

Low Power ◽

Interconnection Networks ◽

High Performance ◽

Critical Issue ◽

Content Management ◽

Structure Design ◽

Portable Devices ◽

On Chip

In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.

Download Full-text