Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

Candace Walden; Devesh Singh; Meenatchi Jagasivamani; Shang Li; Luyi Kang; Mehdi Asnaashari; Sylvain Dubois; Bruce Jacob; Donald Yeung

doi:10.1145/3462632

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3462632 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-26

Author(s):

Candace Walden ◽

Devesh Singh ◽

Meenatchi Jagasivamani ◽

Shang Li ◽

Luyi Kang ◽

...

Keyword(s):

Regular Structure ◽

Main Memory ◽

Monolithically Integrated ◽

Area Efficiency ◽

Memory Area ◽

Simulation Results ◽

Memory Interface ◽

Performance Penalty ◽

Many Core ◽

Design Ideas

Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU’s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests. We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core’s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC–main memory area at the expense of slight increases in delay and energy. The streamlined LLC/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3× and 1.7× over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system’s energy by 6.0× and 1.7×, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC/main memory interface incurs a small 4% performance penalty.

Download Full-text

A high-performance main-memory query engine on emerging many-core processors

10.32657/10356/74182 ◽

2018 ◽

Author(s):

◽

Xuntao Cheng

Keyword(s):

High Performance ◽

Main Memory ◽

Query Engine ◽

Many Core

Download Full-text

Realtime Dome Imaging and Interaction: Towards Immersive Design Environments

Volume 3: 26th Computers and Information in Engineering Conference ◽

10.1115/detc2006-99155 ◽

2006 ◽

Cited By ~ 2

Author(s):

Mike Bailey ◽

Matt Clothier ◽

Nick Gebbie

Keyword(s):

Interactive Graphics ◽

Science Center ◽

Immersive Environments ◽

Design Environments ◽

Flight Simulators ◽

Performance Penalty ◽

Opengl Shading Language ◽

The Masses ◽

Natural Interactions ◽

Design Ideas

As engineering design becomes more and more complex, we predict that the field will look to immersive environments as a way to create more natural interactions with design ideas. But, helmets are bulky and awkward. A better solution for immersive design is a partial dome. Originally the exclusive domain of flight simulators, dome projection is now being brought to the masses with less expensive dome displays and because its immersiveness makes it such a unique design and display experience. A fisheye lens is needed for the projector to display across the nearly 180° of the dome. This necessarily introduces a distortion of the graphics that is being displayed through it. The trick is to then “pre-distort” the graphics in the opposite direction before sending it on to the projector. This paper describes the use of the OpenGL Shading Language (GLSL) to perform this non-linear dome distortion transformation in the GPU. This makes the development of dome-ready interactive graphics code barely different from developing monitor-only graphics code, and with little runtime performance penalty. The shader code is given along with real examples from our work with San Diego’s Reuben H. Fleet Science Center.

Download Full-text

Shadow Sensitive SWIFT

Knowledge Discovery Practices and Emerging Applications of Data Mining - Advances in Data Mining and Database Management ◽

10.4018/978-1-60960-067-9.ch007 ◽

2010 ◽

pp. 130-150

Author(s):

Udai Shanker ◽

Abhay N. Singh ◽

Abhinav Anand ◽

Saurabh Agrawal

Keyword(s):

Real Time ◽

System Performance ◽

Database Systems ◽

Main Memory ◽

Communication Delay ◽

Data Item ◽

Commit Protocols ◽

Simulation Results ◽

Real Time Database ◽

Commit Protocol

This chapter proposes Shadow Sensitive SWIFT commit protocol for Distributed Real Time Database Systems (DRTDBS), where only abort dependent cohort having deadline beyond a specific value (Tshadow_creation_time) can forks off a replica of itself called a shadow, whenever it borrows dirty value of a data item. The new dependencies Commit-on-Termination external dependency between final commit operations of lender and shadow of its borrower and Begin-on-Abort internal dependency between shadow of borrower and borrower itself are defined. If there is serious problem in commitment of lender, execution of borrower is started with its shadow by sending YES-VOTE message piggy bagged with the new result to its coordinator after aborting it and abort dependency created between lender and borrower due to update-read conflict is reversed to commit dependency between shadow and lender with read-update conflict and commit operation governed by Commit-on-Termination dependency. The performance of Shadow Sensitive SWIFT is compared with shadow PROMPT, SWIFT and DSS-SWIFT commit protocols (Haritsa, Ramamritham, & Gupta, 2000; Shanker, Misra, & Sarje, 2006; Shanker, Misra, Sarje, & Shisondia, 2006) for both main memory resident and disk resident databases with and without communication delay. Simulation results show that the proposed protocol improves the system performance up to 5% as transaction miss percentage.

Download Full-text

A many-core hardware acceleration platform for short read mapping problem using distributed memory interface with 3D-stacked architecture

2014 International Symposium on System-on-Chip (SoC) ◽

10.1109/issoc.2014.6972452 ◽

2014 ◽

Author(s):

Pei Liu ◽

Ahmed Hemani ◽

Kolin Paul

Keyword(s):

Distributed Memory ◽

Hardware Acceleration ◽

Read Mapping ◽

Short Read ◽

Mapping Problem ◽

Short Read Mapping ◽

Memory Interface ◽

Many Core

Download Full-text

Cooling of a Many-Core Multiprocessor: Experimental Results for OLTP Workloads

Volume 2: Thermal Management; Data Centers and Energy Efficient Electronic Systems ◽

10.1115/ipack2013-73265 ◽

2013 ◽

Author(s):

Krishnamachar Sreenivasan

Keyword(s):

Thermal Conductivity ◽

Heat Conduction ◽

Heat Generation ◽

Stationary Solutions ◽

Main Memory ◽

Isotropic Case ◽

Published Data ◽

Clock Frequency ◽

The Core ◽

Many Core

Stochastic heat conduction differential equation in spite of its complexity allows stationary solutions valid over a certain range of variables characterizing heat flow in multi-processor cores. Heat conduction equation is recast to account for anisotropy of a many core multi-processor in which heat generated at various locations depends on whether it is a cache, processor, bus controller, or memory controller: within the core generated heat depends on the hit rate, processor utilization, cache organization, and the technology used. Thermal conductivity of and heat generation in the core are treated as stochastic variables and influence of workloads, hitherto unrecognized, is explicitly accounted for in determining temperature distribution and its variation with processor clock frequency. Relationships derived from first principles indicate that rise in temperature with processor frequency for OLTP workload is not as catastrophic as predicted by some Industry brochures! A general framework for heat conduction in an orthotropic rectangular slab (representing a many core processor) with stochastic values of thermal conductivity and heat generation is developed; the theoretical trend is validated using published data for OLTP workloads to obtain temperature at the core surface as a function of clock frequency for the deterministic case. Transaction Processing Councils (TPC) openly available data from controlled, closely audited experiments for TPCC workloads during the period 2000–2011 were analyzed to determine the relation between throughput, clock frequency, main memory size, number of cores, and power consumed. Operating systems, compilers, linkers, processor architecture, cache, main memory, and storage sizes have changed drastically during this ten year period, not to mention hyper-threading was unknown in 2000! This analysis yields the following equations for throughput and power consumed, which for a specific case of 64 processors with a main memory of 32 GB and a million users, becomes W = 1075f0.22. For the isotropic case the temperature difference at the surface may be expressed for the case under study as ΔT = 71.1f0.22. This demonstrates that chip temperature for OLTP workloads does not increase to catastrophic values with increase in frequency. This behavior varies for other types of workloads.

Download Full-text

A parallel discord discovery algorithm for time series on many-core accelerators

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v20r320 ◽

2019 ◽

pp. 211-223

Author(s):

М.Л. Цымблер

Keyword(s):

Time Series ◽

Climate Modeling ◽

Main Memory ◽

Wide Range ◽

Euclidean Distances ◽

Nvidia Gpu ◽

Two Stages ◽

Many Core ◽

Intel Mic ◽

Many Integrated Core

Диссонанс является уточнением понятия аномальной подпоследовательности (существенно непохожей на остальные подпоследовательности) временного ряда. Задача поиска диссонанса встречается в широком спектре предметных областей, связанных с временными рядами: медицина, экономика, моделирование климата и др. В работе предложен новый параллельный алгоритм поиска диссонанса во временном ряде на платформе многоядерного ускорителя для случая, когда входные данные могут быть размещены в оперативной памяти. Алгоритм использует возможность независимого вычисления евклидовых расстояний между подпоследовательностями ряда. Алгоритм состоит из двух этапов: подготовка данных и поиск. На этапе подготовки выполняется построение вспомогательных матричных структур данных, обеспечивающих распараллеливание и векторизацию вычислений. На стадии поиска осуществляется нахождение диссонанса с помощью построенных структур данных. Выполнена реализация алгоритма для ускорителей архитектур Intel MIC (Many Integrated Core) и NVIDIA GPU, распараллеливание выполнено с помощью технологий программирования OpenMP и OpenAcc соответственно. Представлены результаты вычислительных экспериментов, подтверждающих масштабируемость разработанного алгоритма. Discord is a refinement of the concept of anomalous subsequence of a time series. The discord discovery problem frequently occurs in a wide range of application areas related to time series: medicine, economics, climate modeling, etc. In this paper we propose a new parallel discord discovery algorithm for many-core systems in the case when the input data fit in the main memory. The algorithm exploits the ability to independently calculate the Euclidean distances between the subsequences of the time series. Computations are paralleled using OpenMP and OpenAcc for the Intel MIC (Many Integrated Core) and NVIDIA GPU platforms, respectively. The algorithm consists of two stages, namely precomputations and discovery. At the precomputation stage, we construct the auxiliary matrix data structures to ensure the efficient vectorization of computations on an accelerator. At the discovery stage, the algorithm searches for a discord based on the constructed structures. A number of numerical experiments confirm a high scalability of the proposed algorithm.

Download Full-text

An efficient distributed memory interface for many-core platform with 3D stacked DRAM

2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010) ◽

10.1109/date.2010.5457230 ◽

2010 ◽

Cited By ~ 24

Author(s):

Igor Loi ◽

Luca Benini

Keyword(s):

Distributed Memory ◽

Memory Interface ◽

Many Core

Download Full-text

A thermal resilient integration of many-core microprocessors and main memory by 2.5D TSI I/Os

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014 ◽

10.7873/date.2014.190 ◽

2014 ◽

Author(s):

Sih-Sian Wu ◽

Kanwen Wang ◽

Manoj P. D. Sai ◽

Tsung-Yi Ho ◽

Mingbin Yu ◽

...

Keyword(s):

Main Memory ◽

Many Core

Download Full-text

Effects of Time-Periodic Intercoupling Strength on the Spiking Regularity of a Clustered Neuronal Network

International Journal of Bifurcation and Chaos ◽

10.1142/s021812741450076x ◽

2014 ◽

Vol 24 (06) ◽

pp. 1450076

Author(s):

Xiaojuan Sun ◽

Yanhong Zheng

Keyword(s):

Neuronal Network ◽

Noise Level ◽

Regular Structure ◽

Neuronal Model ◽

Cluster Number ◽

Amplitude Increase ◽

Simulation Results ◽

Time Periodic ◽

Lower Noise ◽

Spiking Activity

We investigate the effects of time-periodic intercoupling strength on the spiking regularity of a clustered neuronal network. Inside the clustered neuronal network, each cluster has the same regular structure with each node modeled by a stochastic Hodgkin–Huxley neuronal model. From simulation results, we find that there exist some optimal frequencies of the time-periodic intercoupling strength, at which the spiking regularity of the considered clustered neuronal network becomes higher. When we consider the effects of the amplitude of time-periodic intercoupling strength, we find that the clustered neuronal network could exhibit more regular spiking activity at some intermediate amplitude for higher noise level; while for lower noise level, the spiking regularity of the neuronal network decreases with the amplitude increase. And the obtained results are independent of the cluster number.

Download Full-text

Parallel Algorithm for Frequent Itemset Mining on Intel Many-core Systems

Journal of Computing and Information Technology ◽

10.20532/cit.2018.1004382 ◽

2019 ◽

Vol 26 (4) ◽

pp. 209-221

Keyword(s):

Parallel Implementation ◽

Main Memory ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Memory Space ◽

Itemset Mining ◽

Many Core ◽

Intel Xeon

Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional databases. Apriori is a classical frequent itemset mining algorithm, which employs iterative passes over database combining with generation of candidate itemsets based on frequent itemsets found at the previous iteration, and pruning of clearly infrequent itemsets. The Dynamic Itemset Counting (DIC) algorithm is a variation of Apriori, which tries to reduce the number of passes made over a transactional database while keeping the number of itemsets counted in a pass relatively low. In this paper, we address the problem of accelerating DIC on the Intel Xeon Phi many-core system for the case when the transactional database fits in main memory. Intel Xeon Phi provides a large number of small compute cores with vector processing units. The paper presents a parallel implementation of DIC based on OpenMP technology and thread-level parallelism. We exploit the bit-based internal layout for transactions and itemsets. This technique reduces the memory space for storing the transactional database, simplifies the support count via logical bitwise operation, and allows for vectorization of such a step. Experimental evaluation on the platforms of the Intel Xeon CPU and the Intel Xeon Phi coprocessor with large synthetic and real databases showed good performance and scalability of the proposed algorithm.

Download Full-text