An efficient distributed memory interface for many-core platform with 3D stacked DRAM

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were done using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.

Download Full-text

Monolithically Integrating Non-Volatile Main Memory over the Last-Level Cache

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3462632 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-26

Author(s):

Candace Walden ◽

Devesh Singh ◽

Meenatchi Jagasivamani ◽

Shang Li ◽

Luyi Kang ◽

...

Keyword(s):

Regular Structure ◽

Main Memory ◽

Monolithically Integrated ◽

Area Efficiency ◽

Memory Area ◽

Simulation Results ◽

Memory Interface ◽

Performance Penalty ◽

Many Core ◽

Design Ideas

Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU’s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests. We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core’s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC–main memory area at the expense of slight increases in delay and energy. The streamlined LLC/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3× and 1.7× over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system’s energy by 6.0× and 1.7×, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC/main memory interface incurs a small 4% performance penalty.

Download Full-text

Autonomic Diffusive Load Balancing on Many-Core Architecture Using Simulated Annealing

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.e100.a.1640 ◽

2017 ◽

Vol E100.A (8) ◽

pp. 1640-1649

Author(s):

Hyunjik SONG ◽

Kiyoung CHOI

Keyword(s):

Simulated Annealing ◽

Load Balancing ◽

Many Core

Download Full-text

Architecture and Evaluation of Low Power Many-Core SoC with Two 32-Core Clusters

IEICE Transactions on Electronics ◽

10.1587/transele.e97.c.360 ◽

2014 ◽

Vol E97.C (4) ◽

pp. 360-368

Author(s):

Takashi MIYAMORI ◽

Hui XU ◽

Hiroyuki USUI ◽

Soichiro HOSODA ◽

Toru SANO ◽

...

Keyword(s):

Low Power ◽

Many Core

Download Full-text

On Synchronization and Evaluation Method of Chipped Many-Core Processor

Chinese Journal of Computers ◽

10.3724/sp.j.1016.2010.01777 ◽

2010 ◽

Vol 33 (10) ◽

pp. 1777-1787 ◽

Cited By ~ 1

Author(s):

Wei-Zhi XU ◽

Feng-Long SONG ◽

Zhi-Yong LIU ◽

Dong-Rui FAN ◽

Lei YU ◽

...

Keyword(s):

Evaluation Method ◽

Many Core

Download Full-text