An efficient distributed memory interface for many-core platform with 3D stacked DRAM

Author(s):  
Igor Loi ◽  
Luca Benini
2015 ◽  
Vol 2015 ◽  
pp. 1-12
Author(s):  
Hari Radhakrishnan ◽  
Damian W. I. Rouson ◽  
Karla Morris ◽  
Sameer Shende ◽  
Stavros C. Kassinos

This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were done using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.


2021 ◽  
Vol 18 (4) ◽  
pp. 1-26
Author(s):  
Candace Walden ◽  
Devesh Singh ◽  
Meenatchi Jagasivamani ◽  
Shang Li ◽  
Luyi Kang ◽  
...  

Many emerging non-volatile memories are compatible with CMOS logic, potentially enabling their integration into a CPU’s die. This article investigates such monolithically integrated CPU–main memory chips. We exploit non-volatile memories employing 3D crosspoint subarrays, such as resistive RAM (ReRAM), and integrate them over the CPU’s last-level cache (LLC). The regular structure of cache arrays enables co-design of the LLC and ReRAM main memory for area efficiency. We also develop a streamlined LLC/main memory interface that employs a single shared internal interconnect for both the cache and main memory arrays, and uses a unified controller to service both LLC and main memory requests. We apply our monolithic design ideas to a many-core CPU by integrating 3D ReRAM over each core’s LLC slice. We find that co-design of the LLC and ReRAM saves 27% of the total LLC–main memory area at the expense of slight increases in delay and energy. The streamlined LLC/main memory interface saves an additional 12% in area. Our simulation results show monolithic integration of CPU and main memory improves performance by 5.3× and 1.7× over HBM2 DRAM for several graph and streaming kernels, respectively. It also reduces the memory system’s energy by 6.0× and 1.7×, respectively. Moreover, we show that the area savings of co-design permits the CPU to have 23% more cores and main memory, and that streamlining the LLC/main memory interface incurs a small 4% performance penalty.


2014 ◽  
Vol E97.C (4) ◽  
pp. 360-368
Author(s):  
Takashi MIYAMORI ◽  
Hui XU ◽  
Hiroyuki USUI ◽  
Soichiro HOSODA ◽  
Toru SANO ◽  
...  
Keyword(s):  

2010 ◽  
Vol 33 (10) ◽  
pp. 1777-1787 ◽  
Author(s):  
Wei-Zhi XU ◽  
Feng-Long SONG ◽  
Zhi-Yong LIU ◽  
Dong-Rui FAN ◽  
Lei YU ◽  
...  
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document