Partially shared cache and adaptive replacement algorithm for NoC-based many-core systems

Shared-Cache Simulation for Multi-core System with LRU2-MRU Collaborative Cache Replacement Algorithm

2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing ◽

10.1109/snpd.2012.112 ◽

2012 ◽

Author(s):

Shan Ding ◽

Shiya Lui ◽

Yuanyuan Li

Keyword(s):

Cache Replacement ◽

Core System ◽

Shared Cache ◽

Replacement Algorithm ◽

Cache Simulation

Download Full-text

Efficient Address Mapping of Shared Cache for On-Chip Many-Core Architecture

Euro-Par 2010 - Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-15277-1_27 ◽

2010 ◽

pp. 280-291

Author(s):

Fenglong Song ◽

Dongrui Fan ◽

Zhiyong Liu ◽

Junchao Zhang ◽

Lei Yu ◽

...

Keyword(s):

Shared Cache ◽

Address Mapping ◽

On Chip ◽

Many Core

Download Full-text

An Improved Multi-core Shared Cache Replacement Algorithm

2012 11th International Symposium on Distributed Computing and Applications to Business, Engineering & Science ◽

10.1109/dcabes.2012.39 ◽

2012 ◽

Cited By ~ 3

Author(s):

Fang Juan ◽

Li Chengyan

Keyword(s):

Cache Replacement ◽

Shared Cache ◽

Replacement Algorithm

Download Full-text

HIGH LATENCY AND CONTENTION ON SHARED L2-CACHE FOR MANY-CORE ARCHITECTURES

Parallel Processing Letters ◽

10.1142/s0129626411000096 ◽

2011 ◽

Vol 21 (01) ◽

pp. 85-106 ◽

Cited By ~ 2

Author(s):

MARCO A. Z. ALVES ◽

HENRIQUE C. FREITAS ◽

PHILIPPE O. A. NAVAUX

Keyword(s):

Execution Time ◽

Chip Multiprocessor ◽

Cache Memory ◽

Shared Cache ◽

Shared Caches ◽

L2 Cache ◽

Low Performance ◽

Many Core

Several studies point out the benefits of a shared L2 cache, but some other properties of shared caches must be considered to lead to a thorough understanding of all chip multiprocessor (CMP) bottlenecks. Our paper evaluates and explains shared cache bottlenecks, which are very important considering the rise of many-core processors. The results of our simulations with 32 cores show low performance when L2 cache memory is shared between 2 or 4 cores. In these two cases, the increase of L2 cache latency and contention are the main causes responsible for the increase of execution time.

Download Full-text

Autonomic Diffusive Load Balancing on Many-Core Architecture Using Simulated Annealing

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.e100.a.1640 ◽

2017 ◽

Vol E100.A (8) ◽

pp. 1640-1649

Author(s):

Hyunjik SONG ◽

Kiyoung CHOI

Keyword(s):

Simulated Annealing ◽

Load Balancing ◽

Many Core

Download Full-text

Architecture and Evaluation of Low Power Many-Core SoC with Two 32-Core Clusters

IEICE Transactions on Electronics ◽

10.1587/transele.e97.c.360 ◽

2014 ◽

Vol E97.C (4) ◽

pp. 360-368

Author(s):

Takashi MIYAMORI ◽

Hui XU ◽

Hiroyuki USUI ◽

Soichiro HOSODA ◽

Toru SANO ◽

...

Keyword(s):

Low Power ◽

Many Core

Download Full-text

On Synchronization and Evaluation Method of Chipped Many-Core Processor

Chinese Journal of Computers ◽

10.3724/sp.j.1016.2010.01777 ◽

2010 ◽

Vol 33 (10) ◽

pp. 1777-1787 ◽

Cited By ~ 1

Author(s):

Wei-Zhi XU ◽

Feng-Long SONG ◽

Zhi-Yong LIU ◽

Dong-Rui FAN ◽

Lei YU ◽

...

Keyword(s):

Evaluation Method ◽

Many Core

Download Full-text

Using Write Mask to Support Hybrid Write-Back and Write-Through Cache Policy on Many-Core Architectures

Chinese Journal of Computers ◽

10.3724/sp.j.1016.2008.01918 ◽

2009 ◽

Vol 31 (11) ◽

pp. 1918-1928 ◽

Cited By ~ 3

Author(s):

Wei LIN ◽

Xiao-Chun YE ◽

Feng-Long SONG ◽

Hao ZHANG

Keyword(s):

Many Core

Download Full-text

Research on highly parallel embedded control system design and implementation method

Impact ◽

10.21820/23987073.2019.10.44 ◽

2019 ◽

Vol 2019 (10) ◽

pp. 44-46

Author(s):

Masato Edahiro ◽

Masaki Gondo

Keyword(s):

Computer Architecture ◽

Intelligent Systems ◽

Large Scale ◽

General Purpose ◽

Heterogeneous Structure ◽

Single Chip ◽

Powertrain Control ◽

Processing Power ◽

Hardware Description ◽

Many Core

The pace of technology's advancements is ever-increasing and intelligent systems, such as those found in robots and vehicles, have become larger and more complex. These intelligent systems have a heterogeneous structure, comprising a mixture of modules such as artificial intelligence (AI) and powertrain control modules that facilitate large-scale numerical calculation and real-time periodic processing functions. Information technology expert Professor Masato Edahiro, from the Graduate School of Informatics at the Nagoya University in Japan, explains that concurrent advances in semiconductor research have led to the miniaturisation of semiconductors, allowing a greater number of processors to be mounted on a single chip, increasing potential processing power. 'In addition to general-purpose processors such as CPUs, a mixture of multiple types of accelerators such as GPGPU and FPGA has evolved, producing a more complex and heterogeneous computer architecture,' he says. Edahiro and his partners have been working on the eMBP, a model-based parallelizer (MBP) that offers a mapping system as an efficient way of automatically generating parallel code for multi- and many-core systems. This ensures that once the hardware description is written, eMBP can bridge the gap between software and hardware to ensure that not only is an efficient ecosystem achieved for hardware vendors, but the need for different software vendors to adapt code for their particular platforms is also eliminated.

Download Full-text

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

EPJ Web of Conferences ◽

10.1051/epjconf/201817502009 ◽

2018 ◽

Vol 175 ◽

pp. 02009

Author(s):

Carleton DeTar ◽

Steven Gottlieb ◽

Ruizi Li ◽

Doug Toussaint

Keyword(s):

Conjugate Gradient ◽

Memory Hierarchy ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Code Performance ◽

Recent Developments ◽

Knights Landing ◽

Many Core ◽

Intel Xeon

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Download Full-text