An analysis of on-chip interconnection networks for large-scale chip multiprocessors

In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.

Download Full-text

A study on optimally co-scheduling jobs of different lengths on chip multiprocessors

Proceedings of the 6th ACM conference on Computing frontiers - CF '09 ◽

10.1145/1531743.1531752 ◽

2009 ◽

Cited By ~ 23

Author(s):

Kai Tian ◽

Yunlian Jiang ◽

Xipeng Shen

Keyword(s):

Chip Multiprocessors ◽

On Chip

Download Full-text

Fast and Cycle-Accurate Emulation of Large-Scale Networks-on-Chip Using a Single FPGA

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3151758 ◽

2017 ◽

Vol 10 (4) ◽

pp. 1-27 ◽

Cited By ~ 4

Author(s):

Thiem Van Chu ◽

Shimpei Sato ◽

Kenji Kise

Keyword(s):

Large Scale ◽

Networks On Chip ◽

On Chip ◽

Large Scale Networks

Download Full-text

Embedded RAIDs-on-chip for bus-based chip-multiprocessors

ACM Transactions on Embedded Computing Systems ◽

10.1145/2533316 ◽

2014 ◽

Vol 13 (4) ◽

pp. 1-36 ◽

Cited By ~ 2

Author(s):

Luis Angel D. Bathen ◽

Nikil D. Dutt

Keyword(s):

Chip Multiprocessors ◽

On Chip

Download Full-text

Nanolaser-based emulators of spin Hamiltonians

Nanophotonics ◽

10.1515/nanoph-2020-0230 ◽

2020 ◽

Vol 9 (13) ◽

pp. 4193-4198 ◽

Cited By ~ 2

Author(s):

Midya Parto ◽

William E. Hayenga ◽

Alireza Marandi ◽

Demetrios N. Christodoulides ◽

Mercedeh Khajavikhan

Keyword(s):

Large Scale ◽

Optimization Problems ◽

Exchange Interactions ◽

Np Hard ◽

Spin Models ◽

Geometric Frustration ◽

Spin Hamiltonians ◽

Hard Problems ◽

On Chip ◽

Classical Spin Models

AbstractFinding the solution to a large category of optimization problems, known as the NP-hard class, requires an exponentially increasing solution time using conventional computers. Lately, there has been intense efforts to develop alternative computational methods capable of addressing such tasks. In this regard, spin Hamiltonians, which originally arose in describing exchange interactions in magnetic materials, have recently been pursued as a powerful computational tool. Along these lines, it has been shown that solving NP-hard problems can be effectively mapped into finding the ground state of certain types of classical spin models. Here, we show that arrays of metallic nanolasers provide an ultra-compact, on-chip platform capable of implementing spin models, including the classical Ising and XY Hamiltonians. Various regimes of behavior including ferromagnetic, antiferromagnetic, as well as geometric frustration are observed in these structures. Our work paves the way towards nanoscale spin-emulators that enable efficient modeling of large-scale complex networks.

Download Full-text

Simba

Communications of the ACM ◽

10.1145/3460227 ◽

2021 ◽

Vol 64 (6) ◽

pp. 107-116

Author(s):

Yakun Sophia Shao ◽

Jason Cemons ◽

Rangharajan Venkatesan ◽

Brian Zimmer ◽

Matthew Fojtik ◽

...

Keyword(s):

Deep Learning ◽

Large Scale ◽

Data Locality ◽

Coarse Grained ◽

Batch Size ◽

Peak Performance ◽

Large Scale Systems ◽

High Area ◽

On Chip ◽

And Storage

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

Download Full-text

Task and Communication Allocation for Real-time Tasks to Networks-on-Chip Multiprocessors

2020 Second International Conference on Embedded & Distributed Systems (EDiS) ◽

10.1109/edis49545.2020.9296446 ◽

2020 ◽

Author(s):

Chawki Benchehida ◽

Mohammed Kamel Benhaoua ◽

Houssam-Eddine Zahaf ◽

Giuseppe Lipari

Keyword(s):

Real Time ◽

Chip Multiprocessors ◽

Networks On Chip ◽

On Chip

Download Full-text