Programming many-core architectures - a case study: dense matrix computations on the Intel single-chip cloud computer processor

Bryan Marker; Ernie Chan; Jack Poulson; Robert Geijn; Rob F. Van der Wijngaart; Timothy G. Mattson; Theodore E. Kubaska

doi:10.1002/cpe.1832

Research on highly parallel embedded control system design and implementation method

Impact ◽

10.21820/23987073.2019.10.44 ◽

2019 ◽

Vol 2019 (10) ◽

pp. 44-46

Author(s):

Masato Edahiro ◽

Masaki Gondo

Keyword(s):

Computer Architecture ◽

Intelligent Systems ◽

Large Scale ◽

General Purpose ◽

Heterogeneous Structure ◽

Single Chip ◽

Powertrain Control ◽

Processing Power ◽

Hardware Description ◽

Many Core

The pace of technology's advancements is ever-increasing and intelligent systems, such as those found in robots and vehicles, have become larger and more complex. These intelligent systems have a heterogeneous structure, comprising a mixture of modules such as artificial intelligence (AI) and powertrain control modules that facilitate large-scale numerical calculation and real-time periodic processing functions. Information technology expert Professor Masato Edahiro, from the Graduate School of Informatics at the Nagoya University in Japan, explains that concurrent advances in semiconductor research have led to the miniaturisation of semiconductors, allowing a greater number of processors to be mounted on a single chip, increasing potential processing power. 'In addition to general-purpose processors such as CPUs, a mixture of multiple types of accelerators such as GPGPU and FPGA has evolved, producing a more complex and heterogeneous computer architecture,' he says. Edahiro and his partners have been working on the eMBP, a model-based parallelizer (MBP) that offers a mapping system as an efficient way of automatically generating parallel code for multi- and many-core systems. This ensures that once the hardware description is written, eMBP can bridge the gap between software and hardware to ensure that not only is an efficient ecosystem achieved for hardware vendors, but the need for different software vendors to adapt code for their particular platforms is also eliminated.

Download Full-text

A Survey of Software-Defined Networks-on-Chip: Motivations, Challenges and Opportunities

Micromachines ◽

10.3390/mi12020183 ◽

2021 ◽

Vol 12 (2) ◽

pp. 183

Author(s):

Jose Ricardo Gomez-Rodriguez ◽

Remberto Sandoval-Arechiga ◽

Salvador Ibarra-Delgado ◽

Viktor Ivan Rodriguez-Abdala ◽

Jose Luis Vazquez-Avila ◽

...

Keyword(s):

Single Chip ◽

Synthesis Time ◽

Networks On Chip ◽

Data Dependencies ◽

Layered Architecture ◽

Systems On Chip ◽

Challenges And Opportunities ◽

Computing Platforms ◽

On Chip ◽

Many Core

Current computing platforms encourage the integration of thousands of processing cores, and their interconnections, into a single chip. Mobile smartphones, IoT, embedded devices, desktops, and data centers use Many-Core Systems-on-Chip (SoCs) to exploit their compute power and parallelism to meet the dynamic workload requirements. Networks-on-Chip (NoCs) lead to scalable connectivity for diverse applications with distinct traffic patterns and data dependencies. However, when the system executes various applications in traditional NoCs—optimized and fixed at synthesis time—the interconnection nonconformity with the different applications’ requirements generates limitations in the performance. In the literature, NoC designs embraced the Software-Defined Networking (SDN) strategy to evolve into an adaptable interconnection solution for future chips. However, the works surveyed implement a partial Software-Defined Network-on-Chip (SDNoC) approach, leaving aside the SDN layered architecture that brings interoperability in conventional networking. This paper explores the SDNoC literature and classifies it regarding the desired SDN features that each work presents. Then, we described the challenges and opportunities detected from the literature survey. Moreover, we explain the motivation for an SDNoC approach, and we expose both SDN and SDNoC concepts and architectures. We observe that works in the literature employed an uncomplete layered SDNoC approach. This fact creates various fertile areas in the SDNoC architecture where researchers may contribute to Many-Core SoCs designs.

Download Full-text

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+

Journal of Computer Science and Technology ◽

10.1007/s11390-020-0741-6 ◽

2021 ◽

Vol 36 (1) ◽

pp. 33-43

Author(s):

Jian-Bin Fang ◽

Xiang-Ke Liao ◽

Chun Huang ◽

De-Zun Dong

Keyword(s):

Performance Evaluation ◽

Many Core

Download Full-text

Scientific Programming with High Performance Fortran: A Case Study Using the xHPF Compiler

Scientific Programming ◽

10.1155/1997/528513 ◽

1997 ◽

Vol 6 (1) ◽

pp. 127-152

Author(s):

Eric De Sturler ◽

Volker Strumpen

Keyword(s):

High Performance ◽

Parallel Implementation ◽

Gaussian Elimination ◽

Primary Objective ◽

Matrix Product ◽

Dense Matrix ◽

High Performance Fortran ◽

Partial Pivoting ◽

Intel Paragon

Recently, the first commercial High Performance Fortran (HPF) subset compilers have appeared. This article reports on our experiences with the xHPF compiler of Applied Parallel Research, version 1.2, for the Intel Paragon. At this stage, we do not expect very High Performance from our HPF programs, even though performance will eventually be of paramount importance for the acceptance of HPF. Instead, our primary objective is to study how to convert large Fortran 77 (F77) programs to HPF such that the compiler generates reasonably efficient parallel code. We report on a case study that identifies several problems when parallelizing code with HPF; most of these problems affect current HPF compiler technology in general, although some are specific for the xHPF compiler. We discuss our solutions from the perspective of the scientific programmer, and presenttiming results on the Intel Paragon. The case study comprises three programs of different complexity with respect to parallelization. We use the dense matrix-matrix product to show that the distribution of arrays and the order of nested loops significantly influence the performance of the parallel program. We use Gaussian elimination with partial pivoting to study the parallelization strategy of the compiler. There are various ways to structure this algorithm for a particular data distribution. This example shows how much effort may be demanded from the programmer to support the compiler in generating an efficient parallel implementation. Finally, we use a small application to show that the more complicated structure of a larger program may introduce problems for the parallelization, even though all subroutines of the application are easy to parallelize by themselves. The application consists of a finite volume discretization on a structured grid and a nested iterative solver. Our case study shows that it is possible to obtain reasonably efficient parallel programs with xHPF, although the compiler needs substantial support from the programmer.

Download Full-text

Optimized System-Level Design Methods for NoC-Based Many Core Embedded Systems

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Handbook of Research on Embedded Systems Design ◽

10.4018/978-1-4666-6194-3.ch007 ◽

2014 ◽

pp. 150-179

Author(s):

Haoyuan Ying ◽

Klaus Hofmann ◽

Thomas Hollstein

Keyword(s):

Embedded Systems ◽

Design Optimization ◽

Optimization Methods ◽

Design Methods ◽

System Level ◽

System Level Design ◽

Level Design ◽

On Chip ◽

Many Core

Due to the growing demand on high performance and low power in embedded systems, many core architectures are proposed the most suitable solutions. While the design concentration of many core embedded systems is switching from computation-centric to communication-centric, Network-on-Chip (NoC) is one of the best interconnect techniques for such architectures because of the scalability and high communication bandwidth. Formalized and optimized system-level design methods for NoC-based many core embedded systems are desired to improve the system performance and to reduce the power consumption. In order to understand the design optimization methods in depth, a case study of optimizing many core embedded systems based on 3-Dimensional (3D) NoC with irregular vertical link distribution topology through task mapping, core placement, routing, and topology generation is demonstrated in this chapter. Results of cycle-accurate simulation experiments prove the validity and efficiency of the design methods. Specific to the case study configuration, in maximum 60% vertical links can be saved while maintaining the system efficiency in comparison to full vertical link connection 3D NoCs by applying the design optimization methods.

Download Full-text

Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication

Analog Integrated Circuits and Signal Processing ◽

10.1007/s10470-014-0441-7 ◽

2014 ◽

Vol 82 (1) ◽

pp. 147-158

Author(s):

Wilson M. José ◽

Ana Rita Silva ◽

Mário P. Véstias ◽

Horácio C. Neto

Keyword(s):

Matrix Multiplication ◽

Dense Matrix ◽

Many Core

Download Full-text

Long pipelines in single-chip digital signal processors-concepts and case study

IEEE Transactions on Circuits and Systems ◽

10.1109/31.101307 ◽

1991 ◽

Vol 38 (1) ◽

pp. 100-108 ◽

Cited By ~ 7

Author(s):

R. Ernst

Keyword(s):

Digital Signal ◽

Digital Signal Processors ◽

Single Chip ◽

Signal Processors

Download Full-text

The ROSACE case study: From Simulink specification to multi/many-core execution

2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS) ◽

10.1109/rtas.2014.6926012 ◽

2014 ◽

Cited By ~ 28

Author(s):

Claire Pagetti ◽

David Saussie ◽

Romain Gratia ◽

Eric Noulard ◽

Pierre Siron

Keyword(s):

Many Core

Download Full-text

Performance evaluation of many‐core systems: case study with TILEPro64

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2012.0101 ◽

2013 ◽

Vol 7 (4) ◽

pp. 143-154

Author(s):

Han‐Yee Kim ◽

Young‐Hwan Kim ◽

HeonChang Yu ◽

Taeweon Suh

Keyword(s):

Performance Evaluation ◽

Many Core

Download Full-text

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2017.09.005 ◽

2018 ◽

Vol 120 ◽

pp. 395-404 ◽

Cited By ~ 3

Author(s):

Xuntao Cheng ◽

Bingsheng He ◽

Mian Lu ◽

Chiew Tong Lau

Keyword(s):

Query Processing ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Fine Grained ◽

Many Core ◽

Intel Xeon

Download Full-text