Comparative Performance Evaluation of Modern Heterogeneous High-Performance Computing Systems CPUs

Aleksei Sorokin; Sergey Malkovsky; Georgiy Tsoy; Alexander Zatsarinnyy; Konstantin Volovich

doi:10.3390/electronics9061035

Comparative Performance Evaluation of Modern Heterogeneous High-Performance Computing Systems CPUs

Electronics ◽

10.3390/electronics9061035 ◽

2020 ◽

Vol 9 (6) ◽

pp. 1035

Author(s):

Aleksei Sorokin ◽

Sergey Malkovsky ◽

Georgiy Tsoy ◽

Alexander Zatsarinnyy ◽

Konstantin Volovich

Keyword(s):

High Performance ◽

Parallel Applications ◽

Memory Bandwidth ◽

Comparative Performance ◽

Computing Systems ◽

Operating Modes ◽

Computing Platforms ◽

Multithreading Technology ◽

Performance Computing ◽

Intel Xeon

The study presents a comparison of computing systems based on IBM POWER8, IBM POWER9, and Intel Xeon Platinum 8160 processors running parallel applications. Memory subsystem bandwidth was studied, parallel programming technologies were compared, and the operating modes and capabilities of simultaneous multithreading technology were analyzed. Performance analysis for the studied computing systems running parallel applications based on the OpenMP and MPI technologies was carried out by using the NAS Parallel Benchmarks. An assessment of the results obtained during experimental calculations led to the conclusion that IBM POWER8 and Intel Xeon Platinum 8160 systems have almost the same maximum memory bandwidth, but require a different number of threads for efficient utilization. The IBM POWER9 system has the highest maximum bandwidth, which can be attributed to the large number of memory channels per socket. Based on the results of numerical experiments, recommendations are given on how the hardware of a similar grade can be utilized to solve various scientific problems, including recommendations on optimal processor architecture choice for leveraging the operation of high-performance hybrid computing platforms.

Download Full-text

Implementation and evaluation of the HPC challenge benchmark in the XcalableMP PGAS language

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017698214 ◽

2017 ◽

Vol 33 (1) ◽

pp. 110-123 ◽

Cited By ~ 5

Author(s):

Masahiro Nakao ◽

Hitoshi Murai ◽

Hidetoshi Iwashita ◽

Taisuke Boku ◽

Mitsuhisa Sato

Keyword(s):

High Performance Computing ◽

High Performance ◽

Parallel Applications ◽

Memory Model ◽

Computing Systems ◽

Local View ◽

And Performance ◽

Performance Results ◽

Performance Computing ◽

Do So

To improve productivity for developing parallel applications on high performance computing systems, the XcalableMP PGAS language has been proposed. XcalableMP supports both a typical parallelization under the “global-view memory model” which uses directives and a flexible parallelization under the “local-view memory model” which uses coarray features. The goal of the present paper is to clarify XcalableMP’s productivity and performance. To do so, we implement and evaluate the high performance computing challenge benchmark, namely, EP STREAM Triad, High Performance Linpack, Global fast Fourier transform, and RandomAccess on the K computer using up to 16,384 compute nodes and a generic cluster system using up to 128 compute nodes. We found that we could more easily implement the benchmarks using XcalableMP rather than using MPI. Moreover, most of the performance results using XcalableMP were almost the same as those using MPI.

Download Full-text

Post-silicon verification of high-speed interconnections in Elbrus-8CB microprocessor

Radio Industry (Russia) ◽

10.21778/2413-9599-2019-29-3-33-40 ◽

2019 ◽

Vol 29 (3) ◽

pp. 33-40

Author(s):

A. E. Ometov ◽

A. A. Vinogradov ◽

A. S. Vorobiev

Keyword(s):

High Performance Computing ◽

High Speed ◽

High Performance ◽

Interprocessor Communication ◽

Interference Immunity ◽

Computing Systems ◽

Physical Layers ◽

Operating Modes ◽

Testing Algorithm ◽

Performance Computing

The article describes the experiments carried out during the post-silicone verification of Elbrus-8CB microprocessor – one of the important stages of the verification process, which mostly determines the possibility of creating high-performance computing systems consisting of several microprocessors of this series. The interprocessor communication channels of the Elbrus-8CB microprocessor were investigated and some hypotheses were put forward about the reasons for their low operating speed. Experiments conducted to validate these hypotheses are made with intermediate conclusions based on their results. The built-in testing mechanism of CEI-6G and PCIe 2.0 physical levels was described alongside with its operating modes and testing algorithm. Several studies were carried out to ensure the correctness of the testing mechanism. This led to modifications of the initial testing method. The final conclusions about the reasons for the incorrect operation of interprocessor communications were made, and recommendations were given to improve the high-speed communications signals attenuation parameters and the level of their interference immunity. The relevance of this study for the production of modern high-performance computing systems can be traced not only in the growing interest of designers to this problem, but also in tightening of the requirements of the physical layers manufacturers.

Download Full-text

Energy-aware heuristics for scheduling parallel applications on high performance computing platforms

2014 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) ◽

10.1109/isspit.2014.7300601 ◽

2014 ◽

Cited By ~ 1

Author(s):

Ahmed Ebaid ◽

Sanguthevar Rajasekaran ◽

Reda Ammar ◽

Rasha Ebaid

Keyword(s):

High Performance Computing ◽

High Performance ◽

Parallel Applications ◽

Energy Aware ◽

Computing Platforms ◽

Performance Computing

Download Full-text

An open AlgoWiki encyclopedia of algorithmic features: from mobile to extreme scale

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v16r111 ◽

2015 ◽

pp. 99-111

Author(s):

Вл.В. Воеводин

Keyword(s):

Computer Architecture ◽

High Performance ◽

Parallel Programs ◽

Computing Systems ◽

Dependent Part ◽

Software Platforms ◽

Computing Platforms ◽

Extreme Scale ◽

Algorithmic Structure ◽

Performance Computing

Фундаментальная проблема высокопроизводительных вычислений - это необходимость аккуратного согласования структуры алгоритмов и программ с особенностями архитектуры компьютеров. Возможности современных компьютеров велики, но если хотя бы на одном из этапов процесса решения задачи согласования не будет, то и эффективность работы компьютера будет близка к нулю. Основная идея данного проекта состоит в том, что свойства самих алгоритмов никак не зависят от вычислительных систем, существующих сейчас или будущих. Иными словами, детальное описание машинно-независимых свойств алгоритма нужно сделать лишь один раз, после чего оно может быть многократно использовано при реализации данного алгоритма в различных программно-аппаратных средах. Не менее важна и вторая, машинно-зависимая часть данного исследования, которая посвящена описанию особенностей программной реализации алгоритмов с учетом конкретных программно-аппаратных компьютерных платформ. Результатом проекта, которому посвящена данная статья, является открытая энциклопедия AlgoWiki по свойствам алгоритмов и особенностям их реализации для различных компьютерных систем. Умение эффективно работать со свойствами алгоритмов (выделять, описывать, анализировать, интерпретировать) станет широко востребованным уже через несколько лет, что будет верно как для экзафлопсных суперкомпьютерных систем высшего диапазона производительности, так и для всех других компьютерных платформ: от серверных до мобильных. One of the fundamental problems of high performance computing consists in the necessity of a careful matching between the algorithmic structure of parallel programs and the features of a particular computer architecture. The performance capabilities of modern computers are significant; however, the computer's efficiency drastically decreases if such a matching is not achieved even in one of the stages during the process of solving a problem. The AlgoWiki project is based on the fact that the features of algorithms by themselves are not dependent on computing systems. In other words, a detailed description of machine-independent properties of an algorithm should be done only once; after that, this description can be used many times when implementing this algorithm on various hardware/software environments. Also of importance of this project is its machine-dependent part devoted to the description of algorithmic implementation peculiarities with consideration of particular hardware/software platforms. The main result of this project is an open AlgoWiki encyclopedia covering the properties of algorithms and the peculiarities of their implementation on various computing systems. The knowledge of how to reveal, describe, analyze, and interpret the properties of algorithms will become of significant importance in a few years. This conclusion is valid for future exaflop supercomputers and for other computing platforms: from server to mobile devices.

Download Full-text

The VOLNA-OP2 Tsunami Code (Version 1.0)

10.5194/gmd-2018-18 ◽

2018 ◽

Author(s):

Istvan Z. Reguly ◽

Devaraj Gopinathan ◽

Joakim H. Beck ◽

Michael B. Giles ◽

Serge Guillas ◽

...

Keyword(s):

High Performance ◽

Domain Specific Language ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Domain Specific ◽

Computing Platforms ◽

Tsunami Model ◽

Performance Computing ◽

Code Version ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite volume non-linear shallow water equations (NSWE) solver built on the OP2 domain specific language for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high performance computing platforms: CPUs, the Intel Xeon Phi, and GPUs. This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years, here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity to its users, as well as performance and portability on a number of platforms.

Download Full-text

MonSTer: An Out-of-the-Box Monitoring Tool for High Performance Computing Systems

2020 IEEE International Conference on Cluster Computing (CLUSTER) ◽

10.1109/cluster49012.2020.00022 ◽

2020 ◽

Author(s):

Jie Li ◽

Ghazanfar Ali ◽

Ngan Nguyen ◽

Jon Hass ◽

Alan Sill ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Monitoring Tool ◽

Computing Systems ◽

Performance Computing

Download Full-text

Session details: Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/3263957 ◽

2011 ◽

Vol 38 (4) ◽

Keyword(s):

High Performance Computing ◽

High Performance ◽

Performance Modeling ◽

International Workshop ◽

Special Issue ◽

Computing Systems ◽

Performance Computing

Download Full-text

Treasure Hunt Framework: Distributing Metaheuristics on High Performance Computing Systems

Swarm and Evolutionary Computation ◽

10.1016/j.swevo.2021.100906 ◽

2021 ◽

pp. 100906

Author(s):

Peter Frank Perroni ◽

Myriam Regattieri Delgado ◽

Daniel Weingaertner

Keyword(s):

High Performance Computing ◽

High Performance ◽

Computing Systems ◽

Performance Computing

Download Full-text

GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP

The International Journal of High Performance Computing Applications ◽

10.1177/10943420211008288 ◽

2021 ◽

pp. 109434202110082

Author(s):

Nikolay Kondratyuk ◽

Vsevolod Nikolskiy ◽

Daniil Pavlov ◽

Vladimir Stegailov

Keyword(s):

Molecular Dynamics ◽

High Performance ◽

Software Performance ◽

Computing Systems ◽

Accelerated Molecular Dynamics ◽

Nvidia Cuda ◽

Software And Hardware ◽

Management Capabilities ◽

Utilization Time ◽

Performance Computing

Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management capabilities and in performance. In this work, we analyze the performance of LAMMPS, GROMACS and OpenMM MD packages with different GPU backends on Nvidia Volta and AMD Vega20 GPUs. We consider the efficiency of solving two identical MD models (generic for material science and biomolecular studies) using different software and hardware combinations. We describe our experience in porting the CUDA backend of LAMMPS to ROCm HIP that shows considerable benefits for AMD GPUs comparatively to the OpenCL backend.

Download Full-text

Session details: Special issue on the 2nd international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 11)

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/3264251 ◽

2012 ◽

Vol 40 (2) ◽

Keyword(s):

High Performance Computing ◽

High Performance ◽

Performance Modeling ◽

International Workshop ◽

Special Issue ◽

Computing Systems ◽

Performance Computing

Download Full-text