scholarly journals Comparative Performance Evaluation of Modern Heterogeneous High-Performance Computing Systems CPUs

Electronics ◽  
2020 ◽  
Vol 9 (6) ◽  
pp. 1035
Author(s):  
Aleksei Sorokin ◽  
Sergey Malkovsky ◽  
Georgiy Tsoy ◽  
Alexander Zatsarinnyy ◽  
Konstantin Volovich

The study presents a comparison of computing systems based on IBM POWER8, IBM POWER9, and Intel Xeon Platinum 8160 processors running parallel applications. Memory subsystem bandwidth was studied, parallel programming technologies were compared, and the operating modes and capabilities of simultaneous multithreading technology were analyzed. Performance analysis for the studied computing systems running parallel applications based on the OpenMP and MPI technologies was carried out by using the NAS Parallel Benchmarks. An assessment of the results obtained during experimental calculations led to the conclusion that IBM POWER8 and Intel Xeon Platinum 8160 systems have almost the same maximum memory bandwidth, but require a different number of threads for efficient utilization. The IBM POWER9 system has the highest maximum bandwidth, which can be attributed to the large number of memory channels per socket. Based on the results of numerical experiments, recommendations are given on how the hardware of a similar grade can be utilized to solve various scientific problems, including recommendations on optimal processor architecture choice for leveraging the operation of high-performance hybrid computing platforms.

Author(s):  
Masahiro Nakao ◽  
Hitoshi Murai ◽  
Hidetoshi Iwashita ◽  
Taisuke Boku ◽  
Mitsuhisa Sato

To improve productivity for developing parallel applications on high performance computing systems, the XcalableMP PGAS language has been proposed. XcalableMP supports both a typical parallelization under the “global-view memory model” which uses directives and a flexible parallelization under the “local-view memory model” which uses coarray features. The goal of the present paper is to clarify XcalableMP’s productivity and performance. To do so, we implement and evaluate the high performance computing challenge benchmark, namely, EP STREAM Triad, High Performance Linpack, Global fast Fourier transform, and RandomAccess on the K computer using up to 16,384 compute nodes and a generic cluster system using up to 128 compute nodes. We found that we could more easily implement the benchmarks using XcalableMP rather than using MPI. Moreover, most of the performance results using XcalableMP were almost the same as those using MPI.


2019 ◽  
Vol 29 (3) ◽  
pp. 33-40
Author(s):  
A. E. Ometov ◽  
A. A. Vinogradov ◽  
A. S. Vorobiev

The article describes the experiments carried out during the post-silicone verification of Elbrus-8CB microprocessor – one of the important stages of the verification process, which mostly determines the possibility of creating high-performance computing systems consisting of several microprocessors of this series. The interprocessor communication channels of the Elbrus-8CB microprocessor were investigated and some hypotheses were put forward about the reasons for their low operating speed. Experiments conducted to validate these hypotheses are made with intermediate conclusions based on their results. The built-in testing mechanism of CEI-6G and PCIe 2.0 physical levels was described alongside with its operating modes and testing algorithm. Several studies were carried out to ensure the correctness of the testing mechanism. This led to modifications of the initial testing method. The final conclusions about the reasons for the incorrect operation of interprocessor communications were made, and recommendations were given to improve the high-speed communications signals attenuation parameters and the level of their interference immunity. The relevance of this study for the production of modern high-performance computing systems can be traced not only in the growing interest of designers to this problem, but also in tightening of the requirements of the physical layers manufacturers.


Author(s):  
Вл.В. Воеводин

Фундаментальная проблема высокопроизводительных вычислений - это необходимость аккуратного согласования структуры алгоритмов и программ с особенностями архитектуры компьютеров. Возможности современных компьютеров велики, но если хотя бы на одном из этапов процесса решения задачи согласования не будет, то и эффективность работы компьютера будет близка к нулю. Основная идея данного проекта состоит в том, что свойства самих алгоритмов никак не зависят от вычислительных систем, существующих сейчас или будущих. Иными словами, детальное описание машинно-независимых свойств алгоритма нужно сделать лишь один раз, после чего оно может быть многократно использовано при реализации данного алгоритма в различных программно-аппаратных средах. Не менее важна и вторая, машинно-зависимая часть данного исследования, которая посвящена описанию особенностей программной реализации алгоритмов с учетом конкретных программно-аппаратных компьютерных платформ. Результатом проекта, которому посвящена данная статья, является открытая энциклопедия AlgoWiki по свойствам алгоритмов и особенностям их реализации для различных компьютерных систем. Умение эффективно работать со свойствами алгоритмов (выделять, описывать, анализировать, интерпретировать) станет широко востребованным уже через несколько лет, что будет верно как для экзафлопсных суперкомпьютерных систем высшего диапазона производительности, так и для всех других компьютерных платформ: от серверных до мобильных. One of the fundamental problems of high performance computing consists in the necessity of a careful matching between the algorithmic structure of parallel programs and the features of a particular computer architecture. The performance capabilities of modern computers are significant; however, the computer's efficiency drastically decreases if such a matching is not achieved even in one of the stages during the process of solving a problem. The AlgoWiki project is based on the fact that the features of algorithms by themselves are not dependent on computing systems. In other words, a detailed description of machine-independent properties of an algorithm should be done only once; after that, this description can be used many times when implementing this algorithm on various hardware/software environments. Also of importance of this project is its machine-dependent part devoted to the description of algorithmic implementation peculiarities with consideration of particular hardware/software platforms. The main result of this project is an open AlgoWiki encyclopedia covering the properties of algorithms and the peculiarities of their implementation on various computing systems. The knowledge of how to reveal, describe, analyze, and interpret the properties of algorithms will become of significant importance in a few years. This conclusion is valid for future exaflop supercomputers and for other computing platforms: from server to mobile devices.


2018 ◽  
Author(s):  
Istvan Z. Reguly ◽  
Devaraj Gopinathan ◽  
Joakim H. Beck ◽  
Michael B. Giles ◽  
Serge Guillas ◽  
...  

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite volume non-linear shallow water equations (NSWE) solver built on the OP2 domain specific language for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high performance computing platforms: CPUs, the Intel Xeon Phi, and GPUs. This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years, here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity to its users, as well as performance and portability on a number of platforms.


Author(s):  
Nikolay Kondratyuk ◽  
Vsevolod Nikolskiy ◽  
Daniil Pavlov ◽  
Vladimir Stegailov

Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management capabilities and in performance. In this work, we analyze the performance of LAMMPS, GROMACS and OpenMM MD packages with different GPU backends on Nvidia Volta and AMD Vega20 GPUs. We consider the efficiency of solving two identical MD models (generic for material science and biomolecular studies) using different software and hardware combinations. We describe our experience in porting the CUDA backend of LAMMPS to ROCm HIP that shows considerable benefits for AMD GPUs comparatively to the OpenCL backend.


Sign in / Sign up

Export Citation Format

Share Document