Replicated Computational Results (RCR) Report for “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software”

Sarah Osborn

doi:10.1145/3446000

Replicated Computational Results (RCR) Report for “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software”

ACM Transactions on Mathematical Software ◽

10.1145/3446000 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-4

Author(s):

Sarah Osborn

Keyword(s):

Linear Algebra ◽

High Performance ◽

Numerical Linear Algebra ◽

Practical Implementation ◽

Test Problems ◽

And Performance ◽

Nvidia Gpu ◽

Gpu Architectures ◽

Performance Results ◽

Conjugate Gradient Solver

The article by Flegar et al. titled “Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software” presents a novel, practical implementation of an adaptive precision block-Jacobi preconditioner. Performance results using state-of-the-art GPU architectures for the block-Jacobi preconditioner generation and application demonstrate the practical usability of the method, compared to a traditional full-precision block-Jacobi preconditioner. A production-ready implementation is provided in the Ginkgo numerical linear algebra library. In this report, the Ginkgo library is reinstalled and performance results are generated to perform a comparison to the original results when using Ginkgo’s Conjugate Gradient solver with either the full or the adaptive precision block-Jacobi preconditioner for a suite of test problems on an NVIDIA GPU accelerator. After completing this process, the published results are deemed reproducible.

Download Full-text

A survey of power and energy efficient techniques for high performance numerical linear algebra operations

Parallel Computing ◽

10.1016/j.parco.2014.09.001 ◽

2014 ◽

Vol 40 (10) ◽

pp. 559-573 ◽

Cited By ~ 19

Author(s):

Li Tan ◽

Shashank Kothapalli ◽

Longxiang Chen ◽

Omar Hussaini ◽

Ryan Bissiri ◽

...

Keyword(s):

Linear Algebra ◽

Energy Efficient ◽

High Performance ◽

Numerical Linear Algebra ◽

Power And Energy

Download Full-text

Efficient Graph Component Labeling on Hybrid CPU and GPU Platforms

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.596.276 ◽

2014 ◽

Vol 596 ◽

pp. 276-279

Author(s):

Xiao Hui Pan

Keyword(s):

High Performance ◽

General Purpose ◽

Gpu Programming ◽

Data Parallel ◽

Graphical Processing Units ◽

Architectural Features ◽

Graph Coloring Problem ◽

Graphical Processing ◽

And Performance ◽

Performance Results

Graph component labeling, which is a subset of the general graph coloring problem, is a computationally expensive operation in many important applications and simulations. A number of data-parallel algorithmic variations to the component labeling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on CPUs and GPUs using CUDA. We evaluated our system with real-world graphs. We show how to consider different architectural features of the GPU and the host CPUs and achieve high performance.

Download Full-text

Numerical Linear Algebra for High-Performance Computers

10.1137/1.9780898719611 ◽

1998 ◽

Cited By ~ 201

Author(s):

Jack J. Dongarra ◽

Iain S. Duff ◽

Danny C. Sorensen ◽

Henk A. van der Vorst

Keyword(s):

Linear Algebra ◽

High Performance ◽

Numerical Linear Algebra ◽

High Performance Computers

Download Full-text

Implementation and evaluation of the HPC challenge benchmark in the XcalableMP PGAS language

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017698214 ◽

2017 ◽

Vol 33 (1) ◽

pp. 110-123 ◽

Cited By ~ 5

Author(s):

Masahiro Nakao ◽

Hitoshi Murai ◽

Hidetoshi Iwashita ◽

Taisuke Boku ◽

Mitsuhisa Sato

Keyword(s):

High Performance Computing ◽

High Performance ◽

Parallel Applications ◽

Memory Model ◽

Computing Systems ◽

Local View ◽

And Performance ◽

Performance Results ◽

Performance Computing ◽

Do So

To improve productivity for developing parallel applications on high performance computing systems, the XcalableMP PGAS language has been proposed. XcalableMP supports both a typical parallelization under the “global-view memory model” which uses directives and a flexible parallelization under the “local-view memory model” which uses coarray features. The goal of the present paper is to clarify XcalableMP’s productivity and performance. To do so, we implement and evaluate the high performance computing challenge benchmark, namely, EP STREAM Triad, High Performance Linpack, Global fast Fourier transform, and RandomAccess on the K computer using up to 16,384 compute nodes and a generic cluster system using up to 128 compute nodes. We found that we could more easily implement the benchmarks using XcalableMP rather than using MPI. Moreover, most of the performance results using XcalableMP were almost the same as those using MPI.

Download Full-text

HCGrid: A convolution-based gridding framework for radio astronomy in hybrid computing environments

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3800 ◽

2020 ◽

Author(s):

Hao Wang ◽

Ce Yu ◽

Bo Zhang ◽

Jian Xiao ◽

Qi Luo

Keyword(s):

High Performance ◽

Optimal Parameter ◽

Computing Time ◽

Reduction Process ◽

Hybrid Computing ◽

Convolution Process ◽

Computing Environments ◽

And Performance ◽

Gpu Architectures ◽

Key Steps

Abstract Gridding operation, which is to map non-uniform data samples on to a uniformly distributed grid, is one of the key steps in radio astronomical data reduction process. One of the main bottlenecks of gridding is the poor computing performance, and a typical solution for such performance issue is the implementation of multi-core CPU platforms. Although such a method could usually achieve good results, in many cases, the performance of gridding is still restricted to an extent due to the limitations of CPU, since the main workload of gridding is a combination of a large number of single instruction, multi-data-stream operations, which is more suitable for GPU, rather than CPU implementations. To meet the challenge of massive data gridding for the modern large single-dish radio telescopes, e.g. the Five-hundred-meter Aperture Spherical radio Telescope (FAST), inspired by existing multi-core CPU gridding algorithms such as Cygrid, here we present an easy-to-install, high-performance, and open-source convolutional gridding framework, HCGrid, in CPU-GPU heterogeneous platforms. It optimises data search by employing multi-threading on CPU, and accelerates the convolution process by utilising massive parallelisation of GPU. In order to make HCGrid a more adaptive solution, we also propose the strategies of thread organisation and coarsening, as well as optimal parameter settings under various GPU architectures. A thorough analysis of computing time and performance gain with several GPU parallel optimisation strategies show that it can lead to excellent performance in hybrid computing environments.

Download Full-text

The Matrix Template Library: A Generic Programming Approach to High Performance Numerical Linear Algebra

Computing in Object-Oriented Parallel Environments - Lecture Notes in Computer Science ◽

10.1007/3-540-49372-7_6 ◽

1998 ◽

pp. 59-70 ◽

Cited By ~ 23

Author(s):

Jeremy G. Siek ◽

Andrew Lumsdaine

Keyword(s):

Linear Algebra ◽

High Performance ◽

Numerical Linear Algebra ◽

Programming Approach ◽

Generic Programming ◽

The Matrix ◽

Template Library

Download Full-text

A Modern Framework for Portable High-Performance Numerical Linear Algebra

Advances in Software Tools for Scientific Computing - Lecture Notes in Computational Science and Engineering ◽

10.1007/978-3-642-57172-5_1 ◽

2000 ◽

pp. 1-55 ◽

Cited By ~ 10

Author(s):

Jeremy Siek ◽

Andrew Lumsdaine

Keyword(s):

Linear Algebra ◽

High Performance ◽

Numerical Linear Algebra

Download Full-text

Linnea

ACM Transactions on Mathematical Software ◽

10.1145/3446632 ◽

2021 ◽

Vol 47 (3) ◽

pp. 1-26

Author(s):

Henrik Barthels ◽

Christos Psarras ◽

Paolo Bientinesi

Keyword(s):

Linear Algebra ◽

High Performance ◽

Search Algorithm ◽

Test Problems ◽

Code Generator ◽

Matrix Computations ◽

Significant Performance ◽

High Level ◽

Almost All ◽

Performance Computing

The translation of linear algebra computations into efficient sequences of library calls is a non-trivial task that requires expertise in both linear algebra and high-performance computing. Almost all high-level languages and libraries for matrix computations (e.g., Matlab, Eigen) internally use optimized kernels such as those provided by BLAS and LAPACK; however, their translation algorithms are often too simplistic and thus lead to a suboptimal use of said kernels, resulting in significant performance losses. To combine the productivity offered by high-level languages, and the performance of low-level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem; as output, it returns an efficient sequence of calls to high-performance kernels. Linnea uses a custom best-first search algorithm to find a first solution in less than a second, and increasingly better solutions when given more time. In 125 test problems, the code generated by Linnea almost always outperforms Matlab, Julia, Eigen, and Armadillo, with speedups up to and exceeding 10×.

Download Full-text

Coordinated Energy Management in Heterogeneous Processors

Scientific Programming ◽

10.1155/2014/210762 ◽

2014 ◽

Vol 22 (2) ◽

pp. 93-108 ◽

Cited By ~ 4

Author(s):

Indrani Paul ◽

Vignesh Ravi ◽

Srilatha Manne ◽

Manish Arora ◽

Sudhakar Yalamanchili

Keyword(s):

Performance Management ◽

Energy Management ◽

Performance Optimization ◽

High Performance ◽

Average Energy ◽

Management Algorithm ◽

Tightly Coupled ◽

Heterogeneous Processor ◽

And Performance ◽

Gpu Architectures

This paper examines energy management in a heterogeneous processor consisting of an integrated CPU–GPU for high-performance computing (HPC) applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types – a new and less understood problem. We examine the intra-node CPU–GPU frequency sensitivity of HPC applications on tightly coupled CPU–GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU–GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED2) product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.

Download Full-text

High-Performance Algorithms for Numerical Linear Algebra

The Art of High Performance Computing for Computational Science, Vol. 1 ◽

10.1007/978-981-13-6194-4_7 ◽

2019 ◽

pp. 113-136

Author(s):

Yusaku Yamamoto

Keyword(s):

Linear Algebra ◽

High Performance ◽

Numerical Linear Algebra

Download Full-text