HPGraph: High-Performance Graph Analytics with Productivity on the GPU

High-Level Parallel Ant Colony Optimization with Algorithmic Skeletons

International Journal of Parallel Programming ◽

10.1007/s10766-021-00714-1 ◽

2021 ◽

Author(s):

Breno A. de Melo Menezes ◽

Nina Herrmann ◽

Herbert Kuchen ◽

Fernando Buarque de Lima Neto

Keyword(s):

Ant Colony Optimization ◽

High Performance ◽

Optimization Problems ◽

Programming Model ◽

Parallel Implementation ◽

Ant Colony ◽

Algorithmic Skeletons ◽

Low Level ◽

Programming Patterns ◽

High Level

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

Download Full-text

Machine Learning to Design an Auto-tuning System for the Best Compressed Format Detection for Parallel Sparse Computations

Parallel Processing Letters ◽

10.1142/s0129626421500195 ◽

2021 ◽

Author(s):

Olfa Hamdi-Larbi ◽

Ichrak Mehrez ◽

Thomas Dufaud

Keyword(s):

Machine Learning ◽

Numerical Method ◽

High Performance ◽

Programming Model ◽

Learning Algorithm ◽

Sparse Matrix ◽

Sparse Matrices ◽

Matrix Compression ◽

Target Architecture ◽

Parallel Programming Model

Many applications in scientific computing process very large sparse matrices on parallel architectures. The presented work in this paper is a part of a project where our general aim is to develop an auto-tuner system for the selection of the best matrix compression format in the context of high-performance computing. The target smart system can automatically select the best compression format for a given sparse matrix, a numerical method processing this matrix, a parallel programming model and a target architecture. Hence, this paper describes the design and implementation of the proposed concept. We consider a case study consisting of a numerical method reduced to the sparse matrix vector product (SpMV), some compression formats, the data parallel as a programming model and, a distributed multi-core platform as a target architecture. This study allows extracting a set of important novel metrics and parameters which are relative to the considered programming model. Our metrics are used as input to a machine-learning algorithm to predict the best matrix compression format. An experimental study targeting a distributed multi-core platform and processing random and real-world matrices shows that our system can improve in average up to 7% the accuracy of the machine learning.

Download Full-text

Apache Nemo: A Framework for Optimizing Distributed Data Processing

ACM Transactions on Computer Systems ◽

10.1145/3468144 ◽

2020 ◽

Vol 38 (3-4) ◽

pp. 1-31

Author(s):

Won Wook Song ◽

Youngseok Yang ◽

Jeongyoon Eo ◽

Jangho Seo ◽

Joo Yeon Kim ◽

...

Keyword(s):

Data Processing ◽

High Performance ◽

Programming Model ◽

Compiler Optimization ◽

Ease Of Use ◽

Distributed Data ◽

Performance Improvements ◽

Distributed Data Processing ◽

Fine Control ◽

High Level

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.

Download Full-text

Computing Gamma Calculus on Computer Cluster

Computer Engineering ◽

10.4018/978-1-61350-456-7.ch813 ◽

2012 ◽

pp. 2016-2026

Author(s):

Hong Lin ◽

Jeremy Kemp ◽

Padraic Gilbert

Keyword(s):

Gpu Computing ◽

Programming Model ◽

Large Data ◽

General Purpose ◽

Tuple Space ◽

Data Set ◽

Cuda Architecture ◽

High Level ◽

General Purpose Gpu ◽

Grid Cluster

Gamma Calculus is an inherently parallel, high-level programming model, which allows simple programming molecules to interact, creating a complex system with minimum of coding. Gamma calculus modeled programs were written on top of IBM’s TSpaces middleware, which is Java-based and uses a “Tuple Space” based model for communication, similar to that in Gamma. A parser was written in C++ to translate the Gamma syntax. This was implemented on UHD’s grid cluster (grid.uhd.edu), and in an effort to increase performance and scalability, existing Gamma programs are being transferred to Nvidia’s CUDA architecture. General Purpose GPU computing is well suited to run Gamma programs, as GPU’s excel at running the same operation on a large data set, potentially offering a large speedup.

Download Full-text

Grafs: declarative graph analytics

Proceedings of the ACM on Programming Languages ◽

10.1145/3473588 ◽

2021 ◽

Vol 5 (ICFP) ◽

pp. 1-32

Author(s):

Farzin Houshmand ◽

Mohsen Lesani ◽

Keval Vora

Keyword(s):

High Performance ◽

Large Scale ◽

Kernel Functions ◽

Runtime Systems ◽

Graph Processing ◽

Large Graphs ◽

Graph Analytics ◽

Efficient Code ◽

High Level ◽

Abstract Interface

Graph analytics elicits insights from large graphs to inform critical decisions for business, safety and security. Several large-scale graph processing frameworks feature efficient runtime systems; however, they often provide programming models that are low-level and subtly different from each other. Therefore, end users can find implementation and specially optimization of graph analytics error-prone and time-consuming. This paper regards the abstract interface of the graph processing frameworks as the instruction set for graph analytics, and presents Grafs, a high-level declarative specification language for graph analytics and a synthesizer that automatically generates efficient code for five high-performance graph processing frameworks. It features novel semantics-preserving fusion transformations that optimize the specifications and reduce them to three primitives: reduction over paths, mapping over vertices and reduction over vertices. Reductions over paths are commonly calculated based on push or pull models that iteratively apply kernel functions at the vertices. This paper presents conditions, parametric in terms of the kernel functions, for the correctness and termination of the iterative models, and uses these conditions as specifications to automatically synthesize the kernel functions. Experimental results show that the generated code matches or outperforms handwritten code, and that fusion accelerates execution.

Download Full-text

PERFORMANCE EVALUATION OF BLAS ON THE TRIDENT PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626405002325 ◽

2005 ◽

Vol 15 (04) ◽

pp. 407-414

Author(s):

MOSTAFA I. SOLIMAN ◽

STANISLAV G. SEDUKHIN

Keyword(s):

High Performance ◽

Programming Model ◽

Parallel Applications ◽

Instruction Set ◽

Code Size ◽

Data Parallel ◽

Fine Grain ◽

Multi Level ◽

High Level ◽

Programming Interface

Different subtasks of an application usually have different computational, memory, and I/O requirements that result in different needs for computer capabilities. Thus, the more appropriate approach for both high performance and simple programming model is designing a processor having multi-level instruction set architecture (ISA). This leads to high performance and minimum executable code size. Since the fundamental data structures for a wide variety of existing applications are scalar, vector, and matrix, our research Trident processor has three-level ISA executed on zero-, one-, and two-dimensional arrays of data. These levels are used to express a great amount of fine-grain data parallelism to a processor instead of the dynamical extraction by a complicated logic or statically with compilers. This reduces the design complexity and provides high-level programming interface to hardware. In this paper, the performance of Trident processor is evaluated on BLAS, which represent the kernel operations of many data parallel applications. We show that Trident processor proportionally reduces the number of clock cycles per floating-point operation by increasing the number of execution datapaths.

Download Full-text

The Challenge of Providing a High-Level Programming Model for High-Performance Computing

High-Performance Computing ◽

10.1002/0471732710.ch2 ◽

2006 ◽

pp. 21-49 ◽

Cited By ~ 1

Author(s):

Barbara Chapman

Keyword(s):

High Performance Computing ◽

High Performance ◽

Programming Model ◽

High Level ◽

Performance Computing

Download Full-text

Computing Gamma Calculus on Computer Cluster

Knowledge and Technology Adoption, Diffusion, and Transfer - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-4666-1752-0.ch020 ◽

2012 ◽

pp. 275-286

Author(s):

Hong Lin ◽

Jeremy Kemp ◽

Padraic Gilbert

Keyword(s):

Gpu Computing ◽

Programming Model ◽

Large Data ◽

General Purpose ◽

Tuple Space ◽

Data Set ◽

Cuda Architecture ◽

High Level ◽

General Purpose Gpu ◽

Grid Cluster

Gamma Calculus is an inherently parallel, high-level programming model, which allows simple programming molecules to interact, creating a complex system with minimum of coding. Gamma calculus modeled programs were written on top of IBM’s TSpaces middleware, which is Java-based and uses a “Tuple Space” based model for communication, similar to that in Gamma. A parser was written in C++ to translate the Gamma syntax. This was implemented on UHD’s grid cluster (grid.uhd.edu), and in an effort to increase performance and scalability, existing Gamma programs are being transferred to Nvidia’s CUDA architecture. General Purpose GPU computing is well suited to run Gamma programs, as GPU’s excel at running the same operation on a large data set, potentially offering a large speedup.

Download Full-text

Computing Gamma Calculus on Computer Cluster

International Journal of Technology Diffusion ◽

10.4018/jtd.2010100104 ◽

2010 ◽

Vol 1 (4) ◽

pp. 42-52 ◽

Cited By ~ 7

Author(s):

Hong Lin ◽

Jeremy Kemp ◽

Padraic Gilbert

Keyword(s):

Gpu Computing ◽

Programming Model ◽

Large Data ◽

General Purpose ◽

Tuple Space ◽

Data Set ◽

Computer Cluster ◽

Cuda Architecture ◽

High Level ◽

General Purpose Gpu

Gamma Calculus is an inherently parallel, high-level programming model, which allows simple programming molecules to interact, creating a complex system with minimum of coding. Gamma calculus modeled programs were written on top of IBM’s TSpaces middleware, which is Java-based and uses a “Tuple Space” based model for communication, similar to that in Gamma. A parser was written in C++ to translate the Gamma syntax. This was implemented on UHD’s grid cluster (grid.uhd.edu), and in an effort to increase performance and scalability, existing Gamma programs are being transferred to Nvidia’s CUDA architecture. General Purpose GPU computing is well suited to run Gamma programs, as GPU’s excel at running the same operation on a large data set, potentially offering a large speedup.

Download Full-text

Energy-efficient algebra kernels in FPGA for High Performance Computing

Journal of Computer Science and Technology ◽

10.24215/16666038.21.e09 ◽

2021 ◽

Vol 21 (2) ◽

pp. e09

Author(s):

Federico Favaro ◽

Ernesto Dufrechou ◽

Pablo Ezzatti ◽

Juan Pablo Oliver

Keyword(s):

High Performance Computing ◽

Energy Efficient ◽

High Performance ◽

Programming Model ◽

Sparse Matrix ◽

Matrix Multiplication ◽

Numerical Linear Algebra ◽

Fpga Design ◽

Matrix Vector Multiplication ◽

Performance Computing

The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community.Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU.

Download Full-text