scholarly journals HPGraph: High-Performance Graph Analytics with Productivity on the GPU

2018 ◽  
Vol 2018 ◽  
pp. 1-11
Author(s):  
Haoduo Yang ◽  
Huayou Su ◽  
Qiang Lan ◽  
Mei Wen ◽  
Chunyuan Zhang

The growing use of graph in many fields has sparked a broad interest in developing high-level graph analytics programs. Existing GPU implementations have limited performance with compromising on productivity. HPGraph, our high-performance bulk-synchronous graph analytics framework based on the GPU, provides an abstraction focused on mapping vertex programs to generalized sparse matrix operations on GPU as the backend. HPGraph strikes a balance between performance and productivity by coupling high-performance GPU computing primitives and optimization strategies with a high-level programming model for users to implement various graph algorithms with relatively little effort. We evaluate the performance of HPGraph for four graph primitives (BFS, SSSP, PageRank, and TC). Our experiments show that HPGraph matches or even exceeds the performance of high-performance GPU graph libraries such as MapGraph, nvGraph, and Gunrock. HPGraph also runs significantly faster than advanced CPU graph libraries.

Author(s):  
Breno A. de Melo Menezes ◽  
Nina Herrmann ◽  
Herbert Kuchen ◽  
Fernando Buarque de Lima Neto

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.


Author(s):  
Olfa Hamdi-Larbi ◽  
Ichrak Mehrez ◽  
Thomas Dufaud

Many applications in scientific computing process very large sparse matrices on parallel architectures. The presented work in this paper is a part of a project where our general aim is to develop an auto-tuner system for the selection of the best matrix compression format in the context of high-performance computing. The target smart system can automatically select the best compression format for a given sparse matrix, a numerical method processing this matrix, a parallel programming model and a target architecture. Hence, this paper describes the design and implementation of the proposed concept. We consider a case study consisting of a numerical method reduced to the sparse matrix vector product (SpMV), some compression formats, the data parallel as a programming model and, a distributed multi-core platform as a target architecture. This study allows extracting a set of important novel metrics and parameters which are relative to the considered programming model. Our metrics are used as input to a machine-learning algorithm to predict the best matrix compression format. An experimental study targeting a distributed multi-core platform and processing random and real-world matrices shows that our system can improve in average up to 7% the accuracy of the machine learning.


2020 ◽  
Vol 38 (3-4) ◽  
pp. 1-31
Author(s):  
Won Wook Song ◽  
Youngseok Yang ◽  
Jeongyoon Eo ◽  
Jangho Seo ◽  
Joo Yeon Kim ◽  
...  

Optimizing scheduling and communication of distributed data processing for resource and data characteristics is crucial for achieving high performance. Existing approaches to such optimizations largely fall into two categories. First, distributed runtimes provide low-level policy interfaces to apply the optimizations, but do not ensure the maintenance of correct application semantics and thus often require significant effort to use. Second, policy interfaces that extend a high-level application programming model ensure correctness, but do not provide sufficient fine control. We describe Apache Nemo, an optimization framework for distributed dataflow processing that provides fine control for high performance and also ensures correctness for ease of use. We combine several techniques to achieve this, including an intermediate representation of dataflow, compiler optimization passes, and runtime extensions. Our evaluation results show that Nemo enables composable and reusable optimizations that bring performance improvements on par with existing specialized runtimes tailored for a specific deployment scenario. Apache Nemo is open-sourced at https://nemo.apache.org as an Apache incubator project.


2012 ◽  
pp. 2016-2026
Author(s):  
Hong Lin ◽  
Jeremy Kemp ◽  
Padraic Gilbert

Gamma Calculus is an inherently parallel, high-level programming model, which allows simple programming molecules to interact, creating a complex system with minimum of coding. Gamma calculus modeled programs were written on top of IBM’s TSpaces middleware, which is Java-based and uses a “Tuple Space” based model for communication, similar to that in Gamma. A parser was written in C++ to translate the Gamma syntax. This was implemented on UHD’s grid cluster (grid.uhd.edu), and in an effort to increase performance and scalability, existing Gamma programs are being transferred to Nvidia’s CUDA architecture. General Purpose GPU computing is well suited to run Gamma programs, as GPU’s excel at running the same operation on a large data set, potentially offering a large speedup.


2021 ◽  
Vol 5 (ICFP) ◽  
pp. 1-32
Author(s):  
Farzin Houshmand ◽  
Mohsen Lesani ◽  
Keval Vora

Graph analytics elicits insights from large graphs to inform critical decisions for business, safety and security. Several large-scale graph processing frameworks feature efficient runtime systems; however, they often provide programming models that are low-level and subtly different from each other. Therefore, end users can find implementation and specially optimization of graph analytics error-prone and time-consuming. This paper regards the abstract interface of the graph processing frameworks as the instruction set for graph analytics, and presents Grafs, a high-level declarative specification language for graph analytics and a synthesizer that automatically generates efficient code for five high-performance graph processing frameworks. It features novel semantics-preserving fusion transformations that optimize the specifications and reduce them to three primitives: reduction over paths, mapping over vertices and reduction over vertices. Reductions over paths are commonly calculated based on push or pull models that iteratively apply kernel functions at the vertices. This paper presents conditions, parametric in terms of the kernel functions, for the correctness and termination of the iterative models, and uses these conditions as specifications to automatically synthesize the kernel functions. Experimental results show that the generated code matches or outperforms handwritten code, and that fusion accelerates execution.


2005 ◽  
Vol 15 (04) ◽  
pp. 407-414
Author(s):  
MOSTAFA I. SOLIMAN ◽  
STANISLAV G. SEDUKHIN

Different subtasks of an application usually have different computational, memory, and I/O requirements that result in different needs for computer capabilities. Thus, the more appropriate approach for both high performance and simple programming model is designing a processor having multi-level instruction set architecture (ISA). This leads to high performance and minimum executable code size. Since the fundamental data structures for a wide variety of existing applications are scalar, vector, and matrix, our research Trident processor has three-level ISA executed on zero-, one-, and two-dimensional arrays of data. These levels are used to express a great amount of fine-grain data parallelism to a processor instead of the dynamical extraction by a complicated logic or statically with compilers. This reduces the design complexity and provides high-level programming interface to hardware. In this paper, the performance of Trident processor is evaluated on BLAS, which represent the kernel operations of many data parallel applications. We show that Trident processor proportionally reduces the number of clock cycles per floating-point operation by increasing the number of execution datapaths.


Author(s):  
Hong Lin ◽  
Jeremy Kemp ◽  
Padraic Gilbert

Gamma Calculus is an inherently parallel, high-level programming model, which allows simple programming molecules to interact, creating a complex system with minimum of coding. Gamma calculus modeled programs were written on top of IBM’s TSpaces middleware, which is Java-based and uses a “Tuple Space” based model for communication, similar to that in Gamma. A parser was written in C++ to translate the Gamma syntax. This was implemented on UHD’s grid cluster (grid.uhd.edu), and in an effort to increase performance and scalability, existing Gamma programs are being transferred to Nvidia’s CUDA architecture. General Purpose GPU computing is well suited to run Gamma programs, as GPU’s excel at running the same operation on a large data set, potentially offering a large speedup.


2010 ◽  
Vol 1 (4) ◽  
pp. 42-52 ◽  
Author(s):  
Hong Lin ◽  
Jeremy Kemp ◽  
Padraic Gilbert

Gamma Calculus is an inherently parallel, high-level programming model, which allows simple programming molecules to interact, creating a complex system with minimum of coding. Gamma calculus modeled programs were written on top of IBM’s TSpaces middleware, which is Java-based and uses a “Tuple Space” based model for communication, similar to that in Gamma. A parser was written in C++ to translate the Gamma syntax. This was implemented on UHD’s grid cluster (grid.uhd.edu), and in an effort to increase performance and scalability, existing Gamma programs are being transferred to Nvidia’s CUDA architecture. General Purpose GPU computing is well suited to run Gamma programs, as GPU’s excel at running the same operation on a large data set, potentially offering a large speedup.


2021 ◽  
Vol 21 (2) ◽  
pp. e09
Author(s):  
Federico Favaro ◽  
Ernesto Dufrechou ◽  
Pablo Ezzatti ◽  
Juan Pablo Oliver

The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community.Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU.  


Sign in / Sign up

Export Citation Format

Share Document