GraphPEG

Yashuai Lü; Hui Guo; Libo Huang; Qi Yu; Li Shen; Nong Xiao; Zhiying Wang

doi:10.1145/3450440

GraphPEG

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3450440 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-24

Author(s):

Yashuai Lü ◽

Hui Guo ◽

Libo Huang ◽

Qi Yu ◽

Li Shen ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Graph Algorithm ◽

Graph Processing ◽

Graph Traversal ◽

Fine Grain ◽

Large Scale Data ◽

Load Imbalance ◽

Work Distribution ◽

Level Parallelism

Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. Processing graphs on GPUs introduces several problems, such as load imbalance, low utilization of hardware unit, and memory divergence. Although previous work has proposed several software strategies to optimize graph processing on GPUs, there are several issues beyond the capability of software techniques to address. In this article, we present GraphPEG, a graph processing engine for efficient graph processing on GPUs. Inspired by the observation that many graph algorithms have a common pattern on graph traversal, GraphPEG improves the performance of graph processing by coupling automatic edge gathering with fine-grain work distribution. GraphPEG can also adapt to various input graph datasets and simplify the software design of graph processing with hardware-assisted graph traversal. Simulation results show that, in comparison with two representative highly efficient GPU graph processing software framework Gunrock and SEP-Graph, GraphPEG improves graph processing throughput by 2.8× and 2.5× on average, and up to 7.3× and 7.0× for six graph algorithm benchmarks on six graph datasets, with marginal hardware cost.

Download Full-text

IMSM: An Interval Migration Based Approach for Skew Mitigation in MapReduce

Recent Patents on Computer Science ◽

10.2174/2213275912666190405141745 ◽

2019 ◽

Vol 12 ◽

Author(s):

Balraj Singh ◽

Harsh K Verma

Keyword(s):

Load Balance ◽

Completion Time ◽

High Performance ◽

Large Scale ◽

Research Work ◽

Novel Technique ◽

Large Scale Data ◽

Load Imbalance ◽

Performance Computing ◽

Scale Data

Background: Extreme growth of data necessitates the need for high-performance computing. MapReduce is among the most sought-after platform for processing large-scale data. Research work and analysis of the existing system has revealed its performance bottlenecks and areas of the concern. MapReduce suffers extremely from the problem of skew and load imbalance on its processing nodes. Objective: This paper proposes a novel technique for MapReduce to lower the skew on Map tasks and improve the load balance. It reduces the execution time of job by lowering the completion time of the slowest task. Method:Proposed method performs one-time settlement of load balancing among the Map tasks by analyzing the expected completion time of the Map tasks and redistributes the load. It uses intervals to migrate the overloaded or slows tasks and append them on the under loaded tasks or free slots. Result:Experiments reveal an improvement of up to 1.3x by implementing the proposed strategy and comparing it with the relevant techniques using different datasets. Conclusion:Significant improvement is observed in the performance as a result of lower completion time of a job. Proposed technique exhibits reduced amount of skew and a uniform distribution of load among Map nodes.

Download Full-text

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining ◽

10.1145/3219819.3219927 ◽

2018 ◽

Cited By ~ 2

Author(s):

Alex Gittens ◽

Kai Rothauge ◽

Shusen Wang ◽

Michael W. Mahoney ◽

Lisa Gerhardt ◽

...

Keyword(s):

Data Analysis ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Download Full-text

Enabling low latency at large-scale data center and high-performance computing interconnect networks using fine-grained all-optical switching technology

2017 International Conference on Optical Network Design and Modeling (ONDM) ◽

10.23919/ondm.2017.7958532 ◽

2017 ◽

Cited By ~ 3

Author(s):

Nan Hua ◽

Zhizhen Zhong ◽

Xiaoping Zheng

Keyword(s):

Data Center ◽

High Performance ◽

Optical Switching ◽

Large Scale ◽

Fine Grained ◽

Large Scale Data ◽

All Optical ◽

Performance Computing ◽

All Optical Switching ◽

Scale Data

Download Full-text

Integrating Web service and grid enabling technologies to provide desktop access to high-performance cluster-based components for large-scale data services

36th Annual Simulation Symposium, 2003. ◽

10.1109/simsym.2003.1192810 ◽

2003 ◽

Cited By ~ 8

Author(s):

V.P. Holmes ◽

W.R. Johnson ◽

D.J. Miller

Keyword(s):

Web Service ◽

High Performance ◽

Large Scale ◽

Data Services ◽

Large Scale Data ◽

Enabling Technologies ◽

Scale Data

Download Full-text

A look back on 30 years of the Gordon Bell Prize

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017738610 ◽

2017 ◽

Vol 31 (6) ◽

pp. 469-484 ◽

Cited By ~ 3

Author(s):

Gordon Bell ◽

David H Bailey ◽

Jack Dongarra ◽

Alan H Karp ◽

Kevin Walsh

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

Peak Performance ◽

Outstanding Achievement ◽

Computing Machinery ◽

Large Scale Data ◽

The Us ◽

The Impact ◽

Performance Computing

The Gordon Bell Prize is awarded each year by the Association for Computing Machinery to recognize outstanding achievement in high-performance computing (HPC). The purpose of the award is to track the progress of parallel computing with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems. Financial support for the US$10,000 award is provided through an endowment by Gordon Bell, a pioneer in high-performance and parallel computing. This article examines the evolution of the Gordon Bell Prize and the impact it has had on the field.

Download Full-text

Enabling Large-Scale Biomedical Analysis in the Cloud

BioMed Research International ◽

10.1155/2013/185679 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 10

Author(s):

Ying-Chih Lin ◽

Chin-Sheng Yu ◽

Yen-Jen Lin

Keyword(s):

High Performance ◽

Large Scale ◽

Computing System ◽

Biomedical Data ◽

Data Intensive Computing ◽

Biomedical Analysis ◽

Data Intensive ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Recent progress in high-throughput instrumentations has led to an astonishing growth in both volume and complexity of biomedical data collected from various sources. The planet-size data brings serious challenges to the storage and computing technologies. Cloud computing is an alternative to crack the nut because it gives concurrent consideration to enable storage and high-performance computing on large-scale data. This work briefly introduces the data intensive computing system and summarizes existing cloud-based resources in bioinformatics. These developments and applications would facilitate biomedical research to make the vast amount of diversification data meaningful and usable.

Download Full-text

Data Stewardship Practices for Earth Observation Transient and Optimized Analysis Platforms

10.5194/egusphere-egu21-13177 ◽

2021 ◽

Author(s):

Kaylin Bugbee ◽

Rahul Ramachandran ◽

Ge Peng ◽

Aaron Kaulfus

Keyword(s):

High Performance ◽

Large Scale ◽

Scientific Research ◽

Open Science ◽

Heterogeneous Data ◽

Earth Observation ◽

Large Scale Data ◽

Data Stewardship ◽

Uniform Manner ◽

Analysis Platform

Access to valuable scientific research data is becoming increasingly more open, attracting a growing user community of scientists, decision makers and innovators. While these data are more openly available, accessibility continues to remain an issue due to the large volumes of complex, heterogeneous data that are available for analysis. This emerging accessibility issue is driving the development of specialized software stacks to instantiate new analysis platforms that enable users to quickly and efficiently work with large volumes of data. These platforms, typically found on the cloud or in a high performance computing environment, are optimized for large-scale data analysis. These platforms can be transient in nature, with a defined life span and a focus on improved capabilities as opposed to serving as an archive of record.&#160;&#160;While these transient, optimized platforms are not held to the same stewardship standards as a traditional archive, data must still be managed in a standardized and uniform manner throughout the platform. Valuable scientific research is conducted in these platforms, making these platforms subject to open science principles such as reproducibility and accessibility. In this presentation, we examine the differences between various data stewardship models and describe where transient optimized platforms fit within those models. We then describe in more detail a data and information governance framework for Earth Observation transient optimized analysis platforms. We will end our presentation by sharing our experiences of developing such a framework for the Multi-Mission Algorithm and Analysis Platform (MAAP).

Download Full-text

Grafs: declarative graph analytics

Proceedings of the ACM on Programming Languages ◽

10.1145/3473588 ◽

2021 ◽

Vol 5 (ICFP) ◽

pp. 1-32

Author(s):

Farzin Houshmand ◽

Mohsen Lesani ◽

Keval Vora

Keyword(s):

High Performance ◽

Large Scale ◽

Kernel Functions ◽

Runtime Systems ◽

Graph Processing ◽

Large Graphs ◽

Graph Analytics ◽

Efficient Code ◽

High Level ◽

Abstract Interface

Graph analytics elicits insights from large graphs to inform critical decisions for business, safety and security. Several large-scale graph processing frameworks feature efficient runtime systems; however, they often provide programming models that are low-level and subtly different from each other. Therefore, end users can find implementation and specially optimization of graph analytics error-prone and time-consuming. This paper regards the abstract interface of the graph processing frameworks as the instruction set for graph analytics, and presents Grafs, a high-level declarative specification language for graph analytics and a synthesizer that automatically generates efficient code for five high-performance graph processing frameworks. It features novel semantics-preserving fusion transformations that optimize the specifications and reduce them to three primitives: reduction over paths, mapping over vertices and reduction over vertices. Reductions over paths are commonly calculated based on push or pull models that iteratively apply kernel functions at the vertices. This paper presents conditions, parametric in terms of the kernel functions, for the correctness and termination of the iterative models, and uses these conditions as specifications to automatically synthesize the kernel functions. Experimental results show that the generated code matches or outperforms handwritten code, and that fusion accelerates execution.

Download Full-text

SW-LZMA: Parallel Implementation of LZMA Based on SW26010 Many-Core Processor

Wireless Communications and Mobile Computing ◽

10.1155/2021/4486494 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Bingzheng Li ◽

Jinchen Xu ◽

Zijing Liu

Keyword(s):

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Cluster Systems ◽

Large Scale Data ◽

Many Core ◽

High Performance Computing Cluster ◽

Performance Computing ◽

Scale Data ◽

Computing Cluster

With the development of high-performance computing and big data applications, the scale of data transmitted, stored, and processed by high-performance computing cluster systems is increasing explosively. Efficient compression of large-scale data and reducing the space required for data storage and transmission is one of the keys to improving the performance of high-performance computing cluster systems. In this paper, we present SW-LZMA, a parallel design and optimization of LZMA based on the Sunway 26010 heterogeneous many-core processor. Combined with the characteristics of SW26010 processors, we analyse the storage space requirements, memory access characteristics, and hotspot functions of the LZMA algorithm and implement the thread-level parallelism of the LZMA algorithm based on Athread interface. Furthermore, we make a fine-grained layout of LDM address space to achieve DMA double buffer cyclic sliding window algorithm, which optimizes the performance of SW-LZMA. The experimental results show that compared with the serial baseline implementation of LZMA, the parallel LZMA algorithm obtains a maximum speedup ratio of 4.1 times using the Silesia corpus benchmark, while on the large-scale data set, speedup is 5.3 times.

Download Full-text

Developing a prototype of high-performance graph-processing framework for NEC SX–Aurora TSUBASA vector architecture

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v21r325 ◽

2020 ◽

pp. 290-305

Author(s):

И.В. Афанасьев

Keyword(s):

Graph Algorithms ◽

High Performance ◽

Graph Algorithm ◽

Efficient Implementation ◽

Graph Processing ◽

Irregular Structure ◽

Vector Systems ◽

Order Of Magnitude ◽

Vector Graph ◽

Processing Framework

В данной статье описан подход к созданию прототипа графового фреймворка VGL (Vector Graph Library), нацеленного на эффективную реализацию графовых алгоритмов для современной векторной архитектуры NEC SX–Aurora TSUBASA. Современные векторные системы позволяют значительно ускорять приложения, интенсивно использующие подсистему памяти, подклассом которых являются графовые алгоритмы. Однако подходы к эффективной реализации графовых алгоритмов для векторных систем на сегодняшний день исследованы крайне слабо: вследствие сильно нерегулярной структуры графов реального мира, эффективно задействовать векторные особенности целевых платформ затруднительно. В работе показано, что разработанные на основе предложенного фреймворка VGL реализации графовых алгоритмов не уступают в производительности оптимизированным “вручную” аналогам за счет инкапсуляции большого числа оптимизаций графовых алгоритмов, характерных для векторных систем. Вместе с этим предложенный фреймворк позволяет значительно упростить процесс разработки графовых алгоритмов для векторных систем, на порядок сокращая объем кода реализуемых алгоритмов и скрывая от пользователя особенности программирования систем данного класса. This article describes a prototype of graph-processing framework VGL (Vector Graph Library), aimed at the efficient implementation of graph algorithms for the modern NEC SX–Aurora TSUBASA vector architecture. Present day vector systems can significantly speed up various memory-intensive applications, including graph algorithms. However, approaches to the efficient implementation of graph algorithms for vector systems have been studied extremely poorly as of today: due to the highly irregular structure of real-world graphs, it is difficult to effectively use vector features of target platforms. This paper shows that the implementations of graph algorithms developed on the basis of the proposed VGL framework show the performance comparable to their manually optimized versions due to the encapsulation of a large number of graph algorithm optimizations typical for vector systems. At the same time, the proposed framework makes it possible to significantly simplify the process of developing graph algorithms for vector systems, by an order of magnitude reducing the amount of code for implemented algorithms and hiding the programming features of systems of this class from the user.

Download Full-text