Parallelisation of equation-based simulation programs on heterogeneous computing systems

PeerJ Computer Science ◽

10.7717/peerj-cs.160 ◽

2018 ◽

Vol 4 ◽

pp. e160 ◽

Cited By ~ 3

Author(s):

Dragan D. Nikolić

Keyword(s):

Data Structure ◽

Graphics Processing Units ◽

Heterogeneous Computing ◽

Numerical Solutions ◽

Heterogeneous Systems ◽

General Purpose ◽

Algebraic Equations ◽

Mathematical Operation ◽

General Purpose Processors ◽

Model Equations

Numerical solutions of equation-based simulations require computationally intensive tasks such as evaluation of model equations, linear algebra operations and solution of systems of linear equations. The focus in this work is on parallel evaluation of model equations on shared memory systems such as general purpose processors (multi-core CPUs and manycore devices), streaming processors (Graphics Processing Units and Field Programmable Gate Arrays) and heterogeneous systems. The current approaches for evaluation of model equations are reviewed and their capabilities and shortcomings analysed. Since stream computing differs from traditional computing in that the system processes a sequential stream of elements, equations must be transformed into a data structure suitable for both types. The postfix notation expression stacks are recognised as a platform and programming language independent method to describe, store in computer memory and evaluate general systems of differential and algebraic equations of any size. Each mathematical operation and its operands are described by a specially designed data structure, and every equation is transformed into an array of these structures (a Compute Stack). Compute Stacks are evaluated by a stack machine using a Last In First Out queue. The stack machine is implemented in the DAE Tools modelling software in the C99 language using two Application Programming Interface (APIs)/frameworks for parallelism. The Open Multi-Processing (OpenMP) API is used for parallelisation on general purpose processors, and the Open Computing Language (OpenCL) framework is used for parallelisation on streaming processors and heterogeneous systems. The performance of the sequential Compute Stack approach is compared to the direct C++ implementation and to the previous approach that uses evaluation trees. The new approach is 45% slower than the C++ implementation and more than five times faster than the previous one. The OpenMP and OpenCL implementations are tested on three medium-scale models using a multi-core CPU, a discrete GPU, an integrated GPU and heterogeneous computing setups. Execution times are compared and analysed and the advantages of the OpenCL implementation running on a discrete GPU and heterogeneous systems are discussed. It is found that the evaluation of model equations using the parallel OpenCL implementation running on a discrete GPU is up to twelve times faster than the sequential version while the overall simulation speed-up gained is more than three times.

Download Full-text

Heterogenous Computing on Iris Matching with OpenCL

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.850.129 ◽

2016 ◽

Vol 850 ◽

pp. 129-135

Author(s):

Buğra Şimşek ◽

Nursel Akçam

Keyword(s):

Graphics Processing Units ◽

Iris Recognition ◽

Heterogeneous Computing ◽

Hamming Distance ◽

Heterogeneous Systems ◽

Digital Signal ◽

Mobile Platforms ◽

Central Processing ◽

Field Programmable ◽

Graphics Processing

This study presents parallelization of Hamming Distance algorithm, which is used for iris comparison on iris recognition systems, for heterogeneous systems that can be included Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processing (DSP) boards, Field Programmable Gate Array (FPGA) and some other mobile platforms with OpenCL. OpenCL allows to run same code on CPUs, GPUs, FPGAs and DSP boards. Heterogeneous computing refers to systems include different kind of devices (CPUs, GPUs, FPGAs and other accelerators). Heterogeneous computing gains performance or reduces power for suitable algorithms on these OpenCL supported devices. In this study, Hamming Distance algorithm has been coded with C++ as a sequential code and has been parallelized a designated method by us with OpenCL. Our OpenCL code has been executed on Nvidia GT430 GPU and Intel Xeon 5650 processor. The OpenCL code implementation demonstrates that speed up to 87 times with parallelization. Also our study differs from other studies, which accelerate iris matching, with regard to ensure heterogeneous computing by using OpenCL.

Download Full-text

A SURVEY OF TECHNIQUES FOR MANAGING AND LEVERAGING CACHES IN GPUs

Journal of Circuits System and Computers ◽

10.1142/s0218126614300025 ◽

2014 ◽

Vol 23 (08) ◽

pp. 1430002 ◽

Cited By ~ 11

Author(s):

SPARSH MITTAL

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

General Purpose ◽

System Level ◽

Cache Management ◽

Full Potential ◽

Wide Range ◽

Computing Platforms ◽

Graphics Processing

Initially introduced as special-purpose accelerators for graphics applications, graphics processing units (GPUs) have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU–GPU heterogeneous computing, etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.

Download Full-text

Component-oriented acausal modeling of the dynamical systems in Python language on the example of the model of the sucker rod string

PeerJ Computer Science ◽

10.7717/peerj-cs.227 ◽

2019 ◽

Vol 5 ◽

pp. e227

Author(s):

Volodymyr B. Kopei ◽

Oleh R. Onysko ◽

Vitalii G. Panchuk

Keyword(s):

Solution Procedure ◽

Differential Algebraic Equations ◽

General Purpose ◽

Third Party ◽

Modeling Languages ◽

Algebraic Equations ◽

Basic Set ◽

Sucker Rod ◽

Acausal Modeling ◽

Model Equations

Typically, component-oriented acausal hybrid modeling of complex dynamic systems is implemented by specialized modeling languages. A well-known example is the Modelica language. The specialized nature, complexity of implementation and learning of such languages somewhat limits their development and wide use by developers who know only general-purpose languages. The paper suggests the principle of developing simple to understand and modify Modelica-like system based on the general-purpose programming language Python. The principle consists in: (1) Python classes are used to describe components and their systems, (2) declarative symbolic tools SymPy are used to describe components behavior by difference or differential equations, (3) the solution procedure uses a function initially created using the SymPy lambdify function and computes unknown values in the current step using known values from the previous step, (4) Python imperative constructs are used for simple events handling, (5) external solvers of differential-algebraic equations can optionally be applied via the Assimulo interface, (6) SymPy package allows to arbitrarily manipulate model equations, generate code and solve some equations symbolically. The basic set of mechanical components (1D translational “mass”, “spring-damper” and “force”) is developed. The models of a sucker rods string are developed and simulated using these components. The comparison of results of the sucker rod string simulations with practical dynamometer cards and Modelica results verify the adequacy of the models. The proposed approach simplifies the understanding of the system, its modification and improvement, adaptation for other purposes, makes it available to a much larger community, simplifies integration into third-party software.

Download Full-text

High-level programming for heterogeneous and hierarchical parallel systems

The International Journal of High Performance Computing Applications ◽

10.1177/1094342018807840 ◽

2018 ◽

Vol 32 (6) ◽

pp. 804-806

Author(s):

Javier García-Blas ◽

Christopher Brown

Keyword(s):

Graphics Processing Units ◽

Heterogeneous Computing ◽

Timing Analysis ◽

Parallel Systems ◽

General Purpose ◽

Ongoing Work ◽

High Level ◽

Tools And Techniques ◽

Graphics Processing ◽

Refactoring Tools

High-Level Heterogeneous and Hierarchical Parallel Systems (HLPGPU) aims to bring together researchers and practitioners to present new results and ongoing work on those aspects of high-level programming relevant, or specific to general-purpose computing on graphics processing units (GPGPUs) and new architectures. The 2016 HLPGPU symposium was an event co-located with the HiPEAC conference in Prague, Czech Republic. HLPGPU is targeted at high-level parallel techniques, including programming models, libraries and languages, algorithmic skeletons, refactoring tools and techniques for parallel patterns, tools and systems to aid parallel programming, heterogeneous computing, timing analysis and statistical performance models.

Download Full-text

Evaluation of Selected Resource Allocation and Scheduling Methods in Heterogeneous Many-Core Processors and Graphics Processing Units

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2014-0013 ◽

2014 ◽

Vol 39 (4) ◽

pp. 233-248 ◽

Cited By ~ 1

Author(s):

Milosz Ciznicki ◽

Krzysztof Kurowski ◽

Jan Węglarz

Keyword(s):

Resource Allocation ◽

Task Scheduling ◽

Graphics Processing Units ◽

Heterogeneous Computing ◽

Heterogeneous Systems ◽

Application Programming Interface ◽

System Level ◽

Wide Range ◽

Many Core ◽

Graphics Processing

Abstract Heterogeneous many-core computing resources are increasingly popular among users due to their improved performance over homogeneous systems. Many developers have realized that heterogeneous systems, e.g. a combination of a shared memory multi-core CPU machine with massively parallel Graphics Processing Units (GPUs), can provide significant performance opportunities to a wide range of applications. However, the best overall performance can only be achieved if application tasks are efficiently assigned to different types of processor units in time taking into account their specific resource requirements. Additionally, one should note that available heterogeneous resources have been designed as general purpose units, however, with many built-in features accelerating specific application operations. In other words, the same algorithm or application functionality can be implemented as a different task for CPU or GPU. Nevertheless, from the perspective of various evaluation criteria, e.g. the total execution time or energy consumption, we may observe completely different results. Therefore, as tasks can be scheduled and managed in many alternative ways on both many-core CPUs or GPUs and consequently have a huge impact on the overall computing resources performance, there are needs for new and improved resource management techniques. In this paper we discuss results achieved during experimental performance studies of selected task scheduling methods in heterogeneous computing systems. Additionally, we present a new architecture for resource allocation and task scheduling library which provides a generic application programming interface at the operating system level for improving scheduling polices taking into account a diversity of tasks and heterogeneous computing resources characteristics.

Download Full-text

CUDA or OpenCL

Research Advances in the Integration of Big Data and Smart Computing - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-4666-8737-0.ch015 ◽

2016 ◽

pp. 267-279

Author(s):

Mayank Bhura ◽

Pranav H. Deshpande ◽

K. Chandrasekaran

Keyword(s):

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Systems ◽

General Purpose ◽

Programming Environment ◽

Pros And Cons ◽

Nvidia Gpu ◽

Graphics Processing ◽

Performance Computing

Usage of General Purpose Graphics Processing Units (GPGPUs) in high-performance computing is increasing as heterogeneous systems continue to become dominant. CUDA had been the programming environment for nearly all such NVIDIA GPU based GPGPU applications. Still, the framework runs only on NVIDIA GPUs, for other frameworks it requires reimplementation to utilize additional computing devices that are available. OpenCL provides a vendor-neutral and open programming environment, with many implementations available on CPUs, GPUs, and other types of accelerators, OpenCL can thus be regarded as write once, run anywhere framework. Despite this, both frameworks have their own pros and cons. This chapter presents a comparison of the performance of CUDA and OpenCL frameworks, using an algorithm to find the sum of all possible triple products on a list of integers, implemented on GPUs.

Download Full-text

A Heterogeneous System Based on Latent Semantic Analysis Using GPU and Multi-CPU

Scientific Programming ◽

10.1155/2017/8131390 ◽

2017 ◽

Vol 2017 ◽

pp. 1-19 ◽

Cited By ~ 4

Author(s):

Gabriel A. León-Paredes ◽

Liliana I. Barbosa-Santillán ◽

Juan J. Sánchez-Escobar

Keyword(s):

Information Retrieval ◽

Latent Semantic Analysis ◽

Graphics Processing Units ◽

Semantic Analysis ◽

Computational Cost ◽

Heterogeneous Systems ◽

Semantic Space ◽

General Purpose ◽

Space Construction ◽

Value Decomposition

Latent Semantic Analysis (LSA) is a method that allows us to automatically index and retrieve information from a set of objects by reducing the term-by-document matrix using the Singular Value Decomposition (SVD) technique. However, LSA has a high computational cost for analyzing large amounts of information. The goals of this work are (i) to improve the execution time of semantic space construction, dimensionality reduction, and information retrieval stages of LSA based on heterogeneous systems and (ii) to evaluate the accuracy and recall of the information retrieval stage. We present a heterogeneous Latent Semantic Analysis (hLSA) system, which has been developed using General-Purpose computing on Graphics Processing Units (GPGPUs) architecture, which can solve large numeric problems faster through the thousands of concurrent threads on multiple CUDA cores of GPUs and multi-CPU architecture, which can solve large text problems faster through a multiprocessing environment. We execute the hLSA system with documents from the PubMed Central (PMC) database. The results of the experiments show that the acceleration reached by the hLSA system for large matrices with one hundred and fifty thousand million values is around eight times faster than the standard LSA version with an accuracy of 88% and a recall of 100%.

Download Full-text

Valid Digit and Overflow Information to Reduce Energy Dissipation of Functional Units in General Purpose Processors

IEICE Transactions on Electronics ◽

10.1587/transele.e96.c.463 ◽

2013 ◽

Vol E96.C (4) ◽

pp. 463-472

Author(s):

Kazuhito ITO ◽

Takuya NUMATA

Keyword(s):

Energy Dissipation ◽

General Purpose ◽

General Purpose Processors ◽

Functional Units

Download Full-text

Numerical Analysis of Viscoelastic Rotating Beam with Variable Fractional Order Model Using Shifted Bernstein–Legendre Polynomial Collocation Algorithm

Fractal and Fractional ◽

10.3390/fractalfract5010008 ◽

2021 ◽

Vol 5 (1) ◽

pp. 8

Author(s):

Cundi Han ◽

Yiming Chen ◽

Da-Yan Liu ◽

Driss Boutat

Keyword(s):

Numerical Analysis ◽

Fractional Order ◽

Governing Equation ◽

Numerical Solutions ◽

Bernstein Polynomials ◽

Matrix Product ◽

Algebraic Equations ◽

Rotating Beam ◽

The Matrix ◽

The Time Domain

This paper applies a numerical method of polynomial function approximation to the numerical analysis of variable fractional order viscoelastic rotating beam. First, the governing equation of the viscoelastic rotating beam is established based on the variable fractional model of the viscoelastic material. Second, shifted Bernstein polynomials and Legendre polynomials are used as basis functions to approximate the governing equation and the original equation is converted to matrix product form. Based on the configuration method, the matrix equation is further transformed into algebraic equations and numerical solutions of the governing equation are obtained directly in the time domain. Finally, the efficiency of the proposed algorithm is proved by analyzing the numerical solutions of the displacement of rotating beam under different loads.

Download Full-text

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

Publications of the Astronomical Society of Australia ◽

10.1071/as10021 ◽

2011 ◽

Vol 28 (1) ◽

pp. 1-14 ◽

Cited By ~ 172

Author(s):

W. van Straten ◽

M. Bailes

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Graphics Processing Units ◽

High Performance ◽

Digital Signal ◽

General Purpose ◽

Design Decisions ◽

Extensive Range ◽

Processing Software ◽

Graphics Processing

Abstractdspsr is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, dspsr is now stable and in widespread use in the community. This paper presents a detailed description of its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal algorithms, and demonstration of performance on some contemporary microprocessor architectures.

Download Full-text