Design and implementation of multiple-precision BLAS Level 1 functions for graphics processing units

TESLA GPUs versus MPI with OpenMP for the Forward Modeling of Gravity and Gravity Gradient of Large Prisms Ensemble

Journal of Applied Mathematics ◽

10.1155/2013/437357 ◽

2013 ◽

Vol 2013 ◽

pp. 1-15 ◽

Cited By ~ 4

Author(s):

Carlos Couder-Castañeda ◽

Carlos Ortiz-Alemán ◽

Mauricio Gabriel Orozco-del-Castillo ◽

Mauricio Nava-Flores

Keyword(s):

Parallel Computing ◽

Graphics Processing Units ◽

Forward Modeling ◽

Gravity Gradient ◽

Constant Density ◽

Gravitational Fields ◽

Design And Implementation ◽

Cuda Technology ◽

Performance Results ◽

Graphics Processing

An implementation with the CUDA technology in a single and in several graphics processing units (GPUs) is presented for the calculation of the forward modeling of gravitational fields from a tridimensional volumetric ensemble composed by unitary prisms of constant density. We compared the performance results obtained with the GPUs against a previous version coded in OpenMP with MPI, and we analyzed the results on both platforms. Today, the use of GPUs represents a breakthrough in parallel computing, which has led to the development of several applications with various applications. Nevertheless, in some applications the decomposition of the tasks is not trivial, as can be appreciated in this paper. Unlike a trivial decomposition of the domain, we proposed to decompose the problem by sets of prisms and use different memory spaces per processing CUDA core, avoiding the performance decay as a result of the constant calls to kernels functions which would be needed in a parallelization by observations points. The design and implementation created are the main contributions of this work, because the parallelization scheme implemented is not trivial. The performance results obtained are comparable to those of a small processing cluster.

Download Full-text

Multiple-Precision BLAS Library for Graphics Processing Units

10.36227/techrxiv.12580301.v1 ◽

2020 ◽

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

Arithmetic Operation ◽

Number System ◽

Residue Number System ◽

Floating Point ◽

Data Types ◽

Rounding Errors ◽

Multiple Precision ◽

Graphics Processing ◽

Point Arithmetic

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

Download Full-text

Multiple-precision matrix-vector multiplication on graphics processing units

Program systems theory and applications ◽

10.25209/2079-3316-2020-11-3-33-59 ◽

2020 ◽

Vol 11 (3) ◽

pp. 33-59 ◽

Cited By ~ 1

Author(s):

Константин Сергеевич Исупов ◽

Владимир Сергеевич Князьков

Keyword(s):

Graphics Processing Units ◽

Precision Matrix ◽

Matrix Vector Multiplication ◽

Multiple Precision ◽

Graphics Processing ◽

Matrix Vector

Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для графических процессоров (GPU) с использованием арифметики многократной точности на основе системы остаточных классов. В нашей реализации GEMV покомпонентные операции с многоразрядными векторами и матрицами разбиваются на части, каждая из которых выполняется отдельным CUDA ядром. Это исключает ветвление логики исполнения и позволяет добиться более полного использования ресурсов GPU. Эффективная структура данных для хранения многоразрядных массивов обеспечивает объединение доступов параллельных потоков к глобальной памяти GPU в транзакции. Для предложенной реализации GEMV выполнен анализ ошибок округления и получены оценки точности. Представлены экспериментальные результаты, показывающие высокую эффективность разработанной реализации по сравнению с существующими программными пакетами многократной точности для GPU.

Download Full-text

Design and implementation of MPEG audio layer III decoder using graphics processing units

2010 International Conference on Image Analysis and Signal Processing ◽

10.1109/iasp.2010.5476071 ◽

2010 ◽

Cited By ~ 4

Author(s):

Chen Xiaoliang ◽

Zheng Chengshi ◽

Ma Longhua ◽

Cheng Xiaobin ◽

Li Xiaodong

Keyword(s):

Graphics Processing Units ◽

Design And Implementation ◽

Graphics Processing

Download Full-text

Multiple-Precision BLAS Library for Graphics Processing Units

Communications in Computer and Information Science - Supercomputing ◽

10.1007/978-3-030-64616-5_4 ◽

2020 ◽

pp. 37-49

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

Multiple Precision ◽

Graphics Processing

Download Full-text

Multiple-precision matrix-vector multiplication on graphics processing units

Program systems theory and applications ◽

10.25209/2079-3316-2020-11-3-61-84 ◽

2020 ◽

Vol 11 (3) ◽

pp. 61-84

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

High Efficiency ◽

Parallel Implementation ◽

Number System ◽

Residue Number System ◽

Global Memory ◽

Matrix Vector Multiplication ◽

Multiple Precision ◽

Graphics Processing ◽

Matrix Vector

We are considering a parallel implementation of matrix-vector multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs) using multiple-precision arithmetic based on the residue number system. In our GEMV implementation, element-wise operations with multiple-precision vectors and matrices consist of several parts, each of which is calculated by a separate CUDA kernel. This feature eliminates branch divergence when performing sequential parts of multiple-precision operations and allows the full utilization of the GPU’s resources. An efficient data structure for storing arrays with multiple-precision entries provides a coalesced access pattern to the GPU global memory. We have performed a rounding error analysis and derived error bounds for the proposed GEMV implementation. Experimental results show the high efficiency of the proposed solution compared to existing high-precision packages deployed on GPU.

Download Full-text

Multiple-Precision BLAS Library for Graphics Processing Units

10.36227/techrxiv.12580301 ◽

2020 ◽

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

Arithmetic Operation ◽

Number System ◽

Residue Number System ◽

Floating Point ◽

Data Types ◽

Rounding Errors ◽

Multiple Precision ◽

Graphics Processing ◽

Point Arithmetic

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

Download Full-text

Design and implementation of a stream-based distributedcomputing platform using graphics processing units

Proceedings of the 4th international conference on Computing frontiers - CF '07 ◽

10.1145/1242531.1242561 ◽

2007 ◽

Cited By ~ 12

Author(s):

Shinichi Yamagiwa ◽

Leonel Sousa

Keyword(s):

Graphics Processing Units ◽

Design And Implementation ◽

Graphics Processing

Download Full-text

K-means Clustering Algorithm: Efficient Implementation on Graphics Processing Units

10.14293/s2199-1006.1.sor-.pplhu62.v1 ◽

2019 ◽

Author(s):

Kunjan Aggarwal ◽

Mainak Chaudhuri

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Real Life ◽

Number Of Clusters ◽

Design And Implementation ◽

Data Volume ◽

Data Points ◽

Life Phenomena ◽

Graphics Processing ◽

Analyze Data

Data analysis and classification play a big role in understanding various real life phenomena. Clustering helps analyze data with little or no prior knowledge about it. K-means clustering is a popular clustering algorithm with applications to computer vision, data mining, data visualization, etc.. Due to continuously increasing data volume, parallel computing is necessary to overcome the computational challenges involved in K-means clustering. We present the design and implementation of Kmeans clustering algorithm on widely available graphics processing units (GPUs), which have the required hardware architecture to meet these parallelism needs. We analyze the scalability of our proposed methods with increase in number and dimensionality of data points as well as the number of clusters. We also compare our results with current best available implementations on GPUs and a 24-way threaded parallel CPU implementation. We achieved a consistent speedup of 6.5x over the parallel CPU implementation.

Download Full-text

High performance computing on graphics processing units

Pollack Periodica ◽

10.1556/pollack.3.2008.2.3 ◽

2008 ◽

Vol 3 (2) ◽

pp. 27-34 ◽

Cited By ~ 2

Author(s):

Balázs Tukora ◽

Tibor Szalay

Keyword(s):

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Graphics Processing ◽

Performance Computing

Download Full-text