A Fast CT Reconstruction Scheme for a General Multi-Core PC

International Journal of Biomedical Imaging ◽

10.1155/2007/29160 ◽

2007 ◽

Vol 2007 ◽

pp. 1-9 ◽

Cited By ~ 16

Author(s):

Kai Zeng ◽

Erwei Bai ◽

Ge Wang

Keyword(s):

Data Exchange ◽

Graphics Processing Unit ◽

Computational Cost ◽

Processing Unit ◽

Ct Reconstruction ◽

Reconstruction Process ◽

Multiple Data ◽

Simd Processing ◽

Specialized Hardware ◽

Graphics Processing

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.

Download Full-text

Finite element method completely implemented for graphic processor units using parallel algorithm libraries

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017694703 ◽

2017 ◽

Vol 33 (1) ◽

pp. 53-66 ◽

Cited By ~ 1

Author(s):

Franz Pichler ◽

Gundolf Haase

Keyword(s):

Finite Element ◽

Graphics Processing Unit ◽

Computational Cost ◽

Processing Unit ◽

Time Step ◽

Device Architecture ◽

Transient Problems ◽

Speed Up ◽

Automotive Batteries ◽

Graphics Processing

A finite element code is developed in which all of the computationally expensive steps are performed on a graphics processing unit via the THRUST and the PARALUTION libraries. The code focuses on the simulation of transient problems where the repeated computations per time-step create the computational cost. It is used to solve partial and ordinary differential equations as they arise in thermal-runaway simulations of automotive batteries. The speed-up obtained by utilizing the graphics processing unit for every critical step is compared against the single core and the multi-threading solutions which are also supported by the chosen libraries. This way a high total speed-up on the graphics processing unit is achieved without the need for programming a single classical Compute Unified Device Architecture kernel.

Download Full-text

Graphics processing unit implementation of the F-statistic for continuous gravitational wave searches

Classical and Quantum Gravity ◽

10.1088/1361-6382/ac4616 ◽

2021 ◽

Author(s):

Liam Dunn ◽

Patrick Clearwater ◽

Andrew Melatos ◽

Karl Wette

Keyword(s):

Gravitational Wave ◽

Graphics Processing Units ◽

Graphics Processing Unit ◽

Computational Cost ◽

Processing Unit ◽

Central Processing ◽

Long Baseline ◽

Using Data ◽

Graphics Processing ◽

Gpu Implementation

Abstract The F-statistic is a detection statistic used widely in searches for continuous gravitational waves with terrestrial, long-baseline interferometers. A new implementation of the F-statistic is presented which accelerates the existing "resampling" algorithm using graphics processing units (GPUs). The new implementation runs between 10 and 100 times faster than the existing implementation on central processing units without sacrificing numerical accuracy. The utility of the GPU implementation is demonstrated on a pilot narrowband search for four newly discovered millisecond pulsars in the globular cluster Omega Centauri using data from the second Laser Interferometer Gravitational-Wave Observatory observing run. The computational cost is 17:2 GPU-hours using the new implementation, compared to 1092 core-hours with the existing implementation.

Download Full-text

A multiple-data-based efficient global optimization algorithm and its parallel implementation for automotive body design

Advances in Mechanical Engineering ◽

10.1177/1687814018794341 ◽

2018 ◽

Vol 10 (8) ◽

pp. 168781401879434 ◽

Cited By ~ 1

Author(s):

Bing Xu ◽

Yong Cai

Keyword(s):

Global Optimization ◽

Optimization Algorithm ◽

Graphics Processing Unit ◽

Optimization Method ◽

Efficient Global Optimization ◽

Processing Unit ◽

Global Optimization Algorithm ◽

Multiple Data ◽

Practical Engineering ◽

Graphics Processing

The purpose of this article is to improve the convergence efficiency of the traditional efficient global optimization method. Furthermore, we try a graphics processing unit–based parallel computing method to improve the computing efficiency of the efficient global optimization method for both mathematical and practical engineering problems. First, we propose a multiple-data-based efficient global optimization algorithm instead of the multiple-surrogates-based efficient global optimization algorithm. Second, a novel graphics processing unit–based general-purpose computing technology is adopted to accelerate the solution efficiency of our multiple-data-based efficient global optimization algorithm. Third, a hybrid parallel computing approach using the OpenMP and compute unified device architecture is adopted to further improve the solution efficiency of forward problems in practical application. This is accomplished by integrating the graphics processing unit–based finite element method numerical analysis system into the optimization software. The numerical results show that for the same problem, the optimal result of the multiple-data-based efficient global optimization algorithm is consistently better than the multiple-surrogates-based efficient global optimization algorithm with the same optimization iterations. In addition, the graphics processing unit–based parallel simulation system helps in the reduction of the calculation time for practical engineering problems. The multiple-data-based efficient global optimization method performs stably in both high-order mathematical functions and large-scale nonlinear practical engineering optimization problems. An added benefit is that the computational time and accuracy are no longer obstacles.

Download Full-text

Fast τ-p transforms by chirp modulation

Geophysics ◽

10.1190/geo2018-0380.1 ◽

2019 ◽

Vol 84 (1) ◽

pp. A13-A17 ◽

Cited By ~ 1

Author(s):

Fredrik Andersson ◽

Johan Robertsson

Keyword(s):

Fourier Transform ◽

Computational Complexity ◽

Fourier Transforms ◽

Graphics Processing Unit ◽

Computational Cost ◽

Processing Unit ◽

Chirp Modulation ◽

Fourier Sums ◽

Graphics Processing ◽

Direct Implementation

We have developed simple, fast, and accurate algorithms for the linear Radon ([Formula: see text]-[Formula: see text]) transform and its inverse. The algorithms have an [Formula: see text] computational complexity in contrast to the [Formula: see text] cost of a direct implementation in 2D and an [Formula: see text] computational complexity compared to the [Formula: see text] cost of a direct implementation in 3D. The methods use Bluestein’s algorithm to evaluate discrete nonstandard Fourier sums, and they need, apart from the fast Fourier transform (FFT), only multiplication of chirp functions and their Fourier transforms. The computational cost and accuracy are thus reduced to that inherited by the FFT. Fully working algorithms can be implemented in a couple of lines of code. Moreover, we find that efficient graphics processing unit (GPU) implementations could achieve processing speeds of approximately [Formula: see text], implying that the algorithms are I/O bound rather than compute bound.

Download Full-text

Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications

Applied Sciences ◽

10.3390/app10249121 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9121

Author(s):

KyungWoon Cho ◽

Hyokyung Bahn

Keyword(s):

Block Scheduling ◽

Modular Forms ◽

Graphics Processing Unit ◽

Round Robin ◽

General Purpose ◽

Processing Unit ◽

Thread Block ◽

Computing Unit ◽

Specialized Hardware ◽

Graphics Processing

GPGPU (General-Purpose Graphics Processing Unit) consists of hardware resources that can execute tens of thousands of threads simultaneously. However, in reality, the parallelism is limited as resource allocation is performed by the base unit called thread block, which is not managed judiciously in the current GPGPU systems. To schedule threads in GPGPU, a specialized hardware scheduler allocates thread blocks to the computing unit called SM (Stream Multiprocessors) in a Round-Robin manner. Although scheduling in hardware is simple and fast, we observe that the Round-Robin scheduling is not efficient in GPGPU, as it does not consider the workload characteristics of threads and the resource balance among SMs. In this article, we present a new thread block scheduling model that has the ability of analyzing and quantifying the performances of thread block scheduling. We implement our model as a GPGPU scheduling simulator and show that the conventional thread block scheduling provided in GPGPU hardware does not perform well as the workload becomes heavy. Specifically, we observe that the performance degradation of Round-Robin can be eliminated by adopting DFA (Depth First Allocation), which is simple but scalable. Moreover, as our simulator consists of modular forms based on the framework and we publicly open it for other researchers to use, various scheduling policies can be incorporated into our simulator for evaluating the performance of GPGPU schedulers.

Download Full-text

Numerical Simulation of Multi-Layer-Liquid Sloshing by Multiphase MPS-GPU Method

Volume 8: CFD and FSI ◽

10.1115/omae2020-18086 ◽

2020 ◽

Author(s):

Xiao Wen ◽

Xiang Chen ◽

Decheng Wan

Keyword(s):

Present Method ◽

Graphics Processing Unit ◽

Computational Cost ◽

Three Dimensional ◽

Liquid Sloshing ◽

Processing Unit ◽

Mps Method ◽

Acceleration Technique ◽

Moving Particle ◽

Graphics Processing

Abstract In this paper, a new multiphase MPS-GPU method is proposed through the combination of moving particle semi-implicit (MPS) method and Graphics Processing Unit (GPU) acceleration technique. The new method not only inherits the advantage of MPS method in capturing complex interface deformations, but also overcomes the limitations of huge computational cost in three-dimensional MPS simulation. By this method, both the two-layer-liquid and three-layer-liquids sloshing problems are simulated three-dimensionally on the GPU device, in which more than one million of particles are included. In simulations, the sloshing patterns of each liquid layer under different external excitations are accurately captured. From the interface elevations and impacting pressures calculated by present method, it is found that an obvious discrepancy exists between the deformations of free surface and phase interfaces. Then, the results obtained by multiphase MPS-GPU method are compared with experimental data and other numerical results in open literature and a good agreement is achieved, which validates the accuracy and applicability of the present method in three-dimensional simulations of multi-layer-liquid sloshing flows.

Download Full-text

Fast iterative solvers for large compressed-sparse row linear systems on graphics processing unit

Pollack Periodica ◽

10.1556/pollack.10.2015.1.1 ◽

2015 ◽

Vol 10 (1) ◽

pp. 3-18 ◽

Cited By ~ 1

Author(s):

Frédéric Magoulès ◽

Abal-Kassim Cheik Ahamed ◽

Roman Putanowicz

Keyword(s):

Linear Systems ◽

Graphics Processing Unit ◽

Iterative Solvers ◽

Processing Unit ◽

Compressed Sparse Row ◽

Graphics Processing

Download Full-text

Performance Analysis and Optimization of Graphics Processing Unit

SSRN Electronic Journal ◽

10.2139/ssrn.3350249 ◽

2019 ◽

Author(s):

Lokendra Singh Umrao ◽

Jay Prakash Pandey

Keyword(s):

Performance Analysis ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Implementing wide baseline matching algorithms on a graphics processing unit.

10.2172/921737 ◽

2007 ◽

Author(s):

Fredrick H. Rothganger ◽

Kurt W. Larson ◽

Antonio Ignacio Gonzales ◽

Daniel S. Myers

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Wide Baseline Matching ◽

Graphics Processing

Download Full-text

Two Decades of 4D-QSAR: A Dying Art or Staging a Comeback?

International Journal of Molecular Sciences ◽

10.3390/ijms22105212 ◽

2021 ◽

Vol 22 (10) ◽

pp. 5212

Author(s):

Andrzej Bak

Keyword(s):

Molecular Conformation ◽

Graphics Processing Unit ◽

Processing Unit ◽

Diverse Range ◽

Current State ◽

Gpu Clusters ◽

Pharmacophore Hypothesis ◽

Rising Power ◽

Graphics Processing ◽

Ligand Conformation

A key question confronting computational chemists concerns the preferable ligand geometry that fits complementarily into the receptor pocket. Typically, the postulated ‘bioactive’ 3D ligand conformation is constructed as a ‘sophisticated guess’ (unnecessarily geometry-optimized) mirroring the pharmacophore hypothesis—sometimes based on an erroneous prerequisite. Hence, 4D-QSAR scheme and its ‘dialects’ have been practically implemented as higher level of model abstraction that allows the examination of the multiple molecular conformation, orientation and protonation representation, respectively. Nearly a quarter of a century has passed since the eminent work of Hopfinger appeared on the stage; therefore the natural question occurs whether 4D-QSAR approach is still appealing to the scientific community? With no intention to be comprehensive, a review of the current state of art in the field of receptor-independent (RI) and receptor-dependent (RD) 4D-QSAR methodology is provided with a brief examination of the ‘mainstream’ algorithms. In fact, a myriad of 4D-QSAR methods have been implemented and applied practically for a diverse range of molecules. It seems that, 4D-QSAR approach has been experiencing a promising renaissance of interests that might be fuelled by the rising power of the graphics processing unit (GPU) clusters applied to full-atom MD-based simulations of the protein-ligand complexes.

Download Full-text