CPU AND GPU (CUDA) TEMPLATE MATCHING COMPARISON / CPU IR GPU (CUDA) PALYGINIMAS VYKDANT ŠABLONŲ ATITIKTIES ALGORITMĄ

Evaldas Borcovas; Gintautas Daunys

doi:10.3846/mla.2014.16

CPU AND GPU (CUDA) TEMPLATE MATCHING COMPARISON / CPU IR GPU (CUDA) PALYGINIMAS VYKDANT ŠABLONŲ ATITIKTIES ALGORITMĄ

Mokslas - Lietuvos ateitis ◽

10.3846/mla.2014.16 ◽

2014 ◽

Vol 6 (2) ◽

pp. 129-133

Author(s):

Evaldas Borcovas ◽

Gintautas Daunys

Keyword(s):

Template Matching ◽

Gpu Computing ◽

Computing Time ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Central Processing ◽

Device Architecture ◽

Cuda Technology ◽

Dual Core ◽

Template Size

Image processing, computer vision or other complicated opticalinformation processing algorithms require large resources. It isoften desired to execute algorithms in real time. It is hard tofulfill such requirements with single CPU processor. NVidiaproposed CUDA technology enables programmer to use theGPU resources in the computer. Current research was madewith Intel Pentium Dual-Core T4500 2.3 GHz processor with4 GB RAM DDR3 (CPU I), NVidia GeForce GT320M CUDAcompliable graphics card (GPU I) and Intel Core I5-2500K3.3 GHz processor with 4 GB RAM DDR3 (CPU II), NVidiaGeForce GTX 560 CUDA compatible graphic card (GPU II).Additional libraries as OpenCV 2.1 and OpenCV 2.4.0 CUDAcompliable were used for the testing. Main test were made withstandard function MatchTemplate from the OpenCV libraries.The algorithm uses a main image and a template. An influenceof these factors was tested. Main image and template have beenresized and the algorithm computing time and performancein Gtpix/s have been measured. According to the informationobtained from the research GPU computing using the hardwarementioned earlier is till 24 times faster when it is processing abig amount of information. When the images are small the performanceof CPU and GPU are not significantly different. Thechoice of the template size makes influence on calculating withCPU. Difference in the computing time between the GPUs canbe explained by the number of cores which they have. Vaizdų apdorojimas, kompiuterinė rega ir kiti sudėtingi algoritmai, apdorojantys optinę informaciją, naudoja dideliusskaičiavimo išteklius. Dažnai šiuos algoritmus reikia realizuoti realiuoju laiku. Šį uždavinį išspręsti naudojant tik vienoCPU (angl. Central processing unit) pajėgumus yra sudėtinga. nVidia pasiūlyta CUDA (angl. Compute unified device architecture)technologija leidžia panaudoti GPU (angl. Graphic processing unit) išteklius. Tyrimui atlikti buvo pasirinkti du skirtingiCPU: Intel Pentium Dual-Core T4500 ir Intel Core I5 2500K, bei GPU: nVidia GeForce GT320M ir NVidia GeForce 560.Tyrime buvo panaudotos vaizdų apdorojimo bibliotekos: OpenCV 2.1 ir OpenCV 2.4. Tyrimui buvo pasirinktas šablonų atitiktiesalgoritmas. Algoritmui realizuoti reikalingas analizuojamas vaizdas ir ieškomo objekto vaizdo šablonas. Tyrimo metu buvokeičiamas vaizdo ir šablono dydis bei stebima, kaip tai veikia algoritmo vykdymo trukmę ir vykdomų operacijų skaičių persekundę. Iš gautų rezultatų galima teigti, kad apdorojant didelį duomenų kiekį GPU realizuoja algoritmą iki 24 kartų greičiaunei tik CPU. Dirbant su nedideliu duomenų kiekiu, skirtumas tarp CPU ir GPU yra minimalus. Lyginant skaičiavimus dviejuoseGPU, pastebėta, kad skaičiavimų sparta yra tiesiogiai proporcinga GPU turimų branduolių kiekiui. Mūsų tyrimo atvejuspartesniame GPU jų buvo 16 kartų daugiau, tad ir skaičiavimai vyko 16 kartų sparčiau.

Download Full-text

Accelerated MD Program Using CUDA Technology

Communications in Physics ◽

10.15625/0868-3166/21/2/107 ◽

2011 ◽

Vol 21 (2) ◽

pp. 131

Author(s):

Hoang Van Hue ◽

Nguyen Thi Thanh Ha ◽

Pham Khac Hung

Keyword(s):

Large Scale ◽

Md Simulation ◽

Materials Science ◽

Computing Time ◽

Periodic Boundary Conditions ◽

Compute Unified Device Architecture ◽

Lennard Jones ◽

Device Architecture ◽

Cuda Technology ◽

Speed Up

Molecular dynamic (MD) simulation is proven to be an important tool to study the structure as well as the physical properties at atomic level in materials science. However, it requires a huge computing time and hence limits the ability to treat a large scale simulation. In this paper we present a solution to speed up the MD simulation using CUDA technology (Compute Unified Device Architecture). We used the GeForce GTS 250 card with Version 2.30. The simulation is implemented for Lennard-Jones systems with periodic boundary conditions which consist of 1024, 2048, 4096 and 8192 atoms. The calculation shows that the computing time depends on the size of system and could be decreased by 37 times. This result indicates a possibility of constructing a large MD model with up to 105 atoms on the usual PC.

Download Full-text

Optimization of K-Means Clustering on Graphics Processing Unit Using Compute Unified Device Architecture

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2017.6274 ◽

2017 ◽

Vol 14 (1) ◽

pp. 789-795

Author(s):

V Saveetha ◽

S Sophia

Keyword(s):

High Performance ◽

Programming Model ◽

Graphics Processing Unit ◽

Direct Access ◽

Communication Overhead ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Central Processing ◽

Device Architecture ◽

Graphics Processing

Parallel data clustering aims at using algorithms and methods to extract knowledge from fat databases in rational time using high performance architectures. The computational challenge faced by cluster analysis due to increasing capacity of data can be overcome by exploiting the power of these architectures. The recent development in parallel power of Graphics Processing Unit enables low cost high performance solutions for general purpose applications. The Compute Unified Device Architecture programming model provides application programming interface methods to handle data proficiently on Graphics Processing Unit for iterative clustering algorithms like K-Means. The existing Graphics Processing Unit based K-Means algorithms highly focus on improvising the speedup of the algorithms and fall short to handle the high time spent on transfer of data between the Central Processing Unit and Graphics Processing Unit. A competent K-Means algorithm is proposed in this paper to lessen the transfer time by introducing a novel approach to check the convergence of the algorithm and utilize the pinned memory for direct access. This algorithm outperforms the other algorithms by maximizing parallelism and utilizing the memory features. The relative speedups and the validity measure for the proposed algorithm is elevated when compared with K-Means on Graphics Processing Unit and K-Means using Flag on Graphics Processing Unit. Thus the planned approach proves that communication overhead can be reduced in K-Means clustering.

Download Full-text

Brain abnormality detection using template matching

Bio-Algorithms and Med-Systems ◽

10.1515/bams-2018-0029 ◽

2018 ◽

Vol 14 (4) ◽

Author(s):

G.B. Praveen ◽

Anita Agrawal ◽

Shrey Pareek ◽

Amalin Prince

Keyword(s):

Magnetic Resonance ◽

Template Matching ◽

Imaging Modality ◽

Brain Mri ◽

Tumor Segmentation ◽

Processing Unit ◽

Brain Disorders ◽

Brain Tumor Segmentation ◽

Central Processing ◽

Wide Range

Abstract Magnetic resonance imaging (MRI) is a widely used imaging modality to evaluate brain disorders. MRI generates huge volumes of data, which consist of a sequence of scans taken at different instances of time. As the presence of brain disorders has to be evaluated on all magnetic resonance (MR) sequences, manual brain disorder detection becomes a tedious process and is prone to inter- and intra-rater errors. A technique for detecting abnormalities in brain MRI using template matching is proposed. Bias filed correction is performed on volumetric scans using N4ITK filter, followed by volumetric registration. Normalized cross-correlation template matching is used for image registration taking into account, the rotation and scaling operations. A template of abnormality is selected which is then matched in the volumetric scans, if found, the corresponding image is retrieved. Post-processing of the retrieved images is performed by the thresholding operation; the coordinates and area of the abnormality are reported. The experiments are carried out on the glioma dataset obtained from Brain Tumor Segmentation Challenge 2013 database (BRATS 2013). Glioma dataset consisted of MR scans of 30 real glioma patients and 50 simulated glioma patients. NVIDIA Compute Unified Device Architecture framework is employed in this paper, and it is found that the detection speed using graphics processing unit is almost four times faster than using only central processing unit. The average Dice and Jaccard coefficients for a wide range of trials are found to be 0.91 and 0.83, respectively.

Download Full-text

An efficient solution for fast generation of multi-GNSS real-time products

10.5194/egusphere-egu21-8306 ◽

2021 ◽

Author(s):

Hongjie Zheng ◽

Hanyu Chang ◽

Yongqiang Yuan ◽

Qingyun Wang ◽

Yuhao Li ◽

...

Keyword(s):

Data Processing ◽

Real Time ◽

Processing Time ◽

Efficient Solution ◽

Gpu Computing ◽

Sampling Rate ◽

Precise Orbit Determination ◽

Processing Unit ◽

Processing Efficiency ◽

Central Processing

Global navigation satellite systems (GNSS) have been playing an indispensable role in providing positioning, navigation and timing (PNT) services to global users. Over the past few years, GNSS have been rapidly developed with abundant networks, modern constellations, and multi-frequency observations. To take full advantages of multi-constellation and multi-frequency GNSS, several new mathematic models have been developed such as multi-frequency ambiguity resolution (AR) and the uncombined data processing with raw observations. In addition, new GNSS products including the uncalibrated phase delay (UPD), the observable signal bias (OSB), and the integer recovery clock (IRC) have been generated and provided by analysis centers to support advanced GNSS applications.&#160;&#160;&#160;&#160;&#160;&#160; However, the increasing number of GNSS observations raises a great challenge to the fast generation of multi-constellation and multi-frequency products. In this study, we proposed an efficient solution to realize the fast updating of multi-GNSS real-time products by making full use of the advanced computing techniques. Firstly, instead of the traditional vector operations, the &#8220;level-3 operations&#8221; (matrix by matrix) of Basic Liner Algebra Subprograms (BLAS) is used as much as possible in the Least Square (LSQ) processing, which can improve the efficiency due to the central processing unit (CPU) optimization and faster memory data transmission. Furthermore, most steps of multi-GNSS data processing are transformed from serial mode to parallel mode to take advantage of the multi-core CPU architecture and graphics processing unit (GPU) computing resources. Moreover, we choose the OpenBLAS library for matrix computation as it has good performances in parallel environment.&#160;&#160;&#160;&#160;&#160;&#160; The proposed method is then validated on a 3.30 GHz AMD CPU with 6 cores. The result demonstrates that the proposed method can substantially improve the processing efficiency for multi-GNSS product generation. For the precise orbit determination (POD) solution with 150 ground stations and 128 satellites (GPS/BDS/Galileo/GLONASS/QZSS) in ionosphere-free (IF) mode, the processing time can be shortened from 50 to 10 minutes, which can guarantee the hourly updating of multi-GNSS ultra-rapid orbit products. The processing time of uncombined POD can also be reduced by about 80%. Meanwhile, the multi-GNSS real-time clock products can be easily generated in 5 seconds or even higher sampling rate. In addition, the processing efficiency of UPD and OSB products can also be increased by 4-6 times.

Download Full-text

Implementation of algebraic procedures on the GPU using CUDA architecture on the example of generalized eigenvalue problem

Open Computer Science ◽

10.1515/comp-2016-0006 ◽

2016 ◽

Vol 6 (1) ◽

pp. 79-90

Author(s):

Łukasz Syrocki ◽

Grzegorz Pestka

Keyword(s):

Eigenvalue Problem ◽

Graphics Processing Unit ◽

Generalized Eigenvalue Problem ◽

Processing Unit ◽

Graphics Processors ◽

Central Processing ◽

Generalized Eigenvalue ◽

Cuda Technology ◽

Cuda Architecture ◽

High Level

AbstractThe ready to use set of functions to facilitate solving a generalized eigenvalue problem for symmetric matrices in order to efficiently calculate eigenvalues and eigenvectors, using Compute Unified Device Architecture (CUDA) technology from NVIDIA, is provided. An integral part of the CUDA is the high level programming environment enabling tracking both code executed on Central Processing Unit and on Graphics Processing Unit. The presented matrix structures allow for the analysis of the advantages of using graphics processors in such calculations.

Download Full-text

Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA)

The Journal of Supercomputing ◽

10.1007/s11227-011-0672-7 ◽

2011 ◽

Vol 64 (3) ◽

pp. 942-967 ◽

Cited By ~ 44

Author(s):

Liheng Jian ◽

Cheng Wang ◽

Ying Liu ◽

Shenshen Liang ◽

Weidong Yi ◽

...

Keyword(s):

Data Mining ◽

Graphics Processing Unit ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Data Mining Techniques ◽

Device Architecture ◽

Parallel Data ◽

Parallel Data Mining ◽

Graphics Processing

Download Full-text

The Simulation of Retinal Inner Plexiform Layer Based on Parallel Algorithm

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.680.509 ◽

2013 ◽

Vol 680 ◽

pp. 509-514

Author(s):

Zhi Long Wu ◽

Zhi Jie Wang

Keyword(s):

Parallel Algorithm ◽

Ganglion Cells ◽

Plexiform Layer ◽

Bipolar Cells ◽

Inner Plexiform Layer ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Device Architecture ◽

Retinal Structure ◽

Speed And Accuracy

The final objective of retinal simulation is to construct an artificial computer retina to replace the biological retina, and to offset the vision-impaired people. Due to the complexity of the retinal structure and the great number of bipolar cells and ganglion cells in retinal (exceeding tens of millions), both the speed and accuracy of the simulation of the retinal up to date are at a low level. In this paper we present a method for the simulation of inner plexiform layer of retina based on Compute Unified Device Architecture (CUDA) parallel algorithm to achieve the maximum utilization of CPU and Graphic Processing Unit(GPU), and to improve the speed and accuracy of the retina simulation.

Download Full-text

POM.gpu-v1.0: a GPU-based Princeton Ocean Model

Geoscientific Model Development ◽

10.5194/gmd-8-2815-2015 ◽

2015 ◽

Vol 8 (9) ◽

pp. 2815-2827 ◽

Cited By ~ 13

Author(s):

S. Xu ◽

X. Huang ◽

L.-Y. Oey ◽

F. Xu ◽

H. Fu ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Climate Models ◽

Ocean Model ◽

Compute Unified Device Architecture ◽

Princeton Ocean Model ◽

Central Processing ◽

Device Architecture ◽

Computationally Intensive ◽

Graphics Processing

Abstract. Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.

Download Full-text

Compute-unified device architecture implementation of a block-matching algorithm for multiple graphical processing unit cards

Journal of Electronic Imaging ◽

10.1117/1.3606588 ◽

2011 ◽

Vol 20 (3) ◽

pp. 033004 ◽

Cited By ~ 10

Author(s):

Francesc Massanes

Keyword(s):

Block Matching ◽

Graphical Processing Unit ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Matching Algorithm ◽

Block Matching Algorithm ◽

Device Architecture ◽

Graphical Processing

Download Full-text

Modelling realistic ballast shape to study the lateral pull behaviour using GPU computing

EPJ Web of Conferences ◽

10.1051/epjconf/202124906003 ◽

2021 ◽

Vol 249 ◽

pp. 06003

Author(s):

François Nader ◽

Patrick Pizette ◽

Nicolin Govender ◽

Daniel N. Wilke ◽

Jean-François Ferellec

Keyword(s):

Particle Shape ◽

Gpu Computing ◽

Computational Cost ◽

Processing Unit ◽

Engineering Structures ◽

Stick Slip ◽

Pull Test ◽

Central Processing ◽

Polyhedral Shape ◽

Spherical Grains

The use of the Discrete Element Method to model engineering structures implementing granular materials has proven to be an efficient method to response under various behaviour conditions. However, the computational cost of the simulations increases rapidly, as the number of particles and particle shape complexity increases. An affordable solution to render problems computationally tractable is to use graphical processing units (GPU) for computing. Modern GPUs offer up 10496 compute cores, which allows for a greater parallelisation relative to 32-cores offered by high-end Central Processing Unit (CPU) compute. This study outlines the application of BlazeDEM-GPU, using an RTX 2080Ti GPU (4352 cores), to investigate the influence of the modelling of particle shape on the lateral pull behaviour of granular ballast systems used in railway applications. The idea is to validate the model and show the benefits of simulating non-spherical shapes in future large-scale tests. The algorithm, created to generate the shape of the ballast based on real grain scans, and using polyhedral shape approximations of varying degrees of complexity is shown. The particle size is modelled to scale. A preliminary investigation of the effect of the grain shape is conducted, where a sleeper lateral pull test is carried out in a spherical grains sample, and a cubic grains sample. Preliminary results show that elementary polyhedral shape representations (cubic) recreate some of the characteristic responses in the lateral pull test, such as stick/slip phenomena and force chain distributions, which looks promising for future works on railway simulations. These responses that cannot be recreated with simple spherical grains, unless heuristics are added, which requires additional calibration and approximations. The significant reduction in time when using non-spherical grains also implies that larger granular systems can be investigated.

Download Full-text