scholarly journals Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

2018 ◽  
Vol 8 (10) ◽  
pp. 1985 ◽  
Author(s):  
Yoshihiro Maeda ◽  
Norishige Fukushima ◽  
Hiroshi Matsuo

In this paper, we propose acceleration methods for edge-preserving filtering. The filters natively include denormalized numbers, which are defined in IEEE Standard 754. The processing of the denormalized numbers has a higher computational cost than normal numbers; thus, the computational performance of edge-preserving filtering is severely diminished. We propose approaches to prevent the occurrence of the denormalized numbers for acceleration. Moreover, we verify an effective vectorization of the edge-preserving filtering based on changes in microarchitectures of central processing units by carefully treating kernel weights. The experimental results show that the proposed methods are up to five-times faster than the straightforward implementation of bilateral filtering and non-local means filtering, while the filters maintain the high accuracy. In addition, we showed effective vectorization for each central processing unit microarchitecture. The implementation of the bilateral filter is up to 14-times faster than that of OpenCV. The proposed methods and the vectorization are practical for real-time tasks such as image editing.

Author(s):  
Kamireddy Rasool Reddy ◽  
Madhava Rao Ch ◽  
Nagi Reddy Kalikiri

Denoising is one of the important aspects in image processing applications. Denoising is the process of eliminating the noise from the noisy image. In most cases, noise accumulates at the edges. So that prevention of noise at edges is one of the most prominent problem. There are numerous edge preserving approaches available to reduce the noise at edges in that Gaussian filter, bilateral filter and non-local means filtering are the popular approaches but in these approaches denoised image suffer from blurring. To overcome these problems, in this article a Gaussian/bilateral filtering (G/BF) with a wavelet thresholding approach is proposed for better image denoising. The performance of the proposed work is compared with some edge-preserving filter algorithms such as a bilateral filter and the Non-Local Means Filter, in terms that objectively assess quality. From the simulation results, it is found that the performance of proposed method is superior to the bilateral filter and the Non-Local Means Filter.


2021 ◽  
Vol 119 ◽  
pp. 07002
Author(s):  
Youness Rtal ◽  
Abdelkader Hadjoudja

Graphics Processing Units (GPUs) are microprocessors attached to graphics cards, which are dedicated to the operation of displaying and manipulating graphics data. Currently, such graphics cards (GPUs) occupy all modern graphics cards. In a few years, these microprocessors have become potent tools for massively parallel computing. Such processors are practical instruments that serve in developing several fields like image processing, video and audio encoding and decoding, the resolution of a physical system with one or more unknowns. Their advantages: faster processing and consumption of less energy than the power of the central processing unit (CPU). In this paper, we will define and implement the Lagrange polynomial interpolation method on GPU and CPU to calculate the sodium density at different temperatures Ti using the NVIDIA CUDA C parallel programming model. It can increase computational performance by harnessing the power of the GPU. The objective of this study is to compare the performance of the implementation of the Lagrange interpolation method on CPU and GPU processors and to deduce the efficiency of the use of GPUs for parallel computing.


Author(s):  
Pravin Jagtap ◽  
Rupesh Nasre ◽  
V. S. Sanapala ◽  
B. S. V. Patnaik

Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture[Formula: see text] parallel programming models have been used for parallel computing on Intel Xeon[Formula: see text] E5-[Formula: see text] multi-core CPU and NVIDIA Quadro M[Formula: see text] and NVIDIA Tesla p[Formula: see text] massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.


2018 ◽  
Vol 27 (3) ◽  
pp. 1462-1474 ◽  
Author(s):  
Christina Karam ◽  
Keigo Hirakawa

Author(s):  
Liam Dunn ◽  
Patrick Clearwater ◽  
Andrew Melatos ◽  
Karl Wette

Abstract The F-statistic is a detection statistic used widely in searches for continuous gravitational waves with terrestrial, long-baseline interferometers. A new implementation of the F-statistic is presented which accelerates the existing "resampling" algorithm using graphics processing units (GPUs). The new implementation runs between 10 and 100 times faster than the existing implementation on central processing units without sacrificing numerical accuracy. The utility of the GPU implementation is demonstrated on a pilot narrowband search for four newly discovered millisecond pulsars in the globular cluster Omega Centauri using data from the second Laser Interferometer Gravitational-Wave Observatory observing run. The computational cost is 17:2 GPU-hours using the new implementation, compared to 1092 core-hours with the existing implementation.


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0255397
Author(s):  
Moritz Knolle ◽  
Georgios Kaissis ◽  
Friederike Jungmann ◽  
Sebastian Ziegelmayer ◽  
Daniel Sasse ◽  
...  

The success of deep learning in recent years has arguably been driven by the availability of large datasets for training powerful predictive algorithms. In medical applications however, the sensitive nature of the data limits the collection and exchange of large-scale datasets. Privacy-preserving and collaborative learning systems can enable the successful application of machine learning in medicine. However, collaborative protocols such as federated learning require the frequent transfer of parameter updates over a network. To enable the deployment of such protocols to a wide range of systems with varying computational performance, efficient deep learning architectures for resource-constrained environments are required. Here we present MoNet, a small, highly optimized neural-network-based segmentation algorithm leveraging efficient multi-scale image features. MoNet is a shallow, U-Net-like architecture based on repeated, dilated convolutions with decreasing dilation rates. We apply and test our architecture on the challenging clinical tasks of pancreatic segmentation in computed tomography (CT) images as well as brain tumor segmentation in magnetic resonance imaging (MRI) data. We assess our model’s segmentation performance and demonstrate that it provides performance on par with compared architectures while providing superior out-of-sample generalization performance, outperforming larger architectures on an independent validation set, while utilizing significantly fewer parameters. We furthermore confirm the suitability of our architecture for federated learning applications by demonstrating a substantial reduction in serialized model storage requirement as a surrogate for network data transfer. Finally, we evaluate MoNet’s inference latency on the central processing unit (CPU) to determine its utility in environments without access to graphics processing units. Our implementation is publicly available as free and open-source software.


Sensors ◽  
2019 ◽  
Vol 19 (9) ◽  
pp. 2124 ◽  
Author(s):  
Yingzhong Tian ◽  
Xining Liu ◽  
Long Li ◽  
Wenbin Wang

Iterative closest point (ICP) is a method commonly used to perform scan-matching and registration. To be a simple and robust algorithm, it is still computationally expensive, and it has been regarded as having a crucial challenge especially in a real-time application as used for the simultaneous localization and mapping (SLAM) problem. For these reasons, this paper presents a new method for the acceleration of ICP with an assisted intensity. Unlike the conventional ICP, this method is proposed to reduce the computational cost and avoid divergences. An initial transformation guess is computed with an assisted intensity for their relative rigid-body transformation. Moreover, a target function is proposed to determine the best initial transformation guess based on the statistic of their spatial distances and intensity residuals. Additionally, this method is also proposed to reduce the iteration number. The Anderson acceleration is utilized for increasing the iteration speed which has better ability than the Picard iteration procedure. The proposed algorithm is operated in real time with a single core central processing unit (CPU) thread. Hence, it is suitable for the robot which has limited computation resources. To validate the novelty, this proposed method is evaluated on the SEMANTIC3D.NET benchmark dataset. According to comparative results, the proposed method is declared as having better accuracy and robustness than the conventional ICP methods.


2021 ◽  
Vol 13 (11) ◽  
pp. 2107
Author(s):  
Shiyu Wu ◽  
Zhichao Xu ◽  
Feng Wang ◽  
Dongkai Yang ◽  
Gongjian Guo

Global Navigation Satellite System Reflectometry Bistatic Synthetic Aperture Radar (GNSS-R BSAR) is becoming more and more important in remote sensing because of its low power, low mass, low cost, and real-time global coverage capability. The Back Projection Algorithm (BPA) was usually selected as the GNSS-R BSAR imaging algorithm because it can process echo signals of complex geometric configurations. However, the huge computational cost is a challenge for its application in GNSS-R BSAR. Graphics Processing Units (GPU) provides an efficient computing platform for GNSS-R BSAR processing. In this paper, a solution accelerating the BPA of GNSS-R BSAR using GPU is proposed to improve imaging efficiency, and a matching pre-processing program was proposed to synchronize direct and echo signals to improve imaging quality. To process hundreds of gigabytes of data collected by a long-time synthetic aperture in fixed station mode, a stream processing structure was used to process such a large amount of data to solve the problem of limited GPU memory. In the improvement of the imaging efficiency, the imaging task is divided into pre-processing and BPA, which are performed in the Central Processing Unit (CPU) and GPU, respectively, and a pixel-oriented parallel processing method in back projection is adopted to avoid memory access conflicts caused by excessive data volume. The improved BPA with the long synthetic aperture time is verified through the simulation of and experimenting on the GPS-L5 signal. The results show that the proposed accelerating solution is capable of taking approximately 128.04 s, which is 156 times lower than pure CPU framework for producing a size of 600 m × 600 m image with 1800 s synthetic aperture time; in addition, the same imaging quality with the existing processing solution can be retained.


2021 ◽  
Vol 249 ◽  
pp. 06003
Author(s):  
François Nader ◽  
Patrick Pizette ◽  
Nicolin Govender ◽  
Daniel N. Wilke ◽  
Jean-François Ferellec

The use of the Discrete Element Method to model engineering structures implementing granular materials has proven to be an efficient method to response under various behaviour conditions. However, the computational cost of the simulations increases rapidly, as the number of particles and particle shape complexity increases. An affordable solution to render problems computationally tractable is to use graphical processing units (GPU) for computing. Modern GPUs offer up 10496 compute cores, which allows for a greater parallelisation relative to 32-cores offered by high-end Central Processing Unit (CPU) compute. This study outlines the application of BlazeDEM-GPU, using an RTX 2080Ti GPU (4352 cores), to investigate the influence of the modelling of particle shape on the lateral pull behaviour of granular ballast systems used in railway applications. The idea is to validate the model and show the benefits of simulating non-spherical shapes in future large-scale tests. The algorithm, created to generate the shape of the ballast based on real grain scans, and using polyhedral shape approximations of varying degrees of complexity is shown. The particle size is modelled to scale. A preliminary investigation of the effect of the grain shape is conducted, where a sleeper lateral pull test is carried out in a spherical grains sample, and a cubic grains sample. Preliminary results show that elementary polyhedral shape representations (cubic) recreate some of the characteristic responses in the lateral pull test, such as stick/slip phenomena and force chain distributions, which looks promising for future works on railway simulations. These responses that cannot be recreated with simple spherical grains, unless heuristics are added, which requires additional calibration and approximations. The significant reduction in time when using non-spherical grains also implies that larger granular systems can be investigated.


Sign in / Sign up

Export Citation Format

Share Document