Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

Yoshihiro Maeda; Norishige Fukushima; Hiroshi Matsuo

doi:10.3390/app8101985

Effective Implementation of Edge-Preserving Filtering on CPU Microarchitectures

Applied Sciences ◽

10.3390/app8101985 ◽

2018 ◽

Vol 8 (10) ◽

pp. 1985 ◽

Cited By ~ 5

Author(s):

Yoshihiro Maeda ◽

Norishige Fukushima ◽

Hiroshi Matsuo

Keyword(s):

Computational Cost ◽

Bilateral Filter ◽

Processing Unit ◽

Normal Numbers ◽

Edge Preserving ◽

Central Processing ◽

Local Means ◽

Kernel Weights ◽

Computational Performance ◽

Non Local

In this paper, we propose acceleration methods for edge-preserving filtering. The filters natively include denormalized numbers, which are defined in IEEE Standard 754. The processing of the denormalized numbers has a higher computational cost than normal numbers; thus, the computational performance of edge-preserving filtering is severely diminished. We propose approaches to prevent the occurrence of the denormalized numbers for acceleration. Moreover, we verify an effective vectorization of the edge-preserving filtering based on changes in microarchitectures of central processing units by carefully treating kernel weights. The experimental results show that the proposed methods are up to five-times faster than the straightforward implementation of bilateral filtering and non-local means filtering, while the filters maintain the high accuracy. In addition, we showed effective vectorization for each central processing unit microarchitecture. The implementation of the bilateral filter is up to 14-times faster than that of OpenCV. The proposed methods and the vectorization are practical for real-time tasks such as image editing.

Download Full-text

Performance Assessment of Edge Preserving Filters

International Journal of Information System Modeling and Design ◽

10.4018/ijismd.2017040101 ◽

2017 ◽

Vol 8 (2) ◽

pp. 1-29

Author(s):

Kamireddy Rasool Reddy ◽

Madhava Rao Ch ◽

Nagi Reddy Kalikiri

Keyword(s):

Bilateral Filter ◽

Wavelet Thresholding ◽

Bilateral Filtering ◽

Noisy Image ◽

Edge Preserving ◽

Local Means ◽

Non Local ◽

Assess Quality ◽

Filter Algorithms ◽

Denoised Image

Denoising is one of the important aspects in image processing applications. Denoising is the process of eliminating the noise from the noisy image. In most cases, noise accumulates at the edges. So that prevention of noise at edges is one of the most prominent problem. There are numerous edge preserving approaches available to reduce the noise at edges in that Gaussian filter, bilateral filter and non-local means filtering are the popular approaches but in these approaches denoised image suffer from blurring. To overcome these problems, in this article a Gaussian/bilateral filtering (G/BF) with a wavelet thresholding approach is proposed for better image denoising. The performance of the proposed work is compared with some edge-preserving filter algorithms such as a bilateral filter and the Non-Local Means Filter, in terms that objectively assess quality. From the simulation results, it is found that the performance of proposed method is superior to the bilateral filter and the Non-Local Means Filter.

Download Full-text

Latent common origin of bilateral filter and non-local means filter

10.1117/12.838772 ◽

2010 ◽

Cited By ~ 2

Author(s):

Masayuki Tanaka ◽

Masatoshi Okutomi

Keyword(s):

Bilateral Filter ◽

Common Origin ◽

Local Means ◽

Non Local

Download Full-text

Comparative study of the implementation of the Lagrange interpolation algorithm on GPU and CPU using CUDA to compute the density of a material at different temperatures

SHS Web of Conferences ◽

10.1051/shsconf/202111907002 ◽

2021 ◽

Vol 119 ◽

pp. 07002

Author(s):

Youness Rtal ◽

Abdelkader Hadjoudja

Keyword(s):

Parallel Computing ◽

Graphics Processing Units ◽

Lagrange Interpolation ◽

Polynomial Interpolation ◽

Programming Model ◽

Interpolation Method ◽

Processing Unit ◽

Central Processing ◽

Computational Performance ◽

Different Temperatures

Graphics Processing Units (GPUs) are microprocessors attached to graphics cards, which are dedicated to the operation of displaying and manipulating graphics data. Currently, such graphics cards (GPUs) occupy all modern graphics cards. In a few years, these microprocessors have become potent tools for massively parallel computing. Such processors are practical instruments that serve in developing several fields like image processing, video and audio encoding and decoding, the resolution of a physical system with one or more unknowns. Their advantages: faster processing and consumption of less energy than the power of the central processing unit (CPU). In this paper, we will define and implement the Lagrange polynomial interpolation method on GPU and CPU to calculate the sodium density at different temperatures Ti using the NVIDIA CUDA C parallel programming model. It can increase computational performance by harnessing the power of the GPU. The objective of this study is to compare the performance of the implementation of the Lagrange interpolation method on CPU and GPU processors and to deduce the efficiency of the use of GPUs for parallel computing.

Download Full-text

Efficient parallelization of SPH algorithm on modern multi-core CPUs and massively parallel GPUs

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962321500549 ◽

2021 ◽

pp. 2150054

Author(s):

Pravin Jagtap ◽

Rupesh Nasre ◽

V. S. Sanapala ◽

B. S. V. Patnaik

Keyword(s):

High Performance ◽

Performance Metrics ◽

Computational Simulation ◽

Massively Parallel ◽

Benchmark Problems ◽

Processing Unit ◽

Central Processing ◽

Neighbor Search ◽

Computational Performance ◽

Sph Algorithm

Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture[Formula: see text] parallel programming models have been used for parallel computing on Intel Xeon[Formula: see text] E5-[Formula: see text] multi-core CPU and NVIDIA Quadro M[Formula: see text] and NVIDIA Tesla p[Formula: see text] massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.

Download Full-text

Monte-Carlo Acceleration of Bilateral Filter and Non-Local Means

IEEE Transactions on Image Processing ◽

10.1109/tip.2017.2777182 ◽

2018 ◽

Vol 27 (3) ◽

pp. 1462-1474 ◽

Cited By ~ 10

Author(s):

Christina Karam ◽

Keigo Hirakawa

Keyword(s):

Monte Carlo ◽

Bilateral Filter ◽

Local Means ◽

Non Local

Download Full-text

Graphics processing unit implementation of the F-statistic for continuous gravitational wave searches

Classical and Quantum Gravity ◽

10.1088/1361-6382/ac4616 ◽

2021 ◽

Author(s):

Liam Dunn ◽

Patrick Clearwater ◽

Andrew Melatos ◽

Karl Wette

Keyword(s):

Gravitational Wave ◽

Graphics Processing Units ◽

Graphics Processing Unit ◽

Computational Cost ◽

Processing Unit ◽

Central Processing ◽

Long Baseline ◽

Using Data ◽

Graphics Processing ◽

Gpu Implementation

Abstract The F-statistic is a detection statistic used widely in searches for continuous gravitational waves with terrestrial, long-baseline interferometers. A new implementation of the F-statistic is presented which accelerates the existing "resampling" algorithm using graphics processing units (GPUs). The new implementation runs between 10 and 100 times faster than the existing implementation on central processing units without sacrificing numerical accuracy. The utility of the GPU implementation is demonstrated on a pilot narrowband search for four newly discovered millisecond pulsars in the globular cluster Omega Centauri using data from the second Laser Interferometer Gravitational-Wave Observatory observing run. The computational cost is 17:2 GPU-hours using the new implementation, compared to 1092 core-hours with the existing implementation.

Download Full-text

Efficient, high-performance semantic segmentation using multi-scale feature extraction

PLoS ONE ◽

10.1371/journal.pone.0255397 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0255397

Author(s):

Moritz Knolle ◽

Georgios Kaissis ◽

Friederike Jungmann ◽

Sebastian Ziegelmayer ◽

Daniel Sasse ◽

...

Keyword(s):

Deep Learning ◽

Graphics Processing Units ◽

Substantial Reduction ◽

Image Features ◽

Tumor Segmentation ◽

Processing Unit ◽

Central Processing ◽

Multi Scale ◽

Computational Performance ◽

Wide Range

The success of deep learning in recent years has arguably been driven by the availability of large datasets for training powerful predictive algorithms. In medical applications however, the sensitive nature of the data limits the collection and exchange of large-scale datasets. Privacy-preserving and collaborative learning systems can enable the successful application of machine learning in medicine. However, collaborative protocols such as federated learning require the frequent transfer of parameter updates over a network. To enable the deployment of such protocols to a wide range of systems with varying computational performance, efficient deep learning architectures for resource-constrained environments are required. Here we present MoNet, a small, highly optimized neural-network-based segmentation algorithm leveraging efficient multi-scale image features. MoNet is a shallow, U-Net-like architecture based on repeated, dilated convolutions with decreasing dilation rates. We apply and test our architecture on the challenging clinical tasks of pancreatic segmentation in computed tomography (CT) images as well as brain tumor segmentation in magnetic resonance imaging (MRI) data. We assess our model’s segmentation performance and demonstrate that it provides performance on par with compared architectures while providing superior out-of-sample generalization performance, outperforming larger architectures on an independent validation set, while utilizing significantly fewer parameters. We furthermore confirm the suitability of our architecture for federated learning applications by demonstrating a substantial reduction in serialized model storage requirement as a surrogate for network data transfer. Finally, we evaluate MoNet’s inference latency on the central processing unit (CPU) to determine its utility in environments without access to graphics processing units. Our implementation is publicly available as free and open-source software.

Download Full-text

Intensity-Assisted ICP for Fast Registration of 2D-LIDAR

Sensors ◽

10.3390/s19092124 ◽

2019 ◽

Vol 19 (9) ◽

pp. 2124 ◽

Cited By ~ 3

Author(s):

Yingzhong Tian ◽

Xining Liu ◽

Long Li ◽

Wenbin Wang

Keyword(s):

Real Time ◽

Computational Cost ◽

Target Function ◽

Picard Iteration ◽

Processing Unit ◽

Central Processing ◽

Localization And Mapping ◽

Initial Transformation ◽

Comparative Results ◽

Rigid Body Transformation

Iterative closest point (ICP) is a method commonly used to perform scan-matching and registration. To be a simple and robust algorithm, it is still computationally expensive, and it has been regarded as having a crucial challenge especially in a real-time application as used for the simultaneous localization and mapping (SLAM) problem. For these reasons, this paper presents a new method for the acceleration of ICP with an assisted intensity. Unlike the conventional ICP, this method is proposed to reduce the computational cost and avoid divergences. An initial transformation guess is computed with an assisted intensity for their relative rigid-body transformation. Moreover, a target function is proposed to determine the best initial transformation guess based on the statistic of their spatial distances and intensity residuals. Additionally, this method is also proposed to reduce the iteration number. The Anderson acceleration is utilized for increasing the iteration speed which has better ability than the Picard iteration procedure. The proposed algorithm is operated in real time with a single core central processing unit (CPU) thread. Hence, it is suitable for the robot which has limited computation resources. To validate the novelty, this proposed method is evaluated on the SEMANTIC3D.NET benchmark dataset. According to comparative results, the proposed method is declared as having better accuracy and robustness than the conventional ICP methods.

Download Full-text

An Improved Back-Projection Algorithm for GNSS-R BSAR Imaging Based on CPU and GPU Platform

Remote Sensing ◽

10.3390/rs13112107 ◽

2021 ◽

Vol 13 (11) ◽

pp. 2107

Author(s):

Shiyu Wu ◽

Zhichao Xu ◽

Feng Wang ◽

Dongkai Yang ◽

Gongjian Guo

Keyword(s):

Graphics Processing Units ◽

Low Cost ◽

Computational Cost ◽

Satellite System ◽

Projection Algorithm ◽

Synthetic Aperture ◽

Processing Unit ◽

Back Projection ◽

Imaging Quality ◽

Central Processing

Global Navigation Satellite System Reflectometry Bistatic Synthetic Aperture Radar (GNSS-R BSAR) is becoming more and more important in remote sensing because of its low power, low mass, low cost, and real-time global coverage capability. The Back Projection Algorithm (BPA) was usually selected as the GNSS-R BSAR imaging algorithm because it can process echo signals of complex geometric configurations. However, the huge computational cost is a challenge for its application in GNSS-R BSAR. Graphics Processing Units (GPU) provides an efficient computing platform for GNSS-R BSAR processing. In this paper, a solution accelerating the BPA of GNSS-R BSAR using GPU is proposed to improve imaging efficiency, and a matching pre-processing program was proposed to synchronize direct and echo signals to improve imaging quality. To process hundreds of gigabytes of data collected by a long-time synthetic aperture in fixed station mode, a stream processing structure was used to process such a large amount of data to solve the problem of limited GPU memory. In the improvement of the imaging efficiency, the imaging task is divided into pre-processing and BPA, which are performed in the Central Processing Unit (CPU) and GPU, respectively, and a pixel-oriented parallel processing method in back projection is adopted to avoid memory access conflicts caused by excessive data volume. The improved BPA with the long synthetic aperture time is verified through the simulation of and experimenting on the GPS-L5 signal. The results show that the proposed accelerating solution is capable of taking approximately 128.04 s, which is 156 times lower than pure CPU framework for producing a size of 600 m × 600 m image with 1800 s synthetic aperture time; in addition, the same imaging quality with the existing processing solution can be retained.

Download Full-text

Modelling realistic ballast shape to study the lateral pull behaviour using GPU computing

EPJ Web of Conferences ◽

10.1051/epjconf/202124906003 ◽

2021 ◽

Vol 249 ◽

pp. 06003

Author(s):

François Nader ◽

Patrick Pizette ◽

Nicolin Govender ◽

Daniel N. Wilke ◽

Jean-François Ferellec

Keyword(s):

Particle Shape ◽

Gpu Computing ◽

Computational Cost ◽

Processing Unit ◽

Engineering Structures ◽

Stick Slip ◽

Pull Test ◽

Central Processing ◽

Polyhedral Shape ◽

Spherical Grains

The use of the Discrete Element Method to model engineering structures implementing granular materials has proven to be an efficient method to response under various behaviour conditions. However, the computational cost of the simulations increases rapidly, as the number of particles and particle shape complexity increases. An affordable solution to render problems computationally tractable is to use graphical processing units (GPU) for computing. Modern GPUs offer up 10496 compute cores, which allows for a greater parallelisation relative to 32-cores offered by high-end Central Processing Unit (CPU) compute. This study outlines the application of BlazeDEM-GPU, using an RTX 2080Ti GPU (4352 cores), to investigate the influence of the modelling of particle shape on the lateral pull behaviour of granular ballast systems used in railway applications. The idea is to validate the model and show the benefits of simulating non-spherical shapes in future large-scale tests. The algorithm, created to generate the shape of the ballast based on real grain scans, and using polyhedral shape approximations of varying degrees of complexity is shown. The particle size is modelled to scale. A preliminary investigation of the effect of the grain shape is conducted, where a sleeper lateral pull test is carried out in a spherical grains sample, and a cubic grains sample. Preliminary results show that elementary polyhedral shape representations (cubic) recreate some of the characteristic responses in the lateral pull test, such as stick/slip phenomena and force chain distributions, which looks promising for future works on railway simulations. These responses that cannot be recreated with simple spherical grains, unless heuristics are added, which requires additional calibration and approximations. The significant reduction in time when using non-spherical grains also implies that larger granular systems can be investigated.

Download Full-text