Clustered Cell Parallelization for GPU Computing of Silicon Anisotropic Etching Simulation

Author(s):  
Jianhua Li ◽  
Jingyuan Chen ◽  
Yan Wang ◽  
Jianhua Huang

The parallelization of silicon anisotropic etching simulation with the cellular automata (CA) model on graphics processing units (GPUs) is challenging, because the numbers of computational tasks in etching simulation dynamically change and the existing parallel CA mechanisms do not fit in GPU computation well. In this paper, an improved CA model, called clustered cell model, is proposed for GPU-based etching simulation. The model consists of clustered cells, each of which manages a scalable number of atoms. In this model, only the etching and update of states for the atoms on the etching surface and their unexposed neighbors are performed at each CA time step, whereas the clustered cells are reclassified in a longer time step. With this model, a crystal cell parallelization method is given, where clustered cells are allocated to threads on GPUs in the simulation. With the optimizations from the spatial and temporal aspects as well as a proper granularity, this method provides a faster process simulation. The proposed simulation method is implemented with the Compute Unified Device Architecture (CUDA) application programming interface. Several computational experiments are taken to analyze the efficiency of the method.

Author(s):  
Jianhua Li ◽  
Yan Wang ◽  
Jingyuan Chen ◽  
Li Yan

Silicon anisotropic etching simulation, based on geometric model or cellular automata (CA) model, is highly time-consuming. In this paper, we propose two parallelization methods for the simulation of the silicon anisotropic etching process with CA models on graphics processing units (GPUs). One is the direct parallelization of the serial CA algorithm, and the other is to use a spatial parallelization strategy where each crystal unit cell is allocated to a thread in GPU. The proposed simulation methods are implemented with the Compute Unified Device Architecture (CUDA) application programming interface. Several computational experiments are taken to analyze the efficiency of the methods.


2021 ◽  
Author(s):  
Daiki Ishii ◽  
Masatomo Inui ◽  
Nobuyuki Umezu

Abstract By using the cutter location (CL) surface, fast and stable computation of the cutter path for machining complicated molds and dies can be realized. State-of-the-art graphics processing units (GPUs) are equipped with special hardware named ray tracing (RT) cores dedicated to image processing (called ray tracing) for 3D computer graphics. Using RT cores, it is possible to quickly compute the intersection points between a set of straight lines and polygons. In this paper, we propose a novel CL surface computation method using the RT core. The RT core was originally designed to accelerate 3D computer graphics processing. For the development of software using RT cores, it is necessary to use the OptiX application programming interface (API) library for computer graphics. We demonstrate how to use the OptiX API in the development of software for CL surface computations. Computational experiments were carried out, and it was confirmed that it is possible to obtain the CL surface based on a very high-resolution Z-map several times faster than the depth buffer-based method, which has been considered to be the fastest to date.


2011 ◽  
Vol 19 (4) ◽  
pp. 185-197 ◽  
Author(s):  
Marek Blazewicz ◽  
Steven R. Brandt ◽  
Michal Kierzynka ◽  
Krzysztof Kurowski ◽  
Bogdan Ludwiczak ◽  
...  

With the recent advent of new heterogeneous computing architectures there is still a lack of parallel problem solving environments that can help scientists to use easily and efficiently hybrid supercomputers. Many scientific simulations that use structured grids to solve partial differential equations in fact rely on stencil computations. Stencil computations have become crucial in solving many challenging problems in various domains, e.g., engineering or physics. Although many parallel stencil computing approaches have been proposed, in most cases they solve only particular problems. As a result, scientists are struggling when it comes to the subject of implementing a new stencil-based simulation, especially on high performance hybrid supercomputers. In response to the presented need we extend our previous work on a parallel programming framework for CUDA – CaCUDA that now supports OpenCL. We present CaKernel – a tool that simplifies the development of parallel scientific applications on hybrid systems. CaKernel is built on the highly scalable and portable Cactus framework. In the CaKernel framework, Cactus manages the inter-process communication via MPI while CaKernel manages the code running on Graphics Processing Units (GPUs) and interactions between them. As a non-trivial test case we have developed a 3D CFD code to demonstrate the performance and scalability of the automatically generated code.


2020 ◽  
Author(s):  
Ryan N Gutenkunst

Extracting insight from population genetic data often demands computationally intensive modeling. dadi is a popular program for fitting models of demographic history and natural selection to such data. Here, I show that running dadi on a Graphics Processing Unit (GPU) can speed computation by orders of magnitude compared to the CPU implementation, with minimal user burden. This speed increase enables the analysis of more complex models, which motivated the extension of dadi to four- and five-population models. Remarkably, dadi performs almost as well on inexpensive consumer-grade GPUs as on expensive server-grade GPUs. GPU computing thus offers large and accessible benefits to the community of dadi users. This functionality is available in dadi version 2.1.0.


2018 ◽  
Author(s):  
Upendra Adhikari ◽  
Barmak Mostofian ◽  
Jeremy Copperman ◽  
Andrew Petersen ◽  
Daniel M. Zuckerman

Despite the development of massively parallel computing hardware including inexpensive graphics processing units (GPUs), it has remained infeasible to simulate the folding of atomistic proteins at room temperature using conventional molecular dynamics (MD) beyond the µs scale. Here we report the folding of atomistic, implicitly solvated protein systems with folding times τf ranging from ∼100 µs to ∼1s using the weighted ensemble (WE) strategy in combination with GPU computing. Starting from an initial structure or set of structures, WE organizes an ensemble of GPU-accelerated MD trajectory segments via intermittent pruning and replication events to generate statistically unbiased estimates of rate constants for rare events such as folding; no biasing forces are used. Although the variance among atomistic WE folding runs is significant, multiple independent runs are used to reduce and quantify statistical uncertainty. Folding times are estimated directly from WE probability flux and from history-augmented Markov analysis of the WE data. Three systems were examined: NTL9 at low solvent viscosity (yielding τf = 0.8 − 9.0 μs), NTL9 at water-like viscosity (τf = 0.2 − 1.9 ms), and Protein G at low viscosity (τf = 3.3 - 200 ms). In all cases the folding time, uncertainty, and ensemble properties could be estimated from WE simulation; for Protein G, this characterization required significantly less overall computing than would be required to observe a single folding event with conventional MD simulations. Our results suggest that the use and calibration of force fields and solvent models for precise estimation of kinetic quantities is becoming feasible.


2011 ◽  
Vol 19 (4) ◽  
pp. 199-212 ◽  
Author(s):  
Gaurav ◽  
Steven F. Wojtkiewicz

Graphics processing units (GPUs) are rapidly emerging as a more economical and highly competitive alternative to CPU-based parallel computing. As the degree of software control of GPUs has increased, many researchers have explored their use in non-gaming applications. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing alternatives in single-instruction multiple-data (SIMD) strategies. This study explores the use of GPUs for uncertainty quantification in computational mechanics. Five types of analysis procedures that are frequently utilized for uncertainty quantification of mechanical and dynamical systems have been considered and their GPU implementations have been developed. The numerical examples presented in this study show that considerable gains in computational efficiency can be obtained for these procedures. It is expected that the GPU implementations presented in this study will serve as initial bases for further developments in the use of GPUs in the field of uncertainty quantification and will (i) aid the understanding of the performance constraints on the relevant GPU kernels and (ii) provide some guidance regarding the computational and the data structures to be utilized in these novel GPU implementations.


2021 ◽  
Vol 4 ◽  
pp. 16-22
Author(s):  
Mykola Semylitko ◽  
Gennadii Malaschonok

SVD (Singular Value Decomposition) algorithm is used in recommendation systems, machine learning, image processing, and in various algorithms for working with matrices which can be very large and Big Data, so, given the peculiarities of this algorithm, it can be performed on a large number of computing threads that have only video cards.CUDA is a parallel computing platform and application programming interface model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit for general purpose processing – an approach termed GPGPU (general-purpose computing on graphics processing units). The GPU provides much higher instruction throughput and memory bandwidth than the CPU within a similar price and power envelope. Many applications leverage these higher capabilities to run faster on the GPU than on the CPU. Other computing devices, like FPGAs, are also very energy efficient, but they offer much less programming flexibility than GPUs.The developed modification uses the CUDA architecture, which is intended for a large number of simultaneous calculations, which allows to quickly process matrices of very large sizes. The algorithm of parallel SVD for a three-diagonal matrix based on the Givents rotation provides a high accuracy of calculations. Also the algorithm has a number of optimizations to work with memory and multiplication algorithms that can significantly reduce the computation time discarding empty iterations.This article proposes an approach that will reduce the computation time and, consequently, resources and costs. The developed algorithm can be used with the help of a simple and convenient API in C ++ and Java, as well as will be improved by using dynamic parallelism or parallelization of multiplication operations. Also the obtained results can be used by other developers for comparison, as all conditions of the research are described in detail, and the code is in free access.


Author(s):  
F. Cabarle ◽  
H. Adorna ◽  
M. A. Martínez-del-Amor

In this paper, the authors discuss the simulation of a P system variant known as Spiking Neural P systems (SNP systems), using Graphics Processing Units (GPUs). GPUs are well suited for highly parallel computations because of their intentional and massively parallel architecture. General purpose GPU computing has seen the use of GPUs for computationally intensive applications, not just in graphics and video processing. P systems, including SNP systems, are maximally parallel computing models taking inspiration from the functioning and dynamics of a living cell. In particular, SNP systems take inspiration from a type of cell known as a neuron. The nature of SNP systems allowed for their representation as matrices, which is an elegant step toward their simulation on GPUs. In this paper, the simulation algorithms, design considerations, and implementation are presented. Finally, simulation results, observations, and analyses using a simple but non-trivial SNP system as an example are discussed, including recommendations for future work.


2014 ◽  
Vol 39 (4) ◽  
pp. 233-248 ◽  
Author(s):  
Milosz Ciznicki ◽  
Krzysztof Kurowski ◽  
Jan Węglarz

Abstract Heterogeneous many-core computing resources are increasingly popular among users due to their improved performance over homogeneous systems. Many developers have realized that heterogeneous systems, e.g. a combination of a shared memory multi-core CPU machine with massively parallel Graphics Processing Units (GPUs), can provide significant performance opportunities to a wide range of applications. However, the best overall performance can only be achieved if application tasks are efficiently assigned to different types of processor units in time taking into account their specific resource requirements. Additionally, one should note that available heterogeneous resources have been designed as general purpose units, however, with many built-in features accelerating specific application operations. In other words, the same algorithm or application functionality can be implemented as a different task for CPU or GPU. Nevertheless, from the perspective of various evaluation criteria, e.g. the total execution time or energy consumption, we may observe completely different results. Therefore, as tasks can be scheduled and managed in many alternative ways on both many-core CPUs or GPUs and consequently have a huge impact on the overall computing resources performance, there are needs for new and improved resource management techniques. In this paper we discuss results achieved during experimental performance studies of selected task scheduling methods in heterogeneous computing systems. Additionally, we present a new architecture for resource allocation and task scheduling library which provides a generic application programming interface at the operating system level for improving scheduling polices taking into account a diversity of tasks and heterogeneous computing resources characteristics.


Author(s):  
Weihang Zhu

This paper presents a GPU-based parallel Population Based Incremental Learning (PBIL) algorithm with a local search on bound constrained optimization problems. The genotype of an entire population is evolved in PBIL, which was derived from Genetic Algorithms. Graphics Processing Units (GPU) is an emerging technology for desktop parallel computing. In this research, the classical PBIL is adapted in the data-parallel GPU computing platform. The global optimal search of the PBIL is enhanced by a local Pattern Search method. The hybrid PBIL method is implemented in the GPU environment, and compared to a similar implementation in the common computing environment with a Central Processing Unit (CPU). Computational results indicate that GPU-accelerated PBIL method is effective and faster than the corresponding CPU implementation.


Sign in / Sign up

Export Citation Format

Share Document