Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00124 ◽

2015 ◽

Vol 3 ◽

pp. 87-100

Author(s):

Hua He ◽

Jimmy Lin ◽

Adam Lopez

Keyword(s):

Hierarchical Models ◽

Graphics Processing Units ◽

General Purpose ◽

Practical Applications ◽

On Demand ◽

Mt Evaluation ◽

Order Of Magnitude ◽

Evaluation Dataset ◽

The Cost ◽

Graphics Processing

Grammars for machine translation can be materialized on demand by finding source phrases in an indexed parallel corpus and extracting their translations. This approach is limited in practical applications by the computational expense of online lookup and extraction. For phrase-based models, recent work has shown that on-demand grammar extraction can be greatly accelerated by parallelization on general purpose graphics processing units (GPUs), but these algorithms do not work for hierarchical models, which require matching patterns that contain gaps. We address this limitation by presenting a novel GPU algorithm for on-demand hierarchical grammar extraction that is at least an order of magnitude faster than a comparable CPU algorithm when processing large batches of sentences. In terms of end-to-end translation, with decoding on the CPU, we increase throughput by roughly two thirds on a standard MT evaluation dataset. The GPU necessary to achieve these improvements increases the cost of a server by about a third. We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.

Download Full-text

A High Granularity Approach to NetworkPacket Processing for Latency-TolerantApplications with CUDA (Corvyd)

Avances en Ciencias e Ingeniería ◽

10.18272/aci.v13i2.2142 ◽

2021 ◽

Vol 13 (2) ◽

pp. 7

Author(s):

Maria Pantoja

Keyword(s):

Graphics Processing Units ◽

General Purpose ◽

Packet Processing ◽

Maximum Throughput ◽

Intrusion Prevention ◽

Detection Systems ◽

Enterprise Level ◽

Specialized Hardware ◽

The Cost ◽

Graphics Processing

Currently, practical network packet processing used for In-trusion Detection Systems/Intrusion Prevention Systems (IDS/IPS) tendto belong to one of two disjoint categories: software-only implementa-tions running on general-purpose CPUs, or highly specialized networkhardware implementations using ASICs or FPGAs for the most commonfunctions, general-purpose CPUs for the rest. These approaches cover tryto maximize the performance and minimize the cost, but neither system,when implemented effectively, is affordable to any clients except for thoseat the well-funded enterprise level. In this paper, we aim to improve theperformance of affordable network packet processing in heterogeneoussystems with consumer Graphics Processing Units (GPUs) hardware byoptimizing latency-tolerant packet processing operations, notably IDS,to obtain maximum throughput required by such systems in networkssophisticated enough to demand a dedicated IDS/IPS system, but notenough to justify the high cost of cutting-edge specialized hardware. Inparticular, this project investigated increasing the granularity of OSIlayer-based packet batching over that of previous batching approaches.We demonstrate that highly granular GPU-enabled packet processing isgenerally impractical, compared with existing methods, by implementingour own solution that we call Corvyd, a heterogeneous real-time packetprocessing engine.

Download Full-text

Faster and cheaper: how graphics processing units on spot-market instances minimize turnaround time and budget

Interpretation ◽

10.1190/int-2020-0094.1 ◽

2020 ◽

pp. 1-67

Author(s):

Nicholas T. Okita ◽

Tiago A. Coimbra

Keyword(s):

Cloud Computing ◽

Graphics Processing Units ◽

High Performance ◽

Imaging Techniques ◽

Spot Market ◽

Turnaround Time ◽

On Demand ◽

Order Of Magnitude ◽

Graphics Processing ◽

Price Differences

Cloud computing is enabling users to instantiate and access high-performance computing clusters quickly. However, without proper knowledge of the type of application and the nature of the instances, it can become quite expensive. The objective is to show that adequately choosing the instances provides a fast execution, which, in turn, leads to a low execution price, using the pay-as-you-go model on cloud computing. We used graphics processing units instances on the spot market to execute a seismic-dataset interpolation job and compared their performance to regular on-demand CPU instances. Furthermore, we explored how scaling could also improve the execution times at small price differences. The experiments have shown that, by using an instance with eight accelerators on the spot market, we obtain up to three hundred times speed-up compared to the on-demand CPU options, while being one hundred times cheaper. Finally, our results have shown that seismic-imaging processing can be sped up by order of magnitude with a low budget, resulting in faster and cheaper turn around processing time and enabling possible new imaging techniques.

Download Full-text

DSPSR: Digital Signal Processing Software for Pulsar Astronomy

Publications of the Astronomical Society of Australia ◽

10.1071/as10021 ◽

2011 ◽

Vol 28 (1) ◽

pp. 1-14 ◽

Cited By ~ 172

Author(s):

W. van Straten ◽

M. Bailes

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Graphics Processing Units ◽

High Performance ◽

Digital Signal ◽

General Purpose ◽

Design Decisions ◽

Extensive Range ◽

Processing Software ◽

Graphics Processing

Abstractdspsr is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, dspsr is now stable and in widespread use in the community. This paper presents a detailed description of its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal algorithms, and demonstration of performance on some contemporary microprocessor architectures.

Download Full-text

Accelerating reaction–diffusion simulations with general-purpose graphics processing units

Bioinformatics ◽

10.1093/bioinformatics/btq622 ◽

2010 ◽

Vol 27 (2) ◽

pp. 288-290 ◽

Cited By ~ 30

Author(s):

Matthias Vigelius ◽

Aidan Lane ◽

Bernd Meyer

Keyword(s):

Graphics Processing Units ◽

Reaction Diffusion ◽

General Purpose ◽

Graphics Processing

Download Full-text

More Faster Self-Organizing Maps by General Purpose on Graphics Processing Units

Advances in Intelligent Systems and Computing - Soft Computing in Machine Learning ◽

10.1007/978-3-319-05533-6_5 ◽

2014 ◽

pp. 41-51

Author(s):

Shinji Kawakami ◽

Keiji Kamei

Keyword(s):

Graphics Processing Units ◽

General Purpose ◽

Self Organizing Maps ◽

Graphics Processing ◽

Self Organizing

Download Full-text

A Representation of Membrane Computing with a Clustering Algorithm on the Graphical Processing Unit

Processes ◽

10.3390/pr8091199 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1199

Author(s):

Ravie Chandren Muniyandi ◽

Ali Maroosi

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Hamiltonian Path ◽

Fold Increase ◽

General Purpose ◽

Processing Unit ◽

Thread Block ◽

Hard Problems ◽

Graphical Processing ◽

Graphics Processing

Long-timescale simulations of biological processes such as photosynthesis or attempts to solve NP-hard problems such as traveling salesman, knapsack, Hamiltonian path, and satisfiability using membrane systems without appropriate parallelization can take hours or days. Graphics processing units (GPU) deliver an immensely parallel mechanism to compute general-purpose computations. Previous studies mapped one membrane to one thread block on GPU. This is disadvantageous given that when the quantity of objects for each membrane is small, the quantity of active thread will also be small, thereby decreasing performance. While each membrane is designated to one thread block, the communication between thread blocks is needed for executing the communication between membranes. Communication between thread blocks is a time-consuming process. Previous approaches have also not addressed the issue of GPU occupancy. This study presents a classification algorithm to manage dependent objects and membranes based on the communication rate associated with the defined weighted network and assign them to sub-matrices. Thus, dependent objects and membranes are allocated to the same threads and thread blocks, thereby decreasing communication between threads and thread blocks and allowing GPUs to maintain the highest occupancy possible. The experimental results indicate that for 48 objects per membrane, the algorithm facilitates a 93-fold increase in processing speed compared to a 1.6-fold increase with previous algorithms.

Download Full-text

Accelerating in-memory transaction processing using general purpose graphics processing units

Future Generation Computer Systems ◽

10.1016/j.future.2019.03.034 ◽

2019 ◽

Vol 97 ◽

pp. 836-848

Author(s):

Lan Gao ◽

Yunlong Xu ◽

Rui Wang ◽

Hailong Yang ◽

Zhongzhi Luan ◽

...

Keyword(s):

Graphics Processing Units ◽

Transaction Processing ◽

General Purpose ◽

Graphics Processing

Download Full-text

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units - GPGPU-6

10.1145/2458523 ◽

2013 ◽

Keyword(s):

Graphics Processing Units ◽

General Purpose ◽

General Purpose Processor ◽

Graphics Processing

Download Full-text

Accelerated FDPS: Algorithms to use accelerators with FDPS

Publications of the Astronomical Society of Japan ◽

10.1093/pasj/psz133 ◽

2020 ◽

Vol 72 (1) ◽

Cited By ~ 2

Author(s):

Masaki Iwasawa ◽

Daisuke Namekata ◽

Keigo Nitadori ◽

Kentaro Nomura ◽

Long Wang ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

General Purpose ◽

Performance Model ◽

Performance Tuning ◽

Data Types ◽

Interaction Function ◽

Current Implementation ◽

And Performance ◽

Graphics Processing

Abstract We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

Download Full-text

Simulation of Fire with a Gas Kinetic Scheme on Distributed GPGPU Architectures

Computation ◽

10.3390/computation8020050 ◽

2020 ◽

Vol 8 (2) ◽

pp. 50

Author(s):

Stephan Lenz ◽

Martin Geier ◽

Manfred Krafczyk

Keyword(s):

Graphics Processing Units ◽

Parallel Implementation ◽

Kinetic Scheme ◽

General Purpose ◽

Massively Parallel ◽

Small Scale ◽

Fire Dynamics ◽

Linear Interaction ◽

Turbulent Natural Convection ◽

Order Of Magnitude

The simulation of fire is a challenging task due to its occurrence on multiple space-time scales and the non-linear interaction of multiple physical processes. Current state-of-the-art software such as the Fire Dynamics Simulator (FDS) implements most of the required physics, yet a significant drawback of this implementation is its limited scalability on modern massively parallel hardware. The current paper presents a massively parallel implementation of a Gas Kinetic Scheme (GKS) on General Purpose Graphics Processing Units (GPGPUs) as a potential alternative modeling and simulation approach. The implementation is validated for turbulent natural convection against experimental data. Subsequently, it is validated for two simulations of fire plumes, including a small-scale table top setup and a fire on the scale of a few meters. We show that the present GKS achieves comparable accuracy to the results obtained by FDS. Yet, due to the parallel efficiency on dedicated hardware, our GKS implementation delivers a reduction of wall-clock times of more than an order of magnitude. This paper demonstrates the potential of explicit local schemes in massively parallel environments for the simulation of fire.

Download Full-text