A High Performance Parallel Ranking SVM with OpenCL on Multi-core and Many-core Platforms

Huming Zhu; Pei Li; Peng Zhang; Zheng Luo

doi:10.4018/ijghpc.2019010102

A High Performance Parallel Ranking SVM with OpenCL on Multi-core and Many-core Platforms

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2019010102 ◽

2019 ◽

Vol 11 (1) ◽

pp. 17-28 ◽

Cited By ~ 1

Author(s):

Huming Zhu ◽

Pei Li ◽

Peng Zhang ◽

Zheng Luo

Keyword(s):

Parallel Programming ◽

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Learning To Rank ◽

Support Vector ◽

Processing Unit ◽

Ranking Svm ◽

Ranking Problems ◽

Many Core

A ranking support vector machine (RSVM) is a typical pairwise method of learning to rank, which is effective in ranking problems. However, the training speed of RSVMs are not satisfactory, especially when solving large-scale data ranking problems. Recent years, many-core processing units (graphics processing unit (GPU), Many Integrated Core (MIC)) and multi-core processing units have exhibited huge superiority in the parallel computing domain. With the support of hardware, parallel programming develops rapidly. Open Computing Language (OpenCL) and Open Multi-Processing (OpenMP) are two of popular parallel programming interfaces. The authors present two high-performance parallel implementations of RSVM, an OpenCL version implemented on multi-core and many-core platforms, and an OpenMP version implemented on multi-core platform. The experimental results show that the OpenCL version parallel RSVM achieved considerable speedup on Intel MIC 7110P, NVIDIA Tesla K20M and Intel Xeon E5-2692v2, and it also shows good portability.

Download Full-text

Splotch

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652713 ◽

2016 ◽

Vol 31 (6) ◽

pp. 550-563

Author(s):

Timothy Dykes ◽

Claudio Gheller ◽

Marzia Rivi ◽

Mel Krokos

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Xeon Phi ◽

The Many ◽

Many Core ◽

Performance Results ◽

Graphics Processing ◽

Performance Computing

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Prediction of Residual Stresses in a Multipass Pipe Weld by a Novel 3D Finite Element Approach

Volume 6B: Materials and Fabrication ◽

10.1115/pvp2018-85044 ◽

2018 ◽

Cited By ~ 1

Author(s):

Hui Huang ◽

Jian Chen ◽

Blair Carlson ◽

Hui-Ping Wang ◽

Paul Crooker ◽

...

Keyword(s):

Finite Element ◽

Residual Stresses ◽

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Computational Cost ◽

Three Dimensional ◽

Processing Unit ◽

Girth Welds ◽

Welding Processes

Due to enormous computation cost, current residual stress simulation of multipass girth welds are mostly performed using two-dimensional (2D) axisymmetric models. The 2D model can only provide limited estimation on the residual stresses by assuming its axisymmetric distribution. In this study, a highly efficient thermal-mechanical finite element code for three dimensional (3D) model has been developed based on high performance Graphics Processing Unit (GPU) computers. Our code is further accelerated by considering the unique physics associated with welding processes that are characterized by steep temperature gradient and a moving arc heat source. It is capable of modeling large-scale welding problems that cannot be easily handled by the existing commercial simulation tools. To demonstrate the accuracy and efficiency, our code was compared with a commercial software by simulating a 3D multi-pass girth weld model with over 1 million elements. Our code achieved comparable solution accuracy with respect to the commercial one but with over 100 times saving on computational cost. Moreover, the three-dimensional analysis demonstrated more realistic stress distribution that is not axisymmetric in hoop direction.

Download Full-text

Granular layEr Simulator: Design and Multi-GPU Simulation of the Cerebellar Granular Layer

Frontiers in Computational Neuroscience ◽

10.3389/fncom.2021.630795 ◽

2021 ◽

Vol 15 ◽

Author(s):

Giordana Florimbi ◽

Emanuele Torti ◽

Stefano Masoli ◽

Egidio D'Angelo ◽

Francesco Leporati

Keyword(s):

High Performance ◽

Large Scale ◽

Granular Layer ◽

Graphics Processing Unit ◽

Mossy Fibers ◽

Processing Unit ◽

Large Network ◽

Processing Times ◽

3D Space ◽

High Level

In modern computational modeling, neuroscientists need to reproduce long-lasting activity of large-scale networks, where neurons are described by highly complex mathematical models. These aspects strongly increase the computational load of the simulations, which can be efficiently performed by exploiting parallel systems to reduce the processing times. Graphics Processing Unit (GPU) devices meet this need providing on desktop High Performance Computing. In this work, authors describe a novel Granular layEr Simulator development implemented on a multi-GPU system capable of reconstructing the cerebellar granular layer in a 3D space and reproducing its neuronal activity. The reconstruction is characterized by a high level of novelty and realism considering axonal/dendritic field geometries, oriented in the 3D space, and following convergence/divergence rates provided in literature. Neurons are modeled using Hodgkin and Huxley representations. The network is validated by reproducing typical behaviors which are well-documented in the literature, such as the center-surround organization. The reconstruction of a network, whose volume is 600 × 150 × 1,200 μm3 with 432,000 granules, 972 Golgi cells, 32,399 glomeruli, and 4,051 mossy fibers, takes 235 s on an Intel i9 processor. The 10 s activity reproduction takes only 4.34 and 3.37 h exploiting a single and multi-GPU desktop system (with one or two NVIDIA RTX 2080 GPU, respectively). Moreover, the code takes only 3.52 and 2.44 h if run on one or two NVIDIA V100 GPU, respectively. The relevant speedups reached (up to ~38× in the single-GPU version, and ~55× in the multi-GPU) clearly demonstrate that the GPU technology is highly suitable for realistic large network simulations.

Download Full-text

A Coupled Hydrologic–Hydraulic Model (XAJ–HiPIMS) for Flood Simulation

Water ◽

10.3390/w12051288 ◽

2020 ◽

Vol 12 (5) ◽

pp. 1288

Author(s):

Yueling Wang ◽

Xiaoliu Yang

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Hydrologic Model ◽

Hydraulic Model ◽

Small Scale ◽

Processing Unit ◽

Rainfall Runoff ◽

Graphics Processing ◽

The Impact

To protect ecologies and the environment by preventing floods, analysis of the impact of climate change on water requires a tool capable of considering the rainfall-runoff processes on a small scale, for example, 10 m. As has been shown previously, hydrologic models are good at simulating rainfall-runoff processes on a large scale, e.g., over several hundred km2, while hydraulic models are more advantageous for applications on smaller scales. In order to take advantages of these two types of models, this paper coupled a hydrologic model, the Xinanjing model (XAJ), with a hydraulic model, the Graphics Processing Unit (GPU)-accelerated high-performance integrated hydraulic modelling system (HiPIMS). The study was completed in the Misai basin (797 km2), located in Zhejiang Province, China. The coupled XAJ–HiPIMS model was validated against observed flood events. The simulated results agree well with the data observed at the basin outlet. The study proves that a coupled hydrologic and hydraulic model is capable of providing flood information on a small scale for a large basin and shows the potential of the research.

Download Full-text

A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi7120472 ◽

2018 ◽

Vol 7 (12) ◽

pp. 472 ◽

Cited By ~ 1

Author(s):

Bo Wan ◽

Lin Yang ◽

Shunping Zhou ◽

Run Wang ◽

Dezhi Wang ◽

...

Keyword(s):

Road Network ◽

Large Scale ◽

Graphics Processing Unit ◽

Road Networks ◽

Processing Unit ◽

Data Partition ◽

Matching Method ◽

The Road ◽

Central Processing ◽

Relaxation Matching

The road-network matching method is an effective tool for map integration, fusion, and update. Due to the complexity of road networks in the real world, matching methods often contain a series of complicated processes to identify homonymous roads and deal with their intricate relationship. However, traditional road-network matching algorithms, which are mainly central processing unit (CPU)-based approaches, may have performance bottleneck problems when facing big data. We developed a particle-swarm optimization (PSO)-based parallel road-network matching method on graphics-processing unit (GPU). Based on the characteristics of the two main stages (similarity computation and matching-relationship identification), data-partition and task-partition strategies were utilized, respectively, to fully use GPU threads. Experiments were conducted on datasets with 14 different scales. Results indicate that the parallel PSO-based matching algorithm (PSOM) could correctly identify most matching relationships with an average accuracy of 84.44%, which was at the same level as the accuracy of a benchmark—the probability-relaxation-matching (PRM) method. The PSOM approach significantly reduced the road-network matching time in dealing with large amounts of data in comparison with the PRM method. This paper provides a common parallel algorithm framework for road-network matching algorithms and contributes to integration and update of large-scale road-networks.

Download Full-text

Realtime cerebellum: A large-scale spiking network model of the cerebellum that runs in realtime using a graphics processing unit

Neural Networks ◽

10.1016/j.neunet.2013.01.019 ◽

2013 ◽

Vol 47 ◽

pp. 103-111 ◽

Cited By ~ 47

Author(s):

Tadashi Yamazaki ◽

Jun Igarashi

Keyword(s):

Network Model ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Spiking Network ◽

Graphics Processing

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

Using Unreal Engine to Visualize a Cosmological Volume

Universe ◽

10.3390/universe6100168 ◽

2020 ◽

Vol 6 (10) ◽

pp. 168

Author(s):

Christopher Marsden ◽

Francesco Shankar

Keyword(s):

Real Time ◽

Large Scale ◽

Graphics Processing Unit ◽

Large Scale Structure ◽

Two Dimensions ◽

Scale Structure ◽

Sloan Digital Sky Survey ◽

Processing Unit ◽

Large Scale Universe ◽

Time Projection

In this work we present “Astera’’, a cosmological visualization tool that renders a mock universe in real time using Unreal Engine 4. The large scale structure of the cosmic web is hard to visualize in two dimensions, and a 3D real time projection of this distribution allows for an unprecedented view of the large scale universe, with visually accurate galaxies placed in a dynamic 3D world. The underlying data are based on empirical relations assigned using results from N-Body dark matter simulations, and are matched to galaxies with similar morphologies and sizes, images of which are extracted from the Sloan Digital Sky Survey. Within Unreal Engine 4, galaxy images are transformed into textures and dynamic materials (with appropriate transparency) that are applied to static mesh objects with appropriate sizes and locations. To ensure excellent performance, these static meshes are “instanced’’ to utilize the full capabilities of a graphics processing unit. Additional components include a dynamic system for representing accelerated-time active galactic nuclei. The end result is a visually realistic large scale universe that can be explored by a user in real time, with accurate large scale structure. Astera is not yet ready for public release, but we are exploring options to make different versions of the code available for both research and outreach applications.

Download Full-text

BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text

Bioinformatics ◽

10.1093/bioinformatics/btaa837 ◽

2020 ◽

Author(s):

Ronghui You ◽

Yuxuan Liu ◽

Hiroshi Mamitsuka ◽

Shanfeng Zhu

Keyword(s):

Full Text ◽

High Performance ◽

Large Scale ◽

Learning Strategy ◽

Learning To Rank ◽

Representation Learning ◽

Supplementary Information ◽

Medical Subject Headings ◽

The Difference ◽

Contextual Representation

Abstract Motivation With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. Results We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. Supplementary information Supplementary data are available at Bioinformatics online

Download Full-text