A High Performance Parallel Ranking SVM with OpenCL on Multi-core and Many-core Platforms

2019 ◽  
Vol 11 (1) ◽  
pp. 17-28 ◽  
Author(s):  
Huming Zhu ◽  
Pei Li ◽  
Peng Zhang ◽  
Zheng Luo

A ranking support vector machine (RSVM) is a typical pairwise method of learning to rank, which is effective in ranking problems. However, the training speed of RSVMs are not satisfactory, especially when solving large-scale data ranking problems. Recent years, many-core processing units (graphics processing unit (GPU), Many Integrated Core (MIC)) and multi-core processing units have exhibited huge superiority in the parallel computing domain. With the support of hardware, parallel programming develops rapidly. Open Computing Language (OpenCL) and Open Multi-Processing (OpenMP) are two of popular parallel programming interfaces. The authors present two high-performance parallel implementations of RSVM, an OpenCL version implemented on multi-core and many-core platforms, and an OpenMP version implemented on multi-core platform. The experimental results show that the OpenCL version parallel RSVM achieved considerable speedup on Intel MIC 7110P, NVIDIA Tesla K20M and Intel Xeon E5-2692v2, and it also shows good portability.

Author(s):  
Timothy Dykes ◽  
Claudio Gheller ◽  
Marzia Rivi ◽  
Mel Krokos

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.


Author(s):  
Alan Gray ◽  
Kevin Stratford

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.


Author(s):  
Hui Huang ◽  
Jian Chen ◽  
Blair Carlson ◽  
Hui-Ping Wang ◽  
Paul Crooker ◽  
...  

Due to enormous computation cost, current residual stress simulation of multipass girth welds are mostly performed using two-dimensional (2D) axisymmetric models. The 2D model can only provide limited estimation on the residual stresses by assuming its axisymmetric distribution. In this study, a highly efficient thermal-mechanical finite element code for three dimensional (3D) model has been developed based on high performance Graphics Processing Unit (GPU) computers. Our code is further accelerated by considering the unique physics associated with welding processes that are characterized by steep temperature gradient and a moving arc heat source. It is capable of modeling large-scale welding problems that cannot be easily handled by the existing commercial simulation tools. To demonstrate the accuracy and efficiency, our code was compared with a commercial software by simulating a 3D multi-pass girth weld model with over 1 million elements. Our code achieved comparable solution accuracy with respect to the commercial one but with over 100 times saving on computational cost. Moreover, the three-dimensional analysis demonstrated more realistic stress distribution that is not axisymmetric in hoop direction.


2021 ◽  
Vol 15 ◽  
Author(s):  
Giordana Florimbi ◽  
Emanuele Torti ◽  
Stefano Masoli ◽  
Egidio D'Angelo ◽  
Francesco Leporati

In modern computational modeling, neuroscientists need to reproduce long-lasting activity of large-scale networks, where neurons are described by highly complex mathematical models. These aspects strongly increase the computational load of the simulations, which can be efficiently performed by exploiting parallel systems to reduce the processing times. Graphics Processing Unit (GPU) devices meet this need providing on desktop High Performance Computing. In this work, authors describe a novel Granular layEr Simulator development implemented on a multi-GPU system capable of reconstructing the cerebellar granular layer in a 3D space and reproducing its neuronal activity. The reconstruction is characterized by a high level of novelty and realism considering axonal/dendritic field geometries, oriented in the 3D space, and following convergence/divergence rates provided in literature. Neurons are modeled using Hodgkin and Huxley representations. The network is validated by reproducing typical behaviors which are well-documented in the literature, such as the center-surround organization. The reconstruction of a network, whose volume is 600 × 150 × 1,200 μm3 with 432,000 granules, 972 Golgi cells, 32,399 glomeruli, and 4,051 mossy fibers, takes 235 s on an Intel i9 processor. The 10 s activity reproduction takes only 4.34 and 3.37 h exploiting a single and multi-GPU desktop system (with one or two NVIDIA RTX 2080 GPU, respectively). Moreover, the code takes only 3.52 and 2.44 h if run on one or two NVIDIA V100 GPU, respectively. The relevant speedups reached (up to ~38× in the single-GPU version, and ~55× in the multi-GPU) clearly demonstrate that the GPU technology is highly suitable for realistic large network simulations.


Water ◽  
2020 ◽  
Vol 12 (5) ◽  
pp. 1288
Author(s):  
Yueling Wang ◽  
Xiaoliu Yang

To protect ecologies and the environment by preventing floods, analysis of the impact of climate change on water requires a tool capable of considering the rainfall-runoff processes on a small scale, for example, 10 m. As has been shown previously, hydrologic models are good at simulating rainfall-runoff processes on a large scale, e.g., over several hundred km2, while hydraulic models are more advantageous for applications on smaller scales. In order to take advantages of these two types of models, this paper coupled a hydrologic model, the Xinanjing model (XAJ), with a hydraulic model, the Graphics Processing Unit (GPU)-accelerated high-performance integrated hydraulic modelling system (HiPIMS). The study was completed in the Misai basin (797 km2), located in Zhejiang Province, China. The coupled XAJ–HiPIMS model was validated against observed flood events. The simulated results agree well with the data observed at the basin outlet. The study proves that a coupled hydrologic and hydraulic model is capable of providing flood information on a small scale for a large basin and shows the potential of the research.


2018 ◽  
Vol 7 (12) ◽  
pp. 472 ◽  
Author(s):  
Bo Wan ◽  
Lin Yang ◽  
Shunping Zhou ◽  
Run Wang ◽  
Dezhi Wang ◽  
...  

The road-network matching method is an effective tool for map integration, fusion, and update. Due to the complexity of road networks in the real world, matching methods often contain a series of complicated processes to identify homonymous roads and deal with their intricate relationship. However, traditional road-network matching algorithms, which are mainly central processing unit (CPU)-based approaches, may have performance bottleneck problems when facing big data. We developed a particle-swarm optimization (PSO)-based parallel road-network matching method on graphics-processing unit (GPU). Based on the characteristics of the two main stages (similarity computation and matching-relationship identification), data-partition and task-partition strategies were utilized, respectively, to fully use GPU threads. Experiments were conducted on datasets with 14 different scales. Results indicate that the parallel PSO-based matching algorithm (PSOM) could correctly identify most matching relationships with an average accuracy of 84.44%, which was at the same level as the accuracy of a benchmark—the probability-relaxation-matching (PRM) method. The PSOM approach significantly reduced the road-network matching time in dealing with large amounts of data in comparison with the PRM method. This paper provides a common parallel algorithm framework for road-network matching algorithms and contributes to integration and update of large-scale road-networks.


Electronics ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 884
Author(s):  
Stefano Rossi ◽  
Enrico Boni

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.


Universe ◽  
2020 ◽  
Vol 6 (10) ◽  
pp. 168
Author(s):  
Christopher Marsden ◽  
Francesco Shankar

In this work we present “Astera’’, a cosmological visualization tool that renders a mock universe in real time using Unreal Engine 4. The large scale structure of the cosmic web is hard to visualize in two dimensions, and a 3D real time projection of this distribution allows for an unprecedented view of the large scale universe, with visually accurate galaxies placed in a dynamic 3D world. The underlying data are based on empirical relations assigned using results from N-Body dark matter simulations, and are matched to galaxies with similar morphologies and sizes, images of which are extracted from the Sloan Digital Sky Survey. Within Unreal Engine 4, galaxy images are transformed into textures and dynamic materials (with appropriate transparency) that are applied to static mesh objects with appropriate sizes and locations. To ensure excellent performance, these static meshes are “instanced’’ to utilize the full capabilities of a graphics processing unit. Additional components include a dynamic system for representing accelerated-time active galactic nuclei. The end result is a visually realistic large scale universe that can be explored by a user in real time, with accurate large scale structure. Astera is not yet ready for public release, but we are exploring options to make different versions of the code available for both research and outreach applications.


Author(s):  
Ronghui You ◽  
Yuxuan Liu ◽  
Hiroshi Mamitsuka ◽  
Shanfeng Zhu

Abstract Motivation With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. Results We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. Supplementary information Supplementary data are available at Bioinformatics online


Sign in / Sign up

Export Citation Format

Share Document