BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics

Daniel L Ayres; Michael P Cummings; Guy Baele; Aaron E Darling; Paul O Lewis; David L Swofford; John P Huelsenbeck; Philippe Lemey; Andrew Rambaut; Marc A Suchard

doi:10.1093/sysbio/syz020

BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics

Systematic Biology ◽

10.1093/sysbio/syz020 ◽

2019 ◽

Vol 68 (6) ◽

pp. 1052-1061 ◽

Cited By ~ 41

Author(s):

Daniel L Ayres ◽

Michael P Cummings ◽

Guy Baele ◽

Aaron E Darling ◽

Paul O Lewis ◽

...

Keyword(s):

High Performance ◽

Evolutionary Model ◽

Data Partitioning ◽

Partial Likelihood ◽

Processing Unit ◽

Data Set ◽

Diverse Range ◽

Central Processing ◽

Automated Method ◽

Software Packages

Abstract BEAGLE is a high-performance likelihood-calculation library for phylogenetic inference. The BEAGLE library defines a simple, but flexible, application programming interface (API), and includes a collection of efficient implementations for calculation under a variety of evolutionary models on different hardware devices. The library has been integrated into recent versions of popular phylogenetics software packages including BEAST and MrBayes and has been widely used across a diverse range of evolutionary studies. Here, we present BEAGLE 3 with new parallel implementations, increased performance for challenging data sets, improved scalability, and better usability. We have added new OpenCL and central processing unit-threaded implementations to the library, allowing the effective utilization of a wider range of modern hardware. Further, we have extended the API and library to support concurrent computation of independent partial likelihood arrays, for increased performance of nucleotide-model analyses with greater flexibility of data partitioning. For better scalability and usability, we have improved how phylogenetic software packages use BEAGLE in multi-GPU (graphics processing unit) and cluster environments, and introduced an automated method to select the fastest device given the data set, evolutionary model, and hardware. For application developers who wish to integrate the library, we also have developed an online tutorial. To evaluate the effect of the improvements, we ran a variety of benchmarks on state-of-the-art hardware. For a partitioned exemplar analysis, we observe run-time performance improvements as high as 5.9-fold over our previous GPU implementation. BEAGLE 3 is free, open-source software licensed under the Lesser GPL and available at https://beagle-dev.github.io.

Download Full-text

Efficient parallelization of SPH algorithm on modern multi-core CPUs and massively parallel GPUs

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962321500549 ◽

2021 ◽

pp. 2150054

Author(s):

Pravin Jagtap ◽

Rupesh Nasre ◽

V. S. Sanapala ◽

B. S. V. Patnaik

Keyword(s):

High Performance ◽

Performance Metrics ◽

Computational Simulation ◽

Massively Parallel ◽

Benchmark Problems ◽

Processing Unit ◽

Central Processing ◽

Neighbor Search ◽

Computational Performance ◽

Sph Algorithm

Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture[Formula: see text] parallel programming models have been used for parallel computing on Intel Xeon[Formula: see text] E5-[Formula: see text] multi-core CPU and NVIDIA Quadro M[Formula: see text] and NVIDIA Tesla p[Formula: see text] massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.

Download Full-text

SeisNoise.jl: Ambient Seismic Noise Cross Correlation on the CPU and GPU in Julia

Seismological Research Letters ◽

10.1785/0220200192 ◽

2020 ◽

Vol 92 (1) ◽

pp. 517-527

Author(s):

Timothy Clements ◽

Marine A. Denolle

Keyword(s):

Seismic Noise ◽

High Performance ◽

Cross Correlation ◽

Graphic Processing Unit ◽

Ambient Seismic Noise ◽

Processing Unit ◽

Central Processing ◽

And Performance ◽

Noise Cross Correlation ◽

Performance Computing

Abstract We introduce SeisNoise.jl, a library for high-performance ambient seismic noise cross correlation, written entirely in the computing language Julia. Julia is a new language, with syntax and a learning curve similar to MATLAB (see Data and Resources), R, or Python and performance close to Fortran or C. SeisNoise.jl is compatible with high-performance computing resources, using both the central processing unit and the graphic processing unit. SeisNoise.jl is a modular toolbox, giving researchers common tools and data structures to design custom ambient seismic cross-correlation workflows in Julia.

Download Full-text

An improved real-time object proposals generation method based on local binary pattern

International Journal of Advanced Robotic Systems ◽

10.1177/1729881417724679 ◽

2017 ◽

Vol 14 (4) ◽

pp. 172988141772467 ◽

Cited By ~ 1

Author(s):

Yanting Jiang ◽

Jia Yan ◽

Ci’en Fan ◽

Wenxuan Shi ◽

Dexiang Deng

Keyword(s):

Local Binary Pattern ◽

Sliding Window ◽

High Accuracy ◽

Processing Unit ◽

Data Set ◽

Central Processing ◽

Lighting Conditions ◽

Occluded Objects ◽

Object Proposals ◽

Short Time

Generating a group of category-independent proposals of objects in an image within a very short time is an effective approach to accelerate traditional sliding window search, which has been widely used in preprocessing step of object recognition. In this article, we propose a novel object proposals generation method to produce an order set of candidate windows covering most of object instances. With combination of gradient and local binary pattern, our approach achieves better performance than BING in finding occluded objects and objects in dim lighting conditions. In experiments on the challenging PASCAL VOC 2007 data set, we show that our approach is significantly more accurate than BING. In particular, using 2000 proposals, we achieve 97.6% object detection rate and 69.3% mean average best overlap. Moreover, our proposed method is very efficient and takes only about 0.006 s per image on a laptop central processing unit. The detection speed and high accuracy of proposed method mean that it can be applied to recognizing specific objects in robot visions.

Download Full-text

Controllers: An abstraction to ease the use of hardware accelerators

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017702962 ◽

2017 ◽

Vol 32 (6) ◽

pp. 838-853 ◽

Cited By ~ 4

Author(s):

Ana Moreton–Fernandez ◽

Hector Ortega–Arranz ◽

Arturo Gonzalez–Escribano

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Abstract Entity ◽

Hardware Accelerators ◽

Processing Unit ◽

Central Processing ◽

Computing Platforms ◽

Graphics Processing ◽

Performance Computing ◽

Selection Of

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

Download Full-text

Comparative evaluation of performance and scalability of convolutional neural network implementations on a multisystem HPC architecture

Journal of Physics Conference Series ◽

10.1088/1742-6596/2062/1/012008 ◽

2021 ◽

Vol 2062 (1) ◽

pp. 012008

Author(s):

Sunil Pandey ◽

Naresh Kumar Nagwani ◽

Shrish Verma

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

High Performance ◽

Processing Unit ◽

Neural Network Training ◽

Central Processing ◽

Network Training ◽

Hot Rolled ◽

Hot Rolled Steel ◽

Performance Computing

Abstract The convolutional neural network training algorithm has been implemented for a central processing unit based high performance multisystem architecture machine. The multisystem or the multicomputer is a parallel machine model which is essentially an abstraction of distributed memory parallel machines. In actual practice, this model corresponds to high performance computing clusters. The proposed implementation of the convolutional neural network training algorithm is based on modeling the convolutional neural network as a computational pipeline. The various functions or tasks of the convolutional neural network pipeline have been mapped onto the multiple nodes of a central processing unit based high performance computing cluster for task parallelism. The pipeline implementation provides a first level performance gain through pipeline parallelism. Further performance gains are obtained by distributing the convolutional neural network training onto the different nodes of the compute cluster. The two gains are multiplicative. In this work, the authors have carried out a comparative evaluation of the computational performance and scalability of this pipeline implementation of the convolutional neural network training with a distributed neural network software program which is based on conventional multi-model training and makes use of a centralized server. The dataset considered for this work is the North Eastern University’s hot rolled steel strip surface defects imaging dataset. In both the cases, the convolutional neural networks have been trained to classify the different defects on hot rolled steel strips on the basis of the input image. One hundred images corresponding to each class of defects have been used for the training in order to keep the training times manageable. The hyperparameters of both the convolutional neural networks were kept identical and the programs were run on the same computational cluster to enable fair comparison. Both the convolutional neural network implementations have been observed to train to nearly 80% training accuracy in 200 epochs. In effect, therefore, the comparison is on the time taken to complete the training epochs.

Download Full-text

BLVector: Fast BLAST-Like Algorithm for Manycore CPU With Vectorization

Frontiers in Genetics ◽

10.3389/fgene.2021.618659 ◽

2021 ◽

Vol 12 ◽

Author(s):

Sergio Gálvez ◽

Federico Agostini ◽

Javier Caselli ◽

Pilar Hernandez ◽

Gabriel Dorado

Keyword(s):

Amino Acids ◽

Execution Time ◽

High Performance ◽

Protein Sequences ◽

Processing Unit ◽

Central Processing ◽

Real Scenario ◽

High Level ◽

Performance Computing ◽

Comprehensive Study

New High-Performance Computing architectures have been recently developed for commercial central processing unit (CPU). Yet, that has not improved the execution time of widely used bioinformatics applications, like BLAST+. This is due to a lack of optimization between the bases of the existing algorithms and the internals of the hardware that allows taking full advantage of the available CPU cores. To optimize the new architectures, algorithms must be revised and redesigned; usually rewritten from scratch. BLVector adapts the high-level concepts of BLAST+ to the x86 architectures with AVX-512, to harness their capabilities. A deep comprehensive study has been carried out to optimize the approach, with a significant reduction in time execution. BLVector reduces the execution time of BLAST+ when aligning up to mid-size protein sequences (∼750 amino acids). The gain in real scenario cases is 3.2-fold. When applied to longer proteins, BLVector consumes more time than BLAST+, but retrieves a much larger set of results. BLVector and BLAST+ are fine-tuned heuristics. Therefore, the relevant results returned by both are the same, although they behave differently specially when performing alignments with low scores. Hence, they can be considered complementary bioinformatics tools.

Download Full-text

High-performance computing in water resources hydrodynamics

Journal of Hydroinformatics ◽

10.2166/hydro.2020.163 ◽

2020 ◽

Vol 22 (5) ◽

pp. 1217-1235 ◽

Cited By ~ 3

Author(s):

M. Morales-Hernández ◽

M. B. Sharif ◽

S. Gangrade ◽

T. T. Dullo ◽

S.-C. Kao ◽

...

Keyword(s):

Water Resources ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Test Case ◽

Processing Unit ◽

Central Processing ◽

Graphics Processing ◽

Performance Computing

Abstract This work presents a vision of future water resources hydrodynamics codes that can fully utilize the strengths of modern high-performance computing (HPC). The advances to computing power, formerly driven by the improvement of central processing unit processors, now focus on parallel computing and, in particular, the use of graphics processing units (GPUs). However, this shift to a parallel framework requires refactoring the code to make efficient use of the data as well as changing even the nature of the algorithm that solves the system of equations. These concepts along with other features such as the precision for the computations, dry regions management, and input/output data are analyzed in this paper. A 2D multi-GPU flood code applied to a large-scale test case is used to corroborate our statements and ascertain the new challenges for the next-generation parallel water resources codes.

Download Full-text

Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme

Geoscientific Model Development Discussions ◽

10.5194/gmdd-7-8031-2014 ◽

2014 ◽

Vol 7 (6) ◽

pp. 8031-8077

Author(s):

M. Huang ◽

J. Mielikainen ◽

B. Huang ◽

H. Chen ◽

H.-L. A. Huang ◽

...

Keyword(s):

Boundary Layer ◽

Planetary Boundary Layer ◽

Mixed Boundary ◽

Wrf Model ◽

Evolutionary Model ◽

Processing Unit ◽

Central Processing ◽

Atmospheric Column ◽

Pbl Scheme ◽

Weather Research

Abstract. The planetary boundary layer (PBL) is the lowest part of the atmosphere and where its character is directly affected by its contact with the underlying planetary surface. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transport in the whole atmospheric column. It determines the flux profiles within the well-mixed boundary layer and the more stable layer above. It thus provides an evolutionary model of atmospheric temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. For such purposes, several PBL models have been proposed and employed in the weather research and forecasting (WRF) model of which the Yonsei University (YSU) scheme is one. To expedite weather research and prediction, we have put tremendous effort into developing an accelerated implementation of the entire WRF model using Graphics Processing Unit (GPU) massive parallel computing architecture whilst maintaining its accuracy as compared to its CPU-based implementation. This paper presents our efficient GPU-based design on WRF YSU PBL scheme. Using one NVIDIA Tesla K40 GPU, the GPU-based YSU PBL scheme achieves a speedup of 193× with respect to its Central Processing Unit (CPU) counterpart running on one CPU core, whereas the speedup for one CPU socket (4 cores) with respect to one CPU core is only 3.5×. We can even boost the speedup to 360× with respect to one CPU core as two K40 GPUs are applied.

Download Full-text

The potential of graphical processing units to solve hydraulic network equations

Journal of Hydroinformatics ◽

10.2166/hydro.2011.023 ◽

2011 ◽

Vol 14 (3) ◽

pp. 603-612 ◽

Cited By ~ 8

Author(s):

P. A. Crous ◽

J. E. van Zyl ◽

Y. Roodt

Keyword(s):

Conjugate Gradient ◽

General Purpose ◽

Gradient Algorithm ◽

Processing Unit ◽

Distribution Models ◽

Data Set ◽

Central Processing ◽

Graphical Processing Units ◽

Hydraulic Network ◽

Graphical Processing

The Engineering discipline has relied on computers to perform numerical calculations in many of its sub-disciplines over the last decades. The advent of graphical processing units (GPUs), parallel stream processors, has the potential to speed up generic simulations that facilitate engineering applications aside from traditional computer graphics applications, using GPGPU (general purpose programming on the GPU). The potential benefits of exploiting the GPU for general purpose computation require the program to be highly arithmetic intensive and also data independent. This paper looks at the specific application of the Conjugate Gradient method used in hydraulic network solvers on the GPU and compares the results to conventional central processing unit (CPU) implementations. The results indicate that the GPU becomes more efficient as the data set size increases. However, with the current hardware and the implementation of the Conjugate Gradient algorithm, the application of stream processing to hydraulic network solvers is only faster and more efficient for exceptionally large water distribution models, which are seldom found in practice.

Download Full-text

CryoProtect: A Web Server for Classifying Antifreeze Proteins from Nonantifreeze Proteins

Journal of Chemistry ◽

10.1155/2017/9861752 ◽

2017 ◽

Vol 2017 ◽

pp. 1-15 ◽

Cited By ~ 18

Author(s):

Reny Pratiwi ◽

Aijaz Ahmad Malik ◽

Nalini Schaduangrat ◽

Virapong Prachayasittikul ◽

Jarl E. S. Wikberg ◽

...

Keyword(s):

Amino Acids ◽

Physicochemical Properties ◽

Predictive Model ◽

High Performance ◽

Web Server ◽

Principal Component ◽

Large Set ◽

Data Set ◽

Diverse Range ◽

Ice Binding

Antifreeze protein (AFP) is an ice-binding protein that protects organisms from freezing in extremely cold environments. AFPs are found across a diverse range of species and, therefore, significantly differ in their structures. As there are no consensus sequences available for determining the ice-binding domain of AFPs, thus the prediction and characterization of AFPs from their sequence is a challenging task. This study addresses this issue by predicting AFPs directly from sequence on a large set of 478 AFPs and 9,139 non-AFPs using machine learning (e.g., random forest) as a function of interpretable features (e.g., amino acid composition, dipeptide composition, and physicochemical properties). Furthermore, AFPs were characterized using propensity scores and important physicochemical properties via statistical and principal component analysis. The predictive model afforded high performance with an accuracy of 88.28% and results revealed that AFPs are likely to be composed of hydrophobic amino acids as well as amino acids with hydroxyl and sulfhydryl side chains. The predictive model is provided as a free publicly available web server called CryoProtect for classifying query protein sequence as being either AFP or non-AFP. The data set and source code are for reproducing the results which are provided on GitHub.

Download Full-text