Cache Locality-Centric Parallel String Matching on Many-Core Accelerator Chips

Scientific Programming ◽

10.1155/2015/937694 ◽

2015 ◽

Vol 2015 ◽

pp. 1-20 ◽

Cited By ~ 1

Author(s):

Nhat-Phuong Tran ◽

Myungho Lee ◽

Dong Hoon Choi

Keyword(s):

High Performance ◽

Parallel Implementation ◽

String Matching ◽

Processing Unit ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Multiple Threads ◽

The Many ◽

Many Core ◽

Intel Xeon

Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.

Download Full-text

Splotch

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652713 ◽

2016 ◽

Vol 31 (6) ◽

pp. 550-563

Author(s):

Timothy Dykes ◽

Claudio Gheller ◽

Marzia Rivi ◽

Mel Krokos

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Xeon Phi ◽

The Many ◽

Many Core ◽

Performance Results ◽

Graphics Processing ◽

Performance Computing

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.

Download Full-text

High-performance simulations of turbulent boundary layer flow using Intel Xeon Phi many-core processors

The Journal of Supercomputing ◽

10.1007/s11227-021-03642-6 ◽

2021 ◽

Author(s):

Ji-Hoon Kang ◽

Jinyul Hwang ◽

Hyung Jin Sung ◽

Hoon Ryu

Keyword(s):

Boundary Layer ◽

Turbulent Boundary Layer ◽

Boundary Layer Flow ◽

High Performance ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Turbulent Boundary Layer Flow ◽

Layer Flow ◽

Many Core ◽

Intel Xeon

Download Full-text

Parallel Algorithm for Frequent Itemset Mining on Intel Many-core Systems

Journal of Computing and Information Technology ◽

10.20532/cit.2018.1004382 ◽

2019 ◽

Vol 26 (4) ◽

pp. 209-221

Keyword(s):

Parallel Implementation ◽

Main Memory ◽

Frequent Itemset ◽

Frequent Itemset Mining ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Memory Space ◽

Itemset Mining ◽

Many Core ◽

Intel Xeon

Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional databases. Apriori is a classical frequent itemset mining algorithm, which employs iterative passes over database combining with generation of candidate itemsets based on frequent itemsets found at the previous iteration, and pruning of clearly infrequent itemsets. The Dynamic Itemset Counting (DIC) algorithm is a variation of Apriori, which tries to reduce the number of passes made over a transactional database while keeping the number of itemsets counted in a pass relatively low. In this paper, we address the problem of accelerating DIC on the Intel Xeon Phi many-core system for the case when the transactional database fits in main memory. Intel Xeon Phi provides a large number of small compute cores with vector processing units. The paper presents a parallel implementation of DIC based on OpenMP technology and thread-level parallelism. We exploit the bit-based internal layout for transactions and itemsets. This technique reduces the memory space for storing the transactional database, simplifies the support count via logical bitwise operation, and allows for vectorization of such a step. Experimental evaluation on the platforms of the Intel Xeon CPU and the Intel Xeon Phi coprocessor with large synthetic and real databases showed good performance and scalability of the proposed algorithm.

Download Full-text

Bioinformatics Performance Comparison of Many-core Tile64 vs. Multi-core Intel Xeon

CLEI electronic journal ◽

10.19153/cleiej.17.1.4 ◽

2014 ◽

Vol 17 (1) ◽

Cited By ~ 1

Author(s):

Myriam Kurtz ◽

Francisco J. Esteban ◽

Pilar Hernández ◽

Juan Antonio Caballero ◽

Antonio Guevara ◽

...

Keyword(s):

Parallel Implementation ◽

Protein Sequences ◽

Performance Comparison ◽

Xeon Phi ◽

Performance Improvements ◽

Dna And Rna ◽

Significant Performance ◽

The Many ◽

Many Core ◽

Intel Xeon

The performance of the many-core Tile64 versus the multi-core Xeon x86 architecture on bioinformatics has been compared. We have used the pairwise algorithm MC64-NW/SW that we have previously developed to align nucleic acid (DNA and RNA) and peptide (protein) sequences for the benchmarking, being an enhanced and parallel implementation of the Needleman-Wunsch and Smith-Waterman algorithms. We have ported the MC64-NW/SW (originally developed for the Tile64 processor), to the x86 architecture (Intel Xeon Quad Core and Intel i7 Quad Core processors) with excellent results. Hence, the evolution of the x86-based architectures towards coprocessors like the Xeon Phi should represent significant performance improvements for bioinformatics.

Download Full-text

Using Intel Xeon Phi coprocessors for execution of natural join on compressed data

Numerical Methods and Programming (Vychislitel'nye Metody i Programmirovanie) ◽

10.26089/nummet.v16r450 ◽

2015 ◽

pp. 534-542

Author(s):

Е.В. Иванова ◽

Л.Б. Соколинский

Keyword(s):

High Performance ◽

Data Exchange ◽

Cluster Computing ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Computing Systems ◽

Parallel Decomposition ◽

Many Core ◽

Compressed Data ◽

Intel Xeon

В статье описывается сопроцессор баз данных для высокопроизводительных кластерных вычислительных систем с многоядерными ускорителями, использующий распределенные колоночные индексы с интервальной фрагментацией. Работа сопроцессора рассматривается на примере выполнения операции естественного соединения. Параллельная декомпозиция естественного соединения выполняется на основе использования распределенных колоночных индексов. Предложенный подход позволяет выполнять реляционные операции на кластерных вычислительных системах без массовых обменов данными. Приводятся результаты вычислительных экспериментов с использованием сопроцессоров Intel Xeon Phi, подтверждающие эффективность разработанных методов и алгоритмов. A database coprocessor for high-performance cluster computing systems with many-core accelerators is described. This coprocessor uses distributed columnar indexes with interval fragmentation. The operation of the coprocessor engine is considered by an example of natural join processing. The parallel decomposition of natural join operator is performed using distributed columnar indexes. The proposed approach allow one to perform relational operators on computing clusters without massive data exchange. The results of computational experiments on Intel Xeon Phi confirm the efficiency of the developed methods and algorithms.

Download Full-text

HPC-BLAST: Distributed BLAST for Modern HPC Clusters.

10.29007/qm7h ◽

2019 ◽

Author(s):

Shane Sawyer ◽

Mitchel Horton ◽

Chad Burdyshaw ◽

Glenn Brook ◽

Bhanu Rekapalli

Keyword(s):

Static Load ◽

Sequence Data ◽

Source Code ◽

Biological Research ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Version 2.0 ◽

The Many ◽

Many Core ◽

Intel Xeon

The near exponential growth in sequence data available to bioinformaticists, and the emergence of new fields of biological research, continue to fuel an incessant need for in- creases in sequence alignment performance. Today, more than ever before, bioinformatics researchers have access to a wide variety of HPC architectures including high core count Intel Xeon processors and the many-core Intel Xeon Phi.In this work, the implementation of a distributed, NCBI compliant, BLAST+ (C++ toolkit) code, targeted for multi- and many-core clusters, such as those containing the Intel Xeon Phi line of products is presented. The solution is robust: distributed BLAST runs can use the CPU only, the Xeon Phi processor or coprocessor, or both by utilizing the CPU or Xeon Phi processor plus a Xeon Phi coprocessor. The distributed BLAST implementation employs static load balancing, fault tolerance, and contention aware I/O. The distributed BLAST implementation, HPC-BLAST, maintains greater than 90% weak scaling efficiency on up to 160 Xeon Phi (Knights Landing) nodes.The source code and instructions, are available under the Apache License, Version 2.0 at https://github.com/UTennessee-JICS/HPC-BLAST.

Download Full-text

MILC Code Performance on High End CPU and GPU Supercomputer Clusters

EPJ Web of Conferences ◽

10.1051/epjconf/201817502009 ◽

2018 ◽

Vol 175 ◽

pp. 02009

Author(s):

Carleton DeTar ◽

Steven Gottlieb ◽

Ruizi Li ◽

Doug Toussaint

Keyword(s):

Conjugate Gradient ◽

Memory Hierarchy ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Code Performance ◽

Recent Developments ◽

Knights Landing ◽

Many Core ◽

Intel Xeon

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Download Full-text

The VOLNA-OP2 tsunami code (version 1.5)

Geoscientific Model Development ◽

10.5194/gmd-11-4621-2018 ◽

2018 ◽

Vol 11 (11) ◽

pp. 4621-4635 ◽

Cited By ~ 7

Author(s):

Istvan Z. Reguly ◽

Daniel Giles ◽

Devaraj Gopinathan ◽

Laure Quivy ◽

Joakim H. Beck ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Shallow Water Equation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Central Processing ◽

Domain Specific ◽

Computing Platforms ◽

Graphics Processing ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.

Download Full-text

Effective SIMD Vectorization for Intel Xeon Phi Coprocessors

Scientific Programming ◽

10.1155/2015/269764 ◽

2015 ◽

Vol 2015 ◽

pp. 1-14 ◽

Cited By ~ 8

Author(s):

Xinmin Tian ◽

Hideki Saito ◽

Serguei V. Preis ◽

Eric N. Garcia ◽

Sergey S. Kozhukhov ◽

...

Keyword(s):

High Performance ◽

Xeon Phi ◽

Performance Gain ◽

Intel Xeon Phi ◽

Performance Study ◽

Seamless Integration ◽

Small Matrix ◽

Performance Results ◽

Intel Mic ◽

Intel Xeon

Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel MIC specific alignment optimization, and small matrix transpose/multiplication 2D vectorization implemented in the Intel C/C++ and Fortran production compilers for Intel Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel Xeon Phi coprocessor. We also demonstrate a 2000x performance speedup from the seamless integration of SIMD vectorization and parallelization.

Download Full-text

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

2014 IEEE International Parallel & Distributed Processing Symposium Workshops ◽

10.1109/ipdpsw.2014.194 ◽

2014 ◽

Cited By ~ 13

Author(s):

Lei Jin ◽

Zhaokang Wang ◽

Rong Gu ◽

Chunfeng Yuan ◽

Yihua Huang

Keyword(s):

Neural Networks ◽

Large Scale ◽

Deep Neural Networks ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Many Core ◽

Intel Xeon

Download Full-text