scholarly journals Cache Locality-Centric Parallel String Matching on Many-Core Accelerator Chips

2015 ◽  
Vol 2015 ◽  
pp. 1-20 ◽  
Author(s):  
Nhat-Phuong Tran ◽  
Myungho Lee ◽  
Dong Hoon Choi

Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.

Author(s):  
Timothy Dykes ◽  
Claudio Gheller ◽  
Marzia Rivi ◽  
Mel Krokos

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.


2019 ◽  
Vol 26 (4) ◽  
pp. 209-221

Frequent itemset mining leads to the discovery of associations and correlations among items in large transactional databases. Apriori is a classical frequent itemset mining algorithm, which employs iterative passes over database combining with generation of candidate itemsets based on frequent itemsets found at the previous iteration, and pruning of clearly infrequent itemsets. The Dynamic Itemset Counting (DIC) algorithm is a variation of Apriori, which tries to reduce the number of passes made over a transactional database while keeping the number of itemsets counted in a pass relatively low. In this paper, we address the problem of accelerating DIC on the Intel Xeon Phi many-core system for the case when the transactional database fits in main memory. Intel Xeon Phi provides a large number of small compute cores with vector processing units. The paper presents a parallel implementation of DIC based on OpenMP technology and thread-level parallelism. We exploit the bit-based internal layout for transactions and itemsets. This technique reduces the memory space for storing the transactional database, simplifies the support count via logical bitwise operation, and allows for vectorization of such a step. Experimental evaluation on the platforms of the Intel Xeon CPU and the Intel Xeon Phi coprocessor with large synthetic and real databases showed good performance and scalability of the proposed algorithm.


2014 ◽  
Vol 17 (1) ◽  
Author(s):  
Myriam Kurtz ◽  
Francisco J. Esteban ◽  
Pilar Hernández ◽  
Juan Antonio Caballero ◽  
Antonio Guevara ◽  
...  

The performance of the many-core Tile64 versus the multi-core Xeon x86 architecture on bioinformatics has been compared. We have used the pairwise algorithm MC64-NW/SW that we have previously developed to align nucleic acid (DNA and RNA) and peptide (protein) sequences for the benchmarking, being an enhanced and parallel implementation of the Needleman-Wunsch and Smith-Waterman algorithms. We have ported the MC64-NW/SW (originally developed for the Tile64 processor), to the x86 architecture (Intel Xeon Quad Core and Intel i7 Quad Core processors) with excellent results. Hence, the evolution of the x86-based architectures towards coprocessors like the Xeon Phi should represent significant performance improvements for bioinformatics.


Author(s):  
Е.В. Иванова ◽  
Л.Б. Соколинский

В статье описывается сопроцессор баз данных для высокопроизводительных кластерных вычислительных систем с многоядерными ускорителями, использующий распределенные колоночные индексы с интервальной фрагментацией. Работа сопроцессора рассматривается на примере выполнения операции естественного соединения. Параллельная декомпозиция естественного соединения выполняется на основе использования распределенных колоночных индексов. Предложенный подход позволяет выполнять реляционные операции на кластерных вычислительных системах без массовых обменов данными. Приводятся результаты вычислительных экспериментов с использованием сопроцессоров Intel Xeon Phi, подтверждающие эффективность разработанных методов и алгоритмов. A database coprocessor for high-performance cluster computing systems with many-core accelerators is described. This coprocessor uses distributed columnar indexes with interval fragmentation. The operation of the coprocessor engine is considered by an example of natural join processing. The parallel decomposition of natural join operator is performed using distributed columnar indexes. The proposed approach allow one to perform relational operators on computing clusters without massive data exchange. The results of computational experiments on Intel Xeon Phi confirm the efficiency of the developed methods and algorithms.


10.29007/qm7h ◽  
2019 ◽  
Author(s):  
Shane Sawyer ◽  
Mitchel Horton ◽  
Chad Burdyshaw ◽  
Glenn Brook ◽  
Bhanu Rekapalli

The near exponential growth in sequence data available to bioinformaticists, and the emergence of new fields of biological research, continue to fuel an incessant need for in- creases in sequence alignment performance. Today, more than ever before, bioinformatics researchers have access to a wide variety of HPC architectures including high core count Intel Xeon processors and the many-core Intel Xeon Phi.In this work, the implementation of a distributed, NCBI compliant, BLAST+ (C++ toolkit) code, targeted for multi- and many-core clusters, such as those containing the Intel Xeon Phi line of products is presented. The solution is robust: distributed BLAST runs can use the CPU only, the Xeon Phi processor or coprocessor, or both by utilizing the CPU or Xeon Phi processor plus a Xeon Phi coprocessor. The distributed BLAST implementation employs static load balancing, fault tolerance, and contention aware I/O. The distributed BLAST implementation, HPC-BLAST, maintains greater than 90% weak scaling efficiency on up to 160 Xeon Phi (Knights Landing) nodes.The source code and instructions, are available under the Apache License, Version 2.0 at https://github.com/UTennessee-JICS/HPC-BLAST.


2018 ◽  
Vol 175 ◽  
pp. 02009
Author(s):  
Carleton DeTar ◽  
Steven Gottlieb ◽  
Ruizi Li ◽  
Doug Toussaint

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.


2018 ◽  
Vol 11 (11) ◽  
pp. 4621-4635 ◽  
Author(s):  
Istvan Z. Reguly ◽  
Daniel Giles ◽  
Devaraj Gopinathan ◽  
Laure Quivy ◽  
Joakim H. Beck ◽  
...  

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.


2015 ◽  
Vol 2015 ◽  
pp. 1-14 ◽  
Author(s):  
Xinmin Tian ◽  
Hideki Saito ◽  
Serguei V. Preis ◽  
Eric N. Garcia ◽  
Sergey S. Kozhukhov ◽  
...  

Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as less-than-full-vector loop vectorization, Intel MIC specific alignment optimization, and small matrix transpose/multiplication 2D vectorization implemented in the Intel C/C++ and Fortran production compilers for Intel Xeon Phi coprocessors. A set of workloads from several application domains is employed to conduct the performance study of our SIMD vectorization techniques. The performance results show that we achieved up to 12.5x performance gain on the Intel Xeon Phi coprocessor. We also demonstrate a 2000x performance speedup from the seamless integration of SIMD vectorization and parallelization.


Sign in / Sign up

Export Citation Format

Share Document