scholarly journals Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments

PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3486 ◽  
Author(s):  
Won Cheol Yim ◽  
John C. Cushman

Bioinformatics is currently faced with very large-scale data sets that lead to computational jobs, especially sequence similarity searches, that can take absurdly long times to run. For example, the National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST and BLAST+) suite, which is by far the most widely used tool for rapid similarity searching among nucleic acid or amino acid sequences, is highly central processing unit (CPU) intensive. While the BLAST suite of programs perform searches very rapidly, they have the potential to be accelerated. In recent years, distributed computing environments have become more widely accessible and used due to the increasing availability of high-performance computing (HPC) systems. Therefore, simple solutions for data parallelization are needed to expedite BLAST and other sequence analysis tools. However, existing software for parallel sequence similarity searches often requires extensive computational experience and skill on the part of the user. In order to accelerate BLAST and other sequence analysis tools, Divide and Conquer BLAST (DCBLAST) was developed to perform NCBI BLAST searches within a cluster, grid, or HPC environment by using a query sequence distribution approach. Scaling from one (1) to 256 CPU cores resulted in significant improvements in processing speed. Thus, DCBLAST dramatically accelerates the execution of BLAST searches using a simple, accessible, robust, and parallel approach. DCBLAST works across multiple nodes automatically and it overcomes the speed limitation of single-node BLAST programs. DCBLAST can be used on any HPC system, can take advantage of hundreds of nodes, and has no output limitations. This freely available tool simplifies distributed computation pipelines to facilitate the rapid discovery of sequence similarities between very large data sets.

2010 ◽  
Vol 38 ◽  
pp. 1-48 ◽  
Author(s):  
S. Katrenko ◽  
P. W. Adriaans ◽  
M. Van Someren

This paper discusses the problem of marrying structural similarity with semantic relatedness for Information Extraction from text. Aiming at accurate recognition of relations, we introduce local alignment kernels and explore various possibilities of using them for this task. We give a definition of a local alignment (LA) kernel based on the Smith-Waterman score as a sequence similarity measure and proceed with a range of possibilities for computing similarity between elements of sequences. We show how distributional similarity measures obtained from unlabeled data can be incorporated into the learning task as semantic knowledge. Our experiments suggest that the LA kernel yields promising results on various biomedical corpora outperforming two baselines by a large margin. Additional series of experiments have been conducted on the data sets of seven general relation types, where the performance of the LA kernel is comparable to the current state-of-the-art results.


Author(s):  
Tizian Schulz ◽  
Roland Wittler ◽  
Sven Rahmann ◽  
Faraz Hach ◽  
Jens Stoye

Abstract Motivation Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. Results We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. Availability Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 8 (2) ◽  
pp. 169-180
Author(s):  
Mark Lin ◽  
Periklis Papadopoulos

Computational methods such as Computational Fluid Dynamics (CFD) traditionally yield a single output – a single number that is much like the result one would get if one were to perform a theoretical hand calculation. However, this paper will show that computation methods have inherent uncertainty which can also be reported statistically. In numerical computation, because many factors affect the data collected, the data can be quoted in terms of standard deviations (error bars) along with a mean value to make data comparison meaningful. In cases where two data sets are obscured by uncertainty, the two data sets are said to be indistinguishable. A sample CFD problem pertaining to external aerodynamics is copied and ran on 29 identical computers in a university computer lab. The expectation is that all 29 runs should return exactly the same result; unfortunately, in a few cases the result turns out to be different. This is attributed to the parallelization scheme which partitions the mesh to run in parallel on multiple cores of the computer. The distribution of the computational load is hardware-driven depending on the available resource of each computer at the time. Things, such as load-balancing among multiple Central Processing Unit (CPU) cores using Message Passing Interface (MPI) are transparent to the user. Software algorithm such as METIS or JOSTLE is used to automatically divide up the load between different processors. As such, the user has no control over the outcome of the CFD calculation even when the same problem is computed. Because of this, numerical uncertainty arises from parallel (multicore) computing. One way to resolve this issue is to compute problems using a single core, without mesh repartitioning. However, as this paper demonstrates even this is not straight forward. Keywords: numerical uncertainty, parallelization, load-balancing, automotive aerodynamics


2005 ◽  
Vol 14 (05) ◽  
pp. 811-826 ◽  
Author(s):  
OZGUR OZTURK ◽  
HAKAN FERHATOSMANOGLU

We present a multi-dimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries and both (b) pruning ability and (c) approximation quality for ε-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions. Promising results from the experiments on real biosequence data sets are presented.


2020 ◽  
Author(s):  
Tizian Schulz ◽  
Roland Wittler ◽  
Sven Rahmann ◽  
Faraz Hach ◽  
Jens Stoye

AbstractMotivationIncreasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet.ResultsWe present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome.


2020 ◽  
Author(s):  
Roudati jannah

Perangkat keras komputer adalah bagian dari sistem komputer sebagai perangkat yang dapat diraba, dilihat secara fisik, dan bertindak untuk menjalankan instruksi dari perangkat lunak (software). Perangkat keras komputer juga disebut dengan hardware. Hardware berperan secara menyeluruh terhadap kinerja suatu sistem komputer. Prinsipnya sistem komputer selalu memiliki perangkat keras masukan (input/input device system) – perangkat keras premprosesan (processing/central processing unit) – perangkat keras luaran (output/output device system) – perangkat tambahan yang sifatnya opsional (peripheral) dan tempat penyimpanan data (storage device system/external memory).


2020 ◽  
Author(s):  
Ika Milia wahyunu Siregar

Perkembangan IT di dunia sangat pesat, mulai dari perkembangan sofware hingga hardware. Teknologi sekarang telah mendominasi sebagian besar di permukaan bumi ini. Karena semakin cepatnya perkembangan Teknologi, kita sebagai pengguna bisa ketinggalan informasi mengenai teknologi baru apabila kita tidak up to date dalam pengetahuan teknologi ini. Hal itu dapat membuat kita mudah tergiur dan tertipu dengan berbagai iklan teknologi tanpa memikirkan sisi negatifnya. Sebagai pengguna dari komputer, kita sebaiknya tahu seputar mengenai komponen-komponen komputer. Komputer adalah serangkaian mesin elektronik yang terdiri dari jutaan komponen yang dapat saling bekerja sama, serta membentuk sebuah sistem kerja yang rapi dan teliti. Sistem ini kemudian digunakan untuk dapat melaksanakan pekerjaan secara otomatis, berdasarkan instruksi (program) yang diberikan kepadanya. Istilah Hardware komputer atau perangkat keras komputer, merupakan benda yang secara fisik dapat dipegang, dipindahkan dan dilihat. Central Processing System/ Central Processing Unit (CPU) adalah salah satu jenis perangkat keras yang berfungsi sebagai tempat untuk pengolahan data atau juga dapat dikatakan sebagai otak dari segala aktivitas pengolahan seperti penghitungan, pengurutan, pencarian, penulisan, pembacaan dan sebagainya.


2020 ◽  
Author(s):  
Intan khadijah simatupang

Komputer adalah serangkaian mesin elektronik yang terdiri dari jutaan komponen yang dapat saling bekerja sama, serta membentuk sebuah sistem kerja yang rapi dan teliti. Sistem ini kemudian digunakan untuk dapat melaksanakan pekerjaan secara otomatis, berdasarkan instruksi (program) yang diberikan kepadanya. Istilah Hardware computer atau perangkat keras komputer, merupakan benda yang secara fisik dapat dipegang, dipindahkan dan dilihat. Software komputer atau perangkat lunak komputer merupakan kumpulan instruksi (program/prosedur) untuk dapat melaksanakan pekerjaan secara otomatis dengan cara mengolah atau memproses kumpulan instruksi (data) yang diberikan. Pada prinsipnya sistem komputer selalu memiliki perangkat keras masukan (input/input device system) – perangkat keras pemprosesan (processing/ central processing unit) – perangkat keras keluaran (output/output device system), perangkat tambahan yang sifatnya opsional (peripheral) dan tempat penyimpanan data (Storage device system/external memory).


2020 ◽  
Author(s):  
Siti Kumala Dewi

Perangkat keras komputer adalah bagian dari sistem komputer sebagai perangkat yang dapat diraba, dilihat secara fisik, dan bertindak untuk menjalankan instruksi dari perangkat lunak (software). Perangkat keras komputer juga disebut dengan hardware. Hardware berperan secara menyeluruh terhadap kinerja suatu sistem komputer. Berdasarkan fungsinya, perangkat keras terbagi menjadi :1.Sistem Perangkat Keras Masukan (Input Device System )2.Sistem Pemrosesan ( Central Processing System/ Central Processing Unit(CPU)3.Sistem Perangkat Keras Keluaran ( Output Device System )4.Sistem Perangkat Keras Tambahan (Peripheral/Accessories Device System)


Sign in / Sign up

Export Citation Format

Share Document