Incremental BLAST: incremental addition of new sequence databases through e-value correction

Mapping Intimacies ◽

10.1101/476218 ◽

2018 ◽

Cited By ~ 1

Author(s):

Sajal Dash ◽

Sarthok Rahman ◽

Heather M. Hines ◽

Wu-chun Feng

Keyword(s):

Sequence Similarity ◽

Computational Cost ◽

Supplementary Information ◽

Local Alignment ◽

Arbitrary Sequence ◽

Search Results ◽

Blast Search ◽

Link Type ◽

Ncbi Blast ◽

Incremental Addition

AbstractMotivationSearch results from local alignment search tools use statistical parameters sensitive to the size of the database. NCBI BLAST, for example, reports important matches using similarity scores and expect or e-values calculated against database size. Over the course of an investigation, the database grows and the best matches may change. To update the results of a sequence similarity search to find the most optimal hits, bioinformaticians must rerun the BLAST search against the entire database; this translates into irredeemable spent time, money, and computational resources.ResultsWe develop an efficient way to redeem spent BLAST search effort by introducing the Incremental BLAST. This tool makes use of the previous BLAST search results as it conducts new searches on only the incremental part of the database, recomputes statistical metrics such as e-values and combines these two sets of results to produce updated results. We develop statistics for correcting e-values of any BLAST result against any arbitrary sequence database. The experimental results and accuracy analysis demonstrate that Incremental BLAST can provide search results identical to NCBI BLAST at a significantly reduced computational cost. We apply three case studies to showcase different use cases where Incremental BLAST can make biological discovery more efficiently at a reduced cost. This tool can be used to update sequence blasts during the course of genomic and transcriptomic projects, such as in re-annotation projects, and to conduct incremental addition of taxon-specific sequences to a BLAST database. Incremental BLAST performs (1 + δ)/δ times faster than NCBI BLAST for δ fraction of database growth.AvailabilityIncremental BLAST is available at https://bitbucket.org/sajal000/[email protected] informationSupplementary data are available at https://bitbucket.org/sajal000/incremental-blast

Download Full-text

iBLAST: Incremental BLAST of new sequences via automated e-value correction

PLoS ONE ◽

10.1371/journal.pone.0249410 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0249410

Author(s):

Sajal Dash ◽

Sarthok Rasique Rahman ◽

Heather M. Hines ◽

Wu-chun Feng

Keyword(s):

Computational Cost ◽

Genomic Research ◽

Local Alignment ◽

Search Results ◽

Blast Search ◽

Database Size ◽

Biological Discovery ◽

Computational Resources ◽

Ncbi Blast

Search results from local alignment search tools use statistical scores that are sensitive to the size of the database to report the quality of the result. For example, NCBI BLAST reports the best matches using similarity scores and expect values (i.e., e-values) calculated against the database size. Given the astronomical growth in genomics data throughout a genomic research investigation, sequence databases grow as new sequences are continuously being added to these databases. As a consequence, the results (e.g., best hits) and associated statistics (e.g., e-values) for a specific set of queries may change over the course of a genomic investigation. Thus, to update the results of a previously conducted BLAST search to find the best matches on an updated database, scientists must currently rerun the BLAST search against the entire updated database, which translates into irrecoverable and, in turn, wasted execution time, money, and computational resources. To address this issue, we devise a novel and efficient method to redeem past BLAST searches by introducing iBLAST. iBLAST leverages previous BLAST search results to conduct the same query search but only on the incremental (i.e., newly added) part of the database, recomputes the associated critical statistics such as e-values, and combines these results to produce updated search results. Our experimental results and fidelity analyses show that iBLAST delivers search results that are identical to NCBI BLAST at a substantially reduced computational cost, i.e., iBLAST performs (1 + δ)/δ times faster than NCBI BLAST, where δ represents the fraction of database growth. We then present three different use cases to demonstrate that iBLAST can enable efficient biological discovery at a much faster speed with a substantially reduced computational cost.

Download Full-text

Detecting High Scoring Local Alignments in Pangenome Graphs

Bioinformatics ◽

10.1093/bioinformatics/btab077 ◽

2021 ◽

Author(s):

Tizian Schulz ◽

Roland Wittler ◽

Sven Rahmann ◽

Faraz Hach ◽

Jens Stoye

Keyword(s):

Sequence Similarity ◽

Query Sequence ◽

Heuristic Method ◽

Supplementary Information ◽

De Bruijn Graph ◽

Local Alignment ◽

Memory Usage ◽

Sequence Comparisons ◽

De Bruijn Graphs ◽

De Bruijn

Abstract Motivation Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. Results We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. Availability Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A high-precision hybrid algorithm for predicting eukaryotic protein subcellular localization

10.1101/620179 ◽

2019 ◽

Author(s):

Dahan Zhang ◽

Haiyun Huang ◽

Xiaogang Bai ◽

Xiaodong Fang ◽

Yi Zhang

Keyword(s):

Subcellular Localization ◽

Subcellular Location ◽

Supplementary Information ◽

Local Alignment ◽

Protein Subcellular Localization ◽

Eukaryotic Protein ◽

Link Type ◽

Fisher Discriminant ◽

Average Accuracy ◽

Search Tool

ABSTRACTMotivationSubcellular location plays an essential role in protein synthesis, transport, and secretion, thus it is an important step in understanding the mechanisms of trait-related proteins. Generally, homology methods provide reliable homology-based results with small E-values. We must resort to pattern recognition algorithms (SVM, Fisher discriminant, KNN, random forest, etc.) for proteins that do not share significant homologous domains with known proteins. However, satisfying results are seldom obtained.ResultsHere, a novel hybrid method “Basic Local Alignment Search Tool+Smith-Waterman+Needleman-Wunsch” or BLAST+SWNW, has been obtained by integrating a loosened E-value Basic Local Alignment Search Tool (BLAST) with the Smith-Waterman (SW) and Needleman-Wunsch (NW) algorithms, and this method has been introduced to predict protein subcellular localization in eukaryotes. When tested on Dataset I and Dataset II, BLAST+SWNW showed an average accuracy of 97.18% and 99.60%, respectively, surpassing the performance of other algorithms in predicting eukaryotic protein subcellular localization.Availability and ImplementationBLAST+SWNW is an open source collaborative initiative available in the GitHub repository (https://github.com/ZHANGDAHAN/BLAST-SWNW-for-SLP or http://202.206.64.158:80/link/72016CAC26E4298B3B7E0EAF42288935)[email protected]; [email protected] InformationSupplementary data are available at PLOS Computational Biology online.

Download Full-text

CovRadar: Continuously tracking and filtering SARS-CoV-2 mutations for molecular surveillance

10.1101/2021.02.03.429146 ◽

2021 ◽

Author(s):

Alice Wittig ◽

Fábio Miranda ◽

Martin Hölzer ◽

Tom Altenburg ◽

Jakub M. Bartoszewicz ◽

...

Keyword(s):

Web Application ◽

Phylogenetic Trees ◽

Spike Protein ◽

Supplementary Information ◽

Local Alignment ◽

Multiple Sequence ◽

Molecular Surveillance ◽

Consensus Sequences ◽

Link Type ◽

Viral Sequences

ABSTRACTSummaryThe ongoing pandemic caused by SARS-CoV-2 emphasizes the importance of molecular surveillance to understand the evolution of the virus and to monitor and plan the epidemiological responses. Quick analysis, easy visualization and convenient filtering of the latest viral sequences are essential for this purpose. We present CovRadar, a tool for molecular surveillance of the Corona spike protein. The spike protein contains the receptor binding domain (RBD) that is used as a target for most vaccine candidates. CovRadar consists of a workflow pipeline and a web application that enable the analysis and visualization of over 1 million sequences. First, CovRadar extracts the regions of interest using local alignment, then builds a multiple sequence alignment, infers variants, consensus sequences and phylogenetic trees and finally presents the results in an interactive PDF-like app, making reporting fast, easy and flexible.Availability and implementationCovRadar is freely accessible at https://covradar.net, its open-source code is available at https://gitlab.com/dacs-hpi/covradar.Supplementary informationSupplementary data are available at Bioinformatics online.

Download Full-text

LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins

Bioinformatics ◽

10.1093/bioinformatics/btaa065 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2731-2739 ◽

Cited By ~ 1

Author(s):

Anastasia A Gulyaeva ◽

Andrey I Sigorskih ◽

Elena S Ocheredko ◽

Dmitry V Samborskiy ◽

Alexander E Gorbalenya

Keyword(s):

Rna Virus ◽

Sequence Similarity ◽

Statistical Significance ◽

R Package ◽

Similarity Score ◽

Supplementary Information ◽

Accurate Estimation ◽

Local Alignment ◽

Multidomain Proteins ◽

Multidomain Protein

Abstract Motivation To facilitate accurate estimation of statistical significance of sequence similarity in profile–profile searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as queries depends on delineation of domain borders, which may be unknown. Thus, proteins are commonly used as queries that complicate establishing homology for similarities close to cutoff levels of statistical significance. Results In this article, we describe an iterative approach, called LAMPA, LArge Multidomain Protein Annotator, that resolves the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating statistical significance of hit similarity using ever smaller queries defined at each iteration. LAMPA employs TMHMM and HHsearch for recognition of transmembrane regions and homology, respectively. We used Pfam database for annotating 2985 multidomain proteins (polyproteins) composed of >1000 amino acid residues, which dominate proteomes of RNA viruses. Under strict cutoffs, LAMPA outperformed HHsearch-mediated runs using intact polyproteins as queries by three measures: number of and coverage by identified homologous regions, and number of hit Pfam profiles. Compared to HHsearch, LAMPA identified 507 extra homologous regions in 14.4% of polyproteins. This Pfam-based annotation of RNA virus polyproteins by LAMPA was also superior to RefSeq expert annotation by two measures, region number and annotated length, for 69.3% of RNA virus polyprotein entries. We rationalized the obtained results based on dependencies of HHsearch hit statistical significance for local alignment similarity score from lengths and diversities of query-target pairs in computational experiments. Availability and implementation LAMPA 1.0.0 R package is placed at github (https://github.com/Gorbalenya-Lab/LAMPA). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Similarity Searching System for Biological Phenotype Images Using Deep Convolutional Encoder-decoder Architecture

Current Bioinformatics ◽

10.2174/1574893614666190204150109 ◽

2019 ◽

Vol 14 (7) ◽

pp. 628-639 ◽

Cited By ~ 10

Author(s):

Bizhi Wu ◽

Hangxiao Zhang ◽

Limei Lin ◽

Huiyuan Wang ◽

Yubang Gao ◽

...

Keyword(s):

Neural Network ◽

Retrieval System ◽

Sequence Similarity ◽

Local Alignment ◽

Similarity Searching ◽

Loss Of Function ◽

Biological Images ◽

The Neural Network ◽

Convolutional Autoencoder ◽

Biological Phenotype

Background: The BLAST (Basic Local Alignment Search Tool) algorithm has been widely used for sequence similarity searching. Analogously, the public phenotype images must be efficiently retrieved using biological images as queries and identify the phenotype with high similarity. Due to the accumulation of genotype-phenotype-mapping data, a system of searching for similar phenotypes is not available due to the bottleneck of image processing. Objective: In this study, we focus on the identification of similar query phenotypic images by searching the biological phenotype database, including information about loss-of-function and gain-of-function. Methods: We propose a deep convolutional autoencoder architecture to segment the biological phenotypic images and develop a phenotype retrieval system to enable a better understanding of genotype–phenotype correlation. Results: This study shows how deep convolutional autoencoder architecture can be trained on images from biological phenotypes to achieve state-of-the-art performance in a phenotypic images retrieval system. Conclusion: Taken together, the phenotype analysis system can provide further information on the correlation between genotype and phenotype. Additionally, it is obvious that the neural network model of image segmentation and the phenotype retrieval system is equally suitable for any species, which has enough phenotype images to train the neural network.

Download Full-text

Phyllobacterium loti sp. nov. isolated from nodules of Lotus corniculatus

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.052993-0 ◽

2014 ◽

Vol 64 (Pt_3) ◽

pp. 781-786 ◽

Cited By ~ 33

Author(s):

Maximo Sánchez ◽

Martha-Helena Ramírez-Bahena ◽

Alvaro Peix ◽

María J. Lorite ◽

Juan Sanjuán ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Sequence Similarity ◽

Carbon Sources ◽

Lotus Corniculatus ◽

Culture Conditions ◽

Rrna Gene ◽

Content Type ◽

Link Type ◽

The 16S Rrna Gene

Strain S658T was isolated from a Lotus corniculatus nodule in a soil sample obtained in Uruguay. Phylogenetic analysis of the 16S rRNA gene and atpD gene showed that this strain clustered within the genus Phyllobacterium . The closest related species was, in both cases, Phyllobacterium trifolii PETP02T with 99.8 % sequence similarity in the 16S rRNA gene and 96.1 % in the atpD gene. The 16S rRNA gene contains an insert at the beginning of the sequence that has no similarities with other inserts present in the same gene in described rhizobial species. Ubiquinone Q-10 was the only quinone detected. Strain S658T differed from its closest relatives through its growth in diverse culture conditions and in the assimilation of several carbon sources. It was not able to reproduce nodules in Lotus corniculatus. The results of DNA–DNA hybridization, phenotypic tests and fatty acid analyses confirmed that this strain should be classified as a representative of a novel species of the genus Phyllobacterium , for which the name Phyllobacterium loti sp. nov. is proposed. The type strain is S658T( = LMG 27289T = CECT 8230T).

Download Full-text

Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins

Scientific Reports ◽

10.1038/s41598-021-81063-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Dimitri Boeckaerts ◽

Michiel Stock ◽

Bjorn Criel ◽

Hans Gerstmans ◽

Bernard De Baets ◽

...

Keyword(s):

Machine Learning ◽

Predictive Model ◽

Receptor Binding ◽

Bacterial Infections ◽

Sequence Data ◽

Sequence Similarity ◽

Area Under The Curve ◽

Local Alignment ◽

Search Tool ◽

Different Levels

AbstractNowadays, bacteriophages are increasingly considered as an alternative treatment for a variety of bacterial infections in cases where classical antibiotics have become ineffective. However, characterizing the host specificity of phages remains a labor- and time-intensive process. In order to alleviate this burden, we have developed a new machine-learning-based pipeline to predict bacteriophage hosts based on annotated receptor-binding protein (RBP) sequence data. We focus on predicting bacterial hosts from the ESKAPE group, Escherichia coli, Salmonella enterica and Clostridium difficile. We compare the performance of our predictive model with that of the widely used Basic Local Alignment Search Tool (BLAST). Our best-performing predictive model reaches Precision-Recall Area Under the Curve (PR-AUC) scores between 73.6 and 93.8% for different levels of sequence similarity in the collected data. Our model reaches a performance comparable to that of BLASTp when sequence similarity in the data is high and starts outperforming BLASTp when sequence similarity drops below 75%. Therefore, our machine learning methods can be especially useful in settings in which sequence similarity to other known sequences is low. Predicting the hosts of novel metagenomic RBP sequences could extend our toolbox to tune the host spectrum of phages or phage tail-like bacteriocins by swapping RBPs.

Download Full-text

Faecalicoccus acidiformans gen. nov., sp. nov., isolated from the chicken caecum, and reclassification of Streptococcus pleomorphus (Barnes et al. 1977), Eubacterium biforme (Eggerth 1935) and Eubacterium cylindroides (Cato et al. 1974) as Faecalicoccus pleomorphus comb. nov., Holdemanella biformis gen. nov., comb. nov. and Faecalitalea cylindroides gen. nov., comb. nov., respectively, within the family Erysipelotrichaceae

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.064626-0 ◽

2014 ◽

Vol 64 (Pt_11) ◽

pp. 3877-3884 ◽

Cited By ~ 50

Author(s):

Celine De Maesschalck ◽

Filip Van Immerseel ◽

Venessa Eeckhaut ◽

Siegrid De Baere ◽

Margo Cnockaert ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Type Species ◽

Gene Sequence ◽

Sequence Similarity ◽

16S Rrna Gene Sequence ◽

Rrna Gene ◽

Rrna Gene Sequence ◽

Content Type ◽

Link Type

Strains LMG 27428T and LMG 27427 were isolated from the caecal content of a chicken and produced butyric, lactic and formic acids as major metabolic end products. The genomic DNA G+C contents of strains LMG 27428T and LMG 27427 were 40.4 and 38.8 mol%. On the basis of 16S rRNA gene sequence similarity, both strains were most closely related to the generically misclassified Streptococcus pleomorphus ATCC 29734T. Strain LMG 27428T could be distinguished from S. pleomorphus ATCC 29734T based on production of more lactic acid and less formic acid in M2GSC medium, a higher DNA G+C content and the absence of activities of acid phosphatase and leucine, arginine, leucyl glycine, pyroglutamic acid, glycine and histidine arylamidases, while strain LMG 27428 was biochemically indistinguishable from S. pleomorphus ATCC 29734T. The novel genus Faecalicoccus gen. nov. within the family Erysipelotrichaceae is proposed to accommodate strains LMG 27428T and LMG 27427. Strain LMG 27428T ( = DSM 26963T) is the type strain of Faecalicoccus acidiformans sp. nov., and strain LMG 27427 ( = DSM 26962) is a strain of Faecalicoccus pleomorphus comb. nov. (type strain LMG 17756T = ATCC 29734T = DSM 20574T). Furthermore, the nearest phylogenetic neighbours of the genus Faecalicoccus are the generically misclassified Eubacterium cylindroides DSM 3983T (94.4 % 16S rRNA gene sequence similarity to strain LMG 27428T) and Eubacterium biforme DSM 3989T (92.7 % 16S rRNA gene sequence similarity to strain LMG 27428T). We present genotypic and phenotypic data that allow the differentiation of each of these taxa and propose to reclassify these generically misnamed species of the genus Eubacterium formally as Faecalitalea cylindroides gen. nov., comb. nov. and Holdemanella biformis gen. nov., comb. nov., respectively. The type strain of Faecalitalea cylindroides is DSM 3983T = ATCC 27803T = JCM 10261T and that of Holdemanella biformis is DSM 3989T = ATCC 27806T = CCUG 28091T.

Download Full-text

Amphritea ceti sp. nov., isolated from faeces of Beluga whale (Delphinapterus leucas)

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijs.0.067405-0 ◽

2014 ◽

Vol 64 (Pt_12) ◽

pp. 4068-4072 ◽

Cited By ~ 7

Author(s):

Young-Ok Kim ◽

Sooyeon Park ◽

Doo Nam Kim ◽

Bo-Hye Nam ◽

Sung-Min Won ◽

...

Keyword(s):

Phylogenetic Trees ◽

Sequence Similarity ◽

Beluga Whale ◽

Delphinapterus Leucas ◽

Rrna Gene ◽

Phenotypic Properties ◽

Content Type ◽

Link Type ◽

Beluga Whale Delphinapterus Leucas ◽

Type Strains

A Gram-stain-negative, aerobic, non-spore-forming, non-flagellated and rod-shaped or ovoid bacterial strain, designated RA1T, was isolated from faeces collected from Beluga whale (Delphinapterus leucas) in Yeosu aquarium, South Korea. Strain RA1T grew optimally at 25 °C, at pH 7.0–8.0 and in the presence of 2.0 % (w/v) NaCl. Neighbour-joining, maximum-likelihood and maximum-parsimony phylogenetic trees based on 16S rRNA gene sequences revealed that strain RA1T joins the cluster comprising the type strains of three species of the genus Amphritea , with which it exhibited 95.8–96.0 % sequence similarity. Sequence similarities to the type strains of other recognized species were less than 94.3 %. Strain RA1T contained Q-8 as the predominant ubiquinone and summed feature 3 (C16 : 1ω7c and/or C16 : 1ω6c), C18 : 1ω7c and C16 : 0 as the major fatty acids. The major polar lipids of strain RA1T were phosphatidylethanolamine, phosphatidylglycerol, two unidentified lipids and one unidentified aminolipid. The DNA G+C content of strain RA1T was 47.4 mol%. The differential phenotypic properties, together with the phylogenetic distinctiveness, revealed that strain RA1T is separated from other species of the genus Amphritea . On the basis of the data presented, strain RA1T is considered to represent a novel species of the genus Amphritea , for which the name Amphritea ceti sp. nov. is proposed. The type strain is RA1T ( = KCTC 42154T = NBRC 110551T).

Download Full-text