scholarly journals BLAST-based validation of metagenomic sequence assignments

2017 ◽  
Author(s):  
Adam L. Bazinet ◽  
Brian D. Ondov ◽  
Daniel D. Sommer ◽  
Shashikala Ratnayake

AbstractWhen performing bioforensic casework, it is important to be able to reliably detect the presence of a particular organism in a metagenomic sample, even if the organism is only present in a trace amount. For this task, it is common to use a sequence classification program that determines the taxonomic affiliation of individual sequence reads by comparing them to reference database sequences. As metagenomic data sets often consist of millions or billions of reads that need to be compared to reference databases containing millions of sequences, such sequence classification programs typically use search heuristics and databases with reduced sequence diversity to speed up the analysis, which can lead to incorrect assignments. Thus, in a bioforensic setting where correct assignments are paramount, assignments of interest made by “first-pass” classifiers should be confirmed using the most precise methods and comprehensive databases available. In this study we present ablast-based method for validating the assignments made by less precise sequence classification programs, with optimal parameters for filtering ofblastresults determined via simulation of sequence reads from genomes of interest, and we apply the method to the detection of four pathogenic organisms. The software implementing the method is open source and freely available.

PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4892 ◽  
Author(s):  
Adam L. Bazinet ◽  
Brian D. Ondov ◽  
Daniel D. Sommer ◽  
Shashikala Ratnayake

When performing bioforensic casework, it is important to be able to reliably detect the presence of a particular organism in a metagenomic sample, even if the organism is only present in a trace amount. For this task, it is common to use a sequence classification program that determines the taxonomic affiliation of individual sequence reads by comparing them to reference database sequences. As metagenomic data sets often consist of millions or billions of reads that need to be compared to reference databases containing millions of sequences, such sequence classification programs typically use search heuristics and databases with reduced sequence diversity to speed up the analysis, which can lead to incorrect assignments. Thus, in a bioforensic setting where correct assignments are paramount, assignments of interest made by “first-pass” classifiers should be confirmed using the most precise methods and comprehensive databases available. In this study we present a BLAST-based method for validating the assignments made by less precise sequence classification programs, with optimal parameters for filtering of BLAST results determined via simulation of sequence reads from genomes of interest, and we apply the method to the detection of four pathogenic organisms. The software implementing the method is open source and freely available.


2009 ◽  
Vol 76 (2) ◽  
pp. 609-617 ◽  
Author(s):  
Vanessa A. Varaljay ◽  
Erinn C. Howard ◽  
Shulei Sun ◽  
Mary Ann Moran

ABSTRACT In silico design and testing of environmental primer pairs with metagenomic data are beneficial for capturing a greater proportion of the natural sequence heterogeneity in microbial functional genes, as well as for understanding limitations of existing primer sets that were designed from more restricted sequence data. PCR primer pairs targeting 10 environmental clades and subclades of the dimethylsulfoniopropionate (DMSP) demethylase protein, DmdA, were designed using an iterative bioinformatic approach that took advantage of thousands of dmdA sequences captured in marine metagenomic data sets. Using the bioinformatically optimized primers, dmdA genes were amplified from composite free-living coastal bacterioplankton DNA (from 38 samples over 5 years and two locations) and sequenced using 454 technology. An average of 6,400 amplicons per primer pair represented more than 700 clusters of environmental dmdA sequences across all primers, with clusters defined conservatively at >90% nucleotide sequence identity (∼95% amino acid identity). Degenerate and inosine-based primers did not perform better than specific primer pairs in determining dmdA richness and sometimes captured a lower degree of richness of sequences from the same DNA sample. A comparison of dmdA sequences in free-living versus particle-associated bacteria in southeastern U.S. coastal waters showed that sequence richness in some dmdA subgroups differed significantly between size fractions, though most gene clusters were shared (52 to 91%) and most sequences were affiliated with the shared clusters (∼90%). The availability of metagenomic sequence data has significantly enhanced the design of quantitative PCR primer pairs for this key functional gene, providing robust access to the capabilities and activities of DMSP demethylating bacteria in situ.


2016 ◽  
Author(s):  
Shea N Gardner ◽  
Sasha K Ames ◽  
Maya B Gokhale ◽  
Tom R Slezak ◽  
Jonathan Allen

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.


mSystems ◽  
2018 ◽  
Vol 3 (3) ◽  
Author(s):  
Luis M. Rodriguez-R ◽  
Santosh Gunturu ◽  
James M. Tiedje ◽  
James R. Cole ◽  
Konstantinos T. Konstantinidis

ABSTRACT Estimations of microbial community diversity based on metagenomic data sets are affected, often to an unknown degree, by biases derived from insufficient coverage and reference database-dependent estimations of diversity. For instance, the completeness of reference databases cannot be generally estimated since it depends on the extant diversity sampled to date, which, with the exception of a few habitats such as the human gut, remains severely undersampled. Further, estimation of the degree of coverage of a microbial community by a metagenomic data set is prohibitively time-consuming for large data sets, and coverage values may not be directly comparable between data sets obtained with different sequencing technologies. Here, we extend Nonpareil, a database-independent tool for the estimation of coverage in metagenomic data sets, to a high-performance computing implementation that scales up to hundreds of cores and includes, in addition, a k -mer-based estimation as sensitive as the original alignment-based version but about three hundred times as fast. Further, we propose a metric of sequence diversity ( N d ) derived directly from Nonpareil curves that correlates well with alpha diversity assessed by traditional metrics. We use this metric in different experiments demonstrating the correlation with the Shannon index estimated on 16S rRNA gene profiles and show that N d additionally reveals seasonal patterns in marine samples that are not captured by the Shannon index and more precise rankings of the magnitude of diversity of microbial communities in different habitats. Therefore, the new version of Nonpareil, called Nonpareil 3, advances the toolbox for metagenomic analyses of microbiomes. IMPORTANCE Estimation of the coverage provided by a metagenomic data set, i.e., what fraction of the microbial community was sampled by DNA sequencing, represents an essential first step of every culture-independent genomic study that aims to robustly assess the sequence diversity present in a sample. However, estimation of coverage remains elusive because of several technical limitations associated with high computational requirements and limiting statistical approaches to quantify diversity. Here we described Nonpareil 3, a new bioinformatics algorithm that circumvents several of these limitations and thus can facilitate culture-independent studies in clinical or environmental settings, independent of the sequencing platform employed. In addition, we present a new metric of sequence diversity based on rarefied coverage and demonstrate its use in communities from diverse ecosystems.


2021 ◽  
Vol 4 ◽  
Author(s):  
François Keck ◽  
Florian Altermatt

Reference databases of sequences that have been taxonomically assigned are a key element for DNA-based identification of organisms. Accurate and complete reference databases are necessary to associate a correct taxonomic name to the sequences obtained in studies using metabarcoding. Today many research projects using DNA metabarcoding include the development of a custom reference database, often derived from large repositories like GenBank. At the same time, many projects are focussing on the development of ready-to-use databases validated by experts and targeting specific markers and taxonomic groups. While mainstream tools such as spreadsheet softwares may be suitable to manage small databases, they quickly become insufficient when the amount of data increases and validation operations become more complex. There is a clear need for providing user‐friendly and powerful tools to manipulate biological sequences and manage reference databases. The R language which is a free software and has already been adopted by many researchers to perform their analyses is highly suitable to develop such tools. In this talk, we will outline the approach we recommend to handle small- to middle-sized reference databases, currently still making the majority of projects. We will advocate that a simple tabular approach where each sequence constitutes an observation may be the most adequate. While such a single table may be less flexible and less optimized than relational databases or more complex data structures, it is easy to maintain and allows the direct use of modern dataframe centric tools. We will specifically present and discuss two R packages that can be used jointly to make reference database development more accessible and more reproducible. First, we will briefly introduce bioseq (Keck 2020) which is dedicated to biological sequence manipulation and analysis. The package implements classes and functions to make analyses of complex datasets including DNA, RNA or protein sequences as simple as possible. The strength of bioseq is to provide standard and more advanced functions to perform low level operations through a simple and consistent programming interface. Then we will present refdb, which has been developed as an environment for semi-automatic and assisted construction of reference databases. The refdb package is a reference database manager offering a set of powerful functions to import, organize, clean, filter, audit and export the data. We will outline how these two packages together can speed up reference database generation and handling, and contribute to standardization and repeatability in metabarcoding studies.


Sensors ◽  
2021 ◽  
Vol 21 (5) ◽  
pp. 1736
Author(s):  
Zengchong Yang ◽  
Xiucheng Liu ◽  
Bin Wu ◽  
Ren Liu

Previous studies on Lamb wave touchscreen (LWT) were carried out based on the assumption that the unknown touch had the consistent parameters with acoustic fingerprints in the reference database. The adaptability of LWT to the variations in touch force and touch area was investigated in this study for the first time. The automatic collection of the databases of acoustic fingerprints was realized with an experimental prototype of LWT employing three pairs of transmitter–receivers. The self-adaptive updated weight coefficient of the used transmitter–receiver pairs was employed to successfully improve the accuracy of the localization model established based on a learning method. The performance of the improved method in locating single- and two-touch actions with the reference database of different parameters was carefully evaluated. The robustness of the LWT to the variation of the touch force varied with the touch area. Moreover, it was feasible to locate touch actions of large area with reference databases of small touch areas as long as the unknown touch and the reference databases met the condition of equivalent averaged stress.


GigaScience ◽  
2021 ◽  
Vol 10 (2) ◽  
Author(s):  
Guilhem Sempéré ◽  
Adrien Pétel ◽  
Magsen Abbé ◽  
Pierre Lefeuvre ◽  
Philippe Roumagnac ◽  
...  

Abstract Background Efficiently managing large, heterogeneous data in a structured yet flexible way is a challenge to research laboratories working with genomic data. Specifically regarding both shotgun- and metabarcoding-based metagenomics, while online reference databases and user-friendly tools exist for running various types of analyses (e.g., Qiime, Mothur, Megan, IMG/VR, Anvi'o, Qiita, MetaVir), scientists lack comprehensive software for easily building scalable, searchable, online data repositories on which they can rely during their ongoing research. Results metaXplor is a scalable, distributable, fully web-interfaced application for managing, sharing, and exploring metagenomic data. Being based on a flexible NoSQL data model, it has few constraints regarding dataset contents and thus proves useful for handling outputs from both shotgun and metabarcoding techniques. By supporting incremental data feeding and providing means to combine filters on all imported fields, it allows for exhaustive content browsing, as well as rapid narrowing to find specific records. The application also features various interactive data visualization tools, ways to query contents by BLASTing external sequences, and an integrated pipeline to enrich assignments with phylogenetic placements. The project home page provides the URL of a live instance allowing users to test the system on public data. Conclusion metaXplor allows efficient management and exploration of metagenomic data. Its availability as a set of Docker containers, making it easy to deploy on academic servers, on the cloud, or even on personal computers, will facilitate its adoption.


PLoS ONE ◽  
2011 ◽  
Vol 6 (11) ◽  
pp. e25353 ◽  
Author(s):  
Peng Jia ◽  
Liming Xuan ◽  
Lei Liu ◽  
Chaochun Wei

2014 ◽  
Vol 104 (10) ◽  
pp. 1125-1129 ◽  
Author(s):  
A. H. Stobbe ◽  
W. L. Schneider ◽  
P. R. Hoyt ◽  
U. Melcher

Next generation sequencing (NGS) is not used commonly in diagnostics, in part due to the large amount of time and computational power needed to identify the taxonomic origin of each sequence in a NGS data set. By using the unassembled NGS data sets as the target for searches, pathogen-specific sequences, termed e-probes, could be used as queries to enable detection of specific viruses or organisms in plant sample metagenomes. This method, designated e-probe diagnostic nucleic acid assay, first tested with mock sequence databases, was tested with NGS data sets generated from plants infected with a DNA (Bean golden yellow mosaic virus, BGYMV) or an RNA (Plum pox virus, PPV) virus. In addition, the ability to detect and differentiate among strains of a single virus species, PPV, was examined by using probe sets that were specific to strains. The use of probe sets for multiple viruses determined that one sample was dually infected with BGYMV and Bean golden mosaic virus.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 726
Author(s):  
Mike W.C. Thang ◽  
Xin-Yi Chua ◽  
Gareth Price ◽  
Dominique Gorse ◽  
Matt A. Field

Metagenomic sequencing is an increasingly common tool in environmental and biomedical sciences.  While software for detailing the composition of microbial communities using 16S rRNA marker genes is relatively mature, increasingly researchers are interested in identifying changes exhibited within microbial communities under differing environmental conditions. In order to gain maximum value from metagenomic sequence data we must improve the existing analysis environment by providing accessible and scalable computational workflows able to generate reproducible results. Here we describe a complete end-to-end open-source metagenomics workflow running within Galaxy for 16S differential abundance analysis. The workflow accepts 454 or Illumina sequence data (either overlapping or non-overlapping paired end reads) and outputs lists of the operational taxonomic unit (OTUs) exhibiting the greatest change under differing conditions. A range of analysis steps and graphing options are available giving users a high-level of control over their data and analyses. Additionally, users are able to input complex sample-specific metadata information which can be incorporated into differential analysis and used for grouping / colouring within graphs.  Detailed tutorials containing sample data and existing workflows are available for three different input types: overlapping and non-overlapping read pairs as well as for pre-generated Biological Observation Matrix (BIOM) files. Using the Galaxy platform we developed MetaDEGalaxy, a complete metagenomics differential abundance analysis workflow. MetaDEGalaxy is designed for bench scientists working with 16S data who are interested in comparative metagenomics.  MetaDEGalaxy builds on momentum within the wider Galaxy metagenomics community with the hope that more tools will be added as existing methods mature.


Sign in / Sign up

Export Citation Format

Share Document