ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences

metagenomeFeatures: An R package for working with 16S rRNA reference databases and marker-gene survey feature data

10.1101/339812 ◽

2018 ◽

Cited By ~ 1

Author(s):

Nathan D. Olson ◽

Nidhi Shah ◽

Jayaram Kancherla ◽

Justin Wagner ◽

Joseph N. Paulson ◽

...

Keyword(s):

16S Rrna ◽

Marker Gene ◽

R Package ◽

Bioconductor Package ◽

Rrna Sequence ◽

16S Rrna Sequence ◽

Crucial Step ◽

Reference Databases ◽

Database Comparison ◽

Sequence Databases

AbstractWe developed the metagenomeFeatures R Bioconductor package along with annotation packages for the three primary 16S rRNA databases (Greengenes, RDP, and SILVA) to facilitate working with 16S rRNA sequence databases and marker-gene survey feature data. The metagenomeFeatures package defines two classes, MgDb for working with 16S rRNA sequence databases, and mgFeatures for working with marker-gene survey feature data. The associated annotation packages provide a consistent interface to the different 16S rRNA databases facilitating database comparison and exploration. The mgFeatures represents a crucial step in the development of a common data structure for working with 16S marker-gene survey data in R.Availabilityhttps://bioconductor.org/packages/release/bioc/html/[email protected]

Download Full-text

Fast and sensitive taxonomic assignment to metagenomic contigs

10.1101/2020.11.27.401018 ◽

2020 ◽

Author(s):

M. Mirdita ◽

M. Steinegger ◽

F. Breitwieser ◽

J. Söding ◽

E. Levy Karin

Keyword(s):

Taxonomic Assignment ◽

Weighted Voting ◽

Protein Fragments ◽

Extraction Step ◽

Open Source Software Package ◽

Domains Of Life ◽

Reference Databases ◽

Taxonomic Assignments ◽

Free Open Source ◽

Taxonomic Annotation

SummaryMMseqs2 taxonomy is a new tool to assign taxonomic labels to metagenomic contigs. It extracts all possible protein fragments from each contig, quickly retains those that can contribute to taxonomic annotation, assigns them with robust labels and determines the contig’s taxonomic identity by weighted voting. Its fragment extraction step is suitable for the analysis of all domains of life. MMseqs2 taxonomy is 2-18x faster than state-of-the-art tools and also contains new modules for creating and manipulating taxonomic reference databases as well as reporting and visualizing taxonomic assignments.AvailabilityMMseqs2 taxonomy is part of the MMseqs2 free open-source software package available for Linux, macOS and Windows at https://mmseqs.com.

Download Full-text

Phytool, a ShinyApp to homogenise taxonomy of freshwater microalgae from DNA barcodes and microscopic observations

Metabarcoding and Metagenomics ◽

10.3897/mbmg.5.74096 ◽

2021 ◽

Vol 5 ◽

Author(s):

Alexis Canino ◽

Agnès Bouchez ◽

Christophe Laplace-Treyture ◽

Isabelle Domaizon ◽

Frédéric Rimet

Keyword(s):

Reference Database ◽

Taxonomic Assignment ◽

Freshwater Microalgae ◽

Web Based ◽

Continuous Increase ◽

Freshwater Phytoplankton ◽

Reference Databases ◽

Data Tables ◽

Taxonomic Studies ◽

User Friendly

Methods for biomonitoring of freshwater phytoplankton are evolving rapidly with eDNA-based methods, offering great complementarity with microscopy. Metabarcoding approaches have been more commonly used over the last years, with a continuous increase in the amount of data generated. Depending on the researchers and the way they assigned barcodes to species (bioinformatic pipelines and molecular reference databases), the taxonomic assignment obtained for HTS DNA reads might vary. This is also true for traditional taxonomic studies by microscopy with regular adjustments of the classification and taxonomy. For those reasons (leading to non-homogeneous taxonomies), gap-analyses and comparisons between studies become even more challenging and the curation processes to find potential consensus names are time-consuming. Here, we present a web-based application (Phytool), developed with ShinyApp (Rstudio), that aims to make the harmonisation of taxonomy easier and in a more efficient way, using a complete and up-to-date taxonomy reference database for freshwater microalgae. Phytool allows users to homogenise and update freshwater phytoplankton taxonomical names from sequence files and data tables directly uploaded in the application. It also gathers barcodes from curated references in a user-friendly way in which it is possible to search for specific organisms. All the data provided are downloadable with the possibility to apply filters in order to select only the required taxa and fields (e.g. specific taxonomic ranks). The main goal is to make accessible to a broad range of users the connection between microscopy and molecular biology and taxonomy through different ready-to-use functions. This study estimates that only 25% of species of freshwater phytoplankton in Phytobs are associated with a barcode. We plead for an increased effort to enrich reference databases by coupling taxonomy and molecular methods. Phytool should make this crucial work more efficient. The application is available at https://caninuzzo.shinyapps.io/phytool_v1/

Download Full-text

Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity

10.7287/peerj.preprints.2196 ◽

2016 ◽

Author(s):

Andrew Krohn ◽

Bo Stevens ◽

Adam Robbins-Pianka ◽

Matthew Belus ◽

Gerard J Allan ◽

...

Keyword(s):

Dna Sequences ◽

Marker Gene ◽

Community Diversity ◽

Sequencing Data ◽

Mock Community ◽

Taxonomic Assignment ◽

Data Set ◽

Environmental Diversity ◽

Quality Filtering ◽

Mock Communities

The diversity of complex microbial communities can be rapidly assessed by high-throughput DNA sequencing of marker gene (e.g., 16S) PCR amplicon pools, often yielding many thousands of DNA sequences per sample. However, analysis of such community amplicon sequencing data requires multiple computational steps which affect the outcome of a final data set. Here we use mock communities to describe the effects of parameter adjustments for raw sequence quality filtering, picking operational taxonomic units (OTUs), taxonomic assignment, and OTU table filtering as implemented in the popular microbial ecology analysis package, QIIME 1.9.1. We demonstrate a workflow optimization based upon this exploration, which we also apply to environmental samples. We found that quality filtering of raw data and filtering of OTU tables had large effects on observed OTU diversity. While all taxonomy assignment programs performed with similar accuracy, an appropriate choice of similarity threshold for defining OTUs depended on the method used for OTU picking. Our “default” analysis in QIIME overestimated mock community OTU diversity by at least a factor of ten. Our optimized analysis correctly characterized mock community taxonomic composition and improved the OTU diversity estimate, reducing overestimation to a factor of about two. Though observed relative abundances of mock community member taxa were approximately correct, most were still represented by multiple OTUs. Low-frequency OTUs conspecific to constituent mock community taxa were characterized by multiple substitution and indel errors and the presence of a low-quality base call resulting in sequence truncation during quality filtering. Low-quality base calls were observed at “G” positions most of the time, and were also associated with a preceding “TTT” trinucleotide motif. Environmental diversity estimates were reduced by about 40% from 2508 to 1533 OTUs when comparing output from the default and optimized workflows. We attribute this reduction in observed diversity to the removal of erroneous sequences from the data set. Our results indicate that both strict quality filtering of raw sequencing data and careful filtering of raw OTU tables are important steps for accurately estimating microbial community diversity.

Download Full-text

PEMA v2: addressing metabarcoding bioinformatics analysis challenges

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64902 ◽

2021 ◽

Vol 4 ◽

Author(s):

Haris Zafeiropoulos ◽

Christina Pavloudi ◽

Evangelos Pafilis

Keyword(s):

High Performance ◽

Bioinformatics Analysis ◽

Marker Gene ◽

Environmental Dna ◽

Third Party ◽

Reference Database ◽

Marker Genes ◽

Specific Reference ◽

Taxonomic Assignment ◽

Internal Joint

Environmental DNA (eDNA) and metabarcoding have launched a new era in bio- and eco-assessment over the last years (Ruppert et al. 2019). The simultaneous identification, at the lowest taxonomic level possible, of a mixture of taxa from a great range of samples is now feasible; thus, the number of eDNA metabarcoding studies has increased radically (Deiner and 2017). While the experimental part of eDNA metabarcoding can be rather challenging depending on the special characteristics of the different studies, computational issues are considered to be its major bottlenecks. Among the latter, the bioinformatics analysis of metabarcoding data and especially the taxonomy assignment of the sequences are fundamental challenges. Many steps are required to obtain taxonomically assigned matrices from raw data. For most of these, a plethora of tools are available. However, each tool's execution parameters need to be tailored to reflect each experiment's idiosyncrasy; thus, tuning bioinformatics analysis has proved itself fundamental (Kamenova 2020). The computation capacity of high-performance computing systems (HPC) is frequently required for such analyses. On top of that, the non perfect completeness and correctness of the reference taxonomy databases is another important issue (Loos et al. 2020). Based on third-party tools, we have developed the Pipeline for Environmental Metabarcoding Analysis (PEMA), a HPC-centered, containerized assembly of key metabarcoding analysis tools. PEMA combines state-of-the art technologies and algorithms with an easy to get-set-use framework, allowing researchers to tune thoroughly each study thanks to roll-back checkpoints and on-demand partial pipeline execution features (Zafeiropoulos 2020). Once PEMA was released, there were two main pitfalls soon to be highlighted by users. PEMA supported 4 marker genes and was bounded by specific reference databases. In this new version of PEMA the analysis of any marker gene is now available since a new feature was added, allowing classifiers to train a user-provided reference database and use it for taxonomic assignment. Fig. 1 shows the taxonomy assignment related PEMA modules; all those out of the dashed box have been developed for this new PEMA release. As shown, the RDPClassifier has been trained with Midori reference 2 and has been added as an option, classifying not only metazoans but sequences from all taxonomic groups of Eukaryotes for the case of the COI marker gene. A PEMA documentation site is now also available. PEMA.v2 containers are available via the DockerHub and SingularityHub as well as through the Elixir Greece AAI Service. It has also been selected to be part of the LifeWatch ERIC Internal Joint Initiative for the analysis of ARMS data and soon will be available through the Tesseract VRE.

Download Full-text

metagenomeFeatures: an R package for working with 16S rRNA reference databases and marker-gene survey feature data

Bioinformatics ◽

10.1093/bioinformatics/btz136 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3870-3872 ◽

Cited By ~ 1

Author(s):

Nathan D Olson ◽

Nidhi Shah ◽

Jayaram Kancherla ◽

Justin Wagner ◽

Joseph N Paulson ◽

...

Keyword(s):

16S Rrna ◽

Marker Gene ◽

R Package ◽

Supplementary Information ◽

Bioconductor Package ◽

Rrna Sequence ◽

16S Rrna Sequence ◽

Reference Databases ◽

Supplementary Material ◽

Database Comparison

Abstract Summary We developed the metagenomeFeatures R Bioconductor package along with annotation packages for three 16S rRNA databases (Greengenes, RDP and SILVA) to facilitate working with 16S rRNA databases and marker-gene survey feature data. The metagenomeFeatures package defines two classes, MgDb for working with 16S rRNA sequence databases, and mgFeatures for marker-gene survey feature data. The associated annotation packages provide a consistent interface to the different databases facilitating database comparison and exploration. The mgFeatures-class represents a crucial step in the development of a common data structure for working with 16S marker-gene survey data in R. Availability and implementation https://bioconductor.org/packages/release/bioc/html/metagenomeFeatures.html. Supplementary information Supplementary material is available at Bioinformatics online.

Download Full-text

Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity

10.7287/peerj.preprints.2196v3 ◽

2016 ◽

Cited By ~ 1

Author(s):

Andrew Krohn ◽

Bo Stevens ◽

Adam Robbins-Pianka ◽

Matthew Belus ◽

Gerard J Allan ◽

...

Keyword(s):

Dna Sequences ◽

Marker Gene ◽

Community Diversity ◽

Sequencing Data ◽

Mock Community ◽

Taxonomic Assignment ◽

Data Set ◽

Environmental Diversity ◽

Quality Filtering ◽

Mock Communities

The diversity of complex microbial communities can be rapidly assessed by high-throughput DNA sequencing of marker gene (e.g., 16S) PCR amplicon pools, often yielding many thousands of DNA sequences per sample. However, analysis of such community amplicon sequencing data requires multiple computational steps which affect the outcome of a final data set. Here we use mock communities to describe the effects of parameter adjustments for raw sequence quality filtering, picking operational taxonomic units (OTUs), taxonomic assignment, and OTU table filtering as implemented in the popular microbial ecology analysis package, QIIME 1.9.1. We demonstrate a workflow optimization based upon this exploration, which we also apply to environmental samples. We found that quality filtering of raw data and filtering of OTU tables had large effects on observed OTU diversity. While all taxonomy assignment programs performed with similar accuracy, an appropriate choice of similarity threshold for defining OTUs depended on the method used for OTU picking. Our “default” analysis in QIIME overestimated mock community OTU diversity by at least a factor of ten. Our optimized analysis correctly characterized mock community taxonomic composition and improved the OTU diversity estimate, reducing overestimation to a factor of about two. Though observed relative abundances of mock community member taxa were approximately correct, most were still represented by multiple OTUs. Low-frequency OTUs conspecific to constituent mock community taxa were characterized by multiple substitution and indel errors and the presence of a low-quality base call resulting in sequence truncation during quality filtering. Low-quality base calls were observed at “G” positions most of the time, and were also associated with a preceding “TTT” trinucleotide motif. Environmental diversity estimates were reduced by about 40% from 2508 to 1533 OTUs when comparing output from the default and optimized workflows. We attribute this reduction in observed diversity to the removal of erroneous sequences from the data set. Our results indicate that both strict quality filtering of raw sequencing data and careful filtering of raw OTU tables are important steps for accurately estimating microbial community diversity.

Download Full-text

Exact sequence variants should replace operational taxonomic units in marker gene data analysis

10.1101/113597 ◽

2017 ◽

Cited By ~ 7

Author(s):

Benjamin J Callahan ◽

Paul J McMurdie ◽

Susan P Holmes

Keyword(s):

De Novo ◽

Marker Gene ◽

Taxonomic Resolution ◽

Reference Database ◽

Sequence Variants ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

The Status ◽

Reference Databases ◽

Gene Data

AbstractRecent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently that sequence variants (SVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer taxonomic resolution are immediately apparent, and arguments for SV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits deriving from the status of SVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how those features grant SVs the combined advantages of closed-reference OTUs — including computational costs that scale linearly with study size, simple merging between independently processed datasets, and forward prediction — and of de novo OTUs — including accurate diversity measurement and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that SVs should replace OTUs as the standard unit of marker gene analysis and reporting.

Download Full-text

RESCRIPt: Reproducible sequence taxonomy reference database management

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009581 ◽

2021 ◽

Vol 17 (11) ◽

pp. e1009581

Author(s):

Michael S. Robeson ◽

Devon R. O’Rourke ◽

Benjamin D. Kaehler ◽

Michal Ziemski ◽

Matthew R. Dillon ◽

...

Keyword(s):

Nucleotide Sequence ◽

Marker Gene ◽

Environmental Dna ◽

Reference Sequence ◽

Genome Comparison ◽

Reference Database ◽

Reference Databases ◽

Ncbi Refseq ◽

Quality Filtering ◽

Metagenome Sequencing

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

Download Full-text

RESCRIPt: Reproducible sequence taxonomy reference database management for the masses

10.1101/2020.10.05.326504 ◽

2020 ◽

Cited By ~ 1

Author(s):

Michael S. Robeson ◽

Devon R. O’Rourke ◽

Benjamin D. Kaehler ◽

Michal Ziemski ◽

Matthew R. Dillon ◽

...

Keyword(s):

Nucleotide Sequence ◽

Marker Gene ◽

Environmental Dna ◽

Reference Sequence ◽

Genome Comparison ◽

Reference Database ◽

Reference Databases ◽

Quality Filtering ◽

Metagenome Sequencing ◽

The Masses

AbstractBackgroundNucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardizations limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a software package for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases.ResultsTo highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA, and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes.ConclusionsRESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.

Download Full-text