Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity

10.7287/peerj.preprints.2196v3 ◽

2016 ◽

Cited By ~ 1

Author(s):

Andrew Krohn ◽

Bo Stevens ◽

Adam Robbins-Pianka ◽

Matthew Belus ◽

Gerard J Allan ◽

...

Keyword(s):

Dna Sequences ◽

Marker Gene ◽

Community Diversity ◽

Sequencing Data ◽

Mock Community ◽

Taxonomic Assignment ◽

Data Set ◽

Environmental Diversity ◽

Quality Filtering ◽

Mock Communities

The diversity of complex microbial communities can be rapidly assessed by high-throughput DNA sequencing of marker gene (e.g., 16S) PCR amplicon pools, often yielding many thousands of DNA sequences per sample. However, analysis of such community amplicon sequencing data requires multiple computational steps which affect the outcome of a final data set. Here we use mock communities to describe the effects of parameter adjustments for raw sequence quality filtering, picking operational taxonomic units (OTUs), taxonomic assignment, and OTU table filtering as implemented in the popular microbial ecology analysis package, QIIME 1.9.1. We demonstrate a workflow optimization based upon this exploration, which we also apply to environmental samples. We found that quality filtering of raw data and filtering of OTU tables had large effects on observed OTU diversity. While all taxonomy assignment programs performed with similar accuracy, an appropriate choice of similarity threshold for defining OTUs depended on the method used for OTU picking. Our “default” analysis in QIIME overestimated mock community OTU diversity by at least a factor of ten. Our optimized analysis correctly characterized mock community taxonomic composition and improved the OTU diversity estimate, reducing overestimation to a factor of about two. Though observed relative abundances of mock community member taxa were approximately correct, most were still represented by multiple OTUs. Low-frequency OTUs conspecific to constituent mock community taxa were characterized by multiple substitution and indel errors and the presence of a low-quality base call resulting in sequence truncation during quality filtering. Low-quality base calls were observed at “G” positions most of the time, and were also associated with a preceding “TTT” trinucleotide motif. Environmental diversity estimates were reduced by about 40% from 2508 to 1533 OTUs when comparing output from the default and optimized workflows. We attribute this reduction in observed diversity to the removal of erroneous sequences from the data set. Our results indicate that both strict quality filtering of raw sequencing data and careful filtering of raw OTU tables are important steps for accurately estimating microbial community diversity.

Download Full-text

Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity

10.7287/peerj.preprints.2196v2 ◽

2016 ◽

Cited By ~ 1

Author(s):

Andrew Krohn ◽

Bo Stevens ◽

Adam Robbins-Pianka ◽

Matthew Belus ◽

Gerard J Allan ◽

...

Keyword(s):

Amplicon Sequencing ◽

Community Diversity ◽

Accurate Estimation ◽

Marker Genes ◽

Sequencing Data ◽

Mock Community ◽

Data Set ◽

Environmental Diversity ◽

Quality Filtering ◽

Mock Communities

Diversity of complex microbial communities can be rapidly assessed by community amplicon sequencing of marker genes (e.g., 16S), often yielding many thousands of DNA sequences per sample. However, analysis of community amplicon sequencing data requires multiple computational steps which affect the outcome of a final data set. Here we use mock communities to describe the effects of parameter adjustments for raw sequence quality filtering, picking operational taxonomic units (OTUs), taxonomic assignment, and OTU table filtering as implemented in QIIME 1.9.1. We demonstrate a workflow optimization based upon this exploration which we also apply to environmental samples. We found that quality filtering of raw data and filtering of OTU tables had large effects on observed OTU diversity. While all taxonomy assigners performed with similar accuracy, an appropriate choice of similarity threshold for defining OTUs depended on the method used for OTU picking. Our “default” analysis in QIIME overestimated mock community diversity by at least a factor of ten, compared to the optimized analysis which correctly characterized the taxonomic composition of the mock communities while still overestimating OTU diversity by about a factor of two. Though observed relative abundances of mock community member taxa were approximately correct, most were still represented by multiple OTUs. Low-frequency OTUs conspecific to constituent mock community taxa were characterized by multiple substitution and indel errors and the presence of a low quality base call resulting in sequence truncation during quality filtering. Low quality base calls were observed at “G” positions most of the time, and were also associated with a preceding “TTT” trinucleotide motif. Environmental diversity estimates were reduced by about 40% from 2508 to 1533 OTUs when comparing output from the default and optimized workflows. We attribute this reduction in observed diversity to the removal of erroneous sequences from the data set. Our results indicate that both strict quality filtering of raw sequencing data and careful filtering of raw OTU tables are important steps for accurate estimation of microbial community diversity.

Download Full-text

Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity

10.7287/peerj.preprints.2196v1 ◽

2016 ◽

Author(s):

Andrew Krohn ◽

Bo Stevens ◽

Adam Robbins-Pianka ◽

Matthew Belus ◽

Gerard J Allan ◽

...

Keyword(s):

Amplicon Sequencing ◽

Community Diversity ◽

Accurate Estimation ◽

Marker Genes ◽

Sequencing Data ◽

Mock Community ◽

Data Set ◽

Environmental Diversity ◽

Quality Filtering ◽

Mock Communities

Diversity of complex microbial communities can be rapidly assessed by community amplicon sequencing of marker genes (e.g., 16S), often yielding many thousands of DNA sequences per sample. However, analysis of community amplicon sequencing data requires multiple computational steps which affect the outcome of a final data set. Here we use mock communities to describe the effects of parameter adjustments for raw sequence quality filtering, picking operational taxonomic units (OTUs), taxonomic assignment, and OTU table filtering as implemented in QIIME 1.9.1. We demonstrate a workflow optimization based upon this exploration which we also apply to environmental samples. We found that quality filtering of raw data and filtering of OTU tables had large effects on observed OTU diversity. While all taxonomy assigners performed with similar accuracy, an appropriate choice of similarity threshold for defining OTUs depended on the method used for OTU picking. Our “default” analysis in QIIME overestimated mock community diversity by at least a factor of ten, compared to the optimized analysis which correctly characterized the taxonomic composition of the mock communities while still overestimating OTU diversity by about a factor of two. Though observed relative abundances of mock community member taxa were approximately correct, most were still represented by multiple OTUs. Low-frequency OTUs conspecific to constituent mock community taxa were characterized by multiple substitution and indel errors and the presence of a low quality base call resulting in sequence truncation during quality filtering. Low quality base calls were observed at “G” positions most of the time, and were also associated with a preceding “TTT” trinucleotide motif. Environmental diversity estimates were reduced by about 40% from 2508 to 1533 OTUs when comparing output from the default and optimized workflows. We attribute this reduction in observed diversity to the removal of erroneous sequences from the data set. Our results indicate that both strict quality filtering of raw sequencing data and careful filtering of raw OTU tables are important steps for accurate estimation of microbial community diversity.

Download Full-text

mockrobiota: a Public Resource for Microbiome Bioinformatics Benchmarking

mSystems ◽

10.1128/msystems.00062-16 ◽

2016 ◽

Vol 1 (5) ◽

Cited By ~ 48

Author(s):

Nicholas A. Bokulich ◽

Jai Ram Rideout ◽

William G. Mercurio ◽

Arron Shiffer ◽

Benjamin Wolfe ◽

...

Keyword(s):

Marker Gene ◽

Microbial Community Analysis ◽

Mock Community ◽

Data Set ◽

Public Resource ◽

Source Data ◽

Community Data ◽

Dynamic Resource ◽

And Training ◽

Mock Communities

ABSTRACT The availability of standard and public mock community data will facilitate ongoing method optimizations, comparisons across studies that share source data, and greater transparency and access and eliminate redundancy. These are also valuable resources for bioinformatics teaching and training. This dynamic resource is intended to expand and evolve to meet the changing needs of the omics community. Mock communities are an important tool for validating, optimizing, and comparing bioinformatics methods for microbial community analysis. We present mockrobiota, a public resource for sharing, validating, and documenting mock community data resources, available at http://caporaso-lab.github.io/mockrobiota/ . The materials contained in mockrobiota include data set and sample metadata, expected composition data (taxonomy or gene annotations or reference sequences for mock community members), and links to raw data (e.g., raw sequence data) for each mock community data set. mockrobiota does not supply physical sample materials directly, but the data set metadata included for each mock community indicate whether physical sample materials are available. At the time of this writing, mockrobiota contains 11 mock community data sets with known species compositions, including bacterial, archaeal, and eukaryotic mock communities, analyzed by high-throughput marker gene sequencing. IMPORTANCE The availability of standard and public mock community data will facilitate ongoing method optimizations, comparisons across studies that share source data, and greater transparency and access and eliminate redundancy. These are also valuable resources for bioinformatics teaching and training. This dynamic resource is intended to expand and evolve to meet the changing needs of the omics community.

Download Full-text

Accurate Estimation of Fungal Diversity and Abundance through Improved Lineage-Specific Primers Optimized for Illumina Amplicon Sequencing

Applied and Environmental Microbiology ◽

10.1128/aem.02576-16 ◽

2016 ◽

Vol 82 (24) ◽

pp. 7217-7226 ◽

Cited By ~ 118

Author(s):

D. Lee Taylor ◽

William A. Walters ◽

Niall J. Lennon ◽

James Bochicchio ◽

Andrew Krohn ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Large Subunit ◽

Marker Gene ◽

Soil Samples ◽

Gene Copy ◽

Accurate Estimation ◽

Rrna Gene ◽

Data Set ◽

Mock Communities

ABSTRACTWhile high-throughput sequencing methods are revolutionizing fungal ecology, recovering accurate estimates of species richness and abundance has proven elusive. We sought to design internal transcribed spacer (ITS) primers and an Illumina protocol that would maximize coverage of the kingdom Fungi while minimizing nontarget eukaryotes. We inspected alignments of the 5.8S and large subunit (LSU) ribosomal genes and evaluated potential primers using PrimerProspector. We tested the resulting primers using tiered-abundance mock communities and five previously characterized soil samples. We recovered operational taxonomic units (OTUs) belonging to all 8 members in both mock communities, despite DNA abundances spanning 3 orders of magnitude. The expected and observed read counts were strongly correlated (r= 0.94 to 0.97). However, several taxa were consistently over- or underrepresented, likely due to variation in rRNA gene copy numbers. The Illumina data resulted in clustering of soil samples identical to that obtained with Sanger sequence clone library data using different primers. Furthermore, the two methods produced distance matrices with a Mantel correlation of 0.92. Nonfungal sequences comprised less than 0.5% of the soil data set, with most attributable to vascular plants. Our results suggest that high-throughput methods can produce fairly accurate estimates of fungal abundances in complex communities. Further improvements might be achieved through corrections for rRNA copy number and utilization of standardized mock communities.IMPORTANCEFungi play numerous important roles in the environment. Improvements in sequencing methods are providing revolutionary insights into fungal biodiversity, yet accurate estimates of the number of fungal species (i.e., richness) and their relative abundances in an environmental sample (e.g., soil, roots, water, etc.) remain difficult to obtain. We present improved methods for high-throughput Illumina sequencing of the species-diagnostic fungal ribosomal marker gene that improve the accuracy of richness and abundance estimates. The improvements include new PCR primers and library preparation, validation using a known mock community, and bioinformatic parameter tuning.

Download Full-text

Accuracy of microbial community diversity estimated by closed- and open-reference OTUs

PeerJ ◽

10.7717/peerj.3889 ◽

2017 ◽

Vol 5 ◽

pp. e3889 ◽

Cited By ~ 69

Author(s):

Robert C. Edgar

Keyword(s):

Ribosomal Rna ◽

De Novo ◽

Community Diversity ◽

Reference Database ◽

Mock Community ◽

Variable Regions ◽

Operational Taxonomic Units ◽

Sequencing Technologies ◽

Generation Sequencing ◽

Mock Communities

Next-generation sequencing of 16S ribosomal RNA is widely used to survey microbial communities. Sequences are typically assigned to Operational Taxonomic Units (OTUs). Closed- and open-reference OTU assignment matches reads to a reference database at 97% identity (closed), then clusters unmatched reads using a de novo method (open). Implementations of these methods in the QIIME package were tested on several mock community datasets with 20 strains using different sequencing technologies and primers. Richness (number of reported OTUs) was often greatly exaggerated, with hundreds or thousands of OTUs generated on Illumina datasets. Between-sample diversity was also found to be highly exaggerated in many cases, with weighted Jaccard distances between identical mock samples often close to one, indicating very low similarity. Non-overlapping hyper-variable regions in 70% of species were assigned to different OTUs. On mock communities with Illumina V4 reads, 56% to 88% of predicted genus names were false positives. Biological inferences obtained using these methods are therefore not reliable.

Download Full-text

SMaSH: A scalable, general marker gene identification framework for single-cell RNA sequencing and Spatial Transcriptomics

10.1101/2021.04.08.438978 ◽

2021 ◽

Author(s):

Michael E Nelson ◽

Simone G Riva ◽

Ann Cvejic

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Marker Gene ◽

Marker Genes ◽

Sequencing Data ◽

Computational Framework ◽

Data Set ◽

Spatially Resolved ◽

Single Cell Rna Sequencing ◽

The Given

Spatial transcriptomics is revolutionising the study of single-cell RNA and tissue-wide cell heterogeneity, but few robust methods connecting spatially resolved cells to so-called marker genes from single-cell RNA sequencing, which generate significant insight gleaned from spatial methods, exist. Here we present SMaSH, a general computational framework for extracting key marker genes from single-cell RNA sequencing data for spatial transcriptomics approaches. SMaSH extracts robust and biologically well-motivated marker genes, which characterise the given data-set better than existing and limited computational approaches for global marker gene calculation.

Download Full-text

BTW – Bioinformatics Through Windows: an easy-to-install package to analyze marker gene data

10.7287/peerj.preprints.26581v1 ◽

2018 ◽

Author(s):

Daniel Morais ◽

Luiz Roesch ◽

Marc Redmile-Gordon ◽

Fausto Santos ◽

Petr Baldrian ◽

...

Keyword(s):

Next Generation Sequencing ◽

Operating Systems ◽

Marker Gene ◽

Command Line ◽

Taxonomic Assignment ◽

New Challenges ◽

Quality Filtering ◽

Next Generation Sequencing Ngs ◽

Gene Data ◽

Generation Sequencing

Recent advances in Next-Generation Sequencing (NGS) make comparative analyses of the composition and diversity of whole microbial communities possible at far greater depth than ever before. This brings new challenges, such as an increased dependence on computation to process these huge datasets. The demand on system resources usually requires migrating from Windows to Linux-based operating systems and prior familiarity with command-line interfaces. To overcome this barrier, we developed a fully automated and easy-to-install package as well as a complete, easy to follow pipeline for microbial metataxonomic analysis operating in the Windows Subsystem for Linux (WSL) - Bioinformatics Through Windows (BTW). BTW combines several open-access tools for processing marker gene data, including 16S rRNA, bringing the user from raw sequencing reads to diversity-related conclusions. It includes data quality filtering, clustering, taxonomic assignment and further statistical analyses, directly in WSL, avoiding the prior need of migrating from Windows to Linux. BTW is expected to boost the use of NGS amplicon data by facilitating rapid access to bioinformatics tools for Windows users. BTW is a Bash script and is available in GitHub ( https://github.com/vpylro/BTW ). The package is freely available for noncommercial users.

Download Full-text

BTW – Bioinformatics Through Windows: an easy-to-install package to analyze marker gene data

10.7287/peerj.preprints.26581 ◽

2018 ◽

Author(s):

Daniel Morais ◽

Luiz Roesch ◽

Marc Redmile-Gordon ◽

Fausto Santos ◽

Petr Baldrian ◽

...

Keyword(s):

Next Generation Sequencing ◽

Operating Systems ◽

Marker Gene ◽

Command Line ◽

Taxonomic Assignment ◽

New Challenges ◽

Quality Filtering ◽

Next Generation Sequencing Ngs ◽

Gene Data ◽

Generation Sequencing

Recent advances in Next-Generation Sequencing (NGS) make comparative analyses of the composition and diversity of whole microbial communities possible at far greater depth than ever before. This brings new challenges, such as an increased dependence on computation to process these huge datasets. The demand on system resources usually requires migrating from Windows to Linux-based operating systems and prior familiarity with command-line interfaces. To overcome this barrier, we developed a fully automated and easy-to-install package as well as a complete, easy to follow pipeline for microbial metataxonomic analysis operating in the Windows Subsystem for Linux (WSL) - Bioinformatics Through Windows (BTW). BTW combines several open-access tools for processing marker gene data, including 16S rRNA, bringing the user from raw sequencing reads to diversity-related conclusions. It includes data quality filtering, clustering, taxonomic assignment and further statistical analyses, directly in WSL, avoiding the prior need of migrating from Windows to Linux. BTW is expected to boost the use of NGS amplicon data by facilitating rapid access to bioinformatics tools for Windows users. BTW is a Bash script and is available in GitHub ( https://github.com/vpylro/BTW ). The package is freely available for noncommercial users.

Download Full-text

CoMA – an intuitive and user-friendly pipeline for amplicon-sequencing data analysis

PLoS ONE ◽

10.1371/journal.pone.0243241 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243241

Author(s):

Sebastian Hupfauf ◽

Mohammad Etemadi ◽

Marina Fernández-Delgado Juárez ◽

María Gómez-Brandón ◽

Heribert Insam ◽

...

Keyword(s):

Operating System ◽

Data Analysis ◽

Amplicon Sequencing ◽

Sequencing Data ◽

Taxonomic Assignment ◽

Benchmark Test ◽

Next Generation Sequencing Ngs ◽

User Friendly ◽

Ngs Data ◽

Mock Communities

In recent years, there has been a veritable boost in next-generation sequencing (NGS) of gene amplicons in biological and medical studies. Huge amounts of data are produced and need to be analyzed adequately. Various online and offline analysis tools are available; however, most of them require extensive expertise in computer science or bioinformatics, and often a Linux-based operating system. Here, we introduce “CoMA–Comparative Microbiome Analysis” as a free and intuitive analysis pipeline for amplicon-sequencing data, compatible with any common operating system. Moreover, the tool offers various useful services including data pre-processing, quality checking, clustering to operational taxonomic units (OTUs), taxonomic assignment, data post-processing, data visualization, and statistical appraisal. The workflow results in highly esthetic and publication-ready graphics, as well as output files in standardized formats (e.g. tab-delimited OTU-table, BIOM, NEWICK tree) that can be used for more sophisticated analyses. The CoMA output was validated by a benchmark test, using three mock communities with different sample characteristics (primer set, amplicon length, diversity). The performance was compared with that of Mothur, QIIME and QIIME2-DADA2, popular packages for NGS data analysis. Furthermore, the functionality of CoMA is demonstrated on a practical example, investigating microbial communities from three different soils (grassland, forest, swamp). All tools performed well in the benchmark test and were able to reveal the majority of all genera in the mock communities. Also for the soil samples, the results of CoMA were congruent to those of the other pipelines, in particular when looking at the key microbial players.

Download Full-text