Exact sequence variants should replace operational taxonomic units in marker gene data analysis

Mapping Intimacies ◽

10.1101/113597 ◽

2017 ◽

Cited By ~ 7

Author(s):

Benjamin J Callahan ◽

Paul J McMurdie ◽

Susan P Holmes

Keyword(s):

De Novo ◽

Marker Gene ◽

Taxonomic Resolution ◽

Reference Database ◽

Sequence Variants ◽

Sequencing Data ◽

Operational Taxonomic Units ◽

The Status ◽

Reference Databases ◽

Gene Data

AbstractRecent advances have made it possible to analyze high-throughput marker-gene sequencing data without resorting to the customary construction of molecular operational taxonomic units (OTUs): clusters of sequencing reads that differ by less than a fixed dissimilarity threshold. New methods control errors sufficiently that sequence variants (SVs) can be resolved exactly, down to the level of single-nucleotide differences over the sequenced gene region. The benefits of finer taxonomic resolution are immediately apparent, and arguments for SV methods have focused on their improved resolution. Less obvious, but we believe more important, are the broad benefits deriving from the status of SVs as consistent labels with intrinsic biological meaning identified independently from a reference database. Here we discuss how those features grant SVs the combined advantages of closed-reference OTUs — including computational costs that scale linearly with study size, simple merging between independently processed datasets, and forward prediction — and of de novo OTUs — including accurate diversity measurement and applicability to communities lacking deep coverage in reference databases. We argue that the improvements in reusability, reproducibility and comprehensiveness are sufficiently great that SVs should replace OTUs as the standard unit of marker gene analysis and reporting.

Download Full-text

Strategies for the bioinformatic treatment of high-throughput sequencing data for diatom studies and bioassessment

ARPHA Conference Abstracts ◽

10.3897/aca.4.e65047 ◽

2021 ◽

Vol 4 ◽

Author(s):

Kálmán Tapolczai ◽

François Keck ◽

Valentin Vasselon ◽

Géza Selmeczy ◽

Maria Kahlert ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Sequence Similarity ◽

Test Sample ◽

Taxonomic Resolution ◽

Training Dataset ◽

Sequence Variants ◽

Clustering Methods ◽

Ecological Studies ◽

Sequencing Data

Diatom biomonitoring and ecological studies can greatly benefit from DNA metabarcoding compared to conventional microscopical analysis by potentially providing more reliable and accurate data in a cost- and time-efficient way. A conventional strategy for the bioinformatic treatment of sequencing data involves the clustering of quality filtered sequences into Operational Taxonomic Units (OTUs) based on a global sequence similarity, and their assignment to taxonomy using a reference library. Then, the obtained species lists of the successfully assigned taxa are used for subsequent analyses or quality index calculation. However, the high diversity of bioinformatic methods and parameters make inter-studies comparison difficult, especially because OTUs are specific to a given study. Clustering sequences into OTUs aims to reduce the biasing effect of sequencing artefacts and to reach an approximate species level delimitation at the price of potentially grouping together sequences with different ecology. A similar bias occurs when sequences that differ from each other by their ecological preference are assigned to the same taxa. The incompleteness of reference libraries can further introduce a bias by not taking into account unassigned sequences, thus losing the ecological information they possess. In order to overcome these biases, our studies tested new approaches on de novo developed diatom indices based on periphytic samples collected from streams in France and Hungary. Index development was performed with the leave-one-out cross validation (LOOCV) technique by building a model on a training dataset containing n-1 samples and testing it on the remaining test sample. Test values were correlated with a reference environmental gradient. The model was based on the calculation of optimum and tolerance of taxonomic units along the reference gradient and a modified Zelinka-Marvan diatom index equation. Taxonomic units tested in the studies were morphospecies, OTUs (95% similarity threshold), Individual Sequence Units (ISUs, via minimal bioinformatic quality filtering) and Exact Sequence Variants (ESVs, via DADA2 denoising algorithm). The “clustering-free” approach (ISU- and ESV-based indices) performed better than the OTU-based one, providing a fine taxonomic resolution where the ecological difference on genetically close sequence variants could be detected. Thus, these indices are more adapted to a standardized and comparable routine bioassessment. The “taxonomy-free” approach revealed the ecological preferences for those molecular taxonomic units (ISUs/ESVs) that otherwise either (i) would have been assigned to the same taxa due to genetic similarity, or (ii) would not have been recognized because of their absence from the reference libraries. However, we also found that taxonomic information cannot be neglected in ecological studies when the presence of organisms under particular environmental conditions is to be explained or interpreted e.g. via the traits they possess. New types of clustering methods are welcome in the future of biomonitoring where the delimitation of taxonomic units should be refined based on a higher emphasis on their ecology rather than on morphological or genetical criteria.

Download Full-text

Exact sequence variants should replace operational taxonomic units in marker-gene data analysis

The ISME Journal ◽

10.1038/ismej.2017.119 ◽

2017 ◽

Vol 11 (12) ◽

pp. 2639-2643 ◽

Cited By ~ 723

Author(s):

Benjamin J Callahan ◽

Paul J McMurdie ◽

Susan P Holmes

Keyword(s):

Data Analysis ◽

Exact Sequence ◽

Marker Gene ◽

Sequence Variants ◽

Operational Taxonomic Units ◽

Gene Data

Download Full-text

Synthetic Sequencing Standards: A Guide to Database Choice for Rumen Microbiota Amplicon Sequencing Analysis

Frontiers in Microbiology ◽

10.3389/fmicb.2020.606825 ◽

2020 ◽

Vol 11 ◽

Author(s):

Paul E. Smith ◽

Sinead M. Waters ◽

Ruth Gómez Expósito ◽

Hauke Smidt ◽

Ciara A. Carberry ◽

...

Keyword(s):

High Throughput Sequencing ◽

Cost Effective ◽

Amplicon Sequencing ◽

Gas Production ◽

Reference Database ◽

Specific Reference ◽

Sequencing Analysis ◽

Sequencing Data ◽

Rumen Microbiota ◽

Reference Databases

Our understanding of complex microbial communities, such as those residing in the rumen, has drastically advanced through the use of high throughput sequencing (HTS) technologies. Indeed, with the use of barcoded amplicon sequencing, it is now cost effective and computationally feasible to identify individual rumen microbial genera associated with ruminant livestock nutrition, genetics, performance and greenhouse gas production. However, across all disciplines of microbial ecology, there is currently little reporting of the use of internal controls for validating HTS results. Furthermore, there is little consensus of the most appropriate reference database for analyzing rumen microbiota amplicon sequencing data. Therefore, in this study, a synthetic rumen-specific sequencing standard was used to assess the effects of database choice on results obtained from rumen microbial amplicon sequencing. Four DADA2 reference training sets (RDP, SILVA, GTDB, and RefSeq + RDP) were compared to assess their ability to correctly classify sequences included in the rumen-specific sequencing standard. In addition, two thresholds of phylogenetic bootstrapping, 50 and 80, were applied to investigate the effect of increasing stringency. Sequence classification differences were apparent amongst the databases. For example the classification of Clostridium differed between all databases, thus highlighting the need for a consistent approach to nomenclature amongst different reference databases. It is hoped the effect of database on taxonomic classification observed in this study, will encourage research groups across various microbial disciplines to develop and routinely use their own microbiome-specific reference standard to validate analysis pipelines and database choice.

Download Full-text

Genotyping and De Novo Discovery of Allelic Variants at the Brassicaceae Self-Incompatibility Locus from Short-Read Sequencing Data

Molecular Biology and Evolution ◽

10.1093/molbev/msz258 ◽

2019 ◽

Vol 37 (4) ◽

pp. 1193-1201 ◽

Cited By ~ 2

Author(s):

Mathieu Genete ◽

Vincent Castric ◽

Xavier Vekemans

Keyword(s):

De Novo ◽

Balancing Selection ◽

Methodological Approach ◽

Natural Populations ◽

Sequence Divergence ◽

Reference Database ◽

Self Incompatibility ◽

Sequencing Data ◽

Allelic Series ◽

S Alleles

Abstract Plant self-incompatibility (SI) is a genetic system that prevents selfing and enforces outcrossing. Because of strong balancing selection, the genes encoding SI are predicted to maintain extraordinarily high levels of polymorphism, both in terms of the number of functionally distinct S-alleles that segregate in SI species and in terms of their nucleotide sequence divergence. However, because of these two combined features, documenting polymorphism of these genes also presents important methodological challenges that have so far largely prevented the comprehensive analysis of complete allelic series in natural populations, and also precluded the obtention of complete genic sequences for many S-alleles. Here, we develop a powerful methodological approach based on a computationally optimized comparison of short Illumina sequencing reads from genomic DNA to a database of known nucleotide sequences of the extracellular domain of SRK (eSRK). By examining mapping patterns along the reference sequences, we obtain highly reliable predictions of S-genotypes from individuals collected from natural populations of Arabidopsis halleri. Furthermore, using a de novo assembly approach of the filtered short reads, we obtain full-length sequences of eSRK even when the initial sequence in the database was only partial, and we discover putative new SRK alleles that were not initially present in the database. When including those new alleles in the reference database, we were able to resolve the complete diploid SI genotypes of all individuals. Beyond the specific case of Brassicaceae S-alleles, our approach can be readily applied to other polymorphic loci, given reference allelic sequences are available.

Download Full-text

Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Data Sets ◽

Computationally Efficient ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Download Full-text

Accuracy of microbial community diversity estimated by closed- and open-reference OTUs

PeerJ ◽

10.7717/peerj.3889 ◽

2017 ◽

Vol 5 ◽

pp. e3889 ◽

Cited By ~ 69

Author(s):

Robert C. Edgar

Keyword(s):

Ribosomal Rna ◽

De Novo ◽

Community Diversity ◽

Reference Database ◽

Mock Community ◽

Variable Regions ◽

Operational Taxonomic Units ◽

Sequencing Technologies ◽

Generation Sequencing ◽

Mock Communities

Next-generation sequencing of 16S ribosomal RNA is widely used to survey microbial communities. Sequences are typically assigned to Operational Taxonomic Units (OTUs). Closed- and open-reference OTU assignment matches reads to a reference database at 97% identity (closed), then clusters unmatched reads using a de novo method (open). Implementations of these methods in the QIIME package were tested on several mock community datasets with 20 strains using different sequencing technologies and primers. Richness (number of reported OTUs) was often greatly exaggerated, with hundreds or thousands of OTUs generated on Illumina datasets. Between-sample diversity was also found to be highly exaggerated in many cases, with weighted Jaccard distances between identical mock samples often close to one, indicating very low similarity. Non-overlapping hyper-variable regions in 70% of species were assigned to different OTUs. On mock communities with Illumina V4 reads, 56% to 88% of predicted genus names were false positives. Biological inferences obtained using these methods are therefore not reliable.

Download Full-text

Testing the potential of a ribosomal 16S marker for DNA metabarcoding of insects

PeerJ ◽

10.7717/peerj.1966 ◽

2016 ◽

Vol 4 ◽

pp. e1966 ◽

Cited By ~ 66

Author(s):

Vasco Elbrecht ◽

Pierre Taberlet ◽

Tony Dejean ◽

Alice Valentini ◽

Philippe Usseglio-Polatera ◽

...

Keyword(s):

Species Level ◽

Coi Gene ◽

Taxonomic Resolution ◽

Reference Database ◽

Bias Estimation ◽

Knowing That ◽

Dna Metabarcoding ◽

Local Reference ◽

Comprehensive Survey ◽

Reference Databases

Cytochrome c oxidase I (COI) is a powerful marker for DNA barcoding of animals, with good taxonomic resolution and a large reference database. However, when used for DNA metabarcoding, estimation of taxa abundances and species detection are limited due to primer bias caused by highly variable primer binding sites across the COI gene. Therefore, we explored the ability of the 16S ribosomal DNA gene as an alternative metabarcoding marker for species level assessments. Ten bulk samples, each containing equal amounts of tissue from 52 freshwater invertebrate taxa, were sequenced with the Illumina NextSeq 500 system. The 16S primers amplified three more insect species than the Folmer COI primers and amplified more equally, probably due to decreased primer bias. Estimation of biomass might be less biased with 16S than with COI, although variation in read abundances of two orders of magnitudes is still observed. According to these results, the marker choice depends on the scientific question. If the goal is to obtain a taxonomic identification at the species level, then COI is more appropriate due to established reference databases and known taxonomic resolution of this marker, knowing that a greater proportion of insects will be missed using COI Folmer primers. If the goal is to obtain a more comprehensive survey the 16S marker, which requires building a local reference database, or optimised degenerated COI primers could be more appropriate.

Download Full-text

Broadscale Ecological Patterns Are Robust to Use of Exact Sequence Variants versus Operational Taxonomic Units

mSphere ◽

10.1128/msphere.00148-18 ◽

2018 ◽

Vol 3 (4) ◽

Cited By ~ 52

Author(s):

Sydney I. Glassman ◽

Jennifer B. H. Martiny

Keyword(s):

Exact Sequence ◽

Sequence Similarity ◽

Marker Gene ◽

Large Field ◽

Sequence Variants ◽

Validity Of Results ◽

Data Set ◽

Operational Taxonomic Units ◽

Sequencing Technologies ◽

Β Diversity

ABSTRACTRecent discussion focuses on the best method for delineating microbial taxa, based on either exact sequence variants (ESVs) or traditional operational taxonomic units (OTUs) of marker gene sequences. We sought to test if the binning approach (ESVs versus 97% OTUs) affected the ecological conclusions of a large field study. The data set included sequences targeting all bacteria (16S rRNA) and fungi (internal transcribed spacer [ITS]), across multiple environments diverging markedly in abiotic conditions, over three collection times. Despite quantitative differences in microbial richness, we found that all α and β diversity metrics were highly positively correlated (r> 0.90) between samples analyzed with both approaches. Moreover, the community composition of the dominant taxa did not vary between approaches. Consequently, statistical inferences were nearly indistinguishable. Furthermore, ESVs only moderately increased the genetic resolution of fungal and bacterial diversity (1.3 and 2.1 times OTU richness, respectively). We conclude that for broadscale (e.g., all bacteria or all fungi) α and β diversity analyses, ESV or OTU methods will often reveal similar ecological results. Thus, while there are good reasons to employ ESVs, we need not question the validity of results based on OTUs.IMPORTANCEMicrobial ecologists have made exceptional improvements in our understanding of microbiomes in the last decade due to breakthroughs in sequencing technologies. These advances have wide-ranging implications for fields ranging from agriculture to human health. Due to limitations in databases, the majority of microbial ecology studies use a binning approach to approximate taxonomy based on DNA sequence similarity. There remains extensive debate on the best way to bin and approximate this taxonomy. Here we examine two popular approaches using a large field-based data set examining both bacteria and fungi and conclude that there are not major differences in the ecological outcomes. Thus, it appears that standard microbial community analyses are not overly sensitive to the particulars of binning approaches.

Download Full-text

Improved Metagenomic Taxonomic Profiling Using a Curated Core Gene-Based Bacterial Database Reveals Unrecognized Species in the Genus Streptococcus

Pathogens ◽

10.3390/pathogens9030204 ◽

2020 ◽

Vol 9 (3) ◽

pp. 204 ◽

Cited By ~ 2

Author(s):

Mauricio Chalita ◽

Sung-min Ha ◽

Yeong Ouk Kim ◽

Hyun-Seok Oh ◽

Seok-Hwan Yoon ◽

...

Keyword(s):

Marker Gene ◽

Core Gene ◽

Taxonomic Resolution ◽

Reference Database ◽

Chronic Obstructive ◽

Accurate Identification ◽

Obstructive Pulmonary Disease ◽

Shotgun Metagenomics ◽

Taxonomic Profiling ◽

Metagenomic Sample

Shotgun metagenomics is of great importance in order to understand the composition of the microbial community associated with a sample and the potential impact it may exert on its host. For clinical metagenomics, one of the initial challenges is the accurate identification of a pathogen of interest and ability to single out that pathogen within a complex community of microorganisms. However, in absence of an accurate identification of those microorganisms, any kind of conclusion or diagnosis based on misidentification may lead to erroneous conclusions, especially when comparing distinct groups of individuals. When comparing a shotgun metagenomic sample against a reference genome sequence database, the classification itself is dependent on the contents of the database. Focusing on the genus Streptococcus, we built four synthetic metagenomic samples and demonstrated that shotgun taxonomic profiling using the bacterial core genes as the reference database performed better in both taxonomic profiling and relative abundance prediction than that based on the marker gene reference database included in MetaPhlAn2. Additionally, by classifying sputum samples of patients suffering from chronic obstructive pulmonary disease, we showed that adding genomes of genomospecies to a reference database offers higher taxonomic resolution for taxonomic profiling. Finally, we show how our genomospecies database is able to identify correctly a clinical stool sample from a patient with a streptococcal infection, proving that genomospecies provide better taxonomic coverage for metagenomic analyses.

Download Full-text

A new oomycete metabarcoding method using the rps10 gene

10.1101/2021.09.22.460084 ◽

2021 ◽

Author(s):

Zachary S. L. Foster ◽

Felipe E Albornoz ◽

Valerie J Fieland ◽

Meredith M Larsen ◽

Frank Andrew Jones ◽

...

Keyword(s):

Environmental Samples ◽

Illumina Miseq ◽

Taxonomic Resolution ◽

Reference Database ◽

Mock Community ◽

Operational Taxonomic Units ◽

Wide Range ◽

Dna Metabarcoding ◽

Improved Methods

Oomycetes are a group of eukaryotes related to brown algae and diatoms, many of which cause diseases in plants and animals. Improved methods are needed for rapid and accurate characterization of oomycete communities using DNA metabarcoding. We have identified the mitochondrial 40S ribosomal protein S10 gene (rps10) as a locus useful for oomycete metabarcoding and provide primers predicted to amplify all oomycetes based on available reference sequences from a wide range of taxa. We evaluated its utility relative to a popular barcode, the internal transcribed spacer 1 (ITS1), by sequencing environmental samples and a mock community using Illumina MiSeq. Amplified sequence variants (ASVs) and operational taxonomic units (OTUs) were identified per community. Both the sequence and predicted taxonomy of ASVs and OTUs were compared to the known composition of the mock community. Both rps10 and ITS yielded ASVs with sequences matching 21 of the 24 species in the mock community and matching all 24 when allowing for a 1 bp difference. Taxonomic classifications of ASVs included 23 members of the mock community for rps10 and 17 for ITS1. Sequencing results for the environmental samples suggest the proposed rps10 locus results in substantially less amplification of non-target organisms than the ITS1 method. The amplified rps10 region also has higher taxonomic resolution than ITS1, allowing for greater discrimination of closely related species. We present a new website with a searchable rps10 reference database for species identification and all protocols needed for oomycete metabarcoding. The rps10 barcode and methods described herein provide an effective tool for metabarcoding oomycetes using short-read sequencing.

Download Full-text