Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411v2 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Data Sets ◽

Computationally Efficient ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Download Full-text

Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411v1 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Computationally Efficient ◽

Highly Correlated ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because parts of our algorithm can be run in parallel, it makes open-reference OTU picking tractable on massive amplicon sequence data sets. We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “legacy” open-reference OTU picking, where less of the process can be parallelized, through comparisons on three well-studied datasets. We therefore recommend that subsampled open-reference OTU picking always be applied in favor of “legacy” open-reference OTU picking. An implementation of this algorithm is provided in the popular QIIME software package. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters.

Download Full-text

CD-HIT-OTU-MiSeq, an Improved Approach for Clustering and Analyzing Paired End MiSeq 16S rRNA Sequences

10.1101/153783 ◽

2017 ◽

Cited By ~ 3

Author(s):

Weizhong Li ◽

Yuanyuan Chang

Keyword(s):

16S Rrna ◽

High Speed ◽

De Novo ◽

Sequence Data ◽

Illumina Miseq ◽

Poor Quality ◽

Reference Database ◽

Rrna Gene ◽

Variable Regions ◽

Novel Approach

AbstractIn recent years, Illumina MiSeq sequencers replaced pyrosequencing platforms and became dominant in 16S rRNA sequencing. One unique feature of MiSeq technology, compared with Pyrosequencing, is the Paired End (PE) reads, with each read can be sequenced to 250-300 bases to cover multiple variable regions on the 16S rRNA gene. However, the PE reads need to be assembled into a single contig at the beginning of the analysis. Although there are many methods capable of assembling PE reads into contigs, a big portion of PE reads can not be accurately assembled because the poor quality at the 3’ ends of both PE reads in the overlapping region. This causes that many sequences are discarded in the analysis. In this study, we developed a novel approach for clustering and annotation MiSeq-based 16S sequence data, CD-HIT-OTU-MiSeq. This new approach has four distinct novel features. (1) The package can clustering PE reads without joining them into contigs. (2) Users can choose a high quality portion of the PE reads for analysis (e.g. first 200 / 150 bases from forward / reverse reads), according to base quality profile. (3) We implemented a tool that can splice out the target region (e.g. V3-V4) from a full-length 16S reference database into the PE sequences. CD-HIT-OTU-MiSeq can cluster the spliced PE reference database together with samples, so we can derive Operational Taxonomic Units (OTUs) and annotate these OTUs concurrently. (4) Chimeric sequences are effectively identified through de novo approach. The package offers high speed and high accuracy. The software package is freely available as open source package and is distributed along with CD-HIT from http://cd-hit.org. Within the CD-HIT package, CD-HIT-OTU-MiSeq is within the usecase folder.

Download Full-text

CCMetagen: comprehensive and accurate identification of eukaryotes and prokaryotes in metagenomic data

10.1101/641332 ◽

2019 ◽

Cited By ~ 2

Author(s):

Vanessa R. Marcelino ◽

Philip T.L.C. Clausen ◽

Jan P. Buchmann ◽

Michelle Wille ◽

Jonathan R. Iredell ◽

...

Keyword(s):

High Throughput Sequencing ◽

Sequence Data ◽

Fold Increase ◽

Community Analysis ◽

Biological Data ◽

Microbial Community Analysis ◽

Metagenomic Data ◽

Reference Database ◽

Comprehensive Overview ◽

Accurate Identification

AbstractHigh-throughput sequencing of DNA and RNA from environmental and host-associated samples (metagenomics and metatranscriptomics) is a powerful tool to assess which organisms are present in a sample. Taxonomic identification software usually align individual short sequence reads to a reference database, sometimes containing taxa with complete genomes only. This is a challenging task given that different species can share identical sequence regions and complete genome sequences are only available for a fraction of organisms. A recently developed approach to map sequence reads to reference databases involves weighing all high scoring read-mappings to the data base as a whole to produce better-informed alignments. We used this novel concept in read mapping to develop a highly accurate metagenomic classification pipeline named CCMetagen. Using simulated fungal and bacterial metagenomes, we demonstrate that CCMetagen substantially outperforms other commonly used metagenome classifiers, attaining a 3 – 1580 fold increase in precision and a 2 – 922 fold increase in F1 scores for species-level classifications when compared to Kraken2, Centrifuge and KrakenUniq. CCMetagen is sufficiently fast and memory efficient to use the entire NCBI nucleotide collection (nt) as reference, enabling the assessment of species with incomplete genome sequence data from all biological kingdoms. Our pipeline efficiently produced a comprehensive overview of the microbiome of two biological data sets, including both eukaryotes and prokaryotes. CCMetagen is user-friendly and the results can be easily integrated into microbial community analysis software for streamlined and automated microbiome studies.

Download Full-text

Assessing 16S marker gene survey data analysis methods using mixtures of human stool sample DNA extracts

10.1101/400226 ◽

2018 ◽

Author(s):

Nathan D Olson ◽

M. Senthil Kumar ◽

Shan Li ◽

Stephanie Hao ◽

Winston Timp ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Marker Gene ◽

Real Data ◽

Qualitative Assessment ◽

Data Sets ◽

Sequencing Data ◽

Differential Abundance ◽

Analysis Methods ◽

Downstream Analysis

AbstractBackgroundAnalysis of 16S rRNA marker-gene surveys, used to characterize prokaryotic microbial communities, may be performed by numerous bioinformatic pipelines and downstream analysis methods. However, there is limited guidance on how to decide between methods, appropriate data sets and statistics for assessing these methods are needed. We developed a mixture dataset with real data complexity and an expected value for assessing 16S rRNA bioinformatic pipelines and downstream analysis methods. We generate an assessment dataset using a two-sample titration mixture design. The sequencing data were processed using multiple bioinformatic pipelines, i) DADA2 a sequence inference method, ii) Mothur a de novo clustering method, and iii) QIIME with open-reference clustering. The mixture dataset was used to qualitatively and quantitatively assess count tables generated using the pipelines.ResultsThe qualitative assessment was used to evalute features only present in unmixed samples and titrations. The abundance of Mothur and QIIME features specific to unmixed samples and titrations were explained by sampling alone. However, for DADA2 over a third of the unmixed sample and titration specific feature abundance could not be explained by sampling alone. The quantitative assessment evaluated pipeline performance by comparing observed to expected relative and differential abundance values. Overall the observed relative abundance and differential abundance values were consistent with the expected values. Though outlier features were observed across all pipelines.ConclusionsUsing a novel mixture dataset and assessment methods we quantitatively and qualitatively evaluated count tables generated using three bioinformatic pipelines. The dataset and methods developed for this study will serve as a valuable community resource for assessing 16S rRNA marker-gene survey bioinformatic methods.

Download Full-text

mockrobiota: a public resource for microbiome bioinformatics benchmarking

10.7287/peerj.preprints.2065v1 ◽

2016 ◽

Author(s):

Nicholas A Bokulich ◽

Jai Ram Rideout ◽

William G Mercurio ◽

Benjamin Wolfe ◽

Corinne F Maurice ◽

...

Keyword(s):

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Community Members ◽

Mock Community ◽

Public Resource ◽

Source Data ◽

Community Data ◽

Mock Communities

Mock communities are an important tool for validating, optimizing, and comparing bioinformatics methods for microbial community analysis. We present mockrobiota, a public resource for sharing, validating, and documenting mock community data resources, available at https://github.com/caporaso-lab/mockrobiota. The materials contained in mockrobiota include dataset and sample metadata, expected composition data, which are annotated based on one or more reference taxonomies, links to raw data (e.g., raw sequence data) for each mock community dataset, and optional reference sequences for mock community members. mockrobiota does not supply physical sample materials directly, but the dataset metadata included for each mock community indicate whether physical sample materials are available (and associated contact information). At the time of this writing, mockrobiota contains 11 mock community datasets with known species compositions (including bacterial, archaeal, and eukaryotic mock communities), analyzed by high-throughput marker-gene sequencing. The availability of standard, public mock community data will facilitate ongoing methods optimizations; comparisons across studies that share source data; greater transparency and access; and eliminate redundancy. This dynamic resource is intended to expand and evolve to meet the changing needs of the ‘omics community.

Download Full-text

HashSeq: a Simple, Scalable, and Conservative De Novo Variant Caller for 16S rRNA Gene Data Sets

mSystems ◽

10.1128/msystems.00697-21 ◽

2021 ◽

Author(s):

Farnaz Fouladi ◽

Jacqueline B. Young ◽

Anthony A. Fodor

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Rrna Gene ◽

Sequence Variants ◽

Data Sets ◽

Single Nucleotide ◽

Gene Data

Recent bioinformatics development has enabled the detection of sequence variants with a high resolution of only one single-nucleotide difference in 16S rRNA gene sequence data. Despite this progress, there are several limitations that can be associated with variant calling pipelines, such as producing a large number of low-abundance sequence variants which need to be filtered out with arbitrary thresholds in downstream analyses or having a slow runtime.

Download Full-text

Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data

Nucleic Acids Research ◽

10.1093/nar/gkq872 ◽

2010 ◽

Vol 38 (22) ◽

pp. e205-e205 ◽

Cited By ~ 35

Author(s):

Yijun Sun ◽

Yunpeng Cai ◽

Volker Mai ◽

William Farmerie ◽

Fahong Yu ◽

...

Keyword(s):

Microbial Community ◽

16S Rrna ◽

Sequence Data ◽

Community Analysis ◽

Microbial Community Analysis ◽

Computational Algorithms ◽

Rrna Sequence ◽

16S Rrna Sequence

Download Full-text

mockrobiota: a public resource for microbiome bioinformatics benchmarking

10.7287/peerj.preprints.2065 ◽

2016 ◽

Author(s):

Nicholas A Bokulich ◽

Jai Ram Rideout ◽

William G Mercurio ◽

Benjamin Wolfe ◽

Corinne F Maurice ◽

...

Keyword(s):

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Community Members ◽

Mock Community ◽

Public Resource ◽

Source Data ◽

Community Data ◽

Mock Communities

Mock communities are an important tool for validating, optimizing, and comparing bioinformatics methods for microbial community analysis. We present mockrobiota, a public resource for sharing, validating, and documenting mock community data resources, available at https://github.com/caporaso-lab/mockrobiota. The materials contained in mockrobiota include dataset and sample metadata, expected composition data, which are annotated based on one or more reference taxonomies, links to raw data (e.g., raw sequence data) for each mock community dataset, and optional reference sequences for mock community members. mockrobiota does not supply physical sample materials directly, but the dataset metadata included for each mock community indicate whether physical sample materials are available (and associated contact information). At the time of this writing, mockrobiota contains 11 mock community datasets with known species compositions (including bacterial, archaeal, and eukaryotic mock communities), analyzed by high-throughput marker-gene sequencing. The availability of standard, public mock community data will facilitate ongoing methods optimizations; comparisons across studies that share source data; greater transparency and access; and eliminate redundancy. This dynamic resource is intended to expand and evolve to meet the changing needs of the ‘omics community.

Download Full-text

An extended single-index multiplexed 16S rRNA sequencing for microbial community analysis on MiSeq illumina platforms

Journal of Basic Microbiology ◽

10.1002/jobm.201500420 ◽

2015 ◽

Vol 56 (3) ◽

pp. 321-326 ◽

Cited By ~ 46

Author(s):

Hooman Derakhshani ◽

Hein Min Tun ◽

Ehsan Khafipour

Keyword(s):

Microbial Community ◽

16S Rrna ◽

Community Analysis ◽

16S Rrna Sequencing ◽

Microbial Community Analysis ◽

Single Index ◽

Rrna Sequencing

Download Full-text