scholarly journals Accuracy of microbial community diversity estimated by closed- and open-reference OTUs

PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3889 ◽  
Author(s):  
Robert C. Edgar

Next-generation sequencing of 16S ribosomal RNA is widely used to survey microbial communities. Sequences are typically assigned to Operational Taxonomic Units (OTUs). Closed- and open-reference OTU assignment matches reads to a reference database at 97% identity (closed), then clusters unmatched reads using a de novo method (open). Implementations of these methods in the QIIME package were tested on several mock community datasets with 20 strains using different sequencing technologies and primers. Richness (number of reported OTUs) was often greatly exaggerated, with hundreds or thousands of OTUs generated on Illumina datasets. Between-sample diversity was also found to be highly exaggerated in many cases, with weighted Jaccard distances between identical mock samples often close to one, indicating very low similarity. Non-overlapping hyper-variable regions in 70% of species were assigned to different OTUs. On mock communities with Illumina V4 reads, 56% to 88% of predicted genus names were false positives. Biological inferences obtained using these methods are therefore not reliable.

2020 ◽  
Vol 15 (1) ◽  
pp. 2-16
Author(s):  
Yuwen Luo ◽  
Xingyu Liao ◽  
Fang-Xiang Wu ◽  
Jianxin Wang

Transcriptome assembly plays a critical role in studying biological properties and examining the expression levels of genomes in specific cells. It is also the basis of many downstream analyses. With the increase of speed and the decrease in cost, massive sequencing data continues to accumulate. A large number of assembly strategies based on different computational methods and experiments have been developed. How to efficiently perform transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the issues with transcriptome assembly are explored based on different sequencing technologies. Specifically, transcriptome assemblies with next-generation sequencing reads are divided into reference-based assemblies and de novo assemblies. The examples of different species are used to illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength transcripts without assemblies. In addition, different transcriptome assemblies using the Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions of transcriptome assemblies.


Author(s):  
Andrew Krohn ◽  
Bo Stevens ◽  
Adam Robbins-Pianka ◽  
Matthew Belus ◽  
Gerard J Allan ◽  
...  

Diversity of complex microbial communities can be rapidly assessed by community amplicon sequencing of marker genes (e.g., 16S), often yielding many thousands of DNA sequences per sample. However, analysis of community amplicon sequencing data requires multiple computational steps which affect the outcome of a final data set. Here we use mock communities to describe the effects of parameter adjustments for raw sequence quality filtering, picking operational taxonomic units (OTUs), taxonomic assignment, and OTU table filtering as implemented in QIIME 1.9.1. We demonstrate a workflow optimization based upon this exploration which we also apply to environmental samples. We found that quality filtering of raw data and filtering of OTU tables had large effects on observed OTU diversity. While all taxonomy assigners performed with similar accuracy, an appropriate choice of similarity threshold for defining OTUs depended on the method used for OTU picking. Our “default” analysis in QIIME overestimated mock community diversity by at least a factor of ten, compared to the optimized analysis which correctly characterized the taxonomic composition of the mock communities while still overestimating OTU diversity by about a factor of two. Though observed relative abundances of mock community member taxa were approximately correct, most were still represented by multiple OTUs. Low-frequency OTUs conspecific to constituent mock community taxa were characterized by multiple substitution and indel errors and the presence of a low quality base call resulting in sequence truncation during quality filtering. Low quality base calls were observed at “G” positions most of the time, and were also associated with a preceding “TTT” trinucleotide motif. Environmental diversity estimates were reduced by about 40% from 2508 to 1533 OTUs when comparing output from the default and optimized workflows. We attribute this reduction in observed diversity to the removal of erroneous sequences from the data set. Our results indicate that both strict quality filtering of raw sequencing data and careful filtering of raw OTU tables are important steps for accurate estimation of microbial community diversity.


2017 ◽  
Author(s):  
Weizhong Li ◽  
Yuanyuan Chang

AbstractIn recent years, Illumina MiSeq sequencers replaced pyrosequencing platforms and became dominant in 16S rRNA sequencing. One unique feature of MiSeq technology, compared with Pyrosequencing, is the Paired End (PE) reads, with each read can be sequenced to 250-300 bases to cover multiple variable regions on the 16S rRNA gene. However, the PE reads need to be assembled into a single contig at the beginning of the analysis. Although there are many methods capable of assembling PE reads into contigs, a big portion of PE reads can not be accurately assembled because the poor quality at the 3’ ends of both PE reads in the overlapping region. This causes that many sequences are discarded in the analysis. In this study, we developed a novel approach for clustering and annotation MiSeq-based 16S sequence data, CD-HIT-OTU-MiSeq. This new approach has four distinct novel features. (1) The package can clustering PE reads without joining them into contigs. (2) Users can choose a high quality portion of the PE reads for analysis (e.g. first 200 / 150 bases from forward / reverse reads), according to base quality profile. (3) We implemented a tool that can splice out the target region (e.g. V3-V4) from a full-length 16S reference database into the PE sequences. CD-HIT-OTU-MiSeq can cluster the spliced PE reference database together with samples, so we can derive Operational Taxonomic Units (OTUs) and annotate these OTUs concurrently. (4) Chimeric sequences are effectively identified through de novo approach. The package offers high speed and high accuracy. The software package is freely available as open source package and is distributed along with CD-HIT from http://cd-hit.org. Within the CD-HIT package, CD-HIT-OTU-MiSeq is within the usecase folder.


2020 ◽  
Author(s):  
Zack Saud ◽  
Alexandra M. Kortsinoglou ◽  
Vassili N. Kouvelis ◽  
Tariq M. Butt

Abstract More accurate and complete reference genomes have improved understanding of gene function, biology, and evolutionary mechanisms. Hybrid genome assembly approaches leverage benefits of both long, relatively error-prone reads from third-generation sequencing technologies and short, accurate reads from second-generation sequencing technologies, to produce more accurate and contiguous de novo genome assemblies in comparison to using either technology independently. In this study, we present a novel hybrid assembly pipeline that allowed for both mitogenome de novo assembly and telomere length de novo assembly of all 7 chromosomes of the model entomopathogenic fungus, Metarhizium brunneum. The improved assembly allowed for better ab initio gene prediction and a more BUSCO complete proteome set has been generated in comparison to the eight current NCBI reference Metarhizium spp. genomes. Remarkably, we note that including the mitogenome in ab initio gene prediction training improved overall gene prediction. The assembly was further validated by comparing contig assembly agreement across various assemblers, assessing the assembly performance of each tool. Genomic synteny and orthologous protein clusters were compared between Metarhizium brunneum and three other Hypocreales species with complete genomes, identifying core proteins, and listing orthologous protein clusters shared uniquely between the two entomopathogenic fungal species, so as to further facilitate the understanding of molecular mechanisms underpinning fungal-insect pathogenesis. The novel assembly pipeline may be used for other haploid fungal species, facilitating the need to produce high-quality reference fungal genomes, leading to better understanding of fungal genomic evolution, chromosome structuring and gene regulation.


2018 ◽  
Author(s):  
Robert C. Edgar

AbstractNext-generation amplicon sequencing is widely used for surveying biological diversity in applications such as microbial metagenomics, immune system repertoire analysis and targeted tumor sequencing of cancer-associated genes. In such studies, assignment of reads to incorrect samples (cross-talk) is a well-documented problem that is rarely considered in practice. Here, I describe UNCROSS2, an algorithm designed to detect and filter cross-talk in OTU tables generated by next-generation sequencing of the 16S ribosomal RNA gene. On eight published datasets, cross-talk rates are estimated to range from 0.4% to 1.5% mis-assigned reads. On a mock community test, UNCROSS2 identifies spurious counts due to cross-talk with sensitivity ∼80% to 90% and error rate from ∼1% to ∼20%, but it is not clear whether the accuracy of the algorithm is sufficient to decisively improve diversity rates in practice.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Zack Saud ◽  
Alexandra M. Kortsinoglou ◽  
Vassili N. Kouvelis ◽  
Tariq M. Butt

Abstract Background More accurate and complete reference genomes have improved understanding of gene function, biology, and evolutionary mechanisms. Hybrid genome assembly approaches leverage benefits of both long, relatively error-prone reads from third-generation sequencing technologies and short, accurate reads from second-generation sequencing technologies, to produce more accurate and contiguous de novo genome assemblies in comparison to using either technology independently. In this study, we present a novel hybrid assembly pipeline that allowed for both mitogenome de novo assembly and telomere length de novo assembly of all 7 chromosomes of the model entomopathogenic fungus, Metarhizium brunneum. Results The improved assembly allowed for better ab initio gene prediction and a more BUSCO complete proteome set has been generated in comparison to the eight current NCBI reference Metarhizium spp. genomes. Remarkably, we note that including the mitogenome in ab initio gene prediction training improved overall gene prediction. The assembly was further validated by comparing contig assembly agreement across various assemblers, assessing the assembly performance of each tool. Genomic synteny and orthologous protein clusters were compared between Metarhizium brunneum and three other Hypocreales species with complete genomes, identifying core proteins, and listing orthologous protein clusters shared uniquely between the two entomopathogenic fungal species, so as to further facilitate the understanding of molecular mechanisms underpinning fungal-insect pathogenesis. Conclusions The novel assembly pipeline may be used for other haploid fungal species, facilitating the need to produce high-quality reference fungal genomes, leading to better understanding of fungal genomic evolution, chromosome structuring and gene regulation.


2021 ◽  
Author(s):  
Zachary S. L. Foster ◽  
Felipe E Albornoz ◽  
Valerie J Fieland ◽  
Meredith M Larsen ◽  
Frank Andrew Jones ◽  
...  

Oomycetes are a group of eukaryotes related to brown algae and diatoms, many of which cause diseases in plants and animals. Improved methods are needed for rapid and accurate characterization of oomycete communities using DNA metabarcoding. We have identified the mitochondrial 40S ribosomal protein S10 gene (rps10) as a locus useful for oomycete metabarcoding and provide primers predicted to amplify all oomycetes based on available reference sequences from a wide range of taxa. We evaluated its utility relative to a popular barcode, the internal transcribed spacer 1 (ITS1), by sequencing environmental samples and a mock community using Illumina MiSeq. Amplified sequence variants (ASVs) and operational taxonomic units (OTUs) were identified per community. Both the sequence and predicted taxonomy of ASVs and OTUs were compared to the known composition of the mock community. Both rps10 and ITS yielded ASVs with sequences matching 21 of the 24 species in the mock community and matching all 24 when allowing for a 1 bp difference. Taxonomic classifications of ASVs included 23 members of the mock community for rps10 and 17 for ITS1. Sequencing results for the environmental samples suggest the proposed rps10 locus results in substantially less amplification of non-target organisms than the ITS1 method. The amplified rps10 region also has higher taxonomic resolution than ITS1, allowing for greater discrimination of closely related species. We present a new website with a searchable rps10 reference database for species identification and all protocols needed for oomycete metabarcoding. The rps10 barcode and methods described herein provide an effective tool for metabarcoding oomycetes using short-read sequencing.


2020 ◽  
Author(s):  
Zack Saud ◽  
Alexandra M. Kortsinoglou ◽  
Vassili N. Kouvelis ◽  
Tariq M. Butt

Abstract Background More accurate and complete reference genomes have improved understanding of gene function, biology, and evolutionary mechanisms. Hybrid genome assembly approaches leverage benefits of both long, relatively error-prone reads from third-generation sequencing technologies and short, accurate reads from second-generation sequencing technologies, to produce more accurate and contiguous de novo genome assemblies in comparison to using either technology independently. In this study, we present a novel hybrid assembly pipeline that allowed for both mitogenome de novo assembly and telomere length de novo assembly of all 7 chromosomes of the model entomopathogenic fungus, Metarhizium brunneum . Results The improved assembly allowed for better ab initio gene prediction and a more BUSCO complete proteome set has been generated in comparison to the eight current NCBI reference Metarhizium spp. genomes. Remarkably, we note that including the mitogenome in ab initio gene prediction training improved overall gene prediction. The assembly was further validated by comparing contig assembly agreement across various assemblers, assessing the assembly performance of each tool. Genomic synteny and orthologous protein clusters were compared between Metarhizium brunneum and three other Hypocreales species with complete genomes, identifying core proteins, and listing orthologous protein clusters shared uniquely between the two entomopathogenic fungal species, so as to further facilitate the understanding of molecular mechanisms underpinning fungal-insect pathogenesis. Conclusions The novel assembly pipeline may be used for other haploid fungal species, facilitating the need to produce high-quality reference fungal genomes, leading to better understanding of fungal genomic evolution, chromosome structuring and gene regulation.


Sign in / Sign up

Export Citation Format

Share Document