Contamination as a major factor in poor Illumina assembly of microbial isolate genomes

Mapping Intimacies ◽

10.1101/081885 ◽

2016 ◽

Cited By ~ 5

Author(s):

Haeyoung Jeong ◽

Jae-Goo Pan ◽

Seung-Hwan Park

Keyword(s):

Illumina Sequencing ◽

De Novo ◽

Repetitive Sequences ◽

Low Frequency ◽

Read Depth ◽

16S Rrna Genes ◽

Rrna Genes ◽

Sequencing Error ◽

Sequencing Data ◽

Long Reads

ABSTRACTThe nonhybrid hierarchical assembly of PacBio long reads is becoming the most preferred method for obtaining genomes for microbial isolates. On the other hand, among massive numbers of Illumina sequencing reads produced, there is a slim chance of re-evaluating failed microbial genome assembly (high contig number, large total contig size, and/or the presence of low-depth contigs). We generated Illumina-type test datasets with various levels of sequencing error, pretreatment (trimming and error correction), repetitive sequences, contamination, and ploidy from both simulated and real sequencing data and applied k-mer abundance analysis to quickly detect possible diagnostic signatures of poor assemblies. Contamination was the only factor leading to poor assemblies for the test dataset derived from haploid microbial genomes, resulting in an extraordinary peak within low-frequency k-mer range. When thirteen Illumina sequencing reads of microbes belonging to genera Bacillus or Paenibacillus from a single multiplexed run were subjected to a k-mer abundance analysis, all three samples leading to poor assemblies showed peculiar patterns of contamination. Read depth distribution along the contig length indicated that all problematic assemblies suffered from too many contigs with low average read coverage, where 1% to 15% of total reads were mapped to low-coverage contigs. We found that subsampling or filtering out reads having rare k-mers could efficiently remove low-level contaminants and greatly improve the de novo assemblies. An analysis of 16S rRNA genes recruited from reads or contigs and the application of read classification tools originally designed for metagenome analyses can help identify the source of a contamination. The unexpected presence of proteobacterial reads across multiple samples, which had no relevance to our lab environment, implies that such prevalent contamination might have occurred after the DNA preparation step, probably at the place where sequencing service was provided.

Download Full-text

Ultra-accurate microbial amplicon sequencing with synthetic long reads

Microbiome ◽

10.1186/s40168-021-01072-3 ◽

2021 ◽

Vol 9 (1) ◽

Author(s):

Benjamin J. Callahan ◽

Dmitry Grinevich ◽

Siddhartha Thakur ◽

Michael A. Balamotis ◽

Tuval Ben Yehezkel

Keyword(s):

Microbial Community ◽

16S Rrna ◽

Amplicon Sequencing ◽

Species Level ◽

Full Length ◽

16S Rrna Genes ◽

Rrna Genes ◽

Strain Identification ◽

Long Reads ◽

Long Read

Abstract Background Out of the many pathogenic bacterial species that are known, only a fraction are readily identifiable directly from a complex microbial community using standard next generation DNA sequencing. Long-read sequencing offers the potential to identify a wider range of species and to differentiate between strains within a species, but attaining sufficient accuracy in complex metagenomes remains a challenge. Methods Here, we describe and analytically validate LoopSeq, a commercially available synthetic long-read (SLR) sequencing technology that generates highly accurate long reads from standard short reads. Results LoopSeq reads are sufficiently long and accurate to identify microbial genes and species directly from complex samples. LoopSeq perfectly recovered the full diversity of 16S rRNA genes from known strains in a synthetic microbial community. Full-length LoopSeq reads had a per-base error rate of 0.005%, which exceeds the accuracy reported for other long-read sequencing technologies. 18S-ITS and genomic sequencing of fungal and bacterial isolates confirmed that LoopSeq sequencing maintains that accuracy for reads up to 6 kb in length. LoopSeq full-length 16S rRNA reads could accurately classify organisms down to the species level in rinsate from retail meat samples, and could differentiate strains within species identified by the CDC as potential foodborne pathogens. Conclusions The order-of-magnitude improvement in length and accuracy over standard Illumina amplicon sequencing achieved with LoopSeq enables accurate species-level and strain identification from complex- to low-biomass microbiome samples. The ability to generate accurate and long microbiome sequencing reads using standard short read sequencers will accelerate the building of quality microbial sequence databases and removes a significant hurdle on the path to precision microbial genomics.

Download Full-text

SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme

BMC Bioinformatics ◽

10.1186/s12859-021-04081-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Lidong Guo ◽

Mengyang Xu ◽

Wenchao Wang ◽

Shengqiang Gu ◽

Xia Zhao ◽

...

Keyword(s):

High Efficiency ◽

De Novo ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Draft Assembly ◽

Screening Algorithm ◽

Long Reads ◽

Hybrid Genome ◽

Genomics Research ◽

Negative Effect

Abstract Background Synthetic long reads (SLR) with long-range co-barcoding information are now widely applied in genomics research. Although several tools have been developed for each specific SLR technique, a robust standalone scaffolder with high efficiency is warranted for hybrid genome assembly. Results In this work, we developed a standalone scaffolding tool, SLR-superscaffolder, to link together contigs in draft assemblies using co-barcoding and paired-end read information. Our top-to-bottom scheme first builds a global scaffold graph based on Jaccard Similarity to determine the order and orientation of contigs, and then locally improves the scaffolds with the aid of paired-end information. We also exploited a screening algorithm to reduce the negative effect of misassembled contigs in the input assembly. We applied SLR-superscaffolder to a human single tube long fragment read sequencing dataset and increased the scaffold NG50 of its corresponding draft assembly 1349 fold. Moreover, benchmarking on different input contigs showed that this approach overall outperformed existing SLR scaffolders, providing longer contiguity and fewer misassemblies, especially for short contigs assembled by next-generation sequencing data. The open-source code of SLR-superscaffolder is available at https://github.com/BGI-Qingdao/SLR-superscaffolder. Conclusions SLR-superscaffolder can dramatically improve the contiguity of a draft assembly by integrating a hybrid assembly strategy.

Download Full-text

Computational Approaches for Transcriptome Assembly Based on Sequencing Technologies

Current Bioinformatics ◽

10.2174/1574893614666190410155603 ◽

2020 ◽

Vol 15 (1) ◽

pp. 2-16

Author(s):

Yuwen Luo ◽

Xingyu Liao ◽

Fang-Xiang Wu ◽

Jianxin Wang

Keyword(s):

De Novo ◽

Transcriptome Assembly ◽

Critical Role ◽

High Sensitivity ◽

Biological Properties ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Massive Sequencing ◽

Generation Sequencing

Transcriptome assembly plays a critical role in studying biological properties and examining the expression levels of genomes in specific cells. It is also the basis of many downstream analyses. With the increase of speed and the decrease in cost, massive sequencing data continues to accumulate. A large number of assembly strategies based on different computational methods and experiments have been developed. How to efficiently perform transcriptome assembly with high sensitivity and accuracy becomes a key issue. In this work, the issues with transcriptome assembly are explored based on different sequencing technologies. Specifically, transcriptome assemblies with next-generation sequencing reads are divided into reference-based assemblies and de novo assemblies. The examples of different species are used to illustrate that long reads produced by the third-generation sequencing technologies can cover fulllength transcripts without assemblies. In addition, different transcriptome assemblies using the Hybrid-seq methods and other tools are also summarized. Finally, we discuss the future directions of transcriptome assemblies.

Download Full-text

Defining the Core Citrus Leaf- and Root-Associated Microbiota: Factors Associated with Community Structure and Implications for Managing Huanglongbing (Citrus Greening) Disease

Applied and Environmental Microbiology ◽

10.1128/aem.00210-17 ◽

2017 ◽

Vol 83 (11) ◽

Cited By ~ 22

Author(s):

Ryan A. Blaustein ◽

Graciela L. Lorca ◽

Julie L. Meyer ◽

Claudio F. Gonzalez ◽

Max Teplitski

Keyword(s):

Community Structure ◽

Microbial Community ◽

Microbial Communities ◽

Illumina Sequencing ◽

Symptom Severity ◽

16S Rrna Genes ◽

Rrna Genes ◽

Content Type ◽

The Core ◽

Symptom Progression

ABSTRACTStable associations between plants and microbes are critical to promoting host health and productivity. The objective of this work was to test the hypothesis that restructuring of the core microbiota may be associated with the progression of huanglongbing (HLB), the devastating citrus disease caused byLiberibacter asiaticus,Liberibacter americanus, andLiberibacter africanus. The microbial communities of leaves (n= 94) and roots (n= 79) from citrus trees that varied by HLB symptom severity, cultivar, location, and season/time were characterized with Illumina sequencing of 16S rRNA genes. The taxonomically rich communities contained abundant core members (i.e., detected in at least 95% of the respective leaf or root samples), some overrepresented site-specific members, and a diverse community of low-abundance variable taxa. The composition and diversity of the leaf and root microbiota were strongly associated with HLB symptom severity and location; there was also an association with host cultivar. The relative abundance ofLiberibacterspp. among leaf microbiota positively correlated with HLB symptom severity and negatively correlated with alpha diversity, suggesting that community diversity decreases as symptoms progress. Network analysis of the microbial community time series identified a mutually exclusive relationship betweenLiberibacterspp. and members of theBurkholderiaceae,Micromonosporaceae, andXanthomonadaceae. This work confirmed several previously described plant disease-associated bacteria, as well as identified new potential implications for biological control. Our findings advance the understanding of (i) plant microbiota selection across multiple variables and (ii) changes in (core) community structure that may be a precondition to disease establishment and/or may be associated with symptom progression.IMPORTANCEThis study provides a comprehensive overview of the core microbial community within the microbiomes of plant hosts that vary in extent of disease symptom progression. With 16S Illumina sequencing analyses, we not only confirmed previously described bacterial associations with plant health (e.g., potentially beneficial bacteria) but also identified new associations and potential interactions between certain bacteria and an economically important phytopathogen. The importance of core taxa within broader plant-associated microbial communities is discussed.

Download Full-text

Diversity and Partitioning of Bacterial Populations within the Accessory Nidamental Gland of the Squid Euprymna scolopes

Applied and Environmental Microbiology ◽

10.1128/aem.07437-11 ◽

2012 ◽

Vol 78 (12) ◽

pp. 4200-4208 ◽

Cited By ~ 47

Author(s):

Andrew J. Collins ◽

Brenna A. LaBarre ◽

Brian S. Wong Won ◽

Monica V. Shah ◽

Steven Heng ◽

...

Keyword(s):

16S Rrna ◽

Bacterial Consortium ◽

16S Rrna Genes ◽

Microbial Consortia ◽

Rrna Genes ◽

Euprymna Scolopes ◽

Sequencing Data ◽

Content Type ◽

Exact Function ◽

Nidamental Gland

ABSTRACTMicrobial consortia confer important benefits to animal and plant hosts, and model associations are necessary to examine these types of host/microbe interactions. The accessory nidamental gland (ANG) is a female reproductive organ found among cephalopod mollusks that contains a consortium of bacteria, the exact function of which is unknown. To begin to understand the role of this organ, the bacterial consortium was characterized in the Hawaiian bobtail squid,Euprymna scolopes, a well-studied model organism for symbiosis research. Transmission electron microscopy (TEM) analysis of the ANG revealed dense bacterial assemblages of rod- and coccus-shaped cells segregated by morphology into separate, epithelium-lined tubules. The host epithelium was morphologically heterogeneous, containing ciliated and nonciliated cells with various brush border thicknesses. Hemocytes of the host's innate immune system were also found in close proximity to the bacteria within the tubules. A census of 16S rRNA genes suggested thatRhodobacterales, Rhizobiales, andVerrucomicrobiabacteria were prevalent, with members of the genusPhaeobacterdominating the consortium. Analysis of 454-shotgun sequencing data confirmed the presence of members of these taxa and revealed members of a fourth,Flavobacteriaof theBacteroidetesphylum. 16S rRNA fluorescentin situhybridization (FISH) revealed that many ANG tubules were dominated by members of specific taxa, namely,Rhodobacterales,Verrucomicrobia, orCytophaga-Flavobacteria-Bacteroidetes, suggesting symbiont partitioning to specific host tubules. In addition, FISH revealed that bacteria, includingPhaeobacterspecies from the ANG, are likely deposited into the jelly coat of freshly laid eggs. This report establishes the ANG of the invertebrateE. scolopesas a model to examine interactions between a bacterial consortium and its host.

Download Full-text

Strategy and Performance Evaluation of Low-Frequency Variant Calling for SARS-CoV-2 Using Targeted Deep Illumina Sequencing

Frontiers in Microbiology ◽

10.3389/fmicb.2021.747458 ◽

2021 ◽

Vol 12 ◽

Author(s):

Laura A. E. Van Poelvoorde ◽

Thomas Delcourt ◽

Wim Coucke ◽

Philippe Herman ◽

Sigrid C. J. De Keersmaecker ◽

...

Keyword(s):

Genome Sequence ◽

Illumina Sequencing ◽

Low Frequency ◽

Lower Sensitivity ◽

Wild Type ◽

Sequencing Data ◽

Allelic Frequencies ◽

Diagnostic Samples ◽

Wastewater Samples ◽

Detection And Quantification

The ongoing COVID-19 pandemic, caused by SARS-CoV-2, constitutes a tremendous global health issue. Continuous monitoring of the virus has become a cornerstone to make rational decisions on implementing societal and sanitary measures to curtail the virus spread. Additionally, emerging SARS-CoV-2 variants have increased the need for genomic surveillance to detect particular strains because of their potentially increased transmissibility, pathogenicity and immune escape. Targeted SARS-CoV-2 sequencing of diagnostic and wastewater samples has been explored as an epidemiological surveillance method for the competent authorities. Currently, only the consensus genome sequence of the most abundant strain is taken into consideration for analysis, but multiple variant strains are now circulating in the population. Consequently, in diagnostic samples, potential co-infection(s) by several different variants can occur or quasispecies can develop during an infection in an individual. In wastewater samples, multiple variant strains will often be simultaneously present. Currently, quality criteria are mainly available for constructing the consensus genome sequence, and some guidelines exist for the detection of co-infections and quasispecies in diagnostic samples. The performance of detection and quantification of low-frequency variants using whole genome sequencing (WGS) of SARS-CoV-2 remains largely unknown. Here, we evaluated the detection and quantification of mutations present at low abundances using the mutations defining the SARS-CoV-2 lineage B.1.1.7 (alpha variant) as a case study. Real sequencing data were in silico modified by introducing mutations of interest into raw wild-type sequencing data, or by mixing wild-type and mutant raw sequencing data, to construct mixed samples subjected to WGS using a tiling amplicon-based targeted metagenomics approach and Illumina sequencing. As anticipated, higher variation and lower sensitivity were observed at lower coverages and allelic frequencies. We found that detection of all low-frequency variants at an abundance of 10, 5, 3, and 1%, requires at least a sequencing coverage of 250, 500, 1500, and 10,000×, respectively. Although increasing variability of estimated allelic frequencies at decreasing coverages and lower allelic frequencies was observed, its impact on reliable quantification was limited. This study provides a highly sensitive low-frequency variant detection approach, which is publicly available at https://galaxy.sciensano.be, and specific recommendations for minimum sequencing coverages to detect clade-defining mutations at certain allelic frequencies. This approach will be useful to detect and quantify low-frequency variants in both diagnostic (e.g., co-infections and quasispecies) and wastewater [e.g., multiple variants of concern (VOCs)] samples.

Download Full-text

Genome size and identification of abundant repetitive sequences in Vallisneria spinulosa

PeerJ ◽

10.7717/peerj.3982 ◽

2017 ◽

Vol 5 ◽

pp. e3982 ◽

Cited By ~ 3

Author(s):

RuiJuan Feng ◽

Xin Wang ◽

Min Tao ◽

Guanchao Du ◽

Qishuo Wang

Keyword(s):

Genome Size ◽

Aquatic Plant ◽

Nuclear Dna ◽

De Novo ◽

Repetitive Sequences ◽

Nuclear Dna Content ◽

Ltr Retrotransposons ◽

Sequencing Data ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Vallisneria spinulosa is a freshwater aquatic plant of ecological and economic importance. However, there is limited cytogenetic and genomics information on Vallisneria. In this study, we measured the nuclear DNA content of Vallisneria spinulosa by flow cytometry, performed a de novo assembly, and annotated repetitive sequences by using a combination of next-generation sequencing (NGS) and bioinformatics tools. The genome size of Vallisneria spinulosa is approximately 3,595 Mbp, in which nearly 60% of the genome consists of repetitive sequences. The majority of the repetitive sequences are LTR-retrotransposons comprising 43% of the genome. Although the amount of sequencing data used in this study was not sufficient for a whole-genome assembly, it could generate an overview of representative elements in the genome. These results will lay a new foundation for further studies on various species that belong to the Vallisneria genus.

Download Full-text

Assembly of chromosome-scale contigs by efficiently resolving repetitive sequences with long reads

10.1101/345983 ◽

2018 ◽

Cited By ~ 2

Author(s):

Huilong Du ◽

Chengzhi Liang

Keyword(s):

Single Molecule ◽

High Efficiency ◽

Reference Genome ◽

Repetitive Sequences ◽

Sequencing Data ◽

High Quality ◽

Single Molecule Sequencing ◽

Genome Maps ◽

Long Reads ◽

Novel Method

AbstractDue to the large number of repetitive sequences in complex eukaryotic genomes, fragmented and incompletely assembled genomes lose value as reference sequences, often due to short contigs that cannot be anchored or mispositioned onto chromosomes. Here we report a novel method Highly Efficient Repeat Assembly (HERA), which includes a new concept called a connection graph as well as algorithms for constructing the graph. HERA resolves repeats at high efficiency with single-molecule sequencing data, and enables the assembly of chromosome-scale contigs by further integrating genome maps and Hi-C data. We tested HERA with the genomes of rice R498, maize B73, human HX1 and Tartary buckwheat Pinku1. HERA can correctly assemble most of the tandemly repetitive sequences in rice using single-molecule sequencing data only. Using the same maize and human sequencing data published by Jiao et al. (2017) and Shi et al. (2016), respectively, we dramatically improved on the sequence contiguity compared with the published assemblies, increasing the contig N50 from 1.3 Mb to 61.2 Mb in maize B73 assembly and from 8.3 Mb to 54.4 Mb in human HX1 assembly with HERA. We provided a high-quality maize reference genome with 96.9% of the gaps filled (only 76 gaps left) and several incorrectly positioned sequences fixed compared with the B73 RefGen_v4 assembly. Comparisons between the HERA assembly of HX1 and the human GRCh38 reference genome showed that many gaps in GRCh38 could be filled, and that GRCh38 contained some potential errors that could be fixed. We assembled the Pinku1 genome into 12 scaffolds with a contig N50 size of 27.85 Mb. HERA serves as a new genome assembly/phasing method to generate high quality sequences for complex genomes and as a curation tool to improve the contiguity and completeness of existing reference genomes, including the correction of assembly errors in repetitive regions.

Download Full-text

Comparative Analysis of Bacterial and Archaeal Population Structure by Illumina Sequencing of 16S rRNA Genes in Three Municipal Anaerobic Sludge Digesters

10.21203/rs.3.rs-60183/v1 ◽

2020 ◽

Author(s):

Munawwar Ali Khan ◽

Shams Tabrez Khan ◽

Milred Cedric Sequeira ◽

Sultan Mohammad Faheem

Keyword(s):

Wastewater Treatment ◽

16S Rrna ◽

Microbial Communities ◽

Illumina Sequencing ◽

Full Scale ◽

16S Rrna Genes ◽

Rrna Genes ◽

Anaerobic Digesters ◽

Fermentative Bacteria ◽

Better Regulation

Abstract Understanding the microbial communities in anaerobic digesters is important for better regulation, operation, and sustainable management of the sludge produced at various stages of wastewater treatment processes. Microbial communities in the anaerobic digester of the gulf region where the climatic conditions and other factors may impact the incoming feed have not been documented. Archaeal and Bacterial communities of three full-scale anaerobic digesters, namely AD1, AD3 and AD5 were analyzed by Illumina sequencing of 16S rRNA genes. Among bacteria, the most abundant genus was fermentative bacteria Acetobacteroides (Blvii28). Other predominant bacterial genera in the digesters included thermophilic bacteria (Fervidobacterium and Coprothermobacter) and halophilic bacteria like Haloterrigena and Sediminibacter. This can be correlated with the climatic condition in Dubai, where the bacteria in the original feed may be thermophilic or halophilic as much of the water used in the country is desalinated seawater. Propionic acid-producing bacteria like Paludibacter and propionate oxidizing bacteria like W5 were also dominating group and were found in all the digesters. The predominant Archaea include mainly the members of phylum Euryarchaeota and Crenarchaeota belonging to genus Methanocorpusculum, Metallosphaera, Methanocella, and Methanococcus. The highest population of Methanocorpusculum (more than 50% of total Archaea) hydrogenotrophic archaea matches with the high population of Acetobacteroides (Blvii28) and Fervidobacterium bacteria which ferments the organic substrates to acetate and H2. Coprothermobacter, which is known to improve protein degradation by establishing a syntrophy with hydrogenotrophic archaea, was also one of the dominant genera in the digesters. This study, for the first time, contributes to an in-depth understanding of the phylogenetic diversity of a microbial community of three full-scale anaerobic digesters of a municipal wastewater treatment plant in Dubai, UAE.

Download Full-text

Inclusion of Oxford Nanopore long reads improves all microbial and phage metagenome-assembled genomes from a complex aquifer system

10.1101/2019.12.18.880807 ◽

2019 ◽

Cited By ~ 1

Author(s):

Will A. Overholt ◽

Martin Hölzer ◽

Patricia Geesink ◽

Celia Diezel ◽

Manja Marz ◽

...

Keyword(s):

Cost Benefit Analysis ◽

Hybrid Approach ◽

Cost Benefit ◽

16S Rrna Genes ◽

Rrna Genes ◽

Aquifer System ◽

Sequencing Platform ◽

Long Reads ◽

Oxford Nanopore ◽

Hybrid Assemblies

AbstractAssembling microbial and phage genomes from metagenomes is a powerful and appealing method to understand structure-function relationships in complex environments. In order to compare the recovery of genomes from microorganisms and their phages from groundwater, we generated shotgun metagenomes with Illumina sequencing accompanied by long reads derived from the Oxford Nanopore sequencing platform. Assembly and metagenome-assembled genome (MAG) metrics for both microbes and viruses were determined from Illumina-only assemblies and a hybrid assembly approach. Strikingly, the hybrid approach more than doubled the number of mid to high-quality MAGs (> 50% completion, < 10% redundancy), generated nearly four-fold more phage genomes, and improved all associated genome metrics relative to the Illumina only method. The hybrid assemblies yielded MAGs that were on average 7.8% more complete, with 133 fewer contigs and a 14 kbp greater N50. Furthermore, the longer contigs from the hybrid approach generated microbial MAGs that had a higher proportion of rRNA genes. We demonstrate this usefulness by linking microbial MAGs containing 16S rRNA genes with extensive amplicon dataset. This work provides quantitative data to inform a cost-benefit analysis on the decision to supplement shotgun metagenomic projects with long reads towards the goal of recovering genomes from environmentally abundant groups.

Download Full-text