scholarly journals Improving recovery of member genomes from enrichment reactor microbial communities using MinION–based long read metagenomics

2018 ◽  
Author(s):  
Krithika Arumugam ◽  
Irina Bessarab ◽  
Xianghui Liu ◽  
Gayathri Natarajan ◽  
Daniela I. Drautz–Moses ◽  
...  

AbstractNew long read sequencing technologies offer huge potential for effective recovery of complete, closed genomes. While much progress has been made on cultured isolates, the ability of these methods to recover genomes of member taxa in complex microbial communities is less clear. Here we examine the ability of long read data to recover genomes from enrichment reactor metagenomes. Such modified communities offer a moderate level of complexity compared to the source communities and so are realistic, yet tractable, systems to use for this problem. We sampled an enrichment bioreactor designed to target anaerobic ammonium-oxidising bacteria (AnAOB) and sequenced genomic DNA using both short read (Illumina 301bp PE) and long read data (MinION Mk1B) from the same extraction aliquot. The community contained 23 members, of which 16 had genome bins defined from an assembly of the short read data. Two distinct AnAOB species from genus Candidatus Brocadia were present and had complete genomes, of which one was the most abundant member species in the community. We can recover a 4Mb genome, in 2 contigs, of long read assembled sequence that is unambiguously associated with the most abundant AnAOB member genome. We conclude that obtaining near closed, complete genomes of members of low-medium microbial communities using MinION long read sequence is feasible.

2021 ◽  
Author(s):  
Yu-Hsiang Chen ◽  
Pei-Wen Chiang ◽  
Denis Yu Rogozin ◽  
Andrey Georgievich Degermendzhy ◽  
Hsiu-Hui Chiu ◽  
...  

Background: Most of Earth's bacteria have yet to be cultivated. The metabolic and functional potentials of these uncultivated microorganisms thus remain mysterious, and the metagenome-assembled genome (MAG) approach is the most robust method for uncovering these potentials. However, MAGs discovered by conventional metagenomic assembly and binning methods are usually highly fragmented genomes with heterogeneous sequence contamination, and this affects the accuracy and sensitivity of genomic analyses. Though the maturation of long-read sequencing technologies provides a good opportunity to fix the problem of highly fragmented MAGs as mentioned above, the method's error-prone nature causes severe problems of long-read-alone metagenomics. Hence, methods are urgently needed to retrieve MAGs by a combination of both long- and short-read technologies to advance genome-centric metagenomics. Results: In this study, we combined Illumina and Nanopore data to develop a new workflow to reconstruct 233 MAGs-six novel bacterial orders, 20 families, 66 genera, and 154 species-from Lake Shunet, a secluded meromictic lake in Siberia. Those new MAGs were underrepresented or undetectable in other MAGs studies using metagenomes from human or other common organisms or habitats. Using this newly developed workflow and strategy, the average N50 of reconstructed MAGs greatly increased 10-40-fold compared to when the conventional Illumina assembly and binning method were used. More importantly, six complete MAGs were recovered from our datasets, five of which belong to novel species. We used these as examples to demonstrate many novel and intriguing genomic characteristics discovered in these newly complete genomes and proved the importance of high-quality complete MAGs in microbial genomics and metagenomics studies. Conclusions: The results show that it is feasible to apply our workflow with a few additional long reads to recover numerous complete and high-quality MAGs from short-read metagenomes of high microbial diversity environment samples. The unique features we identified from five complete genomes highlight the robustness of this method in genome-centric metagenomic research. The recovery of 154 novel species MAGs from a rarely explored lake greatly expands the current bacterial genome encyclopedia and broadens our knowledge by adding new genomic characteristics of bacteria. It demonstrates a strong need to recover MAGs from diverse unexplored habitats in the search for microbial dark matter.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Krithika Arumugam ◽  
Irina Bessarab ◽  
Mindia A. S. Haryono ◽  
Xianghui Liu ◽  
Rogelio E. Zuniga–Montanez ◽  
...  

AbstractNew long read sequencing technologies offer huge potential for effective recovery of complete, closed genomes from complex microbial communities. Using long read data (ONT MinION) obtained from an ensemble of activated sludge enrichment bioreactors we recover 22 closed or complete genomes of community members, including several species known to play key functional roles in wastewater bioprocesses, specifically microbes known to exhibit the polyphosphate- and glycogen-accumulating organism phenotypes (namely Candidatus Accumulibacter and Dechloromonas, and Micropruina, Defluviicoccus and Candidatus Contendobacter, respectively), and filamentous bacteria (Thiothrix) associated with the formation and stability of activated sludge flocs. Additionally we demonstrate the recovery of close to 100 circularised plasmids, phages and small microbial genomes from these microbial communities using long read assembled sequence. We describe methods for validating long read assembled genomes using their counterpart short read metagenome-assembled genomes, and assess the influence of different correction procedures on genome quality and predicted gene quality. Our findings establish the feasibility of performing long read metagenome-assembled genome recovery for both chromosomal and non-chromosomal replicons, and demonstrate the value of parallel sampling of moderately complex enrichment communities to obtaining high quality reference genomes of key functional species relevant for wastewater bioprocesses.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Chong Chu ◽  
Rebeca Borges-Monroy ◽  
Vinayak V. Viswanadham ◽  
Soohyun Lee ◽  
Heng Li ◽  
...  

AbstractTransposable elements (TEs) help shape the structure and function of the human genome. When inserted into some locations, TEs may disrupt gene regulation and cause diseases. Here, we present xTea (x-Transposable element analyzer), a tool for identifying TE insertions in whole-genome sequencing data. Whereas existing methods are mostly designed for short-read data, xTea can be applied to both short-read and long-read data. Our analysis shows that xTea outperforms other short read-based methods for both germline and somatic TE insertion discovery. With long-read data, we created a catalogue of polymorphic insertions with full assembly and annotation of insertional sequences for various types of retroelements, including pseudogenes and endogenous retroviruses. Notably, we find that individual genomes have an average of nine groups of full-length L1s in centromeres, suggesting that centromeres and other highly repetitive regions such as telomeres are a significant yet unexplored source of active L1s. xTea is available at https://github.com/parklab/xTea.


2017 ◽  
Author(s):  
Mircea Cretu Stancu ◽  
Markus J. van Roosmalen ◽  
Ivo Renkens ◽  
Marleen Nieboer ◽  
Sjors Middelkamp ◽  
...  

AbstractStructural genomic variants form a common type of genetic alteration underlying human genetic disease and phenotypic variation. Despite major improvements in genome sequencing technology and data analysis, the detection of structural variants still poses challenges, particularly when variants are of high complexity. Emerging long-read single-molecule sequencing technologies provide new opportunities for detection of structural variants. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the ONT MinION at 11x and 16x mean coverage, respectively. We developed a bioinformatic pipeline - NanoSV - to efficiently map genomic structural variants (SVs) from the long-read data. We demonstrate that the nanopore data are superior to corresponding short-read data with regard to detection of de novo rearrangements originating from complex chromothripsis events in the patients. Additionally, genome-wide surveillance of SVs, revealed 3,253 (33%) novel variants that were missed in short-read data of the same sample, the majority of which are duplications < 200bp in size. Long sequencing reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs and SNVs. We employed read-based phasing to show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we resolved the long-range structure of the chromothripsis. This work demonstrates the value of long-read sequencing for screening whole genomes of patients for complex structural variants.


2021 ◽  
Vol 10 (46) ◽  
Author(s):  
Kentaro Miyazaki ◽  
Natsuko Tokito

Complete genome resequencing was conducted for Thermus thermophilus strain TMY by hybrid assembly of Oxford Nanopore Technologies long-read and MGI short-read data. Errors in the previously reported genome sequence determined by PacBio technology alone were corrected, allowing for high-quality comparative genomic analysis of closely related T. thermophilus genomes.


2020 ◽  
Author(s):  
Lauren Coombe ◽  
Vladimir Nikolić ◽  
Justin Chu ◽  
Inanc Birol ◽  
René L. Warren

AbstractSummaryThe ability to generate high-quality genome sequences is cornerstone to modern biological research. Even with recent advancements in sequencing technologies, many genome assemblies are still not achieving reference-grade. Here, we introduce ntJoin, a tool that leverages structural synteny between a draft assembly and reference sequence(s) to contiguate and correct the former with respect to the latter. Instead of alignments, ntJoin uses a lightweight mapping approach based on a graph data structure generated from ordered minimizer sketches. The tool can be used in a variety of different applications, including improving a draft assembly with a reference-grade genome, a short read assembly with a draft long read assembly, and a draft assembly with an assembly from a closely-related species. When scaffolding a human short read assembly using the reference human genome or a long read assembly, ntJoin improves the NGA50 length 23- and 13-fold, respectively, in under 13 m, using less than 11 GB of RAM. Compared to existing reference-guided assemblers, ntJoin generates highly contiguous assemblies faster and using less memory.Availability and implementationntJoin is written in C++ and Python, and is freely available at https://github.com/bcgsc/[email protected]


2018 ◽  
Author(s):  
Mark T. W. Ebbert ◽  
Stefan Farrugia ◽  
Jonathon Sens ◽  
Karen Jansen-West ◽  
Tania F. Gendron ◽  
...  

AbstractBackground: Many neurodegenerative diseases are caused by nucleotide repeat expansions, but most expansions, like the C9orf72 ‘GGGGCC’ (G4C2) repeat that causes approximately 5-7% of all amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD) cases, are too long to sequence using short-read sequencing technologies. It is unclear whether long-read sequencing technologies can traverse these long, challenging repeat expansions. Here, we demonstrate that two long-read sequencing technologies, Pacific Biosciences’ (PacBio) and Oxford Nanopore Technologies’ (ONT), can sequence through disease-causing repeats cloned into plasmids, including the FTD/ALS-causing G4C2 repeat expansion. We also report the first long-read sequencing data characterizing the C9orf72 G4C2 repeat expansion at the nucleotide level in two symptomatic expansion carriers using PacBio whole-genome sequencing and a no-amplification (No-Amp) targeted approach based on CRISPR/Cas9.Results: Both the PacBio and ONT platforms successfully sequenced through the repeat expansions in plasmids. Throughput on the MinlON was a challenge for whole-genome sequencing; we were unable to attain reads covering the human C9orf72 repeat expansion using 15 flow cells. We obtained 8x coverage across the C9orf72 locus using the PacBio Sequel, accurately reporting the unexpanded allele at eight repeats, and reading through the entire expansion with 1324 repeats (7941 nucleotides). Using the No-Amp targeted approach, we attained >800x coverage and were able to identify the unexpanded allele, closely estimate expansion size, and assess nucleotide content in a single experiment. We estimate the individual’s repeat region was >99% G4C2 content, though we cannot rule out small interruptions.Conclusions: Our findings indicate that long-read sequencing is well suited to characterizing known repeat expansions, and for discovering new disease-causing, disease-modifying, or risk-modifying repeat expansions that have gone undetected with conventional short-read sequencing. The PacBio No-Amp targeted approach may have future potential in clinical and genetic counseling environments. Larger and deeper long-read sequencing studies in C9orf72 expansion carriers will be important to determine heterogeneity and whether the repeats are interrupted by non-G4C2 content, potentially mitigating or modifying disease course or age of onset, as interruptions are known to do in other repeat-expansion disorders. These results have broad implications across all diseases where the genetic etiology remains unclear.


2019 ◽  
Author(s):  
Mark T. W. Ebbert ◽  
Tanner D. Jensen ◽  
Karen Jansen-West ◽  
Jonathon P. Sens ◽  
Joseph S. Reddy ◽  
...  

AbstractBackgroundThe human genome contains ‘dark’ gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions that are ‘dark by depth’ (few mappable reads) and others that are ‘camouflaged’ (ambiguous alignment), and we assess how well long-read technologies resolve these regions. We further present an algorithm to resolve most camouflaged regions (including in short-read data) and apply it to the Alzheimer’s Disease Sequencing Project (ADSP; 13142 samples), as a proof of principle.ResultsBased on standard whole-genome lllumina sequencing data, we identified 37873 dark regions in 5857 gene bodies (3635 protein-coding) from pathways important to human health, development, and reproduction. Of the 5857 gene bodies, 494 (8.4%) were 100% dark (142 protein-coding) and 2046 (34.9%) were ≥5% dark (628 protein-coding). Exactly 2757 dark regions were in protein-coding exons (CDS) across 744 genes. Long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduced dark CDS regions to approximately 45.1%, 33.3%, and 18.2% respectively. Applying our algorithm to the ADSP, we rescued 4622 exonic variants from 501 camouflaged genes, including a rare, ten-nucleotide frameshift deletion in CR1, a top Alzheimer’s disease gene, found in only five ADSP cases and zero controls.ConclusionsWhile we could not formally assess the CR1 frameshift mutation in Alzheimer’s disease (insufficient sample-size), we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.


2021 ◽  
Author(s):  
Peipei Wang ◽  
Fanrui Meng ◽  
Bethany M. Moore ◽  
Shin-Han Shiu

Abstract Background: Availability of plant genome sequences has led to significant advances. However, with few exceptions, the great majority of existing genome assemblies are derived from short read sequencing technologies with highly uneven read coverages indicative of sequencing and assembly issues that could significantly impact any downstream analysis of plant genomes. In tomato for example, 0.6% (5.1 Mb) and 9.7% (79.6 Mb) of short-read based assembly had significantly higher and lower coverage compared to background, respectively.Results: To understand what the causes may be for such uneven coverage, we first established machine learning models capable of predicting genomic regions with variable coverages and found that high coverage regions tend to have higher simple sequence repeat and tandem gene densities compared to background regions. To determine if the high coverage regions were misassembled, we examined a recently available tomato long-read based assembly and found that 27.8% (1.41 Mb) of high coverage regions were potentially misassembled of duplicate sequences, compared to 1.4% in background regions. In addition, using a predictive model that can distinguish correctly and incorrectly assembled high coverage regions, we found that misassembled, high coverage regions tend to be flanked by simple sequence repeats, pseudogenes, and transposon elements. Conclusions: Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when using short reads.


Sign in / Sign up

Export Citation Format

Share Document