scholarly journals Assembly of Long Error-Prone Reads Using Repeat Graphs

2018 ◽  
Author(s):  
Mikhail Kolmogorov ◽  
Jeffrey Yuan ◽  
Yu Lin ◽  
Pavel. A. Pevzner

ABSTRACTThe problem of genome assembly is ultimately linked to the problem of the characterization of all repeat families in a genome as a repeat graph. The key reason the de Bruijn graph emerged as a popular short read assembly approach is because it offered an elegant representation of all repeats in a genome that reveals their mosaic structure. However, most algorithms for assembling long error-prone reads use an alternative overlap-layout-consensus (OLC) approach that does not provide a repeat characterization. We present the Flye algorithm for constructing the A-Bruijn (assembly) graph from long error-prone reads, that, in contrast to the k-mer-based de Bruijn graph, assembles genomes using an alignment-based A-Bruijn graph. In difference from existing assemblers, Flye does not attempt to construct accurate contigs (at least at the initial assembly stage) but instead simply generates arbitrary paths in the (unknown) assembly graph and further constructs an assembly graph from these paths. Counter-intuitively, this fast but seemingly reckless approach results in the same graph as the assembly graph constructed from accurate contigs. Flye constructs (overlapping) contigs with possible assembly errors at the initial stage, combines them into an accurate assembly graph, resolves repeats in the assembly graph using small variations between various repeat instances that were left unresolved during the initial assembly stage, constructs a new, less tangled assembly graph based on resolved repeats, and finally outputs accurate contigs as paths in this graph. We benchmark Flye against several state-of-the-art Single Molecule Sequencing assemblers and demonstrate that it generates better or comparable assemblies for all analyzed datasets.

2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Kingshuk Mukherjee ◽  
Massimiliano Rossi ◽  
Leena Salmela ◽  
Christina Boucher

AbstractGenome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.


2016 ◽  
Author(s):  
Amit Gur ◽  
Edward Buckler ◽  
Joseph Burger ◽  
Yaakov Tadmor ◽  
Iftach Klapp

Project objectives: 1) Characterization of variation for yield heterosis in melon using Half-Diallele (HDA) design. 2) Development and implementation of image-based yield phenotyping in melon. 3) Characterization of genetic, epigenetic and transcriptional variation across 25 founder lines and selected hybrids. The epigentic part of this objective was modified during the course of the project: instead of characterization of chromatin structure in a single melon line through genome-wide mapping of nucleosomes using MNase-seq approach, we took advantage of rapid advancements in single-molecule sequencing and shifted the focus to Nanoporelong-read sequencing of all 25 founder lines. This analysis provides invaluable information on genome-wide structural variation across our diversity 4) Integrated analyses and development of prediction models Agricultural heterosis relates to hybrids that outperform their inbred parents for yield. First generation (F1) hybrids are produced in many crop species and it is estimated that heterosis increases yield by 15-30% globally. Melon (Cucumismelo) is an economically important species of The Cucurbitaceae family and is among the most important fleshy fruits for fresh consumption Worldwide. The major goal of this project was to explore the patterns and magnitude of yield heterosis in melon and link it to whole genome sequence variation. A core subset of 25 diverse lines was selected from the Newe-Yaar melon diversity panel for whole-genome re-sequencing (WGS) and test-crosses, to produce structured half-diallele design of 300 F1 hybrids (MelHDA25). Yield variation was measured in replicated yield trials at the whole-plant and at the rootstock levels (through a common-scion grafted experiments), across the F1s and parental lines. As part of this project we also developed an algorithmic pipeline for detection and yield estimation of melons from aerial-images, towards future implementation of such high throughput, cost-effective method for remote yield evaluation in open-field melons. We found extensive, highly heritable root-derived yield variation across the diallele population that was characterized by prominent best-parent heterosis (BPH), where hybrids rootstocks outperformed their parents by 38% and 56 % under optimal irrigation and drought- stress, respectively. Through integration of the genotypic data (~4,000,000 SNPs) and yield analyses we show that root-derived hybrids yield is independent of parental genetic distance. However, we mapped novel root-derived yield QTLs through genome-wide association (GWA) analysis and a multi-QTLs model explained more than 45% of the hybrids yield variation, providing a potential route for marker-assisted hybrid rootstock breeding. Four selected hybrid rootstocks are further studied under multiple scion varieties and their validated positive effect on yield performance is now leading to ongoing evaluation of their commercial potential. On the genomic level, this project resulted in 3 layers of data: 1) whole-genome short-read Illumina sequencing (30X) of the 25 founder lines provided us with 25 genome alignments and high-density melon HapMap that is already shown to be an effective resource for QTL annotation and candidate gene analysis in melon. 2) fast advancements in long-read single-molecule sequencing allowed us to shift focus towards this technology and generate ~50X Nanoporesequencing of the 25 founders which in combination with the short-read data now enable de novo assembly of the 25 genomes that will soon lead to construction of the first melon pan-genome. 3) Transcriptomic (3' RNA-Seq) analysis of several selected hybrids and their parents provide preliminary information on differentially expressed genes that can be further used to explain the root-derived yield variation. Taken together, this project expanded our view on yield heterosis in melon with novel specific insights on root-derived yield heterosis. To our knowledge, thus far this is the largest systematic genetic analysis of rootstock effects on yield heterosis in cucurbits or any other crop plant, and our results are now translated into potential breeding applications. The genomic resources that were developed as part of this project are putting melon in the forefront of genomic research and will continue to be useful tool for the cucurbits community in years to come. 


2019 ◽  
Vol 35 (18) ◽  
pp. 3250-3256 ◽  
Author(s):  
Kingshuk Mukherjee ◽  
Bahar Alipanahi ◽  
Tamer Kahveci ◽  
Leena Salmela ◽  
Christina Boucher

Abstract Motivation Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps—called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. Results We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data. Availability and implementation The software for aligning optical maps to de Bruijn graph, omGraph is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/omGraph. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Author(s):  
Yu Lin ◽  
Jeffrey Yuan ◽  
Mikhail Kolmogorov ◽  
Max W. Shen ◽  
Pavel A. Pevzner

AbstractThe recent breakthroughs in assembling long error-prone reads (such as reads generated by Single Molecule Real Time technology) were based on the overlap-layout-consensus approach and did not utilize the strengths of the alternative de Bruijn graph approach to genome assembly. Moreover, these studies often assume that applications of the de Bruijn graph approach are limited to short and accurate reads and that the overlap-layout-consensus approach is the only practical paradigm for assembling long error-prone reads. Below we show how to generalize de Bruijn graphs to assemble long error-prone reads and describe the ABruijn assembler, which results in more accurate genome reconstructions than the existing state-of-the-art algorithms.


2021 ◽  
Author(s):  
Kingshuk Mukherjee ◽  
Massimiliano Rossi ◽  
Leena Salmela ◽  
Christina Boucher

Abstract Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there exists very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary method that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as Rmapper, and compare its performance against the assembler of Valouev et al. (2006) and Solve by Bionano Genomics on data from three genomes - E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al.(2006) only successfully ran on E. coli. Moreover, on the human genome Rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, RMAPPER is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.


2016 ◽  
Author(s):  
Diego D. Cambuy ◽  
Felipe H. Coutinho ◽  
Bas E. Dutilh

AbstractIn modern-day metagenomics, there is an increasing need for robust taxonomic annotation of long DNA sequences from unknown micro-organisms. Long metagenomic sequences may be derived from assembly of short-read metagenomes, or from long-read single molecule sequencing. Here we introduce CAT, a pipeline for robust taxonomic classification of long DNA sequences. We show that CAT correctly classifies contigs at different taxonomic levels, even in simulated metagenomic datasets that are very distantly related from the sequences in the database. CAT is implemented in Python and the required scripts can be freely downloaded from Github.


2017 ◽  
Author(s):  
Roye Rozov ◽  
Gil Goldshlager ◽  
Eran Halperin ◽  
Ron Shamir

AbstractMotivationWe present Faucet, a 2-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.ResultsFaucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata - coverage counts collected at junction k-mers and connections bridging between junction pairs - contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Faucet’s resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency - namely, Minia and LightAssembler. However, on metagenomes tested, Faucet’s outputs had 14-110% higher mean NGA50 lengths compared to Minia, and 2-11-fold higher mean NGA50 lengths compared to LightAssembler, the only other streaming assembler available.AvailabilityFaucet is available at https://github.com/Shamir-Lab/[email protected],[email protected] information:Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document