Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly

Mapping Intimacies ◽

10.1101/106252 ◽

2017 ◽

Cited By ~ 10

Author(s):

German Tischler ◽

Eugene W. Myers

Keyword(s):

Second Generation ◽

Hybrid Methods ◽

Low Noise ◽

De Bruijn Graph ◽

Third Generation ◽

Sequencing Errors ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

De Bruijn

AbstractWhile second generation sequencing led to a vast increase in sequenced data, the shorter reads which came with it made assembly a much harder task and for some regions impossible with only short read data. This changed again with the advent of third generation long read sequencers. The length of the long reads allows a much better resolution of repetitive regions, their high error rate however is a major challenge. Using the data successfully requires to remove most of the sequencing errors. The first hybrid correction methods used low noise second generation data to correct third generation data, but this approach has issues when it is unclear where to place the short reads due to repeats and also because second generation sequencers fail to sequence some regions which third generation sequencers work on. Later non hybrid methods appeared. We present a new method for non hybrid long read error correction based on De Bruijn graph assembly of short windows of long reads with subsequent combination of these correct windows to corrected long reads. Our experiments show that this method yields a better correction than other state of the art non hybrid correction approaches.

Download Full-text

Evaluating approaches to find exon chains based on long reads

10.1101/066241 ◽

2016 ◽

Author(s):

Anna Kuosmanen ◽

Veli Mäkinen

Keyword(s):

Second Generation ◽

Simulated Data ◽

Error Rates ◽

Third Generation ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

AbstractMotivationTranscript prediction can be modelled as a graph problem where exons are modelled as nodes and reads spanning two or more exons are modelled as exon chains. PacBio third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions.ResultsWe survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity / precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy.AvailabilityThe simulated data and in-house scripts used for this article are available at http://cs.helsinki.fi/u/aekuosma/exon_chain_evaluation_publish.tar.gz.

Download Full-text

Iterative de Bruijn graph assemblers for second-generation sequencing reads

10.5353/th_b5053405 ◽

2012 ◽

Author(s):

Yu Peng

Keyword(s):

Second Generation ◽

De Bruijn Graph ◽

Second Generation Sequencing ◽

De Bruijn ◽

Generation Sequencing

Download Full-text

cloudSPAdes: assembly of synthetic long reads using de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz349 ◽

2019 ◽

Vol 35 (14) ◽

pp. i61-i70 ◽

Cited By ~ 4

Author(s):

Ivan Tolstoganov ◽

Anton Bankevich ◽

Zhoutao Chen ◽

Pavel A Pevzner

Keyword(s):

Narrow Range ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Hybrid Assembly ◽

De Bruijn Graphs ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

New Applications

Abstract Motivation The recently developed barcoding-based synthetic long read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly. Results We describe the algorithmic challenge of the SLR assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed. Availability and implementation Source code and installation manual for cloudSPAdes are available at https://github.com/ablab/spades/releases/tag/cloudspades-paper. Supplementary Information Supplementary data are available at Bioinformatics online.

Download Full-text

FLAS: fast and high-throughput algorithm for PacBio long-read self-correction

Bioinformatics ◽

10.1093/bioinformatics/btz206 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3953-3960 ◽

Cited By ~ 3

Author(s):

Ergude Bao ◽

Fei Xie ◽

Changjin Song ◽

Dandan Song

Keyword(s):

High Throughput ◽

The Self ◽

Supplementary Information ◽

Third Generation ◽

Performance Tests ◽

Sequencing Errors ◽

The Third ◽

Fast Speed ◽

Long Reads ◽

Long Read

Abstract Motivation The third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high-throughput self-correction. MECAT is currently among the fastest self-correction algorithms, but its throughput is relatively small (Xiao et al., 2017). Results Here, we introduce FLAS, a wrapper algorithm of MECAT, to achieve high-throughput long-read self-correction while keeping MECAT’s fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, FLAS also uses the corrected long-read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana and human long reads, FLAS can achieve 22.0–50.6% larger throughput than MECAT. FLAS is 2–13× faster compared to the self-correction algorithms other than MECAT, and its throughput is also 9.8–281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1–29.8% larger N50 sizes than MECAT. Availability and implementation The FLAS software can be downloaded for free from this site: https://github.com/baoe/flas. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

BMC Genomics ◽

10.1186/s12864-019-6286-9 ◽

2019 ◽

Vol 20 (S11) ◽

Author(s):

Arghya Kusum Das ◽

Sayan Goswami ◽

Kisung Lee ◽

Seung-Jong Park

Keyword(s):

Error Correction ◽

Error Rates ◽

De Bruijn Graph ◽

Correction Algorithm ◽

Short Read ◽

Short Reads ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Error Correction Algorithm

Abstract Background Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. Methods In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. Results ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. Conclusion ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.

Download Full-text

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz102 ◽

2019 ◽

Vol 36 (5) ◽

pp. 1374-1381 ◽

Cited By ~ 9

Author(s):

Antoine Limasset ◽

Jean-François Flot ◽

Pierre Peterlongo

Keyword(s):

Supplementary Information ◽

De Bruijn Graph ◽

Sequence Information ◽

Short Read ◽

De Bruijn Graphs ◽

Short Reads ◽

Sequencing Errors ◽

Long Read ◽

De Bruijn ◽

Read Accuracy

Abstract Motivation Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information. Results We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond. Availability and implementation The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TALC: Transcript-level Aware Long Read Correction

10.1101/2020.01.10.901728 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lucile Broseus ◽

Aubin Thomas ◽

Andrew J. Oldfield ◽

Dany Severac ◽

Emeric Dubois ◽

...

Keyword(s):

Transcriptome Sequencing ◽

Transcript Level ◽

De Bruijn Graph ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Rna Transcript

ABSTRACTMotivationLong-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous “hybrid correction” algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data.ResultsWe have created a novel reference-free algorithm called TALC (Transcription Aware Long Read Correction) which models changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads from transcriptome studies. We show that transcription aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.Availability and ImplementationTALC is implemented in C++ and available at https://gitlab.igh.cnrs.fr/lbroseus/[email protected]

Download Full-text

Enhanced Recovery of Microbial Genes and Genomes From a Marine Water Column Using Long-Read Metagenomics

Frontiers in Microbiology ◽

10.3389/fmicb.2021.708782 ◽

2021 ◽

Vol 12 ◽

Author(s):

Jose M. Haro-Moreno ◽

Mario López-Pérez ◽

Francisco Rodriguez-Valera

Keyword(s):

Water Column ◽

Error Rate ◽

Population Genomics ◽

Enhanced Recovery ◽

Third Generation ◽

Long Reads ◽

Flexible Genome ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

Third-generation sequencing has penetrated little in metagenomics due to the high error rate and dependence for assembly on short-read designed bioinformatics. However, second-generation sequencing metagenomics (mostly Illumina) suffers from limitations, particularly in the assembly of microbes with high microdiversity and retrieval of the flexible (adaptive) fraction of prokaryotic genomes. Here, we have used a third-generation technique to study the metagenome of a well-known marine sample from the mixed epipelagic water column of the winter Mediterranean. We have compared PacBio Sequel II with the classical approach using Illumina Nextseq short reads followed by assembly to study the metagenome. Long reads allow for efficient direct retrieval of complete genes avoiding the bias of the assembly step. Besides, the application of long reads on metagenomic assembly allows for the reconstruction of much more complete metagenome-assembled genomes (MAGs), particularly from microbes with high microdiversity such as Pelagibacterales. The flexible genome of reconstructed MAGs was much more complete containing many adaptive genes (some with biotechnological potential). PacBio Sequel II CCS appears particularly suitable for cellular metagenomics due to its low error rate. For most applications of metagenomics, from community structure analysis to ecosystem functioning, long reads should be applied whenever possible. Specifically, for in silico screening of biotechnologically useful genes, or population genomics, long-read metagenomics appears presently as a very fruitful approach and can be analyzed from raw reads before a computationally demanding (and potentially artifactual) assembly step.

Download Full-text

Hybrid correction of highly noisy Oxford Nanopore long reads using a variable-order de Bruijn graph

10.1101/238808 ◽

2017 ◽

Cited By ~ 3

Author(s):

Pierre Morisse ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Error Correction ◽

Error Rate ◽

De Bruijn Graph ◽

Variable Order ◽

Short Reads ◽

Pacific Biosciences ◽

Long Reads ◽

Oxford Nanopore ◽

Long Read ◽

De Bruijn

AbstractMotivationThe recent rise of long read sequencing technologies such as Pacific Biosciences and Oxford Nanopore allows to solve assembly problems for larger and more complex genomes than what allowed short reads technologies. However, these long reads are very noisy, reaching an error rate of around 10 to 15% for Pacific Biosciences, and up to 30% for Oxford Nanopore. The error correction problem has been tackled by either self-correcting the long reads, or using complementary short reads in a hybrid approach, but most methods only focus on Pacific Biosciences data, and do not apply to Oxford Nanopore reads. Moreover, even though recent chemistries from Oxford Nanopore promise to lower the error rate below 15%, it is still higher in practice, and correcting such noisy long reads remains an issue.ResultsWe present HG-CoLoR, a hybrid error correction method that focuses on a seed-and-extend approach based on the alignment of the short reads to the long reads, followed by the traversal of a variable-order de Bruijn graph, built from the short reads. Our experiments show that HG-CoLoR manages to efficiently correct Oxford Nanopore long reads that display an error rate as high as 44%. When compared to other state-of-the-art long read error correction methods able to deal with Oxford Nanopore data, our experiments also show that HG-CoLoR provides the best trade-off between runtime and quality of the results, and is the only method able to efficiently scale to eukaryotic genomes.Availability and implementationHG-CoLoR is implemented is C++, supported on Linux platforms and freely available at https://github.com/morispi/HG-CoLoRContact: [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Toward perfect reads: short reads correction via mapping on compacted de Bruijn graphs

10.1101/558395 ◽

2019 ◽

Cited By ~ 3

Author(s):

Antoine Limasset ◽

Jean-François Flot ◽

Pierre Peterlongo

Keyword(s):

Large Data ◽

De Bruijn Graph ◽

Data Sets ◽

Short Read ◽

De Bruijn Graphs ◽

Short Reads ◽

Sequencing Errors ◽

Long Read ◽

De Bruijn ◽

Read Accuracy

AbstractMotivationsShort-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large data sets or consider reads as mere suites of k-mers, without taking into account their full-length read information.ResultsWe propose a new method to correct short reads using de Bruijn graphs, and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond.Availability and ImplementationThe implementation is open source and available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package.ContactAntoine Limasset [email protected] & Jean-François Flot [email protected] & Pierre Peterlongo [email protected]

Download Full-text