scholarly journals Evaluating approaches to find exon chains based on long reads

2016 ◽  
Author(s):  
Anna Kuosmanen ◽  
Veli Mäkinen

AbstractMotivationTranscript prediction can be modelled as a graph problem where exons are modelled as nodes and reads spanning two or more exons are modelled as exon chains. PacBio third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions.ResultsWe survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity / precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy.AvailabilityThe simulated data and in-house scripts used for this article are available at http://cs.helsinki.fi/u/aekuosma/exon_chain_evaluation_publish.tar.gz.

2019 ◽  
Author(s):  
Laura H. Tung ◽  
Mingfu Shao ◽  
Carl Kingsford

AbstractThird-generation sequencing technologies benefit transcriptome analysis by generating longer sequencing reads. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and the sequencing length limit of the platform. This drives a need for long read transcript assembly. We quantify the benefit that can be achieved by using a transcript assembler on long reads. Adding long-read-specific algorithms, we evolved Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates. Analyzing 26 SRA PacBio datasets using Scallop-LR, Iso-Seq Analysis, and StringTie, we quantified the amount by which assembly improved Iso-Seq results. Through combined evaluation methods, we found that Scallop-LR identifies 2100–4000 more (for 18 human datasets) or 1100–2200 more (for eight mouse datasets) known transcripts than Iso-Seq Analysis, which does not do assembly. Further, Scallop-LR finds 2.4–4.4 times more potentially novel isoforms than Iso-Seq Analysis for the human and mouse datasets. StringTie also identifies more transcripts than Iso-Seq Analysis. Adding long-read-specific optimizations in Scallop-LR increases the numbers of predicted known transcripts and potentially novel isoforms for the human transcriptome compared to several recent short-read assemblers (e.g. StringTie). Our findings indicate that transcript assembly by Scallop-LR can reveal a more complete human transcriptome.


Author(s):  
Ehsan Haghshenas ◽  
Hossein Asghari ◽  
Jens Stoye ◽  
Cedric Chauve ◽  
Faraz Hach

AbstractThird generation sequencing technologies from platforms such as Oxford Nanopore Technologies and Pacific Biosciences have paved the way for building more contiguous assemblies and complete reconstruction of genomes. The larger effective length of the reads generated with these technologies has provided a mean to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive while faster methods are not as accurate. Therefore, there is still an unmet need for tools that are both fast and accurate for reconstructing small and large genomes. Despite the recent advances in third generation sequencing, researchers tend to generate second generation reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler which uses both second and third generation sequencing reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.AvailabilityHASLR is an open source tool available at https://github.com/vpc-ccg/haslr.


Author(s):  
Pierre Morisse ◽  
Thierry Lecroq ◽  
Arnaud Lefebvre

AbstractThird generation sequencing technologies Pacific Biosciences and Oxford Nanopore Technologies were respectively made available in 2011 and 2014. In contrast with second generation sequencing technologies such as Illumina, these new technologies allow the sequencing of long reads of tens to hundreds of kbps. These so called long reads are particularly promising, and are especially expected to solve various problems such as contig and haplotype assembly or scaffolding, for instance. However, these reads are also much more error prone than second generation reads, and display error rates reaching 10 to 30%, according to the sequencing technology and to the version of the chemistry. Moreover, these errors are mainly composed of insertions and deletions, whereas most errors are substitutions in Illumina reads. As a result, long reads require efficient error correction, and a plethora of error correction tools, directly targeted at these reads, were developed in the past nine years. These methods can adopt a hybrid approach, using complementary short reads to perform correction, or a self-correction approach, only making use of the information contained in the long reads sequences. Both these approaches make use of various strategies such as multiple sequence alignment, de Bruijn graphs, hidden Markov models, or even combine different strategies. In this paper, we describe a complete survey of long-read error correction, reviewing all the different methodologies and tools existing up to date, for both hybrid and self-correction. Moreover, the long reads characteristics, such as sequencing depth, length, error rate, or even sequencing technology, can have an impact on how well a given tool or strategy performs, and can thus drastically reduce the correction quality. We thus also present an in-depth benchmark of available long-read error correction tools, on a wide variety of datasets, composed of both simulated and real data, with various error rates, coverages, and read lengths, ranging from small bacterial to large mammal genomes.


2020 ◽  
Vol 36 (12) ◽  
pp. 3669-3679 ◽  
Author(s):  
Can Firtina ◽  
Jeremie S Kim ◽  
Mohammed Alser ◽  
Damla Senol Cali ◽  
A Ercument Cicek ◽  
...  

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 17 (12) ◽  
pp. 5205-5209
Author(s):  
Ali Elbialy ◽  
M. A. El-Dosuky ◽  
Ibrahim M. El-Henawy

Third generation sequencing (TGS) relates to long reads but with relatively high error rates. Quality of TGS is a hot topic, dealing with errors. This paper combines and investigates three quality related metrics. They are basecalling accuracy, Phred Quality Scores, and GC content. For basecalling accuracy, a deep neural network is adopted. The measured loss does not exceed 5.42.


2020 ◽  
Vol 71 (18) ◽  
pp. 5313-5322 ◽  
Author(s):  
Kathryn Dumschott ◽  
Maximilian H-W Schmidt ◽  
Harmeet Singh Chawla ◽  
Rod Snowdon ◽  
Björn Usadel

Abstract DNA sequencing was dominated by Sanger’s chain termination method until the mid-2000s, when it was progressively supplanted by new sequencing technologies that can generate much larger quantities of data in a shorter time. At the forefront of these developments, long-read sequencing technologies (third-generation sequencing) can produce reads that are several kilobases in length. This greatly improves the accuracy of genome assemblies by spanning the highly repetitive segments that cause difficulty for second-generation short-read technologies. Third-generation sequencing is especially appealing for plant genomes, which can be extremely large with long stretches of highly repetitive DNA. Until recently, the low basecalling accuracy of third-generation technologies meant that accurate genome assembly required expensive, high-coverage sequencing followed by computational analysis to correct for errors. However, today’s long-read technologies are more accurate and less expensive, making them the method of choice for the assembly of complex genomes. Oxford Nanopore Technologies (ONT), a third-generation platform for the sequencing of native DNA strands, is particularly suitable for the generation of high-quality assemblies of highly repetitive plant genomes. Here we discuss the benefits of ONT, especially for the plant science community, and describe the issues that remain to be addressed when using ONT for plant genome sequencing.


GigaScience ◽  
2020 ◽  
Vol 9 (10) ◽  
Author(s):  
Davide Bolognini ◽  
Alberto Magi ◽  
Vladimir Benes ◽  
Jan O Korbel ◽  
Tobias Rausch

Abstract Background Tandem repeat sequences are widespread in the human genome, and their expansions cause multiple repeat-mediated disorders. Genome-wide discovery approaches are needed to fully elucidate their roles in health and disease, but resolving tandem repeat variation accurately remains a challenging task. While traditional mapping-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies exhibit substantially higher sequencing error rates, which complicates repeat resolution. Results We developed TRiCoLOR, a freely available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in sequencing data without a prior knowledge of their motifs or locations and resolve repeat multiplicity and period size in a haplotype-specific manner. The tool includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees. Conclusions TRiCoLOR demonstrates excellent performance and improved sensitivity and specificity compared with alternative tools on synthetic data. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes.


2021 ◽  
Vol 12 ◽  
Author(s):  
Jose M. Haro-Moreno ◽  
Mario López-Pérez ◽  
Francisco Rodriguez-Valera

Third-generation sequencing has penetrated little in metagenomics due to the high error rate and dependence for assembly on short-read designed bioinformatics. However, second-generation sequencing metagenomics (mostly Illumina) suffers from limitations, particularly in the assembly of microbes with high microdiversity and retrieval of the flexible (adaptive) fraction of prokaryotic genomes. Here, we have used a third-generation technique to study the metagenome of a well-known marine sample from the mixed epipelagic water column of the winter Mediterranean. We have compared PacBio Sequel II with the classical approach using Illumina Nextseq short reads followed by assembly to study the metagenome. Long reads allow for efficient direct retrieval of complete genes avoiding the bias of the assembly step. Besides, the application of long reads on metagenomic assembly allows for the reconstruction of much more complete metagenome-assembled genomes (MAGs), particularly from microbes with high microdiversity such as Pelagibacterales. The flexible genome of reconstructed MAGs was much more complete containing many adaptive genes (some with biotechnological potential). PacBio Sequel II CCS appears particularly suitable for cellular metagenomics due to its low error rate. For most applications of metagenomics, from community structure analysis to ecosystem functioning, long reads should be applied whenever possible. Specifically, for in silico screening of biotechnologically useful genes, or population genomics, long-read metagenomics appears presently as a very fruitful approach and can be analyzed from raw reads before a computationally demanding (and potentially artifactual) assembly step.


2017 ◽  
Author(s):  
German Tischler ◽  
Eugene W. Myers

AbstractWhile second generation sequencing led to a vast increase in sequenced data, the shorter reads which came with it made assembly a much harder task and for some regions impossible with only short read data. This changed again with the advent of third generation long read sequencers. The length of the long reads allows a much better resolution of repetitive regions, their high error rate however is a major challenge. Using the data successfully requires to remove most of the sequencing errors. The first hybrid correction methods used low noise second generation data to correct third generation data, but this approach has issues when it is unclear where to place the short reads due to repeats and also because second generation sequencers fail to sequence some regions which third generation sequencers work on. Later non hybrid methods appeared. We present a new method for non hybrid long read error correction based on De Bruijn graph assembly of short windows of long reads with subsequent combination of these correct windows to corrected long reads. Our experiments show that this method yields a better correction than other state of the art non hybrid correction approaches.


2020 ◽  
Vol 2 (2) ◽  
Author(s):  
Juliane C Dohm ◽  
Philipp Peters ◽  
Nancy Stralis-Pavese ◽  
Heinz Himmelbauer

Abstract Third-generation sequencing technologies provided by Pacific Biosciences and Oxford Nanopore Technologies generate read lengths in the scale of kilobasepairs. However, these reads display high error rates, and correction steps are necessary to realize their great potential in genomics and transcriptomics. Here, we compare properties of PacBio and Nanopore data and assess correction methods by Canu, MARVEL and proovread in various combinations. We found total error rates of around 13% in the raw datasets. PacBio reads showed a high rate of insertions (around 8%) whereas Nanopore reads showed similar rates for substitutions, insertions and deletions of around 4% each. In data from both technologies the errors were uniformly distributed along reads apart from noisy 5′ ends, and homopolymers appeared among the most over-represented kmers relative to a reference. Consensus correction using read overlaps reduced error rates to about 1% when using Canu or MARVEL after patching. The lowest error rate in Nanopore data (0.45%) was achieved by applying proovread on MARVEL-patched data including Illumina short-reads, and the lowest error rate in PacBio data (0.42%) was the result of Canu correction with minimap2 alignment after patching. Our study provides valuable insights and benchmarks regarding long-read data and correction methods.


Sign in / Sign up

Export Citation Format

Share Document