Evaluation of tools for long read RNA-seq splice-aware alignment

Mapping Intimacies ◽

10.1101/126656 ◽

2017 ◽

Cited By ~ 1

Author(s):

Krešimir Križanović ◽

Amina Echchiki ◽

Julien Roux ◽

Mile Šikić

Keyword(s):

High Throughput Sequencing ◽

Genetic Research ◽

Error Rates ◽

Rna Seq ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Gapped Alignment ◽

Long Read ◽

Gene Expression Levels

AbstractMotivationHigh–throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long PacBio or even ONT MinION reads.ResultsThe tools were tested on synthetic and real datasets from the PacBio and ONT MinION technologies, and both alignment quality and resource usage were compared across tools. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts.Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads.Availabilityhttps://github.com/kkrizanovic/[email protected]

Download Full-text

Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads

10.1101/632703 ◽

2019 ◽

Cited By ~ 1

Author(s):

Laura H. Tung ◽

Mingfu Shao ◽

Carl Kingsford

Keyword(s):

Single Molecule ◽

Error Rates ◽

Human Transcriptome ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Transcript Assembly ◽

Novel Isoforms ◽

Generation Sequencing

AbstractThird-generation sequencing technologies benefit transcriptome analysis by generating longer sequencing reads. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and the sequencing length limit of the platform. This drives a need for long read transcript assembly. We quantify the benefit that can be achieved by using a transcript assembler on long reads. Adding long-read-specific algorithms, we evolved Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates. Analyzing 26 SRA PacBio datasets using Scallop-LR, Iso-Seq Analysis, and StringTie, we quantified the amount by which assembly improved Iso-Seq results. Through combined evaluation methods, we found that Scallop-LR identifies 2100–4000 more (for 18 human datasets) or 1100–2200 more (for eight mouse datasets) known transcripts than Iso-Seq Analysis, which does not do assembly. Further, Scallop-LR finds 2.4–4.4 times more potentially novel isoforms than Iso-Seq Analysis for the human and mouse datasets. StringTie also identifies more transcripts than Iso-Seq Analysis. Adding long-read-specific optimizations in Scallop-LR increases the numbers of predicted known transcripts and potentially novel isoforms for the human transcriptome compared to several recent short-read assemblers (e.g. StringTie). Our findings indicate that transcript assembly by Scallop-LR can reveal a more complete human transcriptome.

Download Full-text

Evaluating approaches to find exon chains based on long reads

10.1101/066241 ◽

2016 ◽

Author(s):

Anna Kuosmanen ◽

Veli Mäkinen

Keyword(s):

Second Generation ◽

Simulated Data ◽

Error Rates ◽

Third Generation ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

Generation Sequencing

AbstractMotivationTranscript prediction can be modelled as a graph problem where exons are modelled as nodes and reads spanning two or more exons are modelled as exon chains. PacBio third-generation sequencing technology produces significantly longer reads than earlier second-generation sequencing technologies, which gives valuable information about longer exon chains in a graph. However, with the high error rates of third-generation sequencing, aligning long reads correctly around the splice sites is a challenging task. Incorrect alignments lead to spurious nodes and arcs in the graph, which in turn lead to incorrect transcript predictions.ResultsWe survey several approaches to find the exon chains corresponding to long reads in a splicing graph, and experimentally study the performance of these methods using simulated data to allow for sensitivity / precision analysis. Our experiments show that short reads from second-generation sequencing can be used to significantly improve exon chain correctness either by error-correcting the long reads before splicing graph creation, or by using them to create a splicing graph on which the long read alignments are then projected. We also study the memory and time consumption of various modules, and show that accurate exon chains lead to significantly increased transcript prediction accuracy.AvailabilityThe simulated data and in-house scripts used for this article are available at http://cs.helsinki.fi/u/aekuosma/exon_chain_evaluation_publish.tar.gz.

Download Full-text

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa075 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Cheng He ◽

Guifang Lin ◽

Hairong Wei ◽

Haibao Tang ◽

Frank F White ◽

...

Keyword(s):

Copy Number ◽

Error Rates ◽

Genome Sequences ◽

Short Reads ◽

Sequencing Technologies ◽

Insertion And Deletion ◽

Novel Approach ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

Abstract Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Download Full-text

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus

10.1101/259150 ◽

2018 ◽

Author(s):

Andrew J. Page ◽

Jacqueline A. Keane

Keyword(s):

Error Rates ◽

Multi Locus Sequence Typing ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Standard Tool ◽

Long Read ◽

Sequence Types ◽

Very High

AbstractGenome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types, allowing, in many cases, to rule a sample in or out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long read sequencing technologies, such as from PacBio or Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a sequence type directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 600 samples sequenced with using long read sequencing technologies from PacBio and Oxford Nanopore. It provides sequence types on average within 90 seconds, with a sensitivity of 94% and specificity of 97%, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

Download Full-text

TALC: Transcript-level Aware Long Read Correction

10.1101/2020.01.10.901728 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lucile Broseus ◽

Aubin Thomas ◽

Andrew J. Oldfield ◽

Dany Severac ◽

Emeric Dubois ◽

...

Keyword(s):

Transcriptome Sequencing ◽

Transcript Level ◽

De Bruijn Graph ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Rna Transcript

ABSTRACTMotivationLong-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous “hybrid correction” algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data.ResultsWe have created a novel reference-free algorithm called TALC (Transcription Aware Long Read Correction) which models changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads from transcriptome studies. We show that transcription aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.Availability and ImplementationTALC is implemented in C++ and available at https://gitlab.igh.cnrs.fr/lbroseus/[email protected]

Download Full-text

Long-read amplicon denoising

10.1101/383794 ◽

2018 ◽

Cited By ~ 3

Author(s):

Venkatesh Kumar ◽

Thomas Vollbrecht ◽

Mark Chernyshev ◽

Sanjay Mohan ◽

Brian Hanst ◽

...

Keyword(s):

Ground Truth ◽

Amplicon Sequencing ◽

Error Rates ◽

Sequencing Error ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Medium Length ◽

Long Reads ◽

Long Read ◽

Error Profiles

Long-read next generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies. Called “amplicon denoising”, this problem has been extensively studied for short-read sequencing technologies, but current solutions do not appear to generalize well to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads (here ~2.6kb) and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower.On one real dataset with ground truth, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method.Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD) are implemented purely in the Julia scientific computing language, and are hereby released along with a complete toolkit of functions that allow long-read amplicon sequence analysis pipelines to be constructed in pure Julia. Further, we make available a webserver to dramatically simplify the processing of long-read PacBio sequences.

Download Full-text

Scalable long read self-correction and assembly polishing with multiple sequence alignment

Scientific Reports ◽

10.1038/s41598-020-80757-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Pierre Morisse ◽

Camille Marchet ◽

Antoine Limasset ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Correction Method ◽

Error Rates ◽

Multiple Sequence ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Human Dataset

AbstractThird-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT.

Download Full-text

Estimating Assembly Base Errors Using K-mer Abundance Difference (KAD) Between Short Reads and Genome Assembled Sequences

10.1101/2020.03.17.994566 ◽

2020 ◽

Cited By ~ 1

Author(s):

Cheng He ◽

Guifang Lin ◽

Hairong Wei ◽

Haibao Tang ◽

Frank F White ◽

...

Keyword(s):

Copy Number ◽

Error Rates ◽

Genome Sequences ◽

Short Reads ◽

Sequencing Technologies ◽

Insertion And Deletion ◽

Novel Approach ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

ABSTRACTGenome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as K-mer Abundance Difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Therefore, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Download Full-text

Two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

10.1101/2020.05.27.118679 ◽

2020 ◽

Cited By ~ 1

Author(s):

Matthew T. Parker ◽

Katarzyna Knop ◽

Geoffrey J. Barton ◽

Gordon G. Simpson

Keyword(s):

Machine Learning ◽

Transcriptome Assembly ◽

Error Rates ◽

Sequence Information ◽

Sequencing Technologies ◽

Alternative Processing ◽

Spliced Alignment ◽

Long Reads ◽

Long Read ◽

Splice Junctions

AbstractTranscription of eukaryotic genomes involves complex alternative processing of RNAs. Sequencing of full-length RNAs using long reads reveals the true complexity of processing. However, the relatively high error rates of long-read sequencing technologies can reduce the accuracy of intron identification. Here we apply alignment metrics and machine-learning-derived sequence information to filter spurious splice junctions from long read alignments and use the remaining junctions to guide realignment in a two-pass approach. This method, available in the software package 2passtools (https://github.com/bartongroup/2passtools), improves the accuracy of spliced alignment and transcriptome assembly for species both with and without existing high-quality annotations.

Download Full-text

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus

PeerJ ◽

10.7717/peerj.5233 ◽

2018 ◽

Vol 6 ◽

pp. e5233 ◽

Cited By ~ 6

Author(s):

Andrew J. Page ◽

Jacqueline A. Keane

Keyword(s):

Error Rates ◽

Multi Locus Sequence Typing ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Standard Tool ◽

Sample Data ◽

Long Read ◽

Sequence Types ◽

Very High

Genome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types (STs), allowing, in many cases, to rule a sample out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long-read sequencing technologies, such as from Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short-read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a ST directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 700 isolates sequenced using long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore. It provides STs for isolates on average within 90 s, with a sensitivity of 94% and specificity of 97% on real sample data, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

Download Full-text