Amplicon Sequencing in the Era of Highly-Accurate Long Reads

Long-read amplicon denoising

10.1101/383794 ◽

2018 ◽

Cited By ~ 3

Author(s):

Venkatesh Kumar ◽

Thomas Vollbrecht ◽

Mark Chernyshev ◽

Sanjay Mohan ◽

Brian Hanst ◽

...

Keyword(s):

Ground Truth ◽

Amplicon Sequencing ◽

Error Rates ◽

Sequencing Error ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Medium Length ◽

Long Reads ◽

Long Read ◽

Error Profiles

Long-read next generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies. Called “amplicon denoising”, this problem has been extensively studied for short-read sequencing technologies, but current solutions do not appear to generalize well to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads (here ~2.6kb) and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower.On one real dataset with ground truth, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method.Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD) are implemented purely in the Julia scientific computing language, and are hereby released along with a complete toolkit of functions that allow long-read amplicon sequence analysis pipelines to be constructed in pure Julia. Further, we make available a webserver to dramatically simplify the processing of long-read PacBio sequences.

Download Full-text

Long-read amplicon denoising

Nucleic Acids Research ◽

10.1093/nar/gkz657 ◽

2019 ◽

Vol 47 (18) ◽

pp. e104-e104 ◽

Cited By ~ 12

Author(s):

Venkatesh Kumar ◽

Thomas Vollbrecht ◽

Mark Chernyshev ◽

Sanjay Mohan ◽

Brian Hanst ◽

...

Keyword(s):

Ground Truth ◽

Amplicon Sequencing ◽

Error Rates ◽

Sequencing Error ◽

Single Nucleotide ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Error Profiles ◽

Virus Community

Abstract Long-read next-generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies from PacBio reads. Called ‘amplicon denoising’, this problem has been extensively studied for short-read sequencing technologies, but current solutions do not always successfully generalize to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower. On two Mock Virus Community datasets with ground truth, each sequenced on a different PacBio instrument, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method, successfully discriminating templates that differ by a just single nucleotide. Julia implementations of Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD), and a webserver interface, are freely available.

Download Full-text

Factorial estimating assembly base errors using k-mer abundance difference (KAD) between short reads and genome assembled sequences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa075 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Cheng He ◽

Guifang Lin ◽

Hairong Wei ◽

Haibao Tang ◽

Frank F White ◽

...

Keyword(s):

Copy Number ◽

Error Rates ◽

Genome Sequences ◽

Short Reads ◽

Sequencing Technologies ◽

Insertion And Deletion ◽

Novel Approach ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

Abstract Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Download Full-text

CaBagE: A Cas9-based Background Elimination strategy for targeted, long-read DNA sequencing

PLoS ONE ◽

10.1371/journal.pone.0241253 ◽

2021 ◽

Vol 16 (4) ◽

pp. e0241253

Author(s):

Amelia D. Wallace ◽

Thomas A. Sasani ◽

Jordan Swanier ◽

Brooke L. Gates ◽

Jeff Greenland ◽

...

Keyword(s):

Dna Sequencing ◽

Tandem Repeats ◽

Short Read ◽

Background Elimination ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Repeat Expansions ◽

Sequencing Platforms ◽

Als Patients

A substantial fraction of the human genome is difficult to interrogate with short-read DNA sequencing technologies due to paralogy, complex haplotype structures, or tandem repeats. Long-read sequencing technologies, such as Oxford Nanopore’s MinION, enable direct measurement of complex loci without introducing many of the biases inherent to short-read methods, though they suffer from relatively lower throughput. This limitation has motivated recent efforts to develop amplification-free strategies to target and enrich loci of interest for subsequent sequencing with long reads. Here, we present CaBagE, a method for target enrichment that is efficient and useful for sequencing large, structurally complex targets. The CaBagE method leverages the stable binding of Cas9 to its DNA target to protect desired fragments from digestion with exonuclease. Enriched DNA fragments are then sequenced with Oxford Nanopore’s MinION long-read sequencing technology. Enrichment with CaBagE resulted in a median of 116X coverage (range 39–416) of target loci when tested on five genomic targets ranging from 4-20kb in length using healthy donor DNA. Four cancer gene targets were enriched in a single reaction and multiplexed on a single MinION flow cell. We further demonstrate the utility of CaBagE in two ALS patients with C9orf72 short tandem repeat expansions to produce genotype estimates commensurate with genotypes derived from repeat-primed PCR for each individual. With CaBagE there is a physical enrichment of on-target DNA in a given sample prior to sequencing. This feature allows adaptability across sequencing platforms and potential use as an enrichment strategy for applications beyond sequencing. CaBagE is a rapid enrichment method that can illuminate regions of the ‘hidden genome’ underlying human disease.

Download Full-text

Quantifying the Benefit Offered by Transcript Assembly on Single-Molecule Long Reads

10.1101/632703 ◽

2019 ◽

Cited By ~ 1

Author(s):

Laura H. Tung ◽

Mingfu Shao ◽

Carl Kingsford

Keyword(s):

Single Molecule ◽

Error Rates ◽

Human Transcriptome ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Transcript Assembly ◽

Novel Isoforms ◽

Generation Sequencing

AbstractThird-generation sequencing technologies benefit transcriptome analysis by generating longer sequencing reads. However, not all single-molecule long reads represent full transcripts due to incomplete cDNA synthesis and the sequencing length limit of the platform. This drives a need for long read transcript assembly. We quantify the benefit that can be achieved by using a transcript assembler on long reads. Adding long-read-specific algorithms, we evolved Scallop to make Scallop-LR, a long-read transcript assembler, to handle the computational challenges arising from long read lengths and high error rates. Analyzing 26 SRA PacBio datasets using Scallop-LR, Iso-Seq Analysis, and StringTie, we quantified the amount by which assembly improved Iso-Seq results. Through combined evaluation methods, we found that Scallop-LR identifies 2100–4000 more (for 18 human datasets) or 1100–2200 more (for eight mouse datasets) known transcripts than Iso-Seq Analysis, which does not do assembly. Further, Scallop-LR finds 2.4–4.4 times more potentially novel isoforms than Iso-Seq Analysis for the human and mouse datasets. StringTie also identifies more transcripts than Iso-Seq Analysis. Adding long-read-specific optimizations in Scallop-LR increases the numbers of predicted known transcripts and potentially novel isoforms for the human transcriptome compared to several recent short-read assemblers (e.g. StringTie). Our findings indicate that transcript assembly by Scallop-LR can reveal a more complete human transcriptome.

Download Full-text

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus

10.1101/259150 ◽

2018 ◽

Author(s):

Andrew J. Page ◽

Jacqueline A. Keane

Keyword(s):

Error Rates ◽

Multi Locus Sequence Typing ◽

Short Read Sequencing ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Standard Tool ◽

Long Read ◽

Sequence Types ◽

Very High

AbstractGenome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types, allowing, in many cases, to rule a sample in or out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long read sequencing technologies, such as from PacBio or Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a sequence type directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 600 samples sequenced with using long read sequencing technologies from PacBio and Oxford Nanopore. It provides sequence types on average within 90 seconds, with a sensitivity of 94% and specificity of 97%, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

Download Full-text

Scalable long read self-correction and assembly polishing with multiple sequence alignment

Scientific Reports ◽

10.1038/s41598-020-80757-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Pierre Morisse ◽

Camille Marchet ◽

Antoine Limasset ◽

Thierry Lecroq ◽

Arnaud Lefebvre

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Correction Method ◽

Error Rates ◽

Multiple Sequence ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

Human Dataset

AbstractThird-generation sequencing technologies allow to sequence long reads of tens of kbp, that are expected to solve various problems. However, they display high error rates, currently capped around 10%. Self-correction is thus regularly used in long reads analysis projects. We introduce CONSENT, a new self-correction method that relies both on multiple sequence alignment and local de Bruijn graphs. To ensure scalability, multiple sequence alignment computation benefits from a new and efficient segmentation strategy, allowing a massive speedup. CONSENT compares well to the state-of-the-art, and performs better on real Oxford Nanopore data. Specifically, CONSENT is the only method that efficiently scales to ultra-long reads, and allows to process a full human dataset, containing reads reaching up to 1.5 Mbp, in 10 days. Moreover, our experiments show that error correction with CONSENT improves the quality of Flye assemblies. Additionally, CONSENT implements a polishing feature, allowing to correct raw assemblies. Our experiments show that CONSENT is 2-38x times faster than other polishing tools, while providing comparable results. Furthermore, we show that, on a human dataset, assembling the raw data and polishing the assembly is less resource consuming than correcting and then assembling the reads, while providing better results. CONSENT is available at https://github.com/morispi/CONSENT.

Download Full-text

Estimating Assembly Base Errors Using K-mer Abundance Difference (KAD) Between Short Reads and Genome Assembled Sequences

10.1101/2020.03.17.994566 ◽

2020 ◽

Cited By ~ 1

Author(s):

Cheng He ◽

Guifang Lin ◽

Hairong Wei ◽

Haibao Tang ◽

Frank F White ◽

...

Keyword(s):

Copy Number ◽

Error Rates ◽

Genome Sequences ◽

Short Reads ◽

Sequencing Technologies ◽

Insertion And Deletion ◽

Novel Approach ◽

Long Reads ◽

Long Read ◽

Genome Assemblies

ABSTRACTGenome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as K-mer Abundance Difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Therefore, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses.

Download Full-text

Two-pass alignment using machine-learning-filtered splice junctions increases the accuracy of intron detection in long-read RNA sequencing

10.1101/2020.05.27.118679 ◽

2020 ◽

Cited By ~ 1

Author(s):

Matthew T. Parker ◽

Katarzyna Knop ◽

Geoffrey J. Barton ◽

Gordon G. Simpson

Keyword(s):

Machine Learning ◽

Transcriptome Assembly ◽

Error Rates ◽

Sequence Information ◽

Sequencing Technologies ◽

Alternative Processing ◽

Spliced Alignment ◽

Long Reads ◽

Long Read ◽

Splice Junctions

AbstractTranscription of eukaryotic genomes involves complex alternative processing of RNAs. Sequencing of full-length RNAs using long reads reveals the true complexity of processing. However, the relatively high error rates of long-read sequencing technologies can reduce the accuracy of intron identification. Here we apply alignment metrics and machine-learning-derived sequence information to filter spurious splice junctions from long read alignments and use the remaining junctions to guide realignment in a two-pass approach. This method, available in the software package 2passtools (https://github.com/bartongroup/2passtools), improves the accuracy of spliced alignment and transcriptome assembly for species both with and without existing high-quality annotations.

Download Full-text

Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus

PeerJ ◽

10.7717/peerj.5233 ◽

2018 ◽

Vol 6 ◽

pp. e5233 ◽

Cited By ~ 6

Author(s):

Andrew J. Page ◽

Jacqueline A. Keane

Keyword(s):

Error Rates ◽

Multi Locus Sequence Typing ◽

Sequencing Technologies ◽

Long Reads ◽

Oxford Nanopore ◽

Standard Tool ◽

Sample Data ◽

Long Read ◽

Sequence Types ◽

Very High

Genome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types (STs), allowing, in many cases, to rule a sample out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long-read sequencing technologies, such as from Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short-read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a ST directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 700 isolates sequenced using long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore. It provides STs for isolates on average within 90 s, with a sensitivity of 94% and specificity of 97% on real sample data, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

Download Full-text