deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Mapping Intimacies ◽

10.1101/612176 ◽

2019 ◽

Cited By ~ 1

Author(s):

Bo Liu ◽

Yadong Liu ◽

Junyi Li ◽

Hongzhe Guo ◽

Tianyi Zang ◽

...

Keyword(s):

Rna Sequencing ◽

De Bruijn Graph ◽

Rna Seq ◽

Read Alignment ◽

Sequencing Errors ◽

Long Read ◽

Technical Issues ◽

Reference Sequences ◽

De Bruijn ◽

Gene Structures

AbstractLong-read RNA sequencing (RNA-seq) is promising to transcriptomics studies, however, the alignment of long RNA-seq reads is still non-trivial due to high sequencing errors and complicated gene structures. Herein, we propose deSALT, a tailored two-pass alignment approach, which constructs graph-based alignment skeletons to infer exons and uses them to generate spliced reference sequences to produce refined alignments. deSALT addresses several difficult technical issues, such as small exons and sequencing errors, which breakthroughs the bottlenecks of long RNA-seq read alignment. Benchmarks demonstrate that deSALT has a greater ability to produce accurate and homogeneous full-length alignments. deSALT is available at: https://github.com/hitbc/deSALT.

Download Full-text

deSALT: fast and accurate long transcriptomic read alignment with de Bruijn graph-based index

Genome Biology ◽

10.1186/s13059-019-1895-9 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 5

Author(s):

Bo Liu ◽

Yadong Liu ◽

Junyi Li ◽

Hongzhe Guo ◽

Tianyi Zang ◽

...

Keyword(s):

Rna Sequencing ◽

De Bruijn Graph ◽

Rna Seq ◽

Read Alignment ◽

Sequencing Errors ◽

Long Read ◽

Technical Issues ◽

Reference Sequences ◽

De Bruijn ◽

Gene Structures

AbstractThe alignment of long-read RNA sequencing reads is non-trivial due to high sequencing errors and complicated gene structures. We propose deSALT, a tailored two-pass alignment approach, which constructs graph-based alignment skeletons to infer exons and uses them to generate spliced reference sequences to produce refined alignments. deSALT addresses several difficult technical issues, such as small exons and sequencing errors, which break through bottlenecks of long RNA-seq read alignment. Benchmarks demonstrate that deSALT has a greater ability to produce accurate and homogeneous full-length alignments. deSALT is available at: https://github.com/hitbc/deSALT.

Download Full-text

Read correction for non-uniform coverages

10.1101/673624 ◽

2019 ◽

Author(s):

Camille Marchet ◽

Yoann Dufresne ◽

Antoine Limasset

Keyword(s):

Rare Variants ◽

Data Availability ◽

Metagenomic Data ◽

De Bruijn Graph ◽

Local Alignment ◽

Rna Seq ◽

Sequencing Errors ◽

Mapping Strategy ◽

De Bruijn ◽

Generation Sequencing

AbstractNext generation sequencing produces large volumes of short sequences with broad applications. The noise due to sequencing errors led to the development of several correction methods. The main correction paradigm expects a high (from 30-40X) uniform coverage to correctly infer a reference set of subsequences from the reads, that are used for correction. In practice, most accurate methods use k-mer spectrum techniques to obtain a set of reference k-mers. However, when correcting NGS datasets that present an uneven coverage, such as RNA-seq data, this paradigm tends to mistake rare variants for errors. It may therefore discard or alter them using highly covered sequences, which leads to an information loss and may introduce bias. In this paper we present two new contributions in order to cope with this situation.First, we show that starting from non-uniform sequencing coverages, a De Bruijn graph can be cleaned from most errors while preserving biological variability. Second, we demonstrate that reads can be efficiently corrected via local alignment on the cleaned De Bruijn graph paths. We implemented the described method in a tool dubbed BCT and evaluated its results on RNA-seq and metagenomic data. We show that the graph cleaning strategy combined with the mapping strategy leads to save more rare k-mers, resulting in a more conservative correction than previous methods. BCT is also capable to better take advantage of the signal of high depth datasets. We suggest that BCT, being scalable to large metagenomic datasets as well as correcting shallow single cell RNA-seq data, can be a general corrector for non-uniform data. Availability: BCT is open source and available at github.com/Malfoy/BCT under the Affero GPL License.

Download Full-text

Toward perfect reads: self-correction of short reads via mapping on de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btz102 ◽

2019 ◽

Vol 36 (5) ◽

pp. 1374-1381 ◽

Cited By ~ 9

Author(s):

Antoine Limasset ◽

Jean-François Flot ◽

Pierre Peterlongo

Keyword(s):

Supplementary Information ◽

De Bruijn Graph ◽

Sequence Information ◽

Short Read ◽

De Bruijn Graphs ◽

Short Reads ◽

Sequencing Errors ◽

Long Read ◽

De Bruijn ◽

Read Accuracy

Abstract Motivation Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information. Results We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond. Availability and implementation The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

TALC: Transcript-level Aware Long Read Correction

10.1101/2020.01.10.901728 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lucile Broseus ◽

Aubin Thomas ◽

Andrew J. Oldfield ◽

Dany Severac ◽

Emeric Dubois ◽

...

Keyword(s):

Transcriptome Sequencing ◽

Transcript Level ◽

De Bruijn Graph ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Rna Transcript

ABSTRACTMotivationLong-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous “hybrid correction” algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data.ResultsWe have created a novel reference-free algorithm called TALC (Transcription Aware Long Read Correction) which models changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads from transcriptome studies. We show that transcription aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.Availability and ImplementationTALC is implemented in C++ and available at https://gitlab.igh.cnrs.fr/lbroseus/[email protected]

Download Full-text

The antibody repertoire of colorectal cancer

10.1101/176768 ◽

2017 ◽

Author(s):

Seong Won Cha ◽

Stefano Bonissone ◽

Seungjin Na ◽

Pavel A. Pevzner ◽

Vineet Bafna

Keyword(s):

Immune Response ◽

Single Amino Acid ◽

De Bruijn Graph ◽

Rna Seq ◽

Protein Databases ◽

Sequencing Errors ◽

Database Construction ◽

De Bruijn ◽

Occurrence Pattern ◽

Standard Protein

Immunotherapy is becoming increasingly important in the fight against cancers, utilizing and manipulating the body's immune response to treat tumors. Understanding the immune repertoire - the collection of immunological proteins - of treated and untreated cells is possible at the genomic, but technically difficult at the protein level. Standard protein databases do not include the highly divergent sequences of somatic rearranged immunoglobulin genes, and may lead to missed identifications in a mass spectrometry search. We introduce a novel proteogenomic approach, AbScan, to identify these highly variable antibody peptides, by developing a customized antibody database construction method using RNA-seq reads aligned to immunoglobulin (Ig) genes. AbScan starts by filtering transcript (RNA-seq) reads that match the template for Ig genes. The retained reads are used to construct a repertoire graph using the 'split' de Bruijn graph: a graph structure that improves upon the standard de Bruijn graph to capture the high diversity of Ig genes in a compact manner. AbScan corrects for sequencing errors, and converts the graph to a format suitable for searching with MS/MS search tools. We used AbScan to create an antibody database from 90 RNA-seq colorectal tumor samples. Next, we used proteogenomics analysis to search MS/MS spectra of matched colorectal samples from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) against the AbScan generated database. AbScan identified 1,940 distinct antibody peptides. Correlating with previously identified Single Amino-Acid Variants (SAAVs) in the tumor samples, we identified 163 pairs (antibody peptide, SAAV) with significant co-occurrence pattern in the 90 samples. The presence of co-expressed antibody and mutated peptides was correlated with survival time of the individuals. Our results suggest that AbScan (https://github.com/csw407/AbScan.git) is an effective tool for a proteomic exploration of the immune response in cancers.

Download Full-text

Non Hybrid Long Read Consensus Using Local De Bruijn Graph Assembly

10.1101/106252 ◽

2017 ◽

Cited By ~ 10

Author(s):

German Tischler ◽

Eugene W. Myers

Keyword(s):

Second Generation ◽

Hybrid Methods ◽

Low Noise ◽

De Bruijn Graph ◽

Third Generation ◽

Sequencing Errors ◽

Long Reads ◽

Long Read ◽

Second Generation Sequencing ◽

De Bruijn

AbstractWhile second generation sequencing led to a vast increase in sequenced data, the shorter reads which came with it made assembly a much harder task and for some regions impossible with only short read data. This changed again with the advent of third generation long read sequencers. The length of the long reads allows a much better resolution of repetitive regions, their high error rate however is a major challenge. Using the data successfully requires to remove most of the sequencing errors. The first hybrid correction methods used low noise second generation data to correct third generation data, but this approach has issues when it is unclear where to place the short reads due to repeats and also because second generation sequencers fail to sequence some regions which third generation sequencers work on. Later non hybrid methods appeared. We present a new method for non hybrid long read error correction based on De Bruijn graph assembly of short windows of long reads with subsequent combination of these correct windows to corrected long reads. Our experiments show that this method yields a better correction than other state of the art non hybrid correction approaches.

Download Full-text

Toward perfect reads: short reads correction via mapping on compacted de Bruijn graphs

10.1101/558395 ◽

2019 ◽

Cited By ~ 3

Author(s):

Antoine Limasset ◽

Jean-François Flot ◽

Pierre Peterlongo

Keyword(s):

Large Data ◽

De Bruijn Graph ◽

Data Sets ◽

Short Read ◽

De Bruijn Graphs ◽

Short Reads ◽

Sequencing Errors ◽

Long Read ◽

De Bruijn ◽

Read Accuracy

AbstractMotivationsShort-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large data sets or consider reads as mere suites of k-mers, without taking into account their full-length read information.ResultsWe propose a new method to correct short reads using de Bruijn graphs, and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond.Availability and ImplementationThe implementation is open source and available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package.ContactAntoine Limasset [email protected] & Jean-François Flot [email protected] & Pierre Peterlongo [email protected]

Download Full-text

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

BMC Bioinformatics ◽

10.1186/s12859-020-03740-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Aranka Steyaert ◽

Pieter Audenaert ◽

Jan Fostier

Keyword(s):

Genomic Sequence ◽

Conditional Random Field ◽

Accurate Determination ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

Expectation Maximisation ◽

De Bruijn

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Download Full-text

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

10.1101/2020.03.29.014159 ◽

2020 ◽

Cited By ~ 4

Author(s):

Camille Marchet ◽

Zamin Iqbal ◽

Daniel Gautheret ◽

Mikael Salson ◽

Rayan Chikhi

Keyword(s):

Large Datasets ◽

Computational Method ◽

De Bruijn Graph ◽

Rna Seq ◽

Indexing Methods ◽

Link Type ◽

De Bruijn

AbstractMotivationIn this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.ResultsWe used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of 4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.Availabilityhttps://github.com/kamimrcht/[email protected]

Download Full-text

RNA sequencing data: hitchhiker's guide to expression analysis

10.7287/peerj.preprints.27283 ◽

2018 ◽

Author(s):

Koen Van Den Berge ◽

Katharina Hembach ◽

Charlotte Soneson ◽

Simone Tiberi ◽

Lieven Clement ◽

...

Keyword(s):

Gene Expression ◽

Rna Sequencing ◽

Large Scale ◽

Science Studies ◽

Rna Seq ◽

Sequencing Data ◽

Data Types ◽

The Past ◽

Long Read ◽

Statistical Approaches

Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.

Download Full-text