scholarly journals Read correction for non-uniform coverages

2019 ◽  
Author(s):  
Camille Marchet ◽  
Yoann Dufresne ◽  
Antoine Limasset

AbstractNext generation sequencing produces large volumes of short sequences with broad applications. The noise due to sequencing errors led to the development of several correction methods. The main correction paradigm expects a high (from 30-40X) uniform coverage to correctly infer a reference set of subsequences from the reads, that are used for correction. In practice, most accurate methods use k-mer spectrum techniques to obtain a set of reference k-mers. However, when correcting NGS datasets that present an uneven coverage, such as RNA-seq data, this paradigm tends to mistake rare variants for errors. It may therefore discard or alter them using highly covered sequences, which leads to an information loss and may introduce bias. In this paper we present two new contributions in order to cope with this situation.First, we show that starting from non-uniform sequencing coverages, a De Bruijn graph can be cleaned from most errors while preserving biological variability. Second, we demonstrate that reads can be efficiently corrected via local alignment on the cleaned De Bruijn graph paths. We implemented the described method in a tool dubbed BCT and evaluated its results on RNA-seq and metagenomic data. We show that the graph cleaning strategy combined with the mapping strategy leads to save more rare k-mers, resulting in a more conservative correction than previous methods. BCT is also capable to better take advantage of the signal of high depth datasets. We suggest that BCT, being scalable to large metagenomic datasets as well as correcting shallow single cell RNA-seq data, can be a general corrector for non-uniform data. Availability: BCT is open source and available at github.com/Malfoy/BCT under the Affero GPL License.

2017 ◽  
Author(s):  
Seong Won Cha ◽  
Stefano Bonissone ◽  
Seungjin Na ◽  
Pavel A. Pevzner ◽  
Vineet Bafna

Immunotherapy is becoming increasingly important in the fight against cancers, utilizing and manipulating the body's immune response to treat tumors. Understanding the immune repertoire - the collection of immunological proteins - of treated and untreated cells is possible at the genomic, but technically difficult at the protein level. Standard protein databases do not include the highly divergent sequences of somatic rearranged immunoglobulin genes, and may lead to missed identifications in a mass spectrometry search. We introduce a novel proteogenomic approach, AbScan, to identify these highly variable antibody peptides, by developing a customized antibody database construction method using RNA-seq reads aligned to immunoglobulin (Ig) genes. AbScan starts by filtering transcript (RNA-seq) reads that match the template for Ig genes. The retained reads are used to construct a repertoire graph using the 'split' de Bruijn graph: a graph structure that improves upon the standard de Bruijn graph to capture the high diversity of Ig genes in a compact manner. AbScan corrects for sequencing errors, and converts the graph to a format suitable for searching with MS/MS search tools. We used AbScan to create an antibody database from 90 RNA-seq colorectal tumor samples. Next, we used proteogenomics analysis to search MS/MS spectra of matched colorectal samples from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) against the AbScan generated database. AbScan identified 1,940 distinct antibody peptides. Correlating with previously identified Single Amino-Acid Variants (SAAVs) in the tumor samples, we identified 163 pairs (antibody peptide, SAAV) with significant co-occurrence pattern in the 90 samples. The presence of co-expressed antibody and mutated peptides was correlated with survival time of the individuals. Our results suggest that AbScan (https://github.com/csw407/AbScan.git) is an effective tool for a proteomic exploration of the immune response in cancers.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Bo Liu ◽  
Yadong Liu ◽  
Junyi Li ◽  
Hongzhe Guo ◽  
Tianyi Zang ◽  
...  

AbstractThe alignment of long-read RNA sequencing reads is non-trivial due to high sequencing errors and complicated gene structures. We propose deSALT, a tailored two-pass alignment approach, which constructs graph-based alignment skeletons to infer exons and uses them to generate spliced reference sequences to produce refined alignments. deSALT addresses several difficult technical issues, such as small exons and sequencing errors, which break through bottlenecks of long RNA-seq read alignment. Benchmarks demonstrate that deSALT has a greater ability to produce accurate and homogeneous full-length alignments. deSALT is available at: https://github.com/hitbc/deSALT.


2019 ◽  
Author(s):  
Bo Liu ◽  
Yadong Liu ◽  
Junyi Li ◽  
Hongzhe Guo ◽  
Tianyi Zang ◽  
...  

AbstractLong-read RNA sequencing (RNA-seq) is promising to transcriptomics studies, however, the alignment of long RNA-seq reads is still non-trivial due to high sequencing errors and complicated gene structures. Herein, we propose deSALT, a tailored two-pass alignment approach, which constructs graph-based alignment skeletons to infer exons and uses them to generate spliced reference sequences to produce refined alignments. deSALT addresses several difficult technical issues, such as small exons and sequencing errors, which breakthroughs the bottlenecks of long RNA-seq read alignment. Benchmarks demonstrate that deSALT has a greater ability to produce accurate and homogeneous full-length alignments. deSALT is available at: https://github.com/hitbc/deSALT.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Aranka Steyaert ◽  
Pieter Audenaert ◽  
Jan Fostier

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.


Author(s):  
Camille Marchet ◽  
Zamin Iqbal ◽  
Daniel Gautheret ◽  
Mikael Salson ◽  
Rayan Chikhi

AbstractMotivationIn this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.ResultsWe used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of 4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.Availabilityhttps://github.com/kamimrcht/[email protected]


2021 ◽  
Author(s):  
Gabriella Rustici ◽  
Eleanor Williams ◽  
Mitra Barzine ◽  
Alvis Brazma ◽  
Roger Bumgarner ◽  
...  

ABSTRACTOver the last two decades, molecular biology has been changed by the introduction of high-throughput technologies. Data sharing requirements have prompted the establishment of persistent data archives. A standardized approach for recording and managing these data was first proposed in the Minimal Information About a Microarray Experiment (MIAME) guidelines. The Minimal Information about a high throughput nucleotide Sequencing Experiment (MINSEQE) proposal was introduced in 2008 as a logical extension of the guidelines to next-generation sequencing (NGS) technologies used for transcriptome analysis.We present a historical snapshot of the data-sharing situation focusing on transcriptomics data from both microarray and RNA-sequencing experiments published between 2009 and 2013, a period during which RNA-seq studies became increasingly popular for transcriptome analysis. We assess how much data from RNA-seq based experiments is actually available in persistent data archives, compared to data derived from microarray based experiments, and evaluate how these types of data differ. Based on this analysis, we provide recommendations to improve RNA-seq data availability, reusability, and reproducibility.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Ming-Feng Hsieh ◽  
Chin Lung Lu ◽  
Chuan Yi Tang

Abstract Background Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches. Results In this paper, we develop a de Bruijn assembler, called Clover (clustering-oriented de novo assembler), that utilizes a novel k-mer clustering approach from the overlap-layout-consensus concept to deal with the sequencing errors generated by the Illumina platform. We further evaluate Clover’s performance against several de Bruijn graph assemblers (ABySS, SOAPdenovo, SPAdes and Velvet), overlap-layout-consensus assemblers (Bambus2, CABOG and MSR-CA) and string graph assembler (SGA) on three datasets (Staphylococcus aureus, Rhodobacter sphaeroides and human chromosome 14). The results show that Clover achieves a superior assembly quality in terms of corrected N50 and E-size while remaining a significantly competitive in run time except SOAPdenovo. In addition, Clover was involved in the sequencing projects of bacterial genomes Acinetobacter baumannii TYTH-1 and Morganella morganii KT. Conclusions The marvel clustering-based approach of Clover that integrates the flexibility of the overlap-layout-consensus approach and the efficiency of the de Bruijn graph method has high potential on de novo assembly. Now, Clover is freely available as open source software from https://oz.nthu.edu.tw/~d9562563/src.html.


Author(s):  
Borja Freire ◽  
Susana Ladra ◽  
Jose R Paramá ◽  
Leena Salmela

Abstract Motivation RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. Results We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. Availability and implementation viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (5) ◽  
pp. 1374-1381 ◽  
Author(s):  
Antoine Limasset ◽  
Jean-François Flot ◽  
Pierre Peterlongo

Abstract Motivation Short-read accuracy is important for downstream analyses such as genome assembly and hybrid long-read correction. Despite much work on short-read correction, present-day correctors either do not scale well on large datasets or consider reads as mere suites of k-mers, without taking into account their full-length sequence information. Results We propose a new method to correct short reads using de Bruijn graphs and implement it as a tool called Bcool. As a first step, Bcool constructs a compacted de Bruijn graph from the reads. This graph is filtered on the basis of k-mer abundance then of unitig abundance, thereby removing most sequencing errors. The cleaned graph is then used as a reference on which the reads are mapped to correct them. We show that this approach yields more accurate reads than k-mer-spectrum correctors while being scalable to human-size genomic datasets and beyond. Availability and implementation The implementation is open source, available at http://github.com/Malfoy/BCOOL under the Affero GPL license and as a Bioconda package. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document