scholarly journals An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

2018 ◽  
Author(s):  
Fatemeh Almodaresi ◽  
Prashant Pandey ◽  
Michael Ferdman ◽  
Rob Johnson ◽  
Rob Patro

AbstractThe colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure.In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes — patterns of color occurrence — present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows to thousands of experiments.We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than 11× better compression compared to RRR.

2020 ◽  
Vol 27 (4) ◽  
pp. 485-499 ◽  
Author(s):  
Fatemeh Almodaresi ◽  
Prashant Pandey ◽  
Michael Ferdman ◽  
Rob Johnson ◽  
Rob Patro

2014 ◽  
Author(s):  
Rebecca R Murphy ◽  
Jared M O'Connell ◽  
Anthony J Cox ◽  
Ole B Schulz-Trieglaff

Scaffolding errors and incorrect traversals of the de Bruijn graph during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub; a tutorial and user documentation are also available.


2014 ◽  
Author(s):  
Rebecca R Murphy ◽  
Jared M O'Connell ◽  
Anthony J Cox ◽  
Ole B Schulz-Trieglaff

Scaffolding errors and incorrect traversals of the de Bruijn graph during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub; a tutorial and user documentation are also available.


2021 ◽  
Author(s):  
Lifu Song ◽  
Feng Geng ◽  
Ziyi Song ◽  
Bing-Zhi Li ◽  
Ying-Jin Yuan

Abstract Data storage in DNA, which store information in polymers, is a potential technology with high density and long-term features. However, the indels, strand rearrangements, and strand breaks that emerged during synthesis, amplification, sequencing, and storage of DNA molecules need to be handled. Here, we report a de Bruijn graph-based, greedy path search algorithm (DBG-GPS), which can efficiently handle all these issues by efficient reconstruction of the DNA strands. DBG-GPS achieves accurate data recovery with low-quality, deep error-prone PCR products, and accelerated aged DNA samples (solution, 70℃ for two weeks). The robustness of DBG-GPS was verified with 100 times of multiple retrievals using PCR products with massive unspecific amplifications. Moreover, DBG-GPS shows linear decoding complexity and more than 100 times faster than the multiple alignment-based methods, indicating a suitable solution for large-scale data storage.


2021 ◽  
Author(s):  
Fatemeh Almodaresi ◽  
Jamshed Khan ◽  
Sergey Madaminov ◽  
Prashant Pandey ◽  
Michael Ferdman ◽  
...  

AbstractMotivationIn the past few years, researchers have proposed numerous indexing schemes for searching large databases of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes—Mantis, VariMerge, and Bifrost—that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data.ResultsIn this paper, we show how to build a scalable and updatable exact sequence-search index. Specifically, we extend Mantis using the Bentley-Saxe transformation to support efficient updates. We demonstrate Mantis’s scalability by constructing an index of ≈ 40K samples from SRA by adding samples one at a time to an initial index of 10K samples.Compared to VariMerge and Bifrost, Mantis is more efficient in terms of index-construction time and memory, query time and memory, and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes.AvailabilityThe updatable Mantis implementation is available at https://github.com/splatlab/mantis/tree/[email protected] informationSupplementary data are available online.


2020 ◽  
Author(s):  
Jamshed Khan ◽  
Rob Patro

AbstractMotivationThe construction of the compacted de Bruijn graph from a large collection of reference genomes is a task of increasing interest in genomic analyses. For example, compacted colored reference de Bruijn graphs are increasingly used as sequence indices for the purposes of alignment of short and long reads. Also, as we sequence and assemble a greater diversity of individual genomes, the compacted colored de Bruijn graph can be used as the basis for methods aiming to perform comparative genomic analyses on these genomes. While algorithms have been developed to construct the compacted colored de Bruijn graph from reference sequences, there is still room for improvement, especially in the memory and the runtime performance as the number and the scale of the genomes over which the de Bruijn graph is built grow.ResultsWe introduce a new algorithm, implemented in the tool Cuttlefish, to construct the colored compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that the algorithm scales much better than existing approaches, especially as the number and scale of the input references grow. For example, on a typical shared-memory machine, Cuttlefish constructed the compacted graph for 100 human genomes in less than 7 hours, using ~29 GB of memory; no other tested tool successfully completed this task on the testing hardware. We also applied Cuttlefish on 11 diverse conifer plant genomes, and the compacted graph was constructed in under 11 hours, using ~84 GB of memory, while the only other tested tool able to complete this compaction on our hardware took more than 16 hours and ~289 GB of memory.AvailabilityCuttlefish is written in C++14, and is available under an open source license at https://github.com/COMBINE-lab/[email protected]


Author(s):  
Hongzhe Guo ◽  
Yilei Fu ◽  
Yan Gao ◽  
Junyi Li ◽  
Yadong Wang ◽  
...  

2017 ◽  
Author(s):  
Fatemeh Almodaresi ◽  
Prashant Pandey ◽  
Rob Patro

AbstractThe colored de Bruijn graph— a variant of the de Bruijn graph which associates each edge (i.e., k-mer) with some set of colors — is an increasingly important combinatorial structure in computational biology. Iqbal et al. demonstrated the utility of this structure for representing and assembling a collection (pop-ulation) of genomes, and showed how it can be used to accurately detect genetic variants. Muggli et al. introduced VARI, a representation of the colored de Bruijn graph that adopts the BOSS representation for the de Bruijn graph topology and achieves considerable savings in space over Cortex, albeit with some sacrifice in speed. The memory-efficient representation of VARI allows the colored de Bruijn graph to be constructed and analyzed for large datasets, beyond what is possible with Cortex.In this paper, we introduce Rainbowfish, a succinct representation of the color information of the colored de Bruijn graph that reduces the space usage even further. Our representation also uses BOSS to represent the de Bruijn graph, but decomposes the color sets based on an equivalence relation and exploits the inherent skewness in the distribution of these color sets. The Rainbowfish representation is compressed based on the 0th-order entropy of the color sets, which can lead to a significant reduction in the space required to store the relevant information for each edge. In practice, Rainbowfish achieves up to a 20 × improvement in space over VARI. Rainbowfish is written in C++11 and is available at https://github.com/COMBINE-lab/rainbowfish.


Sign in / Sign up

Export Citation Format

Share Document