An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

Mapping Intimacies ◽

10.1101/464222 ◽

2018 ◽

Cited By ~ 7

Author(s):

Fatemeh Almodaresi ◽

Prashant Pandey ◽

Michael Ferdman ◽

Rob Johnson ◽

Rob Patro

Keyword(s):

Large Scale ◽

Population Level ◽

Reference Sequence ◽

De Bruijn Graph ◽

High Dimensional ◽

Level Variation ◽

Color Information ◽

Sequence Search ◽

De Bruijn ◽

Scale Sequence

AbstractThe colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure.In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes — patterns of color occurrence — present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows to thousands of experiments.We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than 11× better compression compared to RRR.

Download Full-text

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search

Journal of Computational Biology ◽

10.1089/cmb.2019.0322 ◽

2020 ◽

Vol 27 (4) ◽

pp. 485-499 ◽

Cited By ~ 1

Author(s):

Fatemeh Almodaresi ◽

Prashant Pandey ◽

Michael Ferdman ◽

Rob Johnson ◽

Rob Patro

Keyword(s):

Graph Search ◽

De Bruijn Graph ◽

High Dimensional ◽

Exact Representation ◽

Color Information ◽

De Bruijn

Download Full-text

NxRepair: Error correction in de novo sequence assembly using Nextera mate pairs

10.7287/peerj.preprints.747v1 ◽

2014 ◽

Author(s):

Rebecca R Murphy ◽

Jared M O'Connell ◽

Anthony J Cox ◽

Ole B Schulz-Trieglaff

Keyword(s):

Error Correction ◽

Large Scale ◽

De Novo ◽

Reference Sequence ◽

De Bruijn Graph ◽

Sequencing Data ◽

Additional Information ◽

Mate Pair ◽

De Bruijn ◽

De Novo Sequence Assembly

Scaffolding errors and incorrect traversals of the de Bruijn graph during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub; a tutorial and user documentation are also available.

Download Full-text

NxRepair: Error correction in de novo sequence assembly using Nextera mate pairs

10.7287/peerj.preprints.747 ◽

2014 ◽

Author(s):

Rebecca R Murphy ◽

Jared M O'Connell ◽

Anthony J Cox ◽

Ole B Schulz-Trieglaff

Keyword(s):

Error Correction ◽

Large Scale ◽

De Novo ◽

Reference Sequence ◽

De Bruijn Graph ◽

Sequencing Data ◽

Additional Information ◽

Mate Pair ◽

De Bruijn ◽

De Novo Sequence Assembly

Download Full-text

An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

Lecture Notes in Computer Science - Research in Computational Molecular Biology ◽

10.1007/978-3-030-17083-7_1 ◽

2019 ◽

pp. 1-18 ◽

Cited By ~ 3

Author(s):

Fatemeh Almodaresi ◽

Prashant Pandey ◽

Michael Ferdman ◽

Rob Johnson ◽

Rob Patro

Keyword(s):

Graph Search ◽

De Bruijn Graph ◽

High Dimensional ◽

Exact Representation ◽

Color Information ◽

De Bruijn

Download Full-text

Robust data storage in DNA by de Bruijn graph-based decoding

10.21203/rs.3.rs-382900/v1 ◽

2021 ◽

Author(s):

Lifu Song ◽

Feng Geng ◽

Ziyi Song ◽

Bing-Zhi Li ◽

Ying-Jin Yuan

Keyword(s):

Data Storage ◽

Large Scale ◽

Search Algorithm ◽

De Bruijn Graph ◽

Large Scale Data ◽

Dna Strands ◽

Pcr Products ◽

Path Search ◽

De Bruijn ◽

Linear Decoding

Abstract Data storage in DNA, which store information in polymers, is a potential technology with high density and long-term features. However, the indels, strand rearrangements, and strand breaks that emerged during synthesis, amplification, sequencing, and storage of DNA molecules need to be handled. Here, we report a de Bruijn graph-based, greedy path search algorithm (DBG-GPS), which can efficiently handle all these issues by efficient reconstruction of the DNA strands. DBG-GPS achieves accurate data recovery with low-quality, deep error-prone PCR products, and accelerated aged DNA samples (solution, 70℃ for two weeks). The robustness of DBG-GPS was verified with 100 times of multiple retrievals using PCR products with massive unspecific amplifications. Moreover, DBG-GPS shows linear decoding complexity and more than 100 times faster than the multiple alignment-based methods, indicating a suitable solution for large-scale data storage.

Download Full-text

An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using LSM Trees

10.1101/2021.02.05.429839 ◽

2021 ◽

Author(s):

Fatemeh Almodaresi ◽

Jamshed Khan ◽

Sergey Madaminov ◽

Prashant Pandey ◽

Michael Ferdman ◽

...

Keyword(s):

Large Scale ◽

Supplementary Information ◽

De Bruijn Graph ◽

Sequencing Data ◽

Construction Time ◽

Graph Representations ◽

Sequence Search ◽

General Search ◽

Colored De Bruijn Graph ◽

Search Index

AbstractMotivationIn the past few years, researchers have proposed numerous indexing schemes for searching large databases of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes—Mantis, VariMerge, and Bifrost—that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data.ResultsIn this paper, we show how to build a scalable and updatable exact sequence-search index. Specifically, we extend Mantis using the Bentley-Saxe transformation to support efficient updates. We demonstrate Mantis’s scalability by constructing an index of ≈ 40K samples from SRA by adding samples one at a time to an initial index of 10K samples.Compared to VariMerge and Bifrost, Mantis is more efficient in terms of index-construction time and memory, query time and memory, and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes.AvailabilityThe updatable Mantis implementation is available at https://github.com/splatlab/mantis/tree/[email protected] informationSupplementary data are available online.

Download Full-text

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

10.1101/2020.10.21.349605 ◽

2020 ◽

Author(s):

Jamshed Khan ◽

Rob Patro

Keyword(s):

Large Scale ◽

De Bruijn Graph ◽

Comparative Genomic ◽

De Bruijn Graphs ◽

Long Reads ◽

Genomic Analyses ◽

Finite State ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Memory Compaction

AbstractMotivationThe construction of the compacted de Bruijn graph from a large collection of reference genomes is a task of increasing interest in genomic analyses. For example, compacted colored reference de Bruijn graphs are increasingly used as sequence indices for the purposes of alignment of short and long reads. Also, as we sequence and assemble a greater diversity of individual genomes, the compacted colored de Bruijn graph can be used as the basis for methods aiming to perform comparative genomic analyses on these genomes. While algorithms have been developed to construct the compacted colored de Bruijn graph from reference sequences, there is still room for improvement, especially in the memory and the runtime performance as the number and the scale of the genomes over which the de Bruijn graph is built grow.ResultsWe introduce a new algorithm, implemented in the tool Cuttlefish, to construct the colored compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that the algorithm scales much better than existing approaches, especially as the number and scale of the input references grow. For example, on a typical shared-memory machine, Cuttlefish constructed the compacted graph for 100 human genomes in less than 7 hours, using ~29 GB of memory; no other tested tool successfully completed this task on the testing hardware. We also applied Cuttlefish on 11 diverse conifer plant genomes, and the compacted graph was constructed in under 11 hours, using ~84 GB of memory, while the only other tested tool able to complete this compaction on our hardware took more than 16 hours and ~289 GB of memory.AvailabilityCuttlefish is written in C++14, and is available under an open source license at https://github.com/COMBINE-lab/[email protected]

Download Full-text

deGSM: memory scalable construction of large scale de Bruijn Graph

IEEE/ACM Transactions on Computational Biology and Bioinformatics ◽

10.1109/tcbb.2019.2913932 ◽

2019 ◽

pp. 1-1 ◽

Cited By ~ 5

Author(s):

Hongzhe Guo ◽

Yilei Fu ◽

Yan Gao ◽

Junyi Li ◽

Yadong Wang ◽

...

Keyword(s):

Large Scale ◽

De Bruijn Graph ◽

De Bruijn

Download Full-text

Redundant De Bruijn graph based location and routing for large-scale peer-to-peer system

2010 IEEE International Conference on Progress in Informatics and Computing ◽

10.1109/pic.2010.5687576 ◽

2010 ◽

Author(s):

Jinyan Chen ◽

Yi Zhang

Keyword(s):

Large Scale ◽

Peer To Peer ◽

De Bruijn Graph ◽

De Bruijn

Download Full-text

Rainbowfish: A Succinct Colored de Bruijn Graph Representation

10.1101/138016 ◽

2017 ◽

Cited By ~ 18

Author(s):

Fatemeh Almodaresi ◽

Prashant Pandey ◽

Rob Patro

Keyword(s):

Relevant Information ◽

Large Datasets ◽

Graph Representation ◽

Combinatorial Structure ◽

De Bruijn Graph ◽

Color Information ◽

Efficient Representation ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Memory Efficient

AbstractThe colored de Bruijn graph— a variant of the de Bruijn graph which associates each edge (i.e., k-mer) with some set of colors — is an increasingly important combinatorial structure in computational biology. Iqbal et al. demonstrated the utility of this structure for representing and assembling a collection (pop-ulation) of genomes, and showed how it can be used to accurately detect genetic variants. Muggli et al. introduced VARI, a representation of the colored de Bruijn graph that adopts the BOSS representation for the de Bruijn graph topology and achieves considerable savings in space over Cortex, albeit with some sacrifice in speed. The memory-efficient representation of VARI allows the colored de Bruijn graph to be constructed and analyzed for large datasets, beyond what is possible with Cortex.In this paper, we introduce Rainbowfish, a succinct representation of the color information of the colored de Bruijn graph that reduces the space usage even further. Our representation also uses BOSS to represent the de Bruijn graph, but decomposes the color sets based on an equivalence relation and exploits the inherent skewness in the distribution of these color sets. The Rainbowfish representation is compressed based on the 0th-order entropy of the color sets, which can lead to a significant reduction in the space required to store the relevant information for each edge. In practice, Rainbowfish achieves up to a 20 × improvement in space over VARI. Rainbowfish is written in C++11 and is available at https://github.com/COMBINE-lab/rainbowfish.

Download Full-text