An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search

Fatemeh Almodaresi; Prashant Pandey; Michael Ferdman; Rob Johnson; Rob Patro

doi:10.1089/cmb.2019.0322

An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

Lecture Notes in Computer Science - Research in Computational Molecular Biology ◽

10.1007/978-3-030-17083-7_1 ◽

2019 ◽

pp. 1-18 ◽

Cited By ~ 3

Author(s):

Fatemeh Almodaresi ◽

Prashant Pandey ◽

Michael Ferdman ◽

Rob Johnson ◽

Rob Patro

Keyword(s):

Graph Search ◽

De Bruijn Graph ◽

High Dimensional ◽

Exact Representation ◽

Color Information ◽

De Bruijn

Download Full-text

An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

10.1101/464222 ◽

2018 ◽

Cited By ~ 7

Author(s):

Fatemeh Almodaresi ◽

Prashant Pandey ◽

Michael Ferdman ◽

Rob Johnson ◽

Rob Patro

Keyword(s):

Large Scale ◽

Population Level ◽

Reference Sequence ◽

De Bruijn Graph ◽

High Dimensional ◽

Level Variation ◽

Color Information ◽

Sequence Search ◽

De Bruijn ◽

Scale Sequence

AbstractThe colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure.In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes — patterns of color occurrence — present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows to thousands of experiments.We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than 11× better compression compared to RRR.

Download Full-text

deBGR: an efficient and near-exact representation of the weighted de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/btx261 ◽

2017 ◽

Vol 33 (14) ◽

pp. i133-i141 ◽

Cited By ~ 17

Author(s):

Prashant Pandey ◽

Michael A Bender ◽

Rob Johnson ◽

Rob Patro

Keyword(s):

De Bruijn Graph ◽

Exact Representation ◽

De Bruijn

Download Full-text

Rainbowfish: A Succinct Colored de Bruijn Graph Representation

10.1101/138016 ◽

2017 ◽

Cited By ~ 18

Author(s):

Fatemeh Almodaresi ◽

Prashant Pandey ◽

Rob Patro

Keyword(s):

Relevant Information ◽

Large Datasets ◽

Graph Representation ◽

Combinatorial Structure ◽

De Bruijn Graph ◽

Color Information ◽

Efficient Representation ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Memory Efficient

AbstractThe colored de Bruijn graph— a variant of the de Bruijn graph which associates each edge (i.e., k-mer) with some set of colors — is an increasingly important combinatorial structure in computational biology. Iqbal et al. demonstrated the utility of this structure for representing and assembling a collection (pop-ulation) of genomes, and showed how it can be used to accurately detect genetic variants. Muggli et al. introduced VARI, a representation of the colored de Bruijn graph that adopts the BOSS representation for the de Bruijn graph topology and achieves considerable savings in space over Cortex, albeit with some sacrifice in speed. The memory-efficient representation of VARI allows the colored de Bruijn graph to be constructed and analyzed for large datasets, beyond what is possible with Cortex.In this paper, we introduce Rainbowfish, a succinct representation of the color information of the colored de Bruijn graph that reduces the space usage even further. Our representation also uses BOSS to represent the de Bruijn graph, but decomposes the color sets based on an equivalence relation and exploits the inherent skewness in the distribution of these color sets. The Rainbowfish representation is compressed based on the 0th-order entropy of the color sets, which can lead to a significant reduction in the space required to store the relevant information for each edge. In practice, Rainbowfish achieves up to a 20 × improvement in space over VARI. Rainbowfish is written in C++11 and is available at https://github.com/COMBINE-lab/rainbowfish.

Download Full-text

Improving the efficiency of de Bruijn graph construction using compact universal hitting sets

Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics ◽

10.1145/3459930.3469520 ◽

2021 ◽

Author(s):

Yael Ben-Ari ◽

Dan Flomin ◽

Lianrong Pu ◽

Yaron Orenstein ◽

Ron Shamir

Keyword(s):

De Bruijn Graph ◽

Hitting Sets ◽

De Bruijn

Download Full-text

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00182-9 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Kingshuk Mukherjee ◽

Massimiliano Rossi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

De Bruijn Graph ◽

Anabas Testudineus ◽

E Coli ◽

Genome Wide ◽

A Genome ◽

De Bruijn ◽

Optical Maps ◽

Definition Of ◽

Numeric Representation

AbstractGenome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.

Download Full-text