Rainbowfish: A Succinct Colored de Bruijn Graph Representation

Mapping Intimacies ◽

10.1101/138016 ◽

2017 ◽

Cited By ~ 18

Author(s):

Fatemeh Almodaresi ◽

Prashant Pandey ◽

Rob Patro

Keyword(s):

Relevant Information ◽

Large Datasets ◽

Graph Representation ◽

Combinatorial Structure ◽

De Bruijn Graph ◽

Color Information ◽

Efficient Representation ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Memory Efficient

AbstractThe colored de Bruijn graph— a variant of the de Bruijn graph which associates each edge (i.e., k-mer) with some set of colors — is an increasingly important combinatorial structure in computational biology. Iqbal et al. demonstrated the utility of this structure for representing and assembling a collection (pop-ulation) of genomes, and showed how it can be used to accurately detect genetic variants. Muggli et al. introduced VARI, a representation of the colored de Bruijn graph that adopts the BOSS representation for the de Bruijn graph topology and achieves considerable savings in space over Cortex, albeit with some sacrifice in speed. The memory-efficient representation of VARI allows the colored de Bruijn graph to be constructed and analyzed for large datasets, beyond what is possible with Cortex.In this paper, we introduce Rainbowfish, a succinct representation of the color information of the colored de Bruijn graph that reduces the space usage even further. Our representation also uses BOSS to represent the de Bruijn graph, but decomposes the color sets based on an equivalence relation and exploits the inherent skewness in the distribution of these color sets. The Rainbowfish representation is compressed based on the 0th-order entropy of the color sets, which can lead to a significant reduction in the space required to store the relevant information for each edge. In practice, Rainbowfish achieves up to a 20 × improvement in space over VARI. Rainbowfish is written in C++11 and is available at https://github.com/COMBINE-lab/rainbowfish.

Download Full-text

Building Large Updatable Colored de Bruijn Graphs via Merging

10.1101/229641 ◽

2017 ◽

Cited By ~ 2

Author(s):

Martin D Muggli ◽

Bahar Alipanahi ◽

Christina Boucher

Keyword(s):

Bloom Filter ◽

Metagenomic Data ◽

De Bruijn Graph ◽

De Bruijn Graphs ◽

Working Space ◽

Fold Reduction ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Public Datasets ◽

Memory Efficient

MOTIVATION: There exists several massive genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph have been developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed. RESULTS: We create a method for constructing and updating the colored de Bruijn graph on a very-large dataset through partitioning the data into smaller subsets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this paper. We refer to the resulting method as VariMerge. We validate our approach, and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8,000 strains. Lastly, we compare VariMerge to other competing methods --- including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method by Almodaresi(2019) and Multi-BRWT --- and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16,000 strains in a manner that allows additional samples to be added. Competing methods either did not scale to this large of a dataset or cannot allow for additions without reconstruction. AVAILABILITY: Our software is available under GPLv3 at https://github.com/cosmo-team/cosmo/tree/VARI-merge.

Download Full-text

A Pseudo de Bruijn Graph Representation for Discretization Orders for Distance Geometry

Bioinformatics and Biomedical Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-319-16483-0_50 ◽

2015 ◽

pp. 514-523 ◽

Cited By ~ 6

Author(s):

Antonio Mucherino

Keyword(s):

Distance Geometry ◽

Graph Representation ◽

De Bruijn Graph ◽

De Bruijn ◽

Discretization Orders

Download Full-text

Somatic variant analysis of linked-reads sequencing data with Lancet

Bioinformatics ◽

10.1093/bioinformatics/btaa888 ◽

2020 ◽

Author(s):

Rajeeva Musunuri ◽

Kanika Arora ◽

André Corvelo ◽

Minita Shah ◽

Jennifer Shelton ◽

...

Keyword(s):

Supplementary Information ◽

De Bruijn Graph ◽

Haplotype Structure ◽

Sequencing Data ◽

Somatic Variant ◽

Local Assembly ◽

De Bruijn ◽

Variant Analysis ◽

Colored De Bruijn Graph ◽

Commercial Research

Abstract Summary We present a new version of the popular somatic variant caller, Lancet, that supports the analysis of linked-reads sequencing data. By seamlessly integrating barcodes and haplotype read assignments within the colored De Bruijn graph local-assembly framework, Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure. Availability and implementation Lancet is implemented in C++ and available for academic and non-commercial research purposes as an open-source package at https://github.com/nygenome/lancet. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Algorithms for Molecular Biology ◽

10.1186/1748-7188-8-22 ◽

2013 ◽

Vol 8 (1) ◽

Cited By ~ 151

Author(s):

Rayan Chikhi ◽

Guillaume Rizk

Keyword(s):

Bloom Filter ◽

Graph Representation ◽

De Bruijn Graph ◽

De Bruijn

Download Full-text

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

10.1101/2020.03.29.014159 ◽

2020 ◽

Cited By ~ 4

Author(s):

Camille Marchet ◽

Zamin Iqbal ◽

Daniel Gautheret ◽

Mikael Salson ◽

Rayan Chikhi

Keyword(s):

Large Datasets ◽

Computational Method ◽

De Bruijn Graph ◽

Rna Seq ◽

Indexing Methods ◽

Link Type ◽

De Bruijn

AbstractMotivationIn this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.ResultsWe used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of 4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.Availabilityhttps://github.com/kamimrcht/[email protected]

Download Full-text

A space and time-efficient index for the compacted colored de Bruijn graph

10.1101/191874 ◽

2017 ◽

Cited By ~ 3

Author(s):

Fatemeh Almodaresi ◽

Hirak Sarkar ◽

Rob Patro

Keyword(s):

Data Structure ◽

Pattern Search ◽

De Bruijn Graph ◽

Existing Structures ◽

Link Type ◽

Reference Information ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Asymptotically Efficient ◽

Fast Access

AbstractWe present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search.Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing, carefully organizing our data structure, and making use of succinct representations where applicable, our data structure provides practically fast k-mer lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme built on the same underlying representation, which provides the ability to trade off k-mer query speed for a reduction in the de Bruijn graph index size. We believe this representation strikes a desirable balance between speed and space usage, and it will allow for fast search on large reference sequences.Pufferfish is developed in C++11, is open source (GPL v3), and is available at https://github.com/COMBINE-lab/Pufferfish. The scripts used to generate the results in this manuscript are available at https://github.com/COMBINE-lab/pufferfish_experiments.

Download Full-text

Memory efficient assembly of human genome

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720015500080 ◽

2015 ◽

Vol 13 (02) ◽

pp. 1550008

Author(s):

Farhad Hormozdiari ◽

Eleazar Eskin

Keyword(s):

High Throughput Sequencing ◽

Driving Forces ◽

Linear Time ◽

Real Data ◽

Chromosome 17 ◽

De Bruijn Graph ◽

Detection Problem ◽

Data Set ◽

De Bruijn ◽

Memory Efficient

The ability to detect the genetic variations between two individuals is an essential component for genetic studies. In these studies, obtaining the genome sequence of both individuals is the first step toward variation detection problem. The emergence of high-throughput sequencing (HTS) technology has made DNA sequencing practical, and is widely used by diagnosticians to increase their knowledge about the casual factor in genetic related diseases. As HTS advances, more data are generated every day than the amount that scientists can process. Genome assembly is one of the existing methods to tackle the variation detection problem. The de Bruijn graph formulation of the assembly problem is widely used in the field. Furthermore, it is the only method which can assemble any genome in linear time. However, it requires an enormous amount of memory in order to assemble any mammalian size genome. The high demands of sequencing more individuals and the urge to assemble them are the driving forces for a memory efficient assembler. In this work, we propose a novel method which builds the de Bruijn graph while consuming lower memory. Moreover, our proposed method can reduce the memory usage by 37% compared to the existing methods. In addition, we used a real data set (chromosome 17 of A/J strain) to illustrate the performance of our method.

Download Full-text

A space and time-efficient index for the compacted colored de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/bty292 ◽

2018 ◽

Vol 34 (13) ◽

pp. i169-i177 ◽

Cited By ~ 21

Author(s):

Fatemeh Almodaresi ◽

Hirak Sarkar ◽

Avi Srivastava ◽

Rob Patro

Keyword(s):

De Bruijn Graph ◽

Space And Time ◽

De Bruijn ◽

Colored De Bruijn Graph

Download Full-text

Somatic variant analysis of linked-reads sequencing data with Lancet

10.1101/2020.07.04.158063 ◽

2020 ◽

Author(s):

Rajeeva Musunuri ◽

Kanika Arora ◽

André Corvelo ◽

Minita Shah ◽

Jennifer Shelton ◽

...

Keyword(s):

Open Source ◽

De Bruijn Graph ◽

Haplotype Structure ◽

Sequencing Data ◽

Somatic Variant ◽

Local Assembly ◽

De Bruijn ◽

Variant Analysis ◽

Colored De Bruijn Graph ◽

Commercial Research

AbstractSummaryWe present a new version of the popular somatic variant caller, Lancet, that supports the analysis of linked-reads sequencing data. By seamlessly integrating barcodes and haplotype read assignments within the colored De Bruijn graph local-assembly framework, Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure.Availability and ImplementationLancet is implemented in C++ and is available for academic and non-commercial research purposes as an open-source package at https://github.com/nygenome/[email protected]

Download Full-text

Recoloring the Colored de Bruijn Graph

String Processing and Information Retrieval - Lecture Notes in Computer Science ◽

10.1007/978-3-030-00479-8_1 ◽

2018 ◽

pp. 1-11 ◽

Cited By ~ 6

Author(s):

Bahar Alipanahi ◽

Alan Kuhnle ◽

Christina Boucher

Keyword(s):

De Bruijn Graph ◽

De Bruijn ◽

Colored De Bruijn Graph

Download Full-text