colored de bruijn graph Latest Research Papers

A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes

BMC Bioinformatics ◽

10.1186/s12859-021-04149-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jindan Guo ◽

Erli Pang ◽

Hongtao Song ◽

Kui Lin

Keyword(s):

Coordinate System ◽

Directed Acyclic Graph ◽

Rapid Development ◽

Structural Complexity ◽

Genomic Diversity ◽

De Bruijn Graph ◽

Accurate Analysis ◽

Acyclic Graph ◽

Small Indels ◽

Colored De Bruijn Graph

Abstract Background With the rapid development of accurate sequencing and assembly technologies, an increasing number of high-quality chromosome-level and haplotype-resolved assemblies of genomic sequences have been derived, from which there will be great opportunities for computational pangenomics. Although genome graphs are among the most useful models for pangenome representation, their structural complexity makes it difficult to present genome information intuitively, such as the linear reference genome. Thus, efficiently and accurately analyzing the genome graph spatial structure and coordinating the information remains a substantial challenge. Results We developed a new method, a colored superbubble (cSupB), that can overcome the complexity of graphs and organize a set of species- or population-specific haplotype sequences of interest. Based on this model, we propose a tri-tuple coordinate system that combines an offset value, topological structure and sample information. Additionally, cSupB provides a novel method that utilizes complete topological information and efficiently detects small indels (< 50 bp) for highly similar samples, which can be validated by simulated datasets. Moreover, we demonstrated that cSupB can adapt to the complex cycle structure. Conclusions Although the solution is made suitable for increasingly complex genome graphs by relaxing the constraint, the directed acyclic graph, the motif cSupB and the cSupB method can be extended to any colored directed acyclic graph. We anticipate that our method will facilitate the analysis of individual haplotype variants and population genomic diversity. We have developed a C + + program for implementing our method that is available at https://github.com/eggleader/cSupB.

Download Full-text

An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using LSM Trees

10.1101/2021.02.05.429839 ◽

2021 ◽

Author(s):

Fatemeh Almodaresi ◽

Jamshed Khan ◽

Sergey Madaminov ◽

Prashant Pandey ◽

Michael Ferdman ◽

...

Keyword(s):

Large Scale ◽

Supplementary Information ◽

De Bruijn Graph ◽

Sequencing Data ◽

Construction Time ◽

Graph Representations ◽

Sequence Search ◽

General Search ◽

Colored De Bruijn Graph ◽

Search Index

AbstractMotivationIn the past few years, researchers have proposed numerous indexing schemes for searching large databases of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes—Mantis, VariMerge, and Bifrost—that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data.ResultsIn this paper, we show how to build a scalable and updatable exact sequence-search index. Specifically, we extend Mantis using the Bentley-Saxe transformation to support efficient updates. We demonstrate Mantis’s scalability by constructing an index of ≈ 40K samples from SRA by adding samples one at a time to an initial index of 10K samples.Compared to VariMerge and Bifrost, Mantis is more efficient in terms of index-construction time and memory, query time and memory, and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes.AvailabilityThe updatable Mantis implementation is available at https://github.com/splatlab/mantis/tree/[email protected] informationSupplementary data are available online.

Download Full-text

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

10.1101/2020.10.21.349605 ◽

2020 ◽

Author(s):

Jamshed Khan ◽

Rob Patro

Keyword(s):

Large Scale ◽

De Bruijn Graph ◽

Comparative Genomic ◽

De Bruijn Graphs ◽

Long Reads ◽

Genomic Analyses ◽

Finite State ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Memory Compaction

AbstractMotivationThe construction of the compacted de Bruijn graph from a large collection of reference genomes is a task of increasing interest in genomic analyses. For example, compacted colored reference de Bruijn graphs are increasingly used as sequence indices for the purposes of alignment of short and long reads. Also, as we sequence and assemble a greater diversity of individual genomes, the compacted colored de Bruijn graph can be used as the basis for methods aiming to perform comparative genomic analyses on these genomes. While algorithms have been developed to construct the compacted colored de Bruijn graph from reference sequences, there is still room for improvement, especially in the memory and the runtime performance as the number and the scale of the genomes over which the de Bruijn graph is built grow.ResultsWe introduce a new algorithm, implemented in the tool Cuttlefish, to construct the colored compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel modeling scheme of the de Bruijn graph vertices as finite-state automata, and constrains the state-space for the automata to enable tracking of their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that the algorithm scales much better than existing approaches, especially as the number and scale of the input references grow. For example, on a typical shared-memory machine, Cuttlefish constructed the compacted graph for 100 human genomes in less than 7 hours, using ~29 GB of memory; no other tested tool successfully completed this task on the testing hardware. We also applied Cuttlefish on 11 diverse conifer plant genomes, and the compacted graph was constructed in under 11 hours, using ~84 GB of memory, while the only other tested tool able to complete this compaction on our hardware took more than 16 hours and ~289 GB of memory.AvailabilityCuttlefish is written in C++14, and is available under an open source license at https://github.com/COMBINE-lab/[email protected]

Download Full-text

Somatic variant analysis of linked-reads sequencing data with Lancet

Bioinformatics ◽

10.1093/bioinformatics/btaa888 ◽

2020 ◽

Author(s):

Rajeeva Musunuri ◽

Kanika Arora ◽

André Corvelo ◽

Minita Shah ◽

Jennifer Shelton ◽

...

Keyword(s):

Supplementary Information ◽

De Bruijn Graph ◽

Haplotype Structure ◽

Sequencing Data ◽

Somatic Variant ◽

Local Assembly ◽

De Bruijn ◽

Variant Analysis ◽

Colored De Bruijn Graph ◽

Commercial Research

Abstract Summary We present a new version of the popular somatic variant caller, Lancet, that supports the analysis of linked-reads sequencing data. By seamlessly integrating barcodes and haplotype read assignments within the colored De Bruijn graph local-assembly framework, Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure. Availability and implementation Lancet is implemented in C++ and available for academic and non-commercial research purposes as an open-source package at https://github.com/nygenome/lancet. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detecting High Scoring Local Alignments in Pangenome Graphs

10.1101/2020.09.03.280958 ◽

2020 ◽

Author(s):

Tizian Schulz ◽

Roland Wittler ◽

Sven Rahmann ◽

Faraz Hach ◽

Jens Stoye

Keyword(s):

Sequence Similarity ◽

Query Sequence ◽

Heuristic Method ◽

De Bruijn Graph ◽

Local Alignment ◽

Memory Usage ◽

Sequence Comparisons ◽

De Bruijn Graphs ◽

De Bruijn ◽

Colored De Bruijn Graph

AbstractMotivationIncreasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet.ResultsWe present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome.

Download Full-text

Somatic variant analysis of linked-reads sequencing data with Lancet

10.1101/2020.07.04.158063 ◽

2020 ◽

Author(s):

Rajeeva Musunuri ◽

Kanika Arora ◽

André Corvelo ◽

Minita Shah ◽

Jennifer Shelton ◽

...

Keyword(s):

Open Source ◽

De Bruijn Graph ◽

Haplotype Structure ◽

Sequencing Data ◽

Somatic Variant ◽

Local Assembly ◽

De Bruijn ◽

Variant Analysis ◽

Colored De Bruijn Graph ◽

Commercial Research

AbstractSummaryWe present a new version of the popular somatic variant caller, Lancet, that supports the analysis of linked-reads sequencing data. By seamlessly integrating barcodes and haplotype read assignments within the colored De Bruijn graph local-assembly framework, Lancet computes a barcode-aware coverage and identifies variants that disagree with the local haplotype structure.Availability and ImplementationLancet is implemented in C++ and is available for academic and non-commercial research purposes as an open-source package at https://github.com/nygenome/[email protected]

Download Full-text

Metagenome SNP calling via read-colored de Bruijn graphs

Bioinformatics ◽

10.1093/bioinformatics/btaa081 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bahar Alipanahi ◽

Martin D Muggli ◽

Musa Jundi ◽

Noelle R Noyes ◽

Christina Boucher

Keyword(s):

Pathogenic Bacteria ◽

Read Length ◽

Supplementary Information ◽

De Bruijn Graph ◽

Nucleotide Polymorphisms ◽

Chromosomal Dna ◽

Shotgun Metagenomics ◽

De Bruijn Graphs ◽

De Bruijn ◽

Colored De Bruijn Graph

Abstract Motivation Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to ‘fingerprint’ specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. Results We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. Availability and implementation Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

10.1101/695338 ◽

2019 ◽

Cited By ~ 14

Author(s):

Guillaume Holley ◽

Páll Melsted

Keyword(s):

High Throughput Sequencing ◽

Genomic Analysis ◽

Main Memory ◽

De Bruijn Graph ◽

Direct Construction ◽

De Bruijn Graphs ◽

Wide Range ◽

User Data ◽

De Bruijn ◽

Colored De Bruijn Graph

AbstractMotivationDe Bruijn graphs are the core data structure for a wide range of assemblers and genome analysis software processing High Throughput Sequencing datasets. For population genomic analysis, the colored de Bruijn graph is often used in order to take advantage of the massive sets of sequenced genomes available for each species. However, memory consumption of tools based on the de Bruijn graph is often prohibitive, due to the high number of vertices, edges or colors in the graph. In order to process large and complex genomes, most short-read assemblers based on the de Bruijn graph paradigm reduce the assembly complexity and memory usage by compacting first all maximal non-branching paths of the graph into single vertices. Yet, de Bruijn graph compaction is challenging as it requires the uncompacted de Bruijn graph to be available in memory.ResultsWe present a new parallel and memory efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted de Bruijn graph. Bifrost features a broad range of functions such as sequence querying, storage of user data alongside vertices and graph editing that automatically preserve the compaction property. Bifrost makes full use of the dynamic index efficiency and proposes a graph coloring method efficiently mapping eachk-mer of the graph to the set of genomes in which it occurs. Experimental results show that our algorithm is competitive with state-of-the-art de Bruijn graph compaction and coloring tools. Bifrost was able to build the colored and compacted de Bruijn graph of about 118,000 Salmonella genomes on a mid-class server in about 4 days using 103 GB of main memory.Availabilityhttps://github.com/pmelsted/bifrostavailable with a BSD-2 [email protected]

Download Full-text

Building large updatable colored de Bruijn graphs via merging

Bioinformatics ◽

10.1093/bioinformatics/btz350 ◽

2019 ◽

Vol 35 (14) ◽

pp. i51-i60 ◽

Cited By ~ 11

Author(s):

Martin D Muggli ◽

Bahar Alipanahi ◽

Christina Boucher

Keyword(s):

Bloom Filter ◽

Supplementary Information ◽

Metagenomic Data ◽

De Bruijn Graph ◽

De Bruijn Graphs ◽

Working Space ◽

Fold Reduction ◽

De Bruijn ◽

Colored De Bruijn Graph ◽

Public Datasets

Abstract Motivation There exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed. Results We create a method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this article. We refer to the resulting method as VariMerge. This construction method also allows the graph to be updated with new data. We validate our approach and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8000 strains. Lastly, we compare VariMerge to other competing methods—including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method of Almodaresi et al. and Multi-BRWT—and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16 000 strains in a manner that allows it to be updated. Competing methods either did not scale to this large of a dataset or do not allow for additions without reconstruction. Availability and implementation VariMerge is available at https://github.com/cosmo-team/cosmo/tree/VARI-merge under GPLv3 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

An Index for Sequencing Reads Based on the Colored de Bruijn Graph

String Processing and Information Retrieval - Lecture Notes in Computer Science ◽

10.1007/978-3-030-32686-9_22 ◽

2019 ◽

pp. 304-321

Author(s):

Diego Díaz-Domínguez

Keyword(s):

De Bruijn Graph ◽

De Bruijn ◽

Colored De Bruijn Graph

Download Full-text

colored de bruijn graph
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes

An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using LSM Trees

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Somatic variant analysis of linked-reads sequencing data with Lancet

Detecting High Scoring Local Alignments in Pangenome Graphs

Somatic variant analysis of linked-reads sequencing data with Lancet

Metagenome SNP calling via read-colored de Bruijn graphs

Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

Building large updatable colored de Bruijn graphs via merging

An Index for Sequencing Reads Based on the Colored de Bruijn Graph

Export Citation Format

colored de bruijn graphRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes

An Incrementally Updatable and Scalable System for Large-Scale Sequence Search using LSM Trees

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

Somatic variant analysis of linked-reads sequencing data with Lancet

Detecting High Scoring Local Alignments in Pangenome Graphs

Somatic variant analysis of linked-reads sequencing data with Lancet

Metagenome SNP calling via read-colored de Bruijn graphs

Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs

Building large updatable colored de Bruijn graphs via merging

An Index for Sequencing Reads Based on the Colored de Bruijn Graph

colored de bruijn graph
Recently Published Documents