deGSM: memory scalable construction of large scale de Bruijn Graph

Mapping Intimacies ◽

10.1101/388454 ◽

2018 ◽

Cited By ~ 2

Author(s):

Hongzhe Guo ◽

Yilei Fu ◽

Yan Gao ◽

Junyi Li ◽

Yadong Wang ◽

...

Keyword(s):

Genome Sequence ◽

Large Scale ◽

High Throughput Sequencing ◽

De Novo ◽

Rapid Development ◽

Main Idea ◽

Supplementary Information ◽

De Bruijn Graph ◽

External Sorting ◽

De Bruijn

AbstractMotivationDe Bruijn graph, a fundamental data structure to represent and organize genome sequence, plays important roles in various kinds of sequence analysis tasks such as de novo assembly, high-throughput sequencing (HTS) read alignment, pan-genome analysis, metagenomics analysis, HTS read correction, etc. With the rapid development of HTS data and ever-increasing number of assembled genomes, there is a high demand to construct de Bruijn graph for sequences up to Tera-base-pair level. It is non-trivial since the size of the graph to be constructed could be very large and each graph consists of hundreds of billions of vertices and edges. Current existing approaches may have unaffordable memory footprints to handle such a large de Bruijn graph. Moreover, it also requires the construction approach to handle very large dataset efficiently, even if in a relatively small RAM space.ResultsWe propose a lightweight parallel de Bruijn graph construction approach, de Bruijn Graph Constructor in Scalable Memory (deGSM). The main idea of deGSM is to efficiently construct the Bur-rows-Wheeler Transformation (BWT) of the unipaths of de Bruijn graph in constant RAM space and transform the BWT into the original unitigs. It is mainly implemented by a fast parallel external sorting of k-mers, which allows only a part of k-mers kept in RAM by a novel organization of the k-mers. The experimental results demonstrate that, just with a commonly used machine, deGSM is able to handle very large genome sequence(s), e.g., the contigs (305 Gbp) and scaffolds (1.1 Tbp) recorded in Gen-Bank database and Picea abies HTS dataset (9.7 Tbp). Moreover, deGSM also has faster or comparable construction speed compared with state-of-the-art approaches. With its high scalability and efficiency, deGSM has enormous potentials in many large scale genomics studies.Availabilityhttps://github.com/hitbc/[email protected] (YW) and [email protected] (BL)Supplementary informationSupplementary data are available online.

Download Full-text

Inference of viral quasispecies with a paired de Bruijn graph

Bioinformatics ◽

10.1093/bioinformatics/btaa782 ◽

2020 ◽

Author(s):

Borja Freire ◽

Susana Ladra ◽

Jose R Paramá ◽

Leena Salmela

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

De Bruijn Graph ◽

Viral Quasispecies ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

High Throughput Sequencing Data ◽

De Bruijn

Abstract Motivation RNA viruses exhibit a high mutation rate and thus they exist in infected cells as a population of closely related strains called viral quasispecies. The viral quasispecies assembly problem asks to characterize the quasispecies present in a sample from high-throughput sequencing data. We study the de novo version of the problem, where reference sequences of the quasispecies are not available. Current methods for assembling viral quasispecies are either based on overlap graphs or on de Bruijn graphs. Overlap graph-based methods tend to be accurate but slow, whereas de Bruijn graph-based methods are fast but less accurate. Results We present viaDBG, which is a fast and accurate de Bruijn graph-based tool for de novo assembly of viral quasispecies. We first iteratively correct sequencing errors in the reads, which allows us to use large k-mers in the de Bruijn graph. To incorporate the paired-end information in the graph, we also adapt the paired de Bruijn graph for viral quasispecies assembly. These features enable the use of long-range information in contig construction without compromising the speed of de Bruijn graph-based approaches. Our experimental results show that viaDBG is both accurate and fast, whereas previous methods are either fast or accurate but not both. In particular, viaDBG has comparable or better accuracy than SAVAGE, while being at least nine times faster. Furthermore, the speed of viaDBG is comparable to PEHaplo but viaDBG is able to retrieve also low abundance quasispecies, which are often missed by PEHaplo. Availability and implementation viaDBG is implemented in C++ and it is publicly available at https://bitbucket.org/bfreirec1/viadbg. All datasets used in this article are publicly available at https://bitbucket.org/bfreirec1/data-viadbg/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Succinct Dynamic de Bruijn Graphs

Bioinformatics ◽

10.1093/bioinformatics/btaa546 ◽

2020 ◽

Author(s):

Bahar Alipanahi ◽

Alan Kuhnle ◽

Simon J Puglisi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Data Structures ◽

Large Scale ◽

High Throughput Sequencing ◽

Supplementary Information ◽

De Bruijn Graph ◽

Sequencing Data ◽

Efficient Manner ◽

De Bruijn Graphs ◽

High Throughput Sequencing Data ◽

De Bruijn

Abstract Motivation The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time- efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. Results In this paper we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods e.g., FDBG (Crawford et al., 2018) cannot be constructed on large scale datasets, or cannot support both addition and deletion e.g., BiFrost (Holley and Melsted, 2019). Availability DynamicBOSS is publicly available at https://github.com/baharpan/dynboss. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A parallel computational framework for ultra-large-scale sequence clustering analysis

Bioinformatics ◽

10.1093/bioinformatics/bty617 ◽

2018 ◽

Vol 35 (3) ◽

pp. 380-388 ◽

Cited By ~ 2

Author(s):

Wei Zheng ◽

Qi Mao ◽

Robert J Genco ◽

Jean Wactawski-Wende ◽

Michael Buck ◽

...

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

De Novo ◽

Rapid Development ◽

Operational Taxonomic Unit ◽

Supplementary Information ◽

Computational Framework ◽

Speed Up ◽

Scale Sequence

Abstract Motivation The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. Results In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method. Availability and implementation Open-source software for the proposed method is freely available at https://www.acsu.buffalo.edu/~yijunsun/lab/SLAD.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Faucet: streaming de novo assembly graph construction

10.1101/125658 ◽

2017 ◽

Author(s):

Roye Rozov ◽

Gil Goldshlager ◽

Eran Halperin ◽

Ron Shamir

Keyword(s):

Resource Use ◽

De Novo ◽

State Of The Art ◽

Supplementary Information ◽

De Bruijn Graph ◽

Assembly Quality ◽

Metagenome Assembly ◽

Streaming Algorithm ◽

Supplementary Material ◽

De Bruijn

AbstractMotivationWe present Faucet, a 2-pass streaming algorithm for assembly graph construction. Faucet builds an assembly graph incrementally as each read is processed. Thus, reads need not be stored locally, as they can be processed while downloading data and then discarded. We demonstrate this functionality by performing streaming graph assembly of publicly available data, and observe that the ratio of disk use to raw data size decreases as coverage is increased.ResultsFaucet pairs the de Bruijn graph obtained from the reads with additional meta-data derived from them. We show these metadata - coverage counts collected at junction k-mers and connections bridging between junction pairs - contain most salient information needed for assembly, and demonstrate they enable cleaning of metagenome assembly graphs, greatly improving contiguity while maintaining accuracy. We compared Faucet’s resource use and assembly quality to state of the art metagenome assemblers, as well as leading resource-efficient genome assemblers. Faucet used orders of magnitude less time and disk space than the specialized metagenome assemblers MetaSPAdes and Megahit, while also improving on their memory use; this broadly matched performance of other assemblers optimizing resource efficiency - namely, Minia and LightAssembler. However, on metagenomes tested, Faucet’s outputs had 14-110% higher mean NGA50 lengths compared to Minia, and 2-11-fold higher mean NGA50 lengths compared to LightAssembler, the only other streaming assembler available.AvailabilityFaucet is available at https://github.com/Shamir-Lab/[email protected],[email protected] information:Supplementary data are available at Bioinformatics online.

Download Full-text

BubbleGun: Enumerating Bubbles and Superbubbles in Genome Graphs

10.1101/2021.03.23.436631 ◽

2021 ◽

Author(s):

Fawaz Dabbaghie ◽

Jana Ebler ◽

Tobias Marschall

Keyword(s):

De Novo ◽

General Purpose ◽

Supplementary Information ◽

De Bruijn Graph ◽

De Bruijn Graphs ◽

Third Generation Sequencing ◽

Human Sample ◽

Fast Development ◽

De Bruijn ◽

Generation Sequencing

AbstractMotivationWith the fast development of third generation sequencing machines, de novo genome assembly is becoming a routine even for larger genomes. Graph-based representations of genomes arise both as part of the assembly process, but also in the context of pangenomes representing a population. In both cases, polymorphic loci lead to bubble structures in such graphs. Detecting bubbles is hence an important task when working with genomic variants in the context of genome graphs.ResultsHere, we present a fast general-purpose tool, called BubbleGun, for detecting bubbles and superbubbles in genome graphs. Furthermore, BubbleGun detects and outputs runs of linearly connected bubbles and superbubbles, which we call bubble chains. We showcase its utility on de Bruijn graphs and compare our results to vg’s snarl detection. We show that BubbleGun is considerably faster than vg especially in bigger graphs, where it reports all bubbles in less than 30 minutes on a human sample de Bruijn graph of around 2 million nodes.AvailabilityBubbleGun is available and documented at https://github.com/fawaz-dabbaghieh/bubble_gun under MIT [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

NxRepair: Error correction in de novo sequence assembly using Nextera mate pairs

10.7287/peerj.preprints.747v1 ◽

2014 ◽

Author(s):

Rebecca R Murphy ◽

Jared M O'Connell ◽

Anthony J Cox ◽

Ole B Schulz-Trieglaff

Keyword(s):

Error Correction ◽

Large Scale ◽

De Novo ◽

Reference Sequence ◽

De Bruijn Graph ◽

Sequencing Data ◽

Additional Information ◽

Mate Pair ◽

De Bruijn ◽

De Novo Sequence Assembly

Scaffolding errors and incorrect traversals of the de Bruijn graph during de novo assembly can result in large scale misassemblies in draft genomes. Nextera mate pair sequencing data provide additional information to resolve assembly ambiguities during scaffolding. Here, we introduce NxRepair, an open source toolkit for error correction in de novo assemblies that uses Nextera mate pair libraries to identify and correct large-scale errors. We show that NxRepair can identify and correct large scaffolding errors, without use of a reference sequence, resulting in quantitative improvements in the assembly quality. NxRepair can be downloaded from GitHub; a tutorial and user documentation are also available.

Download Full-text

Succinct Dynamic de Bruijn Graphs

10.1101/2020.04.01.018481 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bahar Alipanahi ◽

Alan Kuhnle ◽

Simon J. Puglisi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Data Structures ◽

Large Scale ◽

High Throughput Sequencing ◽

De Bruijn Graph ◽

Sequencing Data ◽

Efficient Manner ◽

De Bruijn Graphs ◽

High Throughput Sequencing Data ◽

Efficient Data ◽

De Bruijn

AbstractMotivationThe de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes.ResultsIn this paper we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods e.g., FDBG (Crawford et al., 2018) cannot be constructed on large scale datasets, or cannot support both addition and deletion e.g., BiFrost (Holley and Melsted, 2019).AvailabilityDynamicBOSS is publicly available at https://github.com/baharpan/[email protected]

Download Full-text

NxRepair: Error correction in de novo sequence assembly using Nextera mate pairs

10.7287/peerj.preprints.747 ◽

2014 ◽

Author(s):

Rebecca R Murphy ◽

Jared M O'Connell ◽

Anthony J Cox ◽

Ole B Schulz-Trieglaff

Keyword(s):

Error Correction ◽

Large Scale ◽

De Novo ◽

Reference Sequence ◽

De Bruijn Graph ◽

Sequencing Data ◽

Additional Information ◽

Mate Pair ◽

De Bruijn ◽

De Novo Sequence Assembly

Download Full-text

Quick and efficient approach to develop genomic resources in orphan species: Application in Lavandula angustifolia

PLoS ONE ◽

10.1371/journal.pone.0243853 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243853

Author(s):

Berline Fopa Fomeju ◽

Dominique Brunel ◽

Aurélie Bérard ◽

Jean-Baptiste Rivoal ◽

Philippe Gallois ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Rapid Development ◽

Genetic Distances ◽

Lavandula Angustifolia ◽

Distance Analysis ◽

Alternative Medicines ◽

Dna And Rna ◽

Snp Development ◽

High Level

Next-Generation Sequencing (NGS) technologies, by reducing the cost and increasing the throughput of sequencing, have opened doors to generate genomic data in a range of previously poorly studied species. In this study, we propose a method for the rapid development of a large-scale molecular resources for orphan species. We studied as an example the true lavender (Lavandula angustifolia Mill.), a perennial sub-shrub plant native from the Mediterranean region and whose essential oil have numerous applications in cosmetics, pharmaceuticals, and alternative medicines. The heterozygous clone “Maillette” was used as a reference for DNA and RNA sequencing. We first built a reference Unigene, compound of coding sequences, thanks to de novo RNA-seq assembly. Then, we reconstructed the complete genes sequences (with introns and exons) using an Unigene-guided DNA-seq assembly approach. This aimed to maximize the possibilities of finding polymorphism between genetically close individuals despite the lack of a reference genome. Finally, we used these resources for SNP mining within a collection of 16 commercial lavender clones and tested the SNP within the scope of a genetic distance analysis. We obtained a cleaned reference of 8, 030 functionally in silico annotated genes. We found 359K polymorphic sites and observed a high SNP frequency (mean of 1 SNP per 90 bp) and a high level of heterozygosity (more than 60% of heterozygous SNP per genotype). On overall, we found similar genetic distances between pairs of clones, which is probably related to the out-crossing nature of the species and the restricted area of cultivation. The proposed method is transferable to other orphan species, requires little bioinformatics resources and can be realized within a year. This is also the first reported large-scale SNP development on Lavandula angustifolia. All the genomics resources developed herein are publicly available and provide a rich pool of molecular resources to explore and exploit lavender genetic diversity in breeding programs.

Download Full-text

Gene Annotation and Transcriptome Delineation on a De Novo Genome Assembly for the Reference Leishmania major Friedlin Strain

Genes ◽

10.3390/genes12091359 ◽

2021 ◽

Vol 12 (9) ◽

pp. 1359

Author(s):

Esther Camacho ◽

Sandra González-de la Fuente ◽

Jose C. Solana ◽

Alberto Rastrojo ◽

Fernando Carrasco-Ramiro ◽

...

Keyword(s):

Genome Sequence ◽

Genome Assembly ◽

Molecular Mechanisms ◽

High Throughput Sequencing ◽

Leishmania Major ◽

De Novo ◽

Gene Annotation ◽

Leishmania Species ◽

De Novo Genome Assembly ◽

Sequencing Platforms

Leishmania major is the main causative agent of cutaneous leishmaniasis in humans. The Friedlin strain of this species (LmjF) was chosen when a multi-laboratory consortium undertook the objective of deciphering the first genome sequence for a parasite of the genus Leishmania. The objective was successfully attained in 2005, and this represented a milestone for Leishmania molecular biology studies around the world. Although the LmjF genome sequence was done following a shotgun strategy and using classical Sanger sequencing, the results were excellent, and this genome assembly served as the reference for subsequent genome assemblies in other Leishmania species. Here, we present a new assembly for the genome of this strain (named LMJFC for clarity), generated by the combination of two high throughput sequencing platforms, Illumina short-read sequencing and PacBio Single Molecular Real-Time (SMRT) sequencing, which provides long-read sequences. Apart from resolving uncertain nucleotide positions, several genomic regions were reorganized and a more precise composition of tandemly repeated gene loci was attained. Additionally, the genome annotation was improved by adding 542 genes and more accurate coding-sequences defined for around two hundred genes, based on the transcriptome delimitation also carried out in this work. As a result, we are providing gene models (including untranslated regions and introns) for 11,238 genes. Genomic information ultimately determines the biology of every organism; therefore, our understanding of molecular mechanisms will depend on the availability of precise genome sequences and accurate gene annotations. In this regard, this work is providing an improved genome sequence and updated transcriptome annotations for the reference L. major Friedlin strain.

Download Full-text