Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ

Nature Communications ◽

10.1038/s41467-020-19777-8 ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Ilia Minkin ◽

Paul Medvedev

Keyword(s):

Single Machine ◽

De Bruijn Graph ◽

Genome Alignment ◽

Whole Genome ◽

Reconstruction Algorithms ◽

De Bruijn Graphs ◽

Significant Step ◽

De Bruijn ◽

Whole Genome Alignment ◽

Computational Resources

AbstractMultiple whole-genome alignment is a challenging problem in bioinformatics. Despite many successes, current methods are not able to keep up with the growing number, length, and complexity of assembled genomes, especially when computational resources are limited. Approaches based on compacted de Bruijn graphs to identify and extend anchors into locally collinear blocks have potential for scalability, but current methods do not scale to mammalian genomes. We present an algorithm, SibeliaZ-LCB, for identifying collinear blocks in closely related genomes based on analysis of the de Bruijn graph. We further incorporate this into a multiple whole-genome alignment pipeline called SibeliaZ. SibeliaZ shows run-time improvements over other methods while maintaining accuracy. On sixteen recently-assembled strains of mice, SibeliaZ runs in under 16 hours on a single machine, while other tools did not run to completion for eight mice within a week. SibeliaZ makes a significant step towards improving scalability of multiple whole-genome alignment and collinear block reconstruction algorithms on a single machine.

Download Full-text

Panaconda: Application of pan-synteny graph models to genome content analysis

10.1101/215988 ◽

2017 ◽

Cited By ~ 1

Author(s):

Andrew S. Warren ◽

James J. Davis ◽

Alice R. Wattam ◽

Dustin Machi ◽

João C. Setubal ◽

...

Keyword(s):

Gene Families ◽

De Bruijn Graph ◽

Genome Alignment ◽

Whole Genome ◽

Sequence Comparisons ◽

Multiple Sequence ◽

Link Type ◽

De Bruijn ◽

Similarities And Differences ◽

Whole Genome Alignment

AbstractMotivationWhole-genome alignment and pan-genome analysis are useful tools in understanding the similarities and differences of many genomes in an evolutionary context. Here we introduce the concept of pan-synteny graphs, an analysis method that combines elements of both to represent conservation and change of multiple prokaryotic genomes at an architectural level. Pan-synteny graphs represent a reference free approach for the comparison of many genomes and allows for the identification of synteny, insertion, deletion, replacement, inversion, recombination, missed assembly joins, evolutionary hotspots, and reference based scaffolding.ResultsWe present an algorithm for creating whole genome multiple sequence comparisons and a model for representing the similarities and differences among sequences as a graph of syntenic gene families. As part of the pan-synteny graph creation, we first create a de Bruijn graph. Instead of the alphabet of nucleotides commonly used in genome assembly, we use an alphabet of gene families. This de Bruijn graph is then processed to create the pan-synteny graph. Our approach is novel in that it explicitly controls how regions from the same sequence and genome are aligned and generates a graph in which all sequences are fully represented as paths. This method harnesses previous computation involved in protein family calculation to speed up the creation of whole genome alignment for many genomes. We provide the software suite Panaconda, for the calculation of pan-synteny graphs given annotation input, and an implementation of methods for their layout and visualization.AvailabilityPanaconda is available at https://github.com/aswarren/pangenome_graphs and datasets used in examples are available at https://github.com/aswarren/pangenome_examplesContactAndrew Warren [email protected]

Download Full-text

Discrimination of hospital isolates of Acinetobacter baumannii using repeated sequences and whole genome alignment differential analysis

Journal of Applied Genetics ◽

10.1007/s13353-021-00640-5 ◽

2021 ◽

Author(s):

Roman Kotłowski ◽

Alicja Nowak-Zaleska ◽

Grzegorz Węgrzyn

Keyword(s):

Acinetobacter Baumannii ◽

Time Frame ◽

Repeated Sequences ◽

Hospital Environment ◽

Genome Alignment ◽

Whole Genome ◽

Differential Analysis ◽

Gene Encoding ◽

Resistance Patterns ◽

Whole Genome Alignment

AbstractAn optimized method for bacterial strain differentiation, based on combination of Repeated Sequences and Whole Genome Alignment Differential Analysis (RS&WGADA), is presented in this report. In this analysis, 51 Acinetobacter baumannii multidrug-resistance strains from one hospital environment and patients from 14 hospital wards were classified on the basis of polymorphisms of repeated sequences located in CRISPR region, variation in the gene encoding the EmrA-homologue of E. coli, and antibiotic resistance patterns, in combination with three newly identified polymorphic regions in the genomes of A. baumannii clinical isolates. Differential analysis of two similarity matrices between different genotypes and resistance patterns allowed to distinguish three significant correlations (p < 0.05) between 172 bp DNA insertion combined with resistance to chloramphenicol and gentamycin. Interestingly, 45 and 55 bp DNA insertions within the CRISPR region were identified, and combined during analyses with resistance/susceptibility to trimethoprim/sulfamethoxazole. Moreover, 184 or 1374 bp DNA length polymorphisms in the genomic region located upstream of the GTP cyclohydrolase I gene, associated mainly with imipenem susceptibility, was identified. In addition, considerable nucleotide polymorphism of the gene encoding the gamma/tau subunit of DNA polymerase III, an enzyme crucial for bacterial DNA replication, was discovered. The differentiation analysis performed using the above described approach allowed us to monitor the distribution of A. baumannii isolates in different wards of the hospital in the time frame of several years, indicating that the optimized method may be useful in hospital epidemiological studies, particularly in identification of the source of primary infections.

Download Full-text

MUM&Co: accurate detection of all SV types through whole-genome alignment

Bioinformatics ◽

10.1093/bioinformatics/btaa115 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3242-3243 ◽

Cited By ~ 2

Author(s):

Samuel O’Donnell ◽

Gilles Fischer

Keyword(s):

De Novo ◽

Supplementary Information ◽

Genome Alignment ◽

Whole Genome ◽

Structural Variations ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Human Genomes ◽

Whole Genome Alignment ◽

Primary Output

Abstract Summary MUM&Co is a single bash script to detect structural variations (SVs) utilizing whole-genome alignment (WGA). Using MUMmer’s nucmer alignment, MUM&Co can detect insertions, deletions, tandem duplications, inversions and translocations greater than 50 bp. Its versatility depends upon the WGA and therefore benefits from contiguous de-novo assemblies generated by third generation sequencing technologies. Benchmarked against five WGA SV-calling tools, MUM&Co outperforms all tools on simulated SVs in yeast, plant and human genomes and performs similarly in two real human datasets. Additionally, MUM&Co is particularly unique in its ability to find inversions in both simulated and real datasets. Lastly, MUM&Co’s primary output is an intuitive tabulated file containing a list of SVs with only necessary genomic details. Availability and implementation https://github.com/SAMtoBAM/MUMandCo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Whole Genome Alignment with BLAST on Grid Environment

The Sixth IEEE International Conference on Computer and Information Technology (CIT'06) ◽

10.1109/cit.2006.196 ◽

2006 ◽

Author(s):

Min-sung Kim ◽

Choong-hyun Sun ◽

Jin-ki Kim ◽

Gwan-su Yi

Keyword(s):

Genome Alignment ◽

Whole Genome ◽

Grid Environment ◽

Whole Genome Alignment

Download Full-text

Efficient Algorithms for Optimizing Whole Genome Alignment with Noise

Algorithms and Computation - Lecture Notes in Computer Science ◽

10.1007/978-3-540-24587-2_38 ◽

2003 ◽

pp. 364-374 ◽

Cited By ~ 1

Author(s):

T. W. Lam ◽

N. Lu ◽

H. F. Ting ◽

Prudence W. H. Wong ◽

S. M. Yiu

Keyword(s):

Efficient Algorithms ◽

Genome Alignment ◽

Whole Genome ◽

Whole Genome Alignment

Download Full-text

Whole Genome Alignment

Statistics for Bioinformatics ◽

10.1016/b978-1-78548-216-8.50008-7 ◽

2016 ◽

pp. 75-86

Author(s):

Julie Dawn Thompson

Keyword(s):

Genome Alignment ◽

Whole Genome ◽

Whole Genome Alignment

Download Full-text

ALLOWING MISMATCHES IN ANCHORS FOR WHOLE GENOME ALIGNMENT: GENERATION AND EFFECTIVENESS

Proceedings of the 3rd Asia-Pacific Bioinformatics Conference ◽

10.1142/9781860947322_0001 ◽

2005 ◽

Author(s):

SM YIU ◽

PY CHAN ◽

TW LAM ◽

WK SUNG ◽

HF TING ◽

...

Keyword(s):

Genome Alignment ◽

Whole Genome ◽

Whole Genome Alignment

Download Full-text

Scalable Genome Assembly through Parallel de Bruijn Graph Construction for Multiple k-mers

Scientific Reports ◽

10.1038/s41598-019-51284-9 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 1

Author(s):

Kanak Mahadik ◽

Christopher Wright ◽

Milind Kulkarni ◽

Saurabh Bagchi ◽

Somali Chaterji

Keyword(s):

De Novo ◽

De Bruijn Graph ◽

High Quality ◽

De Bruijn Graphs ◽

Sequencing Technologies ◽

De Bruijn ◽

Similar Accuracy ◽

Valued Graph ◽

Assembly Algorithms ◽

Level Parallelism

Abstract Remarkable advancements in high-throughput gene sequencing technologies have led to an exponential growth in the number of sequenced genomes. However, unavailability of highly parallel and scalable de novo assembly algorithms have hindered biologists attempting to swiftly assemble high-quality complex genomes. Popular de Bruijn graph assemblers, such as IDBA-UD, generate high-quality assemblies by iterating over a set of k-values used in the construction of de Bruijn graphs (DBG). However, this process of sequentially iterating from small to large k-values slows down the process of assembly. In this paper, we propose ScalaDBG, which metamorphoses this sequential process, building DBGs for each distinct k-value in parallel. We develop an innovative mechanism to “patch” a higher k-valued graph with contigs generated from a lower k-valued graph. Moreover, ScalaDBG leverages multi-level parallelism, by both scaling up on all cores of a node, and scaling out to multiple nodes simultaneously. We demonstrate that ScalaDBG completes assembling the genome faster than IDBA-UD, but with similar accuracy on a variety of datasets (6.8X faster for one of the most complex genome in our dataset).

Download Full-text

Accurate determination of node and arc multiplicities in de bruijn graphs using conditional random fields

BMC Bioinformatics ◽

10.1186/s12859-020-03740-x ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Aranka Steyaert ◽

Pieter Audenaert ◽

Jan Fostier

Keyword(s):

Genomic Sequence ◽

Conditional Random Field ◽

Accurate Determination ◽

Next Generation Sequencing Data ◽

De Bruijn Graph ◽

Sequencing Data ◽

De Bruijn Graphs ◽

Sequencing Errors ◽

Expectation Maximisation ◽

De Bruijn

Abstract Background De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times each k-mer (resp. k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. True k-mers can be distinguished from erroneous k-mers with a higher F1 score than existing methods. A C++11 implementation is available at https://github.com/biointec/detoxunder the GNU AGPL v3.0 license.

Download Full-text