Efficient Merging of Genome Profile Alignments

Mapping Intimacies ◽

10.1101/309047 ◽

2018 ◽

Author(s):

André Hennig ◽

Kay Nieselt

Keyword(s):

Data Structure ◽

Parallel Computation ◽

Large Scale ◽

Divide And Conquer ◽

Data Sets ◽

Whole Genome ◽

Multiple Sequence ◽

Construction Methods ◽

Current Implementation ◽

Whole Genomes

AbstractMotivationWhole-genome alignment methods show insufficient scalability towards the generation of large-scale whole-genome alignments (WGAs). Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which makes the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.ResultsHere, we present GPA, an approach that aligns the profiles of WGAs and is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve (Darling et al., 2010) and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial data sets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool.AvailabilityGPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of [email protected]

Download Full-text

Efficient merging of genome profile alignments

Bioinformatics ◽

10.1093/bioinformatics/btz377 ◽

2019 ◽

Vol 35 (14) ◽

pp. i71-i80

Author(s):

André Hennig ◽

Kay Nieselt

Keyword(s):

Data Structure ◽

Parallel Computation ◽

Large Scale ◽

Divide And Conquer ◽

Supplementary Information ◽

Whole Genome ◽

Multiple Sequence ◽

Genome Profile ◽

Construction Methods ◽

Profile Alignment

Abstract Motivation Whole-genome alignment (WGA) methods show insufficient scalability toward the generation of large-scale WGAs. Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which make the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles. Results Here, we present genome profile alignment, an approach that aligns the profiles of WGAs and that is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial datasets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool. Availability and implementation GPA is freely available at https://lambda.informatik.uni-tuebingen.de/gitlab/ahennig/GPA. GPA is implemented in Java, uses progressiveMauve and offers a parallel computation of WGAs. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A Phylogenomic Supertree of Birds

Diversity ◽

10.3390/d11070109 ◽

2019 ◽

Vol 11 (7) ◽

pp. 109 ◽

Cited By ~ 17

Author(s):

Rebecca T. Kimball ◽

Carl H. Oliveros ◽

Ning Wang ◽

Noor D. White ◽

F. Keith Barker ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bird Species ◽

Divide And Conquer ◽

Clear Understanding ◽

Whole Genome ◽

Efficient Manner ◽

Sequence Capture ◽

Branch Lengths ◽

Supertree Methods

It has long been appreciated that analyses of genomic data (e.g., whole genome sequencing or sequence capture) have the potential to reveal the tree of life, but it remains challenging to move from sequence data to a clear understanding of evolutionary history, in part due to the computational challenges of phylogenetic estimation using genome-scale data. Supertree methods solve that challenge because they facilitate a divide-and-conquer approach for large-scale phylogeny inference by integrating smaller subtrees in a computationally efficient manner. Here, we combined information from sequence capture and whole-genome phylogenies using supertree methods. However, the available phylogenomic trees had limited overlap so we used taxon-rich (but not phylogenomic) megaphylogenies to weave them together. This allowed us to construct a phylogenomic supertree, with support values, that included 707 bird species (~7% of avian species diversity). We estimated branch lengths using mitochondrial sequence data and we used these branch lengths to estimate divergence times. Our time-calibrated supertree supports radiation of all three major avian clades (Palaeognathae, Galloanseres, and Neoaves) near the Cretaceous-Paleogene (K-Pg) boundary. The approach we used will permit the continued addition of taxa to this supertree as new phylogenomic data are published, and it could be applied to other taxa as well.

Download Full-text

Whole-Genome Sequencing for Routine Pathogen Surveillance in Public Health: a Population Snapshot of InvasiveStaphylococcus aureusin Europe

mBio ◽

10.1128/mbio.00444-16 ◽

2016 ◽

Vol 7 (3) ◽

Cited By ~ 123

Author(s):

David M. Aanensen ◽

Edward J. Feil ◽

Matthew T. G. Holden ◽

Janina Dordel ◽

Corin A. Yeats ◽

...

Keyword(s):

Public Health ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Bacterial Pathogens ◽

Epidemiological Surveillance ◽

Data Sets ◽

Whole Genome ◽

Bioinformatic Tools ◽

Road Map

ABSTRACTThe implementation of routine whole-genome sequencing (WGS) promises to transform our ability to monitor the emergence and spread of bacterial pathogens. Here we combined WGS data from 308 invasiveStaphylococcus aureusisolates corresponding to a pan-European population snapshot, with epidemiological and resistance data. Geospatial visualization of the data is made possible by a generic software tool designed for public health purposes that is available at the project URL (http://www.microreact.org/project/EkUvg9uY?tt=rc). Our analysis demonstrates that high-risk clones can be identified on the basis of population level properties such as clonal relatedness, abundance, and spatial structuring and by inferring virulence and resistance properties on the basis of gene content. We also show thatin silicopredictions of antibiotic resistance profiles are at least as reliable as phenotypic testing. We argue that this work provides a comprehensive road map illustrating the three vital components for future molecular epidemiological surveillance: (i) large-scale structured surveys, (ii) WGS, and (iii) community-oriented database infrastructure and analysis tools.IMPORTANCEThe spread of antibiotic-resistant bacteria is a public health emergency of global concern, threatening medical intervention at every level of health care delivery. Several recent studies have demonstrated the promise of routine whole-genome sequencing (WGS) of bacterial pathogens for epidemiological surveillance, outbreak detection, and infection control. However, as this technology becomes more widely adopted, the key challenges of generating representative national and international data sets and the development of bioinformatic tools to manage and interpret the data become increasingly pertinent. This study provides a road map for the integration of WGS data into routine pathogen surveillance. We emphasize the importance of large-scale routine surveys to provide the population context for more targeted or localized investigation and the development of open-access bioinformatic tools to provide the means to combine and compare independently generated data with publicly available data sets.

Download Full-text

Mandrake: visualising microbial population structure by embedding millions of genomes into a low-dimensional representation

10.1101/2021.10.28.466232 ◽

2021 ◽

Author(s):

John A Lees ◽

Gerry Tonkin-Hill ◽

Zhirong Yang ◽

Jukka Corander

Keyword(s):

Population Structure ◽

Large Scale ◽

Population Genomics ◽

Bacterial Species ◽

Population Based ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Dimensional Reduction Method ◽

Low Dimensional

In less than a decade, population genomics of microbes has progressed from the effort of sequencing dozens of strains to thousands, or even tens of thousands of strains in a single study. There are now hundreds of thousands of genomes available even for a single bacterial species and the number of genomes is expected to continue to increase at an accelerated pace given the advances in sequencing technology and widespread genomic surveillance initiatives. This explosion of data calls for innovative methods to enable rapid exploration of the structure of a population based on different data modalities, such as multiple sequence alignments, assemblies and estimates of gene content across different genomes. Here we present Mandrake, an efficient implementation of a dimensional reduction method tailored for the needs of large-scale population genomics. Mandrake is capable of visualising population structure from millions of whole genomes and we illustrate its usefulness with several data sets representing major pathogens. Our method is freely available both as an analysis pipeline (https://github.com/johnlees/mandrake) and as a browser-based interactive application (https://gtonkinhill.github.io/mandrake-web/).

Download Full-text

Compression with unified and accessible byte blocks to enhance management and analyses of UKBB-scale genotypes

10.21203/rs.3.rs-944936/v1 ◽

2021 ◽

Author(s):

Miaoxin Li ◽

Liubin Zhang ◽

Yangyang Yuan ◽

Wenjie Peng ◽

Bin Tang ◽

...

Keyword(s):

Data Structure ◽

Large Scale ◽

Fundamental Problem ◽

Population Based ◽

Whole Genome ◽

Storage Space ◽

Problem Of Time ◽

Conventional Analysis ◽

Time Overhead ◽

Future Population

Abstract Whole-genome sequencing projects of millions of persons contain enormous genotypes, entailing a huge memory burden and time overhead during computation. Here, we introduce Genotype Blocking Compressor (GBC), a method for rapidly compressing large-scale genotypes into a fast-accessible and highly parallelizable format. We demonstrate that GBC has a competitive compression ratio to help save storage space. Furthermore, GBC is the fastest method to access and manage compressed large-scale genotype files (sorting, merging, splitting, etc.). Our results indicate that GBC can help resolve the fundamental problem of time- and space-consuming computation with large-scale genotypes, and conventional analysis would be substantially enhanced if integrated with GBC to access genotypes. Therefore, GBC's advanced data structure and algorithms will accelerate future population-based biomedical research involving big genomics data.

Download Full-text

An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics

ISRN Biomathematics ◽

10.1155/2013/615630 ◽

2013 ◽

Vol 2013 ◽

pp. 1-14 ◽

Cited By ~ 28

Author(s):

Jurate Daugelaite ◽

Aisling O' Driscoll ◽

Roy D. Sleator

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Sequence Data ◽

Cloud Base ◽

Data Sets ◽

Next Generation ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Computing Technologies

Multiple sequence alignment (MSA) of DNA, RNA, and protein sequences is one of the most essential techniques in the fields of molecular biology, computational biology, and bioinformatics. Next-generation sequencing technologies are changing the biology landscape, flooding the databases with massive amounts of raw sequence data. MSA of ever-increasing sequence data sets is becoming a significant bottleneck. In order to realise the promise of MSA for large-scale sequence data sets, it is necessary for existing MSA algorithms to be run in a parallelised fashion with the sequence data distributed over a computing cluster or server farm. Combining MSA algorithms with cloud computing technologies is therefore likely to improve the speed, quality, and capability for MSA to handle large numbers of sequences. In this review, multiple sequence alignments are discussed, with a specific focus on the ClustalW and Clustal Omega algorithms. Cloud computing technologies and concepts are outlined, and the next generation of cloud base MSA algorithms is introduced.

Download Full-text

Whole-Genome Assessment of Clinical Acinetobacter baumannii Isolates Uncovers Potentially Novel Factors Influencing Carbapenem Resistance

Frontiers in Microbiology ◽

10.3389/fmicb.2021.714284 ◽

2021 ◽

Vol 12 ◽

Author(s):

Kiran Javkar ◽

Hugh Rand ◽

Maria Hoffmann ◽

Yan Luo ◽

Saul Sarria ◽

...

Keyword(s):

Acinetobacter Baumannii ◽

Resistance Genes ◽

Large Scale ◽

Carbapenem Resistance ◽

Future Research ◽

Whole Genome ◽

Pairwise Interactions ◽

Antimicrobial Resistance Genes ◽

Whole Genomes ◽

Imipenem Resistance

Carbapenems—one of the important last-line antibiotics for the treatment of gram-negative infections—are becoming ineffective for treating Acinetobacter baumannii infections. Studies have identified multiple genes (and mechanisms) responsible for carbapenem resistance. In some A. baumannii strains, the presence/absence of putative resistance genes is not consistent with their resistance phenotype—indicating the genomic factors underlying carbapenem resistance in A. baumannii are not fully understood. Here, we describe a large-scale whole-genome genotype-phenotype association study with 349 A. baumannii isolates that extends beyond the presence/absence of individual antimicrobial resistance genes and includes the genomic positions and pairwise interactions of genes. Ten known resistance genes exhibited statistically significant associations with resistance to imipenem, a type of carbapenem: blaOXA-23, qacEdelta1, sul1, mphE, msrE, ant(3”)-II, aacC1, yafP, aphA6, and xerD. A review of the strains without any of these 10 genes uncovered a clade of isolates with diverse imipenem resistance phenotypes. Finer resolution evaluation of this clade revealed the presence of a 38.6 kbp conserved chromosomal region found exclusively in imipenem-susceptible isolates. This region appears to host several HTH-type DNA binding transcriptional regulators and transporter genes. Imipenem-susceptible isolates from this clade also carried two mutually exclusive plasmids that contain genes previously known to be specific to imipenem-susceptible isolates. Our analysis demonstrates the utility of using whole genomes for genotype-phenotype correlations in the context of antibiotic resistance and provides several new hypotheses for future research.

Download Full-text

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins

Bioinformatics ◽

10.1093/bioinformatics/btz863 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2105-2112 ◽

Cited By ~ 14

Author(s):

Chengxin Zhang ◽

Wei Zheng ◽

S M Mortuza ◽

Yang Li ◽

Yang Zhang

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Secondary Structure Prediction ◽

Supplementary Information ◽

Structure Identification ◽

Whole Genome ◽

Multiple Sequence ◽

Contact Prediction ◽

Homologous Sequences

Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Repair of Voids in Multi-Labeled Triangular Mesh

Applied Sciences ◽

10.3390/app11199275 ◽

2021 ◽

Vol 11 (19) ◽

pp. 9275

Author(s):

Deyun Zhong ◽

Benyu Li ◽

Tiandong Shi ◽

Zhaopeng Li ◽

Liguan Wang ◽

...

Keyword(s):

Graph Theory ◽

Data Structure ◽

Large Scale ◽

Triangular Mesh ◽

Experimental Results ◽

Data Sets ◽

Topological Correctness ◽

And Performance

In this paper, we propose a novel mesh repairing method for repairing voids from several meshes to ensure a desired topological correctness. The input to our method is several closed and manifold meshes without labels. The basic idea of the method is to search for and repair voids based on a multi-labeled mesh data structure and the idea of graph theory. We propose the judgment rules of voids between the input meshes and the method of void repairing based on the specified model priorities. It consists of three steps: (a) converting the input meshes into a multi-labeled mesh; (b) searching for quasi-voids using the breadth-first searching algorithm and determining true voids via the judgment rules of voids; (c) repairing voids by modifying mesh labels. The method can repair the voids accurately and only few invalid triangular facets are removed. In general, the method can repair meshes with one hundred thousand facets in approximately one second on very modest hardware. Moreover, it can be easily extended to process large-scale polygon models with millions of polygons. The experimental results of several data sets show the reliability and performance of the void repairing method based on the multi-labeled triangular mesh.

Download Full-text

Whole-genome microsynteny-based phylogeny of angiosperms

Nature Communications ◽

10.1038/s41467-021-23665-0 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Tao Zhao ◽

Arthur Zwaenepoel ◽

Jia-Yu Xue ◽

Shu-Min Kao ◽

Zhen Li ◽

...

Keyword(s):

Large Scale ◽

Phylogenetic Inference ◽

Data Sets ◽

Whole Genome ◽

Data Set ◽

Genome Data ◽

Structural Differences ◽

Evolution Dynamics ◽

Plant Families ◽

Early Diverging Eudicots

AbstractPlant genomes vary greatly in size, organization, and architecture. Such structural differences may be highly relevant for inference of genome evolution dynamics and phylogeny. Indeed, microsynteny—the conservation of local gene content and order—is recognized as a valuable source of phylogenetic information, but its use for the inference of large phylogenies has been limited. Here, by combining synteny network analysis, matrix representation, and maximum likelihood phylogenetic inference, we provide a way to reconstruct phylogenies based on microsynteny information. Both simulations and use of empirical data sets show our method to be accurate, consistent, and widely applicable. As an example, we focus on the analysis of a large-scale whole-genome data set for angiosperms, including more than 120 available high-quality genomes, representing more than 50 different plant families and 30 orders. Our ‘microsynteny-based’ tree is largely congruent with phylogenies proposed based on more traditional sequence alignment-based methods and current phylogenetic classifications but differs for some long-contested and controversial relationships. For instance, our synteny-based tree finds Vitales as early diverging eudicots, Saxifragales within superasterids, and magnoliids as sister to monocots. We discuss how synteny-based phylogenetic inference can complement traditional methods and could provide additional insights into some long-standing controversial phylogenetic relationships.

Download Full-text