Species tree-aware simultaneous reconstruction of gene and domain evolution

Mapping Intimacies ◽

10.1101/336453 ◽

2018 ◽

Cited By ~ 1

Author(s):

Sayyed Auwn Muhammad ◽

Bengt Sennblad ◽

Jens Lagergren

Keyword(s):

Sequence Alignment ◽

Gene Tree ◽

Phylogenetic Reconstruction ◽

Gene Families ◽

Species Tree ◽

Biological Data ◽

Sequence Evolution ◽

Multiple Sequence ◽

Alignment Algorithms ◽

Tandem Duplications

AbstractMost genes are composed of multiple domains, with a common evolutionary history, that typically perform a specific function in the resulting protein. As witnessed by many studies of key gene families, it is important to understand how domains have been duplicated, lost, transferred between genes, and rearranged. Analogously to the case of evolutionary events affecting entire genes, these domain events have large consequences for phylogenetic reconstruction and, in addition, they create considerable obstacles for gene sequence alignment algorithms, a prerequisite for phylogenetic reconstruction.We introduce the DomainDLRS model, a hierarchical, generative probabilistic model containing three levels corresponding to species, genes, and domains, respectively. From a dated species tree, a gene tree is generated according to the DL model, which is a birth-death model generalized to occur in a dated tree. Then, from the dated gene tree, a pre-specified number of dated domain trees are generated using the DL model and the molecular clock is relaxed, effectively converting edge times to edge lengths. Finally, for each domain tree and its lengths, domain sequences are generated for the leaves based on a selected model of sequence evolution.For this model, we present a MCMC-based inference framework called DomainDLRS that takes a dated species tree together with a multiple sequence alignment for each domain family as input and outputs an estimated posterior distribution over reconciled gene and domain trees. By requiring aligned domains rather than genes, our framework evades the problem of aligning full-length genes that have been exposed to domain duplications, in particular non-tandem domain duplications. We show that DomainDLRS performs better than MrBayes on synthetic data and that it outperforms MrBayes on biological data. We analyse several zincfinger genes and show that most domain duplications have been tandem duplications, some involving two or more domains, but non-tandem duplications have also been common.

Download Full-text

SHOOT: phylogenetic gene search and ortholog inference

10.1101/2021.09.01.458564 ◽

2021 ◽

Author(s):

David Emms ◽

Steven Kelly

Keyword(s):

Phylogenetic Analysis ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Trees ◽

Query Sequence ◽

Gene Tree ◽

Biological Research ◽

Gene Sequences ◽

Multiple Sequence ◽

Gene Search

Determining the evolutionary relationships between gene sequences is fundamental to comparative biological research. However, conducting such analyses requires a high degree of technical proficiency in several computational tools including gene family construction, multiple sequence alignment, and phylogenetic inference. Here we present SHOOT, an easy to use phylogenetic search engine for fast and accurate phylogenetic analysis of biological sequences. SHOOT searches a user-provided query sequence against a database of phylogenetic trees of gene sequences (gene trees) and returns a gene tree with the given query sequence correctly grafted within it. We show that SHOOT can perform this search and placement with comparable speed to a conventional BLAST search. We demonstrate that SHOOT phylogenetic placements are as accurate as conventional multiple sequence alignment and maximum likelihood tree inference approaches. We further show that SHOOT can be used to identify orthologs with equivalent accuracy to conventional orthology inference methods. In summary, SHOOT is an accurate and fast tool for complete phylogenetic analysis of novel query sequences. An easy to use webserver is available online at www.shoot.bio.

Download Full-text

Design of multiple sequence alignment algorithms on parallel, distributed memory supercomputers

2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society ◽

10.1109/iembs.2011.6090208 ◽

2011 ◽

Cited By ~ 6

Author(s):

Philip C. Church ◽

Andrzej Goscinski ◽

Kathryn Holt ◽

Michael Inouye ◽

Amol Ghoting ◽

...

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Distributed Memory ◽

Multiple Sequence ◽

Alignment Algorithms

Download Full-text

A Survey of the State-of-the-Art Parallel Multiple Sequence Alignment Algorithms on Multicore Systems

International Journal of Computer Applications ◽

10.5120/ijca2018917658 ◽

2018 ◽

Vol 182 (12) ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Sara Shehab ◽

Sameh Abdulah ◽

Arabi E.

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

State Of The Art ◽

The State ◽

Multicore Systems ◽

Multiple Sequence ◽

Alignment Algorithms

Download Full-text

Efficient Multiple Sequences Alignment Algorithm Generation via Components Assembly Under PAR Framework

Frontiers in Genetics ◽

10.3389/fgene.2020.628175 ◽

2021 ◽

Vol 11 ◽

Author(s):

Haipeng Shi ◽

Haihe Shi ◽

Shenghua Xu

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Similarity ◽

Alignment Algorithm ◽

Pairwise Sequence Alignment ◽

Multiple Sequence ◽

Sequence Alignment Algorithm ◽

Alignment Algorithms ◽

Sequence Similarity Analysis ◽

High Level

As a key algorithm in bioinformatics, sequence alignment algorithm is widely used in sequence similarity analysis and genome sequence database search. Existing research focuses mainly on the specific steps of the algorithm or is for specific problems, lack of high-level abstract domain algorithm framework. Multiple sequence alignment algorithms are more complex, redundant, and difficult to understand, and it is not easy for users to select the appropriate algorithm; some computing errors may occur. Based on our constructed pairwise sequence alignment algorithm component library and the convenient software platform PAR, a few expansion domain components are developed for multiple sequence alignment application domain, and specific multiple sequence alignment algorithm can be designed, and its corresponding program, i.e., C++/Java/Python program, can be generated efficiently and thus enables the improvement of the development efficiency of complex algorithms, as well as accuracy of sequence alignment calculation. A star alignment algorithm is designed and generated to demonstrate the development process.

Download Full-text

EXACT SOLUTIONS FOR SPECIES TREE INFERENCE FROM DISCORDANT GENE TREES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013420055 ◽

2013 ◽

Vol 11 (05) ◽

pp. 1342005 ◽

Cited By ~ 16

Author(s):

WEN-CHIEH CHANG ◽

PAWEŁ GÓRECKI ◽

OLIVER EULENSTEIN

Keyword(s):

Exact Solutions ◽

Gene Tree ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Data Sets ◽

Gene Trees ◽

Species Trees ◽

Worst Case

Phylogenetic analysis has to overcome the grant challenge of inferring accurate species trees from evolutionary histories of gene families (gene trees) that are discordant with the species tree along whose branches they have evolved. Two well studied approaches to cope with this challenge are to solve either biologically informed gene tree parsimony (GTP) problems under gene duplication, gene loss, and deep coalescence, or the classic RF supertree problem that does not rely on any biological model. Despite the potential of these problems to infer credible species trees, they are NP-hard. Therefore, these problems are addressed by heuristics that typically lack any provable accuracy and precision. We describe fast dynamic programming algorithms that solve the GTP problems and the RF supertree problem exactly, and demonstrate that our algorithms can solve instances with data sets consisting of as many as 22 taxa. Extensions of our algorithms can also report the number of all optimal species trees, as well as the trees themselves. To better asses the quality of the resulting species trees that best fit the given gene trees, we also compute the worst case species trees, their numbers, and optimization score for each of the computational problems. Finally, we demonstrate the performance of our exact algorithms using empirical and simulated data sets, and analyze the quality of heuristic solutions for the studied problems by contrasting them with our exact solutions.

Download Full-text

Phylogenomic Analyses Of 2,786 Genes In 158 Lineages Support a Root of The Eukaryotic Tree of Life Between Opisthokonts (Animals, Fungi and Their Microbial Relatives) and All Other Lineages

10.1101/2021.02.26.433005 ◽

2021 ◽

Author(s):

Mario A Ceron Romero ◽

Miguel M Fonseca ◽

Leonardo de Oliveira Martins ◽

David Posada ◽

Laura A Katz

Keyword(s):

Tree Species ◽

High Throughput Sequencing ◽

Single Gene ◽

Gene Tree ◽

Gene Families ◽

Species Tree ◽

Tree Of Life ◽

Gene Duplications ◽

Tree Reconciliation ◽

Phylogenomic Analyses

Advances in phylogenetics and high throughput sequencing have allowed the reconstruction of deep phylogenetic relationships in the evolution of eukaryotes. Yet, the root of the eukaryotic tree of life remains elusive. The most popular hypothesis in textbooks and reviews is a root between Unikonta (Opisthokonta + Amoebozoa) and Bikonta (all other eukaryotes), which emerged from analyses of a single gene fusion. Subsequent highly cited studies based on concatenation of genes supported this hypothesis with some variations or proposed a root within Excavata. However, concatenation of genes neither considers phylogenetically informative events (i.e. gene duplications and losses), nor provides an estimate of the root. A more recent study using gene tree / species tree reconciliation methods suggested the root lies between Opisthokonta and all other eukaryotes, but only including 59 taxa and 20 genes. Here we apply a gene tree / species tree reconciliation approach to a gene-rich and taxon rich dataset (i.e. 2,786 gene families from two sets of 158 diverse eukaryotic lineages) to assess the root, and we iterate each analysis 100 times to quantify tree space uncertainty. We estimate a root between Fungi and all other eukaryotes, or between Opisthokonta and all other eukaryotes, and reject alternative roots from the literature. Based on further analysis of genome size we propose Opisthokonta + others as the most likely root.

Download Full-text

Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1369.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 771-776

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Evolutionary Relationship ◽

Biological Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequencing Technologies ◽

Benchmark Database ◽

Execution Speed ◽

Protein Multiple Sequence Alignment

Protein Multiple sequence alignment (MSA) is a process, that helps in alignment of more than two protein sequences to establish an evolutionary relationship between the sequences. As part of Protein MSA, the biological sequences are aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and hence the volume of biological data generated is increasing at an enormous rate. This increase in volume of data poses a challenge to the existing methods used to perform effective MSA as with the increase in data volume the computational complexities also increases and the speed to process decreases. The accuracy of MSA is another factor critically important as many bioinformatics inferences are dependent on the output of MSA. This paper elaborates on the existing state of the art methods of protein MSA and performs a comparison of four leading methods namely MAFFT, Clustal Omega, MUSCLE and ProbCons based on the speed and accuracy of these methods. BAliBASE version 3.0 (BAliBASE is a repository of manually refined multiple sequence alignments) has been used as a benchmark database and accuracy of alignment methods is computed through the two widely used criteria named Sum of pair score (SPscore) and total column score (TCscore). We also recorded the execution time for each method in order to compute the execution speed.

Download Full-text

Species Tree Estimation from Genome-wide Data with Guenomu

10.1101/023861 ◽

2015 ◽

Author(s):

Leonardo de Oliveira Martins ◽

David Posada

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Gene Families ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Multiple Sources ◽

Reconstruction Methods ◽

Tree Topologies

The history of particular genes and that of the species that carry them can be different due to different reasons. In particular, gene trees and species trees can truly differ due to well-known evolutionary processes like gene duplication and loss, lateral gene transfer or incomplete lineage sorting. Different species tree reconstruction methods have been developed to take this incongruence into account, which can be divided grossly into supertree and supermatrix approaches. Here, we introduce a new Bayesian hierarchical model that we have recently developed and implemented in the program Guenomu, that considers multiple sources of gene tree/species tree disagreement. Guenomu takes as input the posterior distributions of unrooted gene tree topologies for multiple gene families, in order to estimate the posterior distribution of rooted species tree topologies.

Download Full-text