scholarly journals Species tree-aware simultaneous reconstruction of gene and domain evolution

2018 ◽  
Author(s):  
Sayyed Auwn Muhammad ◽  
Bengt Sennblad ◽  
Jens Lagergren

AbstractMost genes are composed of multiple domains, with a common evolutionary history, that typically perform a specific function in the resulting protein. As witnessed by many studies of key gene families, it is important to understand how domains have been duplicated, lost, transferred between genes, and rearranged. Analogously to the case of evolutionary events affecting entire genes, these domain events have large consequences for phylogenetic reconstruction and, in addition, they create considerable obstacles for gene sequence alignment algorithms, a prerequisite for phylogenetic reconstruction.We introduce the DomainDLRS model, a hierarchical, generative probabilistic model containing three levels corresponding to species, genes, and domains, respectively. From a dated species tree, a gene tree is generated according to the DL model, which is a birth-death model generalized to occur in a dated tree. Then, from the dated gene tree, a pre-specified number of dated domain trees are generated using the DL model and the molecular clock is relaxed, effectively converting edge times to edge lengths. Finally, for each domain tree and its lengths, domain sequences are generated for the leaves based on a selected model of sequence evolution.For this model, we present a MCMC-based inference framework called DomainDLRS that takes a dated species tree together with a multiple sequence alignment for each domain family as input and outputs an estimated posterior distribution over reconciled gene and domain trees. By requiring aligned domains rather than genes, our framework evades the problem of aligning full-length genes that have been exposed to domain duplications, in particular non-tandem domain duplications. We show that DomainDLRS performs better than MrBayes on synthetic data and that it outperforms MrBayes on biological data. We analyse several zincfinger genes and show that most domain duplications have been tandem duplications, some involving two or more domains, but non-tandem duplications have also been common.

2021 ◽  
Author(s):  
David Emms ◽  
Steven Kelly

Determining the evolutionary relationships between gene sequences is fundamental to comparative biological research. However, conducting such analyses requires a high degree of technical proficiency in several computational tools including gene family construction, multiple sequence alignment, and phylogenetic inference. Here we present SHOOT, an easy to use phylogenetic search engine for fast and accurate phylogenetic analysis of biological sequences. SHOOT searches a user-provided query sequence against a database of phylogenetic trees of gene sequences (gene trees) and returns a gene tree with the given query sequence correctly grafted within it. We show that SHOOT can perform this search and placement with comparable speed to a conventional BLAST search. We demonstrate that SHOOT phylogenetic placements are as accurate as conventional multiple sequence alignment and maximum likelihood tree inference approaches. We further show that SHOOT can be used to identify orthologs with equivalent accuracy to conventional orthology inference methods. In summary, SHOOT is an accurate and fast tool for complete phylogenetic analysis of novel query sequences. An easy to use webserver is available online at www.shoot.bio.


2021 ◽  
Vol 11 ◽  
Author(s):  
Haipeng Shi ◽  
Haihe Shi ◽  
Shenghua Xu

As a key algorithm in bioinformatics, sequence alignment algorithm is widely used in sequence similarity analysis and genome sequence database search. Existing research focuses mainly on the specific steps of the algorithm or is for specific problems, lack of high-level abstract domain algorithm framework. Multiple sequence alignment algorithms are more complex, redundant, and difficult to understand, and it is not easy for users to select the appropriate algorithm; some computing errors may occur. Based on our constructed pairwise sequence alignment algorithm component library and the convenient software platform PAR, a few expansion domain components are developed for multiple sequence alignment application domain, and specific multiple sequence alignment algorithm can be designed, and its corresponding program, i.e., C++/Java/Python program, can be generated efficiently and thus enables the improvement of the development efficiency of complex algorithms, as well as accuracy of sequence alignment calculation. A star alignment algorithm is designed and generated to demonstrate the development process.


2013 ◽  
Vol 11 (05) ◽  
pp. 1342005 ◽  
Author(s):  
WEN-CHIEH CHANG ◽  
PAWEŁ GÓRECKI ◽  
OLIVER EULENSTEIN

Phylogenetic analysis has to overcome the grant challenge of inferring accurate species trees from evolutionary histories of gene families (gene trees) that are discordant with the species tree along whose branches they have evolved. Two well studied approaches to cope with this challenge are to solve either biologically informed gene tree parsimony (GTP) problems under gene duplication, gene loss, and deep coalescence, or the classic RF supertree problem that does not rely on any biological model. Despite the potential of these problems to infer credible species trees, they are NP-hard. Therefore, these problems are addressed by heuristics that typically lack any provable accuracy and precision. We describe fast dynamic programming algorithms that solve the GTP problems and the RF supertree problem exactly, and demonstrate that our algorithms can solve instances with data sets consisting of as many as 22 taxa. Extensions of our algorithms can also report the number of all optimal species trees, as well as the trees themselves. To better asses the quality of the resulting species trees that best fit the given gene trees, we also compute the worst case species trees, their numbers, and optimization score for each of the computational problems. Finally, we demonstrate the performance of our exact algorithms using empirical and simulated data sets, and analyze the quality of heuristic solutions for the studied problems by contrasting them with our exact solutions.


2021 ◽  
Author(s):  
Mario A Ceron Romero ◽  
Miguel M Fonseca ◽  
Leonardo de Oliveira Martins ◽  
David Posada ◽  
Laura A Katz

Advances in phylogenetics and high throughput sequencing have allowed the reconstruction of deep phylogenetic relationships in the evolution of eukaryotes. Yet, the root of the eukaryotic tree of life remains elusive. The most popular hypothesis in textbooks and reviews is a root between Unikonta (Opisthokonta + Amoebozoa) and Bikonta (all other eukaryotes), which emerged from analyses of a single gene fusion. Subsequent highly cited studies based on concatenation of genes supported this hypothesis with some variations or proposed a root within Excavata. However, concatenation of genes neither considers phylogenetically informative events (i.e. gene duplications and losses), nor provides an estimate of the root. A more recent study using gene tree / species tree reconciliation methods suggested the root lies between Opisthokonta and all other eukaryotes, but only including 59 taxa and 20 genes. Here we apply a gene tree / species tree reconciliation approach to a gene-rich and taxon rich dataset (i.e. 2,786 gene families from two sets of 158 diverse eukaryotic lineages) to assess the root, and we iterate each analysis 100 times to quantify tree space uncertainty. We estimate a root between Fungi and all other eukaryotes, or between Opisthokonta and all other eukaryotes, and reject alternative roots from the literature. Based on further analysis of genome size we propose Opisthokonta + others as the most likely root.


Protein Multiple sequence alignment (MSA) is a process, that helps in alignment of more than two protein sequences to establish an evolutionary relationship between the sequences. As part of Protein MSA, the biological sequences are aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and hence the volume of biological data generated is increasing at an enormous rate. This increase in volume of data poses a challenge to the existing methods used to perform effective MSA as with the increase in data volume the computational complexities also increases and the speed to process decreases. The accuracy of MSA is another factor critically important as many bioinformatics inferences are dependent on the output of MSA. This paper elaborates on the existing state of the art methods of protein MSA and performs a comparison of four leading methods namely MAFFT, Clustal Omega, MUSCLE and ProbCons based on the speed and accuracy of these methods. BAliBASE version 3.0 (BAliBASE is a repository of manually refined multiple sequence alignments) has been used as a benchmark database and accuracy of alignment methods is computed through the two widely used criteria named Sum of pair score (SPscore) and total column score (TCscore). We also recorded the execution time for each method in order to compute the execution speed.


2015 ◽  
Author(s):  
Leonardo de Oliveira Martins ◽  
David Posada

The history of particular genes and that of the species that carry them can be different due to different reasons. In particular, gene trees and species trees can truly differ due to well-known evolutionary processes like gene duplication and loss, lateral gene transfer or incomplete lineage sorting. Different species tree reconstruction methods have been developed to take this incongruence into account, which can be divided grossly into supertree and supermatrix approaches. Here, we introduce a new Bayesian hierarchical model that we have recently developed and implemented in the program Guenomu, that considers multiple sources of gene tree/species tree disagreement. Guenomu takes as input the posterior distributions of unrooted gene tree topologies for multiple gene families, in order to estimate the posterior distribution of rooted species tree topologies.


Sign in / Sign up

Export Citation Format

Share Document