Predicting the Evolution of Syntenies—An Algorithmic Review

Nadia El-Mabrouk

doi:10.3390/a14050152

Predicting the Evolution of Syntenies—An Algorithmic Review

Algorithms ◽

10.3390/a14050152 ◽

2021 ◽

Vol 14 (5) ◽

pp. 152

Author(s):

Nadia El-Mabrouk

Keyword(s):

Single Gene ◽

Gene Tree ◽

Evolutionary Model ◽

Gene Families ◽

Gene Content ◽

Gene Trees ◽

Gene Insertion ◽

Insertion And Deletion ◽

Genomic Regions ◽

Identical Gene

Syntenies are genomic segments of consecutive genes identified by a certain conservation in gene content and order. The notion of conservation may vary from one definition to another, the more constrained requiring identical gene contents and gene orders, while more relaxed definitions just require a certain similarity in gene content, and not necessarily in the same order. Regardless of the way they are identified, the goal is to characterize homologous genomic regions, i.e., regions deriving from a common ancestral region, reflecting a certain gene co-evolution that can enlighten important functional properties. In addition of being able to identify them, it is also necessary to infer the evolutionary history that has led from the ancestral segment to the extant ones. In this field, most algorithmic studies address the problem of inferring rearrangement scenarios explaining the disruption in gene order between segments with the same gene content, some of them extending the evolutionary model to gene insertion and deletion. However, syntenies also evolve through other events modifying their content in genes, such as duplications, losses or horizontal gene transfers, i.e., the movement of genes from one species to another. Although the reconciliation approach between a gene tree and a species tree addresses the problem of inferring such events for single-gene families, little effort has been dedicated to the generalization to segmental events and to syntenies. This paper reviews some of the main algorithmic methods for inferring ancestral syntenies and focus on those integrating both gene orders and gene trees.

Download Full-text

Genomic Characterization and Curation of UCEs Improves Species Tree Reconstruction

Systematic Biology ◽

10.1093/sysbio/syaa063 ◽

2020 ◽

Author(s):

Matthew H Van Dam ◽

James B Henderson ◽

Lauren Esposito ◽

Michelle Trautwein

Keyword(s):

Marine Invertebrates ◽

Phylogenetic Analyses ◽

Single Gene ◽

Gene Tree ◽

Single Copy ◽

Species Tree ◽

Bootstrap Support ◽

Gene Trees ◽

Species Trees ◽

Tree Reconstruction

Abstract Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations and found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.]

Download Full-text

The impact of incongruence and exogenous gene fragments on estimates of the eukaryote root

10.1101/2021.04.08.438903 ◽

2021 ◽

Author(s):

Caesar Al Jewari ◽

Sandra L Baldauf

Keyword(s):

Phylogenetic Analyses ◽

Single Gene ◽

Evolutionary Model ◽

Bootstrap Support ◽

Gene Trees ◽

Full Data ◽

Data Set ◽

Gene Level ◽

The Impact ◽

Exogenous Gene

Phylogenomics uses multiple genetic loci to reconstruct evolutionary trees, under the stipulation that all combined loci share a common phylogenetic history, i.e., they are congruent. Congruence is primarily evaluated via single-gene trees, but these trees invariably lack sufficient signal to resolve deep nodes making it difficult to assess congruence at these levels. Two methods were developed to systematically assess congruence in multi-locus data. Protocol 1 uses gene jackknifing to measure deviation from a central mean to identify taxon-specific incongruencies in the form of persistent outliers. Protocol_2 assesses congruence at the sub-gene level using a sliding window. Both protocols were tested on a controversial data set of 76 mitochondrial proteins previously used in various combinations to assess the eukaryote root. Protocol_1 showed a concentration of outliers in under-sampled taxa, including the pivotal taxon Discoba. Further analysis of Discoba using Protocol_2 detected a surprising number of apparently exogenous gene fragments, some of which overlap with Protocol_1 outliers and others that do not. Phylogenetic analyses of the full data using the static LG-gamma evolutionary model support a neozoan-excavate root for eukaryotes (Discoba sister), which rises to 99-100% bootstrap support with data masked according to either Protocol_1 or Protocol_2. In contrast, site-heterogeneous (mixture) models perform inconsistently with these data, yielding all three possible roots depending on presence/absence/type of masking and/or extent of missing data. The neozoan-excavate root places Amorphea (including animals and fungi) and Diaphoretickes (including plants) as more closely related to each other than either is to Discoba (Jakobida, Heterolobosea, and Euglenozoa), regardless of the presence/absence of additional taxa.

Download Full-text

Aequatus: An open-source homology browser

10.1101/055632 ◽

2016 ◽

Cited By ~ 1

Author(s):

Anil S. Thanki ◽

Nicola Soranzo ◽

Javier Herrero ◽

Wilfried Haerty ◽

Robert P. Davey

Keyword(s):

Open Source ◽

Structural Changes ◽

Gene Tree ◽

Purifying Selection ◽

Gene Families ◽

Gene Trees ◽

Ancestral Gene ◽

Link Type ◽

The Galaxy ◽

Duplication Events

AbstractBackgroundPhylogenetic information inferred from the study of homologous genes helps us to understand the evolution of genes and gene families, including the identification of ancestral gene duplication events as well as regions under positive or purifying selection within lineages. Gene family and orthogroup characterisation enables the identification of syntenic blocks, which can then be visualised with various tools. Unfortunately, currently available tools display only an overview of syntenic regions as a whole, limited to the gene level, and none provide further details about structural changes within genes, such as the conservation of ancestral exon boundaries amongst multiple genomes.FindingsWe present Aequatus, a standalone web-based tool that provides an in-depth view of gene structure across gene families, with various options to render and filter visualisations. It relies on pre-calculated alignment and gene feature information typically held in, but not limited to, the Ensembl Compara and Core databases. We also offer Aequatus.js, a reusable JavaScript module that fulfils the visualisation aspects of Aequatus, available within the Galaxy web platform as a visualisation plugin, which can be used to visualise gene trees generated by the GeneSeqToFamily workflow.AvailabilityAequatus is an open-source tool freely available to download under the MIT license at https://github.com/TGAC/Aequatus. A demo server is available at http://aequatus.earlham.ac.uk/. A publicly available instance of the GeneSeqToFamily workflow to generate gene tree information and visualise it using Aequatus is available on the Galaxy EU server at https://[email protected] and [email protected]

Download Full-text

EXACT SOLUTIONS FOR SPECIES TREE INFERENCE FROM DISCORDANT GENE TREES

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720013420055 ◽

2013 ◽

Vol 11 (05) ◽

pp. 1342005 ◽

Cited By ~ 16

Author(s):

WEN-CHIEH CHANG ◽

PAWEŁ GÓRECKI ◽

OLIVER EULENSTEIN

Keyword(s):

Exact Solutions ◽

Gene Tree ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Data Sets ◽

Gene Trees ◽

Species Trees ◽

Worst Case

Phylogenetic analysis has to overcome the grant challenge of inferring accurate species trees from evolutionary histories of gene families (gene trees) that are discordant with the species tree along whose branches they have evolved. Two well studied approaches to cope with this challenge are to solve either biologically informed gene tree parsimony (GTP) problems under gene duplication, gene loss, and deep coalescence, or the classic RF supertree problem that does not rely on any biological model. Despite the potential of these problems to infer credible species trees, they are NP-hard. Therefore, these problems are addressed by heuristics that typically lack any provable accuracy and precision. We describe fast dynamic programming algorithms that solve the GTP problems and the RF supertree problem exactly, and demonstrate that our algorithms can solve instances with data sets consisting of as many as 22 taxa. Extensions of our algorithms can also report the number of all optimal species trees, as well as the trees themselves. To better asses the quality of the resulting species trees that best fit the given gene trees, we also compute the worst case species trees, their numbers, and optimization score for each of the computational problems. Finally, we demonstrate the performance of our exact algorithms using empirical and simulated data sets, and analyze the quality of heuristic solutions for the studied problems by contrasting them with our exact solutions.

Download Full-text

Phylogenomic Analyses Of 2,786 Genes In 158 Lineages Support a Root of The Eukaryotic Tree of Life Between Opisthokonts (Animals, Fungi and Their Microbial Relatives) and All Other Lineages

10.1101/2021.02.26.433005 ◽

2021 ◽

Author(s):

Mario A Ceron Romero ◽

Miguel M Fonseca ◽

Leonardo de Oliveira Martins ◽

David Posada ◽

Laura A Katz

Keyword(s):

Tree Species ◽

High Throughput Sequencing ◽

Single Gene ◽

Gene Tree ◽

Gene Families ◽

Species Tree ◽

Tree Of Life ◽

Gene Duplications ◽

Tree Reconciliation ◽

Phylogenomic Analyses

Advances in phylogenetics and high throughput sequencing have allowed the reconstruction of deep phylogenetic relationships in the evolution of eukaryotes. Yet, the root of the eukaryotic tree of life remains elusive. The most popular hypothesis in textbooks and reviews is a root between Unikonta (Opisthokonta + Amoebozoa) and Bikonta (all other eukaryotes), which emerged from analyses of a single gene fusion. Subsequent highly cited studies based on concatenation of genes supported this hypothesis with some variations or proposed a root within Excavata. However, concatenation of genes neither considers phylogenetically informative events (i.e. gene duplications and losses), nor provides an estimate of the root. A more recent study using gene tree / species tree reconciliation methods suggested the root lies between Opisthokonta and all other eukaryotes, but only including 59 taxa and 20 genes. Here we apply a gene tree / species tree reconciliation approach to a gene-rich and taxon rich dataset (i.e. 2,786 gene families from two sets of 158 diverse eukaryotic lineages) to assess the root, and we iterate each analysis 100 times to quantify tree space uncertainty. We estimate a root between Fungi and all other eukaryotes, or between Opisthokonta and all other eukaryotes, and reject alternative roots from the literature. Based on further analysis of genome size we propose Opisthokonta + others as the most likely root.

Download Full-text

Species Tree Estimation from Genome-wide Data with Guenomu

10.1101/023861 ◽

2015 ◽

Author(s):

Leonardo de Oliveira Martins ◽

David Posada

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Gene Families ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Multiple Sources ◽

Reconstruction Methods ◽

Tree Topologies

The history of particular genes and that of the species that carry them can be different due to different reasons. In particular, gene trees and species trees can truly differ due to well-known evolutionary processes like gene duplication and loss, lateral gene transfer or incomplete lineage sorting. Different species tree reconstruction methods have been developed to take this incongruence into account, which can be divided grossly into supertree and supermatrix approaches. Here, we introduce a new Bayesian hierarchical model that we have recently developed and implemented in the program Guenomu, that considers multiple sources of gene tree/species tree disagreement. Guenomu takes as input the posterior distributions of unrooted gene tree topologies for multiple gene families, in order to estimate the posterior distribution of rooted species tree topologies.

Download Full-text

Endosymbiotic gene transfer from prokaryotic pangenomes: Inherited chimerism in eukaryotes

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1421385112 ◽

2015 ◽

Vol 112 (33) ◽

pp. 10139-10146 ◽

Cited By ~ 61

Author(s):

Chuan Ku ◽

Shijulal Nelson-Sathi ◽

Mayo Roettger ◽

Sriram Garg ◽

Einat Hazkani-Covo ◽

...

Keyword(s):

Gene Transfer ◽

Single Gene ◽

Gene Tree ◽

Eukaryotic Cell ◽

Specific Gene ◽

Gene Trees ◽

Endosymbiotic Gene Transfer ◽

Evolutionary Innovation ◽

Eukaryotic Genes ◽

A Genome

Endosymbiotic theory in eukaryotic-cell evolution rests upon a foundation of three cornerstone partners—the plastid (a cyanobacterium), the mitochondrion (a proteobacterium), and its host (an archaeon)—and carries a corollary that, over time, the majority of genes once present in the organelle genomes were relinquished to the chromosomes of the host (endosymbiotic gene transfer). However, notwithstanding eukaryote-specific gene inventions, single-gene phylogenies have never traced eukaryotic genes to three single prokaryotic sources, an issue that hinges crucially upon factors influencing phylogenetic inference. In the age of genomes, single-gene trees, once used to test the predictions of endosymbiotic theory, now spawn new theories that stand to eventually replace endosymbiotic theory with descriptive, gene tree-based variants featuring supernumerary symbionts: prokaryotic partners distinct from the cornerstone trio and whose existence is inferred solely from single-gene trees. We reason that the endosymbiotic ancestors of mitochondria and chloroplasts brought into the eukaryotic—and plant and algal—lineage a genome-sized sample of genes from the proteobacterial and cyanobacterial pangenomes of their respective day and that, even if molecular phylogeny were artifact-free, sampling prokaryotic pangenomes through endosymbiotic gene transfer would lead to inherited chimerism. Recombination in prokaryotes (transduction, conjugation, transformation) differs from recombination in eukaryotes (sex). Prokaryotic recombination leads to pangenomes, and eukaryotic recombination leads to vertical inheritance. Viewed from the perspective of endosymbiotic theory, the critical transition at the eukaryote origin that allowed escape from Muller’s ratchet—the origin of eukaryotic recombination, or sex—might have required surprisingly little evolutionary innovation.

Download Full-text

SimPhy: Phylogenomic Simulation of Gene, Locus and Species Trees

10.1101/021709 ◽

2015 ◽

Cited By ~ 2

Author(s):

Diego Mallo ◽

Leonardo de Oliveira Martins ◽

David Posada

Keyword(s):

Incomplete Lineage Sorting ◽

A Priori ◽

Gene Tree ◽

Gene Families ◽

Rate Variation ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Sequence Alignments ◽

Large Trees

We present here a fast and flexible software–SimPhy–for the simulation of multiple gene families evolving under incomplete lineage sorting, gene duplication and loss, horizontal gene transfer—all three potentially leading to the species tree/gene tree discordance—and gene conversion. SimPhy implements a hierarchical phylogenetic model in which the evolution of species, locus and gene trees is governed by global and local parameters (e.g., genome-wide, species-specific, locus-specific), that can be fixed or be sampled from a priori statistical distributions. SimPhy also incorporates comprehensive models of substitution rate variation among lineages (uncorrelated relaxed clocks) and the capability of simulating partitioned nucleotide, codon and protein multilocus sequence alignments under a plethora of substitution models using the program INDELible. We validate SimPhy's output using theoretical expectations and other programs, and show that it scales extremely well with complex models and/or large trees, being an order of magnitude faster than the most similar program (DLCoal-Sim). In addition, we demonstrate how SimPhy can be useful to understand interactions among different evolutionary processes, conducting a simulation study to characterize the systematic overestimation of the duplication time when using standard reconciliation methods. SimPhy is available at https://github.com/adamallo/SimPhy, where users can find the source code, pre-compiled executables, a detailed manual and example cases.

Download Full-text

Non-parametric correction of estimated gene trees using TRACTION

Algorithms for Molecular Biology ◽

10.1186/s13015-019-0161-8 ◽

2020 ◽

Vol 15 (1) ◽

Author(s):

Sarah Christensen ◽

Erin K. Molloy ◽

Pranjal Vachaspati ◽

Ananya Yammanuru ◽

Tandy Warnow

Keyword(s):

Incomplete Lineage Sorting ◽

Phylogenetic Signal ◽

Estimation Error ◽

Single Gene ◽

Gene Tree ◽

Species Tree ◽

Computational Techniques ◽

Gene Trees ◽

Species Trees ◽

Sequencing Data

Abstract Motivation Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into account. Many prior gene tree correction methods have been developed for the case where GDL is present. Results Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to ILS and/or HGT. We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-optimal tree refinement and completion (RF-OTRC) Problem, which seeks a refinement and completion of a singly-labeled gene tree with respect to a given singly-labeled species tree so as to minimize the Robinson−Foulds (RF) distance. Our extensive simulation study on 68,000 estimated gene trees shows that TRACTION matches or improves on the accuracy of well-established methods from the GDL literature when HGT and ILS are both present, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. We also show that a naive generalization of the RF-OTRC problem to multi-labeled trees is possible, but can produce misleading results where gene tree heterogeneity is due to GDL.

Download Full-text

Mirage: A phylogenetic mixture model to reconstruct gene-content evolutionary history using a realistic evolutionary rate model

10.1101/2020.10.09.333286 ◽

2020 ◽

Author(s):

Tsukasa Fukunaga ◽

Wataru Iwasaki

Keyword(s):

Evolutionary History ◽

Evolutionary Rate ◽

Simulation Analysis ◽

Evolutionary Model ◽

Gene Families ◽

Gene Content ◽

Gene Gain ◽

Rate Model ◽

Gain Loss ◽

Loss Rates

AbstractReconstruction of gene-content evolutionary history is an essential approach for understanding how complex biological systems have been organized. However, the existing gene-content evolutionary models cannot formulate complex and heterogeneous gene gain/loss processes, which reflect diverse evolutionary events and greatly depend on gene families. In this study, we developed Mirage (MIxture model with a Realistic evolutionary rate model for Ancestral Genome Estimation), which allows different gene families to have flexible gene gain/loss rates, but reasonably limits the number of parameters to be estimated by the expectation-maximization algorithm. Simulation analysis showed that Mirage can accurately estimate complex and heterogeneous gene gain/loss rates and reconstruct gene-content evolutionary history. Application to empirical datasets demonstrated that our evolutionary model better fits genome data from various taxonomic groups than other models. Using Mirage, we revealed that gene families of metabolic function-related gene families displayed frequent gene gains and losses in all taxa investigated. The source code of Mirage is freely available at https://github.com/fukunagatsu/Mirage.

Download Full-text