Maximum Likelihood Estimation of Species Trees from Gene Trees in the Presence of Ancestral Population Structure

Hillary Koch; Michael DeGiorgio

doi:10.1093/gbe/evaa022

Maximum Likelihood Estimation of Species Trees from Gene Trees in the Presence of Ancestral Population Structure

Genome Biology and Evolution ◽

10.1093/gbe/evaa022 ◽

2020 ◽

Vol 12 (2) ◽

pp. 3977-3995 ◽

Cited By ~ 1

Author(s):

Hillary Koch ◽

Michael DeGiorgio

Keyword(s):

Population Structure ◽

Maximum Likelihood ◽

Gene Tree ◽

Species Tree ◽

Ease Of Use ◽

Likelihood Method ◽

Ancestral Population ◽

Gene Trees ◽

Species Trees ◽

Data Set

Abstract Though large multilocus genomic data sets have led to overall improvements in phylogenetic inference, they have posed the new challenge of addressing conflicting signals across the genome. In particular, ancestral population structure, which has been uncovered in a number of diverse species, can skew gene tree frequencies, thereby hindering the performance of species tree estimators. Here we develop a novel maximum likelihood method, termed TASTI (Taxa with Ancestral structure Species Tree Inference), that can infer phylogenies under such scenarios, and find that it has increasing accuracy with increasing numbers of input gene trees, contrasting with the relatively poor performances of methods not tailored for ancestral structure. Moreover, we propose a supertree approach that allows TASTI to scale computationally with increasing numbers of input taxa. We use genetic simulations to assess TASTI’s performance in the three- and four-taxon settings and demonstrate the application of TASTI on a six-species Afrotropical mosquito data set. Finally, we have implemented TASTI in an open-source software package for ease of use by the scientific community.

Download Full-text

Maximum likelihood estimation of species trees from gene trees in the presence of ancestral population structure

10.1101/700161 ◽

2019 ◽

Author(s):

Hillary Koch ◽

Michael DeGiorgio

Keyword(s):

Population Structure ◽

Maximum Likelihood ◽

Gene Tree ◽

Likelihood Estimation ◽

Ease Of Use ◽

Likelihood Method ◽

Ancestral Population ◽

Gene Trees ◽

Species Trees ◽

Open Source Software Package

AbstractThough large multilocus genomic datasets have led to overall improvements in phylogenetic inference, they have posed the new challenge of addressing conflicting signals across the genome. In particular, ancestral population structure, which has been uncovered in a number of diverse species, can skew gene tree frequencies, thereby hindering the performance of species tree estimators. Here we develop a novel maximum likelihood method, termed TASTI, that can infer phylogenies under such scenarios, and find that it has increasing accuracy with increasing numbers of input gene trees, contrasting with the relatively poor performances of methods not tailored for ancestral structure. Moreover, we propose a supertree approach that allows TASTI to scale computationally with increasing numbers of input taxa. We use genetic simulations to assess TASTI’s performance in the four-taxon setting, and demonstrate the application of TASTI on a six-species Afrotropical mosquito dataset. Finally, we have implemented TASTI in an open-source software package for ease of use by the scientific community.

Download Full-text

Terraces in Gene Tree Reconciliation-Based Species Tree Inference

10.1101/2020.04.17.047092 ◽

2020 ◽

Author(s):

Michael J. Sanderson ◽

Michelle M. McMahon ◽

Mike Steel

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Solution Space ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Tree Reconciliation ◽

The Impact

AbstractTerraces in phylogenetic tree space are sets of trees with identical optimality scores for a given data set, arising from missing data. These were first described for multilocus phylogenetic data sets in the context of maximum parsimony inference and maximum likelihood inference under certain model assumptions. Here we show how the mathematical properties that lead to terraces extend to gene tree - species tree problems in which the gene trees are incomplete. Inference of species trees from either sets of gene family trees subject to duplication and loss, or allele trees subject to incomplete lineage sorting, can exhibit terraces in their solution space. First, we show conditions that lead to a new kind of terrace, which stems from subtree operations that appear in reconciliation problems for incomplete trees. Then we characterize when terraces of both types can occur when the optimality criterion for tree search is based on duplication, loss or deep coalescence scores. Finally, we examine the impact of assumptions about the causes of losses: whether they are due to imperfect sampling or true evolutionary deletion.

Download Full-text

Bayes Estimation of Species Divergence Times and Ancestral Population Sizes Using DNA Sequences From Multiple Loci

Genetics ◽

10.1093/genetics/164.4.1645 ◽

2003 ◽

Vol 164 (4) ◽

pp. 1645-1656 ◽

Cited By ~ 3

Author(s):

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Dna Sequences ◽

Gene Tree ◽

Species Tree ◽

Bayes Estimation ◽

Ancestral Population ◽

Divergence Times ◽

Gene Trees ◽

Species Divergence ◽

Multiple Loci ◽

Population Sizes

Abstract The effective population sizes of ancestral as well as modern species are important parameters in models of population genetics and human evolution. The commonly used method for estimating ancestral population sizes, based on counting mismatches between the species tree and the inferred gene trees, is highly biased as it ignores uncertainties in gene tree reconstruction. In this article, we develop a Bayes method for simultaneous estimation of the species divergence times and current and ancestral population sizes. The method uses DNA sequence data from multiple loci and extracts information about conflicts among gene tree topologies and coalescent times to estimate ancestral population sizes. The topology of the species tree is assumed known. A Markov chain Monte Carlo algorithm is implemented to integrate over uncertain gene trees and branch lengths (or coalescence times) at each locus as well as species divergence times. The method can handle any species tree and allows different numbers of sequences at different loci. We apply the method to published noncoding DNA sequences from the human and the great apes. There are strong correlations between posterior estimates of speciation times and ancestral population sizes. With the use of an informative prior for the human-chimpanzee divergence date, the population size of the common ancestor of the two species is estimated to be ∼20,000, with a 95% credibility interval (8000, 40,000). Our estimates, however, are affected by model assumptions as well as data quality. We suggest that reliable estimates have yet to await more data and more realistic models.

Download Full-text

How to Tackle Phylogenetic Discordance in Recent and Rapidly Radiating Groups? Developing a Workflow Using Loricaria (Asteraceae) as an Example

Frontiers in Plant Science ◽

10.3389/fpls.2021.765719 ◽

2022 ◽

Vol 12 ◽

Author(s):

Martha Kandziora ◽

Petr Sklenář ◽

Filip Kolář ◽

Roswitha Schmickl

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Secondary Contact ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Phylogenetic Discordance ◽

Gene Tree Discordance ◽

High Degree

A major challenge in phylogenetics and -genomics is to resolve young rapidly radiating groups. The fast succession of species increases the probability of incomplete lineage sorting (ILS), and different topologies of the gene trees are expected, leading to gene tree discordance, i.e., not all gene trees represent the species tree. Phylogenetic discordance is common in phylogenomic datasets, and apart from ILS, additional sources include hybridization, whole-genome duplication, and methodological artifacts. Despite a high degree of gene tree discordance, species trees are often well supported and the sources of discordance are not further addressed in phylogenomic studies, which can eventually lead to incorrect phylogenetic hypotheses, especially in rapidly radiating groups. We chose the high-Andean Asteraceae genus Loricaria to shed light on the potential sources of phylogenetic discordance and generated a phylogenetic hypothesis. By accounting for paralogy during gene tree inference, we generated a species tree based on hundreds of nuclear loci, using Hyb-Seq, and a plastome phylogeny obtained from off-target reads during target enrichment. We observed a high degree of gene tree discordance, which we found implausible at first sight, because the genus did not show evidence of hybridization in previous studies. We used various phylogenomic analyses (trees and networks) as well as the D-statistics to test for ILS and hybridization, which we developed into a workflow on how to tackle phylogenetic discordance in recent radiations. We found strong evidence for ILS and hybridization within the genus Loricaria. Low genetic differentiation was evident between species located in different Andean cordilleras, which could be indicative of substantial introgression between populations, promoted during Pleistocene glaciations, when alpine habitats shifted creating opportunities for secondary contact and hybridization.

Download Full-text

SpeciesRax: A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss.

10.1101/2021.03.29.437460 ◽

2021 ◽

Author(s):

Benoit Morel ◽

Paul Schade ◽

Sarah Lutteropp ◽

Tom A. Williams ◽

Gergely J. Szöllösi ◽

...

Keyword(s):

Maximum Likelihood ◽

Gene Family ◽

Gene Families ◽

Species Tree ◽

Likelihood Method ◽

Gene Trees ◽

Informative Signal ◽

Tree Inference ◽

Species Tree Inference ◽

Family Trees

Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modelling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated datasets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large datasets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising $188$ species from $31612$ gene families in one hour using $40$ cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.

Download Full-text

Efficient Maximum-Likelihood Inference For The Isolation-With-Initial-Migration Model With Potentially Asymmetric Gene Flow

10.1101/052894 ◽

2016 ◽

Author(s):

Rui J. Costa ◽

Hilde Wilkinson-Herbots

Keyword(s):

Gene Flow ◽

Maximum Likelihood ◽

Dna Sequences ◽

Computing Time ◽

Likelihood Method ◽

Ancestral Population ◽

Parameter Estimates ◽

Fast Method ◽

Data Set ◽

Im Model

AbstractThe isolation-with-migration (IM) model is commonly used to make inferences about gene flow during speciation, using polymorphism data. However, Becquet and Przeworski (2009) report that the parameter estimates obtained by fitting the IM model are very sensitive to the model's assumptions (including the assumption of constant gene flow until the present). This paper is concerned with the isolation-with-initial-migration (IIM) model of Wilkinson-Herbots (2012), which drops precisely this assumption. In the IIM model, one ancestral population divides into two descendant subpopulations, between which there is an initial period of gene flow and a subsequent period of isolation. We derive a very fast method of fitting an extended version of the IIM model, which also allows for asymmetric gene flow and unequal population sizes. This is a maximum-likelihood method, applicable to data on the number of segregating sites between pairs of DNA sequences from a large number of independent loci. In addition to obtaining parameter estimates, our method can also be used to distinguish between alternative models representing different evolutionary scenarios, by means of likelihood ratio tests. We illustrate the procedure on pairs of Drosophila sequences from approximately 30,000 loci. The computing time needed to fit the most complex version of the model to this data set is only a couple of minutes. The R code to fit the IIM model can be found in the supplementary files of this paper.

Download Full-text

Disentangling Sources of Gene Tree Discordance in Phylogenomic Data Sets: Testing Ancient Hybridizations in Amaranthaceae s.l

Systematic Biology ◽

10.1093/sysbio/syaa066 ◽

2020 ◽

Cited By ~ 1

Author(s):

Diego F Morales-Briones ◽

Gudrun Kadereit ◽

Delphine T Tefarikis ◽

Michael J Moore ◽

Stephen A Smith ◽

...

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Species Tree ◽

Data Sets ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Gene Tree Discordance

Abstract Gene tree discordance in large genomic data sets can be caused by evolutionary processes such as incomplete lineage sorting and hybridization, as well as model violation, and errors in data processing, orthology inference, and gene tree estimation. Species tree methods that identify and accommodate all sources of conflict are not available, but a combination of multiple approaches can help tease apart alternative sources of conflict. Here, using a phylotranscriptomic analysis in combination with reference genomes, we test a hypothesis of ancient hybridization events within the plant family Amaranthaceae s.l. that was previously supported by morphological, ecological, and Sanger-based molecular data. The data set included seven genomes and 88 transcriptomes, 17 generated for this study. We examined gene-tree discordance using coalescent-based species trees and network inference, gene tree discordance analyses, site pattern tests of introgression, topology tests, synteny analyses, and simulations. We found that a combination of processes might have generated the high levels of gene tree discordance in the backbone of Amaranthaceae s.l. Furthermore, we found evidence that three consecutive short internal branches produce anomalous trees contributing to the discordance. Overall, our results suggest that Amaranthaceae s.l. might be a product of an ancient and rapid lineage diversification, and remains, and probably will remain, unresolved. This work highlights the potential problems of identifiability associated with the sources of gene tree discordance including, in particular, phylogenetic network methods. Our results also demonstrate the importance of thoroughly testing for multiple sources of conflict in phylogenomic analyses, especially in the context of ancient, rapid radiations. We provide several recommendations for exploring conflicting signals in such situations. [Amaranthaceae; gene tree discordance; hybridization; incomplete lineage sorting; phylogenomics; species network; species tree; transcriptomics.]

Download Full-text

Genomic Characterization and Curation of UCEs Improves Species Tree Reconstruction

Systematic Biology ◽

10.1093/sysbio/syaa063 ◽

2020 ◽

Author(s):

Matthew H Van Dam ◽

James B Henderson ◽

Lauren Esposito ◽

Michelle Trautwein

Keyword(s):

Marine Invertebrates ◽

Phylogenetic Analyses ◽

Single Gene ◽

Gene Tree ◽

Single Copy ◽

Species Tree ◽

Bootstrap Support ◽

Gene Trees ◽

Species Trees ◽

Tree Reconstruction

Abstract Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations and found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.]

Download Full-text

Phylogenomic terraces: presence and implication in species tree estimation from gene trees

10.1101/2020.04.19.048843 ◽

2020 ◽

Author(s):

Ishrat Tanzila Farah ◽

Md Muktadirul Islam ◽

Kazi Tasnim Zinat ◽

Atif Hasan Rahman ◽

Md Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Deep Coalescence ◽

Tree Estimation ◽

Tree Space ◽

Multiple Species

AbstractSpecies tree estimation from multi-locus dataset is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have formalized the concept of “phylogenomic terraces” in the species tree space, where multiple species trees with distinct topologies may have exactly the same optimization score (quartet score, extra lineage score, etc.) with respect to a collection of gene trees. We investigated the presence and implication of terraces in species tree estimation from multi-locus data by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. Our experiments, on a collection of dataset simulated under ILS, indicate that MDC-based methods may achieve competitive or identical quartet consistency score as MQC but could be significantly worse than MQC in terms of tree accuracy – demonstrating the presence and affect of phylogenomic terraces. This is the first known study that formalizes the concept of phylogenomic terraces in the context of species tree estimation from multi-locus data, and reports the presence and implications of terraces in species tree estimation under ILS.

Download Full-text

Overtraining often results in topologically incorrect species trees with maximum likelihood methods

10.1101/140780 ◽

2017 ◽

Author(s):

Damien M. de Vienne ◽

Fran Supek ◽

Toni Gabaldon

Keyword(s):

Maximum Likelihood ◽

Species Tree ◽

Small Sample ◽

Optimization Process ◽

Evolutionary Relationships ◽

Gene Trees ◽

Species Trees ◽

Tree Reconstruction ◽

Reference Tree ◽

Gold Standard Reference

AbstractBackgroundOvertraining occurs when an optimization process is applied for too many steps, leading to a model describing noise in addition to the signal present in the data. This effect may affect typical approaches for species tree reconstruction that use maximum likelihood optimization procedures on a small sample of concatenated genes. In this context, overtraining may result in trees better describing the specific evolutionary history of the sampled genes rather than the sought evolutionary relationships among the species.ResultsUsing a cross-validation-like approach on real and simulated datasets we showed that overtraining occurs in a significant fraction of cases, leading to species trees that are more distant from a gold-standard reference tree than a previously considered (and rejected) solution in the optimization process. However, we show that the shape of the likelihood curve is informative of the optimal stopping point. As expected, overtraining is aggravated in smaller gene samples and in datasets with increased levels of topological variation among gene trees, but occurs also in controlled, simulated scenarios where a common underlying topology is enforced.ConclusionsOvertraining is frequent in species tree reconstruction and leads to a final tree that is worse in describing the evolutionary relationships of the species under study than an earlier (and rejected) solution encountered during the likelihood optimization process. This result should help develop specific methods for species tree reconstruction in the future, and may improve our understanding of the complexity of tree likelihood landscapes.

Download Full-text