Accounting for Uncertainty in Gene Tree Estimation: Summary-Coalescent Species Tree Inference in a Challenging Radiation of Australian Lizards

Mapping Intimacies ◽

10.1101/056085 ◽

2016 ◽

Author(s):

Mozes P.K. Blom ◽

Jason G. Bragg ◽

Sally Potter ◽

Craig Moritz

Keyword(s):

Phylogenetic Signal ◽

Gene Tree ◽

Phylogenetic Inference ◽

Species Tree ◽

Gene Trees ◽

Tree Inference ◽

Tree Estimation ◽

Tree Resolution ◽

The Impact ◽

Species Tree Inference

AbstractAccurate gene tree inference is an important aspect of species tree estimation in a summary-coalescent framework. Yet, in empirical studies, inferred gene trees differ in accuracy due to stochastic variation in phylogenetic signal between targeted loci. Empiricists should therefore examine the consistency of species tree inference, while accounting for the observed heterogeneity in gene tree resolution of phylogenomic datasets. Here, we assess the impact of gene tree estimation error on summary-coalescent species tree inference by screening ~2000 exonic loci based on gene tree resolution prior to phylogenetic inference. We focus on a phylogenetically challenging radiation of Australian lizards (genus Cryptoblepharus, Scincidae) and explore effects on topology and support. We identify a well-supported topology based on all loci and find that a relatively small number of high-resolution gene trees can be sufficient to converge on the same topology. Adding gene trees with decreasing resolution produced a generally consistent topology, and increased confidence for specific bipartitions that were poorly supported when using a small number of informative loci. This corroborates coalescent-based simulation studies that have highlighted the need for a large number of loci to confidently resolve challenging relationships and refutes the notion that low-resolution gene trees introduce phylogenetic noise. Further, our study also highlights the value of quantifying changes in nodal support across locus subsets of increasing size (but decreasing gene tree resolution). Such detailed analyses can reveal anomalous fluctuations in support at some nodes, suggesting the possibility of model violation. By characterizing the heterogeneity in phylogenetic signal among loci, we can account for uncertainty in gene tree estimation and assess its effect on the consistency of the species tree estimate. We suggest that the evaluation of gene tree resolution should be incorporated in the analysis of empirical phylogenomic datasets. This will ultimately increase our confidence in species tree estimation using summary-coalescent methods and enable us to exploit genomic data for phylogenetic inference.

Download Full-text

Phylogenetic conflicts, combinability, and deep phylogenomics in plants

10.1101/371930 ◽

2018 ◽

Cited By ~ 1

Author(s):

Stephen A. Smith ◽

Nathanael Walker-Hale ◽

Joseph F. Walker ◽

Joseph W. Brown

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Signal ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Data Filtering ◽

Tree Inference ◽

Tree Methods ◽

Inference Methods ◽

Species Tree Inference

AbstractStudies have demonstrated that pervasive gene tree conflict underlies several important phylogenetic relationships where different species tree methods produce conflicting results. Here, we present a means of dissecting the phylogenetic signal for alternative resolutions within a dataset in order to resolve recalcitrant relationships and, importantly, identify what the dataset is unable to resolve. These procedures extend upon methods for isolating conflict and concordance involving specific candidate relationships and can be used to identify systematic error and disambiguate sources of conflict among species tree inference methods. We demonstrate these on a large phylogenomic plant dataset. Our results support the placement of Amborella as sister to the remaining extant angiosperms, Gnetales as sister to pines, and the monophyly of extant gymnosperms. Several other contentious relationships, including the resolution of relationships within the bryophytes and the eudicots, remain uncertain given the low number of supporting gene trees. To address whether concatenation of filtered genes amplified phylogenetic signal for relationships, we implemented a combinatorial heuristic to test combinability of genes. We found that nested conflicts limited the ability of data filtering methods to fully ameliorate conflicting signal amongst gene trees. These analyses confirmed that the underlying conflicting signal does not support broad concatenation of genes. Our approach provides a means of dissecting a specific dataset to address deep phylogenetic relationships while also identifying the inferential boundaries of the dataset.

Download Full-text

Accounting for Uncertainty in Gene Tree Estimation: Summary-Coalescent Species Tree Inference in a Challenging Radiation of Australian Lizards

Systematic Biology ◽

10.1093/sysbio/syw089 ◽

2016 ◽

pp. syw089 ◽

Cited By ~ 6

Author(s):

Mozes P. K. Blom ◽

Jason G. Bragg ◽

Sally Potter ◽

Craig Moritz

Keyword(s):

Gene Tree ◽

Species Tree ◽

Tree Inference ◽

Tree Estimation ◽

Species Tree Inference

Download Full-text

Phylogenetic Conflicts, Combinability, and Deep Phylogenomics in Plants

Systematic Biology ◽

10.1093/sysbio/syz078 ◽

2019 ◽

Vol 69 (3) ◽

pp. 579-592 ◽

Cited By ~ 6

Author(s):

Stephen A Smith ◽

Nathanael Walker-Hale ◽

Joseph F Walker ◽

Joseph W Brown

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Signal ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Data Set ◽

Tree Inference ◽

Tree Methods ◽

Plant Data ◽

Inference Methods

Abstract Studies have demonstrated that pervasive gene tree conflict underlies several important phylogenetic relationships where different species tree methods produce conflicting results. Here, we present a means of dissecting the phylogenetic signal for alternative resolutions within a data set in order to resolve recalcitrant relationships and, importantly, identify what the data set is unable to resolve. These procedures extend upon methods for isolating conflict and concordance involving specific candidate relationships and can be used to identify systematic error and disambiguate sources of conflict among species tree inference methods. We demonstrate these on a large phylogenomic plant data set. Our results support the placement of Amborella as sister to the remaining extant angiosperms, Gnetales as sister to pines, and the monophyly of extant gymnosperms. Several other contentious relationships, including the resolution of relationships within the bryophytes and the eudicots, remain uncertain given the low number of supporting gene trees. To address whether concatenation of filtered genes amplified phylogenetic signal for relationships, we implemented a combinatorial heuristic to test combinability of genes. We found that nested conflicts limited the ability of data filtering methods to fully ameliorate conflicting signal amongst gene trees. These analyses confirmed that the underlying conflicting signal does not support broad concatenation of genes. Our approach provides a means of dissecting a specific data set to address deep phylogenetic relationships while also identifying the inferential boundaries of the data set. [Angiosperms; coalescent; gene-tree conflict; genomics; phylogenetics; phylogenomics.]

Download Full-text

Impact of Ghost Introgression on Coalescent-based Species Tree Inference and Estimation of Divergence Time

10.1101/2022.01.11.475787 ◽

2022 ◽

Author(s):

XiaoXu Pang ◽

Da-Yong Zhang

Keyword(s):

Incomplete Lineage Sorting ◽

Divergence Time ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Multispecies Coalescent ◽

Tree Inference ◽

Tree Methods ◽

The Impact

The species studied in any evolutionary investigation generally constitute a very small proportion of all the species currently existing or that have gone extinct. It is therefore likely that introgression, which is widespread across the tree of life, involves "ghosts," i.e., unsampled, unknown, or extinct lineages. However, the impact of ghost introgression on estimations of species trees has been rarely studied and is thus poorly understood. In this study, we use mathematical analysis and simulations to examine the robustness of species tree methods based on a multispecies coalescent model under gene flow sourcing from an extant or ghost lineage. We found that very low levels of extant or ghost introgression can result in anomalous gene trees (AGTs) on three-taxon rooted trees if accompanied by strong incomplete lineage sorting (ILS). In contrast, even massive introgression, with more than half of the recipient genome descending from the donor lineage, may not necessarily lead to AGTs. In cases involving an ingroup lineage (defined as one that diverged no earlier than the most basal species under investigation) acting as the donor of introgression, the time of root divergence among the investigated species was either underestimated or remained unaffected, but for the cases of outgroup ghost lineages acting as donors, the divergence time was generally overestimated. Under many conditions of ingroup introgression, the stronger the ILS was, the higher was the accuracy of estimating the time of root divergence, although the topology of the species tree is more prone to be biased by the effect of introgression.

Download Full-text

SpeciesRax: A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss.

10.1101/2021.03.29.437460 ◽

2021 ◽

Author(s):

Benoit Morel ◽

Paul Schade ◽

Sarah Lutteropp ◽

Tom A. Williams ◽

Gergely J. Szöllösi ◽

...

Keyword(s):

Maximum Likelihood ◽

Gene Family ◽

Gene Families ◽

Species Tree ◽

Likelihood Method ◽

Gene Trees ◽

Informative Signal ◽

Tree Inference ◽

Species Tree Inference ◽

Family Trees

Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modelling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated datasets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large datasets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising $188$ species from $31612$ gene families in one hour using $40$ cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.

Download Full-text

Gene tree correction for reconciliation and species tree inference: Complexity and algorithms

Journal of Discrete Algorithms ◽

10.1016/j.jda.2013.06.001 ◽

2014 ◽

Vol 25 ◽

pp. 51-65 ◽

Cited By ~ 15

Author(s):

Riccardo Dondi ◽

Nadia El-Mabrouk ◽

Krister M. Swenson

Keyword(s):

Gene Tree ◽

Species Tree ◽

Tree Inference ◽

Species Tree Inference

Download Full-text

Revisiting the phylogeny of Zoanthidea (Cnidaria: Anthozoa): staggered alignment of hypervariable sequences improves species tree inference

10.1101/161117 ◽

2017 ◽

Author(s):

Timothy D. Swain

Keyword(s):

Sequence Alignment ◽

Language Translation ◽

Phylogenetic Inference ◽

Species Tree ◽

Gene Trees ◽

Species Discovery ◽

Conserved Genes ◽

Tree Inference ◽

Data Content ◽

Conserved Gene

AbstractThe recent rapid proliferation of novel taxon identification in the Zoanthidea has been accompanied by a parallel propagation of gene trees as a tool of species discovery, but not a corresponding increase in our understanding of phylogeny. This disparity is caused by the trade-off between the capabilities of automated DNA sequence alignment and data content of genes applied to phylogenetic inference in this group. Conserved genes or segments are easily aligned across the order, but produce poorly resolved trees; hypervariable genes or segments contain the evolutionary signal necessary for resolution and robust support, but sequence alignment is daunting. Staggered alignments are a form of phylogeny-informed sequence alignment composed of a mosaic of local and universal regions that allow phylogenetic inference to be applied to all nucleotides from both hypervariable and conserved gene segments. Comparisons between species tree phylogenies inferred from all data (staggered alignment) and hypervariable-excluded data (standard alignment) demonstrate improved confidence and greater topological agreement with other sources of data for the complete-data tree. This novel phylogeny is the most comprehensive to date (in terms of taxa and data) and can serve as an expandable tool for evolutionary hypothesis testing in the Zoanthidea.ResumenSpanish language translation by Lisbeth O. Swain, DePaul University, Chicago, Illinois, 60604, USA.Aunque la proliferación reciente y acelerada en la identificación de taxones en Zoanthidea ha sido acompañada por una propagación paralela de los árboles de genes como una herramienta en el descubrimiento de especies, no hay una correspondencia en cuanto a la ampliación de nuestro conocimiento en filogenia. Esta disparidad, es causada por la competencia entre la capacidad de los alineamientos de secuencia del ácido desoxirribonucleico (ADN) automatizados y la información contenida en los datos de genes que se aplican a los métodos de inferencia filogenética en este grupo de Zoanthidea. Las regiones o segmentos de genes conservados son fácilmente alineados dentro del orden; sin embargo, producen árboles de genes con resultados paupérrimos; además, aunque estas regiones hipervariables de genes o segmentos contienen las señas evolutivas necesarias para apoyar la construcción robusta y completa de árboles filogenéticos, estos genes producen alineamientos de secuencia abrumadores. Los alineamientos escalonados de secuencias son una forma de alineamientos informados por la filogenia y compuestos de un mosaico de regiones locales y universales que permiten que inferencias filogenéticas sean aplicadas a todos los nucleótidos de regiones hipervariables y de genes o segmentos conservados. Las comparaciones entre especies de árboles filogenéticos quese infirieron de los datos de alineamientos escalonados y los datos hipervariables excluidos (alineamiento estandarizado), demuestran un mejoramiento en la confiabilidad y un mayor acuerdo tipológico con respecto a otras fuentes que contienen árboles filogenéticos hechos de datos más completos. Esta nueva forma escalonada de filogenia es una de los más compresibles hasta la fecha (en términos de taxones y datos) y que pueden servir como una herramienta de amplificación para probar la hipótesis evolutiva de Zoanthidea.

Download Full-text

Terraces in Gene Tree Reconciliation-Based Species Tree Inference

10.1101/2020.04.17.047092 ◽

2020 ◽

Author(s):

Michael J. Sanderson ◽

Michelle M. McMahon ◽

Mike Steel

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Solution Space ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Tree Reconciliation ◽

The Impact

AbstractTerraces in phylogenetic tree space are sets of trees with identical optimality scores for a given data set, arising from missing data. These were first described for multilocus phylogenetic data sets in the context of maximum parsimony inference and maximum likelihood inference under certain model assumptions. Here we show how the mathematical properties that lead to terraces extend to gene tree - species tree problems in which the gene trees are incomplete. Inference of species trees from either sets of gene family trees subject to duplication and loss, or allele trees subject to incomplete lineage sorting, can exhibit terraces in their solution space. First, we show conditions that lead to a new kind of terrace, which stems from subtree operations that appear in reconciliation problems for incomplete trees. Then we characterize when terraces of both types can occur when the optimality criterion for tree search is based on duplication, loss or deep coalescence scores. Finally, we examine the impact of assumptions about the causes of losses: whether they are due to imperfect sampling or true evolutionary deletion.

Download Full-text

Using all gene families vastly expands data available for phylogenomic inference in primates

10.1101/2021.09.22.461252 ◽

2021 ◽

Author(s):

Megan L Smith ◽

Dan Vanderpool ◽

Matthew W. Hahn

Keyword(s):

Branch Length ◽

Gene Families ◽

Phylogenetic Inference ◽

Single Copy ◽

Decomposition Methods ◽

Species Tree ◽

Primate Species ◽

Tree Inference ◽

Inference Methods ◽

Species Tree Inference

Traditionally, single-copy orthologs have been the gold standard in phylogenomics. Most phylogenomic studies identify putative single-copy orthologs by using clustering approaches and retaining families with a single sequence from each species. However, this approach can severely limit the amount of data available by excluding larger families. Recent methodological advances have suggested several ways to include data from larger families. For instance, tree-based decomposition methods facilitate the extraction of orthologs from large families. Additionally, several popular methods for species tree inference appear to be robust to the inclusion of paralogs, and hence could use all of the data from larger families. Here, we explore the effects of using all families for phylogenetic inference using genomes from 26 primate species. We compare single-copy families, orthologs extracted using tree-based decomposition approaches, and all families with all data (i.e., including orthologs and paralogs). We explore several species tree inference methods, finding that across all nodes of the tree except one, identical trees are returned across nearly all datasets and methods. As in previous studies, the relationships among Platyrrhini remain contentious; however, the tree inference methods matter more than the dataset used. We also assess the effects of each dataset on branch length estimates, measures of phylogenetic uncertainty and concordance, and in detecting introgression. Our results demonstrate that using data from larger gene families drastically increases the number of genes available for phylogenetic inference and leads to consistent estimates of branch lengths, nodal certainty and concordance, and inferences of introgression.

Download Full-text

Phylogenomic terraces: presence and implication in species tree estimation from gene trees

10.1101/2020.04.19.048843 ◽

2020 ◽

Author(s):

Ishrat Tanzila Farah ◽

Md Muktadirul Islam ◽

Kazi Tasnim Zinat ◽

Atif Hasan Rahman ◽

Md Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Deep Coalescence ◽

Tree Estimation ◽

Tree Space ◽

Multiple Species

AbstractSpecies tree estimation from multi-locus dataset is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have formalized the concept of “phylogenomic terraces” in the species tree space, where multiple species trees with distinct topologies may have exactly the same optimization score (quartet score, extra lineage score, etc.) with respect to a collection of gene trees. We investigated the presence and implication of terraces in species tree estimation from multi-locus data by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. Our experiments, on a collection of dataset simulated under ILS, indicate that MDC-based methods may achieve competitive or identical quartet consistency score as MQC but could be significantly worse than MQC in terms of tree accuracy – demonstrating the presence and affect of phylogenomic terraces. This is the first known study that formalizes the concept of phylogenomic terraces in the context of species tree estimation from multi-locus data, and reports the presence and implications of terraces in species tree estimation under ILS.

Download Full-text