PoMo: An Allele Frequency-based Approach for Species Tree Estimation

Mapping Intimacies ◽

10.1101/016360 ◽

2015 ◽

Author(s):

Nicola De Maio ◽

Dominik Schrempf ◽

Carolin Kosiol

Keyword(s):

Allele Frequency ◽

Phylogenetic Trees ◽

Incomplete Lineage Sorting ◽

Species Tree ◽

Efficient Estimation ◽

Species Variation ◽

Species Trees ◽

Lineage Sorting ◽

Genome Wide ◽

Tree Estimation

Incomplete lineage sorting can cause incongruencies of the overall species-level phylogenetic tree with the phylogenetic trees for individual genes or genomic segments. If these incongruencies are not accounted for, it is possible to incur several biases in species tree estimation. Here, we present a simple maximum likelihood approach that accounts for ancestral variation and incomplete lineage sorting. We use a POlymorphisms-aware phylogenetic MOdel (PoMo) that we have recently shown to efficiently estimate mutation rates and fixation biases from within and between-species variation data. We extend this model to perform efficient estimation of species trees. We test the performance of PoMo in several different scenarios of incomplete lineage sorting using simulations and compare it with existing methods both in accuracy and computational speed. In contrast to other approaches, our model does not use coalescent theory but is allele-frequency based. We show that PoMo is well suited for genome-wide species tree estimation and that on such data it is more accurate than previous approaches.

Download Full-text

Phylogenomic terraces: presence and implication in species tree estimation from gene trees

10.1101/2020.04.19.048843 ◽

2020 ◽

Author(s):

Ishrat Tanzila Farah ◽

Md Muktadirul Islam ◽

Kazi Tasnim Zinat ◽

Atif Hasan Rahman ◽

Md Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Deep Coalescence ◽

Tree Estimation ◽

Tree Space ◽

Multiple Species

AbstractSpecies tree estimation from multi-locus dataset is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have formalized the concept of “phylogenomic terraces” in the species tree space, where multiple species trees with distinct topologies may have exactly the same optimization score (quartet score, extra lineage score, etc.) with respect to a collection of gene trees. We investigated the presence and implication of terraces in species tree estimation from multi-locus data by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. Our experiments, on a collection of dataset simulated under ILS, indicate that MDC-based methods may achieve competitive or identical quartet consistency score as MQC but could be significantly worse than MQC in terms of tree accuracy – demonstrating the presence and affect of phylogenomic terraces. This is the first known study that formalizes the concept of phylogenomic terraces in the context of species tree estimation from multi-locus data, and reports the presence and implications of terraces in species tree estimation under ILS.

Download Full-text

To include or not to include: The impact of gene filtering on species tree estimation methods

10.1101/149120 ◽

2017 ◽

Cited By ~ 1

Author(s):

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Missing Data ◽

Incomplete Lineage Sorting ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Gene Filtering ◽

Tree Estimation ◽

The Impact

AbstractSpecies tree estimation from loci sampled from multiple genomes is now common, but is challenged by the heterogeneity across the genome due to multiple processes, such as gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting. Although methods for estimating species trees have been developed that address gene tree heterogeneity due to incomplete lineage sorting, many of these methods operate by combining estimated gene trees and are hence vulnerable to gene tree quality. There is also the added concern that missing data, which is frequently encountered in genome-scale datasets, will impact species tree estimation.Our study addresses the impact of gene filtering on species trees inferred from multi-gene datasets. We address these questions using a large and heterogeneous collection of simulated datasets both with and without missing data. We compare several established coalescent-based methods (ASTRAL, ASTRID, MP-EST, and SVDquartets within PAUP*) as well as unpartitioned concatenation using maximum likelihood (RAxML).Our study shows that gene tree error and missing data impact all methods (and some methods degrade more than others), but the degree of incomplete lineage sorting and gene tree estimation error impacts the absolute and relative performance of methods as well as their response to gene filtering strategies. We find that filtering genes based on the degree of missing data is either neutral or else reduces the accuracy of all five methods examined, and so is not recommended. Filtering genes based on gene tree estimation error shows somewhat different trends. Under low levels of incomplete lineage sorting, removing genes with high gene tree estimation error can improve the accuracy of summary methods, but only if not too many genes are removed. Otherwise, filtering genes tends to increase error, especially under high levels of incomplete lineage sorting. Hence, while filtering genes based on missing data is not recommended, there are conditions under which removing high error gene trees can improve species tree estimation. This study provides insights into prior studies and suggests approaches for analyzing phylogenomic datasets.

Download Full-text

Reversible Polymorphism-Aware Phylogenetic Models and their Application to Tree Inference

10.1101/048496 ◽

2016 ◽

Author(s):

Dominik Schrempf ◽

Bui Quang Minh ◽

Nicola De Maio ◽

Arndt von Haeseler ◽

Carolin Kosiol

Keyword(s):

Large Scale ◽

Incomplete Lineage Sorting ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Substitution Models ◽

Tree Inference ◽

Genome Wide Data ◽

Tree Estimation ◽

Dna Substitution

AbstractWe present a reversible Polymorphism-Aware Phylogenetic Model (revPoMo) for species tree estimation from genome-wide data. revPoMo enables the reconstruction of large scale species trees for many within-species samples. It expands the alphabet of DNA substitution models to include polymorphic states, thereby, naturally accounting for incomplete lineage sorting. We implemented revPoMo in the maximum likelihood software IQ-TREE. A simulation study and an application to great apes data show that the runtimes of our approach and standard substitution models are comparable but that revPoMo has much better accuracy in estimating trees, divergence times and mutation rates. The advantage of revPoMo is that an increase of sample size per species improves estimations but does not increase runtime. Therefore, revPoMo is a valuable tool with several applications, from speciation dating to species tree reconstruction.

Download Full-text

wQFM: Statistically Consistent Genome-scale Species Tree Estimation from Weighted Quartets

10.1101/2020.11.30.403352 ◽

2020 ◽

Author(s):

Mahim Mahbub ◽

Zahin Wahab ◽

Rezwana Reaz ◽

M. Saifur Rahman ◽

Md. Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Accurate Method ◽

Species Tree ◽

Estimation Methods ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation ◽

Source Form

AbstractMotivationSpecies tree estimation from genes sampled from throughout the whole genome is complicated due to the gene tree-species tree discordance. Incomplete lineage sorting (ILS) is one of the most frequent causes for this discordance, where alleles can coexist in populations for periods that may span several speciation events. Quartet-based summary methods for estimating species trees from a collection of gene trees are becoming popular due to their high accuracy and statistical guarantee under ILS. Generating quartets with appropriate weights, where weights correspond to the relative importance of quartets, and subsequently amalgamating the weighted quartets to infer a single coherent species tree allows for a statistically consistent way of estimating species trees. However, handling weighted quartets is challenging.ResultsWe propose wQFM, a highly accurate method for species tree estimation from multi-locus data, by extending the quartet FM (QFM) algorithm to a weighted setting. wQFM was assessed on a collection of simulated and real biological datasets, including the avian phylogenomic dataset which is one of the largest phylogenomic datasets to date. We compared wQFM with wQMC, which is the best alternate method for weighted quartet amalgamation, and with ASTRAL, which is one of the most accurate and widely used coalescent-based species tree estimation methods. Our results suggest that wQFM matches or improves upon the accuracy of wQMC and ASTRAL.AvailabilitywQFM is available in open source form at https://github.com/Mahim1997/wQFM-2020.

Download Full-text

ASTRID: Accurate Species TRees from Internode Distances

10.1101/023036 ◽

2015 ◽

Cited By ~ 1

Author(s):

Pranjal Vachaspati ◽

Tandy Warnow

Keyword(s):

Good Accuracy ◽

Incomplete Lineage Sorting ◽

Current Method ◽

Species Tree ◽

Estimation Methods ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation ◽

Source Form

Background: Incomplete lineage sorting (ILS), modelled by the multi-species coalescent (MSC), is known to create discordance between gene trees and species trees, and lead to inaccurate species tree estimations unless appropriate methods are used to estimate the species tree. While many statistically consistent methods have been developed to estimate the species tree in the presence of ILS, only ASTRAL-2 and NJst have been shown to have good accuracy on large datasets. Yet, NJst is generally slower and less accurate than ASTRAL-2, and cannot run on some datasets. Results: We have redesigned NJst to enable it to run on all datasets, and we have expanded its design space so that it can be used with different distance-based tree estimation methods. The resultant method, ASTRID, is statistically consistent under the MSC model, and has accuracy that is competitive with ASTRAL-2. Furthermore, ASTRID is much faster than ASTRAL-2, completing in minutes on some datasets for which ASTRAL-2 used hours. Conclusions: ASTRID is a new coalescent-based method for species tree estimation that is competitive with the best current method in terms of accuracy, while being much faster. ASTRID is available in open source form on github.

Download Full-text

ILS-Aware Analyses of Retroelement Insertions in the Anomaly Zone

10.1101/2020.09.29.319038 ◽

2020 ◽

Author(s):

Erin K. Molloy ◽

John Gatesy ◽

Mark S. Springer

Keyword(s):

Dna Sequences ◽

Incomplete Lineage Sorting ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Short Branch ◽

Multispecies Coalescent ◽

Branch Lengths ◽

Tree Estimation ◽

Infinite Sites Model

AbstractA major shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting (ILS). Coalescence methods explicitly address this problem, but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescence methods, retroelement insertions have emerged as powerful phylogenomic markers for species tree estimation. We show that two recently proposed methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the species tree under the multispecies coalescent model, with retroelement insertions following a neutral infinite sites model of mutation. The accuracy of these and other methods for inferring species trees with retroelements has not been assessed in simulation studies. We simulate retroelements for four different species trees, including three with short branch lengths in the anomaly zone, and assess the performance of eight different methods for recovering the correct species tree. We also examine whether ASTRAL_BP recovers accurate internal branch lengths for internodes of various lengths (in coalescent units). Our results indicate that two recently proposed ILS-aware methods, ASTRAL_BP and SDPquartets, as well as the newly proposed ASTRID_BP, always recover the correct species tree on data sets with large numbers of retroelements even when there are extremely short species-tree branches in the anomaly zone. Dollo parsimony performed almost as well as these ILS-aware methods. By contrast, unordered parsimony, polymorphism parsimony, and MDC recovered the correct species tree in the case of a pectinate tree with four ingroup taxa in the anomaly zone, but failed to recover the correct tree in more complex anomaly-zone situations with additional lineages impacted by extensive incomplete lineage sorting. Camin-Sokal parsimony always reconstructed an incorrect tree in the anomaly zone. ASTRAL_BP accurately estimated branch lengths when internal branches were very short as in anomaly zone situations, but branch lengths were upwardly biased by more than 35% when species tree branches were longer. We derive a mathematical correction for these distortions, assuming the expected number of new retroelement insertions per generation is constant across the species tree. We also show that short branches do not need to be corrected even when this assumption does not hold; therefore, the branch lengths estimates produced by ASTRAL_BP may provide insight into whether an estimated species tree is in the anomaly zone.

Download Full-text

STELAR: A statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency

10.1101/594911 ◽

2019 ◽

Author(s):

Mazharul Islam ◽

Kowshika Sarker ◽

Trisha Das ◽

Rezwana Reaz ◽

Md. Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Estimation Method ◽

Species Tree ◽

Mcmc Methods ◽

Consistent Estimate ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation

AbstractBackgroundSpecies tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets.ResultsWe present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new fast and highly accurate statistically consistent coalescent-based method for estimating species trees from a collection of gene trees. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimate of the species tree under the multi-species coalescent (MSC) model. STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. We evaluated the accuracy of STELAR in comparison with SuperTriplets, which is an alternate fast and highly accurate triplet-based supertree method, and with MP-EST and ASTRAL – two of the most popular and accurate coalescent-based methods. Experimental results suggest that STELAR matches the accuracy of ASTRAL and improves on MP-EST and SuperTriplets.ConclusionsTheoretical and empirical results (on both simulated and real biological datasets) suggest that STELAR is a valuable technique for species tree estimation from gene tree distributions.

Download Full-text

Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer

10.1101/023168 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ruth Davidson ◽

Pranjal Vachaspati ◽

Siavash Mirarab ◽

Tandy Warnow

Keyword(s):

Maximum Likelihood ◽

Gene Transfer ◽

Horizontal Gene Transfer ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Estimation Methods ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation

Background: Species tree estimation is challenged by gene tree heterogeneity resulting from biological processes such as duplication and loss, hybridization, incomplete lineage sorting (ILS), and horizontal gene transfer (HGT). Mathematical theory about reconstructing species trees in the presence of HGT alone or ILS alone suggests that quartet-based species tree methods (known to be statistically consistent under ILS, or under bounded amounts of HGT) might be effective techniques for estimating species trees when both HGT and ILS are present. Results: We evaluated several publicly available coalescent-based methods and concatenation under maximum likelihood on simulated datasets with moderate ILS and varying levels of HGT. Our study shows that two quartet-based species tree estimation methods (ASTRAL-2 and weighted Quartets MaxCut) are both highly accurate, even on datasets with high rates of HGT. In contrast, although NJst and concatenation using maximum likelihood are highly accurate under low HGT, they are less robust to high HGT rates. Conclusion: Our study shows that quartet-based species-tree estimation methods can be highly accurate under the presence of both HGT and ILS. The study suggests the possibility that some quartet-based methods might be statistically consistent under phylogenomic models of gene tree heterogeneity with both HGT and ILS. Keywords: phylogenomics; HGT; ILS; summary methods; concatenation

Download Full-text

Why concatenation fails in the anomaly zone

10.1101/116509 ◽

2017 ◽

Cited By ~ 2

Author(s):

Fábio K. Mendes ◽

Matthew W. Hahn

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Estimation Methods ◽

Species Trees ◽

Lineage Sorting ◽

Common Gene ◽

Tree Estimation ◽

Genome Scale ◽

Future Work

AbstrctGenome-scale sequencing has been of great benefit in recovering species trees, but has not provided final answers. Despite the rapid accumulation of molecular sequences, resolving short and deep branches of the tree of life has remained a challenge, and has prompted the development of new strategies that can make the best use of available data. One such strategy – the concatenation of gene alignments – can be successful when coupled with many tree estimation methods, but has also been shown to fail when there are high levels of incomplete lineage sorting. Here, we focus on the failure of likelihood-based methods in retrieving a rooted, asymmetric four-taxon species tree from concatenated data when the species tree is in or near the anomaly zone – a region of parameter space where the most common gene tree does not match the species tree because of incomplete lineage sorting. First, we use coalescent theory to prove that most informative sites will support the species tree in the anomaly zone, and that as a consequence maximum-parsimony succeeds in recovering the species tree from concatenated data. We further show that maximum-likelihood tree estimation from concatenated data fails both inside and outside the anomaly zone, and that this failure is unconnected to the frequency of the most common gene tree. We provide support for a hypothesis that likelihood-based methods fail in and near the anomaly zone because discordant sites on the species tree have a lower likelihood than those that are discordant on alternative topologies. Our results confirm and extend previous reports of the failure and success of likelihood- and parsimony-based methods, and highlight avenues for future work improving the performance of methods aimed at recovering species tree.

Download Full-text

The Perfect Storm: Gene Tree Estimation Error, Incomplete Lineage Sorting, and Ancient Gene Flow Explain the Most Recalcitrant Ancient Angiosperm Clade, Malpighiales

Systematic Biology ◽

10.1093/sysbio/syaa083 ◽

2020 ◽

Author(s):

Liming Cai ◽

Zhenxiang Xi ◽

Emily Moriarty Lemmon ◽

Alan R Lemmon ◽

Austin Mast ◽

...

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Flowering Plant ◽

Estimation Methods ◽

Lineage Sorting ◽

Tree Estimation ◽

Perfect Storm

Abstract The genomic revolution offers renewed hope of resolving rapid radiations in the Tree of Life. The development of the multispecies coalescent (MSC) model and improved gene tree estimation methods can better accommodate gene tree heterogeneity caused by incomplete lineage sorting (ILS) and gene tree estimation error stemming from the short internal branches. However, the relative influence of these factors in species tree inference is not well understood. Using anchored hybrid enrichment, we generated a data set including 423 single-copy loci from 64 taxa representing 39 families to infer the species tree of the flowering plant order Malpighiales. This order includes nine of the top ten most unstable nodes in angiosperms, which have been hypothesized to arise from the rapid radiation during the Cretaceous. Here, we show that coalescent-based methods do not resolve the backbone of Malpighiales and concatenation methods yield inconsistent estimations, providing evidence that gene tree heterogeneity is high in this clade. Despite high levels of ILS and gene tree estimation error, our simulations demonstrate that these two factors alone are insufficient to explain the lack of resolution in this order. To explore this further, we examined triplet frequencies among empirical gene trees and discovered some of them deviated significantly from those attributed to ILS and estimation error, suggesting gene flow as an additional and previously unappreciated phenomenon promoting gene tree variation in Malpighiales. Finally, we applied a novel method to quantify the relative contribution of these three primary sources of gene tree heterogeneity and demonstrated that ILS, gene tree estimation error, and gene flow contributed to 10.0%, 34.8%, and 21.4% of the variation, respectively. Together, our results suggest that a perfect storm of factors likely influence this lack of resolution, and further indicate that recalcitrant phylogenetic relationships like the backbone of Malpighiales may be better represented as phylogenetic networks. Thus, reducing such groups solely to existing models that adhere strictly to bifurcating trees greatly oversimplifies reality, and obscures our ability to more clearly discern the process of evolution.

Download Full-text