Detecting the anomaly zone in species trees and evidence for a misleading signal in higher-level skink phylogeny (Squamata: Scincidae).

Mapping Intimacies ◽

10.1101/012096 ◽

2014 ◽

Cited By ~ 3

Author(s):

Charles W Linkem ◽

Vladimir N. Minin ◽

Adam D Leache

Keyword(s):

Gene Tree ◽

Species Tree ◽

Ancestral Population ◽

Real Problem ◽

Species Trees ◽

Phylogenetic Studies ◽

New Information ◽

Population Sizes ◽

Tree Methods ◽

Tree Estimation

The anomaly zone presents a major challenge to the accurate resolution of many parts of the Tree of Life. The anomaly zone is defined by the presence of a gene tree topology that is more probable than the true species tree. This discrepancy can result from consecutive rapid speciation events in the species tree. Similar to the problem of long-branch attraction, including more data (loci) will only reinforce the support for the incorrect species tree. Empirical phylogenetic studies often implement coalescent based species tree methods to avoid the anomaly zone, but to this point these studies have not had a method for providing any direct evidence that the species tree is actually in the anomaly zone. In this study, we use 16 species of lizards in the family Scincidae to investigate whether nodes that are difficult to resolve are located within the anomaly zone. We analyze new phylogenomic data (429 loci), using both concatenation and coalescent based species tree estimation, to locate conflicting topological signal. We then use the unifying principle of the anomaly zone, together with estimates of ancestral population sizes and species persistence times, to determine whether the observed phylogenetic conflict is a result of the anomaly zone. We identify at least three regions of the Scindidae phylogeny that provide demographic signatures consistent with the anomaly zone, and this new information helps reconcile the phylogenetic conflict in previously published studies on these lizards. The anomaly zone presents a real problem in phylogenetics, and our new framework for identifying anomalous relationships will help empiricists leverage their resources appropriately for overcoming this challenge.

Download Full-text

Complexity of the simplest species tree problem

Molecular Biology and Evolution ◽

10.1093/molbev/msab009 ◽

2021 ◽

Author(s):

Tianqi Zhu ◽

Ziheng Yang

Keyword(s):

Maximum Likelihood ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Ancestral Population ◽

Statistical Efficiency ◽

Multispecies Coalescent ◽

Full Likelihood ◽

Tree Methods ◽

Tree Estimation

Abstract The multispecies coalescent (MSC) model provides a natural framework for species tree estimation accounting for gene-tree conflicts. While a number of species tree methods under the MSC have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood (ISML) and maximum likelihood (ML). We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case major differences exist among the methods. Fulllikelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes while these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.

Download Full-text

Bayes Estimation of Species Divergence Times and Ancestral Population Sizes Using DNA Sequences From Multiple Loci

Genetics ◽

10.1093/genetics/164.4.1645 ◽

2003 ◽

Vol 164 (4) ◽

pp. 1645-1656 ◽

Cited By ~ 3

Author(s):

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Dna Sequences ◽

Gene Tree ◽

Species Tree ◽

Bayes Estimation ◽

Ancestral Population ◽

Divergence Times ◽

Gene Trees ◽

Species Divergence ◽

Multiple Loci ◽

Population Sizes

Abstract The effective population sizes of ancestral as well as modern species are important parameters in models of population genetics and human evolution. The commonly used method for estimating ancestral population sizes, based on counting mismatches between the species tree and the inferred gene trees, is highly biased as it ignores uncertainties in gene tree reconstruction. In this article, we develop a Bayes method for simultaneous estimation of the species divergence times and current and ancestral population sizes. The method uses DNA sequence data from multiple loci and extracts information about conflicts among gene tree topologies and coalescent times to estimate ancestral population sizes. The topology of the species tree is assumed known. A Markov chain Monte Carlo algorithm is implemented to integrate over uncertain gene trees and branch lengths (or coalescence times) at each locus as well as species divergence times. The method can handle any species tree and allows different numbers of sequences at different loci. We apply the method to published noncoding DNA sequences from the human and the great apes. There are strong correlations between posterior estimates of speciation times and ancestral population sizes. With the use of an informative prior for the human-chimpanzee divergence date, the population size of the common ancestor of the two species is estimated to be ∼20,000, with a 95% credibility interval (8000, 40,000). Our estimates, however, are affected by model assumptions as well as data quality. We suggest that reliable estimates have yet to await more data and more realistic models.

Download Full-text

Phylogenomic terraces: presence and implication in species tree estimation from gene trees

10.1101/2020.04.19.048843 ◽

2020 ◽

Author(s):

Ishrat Tanzila Farah ◽

Md Muktadirul Islam ◽

Kazi Tasnim Zinat ◽

Atif Hasan Rahman ◽

Md Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Deep Coalescence ◽

Tree Estimation ◽

Tree Space ◽

Multiple Species

AbstractSpecies tree estimation from multi-locus dataset is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have formalized the concept of “phylogenomic terraces” in the species tree space, where multiple species trees with distinct topologies may have exactly the same optimization score (quartet score, extra lineage score, etc.) with respect to a collection of gene trees. We investigated the presence and implication of terraces in species tree estimation from multi-locus data by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. Our experiments, on a collection of dataset simulated under ILS, indicate that MDC-based methods may achieve competitive or identical quartet consistency score as MQC but could be significantly worse than MQC in terms of tree accuracy – demonstrating the presence and affect of phylogenomic terraces. This is the first known study that formalizes the concept of phylogenomic terraces in the context of species tree estimation from multi-locus data, and reports the presence and implications of terraces in species tree estimation under ILS.

Download Full-text

To include or not to include: The impact of gene filtering on species tree estimation methods

10.1101/149120 ◽

2017 ◽

Cited By ~ 1

Author(s):

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Missing Data ◽

Incomplete Lineage Sorting ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Gene Filtering ◽

Tree Estimation ◽

The Impact

AbstractSpecies tree estimation from loci sampled from multiple genomes is now common, but is challenged by the heterogeneity across the genome due to multiple processes, such as gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting. Although methods for estimating species trees have been developed that address gene tree heterogeneity due to incomplete lineage sorting, many of these methods operate by combining estimated gene trees and are hence vulnerable to gene tree quality. There is also the added concern that missing data, which is frequently encountered in genome-scale datasets, will impact species tree estimation.Our study addresses the impact of gene filtering on species trees inferred from multi-gene datasets. We address these questions using a large and heterogeneous collection of simulated datasets both with and without missing data. We compare several established coalescent-based methods (ASTRAL, ASTRID, MP-EST, and SVDquartets within PAUP*) as well as unpartitioned concatenation using maximum likelihood (RAxML).Our study shows that gene tree error and missing data impact all methods (and some methods degrade more than others), but the degree of incomplete lineage sorting and gene tree estimation error impacts the absolute and relative performance of methods as well as their response to gene filtering strategies. We find that filtering genes based on the degree of missing data is either neutral or else reduces the accuracy of all five methods examined, and so is not recommended. Filtering genes based on gene tree estimation error shows somewhat different trends. Under low levels of incomplete lineage sorting, removing genes with high gene tree estimation error can improve the accuracy of summary methods, but only if not too many genes are removed. Otherwise, filtering genes tends to increase error, especially under high levels of incomplete lineage sorting. Hence, while filtering genes based on missing data is not recommended, there are conditions under which removing high error gene trees can improve species tree estimation. This study provides insights into prior studies and suggests approaches for analyzing phylogenomic datasets.

Download Full-text

wQFM: Statistically Consistent Genome-scale Species Tree Estimation from Weighted Quartets

10.1101/2020.11.30.403352 ◽

2020 ◽

Author(s):

Mahim Mahbub ◽

Zahin Wahab ◽

Rezwana Reaz ◽

M. Saifur Rahman ◽

Md. Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Accurate Method ◽

Species Tree ◽

Estimation Methods ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation ◽

Source Form

AbstractMotivationSpecies tree estimation from genes sampled from throughout the whole genome is complicated due to the gene tree-species tree discordance. Incomplete lineage sorting (ILS) is one of the most frequent causes for this discordance, where alleles can coexist in populations for periods that may span several speciation events. Quartet-based summary methods for estimating species trees from a collection of gene trees are becoming popular due to their high accuracy and statistical guarantee under ILS. Generating quartets with appropriate weights, where weights correspond to the relative importance of quartets, and subsequently amalgamating the weighted quartets to infer a single coherent species tree allows for a statistically consistent way of estimating species trees. However, handling weighted quartets is challenging.ResultsWe propose wQFM, a highly accurate method for species tree estimation from multi-locus data, by extending the quartet FM (QFM) algorithm to a weighted setting. wQFM was assessed on a collection of simulated and real biological datasets, including the avian phylogenomic dataset which is one of the largest phylogenomic datasets to date. We compared wQFM with wQMC, which is the best alternate method for weighted quartet amalgamation, and with ASTRAL, which is one of the most accurate and widely used coalescent-based species tree estimation methods. Our results suggest that wQFM matches or improves upon the accuracy of wQMC and ASTRAL.AvailabilitywQFM is available in open source form at https://github.com/Mahim1997/wQFM-2020.

Download Full-text

On the Robustness to Gene Tree Estimation Error (or lack thereof) of Coalescent-Based Species Tree Methods

Systematic Biology ◽

10.1093/sysbio/syv016 ◽

2015 ◽

Vol 64 (4) ◽

pp. 663-676 ◽

Cited By ~ 85

Author(s):

Sebastien Roch ◽

Tandy Warnow

Keyword(s):

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Tree Methods ◽

Tree Estimation

Download Full-text

STELAR: A statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency

10.1101/594911 ◽

2019 ◽

Author(s):

Mazharul Islam ◽

Kowshika Sarker ◽

Trisha Das ◽

Rezwana Reaz ◽

Md. Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Estimation Method ◽

Species Tree ◽

Mcmc Methods ◽

Consistent Estimate ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation

AbstractBackgroundSpecies tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets.ResultsWe present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new fast and highly accurate statistically consistent coalescent-based method for estimating species trees from a collection of gene trees. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimate of the species tree under the multi-species coalescent (MSC) model. STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. We evaluated the accuracy of STELAR in comparison with SuperTriplets, which is an alternate fast and highly accurate triplet-based supertree method, and with MP-EST and ASTRAL – two of the most popular and accurate coalescent-based methods. Experimental results suggest that STELAR matches the accuracy of ASTRAL and improves on MP-EST and SuperTriplets.ConclusionsTheoretical and empirical results (on both simulated and real biological datasets) suggest that STELAR is a valuable technique for species tree estimation from gene tree distributions.

Download Full-text

Maximum Likelihood Estimation of Species Trees from Gene Trees in the Presence of Ancestral Population Structure

Genome Biology and Evolution ◽

10.1093/gbe/evaa022 ◽

2020 ◽

Vol 12 (2) ◽

pp. 3977-3995 ◽

Cited By ~ 1

Author(s):

Hillary Koch ◽

Michael DeGiorgio

Keyword(s):

Population Structure ◽

Maximum Likelihood ◽

Gene Tree ◽

Species Tree ◽

Ease Of Use ◽

Likelihood Method ◽

Ancestral Population ◽

Gene Trees ◽

Species Trees ◽

Data Set

Abstract Though large multilocus genomic data sets have led to overall improvements in phylogenetic inference, they have posed the new challenge of addressing conflicting signals across the genome. In particular, ancestral population structure, which has been uncovered in a number of diverse species, can skew gene tree frequencies, thereby hindering the performance of species tree estimators. Here we develop a novel maximum likelihood method, termed TASTI (Taxa with Ancestral structure Species Tree Inference), that can infer phylogenies under such scenarios, and find that it has increasing accuracy with increasing numbers of input gene trees, contrasting with the relatively poor performances of methods not tailored for ancestral structure. Moreover, we propose a supertree approach that allows TASTI to scale computationally with increasing numbers of input taxa. We use genetic simulations to assess TASTI’s performance in the three- and four-taxon settings and demonstrate the application of TASTI on a six-species Afrotropical mosquito data set. Finally, we have implemented TASTI in an open-source software package for ease of use by the scientific community.

Download Full-text

Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer

10.1101/023168 ◽

2015 ◽

Cited By ~ 1

Author(s):

Ruth Davidson ◽

Pranjal Vachaspati ◽

Siavash Mirarab ◽

Tandy Warnow

Keyword(s):

Maximum Likelihood ◽

Gene Transfer ◽

Horizontal Gene Transfer ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Estimation Methods ◽

Species Trees ◽

Lineage Sorting ◽

Tree Estimation

Background: Species tree estimation is challenged by gene tree heterogeneity resulting from biological processes such as duplication and loss, hybridization, incomplete lineage sorting (ILS), and horizontal gene transfer (HGT). Mathematical theory about reconstructing species trees in the presence of HGT alone or ILS alone suggests that quartet-based species tree methods (known to be statistically consistent under ILS, or under bounded amounts of HGT) might be effective techniques for estimating species trees when both HGT and ILS are present. Results: We evaluated several publicly available coalescent-based methods and concatenation under maximum likelihood on simulated datasets with moderate ILS and varying levels of HGT. Our study shows that two quartet-based species tree estimation methods (ASTRAL-2 and weighted Quartets MaxCut) are both highly accurate, even on datasets with high rates of HGT. In contrast, although NJst and concatenation using maximum likelihood are highly accurate under low HGT, they are less robust to high HGT rates. Conclusion: Our study shows that quartet-based species-tree estimation methods can be highly accurate under the presence of both HGT and ILS. The study suggests the possibility that some quartet-based methods might be statistically consistent under phylogenomic models of gene tree heterogeneity with both HGT and ILS. Keywords: phylogenomics; HGT; ILS; summary methods; concatenation

Download Full-text

Why concatenation fails in the anomaly zone

10.1101/116509 ◽

2017 ◽

Cited By ~ 2

Author(s):

Fábio K. Mendes ◽

Matthew W. Hahn

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Estimation Methods ◽

Species Trees ◽

Lineage Sorting ◽

Common Gene ◽

Tree Estimation ◽

Genome Scale ◽

Future Work

AbstrctGenome-scale sequencing has been of great benefit in recovering species trees, but has not provided final answers. Despite the rapid accumulation of molecular sequences, resolving short and deep branches of the tree of life has remained a challenge, and has prompted the development of new strategies that can make the best use of available data. One such strategy – the concatenation of gene alignments – can be successful when coupled with many tree estimation methods, but has also been shown to fail when there are high levels of incomplete lineage sorting. Here, we focus on the failure of likelihood-based methods in retrieving a rooted, asymmetric four-taxon species tree from concatenated data when the species tree is in or near the anomaly zone – a region of parameter space where the most common gene tree does not match the species tree because of incomplete lineage sorting. First, we use coalescent theory to prove that most informative sites will support the species tree in the anomaly zone, and that as a consequence maximum-parsimony succeeds in recovering the species tree from concatenated data. We further show that maximum-likelihood tree estimation from concatenated data fails both inside and outside the anomaly zone, and that this failure is unconnected to the frequency of the most common gene tree. We provide support for a hypothesis that likelihood-based methods fail in and near the anomaly zone because discordant sites on the species tree have a lower likelihood than those that are discordant on alternative topologies. Our results confirm and extend previous reports of the failure and success of likelihood- and parsimony-based methods, and highlight avenues for future work improving the performance of methods aimed at recovering species tree.

Download Full-text