The Impact of Missing Data on Species Tree Estimation

AbstractSpecies tree estimation from loci sampled from multiple genomes is now common, but is challenged by the heterogeneity across the genome due to multiple processes, such as gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting. Although methods for estimating species trees have been developed that address gene tree heterogeneity due to incomplete lineage sorting, many of these methods operate by combining estimated gene trees and are hence vulnerable to gene tree quality. There is also the added concern that missing data, which is frequently encountered in genome-scale datasets, will impact species tree estimation.Our study addresses the impact of gene filtering on species trees inferred from multi-gene datasets. We address these questions using a large and heterogeneous collection of simulated datasets both with and without missing data. We compare several established coalescent-based methods (ASTRAL, ASTRID, MP-EST, and SVDquartets within PAUP*) as well as unpartitioned concatenation using maximum likelihood (RAxML).Our study shows that gene tree error and missing data impact all methods (and some methods degrade more than others), but the degree of incomplete lineage sorting and gene tree estimation error impacts the absolute and relative performance of methods as well as their response to gene filtering strategies. We find that filtering genes based on the degree of missing data is either neutral or else reduces the accuracy of all five methods examined, and so is not recommended. Filtering genes based on gene tree estimation error shows somewhat different trends. Under low levels of incomplete lineage sorting, removing genes with high gene tree estimation error can improve the accuracy of summary methods, but only if not too many genes are removed. Otherwise, filtering genes tends to increase error, especially under high levels of incomplete lineage sorting. Hence, while filtering genes based on missing data is not recommended, there are conditions under which removing high error gene trees can improve species tree estimation. This study provides insights into prior studies and suggests approaches for analyzing phylogenomic datasets.

Download Full-text

Do Alignment and Trimming Methods Matter for Phylogenomic (UCE) Analyses?

Systematic Biology ◽

10.1093/sysbio/syaa064 ◽

2020 ◽

Cited By ~ 1

Author(s):

Daniel M Portik ◽

John J Wiens

Keyword(s):

Missing Data ◽

Molecular Phylogenetics ◽

Species Tree ◽

Sequence Length ◽

Data Sets ◽

Full Data ◽

Tree Methods ◽

Phylogenomic Analyses ◽

Alignment Errors ◽

The Impact

Abstract Alignment is a crucial issue in molecular phylogenetics because different alignment methods can potentially yield very different topologies for individual genes. But it is unclear if the choice of alignment methods remains important in phylogenomic analyses, which incorporate data from hundreds or thousands of genes. For example, problematic biases in alignment might be multiplied across many loci, whereas alignment errors in individual genes might become irrelevant. The issue of alignment trimming (i.e., removing poorly aligned regions or missing data from individual genes) is also poorly explored. Here, we test the impact of 12 different combinations of alignment and trimming methods on phylogenomic analyses. We compare these methods using published phylogenomic data from ultraconserved elements (UCEs) from squamate reptiles (lizards and snakes), birds, and tetrapods. We compare the properties of alignments generated by different alignment and trimming methods (e.g., length, informative sites, missing data). We also test whether these data sets can recover well-established clades when analyzed with concatenated (RAxML) and species-tree methods (ASTRAL-III), using the full data ($\sim $5000 loci) and subsampled data sets (10% and 1% of loci). We show that different alignment and trimming methods can significantly impact various aspects of phylogenomic data sets (e.g., length, informative sites). However, these different methods generally had little impact on the recovery and support values for well-established clades, even across very different numbers of loci. Nevertheless, our results suggest several “best practices” for alignment and trimming. Intriguingly, the choice of phylogenetic methods impacted the phylogenetic results most strongly, with concatenated analyses recovering significantly more well-established clades (with stronger support) than the species-tree analyses. [Alignment; concatenated analysis; phylogenomics; sequence length heterogeneity; species-tree analysis; trimming]

Download Full-text

To Include or Not to Include: The Impact of Gene Filtering on Species Tree Estimation Methods

Systematic Biology ◽

10.1093/sysbio/syx077 ◽

2017 ◽

Vol 67 (2) ◽

pp. 285-303 ◽

Cited By ~ 64

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Species Tree ◽

Estimation Methods ◽

Gene Filtering ◽

Tree Estimation ◽

The Impact

Download Full-text

Effects of missing data on species tree estimation under the coalescent

Molecular Phylogenetics and Evolution ◽

10.1016/j.ympev.2013.06.004 ◽

2013 ◽

Vol 69 (3) ◽

pp. 1057-1062 ◽

Cited By ~ 49

Author(s):

Rasmus Hovmöller ◽

L. Lacey Knowles ◽

Laura S. Kubatko

Keyword(s):

Missing Data ◽

Species Tree ◽

Tree Estimation

Download Full-text

The performance of coalescent-based species tree estimation methods under models of missing data

BMC Genomics ◽

10.1186/s12864-018-4619-8 ◽

2018 ◽

Vol 19 (S5) ◽

Cited By ~ 20

Author(s):

Michael Nute ◽

Jed Chou ◽

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Missing Data ◽

Species Tree ◽

Estimation Methods ◽

Tree Estimation

Download Full-text

Twisted trees and inconsistency of tree estimation when gaps are treated as missing data – The impact of model mis-specification in distance corrections

Molecular Phylogenetics and Evolution ◽

10.1016/j.ympev.2015.07.027 ◽

2015 ◽

Vol 93 ◽

pp. 289-295 ◽

Cited By ~ 9

Author(s):

Emily Jane McTavish ◽

Mike Steel ◽

Mark T. Holder

Keyword(s):

Missing Data ◽

Tree Estimation ◽

The Impact

Download Full-text

Quantifying the impact of an inference model in Bayesian phylogenetics

10.1101/2019.12.17.879098 ◽

2019 ◽

Cited By ~ 1

Author(s):

Richèl J.C. Bilderbeek ◽

Giovanni Laudanno ◽

Rampal S. Etienne

Keyword(s):

Phylogenetic Trees ◽

R Package ◽

Species Tree ◽

Joint Estimation ◽

List Type ◽

Inference Model ◽

Bayesian Phylogenetics ◽

Character Sequences ◽

Tree Estimation ◽

The Impact

SummaryPhylogenetic trees are currently routinely reconstructed from an alignment of character sequences (usually nucleotide sequences). Bayesian tools, such as MrBayes, RevBayes and BEAST2, have gained much popularity over the last decade, as they allow joint estimation of the posterior distribution of the phylogenetic trees and the parameters of the underlying inference model. An important ingredient of these Bayesian approaches is the species tree prior. In principle, the Bayesian framework allows for comparing different tree priors, which may elucidate the macroevolutionary processes underlying the species tree. In practice, however, only macroevolutionary models that allow for fast computation of the prior probability are used. The question is how accurate the tree estimation is when the real macroevolutionary processes are substantially different from those assumed in the tree prior.Here we present pirouette, a free and open-source R package that assesses the inference error made by Bayesian phylogenetics for a given macroevolutionary diversification model. pirouette makes use of BEAST2, but its philosophy applies to any Bayesian phylogenetic inference tool.We describe pirouette’s usage providing full examples in which we interrogate a model for its power to describe another.Last, we discuss the results obtained by the examples and their interpretation.

Download Full-text

QT-GILD: Quartet Based Gene Tree Imputation Using Deep Learning Improves Phylogenomic Analyses Despite Missing Data

10.1101/2021.11.03.467204 ◽

2021 ◽

Author(s):

Sazan Mahbub ◽

Shashata Sawmya ◽

Arpita Saha ◽

Rezwana Reaz ◽

M. Sohel Rahman ◽

...

Keyword(s):

Deep Learning ◽

Missing Data ◽

Language Processing ◽

Estimation Error ◽

Gene Tree ◽

Experimental Studies ◽

Species Tree ◽

Dramatic Improvement ◽

Gene Trees ◽

Tree Estimation

Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, for a combination of reasons (ranging from sampling biases to more biological causes, as in gene birth and loss), gene trees are often incomplete, meaning that not all species of interest have a common set of genes. Incomplete gene trees can potentially impact the accuracy of phylogenomic inference. We, for the first time, introduce the problem of imputing the quartet distribution induced by a set of incomplete gene trees, which involves adding the missing quartets back to the quartet distribution. We present QT-GILD, an automated and specially tailored unsupervised deep learning technique, accompanied by cues from natural language processing (NLP), which learns the quartet distribution in a given set of incomplete gene trees and generates a complete set of quartets accordingly. QT-GILD is a general-purpose technique needing no explicit modeling of the subject system or reasons for missing data or gene tree heterogeneity. Experimental studies on a collection of simulated and empirical data sets suggest that QT-GILD can effectively impute the quartet distribution, which results in a dramatic improvement in the species tree accuracy. Remarkably, QT-GILD not only imputes the missing quartets but it can also account for gene tree estimation error. Therefore, QT-GILD advances the state-of-the-art in species tree estimation from gene trees in the face of missing data. QT-GILD is freely available in open source form at https://github.com/pythonLoader/QT-GILD .

Download Full-text

Accounting for Uncertainty in Gene Tree Estimation: Summary-Coalescent Species Tree Inference in a Challenging Radiation of Australian Lizards

10.1101/056085 ◽

2016 ◽

Author(s):

Mozes P.K. Blom ◽

Jason G. Bragg ◽

Sally Potter ◽

Craig Moritz

Keyword(s):

Phylogenetic Signal ◽

Gene Tree ◽

Phylogenetic Inference ◽

Species Tree ◽

Gene Trees ◽

Tree Inference ◽

Tree Estimation ◽

Tree Resolution ◽

The Impact ◽

Species Tree Inference

AbstractAccurate gene tree inference is an important aspect of species tree estimation in a summary-coalescent framework. Yet, in empirical studies, inferred gene trees differ in accuracy due to stochastic variation in phylogenetic signal between targeted loci. Empiricists should therefore examine the consistency of species tree inference, while accounting for the observed heterogeneity in gene tree resolution of phylogenomic datasets. Here, we assess the impact of gene tree estimation error on summary-coalescent species tree inference by screening ~2000 exonic loci based on gene tree resolution prior to phylogenetic inference. We focus on a phylogenetically challenging radiation of Australian lizards (genus Cryptoblepharus, Scincidae) and explore effects on topology and support. We identify a well-supported topology based on all loci and find that a relatively small number of high-resolution gene trees can be sufficient to converge on the same topology. Adding gene trees with decreasing resolution produced a generally consistent topology, and increased confidence for specific bipartitions that were poorly supported when using a small number of informative loci. This corroborates coalescent-based simulation studies that have highlighted the need for a large number of loci to confidently resolve challenging relationships and refutes the notion that low-resolution gene trees introduce phylogenetic noise. Further, our study also highlights the value of quantifying changes in nodal support across locus subsets of increasing size (but decreasing gene tree resolution). Such detailed analyses can reveal anomalous fluctuations in support at some nodes, suggesting the possibility of model violation. By characterizing the heterogeneity in phylogenetic signal among loci, we can account for uncertainty in gene tree estimation and assess its effect on the consistency of the species tree estimate. We suggest that the evaluation of gene tree resolution should be incorporated in the analysis of empirical phylogenomic datasets. This will ultimately increase our confidence in species tree estimation using summary-coalescent methods and enable us to exploit genomic data for phylogenetic inference.

Download Full-text

Correction to: The performance of coalescent-based species tree estimation methods under models of missing data

BMC Genomics ◽

10.1186/s12864-020-6540-1 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Michael Nute ◽

Jed Chou ◽

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Missing Data ◽

Species Tree ◽

Estimation Methods ◽

Tree Estimation

Download Full-text