Gene tree quality affects empirical coalescent branch length estimation

ModelTeller: Model Selection for Optimal Phylogenetic Reconstruction Using Machine Learning

Molecular Biology and Evolution ◽

10.1093/molbev/msaa154 ◽

2020 ◽

Vol 37 (11) ◽

pp. 3338-3352

Author(s):

Shiran Abadi ◽

Oren Avram ◽

Saharon Rosset ◽

Tal Pupko ◽

Itay Mayrose

Keyword(s):

Machine Learning ◽

Model Selection ◽

Selection Criteria ◽

Sequence Data ◽

Phylogenetic Reconstruction ◽

Branch Length ◽

Model Parameters ◽

Model Selection Criteria ◽

Learning Framework ◽

Length Estimation

Abstract Statistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. Although model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, although these methods are dedicated to revealing the processes that underlie the sequence data, they do not always produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate nucleotide substitution model for branch-length estimation. We demonstrate that ModelTeller leads to more accurate branch-length inference than current model selection criteria on data sets simulated under realistic processes. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared with existing strategies. By harnessing the machine-learning framework, we distinguish between features that mostly contribute to branch-length optimization, concerning the extent of sequence divergence, and features that are related to estimates of the model parameters that are important for the selection made by current criteria.

Download Full-text

Branch-length Estimation

Dictionary of Bioinformatics and Computational Biology ◽

10.1002/9780471650126.dob0079.pub2 ◽

2004 ◽

Author(s):

Sudhir Kumar ◽

Alan Filipski

Keyword(s):

Branch Length ◽

Length Estimation

Download Full-text

Branch length estimation and divergence dating: estimates of error in Bayesian and maximum likelihood frameworks

BMC Evolutionary Biology ◽

10.1186/1471-2148-10-5 ◽

2010 ◽

Vol 10 (1) ◽

pp. 5 ◽

Cited By ~ 33

Author(s):

Rachel S Schwartz ◽

Rachel L Mueller

Keyword(s):

Maximum Likelihood ◽

Branch Length ◽

Length Estimation ◽

Divergence Dating

Download Full-text

Integrated Likelihood for Phylogenomics under a No-Common-Mechanism Model

10.1101/500520 ◽

2018 ◽

Cited By ~ 1

Author(s):

Hunter Tidwell ◽

Luay Nakhleh

Keyword(s):

Sequence Data ◽

Gene Tree ◽

Branch Length ◽

Biological Data ◽

Common Mechanism ◽

Sequence Evolution ◽

Gene Trees ◽

Species Trees ◽

Integrated Likelihood ◽

Species Phylogeny

The availability of genome-wide sequence data from a large number of species as well as data from multiple individuals within a species has ushered in the era of phylogenomics. In this era, species phylogeny inference is based on models of sequence evolution on gene trees as well as models of gene tree evolution within the branches of species phylogenies. Parsimony, likelihood, Bayesian, and distance methods have been introduced for species phylogeny inference based on such models. All methods, except for the parsimony ones, assume a common mechanism across all loci as captured by a single value of each branch length of the species phylogeny. In this paper, we propose a ``no common mechanism" (NCM) model, where every gene tree evolves according to its own parameters of the species phylogeny. An analogous model was proposed and explored, both mathematically and experimentally, for sites, or characters, in a sequence alignment in the context of the classical phylogeny problem. For example, a famous equivalence between the maximum parsimony and maximum likelihood phylogeny estimates was established under certain NCM models by Tuffley and Steel. Here we derive an analytically integrated likelihood of both species trees and networks given the gene trees of multiple loci under an NCM model. We demonstrate the performance of inference under this integrated likelihood on both simulated and biological data. The model presented here will afford opportunities for exploring connections among various methods for estimating species phylogenies from multiple, independent loci.

Download Full-text

Accurate Branch Length Estimation in Partitioned Bayesian Analyses Requires Accommodation of Among-Partition Rate Variation and Attention to Branch Length Priors

Systematic Biology ◽

10.1080/10635150601087641 ◽

2006 ◽

Vol 55 (6) ◽

pp. 993-1003 ◽

Cited By ~ 113

Author(s):

David C. Marshall ◽

Chris Simon ◽

Thomas R. Buckley

Keyword(s):

Branch Length ◽

Rate Variation ◽

Bayesian Analyses ◽

Length Estimation

Download Full-text

Branch-length estimation bias misleads molecular dating for a vertebrate mitochondrial phylogeny

Gene ◽

10.1016/j.gene.2008.08.017 ◽

2009 ◽

Vol 441 (1-2) ◽

pp. 132-140 ◽

Cited By ~ 74

Author(s):

Matthew J. Phillips

Keyword(s):

Branch Length ◽

Molecular Dating ◽

Estimation Bias ◽

Length Estimation ◽

Mitochondrial Phylogeny

Download Full-text

The impacts of drift and selection on genomic evolution in insects

PeerJ ◽

10.7717/peerj.3241 ◽

2017 ◽

Vol 5 ◽

pp. e3241 ◽

Cited By ~ 3

Author(s):

K. Jun Tong ◽

Sebastián Duchêne ◽

Nathan Lo ◽

Simon Y.W. Ho

Keyword(s):

Gene Tree ◽

Evolutionary Process ◽

Branch Length ◽

Rate Variation ◽

Gene Trees ◽

Genomic Evolution ◽

Genomic Dataset ◽

Phylogenetic Methods ◽

Strong Negative Selection ◽

Shed Light

Genomes evolve through a combination of mutation, drift, and selection, all of which act heterogeneously across genes and lineages. This leads to differences in branch-length patterns among gene trees. Genes that yield trees with the same branch-length patterns can be grouped together into clusters. Here, we propose a novel phylogenetic approach to explain the factors that influence the number and distribution of these gene-tree clusters. We apply our method to a genomic dataset from insects, an ancient and diverse group of organisms. We find some evidence that when drift is the dominant evolutionary process, each cluster tends to contain a large number of fast-evolving genes. In contrast, strong negative selection leads to many distinct clusters, each of which contains only a few slow-evolving genes. Our work, although preliminary in nature, illustrates the use of phylogenetic methods to shed light on the factors driving rate variation in genomic evolution.

Download Full-text

Linking Branch Lengths Across Loci Provides the Best Fit for Phylogenetic Inference

10.1101/467449 ◽

2018 ◽

Cited By ~ 2

Author(s):

David A. Duchêne ◽

K. Jun Tong ◽

Charles S. P. Foster ◽

Sebastián Duchêne ◽

Robert Lanfear ◽

...

Keyword(s):

Gene Tree ◽

Branch Length ◽

Phylogenetic Inference ◽

Data Sets ◽

Length Variation ◽

Gene Trees ◽

Data Set ◽

Branch Lengths ◽

Future Work ◽

Best Fit

AbstractEvolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. Appropriately modelling this heterogeneity is important for reliable phylogenetic inference. One modelling approach in statistical phylogenetics is to apply independent models of molecular evolution to different groups of sites, where the groups are usually defined by locus, codon position, or combinations of the two. The potential impacts of partitioning data for the assignment of substitution models are well appreciated. Meanwhile, the treatment of branch lengths has received far less attention. In this study, we examined the effects of linking and unlinking branch-length parameters across loci. By analysing a range of empirical data sets, we find that the best-fitting model for phylogenetic inference is consistently one in which branch lengths are proportionally linked: gene trees have the same pattern of branch-length variation, but with varying absolute tree lengths. This model provided a substantially better fit than those that either assumed identical branch lengths across gene trees or that allowed each gene tree to have its own distinct set of branch lengths. Using simulations, we show that the fit of the three different models of branch lengths varies with the length of the sequence alignment and with the number of taxa in the data set. Our findings suggest that a model with proportionally linked branch lengths across loci is likely to provide the best fit under the conditions that are most commonly seen in practice. In future work, improvements in fit might be afforded by models with levels of complexity intermediate to proportional and free branch lengths. The results of our study have implications for model selection, computational efficiency, and experimental design in phylogenomics.

Download Full-text

Mitochondrial Genes and Mammalian Phylogenies: Increasing the Reliability of Branch Length Estimation

Molecular Biology and Evolution ◽

10.1093/oxfordjournals.molbev.a026302 ◽

2000 ◽

Vol 17 (2) ◽

pp. 224-234 ◽

Cited By ~ 18

Author(s):

Patrice Showers Corneli ◽

Ryk H. Ward

Keyword(s):

Branch Length ◽

Mitochondrial Genes ◽

Length Estimation

Download Full-text

ModelTeller: model selection for optimal phylogenetic reconstruction using machine learning

10.1101/2020.01.09.899906 ◽

2020 ◽

Author(s):

Shiran Abadi ◽

Oren Avram ◽

Saharon Rosset ◽

Tal Pupko ◽

Itay Mayrose

Keyword(s):

Machine Learning ◽

Model Selection ◽

Selection Criteria ◽

Sequence Data ◽

Phylogenetic Reconstruction ◽

Branch Length ◽

Complex Model ◽

Estimation Accuracy ◽

Model Selection Criteria ◽

Length Estimation

AbstractStatistical criteria have long been the standard for selecting the best model for phylogenetic reconstruction and downstream statistical inference. While model selection is regarded as a fundamental step in phylogenetics, existing methods for this task consume computational resources for long processing time, they are not always feasible, and sometimes depend on preliminary assumptions which do not hold for sequence data. Moreover, while these methods are dedicated to revealing the processes that underlie the sequence data, in most cases they do not produce the most accurate trees. Notably, phylogeny reconstruction consists of two related tasks, topology reconstruction and branch-length estimation. It was previously shown that in many cases the most complex model, GTR+I+G, leads to topologies that are as accurate as using existing model selection criteria, but overestimates branch lengths. Here, we present ModelTeller, a computational methodology for phylogenetic model selection, devised within the machine-learning framework, optimized to predict the most accurate model for branch-length estimation accuracy. ModelTeller relies on a readily implemented machine-learning model and thus the prediction according to features extracted from the sequence data results in a substantial decrease in running time compared to existing strategies. We show that on datasets simulated under simple homogenous substitution models ModelTeller leads to branch-length estimation that is as accurate as the statistical model selection criteria. We then demonstrate that ModelTeller outperforms these criteria when more intricate patterns – that aim at mimicking realistic processes – are considered.

Download Full-text