Linking Branch Lengths Across Loci Provides the Best Fit for Phylogenetic Inference

Mapping Intimacies ◽

10.1101/467449 ◽

2018 ◽

Cited By ~ 2

Author(s):

David A. Duchêne ◽

K. Jun Tong ◽

Charles S. P. Foster ◽

Sebastián Duchêne ◽

Robert Lanfear ◽

...

Keyword(s):

Gene Tree ◽

Branch Length ◽

Phylogenetic Inference ◽

Data Sets ◽

Length Variation ◽

Gene Trees ◽

Data Set ◽

Branch Lengths ◽

Future Work ◽

Best Fit

AbstractEvolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. Appropriately modelling this heterogeneity is important for reliable phylogenetic inference. One modelling approach in statistical phylogenetics is to apply independent models of molecular evolution to different groups of sites, where the groups are usually defined by locus, codon position, or combinations of the two. The potential impacts of partitioning data for the assignment of substitution models are well appreciated. Meanwhile, the treatment of branch lengths has received far less attention. In this study, we examined the effects of linking and unlinking branch-length parameters across loci. By analysing a range of empirical data sets, we find that the best-fitting model for phylogenetic inference is consistently one in which branch lengths are proportionally linked: gene trees have the same pattern of branch-length variation, but with varying absolute tree lengths. This model provided a substantially better fit than those that either assumed identical branch lengths across gene trees or that allowed each gene tree to have its own distinct set of branch lengths. Using simulations, we show that the fit of the three different models of branch lengths varies with the length of the sequence alignment and with the number of taxa in the data set. Our findings suggest that a model with proportionally linked branch lengths across loci is likely to provide the best fit under the conditions that are most commonly seen in practice. In future work, improvements in fit might be afforded by models with levels of complexity intermediate to proportional and free branch lengths. The results of our study have implications for model selection, computational efficiency, and experimental design in phylogenomics.

Download Full-text

Linking Branch Lengths across Sets of Loci Provides the Highest Statistical Support for Phylogenetic Inference

Molecular Biology and Evolution ◽

10.1093/molbev/msz291 ◽

2019 ◽

Vol 37 (4) ◽

pp. 1202-1210 ◽

Cited By ~ 6

Author(s):

David A Duchêne ◽

K Jun Tong ◽

Charles S P Foster ◽

Sebastián Duchêne ◽

Robert Lanfear ◽

...

Keyword(s):

Empirical Data ◽

Sequence Data ◽

Branch Length ◽

Data Sets ◽

Gene Trees ◽

Data Set ◽

Branch Lengths ◽

Statistical Support ◽

Distinct Branch ◽

Consistent Support

Abstract Evolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. In phylogenetics, the potential impacts of partitioning sequence data for the assignment of substitution models are well appreciated. In contrast, the treatment of branch lengths has received far less attention. In this study, we examined the effects of linking and unlinking branch-length parameters across loci or subsets of loci. By analyzing a range of empirical data sets, we find consistent support for a model in which branch lengths are proportionate between subsets of loci: gene trees share the same pattern of branch lengths, but form subsets that vary in their overall tree lengths. These models had substantially better statistical support than models that assume identical branch lengths across gene trees, or those in which genes form subsets with distinct branch-length patterns. We show using simulations and empirical data that the complexity of the branch-length model with the highest support depends on the length of the sequence alignment and on the numbers of taxa and loci in the data set. Our findings suggest that models in which branch lengths are proportionate between subsets have the highest statistical support under the conditions that are most commonly seen in practice. The results of our study have implications for model selection, computational efficiency, and experimental design in phylogenomics.

Download Full-text

Phylogenetic signal is associated with the degree of variation in root-to-tip distances

10.1101/2020.01.28.923805 ◽

2020 ◽

Author(s):

Mezzalina Vankan ◽

Simon Y.W. Ho ◽

Carolina Pardo-Diaz ◽

David A. Duchêne

Keyword(s):

Sequence Data ◽

Phylogenetic Signal ◽

Gene Tree ◽

Phylogenetic Inference ◽

Genomic Region ◽

Published Data ◽

Data Sets ◽

Gene Trees ◽

Species Trees ◽

Data Set

AbstractThe phylogenetic information contained in sequence data is partly determined by the overall rate of nucleotide substitution in the genomic region in question. However, phylogenetic signal is affected by various other factors, such as heterogeneity in substitution rates across lineages. These factors might be able to predict the phylogenetic accuracy of any given gene in a data set. We examined the association between the accuracy of phylogenetic inference across genes and several characteristics of branch lengths in phylogenomic data. In a large number of published data sets, we found that the accuracy of phylogenetic inference from genes was consistently associated with their mean statistical branch support and variation in their gene tree root-to-tip distances, but not with tree length and stemminess. Therefore, a signal of constant evolutionary rates across lineages appears to be beneficial for phylogenetic inference. Identifying the causes of variation in root-to-tip lengths in gene trees also offers a potential way forward to increase congruence in the signal across genes and improve estimates of species trees from phylogenomic data sets.

Download Full-text

Detecting destabilizing species in the phylogenetic backbone of Potentilla (Rosaceae) using low-copy nuclear markers

AoB Plants ◽

10.1093/aobpla/plaa017 ◽

2020 ◽

Vol 12 (3) ◽

Cited By ~ 1

Author(s):

Nannie L Persson ◽

Ingrid Toresen ◽

Heidi Lie Andersen ◽

Jenny E E Smedmark ◽

Torsten Eriksson

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Data Sets ◽

Gene Trees ◽

Lineage Sorting ◽

Nuclear Markers ◽

Data Set ◽

Multispecies Coalescent ◽

Single Marker ◽

Named Groups

Abstract The genus Potentilla (Rosaceae) has been subjected to several phylogenetic studies, but resolving its evolutionary history has proven challenging. Previous analyses recovered six, informally named, groups: the Argentea, Ivesioid, Fragarioides, Reptans, Alba and Anserina clades, but the relationships among some of these clades differ between data sets. The Reptans clade, which includes the type species of Potentilla, has been noticed to shift position between plastid and nuclear ribosomal data sets. We studied this incongruence by analysing four low-copy nuclear markers, in addition to chloroplast and nuclear ribosomal data, with a set of Bayesian phylogenetic and Multispecies Coalescent (MSC) analyses. A selective taxon removal strategy demonstrated that the included representatives from the Fragarioides clade, P. dickinsii and P. fragarioides, were the main sources of the instability seen in the trees. The Fragarioides species showed different relationships in each gene tree, and were only supported as a monophyletic group in a single marker when the Reptans clade was excluded from the analysis. The incongruences could not be explained by allopolyploidy, but rather by homoploid hybridization, incomplete lineage sorting or taxon sampling effects. When P. dickinsii and P. fragarioides were removed from the data set, a fully resolved, supported backbone phylogeny of Potentilla was obtained in the MSC analysis. Additionally, indications of autopolyploid origins of the Reptans and Ivesioid clades were discovered in the low-copy gene trees.

Download Full-text

Partitioned Gene-Tree Analyses and Gene-Based Topology Testing Help Resolve Incongruence in a Phylogenomic Study of Host-Specialist Bees (Apidae: Eucerinae)

Molecular Biology and Evolution ◽

10.1093/molbev/msaa277 ◽

2020 ◽

Author(s):

Felipe V Freitas ◽

Michael G Branstetter ◽

Terry Griswold ◽

Eduardo A B Almeida

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Data Sets ◽

Gene Trees ◽

Lineage Sorting ◽

Data Set ◽

Analytical Strategy ◽

Analytical Approaches ◽

Phylogenomic Study

Abstract Incongruence among phylogenetic results has become a common occurrence in analyses of genome-scale data sets. Incongruence originates from uncertainty in underlying evolutionary processes (e.g., incomplete lineage sorting) and from difficulties in determining the best analytical approaches for each situation. To overcome these difficulties, more studies are needed that identify incongruences and demonstrate practical ways to confidently resolve them. Here, we present results of a phylogenomic study based on the analysis 197 taxa and 2,526 ultraconserved element (UCE) loci. We investigate evolutionary relationships of Eucerinae, a diverse subfamily of apid bees (relatives of honey bees and bumble bees) with >1,200 species. We sampled representatives of all tribes within the group and >80% of genera, including two mysterious South American genera, Chilimalopsis and Teratognatha. Initial analysis of the UCE data revealed two conflicting hypotheses for relationships among tribes. To resolve the incongruence, we tested concatenation and species tree approaches and used a variety of additional strategies including locus filtering, partitioned gene-trees searches, and gene-based topological tests. We show that within-locus partitioning improves gene tree and subsequent species-tree estimation, and that this approach, confidently resolves the incongruence observed in our data set. After exploring our proposed analytical strategy on eucerine bees, we validated its efficacy to resolve hard phylogenetic problems by implementing it on a published UCE data set of Adephaga (Insecta: Coleoptera). Our results provide a robust phylogenetic hypothesis for Eucerinae and demonstrate a practical strategy for resolving incongruence in other phylogenomic data sets.

Download Full-text

Predicting the Impact of Describing New Species on Phylogenetic Patterns

Integrative Organismal Biology ◽

10.1093/iob/obz028 ◽

2019 ◽

Vol 1 (1) ◽

Cited By ~ 1

Author(s):

D C Blackburn ◽

G Giribet ◽

D E Soltis ◽

E L Stanley

Keyword(s):

New Species ◽

Phylogenetic Trees ◽

Branch Length ◽

Length Variation ◽

Tree Shape ◽

Branch Lengths ◽

Taxonomic History ◽

Ecological Patterns ◽

The Impact ◽

Incomplete Sampling

Abstract Although our inventory of Earth’s biodiversity remains incomplete, we still require analyses using the Tree of Life to understand evolutionary and ecological patterns. Because incomplete sampling may bias our inferences, we must evaluate how future additions of newly discovered species might impact analyses performed today. We describe an approach that uses taxonomic history and phylogenetic trees to characterize the impact of past species discoveries on phylogenetic knowledge using patterns of branch-length variation, tree shape, and phylogenetic diversity. This provides a framework for assessing the relative completeness of taxonomic knowledge of lineages within a phylogeny. To demonstrate this approach, we use recent large phylogenies for amphibians, reptiles, flowering plants, and invertebrates. Well-known clades exhibit a decline in the mean and range of branch lengths that are added each year as new species are described. With increased taxonomic knowledge over time, deep lineages of well-known clades become known such that most recently described new species are added close to the tips of the tree, reflecting changing tree shape over the course of taxonomic history. The same analyses reveal other clades to be candidates for future discoveries that could dramatically impact our phylogenetic knowledge. Our work reveals that species are often added non-randomly to the phylogeny over multiyear time-scales in a predictable pattern of taxonomic maturation. Our results suggest that we can make informed predictions about how new species will be added across the phylogeny of a given clade, thus providing a framework for accommodating unsampled undescribed species in evolutionary analyses.

Download Full-text

Disentangling Sources of Gene Tree Discordance in Phylogenomic Data Sets: Testing Ancient Hybridizations in Amaranthaceae s.l

Systematic Biology ◽

10.1093/sysbio/syaa066 ◽

2020 ◽

Cited By ~ 1

Author(s):

Diego F Morales-Briones ◽

Gudrun Kadereit ◽

Delphine T Tefarikis ◽

Michael J Moore ◽

Stephen A Smith ◽

...

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Species Tree ◽

Data Sets ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Gene Tree Discordance

Abstract Gene tree discordance in large genomic data sets can be caused by evolutionary processes such as incomplete lineage sorting and hybridization, as well as model violation, and errors in data processing, orthology inference, and gene tree estimation. Species tree methods that identify and accommodate all sources of conflict are not available, but a combination of multiple approaches can help tease apart alternative sources of conflict. Here, using a phylotranscriptomic analysis in combination with reference genomes, we test a hypothesis of ancient hybridization events within the plant family Amaranthaceae s.l. that was previously supported by morphological, ecological, and Sanger-based molecular data. The data set included seven genomes and 88 transcriptomes, 17 generated for this study. We examined gene-tree discordance using coalescent-based species trees and network inference, gene tree discordance analyses, site pattern tests of introgression, topology tests, synteny analyses, and simulations. We found that a combination of processes might have generated the high levels of gene tree discordance in the backbone of Amaranthaceae s.l. Furthermore, we found evidence that three consecutive short internal branches produce anomalous trees contributing to the discordance. Overall, our results suggest that Amaranthaceae s.l. might be a product of an ancient and rapid lineage diversification, and remains, and probably will remain, unresolved. This work highlights the potential problems of identifiability associated with the sources of gene tree discordance including, in particular, phylogenetic network methods. Our results also demonstrate the importance of thoroughly testing for multiple sources of conflict in phylogenomic analyses, especially in the context of ancient, rapid radiations. We provide several recommendations for exploring conflicting signals in such situations. [Amaranthaceae; gene tree discordance; hybridization; incomplete lineage sorting; phylogenomics; species network; species tree; transcriptomics.]

Download Full-text

A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-019-1014-6 ◽

2020 ◽

Vol 20 (1) ◽

Cited By ~ 8

Author(s):

André M. Carrington ◽

Paul W. Fieguth ◽

Hammad Qazi ◽

Andreas Holzinger ◽

Helen H. Chen ◽

...

Keyword(s):

Roc Curve ◽

Diagnostic Testing ◽

Real Life ◽

Imbalanced Data ◽

Machine Learning Algorithms ◽

Data Sets ◽

Data Set ◽

Partial Auc ◽

C Statistic ◽

Future Work

Abstract Background In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. Methods We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data—as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. Results Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. Conclusions The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. Future work Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques.

Download Full-text

Computing the Internode Certainty and related measures from partial gene trees

10.1101/022053 ◽

2015 ◽

Cited By ~ 2

Author(s):

Kassian Kobert ◽

Leonidas Salichos ◽

Antonis Rokas ◽

Alexandros Stamatakis

Keyword(s):

Empirical Data ◽

Data Sets ◽

Gene Trees ◽

Data Set ◽

Reference Tree ◽

Full Species

AbstractWe present, implement, and evaluate an approach to calculate the internode certainty and tree certainty on a given reference tree from a collection of partial gene trees. Previously, the calculation of these values was only possible from a collection of gene trees with exactly the same taxon set as the reference tree. An application to sets of partial gene trees requires mathematical corrections in the internode certainty and tree certainty calculations. We implement our methods in RAxML and test them on empirical data sets. These tests imply that the inclusion of partial trees does matter. However, in order to provide meaningful measurements, any data set should also include trees containing the full species set.

Download Full-text

Terraces in Gene Tree Reconciliation-Based Species Tree Inference

10.1101/2020.04.17.047092 ◽

2020 ◽

Author(s):

Michael J. Sanderson ◽

Michelle M. McMahon ◽

Mike Steel

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Solution Space ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Tree Reconciliation ◽

The Impact

AbstractTerraces in phylogenetic tree space are sets of trees with identical optimality scores for a given data set, arising from missing data. These were first described for multilocus phylogenetic data sets in the context of maximum parsimony inference and maximum likelihood inference under certain model assumptions. Here we show how the mathematical properties that lead to terraces extend to gene tree - species tree problems in which the gene trees are incomplete. Inference of species trees from either sets of gene family trees subject to duplication and loss, or allele trees subject to incomplete lineage sorting, can exhibit terraces in their solution space. First, we show conditions that lead to a new kind of terrace, which stems from subtree operations that appear in reconciliation problems for incomplete trees. Then we characterize when terraces of both types can occur when the optimality criterion for tree search is based on duplication, loss or deep coalescence scores. Finally, we examine the impact of assumptions about the causes of losses: whether they are due to imperfect sampling or true evolutionary deletion.

Download Full-text

The Multispecies Coalescent Model Outperforms Concatenation across Diverse Phylogenomic Data Sets

10.1101/860809 ◽

2019 ◽

Author(s):

Xiaodong Jian ◽

Scott V. Edwards ◽

Liang Liu

Keyword(s):

Data Analysis ◽

Model Validation ◽

Bayesian Model ◽

Model Comparison ◽

Phylogenetic Inference ◽

Data Sets ◽

Gene Trees ◽

Substitution Model ◽

Multispecies Coalescent ◽

Substitution Models

ABSTRACTA statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically concordant gene trees suggest that a poor fit of substitution models (44% of loci rejecting the substitution model) and concatenation models (38% of loci rejecting the hypothesis of topologically congruent gene trees) is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across 6 major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models, and Bayesian model comparison strongly favors the MSC over concatenation across all data sets. Species tree inference suggests that loci rejecting the MSC have little effect on species tree estimation. Due to computational constraints, the Bayesian model validation and comparison analyses were conducted on the reduced data sets. A complete analysis of phylogenomic data requires the development of efficient algorithms for phylogenetic inference. Nevertheless, the concatenation assumption of congruent gene trees rarely holds for phylogenomic data with more than 10 loci. Thus, for large phylogenomic data sets, model comparison analyses are expected to consistently and more strongly favor the coalescent model over the concatenation model. Our analysis reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference.

Download Full-text