Computing the Internode Certainty and related measures from partial gene trees

Mapping Intimacies ◽

10.1101/022053 ◽

2015 ◽

Cited By ~ 2

Author(s):

Kassian Kobert ◽

Leonidas Salichos ◽

Antonis Rokas ◽

Alexandros Stamatakis

Keyword(s):

Empirical Data ◽

Data Sets ◽

Gene Trees ◽

Data Set ◽

Reference Tree ◽

Full Species

AbstractWe present, implement, and evaluate an approach to calculate the internode certainty and tree certainty on a given reference tree from a collection of partial gene trees. Previously, the calculation of these values was only possible from a collection of gene trees with exactly the same taxon set as the reference tree. An application to sets of partial gene trees requires mathematical corrections in the internode certainty and tree certainty calculations. We implement our methods in RAxML and test them on empirical data sets. These tests imply that the inclusion of partial trees does matter. However, in order to provide meaningful measurements, any data set should also include trees containing the full species set.

Download Full-text

Linking Branch Lengths across Sets of Loci Provides the Highest Statistical Support for Phylogenetic Inference

Molecular Biology and Evolution ◽

10.1093/molbev/msz291 ◽

2019 ◽

Vol 37 (4) ◽

pp. 1202-1210 ◽

Cited By ~ 6

Author(s):

David A Duchêne ◽

K Jun Tong ◽

Charles S P Foster ◽

Sebastián Duchêne ◽

Robert Lanfear ◽

...

Keyword(s):

Empirical Data ◽

Sequence Data ◽

Branch Length ◽

Data Sets ◽

Gene Trees ◽

Data Set ◽

Branch Lengths ◽

Statistical Support ◽

Distinct Branch ◽

Consistent Support

Abstract Evolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. In phylogenetics, the potential impacts of partitioning sequence data for the assignment of substitution models are well appreciated. In contrast, the treatment of branch lengths has received far less attention. In this study, we examined the effects of linking and unlinking branch-length parameters across loci or subsets of loci. By analyzing a range of empirical data sets, we find consistent support for a model in which branch lengths are proportionate between subsets of loci: gene trees share the same pattern of branch lengths, but form subsets that vary in their overall tree lengths. These models had substantially better statistical support than models that assume identical branch lengths across gene trees, or those in which genes form subsets with distinct branch-length patterns. We show using simulations and empirical data that the complexity of the branch-length model with the highest support depends on the length of the sequence alignment and on the numbers of taxa and loci in the data set. Our findings suggest that models in which branch lengths are proportionate between subsets have the highest statistical support under the conditions that are most commonly seen in practice. The results of our study have implications for model selection, computational efficiency, and experimental design in phylogenomics.

Download Full-text

Detecting destabilizing species in the phylogenetic backbone of Potentilla (Rosaceae) using low-copy nuclear markers

AoB Plants ◽

10.1093/aobpla/plaa017 ◽

2020 ◽

Vol 12 (3) ◽

Cited By ~ 1

Author(s):

Nannie L Persson ◽

Ingrid Toresen ◽

Heidi Lie Andersen ◽

Jenny E E Smedmark ◽

Torsten Eriksson

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Data Sets ◽

Gene Trees ◽

Lineage Sorting ◽

Nuclear Markers ◽

Data Set ◽

Multispecies Coalescent ◽

Single Marker ◽

Named Groups

Abstract The genus Potentilla (Rosaceae) has been subjected to several phylogenetic studies, but resolving its evolutionary history has proven challenging. Previous analyses recovered six, informally named, groups: the Argentea, Ivesioid, Fragarioides, Reptans, Alba and Anserina clades, but the relationships among some of these clades differ between data sets. The Reptans clade, which includes the type species of Potentilla, has been noticed to shift position between plastid and nuclear ribosomal data sets. We studied this incongruence by analysing four low-copy nuclear markers, in addition to chloroplast and nuclear ribosomal data, with a set of Bayesian phylogenetic and Multispecies Coalescent (MSC) analyses. A selective taxon removal strategy demonstrated that the included representatives from the Fragarioides clade, P. dickinsii and P. fragarioides, were the main sources of the instability seen in the trees. The Fragarioides species showed different relationships in each gene tree, and were only supported as a monophyletic group in a single marker when the Reptans clade was excluded from the analysis. The incongruences could not be explained by allopolyploidy, but rather by homoploid hybridization, incomplete lineage sorting or taxon sampling effects. When P. dickinsii and P. fragarioides were removed from the data set, a fully resolved, supported backbone phylogeny of Potentilla was obtained in the MSC analysis. Additionally, indications of autopolyploid origins of the Reptans and Ivesioid clades were discovered in the low-copy gene trees.

Download Full-text

Partitioned Gene-Tree Analyses and Gene-Based Topology Testing Help Resolve Incongruence in a Phylogenomic Study of Host-Specialist Bees (Apidae: Eucerinae)

Molecular Biology and Evolution ◽

10.1093/molbev/msaa277 ◽

2020 ◽

Author(s):

Felipe V Freitas ◽

Michael G Branstetter ◽

Terry Griswold ◽

Eduardo A B Almeida

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Data Sets ◽

Gene Trees ◽

Lineage Sorting ◽

Data Set ◽

Analytical Strategy ◽

Analytical Approaches ◽

Phylogenomic Study

Abstract Incongruence among phylogenetic results has become a common occurrence in analyses of genome-scale data sets. Incongruence originates from uncertainty in underlying evolutionary processes (e.g., incomplete lineage sorting) and from difficulties in determining the best analytical approaches for each situation. To overcome these difficulties, more studies are needed that identify incongruences and demonstrate practical ways to confidently resolve them. Here, we present results of a phylogenomic study based on the analysis 197 taxa and 2,526 ultraconserved element (UCE) loci. We investigate evolutionary relationships of Eucerinae, a diverse subfamily of apid bees (relatives of honey bees and bumble bees) with >1,200 species. We sampled representatives of all tribes within the group and >80% of genera, including two mysterious South American genera, Chilimalopsis and Teratognatha. Initial analysis of the UCE data revealed two conflicting hypotheses for relationships among tribes. To resolve the incongruence, we tested concatenation and species tree approaches and used a variety of additional strategies including locus filtering, partitioned gene-trees searches, and gene-based topological tests. We show that within-locus partitioning improves gene tree and subsequent species-tree estimation, and that this approach, confidently resolves the incongruence observed in our data set. After exploring our proposed analytical strategy on eucerine bees, we validated its efficacy to resolve hard phylogenetic problems by implementing it on a published UCE data set of Adephaga (Insecta: Coleoptera). Our results provide a robust phylogenetic hypothesis for Eucerinae and demonstrate a practical strategy for resolving incongruence in other phylogenomic data sets.

Download Full-text

Synthesizing large-scale species trees using the strict consensus approach

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017400029 ◽

2017 ◽

Vol 15 (03) ◽

pp. 1740002 ◽

Cited By ~ 2

Author(s):

Jucheol Moon ◽

Oliver Eulenstein

Keyword(s):

Comparative Study ◽

Empirical Data ◽

Large Scale ◽

Efficient Algorithms ◽

Data Sets ◽

Gene Trees ◽

Species Trees ◽

Consensus Approach ◽

Specific Objective ◽

Standard Tool

Supertree problems are a standard tool for synthesizing large-scale species trees from a given collection of gene trees under some problem-specific objective. Unfortunately, these problems are typically NP-hard, and often remain so when their instances are restricted to rooted gene trees sampled from the same species. While a class of restricted supertree problems has been effectively addressed by the parameterized strict consensus approach, in practice, most gene trees are unrooted and sampled from different species. Here, we overcome this stringent limitation by describing efficient algorithms that are adopting the strict consensus approach to also handle unrestricted supertree problems. Finally, we demonstrate the performance of our algorithms in a comparative study with classic supertree heuristics using simulated and empirical data sets.

Download Full-text

New Methods to Calculate Concordance Factors for Phylogenomic Datasets

Molecular Biology and Evolution ◽

10.1093/molbev/msaa106 ◽

2020 ◽

Vol 37 (9) ◽

pp. 2727-2733 ◽

Cited By ~ 17

Author(s):

Bui Quang Minh ◽

Matthew W Hahn ◽

Robert Lanfear

Keyword(s):

Full Description ◽

Data Sets ◽

The Novel ◽

Gene Trees ◽

Reference Tree ◽

New Methods ◽

Genealogical Concordance ◽

Branch Support ◽

Two Measures ◽

Wide Usage

Abstract We implement two measures for quantifying genealogical concordance in phylogenomic data sets: the gene concordance factor (gCF) and the novel site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of “decisive” gene trees containing that branch. This measure is already in wide usage, but here we introduce a package that calculates it while accounting for variable taxon coverage among gene trees. sCF is a new measure defined as the percentage of decisive sites supporting a branch in the reference tree. gCF and sCF complement classical measures of branch support in phylogenetics by providing a full description of underlying disagreement among loci and sites. An easy to use implementation and tutorial is freely available in the IQ-TREE software package (http://www.iqtree.org/doc/Concordance-Factor, last accessed May 13, 2020).

Download Full-text

Quartet-Based Computations of Internode Certainty Provide Robust Measures of Phylogenetic Incongruence

Systematic Biology ◽

10.1093/sysbio/syz058 ◽

2019 ◽

Vol 69 (2) ◽

pp. 308-324 ◽

Cited By ~ 7

Author(s):

Xiaofan Zhou ◽

Sarah Lutteropp ◽

Lucas Czech ◽

Alexandros Stamatakis ◽

Moritz Von Looz ◽

...

Keyword(s):

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Simulated Data ◽

Data Sets ◽

Gene Trees ◽

Data Set ◽

Statistical Confidence ◽

Branch Support ◽

Robust Measures ◽

Genome Scale

Abstract Incongruence, or topological conflict, is prevalent in genome-scale data sets. Internode certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internal branch among a set of phylogenetic trees and complement regular branch support measures (e.g., bootstrap, posterior probability) that instead assess the statistical confidence of inference. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, IC score calculation typically requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing taxa is high, the scores yielded by current approaches that adjust bipartition frequencies in partial gene trees differ substantially from each other and tend to be overestimates. To overcome these issues, we developed three new IC measures based on the frequencies of quartets, which naturally apply to both complete and partial trees. Comparison of our new quartet-based measures to previous bipartition-based measures on simulated data shows that: (1) on complete data sets, both quartet-based and bipartition-based measures yield very similar IC scores; (2) IC scores of quartet-based measures on a given data set with and without missing taxa are more similar than the scores of bipartition-based measures; and (3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in phylogenetic inference than bipartition-based measures. Additionally, the analysis of an empirical mammalian phylogenomic data set using our quartet-based measures reveals the presence of substantial levels of incongruence for numerous internal branches. An efficient open-source implementation of these quartet-based measures is freely available in the program QuartetScores (https://github.com/lutteropp/QuartetScores).

Download Full-text

Multiple-Choice Tests: Polytomous IRT Models Misestimate Item Information

The Spanish Journal of Psychology ◽

10.1017/sjp.2014.95 ◽

2014 ◽

Vol 17 ◽

Cited By ~ 2

Author(s):

Miguel A. García-Pérez

Keyword(s):

Empirical Data ◽

Choice Model ◽

Multiple Choice ◽

Data Sets ◽

Data Set ◽

Information Functions ◽

Multiple Choice Tests ◽

Test Information ◽

Choice Tests ◽

Polytomous Models

AbstractLikert-type items and polytomous models are preferred over yes–no items and dichotomous models for the measurement of attitudes, because a broader range of response categories provides superior item and test information functions. Yet, for ability assessment with multiple-choice tests, the dichotomous three-parameter logistic model (3PLM) is often chosen. Because multiple-choice responses are polytomous before they are categorized as correct or incorrect, a polytomous characterization might render more efficient tests. Early studies suggested that the nominal response model (NRM) is advantageous in this respect. We investigate the reasons for those results and the outcomes of a polytomous characterization based on the multiple-choice model (MCM). An empirical data set is used to compare polytomous (NRM and MCM) and dichotomous (3PLM) characterizations of a test. The results revealed superior item and test information functions from polytomous models. Yet, close inspection suggests that these outcomes are artifactual and two simulation studies confirmed this point. These studies revealed a structural inadequacy of the NRM for multiple-choice items and that the MCM characterization outperforms the 3PLM characterization only when distractor endorsement frequencies vary non-monotonically with ability, although this feature is rarely observed in empirical data sets.

Download Full-text

Phylogenetic signal is associated with the degree of variation in root-to-tip distances

10.1101/2020.01.28.923805 ◽

2020 ◽

Author(s):

Mezzalina Vankan ◽

Simon Y.W. Ho ◽

Carolina Pardo-Diaz ◽

David A. Duchêne

Keyword(s):

Sequence Data ◽

Phylogenetic Signal ◽

Gene Tree ◽

Phylogenetic Inference ◽

Genomic Region ◽

Published Data ◽

Data Sets ◽

Gene Trees ◽

Species Trees ◽

Data Set

AbstractThe phylogenetic information contained in sequence data is partly determined by the overall rate of nucleotide substitution in the genomic region in question. However, phylogenetic signal is affected by various other factors, such as heterogeneity in substitution rates across lineages. These factors might be able to predict the phylogenetic accuracy of any given gene in a data set. We examined the association between the accuracy of phylogenetic inference across genes and several characteristics of branch lengths in phylogenomic data. In a large number of published data sets, we found that the accuracy of phylogenetic inference from genes was consistently associated with their mean statistical branch support and variation in their gene tree root-to-tip distances, but not with tree length and stemminess. Therefore, a signal of constant evolutionary rates across lineages appears to be beneficial for phylogenetic inference. Identifying the causes of variation in root-to-tip lengths in gene trees also offers a potential way forward to increase congruence in the signal across genes and improve estimates of species trees from phylogenomic data sets.

Download Full-text

Linking Branch Lengths Across Loci Provides the Best Fit for Phylogenetic Inference

10.1101/467449 ◽

2018 ◽

Cited By ~ 2

Author(s):

David A. Duchêne ◽

K. Jun Tong ◽

Charles S. P. Foster ◽

Sebastián Duchêne ◽

Robert Lanfear ◽

...

Keyword(s):

Gene Tree ◽

Branch Length ◽

Phylogenetic Inference ◽

Data Sets ◽

Length Variation ◽

Gene Trees ◽

Data Set ◽

Branch Lengths ◽

Future Work ◽

Best Fit

AbstractEvolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. Appropriately modelling this heterogeneity is important for reliable phylogenetic inference. One modelling approach in statistical phylogenetics is to apply independent models of molecular evolution to different groups of sites, where the groups are usually defined by locus, codon position, or combinations of the two. The potential impacts of partitioning data for the assignment of substitution models are well appreciated. Meanwhile, the treatment of branch lengths has received far less attention. In this study, we examined the effects of linking and unlinking branch-length parameters across loci. By analysing a range of empirical data sets, we find that the best-fitting model for phylogenetic inference is consistently one in which branch lengths are proportionally linked: gene trees have the same pattern of branch-length variation, but with varying absolute tree lengths. This model provided a substantially better fit than those that either assumed identical branch lengths across gene trees or that allowed each gene tree to have its own distinct set of branch lengths. Using simulations, we show that the fit of the three different models of branch lengths varies with the length of the sequence alignment and with the number of taxa in the data set. Our findings suggest that a model with proportionally linked branch lengths across loci is likely to provide the best fit under the conditions that are most commonly seen in practice. In future work, improvements in fit might be afforded by models with levels of complexity intermediate to proportional and free branch lengths. The results of our study have implications for model selection, computational efficiency, and experimental design in phylogenomics.

Download Full-text

The social wasp Vespula germanica (Fabricius) (Hymenoptera: Vespidae) population dynamics in England over 39 years.

The Entomologist s monthly magazine ◽

10.31184/m00138908.1542.3906 ◽

2018 ◽

Vol 154 (2) ◽

pp. 149-155

Author(s):

Michael Archer

Keyword(s):

Population Dynamics ◽

Population Dynamic ◽

Ecological Factors ◽

Social Wasp ◽

Data Sets ◽

Data Set ◽

Vespula Germanica ◽

The Social ◽

Minimum Number ◽

Suction Traps

1. Yearly records of worker Vespula germanica (Fabricius) taken in suction traps at Silwood Park (28 years) and at Rothamsted Research (39 years) are examined. 2. Using the autocorrelation function (ACF), a significant negative 1-year lag followed by a lesser non-significant positive 2-year lag was found in all, or parts of, each data set, indicating an underlying population dynamic of a 2-year cycle with a damped waveform. 3. The minimum number of years before the 2-year cycle with damped waveform was shown varied between 17 and 26, or was not found in some data sets. 4. Ecological factors delaying or preventing the occurrence of the 2-year cycle are considered.

Download Full-text