Quartet-based computations of internode certainty provide accurate and robust measures of phylogenetic incongruence

Mapping Intimacies ◽

10.1101/168526 ◽

2017 ◽

Cited By ~ 8

Author(s):

Xiaofan Zhou ◽

Sarah Lutteropp ◽

Lucas Czech ◽

Alexandros Stamatakis ◽

Moritz von Looz ◽

...

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Simulated Data ◽

Data Sets ◽

Gene Trees ◽

Branch Support ◽

Robust Measures ◽

Genome Scale ◽

Scale Data

AbstractIncongruence, or topological conflict, is prevalent in genome-scale data sets but relatively few measures have been developed to quantify it. Internode Certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internode (or internal branch) among a set of phylogenetic trees and complement regular branch support statistics in assessing the confidence of the inferred phylogenetic relationships. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, the calculation of IC scores requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing data is high, current approaches that adjust bipartition frequencies in partial gene trees tend to overestimate IC scores and alternative adjustment approaches differ substantially from each other in their scores. To overcome these issues, we developed three new measures for calculating internode certainty that are based on the frequencies of quartets, which naturally apply to both comprehensive and partial trees. Our comparison of these new quartet-based measures to previous bipartition-based measures on simulated data shows that: 1) on comprehensive trees, both types of measures yield highly similar IC scores; 2) on partial trees, quartet-based measures generate more accurate IC scores; and 3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in the phylogenetic relationships to be assessed. Additionally, analysis of 15 empirical phylogenomic data sets using our quartet-based measures suggests that numerous relationships remain unresolved despite the availability of genome-scale data. Finally, we provide an efficient open-source implementation of these quartet-based measures in the program QuartetScores, which is freely available at https://github.com/algomaus/QuartetScores.

Download Full-text

Quartet-Based Computations of Internode Certainty Provide Robust Measures of Phylogenetic Incongruence

Systematic Biology ◽

10.1093/sysbio/syz058 ◽

2019 ◽

Vol 69 (2) ◽

pp. 308-324 ◽

Cited By ~ 7

Author(s):

Xiaofan Zhou ◽

Sarah Lutteropp ◽

Lucas Czech ◽

Alexandros Stamatakis ◽

Moritz Von Looz ◽

...

Keyword(s):

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Simulated Data ◽

Data Sets ◽

Gene Trees ◽

Data Set ◽

Statistical Confidence ◽

Branch Support ◽

Robust Measures ◽

Genome Scale

Abstract Incongruence, or topological conflict, is prevalent in genome-scale data sets. Internode certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internal branch among a set of phylogenetic trees and complement regular branch support measures (e.g., bootstrap, posterior probability) that instead assess the statistical confidence of inference. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, IC score calculation typically requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing taxa is high, the scores yielded by current approaches that adjust bipartition frequencies in partial gene trees differ substantially from each other and tend to be overestimates. To overcome these issues, we developed three new IC measures based on the frequencies of quartets, which naturally apply to both complete and partial trees. Comparison of our new quartet-based measures to previous bipartition-based measures on simulated data shows that: (1) on complete data sets, both quartet-based and bipartition-based measures yield very similar IC scores; (2) IC scores of quartet-based measures on a given data set with and without missing taxa are more similar than the scores of bipartition-based measures; and (3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in phylogenetic inference than bipartition-based measures. Additionally, the analysis of an empirical mammalian phylogenomic data set using our quartet-based measures reveals the presence of substantial levels of incongruence for numerous internal branches. An efficient open-source implementation of these quartet-based measures is freely available in the program QuartetScores (https://github.com/lutteropp/QuartetScores).

Download Full-text

Congruence and Conflict in the Higher-Level Phylogenetics of Squamate Reptiles: An Expanded Phylogenomic Perspective

Systematic Biology ◽

10.1093/sysbio/syaa054 ◽

2020 ◽

Cited By ~ 1

Author(s):

Sonal Singhal ◽

Timothy J Colston ◽

Maggie R Grundler ◽

Stephen A Smith ◽

Gabriel C Costa ◽

...

Keyword(s):

Evolutionary History ◽

Gene Tree ◽

Data Sets ◽

Gene Trees ◽

Squamate Reptiles ◽

Statistical Confidence ◽

Target Capture ◽

Anchored Hybrid Enrichment ◽

Genome Scale ◽

Scale Data

Abstract Genome-scale data have the potential to clarify phylogenetic relationships across the tree of life but have also revealed extensive gene tree conflict. This seeming paradox, whereby larger data sets both increase statistical confidence and uncover significant discordance, suggests that understanding sources of conflict is important for accurate reconstruction of evolutionary history. We explore this paradox in squamate reptiles, the vertebrate clade comprising lizards, snakes, and amphisbaenians. We collected an average of 5103 loci for 91 species of squamates that span higher-level diversity within the clade, which we augmented with publicly available sequences for an additional 17 taxa. Using a locus-by-locus approach, we evaluated support for alternative topologies at 17 contentious nodes in the phylogeny. We identified shared properties of conflicting loci, finding that rate and compositional heterogeneity drives discordance between gene trees and species tree and that conflicting loci rarely overlap across contentious nodes. Finally, by comparing our tests of nodal conflict to previous phylogenomic studies, we confidently resolve 9 of the 17 problematic nodes. We suggest this locus-by-locus and node-by-node approach can build consensus on which topological resolutions remain uncertain in phylogenomic studies of other contentious groups. [Anchored hybrid enrichment (AHE); gene tree conflict; molecular evolution; phylogenomic concordance; target capture; ultraconserved elements (UCE).]

Download Full-text

HyDe: a Python Package for Genome-Scale Hybridization Detection

10.1101/188037 ◽

2017 ◽

Cited By ~ 2

Author(s):

Paul D. Blischak ◽

Julia Chifman ◽

Andrea D. Wolfe ◽

Laura S. Kubatko

Keyword(s):

Gene Flow ◽

Simulated Data ◽

Data Sets ◽

Gene Trees ◽

Link Type ◽

Phylogenetic Invariants ◽

Genome Scale ◽

The Relationship ◽

Python Package ◽

Hybridization Detection

AbstractThe analysis of hybridization and gene flow among closely related taxa is a common goal for researchers studying speciation and phylogeography. Many methods for hybridization detection use simple site pattern frequencies from observed genomic data and compare them to null models that predict an absence of gene flow. The theory underlying the detection of hybridization using these site pattern probabilities exploits the relationship between the coalescent process for gene trees within population trees and the process of mutation along the branches of the gene trees. For certain models, site patterns are predicted to occur in equal frequency (i.e., their difference is 0), producing a set of functions called phylogenetic invariants. In this paper we introduce HyDe, a software package for detecting hybridization using phylogenetic invariants arising under the coalescent model with hybridization. HyDe is written in Python, and can be used interactively or through the command line using pre-packaged scripts. We demonstrate the use of HyDe on simulated data, as well as on two empirical data sets from the literature. We focus in particular on identifying individual hybrids within population samples and on distinguishing between hybrid speciation and gene flow. HyDe is freely available as an open source Python package under the GNU GPL v3 on both GitHub (https://github.com/pblischak/HyDe) and the Python Package Index (PyPI: https://pypi.python.org/pypi/phyde).

Download Full-text

Using target enrichment sequencing to study the higher-level phylogeny of the largest lichen-forming fungi family: Parmeliaceae (Ascomycota)

IMA Fungus ◽

10.1186/s43008-020-00051-x ◽

2020 ◽

Vol 11 (1) ◽

Author(s):

Felix Grewe ◽

Claudio Ametrano ◽

Todd J. Widhelm ◽

Steven Leavitt ◽

Isabel Distefano ◽

...

Keyword(s):

Data Sets ◽

Target Enrichment ◽

Data Set ◽

Reduced Genome ◽

Genome Data ◽

Worldwide Distribution ◽

Phylogenetic Studies ◽

Genome Scale ◽

Scale Data ◽

Promising Avenue

AbstractParmeliaceae is the largest family of lichen-forming fungi with a worldwide distribution. We used a target enrichment data set and a qualitative selection method for 250 out of 350 genes to infer the phylogeny of the major clades in this family including 81 taxa, with both subfamilies and all seven major clades previously recognized in the subfamily Parmelioideae. The reduced genome-scale data set was analyzed using concatenated-based Bayesian inference and two different Maximum Likelihood analyses, and a coalescent-based species tree method. The resulting topology was strongly supported with the majority of nodes being fully supported in all three concatenated-based analyses. The two subfamilies and each of the seven major clades in Parmelioideae were strongly supported as monophyletic. In addition, most backbone relationships in the topology were recovered with high nodal support. The genus Parmotrema was found to be polyphyletic and consequently, it is suggested to accept the genus Crespoa to accommodate the species previously placed in Parmotrema subgen. Crespoa. This study demonstrates the power of reduced genome-scale data sets to resolve phylogenetic relationships with high support. Due to lower costs, target enrichment methods provide a promising avenue for phylogenetic studies including larger taxonomic/specimen sampling than whole genome data would allow.

Download Full-text

A Novel Approach to Investigate the Effect of Tree Reconstruction Artifacts in Single-Gene Analysis Clarifies Opsin Evolution in Nonbilaterian Metazoans

Genome Biology and Evolution ◽

10.1093/gbe/evaa015 ◽

2020 ◽

Vol 12 (2) ◽

pp. 3906-3916 ◽

Cited By ~ 1

Author(s):

James F Fleming ◽

Roberto Feuda ◽

Nicholas W Roberts ◽

Davide Pisani

Keyword(s):

Phylogenetic Signal ◽

Single Gene ◽

G Protein Coupled Receptors ◽

Data Sets ◽

Gene Trees ◽

Tree Reconstruction ◽

Phylogenetic Information ◽

Novel Approach ◽

Gene Data ◽

G Protein Coupled

Abstract Our ability to correctly reconstruct a phylogenetic tree is strongly affected by both systematic errors and the amount of phylogenetic signal in the data. Current approaches to tackle tree reconstruction artifacts, such as the use of parameter-rich models, do not translate readily to single-gene alignments. This, coupled with the limited amount of phylogenetic information contained in single-gene alignments, makes gene trees particularly difficult to reconstruct. Opsin phylogeny illustrates this problem clearly. Opsins are G-protein coupled receptors utilized in photoreceptive processes across Metazoa and their protein sequences are roughly 300 amino acids long. A number of incongruent opsin phylogenies have been published and opsin evolution remains poorly understood. Here, we present a novel approach, the canary sequence approach, to investigate and potentially circumvent errors in single-gene phylogenies. First, we demonstrate our approach using two well-understood cases of long-branch attraction in single-gene data sets, and simulations. After that, we apply our approach to a large collection of well-characterized opsins to clarify the relationships of the three main opsin subfamilies.

Download Full-text

Evaluating Phylogenetic Informativeness as a Predictor of Phylogenetic Signal for Metazoan, Fungal, and Mammalian Phylogenomic Data Sets

BioMed Research International ◽

10.1155/2013/621604 ◽

2013 ◽

Vol 2013 ◽

pp. 1-14 ◽

Cited By ~ 14

Author(s):

Francesc López-Giráldez ◽

Andrew H. Moeller ◽

Jeffrey P. Townsend

Keyword(s):

Phylogenetic Signal ◽

Simulated Data ◽

Quantitative Measure ◽

Data Sets ◽

Phylogenetic Informativeness ◽

Phylogenetic Resolution ◽

Taxonomic Groups ◽

Diverse Groups ◽

Simulated Data Sets ◽

Selection Of

Phylogenetic research is often stymied by selection of a marker that leads to poor phylogenetic resolution despite considerable cost and effort. Profiles of phylogenetic informativeness provide a quantitative measure for prioritizing gene sampling to resolve branching order in a particular epoch. To evaluate the utility of these profiles, we analyzed phylogenomic data sets from metazoans, fungi, and mammals, thus encompassing diverse time scales and taxonomic groups. We also evaluated the utility of profiles created based on simulated data sets. We found that genes selected via their informativeness dramatically outperformed haphazard sampling of markers. Furthermore, our analyses demonstrate that the original phylogenetic informativeness method can be extended to trees with more than four taxa. Thus, although the method currently predicts phylogenetic signal without specifically accounting for the misleading effects of stochastic noise, it is robust to the effects of homoplasy. The phylogenetic informativeness rankings obtained will allow other researchers to select advantageous genes for future studies within these clades, maximizing return on effort and investment. Genes identified might also yield efficient experimental designs for phylogenetic inference for many sister clades and outgroup taxa that are closely related to the diverse groups of organisms analyzed.

Download Full-text

Chloroplast genomes elucidate diversity, phylogeny, and taxonomy of Pulsatilla (Ranunculaceae)

Scientific Reports ◽

10.1038/s41598-020-76699-7 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Qiu-jie Li ◽

Na Su ◽

Ling Zhang ◽

Ru-chang Tong ◽

Xiao-hui Zhang ◽

...

Keyword(s):

Molecular Markers ◽

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Data Sets ◽

Nucleotide Polymorphisms ◽

Single Nucleotide ◽

Variable Regions ◽

Chloroplast Genomes ◽

Medicinal Value ◽

Cp Genome

AbstractPulsatilla (Ranunculaceae) consists of about 40 species, and many of them have horticultural and/or medicinal value. However, it is difficult to recognize and identify wild Pulsatilla species. Universal molecular markers have been used to identify these species, but insufficient phylogenetic signal was available. Here, we compared the complete chloroplast genomes of seven Pulsatilla species. The chloroplast genomes of Pulsatilla were very similar and their length ranges from 161,501 to 162,669 bp. Eight highly variable regions and potential sources of molecular markers such as simple sequence repeats, large repeat sequences, and single nucleotide polymorphisms were identified, which are valuable for studies of infra- and inter-specific genetic diversity. The SNP number differentiating any two Pulsatilla chloroplast genomes ranged from 112 to 1214, and provided sufficient data for species delimitation. Phylogenetic trees based on different data sets were consistent with one another, with the IR, SSC regions and the barcode combination rbcL + matK + trnH-psbA produced slightly different results. Phylogenetic relationships within Pulsatilla were certainly resolved using the complete cp genome sequences. Overall, this study provides plentiful chloroplast genomic resources, which will be helpful to identify members of this taxonomically challenging group in further investigation.

Download Full-text

Intragenic Conflict in Phylogenomic Data Sets

Molecular Biology and Evolution ◽

10.1093/molbev/msaa170 ◽

2020 ◽

Vol 37 (11) ◽

pp. 3380-3388

Author(s):

Stephen A Smith ◽

Nathanael Walker-Hale ◽

Joseph F Walker

Keyword(s):

Data Analysis ◽

Empirical Data ◽

Evolutionary History ◽

Phylogenetic Signal ◽

Phylogenetic Analyses ◽

Simulated Data ◽

Data Sets ◽

Alignment Error ◽

Biological Processes ◽

Simple Method

Abstract Most phylogenetic analyses assume that a single evolutionary history underlies one gene. However, both biological processes and errors can cause intragenic conflict. The extent to which this conflict is present in empirical data sets is not well documented, but if common, could have far-reaching implications for phylogenetic analyses. We examined several large phylogenomic data sets from diverse taxa using a fast and simple method to identify well-supported intragenic conflict. We found conflict to be highly variable between data sets, from 1% to >92% of genes investigated. We analyzed four exemplar genes in detail and analyzed simulated data under several scenarios. Our results suggest that alignment error may be one major source of conflict, but other conflicts remain unexplained and may represent biological signal or other errors. Whether as part of data analysis pipelines or to explore biologically processes, analyses of within-gene phylogenetic signal should become common.

Download Full-text

JTK_CYCLE: An Efficient Nonparametric Algorithm for Detecting Rhythmic Components in Genome-Scale Data Sets

Journal of Biological Rhythms ◽

10.1177/0748730410379711 ◽

2010 ◽

Vol 25 (5) ◽

pp. 372-380 ◽

Cited By ~ 485

Author(s):

Michael E. Hughes ◽

John B. Hogenesch ◽

Karl Kornacker

Keyword(s):

Data Sets ◽

Nonparametric Algorithm ◽

Genome Scale ◽

Scale Data

Download Full-text

Phylogenetic conflicts, combinability, and deep phylogenomics in plants

10.1101/371930 ◽

2018 ◽

Cited By ~ 1

Author(s):

Stephen A. Smith ◽

Nathanael Walker-Hale ◽

Joseph F. Walker ◽

Joseph W. Brown

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Signal ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Data Filtering ◽

Tree Inference ◽

Tree Methods ◽

Inference Methods ◽

Species Tree Inference

AbstractStudies have demonstrated that pervasive gene tree conflict underlies several important phylogenetic relationships where different species tree methods produce conflicting results. Here, we present a means of dissecting the phylogenetic signal for alternative resolutions within a dataset in order to resolve recalcitrant relationships and, importantly, identify what the dataset is unable to resolve. These procedures extend upon methods for isolating conflict and concordance involving specific candidate relationships and can be used to identify systematic error and disambiguate sources of conflict among species tree inference methods. We demonstrate these on a large phylogenomic plant dataset. Our results support the placement of Amborella as sister to the remaining extant angiosperms, Gnetales as sister to pines, and the monophyly of extant gymnosperms. Several other contentious relationships, including the resolution of relationships within the bryophytes and the eudicots, remain uncertain given the low number of supporting gene trees. To address whether concatenation of filtered genes amplified phylogenetic signal for relationships, we implemented a combinatorial heuristic to test combinability of genes. We found that nested conflicts limited the ability of data filtering methods to fully ameliorate conflicting signal amongst gene trees. These analyses confirmed that the underlying conflicting signal does not support broad concatenation of genes. Our approach provides a means of dissecting a specific dataset to address deep phylogenetic relationships while also identifying the inferential boundaries of the dataset.

Download Full-text