Intragenic Conflict in Phylogenomic Data Sets

Stephen A Smith; Nathanael Walker-Hale; Joseph F Walker

doi:10.1093/molbev/msaa170

Intragenic Conflict in Phylogenomic Data Sets

Molecular Biology and Evolution ◽

10.1093/molbev/msaa170 ◽

2020 ◽

Vol 37 (11) ◽

pp. 3380-3388

Author(s):

Stephen A Smith ◽

Nathanael Walker-Hale ◽

Joseph F Walker

Keyword(s):

Data Analysis ◽

Empirical Data ◽

Evolutionary History ◽

Phylogenetic Signal ◽

Phylogenetic Analyses ◽

Simulated Data ◽

Data Sets ◽

Alignment Error ◽

Biological Processes ◽

Simple Method

Abstract Most phylogenetic analyses assume that a single evolutionary history underlies one gene. However, both biological processes and errors can cause intragenic conflict. The extent to which this conflict is present in empirical data sets is not well documented, but if common, could have far-reaching implications for phylogenetic analyses. We examined several large phylogenomic data sets from diverse taxa using a fast and simple method to identify well-supported intragenic conflict. We found conflict to be highly variable between data sets, from 1% to >92% of genes investigated. We analyzed four exemplar genes in detail and analyzed simulated data under several scenarios. Our results suggest that alignment error may be one major source of conflict, but other conflicts remain unexplained and may represent biological signal or other errors. Whether as part of data analysis pipelines or to explore biologically processes, analyses of within-gene phylogenetic signal should become common.

Download Full-text

Intragenic conflict in phylogenomic datasets

10.1101/2020.04.13.038133 ◽

2020 ◽

Author(s):

Stephen A. Smith ◽

Nathanael Walker-Hale ◽

Joseph F. Walker

Keyword(s):

Data Analysis ◽

Evolutionary History ◽

Phylogenetic Analyses ◽

Simulated Data ◽

Tree Of Life ◽

Alignment Error ◽

Biological Processes ◽

Simple Method

AbstractMost phylogenetic analyses assume that a single evolutionary history underlies one gene. However, both biological processes and errors in dataset assembly can violate this assumption causing intragenic conflict. The extent to which this conflict is present in empirical datasets is not well documented. However, if common, it would have far-reaching implications for phylogenetic analyses. Here, we examined several large phylogenomic datasets from diverse taxa using a fast and simple method to identify well supported intragenic conflict. We found conflict to be highly variable between datasets, from 1% to more than 92% of genes investigated. To better characterize patterns of conflict, we analyzed four genes with no obvious data assembly errors in more detail. Analyses on simulated data highlighted that alignment error may be one major source of conflict. Whether as part of data analysis pipelines or in order to explore potential biologically compelling intragenic processes, analyses of within gene signal should become common. The method presented here provides a relatively fast means for identifying conflicts that is agnostic to the generating process. Datasets identified with high intragenic conflict may either have significant errors in dataset assembly or represent conflict generated by biological processes. Conflicts that are the result of error should be identified and discarded or corrected. For those conflicts that are the result of biological processes, these analyses contribute to the growing consensus that, similar to genomes, genes themselves may exhibit multiple conflicting evolutionary histories across the tree of life.

Download Full-text

Evaluating Phylogenetic Informativeness as a Predictor of Phylogenetic Signal for Metazoan, Fungal, and Mammalian Phylogenomic Data Sets

BioMed Research International ◽

10.1155/2013/621604 ◽

2013 ◽

Vol 2013 ◽

pp. 1-14 ◽

Cited By ~ 14

Author(s):

Francesc López-Giráldez ◽

Andrew H. Moeller ◽

Jeffrey P. Townsend

Keyword(s):

Phylogenetic Signal ◽

Simulated Data ◽

Quantitative Measure ◽

Data Sets ◽

Phylogenetic Informativeness ◽

Phylogenetic Resolution ◽

Taxonomic Groups ◽

Diverse Groups ◽

Simulated Data Sets ◽

Selection Of

Phylogenetic research is often stymied by selection of a marker that leads to poor phylogenetic resolution despite considerable cost and effort. Profiles of phylogenetic informativeness provide a quantitative measure for prioritizing gene sampling to resolve branching order in a particular epoch. To evaluate the utility of these profiles, we analyzed phylogenomic data sets from metazoans, fungi, and mammals, thus encompassing diverse time scales and taxonomic groups. We also evaluated the utility of profiles created based on simulated data sets. We found that genes selected via their informativeness dramatically outperformed haphazard sampling of markers. Furthermore, our analyses demonstrate that the original phylogenetic informativeness method can be extended to trees with more than four taxa. Thus, although the method currently predicts phylogenetic signal without specifically accounting for the misleading effects of stochastic noise, it is robust to the effects of homoplasy. The phylogenetic informativeness rankings obtained will allow other researchers to select advantageous genes for future studies within these clades, maximizing return on effort and investment. Genes identified might also yield efficient experimental designs for phylogenetic inference for many sister clades and outgroup taxa that are closely related to the diverse groups of organisms analyzed.

Download Full-text

Transcriptomics provides a robust framework for the relationships of the major clades of cladobranch sea slugs (Mollusca, Gastropoda, Heterobranchia), but fails to resolve the position of the enigmatic genus Embletonia

BMC Ecology and Evolution ◽

10.1186/s12862-021-01944-0 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Dario Karmeinski ◽

Karen Meusemann ◽

Jessica A. Goodheart ◽

Michael Schroedl ◽

Alexander Martynov ◽

...

Keyword(s):

Phylogenetic Relationships ◽

Evolutionary History ◽

Phylogenetic Signal ◽

Set Covering ◽

Phylogenetic Position ◽

Data Sets ◽

Transcriptome Data ◽

Data Set ◽

Sea Slugs ◽

History Of

Abstract Background The soft-bodied cladobranch sea slugs represent roughly half of the biodiversity of marine nudibranch molluscs on the planet. Despite their global distribution from shallow waters to the deep sea, from tropical into polar seas, and their important role in marine ecosystems and for humans (as targets for drug discovery), the evolutionary history of cladobranch sea slugs is not yet fully understood. Results To enlarge the current knowledge on the phylogenetic relationships, we generated new transcriptome data for 19 species of cladobranch sea slugs and two additional outgroup taxa (Berthella plumula and Polycera quadrilineata). We complemented our taxon sampling with previously published transcriptome data, resulting in a final data set covering 56 species from all but one accepted cladobranch superfamilies. We assembled all transcriptomes using six different assemblers, selecting those assemblies that provided the largest amount of potentially phylogenetically informative sites. Quality-driven compilation of data sets resulted in four different supermatrices: two with full coverage of genes per species (446 and 335 single-copy protein-coding genes, respectively) and two with a less stringent coverage (667 genes with 98.9% partition coverage and 1767 genes with 86% partition coverage, respectively). We used these supermatrices to infer statistically robust maximum-likelihood trees. All analyses, irrespective of the data set, indicate maximal statistical support for all major splits and phylogenetic relationships at the family level. Besides the questionable position of Noumeaella rubrofasciata, rendering the Facelinidae as polyphyletic, the only notable discordance between the inferred trees is the position of Embletonia pulchra. Extensive testing using Four-cluster Likelihood Mapping, Approximately Unbiased tests, and Quartet Scores revealed that its position is not due to any informative phylogenetic signal, but caused by confounding signal. Conclusions Our data matrices and the inferred trees can serve as a solid foundation for future work on the taxonomy and evolutionary history of Cladobranchia. The placement of E. pulchra, however, proves challenging, even with large data sets and various optimization strategies. Moreover, quartet mapping results show that confounding signal present in the data is sufficient to explain the inferred position of E. pulchra, again leaving its phylogenetic position as an enigma.

Download Full-text

Phylogenetic signal in extinction selectivity in Devonian terebratulide brachiopods

Paleobiology ◽

10.1666/14006 ◽

2014 ◽

Vol 40 (4) ◽

pp. 675-692 ◽

Cited By ~ 17

Author(s):

Paul G. Harnik ◽

Paul C. Fitzgerald ◽

Jonathan L. Payne ◽

Sandra J. Carlson

Keyword(s):

Body Size ◽

Fossil Record ◽

Evolutionary History ◽

Phylogenetic Signal ◽

Phylogenetic Analyses ◽

Positive Association ◽

Range Size ◽

Biological Traits ◽

Significant Positive Association ◽

Extinction Selectivity

Determining which biological traits affect taxonomic durations is critical for explaining macroevolutionary patterns. Two approaches are commonly used to investigate the associations between traits and durations and/or extinction and origination rates: analyses of taxonomic occurrence patterns in the fossil record and comparative phylogenetic analyses, predominantly of extant taxa. By capitalizing upon the empirical record of past extinctions, paleontological data avoid some of the limitations of existing methods for inferring extinction and origination rates from molecular phylogenies. However, most paleontological studies of extinction selectivity have ignored phylogenetic relationships because there is a dearth of phylogenetic hypotheses for diverse non-vertebrate higher taxa in the fossil record. This omission inflates the degrees of freedom in statistical analyses and leaves open the possibility that observed associations are indirect, reflecting shared evolutionary history rather than the direct influence of particular traits on durations. Here we investigate global patterns of extinction selectivity in Devonian terebratulide brachiopods and compare the results of taxonomic vs. phylogenetic approaches. Regression models that assume independence among taxa provide support for a positive association between geographic range size and genus duration but do not indicate an association between body size and genus duration. Brownian motion models of trait evolution identify significant similarities in body size, range size, and duration among closely related terebratulide genera. We use phylogenetic regression to account for shared evolutionary history and find support for a significant positive association between range size and duration among terebratulides that is also phylogenetically structured. The estimated range size–duration relationship is moderately weaker in the phylogenetic analysis due to the down-weighting of closely related genera that were both broadly distributed and long lived; however, this change in slope is not statistically significant. These results provide evidence for the phylogenetic conservatism of organismal and emergent traits, yet also the general phylogenetic independence of the relationship between range size and duration.

Download Full-text

Simulated data sets for single molecule kinetics: some limitations and complications of data analysis

European Biophysics Journal ◽

10.1007/s00249-006-0067-5 ◽

2006 ◽

Vol 35 (8) ◽

pp. 633-645 ◽

Cited By ~ 8

Author(s):

Jue Shi ◽

Ari Gafni ◽

Duncan Steel

Keyword(s):

Data Analysis ◽

Single Molecule ◽

Simulated Data ◽

Data Sets ◽

Simulated Data Sets

Download Full-text

Different methods for niche and fitness differences computation offer contrasting explanations of species coexistence

10.1101/2021.09.28.462166 ◽

2021 ◽

Author(s):

Juerg W Spaak ◽

Po-Ju Ke ◽

Andrew W Letten ◽

Frederik De Laender

Keyword(s):

Empirical Data ◽

Species Coexistence ◽

Plant Traits ◽

Simulated Data ◽

Data Sets ◽

Phylogenetic Distance ◽

Simultaneous Application ◽

Coexistence Theory ◽

Niche Differences ◽

Better Than

In modern coexistence theory, species coexistence can either arise via stabilizing mechanisms that increase niche differences or equalizing mechanisms that reduce fitness differences.Having a common currency for interpreting these mechanisms is essential for synthesizing knowledge across different studies and systems.Several methods for quantifying niche and fitness differences exist, but it remains unknown to what extent these methods agree on the reasons why species coexist. Here, we apply four common methods to quantify niche and fitness differences to one simulated and two empirical data sets. We ask if different methods result in different insights into what drives species coexistence. We find that different methods disagree on the effects of resource supply rates (simulated data), and of plant traits or phylogenetic distance (empirical data), on niche and fitness differences. More specifically, these methods often do not agree better than expected by chance. We argue for (1) a better understanding of what connects and sets apart different methods, and (2) the simultaneous application of multiple methods to enhance a more complete insight into why species coexist.

Download Full-text

Coping With Plenitude: A Computational Approach to Selecting the Right Algorithm

Sociological Methods & Research ◽

10.1177/00491241211031273 ◽

2021 ◽

pp. 004912412110312

Author(s):

Ramina Sotoudeh ◽

Paul DiMaggio

Keyword(s):

Empirical Data ◽

Simulated Data ◽

Computational Approach ◽

Data Sets ◽

Class Analysis ◽

Free Lunch ◽

The Right ◽

No Free Lunch ◽

Do So ◽

Attitude Surveys

Sociologists increasingly face choices among competing algorithms that represent reasonable approaches to the same task, with little guidance in choosing among them. We develop a strategy that uses simulated data to identify the conditions under which different methods perform well and applies what is learned from the simulations to predict which method will perform best on never-before-seen empirical data sets. We apply this strategy to a class of methods that group respondents to attitude surveys according to whether they share construals of a given domain. This allows us to identify the relative strengths and weaknesses of the methods we consider, including relational class analysis, correlational class analysis, and eight other such variants. Results support the “no free lunch” view that researchers should abandon the quest for one best algorithm in favor of matching algorithms to kinds of data for which each is most appropriate and provide direction on how to do so.

Download Full-text

Quartet-based computations of internode certainty provide accurate and robust measures of phylogenetic incongruence

10.1101/168526 ◽

2017 ◽

Cited By ~ 8

Author(s):

Xiaofan Zhou ◽

Sarah Lutteropp ◽

Lucas Czech ◽

Alexandros Stamatakis ◽

Moritz von Looz ◽

...

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Simulated Data ◽

Data Sets ◽

Gene Trees ◽

Branch Support ◽

Robust Measures ◽

Genome Scale ◽

Scale Data

AbstractIncongruence, or topological conflict, is prevalent in genome-scale data sets but relatively few measures have been developed to quantify it. Internode Certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internode (or internal branch) among a set of phylogenetic trees and complement regular branch support statistics in assessing the confidence of the inferred phylogenetic relationships. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, the calculation of IC scores requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing data is high, current approaches that adjust bipartition frequencies in partial gene trees tend to overestimate IC scores and alternative adjustment approaches differ substantially from each other in their scores. To overcome these issues, we developed three new measures for calculating internode certainty that are based on the frequencies of quartets, which naturally apply to both comprehensive and partial trees. Our comparison of these new quartet-based measures to previous bipartition-based measures on simulated data shows that: 1) on comprehensive trees, both types of measures yield highly similar IC scores; 2) on partial trees, quartet-based measures generate more accurate IC scores; and 3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in the phylogenetic relationships to be assessed. Additionally, analysis of 15 empirical phylogenomic data sets using our quartet-based measures suggests that numerous relationships remain unresolved despite the availability of genome-scale data. Finally, we provide an efficient open-source implementation of these quartet-based measures in the program QuartetScores, which is freely available at https://github.com/algomaus/QuartetScores.

Download Full-text

Quartet-Based Computations of Internode Certainty Provide Robust Measures of Phylogenetic Incongruence

Systematic Biology ◽

10.1093/sysbio/syz058 ◽

2019 ◽

Vol 69 (2) ◽

pp. 308-324 ◽

Cited By ~ 7

Author(s):

Xiaofan Zhou ◽

Sarah Lutteropp ◽

Lucas Czech ◽

Alexandros Stamatakis ◽

Moritz Von Looz ◽

...

Keyword(s):

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Simulated Data ◽

Data Sets ◽

Gene Trees ◽

Data Set ◽

Statistical Confidence ◽

Branch Support ◽

Robust Measures ◽

Genome Scale

Abstract Incongruence, or topological conflict, is prevalent in genome-scale data sets. Internode certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internal branch among a set of phylogenetic trees and complement regular branch support measures (e.g., bootstrap, posterior probability) that instead assess the statistical confidence of inference. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, IC score calculation typically requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing taxa is high, the scores yielded by current approaches that adjust bipartition frequencies in partial gene trees differ substantially from each other and tend to be overestimates. To overcome these issues, we developed three new IC measures based on the frequencies of quartets, which naturally apply to both complete and partial trees. Comparison of our new quartet-based measures to previous bipartition-based measures on simulated data shows that: (1) on complete data sets, both quartet-based and bipartition-based measures yield very similar IC scores; (2) IC scores of quartet-based measures on a given data set with and without missing taxa are more similar than the scores of bipartition-based measures; and (3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in phylogenetic inference than bipartition-based measures. Additionally, the analysis of an empirical mammalian phylogenomic data set using our quartet-based measures reveals the presence of substantial levels of incongruence for numerous internal branches. An efficient open-source implementation of these quartet-based measures is freely available in the program QuartetScores (https://github.com/lutteropp/QuartetScores).

Download Full-text

Using simulated data sets to compare data analysis techniques used for software cost modelling

IEE Proceedings - Software ◽

10.1049/ip-sen:20010621 ◽

2001 ◽

Vol 148 (6) ◽

pp. 165 ◽

Cited By ~ 25

Author(s):

L. Pickard ◽

B. Kitchenham ◽

S.J. Linkman

Keyword(s):

Data Analysis ◽

Simulated Data ◽

Data Sets ◽

Software Cost ◽

Analysis Techniques ◽

Cost Modelling ◽

Simulated Data Sets

Download Full-text