Distribution of gene tree histories under the coalescent model with gene flow

Mapping Intimacies ◽

10.1101/023937 ◽

2015 ◽

Author(s):

Yuan Tian ◽

Laura Kubatko

Keyword(s):

Gene Flow ◽

Maximum Likelihood ◽

Gene Tree ◽

Species Tree ◽

Tree Topology ◽

Effective Population ◽

Sequence Alignments ◽

Data Set ◽

Coalescent Model ◽

Population Sizes

We propose a coalescent model for three species that allows gene flow between both pairs of sister populations. The model is designed to analyze multilocus genomic sequence alignments, with one sequence sampled from each of the three species. The model is formulated using a Markov chain representation, which allows use of matrix exponentiation to compute analytical expressions for the probability density of gene tree genealogies. The gene tree history distribution as well as the gene tree topology distribution under this coalescent model with gene flow are then calculated via numerical integration. We analyze the model to compare the distributions of gene tree topologies and gene tree histories for species trees with differing effective population sizes and gene flow rates. Our results suggest conditions under which the species tree and associated parameters are not identifiable from the gene tree topology distribution when gene flow is present, but indicate that the gene tree history distribution may identify the species tree and associated parameters. Thus, the gene tree history distribution can be used to infer parameters such as the ancestral effective population sizes and the rates of gene flow in a maximum likelihood (ML) framework. We conduct computer simulations to evaluate the performance of our method in estimating these parameters, and we apply our method to an Afrotropical mosquito data set (Fontaine et al., 2015) to demonstrate the usefulness of our method for the analysis of empirical data. Key words: coalescent, gene flow, migration, hybridization, gene tree, topology, history, maximum likelihood, speciation.

Download Full-text

ipcoal: An interactive Python package for simulating and analyzing genealogies and sequences on a species tree or network

10.1101/2020.01.15.908236 ◽

2020 ◽

Cited By ~ 1

Author(s):

Patrick F. McKenzie ◽

Deren A. R. Eaton

Keyword(s):

Sequence Data ◽

Gene Tree ◽

Species Tree ◽

Effective Population ◽

Demographic Models ◽

Link Type ◽

Tree Inference ◽

Population Sizes ◽

Visualization Tools ◽

Python Package

AbstractSummaryipcoal is a free and open source Python package for simulating and analyzing genealogies and sequences. It automates the task of describing complex demographic models (e.g., with divergence times, effective population sizes, migration events) to the msprime coalescent simulator by parsing a user-supplied species tree or network. Genealogies, sequences, and metadata are returned in tabular format allowing for easy downstream analyses. ipcoal includes phylogenetic inference tools to automate gene tree inference from simulated sequence data, and visualization tools for analyzing results and verifying model accuracy. The ipcoal package is a powerful tool for posterior predictive data analysis, for methods validation, and for teaching coalescent methods in an interactive and visual environment.Availability and implementationSource code is available from the GitHub repository (https://github.com/pmckenz1/ipcoal/) and is distributed for packaged installation with conda. Complete documentation and interactive notebooks prepared for teaching purposes are available at https://ipcoal.readthedocs.io/.

Download Full-text

Phylogenomic assessment of the role of hybridization and introgression in trait evolution

PLoS Genetics ◽

10.1371/journal.pgen.1009701 ◽

2021 ◽

Vol 17 (8) ◽

pp. e1009701

Author(s):

Yaxuan Wang ◽

Zhen Cao ◽

Huw A. Ogilvie ◽

Luay Nakhleh

Keyword(s):

Gene Flow ◽

Gene Tree ◽

Species Tree ◽

Biological Data ◽

Gene Trees ◽

Data Set ◽

Trait Evolution ◽

Branching Patterns ◽

Hybridization And Introgression

Trait evolution among a set of species—a central theme in evolutionary biology—has long been understood and analyzed with respect to a species tree. However, the field of phylogenomics, which has been propelled by advances in sequencing technologies, has ushered in the era of species/gene tree incongruence and, consequently, a more nuanced understanding of trait evolution. For a trait whose states are incongruent with the branching patterns in the species tree, the same state could have arisen independently in different species (homoplasy) or followed the branching patterns of gene trees, incongruent with the species tree (hemiplasy). Another evolutionary process whose extent and significance are better revealed by phylogenomic studies is gene flow between different species. In this work, we present a phylogenomic method for assessing the role of hybridization and introgression in the evolution of polymorphic or monomorphic binary traits. We apply the method to simulated evolutionary scenarios to demonstrate the interplay between the parameters of the evolutionary history and the role of introgression in a binary trait’s evolution (which we call xenoplasy). Very importantly, we demonstrate, including on a biological data set, that inferring a species tree and using it for trait evolution analysis in the presence of gene flow could lead to misleading hypotheses about trait evolution.

Download Full-text

ipcoal: an interactive Python package for simulating and analyzing genealogies and sequences on a species tree or network

Bioinformatics ◽

10.1093/bioinformatics/btaa486 ◽

2020 ◽

Vol 36 (14) ◽

pp. 4193-4196

Author(s):

Patrick F McKenzie ◽

Deren A R Eaton

Keyword(s):

Sequence Data ◽

Gene Tree ◽

Species Tree ◽

Effective Population ◽

Demographic Models ◽

Simulated Sequence ◽

Tree Inference ◽

Population Sizes ◽

Visualization Tools ◽

Python Package

Abstract Summary ipcoal is a free and open source Python package for simulating and analyzing genealogies and sequences. It automates the task of describing complex demographic models (e.g. with divergence times, effective population sizes, migration events) to the msprime coalescent simulator by parsing a user-supplied species tree or network. Genealogies, sequences and metadata are returned in tabular format allowing for easy downstream analyses. ipcoal includes phylogenetic inference tools to automate gene tree inference from simulated sequence data, and visualization tools for analyzing results and verifying model accuracy. The ipcoal package is a powerful tool for posterior predictive data analysis, for methods validation and for teaching coalescent methods in an interactive and visual environment. Availability and implementation Source code is available from the GitHub repository (https://github.com/pmckenz1/ipcoal/) and is distributed for packaged installation with conda. Complete documentation and interactive notebooks prepared for teaching purposes, including an empirical example, are available at https://ipcoal.readthedocs.io/. Contact [email protected]

Download Full-text

Maximum Likelihood Estimation of Species Trees from Gene Trees in the Presence of Ancestral Population Structure

Genome Biology and Evolution ◽

10.1093/gbe/evaa022 ◽

2020 ◽

Vol 12 (2) ◽

pp. 3977-3995 ◽

Cited By ~ 1

Author(s):

Hillary Koch ◽

Michael DeGiorgio

Keyword(s):

Population Structure ◽

Maximum Likelihood ◽

Gene Tree ◽

Species Tree ◽

Ease Of Use ◽

Likelihood Method ◽

Ancestral Population ◽

Gene Trees ◽

Species Trees ◽

Data Set

Abstract Though large multilocus genomic data sets have led to overall improvements in phylogenetic inference, they have posed the new challenge of addressing conflicting signals across the genome. In particular, ancestral population structure, which has been uncovered in a number of diverse species, can skew gene tree frequencies, thereby hindering the performance of species tree estimators. Here we develop a novel maximum likelihood method, termed TASTI (Taxa with Ancestral structure Species Tree Inference), that can infer phylogenies under such scenarios, and find that it has increasing accuracy with increasing numbers of input gene trees, contrasting with the relatively poor performances of methods not tailored for ancestral structure. Moreover, we propose a supertree approach that allows TASTI to scale computationally with increasing numbers of input taxa. We use genetic simulations to assess TASTI’s performance in the three- and four-taxon settings and demonstrate the application of TASTI on a six-species Afrotropical mosquito data set. Finally, we have implemented TASTI in an open-source software package for ease of use by the scientific community.

Download Full-text

The Perfect Storm: Gene Tree Estimation Error, Incomplete Lineage Sorting, and Ancient Gene Flow Explain the Most Recalcitrant Ancient Angiosperm Clade, Malpighiales

Systematic Biology ◽

10.1093/sysbio/syaa083 ◽

2020 ◽

Author(s):

Liming Cai ◽

Zhenxiang Xi ◽

Emily Moriarty Lemmon ◽

Alan R Lemmon ◽

Austin Mast ◽

...

Keyword(s):

Gene Flow ◽

Incomplete Lineage Sorting ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Flowering Plant ◽

Estimation Methods ◽

Lineage Sorting ◽

Tree Estimation ◽

Perfect Storm

Abstract The genomic revolution offers renewed hope of resolving rapid radiations in the Tree of Life. The development of the multispecies coalescent (MSC) model and improved gene tree estimation methods can better accommodate gene tree heterogeneity caused by incomplete lineage sorting (ILS) and gene tree estimation error stemming from the short internal branches. However, the relative influence of these factors in species tree inference is not well understood. Using anchored hybrid enrichment, we generated a data set including 423 single-copy loci from 64 taxa representing 39 families to infer the species tree of the flowering plant order Malpighiales. This order includes nine of the top ten most unstable nodes in angiosperms, which have been hypothesized to arise from the rapid radiation during the Cretaceous. Here, we show that coalescent-based methods do not resolve the backbone of Malpighiales and concatenation methods yield inconsistent estimations, providing evidence that gene tree heterogeneity is high in this clade. Despite high levels of ILS and gene tree estimation error, our simulations demonstrate that these two factors alone are insufficient to explain the lack of resolution in this order. To explore this further, we examined triplet frequencies among empirical gene trees and discovered some of them deviated significantly from those attributed to ILS and estimation error, suggesting gene flow as an additional and previously unappreciated phenomenon promoting gene tree variation in Malpighiales. Finally, we applied a novel method to quantify the relative contribution of these three primary sources of gene tree heterogeneity and demonstrated that ILS, gene tree estimation error, and gene flow contributed to 10.0%, 34.8%, and 21.4% of the variation, respectively. Together, our results suggest that a perfect storm of factors likely influence this lack of resolution, and further indicate that recalcitrant phylogenetic relationships like the backbone of Malpighiales may be better represented as phylogenetic networks. Thus, reducing such groups solely to existing models that adhere strictly to bifurcating trees greatly oversimplifies reality, and obscures our ability to more clearly discern the process of evolution.

Download Full-text

Bayes Estimation of Species Divergence Times and Ancestral Population Sizes Using DNA Sequences From Multiple Loci

Genetics ◽

10.1093/genetics/164.4.1645 ◽

2003 ◽

Vol 164 (4) ◽

pp. 1645-1656 ◽

Cited By ~ 3

Author(s):

Bruce Rannala ◽

Ziheng Yang

Keyword(s):

Dna Sequences ◽

Gene Tree ◽

Species Tree ◽

Bayes Estimation ◽

Ancestral Population ◽

Divergence Times ◽

Gene Trees ◽

Species Divergence ◽

Multiple Loci ◽

Population Sizes

Abstract The effective population sizes of ancestral as well as modern species are important parameters in models of population genetics and human evolution. The commonly used method for estimating ancestral population sizes, based on counting mismatches between the species tree and the inferred gene trees, is highly biased as it ignores uncertainties in gene tree reconstruction. In this article, we develop a Bayes method for simultaneous estimation of the species divergence times and current and ancestral population sizes. The method uses DNA sequence data from multiple loci and extracts information about conflicts among gene tree topologies and coalescent times to estimate ancestral population sizes. The topology of the species tree is assumed known. A Markov chain Monte Carlo algorithm is implemented to integrate over uncertain gene trees and branch lengths (or coalescence times) at each locus as well as species divergence times. The method can handle any species tree and allows different numbers of sequences at different loci. We apply the method to published noncoding DNA sequences from the human and the great apes. There are strong correlations between posterior estimates of speciation times and ancestral population sizes. With the use of an informative prior for the human-chimpanzee divergence date, the population size of the common ancestor of the two species is estimated to be ∼20,000, with a 95% credibility interval (8000, 40,000). Our estimates, however, are affected by model assumptions as well as data quality. We suggest that reliable estimates have yet to await more data and more realistic models.

Download Full-text

GeneRax: A tool for species tree-aware maximum likelihood based gene family tree inference under gene duplication, transfer, and loss

10.1101/779066 ◽

2019 ◽

Cited By ~ 3

Author(s):

Benoit Morel ◽

Alexey M. Kozlov ◽

Alexandros Stamatakis ◽

Gergely J. Szöllősi

Keyword(s):

Maximum Likelihood ◽

Phylogenetic Trees ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Species Tree ◽

Homologous Gene ◽

Sequence Alignments ◽

Full Likelihood ◽

True Tree

AbstractInferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges species tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data pre-processing (e.g., computing bootstrap trees), and rely on approximations and heuristics that limit the degree of tree space exploration. Here we present GeneRax, the first maximum likelihood species tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared to competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson-Foulds distance. On empirical datasets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1099 Cyanobacteria families in eight minutes on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax.

Download Full-text

Efficient Maximum-Likelihood Inference For The Isolation-With-Initial-Migration Model With Potentially Asymmetric Gene Flow

10.1101/052894 ◽

2016 ◽

Author(s):

Rui J. Costa ◽

Hilde Wilkinson-Herbots

Keyword(s):

Gene Flow ◽

Maximum Likelihood ◽

Dna Sequences ◽

Computing Time ◽

Likelihood Method ◽

Ancestral Population ◽

Parameter Estimates ◽

Fast Method ◽

Data Set ◽

Im Model

AbstractThe isolation-with-migration (IM) model is commonly used to make inferences about gene flow during speciation, using polymorphism data. However, Becquet and Przeworski (2009) report that the parameter estimates obtained by fitting the IM model are very sensitive to the model's assumptions (including the assumption of constant gene flow until the present). This paper is concerned with the isolation-with-initial-migration (IIM) model of Wilkinson-Herbots (2012), which drops precisely this assumption. In the IIM model, one ancestral population divides into two descendant subpopulations, between which there is an initial period of gene flow and a subsequent period of isolation. We derive a very fast method of fitting an extended version of the IIM model, which also allows for asymmetric gene flow and unequal population sizes. This is a maximum-likelihood method, applicable to data on the number of segregating sites between pairs of DNA sequences from a large number of independent loci. In addition to obtaining parameter estimates, our method can also be used to distinguish between alternative models representing different evolutionary scenarios, by means of likelihood ratio tests. We illustrate the procedure on pairs of Drosophila sequences from approximately 30,000 loci. The computing time needed to fit the most complex version of the model to this data set is only a couple of minutes. The R code to fit the IIM model can be found in the supplementary files of this paper.

Download Full-text

Deep-Time Demographic Inference Suggests Ecological Release as Driver of Neoavian Adaptive Radiation

Diversity ◽

10.3390/d12040164 ◽

2020 ◽

Vol 12 (4) ◽

pp. 164 ◽

Cited By ~ 2

Author(s):

Peter Houde ◽

Edward L. Braun ◽

Lawrence Zhou

Keyword(s):

Evolutionary Biology ◽

Gene Tree ◽

Difficult Problem ◽

Gene Trees ◽

Effective Population ◽

Deep Time ◽

Ecological Release ◽

Population Sizes ◽

Demographic Inference ◽

Gene Tree Discordance

Assessing the applicability of theory to major adaptive radiations in deep time represents an extremely difficult problem in evolutionary biology. Neoaves, which includes 95% of living birds, is believed to have undergone a period of rapid diversification roughly coincident with the Cretaceous–Paleogene (K-Pg) boundary. We investigate whether basal neoavian lineages experienced an ecological release in response to ecological opportunity, as evidenced by density compensation. We estimated effective population sizes (Ne) of basal neoavian lineages by combining coalescent branch lengths (CBLs) and the numbers of generations between successive divergences. We used a modified version of Accurate Species TRee Algorithm (ASTRAL) to estimate CBLs directly from insertion–deletion (indel) data, as well as from gene trees using DNA sequence and/or indel data. We found that some divergences near the K-Pg boundary involved unexpectedly high gene tree discordance relative to the estimated number of generations between speciation events. The simplest explanation for this result is an increase in Ne, despite the caveats discussed herein. It appears that at least some early neoavian lineages, similar to the ancestor of the clade comprising doves, mesites, and sandgrouse, experienced ecological release near the time of the K-Pg mass extinction.

Download Full-text

Disentangling Sources of Gene Tree Discordance in Phylogenomic Data Sets: Testing Ancient Hybridizations in Amaranthaceae s.l

Systematic Biology ◽

10.1093/sysbio/syaa066 ◽

2020 ◽

Cited By ~ 1

Author(s):

Diego F Morales-Briones ◽

Gudrun Kadereit ◽

Delphine T Tefarikis ◽

Michael J Moore ◽

Stephen A Smith ◽

...

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Species Tree ◽

Data Sets ◽

Species Trees ◽

Lineage Sorting ◽

Data Set ◽

Gene Tree Discordance

Abstract Gene tree discordance in large genomic data sets can be caused by evolutionary processes such as incomplete lineage sorting and hybridization, as well as model violation, and errors in data processing, orthology inference, and gene tree estimation. Species tree methods that identify and accommodate all sources of conflict are not available, but a combination of multiple approaches can help tease apart alternative sources of conflict. Here, using a phylotranscriptomic analysis in combination with reference genomes, we test a hypothesis of ancient hybridization events within the plant family Amaranthaceae s.l. that was previously supported by morphological, ecological, and Sanger-based molecular data. The data set included seven genomes and 88 transcriptomes, 17 generated for this study. We examined gene-tree discordance using coalescent-based species trees and network inference, gene tree discordance analyses, site pattern tests of introgression, topology tests, synteny analyses, and simulations. We found that a combination of processes might have generated the high levels of gene tree discordance in the backbone of Amaranthaceae s.l. Furthermore, we found evidence that three consecutive short internal branches produce anomalous trees contributing to the discordance. Overall, our results suggest that Amaranthaceae s.l. might be a product of an ancient and rapid lineage diversification, and remains, and probably will remain, unresolved. This work highlights the potential problems of identifiability associated with the sources of gene tree discordance including, in particular, phylogenetic network methods. Our results also demonstrate the importance of thoroughly testing for multiple sources of conflict in phylogenomic analyses, especially in the context of ancient, rapid radiations. We provide several recommendations for exploring conflicting signals in such situations. [Amaranthaceae; gene tree discordance; hybridization; incomplete lineage sorting; phylogenomics; species network; species tree; transcriptomics.]

Download Full-text