FastMulRFS: Fast and accurate species tree estimation under generic gene duplication and loss models

Mapping Intimacies ◽

10.1101/835553 ◽

2019 ◽

Cited By ~ 2

Author(s):

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Species Tree ◽

Biological Research ◽

Generic Model ◽

Species Trees ◽

Basic Part ◽

Heterogeneous Datasets ◽

Gene Duplication And Loss ◽

Tree Estimation ◽

Scalable Methods

AbstractMotivationSpecies tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed.ResultsWe present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods.AvailabilityFastMulRFS is available on Github (https://github.com/ekmolloy/fastmulrfs).

Download Full-text

FastMulRFS: fast and accurate species tree estimation under generic gene duplication and loss models

Bioinformatics ◽

10.1093/bioinformatics/btaa444 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i57-i65 ◽

Cited By ~ 3

Author(s):

Erin K Molloy ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Species Tree ◽

Supplementary Information ◽

Biological Research ◽

Generic Model ◽

Species Trees ◽

Basic Part ◽

Gene Duplication And Loss ◽

Tree Estimation ◽

Scalable Methods

Abstract Motivation Species tree estimation is a basic part of biological research but can be challenging because of gene duplication and loss (GDL), which results in genes that can appear more than once in a given genome. All common approaches in phylogenomic studies either reduce available data or are error-prone, and thus, scalable methods that do not discard data and have high accuracy on large heterogeneous datasets are needed. Results We present FastMulRFS, a polynomial-time method for estimating species trees without knowledge of orthology. We prove that FastMulRFS is statistically consistent under a generic model of GDL when adversarial GDL does not occur. Our extensive simulation study shows that FastMulRFS matches the accuracy of MulRF (which tries to solve the same optimization problem) and has better accuracy than prior methods, including ASTRAL-multi (the only method to date that has been proven statistically consistent under GDL), while being much faster than both methods. Availability and impementation FastMulRFS is available on Github (https://github.com/ekmolloy/fastmulrfs). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Comparing Methods for Species Tree Estimation With Gene Duplication and Loss

10.1101/2021.02.05.429947 ◽

2021 ◽

Author(s):

James Willson ◽

Mrinmoy Saha Roddur ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Data Bank ◽

Species Tree ◽

Biological Research ◽

Gene Trees ◽

Species Trees ◽

Tree Inference ◽

Multiple Copies ◽

Gene Duplication And Loss ◽

Tree Estimation

AbstractSpecies tree inference from gene trees is an important part of biological research. One confounding factor in estimating species trees is gene duplication and loss which can lead to gene trees with multiple copies of the same gene. In recent years there have been several new methods developed to address this problem that have substantially improved on earlier methods; however, the best performing methods (ASTRAL-Pro, ASTRID-multi, and FastMulRFS) have not yet been directly compared. In this study, we compare ASTRAL-Pro, ASTRID-multi, and FastMulRFS under a wide variety of conditions. Our study shows that while all three have very good accuracy, nearly the same under many conditions, ASTRAL-Pro and ASTRID-multi are more reliably accurate than FastMuLRFS, and that ASTRID-multi is often faster than ASTRAL-Pro. The datasets generated for this study are freely available in the Illinois Data Bank at https://databank.illinois.edu/datasets/IDB-2418574

Download Full-text

Multispecies Coalescent: Theory and Applications in Phylogenetics

Annual Review of Ecology Evolution and Systematics ◽

10.1146/annurev-ecolsys-012121-095340 ◽

2021 ◽

Vol 52 (1) ◽

Author(s):

Siavash Mirarab ◽

Luay Nakhleh ◽

Tandy Warnow

Keyword(s):

Incomplete Lineage Sorting ◽

Species Tree ◽

Phylogenetic Networks ◽

Biological Research ◽

Annual Review ◽

Publication Date ◽

Gene Trees ◽

Species Trees ◽

Basic Part ◽

Tree Estimation

Species tree estimation is a basic part of many biological research projects, ranging from answering basic evolutionary questions (e.g., how did a group of species adapt to their environments?) to addressing questions in functional biology. Yet, species tree estimation is very challenging, due to processes such as incomplete lineage sorting, gene duplication and loss, horizontal gene transfer, and hybridization, which can make gene trees differ from each other and from the overall evolutionary history of the species. Over the last 10–20 years, there has been tremendous growth in methods and mathematical theory for estimating species trees and phylogenetic networks, and some of these methods are now in wide use. In this survey, we provide an overview of the current state of the art, identify the limitations of existing methods and theory, and propose additional research problems and directions. Expected final online publication date for the Annual Review of Ecology, Evolution, and Systematics, Volume 52 is November 2021. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Disjoint Tree Mergers for Large-Scale Maximum Likelihood Tree Estimation

Algorithms ◽

10.3390/a14050148 ◽

2021 ◽

Vol 14 (5) ◽

pp. 148

Author(s):

Minhyuk Park ◽

Paul Zaharias ◽

Tandy Warnow

Keyword(s):

Maximum Likelihood ◽

Gene Tree ◽

Input Sequence ◽

Species Tree ◽

Estimation Methods ◽

Biological Research ◽

Species Trees ◽

Basic Part ◽

Large Trees ◽

Tree Estimation

The estimation of phylogenetic trees for individual genes or multi-locus datasets is a basic part of considerable biological research. In order to enable large trees to be computed, Disjoint Tree Mergers (DTMs) have been developed; these methods operate by dividing the input sequence dataset into disjoint sets, constructing trees on each subset, and then combining the subset trees (using auxiliary information) into a tree on the full dataset. DTMs have been used to advantage for multi-locus species tree estimation, enabling highly accurate species trees at reduced computational effort, compared to leading species tree estimation methods. Here, we evaluate the feasibility of using DTMs to improve the scalability of maximum likelihood (ML) gene tree estimation to large numbers of input sequences. Our study shows distinct differences between the three selected ML codes—RAxML-NG, IQ-TREE 2, and FastTree 2—and shows that good DTM pipeline design can provide advantages over these ML codes on large datasets.

Download Full-text

Comparing Methods for Species Tree Estimation with Gene Duplication and Loss

Algorithms for Computational Biology - Lecture Notes in Computer Science ◽

10.1007/978-3-030-74432-8_8 ◽

2021 ◽

pp. 106-117

Author(s):

James Willson ◽

Mrinmoy Saha Roddur ◽

Tandy Warnow

Keyword(s):

Gene Duplication ◽

Species Tree ◽

Gene Duplication And Loss ◽

Tree Estimation

Download Full-text

Polynomial-Time Statistical Estimation of Species Trees under Gene Duplication and Loss

10.1101/821439 ◽

2019 ◽

Cited By ~ 3

Author(s):

Brandon Legried ◽

Erin K. Molloy ◽

Tandy Warnow ◽

Sébastien Roch

Keyword(s):

Gene Duplication ◽

Polynomial Time ◽

Incomplete Lineage Sorting ◽

Data Bank ◽

Polynomial Time Algorithm ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Biological Studies ◽

Gene Duplication And Loss

AbstractPhylogenomics—the estimation of species trees from multilocus datasets—is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this paper, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL. All scripts and datasets used in this study are available on the Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.

Download Full-text

Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss

Journal of Computational Biology ◽

10.1089/cmb.2020.0424 ◽

2020 ◽

Author(s):

Brandon Legried ◽

Erin K. Molloy ◽

Tandy Warnow ◽

Sébastien Roch

Keyword(s):

Gene Duplication ◽

Polynomial Time ◽

Statistical Estimation ◽

Species Trees ◽

Gene Duplication And Loss

Download Full-text

Inference of Ancient Whole-Genome Duplications and the Evolution of Gene Duplication and Loss Rates

Molecular Biology and Evolution ◽

10.1093/molbev/msz088 ◽

2019 ◽

Vol 36 (7) ◽

pp. 1384-1404 ◽

Cited By ~ 16

Author(s):

Arthur Zwaenepoel ◽

Yves Van de Peer

Keyword(s):

Maximum Likelihood ◽

Gene Duplication ◽

Gene Tree ◽

Probabilistic Approach ◽

Species Tree ◽

Rate Variation ◽

Whole Genome ◽

Tree Reconciliation ◽

Gene Duplication And Loss ◽

Loss Rates

Abstract Gene tree–species tree reconciliation methods have been employed for studying ancient whole-genome duplication (WGD) events across the eukaryotic tree of life. Most approaches have relied on using maximum likelihood trees and the maximum parsimony reconciliation thereof to count duplication events on specific branches of interest in a reference species tree. Such approaches do not account for uncertainty in the gene tree and reconciliation, or do so only heuristically. The effects of these simplifications on the inference of ancient WGDs are unclear. In particular, the effects of variation in gene duplication and loss rates across the species tree have not been considered. Here, we developed a full probabilistic approach for phylogenomic reconciliation-based WGD inference, accounting for both gene tree and reconciliation uncertainty using a method based on the principle of amalgamated likelihood estimation. The model and methods are implemented in a maximum likelihood and Bayesian setting and account for variation of duplication and loss rates across the species tree, using methods inspired by phylogenetic divergence time estimation. We applied our newly developed framework to ancient WGDs in land plants and investigated the effects of duplication and loss rate variation on reconciliation and gene count based assessment of these earlier proposed WGDs.

Download Full-text

Phylogenomic terraces: presence and implication in species tree estimation from gene trees

10.1101/2020.04.19.048843 ◽

2020 ◽

Author(s):

Ishrat Tanzila Farah ◽

Md Muktadirul Islam ◽

Kazi Tasnim Zinat ◽

Atif Hasan Rahman ◽

Md Shamsuzzoha Bayzid

Keyword(s):

Incomplete Lineage Sorting ◽

Gene Tree ◽

Species Tree ◽

Gene Trees ◽

Species Trees ◽

Lineage Sorting ◽

Deep Coalescence ◽

Tree Estimation ◽

Tree Space ◽

Multiple Species

AbstractSpecies tree estimation from multi-locus dataset is extremely challenging, especially in the presence of gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Summary methods have been developed which estimate gene trees and then combine the gene trees to estimate a species tree by optimizing various optimization scores. In this study, we have formalized the concept of “phylogenomic terraces” in the species tree space, where multiple species trees with distinct topologies may have exactly the same optimization score (quartet score, extra lineage score, etc.) with respect to a collection of gene trees. We investigated the presence and implication of terraces in species tree estimation from multi-locus data by taking ILS into account. We analyzed two of the most popular ILS-aware optimization criteria: maximize quartet consistency (MQC) and minimize deep coalescence (MDC). Methods based on MQC are provably statistically consistent, whereas MDC is not a consistent criterion for species tree estimation. Our experiments, on a collection of dataset simulated under ILS, indicate that MDC-based methods may achieve competitive or identical quartet consistency score as MQC but could be significantly worse than MQC in terms of tree accuracy – demonstrating the presence and affect of phylogenomic terraces. This is the first known study that formalizes the concept of phylogenomic terraces in the context of species tree estimation from multi-locus data, and reports the presence and implications of terraces in species tree estimation under ILS.

Download Full-text

To include or not to include: The impact of gene filtering on species tree estimation methods

10.1101/149120 ◽

2017 ◽

Cited By ~ 1

Author(s):

Erin K. Molloy ◽

Tandy Warnow

Keyword(s):

Missing Data ◽

Incomplete Lineage Sorting ◽

Estimation Error ◽

Gene Tree ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Gene Filtering ◽

Tree Estimation ◽

The Impact

AbstractSpecies tree estimation from loci sampled from multiple genomes is now common, but is challenged by the heterogeneity across the genome due to multiple processes, such as gene duplication and loss, horizontal gene transfer, and incomplete lineage sorting. Although methods for estimating species trees have been developed that address gene tree heterogeneity due to incomplete lineage sorting, many of these methods operate by combining estimated gene trees and are hence vulnerable to gene tree quality. There is also the added concern that missing data, which is frequently encountered in genome-scale datasets, will impact species tree estimation.Our study addresses the impact of gene filtering on species trees inferred from multi-gene datasets. We address these questions using a large and heterogeneous collection of simulated datasets both with and without missing data. We compare several established coalescent-based methods (ASTRAL, ASTRID, MP-EST, and SVDquartets within PAUP*) as well as unpartitioned concatenation using maximum likelihood (RAxML).Our study shows that gene tree error and missing data impact all methods (and some methods degrade more than others), but the degree of incomplete lineage sorting and gene tree estimation error impacts the absolute and relative performance of methods as well as their response to gene filtering strategies. We find that filtering genes based on the degree of missing data is either neutral or else reduces the accuracy of all five methods examined, and so is not recommended. Filtering genes based on gene tree estimation error shows somewhat different trends. Under low levels of incomplete lineage sorting, removing genes with high gene tree estimation error can improve the accuracy of summary methods, but only if not too many genes are removed. Otherwise, filtering genes tends to increase error, especially under high levels of incomplete lineage sorting. Hence, while filtering genes based on missing data is not recommended, there are conditions under which removing high error gene trees can improve species tree estimation. This study provides insights into prior studies and suggests approaches for analyzing phylogenomic datasets.

Download Full-text