Empirical Performance of Tree-based Inference of Phylogenetic Networks

Mapping Intimacies ◽

10.1101/693986 ◽

2019 ◽

Cited By ~ 1

Author(s):

Zhen Cao ◽

Luay Nakhleh

Keyword(s):

Network Inference ◽

Computational Cost ◽

Phylogenetic Network ◽

Large Data ◽

Phylogenetic Networks ◽

Divide And Conquer ◽

Maximum Likelihood Estimates ◽

Tree Inference ◽

Backbone Tree ◽

Network Space

AbstractPhylogenetic networks extend the phylogenetic tree structure and allow for modeling vertical and horizontal evolution in a single framework. Statistical inference of phylogenetic networks is prohibitive and currently limited to small networks. An approach that could significantly improve phylogenetic network space exploration is based on first inferring an evolutionary tree of the species under consideration, and then augmenting the tree into a network by adding a set of “horizontal” edges to better fit the data.In this paper, we study the performance of such an approach on networks generated under a birth-hybridization model and explore its feasibility as an alternative to approaches that search the phylogenetic network space directly (without relying on a fixed underlying tree). We find that the concatenation method does poorly at obtaining a “backbone” tree that could be augmented into the correct network, whereas the popular species tree inference method ASTRAL does significantly better at such a task. We then evaluated the tree-to-network augmentation phase under the minimizing deep coalescence and pseudo-likelihood criteria. We find that even though this is a much faster approach than the direct search of the network space, the accuracy is much poorer, even when the backbone tree is a good starting tree.Our results show that tree-based inference of phylogenetic networks could yield very poor results. As exploration of the network space directly in search of maximum likelihood estimates or a representative sample of the posterior is very expensive, significant improvements to the computational complexity of phylogenetic network inference are imperative if analyses of large data sets are to be performed. We show that a recently developed divide-and-conquer approach significantly outperforms tree-based inference in terms of accuracy, albeit still at a higher computational cost.

Download Full-text

A Divide-and-Conquer Method for Scalable Phylogenetic Network Inference from Multi-locus Data

10.1101/587725 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jiafan Zhu ◽

Xinhao Liu ◽

Huw A. Ogilvie ◽

Luay K. Nakhleh

Keyword(s):

Large Scale ◽

Network Inference ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Divide And Conquer ◽

Lineage Sorting ◽

Step Method ◽

Sequence Alignments

AbstractReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting (ILS). However, these methods can only handle a small number of loci from a handful of genomes.In this paper, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological data sets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet)[email protected]

Download Full-text

TREEasy: an automated workflow to infer gene trees, species trees, and phylogenetic networks from multilocus data

10.1101/706390 ◽

2019 ◽

Cited By ~ 1

Author(s):

Yafei Mao ◽

Siqing Hou ◽

Evan P. Economo

Keyword(s):

Tree Species ◽

Network Inference ◽

Gene Tree ◽

Phylogenetic Network ◽

Handling Time ◽

Phylogenetic Networks ◽

Recent Analysis ◽

Gene Trees ◽

Species Trees ◽

Tree Inference

AbstractMultilocus genomic datasets can be used to infer a rich set of information about the evolutionary history of a lineage, including gene trees, species trees, and phylogenetic networks. However, user-friendly tools to run such integrated analyses are lacking, and workflows often require tedious reformatting and handling time to shepherd data through a series of individual programs. Here, we present a tool written in Python—TREEasy—that performs automated sequence alignment (with MAFFT), gene tree inference (with IQ-Tree), species inference from concatenated data (with IQ-Tree), species tree inference from gene trees (with ASTRAL, MP-EST, and STELLS2), and phylogenetic network inference (with SNaQ and PhyloNet). The tool only requires FASTA files and nine parameters as inputs. The Tool can be run as command line or through a Graphical User Interface (GUI). As examples, we reproduced a recent analysis of staghorn coral evolution, and performed a new analysis on the evolution of the WGD clade of yeast. The latter revealed novel inferences that were not identified by previous analyses. TREEasy represents a reliable and simple tool to accelerate research in systematic biology (https://github.com/MaoYafei/TREEasy).

Download Full-text

A divide-and-conquer method for scalable phylogenetic network inference from multilocus data

Bioinformatics ◽

10.1093/bioinformatics/btz359 ◽

2019 ◽

Vol 35 (14) ◽

pp. i370-i378 ◽

Cited By ~ 5

Author(s):

Jiafan Zhu ◽

Xinhao Liu ◽

Huw A Ogilvie ◽

Luay K Nakhleh

Keyword(s):

Large Scale ◽

Network Inference ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Divide And Conquer ◽

Supplementary Information ◽

Lineage Sorting ◽

Step Method ◽

Sequence Alignments

Abstract Motivation Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. Results In this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. Availability and implementation We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Assessing the fit of the multi-species network coalescent to multi-locus data

Bioinformatics ◽

10.1093/bioinformatics/btaa863 ◽

2020 ◽

Author(s):

Ruoyi Cai ◽

Cécile Ané

Keyword(s):

Model Selection ◽

Goodness Of Fit ◽

Network Inference ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Supplementary Information ◽

Goodness Of Fit Test ◽

Full Likelihood ◽

Genome Wide ◽

Inference Methods

Abstract Motivation With growing genome-wide molecular datasets from next-generation sequencing, phylogenetic networks can be estimated using a variety of approaches. These phylogenetic networks include events like hybridization, gene flow or horizontal gene transfer explicitly. However, the most accurate network inference methods are computationally heavy. Methods that scale to larger datasets do not calculate a full likelihood, such that traditional likelihood-based tools for model selection are not applicable to decide how many past hybridization events best fit the data. We propose here a goodness-of-fit test to quantify the fit between data observed from genome-wide multi-locus data, and patterns expected under the multi-species coalescent model on a candidate phylogenetic network. Results We identified weaknesses in the previously proposed TICR test, and proposed corrections. The performance of our new test was validated by simulations on real-world phylogenetic networks. Our test provides one of the first rigorous tools for model selection, to select the adequate network complexity for the data at hand. The test can also work for identifying poorly inferred areas on a network. Availability and implementation Software for the goodness-of-fit test is available as a Julia package at https://github.com/cecileane/QuartetNetworkGoodnessFit.jl. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

THE NET-HMM APPROACH: PHYLOGENETIC NETWORK INFERENCE BY COMBINING MAXIMUM LIKELIHOOD AND HIDDEN MARKOV MODELS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972000900428x ◽

2009 ◽

Vol 07 (04) ◽

pp. 625-644 ◽

Cited By ~ 2

Author(s):

SAGI SNIR ◽

TAMIR TULLER

Keyword(s):

Network Inference ◽

Markov Models ◽

Hidden Markov ◽

Bacterial Genome ◽

Genetic Material ◽

Phylogenetic Network ◽

Significance Test ◽

Amino Acid Sequences ◽

Phylogenetic Networks ◽

Significant Mechanism

Horizontal gene transfer (HGT) is the event of transferring genetic material from one lineage in the evolutionary tree to a different lineage. HGT plays a major role in bacterial genome diversification and is a significant mechanism by which bacteria develop resistance to antibiotics. Although the prevailing assumption is of complete HGT, cases of partial HGT (which are also named chimeric HGT) where only part of a gene is horizontally transferred, have also been reported, albeit less frequently. In this work we suggest a new probabilistic model, the NET-HMM, for analyzing and modeling phylogenetic networks. This new model captures the biologically realistic assumption that neighboring sites of DNA or amino acid sequences are not independent, which increases the accuracy of the inference. The model describes the phylogenetic network as a Hidden Markov Model (HMM), where each hidden state is related to one of the network's trees. One of the advantages of the NET-HMM is its ability to infer partial HGT as well as complete HGT. We describe the properties of the NET-HMM, devise efficient algorithms for solving a set of problems related to it, and implement them in software. We also provide a novel complementary significance test for evaluating the fitness of a model (NET-HMM) to a given dataset. Using NET-HMM, we are able to answer interesting biological questions, such as inferring the length of partial HGT's and the affected nucleotides in the genomic sequences, as well as inferring the exact location of HGT events along the tree branches. These advantages are demonstrated through the analysis of synthetical inputs and three different biological inputs.

Download Full-text

NetRAX: Accurate and Fast Maximum Likelihood Phylogenetic Network Inference⋆

10.1101/2021.08.30.458194 ◽

2021 ◽

Author(s):

Sarah Lutteropp ◽

Céline Scornavacca ◽

Alexey M. Kozlov ◽

Benoit Morel ◽

Alexandros Stamatakis

Keyword(s):

Maximum Likelihood ◽

Network Inference ◽

Likelihood Function ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Small Data ◽

Lineage Sorting ◽

Sequence Alignments ◽

Multiple Sequence

AbstractPhylogenetic networks are used to represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting (ILS). Unfortunately, this induces a very high computational complexity. Hence, current tools can only analyze small data sets.We present NetRAX, a tool for maximum likelihood inference of phylogenetic networks in the absence of incomplete lineage sorting. Our tool leverages state-of-the-art methods for efficiently computing the phylogenetic likelihood function on trees, and extends them to phylogenetic networks via the notion of “displayed trees”. NetRAX can infer maximum likelihood phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format.On simulated data, our results show a very low relative difference in BIC score and a near-zero unrooted softwired cluster distance to the true, simulated networks. With NetRAX, a network inference on a partitioned alignment with 8, 000 sites, 30 taxa, and 3 reticulations completes within a few minutes on a standard laptop.Our implementation is available under the GNU General Public License v3.0 at https://github.com/lutteropp/NetRAX.

Download Full-text

Applicability of several rooted phylogenetic network algorithms for representing the evolutionary history of SARS-CoV-2

BMC Ecology and Evolution ◽

10.1186/s12862-021-01946-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Rosanne Wallin ◽

Leo van Iersel ◽

Steven Kelk ◽

Leen Stougie

Keyword(s):

Phylogenetic Trees ◽

Evolutionary History ◽

Network Inference ◽

Phylogenetic Network ◽

Phylogenetic Networks ◽

Network Algorithms ◽

Running Time ◽

Inference Algorithms ◽

History Of ◽

The Impact

Abstract Background Rooted phylogenetic networks are used to display complex evolutionary history involving so-called reticulation events, such as genetic recombination. Various methods have been developed to construct such networks, using for example a multiple sequence alignment or multiple phylogenetic trees as input data. Coronaviruses are known to recombine frequently, but rooted phylogenetic networks have not yet been used extensively to describe their evolutionary history. Here, we created a workflow to compare the evolutionary history of SARS-CoV-2 with other SARS-like viruses using several rooted phylogenetic network inference algorithms. This workflow includes filtering noise from sets of phylogenetic trees by contracting edges based on branch length and bootstrap support, followed by resolution of multifurcations. We explored the running times of the network inference algorithms, the impact of filtering on the properties of the produced networks, and attempted to derive biological insights regarding the evolution of SARS-CoV-2 from them. Results The network inference algorithms are capable of constructing rooted phylogenetic networks for coronavirus data, although running-time limitations require restricting such datasets to a relatively small number of taxa. Filtering generally reduces the number of reticulations in the produced networks and increases their temporal consistency. Taxon bat-SL-CoVZC45 emerges as a major and structural source of discordance in the dataset. The tested algorithms often indicate that SARS-CoV-2/RaTG13 is a tree-like clade, with possibly some reticulate activity further back in their history. A smaller number of constructed networks posit SARS-CoV-2 as a possible recombinant, although this might be a methodological artefact arising from the interaction of bat-SL-CoVZC45 discordance and the optimization criteria used. Conclusion Our results demonstrate that as part of a wider workflow and with careful attention paid to running time, rooted phylogenetic network algorithms are capable of producing plausible networks from coronavirus data. These networks partly corroborate existing theories about SARS-CoV-2, and partly produce new avenues for exploration regarding the location and significance of reticulate activity within the wider group of SARS-like viruses. Our workflow may serve as a model for pipelines in which phylogenetic network algorithms can be used to analyse different datasets and test different hypotheses.

Download Full-text

Maximum Parsimony Inference of Phylogenetic Networks in the Presence of Polyploid Complexes

10.1101/2020.09.28.317651 ◽

2020 ◽

Author(s):

Zhi Yan ◽

Zhen Cao ◽

Yushu Liu ◽

Luay Nakhleh

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Data Sets ◽

Gene Trees ◽

Polyploid Species ◽

Lineage Sorting ◽

Work Done

AbstractPhylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this paper, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene trees as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. The method could also infer trees, thus potentially distinguishing between auto- and allo-polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating given phylogenetic networks are implemented and publicly available in the PhyloNet software package.

Download Full-text

On the inference of complex phylogenetic networks by Markov Chain Monte-Carlo

10.1101/2020.10.07.329425 ◽

2020 ◽

Author(s):

Rabier Charles-Elie ◽

Berry Vincent ◽

Glaszmann Jean-Christophe ◽

Pardi Fabio ◽

Scornavacca Celine

Keyword(s):

Network Inference ◽

Crop Improvement ◽

Phylogenetic Network ◽

Genomic Data ◽

Phylogenetic Networks ◽

Data Set ◽

Coalescent Model ◽

Multispecies Coalescent ◽

Model Complex ◽

Hybridization And Introgression

AbstractFor various species, high quality sequences and complete genomes are nowadays available for many individuals. This makes data analysis challenging, as methods need not only to be accurate, but also time efficient given the tremendous amount of data to process. In this article, we introduce an efficient method to infer the evolutionary history of individuals under the multispecies coalescent model in networks (MSNC). Phylogenetic networks are an extension of phylogenetic trees that can contain reticulate nodes, which allow to model complex biological events such as horizontal gene transfer, hybridization, introgression and recombination. We present a novel way to compute the likelihood of biallelic markers sampled along genomes whose evolution involved such events. This likelihood computation is at the heart of a Bayesian network inference method called SnappNet, as it extends the Snapp method [1] inferring evolutionary trees under the multispecies coalescent model, to networks. SnappNet is available as a package of the well-known beast 2 software.Recently, the MCMCBiMarkers method [2] also extended Snapp to networks. Both methods take biallelic markers as input, rely on the same model of evolution and sample networks in a Bayesian framework, though using different methods for computing priors. However, SnappNet relies on algorithms that are exponentially more time-efficient on non-trivial networks. Using extensive simulations, we compare performances of SnappNet and MCMCBiMarkers. We show that both methods enjoy similar abilities to recover simple networks, but SnappNet is more accurate than MCMCBiMarkers on more complex network scenarios. Also, on complex networks, SnappNet is found to be extremely faster than MCMCBiMarkers in terms of time required for the likelihood computation. We finally illustrate SnappNet performances on a rice data set. SnappNet infers a scenario that is compatible with simpler schemes proposed so far and provides additional understanding of rice evolution.Author summaryNowadays, to make the best use of the vast amount of genomic data at our disposal, there is a real need for methods able to model complex biological mechanisms such as hybridization and introgression. Understanding such mechanisms can help geneticists to elaborate strategies in crop improvement that may help reducing poverty and dealing with climate change. However, reconstructing such evolution scenarios is challenging. Indeed, the inference of phylogenetic networks, which explicitly model reticulation events such as hybridization and introgression, requires high computational resources. Then, on large data sets, biologists generally deduce reticulation events indirectly using species tree inference tools.In this context, we present a new Bayesian method, called SnappNet, dedicated to phylogenetic network inference. Our method is competitive in terms of execution speed with respect to its competitors. This speed gain enables us to consider more complex evolution scenarios during Bayesian analyses. When applied to rice genomic data, SnappNet suggested a new evolution scenario, compatible with the existing ones: it posits cAus as the result of an early combination between the Indica and Japonica lineages, followed by a later combination between the cAus and Japonica lineages to derive cBasmati. This accounts for the well-documented wide hybrid compatibility of cAus.

Download Full-text

Maximum Parsimony Inference of Phylogenetic Networks in the Presence of Polyploid Complexes

Systematic Biology ◽

10.1093/sysbio/syab081 ◽

2021 ◽

Author(s):

Zhi Yan ◽

Zhen Cao ◽

Yushu Liu ◽

Huw A Ogilvie ◽

Luay Nakhleh

Keyword(s):

Network Inference ◽

Incomplete Lineage Sorting ◽

Gene Tree ◽

Phylogenetic Network ◽

Biological Data ◽

Phylogenetic Networks ◽

Data Sets ◽

Polyploid Species ◽

Lineage Sorting ◽

Work Done

Abstract Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this paper, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene tree topologies as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating evolutionary hypotheses in the form of phylogenetic networks are implemented and publicly available in the PhyloNet software package.

Download Full-text