scholarly journals Efficient detection of repeating sites to accelerate phylogenetic likelihood calculations

2016 ◽  
Author(s):  
Kassian Kobert ◽  
Alexandros Stamatakis ◽  
Tomáš Flouri

The phylogenetic likelihood function is the major computational bottleneck in several applications of evolutionary biology such as phylogenetic inference, species delimitation, model selection and divergence times estimation. Given the alignment, a tree and the evolutionary model parameters, the likelihood function computes the conditional likelihood vectors for every node of the tree. Vector entries for which all input data are identical result in redundant likelihood operations which, in turn, yield identical conditional values. Such operations can be omitted for improving run-time and, using appropriate data structures, reducing memory usage. We present a fast, novel method for identifying and omitting such redundant operations in phylogenetic likelihood calculations, and assess the performance improvement and memory saving attained by our method. Using empirical and simulated data sets, we show that a prototype implementation of our method yields up to 10-fold speedups and uses up to 78% less memory than one of the fastest and most highly tuned implementations of the phylogenetic likelihood function currently available. Our method is generic and can seamlessly be integrated into any phylogenetic likelihood implementation.

2021 ◽  
Vol 2021 ◽  
pp. 1-27
Author(s):  
Awad A. Bakery ◽  
Wael Zakaria ◽  
OM Kalthum S. K. Mohamed

The generalized Gamma model has been applied in a variety of research fields, including reliability engineering and lifetime analysis. Indeed, we know that, from the above, it is unbounded. Data have a bounded service area in a variety of applications. A new five-parameter bounded generalized Gamma model, the bounded Weibull model with four parameters, the bounded Gamma model with four parameters, the bounded generalized Gaussian model with three parameters, the bounded exponential model with three parameters, and the bounded Rayleigh model with two parameters, is presented in this paper as a special case. This approach to the problem, which utilizes a bounded support area, allows for a great deal of versatility in fitting various shapes of observed data. Numerous properties of the proposed distribution have been deduced, including explicit expressions for the moments, quantiles, mode, moment generating function, mean variance, mean residual lifespan, and entropies, skewness, kurtosis, hazard function, survival function, r   th order statistic, and median distributions. The delivery has hazard frequencies that are monotonically increasing or declining, bathtub-shaped, or upside-down bathtub-shaped. We use the Newton Raphson approach to approximate model parameters that increase the log-likelihood function and some of the parameters have a closed iterative structure. Six actual data sets and six simulated data sets were tested to demonstrate how the proposed model works in reality. We illustrate why the Model is more stable and less affected by sample size. Additionally, the suggested model for wavelet histogram fitting of images and sounds is very accurate.


2021 ◽  
pp. gr.273631.120
Author(s):  
Xinhao Liu ◽  
Huw A Ogilvie ◽  
Luay Nakhleh

Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, are coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation it is flexible enough to enable future implementations of all kinds of population models.


2015 ◽  
Vol 11 (A29A) ◽  
pp. 205-207
Author(s):  
Philip C. Gregory

AbstractA new apodized Keplerian model is proposed for the analysis of precision radial velocity (RV) data to model both planetary and stellar activity (SA) induced RV signals. A symmetrical Gaussian apodization function with unknown width and center can distinguish planetary signals from SA signals on the basis of the width of the apodization function. The general model for m apodized Keplerian signals also includes a linear regression term between RV and the stellar activity diagnostic In (R'hk), as well as an extra Gaussian noise term with unknown standard deviation. The model parameters are explored using a Bayesian fusion MCMC code. A differential version of the Generalized Lomb-Scargle periodogram provides an additional way of distinguishing SA signals and helps guide the choice of new periods. Sample results are reported for a recent international RV blind challenge which included multiple state of the art simulated data sets supported by a variety of stellar activity diagnostics.


2014 ◽  
Author(s):  
Ramon Diaz-Uriarte

Cancer progression is caused by the sequential accumulation of mutations, but not all orders of accumulation of mutations are equally likely. When the fixation of some mutations depends on the presence of previous ones, identifying restrictions in the order of accumulation of mutations can lead to the discovery of therapeutic targets and diagnostic markers. Using simulated data sets, I conducted a comprehensive comparison of the performance of all available methods to identify these restrictions from cross-sectional data. In contrast to previous work, I embedded restrictions within evolutionary models of tumor progression that included passengers (mutations not responsible for the development of cancer, known to be very common). This allowed me to asses the effects of having to filter out passengers, of sampling schemes, and of deviations from order restrictions. Poor choices of method, filtering, and sampling lead to large errors in all performance metrics. Having to filter passengers lead to decreased performance, especially because true restrictions were missed. Overall, the best method for identifying order restrictions were Oncogenetic Trees, a fast and easy to use method that, although unable to recover dependencies of mutations on more than one mutation, showed good performance in most scenarios, superior to Conjunctive Bayesian Networks and Progression Networks. Single cell sampling provided no advantage, but sampling in the final stages of the disease vs.\ sampling at different stages had severe effects. Evolutionary model and deviations from order restrictions had major, and sometimes counterintuitive, interactions with other factors that affected performance. This paper provides practical recommendations for using these methods with experimental data. Moreover, it shows that it is both possible and necessary to embed assumptions about order restrictions and the nature of driver status within evolutionary models of cancer progression to evaluate the performance of inferential approaches.


2020 ◽  
Vol 69 (5) ◽  
pp. 973-986 ◽  
Author(s):  
Joëlle Barido-Sottani ◽  
Timothy G Vaughan ◽  
Tanja Stadler

Abstract Heterogeneous populations can lead to important differences in birth and death rates across a phylogeny. Taking this heterogeneity into account is necessary to obtain accurate estimates of the underlying population dynamics. We present a new multitype birth–death model (MTBD) that can estimate lineage-specific birth and death rates. This corresponds to estimating lineage-dependent speciation and extinction rates for species phylogenies, and lineage-dependent transmission and recovery rates for pathogen transmission trees. In contrast with previous models, we do not presume to know the trait driving the rate differences, nor do we prohibit the same rates from appearing in different parts of the phylogeny. Using simulated data sets, we show that the MTBD model can reliably infer the presence of multiple evolutionary regimes, their positions in the tree, and the birth and death rates associated with each. We also present a reanalysis of two empirical data sets and compare the results obtained by MTBD and by the existing software BAMM. We compare two implementations of the model, one exact and one approximate (assuming that no rate changes occur in the extinct parts of the tree), and show that the approximation only slightly affects results. The MTBD model is implemented as a package in the Bayesian inference software BEAST 2 and allows joint inference of the phylogeny and the model parameters.[Birth–death; lineage specific rates, multi-type model.]


2021 ◽  
Author(s):  
Gah-Yi Ban ◽  
N. Bora Keskin

We consider a seller who can dynamically adjust the price of a product at the individual customer level, by utilizing information about customers’ characteristics encoded as a d-dimensional feature vector. We assume a personalized demand model, parameters of which depend on s out of the d features. The seller initially does not know the relationship between the customer features and the product demand but learns this through sales observations over a selling horizon of T periods. We prove that the seller’s expected regret, that is, the revenue loss against a clairvoyant who knows the underlying demand relationship, is at least of order [Formula: see text] under any admissible policy. We then design a near-optimal pricing policy for a semiclairvoyant seller (who knows which s of the d features are in the demand model) who achieves an expected regret of order [Formula: see text]. We extend this policy to a more realistic setting, where the seller does not know the true demand predictors, and show that this policy has an expected regret of order [Formula: see text], which is also near-optimal. Finally, we test our theory on simulated data and on a data set from an online auto loan company in the United States. On both data sets, our experimentation-based pricing policy is superior to intuitive and/or widely-practiced customized pricing methods, such as myopic pricing and segment-then-optimize policies. Furthermore, our policy improves upon the loan company’s historical pricing decisions by 47% in expected revenue over a six-month period. This paper was accepted by Noah Gans, stochastic models and simulation.


2020 ◽  
Vol 37 (6) ◽  
pp. 1832-1842 ◽  
Author(s):  
Mandev S Gill ◽  
Philippe Lemey ◽  
Marc A Suchard ◽  
Andrew Rambaut ◽  
Guy Baele

Abstract Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an “online” fashion. Widely used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data—in terms of alignment changes, sequence addition or removal—present common scenarios that can benefit from online inference.


Author(s):  
Wassila Nissas ◽  
Soufiane Gasmi

In the reliability literature, maintenance efficiency is usually dealt with as a fixed value. Since repairable systems are subject to different degrees and types of repair, it is more convenient to regard a random variable for maintenance efficiency. This paper is devoted to the statistical study of a general hybrid model for repairable systems working under imperfect maintenance. For both failure improvement and virtual age reduction of the system, maintenance efficiency is assumed to be random, with an exponential distribution as a probability density function. The likelihood function of this model is provided, and the estimation of the model parameters is computed by considering the maximization likelihood procedure. Obtained results were tested and applied to simulated and real data sets. To construct confidence intervals, the bias-corrected accelerated bootstrap method has been used.


2021 ◽  
Author(s):  
Arthur Zwaenepoel ◽  
Yves Van de Peer

AbstractPhylogenetic models of gene family evolution based on birth-death processes (BDPs) vide an awkward fit to comparative genomic data sets. A central assumption of these models is the constant per-gene loss rate in any particular family. Because of the possibility of partial functional redundancy among gene family members, gene loss dynamics are however likely to be dependent on the number of genes in a family, and different variations of commonly employed BDP models indeed suggest this is the case. We propose a simple two-type branching process model to better approximate the stochastic evolution of gene families by gene duplication and loss and perform Bayesian statistical inference of model parameters in a phylogenetic context. We evaluate the statistical methods using simulated data sets and apply the model to gene family data for Drosophila, yeasts and primates, providing new quantitative insights in the long-term maintenance of duplicated genes.


2019 ◽  
Vol 4 (2) ◽  
pp. 108-123 ◽  
Author(s):  
Andrew M Ritchie ◽  
Simon Y W Ho

Abstract Bayesian phylogenetic methods derived from evolutionary biology can be used to reconstruct the history of human languages using databases of cognate words. These analyses have produced exciting results regarding the origins and dispersal of linguistic and cultural groups through prehistory. Bayesian lexical dating requires the specification of priors on all model parameters. This includes the use of a prior on divergence times, often combined with a prior on tree topology and referred to as a tree prior. Violation of the underlying assumptions of the tree prior can lead to an erroneous estimate of the timescale of language evolution. To investigate these impacts, we tested the sensitivity of Bayesian dating to the tree prior in analyses of four lexical data sets. Our results show that estimates of the origin times of language families are robust to the choice of tree prior for lexical data, though less so than when Bayesian phylogenetic methods are used to analyse genetic data sets. We also used the relative fit of speciation and coalescent tree priors to determine the ability of speciation models to describe language diversification at four different taxonomic levels. We found that speciation priors were preferred over a constant-size coalescent prior regardless of taxonomic scale. However, data sets with narrower taxonomic and geographic sampling exhibited a poorer fit to ideal birth–death model expectations. Our results encourage further investigation into the nature of language diversification at different sampling scales.


Sign in / Sign up

Export Citation Format

Share Document