scholarly journals Universal Sample Size Invariant Measures for Uncertainty Quantification in Density Estimation

Entropy ◽  
2019 ◽  
Vol 21 (11) ◽  
pp. 1120 ◽  
Author(s):  
Jenny Farmer ◽  
Zach Merino ◽  
Alexander Gray ◽  
Donald Jacobs

Previously, we developed a high throughput non-parametric maximum entropy method (PLOS ONE, 13(5): e0196937, 2018) that employs a log-likelihood scoring function to characterize uncertainty in trial probability density estimates through a scaled quantile residual (SQR). The SQR for the true probability density has universal sample size invariant properties equivalent to sampled uniform random data (SURD). Alternative scoring functions are considered that include the Anderson-Darling test. Scoring function effectiveness is evaluated using receiver operator characteristics to quantify efficacy in discriminating SURD from decoy-SURD, and by comparing overall performance characteristics during density estimation across a diverse test set of known probability distributions.

Author(s):  
Jun Pei ◽  
Zheng Zheng ◽  
Hyunji Kim ◽  
Lin Song ◽  
Sarah Walworth ◽  
...  

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>


2021 ◽  
Vol 13 (1) ◽  
Author(s):  
Xujun Zhang ◽  
Chao Shen ◽  
Xueying Guo ◽  
Zhe Wang ◽  
Gaoqi Weng ◽  
...  

AbstractVirtual screening (VS) based on molecular docking has emerged as one of the mainstream technologies of drug discovery due to its low cost and high efficiency. However, the scoring functions (SFs) implemented in most docking programs are not always accurate enough and how to improve their prediction accuracy is still a big challenge. Here, we propose an integrated platform called ASFP, a web server for the development of customized SFs for structure-based VS. There are three main modules in ASFP: (1) the descriptor generation module that can generate up to 3437 descriptors for the modelling of protein–ligand interactions; (2) the AI-based SF construction module that can establish target-specific SFs based on the pre-generated descriptors through three machine learning (ML) techniques; (3) the online prediction module that provides some well-constructed target-specific SFs for VS and an additional generic SF for binding affinity prediction. Our methodology has been validated on several benchmark datasets. The target-specific SFs can achieve an average ROC AUC of 0.973 towards 32 targets and the generic SF can achieve the Pearson correlation coefficient of 0.81 on the PDBbind version 2016 core set. To sum up, the ASFP server is a powerful tool for structure-based VS.


Author(s):  
V. S. Huzurbazar

Let f(x, αi) be the probability density function of a distribution depending on n parameters αi(i = 1,2, …, n). Then following Jeffreys(1) we shall say that the parameters αi are orthogonal if


2020 ◽  
Vol 0 (0) ◽  
Author(s):  
Zahra Amini Farsani ◽  
Volker J. Schmid

AbstractCo-localization analysis is a popular method for quantitative analysis in fluorescence microscopy imaging. The localization of marked proteins in the cell nucleus allows a deep insight into biological processes in the nucleus. Several metrics have been developed for measuring the co-localization of two markers, however, they depend on subjective thresholding of background and the assumption of linearity. We propose a robust method to estimate the bivariate distribution function of two color channels. From this, we can quantify their co- or anti-colocalization. The proposed method is a combination of the Maximum Entropy Method (MEM) and a Gaussian Copula, which we call the Maximum Entropy Copula (MEC). This new method can measure the spatial and nonlinear correlation of signals to determine the marker colocalization in fluorescence microscopy images. The proposed method is compared with MEM for bivariate probability distributions. The new colocalization metric is validated on simulated and real data. The results show that MEC can determine co- and anti-colocalization even in high background settings. MEC can, therefore, be used as a robust tool for colocalization analysis.


2021 ◽  
Vol 5 (1) ◽  
pp. 1-11
Author(s):  
Vitthal Anwat ◽  
Pramodkumar Hire ◽  
Uttam Pawar ◽  
Rajendra Gunjal

Flood Frequency Analysis (FFA) method was introduced by Fuller in 1914 to understand the magnitude and frequency of floods. The present study is carried out using the two most widely accepted probability distributions for FFA in the world namely, Gumbel Extreme Value type I (GEVI) and Log Pearson type III (LP-III). The Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) methods were used to select the most suitable probability distribution at sites in the Damanganga Basin. Moreover, discharges were estimated for various return periods using GEVI and LP-III. The recurrence interval of the largest peak flood on record (Qmax) is 107 years (at Nanipalsan) and 146 years (at Ozarkhed) as per LP-III. Flood Frequency Curves (FFC) specifies that LP-III is the best-fitted probability distribution for FFA of the Damanganga Basin. Therefore, estimated discharges and return periods by LP-III probability distribution are more reliable and can be used for designing hydraulic structures.


2021 ◽  
Vol 118 (40) ◽  
pp. e2025782118
Author(s):  
Wei-Chia Chen ◽  
Juannan Zhou ◽  
Jason M. Sheltzer ◽  
Justin B. Kinney ◽  
David M. McCandlish

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5′ splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.


Sign in / Sign up

Export Citation Format

Share Document