Universal Sample Size Invariant Measures for Uncertainty Quantification in Density Estimation

Jenny Farmer; Zach Merino; Alexander Gray; Donald Jacobs

doi:10.3390/e21111120

Universal Sample Size Invariant Measures for Uncertainty Quantification in Density Estimation

Entropy ◽

10.3390/e21111120 ◽

2019 ◽

Vol 21 (11) ◽

pp. 1120 ◽

Cited By ~ 1

Author(s):

Jenny Farmer ◽

Zach Merino ◽

Alexander Gray ◽

Donald Jacobs

Keyword(s):

Sample Size ◽

Probability Density ◽

Density Estimation ◽

Probability Distributions ◽

Scoring Function ◽

Entropy Method ◽

Scoring Functions ◽

Trial Probability ◽

Anderson Darling ◽

Invariant Properties

Previously, we developed a high throughput non-parametric maximum entropy method (PLOS ONE, 13(5): e0196937, 2018) that employs a log-likelihood scoring function to characterize uncertainty in trial probability density estimates through a scaled quantile residual (SQR). The SQR for the true probability density has universal sample size invariant properties equivalent to sampled uniform random data (SURD). Alternative scoring functions are considered that include the Anderson-Darling test. Scoring function effectiveness is evaluated using receiver operator characteristics to quantify efficacy in discriminating SURD from decoy-SURD, and by comparing overall performance characteristics during density estimation across a diverse test set of known probability distributions.

Download Full-text

The Generalized Cross Entropy Method, with Applications to Probability Density Estimation

Methodology And Computing In Applied Probability ◽

10.1007/s11009-009-9133-7 ◽

2009 ◽

Vol 13 (1) ◽

pp. 1-27 ◽

Cited By ~ 23

Author(s):

Zdravko I. Botev ◽

Dirk P. Kroese

Keyword(s):

Probability Density ◽

Density Estimation ◽

Entropy Method ◽

Cross Entropy ◽

Probability Density Estimation ◽

Cross Entropy Method ◽

Generalized Cross Entropy

Download Full-text

Random Forest Refinement of Pairwise Potentials for Protein-ligand Decoy Detection

10.26434/chemrxiv.8047820.v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jun Pei ◽

Zheng Zheng ◽

Hyunji Kim ◽

Lin Song ◽

Sarah Walworth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Probability Function ◽

Pair Potential ◽

Scoring Function ◽

Stable Structure ◽

Scoring Functions ◽

Atom Pair ◽

Data Set ◽

Atom Pairs

An accurate scoring function is expected to correctly select the most stable structure from a set of pose candidates. One can hypothesize that a scoring function’s ability to identify the most stable structure might be improved by emphasizing the most relevant atom pairwise interactions. However, it is hard to evaluate the relevant importance for each atom pair using traditional means. With the introduction of machine learning methods, it has become possible to determine the relative importance for each atom pair present in a scoring function. In this work, we use the Random Forest (RF) method to refine a pair potential developed by our laboratory (GARF6) by identifying relevant atom pairs that optimize the performance of the potential on our given task. Our goal is to construct a machine learning (ML) model that can accurately differentiate the native ligand binding pose from candidate poses using a potential refined by RF optimization. We successfully constructed RF models on an unbalanced data set with the ‘comparison’ concept and, the resultant RF models were tested on CASF-2013.5 In a comparison of the performance of our RF models against 29 scoring functions, we found our models outperformed the other scoring functions in predicting the native pose. In addition, we used two artificial designed potential models to address the importance of the GARF potential in the RF models: (1) a scrambled probability function set, which was obtained by mixing up atom pairs and probability functions in GARF, and (2) a uniform probability function set, which share the same peak positions with GARF but have fixed peak heights. The results of accuracy comparison from RF models based on the scrambled, uniform, and original GARF potential clearly showed that the peak positions in the GARF potential are important while the well depths are not. <br>

Download Full-text

ASFP (Artificial Intelligence based Scoring Function Platform): a web server for the development of customized scoring functions

Journal of Cheminformatics ◽

10.1186/s13321-021-00486-3 ◽

2021 ◽

Vol 13 (1) ◽

Author(s):

Xujun Zhang ◽

Chao Shen ◽

Xueying Guo ◽

Zhe Wang ◽

Gaoqi Weng ◽

...

Keyword(s):

High Efficiency ◽

Low Cost ◽

Pearson Correlation ◽

Scoring Function ◽

Web Server ◽

Scoring Functions ◽

Protein Ligand Interactions ◽

Prediction Module ◽

Ligand Interactions ◽

Benchmark Datasets

AbstractVirtual screening (VS) based on molecular docking has emerged as one of the mainstream technologies of drug discovery due to its low cost and high efficiency. However, the scoring functions (SFs) implemented in most docking programs are not always accurate enough and how to improve their prediction accuracy is still a big challenge. Here, we propose an integrated platform called ASFP, a web server for the development of customized SFs for structure-based VS. There are three main modules in ASFP: (1) the descriptor generation module that can generate up to 3437 descriptors for the modelling of protein–ligand interactions; (2) the AI-based SF construction module that can establish target-specific SFs based on the pre-generated descriptors through three machine learning (ML) techniques; (3) the online prediction module that provides some well-constructed target-specific SFs for VS and an additional generic SF for binding affinity prediction. Our methodology has been validated on several benchmark datasets. The target-specific SFs can achieve an average ROC AUC of 0.973 towards 32 targets and the generic SF can achieve the Pearson correlation coefficient of 0.81 on the PDBbind version 2016 core set. To sum up, the ASFP server is a powerful tool for structure-based VS.

Download Full-text

Probability distributions and orthogonal parameters

Mathematical Proceedings of the Cambridge Philosophical Society ◽

10.1017/s0305004100025743 ◽

1950 ◽

Vol 46 (2) ◽

pp. 281-284 ◽

Cited By ~ 14

Author(s):

V. S. Huzurbazar

Keyword(s):

Probability Density Function ◽

Probability Density ◽

Density Function ◽

Probability Distributions ◽

Orthogonal Parameters ◽

Image Position

Let f(x, αi) be the probability density function of a distribution depending on n parameters αi(i = 1,2, …, n). Then following Jeffreys(1) we shall say that the parameters αi are orthogonal if

Download Full-text

Application of Kernel Density Estimation to Impact Probability Density Determination for Risk Analysis

48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition ◽

10.2514/6.2010-1541 ◽

2010 ◽

Cited By ~ 2

Author(s):

Erik Larson ◽

George Lloyd

Keyword(s):

Risk Analysis ◽

Probability Density ◽

Density Estimation ◽

Kernel Density Estimation ◽

Kernel Density ◽

Density Determination

Download Full-text

An Incipient Fault diagnosis Methodology Using Local Mahalanobis Distance: Detection process based on Empirical Probability Density Estimation

Signal Processing ◽

10.1016/j.sigpro.2021.108308 ◽

2021 ◽

pp. 108308

Author(s):

Junjie Yang ◽

Claude Delpha

Keyword(s):

Fault Diagnosis ◽

Probability Density ◽

Density Estimation ◽

Mahalanobis Distance ◽

Probability Density Estimation ◽

Detection Process ◽

Empirical Probability ◽

Incipient Fault ◽

Incipient Fault Diagnosis

Download Full-text

TIME SERIES RESIDUALS WITH APPLICATION TO PROBABILITY DENSITY ESTIMATION

Journal of Time Series Analysis ◽

10.1111/j.1467-9892.1987.tb00445.x ◽

1987 ◽

Vol 8 (3) ◽

pp. 329-344 ◽

Cited By ~ 22

Author(s):

P. M. Robinson

Keyword(s):

Time Series ◽

Probability Density ◽

Density Estimation ◽

Probability Density Estimation

Download Full-text

Co-localization analysis in fluorescence microscopy via maximum entropy copula

The International Journal of Biostatistics ◽

10.1515/ijb-2019-0019 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Zahra Amini Farsani ◽

Volker J. Schmid

Keyword(s):

Fluorescence Microscopy ◽

Maximum Entropy ◽

Probability Distributions ◽

Real Data ◽

Bivariate Distribution ◽

Entropy Method ◽

Gaussian Copula ◽

Microscopy Imaging ◽

High Background ◽

Localization Analysis

AbstractCo-localization analysis is a popular method for quantitative analysis in fluorescence microscopy imaging. The localization of marked proteins in the cell nucleus allows a deep insight into biological processes in the nucleus. Several metrics have been developed for measuring the co-localization of two markers, however, they depend on subjective thresholding of background and the assumption of linearity. We propose a robust method to estimate the bivariate distribution function of two color channels. From this, we can quantify their co- or anti-colocalization. The proposed method is a combination of the Maximum Entropy Method (MEM) and a Gaussian Copula, which we call the Maximum Entropy Copula (MEC). This new method can measure the spatial and nonlinear correlation of signals to determine the marker colocalization in fluorescence microscopy images. The proposed method is compared with MEM for bivariate probability distributions. The new colocalization metric is validated on simulated and real data. The results show that MEC can determine co- and anti-colocalization even in high background settings. MEC can, therefore, be used as a robust tool for colocalization analysis.

Download Full-text

Analysis of Magnitude and Frequency of Floods in the Damanganga Basin: Western India

Hydrospatial Analysis ◽

10.21523/gcj3.2021050101 ◽

2021 ◽

Vol 5 (1) ◽

pp. 1-11

Author(s):

Vitthal Anwat ◽

Pramodkumar Hire ◽

Uttam Pawar ◽

Rajendra Gunjal

Keyword(s):

Probability Distribution ◽

Probability Distributions ◽

Flood Frequency ◽

Flood Frequency Analysis ◽

Western India ◽

Type I ◽

Return Periods ◽

Pearson Type ◽

Kolmogorov Smirnov ◽

Anderson Darling

Flood Frequency Analysis (FFA) method was introduced by Fuller in 1914 to understand the magnitude and frequency of floods. The present study is carried out using the two most widely accepted probability distributions for FFA in the world namely, Gumbel Extreme Value type I (GEVI) and Log Pearson type III (LP-III). The Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) methods were used to select the most suitable probability distribution at sites in the Damanganga Basin. Moreover, discharges were estimated for various return periods using GEVI and LP-III. The recurrence interval of the largest peak flood on record (Qmax) is 107 years (at Nanipalsan) and 146 years (at Ozarkhed) as per LP-III. Flood Frequency Curves (FFC) specifies that LP-III is the best-fitted probability distribution for FFA of the Damanganga Basin. Therefore, estimated discharges and return periods by LP-III probability distribution are more reliable and can be used for designing hydraulic structures.

Download Full-text

Field-theoretic density estimation for biological sequence space with applications to 5′ splice site diversity and aneuploidy in cancer

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2025782118 ◽

2021 ◽

Vol 118 (40) ◽

pp. e2025782118

Author(s):

Wei-Chia Chen ◽

Juannan Zhou ◽

Jason M. Sheltzer ◽

Justin B. Kinney ◽

David M. McCandlish

Keyword(s):

Probability Distribution ◽

Maximum Entropy ◽

Density Estimation ◽

Sequence Space ◽

Chromosomal Abnormalities ◽

Fundamental Problem ◽

Probability Distributions ◽

Biological Sequence ◽

Point Estimates ◽

Site Diversity

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5′ splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.

Download Full-text