Bayesian Inference for Correlations in the Presence of Measurement Error and Estimation Uncertainty

Dora Matzke; Alexander Ly; Ravi Selker; Wouter D. Weeda; Benjamin Scheibehenne; Michael D. Lee; Eric-Jan Wagenmakers

doi:10.1525/collabra.78

Bayesian Inference for Correlations in the Presence of Measurement Error and Estimation Uncertainty

Collabra Psychology ◽

10.1525/collabra.78 ◽

2017 ◽

Vol 3 (1) ◽

Cited By ~ 7

Author(s):

Dora Matzke ◽

Alexander Ly ◽

Ravi Selker ◽

Wouter D. Weeda ◽

Benjamin Scheibehenne ◽

...

Keyword(s):

Measurement Error ◽

Bayesian Modeling ◽

Pearson Correlation ◽

Psychological Research ◽

Uncertain Parameter ◽

Parameter Estimates ◽

Data Sets ◽

Data Set ◽

Bayesian Hierarchical ◽

Estimation Uncertainty

Whenever parameter estimates are uncertain or observations are contaminated by measurement error, the Pearson correlation coefficient can severely underestimate the true strength of an association. Various approaches exist for inferring the correlation in the presence of estimation uncertainty and measurement error, but none are routinely applied in psychological research. Here we focus on a Bayesian hierarchical model proposed by Behseta, Berdyyeva, Olson, and Kass (2009) that allows researchers to infer the underlying correlation between error-contaminated observations. We show that this approach may be also applied to obtain the underlying correlation between uncertain parameter estimates as well as the correlation between uncertain parameter estimates and noisy observations. We illustrate the Bayesian modeling of correlations with two empirical data sets; in each data set, we first infer the posterior distribution of the underlying correlation and then compute Bayes factors to quantify the evidence that the data provide for the presence of an association.

Download Full-text

Stochastic Revealed Preferences with Measurement Error

The Review of Economic Studies ◽

10.1093/restud/rdaa067 ◽

2020 ◽

Cited By ~ 1

Author(s):

Victor H Aguiar ◽

Nail Kashaev

Keyword(s):

International Economic ◽

Measurement Error ◽

Panel Data ◽

Revealed Preference ◽

Revealed Preferences ◽

Data Sets ◽

Single Individual ◽

Data Set ◽

International Economic Review ◽

New Evidence

Abstract A long-standing question about consumer behaviour is whether individuals’ observed purchase decisions satisfy the revealed preference (RP) axioms of the utility maximization theory (UMT). Researchers using survey or experimental panel data sets on prices and consumption to answer this question face the well-known problem of measurement error. We show that ignoring measurement error in the RP approach may lead to overrejection of the UMT. To solve this problem, we propose a new statistical RP framework for consumption panel data sets that allows for testing the UMT in the presence of measurement error. Our test is applicable to all consumer models that can be characterized by their first-order conditions. Our approach is non-parametric, allows for unrestricted heterogeneity in preferences and requires only a centring condition on measurement error. We develop two applications that provide new evidence about the UMT. First, we find support in a survey data set for the dynamic and time-consistent UMT in single-individual households, in the presence of nonclassical measurement error in consumption. In the second application, we cannot reject the static UMT in a widely used experimental data set in which measurement error in prices is assumed to be the result of price misperception due to the experimental design. The first finding stands in contrast to the conclusions drawn from the deterministic RP test of Browning (1989, International Economic Review, 979–992). The second finding reverses the conclusions drawn from the deterministic RP test of Afriat (1967, International Economic Review, 8, 6–77) and Varian (1982, Econometrica, 945–973).

Download Full-text

A New Robust Diagnostic Plot for Classifying Good and Bad High Leverage Points in a Multiple Linear Regression Model

Mathematical Problems in Engineering ◽

10.1155/2015/279472 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12

Author(s):

Mohammed Alguraibawi ◽

Habshah Midi ◽

A. H. M. Rahmatullah Imon

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Multiple Linear Regression Model ◽

Parameter Estimates ◽

Data Sets ◽

Data Set ◽

Monte Carlo Simulation Study ◽

Leverage Points ◽

Diagnostic Plot ◽

Leverage Point

Identification of high leverage point is crucial because it is responsible for inaccurate prediction and invalid inferential statement as it has a larger impact on the computed values of various estimates. It is essential to classify the high leverage points into good and bad leverage points because only the bad leverage points have an undue effect on the parameter estimates. It is now evident that when a group of high leverage points is present in a data set, the existing robust diagnostic plot fails to classify them correctly. This problem is due to the masking and swamping effects. In this paper, we propose a new robust diagnostic plot to correctly classify the good and bad leverage points by reducing both masking and swamping effects. The formulation of the proposed plot is based on the Modified Generalized Studentized Residuals. We investigate the performance of our proposed method by employing a Monte Carlo simulation study and some well-known data sets. The results indicate that the proposed method is able to improve the rate of detection of bad leverage points and also to reduce swamping and masking effects.

Download Full-text

A Bayesian Approach to Ranking and Rater Evaluation

Journal of Educational and Behavioral Statistics ◽

10.3102/1076998609353116 ◽

2010 ◽

Vol 35 (2) ◽

pp. 194-214 ◽

Cited By ~ 6

Author(s):

Jing Cao ◽

S. Lynne Stokes ◽

Song Zhang

Keyword(s):

Measurement Error ◽

Ordinal Data ◽

Substantial Improvement ◽

Theoretic Approach ◽

Data Set ◽

Bayesian Hierarchical ◽

Decision Theoretic Approach ◽

Simulation Based ◽

True Order ◽

Rater Performance

We develop a Bayesian hierarchical model for the analysis of ordinal data from multirater ranking studies. The model for a rater’s score includes four latent factors: one is a latent item trait determining the true order of items and the other three are the rater’s performance characteristics, including bias, discrimination, and measurement error in the ratings. The proposed approach aims at three goals. First, three Bayesian estimators are introduced to estimate the ranks of items. They all show a substantial improvement over the widely used score sums by using the information on the variable skill of the raters. Second, rater performance can be compared based on rater bias, discrimination, and measurement error. Third, a simulation-based decision-theoretic approach is described to determine the number of raters to employ. A simulation study and an analysis based on a grant review data set are presented.

Download Full-text

Diagnosis-based risk adjustment and Australian health system policy

Australian Health Review ◽

10.1071/ah060083 ◽

2006 ◽

Vol 30 (1) ◽

pp. 83 ◽

Cited By ~ 5

Author(s):

Ronald Donato ◽

Jeffrey Richardson

Keyword(s):

Health System ◽

Risk Adjustment ◽

The United States ◽

Parameter Estimates ◽

System Reform ◽

Data Sets ◽

Hospital Inpatient ◽

Data Set ◽

Health System Reform ◽

Individual Level

Diagnosis-based risk adjustment is increasingly seen as an important tool for establishing capitation payments and evaluating appropriateness and efficiency of services provided and has become an important area of research for many countries contemplating health system reform. This paper examines the application of a risk-adjustment method, extensively validated in the United States, known as diagnostic cost groups (DCG), to a large Australian hospital inpatient data set. The data set encompassed hospital inpatient diagnoses and inpatient expenditure for the entire metropolitan population residing in the state of New South Wales. The DCG model was able to explain 34% of individual-level variation in concurrent expenditure and 5.2% in subsequent year expenditure, which is comparable to US studies using inpatient-only data. The degree of stability and internal consistency of the parameter estimates for both the concurrent and prospective models indicate the DCG methodology has face validity in its application to NSW health data sets. Modelling and simulations were conducted which demonstrate the policy applications and significance of risk adjustment model(s) in the Australian context. This study demonstrates the feasibility of using large individual-level data sets for diagnosis-based risk adjustment research in Australia. The results suggest that a research agenda should be established to broaden the options for health system reform.

Download Full-text

Genetic parameters for clinical mastitis in Holstein-Friesians in the United Kingdom: a Bayesian analysis

Animal Science ◽

10.1017/s1357729800058203 ◽

2001 ◽

Vol 73 (2) ◽

pp. 229-240 ◽

Cited By ~ 27

Author(s):

H. N. Kadarmideen ◽

R. Rekaya ◽

D. Gianola

Keyword(s):

Genetic Parameters ◽

Genetic Parameter ◽

Mixed Linear Model ◽

Clinical Mastitis ◽

Parameter Estimates ◽

Data Sets ◽

Data Sampling ◽

Environmental Variance ◽

Data Set ◽

Monte Carlo Techniques

AbstractA Bayesian threshold-liability model with Markov chain Monte Carlo techniques was used to infer genetic parameters for clinical mastitis records collected on Holstein-Friesian cows by one of the United Kingdom’s national recording schemes. Four data sets were created to investigate the effect of data sampling methods on genetic parameter estimates for first and multi-lactation cows, separately. The data sets were: (1) cows with complete first lactations only (8671 cows); (2) all cows, with first lactations whether complete or incomplete (10 967 cows); (3) cows with complete multi-lactations (32 948 records); and (4) all cows with multiple lactations whether complete or incomplete (44 268 records). A Gaussian mixed linear model with sire effects was adopted for liability. Explanatory variables included in the model varied for each data set. Analyses were conducted using Gibbs sampling and estimates were on the liability scale. Posterior means of heritability for clinical mastitis were higher for first lactations (0·11 and 0·10 for data sets 1 and 2, respectively) than for multiple lactations (0·09 and 0·07, for data sets 3 and 4, respectively). For multiple lactations, estimates of permanent environmental variance were higher for complete than incomplete lactations. Repeatability was 0·21 and 0·17 for data sets 3 and 4, respectively. This suggests the existence of effects, other than additive genetic effects, on susceptibility to mastitis that are common to all lactations. In first or multi-lactation data sets, heritability was proportionately 0·10 to 0·19 lower for data sets with all records (in which case the models had days in milk as a covariate) than for data with only complete lactation records (models without days in milk as a covariate). This suggests an effect of data sampling on genetic parameter estimates. The regression of liability on days in milk differed from zero, indicating that the probability of mastitis is higher for longer lactations, as expected. Results also indicated that a regression on days in milk should be included in a model for genetic evaluation of sires for mastitis resistance based on records in progress.

Download Full-text

Quantifying acoustic survey uncertainty using Bayesian hierarchical modeling with an application to assessing Mysis relicta population densities in Lake Ontario

ICES Journal of Marine Science ◽

10.1093/icesjms/fsw080 ◽

2016 ◽

Vol 73 (8) ◽

pp. 2104-2111 ◽

Cited By ~ 3

Author(s):

Patrick J. Sullivan ◽

Lars G. Rudstam

Keyword(s):

Hierarchical Modeling ◽

Survey Design ◽

Lake Ontario ◽

Sensitivity Analyses ◽

Target Strength ◽

Parameter Estimates ◽

Data Set ◽

Density Estimates ◽

Mysis Relicta ◽

Bayesian Hierarchical

Abstract A Bayesian hierarchical model was applied to acoustic backscattering data collected on Mysis relicta (opossum shrimp) populations in Lake Ontario in 2005 to estimate the combined uncertainty in mean density estimates as well as the individual contributions to that uncertainty from the various information sources involved in the calculation including calibration, target strength determination, threshold specification and survey sampling design. Traditional estimation approaches often only take into account the variability associated with the survey design, while assuming that all other intermediate parameter estimates used in the calculations are fixed and known. Unfortunately, unaccounted for variation in the steps leading up to the global density estimate may make significant contributions to the uncertainty of density estimates. While other studies have used sensitivity analyses to demonstrate the degree to which uncertainty in the various input parameters can influence estimates, including the uncertainty directly as demonstrated here using a Bayesian hierarchical approach allows for a more transparent representation of the true uncertainty and the mechanisms needed for its reduction. A Bayesian analysis of the mysid data examined here indicates that increasing the sample size of biological collections used in the target strength regression prove to be a more direct and practical way of reducing the overall variation in mean density estimates than similar steps employed to increase the number of transects surveyed. A doubling of target strength net tow samples resulted in a 23% reduction in variance relative to an 11% reduction that resulted from doubling the number of survey transects. This is an important difference as doubling the number of survey transects would add 5 days to the survey whereas doubling the number of net tows would add only one day. Although these results are specific to this particular data set, the method described is general.

Download Full-text

A Probabilistic Procedure for Anonymisation, for Assessing the Risk of Re-identification and for the Analysis of Perturbed Data Sets

Journal of Official Statistics ◽

10.2478/jos-2020-0005 ◽

2020 ◽

Vol 36 (1) ◽

pp. 89-115 ◽

Cited By ~ 1

Author(s):

Harvey Goldstein ◽

Natalie Shlomo

Keyword(s):

Random Noise ◽

Secondary Analysis ◽

New Method ◽

Parameter Estimates ◽

Model Parameters ◽

Data Sets ◽

Data Set ◽

Second Stage ◽

True Model ◽

Consistent Parameter

AbstractThe requirement to anonymise data sets that are to be released for secondary analysis should be balanced by the need to allow their analysis to provide efficient and consistent parameter estimates. The proposal in this article is to integrate the process of anonymisation and data analysis. The first stage uses the addition of random noise with known distributional properties to some or all variables in a released (already pseudonymised) data set, in which the values of some identifying and sensitive variables for data subjects of interest are also available to an external ‘attacker’ who wishes to identify those data subjects in order to interrogate their records in the data set. The second stage of the analysis consists of specifying the model of interest so that parameter estimation accounts for the added noise. Where the characteristics of the noise are made available to the analyst by the data provider, we propose a new method that allows a valid analysis. This is formally a measurement error model and we describe a Bayesian MCMC algorithm that recovers consistent estimates of the true model parameters. A new method for handling categorical data is presented. The article shows how an appropriate noise distribution can be determined.

Download Full-text

Reservoir-parameter identification using minimum relative entropy-based Bayesian inversion of seismic AVA and marine CSEM data

Geophysics ◽

10.1190/1.2348770 ◽

2006 ◽

Vol 71 (6) ◽

pp. O77-O88 ◽

Cited By ~ 32

Author(s):

Zhangshuan Hou ◽

Yoram Rubin ◽

G. Michael Hoversten ◽

Don Vasco ◽

Jinsong Chen

Keyword(s):

Relative Entropy ◽

Joint Inversion ◽

Parameter Estimates ◽

Data Sets ◽

Data Set ◽

Mean Values ◽

Prior Probabilities ◽

Reservoir Fluid ◽

Minimum Relative Entropy ◽

Reservoir Parameter

A stochastic joint-inversion approach for estimating reservoir-fluid saturations and porosity is proposed. The approach couples seismic amplitude variation with angle (AVA) and marine controlled-source electromagnetic (CSEM) forward models into a Bayesian framework, which allows for integration of complementary information. To obtain minimally subjective prior probabilities required for the Bayesian approach, the principle of minimum relative entropy (MRE) is employed. Instead of single-value estimates provided by deterministic methods, the approach gives a probability distribution for any unknown parameter of interest, such as reservoir-fluid saturations or porosity at various locations. The distribution means, modes, and confidence intervals can be calculated, providing a more complete understanding of the uncertainty in the parameter estimates. The approach is demonstrated using synthetic and field data sets. Results show that joint inversion using seismic and EM data gives better estimates of reservoir parameters than estimates from either geophysical data set used in isolation. Moreover, a more informative prior leads to much narrower predictive intervals of the target parameters, with mean values of the posterior distributions closer to logged values.

Download Full-text

Using anticlustering to partition data sets into equivalent parts

10.31234/osf.io/3razc ◽

2019 ◽

Author(s):

Martin Papenberg ◽

Gunnar W. Klau

Keyword(s):

Cross Validation ◽

Item Difficulty ◽

Large Data ◽

Real Data ◽

Psychological Research ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

R Programming Language ◽

R Programming

Numerous applications in psychological research require that a pool of elements is partitioned into multiple parts. While many applications seek groups that are well-separated, i.e., dissimilar from each other, others require the different groups to be as similar as possible. Examples include the assignment of students to parallel courses, assembling stimulus sets in experimental psychology, splitting achievement tests into parts of equal difficulty, and dividing a data set for cross validation. We present anticlust, an easy-to-use and free software package for solving these problems fast and in an automated manner. The package anticlust is an open source extension to the R programming language and implements the methodology of anticlustering. Anticlustering divides elements into similar parts, ensuring similarity between groups by enforcing heterogeneity within groups. Thus, anticlustering is the direct reversal of cluster analysis that aims to maximize homogeneity within groups and dissimilarity between groups. Our package anticlust implements two anticlustering criteria, reversing the clustering methods k-means and cluster editing, respectively. In a simulation study, we show that anticlustering returns excellent results and outperforms alternative approaches like random assignment and matching. In three example applications, we illustrate how to apply anticlust on real data sets. We demonstrate how to assign experimental stimuli to equivalent sets based on norming data, how to divide a large data set for cross validation, and how to split a test into parts of equal item difficulty and discrimination.

Download Full-text

Evaluating molecular modeling tools for thermal stability using an independently generated dataset

10.1101/856732 ◽

2019 ◽

Cited By ~ 1

Author(s):

Peishan Huang ◽

Simon K. S. Chu ◽

Henrique N. Frizzo ◽

Morgan P. Connolly ◽

Ryan W. Caster ◽

...

Keyword(s):

Thermal Stability ◽

Soluble Protein ◽

Pearson Correlation ◽

Published Data ◽

Data Sets ◽

Modeling Tools ◽

Data Set ◽

Computational Tools ◽

Physical Features ◽

New Algorithms

ABSTRACTEngineering proteins to enhance thermal stability is a widely utilized approach for creating industrially relevant biocatalysts. Computational tools that guide these engineering efforts remain an active area of research with new data sets and develop algorithms. To aid in these efforts, we are reporting an expansion of our previously published data set of mutants for a β-glucosidase to include both measures of TM and ΔΔG, to complement the previously reported measures of T50 and kinetic constants (kcat and KM). For a set of 51 mutants, we found that T50 and TM are moderately correlated with a Pearson correlation coefficient (PCC) of 0.58, indicated the two methods capture different physical features. The performance of predicted stability using five computational tools are also evaluated on the 51 mutants dataset, none of which are found to be strong predictors of the observed changes in T50, TM, or ΔΔG. Furthermore, the ability of the five algorithms to predict the production of isolatable soluble protein is examined, which revealed that Rosetta ΔΔG, ELASPIC, and DeepDDG are capable of predicting if a mutant could be produced and isolated as a soluble protein. These results further highlight the need for new algorithms for predicting modest, yet important, changes in thermal stability as well as a new utility for current algorithms for prescreening designs for the production of soluble mutants.

Download Full-text