Robust Estimation of the Mean and Covariance Matrix from Data with Missing Values

Author(s):  
Roderick J. A. Little
2012 ◽  
Vol 01 (04) ◽  
pp. 1250013 ◽  
Author(s):  
IOANA DUMITRIU ◽  
ELLIOT PAQUETTE

We study the global fluctuations for linear statistics of the form [Formula: see text] as n → ∞, for C1 functions f, and λ1, …, λn being the eigenvalues of a (general) β-Jacobi ensemble. The fluctuation from the mean [Formula: see text] turns out to be given asymptotically by a Gaussian process. We compute the covariance matrix for the process and show that it is diagonalized by a shifted Chebyshev polynomial basis; in addition, we analyze the deviation from the predicted mean for polynomial test functions, and we obtain a law of large numbers.


2014 ◽  
Vol 3 (1) ◽  
Author(s):  
Mark J. van der Laan ◽  
Alexander R. Luedtke ◽  
Iván Díaz

AbstractYoung, Hernán, and Robins consider the mean outcome under a dynamic intervention that may rely on the natural value of treatment. They first identify this value with a statistical target parameter, and then show that this statistical target parameter can also be identified with a causal parameter which gives the mean outcome under a stochastic intervention. The authors then describe estimation strategies for these quantities. Here we augment the authors’ insightful discussion by sharing our experiences in situations where two causal questions lead to the same statistical estimand, or the newer problem that arises in the study of data adaptive parameters, where two statistical estimands can lead to the same estimation problem. Given a statistical estimation problem, we encourage others to always use a robust estimation framework where the data generating distribution truly belongs to the statistical model. We close with a discussion of a framework which has these properties.


Author(s):  
Byron C. Jaeger ◽  
Ryan Cantor ◽  
Venkata Sthanam ◽  
Rongbing Xie ◽  
James K. Kirklin ◽  
...  

Background: Risk prediction models play an important role in clinical decision making. When developing risk prediction models, practitioners often impute missing values to the mean. We evaluated the impact of applying other strategies to impute missing values on the prognostic accuracy of downstream risk prediction models, that is, models fitted to the imputed data. A secondary objective was to compare the accuracy of imputation methods based on artificially induced missing values. To complete these objectives, we used data from the Interagency Registry for Mechanically Assisted Circulatory Support. Methods: We applied 12 imputation strategies in combination with 2 different modeling strategies for mortality and transplant risk prediction following surgery to receive mechanical circulatory support. Model performance was evaluated using Monte-Carlo cross-validation and measured based on outcomes 6 months following surgery using the scaled Brier score, concordance index, and calibration error. We used Bayesian hierarchical models to compare model performance. Results: Multiple imputation with random forests emerged as a robust strategy to impute missing values, increasing model concordance by 0.0030 (25th–75th percentile: 0.0008–0.0052) compared with imputation to the mean for mortality risk prediction using a downstream proportional hazards model. The posterior probability that single and multiple imputation using random forests would improve concordance versus mean imputation was 0.464 and >0.999, respectively. Conclusions: Selecting an optimal strategy to impute missing values such as random forests and applying multiple imputation can improve the prognostic accuracy of downstream risk prediction models.


2008 ◽  
Vol 21 (24) ◽  
pp. 6710-6723 ◽  
Author(s):  
Jason E. Smerdon ◽  
Alexey Kaplan ◽  
Diana Chang

Abstract The regularized expectation maximization (RegEM) method has been used in recent studies to derive climate field reconstructions of Northern Hemisphere temperatures during the last millennium. Original pseudoproxy experiments that tested RegEM [with ridge regression regularization (RegEM-Ridge)] standardized the input data in a way that improved the performance of the reconstruction method, but included data from the reconstruction interval for estimates of the mean and standard deviation of the climate field—information that is not available in real-world reconstruction problems. When standardizations are confined to the calibration interval only, pseudoproxy reconstructions performed with RegEM-Ridge suffer from warm biases and variance losses. Only cursory explanations of this so-called standardization sensitivity of RegEM-Ridge have been published, but they have suggested that the selection of the regularization (ridge) parameter by means of minimizing the generalized cross validation (GCV) function is the source of the effect. The origin of the standardization sensitivity is more thoroughly investigated herein and is shown not to be associated with the selection of the ridge parameter; sets of derived reconstructions reveal that GCV-selected ridge parameters are minimally different for reconstructions standardized either over both the reconstruction and calibration interval or over the calibration interval only. While GCV may select ridge parameters that are different from those that precisely minimize the error in pseudoproxy reconstructions, RegEM reconstructions performed with truly optimized ridge parameters are not significantly different from those that use GCV-selected ridge parameters. The true source of the standardization sensitivity is attributable to the inclusion or exclusion of additional information provided by the reconstruction interval, namely, the mean and standard deviation fields computed for the complete modeled dataset. These fields are significantly different from those for the calibration period alone because of the violation of a standard EM assumption that missing values are missing at random in typical paleoreconstruction problems; climate data are predominantly missing in the preinstrumental period when the mean climate was significantly colder than the mean of the instrumental period. The origin of the standardization sensitivity therefore is not associated specifically with RegEM-Ridge, and more recent attempts to regularize the EM algorithm using truncated total least squares could theoretically also be susceptible to the problems affecting RegEM-Ridge. Nevertheless, the principal failure of RegEM-Ridge arises because of a poor initial estimate of the mean field, and therefore leaves open the possibility that alternative methods may perform better.


2012 ◽  
Vol 62 (1) ◽  
Author(s):  
Lubomír Kubáček

AbstractIn certain settings the mean response is modeled by a linear model using a large number of parameters. Sometimes it is desirable to reduce the number of parameters prior to conducting the experiment and prior to the actual statistical analysis. Essentially, it means to formulate a simpler approximate model to the original “ideal” one. The goal is to find conditions (on the model matrix and covariance matrix) under which the reduction does not influence essentially the data fit. Here we try to develop such conditions in regular linear model without and with linear restraints. We emphasize that these conditions are independent of observed data.


Sign in / Sign up

Export Citation Format

Share Document