Missing data imputation in multivariate t distribution with unknown degrees of freedom using expectation maximization algorithm and its stochastic variants

2020 ◽  
Vol 15 (3) ◽  
pp. 263-272
Author(s):  
Paul Kimani Kinyanjui ◽  
Cox Lwaka Tamba ◽  
Luke Akong’o Orawo ◽  
Justin Obwoge Okenye

Many researchers encounter the missing data problem. The phenomenon may be occasioned by data omission, non-response, death of respondents, recording errors, among others. It is important to find an appropriate data imputation technique to fill in the missing positions. In this study, the Expectation Maximization (EM) algorithm and two of its stochastic variants, stochastic EM (SEM) and Monte Carlo EM (MCEM), are employed in missing data imputation and parameter estimation in multivariate t distribution with unknown degrees of freedom. The imputation efficiencies of the three methods are then compared using mean square error (MSE) criterion. SEM yields the lowest MSE, making it the most efficient method in data imputation when the data assumes the multivariate t distribution. The algorithm’s stochastic nature enables it to avoid local saddle points and achieve global maxima; ultimately increasing its efficiency. The EM and MCEM techniques yield almost similar results. Large sample draws in the MCEM’s E-step yield more or less the same results as the deterministic EM. In parameter estimation, it is observed that the parameter estimates for EM and MCEM are relatively close to the simulated data’s maximum likelihood (ML) estimates. This is not the case in SEM, owing to the random nature of the algorithm.

2018 ◽  
Author(s):  
Seyed Mahmood Taghavi-Shahri ◽  
Alessandro Fassò ◽  
Behzad Mahaki ◽  
Heresh Amini

AbstractGraphical AbstractLand use regression (LUR) has been widely applied in epidemiologic research for exposure assessment. In this study, for the first time, we aimed to develop a spatiotemporal LUR model using Distributed Space Time Expectation Maximization (D-STEM). This spatiotemporal LUR model examined with daily particulate matter ≤ 2.5 μm (PM2.5) within the megacity of Tehran, capital of Iran. Moreover, D-STEM missing data imputation was compared with mean substitution in each monitoring station, as it is equivalent to ignoring of missing data, which is common in LUR studies that employ regulatory monitoring stations’ data. The amount of missing data was 28% of the total number of observations, in Tehran in 2015. The annual mean of PM2.5 concentrations was 33 μg/m3. Spatiotemporal R-squared of the D-STEM final daily LUR model was 78%, and leave-one-out cross-validation (LOOCV) R-squared was 66%. Spatial R-squared and LOOCV R-squared were 89% and 72%, respectively. Temporal R-squared and LOOCV R-squared were 99.5% and 99.3%, respectively. Mean absolute error decreased 26% in imputation of missing data by using the D-STEM final LUR model instead of mean substitution. This study reveals competence of the D-STEM software in spatiotemporal missing data imputation, estimation of temporal trend, and mapping of small scale (20 × 20 meters) within-city spatial variations, in the LUR context. The estimated PM2.5 concentrations maps could be used in future studies on short- and/or long-term health effects. Overall, we suggest using D-STEM capabilities in increasing LUR studies that employ data of regulatory network monitoring stations.Highlights-First Land Use Regression using D-STEM, a recently introduced statistical software-Assess D-STEM in spatiotemporal modeling, mapping, and missing data imputation-Estimate high resolution (20×20 m) daily maps for exposure assessment in a megacity-Provide both short- and long-term exposure assessment for epidemiological studies


Author(s):  
A. M. Kshirsagar

If the components x1, x2,…, xk of a vector X have a non-singular multivariate normal distribution having a null vector of means and variance-covariance matrix Σ= σ2, the matrix R=[ρij] (where ρii = 1) is known in certain cases but σ2 is unknown. If s2 is an estimate of σ2 based on ƒ degrees of freedom and is distributed independently of X, the distribution of the vector t=x/s is known as the multivariate t-distribution. This distribution was first obtained by Dunnett and Sobel (6) and independently by Cornish (3). Dunnett, Sobel and Bechhofer(2) have discussed some practical applications of this distribution. Cornish (3) obtained this distribution while considering the pre-treatment to be given to certain types of replicated experiments. This distribution possesses some useful properties and makes it suitable as a basis for exact tests of significance in various problems, and Dunnett and Sobel (6), by providing tables of the probability integral, have taken the first step towards its use in practice. Cornish, in a later paper (4) considered the sampling distribution of statistics derived from the multivariate t-distribution and using this he obtained the well-known ((7), (8)) distribution of the sample regression coefficient of one variate with respect to another, when both have a bivariate normal distribution.


Sign in / Sign up

Export Citation Format

Share Document