Identification of a Suitable Missing Data Handling Method for Lifelogs-based Wellness Index Estimation with Panel Data (Preprint)

2020 ◽  
Author(s):  
KI-Hun Kim ◽  
Kwang-Jae Kim

BACKGROUND A lifelogs-based wellness index (LWI) is a function to calculate wellness scores from health behavior lifelogs such as daily walking steps and sleep time collected through smartphones. A wellness score intuitively shows a user of a smart wellness service the overall condition of health behaviors. LWI development includes LWI estimation (i.e., estimating coefficients in LWI with data). A panel data set of health behavior lifelogs allows LWI estimation to control for variables unobserved in LWI and hence to be less biased. Such panel data sets are likely to have missing data due to various random events of daily life (e.g., smart devices stop collecting data when they are out of batteries). Missing data can introduce the biases to LWI coefficients. Thus, the choice of appropriate missing data handling method is important to reduce the biases in LWI estimation with a panel data set of health behavior lifelogs. However, relevant studies are scarce in the literature. OBJECTIVE This research aims to identify a suitable missing data handling method for LWI estimation with panel data. Six representative missing data handling methods (i.e., listwise deletion (LD), mean imputation, Expectation-Maximization (EM) based multiple imputation, Predictive-Mean Matching (PMM) based multiple imputation, k-Nearest Neighbors (k-NN) based imputation, and Low-rank Approximation (LA) based imputation) are comparatively evaluated through the simulation of an existing LWI development case. METHODS A panel data set of health behavior lifelogs collected in the existing LWI development case was transformed into a reference data set. 200 simulated data sets were generated by randomly introducing missing data to the reference data set at each of missingness proportions from 1% to 80%. The six methods were applied to transform the simulated data sets into complete data sets by handling missing data. Coefficients in a linear LWI, a linear function, were estimated with each of all the complete data sets by following the case. Coefficient biases of the six methods were calculated by comparing the estimated coefficient values with reference values estimated with the reference data set. RESULTS Based on the coefficient biases, the superior methods changed according to the missingness proportion: LA based imputation, PMM based multiple imputation, and EM based multiple imputation for 1% to 30% missingness proportions; LA based imputation and PMM based multiple imputation for 31% to 60%; and only LA based imputation for over 60%. CONCLUSIONS LA based imputation was superior among the six methods regardless of the missingness proportion. This superiority is generalizable for other panel data sets of health behavior lifelogs because existing works have verified their low-rank nature where LA based imputation works well. This result will guide the missing data handling to reduce the coefficient biases in new development cases of linear LWIs with panel data.

10.2196/20597 ◽  
2020 ◽  
Vol 8 (12) ◽  
pp. e20597
Author(s):  
Ki-Hun Kim ◽  
Kwang-Jae Kim

Background A lifelogs-based wellness index (LWI) is a function for calculating wellness scores based on health behavior lifelogs (eg, daily walking steps and sleep times collected via a smartwatch). A wellness score intuitively shows the users of smart wellness services the overall condition of their health behaviors. LWI development includes estimation (ie, estimating coefficients in LWI with data). A panel data set comprising health behavior lifelogs allows LWI estimation to control for unobserved variables, thereby resulting in less bias. However, these data sets typically have missing data due to events that occur in daily life (eg, smart devices stop collecting data when batteries are depleted), which can introduce biases into LWI coefficients. Thus, the appropriate choice of method to handle missing data is important for reducing biases in LWI estimations with panel data. However, there is a lack of research in this area. Objective This study aims to identify a suitable missing-data handling method for LWI estimation with panel data. Methods Listwise deletion, mean imputation, expectation maximization–based multiple imputation, predictive-mean matching–based multiple imputation, k-nearest neighbors–based imputation, and low-rank approximation–based imputation were comparatively evaluated by simulating an existing case of LWI development. A panel data set comprising health behavior lifelogs of 41 college students over 4 weeks was transformed into a reference data set without any missing data. Then, 200 simulated data sets were generated by randomly introducing missing data at proportions from 1% to 80%. The missing-data handling methods were each applied to transform the simulated data sets into complete data sets, and coefficients in a linear LWI were estimated for each complete data set. For each proportion for each method, a bias measure was calculated by comparing the estimated coefficient values with values estimated from the reference data set. Results Methods performed differently depending on the proportion of missing data. For 1% to 30% proportions, low-rank approximation–based imputation, predictive-mean matching–based multiple imputation, and expectation maximization–based multiple imputation were superior. For 31% to 60% proportions, low-rank approximation–based imputation and predictive-mean matching–based multiple imputation performed best. For over 60% proportions, only low-rank approximation–based imputation performed acceptably. Conclusions Low-rank approximation–based imputation was the best of the 6 data-handling methods regardless of the proportion of missing data. This superiority is generalizable to other panel data sets comprising health behavior lifelogs given their verified low-rank nature, for which low-rank approximation–based imputation is known to perform effectively. This result will guide missing-data handling in reducing coefficient biases in new development cases of linear LWIs with panel data.


2020 ◽  
Author(s):  
Kiran Mahat ◽  
Andrew Mitchell ◽  
Tshelthrim Zangpo

AbstractWe report the first detection of Fall Armyworm (FAW), Spodoptera frugiperda (Smith, 1797), in Bhutan. FAW feeds on more than 300 plant species and is a serious pest of many. It has been spreading through Africa since 2016 and Asia since 2018. In Bhutan, this species was first detected in maize fields in the western part of the country in September 2019 and subsequently found infesting maize crop in southern parts of the country in December 2019 and April 2020. Using morphological and molecular techniques the presence of the first invading populations of S. frugiperda in Bhutan is confirmed through this study. We present an updated reference DNA barcode data set for FAW comprising 374 sequences, which can be used to reliably identify this serious pest species, and discuss some of the reasons why such compiled reference data sets are necessary, despite the publicly availability of the underlying data. We also report on a second armyworm species, the Northern Armyworm, Mythimna separata (Walker, 1865), in rice, maize and other crops in eighteen districts of Bhutan.


Author(s):  
Victor H Aguiar ◽  
Nail Kashaev

Abstract A long-standing question about consumer behaviour is whether individuals’ observed purchase decisions satisfy the revealed preference (RP) axioms of the utility maximization theory (UMT). Researchers using survey or experimental panel data sets on prices and consumption to answer this question face the well-known problem of measurement error. We show that ignoring measurement error in the RP approach may lead to overrejection of the UMT. To solve this problem, we propose a new statistical RP framework for consumption panel data sets that allows for testing the UMT in the presence of measurement error. Our test is applicable to all consumer models that can be characterized by their first-order conditions. Our approach is non-parametric, allows for unrestricted heterogeneity in preferences and requires only a centring condition on measurement error. We develop two applications that provide new evidence about the UMT. First, we find support in a survey data set for the dynamic and time-consistent UMT in single-individual households, in the presence of nonclassical measurement error in consumption. In the second application, we cannot reject the static UMT in a widely used experimental data set in which measurement error in prices is assumed to be the result of price misperception due to the experimental design. The first finding stands in contrast to the conclusions drawn from the deterministic RP test of Browning (1989, International Economic Review, 979–992). The second finding reverses the conclusions drawn from the deterministic RP test of Afriat (1967, International Economic Review, 8, 6–77) and Varian (1982, Econometrica, 945–973).


SAGE Open ◽  
2016 ◽  
Vol 6 (4) ◽  
pp. 215824401666822 ◽  
Author(s):  
Simon Grund ◽  
Oliver Lüdtke ◽  
Alexander Robitzsch

The treatment of missing data can be difficult in multilevel research because state-of-the-art procedures such as multiple imputation (MI) may require advanced statistical knowledge or a high degree of familiarity with certain statistical software. In the missing data literature, pan has been recommended for MI of multilevel data. In this article, we provide an introduction to MI of multilevel missing data using the R package pan, and we discuss its possibilities and limitations in accommodating typical questions in multilevel research. To make pan more accessible to applied researchers, we make use of the mitml package, which provides a user-friendly interface to the pan package and several tools for managing and analyzing multiply imputed data sets. We illustrate the use of pan and mitml with two empirical examples that represent common applications of multilevel models, and we discuss how these procedures may be used in conjunction with other software.


2015 ◽  
Vol 5 (2) ◽  
pp. 137-148 ◽  
Author(s):  
Jeremy N.V Miles ◽  
Priscillia Hunt

Purpose – In applied psychology research settings, such as criminal psychology, missing data are to be expected. Missing data can cause problems with both biased estimates and lack of statistical power. The paper aims to discuss these issues. Design/methodology/approach – Recently, sophisticated methods for appropriately dealing with missing data, so as to minimize bias and to maximize power have been developed. In this paper the authors use an artificial data set to demonstrate the problems that can arise with missing data, and make naïve attempts to handle data sets where some data are missing. Findings – With the artificial data set, and a data set comprising of the results of a survey investigating prices paid for recreational and medical marijuana, the authors demonstrate the use of multiple imputation and maximum likelihood estimation for obtaining appropriate estimates and standard errors when data are missing. Originality/value – Missing data are ubiquitous in applied research. This paper demonstrates that techniques for handling missing data are accessible and should be employed by researchers.


2021 ◽  
Vol 906 (1) ◽  
pp. 012091
Author(s):  
Petr Kalvoda ◽  
Jakub Nosek ◽  
Petra Kalvodova

Abstract Mobile mapping systems (MMS) are becoming widely used in standard geodetic tasks more commonly in the last years. The paper is focused on the influence of control points (CPs) number and configuration on mobile laser scanning accuracy. The mobile laser scanning (MLS) data was acquired by MMS RIEGL VMX-450. The resulting point cloud was compared with two different reference data sets. The first reference data set consisted of a high-accuracy test point field (TPF) measured by a Trimble R8s GNSS system and a Trimble S8 HP total station. The second reference data set was a point cloud from terrestrial laser scanning (TLS) using two Faro Focus3D X 130 laser scanners. The coordinates of both reference data sets were determined with significantly higher accuracy than the coordinates of the tested MLS point cloud. The accuracy testing is based on coordinate differences between the reference data set and the tested MLS point cloud. There is a minimum number of 6–7 CPs in our scanned area (based on MLS trajectory length) to achieve the declared relative accuracy of trajectory positioning according to the RIEGL datasheet. We tested two types of ground control point (GCP) configurations for 7 GCPs, using TPF reference data. The first type is a trajectory-based CPs configuration, and the second is a geometry-based CPs configuration. The accuracy differences of the MLS point clouds with trajectory-based CPs configuration and geometry-based CPs configuration are not statistically significant. From a practical perspective, a geometry-based CPs configuration is more advantageous in the nonlinear type of urban area such as our one. The following analyzes are performed on geometry-based CPs configuration variants. We tested the influence of changing the location of two CPs from ground to roof. The effect of the vertical configuration of the CPs on the accuracy of the tested MLS point cloud has not been demonstrated. The effect of the number of control points on the accuracy of the MLS point cloud was also tested. In the overall statistics using TPF, the accuracy increases significantly with increasing the number of GCPs up to 6. This number corresponds to a requirement of the manufacturer. Although further increasing the number of CPs does not significantly increase the global accuracy, local accuracy improves with increasing the number of CPs up to 10 (average spacing 50 m) according to the comparison with the TLS reference point cloud. The accuracy test of the MLS point cloud was divided into the horizontal accuracy test on the façade data subset and the vertical accuracy test on the road data subset using the TLS reference point cloud. The results of this paper can help improve the efficiency and accuracy of the mobile mapping process in geodetic praxis.


2021 ◽  
Author(s):  
Petya Kindalova ◽  
Ioannis Kosmidis ◽  
Thomas E. Nichols

AbstractObjectivesWhite matter lesions are a very common finding on MRI in older adults and their presence increases the risk of stroke and dementia. Accurate and computationally efficient modelling methods are necessary to map the association of lesion incidence with risk factors, such as hypertension. However, there is no consensus in the brain mapping literature whether a voxel-wise modelling approach is better for binary lesion data than a more computationally intensive spatial modelling approach that accounts for voxel dependence.MethodsWe review three regression approaches for modelling binary lesion masks including massunivariate probit regression modelling with either maximum likelihood estimates, or mean bias-reduced estimates, and spatial Bayesian modelling, where the regression coefficients have a conditional autoregressive model prior to account for local spatial dependence. We design a novel simulation framework of artificial lesion maps to compare the three alternative lesion mapping methods. The age effect on lesion probability estimated from a reference data set (13,680 individuals from the UK Biobank) is used to simulate a realistic voxel-wise distribution of lesions across age. To mimic the real features of lesion masks, we suggest matching brain lesion summaries (total lesion volume, average lesion size and lesion count) across the reference data set and the simulated data sets. Thus, we allow for a fair comparison between the modelling approaches, under a realistic simulation setting.ResultsOur findings suggest that bias-reduced estimates for voxel-wise binary-response generalized linear models (GLMs) overcome the drawbacks of infinite and biased maximum likelihood estimates and scale well for large data sets because voxel-wise estimation can be performed in parallel across voxels. Contrary to the assumption of spatial dependence being key in lesion mapping, our results show that voxel-wise bias-reduction and spatial modelling result in largely similar estimates.ConclusionBias-reduced estimates for voxel-wise GLMs are not only accurate but also computationally efficient, which will become increasingly important as more biobank-scale neuroimaging data sets become available.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2017 ◽  
Vol 4 (3) ◽  
pp. 205316801771979 ◽  
Author(s):  
Joseph Wright ◽  
Erica Frantz

This paper re-examines the findings from a recently published study on hydrocarbon rents and autocratic survival by Lucas and Richter (LR hereafter). LR introduce a new data set on hydrocarbon rents and use it to examine the link between oil income and autocratic survival. Employing a placebo test, we show that the authors’ strategy for dealing with missingness in the new hydrocarbon rents data set – filling in missing data with zeros – creates bias in the reported estimates of interest. Addressing missingness with multiple imputation shows that the LR findings linking oil rents to democratization do not hold. Instead, we find that hydrocarbon rents reduce the chances of transition to a new dictatorship, consistent with the conclusions of Wright et al.


Sign in / Sign up

Export Citation Format

Share Document