Data-adaptive Fellegi-Sunter Model for Probabilistic Record Linkage: Incorporating Missing Data and Field Selection (Preprint)

2021 ◽  
Author(s):  
Xiaochun Li ◽  
Huiping Xu ◽  
Shaun Grannis

UNSTRUCTURED Objectives: We address the real-world challenges of missing data and matching field selection in linking medical records by evaluating the extent to which incorporating the missing-at-random assumption in the Fellegi-Sunter model and using data-driven selected fields improve patient matching accuracy using real-world use cases. Methods: We adapted the Fellegi-Sunter model to accommodate missing data using the missing-at-random assumption and compared the adaptation to the common strategy of treating missing values as disagreement, with matching fields specified by experts or selected by data-driven methods. We used four use cases, each containing a random sample of record pairs with match status ascertained by manual reviews. Use cases included health information exchange (HIE) record deduplication, linkage of public health registry records to HIE, linkage of Social Security Death Master File records to HIE, and deduplicating newborn screening records, which represent real-world clinical and public health scenarios. Matching performance was evaluated using sensitivity, specificity, positive predictive value, negative predictive value, and F-score. Results: Incorporating the missing-at-random assumption in the Fellegi-Sunter model maintained or improved F-scores whether matching fields were expert-specified or selected by data-driven methods. Combining the missing-at-random assumption and data-driven fields optimized F-scores in the four use cases. Conclusions: Missing-at-random is a reasonable assumption in real-world record linkage applications: it maintains or improves F-scores regardless of whether matching fields are expert-specified or data-driven. Data-driven selection of fields coupled with MAR achieves the best overall performance, which can be especially useful in privacy-preserving record linkage.

Author(s):  
Caio Ribeiro ◽  
Alex A. Freitas

AbstractLongitudinal datasets of human ageing studies usually have a high volume of missing data, and one way to handle missing values in a dataset is to replace them with estimations. However, there are many methods to estimate missing values, and no single method is the best for all datasets. In this article, we propose a data-driven missing value imputation approach that performs a feature-wise selection of the best imputation method, using known information in the dataset to rank the five methods we selected, based on their estimation error rates. We evaluated the proposed approach in two sets of experiments: a classifier-independent scenario, where we compared the applicabilities and error rates of each imputation method; and a classifier-dependent scenario, where we compared the predictive accuracy of Random Forest classifiers generated with datasets prepared using each imputation method and a baseline approach of doing no imputation (letting the classification algorithm handle the missing values internally). Based on our results from both sets of experiments, we concluded that the proposed data-driven missing value imputation approach generally resulted in models with more accurate estimations for missing data and better performing classifiers, in longitudinal datasets of human ageing. We also observed that imputation methods devised specifically for longitudinal data had very accurate estimations. This reinforces the idea that using the temporal information intrinsic to longitudinal data is a worthwhile endeavour for machine learning applications, and that can be achieved through the proposed data-driven approach.


Author(s):  
Thelma Dede Baddoo ◽  
Zhijia Li ◽  
Samuel Nii Odai ◽  
Kenneth Rodolphe Chabi Boni ◽  
Isaac Kwesi Nooni ◽  
...  

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.


2022 ◽  
Vol 16 (4) ◽  
pp. 1-24
Author(s):  
Kui Yu ◽  
Yajing Yang ◽  
Wei Ding

Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically independent of the class variable conditioning on its MB, this enables causal feature selection to identify potential causal features for feature selection for building robust and physically meaningful prediction models. Missing data, ubiquitous in many real-world applications, remain an open research problem in causal feature selection due to its technical complexity. In this article, we discuss a novel multiple imputation MB (MimMB) framework for causal feature selection with missing data. MimMB integrates Data Imputation with MB Learning in a unified framework to enable the two key components to engage with each other. MB Learning enables Data Imputation in a potentially causal feature space for achieving accurate data imputation, while accurate Data Imputation helps MB Learning identify a reliable MB of the class variable in turn. Then, we further design an enhanced kNN estimator for imputing missing values and instantiate the MimMB. In our comprehensively experimental evaluation, our new approach can effectively learn the MB of a given variable in a Bayesian network and outperforms other rival algorithms using synthetic and real-world datasets.


Author(s):  
Juheng Zhang ◽  
Xiaoping Liu ◽  
Xiao-Bai Li

We study strategically missing data problems in predictive analytics with regression. In many real-world situations, such as financial reporting, college admission, job application, and marketing advertisement, data providers often conceal certain information on purpose in order to gain a favorable outcome. It is important for the decision-maker to have a mechanism to deal with such strategic behaviors. We propose a novel approach to handle strategically missing data in regression prediction. The proposed method derives imputation values of strategically missing data based on the Support Vector Regression models. It provides incentives for the data providers to disclose their true information. We show that with the proposed method imputation errors for the missing values are minimized under some reasonable conditions. An experimental study on real-world data demonstrates the effectiveness of the proposed approach.


2015 ◽  
Vol 2015 ◽  
pp. 1-14 ◽  
Author(s):  
Jaemun Sim ◽  
Jonathan Sangyun Lee ◽  
Ohbyung Kwon

In a ubiquitous environment, high-accuracy data analysis is essential because it affects real-world decision-making. However, in the real world, user-related data from information systems are often missing due to users’ concerns about privacy or lack of obligation to provide complete data. This data incompleteness can impair the accuracy of data analysis using classification algorithms, which can degrade the value of the data. Many studies have attempted to overcome these data incompleteness issues and to improve the quality of data analysis using classification algorithms. The performance of classification algorithms may be affected by the characteristics and patterns of the missing data, such as the ratio of missing data to complete data. We perform a concrete causal analysis of differences in performance of classification algorithms based on various factors. The characteristics of missing values, datasets, and imputation methods are examined. We also propose imputation and classification algorithms appropriate to different datasets and circumstances.


Author(s):  
Brett K Beaulieu-Jones ◽  
Daniel R Lavage ◽  
John W Snyder ◽  
Jason H Moore ◽  
Sarah A Pendergrass ◽  
...  

BACKGROUND Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results. OBJECTIVE The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered. METHODS We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling). RESULTS Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation. CONCLUSIONS The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.


2020 ◽  
Vol 4 (Supplement_1) ◽  
pp. 509-509
Author(s):  
Peiyi Lu ◽  
Mack Shelley

Abstract Studies using data from longitudinal health survey of older adults usually assumed the data were missing completely at random (MCAR) or missing at random (MAR). Thus subsequent analyses used multiple imputation or likelihood-based method to handle missing data. However, little existing research actually examines whether the data met the MCAR/MAR assumptions before performing data analyses. This study first summarized the commonly used statistical methods to test missing mechanism and discussed their application conditions. Then using two-wave longitudinal data from the Health and Retirement Study (HRS; wave 2014-2015 and wave 2016-2017; N=18,747), this study applied different approaches to test the missing mechanism of several demographic and health variables. These approaches included Little’s test, logistic regression method, nonparametric tests, false discovery rate, and others. Results indicated the data did not meet the MCAR assumption even though they had a very low rate of missing values. Demographic variables provided good auxiliary information for health variables. Health measures (e.g., self-reported health, activity of daily life, depressive symptoms) met the MAR assumptions. Older respondents could drop out and die in the longitudinal survey, but attrition did not significantly affect the MAR assumption. Our findings supported the MAR assumptions for the demographic and health variables in HRS, and therefore provided statistical justification to HRS researchers about using imputation or likelihood-based methods to deal with missing data. However, researchers are strongly encouraged to test the missing mechanism of the specific variables/data they choose when using a new dataset.


Author(s):  
Daniele Bottigliengo ◽  
Giulia Lorenzoni ◽  
Honoria Ocagli ◽  
Matteo Martinato ◽  
Paola Berchialla ◽  
...  

(1) Background: Propensity score methods gained popularity in non-interventional clinical studies. As it may often occur in observational datasets, some values in baseline covariates are missing for some patients. The present study aims to compare the performances of popular statistical methods to deal with missing data in propensity score analysis. (2) Methods: Methods that account for missing data during the estimation process and methods based on the imputation of missing values, such as multiple imputations, were considered. The methods were applied on the dataset of an ongoing prospective registry for the treatment of unprotected left main coronary artery disease. The performances were assessed in terms of the overall balance of baseline covariates. (3) Results: Methods that explicitly deal with missing data were superior to classical complete case analysis. The best balance was observed when propensity scores were estimated with a method that accounts for missing data using a stochastic approximation of the expectation-maximization algorithm. (4) Conclusions: If missing at random mechanism is plausible, methods that use missing data to estimate propensity score or impute them should be preferred. Sensitivity analyses are encouraged to evaluate the implications methods used to handle missing data and estimate propensity score.


2020 ◽  
Author(s):  
Pietro Di Lena ◽  
Claudia Sala ◽  
Andrea Prodi ◽  
Christine Nardini

Abstract Background: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms -(completely) at random or not- and different representations of DNA methylation levels (β and M-value). Results: We make an extensive analysis of the imputation performances of seven imputation methods on simulated missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) methylation data. We further consider imputation performances on the β- and M-value popular representations of methylation levels. Overall, β -values enable better imputation performances than M-values. Imputation accuracy is lower for mid-range β -values, while it is generally more accurate for values at the extremes of the β -value range. The MAR values distribution is on the average more dense in the mid-range in comparison to the expected β -value distribution. As a consequence, MAR values are on average harder to impute. Conclusions: The results of the analysis provide guidelines for the most suitable imputation approaches for DNA methylation data under different representations of DNA methylation levels and different missing data mechanisms.


2011 ◽  
Vol 26 (S2) ◽  
pp. 572-572
Author(s):  
N. Resseguier ◽  
H. Verdoux ◽  
F. Clavel-Chapelon ◽  
X. Paoletti

IntroductionThe CES-D scale is commonly used to assess depressive symptoms (DS) in large population-based studies. Missing values in items of the scale may create biases.ObjectivesTo explore reasons for not completing items of the CES-D scale and to perform sensitivity analysis of the prevalence of DS to assess the impact of different missing data hypotheses.Methods71412 women included in the French E3N cohort returned in 2005 a questionnaire containing the CES-D scale. 45% presented at least one missing value in the scale. An interview study was carried out on a random sample of 204 participants to examine the different hypotheses for the missing value mechanism. The prevalence of DS was estimated according to different methods for handling missing values: complete cases analysis, single imputation, multiple imputation under MAR (missing at random) and MNAR (missing not at random) assumptions.ResultsThe interviews showed that participants were not embarrassed to fill in questions about DS. Potential reasons of nonresponse were identified. MAR and MNAR hypotheses remained plausible and were explored.Among complete responders, the prevalence of DS was 26.1%. After multiple imputation under MAR assumption, it was 28.6%, 29.8% and 31.7% among women presenting up to 4, to 10 and to 20 missing values, respectively. The estimates were robust after applying various scenarios of MNAR data for the sensitivity analysis.ConclusionsThe CES-D scale can easily be used to assess DS in large cohorts. Multiple imputation under MAR assumption allows to reliably handle missing values.


Sign in / Sign up

Export Citation Format

Share Document