scholarly journals A Safe-Region Imputation Method for Handling Medical Data with Missing Values

Symmetry ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1792
Author(s):  
Shu-Fen Huang ◽  
Ching-Hsue Cheng

Medical data usually have missing values; hence, imputation methods have become an important issue. In previous studies, many imputation methods based on variable data had a multivariate normal distribution, such as expectation-maximization and regression-based imputation. These assumptions may lead to deviations in the results, which sometimes create a bottleneck. In addition, directly deleting instances with missing values may have several problems, such as losing important data, producing invalid research samples, and leading to research deviations. Therefore, this study proposed a safe-region imputation method for handling medical data with missing values; we also built a medical prediction model and compared the removed missing values with imputation methods in terms of the generated rules, accuracy, and AUC. First, this study used the kNN imputation, multiple imputation, and the proposed imputation to impute the missing data and then applied four attribute selection methods to select the important attributes. Then, we used the decision tree (C4.5), random forest, REP tree, and LMT classifier to generate the rules, accuracy, and AUC for comparison. Because there were four datasets with imbalanced classes (asymmetric classes), the AUC was an important criterion. In the experiment, we collected four open medical datasets from UCI and one international stroke trial dataset. The results show that the proposed safe-region imputation is better than the listing imputation methods and after imputing offers better results than directly deleting instances with missing values in the number of rules, accuracy, and AUC. These results will provide a reference for medical stakeholders.

2021 ◽  
Author(s):  
Ching-Hsue Cheng ◽  
Shu-Fen Huang

Abstract Nowadays, people pay increasing attention to health, and the integrity of medical records has been put into focus. Recently, medical data imputation has become a very active field because medical data usually have missing values. Many imputation methods have been proposed, but many model-based imputation methods such as expectation-maximization and regression-based imputation based on the variables data have a multivariate normal distribution, which assumption can lead to biased results. Sometimes this becomes a bottleneck, such as computationally more complex than model-free methods. Furthermore, directly remove instances with missing values, this approach has several problems, and it is possible to lose the important data, produce ineffective research samples, and cause research deviations, and so on. Therefore, this study proposes a novel clustering-based purity and distance imputation method to improve the handling of missing values. In the experiment, we collected eight different medical datasets to compare the proposed imputation methods with the listed imputation methods with regard to the results of different situations. In imputation measures, the area under the curve (AUC) is used to evaluate the performance of the imbalanced class datasets in MAR and MCAR experiments, and accuracy is applied to measure its performance of the balanced class in MNAR experiment. Finally, the root-mean-square error (RMSE) is also used to compare the proposed and the listing imputation methods. In addition, this study utilized the elbow method and the average silhouette method to find the optimal number of clusters for all datasets. Results showed that the proposed imputation method could improve imputation performance in the accuracy, AUC, and RMSE of different missing degrees and missing types.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Pooja Rani ◽  
Rajneesh Kumar ◽  
Anurag Jain

PurposeDecision support systems developed using machine learning classifiers have become a valuable tool in predicting various diseases. However, the performance of these systems is adversely affected by the missing values in medical datasets. Imputation methods are used to predict these missing values. In this paper, a new imputation method called hybrid imputation optimized by the classifier (HIOC) is proposed to predict missing values efficiently.Design/methodology/approachThe proposed HIOC is developed by using a classifier to combine multivariate imputation by chained equations (MICE), K nearest neighbor (KNN), mean and mode imputation methods in an optimum way. Performance of HIOC has been compared to MICE, KNN, and mean and mode methods. Four classifiers support vector machine (SVM), naive Bayes (NB), random forest (RF) and decision tree (DT) have been used to evaluate the performance of imputation methods.FindingsThe results show that HIOC performed efficiently even with a high rate of missing values. It had reduced root mean square error (RMSE) up to 17.32% in the heart disease dataset and 34.73% in the breast cancer dataset. Correct prediction of missing values improved the accuracy of the classifiers in predicting diseases. It increased classification accuracy up to 18.61% in the heart disease dataset and 6.20% in the breast cancer dataset.Originality/valueThe proposed HIOC is a new hybrid imputation method that can efficiently predict missing values in any medical dataset.


2010 ◽  
Vol 6 (3) ◽  
pp. 1-10 ◽  
Author(s):  
Shichao Zhang

In this paper, the author designs an efficient method for imputing iteratively missing target values with semi-parametric kernel regression imputation, known as the semi-parametric iterative imputation algorithm (SIIA). While there is little prior knowledge on the datasets, the proposed iterative imputation method, which impute each missing value several times until the algorithms converges in each model, utilize a substantially useful amount of information. Additionally, this information includes occurrences involving missing values as well as capturing the real dataset distribution easier than the parametric or nonparametric imputation techniques. Experimental results show that the author’s imputation methods outperform the existing methods in terms of imputation accuracy, in particular in the situation with high missing ratio.


Author(s):  
V. Jinubala ◽  
P. Jeyakumar

Data Mining is an emerging research field in the analysis of agricultural data. In fact the most important problem in extracting knowledge from the agriculture data is the missing values of the attributes in the selected data set. If such deficiencies are there in the selected data set then it needs to be cleaned during preprocessing of the data in order to obtain a functional data. The main objective of this paper is to analyse the effectiveness of the various imputation methods in producing a complete data set that can be more useful for applying data mining techniques and presented a comparative analysis of the imputation methods for handling missing values. The pest data set of rice crop collected throughout Maharashtra state under Crop Pest Surveillance and Advisory Project (CROPSAP) during 2009-2013 was used for analysis. The different methodologies like Deleting of rows, Mean & Median, Linear regression and Predictive Mean Matching were analysed for Imputation of Missing values. The comparative analysis shows that Predictive Mean Matching Methodology was better than other methods and effective for imputation of missing values in large data set.


2019 ◽  
Vol 6 (339) ◽  
pp. 73-98
Author(s):  
Małgorzata Aleksandra Misztal

The problem of incomplete data and its implications for drawing valid conclusions from statistical analyses is not related to any particular scientific domain, it arises in economics, sociology, education, behavioural sciences or medicine. Almost all standard statistical methods presume that every object has information on every variable to be included in the analysis and the typical approach to missing data is simply to delete them. However, this leads to ineffective and biased analysis results and is not recommended in the literature. The state of the art technique for handling missing data is multiple imputation. In the paper, some selected multiple imputation methods were taken into account. Special attention was paid to using principal components analysis (PCA) as an imputation method. The goal of the study was to assess the quality of PCA‑based imputations as compared to two other multiple imputation techniques: multivariate imputation by chained equations (MICE) and missForest. The comparison was made by artificially simulating different proportions (10–50%) and mechanisms of missing data using 10 complete data sets from the UCI repository of machine learning databases. Then, missing values were imputed with the use of MICE, missForest and the PCA‑based method (MIPCA). The normalised root mean square error (NRMSE) was calculated as a measure of imputation accuracy. On the basis of the conducted analyses, missForest can be recommended as a multiple imputation method providing the lowest rates of imputation errors for all types of missingness. PCA‑based imputation does not perform well in terms of accuracy.


2019 ◽  
Author(s):  
Tabea Kossen ◽  
Michelle Livne ◽  
Vince I Madai ◽  
Ivana Galinovic ◽  
Dietmar Frey ◽  
...  

AbstractBackground and purposeHandling missing values is a prevalent challenge in the analysis of clinical data. The rise of data-driven models demands an efficient use of the available data. Methods to impute missing values are thus crucial. Here, we developed a publicly available framework to test different imputation methods and compared their impact in a typical stroke clinical dataset as a use case.MethodsA clinical dataset based on the 1000Plus stroke study with 380 completed-entries patients was used. 13 common clinical parameters including numerical and categorical values were selected. Missing values in a missing-at-random (MAR) and missing-completely-at-random (MCAR) fashion from 0% to 60% were simulated and consequently imputed using the mean, hot-deck, multiple imputation by chained equations, expectation maximization method and listwise deletion. The performance was assessed by the root mean squared error, the absolute bias and the performance of a linear model for discharge mRS prediction.ResultsListwise deletion was the worst performing method and started to be significantly worse than any imputation method from 2% (MAR) and 3% (MCAR) missing values on. The underlying missing value mechanism seemed to have a crucial influence on the identified best performing imputation method. Consequently no single imputation method outperformed all others. A significant performance drop of the linear model started from 11% (MAR+MCAR) and 18% (MCAR) missing values.ConclusionsIn the presented case study of a typical clinical stroke dataset we confirmed that listwise deletion should be avoided for dealing with missing values. Our findings indicate that the underlying missing value mechanism and other dataset characteristics strongly influence the best choice of imputation method. For future studies with similar data structure, we thus suggest to use the developed framework in this study to select the most suitable imputation method for a given dataset prior to analysis.


2018 ◽  
Vol 15 (2) ◽  
pp. 347-368 ◽  
Author(s):  
Won-Seok Hwang ◽  
Shaoyu Li ◽  
Sang-Wook Kim ◽  
Kichun Lee

Existing recommendation methods suffer from the data sparsity problem which means that most of users have rated only a very small number of items, often resulting in low recommendation accuracy. In addition, for cold start users evaluating only few items, rating predictions with the methods also produce low accuracy. To address these problems, we propose a novel data imputation method that effectively substitutes missing ratings with probable values (i.e., imputed values). Our method successfully improves accuracy of recommendation methods from the following three aspects: (1) exploiting a trust network, (2) imputing only a part of missing ratings, and (3) applying them to any recommendation methods. Our method employs a bidirectional connection structure within a distance level for finding reliable users in exploiting a trust network as useful information. In addition, our method imputes only some missing ratings, called fillable ratings, whose imputed values are expected to be accurate with a sufficient level of confidence. Moreover, our imputation method is independent of, thus applicable to, any recommendation methods that may include application-specific ones and the most accurate one in each domain. We conduct experiments on three real-life datasets which arise from Epinions and Ciao. Our experimental results demonstrate that our method has recommendation accuracy better than existing recommendation methods equipped with imputation methods or trust networks, especially for cold start users.


2013 ◽  
Vol 11 (7) ◽  
pp. 2779-2786
Author(s):  
Rahul Singhai

One relevant problem in data preprocessing is the presence of missing data that leads the poor quality of patterns, extracted after mining. Imputation is one of the widely used procedures that replace the missing values in a data set by some probable values. The advantage of this approach is that the missing data treatment is independent of the learning algorithm used. This allows the user to select the most suitable imputation method for each situation. This paper analyzes the various imputation methods proposed in the field of statistics with respect to data mining. A comparative analysis of three different imputation approaches which can be used to impute missing attribute values in data mining are given that shows the most promising method. An artificial input data (of numeric type) file of 1000 records is used to investigate the performance of these methods. For testing the significance of these methods Z-test approach were used.


2020 ◽  
Author(s):  
Kruttika Dabke ◽  
Simion Kreimer ◽  
Michelle R. Jones ◽  
Sarah J. Parker

AbstractMissing values in proteomic data sets have real consequences on downstream data analysis and reproducibility. Although several imputation methods exist to handle missing values, there is no single imputation method that is best suited for a diverse range of data sets and no clear strategy exists for evaluating imputation methods for large-scale DIA-MS data sets, especially at different levels of protein quantification. To navigate through the different imputation strategies available in the literature, we have established a workflow to assess imputation methods on large-scale label-free DIA-MS data sets. We used two distinct DIA-MS data sets with real missing values to evaluate eight different imputation methods with multiple parameters at different levels of protein quantification; dilution series data set and an independent data set with actual experimental samples. We found that imputation methods based on local structures within the data, like local least squares (LLS) and random forest (RF), worked well in our dilution series data set whereas, imputation methods based on global structures within the data, like BPCA performed well in our independent data set. We also found that imputation at the most basic level of protein quantification – fragment level-improved accuracy and number of proteins quantified. Overall, this study indicates that the most suitable imputation method depends on the overall structure and correlations of proteins within the data set and can be identified with the workflow presented here.


Sign in / Sign up

Export Citation Format

Share Document