Attribute-Associated Neuron Modeling and Missing Value Imputation for Incomplete Data

Wireless Communications and Mobile Computing ◽

10.1155/2021/5589872 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Xiaochen Lai ◽

Jinchong Zhu ◽

Liyong Zhang ◽

Zheng Zhang ◽

Wei Lu

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Imputation Accuracy ◽

Model Fitting ◽

Model Parameters ◽

Global Approximation ◽

Missing Value ◽

Training Scheme ◽

Incomplete Model ◽

Model Training

The imputation of missing values is an important research content in incomplete data analysis. Based on the auto associative neural network (AANN), this paper conducts regression modeling for incomplete data and imputes missing values. Since the AANN can estimate missing values in multiple missingness patterns efficiently, we introduce incomplete records into the modeling process and propose an attribute cross fitting model (ACFM) based on AANN. ACFM reconstructs the path of data transmission between output and input neurons and optimizes the model parameters by training errors of existing data, thereby improving its own ability to fit relations between attributes of incomplete data. Besides, for the problem of incomplete model input, this paper proposes a model training scheme, which sets missing values as variables and makes missing value variables update with model parameters iteratively. The method of local learning and global approximation increases the precision of model fitting and the imputation accuracy of missing values. Finally, experiments based on several datasets verify the effectiveness of the proposed method.

Download Full-text

Comparison of Algorithms for Clustering Incomplete Data

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2014-0007 ◽

2014 ◽

Vol 39 (2) ◽

pp. 107-127 ◽

Cited By ~ 6

Author(s):

Artur Matyja ◽

Krzysztof Siminski

Keyword(s):

Data Analysis ◽

Incomplete Data ◽

Missing Values ◽

Real Data ◽

Complete Data ◽

The Other ◽

Data Sets ◽

Missing Value ◽

Comparison Of Algorithms ◽

New Algorithms

Abstract The missing values are not uncommon in real data sets. The algorithms and methods used for the data analysis of complete data sets cannot always be applied to missing value data. In order to use the existing methods for complete data, the missing value data sets are preprocessed. The other solution to this problem is creation of new algorithms dedicated to missing value data sets. The objective of our research is to compare the preprocessing techniques and specialised algorithms and to find their most advantageous usage.

Download Full-text

Missing Value Imputation Approach for Mass Spectrometry-based Metabolomics Data

10.1101/171967 ◽

2017 ◽

Cited By ~ 1

Author(s):

Runmin Wei ◽

Jingye Wang ◽

Mingming Su ◽

Erik Jia ◽

Tianlu Chen ◽

...

Keyword(s):

Mass Spectrometry ◽

Missing Values ◽

Pearson Correlation ◽

Imputation Accuracy ◽

Metabolomics Data ◽

Missing Value ◽

Sample Distribution ◽

Imputation Methods ◽

Missing Value Imputation ◽

Squared Error

AbstractIntroductionMissing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection of methods can significantly affect following data analyses and interpretations. According to the definition, there are three types of missing values, missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).ObjectivesThe aim of this study was to comprehensively compare common imputation methods for different types of missing values using two separate metabolomics data sets (977 and 198 serum samples respectively) to propose a strategy to deal with missing values in metabolomics studies.MethodsImputation methods included zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC). Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate the imputation accuracy for MCAR/MAR and MNAR correspondingly. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes sum of squared error were used to evaluate the overall sample distribution. Student’s t-test followed by Pearson correlation analysis was conducted to evaluate the effect of imputation on univariate statistical analysis.ResultsOur findings demonstrated that RF imputation performed the best for MCAR/MAR and QRILC was the favored one for MNAR.ConclusionCombining with “modified 80% rule”, we proposed a comprehensive strategy and developed a public-accessible web-tool for missing value imputation in metabolomics data.

Download Full-text

Automatic missing value imputation for cleaning phase of diabetic’s readmission prediction model

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v12i2.pp2001-2013 ◽

2022 ◽

Vol 12 (2) ◽

pp. 2001

Author(s):

Jesmeen Mohd Zebaral Hoque ◽

Jakir Hossen ◽

Shohel Sayeed ◽

Chy. Mohammed Tawsif K. ◽

Jaya Ganesan ◽

...

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Prediction Models ◽

Low Cost ◽

Support Vector ◽

Data Sampling ◽

Data Set ◽

Missing Value ◽

Missing Value Imputation ◽

Proper Analysis

Recently, the industry of healthcare started generating a large volume of datasets. If hospitals can employ the data, they could easily predict the outcomes and provide better treatments at early stages with low cost. Here, data analytics (DA) was used to make correct decisions through proper analysis and prediction. However, inappropriate data may lead to flawed analysis and thus yield unacceptable conclusions. Hence, transforming the improper data from the entire data set into useful data is essential. Machine learning (ML) technique was used to overcome the issues due to incomplete data. A new architecture, automatic missing value imputation (AMVI) was developed to predict missing values in the dataset, including data sampling and feature selection. Four prediction models (i.e., logistic regression, support vector machine (SVM), AdaBoost, and random forest algorithms) were selected from the well-known classification. The complete AMVI architecture performance was evaluated using a structured data set obtained from the UCI repository. Accuracy of around 90% was achieved. It was also confirmed from cross-validation that the trained ML model is suitable and not over-fitted. This trained model is developed based on the dataset, which is not dependent on a specific environment. It will train and obtain the outperformed model depending on the data available.

Download Full-text

An Interpretable Risk Prediction Model for Healthcare with Pattern Attention

10.1101/2020.07.26.20162479 ◽

2020 ◽

Author(s):

Sundreen Asad Kamal ◽

Changchang Yin ◽

Buyue Qian ◽

Ping Zhang

Keyword(s):

Risk Prediction ◽

Missing Values ◽

Imputation Accuracy ◽

Mortality Prediction ◽

Attention Mechanism ◽

The Self ◽

Health State ◽

Missing Value ◽

Medical Event ◽

Real Value

Background: The availability of the massive amount of data enables the possibility of clinical predictive tasks. Deep learning methods have achieved promising performance on the tasks. However, most existing methods suffer from three limitations: (i) There are lots of missing value for real value events, many methods impute the missing value and then train their models based on the imputed values, which may introduce imputation bias. The models' performance is highly dependent on imputation accuracy. (ii) Lots of existing studies just take Boolean value medical events (e.g. diagnosis code) as inputs, but ignore real value medical events (e.g., lab tests and vital signs), which are more important for acute disease (e.g., sepsis) and mortality prediction. (iii) Existing interpretable models can illustrate which medical events are conducive to the output results, but are not able to give contributions of patterns among medical events. Methods: In this study, we propose a novel interpretable Pattern Attention model with Value Embedding (PAVE) to predict the risks of certain diseases. PAVE takes the embedding of various medical events, their values and the corresponding occurring time as inputs, leverage self-attention mechanism to attend to meaningful patterns among medical events for risk prediction tasks. Because only the observed values are embedded into vectors, we don't need to impute the missing values and thus avoids the imputations bias. Moreover, the self-attention mechanism is helpful for the model interpretability, which means the proposed model can output which patterns cause high risks. Results: We conduct sepsis onset prediction and mortality prediction experiments on a publicly available dataset MIMIC-III and our proprietary EHR dataset. The experimental results show that PAVE outperforms existing models. Moreover, by analyzing the self-attention weights, our model outputs meaningful medical event patterns related to mortality. Conclusions: PAVE learns effective medical event representation by incorporating the values and occurring time, which can improve the risk prediction performance. Moreover, the presented self-attention mechanism can not only capture patients' health state information, but also output the contributions of various medical event patterns, which pave the way for interpretable clinical risk predictions. Availability: The code for this paper is available at: https://github.com/yinchangchang/PAVE.

Download Full-text

Normalization and outlier removal in class center-based firefly algorithm for missing value imputation

Journal Of Big Data ◽

10.1186/s40537-021-00518-7 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Heru Nugroho ◽

Nugraha Priya Utama ◽

Kridanto Surendro

Keyword(s):

Incomplete Data ◽

Statistical Power ◽

Firefly Algorithm ◽

Missing Values ◽

Outlier Removal ◽

Processing Stage ◽

Missing Value ◽

Missing Value Imputation ◽

Almost All ◽

True Values

AbstractA missing value is one of the factors that often cause incomplete data in almost all studies, even those that are well-designed and controlled. It can also decrease a study’s statistical power or result in inaccurate estimations and conclusions. Hence, data normalization and missing value handling are considered the major problems in the data pre-processing stage, while classification algorithms are adopted to handle numerical features. In cases where the observed data contained outliers, the missing value estimated results are sometimes unreliable or even differ greatly from the true values. Therefore, this study aims to propose the combination of normalization and outlier removals before imputing missing values on the class center-based firefly algorithm method (ON + C3FA). Moreover, some standard imputation techniques like mean, a random value, regression, as well as multiple imputation, KNN imputation, and decision tree (DT)-based missing value imputation were utilized as a comparison of the proposed method. Experimental results on the sonar dataset showed normalization and outlier removals effect in the methods. According to the proposed method (ON + C3FA), AUC, accuracy, F1-Score, Precision, Recall, and AUC-PR had 0.972, 0.906, 0.906, 0.908, 0.906, 0.61 respectively. The result showed combining normalization and outlier removals in C3-FA (ON + C3FA) was an efficient technique for obtaining actual data in handling missing values, and it also outperformed the previous studies methods with r and RMSE values of 0.935 and 0.02. Meanwhile, the Dks value obtained from this technique was 0.04, which indicated that it could maintain the values or distribution accuracy.

Download Full-text

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study (Preprint)

10.2196/preprints.30824 ◽

2021 ◽

Author(s):

Hansle Gwon ◽

Imjin Ahn ◽

Yunha Kim ◽

Hee Jun Kang ◽

Hyeram Seo ◽

...

Keyword(s):

Machine Learning ◽

Incomplete Data ◽

Missing Values ◽

Pearson Correlation ◽

Laboratory Data ◽

Complete Data ◽

Learning System ◽

Training Data ◽

Rank Test ◽

Missing Value

BACKGROUND When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. OBJECTIVE The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. METHODS In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. RESULTS In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a <i>P</i> value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible <i>P</i> value, 3.05e-5, in all situations. CONCLUSIONS Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

Download Full-text

Normalization and Outlier Removal in Class Center-Based Firefly Algorithm for Missing Value Imputation

10.21203/rs.3.rs-538193/v1 ◽

2021 ◽

Author(s):

Heru Nugroho ◽

Nugraha Priya Utama ◽

Kridanto Surendro

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Classification Algorithms ◽

Data Normalization ◽

Outlier Removal ◽

Processing Stage ◽

Missing Value ◽

Imputation Methods ◽

Knn Classifier ◽

True Values

Abstract Missing data is one of the factors often causing incomplete data in research. Data normalization and missing value handling were considered major problems in the data pre-processing stage, while classification algorithms were adopted to handle numerical features. Furthermore, in cases where the observed data contains outliers, the missing values’ estimated results are sometimes unreliable, or even differ greatly from the true values. This study aims to proposed combination of normalization and outlier removal’s before imputing missing values using several methods, mean, random value, regression, multiple imputation, KNN, and C3-FA. Experimental results on the sonar dataset show normalization and outlier removal’s effect in these imputation methods. In the proposed C3-FA method, this produced accuracy, F1-Score, Precision, and Recall values of 0.906, 0.906, 0.908, and 0.906, respectively. Based on the KNN classifier evaluation results, this value outperformed the other five (5) methods. Meanwhile, the results for RMSE, Dks, and r obtained from combining normalization and outlier removal’s in the C3-FA method were 0.02, 0.04, and 0.935, respectively. This shows that the proposed method is able to reproduce the real values of the data or the prediction accuracy and maintain the distribution of the values or the distribution accuracy.

Download Full-text

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

JMIR Public Health and Surveillance ◽

10.2196/30824 ◽

2021 ◽

Vol 7 (10) ◽

pp. e30824

Author(s):

Hansle Gwon ◽

Imjin Ahn ◽

Yunha Kim ◽

Hee Jun Kang ◽

Hyeram Seo ◽

...

Keyword(s):

Machine Learning ◽

Incomplete Data ◽

Missing Values ◽

Pearson Correlation ◽

Laboratory Data ◽

Complete Data ◽

Learning System ◽

Rank Test ◽

P Value ◽

Missing Value

Background When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. Objective The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. Methods In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. Results In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. Conclusions Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

Download Full-text

Missing value imputation in proximity extension assay-based targeted proteomics data

PLoS ONE ◽

10.1371/journal.pone.0243487 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0243487

Author(s):

Michael Lenz ◽

Andreas Schulz ◽

Thomas Koeck ◽

Steffen Rapp ◽

Markus Nagler ◽

...

Keyword(s):

Missing Values ◽

Signal To Noise Ratio ◽

Pearson Correlation ◽

Imputation Accuracy ◽

Targeted Proteomics ◽

Proteomics Data ◽

Missing Value ◽

Missing Value Imputation ◽

Protein Levels ◽

Missing Completely At Random

Targeted proteomics utilizing antibody-based proximity extension assays provides sensitive and highly specific quantifications of plasma protein levels. Multivariate analysis of this data is hampered by frequent missing values (random or left censored), calling for imputation approaches. While appropriate missing-value imputation methods exist, benchmarks of their performance in targeted proteomics data are lacking. Here, we assessed the performance of two methods for imputation of values missing completely at random, the previously top-benchmarked ‘missForest’ and the recently published ‘GSimp’ method. Evaluation was accomplished by comparing imputed with remeasured relative concentrations of 91 inflammation related circulating proteins in 86 samples from a cohort of 645 patients with venous thromboembolism. The median Pearson correlation between imputed and remeasured protein expression values was 69.0% for missForest and 71.6% for GSimp (p = 5.8e-4). Imputation with missForest resulted in stronger reduction of variance compared to GSimp (median relative variance of 25.3% vs. 68.6%, p = 2.4e-16) and undesired larger bias in downstream analyses. Irrespective of the imputation method used, the 91 imputed proteins revealed large variations in imputation accuracy, driven by differences in signal to noise ratio and information overlap between proteins. In summary, GSimp outperformed missForest, while both methods show good overall imputation accuracy with large variations between proteins.

Download Full-text

An interpretable risk prediction model for healthcare with pattern attention

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01331-7 ◽

2020 ◽

Vol 20 (S11) ◽

Author(s):

Sundreen Asad Kamal ◽

Changchang Yin ◽

Buyue Qian ◽

Ping Zhang

Keyword(s):

Risk Prediction ◽

Missing Values ◽

Imputation Accuracy ◽

Mortality Prediction ◽

Attention Mechanism ◽

The Self ◽

Health State ◽

Missing Value ◽

Medical Event ◽

Real Value

Abstract Background The availability of massive amount of data enables the possibility of clinical predictive tasks. Deep learning methods have achieved promising performance on the tasks. However, most existing methods suffer from three limitations: (1) There are lots of missing value for real value events, many methods impute the missing value and then train their models based on the imputed values, which may introduce imputation bias. The models’ performance is highly dependent on the imputation accuracy. (2) Lots of existing studies just take Boolean value medical events (e.g. diagnosis code) as inputs, but ignore real value medical events (e.g., lab tests and vital signs), which are more important for acute disease (e.g., sepsis) and mortality prediction. (3) Existing interpretable models can illustrate which medical events are conducive to the output results, but are not able to give contributions of patterns among medical events. Methods In this study, we propose a novel interpretable Pattern Attention model with Value Embedding (PAVE) to predict the risks of certain diseases. PAVE takes the embedding of various medical events, their values and the corresponding occurring time as inputs, leverage self-attention mechanism to attend to meaningful patterns among medical events for risk prediction tasks. Because only the observed values are embedded into vectors, we don’t need to impute the missing values and thus avoids the imputations bias. Moreover, the self-attention mechanism is helpful for the model interpretability, which means the proposed model can output which patterns cause high risks. Results We conduct sepsis onset prediction and mortality prediction experiments on a publicly available dataset MIMIC-III and our proprietary EHR dataset. The experimental results show that PAVE outperforms existing models. Moreover, by analyzing the self-attention weights, our model outputs meaningful medical event patterns related to mortality. Conclusions PAVE learns effective medical event representation by incorporating the values and occurring time, which can improve the risk prediction performance. Moreover, the presented self-attention mechanism can not only capture patients’ health state information, but also output the contributions of various medical event patterns, which pave the way for interpretable clinical risk predictions. Availability The code for this paper is available at: https://github.com/yinchangchang/PAVE.

Download Full-text