scholarly journals Data imputation using a trust network for recommendation via matrix factorization

2018 ◽  
Vol 15 (2) ◽  
pp. 347-368 ◽  
Author(s):  
Won-Seok Hwang ◽  
Shaoyu Li ◽  
Sang-Wook Kim ◽  
Kichun Lee

Existing recommendation methods suffer from the data sparsity problem which means that most of users have rated only a very small number of items, often resulting in low recommendation accuracy. In addition, for cold start users evaluating only few items, rating predictions with the methods also produce low accuracy. To address these problems, we propose a novel data imputation method that effectively substitutes missing ratings with probable values (i.e., imputed values). Our method successfully improves accuracy of recommendation methods from the following three aspects: (1) exploiting a trust network, (2) imputing only a part of missing ratings, and (3) applying them to any recommendation methods. Our method employs a bidirectional connection structure within a distance level for finding reliable users in exploiting a trust network as useful information. In addition, our method imputes only some missing ratings, called fillable ratings, whose imputed values are expected to be accurate with a sufficient level of confidence. Moreover, our imputation method is independent of, thus applicable to, any recommendation methods that may include application-specific ones and the most accurate one in each domain. We conduct experiments on three real-life datasets which arise from Epinions and Ciao. Our experimental results demonstrate that our method has recommendation accuracy better than existing recommendation methods equipped with imputation methods or trust networks, especially for cold start users.

Symmetry ◽  
2020 ◽  
Vol 12 (11) ◽  
pp. 1792
Author(s):  
Shu-Fen Huang ◽  
Ching-Hsue Cheng

Medical data usually have missing values; hence, imputation methods have become an important issue. In previous studies, many imputation methods based on variable data had a multivariate normal distribution, such as expectation-maximization and regression-based imputation. These assumptions may lead to deviations in the results, which sometimes create a bottleneck. In addition, directly deleting instances with missing values may have several problems, such as losing important data, producing invalid research samples, and leading to research deviations. Therefore, this study proposed a safe-region imputation method for handling medical data with missing values; we also built a medical prediction model and compared the removed missing values with imputation methods in terms of the generated rules, accuracy, and AUC. First, this study used the kNN imputation, multiple imputation, and the proposed imputation to impute the missing data and then applied four attribute selection methods to select the important attributes. Then, we used the decision tree (C4.5), random forest, REP tree, and LMT classifier to generate the rules, accuracy, and AUC for comparison. Because there were four datasets with imbalanced classes (asymmetric classes), the AUC was an important criterion. In the experiment, we collected four open medical datasets from UCI and one international stroke trial dataset. The results show that the proposed safe-region imputation is better than the listing imputation methods and after imputing offers better results than directly deleting instances with missing values in the number of rules, accuracy, and AUC. These results will provide a reference for medical stakeholders.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5947
Author(s):  
Liang Zhang

Building operation data are important for monitoring, analysis, modeling, and control of building energy systems. However, missing data is one of the major data quality issues, making data imputation techniques become increasingly important. There are two key research gaps for missing sensor data imputation in buildings: the lack of customized and automated imputation methodology, and the difficulty of the validation of data imputation methods. In this paper, a framework is developed to address these two gaps. First, a validation data generation module is developed based on pattern recognition to create a validation dataset to quantify the performance of data imputation methods. Second, a pool of data imputation methods is tested under the validation dataset to find an optimal single imputation method for each sensor, which is termed as an ensemble method. The method can reflect the specific mechanism and randomness of missing data from each sensor. The effectiveness of the framework is demonstrated by 18 sensors from a real campus building. The overall accuracy of data imputation for those sensors improves by 18.2% on average compared with the best single data imputation method.


2021 ◽  
Vol 7 ◽  
pp. e619
Author(s):  
Khaled M. Fouad ◽  
Mahmoud M. Ismail ◽  
Ahmad Taher Azar ◽  
Mona M. Arafa

The real-world data analysis and processing using data mining techniques often are facing observations that contain missing values. The main challenge of mining datasets is the existence of missing values. The missing values in a dataset should be imputed using the imputation method to improve the data mining methods’ accuracy and performance. There are existing techniques that use k-nearest neighbors algorithm for imputing the missing values but determining the appropriate k value can be a challenging task. There are other existing imputation techniques that are based on hard clustering algorithms. When records are not well-separated, as in the case of missing data, hard clustering provides a poor description tool in many cases. In general, the imputation depending on similar records is more accurate than the imputation depending on the entire dataset's records. Improving the similarity among records can result in improving the imputation performance. This paper proposes two numerical missing data imputation methods. A hybrid missing data imputation method is initially proposed, called KI, that incorporates k-nearest neighbors and iterative imputation algorithms. The best set of nearest neighbors for each missing record is discovered through the records similarity by using the k-nearest neighbors algorithm (kNN). To improve the similarity, a suitable k value is estimated automatically for the kNN. The iterative imputation method is then used to impute the missing values of the incomplete records by using the global correlation structure among the selected records. An enhanced hybrid missing data imputation method is then proposed, called FCKI, which is an extension of KI. It integrates fuzzy c-means, k-nearest neighbors, and iterative imputation algorithms to impute the missing data in a dataset. The fuzzy c-means algorithm is selected because the records can belong to multiple clusters at the same time. This can lead to further improvement for similarity. FCKI searches a cluster, instead of the whole dataset, to find the best k-nearest neighbors. It applies two levels of similarity to achieve a higher imputation accuracy. The performance of the proposed imputation techniques is assessed by using fifteen datasets with variant missing ratios for three types of missing data; MCAR, MAR, MNAR. These different missing data types are generated in this work. The datasets with different sizes are used in this paper to validate the model. Therefore, proposed imputation techniques are compared with other missing data imputation methods by means of three measures; the root mean square error (RMSE), the normalized root mean square error (NRMSE), and the mean absolute error (MAE). The results show that the proposed methods achieve better imputation accuracy and require significantly less time than other missing data imputation methods.


Author(s):  
Mehmet S. Aktaş ◽  
Sinan Kaplan ◽  
Hasan Abacı ◽  
Oya Kalipsiz ◽  
Utku Ketenci ◽  
...  

Missing data is a common problem for data clustering quality. Most real-life datasets have missing data, which in turn has some effect on clustering tasks. This chapter investigates the appropriate data treatment methods for varying missing data scarcity distributions including gamma, Gaussian, and beta distributions. The analyzed data imputation methods include mean, hot-deck, regression, k-nearest neighbor, expectation maximization, and multiple imputation. To reveal the proper methods to deal with missing data, data mining tasks such as clustering is utilized for evaluation. With the experimental studies, this chapter identifies the correlation between missing data imputation methods and missing data distributions for clustering tasks. The results of the experiments indicated that expectation maximization and k-nearest neighbor methods provide best results for varying missing data scarcity distributions.


Missing data imputation is essential task becauseremoving all records with missing values will discard useful information from other attributes. This paper estimates the performanceof prediction for autism dataset with imputed missing values. Statistical imputation methods like mean, imputation with zero or constant and machine learning imputation methods like K-nearest neighbour chained Equation methods were compared with the proposed deep learning imputation method. The predictions of patients with autistic spectrum disorder were measured using support vector machine for imputed dataset. Among the imputation methods, Deeplearningalgorithm outperformed statistical and machine learning imputation methods. The same is validated using significant difference in p values revealed using Friedman’s test


2020 ◽  
Vol 42 ◽  
pp. e37
Author(s):  
Elisandra Lucia Moro Stochero ◽  
Luciane Flores Jacobi ◽  
Alessandro Dal’Col Lúcio

Unanticipated events often occur in the development of an experiment, especially in field experiments, often causing data loss. Through the present study, we sought to verify whether there is a difference in the result of the analysis of variance (ANOVA) test for unbalanced data in the Completely Randomized Design (DIC) when the inclusion of data obtained from the data imputation technique was performed of Predictive Average. Experiments were simulated in the DIC with 5 treatments and 10 repetitions, generating complete databases. From each database, 10% of the plots were removed and after the imputation method was applied, comparing the ANOVA results in each step. The imputation yielded acceptable results, but not better than those obtained when performing the specific ANOVA test for unbalanced data.


2021 ◽  
Author(s):  
Ching-Hsue Cheng ◽  
Shu-Fen Huang

Abstract Nowadays, people pay increasing attention to health, and the integrity of medical records has been put into focus. Recently, medical data imputation has become a very active field because medical data usually have missing values. Many imputation methods have been proposed, but many model-based imputation methods such as expectation-maximization and regression-based imputation based on the variables data have a multivariate normal distribution, which assumption can lead to biased results. Sometimes this becomes a bottleneck, such as computationally more complex than model-free methods. Furthermore, directly remove instances with missing values, this approach has several problems, and it is possible to lose the important data, produce ineffective research samples, and cause research deviations, and so on. Therefore, this study proposes a novel clustering-based purity and distance imputation method to improve the handling of missing values. In the experiment, we collected eight different medical datasets to compare the proposed imputation methods with the listed imputation methods with regard to the results of different situations. In imputation measures, the area under the curve (AUC) is used to evaluate the performance of the imbalanced class datasets in MAR and MCAR experiments, and accuracy is applied to measure its performance of the balanced class in MNAR experiment. Finally, the root-mean-square error (RMSE) is also used to compare the proposed and the listing imputation methods. In addition, this study utilized the elbow method and the average silhouette method to find the optimal number of clusters for all datasets. Results showed that the proposed imputation method could improve imputation performance in the accuracy, AUC, and RMSE of different missing degrees and missing types.


2021 ◽  
Vol 29 (2) ◽  
Author(s):  
Nurul Azifah Mohd Pauzi ◽  
Yap Bee Wah ◽  
Sayang Mohd Deni ◽  
Siti Khatijah Nor Abdul Rahim ◽  
Suhartono

High quality data is essential in every field of research for valid research findings. The presence of missing data in a dataset is common and occurs for a variety of reasons such as incomplete responses, equipment malfunction and data entry error. Single and multiple data imputation methods have been developed for data imputation of missing values. This study investigated the performance of single imputation using mean and multiple imputation method using Multivariate Imputation by Chained Equations (MICE) via a simulation study. The MCAR which means missing completely at random were generated randomly for ten levels of missing rates (proportion of missing data): 5% to 50% for different sample sizes. Mean Square Error (MSE) was used to evaluate the performance of the imputation methods. Data imputation method depends on data types. Mean imputation is commonly used to impute missing values for continuous variable while MICE method can handle both continuous and categorical variables. The simulation results indicate that group mean imputation (GMI) performed better compared to overall mean imputation (OMI) and MICE with lowest value of MSE for all sample sizes and missing rates. The MSE of OMI, GMI, and MICE increases when missing rate increases. The MICE method has the lowest performance (i.e. highest MSE) when percentage of missing rates is more than 15%. Overall, GMI is more superior compared to OMI and MICE for all missing rates and sample size for MCAR mechanism. An application to a real dataset confirmed the findings of the simulation results. The findings of this study can provide knowledge to researchers and practitioners on which imputation method is more suitable when the data involves missing data.


Sign in / Sign up

Export Citation Format

Share Document