scholarly journals A Comparative Study on TIBA Imputation Methods in FCMdd-Based Linear Clustering with Relational Data

2011 ◽  
Vol 2011 ◽  
pp. 1-10 ◽  
Author(s):  
Takeshi Yamamoto ◽  
Katsuhiro Honda ◽  
Akira Notsu ◽  
Hidetomo Ichihashi

Relational fuzzy clustering has been developed for extracting intrinsic cluster structures of relational data and was extended to a linear fuzzy clustering model based on Fuzzyc-Medoids (FCMdd) concept, in which Fuzzyc-Means-(FCM-) like iterative algorithm was performed by defining linear cluster prototypes using two representative medoids for each line prototype. In this paper, the FCMdd-type linear clustering model is further modified in order to handle incomplete data including missing values, and the applicability of several imputation methods is compared. In several numerical experiments, it is demonstrated that some pre-imputation strategies contribute to properly selecting representative medoids of each cluster.

Author(s):  
Katsuhiro Honda ◽  
◽  
Yoshihito Nakamura ◽  
Hidetomo Ichihashi

This paper proposes the simultaneous application of homogeneity analysis and fuzzy clustering with incomplete data. Taking into account the similarity between the loss function for homogeneity analysis and the least squares criterion for principal component analysis, we define the new objective function in a formulation similar to linear fuzzy clustering with missing values. Numerical experiments demonstrate the feasibility of the proposed method.


Author(s):  
Takeshi Yamamoto ◽  
◽  
Katsuhiro Honda ◽  
Akira Notsu ◽  
Hidetomo Ichihashi

Relational data is common in many real-world applications. Linear fuzzy clustering models have been extended for handling relational data based on Fuzzyc-Medoids (FCMdd) framework. In this paper, with the goal being to handle non-Euclidean data, β-spread transformation of relational data matrices used in Non-Euclidean-type Relational Fuzzy (NERF)c-means is applied before FCMdd-type linear cluster extraction. β-spread transformation modifies data elements to avoid negative values for clustering criteria of distances between objects and linear prototypes. In numerical experiments, typical features of the proposed approach are demonstrated not only using artificially generated data but also in a document classification task with a document-keyword co-occurrence relation.


Author(s):  
Yuchi Kanzawa ◽  

In this paper, an entropy-regularized fuzzy clustering approach for non-Euclidean relational data and indefinite kernel data is developed that has not previously been discussed. It is important because relational data and kernel data are not always Euclidean and positive semi-definite, respectively. It is theoretically determined that an entropy-regularized approach for both non-Euclidean relational data and indefinite kernel data can be applied without using a β-spread transformation, and that two other options make the clustering results crisp for both data types. These results are in contrast to those from the standard approach. Numerical experiments are employed to verify the theoretical results, and the clustering accuracy of three entropy-regularized approaches for non-Euclidean relational data, and three for indefinite kernel data, is compared.


2021 ◽  
Vol 25 (4) ◽  
pp. 825-846
Author(s):  
Ahmad Jaffar Khan ◽  
Basit Raza ◽  
Ahmad Raza Shahid ◽  
Yogan Jaya Kumar ◽  
Muhammad Faheem ◽  
...  

Almost all real-world datasets contain missing values. Classification of data with missing values can adversely affect the performance of a classifier if not handled correctly. A common approach used for classification with incomplete data is imputation. Imputation transforms incomplete data with missing values to complete data. Single imputation methods are mostly less accurate than multiple imputation methods which are often computationally much more expensive. This study proposes an imputed feature selected bagging (IFBag) method which uses multiple imputation, feature selection and bagging ensemble learning approach to construct a number of base classifiers to classify new incomplete instances without any need for imputation in testing phase. In bagging ensemble learning approach, data is resampled multiple times with substitution, which can lead to diversity in data thus resulting in more accurate classifiers. The experimental results show the proposed IFBag method is considerably fast and gives 97.26% accuracy for classification with incomplete data as compared to common methods used.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Kamran Mehrabani-Zeinabad ◽  
Marziyeh Doostfatemeh ◽  
Seyyed Mohammad Taghi Ayatollahi

Missing data is one of the most important causes in reduction of classification accuracy. Many real datasets suffer from missing values, especially in medical sciences. Imputation is a common way to deal with incomplete datasets. There are various imputation methods that can be applied, and the choice of the best method depends on the dataset conditions such as sample size, missing percent, and missing mechanism. Therefore, the better solution is to classify incomplete datasets without imputation and without any loss of information. The structure of the “Bayesian additive regression trees” (BART) model is improved with the “Missingness Incorporated in Attributes” approach to solve its inefficiency in handling the missingness problem. Implementation of MIA-within-BART is named “BART.m”. As the abilities of BART.m are not investigated in classification of incomplete datasets, this simulation-based study aimed to provide such resource. The results indicate that BART.m can be used even for datasets with 90 missing present and more importantly, it diagnoses the irrelevant variables and removes them by its own. BART.m outperforms common models for classification with incomplete data, according to accuracy and computational time. Based on the revealed properties, it can be said that BART.m is a high accuracy model in classification of incomplete datasets which avoids any assumptions and preprocess steps.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Liang Jin ◽  
Yingtao Bi ◽  
Chenqi Hu ◽  
Jun Qu ◽  
Shichen Shen ◽  
...  

AbstractThe presence of missing values (MVs) in label-free quantitative proteomics greatly reduces the completeness of data. Imputation has been widely utilized to handle MVs, and selection of the proper method is critical for the accuracy and reliability of imputation. Here we present a comparative study that evaluates the performance of seven popular imputation methods with a large-scale benchmark dataset and an immune cell dataset. Simulated MVs were incorporated into the complete part of each dataset with different combinations of MV rates and missing not at random (MNAR) rates. Normalized root mean square error (NRMSE) was applied to evaluate the accuracy of protein abundances and intergroup protein ratios after imputation. Detection of true positives (TPs) and false altered-protein discovery rate (FADR) between groups were also compared using the benchmark dataset. Furthermore, the accuracy of handling real MVs was assessed by comparing enriched pathways and signature genes of cell activation after imputing the immune cell dataset. We observed that the accuracy of imputation is primarily affected by the MNAR rate rather than the MV rate, and downstream analysis can be largely impacted by the selection of imputation methods. A random forest-based imputation method consistently outperformed other popular methods by achieving the lowest NRMSE, high amount of TPs with the average FADR < 5%, and the best detection of relevant pathways and signature genes, highlighting it as the most suitable method for label-free proteomics.


2017 ◽  
Vol 10 (04) ◽  
pp. 773-779
Author(s):  
V.B. Kamble ◽  
S.N. Deshmukh

Presence of missing values in the dataset leads to difficult for data analysis in data mining task. In this research work, student dataset is taken contains marks of four different subjects in engineering college. Mean, Mode, Median Imputation were used to deal with challenges of incomplete data. By using MSE and RMSE on dataset using with proposed Method and imputation methods like Mean, Mode, and Median Imputation on the dataset and found out to be values of Mean Squared Error and Root Mean Squared Error for the dataset. Accuracy also found out to be using Proposed Method with Imputation Technique. Experimental observation it was found that, MSE and RMSE gradually decreases when size of the databases is gradually increases by using proposed Method. Also MSE and RMSE gradually increase when size of the databases is gradually increases by using simple imputation technique. Accuracy is also increases with increases size of the databases.


Author(s):  
Katsuhiro Honda ◽  
◽  
Takeshi Yamamoto ◽  
Akira Notsu ◽  
Hidetomo Ichihashi

Visualization is a fundamental approach for revealing intrinsic structures in multidimensional observation. This paper considers visualization of non-Euclidean relational data by extracting local linear substructures. In order to extract robust linear clusters, an FCMdd-based linear fuzzy clustering model is applied in conjunction with a robust measure of alternativec-means. Non-Euclidean data matrices are handled with β-spread transformation in a manner similar to that of NERFc-Means. In several experiments, robust feature maps derived by the robust clustering model are compared with feature maps given by the conventional clustering model and Multi-Dimensional Scaling (MDS).


2021 ◽  
Author(s):  
Heru Nugroho ◽  
Nugraha Priya Utama ◽  
Kridanto Surendro

Abstract Missing data is one of the factors often causing incomplete data in research. Data normalization and missing value handling were considered major problems in the data pre-processing stage, while classification algorithms were adopted to handle numerical features. Furthermore, in cases where the observed data contains outliers, the missing values’ estimated results are sometimes unreliable, or even differ greatly from the true values. This study aims to proposed combination of normalization and outlier removal’s before imputing missing values using several methods, mean, random value, regression, multiple imputation, KNN, and C3-FA. Experimental results on the sonar dataset show normalization and outlier removal’s effect in these imputation methods. In the proposed C3-FA method, this produced accuracy, F1-Score, Precision, and Recall values of 0.906, 0.906, 0.908, and 0.906, respectively. Based on the KNN classifier evaluation results, this value outperformed the other five (5) methods. Meanwhile, the results for RMSE, Dks, and r obtained from combining normalization and outlier removal’s in the C3-FA method were 0.02, 0.04, and 0.935, respectively. This shows that the proposed method is able to reproduce the real values ​​of the data or the prediction accuracy and maintain the distribution of the values ​​or the distribution accuracy.


Sign in / Sign up

Export Citation Format

Share Document