A General Framework for Mixed and Incomplete Data Clustering Based on Swarm Intelligence Algorithms

Yenny Villuendas-Rey; Eley Barroso-Cubas; Oscar Camacho-Nieto; Cornelio Yáñez-Márquez

doi:10.3390/math9070786

A General Framework for Mixed and Incomplete Data Clustering Based on Swarm Intelligence Algorithms

Mathematics ◽

10.3390/math9070786 ◽

2021 ◽

Vol 9 (7) ◽

pp. 786

Author(s):

Yenny Villuendas-Rey ◽

Eley Barroso-Cubas ◽

Oscar Camacho-Nieto ◽

Cornelio Yáñez-Márquez

Keyword(s):

Swarm Intelligence ◽

Data Clustering ◽

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Bat Algorithm ◽

Hybrid Features ◽

Bee Colony ◽

Learning Tasks ◽

Clustering Data

Swarm intelligence has appeared as an active field for solving numerous machine-learning tasks. In this paper, we address the problem of clustering data with missing values, where the patterns are described by mixed (or hybrid) features. We introduce a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We experimentally obtain the adequate values of the parameters for these three modified algorithms, with the purpose of applying them in the clustering task. We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.

Download Full-text

Experiments on Clustering Algorithms for Mixed and Incomplete Data

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.b2551.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 4778-4784

Keyword(s):

Machine Learning ◽

Experimental Study ◽

Incomplete Data ◽

Clustering Algorithms ◽

Cluster Validation ◽

Clustering Data

Clustering mixed and incomplete data is a goal of frequent approaches in the last years because its common apparition in soft sciences problems. However, there is a lack of studies evaluating the performance of clustering algorithms for such kind of data. In this paper we present an experimental study about performance of seven clustering algorithms which used one of these techniques: partition, hierarchal or metaheuristic. All the methods ran over 15 databases from UCI Machine Learning Repository, having mixed and incomplete data descriptions. In external cluster validation using the indices Entropy and V-Measure, the algorithms that use the last technique showed the best results. Thus, we recommend metaheuristic based clustering algorithms for clustering data having mixed and incomplete descriptions.

Download Full-text

A COMPARISON OF CLUSTERING BY IMPUTATION AND SPECIAL CLUSTERING ALGORITHMS ON THE REAL INCOMPLETE DATA

Jurnal Ilmu Komputer dan Informasi ◽

10.21609/jiki.v13i2.818 ◽

2020 ◽

Vol 13 (2) ◽

pp. 65-75

Author(s):

Ridho Ananda ◽

Atika Ratna Dewi ◽

Nurlaili Nurlaili

Keyword(s):

Expectation Maximization ◽

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Distance Estimation ◽

Soft Constraints ◽

Fuzzy C Means ◽

Environmental Performance Index ◽

Silhouette Index ◽

Value Decomposition

The existence of missing values will really inhibit process of clustering. To overcome it, some of scientists have found several solutions. Both of them are imputation and special clustering algorithms. This paper compared the results of clustering by using them in incomplete data. K-means algorithms was utilized in the imputation data. The algorithms used were distribution free multiple imputation (DFMI), Gabriel eigen (GE), expectation maximization-singular value decomposition (EM-SVD), biplot imputation (BI), four algorithms of modified fuzzy c-means (FCM), k-means soft constraints (KSC), distance estimation strategy fuzzy c-means (DESFCM), k-means soft constraints imputed-observed (KSC-IO). The data used were the 2018 environmental performance index (EPI) and the simulation data. The optimal clustering on the 2018 EPI data would be chosen based on Silhouette index, where previously, it had been tested its capability in simulation dataset. The results showed that Silhouette index have the good capability to validate the clustering results in the incomplete dataset and the optimal clustering in the 2018 EPI dataset was obtained by k-means using BI where the silhouette index and time complexity were 0.613 and 0.063 respectively. Based on the results, k-means by using BI is suggested processing clustering analysis in the 2018 EPI dataset.

Download Full-text

Robust K-Median and K-Means Clustering Algorithms for Incomplete Data

Mathematical Problems in Engineering ◽

10.1155/2016/4321928 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 6

Author(s):

Jinhua Li ◽

Shiji Song ◽

Yuli Zhang ◽

Zhen Zhou

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Interval Data ◽

Accurate Estimation ◽

Data Sets ◽

Clustering Methods ◽

Estimation Errors ◽

Feature Values ◽

Time And Space Complexity

Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and K-means. However, in practice, it is often hard to obtain accurate estimation of the missing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO) formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.

Download Full-text

Comparison of Selected Swarm Intelligence Algorithms in Student Courses Recommendation Application

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194014500041 ◽

2014 ◽

Vol 24 (01) ◽

pp. 91-109 ◽

Cited By ~ 4

Author(s):

Janusz Sobecki

Keyword(s):

Particle Swarm Optimization ◽

Swarm Intelligence ◽

Optimization Problems ◽

Bat Algorithm ◽

Problem Space ◽

Swarm Optimization ◽

Bee Colony ◽

Grade Prediction ◽

Bee Colony Optimization ◽

Swarm Intelligence Algorithm

In this paper a comparison of a few swarm intelligence algorithms applied in recommendation of student courses is presented. Swarm intelligence algorithms are nowadays successfully used in many areas, especially in optimization problems. To apply each swarm intelligence algorithm in recommender systems a special representation of the problem space is necessary. Here we present the comparison of efficiency of grade prediction of several evolutionary algorithms, such as: Ant Colony Optimization (ACO), Particle Swarm Optimization (PSO), Intelligent Weed Optimization (IWO), Bee Colony Optimization (BCO) and Bat Algorithm (BA).

Download Full-text

Lookahead selective sampling for incomplete data

International Journal of Applied Mathematics and Computer Science ◽

10.1515/amcs-2016-0062 ◽

2016 ◽

Vol 26 (4) ◽

pp. 871-884 ◽

Cited By ~ 1

Author(s):

Loai Abdallah ◽

Ilan Shimshoni

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Mean Shift ◽

Ensemble Clustering ◽

Selective Sampling ◽

Mean Shift Clustering ◽

Sampling Algorithms ◽

Instance Space ◽

Incomplete Datasets

AbstractMissing values in data are common in real world applications. There are several methods that deal with this problem. In this paper we present lookahead selective sampling (LSS) algorithms for datasets with missing values. We developed two versions of selective sampling. The first one integrates a distance function that can measure the similarity between pairs of incomplete points within the framework of the LSS algorithm. The second algorithm uses ensemble clustering in order to represent the data in a cluster matrix without missing values and then run the LSS algorithm based on the ensemble clustering instance space (LSS-EC). To construct the cluster matrix, we use the k-means and mean shift clustering algorithms especially modified to deal with incomplete datasets. We tested our algorithms on six standard numerical datasets from different fields. On these datasets we simulated missing values and compared the performance of the LSS and LSS-EC algorithms for incomplete data to two other basic methods. Our experiments show that the suggested selective sampling algorithms outperform the other methods.

Download Full-text

PANTSA Influence in grouping Mixed and Incomplete Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b6534.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 579-583

Keyword(s):

Experimental Evidence ◽

Data Clustering ◽

Incomplete Data ◽

Clustering Algorithms ◽

Numerical Data ◽

Clustering Methods ◽

High Quality ◽

Before And After

Obtaining high quality groups and processing mixed and incomplete data (DMI) are still problems in the data clustering. Recently a method was proposed that improves the results obtained by clustering algorithms, the PAntSA; but this was only designed and tested for numerical data. For this reason, this paper analyzes the influence of applying the PAntSA in the performance of DMI restricted clustering algorithms. For this, the results of different algorithms are compared before and after applying the PAntSA. The comparisons made provide experimental evidence that the PAntSA algorithm improves the quality of the groups obtained by traditional DMI clustering methods.

Download Full-text

A Robust Fuzzy Approach For Gene Expression Data Clustering

10.21203/rs.3.rs-547452/v1 ◽

2021 ◽

Author(s):

Meskat Jahan ◽

Mahmudul Hasan

Keyword(s):

Gene Expression ◽

Data Clustering ◽

Missing Values ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Expression Data ◽

Mining Method ◽

Cluster Number ◽

Gene Expression Data Clustering ◽

Parameter Dependent

Abstract In the big data era, clustering is one of the most popular data mining method. The majority of clustering algorithms have complications like automatic cluster number determination, poor clustering precision, inconsistent clustering of various datasets and parameter-dependent etc. A new fuzzy autonomous solution for clustering named Meskat-Mahmudul (MM) clustering algorithm proposed to overcome the complexity of parameter–free automatic cluster number determination and clustering accuracy. MM clustering algorithm finds out the exact number of clusters based on Average Silhouette method in multivariate mixed attribute dataset, including real-time gene expression dataset and dealt missing values, noise and outliers. MM Extended K-Means (MMK) clustering algorithm is an enhancement of the K-Means algorithm, which serves the purpose for automatic cluster discovery and runtime cluster placement. Several validation methods used to evaluate cluster and certify optimum cluster partitioning and perfection. Some datasets used to assess the performance of the proposed algorithms to other algorithms in terms of time complexity and clustering efficiency. Finally, MM clustering and MMK clustering algorithms found superior over conventional algorithms.

Download Full-text

Data clustering algorithms based on Swarm Intelligence

2011 3rd International Conference on Electronics Computer Technology ◽

10.1109/icectech.2011.5941931 ◽

2011 ◽

Cited By ~ 7

Author(s):

Pankaj K. Bharne ◽

V. S. Gulhane ◽

Shweta K. Yewale

Keyword(s):

Swarm Intelligence ◽

Data Clustering ◽

Clustering Algorithms

Download Full-text

Role of Swarm Intelligence based Algorithms and their Applications for optimization in Software Reliability

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b2953.078219 ◽

2019 ◽

Vol 8 (2) ◽

pp. 3323-3327

Keyword(s):

Swarm Intelligence ◽

Software Reliability ◽

Bat Algorithm ◽

Cuckoo Search ◽

Particle Swarm Optimizer ◽

Quality Performance ◽

Reliability Models ◽

Bee Colony ◽

Software Reliability Models

The software has many features like functionality, maintainability, serviceability, usability, quality, performance. The reliability of the software is an imperative characteristic of software that leads to the eminence of the software. Software reliability is a great concern for software producers as well as users of the software. Keeping this concern in mind, there are already hundreds of software reliability models developed in the last four decades. This paper evaluates different algorithms based on Swarm intelligence in the way of optimization in software reliability. There are a number of swarm intelligence based algorithms that already have been used to improve the efficiency of the reliability of the software. Some of them are ant colony optimizer method (ACO), particle swarm optimizer method (PSO), artificial bee colony optimizer (ABC), bat algorithm, fish swarm algorithm, cuckoo search, bird flock algorithm. Still, there are so many algorithms based on Swarm intelligence that has not been used in this area. This paper investigates some known swarm intelligence based algorithms and their applications for optimizing software reliability.

Download Full-text

Integrated Algorithm for Unsupervised Data Clustering Problems in Data Mining

Journal of Southwest Jiaotong University ◽

10.35741/issn.0258-2724.54.5.40 ◽

2019 ◽

Vol 54 (5) ◽

Author(s):

Nibras Othman Abdul Wahid ◽

Saif Aamer Fadhil ◽

Noor Abbood Jasim

Keyword(s):

Data Mining ◽

Data Clustering ◽

Clustering Algorithms ◽

Genetic Operators ◽

Way Of Life ◽

Fundamental Parameters ◽

Benchmark Datasets ◽

Key Issues ◽

Clustering Data ◽

Lion Optimization Algorithm

Unsupervised data clustering investigation is a standout amongst the most valuable apparatuses and an enlightening undertaking in data mining that looks to characterize homogeneous gatherings of articles depending on likeness and is utilized in numerous applications. One of the key issues in data mining is clustering data that have pulled in much consideration. One of the famous clustering algorithms is K-means clustering that has been effectively connected to numerous issues. Scientists recommended enhancing the nature of K-means, optimization algorithms were hybridized. In this paper, a heuristic calculation, Lion Optimization Algorithm (LOA), and Genetic Algorithm (GA) were adjusted for K-Means data clustering by altering the fundamental parameters of LOA calculation, which is propelled from the characteristic enlivened calculations. The uncommon way of life of lions and their participation attributes has been the essential inspiration for the advancement of this improvement calculation. The GA is utilized when it is required to reallocate the clusters using the genetic operators, crossover, and mutation. The outcomes of the examination of this calculation mirror the capacity of this methodology in clustering examination on the number of benchmark datasets from UCI Machine Learning Repository.

Download Full-text