Towards Expert-Inspired Automatic Criterion to Cut a Dendrogram for Real-Industrial Applications

Mapping Intimacies ◽

10.3233/faia210140 ◽

2021 ◽

Author(s):

Shikha Suman ◽

Ashutosh Karna ◽

Karina Gibert

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Computational Cost ◽

Real Life ◽

Ground Truth ◽

Industrial Applications ◽

Underlying Structure ◽

Cluster Validity ◽

Cluster Validity Index ◽

Number Of Clusters

Hierarchical clustering is one of the most preferred choices to understand the underlying structure of a dataset and defining typologies, with multiple applications in real life. Among the existing clustering algorithms, the hierarchical family is one of the most popular, as it permits to understand the inner structure of the dataset and find the number of clusters as an output, unlike popular methods, like k-means. One can adjust the granularity of final clustering to the goals of the analysis themselves. The number of clusters in a hierarchical method relies on the analysis of the resulting dendrogram itself. Experts have criteria to visually inspect the dendrogram and determine the number of clusters. Finding automatic criteria to imitate experts in this task is still an open problem. But, dependence on the expert to cut the tree represents a limitation in real applications like the fields industry 4.0 and additive manufacturing. This paper analyses several cluster validity indexes in the context of determining the suitable number of clusters in hierarchical clustering. A new Cluster Validity Index (CVI) is proposed such that it properly catches the implicit criteria used by experts when analyzing dendrograms. The proposal has been applied on a range of datasets and validated against experts ground-truth overcoming the results obtained by the State of the Art and also significantly reduces the computational cost.

Download Full-text

A Hierarchical Clustering Algorithm Based on Silhouette Index for Cancer Subtype Discovery from Omics Data

10.1101/309716 ◽

2018 ◽

Cited By ~ 1

Author(s):

N. Nidheesh ◽

K.A. Abdul Nazeer ◽

P.M. Ameer

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Agglomerative Clustering ◽

Cluster Validity ◽

Cluster Validity Index ◽

Number Of Clusters ◽

Silhouette Index ◽

Cancer Subtype ◽

Hierarchical Agglomerative Clustering ◽

Hierarchical Clustering Algorithm

AbstractCancer subtype discovery fromomicsdata requires techniques to estimate the number of natural clusters in the data. Automatically estimating the number of clusters has been a challenging problem in Machine Learning. Using clustering algorithms together with internal cluster validity indexes have been a popular method of estimating the number of clusters in biomolecular data. We propose a Hierarchical Agglomerative Clustering algorithm, namedSilHAC, which can automatically estimate the number of natural clusters and can find the associated clustering solution.SilHACis parameterless. We also present two hybrids ofSilHACwithSpectral ClusteringandK-Meansrespectively as components.SilHACand the hybrids could find reasonable estimates for the number of clusters and the associated clustering solution when applied to a collection of cancer gene expression datasets. The proposed methods are better alternatives to the ‘clustering algorithm - internal cluster validity index’ pipelines for estimating the number of natural clusters.

Download Full-text

Performance Evaluation of Line Symmetry-Based Validity Indices on Clustering Algorithms

Journal of Intelligent Systems ◽

10.1515/jisys-2016-0010 ◽

2017 ◽

Vol 26 (3) ◽

pp. 483-503 ◽

Cited By ~ 1

Author(s):

Vijay Kumar ◽

Jitender Kumar Chhabra ◽

Dinesh Kumar

Keyword(s):

Clustering Algorithms ◽

Harmony Search ◽

Real Life ◽

Optimal Number ◽

Distance Measures ◽

Cluster Validity ◽

Number Of Clusters ◽

Cluster Validity Indices ◽

Validity Indices ◽

On Line

AbstractFinding the optimal number of clusters and the appropriate partitioning of the given dataset are the two major challenges while dealing with clustering. For both of these, cluster validity indices are used. In this paper, seven widely used cluster validity indices, namely DB index, PS index, I index, XB index, FS index, K index, and SV index, have been developed based on line symmetry distance measures. These indices provide the measure of line symmetry present in the partitioning of the dataset. These are able to detect clusters of any shape or size in a given dataset, as long as they possess the property of line symmetry. The performance of these indices is evaluated on three clustering algorithms: K-means, fuzzy-C means, and modified harmony search-based clustering (MHSC). The efficacy of symmetry-based validity indices on clustering algorithms is demonstrated on artificial and real-life datasets, six each, with the number of clusters varying from 2 to $\sqrt n ,$ where n is the total number of data points existing in the dataset. The experimental results reveal that the incorporation of line symmetry-based distance improves the capabilities of these existing validity indices in finding the appropriate number of clusters. Comparisons of these indices are done with the point symmetric and original versions of these seven validity indices. The results also demonstrate that the MHSC technique performs better as compared to other well-known clustering techniques. For real-life datasets, analysis of variance statistical analysis is also performed.

Download Full-text

Fast Search Algorithm for Determining the Optimal Number of Clusters using Cluster Validity Index

The Journal of the Korea Contents Association ◽

10.5392/jkca.2009.9.9.080 ◽

2009 ◽

Vol 9 (9) ◽

pp. 80-89 ◽

Cited By ~ 1

Author(s):

Sang-Wook Lee

Keyword(s):

Search Algorithm ◽

Optimal Number ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Number Of Clusters ◽

Fast Search ◽

Fast Search Algorithm ◽

Optimal Number Of Clusters

Download Full-text

Canonical PSO Based K-Means Clustering Approach for Real Datasets

International Scholarly Research Notices ◽

10.1155/2014/414013 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Lopamudra Dey ◽

Sanjay Chakraborty

Keyword(s):

Data Mining ◽

Air Pollution ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Cluster Validity ◽

Validity Assessment ◽

Different Types ◽

Clustering Approach ◽

Validity Measure

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

Download Full-text

SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index

Pattern Recognition ◽

10.1016/j.patcog.2010.04.021 ◽

2010 ◽

Vol 43 (10) ◽

pp. 3364-3373 ◽

Cited By ~ 41

Author(s):

Ibai Gurrutxaga ◽

Iñaki Albisua ◽

Olatz Arbelaitz ◽

José I. Martín ◽

Javier Muguerza ◽

...

Keyword(s):

Hierarchical Clustering ◽

Efficient Method ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index

Download Full-text

Cluster Validity Index to Determine the Optimal Number Clusters of Fuzzy Clustering for Classify Customer Buying Behavior

Journal of Development Research ◽

10.28926/jdr.v5i1.134 ◽

2021 ◽

Vol 5 (1) ◽

pp. 7-12

Author(s):

Salnan Ratih Asrriningtias

Keyword(s):

Fuzzy Clustering ◽

Optimal Number ◽

Buying Behavior ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Number Of Clusters ◽

Best Value ◽

Fuzzy Clustering Method ◽

The Right

One of the strategies in order to compete in Batik MSMEs is to look at the characteristics of the customer. To make it easier to see the characteristics of customer buying behavior, it is necessary to classify customers based on similarity of characteristics using fuzzy clustering. One of the parameters that must be determined at the beginning of the fuzzy clustering method is the number of clusters. Increasing the number of clusters does not guarantee the best performance, but the right number of clusters greatly affects the performance of fuzzy clustering. So to get optimal number cluster, we can measured the result of clustering in each number cluster using the cluster validity index. From several types of cluster validity index, NPC give the best value. Optimal number cluster that obtained by the validity index is 2 and this number cluster give classify result with small variance value

Download Full-text

Tensor Decomposition for Multilayer Networks Clustering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013371 ◽

2019 ◽

Vol 33 ◽

pp. 3371-3378 ◽

Cited By ~ 2

Author(s):

Zitai Chen ◽

Chuan Chen ◽

Zibin Zheng ◽

Yi Zhu

Keyword(s):

Clustering Algorithms ◽

Cluster Structure ◽

Real Life ◽

Nonlinear Least Squares ◽

Tensor Decomposition ◽

Underlying Structure ◽

Network Clustering ◽

Multilayer Networks ◽

Novel Approach ◽

Real World Datasets

Clustering on multilayer networks has been shown to be a promising approach to enhance the accuracy. Various multilayer networks clustering algorithms assume all networks derive from a latent clustering structure, and jointly learn the compatible and complementary information from different networks to excavate one shared underlying structure. However, such an assumption is in conflict with many emerging real-life applications due to the existence of noisy/irrelevant networks. To address this issue, we propose Centroid-based Multilayer Network Clustering (CMNC), a novel approach which can divide irrelevant relationships into different network groups and uncover the cluster structure in each group simultaneously. The multilayer networks is represented within a unified tensor framework for simultaneously capturing multiple types of relationships between a set of entities. By imposing the rank-(Lr,Lr,1) block term decomposition with nonnegativity, we are able to have well interpretations on the multiple clustering results based on graph cut theory. Numerically, we transform this tensor decomposition problem to an unconstrained optimization, thus can solve it efficiently under the nonlinear least squares (NLS) framework. Extensive experimental results on synthetic and real-world datasets show the effectiveness and robustness of our method against noise and irrelevant data.

Download Full-text

Incremental Interval Type-2 Fuzzy Clustering of Data Streams using Single Pass Method

Sensors ◽

10.3390/s20113210 ◽

2020 ◽

Vol 20 (11) ◽

pp. 3210

Author(s):

Sana Qaiyum ◽

Izzatdin Aziz ◽

Mohd Hilmi Hasan ◽

Asif Irshad Khan ◽

Abdulmohsen Almalawi

Keyword(s):

Fuzzy Clustering ◽

Data Streams ◽

Clustering Algorithms ◽

Real Life ◽

Fuzzy Cluster ◽

Optimization Approach ◽

Cluster Validity Index ◽

Data Points ◽

Interval Type

Data Streams create new challenges for fuzzy clustering algorithms, specifically Interval Type-2 Fuzzy C-Means (IT2FCM). One problem associated with IT2FCM is that it tends to be sensitive to initialization conditions and therefore, fails to return global optima. This problem has been addressed by optimizing IT2FCM using Ant Colony Optimization approach. However, IT2FCM-ACO obtain clusters for the whole dataset which is not suitable for clustering large streaming datasets that may be coming continuously and evolves with time. Thus, the clusters generated will also evolve with time. Additionally, the incoming data may not be available in memory all at once because of its size. Therefore, to encounter the challenges of a large data stream environment we propose improvising IT2FCM-ACO to generate clusters incrementally. The proposed algorithm produces clusters by determining appropriate cluster centers on a certain percentage of available datasets and then the obtained cluster centroids are combined with new incoming data points to generate another set of cluster centers. The process continues until all the data are scanned. The previous data points are released from memory which reduces time and space complexity. Thus, the proposed incremental method produces data partitions comparable to IT2FCM-ACO. The performance of the proposed method is evaluated on large real-life datasets. The results obtained from several fuzzy cluster validity index measures show the enhanced performance of the proposed method over other clustering algorithms. The proposed algorithm also improves upon the run time and produces excellent speed-ups for all datasets.

Download Full-text

Rough ISODATA Algorithm

International Journal of Fuzzy System Applications ◽

10.4018/ijfsa.2013100101 ◽

2013 ◽

Vol 3 (4) ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

S. Sampath ◽

B. Ramya

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Vital Role ◽

Data Sets ◽

Clustering Method ◽

Data Set ◽

Number Of Clusters ◽

Real Life Data ◽

Nonparametric Statistical

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.

Download Full-text

A Hybrid Validity Index to Determine K Parameter Value of k-Means Algorithm for Time Series Clustering

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622021500449 ◽

2021 ◽

pp. 1-22

Author(s):

Fatma Ozge Ozkok ◽

Mete Celik

Keyword(s):

Time Series ◽

Clustering Algorithms ◽

Real Life ◽

Internal Validity ◽

The Other ◽

Sequential Data ◽

Data Mining Technique ◽

Validity Index ◽

Number Of Clusters ◽

Benchmark Datasets

Time series is a set of sequential data point in time order. The sizes and dimensions of the time series datasets are increasing day by day. Clustering is an unsupervised data mining technique that groups objects based on their similarities. It is used to analyze various datasets, such as finance, climate, and bioinformatics datasets. [Formula: see text]-means is one of the most used clustering algorithms. However, it is challenging to determine the value of [Formula: see text] parameter, which is the number of clusters. One of the most used methods to determine the number of clusters (such as [Formula: see text]) is cluster validity indexes. Several internal and external validity indexes are used to find suitable cluster numbers based on characteristics of datasets. In this study, we propose a hybrid validity index to determine the value of [Formula: see text] parameter of [Formula: see text]-means algorithm. The proposed hybrid validity index comprises four internal validity indexes, such as Dunn, Silhouette, C index, and Davies–Bouldin indexes. The proposed method was applied to nine real-life finance and benchmarks time series datasets. The financial dataset was obtained from Yahoo Finance, consisting of daily closing data of stocks. The other eight benchmark datasets were obtained from UCR time series classification archive. Experimental results showed that the proposed hybrid validity index is promising for finding the suitable number of clusters with respect to the other indexes for clustering time-series datasets.

Download Full-text