Boundary Matching and Interior Connectivity-Based Cluster Validity Anlysis

Qi Li; Shihong Yue; Yaru Wang; Mingliang Ding; Jia Li; Zeying Wang

doi:10.3390/app10041337

Boundary Matching and Interior Connectivity-Based Cluster Validity Anlysis

Applied Sciences ◽

10.3390/app10041337 ◽

2020 ◽

Vol 10 (4) ◽

pp. 1337 ◽

Cited By ~ 2

Author(s):

Qi Li ◽

Shihong Yue ◽

Yaru Wang ◽

Mingliang Ding ◽

Jia Li ◽

...

Keyword(s):

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Cluster Validity ◽

Validity Index ◽

Boundary Points ◽

Validity Indices ◽

Boundary Matching ◽

Interior Points

The evaluation of clustering results plays an important role in clustering analysis. However, the existing validity indices are limited to a specific clustering algorithm, clustering parameter, and assumption in practice. In this paper, we propose a novel validity index to solve the above problems based on two complementary measures: boundary points matching and interior points connectivity. Firstly, when any clustering algorithm is performed on a dataset, we extract all boundary points for the dataset and its partitioned clusters using a nonparametric metric. The measure of boundary points matching is computed. Secondly, the interior points connectivity of both the dataset and all the partitioned clusters are measured. The proposed validity index can evaluate different clustering results on the dataset obtained from different clustering algorithms, which cannot be evaluated by the existing validity indices at all. Experimental results demonstrate that the proposed validity index can evaluate clustering results obtained by using an arbitrary clustering algorithm and find the optimal clustering parameters.

Download Full-text

Analysis of Incremental Cluster Validity for Big Data Applications

International Journal of Uncertainty Fuzziness and Knowledge-Based Systems ◽

10.1142/s0218488518400111 ◽

2018 ◽

Vol 26 (Suppl. 2) ◽

pp. 47-62 ◽

Cited By ~ 2

Author(s):

Omar A. Ibrahim ◽

Yiqing Wang ◽

James M. Keller

Keyword(s):

Clustering Algorithm ◽

Distance Measure ◽

Dimensional Space ◽

Clustering Algorithms ◽

Internal Validity ◽

Streaming Data ◽

Cluster Validity ◽

Cluster Validity Index ◽

Online Clustering ◽

Validity Indices

Online clustering has attracted attention due to the explosion of ubiquitous continuous sensing. Streaming clustering algorithms need to look for new structures and adapt as the data evolves, such that outliers are detected, and that new emerging clusters are automatically formed. The performance of a streaming clustering algorithm needs to be monitored over time to understand the behavior of the streaming data in terms of new emerging clusters and number of outlier data points. Small datasets with 2 or 3 dimensions can be monitored by plotting the clustering results as data evolves. However, as the size and dimensions of streaming data increase, plotting the clustering result becomes unfeasible. Therefore, incremental internal Validity Indices (iCVIs) could be applied for monitoring the performance of an online clustering algorithm. In this paper, we study the internal incremental Davies-Bouldin (iDB) cluster validity index in the context of big streaming data analysis. Also, we study the effect of large number of samples on the values of the iCVI (iDB). Finally, we propose a way to project streaming data into a lower space for cases where the distance measure does not perform as expected in the high dimensional space.

Download Full-text

Study on Information Technology with MADS — A New Data Clustering Algorithm Based on Ant Colony Optimization

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.859.572 ◽

2013 ◽

Vol 859 ◽

pp. 572-576 ◽

Cited By ~ 1

Author(s):

Yong Li Liu

Keyword(s):

Information Technology ◽

Ant Colony Optimization ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Ant Colony ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Data Set

In the field of information technology, data clustering algorithms are widely used. In this paper, we proposed a new data clustering algorithm, named MADS, It is based on ant colony Optimization. MADS can automatically find clusters, depending on a few parameters that are not directly related to the data set. In addition, there are some existence technique was also utilized in our method, such as the density concept and cluster validity index (DB-index). The experiment results verified that MADS is able to discover clusters with varying shapes and is effective when applied to image segmentation.

Download Full-text

Role of Cluster Validity Indices in Delineation of Precipitation Regions

Water ◽

10.3390/w12051372 ◽

2020 ◽

Vol 12 (5) ◽

pp. 1372

Author(s):

Nikhil Bhatia ◽

Jency M. Sojan ◽

Slobodon Simonovic ◽

Roshan Srivastav

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Optimal Number ◽

Ratio Test ◽

Cluster Validity ◽

Number Of Clusters ◽

Cluster Validity Indices ◽

Validity Indices ◽

Point Data ◽

Optimal Number Of Clusters

The delineation of precipitation regions is to identify homogeneous zones in which the characteristics of the process are statistically similar. The regionalization process has three main components: (i) delineation of regions using clustering algorithms, (ii) determining the optimal number of regions using cluster validity indices (CVIs), and (iii) validation of regions for homogeneity using L-moments ratio test. The identification of the optimal number of clusters will significantly affect the homogeneity of the regions. The objective of this study is to investigate the performance of the various CVIs in identifying the optimal number of clusters, which maximizes the homogeneity of the precipitation regions. The k-means clustering algorithm is adopted to delineate the regions using location-based attributes for two large areas from Canada, namely, the Prairies and the Great Lakes-St Lawrence lowlands (GL-SL) region. The seasonal precipitation data for 55 years (1951–2005) is derived using high-resolution ANUSPLIN gridded point data for Canada. The results indicate that the optimal number of clusters and the regional homogeneity depends on the CVI adopted. Among 42 cluster indices considered, 15 of them outperform in identifying the homogeneous precipitation regions. The Dunn, D e t _ r a t i o and Trace( W − 1 B ) indices found to be the best for all seasons in both the regions.

Download Full-text

Determining the Correct Number of Clusters in the CT Image Segmentation

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2020.3199 ◽

2020 ◽

Vol 10 (11) ◽

pp. 2675-2680

Author(s):

Qi Li ◽

Shihong Yue ◽

Mingliang Ding ◽

Jia Li ◽

Zeying Wang

Keyword(s):

Image Segmentation ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Detection System ◽

Ct Image ◽

Cluster Number ◽

Cluster Validity Index ◽

Validity Index ◽

Validity Indices ◽

Synthetic Datasets

Clustering algorithm plays an essential role in CT image segmentation, and cluster validity index is an essential component in clustering analysis. There are a lot of validity indices used for assessing clustering results, that is, determine the optimal cluster number. But the existing validity indices are often ineffective for the datasets with irregular-shaped clusters and corrupted by noise. This study aims to define a novel validity index which cannot be affected by the shapes of clusters and corrupted by noise of the investigated datasets. Chain-based distance different from original Euclidean distance is defined first, then by a multidimensional scaling (MDS) transformation, all points are mapped into a new data space. After evaluation of compactness and separation twice in datasets, a novel validity index is proposed. A lot of synthetic datasets and several typical CT images were used for validating the proposed validity index. Experimental results validate the proposed index and this index is applicable to the datasets with arbitrary-shaped clusters and corrupted by noise, which is helpful in clustering analysis and computer-aided detection system.

Download Full-text

Enhanced fuzzy clustering algorithm and cluster validity index for human perception

Expert Systems with Applications ◽

10.1016/j.eswa.2012.05.049 ◽

2013 ◽

Vol 40 (3) ◽

pp. 929-937 ◽

Cited By ~ 11

Author(s):

M. Bahar Başkır ◽

I. Burhan Türkşen

Keyword(s):

Fuzzy Clustering ◽

Clustering Algorithm ◽

Human Perception ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index ◽

Fuzzy Clustering Algorithm

Download Full-text

An improved OPTICS clustering algorithm for discovering clusters with uneven densities

Intelligent Data Analysis ◽

10.3233/ida-205497 ◽

2021 ◽

Vol 25 (6) ◽

pp. 1453-1471

Author(s):

Chunhua Tang ◽

Han Wang ◽

Zhiwen Wang ◽

Xiangkun Zeng ◽

Huaran Yan ◽

...

Keyword(s):

Time Complexity ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

Clustering Algorithms ◽

Substantial Improvement ◽

Experimental Results ◽

High Time ◽

Parameter Setting ◽

K Nearest Neighbor ◽

Density Based Clustering

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Canonical PSO Based K-Means Clustering Approach for Real Datasets

International Scholarly Research Notices ◽

10.1155/2014/414013 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Lopamudra Dey ◽

Sanjay Chakraborty

Keyword(s):

Data Mining ◽

Air Pollution ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Life ◽

Cluster Validity ◽

Validity Assessment ◽

Different Types ◽

Clustering Approach ◽

Validity Measure

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.

Download Full-text

A Data Distribution View of Clustering Algorithms

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch059 ◽

2011 ◽

pp. 374-381 ◽

Cited By ~ 1

Author(s):

Junjie Wu ◽

Jian Chen ◽

Hui Xiong

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Distribution ◽

Point Of View ◽

Group Method ◽

Data Sets ◽

Distribution Point

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

Download Full-text

A New Incremental Cluster Validity Index for Streaming Clustering Analysis

2019 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) ◽

10.1109/fuzz-ieee.2019.8858900 ◽

2019 ◽

Author(s):

Omar A. Ibrahim ◽

James M. Keller ◽

Mihail Popescu

Keyword(s):

Clustering Analysis ◽

Cluster Validity ◽

Cluster Validity Index ◽

Validity Index

Download Full-text