scholarly journals Boundary Matching and Interior Connectivity-Based Cluster Validity Anlysis

2020 ◽  
Vol 10 (4) ◽  
pp. 1337 ◽  
Author(s):  
Qi Li ◽  
Shihong Yue ◽  
Yaru Wang ◽  
Mingliang Ding ◽  
Jia Li ◽  
...  

The evaluation of clustering results plays an important role in clustering analysis. However, the existing validity indices are limited to a specific clustering algorithm, clustering parameter, and assumption in practice. In this paper, we propose a novel validity index to solve the above problems based on two complementary measures: boundary points matching and interior points connectivity. Firstly, when any clustering algorithm is performed on a dataset, we extract all boundary points for the dataset and its partitioned clusters using a nonparametric metric. The measure of boundary points matching is computed. Secondly, the interior points connectivity of both the dataset and all the partitioned clusters are measured. The proposed validity index can evaluate different clustering results on the dataset obtained from different clustering algorithms, which cannot be evaluated by the existing validity indices at all. Experimental results demonstrate that the proposed validity index can evaluate clustering results obtained by using an arbitrary clustering algorithm and find the optimal clustering parameters.

Author(s):  
Omar A. Ibrahim ◽  
Yiqing Wang ◽  
James M. Keller

Online clustering has attracted attention due to the explosion of ubiquitous continuous sensing. Streaming clustering algorithms need to look for new structures and adapt as the data evolves, such that outliers are detected, and that new emerging clusters are automatically formed. The performance of a streaming clustering algorithm needs to be monitored over time to understand the behavior of the streaming data in terms of new emerging clusters and number of outlier data points. Small datasets with 2 or 3 dimensions can be monitored by plotting the clustering results as data evolves. However, as the size and dimensions of streaming data increase, plotting the clustering result becomes unfeasible. Therefore, incremental internal Validity Indices (iCVIs) could be applied for monitoring the performance of an online clustering algorithm. In this paper, we study the internal incremental Davies-Bouldin (iDB) cluster validity index in the context of big streaming data analysis. Also, we study the effect of large number of samples on the values of the iCVI (iDB). Finally, we propose a way to project streaming data into a lower space for cases where the distance measure does not perform as expected in the high dimensional space.


2013 ◽  
Vol 859 ◽  
pp. 572-576 ◽  
Author(s):  
Yong Li Liu

In the field of information technology, data clustering algorithms are widely used. In this paper, we proposed a new data clustering algorithm, named MADS, It is based on ant colony Optimization. MADS can automatically find clusters, depending on a few parameters that are not directly related to the data set. In addition, there are some existence technique was also utilized in our method, such as the density concept and cluster validity index (DB-index). The experiment results verified that MADS is able to discover clusters with varying shapes and is effective when applied to image segmentation.


Water ◽  
2020 ◽  
Vol 12 (5) ◽  
pp. 1372
Author(s):  
Nikhil Bhatia ◽  
Jency M. Sojan ◽  
Slobodon Simonovic ◽  
Roshan Srivastav

The delineation of precipitation regions is to identify homogeneous zones in which the characteristics of the process are statistically similar. The regionalization process has three main components: (i) delineation of regions using clustering algorithms, (ii) determining the optimal number of regions using cluster validity indices (CVIs), and (iii) validation of regions for homogeneity using L-moments ratio test. The identification of the optimal number of clusters will significantly affect the homogeneity of the regions. The objective of this study is to investigate the performance of the various CVIs in identifying the optimal number of clusters, which maximizes the homogeneity of the precipitation regions. The k-means clustering algorithm is adopted to delineate the regions using location-based attributes for two large areas from Canada, namely, the Prairies and the Great Lakes-St Lawrence lowlands (GL-SL) region. The seasonal precipitation data for 55 years (1951–2005) is derived using high-resolution ANUSPLIN gridded point data for Canada. The results indicate that the optimal number of clusters and the regional homogeneity depends on the CVI adopted. Among 42 cluster indices considered, 15 of them outperform in identifying the homogeneous precipitation regions. The Dunn, D e t _ r a t i o and Trace( W − 1 B ) indices found to be the best for all seasons in both the regions.


2020 ◽  
Vol 10 (11) ◽  
pp. 2675-2680
Author(s):  
Qi Li ◽  
Shihong Yue ◽  
Mingliang Ding ◽  
Jia Li ◽  
Zeying Wang

Clustering algorithm plays an essential role in CT image segmentation, and cluster validity index is an essential component in clustering analysis. There are a lot of validity indices used for assessing clustering results, that is, determine the optimal cluster number. But the existing validity indices are often ineffective for the datasets with irregular-shaped clusters and corrupted by noise. This study aims to define a novel validity index which cannot be affected by the shapes of clusters and corrupted by noise of the investigated datasets. Chain-based distance different from original Euclidean distance is defined first, then by a multidimensional scaling (MDS) transformation, all points are mapped into a new data space. After evaluation of compactness and separation twice in datasets, a novel validity index is proposed. A lot of synthetic datasets and several typical CT images were used for validating the proposed validity index. Experimental results validate the proposed index and this index is applicable to the datasets with arbitrary-shaped clusters and corrupted by noise, which is helpful in clustering analysis and computer-aided detection system.


2021 ◽  
Vol 25 (6) ◽  
pp. 1453-1471
Author(s):  
Chunhua Tang ◽  
Han Wang ◽  
Zhiwen Wang ◽  
Xiangkun Zeng ◽  
Huaran Yan ◽  
...  

Most density-based clustering algorithms have the problems of difficult parameter setting, high time complexity, poor noise recognition, and weak clustering for datasets with uneven density. To solve these problems, this paper proposes FOP-OPTICS algorithm (Finding of the Ordering Peaks Based on OPTICS), which is a substantial improvement of OPTICS (Ordering Points To Identify the Clustering Structure). The proposed algorithm finds the demarcation point (DP) from the Augmented Cluster-Ordering generated by OPTICS and uses the reachability-distance of DP as the radius of neighborhood eps of its corresponding cluster. It overcomes the weakness of most algorithms in clustering datasets with uneven densities. By computing the distance of the k-nearest neighbor of each point, it reduces the time complexity of OPTICS; by calculating density-mutation points within the clusters, it can efficiently recognize noise. The experimental results show that FOP-OPTICS has the lowest time complexity, and outperforms other algorithms in parameter setting and noise recognition.


2021 ◽  
Vol 8 (10) ◽  
pp. 43-50
Author(s):  
Truong et al. ◽  

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Lopamudra Dey ◽  
Sanjay Chakraborty

“Clustering” the significance and application of this technique is spread over various fields. Clustering is an unsupervised process in data mining, that is why the proper evaluation of the results and measuring the compactness and separability of the clusters are important issues. The procedure of evaluating the results of a clustering algorithm is known as cluster validity measure. Different types of indexes are used to solve different types of problems and indices selection depends on the kind of available data. This paper first proposes Canonical PSO based K-means clustering algorithm and also analyses some important clustering indices (intercluster, intracluster) and then evaluates the effects of those indices on real-time air pollution database, wholesale customer, wine, and vehicle datasets using typical K-means, Canonical PSO based K-means, simple PSO based K-means, DBSCAN, and Hierarchical clustering algorithms. This paper also describes the nature of the clusters and finally compares the performances of these clustering algorithms according to the validity assessment. It also defines which algorithm will be more desirable among all these algorithms to make proper compact clusters on this particular real life datasets. It actually deals with the behaviour of these clustering algorithms with respect to validation indexes and represents their results of evaluation in terms of mathematical and graphical forms.


Author(s):  
Junjie Wu ◽  
Jian Chen ◽  
Hui Xiong

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.


Sign in / Sign up

Export Citation Format

Share Document