A Novel Hierarchical Clustering Approach Based on Universal Gravitation

Mathematical Problems in Engineering ◽

10.1155/2020/6748056 ◽

2020 ◽

Vol 2020 ◽

pp. 1-15

Author(s):

Peng Zhang ◽

Kun She

Keyword(s):

Hierarchical Clustering ◽

Clustering Analysis ◽

Gravitational Force ◽

Clustering Algorithms ◽

Influence Coefficient ◽

Data Sets ◽

Universal Gravitation ◽

Real World Data ◽

Gravitational Influence ◽

Clustering Approach

The target of the clustering analysis is to group a set of data points into several clusters based on the similarity or distance. The similarity or distance is usually a scalar used in numerous traditional clustering algorithms. Nevertheless, a vector, such as data gravitational force, contains more information than a scalar and can be applied in clustering analysis to promote clustering performance. Therefore, this paper proposes a three-stage hierarchical clustering approach called GHC, which takes advantage of the vector characteristic of data gravitational force inspired by the law of universal gravitation. In the first stage, a sparse gravitational graph is constructed based on the top k data gravitations between each data point and its neighbors in the local region. Then the sparse graph is partitioned into many subgraphs by the gravitational influence coefficient. In the last stage, the satisfactory clustering result is obtained by merging these subgraphs iteratively by using a new linkage criterion. To demonstrate the performance of GHC algorithm, the experiments on synthetic and real-world data sets are conducted, and the results show that the GHC algorithm achieves better performance than the other existing clustering algorithms.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

A Data Distribution View of Clustering Algorithms

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch059 ◽

2011 ◽

pp. 374-381 ◽

Cited By ~ 1

Author(s):

Junjie Wu ◽

Jian Chen ◽

Hui Xiong

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Distribution ◽

Point Of View ◽

Group Method ◽

Data Sets ◽

Distribution Point

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

Download Full-text

Emergent community agglomeration from data set geometry

10.1101/109587 ◽

2017 ◽

Author(s):

Chenchao Zhao ◽

Jun S. Song

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Feature Space ◽

Curse Of Dimensionality ◽

Dimensional Euclidean Space ◽

Dynamical Process ◽

Data Sets ◽

Learning Methods ◽

Data Set ◽

Automatic Grouping

In the statistical learning language, samples are snapshots of random vectors drawn from some unknown distribution. Such vectors usually reside in a high-dimensional Euclidean space, and thus, the "curse of dimensionality" often undermines the power of learning methods, including community detection and clustering algorithms, that rely on Euclidean geometry. This paper presents the idea of effective dissimilarity transformation (EDT) on empirical dissimilarity hyperspheres and studies its effects using synthetic and gene expression data sets. Iterating the EDT turns a static data distribution into a dynamical process purely driven by the empirical data set geometry and adaptively ameliorates the curse of dimensionality, partly through changing the topology of a Euclidean feature space into a compact hypersphere. The EDT often improves the performance of hierarchical clustering via the automatic grouping information emerging from global interactions of data points. The EDT is not restricted to hierarchical clustering, and other learning methods based on pairwise dissimilarity should also benefit from the many desirable properties of EDT.

Download Full-text

Clustering, factor discovery and optimal transport

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa040 ◽

2020 ◽

Author(s):

Hongkang Yang ◽

Esteban G Tabak

Keyword(s):

Latent Variables ◽

Optimal Transport ◽

Clustering Algorithms ◽

Data Sets ◽

Affine Transformations ◽

Real World Data ◽

Continuous Version ◽

Clustering Problem ◽

Latent Space ◽

Transport Maps

Abstract The clustering problem, and more generally latent factor discovery or latent space inference, is formulated in terms of the Wasserstein barycenter problem from optimal transport. The objective proposed is the maximization of the variability attributable to class, further characterized as the minimization of the variance of the Wasserstein barycenter. Existing theory, which constrains the transport maps to rigid translations, is extended to affine transformations. The resulting non-parametric clustering algorithms include $k$-means as a special case and exhibit more robust performance. A continuous version of these algorithms discovers continuous latent variables and generalizes principal curves. The strength of these algorithms is demonstrated by tests on both artificial and real-world data sets.

Download Full-text

A Novel Scalable Signature Based Subspace Clustering Approach for Big Data

International Journal of Information Technology and Web Engineering ◽

10.4018/ijitwe.2019040103 ◽

2019 ◽

Vol 14 (2) ◽

pp. 41-51 ◽

Cited By ~ 1

Author(s):

T. Gayathri ◽

D. Lalitha Bhaskari

Keyword(s):

Big Data ◽

Data Management ◽

Clustering Algorithms ◽

Synthetic Data ◽

Subspace Clustering ◽

Distance Measures ◽

Data Sets ◽

Management Tools ◽

Clustering Approach ◽

Different Dimensions

“Big data” as the name suggests is a collection of large and complicated data sets which are usually hard to process with on-hand data management tools or other conventional processing applications. A scalable signature based subspace clustering approach is presented in this article that would avoid identification of redundant clusters. Various distance measures are utilized to perform experiments that validate the performance of the proposed algorithm. Also, for the same purpose of validation, the synthetic data sets that are chosen have different dimensions, and their size will be distributed when opened with Weka. The F1 quality measure and the runtime of these synthetic data sets are computed. The performance of the proposed algorithm is compared with other existing clustering algorithms such as CLIQUE.INSCY and SUNCLU.

Download Full-text

Scalable Recursive Top-Down Hierarchical Clustering Approach with Implicit Model Selection for Textual Data Sets

2010 Workshops on Database and Expert Systems Applications ◽

10.1109/dexa.2010.25 ◽

2010 ◽

Cited By ~ 5

Author(s):

Markus Muhr ◽

Vedran Sabol ◽

Michael Granitzer

Keyword(s):

Model Selection ◽

Hierarchical Clustering ◽

Data Sets ◽

Top Down ◽

Implicit Model ◽

Textual Data ◽

Clustering Approach ◽

Selection For

Download Full-text

A Three-Level Optimization Model for Nonlinearly Separable Clustering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5719 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3211-3218

Author(s):

Liang Bai ◽

Jiye Liang

Keyword(s):

Optimization Model ◽

Clustering Algorithms ◽

Complex Structure ◽

Computational Cost ◽

Real Data ◽

Data Sets ◽

Real World Data ◽

Clustering Problem ◽

Efficiency And Effectiveness ◽

Clustering Problems

Due to the complex structure of the real-world data, nonlinearly separable clustering is one of popular and widely studied clustering problems. Currently, various types of algorithms, such as kernel k-means, spectral clustering and density clustering, have been developed to solve this problem. However, it is difficult for them to balance the efficiency and effectiveness of clustering, which limits their real applications. To get rid of the deficiency, we propose a three-level optimization model for nonlinearly separable clustering which divides the clustering problem into three sub-problems: a linearly separable clustering on the object set, a nonlinearly separable clustering on the cluster set and an ensemble clustering on the partition set. An iterative algorithm is proposed to solve the optimization problem. The proposed algorithm can use low computational cost to effectively recognize nonlinearly separable clusters. The performance of this algorithm has been studied on synthetical and real data sets. Comparisons with other nonlinearly separable clustering algorithms illustrate the efficiency and effectiveness of the proposed algorithm.

Download Full-text

Comparative Study of Document Clustering Algorithms

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.11.20816 ◽

2018 ◽

Vol 7 (4.11) ◽

pp. 246

Author(s):

N. M. Ariff ◽

M. A. A. Bakar ◽

M. I. Rahmad

Keyword(s):

Data Mining ◽

Hierarchical Clustering ◽

Clustering Analysis ◽

Clustering Algorithms ◽

Document Clustering ◽

Text Clustering ◽

Data Mining Technique ◽

Mining Technique ◽

Meaningful Result ◽

Different Types

Text clustering is a data mining technique that is becoming more important in present studies. Document clustering makes use of text clustering to divide documents according to the various topics. The choice of words in document clustering is important to ensure that the document can be classified correctly. Three different methods of clustering which are hierarchical clustering, k-means and k-medoids are used and compared in this study in order to identify the best method which produce the best result in document clustering. The three methods are applied on 60 sports articles involving four different types of sports. The k-medoids clustering produced the worst result while k-means clustering is found to be more sensitive towards general words. Therefore, the method of hierarchical clustering is deemed more stable to produce a meaningful result in document clustering analysis.

Download Full-text

Simplifying functional network representation and interpretation through causality clustering

Scientific Reports ◽

10.1038/s41598-021-94797-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Massimiliano Zanin

Keyword(s):

Complex System ◽

Human Brain ◽

Real World ◽

Functional Network ◽

Functional Networks ◽

Data Sets ◽

Real World Data ◽

Inherent Difficulty ◽

Information Dynamics ◽

Clustering Approach

AbstractFunctional networks, i.e. networks representing the interactions between the elements of a complex system and reconstructed from the observed elements’ dynamics, are becoming a fundamental tool to unravel the structures created by the movement of information in systems like the human brain. They also present drawbacks, one of the most important being the inherent difficulty in representing and interpreting the resulting structures for large number of nodes and links. I here propose a causality clustering approach, based on grouping nodes into clusters according to their similarity in the overall information dynamics, the latter one being measured by a causality metric. The whole system can then arbitrarily be simplified, with nodes being grouped in e.g. sources, brokers and sinks of information. The advantages and limitations of the proposed approach are discussed using a set of synthetic and real-world data sets, the latter ones representing two neuroscience and technological problems.

Download Full-text

A COMPARATIVE ANALYSIS OF K-MEANS AND HIERARCHICAL CLUSTERING

EPRA International Journal of Multidisciplinary Research (IJMR) ◽

10.36713/epra8308 ◽

2021 ◽

pp. 412-418

Author(s):

Aastha Gupta ◽

Himanshu Sharma ◽

Anas Akhtar

Keyword(s):

Data Mining ◽

Hierarchical Clustering ◽

Clustering Algorithms ◽

Analytical Techniques ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Advantages And Disadvantages ◽

The Many ◽

Data Elements

Clustering is the process of arranging comparable data elements into groups. One of the most frequent data mining analytical techniques is clustering analysis; the clustering algorithm’s strategy has a direct influence on the clustering results. This study examines the many types of algorithms, such as k-means clustering algorithms, and compares and contrasts their advantages and disadvantages. This paper also highlights concerns with clustering algorithms, such as time complexity and accuracy, in order to give better outcomes in a variety of environments. The outcomes are described in terms of big datasets. The focus of this study is on clustering algorithms with the WEKA data mining tool. Clustering is the process of dividing a big data set into small groups or clusters. Clustering is an unsupervised approach that may be used to analyze big datasets with many characteristics. It’s a data-modeling technique that provides a clear image of your data. Two clustering methods, k-means and hierarchical clustering, are explained in this survey and their analysis using WEKA tool on different data sets. KEYWORDS: data clustering, weka , k-means, hierarchical clustering

Download Full-text