A COMPARATIVE ANALYSIS OF K-MEANS AND HIERARCHICAL CLUSTERING

Big Data Clustering And Its Applications Examination

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1466.0982s1119 ◽

2019 ◽

Vol 8 (2S11) ◽

pp. 3687-3693

Keyword(s):

Data Mining ◽

Big Data ◽

Data Clustering ◽

Clustering Algorithms ◽

Large Data ◽

Data Sets ◽

Clustering Methods ◽

Time Saving ◽

Data Set ◽

The Many

Clustering is a type of mining process where the data set is categorized into various sub classes. Clustering process is very much essential in classification, grouping, and exploratory pattern of analysis, image segmentation and decision making. And we can explain about the big data as very large data sets which are examined computationally to show techniques and associations and also which is associated to the human behavior and their interactions. Big data is very essential for several organisations but in few cases very complex to store and it is also time saving. Hence one of the ways of overcoming these issues is to develop the many clustering methods, moreover it suffers from the large complexity. Data mining is a type of technique where the useful information is extracted, but the data mining models cannot utilized for the big data because of inherent complexity. The main scope here is to introducing a overview of data clustering divisions for the big data And also explains here few of the related work for it. This survey concentrates on the research of several clustering algorithms which are working basically on the elements of big data. And also the short overview of clustering algorithms which are grouped under partitioning, hierarchical, grid based and model based are seenClustering is major data mining and it is used for analyzing the big data.the problems for applying clustering patterns to big data and also we phase new issues come up with big data

Download Full-text

Performance Comparison with Hierarchical and Partitional Clustering Methods

WSEAS TRANSACTIONS ON COMMUNICATIONS ◽

10.37394/23204.2021.20.23 ◽

2021 ◽

Vol 20 ◽

pp. 177-184

Author(s):

Ozer Ozdemir ◽

Simgenur Cerman

Keyword(s):

Data Mining ◽

Hierarchical Clustering ◽

Clustering Algorithms ◽

Performance Comparison ◽

Hierarchical Partitioning ◽

Clustering Methods ◽

Data Set ◽

Partitional Clustering ◽

Statistical Programming ◽

Using Data

In data mining, one of the commonly-used techniques is the clustering. Clustering can be done by the different algorithms such as hierarchical, partitioning, grid, density and graph based algorithms. In this study first of all the concept of data mining explained, then giving information the aims of using data mining and the areas of using and then clustering and clustering algorithms that used in data mining are explained theoretically. Ultimately within the scope of this study, "Mall Customers" data set that taken from Kaggle database, based partitioned clustering and hierarchical clustering algorithms aimed at the separation of clusters according to their costumers features. In the clusters obtained by the partitional clustering algorithms, the similarity within the cluster is maximum and the similarity between the clusters is minimum. The hierarchical clustering algorithms is based on the gathering of similar features or vice versa. The partitional clustering algorithms used; k-means and PAM, hierarchical clustering algorithms used; AGNES and DIANA are algorithms. In this study, R statistical programming language was used in the application of algorithms. At the end of the study, the data set was run with clustering algorithms and the obtained analysis results were interpreted.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Emergent community agglomeration from data set geometry

10.1101/109587 ◽

2017 ◽

Author(s):

Chenchao Zhao ◽

Jun S. Song

Keyword(s):

Hierarchical Clustering ◽

Clustering Algorithms ◽

Feature Space ◽

Curse Of Dimensionality ◽

Dimensional Euclidean Space ◽

Dynamical Process ◽

Data Sets ◽

Learning Methods ◽

Data Set ◽

Automatic Grouping

In the statistical learning language, samples are snapshots of random vectors drawn from some unknown distribution. Such vectors usually reside in a high-dimensional Euclidean space, and thus, the "curse of dimensionality" often undermines the power of learning methods, including community detection and clustering algorithms, that rely on Euclidean geometry. This paper presents the idea of effective dissimilarity transformation (EDT) on empirical dissimilarity hyperspheres and studies its effects using synthetic and gene expression data sets. Iterating the EDT turns a static data distribution into a dynamical process purely driven by the empirical data set geometry and adaptively ameliorates the curse of dimensionality, partly through changing the topology of a Euclidean feature space into a compact hypersphere. The EDT often improves the performance of hierarchical clustering via the automatic grouping information emerging from global interactions of data points. The EDT is not restricted to hierarchical clustering, and other learning methods based on pairwise dissimilarity should also benefit from the many desirable properties of EDT.

Download Full-text

Time series event correlation with DTW and Hierarchical Clustering methods

10.7287/peerj.preprints.27959 ◽

2019 ◽

Author(s):

Srishti Mishra ◽

Zohair Shafi ◽

Santanu Pathak

Keyword(s):

Time Series ◽

Hierarchical Clustering ◽

Time Series Data ◽

Series Data ◽

Data Sets ◽

Multiple Time ◽

Clustering Methods ◽

Event Correlation ◽

Data Set ◽

Causation Analysis

Data driven decision making is becoming increasingly an important aspect for successful business execution. More and more organizations are moving towards taking informed decisions based on the data that they are generating. Most of this data are in temporal format - time series data. Effective analysis across time series data sets, in an efficient and quick manner is a challenge. The most interesting and valuable part of such analysis is to generate insights on correlation and causation across multiple time series data sets. This paper looks at methods that can be used to analyze such data sets and gain useful insights from it, primarily in the form of correlation and causation analysis. This paper focuses on two methods for doing so, Two Sample Test with Dynamic Time Warping and Hierarchical Clustering and looks at how the results returned from both can be used to gain a better understanding of the data. Moreover, the methods used are meant to work with any data set, regardless of the subject domain and idiosyncrasies of the data set, primarily, a data agnostic approach.

Download Full-text

The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data

Journal of Industrial Integration and Management ◽

10.1142/s2424862218500173 ◽

2019 ◽

Vol 04 (01) ◽

pp. 1850017 ◽

Cited By ~ 3

Author(s):

Weiru Chen ◽

Jared Oliverio ◽

Jin Ho Kim ◽

Jiayue Shen

Keyword(s):

Data Mining ◽

Big Data ◽

Data Reduction ◽

Data Clustering ◽

Clustering Algorithms ◽

High Volume ◽

Clustering Methods ◽

Data Set ◽

Processing Methods ◽

Integration Data

Big Data is a popular cutting-edge technology nowadays. Techniques and algorithms are expanding in different areas including engineering, biomedical, and business. Due to the high-volume and complexity of Big Data, it is necessary to conduct data pre-processing methods when data mining. The pre-processing methods include data cleaning, data integration, data reduction, and data transformation. Data clustering is the most important step of data reduction. With data clustering, mining on the reduced data set should be more efficient yet produce quality analytical results. This paper presents the different data clustering methods and related algorithms for data mining with Big Data. Data clustering can increase the efficiency and accuracy of data mining.

Download Full-text

HIERARCHICAL CLUSTERING FOR COMPLEX DATA

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213005002399 ◽

2005 ◽

Vol 14 (05) ◽

pp. 791-809 ◽

Cited By ~ 7

Author(s):

LATIFUR KHAN ◽

FENG LUO

Keyword(s):

Gene Expression ◽

Hierarchical Clustering ◽

Clustering Algorithms ◽

Hierarchical Level ◽

Data Sets ◽

Complex Data ◽

Agglomerative Clustering ◽

Levels Of Abstraction ◽

Data Set ◽

Self Organizing

In this paper we introduce a new tree-structured self-organizing neural network called a dynamical growing self-organizing tree (DGSOT). This DGSOT algorithm constructs a hierarchy from top to bottom by division. At each hierarchical level, the DGSOT optimizes the number of clusters, from which the proper hierarchical structure of the underlying data set can be found. We propose a K-level up distribution (KLD) mechanism. This KLD scheme increases the scope for data distribution in the hierarchy, which allows the data mis-clustered in the early stages to be re-evaluated at a later stage increasing the accuracy of the final clustering result. The DGSOT algorithm, combined with the KLD mechanism, overcomes the drawbacks of traditional hierarchical clustering algorithms (e.g., hierarchical agglomerative clustering). The DGSOT algorithm has been tested on two benchmark data sets including gene expression complex data set and we observe that our algorithm extracts patterns with different levels of abstraction. Furthermore, our approach is useful on recognizing features in complex gene expression data. As a dendrogram, these results can be easily displayed for visualization.

Download Full-text

Graph Theoretic Techniques for Clustering and Biclustering gene expression data.

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2012.1136 ◽

2012 ◽

pp. 173-181

Author(s):

Prangyaparamita Mohapatra ◽

Tripti Swarnkar

Keyword(s):

Gene Expression ◽

Data Mining ◽

Gene Expression Data ◽

Biological Networks ◽

Clustering Algorithms ◽

Expression Data ◽

Microarray Technology ◽

Clustering Methods ◽

Experimental Conditions ◽

Data Set

DNA microarray technology has made it possible to simultaneously monitor the expression levels of thousands of genes during biological processes and across collections of related samples. However, the large number of genes and the complexity of biological networks greatly increase the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the gene expression matrix have been proposed to date. This simultaneous clustering, usually designated by biclustering, seeks to find submatrices that are subgroups of genes and subgroups of columns, where the genes exhibit highly correlated activities for every condition. This type of algorithms has also been proposed and used in other fields, such as information retrieval and data mining. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches.

Download Full-text

Time series event correlation with DTW and Hierarchical Clustering methods

10.7287/peerj.preprints.27959v1 ◽

2019 ◽

Cited By ~ 1

Author(s):

Srishti Mishra ◽

Zohair Shafi ◽

Santanu Pathak

Keyword(s):

Time Series ◽

Hierarchical Clustering ◽

Time Series Data ◽

Series Data ◽

Data Sets ◽

Multiple Time ◽

Clustering Methods ◽

Event Correlation ◽

Data Set ◽

Causation Analysis

Data driven decision making is becoming increasingly an important aspect for successful business execution. More and more organizations are moving towards taking informed decisions based on the data that they are generating. Most of this data are in temporal format - time series data. Effective analysis across time series data sets, in an efficient and quick manner is a challenge. The most interesting and valuable part of such analysis is to generate insights on correlation and causation across multiple time series data sets. This paper looks at methods that can be used to analyze such data sets and gain useful insights from it, primarily in the form of correlation and causation analysis. This paper focuses on two methods for doing so, Two Sample Test with Dynamic Time Warping and Hierarchical Clustering and looks at how the results returned from both can be used to gain a better understanding of the data. Moreover, the methods used are meant to work with any data set, regardless of the subject domain and idiosyncrasies of the data set, primarily, a data agnostic approach.

Download Full-text

A New semi-supervised clustering for incomplete data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189744 ◽

2021 ◽

pp. 1-13

Author(s):

Sonia Goel ◽

Meena Tushir

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Complete Data ◽

Unlabeled Data ◽

Misclassification Rate ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Supervised Clustering

Semi-supervised clustering technique partitions the unlabeled data based on prior knowledge of labeled data. Most of the semi-supervised clustering algorithms exist only for the clustering of complete data, i.e., the data sets with no missing features. In this paper, an effort has been made to check the effectiveness of semi-supervised clustering when applied to incomplete data sets. The novelty of this approach is that it considers the missing features along with available knowledge (labels) of the data set. The linear interpolation imputation technique initially imputes the missing features of the data set, thus completing the data set. A semi-supervised clustering is now employed on this complete data set, and missing features are regularly updated within the clustering process. In the proposed work, the labeled percentage range used is 30, 40, 50, and 60% of the total data. Data is further altered by arbitrarily eliminating certain features of its components, which makes the data incomplete with partial labeling. The proposed algorithm utilizes both labeled and unlabeled data, along with certain missing values in the data. The proposed algorithm is evaluated using three performance indices, namely the misclassification rate, random index metric, and error rate. Despite the additional missing features, the proposed algorithm has been successfully implemented on real data sets and showed better/competing results than well-known standard semi-supervised clustering methods.

Download Full-text