Incremental Hierarchical Clustering for Data Insertion and Its Evaluation

2020 ◽  
Vol 8 (2) ◽  
pp. 1-22
Author(s):  
Kakeru Narita ◽  
Teruhisa Hochin ◽  
Yoshihiro Hayashi ◽  
Hiroki Nomiya

Clustering is employed in various fields. However, the conventional method does not consider changing data. Therefore, if the data is changed, the entire dataset must be re-clustered. This article proposes a clustering method to update the clustering result obtained by a hierarchical clustering method without re-clustering when a point is inserted. This article defines the center and the radius of a cluster and determine the cluster to be inserted. The insertion location is determined by similarity based on the conventional clustering method. this research introduces the concept of outliers and consider creating a cluster caused by the insertion. By examining the multimodality of a cluster, the cluster is divided. In addition, when the number of clusters increases, data points previously inserted are updated by re-insertion. Compared with the conventional method, the experimental results demonstrate that the execution time of the proposed method is significantly smaller and clustering accuracy is comparable for some data.

2012 ◽  
Vol 532-533 ◽  
pp. 1373-1377 ◽  
Author(s):  
Ai Ping Deng ◽  
Ben Xiao ◽  
Hui Yong Yuan

In allusion to the disadvantage of having to obtain the number of clusters in advance and the sensitivity to selecting initial clustering centers in the K-means algorithm, an improved K-means algorithm is proposed, that the cluster centers and the number of clusters are dynamically changing. The new algorithm determines the cluster centers by calculating the density of data points and shared nearest neighbor similarity, and controls the clustering categories by using the average shared nearest neighbor self-similarity.The experimental results of IRIS testing data set show that the algorithm can select the cluster cennters and can distinguish between different types of cluster efficiently.


Corpora ◽  
2008 ◽  
Vol 3 (1) ◽  
pp. 59-81 ◽  
Author(s):  
Stefan Th. Gries ◽  
Martin Hilpert

In this paper, we introduce a data-driven bottom-up clustering method for the identification of stages in diachronic corpus data that differ from each other quantitatively. Much like regular approaches to hierarchical clustering, it is based on identifying and merging the most cohesive groups of data points, but, unlike regular approaches to clustering, it allows for the merging of temporally adjacent data, thus, in effect, preserving the chronological order. We exemplify the method with two case studies, one on verbal complementation of shall, the other on the development of the perfect in English.


2013 ◽  
Vol 457-458 ◽  
pp. 919-925
Author(s):  
Yu Hua Liu ◽  
Cui Xu ◽  
Ke Xu ◽  
Jian Zhi Jin

By analyzing the problem of k-means, we find the traditional k-means algorithm suffers from some shortcomings, such as requiring the user to give out the number of clusters k in advance, being sensitive to the initial cluster centers, being sensitive to the noise and isolated data, only being applied to the type found in globular clusters, and being easily trapped into a local solution et cetera. This improved algorithm uses the potential of data to find the center data and eliminate the noise data. It decomposes big or extended cluster into several small clusters, then merges adjacent small clusters into a big cluster using the information provided by the Safety Area. Experimental results demonstrate that the improved k-means algorithm can determine the number of clusters, distinguish irregular cluster to a certain extent, decrease the dependence on the initial cluster centers, eliminate the effects of the noise data and get a better clustering accuracy.


Author(s):  
Yonghua Zhu ◽  
Xiaofeng Zhu ◽  
Wei Zheng

Although multi-view clustering is capable to usemore information than single view clustering, existing multi-view clustering methods still have issues to be addressed, such as initialization sensitivity, the specification of the number of clusters,and the influence of outliers. In this paper, we propose a robust multi-view clustering method to address these issues. Specifically, we first propose amulti-view based sum-of-square error estimation tomake the initialization easy and simple as well asuse a sum-of-norm regularization to automaticallylearn the number of clusters according to data distribution. We further employ robust estimators constructed by the half-quadratic theory to avoid theinfluence of outliers for conducting robust estimations of both sum-of-square error and the numberof clusters. Experimental results on both syntheticand real datasets demonstrate that our method outperforms the state-of-the-art methods.  


Author(s):  
Wan Maseri Binti Wan Mohd ◽  
A.H. Beg ◽  
Tutut Herawan ◽  
A. Noraziah ◽  
K. F. Rabbi

K-means is an unsupervised learning and partitioning clustering algorithm. It is popular and widely used for its simplicity and fastness. K-means clustering produce a number of separate flat (non-hierarchical) clusters and suitable for generating globular clusters. The main drawback of the k-means algorithm is that the user must specify the number of clusters in advance. This paper presents an improved version of K-means algorithm with auto-generate an initial number of clusters (k) and a new approach of defining initial Centroid for effective and efficient clustering process. The underlined mechanism has been analyzed and experimented. The experimental results show that the number of iteration is reduced to 50% and the run time is lower and constantly based on maximum distance of data points, regardless of how many data points.


2011 ◽  
Vol 1 (3) ◽  
pp. 1-14 ◽  
Author(s):  
Wan Maseri Binti Wan Mohd ◽  
A.H. Beg ◽  
Tutut Herawan ◽  
A. Noraziah ◽  
K. F. Rabbi

K-means is an unsupervised learning and partitioning clustering algorithm. It is popular and widely used for its simplicity and fastness. K-means clustering produce a number of separate flat (non-hierarchical) clusters and suitable for generating globular clusters. The main drawback of the k-means algorithm is that the user must specify the number of clusters in advance. This paper presents an improved version of K-means algorithm with auto-generate an initial number of clusters (k) and a new approach of defining initial Centroid for effective and efficient clustering process. The underlined mechanism has been analyzed and experimented. The experimental results show that the number of iteration is reduced to 50% and the run time is lower and constantly based on maximum distance of data points, regardless of how many data points.


2020 ◽  
Vol 34 (05) ◽  
pp. 8360-8367
Author(s):  
Ting-En Lin ◽  
Hua Xu ◽  
Hanlei Zhang

Identifying new user intents is an essential task in the dialogue system. However, it is hard to get satisfying clustering results since the definition of intents is strongly guided by prior knowledge. Existing methods incorporate prior knowledge by intensive feature engineering, which not only leads to overfitting but also makes it sensitive to the number of clusters. In this paper, we propose constrained deep adaptive clustering with cluster refinement (CDAC+), an end-to-end clustering method that can naturally incorporate pairwise constraints as prior knowledge to guide the clustering process. Moreover, we refine the clusters by forcing the model to learn from the high confidence assignments. After eliminating low confidence assignments, our approach is surprisingly insensitive to the number of clusters. Experimental results on the three benchmark datasets show that our method can yield significant improvements over strong baselines. 1


Author(s):  
Ana Belén Ramos-Guajardo

AbstractA new clustering method for random intervals that are measured in the same units over the same group of individuals is provided. It takes into account the similarity degree between the expected values of the random intervals that can be analyzed by means of a two-sample similarity bootstrap test. Thus, the expectations of each pair of random intervals are compared through that test and a p-value matrix is finally obtained. The suggested clustering algorithm considers such a matrix where each p-value can be seen at the same time as a kind of similarity between the random intervals. The algorithm is iterative and includes an objective stopping criterion that leads to statistically similar clusters that are different from each other. Some simulations to show the empirical performance of the proposal are developed and the approach is applied to two real-life situations.


2021 ◽  
Vol 11 (15) ◽  
pp. 7169
Author(s):  
Mohamed Allouche ◽  
Tarek Frikha ◽  
Mihai Mitrea ◽  
Gérard Memmi ◽  
Faten Chaabane

To bridge the current gap between the Blockchain expectancies and their intensive computation constraints, the present paper advances a lightweight processing solution, based on a load-balancing architecture, compatible with the lightweight/embedding processing paradigms. In this way, the execution of complex operations is securely delegated to an off-chain general-purpose computing machine while the intimate Blockchain operations are kept on-chain. The illustrations correspond to an on-chain Tezos configuration and to a multiprocessor ARM embedded platform (integrated into a Raspberry Pi). The performances are assessed in terms of security, execution time, and CPU consumption when achieving a visual document fingerprint task. It is thus demonstrated that the advanced solution makes it possible for a computing intensive application to be deployed under severely constrained computation and memory resources, as set by a Raspberry Pi 3. The experimental results show that up to nine Tezos nodes can be deployed on a single Raspberry Pi 3 and that the limitation is not derived from the memory but from the computation resources. The execution time with a limited number of fingerprints is 40% higher than using a classical PC solution (value computed with 95% relative error lower than 5%).


Author(s):  
Poonam Rani ◽  
MPS Bhatia ◽  
Devendra K Tayal

The paper presents an intelligent approach for the comparison of social networks through a cone model by using the fuzzy k-medoids clustering method. It makes use of a geometrical three-dimensional conical model, which astutely represents the user experience views. It uses both the static as well as the dynamic parameters of social networks. In this, we propose an algorithm that investigates which social network is more fruitful. For the experimental results, the proposed work is employed on the data collected from students from different universities through the Google forms, where students are required to rate their experience of using different social networks on different scales.


Sign in / Sign up

Export Citation Format

Share Document