Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

2018 ◽  
Vol 14 (3) ◽  
pp. 38-55 ◽  
Author(s):  
Kavan Fatehi ◽  
Mohsen Rezvani ◽  
Mansoor Fateh ◽  
Mohammad-Reza Pajoohan

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.

Author(s):  
Parul Agarwal ◽  
Shikha Mehta

Subspace clustering approaches cluster high dimensional data in different subspaces. It means grouping the data with different relevant subsets of dimensions. This technique has become very effective as a distance measure becomes ineffective in a high dimensional space. This chapter presents a novel evolutionary approach to a bottom up subspace clustering SUBSPACE_DE which is scalable to high dimensional data. SUBSPACE_DE uses a self-adaptive DBSCAN algorithm to perform clustering in data instances of each attribute and maximal subspaces. Self-adaptive DBSCAN clustering algorithms accept input from differential evolution algorithms. The proposed SUBSPACE_DE algorithm is tested on 14 datasets, both real and synthetic. It is compared with 11 existing subspace clustering algorithms. Evaluation metrics such as F1_Measure and accuracy are used. Performance analysis of the proposed algorithms is considerably better on a success rate ratio ranking in both accuracy and F1_Measure. SUBSPACE_DE also has potential scalability on high dimensional datasets.


Entropy ◽  
2019 ◽  
Vol 21 (9) ◽  
pp. 906
Author(s):  
Muhammad Azhar ◽  
Mark Junjie Li ◽  
Joshua Zhexue Huang

Data classification is an important research topic in the field of data mining. With the rapid development in social media sites and IoT devices, data have grown tremendously in volume and complexity, which has resulted in a lot of large and complex high-dimensional data. Classifying such high-dimensional complex data with a large number of classes has been a great challenge for current state-of-the-art methods. This paper presents a novel, hierarchical, gamma mixture model-based unsupervised method for classifying high-dimensional data with a large number of classes. In this method, we first partition the features of the dataset into feature strata by using k-means. Then, a set of subspace data sets is generated from the feature strata by using the stratified subspace sampling method. After that, the GMM Tree algorithm is used to identify the number of clusters and initial clusters in each subspace dataset and passing these initial cluster centers to k-means to generate base subspace clustering results. Then, the subspace clustering result is integrated into an object cluster association (OCA) matrix by using the link-based method. The ensemble clustering result is generated from the OCA matrix by the k-means algorithm with the number of clusters identified by the GMM Tree algorithm. After producing the ensemble clustering result, the dominant class label is assigned to each cluster after computing the purity. A classification is made on the object by computing the distance between the new object and the center of each cluster in the classifier, and the class label of the cluster is assigned to the new object which has the shortest distance. A series of experiments were conducted on twelve synthetic and eight real-world data sets, with different numbers of classes, features, and objects. The experimental results have shown that the new method outperforms other state-of-the-art techniques to classify data in most of the data sets.


2021 ◽  
Vol 7 ◽  
pp. e477
Author(s):  
Amalia Villa ◽  
Abhijith Mundanad Narayanan ◽  
Sabine Van Huffel ◽  
Alexander Bertrand ◽  
Carolina Varon

Feature selection techniques are very useful approaches for dimensionality reduction in data analysis. They provide interpretable results by reducing the dimensions of the data to a subset of the original set of features. When the data lack annotations, unsupervised feature selectors are required for their analysis. Several algorithms for this aim exist in the literature, but despite their large applicability, they can be very inaccessible or cumbersome to use, mainly due to the need for tuning non-intuitive parameters and the high computational demands. In this work, a publicly available ready-to-use unsupervised feature selector is proposed, with comparable results to the state-of-the-art at a much lower computational cost. The suggested approach belongs to the methods known as spectral feature selectors. These methods generally consist of two stages: manifold learning and subset selection. In the first stage, the underlying structures in the high-dimensional data are extracted, while in the second stage a subset of the features is selected to replicate these structures. This paper suggests two contributions to this field, related to each of the stages involved. In the manifold learning stage, the effect of non-linearities in the data is explored, making use of a radial basis function (RBF) kernel, for which an alternative solution for the estimation of the kernel parameter is presented for cases with high-dimensional data. Additionally, the use of a backwards greedy approach based on the least-squares utility metric for the subset selection stage is proposed. The combination of these new ingredients results in the utility metric for unsupervised feature selection U2FS algorithm. The proposed U2FS algorithm succeeds in selecting the correct features in a simulation environment. In addition, the performance of the method on benchmark datasets is comparable to the state-of-the-art, while requiring less computational time. Moreover, unlike the state-of-the-art, U2FS does not require any tuning of parameters.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.


2017 ◽  
Vol 7 (2) ◽  
Author(s):  
Marco Gaboardi ◽  
Emilio Jesús Gallego Arias ◽  
Justin Hsu ◽  
Aaron Roth ◽  
Zhiwei Steven Wu

We present a practical, differentially private algorithm for answering a large number of queries on high dimensional datasets. Like all algorithms for this task, ours necessarily has worst-case complexity exponential in the dimension of the data. However, our algorithm packages the computationally hard step into a concisely defined integer program, which can be solved non-privately using standard solvers. We prove accuracy and privacy theorems for our algorithm, and then demonstrate experimentally that our algorithm performs well in practice. For example, our algorithm can efficiently and accurately answer millions of queries on the Netflix dataset, which has over 17,000 attributes; this is an improvement on the state of the art by multiple orders of magnitude.


2015 ◽  
Vol 23 (3) ◽  
pp. 303-313 ◽  
Author(s):  
Lianli Gao ◽  
Jingkuan Song ◽  
Xingyi Liu ◽  
Junming Shao ◽  
Jiajun Liu ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document