Study on the Influence of Diversity and Quality in Entropy Based Collaborative Clustering

Jérémie Sublime; Guénaël Cabanes; Basarab Matei

doi:10.3390/e21100951

Study on the Influence of Diversity and Quality in Entropy Based Collaborative Clustering

Entropy ◽

10.3390/e21100951 ◽

2019 ◽

Vol 21 (10) ◽

pp. 951

Author(s):

Jérémie Sublime ◽

Guénaël Cabanes ◽

Basarab Matei

Keyword(s):

Clustering Algorithms ◽

Mathematical Optimization ◽

Data Sets ◽

Distributed Data ◽

Clustering Methods ◽

Local Structures ◽

Collaborative Clustering ◽

Privacy Constraints ◽

The Stability ◽

The One

The aim of collaborative clustering is to enhance the performances of clustering algorithms by enabling them to work together and exchange their information to tackle difficult data sets. The fundamental concept of collaboration is that clustering algorithms operate locally but collaborate by exchanging information about the local structures found by each algorithm. This kind of collaborative learning can be beneficial to a wide number of tasks including multi-view clustering, clustering of distributed data with privacy constraints, multi-expert clustering and multi-scale analysis. Within this context, the main difficulty of collaborative clustering is to determine how to weight the influence of the different clustering methods with the goal of maximizing the final results and minimizing the risk of negative collaborations—where the results are worse after collaboration than before. In this paper, we study how the quality and diversity of the different collaborators, but also the stability of the partitions can influence the final results. We propose both a theoretical analysis based on mathematical optimization, and a second study based on empirical results. Our findings show that on the one hand, in the absence of a clear criterion to optimize, a low diversity pool of solution with a high stability are the best option to ensure good performances. And on the other hand, if there is a known criterion to maximize, it is best to rely on a higher diversity pool of solution with a high quality on the said criterion. While our approach focuses on entropy based collaborative clustering, we believe that most of our results could be extended to other collaborative algorithms.

Download Full-text

A systematic performance evaluation of clustering methods for single-cell RNA-seq data

F1000Research ◽

10.12688/f1000research.15666.1 ◽

2018 ◽

Vol 7 ◽

pp. 1141 ◽

Cited By ~ 50

Author(s):

Angelo Duò ◽

Mark D. Robinson ◽

Charlotte Soneson

Keyword(s):

Performance Evaluation ◽

Single Cell ◽

Clustering Algorithms ◽

General Purpose ◽

Data Sets ◽

Consensus Clustering ◽

Rna Seq ◽

Clustering Methods ◽

Run Time ◽

The Stability

Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 12 clustering algorithms, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using 9 publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple individual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. The R scripts providing an extensible framework for the evaluation of new methods and data sets are available on GitHub (https://github.com/markrobinsonuzh/scRNAseq_clustering_comparison).

Download Full-text

Robust K-Median and K-Means Clustering Algorithms for Incomplete Data

Mathematical Problems in Engineering ◽

10.1155/2016/4321928 ◽

2016 ◽

Vol 2016 ◽

pp. 1-8 ◽

Cited By ~ 6

Author(s):

Jinhua Li ◽

Shiji Song ◽

Yuli Zhang ◽

Zhen Zhou

Keyword(s):

Incomplete Data ◽

Missing Values ◽

Clustering Algorithms ◽

Interval Data ◽

Accurate Estimation ◽

Data Sets ◽

Clustering Methods ◽

Estimation Errors ◽

Feature Values ◽

Time And Space Complexity

Incomplete data with missing feature values are prevalent in clustering problems. Traditional clustering methods first estimate the missing values by imputation and then apply the classical clustering algorithms for complete data, such as K-median and K-means. However, in practice, it is often hard to obtain accurate estimation of the missing values, which deteriorates the performance of clustering. To enhance the robustness of clustering algorithms, this paper represents the missing values by interval data and introduces the concept of robust cluster objective function. A minimax robust optimization (RO) formulation is presented to provide clustering results, which are insensitive to estimation errors. To solve the proposed RO problem, we propose robust K-median and K-means clustering algorithms with low time and space complexity. Comparisons and analysis of experimental results on both artificially generated and real-world incomplete data sets validate the robustness and effectiveness of the proposed algorithms.

Download Full-text

SOFTWARE ARCHITECTURE DECOMPOSITION USING ATTRIBUTES

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194007003410 ◽

2007 ◽

Vol 17 (05) ◽

pp. 599-613 ◽

Cited By ~ 8

Author(s):

CHUNG-HORNG LUNG ◽

XIA XU ◽

MARZIA ZAMAN

Keyword(s):

Architectural Design ◽

Clustering Algorithms ◽

Group Method ◽

Clustering Methods ◽

System Decomposition ◽

Software Artifacts ◽

Pair Group ◽

The One ◽

Past Experiences ◽

Final System

Software architectural design has an enormous effect on downstream software artifacts. Decomposition of function for the final system is one of the critical steps in software architectural design. The process of decomposition is typically conducted by designers based on their intuition and past experiences, which may not be robust sometimes. This paper presents a study of applying the clustering technique to support system decomposition based on requirements and their attributes. The approach can support the architectural design process by grouping closely related requirements to form a subsystem or module. In this paper, we demonstrate our experiments in applying the approach to an industrial communication protocol software system and comparing several clustering algorithms. The result obtained from WPGMA (weighted pair-group method using arithmetic averages) shows closer resemblance than other clustering methods to the one developed by the designer.

Download Full-text

Robust clustering and interpretation of scRNA-seq data using reference component analysis

10.1101/2021.02.16.431527 ◽

2021 ◽

Author(s):

Florian Schmidt ◽

Bobby Ranjan ◽

Quy Xiao Xuan Lin ◽

Vaidehi Krishnan ◽

Ignasius Joanito ◽

...

Keyword(s):

Single Cell ◽

De Novo ◽

Clustering Algorithms ◽

Cell Types ◽

Unsupervised Clustering ◽

Data Sets ◽

Clustering Methods ◽

Robust Clustering ◽

Supervised Clustering ◽

Downstream Analysis

MotivationThe transcriptomic diversity of the hundreds of cell types in the human body can be analysed in unprecedented detail using single cell (SC) technologies. Though clustering of cellular transcriptomes is the default technique for defining cell types and subtypes, single cell clustering can be strongly influenced by technical variation. In fact, the prevalent unsupervised clustering algorithms can cluster cells by technical, rather than biological, variation.ResultsCompared to de novo (unsupervised) clustering methods, we demonstrate using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects. To leverage the advantages of supervised clustering, we present RCA2, a new, scalable, and broadly applicable version of our RCA algorithm. RCA2 provides a user-friendly framework for supervised clustering and downstream analysis of large scRNA-seq data sets. RCA2 can be seamlessly incorporated into existing algorithmic pipelines. It incorporates various new reference panels for human and mouse, supports generation of custom panels and uses efficient graph-based clustering and sparse data structures to ensure scalability. We demonstrate the applicability of RCA2 on SC data from human bone marrow, healthy PBMCs and PBMCs from COVID-19 patients. Importantly, RCA2 facilitates cell-type-specific QC, which we show is essential for accurate clustering of SC data from heterogeneous tissues. In the era of cohort-scale SC analysis, supervised clustering methods such as RCA2 will facilitate unified analysis of diverse SC datasets.AvailabilityRCA2 is implemented in R and is available at github.com/prabhakarlab/RCAv2

Download Full-text

A COMPARATIVE ANALYSIS OF K-MEANS AND HIERARCHICAL CLUSTERING

EPRA International Journal of Multidisciplinary Research (IJMR) ◽

10.36713/epra8308 ◽

2021 ◽

pp. 412-418

Author(s):

Aastha Gupta ◽

Himanshu Sharma ◽

Anas Akhtar

Keyword(s):

Data Mining ◽

Hierarchical Clustering ◽

Clustering Algorithms ◽

Analytical Techniques ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Advantages And Disadvantages ◽

The Many ◽

Data Elements

Clustering is the process of arranging comparable data elements into groups. One of the most frequent data mining analytical techniques is clustering analysis; the clustering algorithm’s strategy has a direct influence on the clustering results. This study examines the many types of algorithms, such as k-means clustering algorithms, and compares and contrasts their advantages and disadvantages. This paper also highlights concerns with clustering algorithms, such as time complexity and accuracy, in order to give better outcomes in a variety of environments. The outcomes are described in terms of big datasets. The focus of this study is on clustering algorithms with the WEKA data mining tool. Clustering is the process of dividing a big data set into small groups or clusters. Clustering is an unsupervised approach that may be used to analyze big datasets with many characteristics. It’s a data-modeling technique that provides a clear image of your data. Two clustering methods, k-means and hierarchical clustering, are explained in this survey and their analysis using WEKA tool on different data sets. KEYWORDS: data clustering, weka , k-means, hierarchical clustering

Download Full-text

Topographic Mapping of Large Dissimilarity Data Sets

Neural Computation ◽

10.1162/neco_a_00012 ◽

2010 ◽

Vol 22 (9) ◽

pp. 2229-2284 ◽

Cited By ~ 63

Author(s):

Barbara Hammer ◽

Alexander Hasenfuss

Keyword(s):

Linear Time ◽

Clustering Algorithms ◽

Topographic Maps ◽

Data Sets ◽

Self Organizing Map ◽

Clustering Methods ◽

Neighborhood Structure ◽

Proximity Data ◽

Dissimilarity Data ◽

Relational Clustering

Topographic maps such as the self-organizing map (SOM) or neural gas (NG) constitute powerful data mining techniques that allow simultaneously clustering data and inferring their topological structure, such that additional features, for example, browsing, become available. Both methods have been introduced for vectorial data sets; they require a classical feature encoding of information. Often data are available in the form of pairwise distances only, such as arise from a kernel matrix, a graph, or some general dissimilarity measure. In such cases, NG and SOM cannot be applied directly. In this article, we introduce relational topographic maps as an extension of relational clustering algorithms, which offer prototype-based representations of dissimilarity data, to incorporate neighborhood structure. These methods are equivalent to the standard (vectorial) techniques if a Euclidean embedding exists, while preventing the need to explicitly compute such an embedding. Extending these techniques for the general case of non-Euclidean dissimilarities makes possible an interpretation of relational clustering as clustering in pseudo-Euclidean space. We compare the methods to well-known clustering methods for proximity data based on deterministic annealing and discuss how far convergence can be guaranteed in the general case. Relational clustering is quadratic in the number of data points, which makes the algorithms infeasible for huge data sets. We propose an approximate patch version of relational clustering that runs in linear time. The effectiveness of the methods is demonstrated in a number of examples.

Download Full-text

A Comprehensive Study on the Importance of the Elbow and the Silhouette Metrics in Cluster Count Prediction for Partition Cluster Models

Revista Gestão Inovação e Tecnologias ◽

10.47059/revistageintec.v11i4.2408 ◽

2021 ◽

Vol 11 (4) ◽

pp. 3792-3806

Author(s):

A.A. Abdulnassar ◽

Latha R. Nair

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Cluster Models ◽

Data Repository ◽

Data Sets ◽

Analysis Tool ◽

Clustering Methods ◽

K Value ◽

Statistical Analysis Tool ◽

Partition Clustering

Proper selection of cluster count gives better clustering results in partition models. Partition clustering methods are very simple as well as efficient. Kmeans and its modified versions are very efficient cluster models and the results are very sensitive to the chosen K value. The partition clustering algorithms are more suitable in applications where the data are arranged in a uniform manner. This work aims to evaluate the importance of assigning cluster count value in order to improve the efficiency of partition clustering algorithms using two well known statistical methods, the Elbow method and the Silhouette method. The performance of the Silhouette method and Elbow method are compared with different data sets from the UCI data repository. The values obtained using these methods are compared with the results of cluster performance obtained using the statistical analysis tool Weka on the selected data sets. Performance was evaluated on cluster efficiency for small and large data sets by varying the cluster count values. Similar results obtained from the three methods, the Elbow method, the Silhouette method and the clustering by Weka. It was also observed that the fast reduction in clustering efficiency for small changes in cluster count when the cluster count is small.

Download Full-text

Survey of Clustering

International Journal of Information Retrieval Research ◽

10.4018/ijirr.2013040101 ◽

2013 ◽

Vol 3 (2) ◽

pp. 1-29 ◽

Cited By ~ 3

Author(s):

Raymond Greenlaw ◽

Sanpawat Kantabutra

Keyword(s):

Parallel Computation ◽

Clustering Algorithms ◽

Data Sets ◽

Clustering Methods ◽

Top Down ◽

Research Directions ◽

History Of ◽

Representative Points ◽

Parallel Clustering ◽

Extensive List

This article is a survey into clustering applications and algorithms. A number of important well-known clustering methods are discussed. The authors present a brief history of the development of the field of clustering, discuss various types of clustering, and mention some of the current research directions in the field of clustering. More specifically, top-down and bottom-up hierarchical clustering are described. Additionally, K-Means and K-Medians clustering algorithms are also shown. The concept of representative points is introduced and the technique of discovering them is presented. Immense data sets in clustering often necessitate parallel computation. The authors discuss issues involving parallel clustering as well. Clustering deals with a large number of experimental results. The authors provide references to these works throughout the article. A table for comparing various clustering methods is given in the end. The authors give a summary and an extensive list of references, including some of the latest works in the field, to conclude the article.

Download Full-text

Clustering Algorithm for Arbitrary Data Sets

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch046 ◽

2011 ◽

pp. 297-303

Author(s):

Yu-Chen Song ◽

Hai-Dong Meng

Keyword(s):

Arbitrary Shape ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Interaction Networks ◽

Climate Indices ◽

Data Sets ◽

Clustering Methods ◽

Efficient Manner ◽

Protein Protein Interaction ◽

Clustering Data

Clustering analysis is an intrinsic component of numerous applications, including pattern recognition, life sciences, image processing, web data analysis, earth sciences, and climate research. As an example, consider the biology domain. In any living cell that undergoes a biological process, different subsets of its genes are expressed in different stages of the process. To facilitate a deeper understanding of these processes, a clustering algorithm was developed (Ben- Dor, Shamir, & Yakhini, 1999) that enabled detailed analysis of gene expression data. Recent advances in proteomics technologies, such as two-hybrid, phage display and mass spectrometry, have enabled the creation of detailed maps of biomolecular interaction networks. To further understanding in this area, a clustering mechanism that detects densely connected regions in large protein-protein interaction networks that may represent molecular complexes was constructed (Bader & Hogue, 2003). In the interpretation of remote sensing images, clustering algorithms (Sander, Ester, Kriegel, & Xu, 1998) have been employed to recognize and understand the content of such images. In the management of web directories, document annotation is an important task. Given a predefined taxonomy, the objective is to identify a category related to the content of an unclassified document. Self-Organizing Maps have been harnessed to influence the learning process with knowledge encoded within a taxonomy (Adami, Avesani, & Sona, 2005). Earth scientists are interested in discovering areas of the ocean that have a demonstrable effect on climatic events on land, and the SNN clustering technique (Ertöz, Steinbach, & Kumar, 2002) is one example of a technique that has been adopted in this domain. Also, scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earth’s oceans and atmosphere. Clustering techniques have proved crucial in the production of climate indices (Steinbach, Tan, Kumar, Klooster, & Potter, 2003). In many application domains, clusters of data are of arbitrary shape, size and density, and the number of clusters is unknown. In such scenarios, traditional clustering algorithms, including partitioning methods, hierarchical methods, density-based methods and gridbased methods, cannot identify clusters efficiently or accurately. Obviously, this is a critical limitation. In the following sections, a number of clustering methods are presented and discussed, after which the design of an algorithm based on Density and Density-reachable (CADD) is presented. CADD seeks to remedy some of the deficiencies of classical clustering approaches by robustly clustering data that is of arbitrary shape, size, and density in an effective and efficient manner.

Download Full-text

Introduction to Clustering

Dynamic and Advanced Data Mining for Progressing Technological Development ◽

10.4018/978-1-60566-908-3.ch010 ◽

2010 ◽

pp. 224-254

Author(s):

Raymond Greenlaw ◽

Sanpawat Kantabutra

Keyword(s):

Clustering Algorithms ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Research Directions ◽

History Of ◽

Representative Points ◽

Parallel Clustering ◽

Extensive List

This chapter provides the reader with an introduction to clustering algorithms and applications. A number of important well-known clustering methods are surveyed. The authors present a brief history of the development of the field of clustering, discuss various types of clustering, and mention some of the current research directions in the field of clustering. Algorithms are described for top-down and bottom-up hierarchical clustering, as are algorithms for K-Means clustering and for K-Medians clustering. The technique of representative points is also presented. Given the large data sets involved with clustering, the need to apply parallel computing to clustering arises, so they discuss issues related to parallel clustering as well. Throughout the chapter references are provided to works that contain a large number of experimental results. A comparison of the various clustering methods is given in tabular format. They conclude the chapter with a summary and an extensive list of references.

Download Full-text