Impact of similarity metrics on single-cell RNA-seq data clustering

Taiyun Kim; Irene Rui Chen; Yingxin Lin; Andy Yi-Yang Wang; Jean Yee Hwa Yang; Pengyi Yang

doi:10.1093/bib/bby076

Impact of similarity metrics on single-cell RNA-seq data clustering

Briefings in Bioinformatics ◽

10.1093/bib/bby076 ◽

2018 ◽

Vol 20 (6) ◽

pp. 2316-2326 ◽

Cited By ~ 27

Author(s):

Taiyun Kim ◽

Irene Rui Chen ◽

Yingxin Lin ◽

Andy Yi-Yang Wang ◽

Jean Yee Hwa Yang ◽

...

Keyword(s):

Single Cell ◽

Data Clustering ◽

Euclidean Distance ◽

Clustering Algorithm ◽

High Throughput Sequencing ◽

Clustering Algorithms ◽

Critical Role ◽

Similarity Metrics ◽

Pearson’S Correlation ◽

Pearson's Correlation

Abstract Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson’s correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson’s correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson’s correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.

Download Full-text

Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data

BMC Bioinformatics ◽

10.1186/s12859-020-03797-8 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Chunxiang Wang ◽

Xin Gao ◽

Juntao Liu

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Preprocessing ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Preprocessing Method ◽

Cell Clustering ◽

Cell Gene Expression

Abstract Background Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. Results We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. Conclusion The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.

Download Full-text

A Research Roadmap of Big Data Clustering Algorithms for Future Internet of Things

International Journal of Organizational and Collective Intelligence ◽

10.4018/ijoci.2019040102 ◽

2019 ◽

Vol 9 (2) ◽

pp. 16-30 ◽

Cited By ~ 1

Author(s):

Hind Bangui ◽

Mouzhi Ge ◽

Barbora Buhnova

Keyword(s):

Big Data ◽

Internet Of Things ◽

Mobile Networks ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Future Internet ◽

Research Challenges ◽

Initial Stage ◽

Big Data Technologies

Due to the massive data increase in different Internet of Things (IoT) domains such as healthcare IoT and Smart City IoT, Big Data technologies have been emerged as critical analytics tools for analyzing the IoT data. Among the Big Data technologies, data clustering is one of the essential approaches to process the IoT data. However, how to select a suitable clustering algorithm for IoT data is still unclear. Furthermore, since Big Data technology are still in its initial stage for different IoT domains, it is thus valuable to propose and structure the research challenges between Big Data and IoT. Therefore, this article starts by reviewing and comparing the data clustering algorithms that can be applied in IoT datasets, and then extends the discussions to a broader IoT context such as IoT dynamics and IoT mobile networks. Finally, this article identifies a set of research challenges that harvest a research roadmap for the Big Data research in IoT domains. The proposed research roadmap aims at bridging the research gaps between Big Data and various IoT contexts.

Download Full-text

Single-cell RNA-seq data clustering: A survey with performance comparison study

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720020400053 ◽

2020 ◽

Vol 18 (04) ◽

pp. 2040005

Author(s):

Ruiyi Li ◽

Jihong Guan ◽

Shuigeng Zhou

Keyword(s):

Single Cell ◽

Data Clustering ◽

Performance Metrics ◽

Clustering Algorithms ◽

Cell Types ◽

Performance Comparison ◽

Cellular Heterogeneity ◽

Clustering Methods ◽

Multiple Perspectives ◽

Underlying Mechanisms

Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.

Download Full-text

Research on Parallel DBSCAN Algorithm Design Based on MapReduce

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.301-303.1133 ◽

2011 ◽

Vol 301-303 ◽

pp. 1133-1138 ◽

Cited By ~ 17

Author(s):

Yan Xiang Fu ◽

Wei Zhong Zhao ◽

Hui Fang Ma

Keyword(s):

Data Clustering ◽

Large Scale ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Algorithm Design ◽

Document Retrieval ◽

Commodity Hardware ◽

Dbscan Clustering ◽

Dbscan Algorithm ◽

Parallel Clustering

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.

Download Full-text

Scalable Clustering with Supervised Linkage Methods

10.1101/2021.08.01.454697 ◽

2021 ◽

Author(s):

James Anibal ◽

Alexandre Day ◽

Erol Bahadiroglu ◽

Liam O'Neill ◽

Long Phan ◽

...

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Biomedical Sciences ◽

New Approach ◽

Scalable Clustering ◽

Linkage Methods ◽

Density Clustering ◽

Cell Data ◽

Different Levels

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner. We make HAL-x publicly available at: https://pypi.org/project/hal-x/

Download Full-text

Spectral Clustering Based on Sparse Representation

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.3822 ◽

2014 ◽

Vol 556-562 ◽

pp. 3822-3826

Author(s):

Chen Xiao Hu ◽

Xian Chun Zou

Keyword(s):

Sparse Representation ◽

Spectral Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Similarity Metrics ◽

Distance Metrics ◽

Information Propagation ◽

Discriminative Ability ◽

Two Samples ◽

Better Than

Spectral clustering is an efficient clustering algorithm based the information propagation between neighborhood nodes. Its performance is largely dependent on the distance metrics, thus it is possible to boost its performance by adapting more reliable distance metric. Given the advantages of sparse representation in discriminative ability, robust to noisy and more faithfully to measure the similarity between two samples, we propose an sparse representation algorithm based on sparse representation. The experimental study on several datasets shows that, the proposed algorithm performs better than the sparse clustering algorithms based on other similarity metrics.

Download Full-text

ADOFL: Multi-Kernel-Based Adaptive Directive Operative Fractional Lion Optimisation Algorithm for Data Clustering

Journal of Intelligent Systems ◽

10.1515/jisys-2016-0175 ◽

2018 ◽

Vol 27 (3) ◽

pp. 317-329 ◽

Cited By ~ 1

Author(s):

Satish Chander ◽

P. Vijaya ◽

Praveen Dhyani

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Fitness Function ◽

Clustering Algorithms ◽

Previous Method ◽

Optimisation Algorithm ◽

Lion Algorithm ◽

Similar Cluster ◽

Centroid Point ◽

Data Objects

Abstract The progress of databases in fields such as medical, business, education, marketing, etc., is colossal because of the developments in information technology. Knowledge discovery from such concealed bulk databases is a tedious task. For this, data mining is one of the promising solutions and clustering is one of its applications. The clustering process groups the data objects related to each other in a similar cluster and diverse objects in another cluster. The literature presents many clustering algorithms for data clustering. Optimisation-based clustering algorithm is one of the recently developed algorithms for the clustering process to discover the optimal cluster based on the objective function. In our previous method, direct operative fractional lion optimisation algorithm was proposed for data clustering. In this paper, we designed a new clustering algorithm called adaptive decisive operative fractional lion (ADOFL) optimisation algorithm based on multi-kernel function. Moreover, a new fitness function called multi-kernel WL index is proposed for the selection of the best centroid point for clustering. The experimentation of the proposed ADOFL algorithm is carried out over two benchmarked datasets, Iris and Wine. The performance of the proposed ADOFL algorithm is validated over existing clustering algorithms such as particle swarm clustering (PSC) algorithm, modified PSC algorithm, lion algorithm, fractional lion algorithm, and DOFL. The result shows that the maximum clustering accuracy of 79.51 is obtained by the proposed method in data clustering.

Download Full-text

RECURSIVE HIERARCHICAL CLUSTERING FOR HYPERSPECTRAL IMAGES

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliii-b3-2020-461-2020 ◽

2020 ◽

Vol XLIII-B3-2020 ◽

pp. 461-465

Author(s):

S. May

Keyword(s):

Hierarchical Clustering ◽

Euclidean Distance ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Hyperspectral Images ◽

High Dimensionality ◽

Illumination Change ◽

Clustering Techniques ◽

Semantic Class ◽

Spectral Angle

Abstract. Partition based clustering techniques are widely used in data mining and also to analyze hyperspectral images. Unsupervised clustering only depends on data, without any external knowledge. It creates a complete partition of the image with many classes. And so, sparse labeled samples may be used to label each cluster, and so simplify the supervised step. Each clustering algorithm has its own advantages, drawbacks (initialization, training complexity). We propose in this paper to use a recursive hierarchical clustering based on standard clustering strategies such as K-Means or Fuzzy-C-Means. The recursive hierarchical approach reduces the algorithm complexity, in order to process large amount of input pixels, and also to produce a clustering with a high number of clusters. Moreover, in hyperspectral images, a classical question is related to the high dimensionality and also to the distance that shall be used. Classical clustering algorithms usually use the Euclidean distance to compute distance between samples and centroids. We propose to implement the spectral angle distance instead and evaluate its performance. It better fits the pixel spectrums and is less sensitive to illumination change or spectrum variability inside a semantic class. Different scenes are processed with this method in order to demonstrate its potential.

Download Full-text

Automated Digital Image Clustering Algorithm Based on Colour Distance and IDBSCAN

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.b7078.129219 ◽

2019 ◽

Vol 9 (2) ◽

pp. 2717-2722

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Digital Images ◽

Clustering Algorithms ◽

Image Clustering ◽

Time Consumption ◽

Consumption Rates ◽

Efficient Data ◽

Audio Video ◽

Colour Distance

Data clustering is inevitable for crucial data analytic based applications. Though data clustering algorithms are capacious in the literature, there is always a room for efficient data clustering algorithms. This is due to the uncontrollable growth of data and its utilization. The data clustering may consider any of the data formats such as text, images, audio, video and so on. Due to the increasing utilization trend of digital images, this work intends to present a data clustering algorithm for digital images, which is based colour distance and Improvised DBSCAN (IDBSCAN) algorithm. The proposed IDBSCAN completely weeds out the annoying process of setting the initial parameters such as 𝜺 and 𝒎𝒊𝒏𝒑𝒕𝒔 by setting them automatically. The performance of the proposed work is analysed in terms of clustering accuracy, precision, recall, Fmeasure and time consumption rates. The proposed work outperforms the existing approaches with reasonable time consumption.

Download Full-text

An Empirical Study of Cluster-Based MOEA/D Bare Bones PSO for Data Clustering

Algorithms ◽

10.3390/a14110338 ◽

2021 ◽

Vol 14 (11) ◽

pp. 338

Author(s):

Daphne Teck Ching Lai ◽

Yuji Sato

Keyword(s):

Data Clustering ◽

Clustering Algorithm ◽

Selection Process ◽

Clustering Algorithms ◽

Future Research ◽

Overlapping Clusters ◽

Hierarchical Topology ◽

Objective Space ◽

Solution Selection ◽

Bare Bones

Previously, cluster-based multi or many objective function techniques were proposed to reduce the Pareto set. Recently, researchers proposed such techniques to find better solutions in the objective space to solve engineering problems. In this work, we applied a cluster-based approach for solution selection in a multiobjective evolutionary algorithm based on decomposition with bare bones particle swarm optimization for data clustering and investigated its clustering performance. In our previous work, we found that MOEA/D with BBPSO performed the best on 10 datasets. Here, we extend this work applying a cluster-based approach tested on 13 UCI datasets. We compared with six multiobjective evolutionary clustering algorithms from the existing literature and ten from our previous work. The proposed technique was found to perform well on datasets highly overlapping clusters, such as CMC and Sonar. So far, we found only one work that used cluster-based MOEA for clustering data, the hierarchical topology multiobjective clustering algorithm. All other cluster-based MOEA found were used to solve other problems that are not data clustering problems. By clustering Pareto solutions and evaluating new candidates against the found cluster representatives, local search is introduced in the solution selection process within the objective space, which can be effective on datasets with highly overlapping clusters. This is an added layer of search control in the objective space. The results are found to be promising, prompting different areas of future research which are discussed, including the study of its effects with an increasing number of clusters as well as with other objective functions.

Download Full-text