Impact of similarity metrics on single-cell RNA-seq data clustering

2018 ◽  
Vol 20 (6) ◽  
pp. 2316-2326 ◽  
Author(s):  
Taiyun Kim ◽  
Irene Rui Chen ◽  
Yingxin Lin ◽  
Andy Yi-Yang Wang ◽  
Jean Yee Hwa Yang ◽  
...  

Abstract Advances in high-throughput sequencing on single-cell gene expressions [single-cell RNA sequencing (scRNA-seq)] have enabled transcriptome profiling on individual cells from complex samples. A common goal in scRNA-seq data analysis is to discover and characterise cell types, typically through clustering methods. The quality of the clustering therefore plays a critical role in biological discovery. While numerous clustering algorithms have been proposed for scRNA-seq data, fundamentally they all rely on a similarity metric for categorising individual cells. Although several studies have compared the performance of various clustering algorithms for scRNA-seq data, currently there is no benchmark of different similarity metrics and their influence on scRNA-seq data clustering. Here, we compared a panel of similarity metrics on clustering a collection of annotated scRNA-seq datasets. Within each dataset, a stratified subsampling procedure was applied and an array of evaluation measures was employed to assess the similarity metrics. This produced a highly reliable and reproducible consensus on their performance assessment. Overall, we found that correlation-based metrics (e.g. Pearson’s correlation) outperformed distance-based metrics (e.g. Euclidean distance). To test if the use of correlation-based metrics can benefit the recently published clustering techniques for scRNA-seq data, we modified a state-of-the-art kernel-based clustering algorithm (SIMLR) using Pearson’s correlation as a similarity measure and found significant performance improvement over Euclidean distance on scRNA-seq data clustering. These findings demonstrate the importance of similarity metrics in clustering scRNA-seq data and highlight Pearson’s correlation as a favourable choice. Further comparison on different scRNA-seq library preparation protocols suggests that they may also affect clustering performance. Finally, the benchmarking framework is available at http://www.maths.usyd.edu.au/u/SMS/bioinformatics/software.html.

2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Chunxiang Wang ◽  
Xin Gao ◽  
Juntao Liu

Abstract Background Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. Results We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. Conclusion The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.


Author(s):  
Hind Bangui ◽  
Mouzhi Ge ◽  
Barbora Buhnova

Due to the massive data increase in different Internet of Things (IoT) domains such as healthcare IoT and Smart City IoT, Big Data technologies have been emerged as critical analytics tools for analyzing the IoT data. Among the Big Data technologies, data clustering is one of the essential approaches to process the IoT data. However, how to select a suitable clustering algorithm for IoT data is still unclear. Furthermore, since Big Data technology are still in its initial stage for different IoT domains, it is thus valuable to propose and structure the research challenges between Big Data and IoT. Therefore, this article starts by reviewing and comparing the data clustering algorithms that can be applied in IoT datasets, and then extends the discussions to a broader IoT context such as IoT dynamics and IoT mobile networks. Finally, this article identifies a set of research challenges that harvest a research roadmap for the Big Data research in IoT domains. The proposed research roadmap aims at bridging the research gaps between Big Data and various IoT contexts.


2020 ◽  
Vol 18 (04) ◽  
pp. 2040005
Author(s):  
Ruiyi Li ◽  
Jihong Guan ◽  
Shuigeng Zhou

Clustering analysis has been widely applied to single-cell RNA-sequencing (scRNA-seq) data to discover cell types and cell states. Algorithms developed in recent years have greatly helped the understanding of cellular heterogeneity and the underlying mechanisms of biological processes. However, these algorithms often use different techniques, were evaluated on different datasets and compared with some of their counterparts usually using different performance metrics. Consequently, there lacks an accurate and complete picture of their merits and demerits, which makes it difficult for users to select proper algorithms for analyzing their data. To fill this gap, we first do a review on the major existing scRNA-seq data clustering methods, and then conduct a comprehensive performance comparison among them from multiple perspectives. We consider 13 state of the art scRNA-seq data clustering algorithms, and collect 12 publicly available real scRNA-seq datasets from the existing works to evaluate and compare these algorithms. Our comparative study shows that the existing methods are very diverse in performance. Even the top-performance algorithms do not perform well on all datasets, especially those with complex structures. This suggests that further research is required to explore more stable, accurate, and efficient clustering algorithms for scRNA-seq data.


2011 ◽  
Vol 301-303 ◽  
pp. 1133-1138 ◽  
Author(s):  
Yan Xiang Fu ◽  
Wei Zhong Zhao ◽  
Hui Fang Ma

Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image segmentation and pattern classification. The enlarging volumes of information emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, more researchers try to design efficient parallel clustering algorithms. In this paper, we propose a parallel DBSCAN clustering algorithm based on Hadoop, which is a simple yet powerful parallel programming platform. The experimental results demonstrate that the proposed algorithm can scale well and efficiently process large datasets on commodity hardware.


2021 ◽  
Author(s):  
James Anibal ◽  
Alexandre Day ◽  
Erol Bahadiroglu ◽  
Liam O'Neill ◽  
Long Phan ◽  
...  

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner. We make HAL-x publicly available at: https://pypi.org/project/hal-x/


2014 ◽  
Vol 556-562 ◽  
pp. 3822-3826
Author(s):  
Chen Xiao Hu ◽  
Xian Chun Zou

Spectral clustering is an efficient clustering algorithm based the information propagation between neighborhood nodes. Its performance is largely dependent on the distance metrics, thus it is possible to boost its performance by adapting more reliable distance metric. Given the advantages of sparse representation in discriminative ability, robust to noisy and more faithfully to measure the similarity between two samples, we propose an sparse representation algorithm based on sparse representation. The experimental study on several datasets shows that, the proposed algorithm performs better than the sparse clustering algorithms based on other similarity metrics.


2018 ◽  
Vol 27 (3) ◽  
pp. 317-329 ◽  
Author(s):  
Satish Chander ◽  
P. Vijaya ◽  
Praveen Dhyani

Abstract The progress of databases in fields such as medical, business, education, marketing, etc., is colossal because of the developments in information technology. Knowledge discovery from such concealed bulk databases is a tedious task. For this, data mining is one of the promising solutions and clustering is one of its applications. The clustering process groups the data objects related to each other in a similar cluster and diverse objects in another cluster. The literature presents many clustering algorithms for data clustering. Optimisation-based clustering algorithm is one of the recently developed algorithms for the clustering process to discover the optimal cluster based on the objective function. In our previous method, direct operative fractional lion optimisation algorithm was proposed for data clustering. In this paper, we designed a new clustering algorithm called adaptive decisive operative fractional lion (ADOFL) optimisation algorithm based on multi-kernel function. Moreover, a new fitness function called multi-kernel WL index is proposed for the selection of the best centroid point for clustering. The experimentation of the proposed ADOFL algorithm is carried out over two benchmarked datasets, Iris and Wine. The performance of the proposed ADOFL algorithm is validated over existing clustering algorithms such as particle swarm clustering (PSC) algorithm, modified PSC algorithm, lion algorithm, fractional lion algorithm, and DOFL. The result shows that the maximum clustering accuracy of 79.51 is obtained by the proposed method in data clustering.


Author(s):  
S. May

Abstract. Partition based clustering techniques are widely used in data mining and also to analyze hyperspectral images. Unsupervised clustering only depends on data, without any external knowledge. It creates a complete partition of the image with many classes. And so, sparse labeled samples may be used to label each cluster, and so simplify the supervised step. Each clustering algorithm has its own advantages, drawbacks (initialization, training complexity). We propose in this paper to use a recursive hierarchical clustering based on standard clustering strategies such as K-Means or Fuzzy-C-Means. The recursive hierarchical approach reduces the algorithm complexity, in order to process large amount of input pixels, and also to produce a clustering with a high number of clusters. Moreover, in hyperspectral images, a classical question is related to the high dimensionality and also to the distance that shall be used. Classical clustering algorithms usually use the Euclidean distance to compute distance between samples and centroids. We propose to implement the spectral angle distance instead and evaluate its performance. It better fits the pixel spectrums and is less sensitive to illumination change or spectrum variability inside a semantic class. Different scenes are processed with this method in order to demonstrate its potential.


Data clustering is inevitable for crucial data analytic based applications. Though data clustering algorithms are capacious in the literature, there is always a room for efficient data clustering algorithms. This is due to the uncontrollable growth of data and its utilization. The data clustering may consider any of the data formats such as text, images, audio, video and so on. Due to the increasing utilization trend of digital images, this work intends to present a data clustering algorithm for digital images, which is based colour distance and Improvised DBSCAN (IDBSCAN) algorithm. The proposed IDBSCAN completely weeds out the annoying process of setting the initial parameters such as 𝜺 and 𝒎𝒊𝒏𝒑𝒕𝒔 by setting them automatically. The performance of the proposed work is analysed in terms of clustering accuracy, precision, recall, Fmeasure and time consumption rates. The proposed work outperforms the existing approaches with reasonable time consumption.


Algorithms ◽  
2021 ◽  
Vol 14 (11) ◽  
pp. 338
Author(s):  
Daphne Teck Ching Lai ◽  
Yuji Sato

Previously, cluster-based multi or many objective function techniques were proposed to reduce the Pareto set. Recently, researchers proposed such techniques to find better solutions in the objective space to solve engineering problems. In this work, we applied a cluster-based approach for solution selection in a multiobjective evolutionary algorithm based on decomposition with bare bones particle swarm optimization for data clustering and investigated its clustering performance. In our previous work, we found that MOEA/D with BBPSO performed the best on 10 datasets. Here, we extend this work applying a cluster-based approach tested on 13 UCI datasets. We compared with six multiobjective evolutionary clustering algorithms from the existing literature and ten from our previous work. The proposed technique was found to perform well on datasets highly overlapping clusters, such as CMC and Sonar. So far, we found only one work that used cluster-based MOEA for clustering data, the hierarchical topology multiobjective clustering algorithm. All other cluster-based MOEA found were used to solve other problems that are not data clustering problems. By clustering Pareto solutions and evaluating new candidates against the found cluster representatives, local search is introduced in the solution selection process within the objective space, which can be effective on datasets with highly overlapping clusters. This is an added layer of search control in the objective space. The results are found to be promising, prompting different areas of future research which are discussed, including the study of its effects with an increasing number of clusters as well as with other objective functions.


Sign in / Sign up

Export Citation Format

Share Document