PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

Mapping Intimacies ◽

10.1101/765628 ◽

2019 ◽

Author(s):

Shobana V. Stassen ◽

Dickson M. D. Siu ◽

Kelvin C. M. Lee ◽

Joshua W. K. Ho ◽

Hayden K. H. So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cell Mass ◽

Cellular Heterogeneity ◽

Phenotypic Data ◽

Data Set ◽

Cell Data

AbstractMotivationNew single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity.ResultsWe introduce a highly scalable graph-based clustering algorithm PARC - phenotyping by accelerated refined community-partitioning – for ultralarge-scale, high-dimensional single-cell data (> 1 million cells). Using large single cell mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without sub-sampling of cells, including Phenograph, FlowSOM, and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single cell data set of 1.1M cells within 13 minutes, compared to >2 hours to the next fastest graph-clustering algorithm, Phenograph. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis.Availability and Implementationhttps://github.com/ShobiStassen/PARC

Download Full-text

PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells

Bioinformatics ◽

10.1093/bioinformatics/btaa042 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2778-2786 ◽

Cited By ~ 5

Author(s):

Shobana V Stassen ◽

Dickson M D Siu ◽

Kelvin C M Lee ◽

Joshua W K Ho ◽

Hayden K H So ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Phenotypic Data ◽

Scalable Algorithm ◽

Cell Data

Abstract Motivation New single-cell technologies continue to fuel the explosive growth in the scale of heterogeneous single-cell data. However, existing computational methods are inadequately scalable to large datasets and therefore cannot uncover the complex cellular heterogeneity. Results We introduce a highly scalable graph-based clustering algorithm PARC—Phenotyping by Accelerated Refined Community-partitioning—for large-scale, high-dimensional single-cell data (>1 million cells). Using large single-cell flow and mass cytometry, RNA-seq and imaging-based biophysical data, we demonstrate that PARC consistently outperforms state-of-the-art clustering algorithms without subsampling of cells, including Phenograph, FlowSOM and Flock, in terms of both speed and ability to robustly detect rare cell populations. For example, PARC can cluster a single-cell dataset of 1.1 million cells within 13 min, compared with >2 h for the next fastest graph-clustering algorithm. Our work presents a scalable algorithm to cope with increasingly large-scale single-cell analysis. Availability and implementation https://github.com/ShobiStassen/PARC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Ultrafast clustering of single-cell ow cytometry data using FlowGrid

10.1101/394189 ◽

2018 ◽

Author(s):

Xiaoxin Ye ◽

Joshua W. K. Ho

Keyword(s):

Cell Surface ◽

Single Cell ◽

Clustering Algorithm ◽

Single Cells ◽

Clustering Algorithms ◽

Large Data ◽

Population Heterogeneity ◽

Multidimensional Data ◽

Protein Markers ◽

Data Set

AbstractFlow cytometry is a popular technology for quantitative single-cell profiling of cell surface markers. It enables expression measurement of tens of cell surface protein markers in millions of single cells. It is a powerful tool for discovering cell sub-populations and quantifying cell population heterogeneity. Traditionally, scientists use manual gating to identify cell types, but the process is subjective and is not effective for large multidimensional data. Many clustering algorithms have been developed to analyse these data but most of them are not scalable to very large data sets with more than ten million cells.Here, we present a new clustering algorithm that combines the advantages of density-based clustering algorithm DBSCAN with the scalability of grid-based clustering. This new clustering algorithm is implemented in python as an open source package, FlowGrid. FlowGrid is memory efficient and scales linearly with respect to the number of cells. We have evaluated the performance of FlowGrid against other state-of-the-art clustering programs and found that FlowGrid produces similar clustering results but with substantially less time. For example, FlowGrid is able to complete a clustering task on a data set of 23.6 million cells in less than 12 seconds, while other algorithms take more than 500 seconds or get into error.FlowGrid is an ultrafast clustering algorithm for large single-cell flow cy-tometry data. The source code is available at https://github.com/VCCRI/FlowGrid.

Download Full-text

CIM-seq

10.21203/rs.3.pex-1365/v1 ◽

2021 ◽

Author(s):

Nathanael Andrews ◽

Martin Enge

Keyword(s):

Single Cell ◽

Single Cells ◽

Likelihood Estimation ◽

Cell Types ◽

Data Sets ◽

Target Tissue ◽

Data Set ◽

Rnaseq Data ◽

The Given ◽

Cell Data

Abstract CIM-seq is a tool for deconvoluting RNA-seq data from cell multiplets (clusters of two or more cells) in order to identify physically interacting cell in a given tissue. The method requires two RNAseq data sets from the same tissue: one of single cells to be used as a reference, and one of cell multiplets to be deconvoluted. CIM-seq is compatible with both droplet based sequencing methods, such as Chromium Single Cell 3′ Kits from 10x genomics; and plate based methods, such as Smartseq2. The pipeline consists of three parts: 1) Dissociation of the target tissue, FACS sorting of single cells and multiplets, and conventional scRNA-seq 2) Feature selection and clustering of cell types in the single cell data set - generating a blueprint of transcriptional profiles in the given tissue 3) Computational deconvolution of multiplets through a maximum likelihood estimation (MLE) to determine the most likely cell type constituents of each multiplet.

Download Full-text

Scalable Clustering with Supervised Linkage Methods

10.1101/2021.08.01.454697 ◽

2021 ◽

Author(s):

James Anibal ◽

Alexandre Day ◽

Erol Bahadiroglu ◽

Liam O'Neill ◽

Long Phan ◽

...

Keyword(s):

Single Cell ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Biomedical Sciences ◽

New Approach ◽

Scalable Clustering ◽

Linkage Methods ◽

Density Clustering ◽

Cell Data ◽

Different Levels

Data clustering plays a significant role in biomedical sciences, particularly in single-cell data analysis. Researchers use clustering algorithms to group individual cells into populations that can be evaluated across different levels of disease progression, drug response, and other clinical statuses. In many cases, multiple sets of clusters must be generated to assess varying levels of cluster specificity. For example, there are many subtypes of leukocytes (e.g. T cells), whose individual preponderance and phenotype must be assessed for statistical/functional significance. In this report, we introduce a novel hierarchical density clustering algorithm (HAL-x) that uses supervised linkage methods to build a cluster hierarchy on raw single-cell data. With this new approach, HAL-x can quickly predict multiple sets of labels for immense datasets, achieving a considerable improvement in computational efficiency on large datasets compared to existing methods. We also show that cell clusters generated by HAL-x yield near-perfect F1-scores when classifying different clinical statuses based on single-cell profiles. Our hierarchical density clustering algorithm achieves high accuracy in single cell classification in a scalable, tunable and rapid manner. We make HAL-x publicly available at: https://pypi.org/project/hal-x/

Download Full-text

Supervised dimensionality reduction for exploration of single-cell data by Hybrid Subset Selection - Linear Discriminant Analysis

10.1101/2022.01.06.475279 ◽

2022 ◽

Author(s):

Meelad Amouzgar ◽

David R Glass ◽

Reema Baskar ◽

Inna Averbukh ◽

Samuel C Kimmey ◽

...

Keyword(s):

Discriminant Analysis ◽

Dimensionality Reduction ◽

Linear Discriminant Analysis ◽

Single Cell ◽

Cell Mass ◽

Subset Selection ◽

Cellular Heterogeneity ◽

Linear Discriminant ◽

Original Dataset ◽

Cell Data

Single-cell technologies generate large, high-dimensional datasets encompassing a diversity of omics. Dimensionality reduction enables visualization of data by representing cells in two-dimensional plots that capture the structure and heterogeneity of the original dataset. Visualizations contribute to human understanding of data and are useful for guiding both quantitative and qualitative analysis of cellular relationships. Existing algorithms are typically unsupervised, utilizing only measured features to generate manifolds, disregarding known biological labels such as cell type or experimental timepoint. Here, we repurpose the classification algorithm, linear discriminant analysis (LDA), for supervised dimensionality reduction of single-cell data. LDA identifies linear combinations of predictors that optimally separate a priori classes, enabling users to tailor visualizations to separate specific aspects of cellular heterogeneity. We implement feature selection by hybrid subset selection (HSS) and demonstrate that this flexible, computationally-efficient approach generates non-stochastic, interpretable axes amenable to diverse biological processes, such as differentiation over time and cell cycle. We benchmark HSS-LDA against several popular dimensionality reduction algorithms and illustrate its utility and versatility for exploration of single-cell mass cytometry, transcriptomics and chromatin accessibility data.

Download Full-text

Deep soft K-means clustering with self-training for single-cell RNA sequence data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa039 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Liang Chen ◽

Weinan Wang ◽

Yuyao Zhai ◽

Minghua Deng

Keyword(s):

Deep Learning ◽

Single Cell ◽

Large Scale ◽

Sequence Data ◽

Dimensional Space ◽

Expression Profiles ◽

Single Cells ◽

Clustering Algorithms ◽

Training Procedure ◽

Latent Space

Abstract Single-cell RNA sequencing (scRNA-seq) allows researchers to study cell heterogeneity at the cellular level. A crucial step in analyzing scRNA-seq data is to cluster cells into subpopulations to facilitate subsequent downstream analysis. However, frequent dropout events and increasing size of scRNA-seq data make clustering such high-dimensional, sparse and massive transcriptional expression profiles challenging. Although some existing deep learning-based clustering algorithms for single cells combine dimensionality reduction with clustering, they either ignore the distance and affinity constraints between similar cells or make some additional latent space assumptions like mixture Gaussian distribution, failing to learn cluster-friendly low-dimensional space. Therefore, in this paper, we combine the deep learning technique with the use of a denoising autoencoder to characterize scRNA-seq data while propose a soft self-training K-means algorithm to cluster the cell population in the learned latent space. The self-training procedure can effectively aggregate the similar cells and pursue more cluster-friendly latent space. Our method, called ‘scziDesk’, alternately performs data compression, data reconstruction and soft clustering iteratively, and the results exhibit excellent compatibility and robustness in both simulated and real data. Moreover, our proposed method has perfect scalability in line with cell size on large-scale datasets.

Download Full-text

Parallel Implementation of Improved K-Means Based on a Cloud Platform

Information Technology And Control ◽

10.5755/j01.itc.48.4.23881 ◽

2019 ◽

Vol 48 (4) ◽

pp. 673-681

Author(s):

Shufen Zhang ◽

Zhiyu Liu ◽

Xuebin Chen ◽

Changyin Luo

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Programming Model ◽

Parallel Implementation ◽

Clustering Algorithms ◽

Data Set ◽

Large Scale Data ◽

Sample Density ◽

Scale Data ◽

Selection Of

In order to solve the problem of traditional K-Means clustering algorithm in dealing with large-scale data set, a Hadoop K-Means (referred to HKM) clustering algorithm is proposed. Firstly, according to the sample density, the algorithm eliminates the effects of noise points in the data set. Secondly, it optimizes the selection of the initial center point using the thought of the max-min distance. Finally, it uses a MapReduce programming model to realize the parallelization. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but can also solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Download Full-text

dropClust: Efficient clustering of ultra-large scRNA-seq data

10.1101/170308 ◽

2017 ◽

Cited By ~ 2

Author(s):

Debajyoti Sinha ◽

Akhilesh Kumar ◽

Himanshu Kumar ◽

Sanghamitra Bandyopadhyay ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Large Scale ◽

Best Practice ◽

Clustering Algorithm ◽

Nearest Neighbor ◽

De Novo ◽

Single Cells ◽

Nearest Neighbor Search ◽

Locality Sensitive Hashing ◽

Clustering Methods

ABSTRACTDroplet based single cell transcriptomics has recently enabled parallel screening of tens of thousands of single cells. Clustering methods that scale for such high dimensional data without compromising accuracy are scarce. We exploit Locality Sensitive Hashing, an approximate nearest neighbor search technique to develop ade novoclustering algorithm for large-scale single cell data. On a number of real datasets, dropClust outperformed the existing best practice methods in terms of execution time, clustering accuracy and detectability of minor cell sub-types.

Download Full-text

Benchmarking PSM identification tools for single cell proteomics

10.1101/2021.08.17.456676 ◽

2021 ◽

Author(s):

Daisha Van Der Watt ◽

Hannah Boekweg ◽

Thy Truong ◽

Amanda J Guise ◽

Edward D Plowey ◽

...

Keyword(s):

Machine Learning ◽

Single Cell ◽

Single Cells ◽

Peptide Identification ◽

Machine Learning Algorithms ◽

Cellular Heterogeneity ◽

Proteomics Data ◽

Improve Performance ◽

False Discovery ◽

Cell Data

AbstractSingle cell proteomics is an emerging sub-field within proteomics with the potential to revolutionize our understanding of cellular heterogeneity and interactions. Recent efforts have largely focused on technological advancements in sample preparation, chromatography and instrumentation to enable measuring proteins present in these ultra-limited samples. Although advancements in data acquisition have rapidly improved our ability to analyze single cells, the software pipelines used in data analysis were originally written for traditional bulk samples and their performance on single cell data has not been investigated. We benchmarked five popular peptide identification tools on single cell proteomics data. We found that MetaMorpheus achieved the greatest number of peptide spectrum matches at a 1% false discovery rate. Depending on the tool, we also find that post processing machine learning can improve spectrum identification results by up to ∼40%. Although rescoring leads to a greater number of peptide spectrum matches, these new results typically are generated by 3rd party tools and have no way of being utilized by the primary pipeline for quantification. Exploration of novel metrics for machine learning algorithms will continue to improve performance.

Download Full-text

A Quantitative Single-Cell Proteomics Approach to Characterize an Acute Myeloid Leukemia Hierarchy

10.1101/745679 ◽

2019 ◽

Cited By ~ 6

Author(s):

Erwin M. Schoof ◽

Nicolas Rapin ◽

Simonas Savickas ◽

Coline Gentil ◽

Eric Lechman ◽

...

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

Single Cells ◽

Cell Mass ◽

Science Research ◽

Cell Types ◽

Cellular Heterogeneity ◽

Significant Shift ◽

Cell Analysis ◽

Computational Pipeline

AbstractIn recent years, cellular life science research has experienced a significant shift, moving away from conducting bulk cell interrogation towards single-cell analysis. It is only through single cell analysis that a complete understanding of cellular heterogeneity, and the interplay between various cell types that are fundamental to specific biological phenotypes, can be achieved. Single-cell assays at the protein level have been predominantly limited to targeted, antibody-based methods. However, here we present an experimental and computational pipeline, which establishes a comprehensive single-cell mass spectrometry-based proteomics workflow.By exploiting a leukemia culture system, containing functionally-defined leukemic stem cells, progenitors and terminally differentiated blasts, we demonstrate that our workflow is able to explore the cellular heterogeneity within this aberrant developmental hierarchy. We show our approach is capable to quantifying hundreds of proteins across hundreds of single cells using limited instrument time. Furthermore, we developed a computational pipeline (SCeptre), that effectively clusters the data and permits the extraction of cell-specific proteins and functional pathways. This proof-of-concept work lays the foundation for future global single-cell proteomics studies.

Download Full-text