Cluster Headache: Comparing Clustering Tools for 10X Single Cell Sequencing Data

Mapping Intimacies ◽

10.1101/203752 ◽

2017 ◽

Cited By ~ 4

Author(s):

Saskia Freytag ◽

Ingrid Lonnstedt ◽

Milica Ng ◽

Melanie Bahlo

Keyword(s):

Single Cell ◽

Mononuclear Cells ◽

Large Degree ◽

Rna Seq ◽

Clustering Methods ◽

Ribosomal Protein Genes ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Type Assignment ◽

Peripheral Mononuclear Cells

AbstractThe commercially available 10X Genomics protocol to generate droplet-based single cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method offers most accuracy. Answering this question is complicated by the fact that 10X Genomics data lack cell labels that would allow a direct performance evaluation. Thus in this review, we focused on comparing clustering solutions of a dozen methods for three datasets on human peripheral mononuclear cells generated with the 10X Genomics technology. While clustering solutions appeared robust, we found that solutions produced by different methods have little in common with each other. They also failed to replicate cell type assignment generated with supervised labeling approaches. Furthermore, we demonstrate that all clustering methods tested clustered cells to a large degree according to the amount of genes coding for ribosomal protein genes in each cell.

Download Full-text

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

Download Full-text

Cobolt: integrative analysis of multimodal single-cell sequencing data

Genome Biology ◽

10.1186/s13059-021-02556-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Boying Gong ◽

Yun Zhou ◽

Elizabeth Purdom

Keyword(s):

Gene Expression ◽

Single Cell ◽

Chromatin Accessibility ◽

Integrative Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Multiple Datasets ◽

Novel Method ◽

Sequencing Platforms

AbstractA growing number of single-cell sequencing platforms enable joint profiling of multiple omics from the same cells. We present , a novel method that not only allows for analyzing the data from joint-modality platforms, but provides a coherent framework for the integration of multiple datasets measured on different modalities. We demonstrate its performance on multi-modality data of gene expression and chromatin accessibility and illustrate the integration abilities of by jointly analyzing this multi-modality data with single-cell RNA-seq and ATAC-seq datasets.

Download Full-text

Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data

F1000Research ◽

10.12688/f1000research.15809.2 ◽

2018 ◽

Vol 7 ◽

pp. 1297 ◽

Cited By ~ 14

Author(s):

Saskia Freytag ◽

Luyi Tian ◽

Ingrid Lönnstedt ◽

Milica Ng ◽

Melanie Bahlo

Keyword(s):

Single Cell ◽

Peripheral Blood Mononuclear Cells ◽

Gold Standard ◽

Mononuclear Cells ◽

Rna Seq ◽

Sequencing Data ◽

Peripheral Blood Mononuclear ◽

Silver Standard ◽

Single Cell Rna Sequencing ◽

Blood Mononuclear Cells

Background: The commercially available 10x Genomics protocol to generate droplet-based single cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method to use. Methods: Here we use one gold standard 10x Genomics dataset, generated from the mixture of three cell lines, as well as multiple silver standard 10x Genomics datasets generated from peripheral blood mononuclear cells to examine not only the accuracy but also running time and robustness of a dozen methods. Results: We found that Seurat outperformed other methods, although performance seems to be dependent on many factors, including the complexity of the studied system. Furthermore, we found that solutions produced by different methods have little in common with each other. Conclusions: In light of this we conclude that the choice of clustering tool crucially determines interpretation of scRNA-seq data generated by 10x Genomics. Hence practitioners and consumers should remain vigilant about the outcome of 10x Genomics scRNA-seq analysis.

Download Full-text

Phenotype-guided subpopulation identification from single-cell sequencing data

10.1101/2020.06.05.137240 ◽

2020 ◽

Author(s):

Duanchen Sun ◽

Xiangnan Guan ◽

Amy E. Moran ◽

David Z. Qian ◽

Pepper Schedin ◽

...

Keyword(s):

Lung Cancer ◽

Single Cell ◽

Clinical Information ◽

Single Step ◽

Cell Subpopulation ◽

Clustering Methods ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Cell Subpopulations ◽

Cell Data

AbstractSingle-cell sequencing yields novel discoveries by distinguishing cell types, states and lineages within the context of heterogeneous tissues. However, interpreting complex single-cell data from highly heterogeneous cell populations remains challenging. Currently, most existing single-cell data analyses focus on cell type clusters defined by unsupervised clustering methods, which cannot directly link cell clusters with specific biological and clinical phenotypes. Here we present Scissor, a novel approach that utilizes disease phenotypes to identify cell subpopulations from single-cell data that most highly correlate with a given phenotype. This “phenotype-to-cell within a single step” strategy enables the utilization of a large amount of clinical information that has been collected for bulk assays to identify the most highly phenotype-associated cell subpopulations. When applied to a lung cancer single-cell RNA-seq (scRNA-seq) dataset, Scissor identified a subset of cells exhibiting high hypoxia activities, which predicted worse survival outcomes in lung cancer patients. Furthermore, in a melanoma scRNA-seq dataset, Scissor discerned a T cell subpopulation with low PDCD1/CTLA4 and high TCF7 expressions, which is associated with a favorable immunotherapy response. Thus, Scissor provides a novel framework to identify the biologically and clinically relevant cell subpopulations from single-cell assays by leveraging the wealth of phenotypes and bulk-omics datasets.

Download Full-text

Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data

Nature Communications ◽

10.1038/s41467-021-22008-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Tian Tian ◽

Jie Zhang ◽

Xiang Lin ◽

Zhi Wei ◽

Hakon Hakonarson

Keyword(s):

Single Cell ◽

Domain Knowledge ◽

Ad Hoc ◽

A Priori ◽

Unsupervised Clustering ◽

Rna Seq ◽

Clustering Methods ◽

Cell Type ◽

Type Assignment ◽

Deep Embedding

AbstractClustering is a critical step in single cell-based studies. Most existing methods support unsupervised clustering without the a priori exploitation of any domain knowledge. When confronted by the high dimensionality and pervasive dropout events of scRNA-Seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters, which complicates cell type assignment. In such cases, the only recourse is for the user to manually and repeatedly tweak clustering parameters until acceptable clusters are found. Consequently, the path to obtaining biologically meaningful clusters can be ad hoc and laborious. Here we report a principled clustering method named scDCC, that integrates domain knowledge into the clustering step. Experiments on various scRNA-seq datasets from thousands to tens of thousands of cells show that scDCC can significantly improve clustering performance, facilitating the interpretability of clusters and downstream analyses, such as cell type assignment.

Download Full-text

Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data

F1000Research ◽

10.12688/f1000research.15809.1 ◽

2018 ◽

Vol 7 ◽

pp. 1297 ◽

Cited By ~ 58

Author(s):

Saskia Freytag ◽

Luyi Tian ◽

Ingrid Lönnstedt ◽

Milica Ng ◽

Melanie Bahlo

Keyword(s):

Single Cell ◽

Peripheral Blood Mononuclear Cells ◽

Gold Standard ◽

Mononuclear Cells ◽

Rna Seq ◽

Sequencing Data ◽

Peripheral Blood Mononuclear ◽

Silver Standard ◽

Single Cell Rna Sequencing ◽

Blood Mononuclear Cells

Background: The commercially available 10x Genomics protocol to generate droplet-based single-cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method to use. Methods: Here we use one gold standard 10x Genomics dataset, generated from the mixture of three cell lines, as well as three silver standard 10x Genomics datasets generated from peripheral blood mononuclear cells to examine not only the accuracy but also robustness of a dozen methods. Results: We found that some methods, including Seurat and Cell Ranger, outperform other methods, although performance seems to be dependent on the complexity of the studied system. Furthermore, we found that solutions produced by different methods have little in common with each other. Conclusions: In light of this, we conclude that the choice of clustering tool crucially determines interpretation of scRNA-seq data generated by 10x Genomics. Hence practitioners and consumers should remain vigilant about the outcome of 10x Genomics scRNA-seq analysis.

Download Full-text

Analyses of metastasis-associated genes in IDH wild-type glioma

BMC Cancer ◽

10.1186/s12885-020-07628-0 ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Xiaozhi Li ◽

Yutong Meng

Keyword(s):

Single Cell ◽

Signaling Pathway ◽

Prognostic Model ◽

Cox Regression ◽

Metastatic Tumor ◽

Cellular Protein ◽

Signal Pathways ◽

Clustering Methods ◽

Sequencing Data ◽

Single Cell Sequencing

Abstract Background Glioma is the most common malignant tumor of the brain. The existence of metastatic tumor cells is an important cause of recurrence even after radical glioma resection. Methods Single-cell sequencing data and high-throughput data were downloaded from GEO database and TCGA/CGGA database. By means of PCA and tSNE clustering methods, metastasis-associated genes in glioma were identified. GSEA explored possible biological functions that these metastasis-associated genes may participate in. Univariate and multivariate Cox regression were used to construct a prognostic model. Results Glioma metastatic cells and metastasis-associated genes were identified. The prognostic model based on metastasis-associated genes had good sensitivity and specificity for the prognosis of glioma. These genes may be involved in signal pathways such as cellular protein catabolic process, p53 signaling pathway, transcriptional misregulation in cancer and JAK-STAT signaling pathway. Conclusion This study explored glioma metastasis-associated genes through single-cell sequencing data mining, and aimed to identify prognostic metastasis-associated signatures for glioma and may provide potential targets for further cancer research.

Download Full-text

Single-Cell Sequencing of Peripheral Mononuclear Cells Reveals Distinct Immune Response Landscapes of COVID-19 and Influenza Patients

Immunity ◽

10.1016/j.immuni.2020.07.009 ◽

2020 ◽

Vol 53 (3) ◽

pp. 685-696.e3 ◽

Cited By ~ 22

Author(s):

Linnan Zhu ◽

Penghui Yang ◽

Yingze Zhao ◽

Zhenkun Zhuang ◽

Zhifeng Wang ◽

...

Keyword(s):

Immune Response ◽

Single Cell ◽

Mononuclear Cells ◽

Single Cell Sequencing ◽

Peripheral Mononuclear Cells

Download Full-text

Straightforward clustering of single-cell RNA-Seq data with t-SNE and DBSCAN

10.1101/770388 ◽

2019 ◽

Cited By ~ 3

Author(s):

Florian Wagner

Keyword(s):

Single Cell ◽

Mononuclear Cells ◽

Gene Selection ◽

Real Data ◽

Cell Types ◽

Fine Tuning ◽

Rna Seq ◽

Clustering Methods ◽

Cell Type ◽

Peripheral Blood Mononuclear

AbstractClustering of cells by cell type is arguably the most common and repetitive task encountered during the analysis of single-cell RNA-Seq data. However, as popular clustering methods operate largely independently of visualization techniques, the fine-tuning of clustering parameters can be unintuitive and time-consuming. Here, I propose Galapagos, a simple and effective clustering workflow based on t-SNE and DBSCAN that does not require a gene selection step. In practice, Galapagos only involves the fine-tuning of two parameters, which is straightforward, as clustering is performed directly on the t-SNE visualization results. Using peripheral blood mononuclear cells as a model tissue, I validate the effectiveness of Galapagos in different ways. First, I show that Galapagos generates clusters corresponding to all main cell types present. Then, I demonstrate that the t-SNE results are robust to parameter choices and initialization points. Next, I employ a simulation approach to show that clustering with Galapagos is accurate and robust to the high levels of technical noise present. Finally, to demonstrate Galapagos’ accuracy on real data, I compare clustering results to true cell type identities established using CITE-Seq data. In this context, I also provide an example of the primary limitation of Galapagos, namely the difficulty to resolve related cell types in cases where t-SNE fails to clearly separate the cells. Galapagos helps to make clustering scRNA-Seq data more intuitive and reproducible, and can be implemented in most programming languages with only a few lines of code.

Download Full-text

TMIC-36. ALDH1A2 AS A NOVEL PUTATIVE MARKER OF MACROPHAGE DIFFERENTIATION IN GBM

Neuro-Oncology ◽

10.1093/neuonc/noz175.1070 ◽

2019 ◽

Vol 21 (Supplement_6) ◽

pp. vi255-vi255

Author(s):

Stephanie Sanders ◽

Denise Herpai ◽

Lance Miller ◽

Waldemar Debinski

Keyword(s):

Single Cell ◽

Magnetic Beads ◽

Mononuclear Cells ◽

Gelatin Zymography ◽

Growth Factor Signaling ◽

Sequencing Data ◽

Expression Array ◽

Single Cell Sequencing ◽

Metabolism Regulation ◽

Putative Marker

Abstract Tumor-associated macrophages (TAM) are abundant in glioblastoma (GBM), composing up to 30% of the total tumor mass. However, their genotype/phenotype and exact role in tumor progression and immune suppression are not fully understood. Macrophages are believed to be polarized along a spectrum spanning from M0, M1, and M2 states. M1 TAMs are proinflammatory while M2 TAMs are associated with anti-inflammatory, pro-tumor responses. For example, using gelatin zymography we saw an increase in MMP 2 and 9 activities in co-culture of GBM cells and M2 polarized macrophages. In search for factors determining M2 type TAMs, we found Aldehyde Dehydrogenase 1 family A2 (ALDH1A2) to be highly over-expressed in M2 polarized THP1 cells using a gene expression array. We also performed single cell sequencing on CD45+ cells isolated from GBM tumors using antibody-conjugated magnetic beads. Monocytic/macrophage cells were found in 2 clusters and ALDH1A2 expression was increased in 2.1% of CD68+/CD163+ cells in Cluster 1 and 2.7% of CD68+/CD163+ cells in Cluster 2. Notably, ALDH1A2 was not detected in peripheral blood mononuclear cells. Analysis of the single cell sequencing data has led to identification of several genes whose expression is increased in ALDH1A2+ cells compared to ALDH1A2- cells. These genes are associated with lipid metabolism, regulation of neoplastic transformation, and insulin growth factor signaling. To further validate these results, we co-stained GBM tumor sections for CD163 and ALDH1A2 and observed that ALDH1A2 is co-localized in M2 macrophages. We have noticed a propensity of ALDH1A2+ cells to be associated with tumor neovasculature. Being that ALDH1A2 is the main enzyme in retinoic acid (RA) synthesis, it is plausible that this could represent another possible function of these enzyme-positive TAMs. Taken together these data suggest a role for ALDH1A2 as a novel putative marker of a subset of M2 TAM phenotype and function in GBM.

Download Full-text