Artificial-cell-type aware cell-type classification in CITE-seq

Qiuyu Lian; Hongyi Xin; Jianzhu Ma; Liza Konnikova; Wei Chen; Jin Gu; Kong Chen

doi:10.1093/bioinformatics/btaa467

Artificial-cell-type aware cell-type classification in CITE-seq

Bioinformatics ◽

10.1093/bioinformatics/btaa467 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i542-i550 ◽

Cited By ~ 1

Author(s):

Qiuyu Lian ◽

Hongyi Xin ◽

Jianzhu Ma ◽

Liza Konnikova ◽

Wei Chen ◽

...

Keyword(s):

Cell Surface ◽

Single Cell ◽

Domain Knowledge ◽

Cell Types ◽

Surface Marker ◽

Supplementary Information ◽

Clustering Methods ◽

Cell Type ◽

Artificial Cell ◽

Marker Proteins

Abstract Motivation Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq), couples the measurement of surface marker proteins with simultaneous sequencing of mRNA at single cell level, which brings accurate cell surface phenotyping to single-cell transcriptomics. Unfortunately, multiplets in CITE-seq datasets create artificial cell types (ACT) and complicate the automation of cell surface phenotyping. Results We propose CITE-sort, an artificial-cell-type aware surface marker clustering method for CITE-seq. CITE-sort is aware of and is robust to multiplet-induced ACT. We benchmarked CITE-sort with real and simulated CITE-seq datasets and compared CITE-sort against canonical clustering methods. We show that CITE-sort produces the best clustering performance across the board. CITE-sort not only accurately identifies real biological cell types (BCT) but also consistently and reliably separates multiplet-induced artificial-cell-type droplet clusters from real BCT droplet clusters. In addition, CITE-sort organizes its clustering process with a binary tree, which facilitates easy interpretation and verification of its clustering result and simplifies cell-type annotation with domain knowledge in CITE-seq. Availability and implementation http://github.com/QiuyuLian/CITE-sort. Supplementary information Supplementary data is available at Bioinformatics online.

Download Full-text

Artificial-Cell-Type Aware Cell Type Classification in CITE-seq

10.1101/2020.01.31.928010 ◽

2020 ◽

Author(s):

Qiuyu Lian ◽

Hongyi Xin ◽

Jianzhu Ma ◽

Liza Konnikova ◽

Wei Chen ◽

...

Keyword(s):

Cell Surface ◽

Single Cell ◽

Domain Knowledge ◽

Cell Types ◽

Surface Marker ◽

Biological Cell ◽

Clustering Methods ◽

Cell Type ◽

Artificial Cell ◽

Marker Proteins

AbstractCellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq), couples the measurement of surface marker proteins with simultaneous sequencing of mRNA at single cell level, which brings accurate cell surface phenotyping to single cell transcriptomics. Unfortunately, multiplets in CITE-seq datasets create artificial cell types and complicates the automation of cell surface phenotyping. We propose CITE-sort, an artificial-cell-type aware surface marker clustering method for CITE-seq. CITE-sort is aware of and is robust to multiplet-induced artificial cell types. We benchmarked CITE-sort with real and simulated CITE-seq datasets and compared CITE-sort against canonical clustering methods. We show that CITE-sort produces the best clustering performance across the board. CITE-sort not only accurately identifies real biological cell types but also consistently and reliably separates multiplet-induced artificial-cell-type droplet clusters from real biological-cell-type droplet clusters. In addition, CITE-sort organizes its clustering process with a binary tree, which facilitates easy interpretation and verification of its clustering result and simplifies cell type annotation with domain knowledge in CITE-seq.

Download Full-text

Ensemble learning for classifying single-cell data and projection across reference atlases

Bioinformatics ◽

10.1093/bioinformatics/btaa137 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3585-3587

Author(s):

Lin Wang ◽

Francisca Catalan ◽

Karin Shamardani ◽

Husam Babikir ◽

Aaron Diaz

Keyword(s):

Single Cell ◽

Cell Types ◽

Status Quo ◽

Supplementary Information ◽

Published Data ◽

Supplementary Data ◽

Cell Type ◽

Low Sensitivity ◽

Project Data ◽

Cell Data

Abstract Summary Single-cell data are being generated at an accelerating pace. How best to project data across single-cell atlases is an open problem. We developed a boosted learner that overcomes the greatest challenge with status quo classifiers: low sensitivity, especially when dealing with rare cell types. By comparing novel and published data from distinct scRNA-seq modalities that were acquired from the same tissues, we show that this approach preserves cell-type labels when mapping across diverse platforms. Availability and implementation https://github.com/diazlab/ELSA Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation

Bioinformatics ◽

10.1093/bioinformatics/btz139 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3642-3650 ◽

Cited By ~ 13

Author(s):

Ruiqing Zheng ◽

Min Li ◽

Zhenlan Liang ◽

Fang-Xiang Wu ◽

Yi Pan ◽

...

Keyword(s):

Single Cell ◽

Low Rank ◽

Supplementary Information ◽

Similarity Matrix ◽

Similarity Learning ◽

Clustering Methods ◽

Cell Type ◽

Gene Markers ◽

Adaptive Penalty ◽

New Perspective

Abstract Motivation The development of single-cell RNA-sequencing (scRNA-seq) provides a new perspective to study biological problems at the single-cell level. One of the key issues in scRNA-seq analysis is to resolve the heterogeneity and diversity of cells, which is to cluster the cells into several groups. However, many existing clustering methods are designed to analyze bulk RNA-seq data, it is urgent to develop the new scRNA-seq clustering methods. Moreover, the high noise in scRNA-seq data also brings a lot of challenges to computational methods. Results In this study, we propose a novel scRNA-seq cell type detection method based on similarity learning, called SinNLRR. The method is motivated by the self-expression of the cells with the same group. Specifically, we impose the non-negative and low rank structure on the similarity matrix. We apply alternating direction method of multipliers to solve the optimization problem and propose an adaptive penalty selection method to avoid the sensitivity to the parameters. The learned similarity matrix could be incorporated with spectral clustering, t-distributed stochastic neighbor embedding for visualization and Laplace score for prioritizing gene markers. In contrast to other scRNA-seq clustering methods, our method achieves more robust and accurate results on different datasets. Availability and implementation Our MATLAB implementation of SinNLRR is available at, https://github.com/zrq0123/SinNLRR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Transcriptomes of major renal collecting-duct cell types in mouse identified by single-cell RNA-Seq

10.1101/183376 ◽

2017 ◽

Author(s):

Lihe Chen ◽

Jae Wook Lee ◽

Chung-Lin Chou ◽

Anilkumar Nair ◽

Maria Agustina Battistone ◽

...

Keyword(s):

Cell Surface ◽

Single Cell ◽

Collecting Duct ◽

Extracellular Fluid ◽

Cell Types ◽

Fluid Composition ◽

Intercalated Cells ◽

Rna Seq ◽

Cell Type ◽

Renal Collecting Duct

ABSTRACTPrior RNA sequencing (RNA-Seq) studies have identified complete transcriptomes for most renal epithelial cell types. The exceptions are the cell types that make up the renal collecting duct, namely intercalated cells (ICs) and principal cells (PCs), which account for only a small fraction of the kidney mass, but play critical physiological roles in the regulation of blood pressure, extracellular fluid volume and extracellular fluid composition. To enrich these cell types, we used fluorescence-activated cell sorting (FACS) that employed well established lectin cell surface markers for PCs and type B ICs, as well as a newly identified cell surface marker for type A ICs, viz. c-Kit. Single-cell RNA-Seq using the 1C- and PC-enriched populations as input enabled identification of complete transcriptomes of A-ICs, B-ICs and PCs. The data were used to create a freely-accessible online gene-expression database for collecting duct cells. This database allowed identification of genes that are selectively expressed in each cell type including cell-surface receptors, transcription factors, transporters and secreted proteins. The analysis also identified a small fraction of hybrid cells expressing both aquapor¡n-2 and either anion exchanger 1 or pendrin transcripts. In many cases, mRNAs for receptors and their ligands were identified in different cells (e.g. Notch2 chiefly in PCs vs Jag1 chiefly in ICs) suggesting signaling crosstalk among the three cell types. The identified patterns of gene expression among the three types of collecting duct cells provide a foundation for understanding physiological regulation and pathophysiology in the renal collecting duct.SIGNIFICANCE STATEMENTA long-term goal in mammalian biology is to identify the genes expressed in every cell type of the body. In kidney, the expressed genes (“transcriptome”) of all epithelial cell types have already been identified with the exception of the cells that make up the renal collecting duct, responsible for regulation of blood pressure and body fluid composition. Here, a technique called "single-cell RNA-Seq" was used in mouse to identify transcriptomes for the major collecting-duct cell types: type A intercalated cells, type B intercalated cells and principal cells. The information was used to create a publicly-accessible online resource. The data allowed identification of genes that are selectively expressed in each cell type, informative for cell-level understanding of physiology and pathophysiology.

Download Full-text

ACTINN: automated identification of cell types in single cell RNA sequencing

Bioinformatics ◽

10.1093/bioinformatics/btz592 ◽

2019 ◽

Cited By ~ 7

Author(s):

Feiyang Ma ◽

Matteo Pellegrini

Keyword(s):

Neural Network ◽

Single Cell ◽

Rna Sequencing ◽

Immune Cell ◽

Cell Types ◽

Mouse Cell ◽

Supplementary Information ◽

Cell Type ◽

Human T Cell ◽

Single Cell Rna Sequencing

Abstract Motivation Cell type identification is one of the major goals in single cell RNA sequencing (scRNA-seq). Current methods for assigning cell types typically involve the use of unsupervised clustering, the identification of signature genes in each cluster, followed by a manual lookup of these genes in the literature and databases to assign cell types. However, there are several limitations associated with these approaches, such as unwanted sources of variation that influence clustering and a lack of canonical markers for certain cell types. Here, we present ACTINN (Automated Cell Type Identification using Neural Networks), which employs a neural network with three hidden layers, trains on datasets with predefined cell types and predicts cell types for other datasets based on the trained parameters. Results We trained the neural network on a mouse cell type atlas (Tabula Muris Atlas) and a human immune cell dataset, and used it to predict cell types for mouse leukocytes, human PBMCs and human T cell sub types. The results showed that our neural network is fast and accurate, and should therefore be a useful tool to complement existing scRNA-seq pipelines. Availability and implementation The codes and datasets are available at https://figshare.com/articles/ACTINN/8967116. Tutorial is available at https://github.com/mafeiyang/ACTINN. All codes are implemented in python. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Sample demultiplexing, multiplet detection, experiment planning and novel cell type verification in single cell sequencing

10.1101/828483 ◽

2019 ◽

Cited By ~ 2

Author(s):

Hongyi Xin ◽

Qi Yan ◽

Yale Jiang ◽

Qiuyu Lian ◽

Jiadi Luo ◽

...

Keyword(s):

Single Cell ◽

State Of The Art ◽

Cell Types ◽

Gaussian Mixture ◽

Experiment Planning ◽

Cell Type ◽

Single Cell Sequencing ◽

Artificial Cell ◽

Real Cell ◽

Downstream Analysis

AbstractIdentifying and removing multiplets from downstream analysis is essential to improve the scalability and reliability of single cell RNA sequencing (scRNA-seq). High multiplet rates create artificial cell types in the dataset. Sample barcoding, including the cell hashing technology and the MULTI-seq technology, enables analytical identification of a fraction of multiplets in a scRNA-seq dataset.We propose a Gaussian-mixture-model-based multiplet identification method, GMM-Demux. GMM-Demux accurately identifies and removes the sample-barcoding-detectable multiplets and estimates the percentage of sample-barcoding-undetectable multiplets in the remaining dataset. GMM-Demux describes the droplet formation process with an augmented binomial probabilistic model, and uses the model to authenticate cell types discovered from a scRNA-seq dataset.We conducted two cell-hashing experiments, collected a public cell-hashing dataset, and generated a simulated cellhashing dataset. We compared the classification result of GMM-Demux against a state-of-the-art heuristic-based classifier. We show that GMM-Demux is more accurate, more stable, reduces the error rate by up to 69×, and is capable of reliably recognizing 9 multiplet-induced fake cell types and 8 real cell types in a PBMC scRNA-seq dataset.

Download Full-text

Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data

Nature Communications ◽

10.1038/s41467-021-22008-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Tian Tian ◽

Jie Zhang ◽

Xiang Lin ◽

Zhi Wei ◽

Hakon Hakonarson

Keyword(s):

Single Cell ◽

Domain Knowledge ◽

Ad Hoc ◽

A Priori ◽

Unsupervised Clustering ◽

Rna Seq ◽

Clustering Methods ◽

Cell Type ◽

Type Assignment ◽

Deep Embedding

AbstractClustering is a critical step in single cell-based studies. Most existing methods support unsupervised clustering without the a priori exploitation of any domain knowledge. When confronted by the high dimensionality and pervasive dropout events of scRNA-Seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters, which complicates cell type assignment. In such cases, the only recourse is for the user to manually and repeatedly tweak clustering parameters until acceptable clusters are found. Consequently, the path to obtaining biologically meaningful clusters can be ad hoc and laborious. Here we report a principled clustering method named scDCC, that integrates domain knowledge into the clustering step. Experiments on various scRNA-seq datasets from thousands to tens of thousands of cells show that scDCC can significantly improve clustering performance, facilitating the interpretability of clusters and downstream analyses, such as cell type assignment.

Download Full-text

scMRA: A robust deep learning method to annotate scRNA-seq data with multiple reference datasets

Bioinformatics ◽

10.1093/bioinformatics/btab700 ◽

2021 ◽

Author(s):

Musu Yuan ◽

Liang Chen ◽

Minghua Deng

Keyword(s):

Deep Learning ◽

Single Cell ◽

Reference Data ◽

Cell Types ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Batch Effects ◽

Cell Type ◽

Convolutional Network ◽

Multiple Reference

Abstract Motivation Single-cell RNA-seq (scRNA-seq) has been widely used to resolve cellular heterogeneity. After collecting scRNA-seq data, the natural next step is to integrate the accumulated data to achieve a common ontology of cell types and states. Thus, an effective and efficient cell-type identification method is urgently needed. Meanwhile, high quality reference data remain a necessity for precise annotation. However, such tailored reference data are always lacking in practice. To address this, we aggregated multiple datasets into a meta-dataset on which annotation is conducted. Existing supervised or semi-supervised annotation methods suffer from batch effects caused by different sequencing platforms, the effect of which increases in severity with multiple reference datasets. Results Herein, a robust deep learning based single-cell Multiple Reference Annotator (scMRA) is introduced. In scMRA, a knowledge graph is constructed to represent the characteristics of cell types in different datasets, and a graphic convolutional network (GCN) serves as a discriminator based on this graph. scMRA keeps intra-cell-type closeness and the relative position of cell types across datasets. scMRA is remarkably powerful at transferring knowledge from multiple reference datasets, to the unlabeled target domain, thereby gaining an advantage over other state-of-the-art annotation methods in multi-reference data experiments. Furthermore, scMRA can remove batch effects. To the best of our knowledge, this is the first attempt to use multiple insufficient reference datasets to annotate target data, and it is, comparatively, the best annotation method for multiple scRNA-seq datasets. Availability An implementation of scMRA is available from https://github.com/ddb-qiwang/scMRA-torch Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

scTIM: seeking cell-type-indicative marker from single cell RNA-seq data by consensus optimization

Bioinformatics ◽

10.1093/bioinformatics/btz936 ◽

2019 ◽

Vol 36 (8) ◽

pp. 2474-2485 ◽

Cited By ~ 2

Author(s):

Zhanying Feng ◽

Xianwen Ren ◽

Yuan Fang ◽

Yining Yin ◽

Chutian Huang ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Cell Types ◽

Mouse Cell ◽

Supplementary Information ◽

Rna Seq ◽

Cell Type ◽

Robust Solution ◽

Development Trajectory ◽

Consensus Optimization

Abstract Motivation Single cell RNA-seq data offers us new resource and resolution to study cell type identity and its conversion. However, data analyses are challenging in dealing with noise, sparsity and poor annotation at single cell resolution. Detecting cell-type-indicative markers is promising to help denoising, clustering and cell type annotation. Results We developed a new method, scTIM, to reveal cell-type-indicative markers. scTIM is based on a multi-objective optimization framework to simultaneously maximize gene specificity by considering gene-cell relationship, maximize gene’s ability to reconstruct cell–cell relationship and minimize gene redundancy by considering gene–gene relationship. Furthermore, consensus optimization is introduced for robust solution. Experimental results on three diverse single cell RNA-seq datasets show scTIM’s advantages in identifying cell types (clustering), annotating cell types and reconstructing cell development trajectory. Applying scTIM to the large-scale mouse cell atlas data identifies critical markers for 15 tissues as ‘mouse cell marker atlas’, which allows us to investigate identities of different tissues and subtle cell types within a tissue. scTIM will serve as a useful method for single cell RNA-seq data mining. Availability and implementation scTIM is freely available at https://github.com/Frank-Orwell/scTIM. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Straightforward clustering of single-cell RNA-Seq data with t-SNE and DBSCAN

10.1101/770388 ◽

2019 ◽

Cited By ~ 3

Author(s):

Florian Wagner

Keyword(s):

Single Cell ◽

Mononuclear Cells ◽

Gene Selection ◽

Real Data ◽

Cell Types ◽

Fine Tuning ◽

Rna Seq ◽

Clustering Methods ◽

Cell Type ◽

Peripheral Blood Mononuclear

AbstractClustering of cells by cell type is arguably the most common and repetitive task encountered during the analysis of single-cell RNA-Seq data. However, as popular clustering methods operate largely independently of visualization techniques, the fine-tuning of clustering parameters can be unintuitive and time-consuming. Here, I propose Galapagos, a simple and effective clustering workflow based on t-SNE and DBSCAN that does not require a gene selection step. In practice, Galapagos only involves the fine-tuning of two parameters, which is straightforward, as clustering is performed directly on the t-SNE visualization results. Using peripheral blood mononuclear cells as a model tissue, I validate the effectiveness of Galapagos in different ways. First, I show that Galapagos generates clusters corresponding to all main cell types present. Then, I demonstrate that the t-SNE results are robust to parameter choices and initialization points. Next, I employ a simulation approach to show that clustering with Galapagos is accurate and robust to the high levels of technical noise present. Finally, to demonstrate Galapagos’ accuracy on real data, I compare clustering results to true cell type identities established using CITE-Seq data. In this context, I also provide an example of the primary limitation of Galapagos, namely the difficulty to resolve related cell types in cases where t-SNE fails to clearly separate the cells. Galapagos helps to make clustering scRNA-Seq data more intuitive and reproducible, and can be implemented in most programming languages with only a few lines of code.

Download Full-text