SMNN: Batch Effect Correction for Single-cell RNA-seq data via Supervised Mutual Nearest Neighbor Detection

Mapping Intimacies ◽

10.1101/672261 ◽

2019 ◽

Cited By ~ 1

Author(s):

Yuchen Yang ◽

Gang Li ◽

Huijun Qian ◽

Kirk C. Wilhelmsen ◽

Yin Shen ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

State Of The Art ◽

Nearest Neighbors ◽

Cell Types ◽

Batch Effect ◽

Batch Effects ◽

Cell Type ◽

Label Information ◽

Cell Type Specific

AbstractBatch effect correction has been recognized to be indispensable when integrating single-cell RNA sequencing (scRNA-seq) data from multiple batches. State-of-the-art methods ignore single-cell cluster label information, but such information can improve effectiveness of batch effect correction, particularly under realistic scenarios where biological differences are not orthogonal to batch effects. To address this issue, we propose SMNN for batch effect correction of scRNA-seq data via supervised mutual nearest neighbor detection. Our extensive evaluations in simulated and real datasets show that SMNN provides improved merging within the corresponding cell types across batches, leading to reduced differentiation across batches over MNN, Seurat v3, and LIGER. Furthermore, SMNN retains more cell type-specific features, partially manifested by differentially expressed genes identified between cell types after SMNN correction being biologically more relevant, with precision improving by up to 841%.Key PointsBatch effect correction has been recognized to be critical when integrating scRNA-seq data from multiple batches due to systematic differences in time points, generating laboratory and/or handling technician(s), experimental protocol, and/or sequencing platform.Existing batch effect correction methods that leverages information from mutual nearest neighbors across batches (for example, implemented in SC3 or Seurat) ignore cell type information and suffer from potentially mismatching single cells from different cell types across batches, which would lead to undesired correction results, especially under the scenario where variation from batch effects is non-negligible compared with biological effects.To address this critical issue, here we present SMNN, a supervised machine learning method that first takes cluster/cell-type label information from users or inferred from scRNA-seq clustering, and then searches mutual nearest neighbors within each cell type instead of global searching.Our SMNN method shows clear advantages over three state-of-the-art batch effect correction methods and can better mix cells of the same cell type across batches and more effectively recover cell-type specific features, in both simulations and real datasets.

Download Full-text

SMNN: batch effect correction for single-cell RNA-seq data via supervised mutual nearest neighbor detection

Briefings in Bioinformatics ◽

10.1093/bib/bbaa097 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yuchen Yang ◽

Gang Li ◽

Huijun Qian ◽

Kirk C Wilhelmsen ◽

Yin Shen ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

State Of The Art ◽

Cell Types ◽

Batch Effect ◽

Rna Seq ◽

Cluster Label ◽

Label Information ◽

Cell Type Specific ◽

Biological Differences

Download Full-text

JIND: Joint Integration and Discrimination for Automated Single-Cell Annotation

10.1101/2020.10.06.327601 ◽

2020 ◽

Author(s):

Mohit Goyal ◽

Guillermo Serrano ◽

Ilan Shomorony ◽

Mikel Hernaez ◽

Idoia Ochoa

Keyword(s):

Single Cell ◽

Cell Types ◽

Marker Genes ◽

Specific Marker ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Latent Space ◽

Cell Type Specific ◽

Low Dimensional

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.

Download Full-text

Supervised Adversarial Alignment of Single-Cell RNA-seq Data

10.1101/2020.01.06.896621 ◽

2020 ◽

Author(s):

Songwei Ge ◽

Haohan Wang ◽

Amir Alavi ◽

Eric Xing ◽

Ziv Bar-Joseph

Keyword(s):

Single Cell ◽

Cell Types ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Reduced Representation ◽

Type Assignment ◽

Cell Type Specific ◽

Reduced Dimension ◽

Adversarial Model

AbstractDimensionality reduction is an important first step in the analysis of single cell RNA-seq (scRNA-seq) data. In addition to enabling the visualization of the profiled cells, such representations are used by many downstream analyses methods ranging from pseudo-time reconstruction to clustering to alignment of scRNA-seq data from different experiments, platforms, and labs. Both supervised and unsupervised methods have been proposed to reduce the dimension of scRNA-seq. However, all methods to date are sensitive to batch effects. When batches correlate with cell types, as is often the case, their impact can lead to representations that are batch rather than cell type specific. To overcome this we developed a domain adversarial neural network model for learning a reduced dimension representation of scRNA-seq data. The adversarial model tries to simultaneously optimize two objectives. The first is the accuracy of cell type assignment and the second is the inability to distinguish the batch (domain). We tested the method by using the resulting representation to align several different datasets. As we show, by overcoming batch effects our method was able to correctly separate cell types, improving on several prior methods suggested for this task. Analysis of the top features used by the network indicates that by taking the batch impact into account, the reduced representation is much better able to focus on key genes for each cell type.

Download Full-text

BATMAN: fast and accurate integration of single-cell RNA-Seq datasets via minimum-weight matching

10.1101/2020.01.22.915629 ◽

2020 ◽

Author(s):

Igor Mandric ◽

Brian L. Hill ◽

Malika K. Freund ◽

Michael Thompson ◽

Eran Halperin

Keyword(s):

Single Cell ◽

Integration Method ◽

State Of The Art ◽

Minimum Weight ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Multiple Datasets ◽

Efficient Integration ◽

Cell Type Specific

AbstractSingle-cell RNA-Sequencing (scRNA-Seq) is a set of technologies used to profile gene expression at the level of individual cells. Although the throughput of scRNA-Seq experiments is steadily growing in terms of the number of cells, large datasets are not yet commonly used due to prohibitively high costs. Integrating multiple datasets into one can improve power in scRNA-Seq experiments, and efficient integration is very important for downstream analyses such as identifying cell-type-specific eQTLs. State-of-the-art scRNA-Seq integration methods are based on the mutual nearest neighbors paradigm and fail to both correct for batch effects and maintain the local structure of the datasets. In this paper, we propose a novel scRNA-Seq dataset integration method called BATMAN (BATch integration via minimum-weight MAtchiNg). Across multiple simulations and real datasets, we show that our method significantly outperforms state-of-the-art tools with respect to existing metrics for batch effects by up to 80% while retaining cell-to-cell relationships. BATMAN is available at https://github.com/mandricigor/batman.

Download Full-text

iSMNN: Batch Effect Correction for Single-cell RNA-seq data via Iterative Supervised Mutual Nearest Neighbor Refinement

10.1101/2020.11.09.375659 ◽

2020 ◽

Author(s):

Yuchen Yang ◽

Gang Li ◽

Yifang Xie ◽

Li Wang ◽

Yingxi Yang ◽

...

Keyword(s):

Single Cell ◽

Nearest Neighbor ◽

Biological Function ◽

State Of The Art ◽

Cell Types ◽

Batch Effect ◽

Iterative Refinement ◽

Rna Seq ◽

Medical Studies ◽

Cell Level

ABSTRACTBatch effect correction is an essential step in the integrative analysis of multiple single cell RNA-seq (scRNA-seq) data. One state-of-the-art strategy for batch effect correction is via unsupervised or supervised detection of mutual nearest neighbors (MNNs). However, both two kinds of methods only detect MNNs across batches on the top of uncorrected data, where the large batch effect may affect the MNN search. To address this issue, we presented iSMNN, a batch effect correction approach via iterative supervised MNN refinement across data after correction. Our benchmarking on both simulation and real datasets showed the advantages of the iterative refinement of MNNs on the performance of correction. Compared to the popular methods MNNcorrect and Seurat v3, our iSMNN is able to better mix the cells of the same cell type across batches. In addition, iSMNN can also facilitate the identification of DEGs relevant to the biological function of certain cell types. These results indicated that iSMNN will be a valuable method for integrating multiple scRNA-seq datasets that can facilitate biological and medical studies at single-cell level.

Download Full-text

Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data

10.21203/rs.3.rs-151085/v1 ◽

2021 ◽

Author(s):

Yifan Zhao ◽

Huiyu Cai ◽

Zuobai Zhang ◽

Jian Tang ◽

Yue Li

Keyword(s):

Single Cell ◽

Topic Model ◽

Cell Types ◽

Gene Signature ◽

Batch Effects ◽

Cell Type ◽

Transcriptomic Data ◽

Variational Autoencoder ◽

Single Cell Rna Sequencing ◽

Cell Type Specific

Abstract The advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, integrative analysis of scRNA-seq data remains a challenge largely due to batch effects. We present single-cell Embedded Topic Model (scETM), an unsupervised deep generative model that recapitulates known cell types by inferring the latent cell topic mixtures via a variational autoencoder. scETM is scalable to over 10^6 cells and enables effective knowledge transfer across datasets. scETM also offers high interpretability and allows the incorporation of prior pathway knowledge into the gene embeddings. The scETM-inferred topics show enrichment in cell-type-specific and disease-related pathways.

Download Full-text

SSBER: removing batch effect for single-cell RNA sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04165-w ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yin Zhang ◽

Fei Wang

Keyword(s):

Single Cell ◽

Cell Types ◽

Batch Effect ◽

Batch Effects ◽

Cell Type ◽

Sequencing Data ◽

Cell Type Composition ◽

Type Composition ◽

Downstream Analysis ◽

Sequencing Platforms

Abstract Background With the continuous maturity of sequencing technology, different laboratories or different sequencing platforms have generated a large amount of single-cell transcriptome sequencing data for the same or different tissues. Due to batch effects and high dimensions of scRNA data, downstream analysis often faces challenges. Although a number of algorithms and tools have been proposed for removing batch effects, the current mainstream algorithms have faced the problem of data overcorrection when the cell type composition varies greatly between batches. Results In this paper, we propose a novel method named SSBER by utilizing biological prior knowledge to guide the correction, aiming to solve the problem of poor batch-effect correction when the cell type composition differs greatly between batches. Conclusions SSBER effectively solves the above problems and outperforms other algorithms when the cell type structure among batches or distribution of cell population varies considerably, or some similar cell types exist across batches.

Download Full-text

Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data

10.1101/2021.01.13.426593 ◽

2021 ◽

Author(s):

Yifan Zhao ◽

Huiyu Cai ◽

Zuobai Zhang ◽

Jian Tang ◽

Yue Li

Keyword(s):

Single Cell ◽

Topic Model ◽

Cell Types ◽

Gene Signature ◽

Batch Effects ◽

Cell Type ◽

Transcriptomic Data ◽

Variational Autoencoder ◽

Single Cell Rna Sequencing ◽

Cell Type Specific

AbstractThe advent of single-cell RNA sequencing (scRNA-seq) technologies has revolutionized transcriptomic studies. However, integrative analysis of scRNA-seq data remains a challenge largely due to batch effects. We present single-cell Embedded Topic Model (scETM), an unsupervised deep generative model that recapitulates known cell types by inferring the latent cell topic mixtures via a variational autoencoder. scETM is scalable to over 106 cells and enables effective knowledge transfer across datasets. scETM also offers high inter-pretability and allows the incorporation of prior pathway knowledge into the gene embeddings. The scETM-inferred topics show enrichment in cell-type-specific and disease-related pathways.

Download Full-text

CellMixS: quantifying and visualizing batch effects in single-cell RNA-seq data

Life Science Alliance ◽

10.26508/lsa.202001004 ◽

2021 ◽

Vol 4 (6) ◽

pp. e202001004

Author(s):

Almut Lütge ◽

Joanna Zyprych-Walczak ◽

Urszula Brykczynska Kunzmann ◽

Helena L Crowell ◽

Daniela Calini ◽

...

Keyword(s):

Single Cell ◽

Cell Types ◽

Rna Seq ◽

Batch Effects ◽

Cell Type ◽

Cell Type Specificity ◽

Distance Distributions ◽

A Cell ◽

Cell Type Specific ◽

Synthetic Datasets

A key challenge in single-cell RNA-sequencing (scRNA-seq) data analysis is batch effects that can obscure the biological signal of interest. Although there are various tools and methods to correct for batch effects, their performance can vary. Therefore, it is important to understand how batch effects manifest to adjust for them. Here, we systematically explore batch effects across various scRNA-seq datasets according to magnitude, cell type specificity, and complexity. We developed a cell-specific mixing score (cms) that quantifies mixing of cells from multiple batches. By considering distance distributions, the score is able to detect local batch bias as well as differentiate between unbalanced batches and systematic differences between cells of the same cell type. We compare metrics in scRNA-seq data using real and synthetic datasets and whereas these metrics target the same question and are used interchangeably, we find differences in scalability, sensitivity, and ability to handle differentially abundant cell types. We find that cell-specific metrics outperform cell type–specific and global metrics and recommend them for both method benchmarks and batch exploration.

Download Full-text

CellO: Comprehensive and hierarchical cell type classification of human cells with the Cell Ontology

10.1101/634097 ◽

2019 ◽

Cited By ~ 1

Author(s):

Matthew N. Bernstein ◽

Zhongjie Ma ◽

Michael Gleicher ◽

Colin N. Dewey

Keyword(s):

Single Cell ◽

Web Application ◽

Cell Types ◽

Rna Seq ◽

Cell Type ◽

Training Set ◽

Sequence Read Archive ◽

Cell Ontology ◽

Cell Type Specific ◽

Type Classification

SummaryCell type annotation is a fundamental task in the analysis of single-cell RNA-sequencing data. In this work, we present CellO, a machine learning-based tool for annotating human RNA-seq data with the Cell Ontology. CellO enables accurate and standardized cell type classification by considering the rich hierarchical structure of known cell types, a source of prior knowledge that is not utilized by existing methods. Furthemore, CellO comes pre-trained on a novel, comprehensive dataset of human, healthy, untreated primary samples in the Sequence Read Archive, which to the best of our knowledge, is the most diverse curated collection of primary cell data to date. CellO’s comprehensive training set enables it to run out-of-the-box on diverse cell types and achieves superior or competitive performance when compared to existing state-of-the-art methods. Lastly, CellO’s linear models are easily interpreted, thereby enabling exploration of cell type-specific expression signatures across the ontology. To this end, we also present the CellO Viewer: a web application for exploring CellO’s models across the ontology.HighlightWe present CellO, a tool for hierarchically classifying cell type from single-cell RNA-seq data against the graph-structured Cell OntologyCellO is pre-trained on a comprehensive dataset comprising nearly all bulk RNA-seq primary cell samples in the Sequence Read ArchiveCellO achieves superior or comparable performance with existing methods while featuring a more comprehensive pre-packaged training setCellO is built with easily interpretable models which we expose through a novel web application, the CellO Viewer, for exploring cell type-specific signatures across the Cell OntologyGraphical Abstract

Download Full-text