scVAE: Variational auto-encoders for single-cell gene expression data

Mapping Intimacies ◽

10.1101/318295 ◽

2018 ◽

Cited By ~ 27

Author(s):

Christopher Heje Grønbech ◽

Maximillian Fornitz Vording ◽

Pascal Timshel ◽

Casper Kaae Sønderby ◽

Tune Hannes Pers ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Count Data ◽

Model Comparison ◽

A Priori ◽

Cell Types ◽

Generative Models ◽

Data Sets ◽

Likelihood Functions ◽

Cell Gene Expression

AbstractMotivationModels for analysing and making relevant biological inferences from massive amounts of complex single-cell transcriptomic data typically require several individual data-processing steps, each with their own set of hyperparameter choices. With deep generative models one can work directly with count data, make likelihood-based model comparison, learn a latent representation of the cells and capture more of the variability in different cell populations.ResultsWe propose a novel method based on variational auto-encoders (VAEs) for analysis of single-cell RNA sequencing (scRNA-seq) data. It avoids data preprocessing by using raw count data as input and can robustly estimate the expected gene expression levels and a latent representation for each cell. We tested several count likelihood functions and a variant of the VAE that has a priori clustering in the latent space. We show for several scRNA-seq data sets that our method outperforms recently proposed scRNA-seq methods in clustering cells and that the resulting clusters reflect cell types.Availability and implementationOur method, called scVAE, is implemented in Python using the TensorFlow machine-learning library, and it is freely available at https://github.com/scvae/scvae.

Download Full-text

scVAE: variational auto-encoders for single-cell gene expression data

Bioinformatics ◽

10.1093/bioinformatics/btaa293 ◽

2020 ◽

Vol 36 (16) ◽

pp. 4415-4422 ◽

Cited By ~ 10

Author(s):

Christopher Heje Grønbech ◽

Maximillian Fornitz Vording ◽

Pascal N Timshel ◽

Casper Kaae Sønderby ◽

Tune H Pers ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Count Data ◽

Model Comparison ◽

A Priori ◽

Cell Types ◽

Generative Models ◽

Supplementary Information ◽

Likelihood Functions ◽

Cell Gene Expression

Abstract Motivation Models for analysing and making relevant biological inferences from massive amounts of complex single-cell transcriptomic data typically require several individual data-processing steps, each with their own set of hyperparameter choices. With deep generative models one can work directly with count data, make likelihood-based model comparison, learn a latent representation of the cells and capture more of the variability in different cell populations. Results We propose a novel method based on variational auto-encoders (VAEs) for analysis of single-cell RNA sequencing (scRNA-seq) data. It avoids data preprocessing by using raw count data as input and can robustly estimate the expected gene expression levels and a latent representation for each cell. We tested several count likelihood functions and a variant of the VAE that has a priori clustering in the latent space. We show for several scRNA-seq datasets that our method outperforms recently proposed scRNA-seq methods in clustering cells and that the resulting clusters reflect cell types. Availability and implementation Our method, called scVAE, is implemented in Python using the TensorFlow machine-learning library, and it is freely available at https://github.com/scvae/scvae. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

bigSCale: An Analytical Framework for Big-Scale Single-Cell Data

10.1101/197244 ◽

2017 ◽

Cited By ~ 4

Author(s):

Giovanni Iacono ◽

Elisabetta Mereu ◽

Amy Guillaumet-Adkins ◽

Roser Corominas ◽

Ivon Cuscó ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Single Cells ◽

A Priori ◽

Large Data ◽

Cell Types ◽

Analytical Framework ◽

Marker Genes ◽

Data Sets ◽

Neuronal Progenitor

AbstractSingle-cell RNA sequencing significantly deepened our insights into complex tissues and latest techniques are capable processing ten-thousands of cells simultaneously. With bigSCale, we provide an analytical framework being scalable to analyze millions of cells, addressing challenges of future large datasets. Unlike previous methods, bigSCale does not constrain data to fit an a priori-defined distribution and instead uses an accurate numerical model of noise. We evaluated the performance of bigSCale using a biological model of aberrant gene expression in patient derived neuronal progenitor cells and simulated datasets, which underlined its speed and accuracy in differential expression analysis. We further applied bigSCale to analyze 1.3 million cells from the mouse developing forebrain. Herein, we identified rare populations, such as Reelin positive Cajal-Retzius neurons, for which we determined a previously not recognized heterogeneity associated to distinct differentiation stages, spatial organization and cellular function. Together, bigSCale presents a perfect solution to address future challenges of large single-cell datasets.Extended AbstractSingle-cell RNA sequencing (scRNAseq) significantly deepened our insights into complex tissues by providing high-resolution phenotypes for individual cells. Recent microfluidic-based methods are scalable to ten-thousands of cells, enabling an unbiased sampling and comprehensive characterization without prior knowledge. Increasing cell numbers, however, generates extremely big datasets, which extends processing time and challenges computing resources. Current scRNAseq analysis tools are not designed to analyze datasets larger than from thousands of cells and often lack sensitivity and specificity to identify marker genes for cell populations or experimental conditions. With bigSCale, we provide an analytical framework for the sensitive detection of population markers and differentially expressed genes, being scalable to analyze millions of single cells. Unlike other methods that use simple or mixture probabilistic models with negative binomial, gamma or Poisson distributions to handle the noise and sparsity of scRNAseq data, bigSCale does not constrain the data to fit an a priori-defined distribution. Instead, bigSCale uses large sample sizes to estimate a highly accurate and comprehensive numerical model of noise and gene expression. The framework further includes modules for differential expression (DE) analysis, cell clustering and population marker identification. Moreover, a directed convolution strategy allows processing of extremely large data sets, while preserving the transcript information from individual cells.We evaluate the performance of bigSCale using a biological model for reduced or elevated gene expression levels. Specifically, we perform scRNAseq of 1,920 patient derived neuronal progenitor cells from Williams-Beuren and 7q11.23 microduplication syndrome patients, harboring a deletion or duplication of 7q11.23, respectively. The affected region contains 28 genes whose transcriptional levels vary in line with their allele frequency. BigSCale detects expression changes with respect to cells from a healthy donor and outperforms other methods for single-cell DE analysis in sensitivity. Simulated data sets, underline the performance of bigSCale in DE analysis as it is faster and more sensitive and specific than other methods. The probabilistic model of cell-distances within bigSCale is further suitable for unsupervised clustering and the identification of cell types and subpopulations. Using bigSCale, we identify all major cell types of the somatosensory cortex and hippocampus analyzing 3,005 cells from adult mouse brains. Remarkably, we increase the number of cell population specific marker genes 4-6-fold compared to the original analysis and, moreover, define markers of higher order cell types. These include CD90 (Thy1), a neuronal surface receptor, potentially suitable for isolating intact neurons from complex brain samples.To test its applicability for large data sets, we apply bigSCale on scRNAseq data from 1.3 million cells derived from the pallium of the mouse developing forebrain (E18, 10x Genomics). Our directed down-sampling strategy accumulates transcript counts from cells with similar transcriptional profiles into index cell transcriptomes, thereby defining cellular clusters with improved resolution. Accordingly, index cell clusters provide a rich resource of marker genes for the main brain cell types and less frequent subpopulations. Our analysis of rare populations includes poorly characterized developmental cell types, such as neuron progenitors from the subventricular zone and neocortical Reelin positive neurons known as Cajal-Retzius (CR) cells. The latter represent a transient population which regulates the laminar formation of the developing neocortex and whose malfunctioning causes major neurodevelopmental disorders like autism or schizophrenia. Most importantly, index cell cluster can be deconvoluted to individual cell level for targeted analysis of populations of interest. Through decomposition of Reelin positive neurons, we determined a previously not recognized heterogeneity among CR cells, which we could associate to distinct differentiation stages as well as spatial and functional differences in the developing mouse brain. Specifically, subtypes of CR cells identified by bigSCale express different compositions of NMDA, AMPA and glycine receptor subunits, pointing to subpopulations with distinct membrane properties. Furthermore, we found Cxcl12, a chemokine secreted by the meninges and regulating the tangential migration of CR cells, to be also expressed in CR cells located in the marginal zone of the neocortex, indicating a self-regulated migration capacity.Together, bigSCale presents a perfect solution for the processing and analysis of scRNAseq data from millions of single cells. Its speed and sensitivity makes it suitable to the address future challenges of large single-cell data sets.

Download Full-text

GiniClust: detecting rare cell types from single-cell gene expression data with Gini index

Genome Biology ◽

10.1186/s13059-016-1010-4 ◽

2016 ◽

Vol 17 (1) ◽

Cited By ~ 126

Author(s):

Lan Jiang ◽

Huidong Chen ◽

Luca Pinello ◽

Guo-Cheng Yuan

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Gini Index ◽

Cell Types ◽

Expression Data ◽

Cell Gene Expression ◽

Cell Gene

Download Full-text

Enhancing droplet-based single-nucleus RNA-seq resolution using the semi-supervised machine learning classifier DIEM

10.1101/786285 ◽

2019 ◽

Cited By ~ 4

Author(s):

Marcus Alvarez ◽

Elior Rahmani ◽

Brandon Jew ◽

Kristina M. Garske ◽

Zong Miao ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Supervised Machine Learning ◽

Data Sets ◽

Rna Seq ◽

Novel Approach ◽

Single Nucleus ◽

Downstream Analysis

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.

Download Full-text

Inferring Relevant Cell Types for Complex Traits by Using Single-Cell Gene Expression

The American Journal of Human Genetics ◽

10.1016/j.ajhg.2017.09.009 ◽

2017 ◽

Vol 101 (5) ◽

pp. 686-699 ◽

Cited By ~ 45

Author(s):

Diego Calderon ◽

Anand Bhaskar ◽

David A. Knowles ◽

David Golan ◽

Towfique Raj ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Complex Traits ◽

Cell Types ◽

Cell Gene Expression ◽

Cell Gene

Download Full-text

Probabilistic Harmonization and Annotation of Single-cell Transcriptomics Data with Deep Generative Models

10.1101/532895 ◽

2019 ◽

Cited By ~ 14

Author(s):

Chenling Xu ◽

Romain Lopez ◽

Edouard Mehlman ◽

Jeffrey Regier ◽

Michael I. Jordan ◽

...

Keyword(s):

Single Cell ◽

Probabilistic Approach ◽

Cell Types ◽

Generative Models ◽

Marker Genes ◽

Data Sets ◽

Data Set ◽

Cell State ◽

Transcriptomics Data ◽

Single Data

AbstractAs single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.

Download Full-text

In situelectro-sequencing in three-dimensional tissues

10.1101/2021.04.22.440941 ◽

2021 ◽

Author(s):

Qiang Li ◽

Zuwan Lin ◽

Ren Liu ◽

Xin Tang ◽

Jiahao Huang ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Three Dimensional ◽

Cell Types ◽

Developmental Trajectory ◽

Single Cell Level ◽

Cell Level ◽

Cell Gene Expression ◽

Cell Gene

AbstractPairwise mapping of single-cell gene expression and electrophysiology in intact three-dimensional (3D) tissues is crucial for studying electrogenic organs (e.g., brain and heart)1–5. Here, we introducein situelectro-sequencing (electro-seq), combining soft bioelectronics within situRNA sequencing to stably map millisecond-timescale cellular electrophysiology and simultaneously profile a large number of genes at single-cell level across 3D tissues. We appliedin situelectro-seq to 3D human induced pluripotent stem cell-derived cardiomyocyte (hiPSC-CM) patches, precisely registering the CM gene expression with electrophysiology at single-cell level, enabling multimodalin situanalysis. Such multimodal data integration substantially improved the dissection of cell types and the reconstruction of developmental trajectory from spatially heterogeneous tissues. Using machine learning (ML)-based cross-modal analysis,in situelectro-seq identified the gene-to-electrophysiology relationship over the time course of cardiac maturation. Further leveraging such a relationship to train a coupled autoencoder, we demonstrated the prediction of single-cell gene expression profile evolution using long-term electrical measurement from the same cardiac patch or 3D millimeter-scale cardiac organoids. As exemplified by cardiac tissue maturation,in situelectro-seq will be broadly applicable to create spatiotemporal multimodal maps and predictive models in electrogenic organs, allowing discovery of cell types and gene programs responsible for electrophysiological function and dysfunction.

Download Full-text

Sampling from Disentangled Representations of Single-Cell Data Using Generative Adversarial Networks

10.1101/2021.01.15.426872 ◽

2021 ◽

Author(s):

Hengshi Yu ◽

Joshua D. Welch

Keyword(s):

Gene Expression ◽

Single Cell ◽

Gene Expression Data ◽

Generative Models ◽

Generative Adversarial Networks ◽

Expression Data ◽

Gene Expression Response ◽

Adversarial Networks ◽

Cell Gene Expression ◽

Cell Gene

AbstractDeep generative models, including variational autoencoders (VAEs) and generative adversarial networks (GANs), have achieved remarkable successes in generating and manipulating highdimensional images. VAEs excel at learning disentangled image representations, while GANs excel at generating realistic images. Here, we systematically assess disentanglement and generation performance on single-cell gene expression data and find that these strengths and weaknesses of VAEs and GANs apply to single-cell gene expression data in a similar way. We also develop MichiGAN1, a novel neural network that combines the strengths of VAEs and GANs to sample from disentangled representations without sacrificing data generation quality. We learn disentangled representations of two large singlecell RNA-seq datasets [13, 68] and use MichiGAN to sample from these representations. MichiGAN allows us to manipulate semantically distinct aspects of cellular identity and predict single-cell gene expression response to drug treatment.

Download Full-text

Publisher Correction: scDesign2: a transparent simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured

Genome Biology ◽

10.1186/s13059-021-02394-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Tianyi Sun ◽

Dongyuan Song ◽

Wei Vivian Li ◽

Jingyi Jessica Li

Keyword(s):

Gene Expression ◽

Single Cell ◽

Count Data ◽

High Fidelity ◽

Cell Gene Expression ◽

Cell Gene

Download Full-text

MichiGAN: sampling from disentangled representations of single-cell data using generative adversarial networks

Genome Biology ◽

10.1186/s13059-021-02373-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Hengshi Yu ◽

Joshua D. Welch

Keyword(s):

Gene Expression ◽

Single Cell ◽

Generative Models ◽

Generative Adversarial Networks ◽

Data Generation ◽

Gene Expression Response ◽

Adversarial Networks ◽

Cell Gene Expression ◽

Expression Response ◽

Cell Gene

AbstractDeep generative models such as variational autoencoders (VAEs) and generative adversarial networks (GANs) generate and manipulate high-dimensional images. We systematically assess the complementary strengths and weaknesses of these models on single-cell gene expression data. We also develop MichiGAN, a novel neural network that combines the strengths of VAEs and GANs to sample from disentangled representations without sacrificing data generation quality. We learn disentangled representations of three large single-cell RNA-seq datasets and use MichiGAN to sample from these representations. MichiGAN allows us to manipulate semantically distinct aspects of cellular identity and predict single-cell gene expression response to drug treatment.

Download Full-text