scholarly journals Single cell RNA-seq denoising using a deep count autoencoder

2018 ◽  
Author(s):  
Gökcen Eraslan ◽  
Lukas M. Simon ◽  
Maria Mircea ◽  
Nikola S. Mueller ◽  
Fabian J. Theis

AbstractSingle-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at a cellular resolution. However, noise due to amplification and dropout may obstruct analyses, so scalable denoising methods for increasingly large but sparse scRNAseq data are needed. We propose a deep count autoencoder network (DCA) to denoise scRNA-seq datasets. DCA takes the count distribution, overdispersion and sparsity of the data into account using a zero-inflated negative binomial noise model, and nonlinear gene-gene or gene-dispersion interactions are captured. Our method scales linearly with the number of cells and can therefore be applied to datasets of millions of cells. We demonstrate that DCA denoising improves a diverse set of typical scRNA-seq data analyses using simulated and real datasets. DCA outperforms existing methods for data imputation in quality and speed, enhancing biological discovery.

2017 ◽  
Author(s):  
Zhun Miao ◽  
Ke Deng ◽  
Xiaowo Wang ◽  
Xuegong Zhang

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.


2019 ◽  
Vol 116 (20) ◽  
pp. 9775-9784 ◽  
Author(s):  
Yingxin Lin ◽  
Shila Ghazanfar ◽  
Kevin Y. X. Wang ◽  
Johann A. Gagnon-Bartsch ◽  
Kitty K. Lo ◽  
...  

Concerted examination of multiple collections of single-cell RNA sequencing (RNA-seq) data promises further biological insights that cannot be uncovered with individual datasets. Here we present scMerge, an algorithm that integrates multiple single-cell RNA-seq datasets using factor analysis of stably expressed genes and pseudoreplicates across datasets. Using a large collection of public datasets, we benchmark scMerge against published methods and demonstrate that it consistently provides improved cell type separation by removing unwanted factors; scMerge can also enhance biological discovery through robust data integration, which we show through the inference of development trajectory in a liver dataset collection.


2020 ◽  
Author(s):  
Julia Eve Olivieri ◽  
Roozbeh Dehghannasiri ◽  
Julia Salzman

AbstractTo date, the field of single-cell genomics has viewed robust splicing analysis as completely out of reach in droplet-based platforms, preventing biological discovery of single-cell regulated splicing. Here, we introduce a novel, robust, and computationally efficient statistical method, the Splicing Z Score (SZS), to detect differential alternative splicing in single cell RNA-Seq technologies including 10x Chromium. We applied the SZS to primary human cells to discover new regulated, cell type-specific splicing patterns. Illustrating the power of the SZS method, splicing of a small set of genes has high predictive power for tissue compartment in the human lung, and the SZS identifies un-annotated, conserved splicing regulation in the human spermatogenesis. The SZS is a method that can rapidly identify regulated splicing events from single cell data and prioritize genes predicted to have functionally significant splicing programs.


2016 ◽  
Author(s):  
Vasilis Ntranos ◽  
Govinda M. Kamath ◽  
Jesse Zhang ◽  
Lior Pachter ◽  
David N. Tse

Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling which limit their scope and generality. We propose a novel method that departs from standard analysis pipelines, comparing and clustering cells based not on their transcript or gene quantifications but on their transcript-compatibility read counts. In re-analysis of two landmark yet disparate single-cell RNA-Seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays.


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Xin Wang ◽  
Jane Frederick ◽  
Hongbin Wang ◽  
Sheng Hui ◽  
Vadim Backman ◽  
...  

Abstract The transcriptional plasticity of cancer cells promotes intercellular heterogeneity in response to anticancer drugs and facilitates the generation of subpopulation surviving cells. Characterizing single-cell transcriptional heterogeneity after drug treatments can provide mechanistic insights into drug efficacy. Here, we used single-cell RNA-seq to examine transcriptomic profiles of cancer cells treated with paclitaxel, celecoxib and the combination of the two drugs. By normalizing the expression of endogenous genes to spike-in molecules, we found that cellular mRNA abundance shows dynamic regulation after drug treatment. Using a random forest model, we identified gene signatures classifying single cells into three states: transcriptional repression, amplification and control-like. Treatment with paclitaxel or celecoxib alone generally repressed gene transcription across single cells. Interestingly, the drug combination resulted in transcriptional amplification and hyperactivation of mitochondrial oxidative phosphorylation pathway linking to enhanced cell killing efficiency. Finally, we identified a regulatory module enriched with metabolism and inflammation-related genes activated in a subpopulation of paclitaxel-treated cells, the expression of which predicted paclitaxel efficacy across cancer cell lines and in vivo patient samples. Our study highlights the dynamic global transcriptional activity driving single-cell heterogeneity during drug response and emphasizes the importance of adding spike-in molecules to study gene expression regulation using single-cell RNA-seq.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Jan Lause ◽  
Philipp Berens ◽  
Dmitry Kobak

Abstract Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. Results We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. Conclusions We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.


2019 ◽  
Author(s):  
Christoph Hafemeister ◽  
Rahul Satija

AbstractSingle-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from ’regularized negative binomial regression’, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation, and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R packagesctransform, with a direct interface to our single-cell toolkitSeurat.


2018 ◽  
Author(s):  
Nikolaos Papadopoulos ◽  
R. Gonzalo Parra ◽  
Johannes Söding

BackgroundSingle-cell RNA sequencing (scRNA-seq) is an enabling technology for the study of cellular differentiation and heterogeneity. From snapshots of the transcriptomic profiles of differentiating single cells, the cellular lineage tree that leads from a progenitor population to multiple types of differentiated cells can be derived. The underlying lineage trees of most published datasets are linear or have a single branchpoint, but many studies with more complex lineage trees will soon become available. To test and further develop tools for lineage tree reconstruction, we need test datasets with known trees.ResultsPROSSTT can simulate scRNA-seq datasets for differentiation processes with lineage trees of any desired complexity, noise level, noise model, and size. PROSSTT also provides scripts to quantify the quality of predicted lineage trees.Availabilityhttps://github.com/soedinglab/[email protected]


2020 ◽  
Author(s):  
Yuansong Zeng ◽  
Xiang Zhou ◽  
Jiahua Rao ◽  
Yutong Lu ◽  
Yuedong Yang

AbstractRecent advances in single-cell RNA sequencing (scRNA-seq) technologies provide a great opportunity to study gene expression at cellular resolution, and the scRNA-seq data has been routinely conducted to unfold cell heterogeneity and diversity. A critical step for the scRNA-seq analyses is to cluster the same type of cells, and many methods have been developed for cell clustering. However, existing clustering methods are limited to extract the representations from expression data of individual cells, while ignoring the high-order structural relations between cells. Here, we proposed a new method (GraphSCC) to cluster cells based on scRNA-seq data by accounting structural relations between cells through a graph convolutional network. The representation learned from the graph convolutional network, together with another representation output from a denoising autoencoder network, are optimized by a dual self-supervised module for better cell clustering. Extensive experiments indicate that GraphSCC model outperforms state-of-the-art methods in various evaluation metrics on both simulated and real datasets. Further visualizations show that GraphSCC provides representations for better intra-cluster compactness and inter-cluster separability.


2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Christoph Hafemeister ◽  
Rahul Satija

AbstractSingle-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package , with a direct interface to our single-cell toolkit .


Sign in / Sign up

Export Citation Format

Share Document