Molecular Cross-Validation for Single-Cell RNA-seq

Mapping Intimacies ◽

10.1101/786269 ◽

2019 ◽

Cited By ~ 7

Author(s):

Joshua Batson ◽

Loïc Royer ◽

James Webber

Keyword(s):

Single Cell ◽

Cross Validation ◽

Individual Cell ◽

Principal Component ◽

Ground Truth ◽

Rna Seq ◽

Optimal Parameters ◽

Denoising Method ◽

Data Driven Approach ◽

Cell Data

Single-cell RNA sequencing enables researchers to study the gene expression of individual cells. However, in high-throughput methods the portrait of each individual cell is noisy, representing thousands of the hundreds of thousands of mRNA molecules originally present. While many methods for denoising single-cell data have been proposed, a principled procedure for selecting and calibrating the best method for a given dataset has been lacking. We present “molecular cross-validation,” a statistically principled and data-driven approach for estimating the accuracy of any denoising method without the need for ground-truth. We validate this approach for three denoising methods—principal component analysis, network diffusion, and a deep autoencoder—on a dataset of deeply-sequenced neurons. We show that molecular cross-validation correctly selects the optimal parameters for each method and identifies the best method for the dataset.

Download Full-text

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model

Genome Biology ◽

10.1186/s13059-019-1861-6 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 35

Author(s):

F. William Townes ◽

Stephanie C. Hicks ◽

Martin J. Aryee ◽

Rafael A. Irizarry

Keyword(s):

Feature Selection ◽

Dimension Reduction ◽

Single Cell ◽

Current Practice ◽

Principal Component ◽

Ground Truth ◽

Rna Seq ◽

Normal Distributions ◽

Multinomial Sampling ◽

Negative Controls

AbstractSingle-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.

Download Full-text

Single-cell identity definition using random forests and recursive feature elimination

10.1101/2020.08.03.233650 ◽

2020 ◽

Author(s):

Madeline Park ◽

Sevahn Vorperian ◽

Sheng Wang ◽

Angela Oliveira Pisco

Keyword(s):

Single Cell ◽

Cross Validation ◽

Random Forest Classifier ◽

Recursive Feature Elimination ◽

Rna Seq ◽

Feature Importance ◽

Analysis Workflow ◽

Necessary And Sufficient ◽

Cell Data ◽

Python Package

AbstractSingle-cell RNA sequencing (scRNA-seq) enables the detailed examination of a cell’s underlying regulatory networks and the molecular factors contributing to its identity. We developed scRFE with the goal of generating interpretable gene lists that can accurately distinguish observations (single-cells) by their features (genes) given a metadata category of the dataset. scRFE is an algorithm that combines the classical random forest classifier with recursive feature elimination and cross validation to find the features necessary and sufficient to classify cells in a single-cell RNA-seq dataset by ranking feature importance. It is implemented as a Python package compatible with Scanpy, enabling its seamless integration into any single-cell data analysis workflow that aims at identifying minimal transcriptional programs relevant to describing metadata features of the dataset. We applied scRFE to the Tabula Muris Senis and reproduced established aging patterns and transcription factor reprogramming protocols, highlighting the biological value of scRFE’s learned features.Author summaryscRFE is a Python package that combines a random forest classifier with recursive feature elimination and cross validation to find the features necessary and sufficient to classify cells in a single-cell RNA-seq dataset by ranking feature importance. scRFE was designed to enable straightforward integration as part of any single-cell data analysis workflow that aims at identifying minimal transcriptional programs relevant to describing metadata features of the dataset.

Download Full-text

Feature Selection and Dimension Reduction for Single Cell RNA-Seq based on a Multinomial Model

10.1101/574574 ◽

2019 ◽

Cited By ~ 29

Author(s):

F. William Townes ◽

Stephanie C. Hicks ◽

Martin J. Aryee ◽

Rafael A. Irizarry

Keyword(s):

Feature Selection ◽

Dimension Reduction ◽

Single Cell ◽

Current Practice ◽

Principal Component ◽

Ground Truth ◽

Rna Seq ◽

Normal Distributions ◽

Multinomial Sampling ◽

Negative Controls

AbstractSingle cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization pro-cedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We pro-pose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.

Download Full-text

EpiScanpy: integrated single-cell epigenomic analysis

10.1101/648097 ◽

2019 ◽

Cited By ~ 4

Author(s):

Anna Danese ◽

Maria L. Richter ◽

David S. Fischer ◽

Fabian J. Theis ◽

Maria Colomé-Tatché

Keyword(s):

Dna Methylation ◽

Single Cell ◽

Large Scale ◽

Feature Space ◽

Rna Seq ◽

Computational Framework ◽

Learning Techniques ◽

Multiple Feature ◽

The Many ◽

Cell Data

ABSTRACTEpigenetic single-cell measurements reveal a layer of regulatory information not accessible to single-cell transcriptomics, however single-cell-omics analysis tools mainly focus on gene expression data. To address this issue, we present epiScanpy, a computational framework for the analysis of single-cell DNA methylation and single-cell ATAC-seq data. EpiScanpy makes the many existing RNA-seq workflows from scanpy available to large-scale single-cell data from other -omics modalities. We introduce and compare multiple feature space constructions for epigenetic data and show the feasibility of common clustering, dimension reduction and trajectory learning techniques. We benchmark epiScanpy by interrogating different single-cell brain mouse atlases of DNA methylation, ATAC-seq and transcriptomics. We find that differentially methylated and differentially open markers between cell clusters enrich transcriptome-based cell type labels by orthogonal epigenetic information.

Download Full-text

Truncated Robust Principal Component Analysis and Noise Reduction for Single Cell RNA-seq Data

Bioinformatics Research and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-94968-0_32 ◽

2018 ◽

pp. 335-346

Author(s):

Krzysztof Gogolewski ◽

Maciej Sykulski ◽

Neo Christopher Chung ◽

Anna Gambin

Keyword(s):

Principal Component Analysis ◽

Noise Reduction ◽

Single Cell ◽

Principal Component ◽

Component Analysis ◽

Rna Seq ◽

Robust Principal Component Analysis

Download Full-text

The SZS is an efficient statistical method to identify regulated splicing events in droplet-based RNA sequencing

10.1101/2020.11.10.377572 ◽

2020 ◽

Author(s):

Julia Eve Olivieri ◽

Roozbeh Dehghannasiri ◽

Julia Salzman

Keyword(s):

Single Cell ◽

Statistical Method ◽

Rna Seq ◽

Computationally Efficient ◽

Small Set ◽

Biological Discovery ◽

Cell Type Specific ◽

Human Spermatogenesis ◽

Splicing Patterns ◽

Cell Data

AbstractTo date, the field of single-cell genomics has viewed robust splicing analysis as completely out of reach in droplet-based platforms, preventing biological discovery of single-cell regulated splicing. Here, we introduce a novel, robust, and computationally efficient statistical method, the Splicing Z Score (SZS), to detect differential alternative splicing in single cell RNA-Seq technologies including 10x Chromium. We applied the SZS to primary human cells to discover new regulated, cell type-specific splicing patterns. Illustrating the power of the SZS method, splicing of a small set of genes has high predictive power for tissue compartment in the human lung, and the SZS identifies un-annotated, conserved splicing regulation in the human spermatogenesis. The SZS is a method that can rapidly identify regulated splicing events from single cell data and prioritize genes predicted to have functionally significant splicing programs.

Download Full-text

SCOUT: Single-cell outlier analysis in cancer

10.1101/2020.03.25.007518 ◽

2020 ◽

Author(s):

Giovana Ravizzoni Onzi ◽

Juliano Luiz Faccioni ◽

Alvaro G. Alvarado ◽

Paula Andreghetto Bracco ◽

Harley I. Kornblum ◽

...

Keyword(s):

Data Analysis ◽

Single Cell ◽

Biological Markers ◽

Rna Seq ◽

Outlier Analysis ◽

Mass Cytometry ◽

Wide Range ◽

Cell Data

Outliers are often ignored or even removed from data analysis. In cancer, however, single outlier cells can be of major importance, since they have uncommon characteristics that may confer capacity to invade, metastasize, or resist to therapy. Here we present the Single-Cell OUTlier analysis (SCOUT), a resource for single-cell data analysis focusing on outlier cells, and the SCOUT Selector (SCOUTS), an application to systematically apply SCOUT on a dataset over a wide range of biological markers. Using publicly available datasets of cancer samples obtained from mass cytometry and single-cell RNA-seq platforms, outlier cells for the expression of proteins or RNAs were identified and compared to their non-outlier counterparts among different samples. Our results show that analyzing single-cell data using SCOUT can uncover key information not easily observed in the analysis of the whole population.

Download Full-text

Accurate denoising of single-cell RNA-Seq data using unbiased principal component analysis

10.1101/655365 ◽

2019 ◽

Cited By ~ 11

Author(s):

Florian Wagner ◽

Dalia Barkley ◽

Itai Yanai

Keyword(s):

Principal Component Analysis ◽

Single Cell ◽

Simulated Data ◽

Principal Component ◽

Cell Aggregation ◽

Component Analysis ◽

Rna Seq ◽

Highly Expressed Genes ◽

Cell Subpopulations ◽

Aggregation Step

AbstractSingle-cell RNA-Seq measurements are commonly affected by high levels of technical noise, posing challenges for data analysis and visualization. A diverse array of methods has been proposed to computationally remove noise by sharing information across similar cells or genes, however their respective accuracies have been difficult to establish. Here, we propose a simple denoising strategy based on principal component analysis (PCA). We show that while PCA performed on raw data is biased towards highly expressed genes, this bias can be mitigated with a cell aggregation step, allowing the recovery of denoised expression values for both highly and lowly expressed genes. We benchmark our resulting ENHANCE algorithm and three previously described methods on simulated data that closely mimic real datasets, showing that ENHANCE provides the best overall denoising accuracy, recovering modules of co-expressed genes and cell subpopulations. Implementations of our algorithm are available at https://github.com/yanailab/enhance.

Download Full-text

scDIOR: single cell RNA-seq data IO software

BMC Bioinformatics ◽

10.1186/s12859-021-04528-3 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Huijian Feng ◽

Lihui Lin ◽

Jiekai Chen

Keyword(s):

Single Cell ◽

Programming Languages ◽

Large Scale ◽

Developmental Trajectories ◽

Rapid Development ◽

Data Transformation ◽

Rna Seq ◽

Data Types ◽

User Friendly ◽

Cell Data

Abstract Background Single-cell RNA sequencing is becoming a powerful tool to identify cell states, reconstruct developmental trajectories, and deconvolute spatial expression. The rapid development of computational methods promotes the insight of heterogeneous single-cell data. An increasing number of tools have been provided for biological analysts, of which two programming languages- R and Python are widely used among researchers. R and Python are complementary, as many methods are implemented specifically in R or Python. However, the different platforms immediately caused the data sharing and transformation problem, especially for Scanpy, Seurat, and SingleCellExperiemnt. Currently, there is no efficient and user-friendly software to perform data transformation of single-cell omics between platforms, which makes users spend unbearable time on data Input and Output (IO), significantly reducing the efficiency of data analysis. Results We developed scDIOR for single-cell data transformation between platforms of R and Python based on Hierarchical Data Format Version 5 (HDF5). We have created a data IO ecosystem between three R packages (Seurat, SingleCellExperiment, Monocle) and a Python package (Scanpy). Importantly, scDIOR accommodates a variety of data types across programming languages and platforms in an ultrafast way, including single-cell RNA-seq and spatial resolved transcriptomics data, using only a few codes in IDE or command line interface. For large scale datasets, users can partially load the needed information, e.g., cell annotation without the gene expression matrices. scDIOR connects the analytical tasks of different platforms, which makes it easy to compare the performance of algorithms between them. Conclusions scDIOR contains two modules, dior in R and diopy in Python. scDIOR is a versatile and user-friendly tool that implements single-cell data transformation between R and Python rapidly and stably. The software is freely accessible at https://github.com/JiekaiLab/scDIOR.

Download Full-text

Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

Genome Biology ◽

10.1186/s13059-021-02451-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jan Lause ◽

Philipp Berens ◽

Dmitry Kobak

Keyword(s):

Single Cell ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Ground Truth ◽

Downstream Processing ◽

Negative Control ◽

Parameter Estimates ◽

Rna Seq ◽

Pearson Residuals ◽

Rank One

Abstract Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. Results We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. Conclusions We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.

Download Full-text