Feature Selection and Dimension Reduction for Single Cell RNA-Seq based on a Multinomial Model

Mapping Intimacies ◽

10.1101/574574 ◽

2019 ◽

Cited By ~ 29

Author(s):

F. William Townes ◽

Stephanie C. Hicks ◽

Martin J. Aryee ◽

Rafael A. Irizarry

Keyword(s):

Feature Selection ◽

Dimension Reduction ◽

Single Cell ◽

Current Practice ◽

Principal Component ◽

Ground Truth ◽

Rna Seq ◽

Normal Distributions ◽

Multinomial Sampling ◽

Negative Controls

AbstractSingle cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero-inflation. Current normalization pro-cedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We pro-pose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform current practice in a downstream clustering assessment using ground-truth datasets.

Download Full-text

Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model

Genome Biology ◽

10.1186/s13059-019-1861-6 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 35

Author(s):

F. William Townes ◽

Stephanie C. Hicks ◽

Martin J. Aryee ◽

Rafael A. Irizarry

Keyword(s):

Feature Selection ◽

Dimension Reduction ◽

Single Cell ◽

Current Practice ◽

Principal Component ◽

Ground Truth ◽

Rna Seq ◽

Normal Distributions ◽

Multinomial Sampling ◽

Negative Controls

AbstractSingle-cell RNA-Seq (scRNA-Seq) profiles gene expression of individual cells. Recent scRNA-Seq datasets have incorporated unique molecular identifiers (UMIs). Using negative controls, we show UMI counts follow multinomial sampling with no zero inflation. Current normalization procedures such as log of counts per million and feature selection by highly variable genes produce false variability in dimension reduction. We propose simple multinomial methods, including generalized principal component analysis (GLM-PCA) for non-normal distributions, and feature selection using deviance. These methods outperform the current practice in a downstream clustering assessment using ground truth datasets.

Download Full-text

Author Correction: Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model

Genome Biology ◽

10.1186/s13059-020-02109-w ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

F. William Townes ◽

Stephanie C. Hicks ◽

Martin J. Aryee ◽

Rafael A. Irizarry

Keyword(s):

Feature Selection ◽

Dimension Reduction ◽

Single Cell ◽

Multinomial Model ◽

Rna Seq

Download Full-text

Molecular Cross-Validation for Single-Cell RNA-seq

10.1101/786269 ◽

2019 ◽

Cited By ~ 7

Author(s):

Joshua Batson ◽

Loïc Royer ◽

James Webber

Keyword(s):

Single Cell ◽

Cross Validation ◽

Individual Cell ◽

Principal Component ◽

Ground Truth ◽

Rna Seq ◽

Optimal Parameters ◽

Denoising Method ◽

Data Driven Approach ◽

Cell Data

Single-cell RNA sequencing enables researchers to study the gene expression of individual cells. However, in high-throughput methods the portrait of each individual cell is noisy, representing thousands of the hundreds of thousands of mRNA molecules originally present. While many methods for denoising single-cell data have been proposed, a principled procedure for selecting and calibrating the best method for a given dataset has been lacking. We present “molecular cross-validation,” a statistically principled and data-driven approach for estimating the accuracy of any denoising method without the need for ground-truth. We validate this approach for three denoising methods—principal component analysis, network diffusion, and a deep autoencoder—on a dataset of deeply-sequenced neurons. We show that molecular cross-validation correctly selects the optimal parameters for each method and identifies the best method for the dataset.

Download Full-text

corral: Single-cell RNA-seq dimension reduction, batch integration, and visualization with correspondence analysis

10.1101/2021.11.24.469874 ◽

2021 ◽

Author(s):

Lauren L Hsu ◽

Aedin C Culhane

Keyword(s):

Dimension Reduction ◽

Single Cell ◽

Correspondence Analysis ◽

Principal Component ◽

Rna Seq ◽

Log Transformation ◽

Pearson Residuals ◽

Effective Dimension Reduction ◽

Chi Squared ◽

Table Analysis

Effective dimension reduction is an essential step in analysis of single cell RNA-seq(scRNAseq) count data, which are high-dimensional, sparse, and noisy. Principal component analysis (PCA) is widely used in analytical pipelines, and since PCA requires continuous data, it is often coupled with log-transformation in scRNAseq applications. However, log-transformation of scRNAseq counts distorts the data, and can obscure meaningful variation. We describe correspondence analysis (CA) for dimension reduction of scRNAseq data, which is a performant alternative to PCA.Designed for use with counts, CA is based on decomposition of a chi-squared residual matrix and does not require log-transformation of scRNAseq counts. We extend beyond standard CA (decomposition of Pearson residuals computed on the contingency table) and propose variations of CA, including an alternative chi-squared statistic, that address overdispersion and high sparsity in scRNAseq data. The performance of five variations of CA and standard CA are benchmarked on 10 datasets and compared to glmPCA. CA variations are fast, scalable, and outperforms standard CA and glmPCA, to compute embeddings with more performant or comparable clustering accuracy in 8 out of 9 datasets. Of the variations we considered,CA using the Freeman-Tukey chi-squared residual was most performant overall in scRNAseq data. Our analyses also showed that variance stabilizing transformations applied in conjunction with standard CA (using Pearson residuals) and the use of power deflation smoothing both improve performance in downstream clustering tasks, as compared to standard CA alone. CA has advantages including visual illustration of associations between genes and cell populations in a 'CA biplot' and easy extension to multi-table analysis enabling integrative dimension reduction. We introduce corralm, a CA-based method for multi-table batch integration of scRNAseq data in shared latent space, and we propose a new approach for assessing batch integration. We implement CA for scRNAseq in the corral R/Bioconductor package(https://www.bioconductor.org/packages/corral) that interfaces directly with widely used single cell classes in Bioconductor, allowing for easy integration into scRNAseq pipelines.

Download Full-text

Visualizing Single-Cell RNA-seq Data with Semisupervised Principal Component Analysis

International Journal of Molecular Sciences ◽

10.3390/ijms21165797 ◽

2020 ◽

Vol 21 (16) ◽

pp. 5797

Author(s):

Zhenqiu Liu

Keyword(s):

Principal Component Analysis ◽

Dimension Reduction ◽

Single Cell ◽

Optimal Solution ◽

Principal Component ◽

Component Analysis ◽

Biological Information ◽

Rna Seq ◽

Computationally Efficient ◽

Leibler Divergence

Single-cell RNA-seq (scRNA-seq) is a powerful tool for analyzing heterogeneous and functionally diverse cell population. Visualizing scRNA-seq data can help us effectively extract meaningful biological information and identify novel cell subtypes. Currently, the most popular methods for scRNA-seq visualization are principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). While PCA is an unsupervised dimension reduction technique, t-SNE incorporates cluster information into pairwise probability, and then maximizes the Kullback–Leibler divergence. Uniform Manifold Approximation and Projection (UMAP) is another recently developed visualization method similar to t-SNE. However, one limitation with UMAP and t-SNE is that they can only capture the local structure of the data, the global structure of the data is not faithfully preserved. In this manuscript, we propose a semisupervised principal component analysis (ssPCA) approach for scRNA-seq visualization. The proposed approach incorporates cluster-labels into dimension reduction and discovers principal components that maximize both data variance and cluster dependence. ssPCA must have cluster-labels as its input. Therefore, it is most useful for visualizing clusters from a scRNA-seq clustering software. Our experiments with simulation and real scRNA-seq data demonstrate that ssPCA is able to preserve both local and global structures of the data, and uncover the transition and progressions in the data, if they exist. In addition, ssPCA is convex and has a global optimal solution. It is also robust and computationally efficient, making it viable for scRNA-seq cluster visualization.

Download Full-text

Truncated Robust Principal Component Analysis and Noise Reduction for Single Cell RNA-seq Data

Bioinformatics Research and Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-94968-0_32 ◽

2018 ◽

pp. 335-346

Author(s):

Krzysztof Gogolewski ◽

Maciej Sykulski ◽

Neo Christopher Chung ◽

Anna Gambin

Keyword(s):

Principal Component Analysis ◽

Noise Reduction ◽

Single Cell ◽

Principal Component ◽

Component Analysis ◽

Rna Seq ◽

Robust Principal Component Analysis

Download Full-text

So you think you can PLS-DA?

BMC Bioinformatics ◽

10.1186/s12859-019-3310-7 ◽

2020 ◽

Vol 21 (S1) ◽

Author(s):

Daniel Ruiz-Perez ◽

Haibin Guan ◽

Purnima Madhivanan ◽

Kalai Mathee ◽

Giri Narasimhan

Keyword(s):

Feature Selection ◽

Signal To Noise Ratio ◽

Synthetic Data ◽

Principal Component ◽

Ground Truth ◽

Close Relative ◽

Data Set ◽

Series Of Experiments ◽

Feature Selector ◽

Class Labels

Abstract Background Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA). Results We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models encountered when analyzing bioinformatics and clinical data. Other methods were also evaluated. Finally, we analyzed an interesting data set from 396 vaginal microbiome samples where the ground truth for the feature selection was available. All the 3D figures shown in this paper as well as the supplementary ones can be viewed interactively at http://biorg.cs.fiu.edu/plsda Conclusions Our results highlighted the strengths and weaknesses of PLS-DA in comparison with PCA for different underlying data models.

Download Full-text

Accurate denoising of single-cell RNA-Seq data using unbiased principal component analysis

10.1101/655365 ◽

2019 ◽

Cited By ~ 11

Author(s):

Florian Wagner ◽

Dalia Barkley ◽

Itai Yanai

Keyword(s):

Principal Component Analysis ◽

Single Cell ◽

Simulated Data ◽

Principal Component ◽

Cell Aggregation ◽

Component Analysis ◽

Rna Seq ◽

Highly Expressed Genes ◽

Cell Subpopulations ◽

Aggregation Step

AbstractSingle-cell RNA-Seq measurements are commonly affected by high levels of technical noise, posing challenges for data analysis and visualization. A diverse array of methods has been proposed to computationally remove noise by sharing information across similar cells or genes, however their respective accuracies have been difficult to establish. Here, we propose a simple denoising strategy based on principal component analysis (PCA). We show that while PCA performed on raw data is biased towards highly expressed genes, this bias can be mitigated with a cell aggregation step, allowing the recovery of denoised expression values for both highly and lowly expressed genes. We benchmark our resulting ENHANCE algorithm and three previously described methods on simulated data that closely mimic real datasets, showing that ENHANCE provides the best overall denoising accuracy, recovering modules of co-expressed genes and cell subpopulations. Implementations of our algorithm are available at https://github.com/yanailab/enhance.

Download Full-text

Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data

Genome Biology ◽

10.1186/s13059-021-02451-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jan Lause ◽

Philipp Berens ◽

Dmitry Kobak

Keyword(s):

Single Cell ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Ground Truth ◽

Downstream Processing ◽

Negative Control ◽

Parameter Estimates ◽

Rna Seq ◽

Pearson Residuals ◽

Rank One

Abstract Background Standard preprocessing of single-cell RNA-seq UMI data includes normalization by sequencing depth to remove this technical variability, and nonlinear transformation to stabilize the variance across genes with different expression levels. Instead, two recent papers propose to use statistical count models for these tasks: Hafemeister and Satija (Genome Biol 20:296, 2019) recommend using Pearson residuals from negative binomial regression, while Townes et al. (Genome Biol 20:295, 2019) recommend fitting a generalized PCA model. Here, we investigate the connection between these approaches theoretically and empirically, and compare their effects on downstream processing. Results We show that the model of Hafemeister and Satija produces noisy parameter estimates because it is overspecified, which is why the original paper employs post hoc smoothing. When specified more parsimoniously, it has a simple analytic solution equivalent to the rank-one Poisson GLM-PCA of Townes et al. Further, our analysis indicates that per-gene overdispersion estimates in Hafemeister and Satija are biased, and that the data are in fact consistent with the overdispersion parameter being independent of gene expression. We then use negative control data without biological variability to estimate the technical overdispersion of UMI counts, and find that across several different experimental protocols, the data are close to Poisson and suggest very moderate overdispersion. Finally, we perform a benchmark to compare the performance of Pearson residuals, variance-stabilizing transformations, and GLM-PCA on scRNA-seq datasets with known ground truth. Conclusions We demonstrate that analytic Pearson residuals strongly outperform other methods for identifying biologically variable genes, and capture more of the biologically meaningful variation when used for dimensionality reduction.

Download Full-text

CellBench: R/Bioconductor software for comparing single-cell RNA-seq analysis methods

Bioinformatics ◽

10.1093/bioinformatics/btz889 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2288-2290 ◽

Cited By ~ 3

Author(s):

Shian Su ◽

Luyi Tian ◽

Xueyi Dong ◽

Peter F Hickey ◽

Saskia Freytag ◽

...

Keyword(s):

Single Cell ◽

Ad Hoc ◽

Performance Metrics ◽

Single Cell Analysis ◽

Ground Truth ◽

Bioinformatic Analysis ◽

Rna Seq ◽

Effective Manner ◽

Cell Gene Expression ◽

The Many

Abstract Motivation Bioinformatic analysis of single-cell gene expression data is a rapidly evolving field. Hundreds of bespoke methods have been developed in the past few years to deal with various aspects of single-cell analysis and consensus on the most appropriate methods to use under different settings is still emerging. Benchmarking the many methods is therefore of critical importance and since analysis of single-cell data usually involves multi-step pipelines, effective evaluation of pipelines involving different combinations of methods is required. Current benchmarks of single-cell methods are mostly implemented with ad-hoc code that is often difficult to reproduce or extend, and exhaustive manual coding of many combinations is infeasible in most instances. Therefore, new software is needed to manage pipeline benchmarking. Results The CellBench R software facilitates method comparisons in either a task-centric or combinatorial way to allow pipelines of methods to be evaluated in an effective manner. CellBench automatically runs combinations of methods, provides facilities for measuring running time and delivers output in tabular form which is highly compatible with tidyverse R packages for summary and visualization. Our software has enabled comprehensive benchmarking of single-cell RNA-seq normalization, imputation, clustering, trajectory analysis and data integration methods using various performance metrics obtained from data with available ground truth. CellBench is also amenable to benchmarking other bioinformatics analysis tasks. Availability and implementation Available from https://bioconductor.org/packages/CellBench.

Download Full-text