Probabilistic Count Matrix Factorization for Single Cell Expression Data Analysis

Mapping Intimacies ◽

10.1101/211938 ◽

2017 ◽

Cited By ~ 4

Author(s):

G. Durif ◽

L. Modolo ◽

J. E. Mold ◽

S. Lambert-Lacroix ◽

F. Picard

Keyword(s):

Data Analysis ◽

Single Cell ◽

Matrix Factorization ◽

Factor Model ◽

R Package ◽

Data Representation ◽

Population Diversity ◽

Expression Data ◽

Statistical Point ◽

Cell Expression

AbstractMotivationThe development of high throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly-expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. PCA is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data.ResultsWe propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis, that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression (scRNA-seq) data.AvailabilityOur work is implemented in the pCMF R-package1.

Download Full-text

Probabilistic count matrix factorization for single cell expression data analysis

Bioinformatics ◽

10.1093/bioinformatics/btz177 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4011-4019 ◽

Cited By ~ 7

Author(s):

Ghislain Durif ◽

Laurent Modolo ◽

Jeff E Mold ◽

Sophie Lambert-Lacroix ◽

Franck Picard

Keyword(s):

Data Analysis ◽

Single Cell ◽

Matrix Factorization ◽

Principal Component ◽

Data Representation ◽

Population Diversity ◽

Supplementary Information ◽

Expression Data ◽

Statistical Point ◽

Cell Expression

Abstract Motivation The development of high-throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. Principal component analysis (PCA) is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data. Results We propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression data. Availability and implementation Our work is implemented in the pCMF R-package (https://github.com/gdurif/pCMF). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Improved dropClust R package with integrative analysis support for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz823 ◽

2019 ◽

Author(s):

Debajyoti Sinha ◽

Pradyumn Sinha ◽

Ritwik Saha ◽

Sanghamitra Bandyopadhyay ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Large Scale ◽

R Package ◽

Integrative Analysis ◽

Locality Sensitive Hashing ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Speed Up ◽

Cell Expression

Abstract Summary DropClust leverages Locality Sensitive Hashing (LSH) to speed up clustering of large scale single cell expression data. Here we present the improved dropClust, a complete R package that is, fast, interoperable and minimally resource intensive. The new dropClust features a novel batch effect removal algorithm that allows integrative analysis of single cell RNA-seq (scRNA-seq) datasets. Availability and implementation dropClust is freely available at https://github.com/debsin/dropClust as an R package. A lightweight online version of the dropClust is available at https://debsinha.shinyapps.io/dropClust/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A zero-inflated non-negative matrix factorization for the deconvolution of mixed signals of biological data

The International Journal of Biostatistics ◽

10.1515/ijb-2020-0039 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Yixin Kong ◽

Ariangela Kozik ◽

Cindy H. Nakatsu ◽

Yava L. Jones-Hall ◽

Hyonho Chun

Keyword(s):

Matrix Factorization ◽

Factor Model ◽

R Package ◽

Biological Data ◽

Superior Performance ◽

Sequencing Data ◽

Fecal Microbiome ◽

Brain Gene Expression ◽

Cell Transcriptome ◽

Non Negative Matrix Factorization

Abstract A latent factor model for count data is popularly applied in deconvoluting mixed signals in biological data as exemplified by sequencing data for transcriptome or microbiome studies. Due to the availability of pure samples such as single-cell transcriptome data, the accuracy of the estimates could be much improved. However, the advantage quickly disappears in the presence of excessive zeros. To correctly account for this phenomenon in both mixed and pure samples, we propose a zero-inflated non-negative matrix factorization and derive an effective multiplicative parameter updating rule. In simulation studies, our method yielded the smallest bias. We applied our approach to brain gene expression as well as fecal microbiome datasets, illustrating the superior performance of the approach. Our method is implemented as a publicly available R-package, iNMF.

Download Full-text

Discovery of rare cells from voluminous single cell expression data

Nature Communications ◽

10.1038/s41467-018-07234-6 ◽

2018 ◽

Vol 9 (1) ◽

Cited By ~ 20

Author(s):

Aashi Jindal ◽

Prashant Gupta ◽

Jayadeva ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Expression Data ◽

Rare Cells ◽

Cell Expression

Download Full-text

ArrayExpress update – from bulk to single-cell expression data

Nucleic Acids Research ◽

10.1093/nar/gky964 ◽

2018 ◽

Vol 47 (D1) ◽

pp. D711-D715 ◽

Cited By ~ 115

Author(s):

Awais Athar ◽

Anja Füllgrabe ◽

Nancy George ◽

Haider Iqbal ◽

Laura Huerta ◽

...

Keyword(s):

Single Cell ◽

Expression Data ◽

Cell Expression

Download Full-text

ESCO: single cell expression simulation incorporating gene co-expression

10.1101/2020.10.20.347211 ◽

2020 ◽

Author(s):

Jinjin Tian ◽

Jiebiao Wang ◽

Kathryn Roeder

Keyword(s):

Single Cell ◽

R Package ◽

Brain Cell ◽

Gene Interactions ◽

Cell Type ◽

Imputation Methods ◽

Biological Interest ◽

A Cell ◽

Cell Expression ◽

Cell Data

AbstractMotivationGene-gene co-expression networks (GCN) are of biological interest for the useful information they provide for understanding gene-gene interactions. The advent of single cell RNA-sequencing allows us to examine more subtle gene co-expression occurring within a cell type. Many imputation and denoising methods have been developed to deal with the technical challenges observed in single cell data; meanwhile, several simulators have been developed for benchmarking and assessing these methods. Most of these simulators, however, either do not incorporate gene co-expression or generate co-expression in an inconvenient manner.ResultsTherefore, with the focus on gene co-expression, we propose a new simulator, ESCO, which adopts the idea of the copula to impose gene co-expression, while preserving the highlights of available simulators, which perform well for simulation of gene expression marginally. Using ESCO, we assess the performance of imputation methods on GCN recovery and find that imputation generally helps GCN recovery when the data are not too sparse, and the ensemble imputation method works best among leading methods. In contrast, imputation fails to help in the presence of an excessive fraction of zero counts, where simple data aggregating methods are a better choice. These findings are further verified with mouse and human brain cell data.AvailabilityThe ESCO implementation is available as R package SplatterESCO (https://github.com/JINJINT/SplatterESCO)[email protected]

Download Full-text

Identifying signaling genes in spatial single-cell expression data

Bioinformatics ◽

10.1093/bioinformatics/btaa769 ◽

2020 ◽

Author(s):

Dongshunyi Li ◽

Jun Ding ◽

Ziv Bar-Joseph

Keyword(s):

Single Cell ◽

Cell Interactions ◽

Excitatory Neuron ◽

Expression Data ◽

Mixture Of Experts ◽

Prediction Of Response ◽

Technological Advances ◽

A Cell ◽

Cell Expression ◽

Cell Cell

Abstract Motivation Recent technological advances enable the profiling of spatial single-cell expression data. Such data present a unique opportunity to study cell–cell interactions and the signaling genes that mediate them. However, most current methods for the analysis of these data focus on unsupervised descriptive modeling, making it hard to identify key signaling genes and quantitatively assess their impact. Results We developed a Mixture of Experts for Spatial Signaling genes Identification (MESSI) method to identify active signaling genes within and between cells. The mixture of experts strategy enables MESSI to subdivide cells into subtypes. MESSI relies on multi-task learning using information from neighboring cells to improve the prediction of response genes within a cell. Applying the methods to three spatial single-cell expression datasets, we show that MESSI accurately predicts the levels of response genes, improving upon prior methods and provides useful biological insights about key signaling genes and subtypes of excitatory neuron cells. Availability and implementation MESSI is available at: https://github.com/doraadong/MESSI

Download Full-text

Machine learning of stem cell identities from single-cell expression data via regulatory network archetypes

10.1101/208470 ◽

2017 ◽

Cited By ~ 2

Author(s):

Patrick S Stumpf ◽

Ben D MacArthur

Keyword(s):

Machine Learning ◽

Stem Cells ◽

Stem Cell ◽

Single Cell ◽

Regulatory Network ◽

Cell Biology ◽

Embryonic Stem ◽

Network Activity ◽

Expression Data ◽

Cell Expression

AbstractThe molecular regulatory network underlying stem cell pluripotency has been intensively studied, and we now have a reliable ensemble model for the ‘average’ pluripotent cell. However, evidence of significant cell-to-cell variability suggests that the activity of this network varies within individual stem cells, leading to differential processing of environmental signals and variability in cell fates. Here, we adapt a method originally designed for face recognition to infer regulatory network patterns within individual cells from single-cell expression data. Using this method we identify three distinct network configurations in cultured mouse embryonic stem cells – corresponding to naïve and formative pluripotent states and an early primitive endoderm state – and associate these configurations with particular combinations of regulatory network activity archetypes that govern different aspects of the cell’s response to environmental stimuli, cell cycle status and core information processing circuitry. These results show how variability in cell identities arise naturally from alterations in underlying regulatory network dynamics and demonstrate how methods from machine learning may be used to better understand single cell biology, and the collective dynamics of cell communities.

Download Full-text

Subpopulation identification for single-cell RNA-sequencing data using functional data analysis

10.1101/760413 ◽

2019 ◽

Author(s):

Kyungmin Ahn ◽

Hironobu Fujiwara

Keyword(s):

Gene Expression ◽

Data Analysis ◽

Single Cell ◽

Gene Expression Data ◽

Functional Data Analysis ◽

Functional Data ◽

Clustering Algorithms ◽

Expression Data ◽

Clustering Methods ◽

Single Cell Rna Sequencing

AbstractBackgroundIn single-cell RNA-sequencing (scRNA-seq) data analysis, a number of statistical tools in multivariate data analysis (MDA) have been developed to help analyze the gene expression data. This MDA approach is typically focused on examining discrete genomic units of genes that ignores the dependency between the data components. In this paper, we propose a functional data analysis (FDA) approach on scRNA-seq data whereby we consider each cell as a single function. To avoid a large number of dropouts (zero or zero-closed values) and reduce the high dimensionality of the data, we first perform a principal component analysis (PCA) and assign PCs to be the amplitude of the function. Then we use the index of PCs directly from PCA for the phase components. This approach allows us to apply FDA clustering methods to scRNA-seq data analysis.ResultsTo demonstrate the robustness of our method, we apply several existing FDA clustering algorithms to the gene expression data to improve the accuracy of the classification of the cell types against the conventional clustering methods in MDA. As a result, the FDA clustering algorithms achieve superior accuracy on simulated data as well as real data such as human and mouse scRNA-seq data.ConclusionsThis new statistical technique enhances the classification performance and ultimately improves the understanding of stochastic biological processes. This new framework provides an essentially different scRNA-seq data analytical approach, which can complement conventional MDA methods. It can be truly effective when current MDA methods cannot detect or uncover the hidden functional nature of the gene expression dynamics.

Download Full-text

Analysis of Single-Cell Expression Data of Liver Sinusoidal Endothelial Cells Reveals Strong Variability of F8 Expression Associated with Specific Expression Profile

10.1055/s-0040-1721596 ◽

2020 ◽

Author(s):

Muhammad Ahmer Jamil ◽

Osman El-Maarri

Keyword(s):

Endothelial Cells ◽

Single Cell ◽

Expression Profile ◽

Expression Data ◽

Liver Sinusoidal Endothelial Cells ◽

Sinusoidal Endothelial Cells ◽

Specific Expression ◽

Cell Expression

Download Full-text