scholarly journals netNMF-sc: Leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis

2019 ◽  
Author(s):  
Rebecca Elyanow ◽  
Bianca Dumitrascu ◽  
Barbara E. Engelhardt ◽  
Benjamin J. Raphael

AbstractMotivationSingle-cell RNA-sequencing (scRNA-seq) enables high throughput measurement of RNA expression in individual cells. Due to technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard analysis methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells, leveraging the observation that cells generally occupy a small number of RNA expression states.ResultsWe introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc combines network-regularized non-negative matrix factorization with a procedure for handling zero inflation in transcript count matrices. The matrix factorization results in a low-dimensional representation of the transcript count matrix, which imputes gene abundance for both zero and non-zero entries and can be used to cluster cells. The network regularization leverages prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be close in the low-dimensional representation. We show that netNMF-sc outperforms existing methods on simulated and real scRNA-seq data, with increasing advantage at higher dropout rates (e.g. above 60%). Furthermore, we show that the results from netNMF-sc – including estimation of gene-gene covariance – are robust to choice of network, with more representative networks leading to greater performance gains.AvailabilitynetNMF-sc is available at github.com/raphael-group/[email protected]

2020 ◽  
Author(s):  
Jinjin Tian ◽  
Jiebiao Wang ◽  
Kathryn Roeder

AbstractMotivationGene-gene co-expression networks (GCN) are of biological interest for the useful information they provide for understanding gene-gene interactions. The advent of single cell RNA-sequencing allows us to examine more subtle gene co-expression occurring within a cell type. Many imputation and denoising methods have been developed to deal with the technical challenges observed in single cell data; meanwhile, several simulators have been developed for benchmarking and assessing these methods. Most of these simulators, however, either do not incorporate gene co-expression or generate co-expression in an inconvenient manner.ResultsTherefore, with the focus on gene co-expression, we propose a new simulator, ESCO, which adopts the idea of the copula to impose gene co-expression, while preserving the highlights of available simulators, which perform well for simulation of gene expression marginally. Using ESCO, we assess the performance of imputation methods on GCN recovery and find that imputation generally helps GCN recovery when the data are not too sparse, and the ensemble imputation method works best among leading methods. In contrast, imputation fails to help in the presence of an excessive fraction of zero counts, where simple data aggregating methods are a better choice. These findings are further verified with mouse and human brain cell data.AvailabilityThe ESCO implementation is available as R package SplatterESCO (https://github.com/JINJINT/SplatterESCO)[email protected]


2021 ◽  
Author(s):  
Tara Chari ◽  
Joeyta Banerjee ◽  
Lior Pachter

Dimensionality reduction is standard practice for filtering noise and identifying relevant dimensions in large-scale data analyses. In biology, single-cell expression studies almost always begin with reduction to two or three dimensions to produce 'all-in-one' visuals of the data that are amenable to the human eye, and these are subsequently used for qualitative and quantitative analysis of cell relationships. However, there is little theoretical support for this practice. We examine the theoretical and practical implications of low-dimensional embedding of single-cell data, and find extensive distortions incurred on the global and local properties of biological patterns relative to the high-dimensional, ambient space. In lieu of this, we propose semi-supervised dimension reduction to higher dimension, and show that such targeted reduction guided by the metadata associated with single-cell experiments provides useful latent space representations for hypothesis-driven biological discovery.


2019 ◽  
Vol 35 (20) ◽  
pp. 4011-4019 ◽  
Author(s):  
Ghislain Durif ◽  
Laurent Modolo ◽  
Jeff E Mold ◽  
Sophie Lambert-Lacroix ◽  
Franck Picard

Abstract Motivation The development of high-throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. Principal component analysis (PCA) is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data. Results We propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression data. Availability and implementation Our work is implemented in the pCMF R-package (https://github.com/gdurif/pCMF). Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 2015 ◽  
pp. 1-9 ◽  
Author(s):  
Wen Dai ◽  
Xi Liu ◽  
Yibo Gao ◽  
Lin Chen ◽  
Jianglong Song ◽  
...  

There has been rising interest in the discovery of novel drug indications because of high costs in introducing new drugs. Many computational techniques have been proposed to detect potential drug-disease associations based on the creation of explicit profiles of drugs and diseases, while seldom research takes advantage of the immense accumulation of interaction data. In this work, we propose a matrix factorization model based on known drug-disease associations to predict novel drug indications. In addition, genomic space is also integrated into our framework. The introduction of genomic space, which includes drug-gene interactions, disease-gene interactions, and gene-gene interactions, is aimed at providing molecular biological information for prediction of drug-disease associations. The rationality lies in our belief that association between drug and disease has its evidence in the interactome network of genes. Experiments show that the integration of genomic space is indeed effective. Drugs, diseases, and genes are described with feature vectors of the same dimension, which are retrieved from the interaction data. Then a matrix factorization model is set up to quantify the association between drugs and diseases. Finally, we use the matrix factorization model to predict novel indications for drugs.


2021 ◽  
Vol 3 (4) ◽  
Author(s):  
Gerard A Bouland ◽  
Ahmed Mahfouz ◽  
Marcel J T Reinders

Abstract Single-cell RNA sequencing data is characterized by a large number of zero counts, yet there is growing evidence that these zeros reflect biological variation rather than technical artifacts. We propose to use binarized expression profiles to identify the effects of biological variation in single-cell RNA sequencing data. Using 16 publicly available and simulated datasets, we show that a binarized representation of single-cell expression data accurately represents biological variation and reveals the relative abundance of transcripts more robustly than counts.


2021 ◽  
Author(s):  
Shreya Mishra ◽  
Smriti Chawla ◽  
Neetesh Pandey ◽  
Debarka SenGupta ◽  
Vibhor Kumar

AbstractThe true benefits of large data-sets of single-cell epigenome and transcriptome profiles can be availed only when they are searchable to annotate individual unannotated cells. Matching a single-cell epigenome profile to a large pool of reference cells remains as a challenge and largely unexplored. Here, we introduce scEpiSearch, which enables a user to query single-cell open-chromatin read-count matrices for comparison against a large pool of single-cell expression and open-chromatin profiles from human and mouse cells (∼ 3.5 million cells). Besides providing accurate search in a short time and scalable visualization of results for multiple query cells, scEpisearch also provides a low-dimensional representation of single-cell open-chromatin profiles. It outperformed many other methods in terms of correct low-dimensional embedding of single-cell open-chromatin profiles originating from different platforms and species. Here we show how scEpiSearch is unique in providing several facilities to assist researchers in the analysis of single-cell open-chromatin profiles to infer cellular state, lineage, potency and representative genes.


Algorithms ◽  
2019 ◽  
Vol 12 (3) ◽  
pp. 62 ◽  
Author(s):  
Zhonglin Ye ◽  
Haixing Zhao ◽  
Ke Zhang ◽  
Yu Zhu

Network representation learning is a key research field in network data mining. In this paper, we propose a novel multi-view network representation algorithm (MVNR), which embeds multi-scale relations of network vertices into the low dimensional representation space. In contrast to existing approaches, MVNR explicitly encodes higher order information using k-step networks. In addition, we introduce the matrix forest index as a kind of network feature, which can be applied to balance the representation weights of different network views. We also research the relevance amongst MVNR and several excellent research achievements, including DeepWalk, node2vec and GraRep and so forth. We conduct our experiment on several real-world citation datasets and demonstrate that MVNR outperforms some new approaches using neural matrix factorization. Specifically, we demonstrate the efficiency of MVNR on network classification, visualization and link prediction tasks.


2021 ◽  
Vol 25 (2) ◽  
pp. 339-357
Author(s):  
Guowang Du ◽  
Lihua Zhou ◽  
Kevin Lü ◽  
Haiyan Ding

Multi-view clustering aims to group similar samples into the same clusters and dissimilar samples into different clusters by integrating heterogeneous information from multi-view data. Non-negative matrix factorization (NMF) has been widely applied to multi-view clustering owing to its interpretability. However, most NMF-based algorithms only factorize multi-view data based on the shallow structure, neglecting complex hierarchical and heterogeneous information in multi-view data. In this paper, we propose a deep multiple non-negative matrix factorization (DMNMF) framework based on AutoEncoder for multi-view clustering. DMNMF consists of multiple Encoder Components and Decoder Components with deep structures. Each pair of Encoder Component and Decoder Component are used to hierarchically factorize the input data from a view for capturing the hierarchical information, and all Encoder and Decoder Components are integrated into an abstract level to learn a common low-dimensional representation for combining the heterogeneous information across multi-view data. Furthermore, graph regularizers are also introduced to preserve the local geometric information of each view. To optimize the proposed framework, an iterative updating scheme is developed. Besides, the corresponding algorithm called MVC-DMNMF is also proposed and implemented. Extensive experiments on six benchmark datasets have been conducted, and the experimental results demonstrate the superior performance of our proposed MVC-DMNMF for multi-view clustering compared to other baseline algorithms.


2017 ◽  
Author(s):  
G. Durif ◽  
L. Modolo ◽  
J. E. Mold ◽  
S. Lambert-Lacroix ◽  
F. Picard

AbstractMotivationThe development of high throughput single-cell sequencing technologies now allows the investigation of the population diversity of cellular transcriptomes. The expression dynamics (gene-to-gene variability) can be quantified more accurately, thanks to the measurement of lowly-expressed genes. In addition, the cell-to-cell variability is high, with a low proportion of cells expressing the same genes at the same time/level. Those emerging patterns appear to be very challenging from the statistical point of view, especially to represent a summarized view of single-cell expression data. PCA is a most powerful tool for high dimensional data representation, by searching for latent directions catching the most variability in the data. Unfortunately, classical PCA is based on Euclidean distance and projections that poorly work in presence of over-dispersed count data with dropout events like single-cell expression data.ResultsWe propose a probabilistic Count Matrix Factorization (pCMF) approach for single-cell expression data analysis, that relies on a sparse Gamma-Poisson factor model. This hierarchical model is inferred using a variational EM algorithm. It is able to jointly build a low dimensional representation of cells and genes. We show how this probabilistic framework induces a geometry that is suitable for single-cell data visualization, and produces a compression of the data that is very powerful for clustering purposes. Our method is competed against other standard representation methods like t-SNE, and we illustrate its performance for the representation of single-cell expression (scRNA-seq) data.AvailabilityOur work is implemented in the pCMF R-package1.


2020 ◽  
Vol 30 (2) ◽  
pp. 195-204 ◽  
Author(s):  
Rebecca Elyanow ◽  
Bianca Dumitrascu ◽  
Barbara E. Engelhardt ◽  
Benjamin J. Raphael

Sign in / Sign up

Export Citation Format

Share Document