scholarly journals ZINB-WaVE: A general and flexible method for signal extraction from single-cell RNA-seq data

2017 ◽  
Author(s):  
Davide Risso ◽  
Fanny Perraudeau ◽  
Svetlana Gribkova ◽  
Sandrine Dudoit ◽  
Jean-Philippe Vert

AbstractSingle-cell RNA sequencing (scRNA-seq) is a powerful high-throughput technique that enables researchers to measure genome-wide transcription levels at the resolution of single cells. Because of the low amount of RNA present in a single cell, some genes may fail to be detected even though they are expressed; these genes are usually referred to as dropouts. Here, we present a general and flexible zero-inflated negative binomial model (ZINB-WaVE), which leads to low-dimensional representations of the data that account for zero inflation (dropouts), over-dispersion, and the count nature of the data. We demonstrate, with simulated and real data, that the model and its associated estimation procedure are able to give a more stable and accurate low-dimensional representation of the data than principal component analysis (PCA) and zero-inflated factor analysis (ZIFA), without the need for a preliminary normalization step.

2017 ◽  
Author(s):  
Zhun Miao ◽  
Ke Deng ◽  
Xiaowo Wang ◽  
Xuegong Zhang

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.


2019 ◽  
Author(s):  
Christian H. Holland ◽  
Jovan Tanevski ◽  
Jan Gleixner ◽  
Manu P. Kumar ◽  
Elisabetta Mereu ◽  
...  

AbstractMany tools have been developed to extract functional and mechanistic insight from bulk transcriptome profiling data. With the advent of single-cell RNA sequencing (scRNA-seq), it is in principle possible to do such an analysis for single cells. However, scRNA-seq data has characteristics such as drop-out events, low library sizes and a comparatively large number of samples/cells. It is thus not clear if functional genomics tools established for bulk sequencing can be applied to scRNA-seq in a meaningful way. To address this question, we performed benchmark studies on in silico and in vitro single-cell RNA-seq data. We included the bulk-RNA tools PROGENy, GO enrichment and DoRothEA that estimate pathway and transcription factor (TF) activities, respectively, and compared them against the tools AUCell and metaVIPER, designed for scRNA-seq. For the in silico study we simulated single cells from TF/pathway perturbation bulk RNA-seq experiments. Our simulation strategy guarantees that the information of the original perturbation is preserved while resembling the characteristics of scRNA-seq data. We complemented the in silico data with in vitro scRNA-seq data upon CRISPR-mediated knock-out. Our benchmarks on both the simulated and real data revealed comparable performance to the original bulk data. Additionally, we showed that the TF and pathway activities preserve cell-type specific variability by analysing a mixture sample sequenced with 13 scRNA-seq different protocols. Our analyses suggest that bulk functional genomics tools can be applied to scRNA-seq data, outperforming dedicated single cell tools. Furthermore we provide a benchmark for further methods development by the community.


2016 ◽  
Author(s):  
Peijie Lin ◽  
Michael Troup ◽  
Joshua W. K. Ho

Most existing dimensionality reduction and clustering packages for single-cell RNA-Seq (scRNA-Seq) data deal with dropouts by heavy modelling and computational machinery. Here we introduce CIDR (Clustering through Imputation and Dimensionality Reduction), an ultrafast algorithm which uses a novel yet very simple ‘implicit imputation’ approach to alleviate the impact of dropouts in scRNA-Seq data in a principled manner. Using a range of simulated and real data, we have shown that CIDR improves the standard principal component analysis and outperforms the state-of-the-art methods, namely t-SNE, ZIFA and RaceID, in terms of clustering accuracy. CIDR typically completes within seconds for processing a data set of hundreds of cells, and minutes for a data set of thousands of cells. CIDR can be downloaded at https://github.org/VCCRI/CIDR.


2020 ◽  
Vol 36 (15) ◽  
pp. 4291-4295
Author(s):  
Philipp Angerer ◽  
David S Fischer ◽  
Fabian J Theis ◽  
Antonio Scialdone ◽  
Carsten Marr

Abstract Motivation Dimensionality reduction is a key step in the analysis of single-cell RNA-sequencing data. It produces a low-dimensional embedding for visualization and as a calculation base for downstream analysis. Nonlinear techniques are most suitable to handle the intrinsic complexity of large, heterogeneous single-cell data. However, with no linear relation between gene and embedding coordinate, there is no way to extract the identity of genes driving any cell’s position in the low-dimensional embedding, making it difficult to characterize the underlying biological processes. Results In this article, we introduce the concepts of local and global gene relevance to compute an equivalent of principal component analysis loadings for non-linear low-dimensional embeddings. Global gene relevance identifies drivers of the overall embedding, while local gene relevance identifies those of a defined sub-region. We apply our method to single-cell RNA-seq datasets from different experimental protocols and to different low-dimensional embedding techniques. This shows our method’s versatility to identify key genes for a variety of biological processes. Availability and implementation To ensure reproducibility and ease of use, our method is released as part of destiny 3.0, a popular R package for building diffusion maps from single-cell transcriptomic data. It is readily available through Bioconductor. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 7 (8) ◽  
pp. eabe3610
Author(s):  
Conor J. Kearney ◽  
Stephin J. Vervoort ◽  
Kelly M. Ramsbottom ◽  
Izabela Todorovski ◽  
Emily J. Lelliott ◽  
...  

Multimodal single-cell RNA sequencing enables the precise mapping of transcriptional and phenotypic features of cellular differentiation states but does not allow for simultaneous integration of critical posttranslational modification data. Here, we describe SUrface-protein Glycan And RNA-seq (SUGAR-seq), a method that enables detection and analysis of N-linked glycosylation, extracellular epitopes, and the transcriptome at the single-cell level. Integrated SUGAR-seq and glycoproteome analysis identified tumor-infiltrating T cells with unique surface glycan properties that report their epigenetic and functional state.


2019 ◽  
Vol 116 (13) ◽  
pp. 5979-5984 ◽  
Author(s):  
Yahui Ji ◽  
Dongyuan Qi ◽  
Linmei Li ◽  
Haoran Su ◽  
Xiaojie Li ◽  
...  

Extracellular vesicles (EVs) are important intercellular mediators regulating health and diseases. Conventional methods for EV surface marker profiling, which was based on population measurements, masked the cell-to-cell heterogeneity in the quantity and phenotypes of EV secretion. Herein, by using spatially patterned antibody barcodes, we realized multiplexed profiling of single-cell EV secretion from more than 1,000 single cells simultaneously. Applying this platform to profile human oral squamous cell carcinoma (OSCC) cell lines led to a deep understanding of previously undifferentiated single-cell heterogeneity underlying EV secretion. Notably, we observed that the decrement of certain EV phenotypes (e.g.,CD63+EV) was associated with the invasive feature of both OSCC cell lines and primary OSCC cells. We also realized multiplexed detection of EV secretion and cytokines secretion simultaneously from the same single cells to investigate the multidimensional spectrum of cellular communications, from which we resolved tiered functional subgroups with distinct secretion profiles by visualized clustering and principal component analysis. In particular, we found that different cell subgroups dominated EV secretion and cytokine secretion. The technology introduced here enables a comprehensive evaluation of EV secretion heterogeneity at single-cell level, which may become an indispensable tool to complement current single-cell analysis and EV research.


2021 ◽  
Author(s):  
Qing Xie ◽  
Chengong Han ◽  
Victor Jin ◽  
Shili Lin

Single cell Hi-C techniques enable one to study cell to cell variability in chromatin interactions. However, single cell Hi-C (scHi-C) data suffer severely from sparsity, that is, the existence of excess zeros due to insufficient sequencing depth. Complicate things further is the fact that not all zeros are created equal, as some are due to loci truly not interacting because of the underlying biological mechanism (structural zeros), whereas others are indeed due to insufficient sequencing depth (sampling zeros), especially for loci that interact infrequently. Differentiating between structural zeros and sampling zeros is important since correct inference would improve downstream analyses such as clustering and discovery of subtypes. Nevertheless, distinguishing between these two types of zeros has received little attention in the single cell Hi-C literature, where the issue of sparsity has been addressed mainly as a data quality improvement problem. To fill this gap, in this paper, we propose HiCImpute, a Bayesian hierarchy model that goes beyond data quality improvement by also identifying observed zeros that are in fact structural zeros. HiCImpute takes spatial dependencies of scHi-C 2D data structure into account while also borrowing information from similar single cells and bulk data, when such are available. Through an extensive set of analyses of synthetic and real data, we demonstrate the ability of HiCImpute for identifying structural zeros with high sensitivity, and for accurate imputation of dropout values in sampling zeros. Downstream analyses using data improved from HiCImpute yielded much more accurate clustering of cell types compared to using observed data or data improved by several comparison methods. Most significantly, HiCImpute-improved data has led to the identification of subtypes within each of the excitatory neuronal cells of L4 and L5 in the prefrontal cortex.


Blood ◽  
2019 ◽  
Vol 134 (Supplement_1) ◽  
pp. 2520-2520
Author(s):  
Parashar Dhapola ◽  
Mikael Sommarin ◽  
Mohamed Eldeeb ◽  
Amol Ugale ◽  
David Bryder ◽  
...  

Single-cell transcriptomics (scRNA-Seq) has accelerated the investigation of hematopoietic differentiation. Based on scRNA-Seq data, more refined models of lineage determination in stem- and progenitor cells are now available. Despite such advances, characterizing leukemic cells using single-cell approaches remains challenging. The conventional strategies of scRNA-Seq analysis map all cells on the same low dimensional space using approaches like tSNE and UMAP. However, when used for comparing normal and leukemic cells, such methods are often inadequate as the transcriptome of the leukemic cells has systematically diverged, resulting in irrelevant separation of leukemic subpopulations from their healthy counterpart. Here, we have developed a new computational approach bundled into a tool called Nabo (nabo.readthedocs.io) that has the capacity to directly compare cells that are otherwise unalignable. First, Nabo creates a shared nearest neighbor graph of the reference population, and the heterogeneity of this population is subsequently defined by performing clustering on the graph and calculating a low dimensional representation using t-SNE or UMAP. Nabo then calculates the similarity of incoming cells from a target population to each cell in the reference graph using a modified Canberra metric. The reference cells with higher similarity to the target cells obtain higher mapping scores. The built-in classifier is used to assign each target cell a reference cluster identity. We tested Nabo's accuracy on control datasets and found that Nabo's performance in terms of accuracy and robustness of projection is comparable to state-of-art methods. Moreover, Nabo is a generalized domain adaptation algorithm and hence can perform classification of target cells that are arbitrarily dissimilar to reference cells. Nabo could identify the cell-identity of sorted CD19+ B cells, CD14+ monocytes and CD56+ by projecting these unlabeled cells onto labelled peripheral blood mononuclear cells with an average specificity higher than 0.98. The general applicability of Nabo was demonstrated by successfully integrating pancreatic cells, sequenced in three different studies using different sequencing chemistries with comparable or better accuracy than existing methods. Also, it was conclusively demonstrated that Nabo can predict the identity of human HSPC subpopulations to the same accuracy as can be achieved by established cell-surface markers. Having Nabo at hand, we aimed to uncover the heterogeneity of hematopoietic cells from different stages of AML. Nabo showed that AML cells lacked the heterogeneity of normal CD34+ cells and were devoid of cells with HSC gene signature. A large patient-to-patient variability was found where leukemic cells mapped to distinct stages of myeloid progenitors. To ask whether this variability could reflect differences in leukemia-initiating cell identity, we induced leukemia in murine granulocyte-monocyte-lymphoid progenitors (GMLPs) using an inducible model for MLL-ENL-driven AML. On projection, more than 70% of MLL-ENL-activated cells mapped to a distinct Flt3+ subpopulation present within healthy GMLPs. Statistical validity of this projection was verified using two novel null models for testing cell projections: 1) ablated node model, wherein the mapping strength of target cells are evaluated after removal of high mapping score source nodes, and 2) high entropy features model, which rules out the background noise effect. By separating Flt3+ and Flt3- cells prior to activation of the fusion gene and performing in vitro replating assays, we could demonstrate that Flt3+ GMLPs contained 3-4 fold more leukemia-initiating cells (1/1.34 cells) than Flt3- GMLPs (1/4.89 cells), indicating that leukemia-initiating cells within GMLPs express Flt3. Taken together, Nabo represents a robust cell projection strategy for relevant analysis of scRNA-Seq data that permits an interpretable inference of cross-population relationships. Nabo is designed to compare disparate cellular populations by using the heterogeneity of one population as a point of reference allowing for cell-type specification even following perturbations that have resulted in large molecular changes to the cells of interest. As such, Nabo has critical implementation for delineation of leukemia heterogeneity and identification of leukemia-initiating cell population. Disclosures No relevant conflicts of interest to declare.


2019 ◽  
Author(s):  
Ning Wang ◽  
Andrew E. Teschendorff

AbstractInferring the activity of transcription factors in single cells is a key task to improve our understanding of development and complex genetic diseases. This task is, however, challenging due to the relatively large dropout rate and noisy nature of single-cell RNA-Seq data. Here we present a novel statistical inference framework called SCIRA (Single Cell Inference of Regulatory Activity), which leverages the power of large-scale bulk RNA-Seq datasets to infer high-quality tissue-specific regulatory networks, from which regulatory activity estimates in single cells can be subsequently obtained. We show that SCIRA can correctly infer regulatory activity of transcription factors affected by high technical dropouts. In particular, SCIRA can improve sensitivity by as much as 70% compared to differential expression analysis and current state-of-the-art methods. Importantly, SCIRA can reveal novel regulators of cell-fate in tissue-development, even for cell-types that only make up 5% of the tissue, and can identify key novel tumor suppressor genes in cancer at single cell resolution. In summary, SCIRA will be an invaluable tool for single-cell studies aiming to accurately map activity patterns of key transcription factors during development, and how these are altered in disease.


2020 ◽  
Author(s):  
Mohit Goyal ◽  
Guillermo Serrano ◽  
Ilan Shomorony ◽  
Mikel Hernaez ◽  
Idoia Ochoa

AbstractSingle-cell RNA-seq is a powerful tool in the study of the cellular composition of different tissues and organisms. A key step in the analysis pipeline is the annotation of cell-types based on the expression of specific marker genes. Since manual annotation is labor-intensive and does not scale to large datasets, several methods for automated cell-type annotation have been proposed based on supervised learning. However, these methods generally require feature extraction and batch alignment prior to classification, and their performance may become unreliable in the presence of cell-types with very similar transcriptomic profiles, such as differentiating cells. We propose JIND, a framework for automated cell-type identification based on neural networks that directly learns a low-dimensional representation (latent code) in which cell-types can be reliably determined. To account for batch effects, JIND performs a novel asymmetric alignment in which the transcriptomic profile of unseen cells is mapped onto the previously learned latent space, hence avoiding the need of retraining the model whenever a new dataset becomes available. JIND also learns cell-type-specific confidence thresholds to identify and reject cells that cannot be reliably classified. We show on datasets with and without batch effects that JIND classifies cells more accurately than previously proposed methods while rejecting only a small proportion of cells. Moreover, JIND batch alignment is parallelizable, being more than five or six times faster than Seurat integration. Availability: https://github.com/mohit1997/JIND.


Sign in / Sign up

Export Citation Format

Share Document