scholarly journals K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data

2017 ◽  
Author(s):  
Florian Wagner ◽  
Yun Yan ◽  
Itai Yanai

High-throughput single-cell RNA-Seq (scRNA-Seq) is a powerful approach for studying heterogeneous tissues and dynamic cellular processes. However, compared to bulk RNA-Seq, single-cell expression profiles are extremely noisy, as they only capture a fraction of the transcripts present in the cell. Here, we propose the k-nearest neighbor smoothing (kNN-smoothing) algorithm, designed to reduce noise by aggregating information from similar cells (neighbors) in a computationally efficient and statistically tractable manner. The algorithm is based on the observation that across protocols, the technical noise exhibited by UMI-filtered scRNA-Seq data closely follows Poisson statistics. Smoothing is performed by first identifying the nearest neighbors of each cell in a step-wise fashion, based on partially smoothed and variance-stabilized expression profiles, and then aggregating their transcript counts. We show that kNN-smoothing greatly improves the detection of clusters of cells and co-expressed genes, and clearly outperforms other smoothing methods on simulated data. To accurately perform smoothing for datasets containing highly similar cell populations, we propose the kNN-smoothing 2 algorithm, in which neighbors are determined after projecting the partially smoothed data onto the first few principal components. We show that unlike its predecessor, kNN-smoothing 2 can accurately distinguish between cells from different T cell subsets, and enables their identification in peripheral blood using unsupervised methods. Our work facilitates the analysis of scRNA-Seq data across a broad range of applications, including the identification of cell populations in heterogeneous tissues and the characterization of dynamic processes such as cellular differentiation. Reference implementations of our algorithms can be found at https://github.com/yanailab/knn-smoothing.

2016 ◽  
Author(s):  
Aaron T. L. Lun ◽  
John C. Marioni

AbstractAn increasing number of studies are using single-cell RNA-sequencing (scRNA-seq) to characterize the gene expression profiles of individual cells. One common analysis applied to scRNA-seq data involves detecting differentially expressed (DE) genes between cells in different biological groups. However, many experiments are designed such that the cells to be compared are processed in separate plates or chips, meaning that the groupings are confounded with systematic plate effects. This confounding aspect is frequently ignored in DE analyses of scRNA-seq data. In this article, we demonstrate that failing to consider plate effects in the statistical model results in loss of type I error control. A solution is proposed whereby counts are summed from all cells in each plate and the count sums for all plates are used in the DE analysis. This restores type I error control in the presence of plate effects without compromising detection power in simulated data. Summation is also robust to varying numbers and library sizes of cells on each plate. Similar results are observed in DE analyses of real data where the use of count sums instead of single-cell counts improves specificity and the ranking of relevant genes. This suggests that summation can assist in maintaining statistical rigour in DE analyses of scRNA-seq data with plate effects.


2016 ◽  
Author(s):  
Bo Wang ◽  
Junjie Zhu ◽  
Emma Pierson ◽  
Daniele Ramazzotti ◽  
Serafim Batzoglou

AbstractSingle-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. Here, we propose a novel similarity-learning framework, SIMLR (single-cell interpretation via multi-kernel learning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization applications. Benchmarking against state-of-the-art methods for these applications, we used SIMLR to re-analyse seven representative single-cell data sets, including high-throughput droplet-based data sets with tens of thousands of cells. We show that SIMLR greatly improves clustering sensitivity and accuracy, as well as the visualization and interpretability of the data.


2018 ◽  
Author(s):  
Fan Zhang ◽  
Kevin Wei ◽  
Kamil Slowikowski ◽  
Chamith Y. Fonseka ◽  
Deepak A. Rao ◽  
...  

AbstractTo define the cell populations in rheumatoid arthritis (RA) driving joint inflammation, we applied single-cell RNA-seq (scRNA-seq), mass cytometry, bulk RNA-seq, and flow cytometry to sorted T cells, B cells, monocytes, and fibroblasts from 51 synovial tissue RA and osteoarthritis (OA) patient samples. Utilizing an integrated computational strategy based on canonical correlation analysis to 5,452 scRNA-seq profiles, we identified 18 unique cell populations. Combining mass cytometry and transcriptomics together revealed cell states expanded in RA synovia: THY1+HLAhigh sublining fibroblasts (OR=33.8), IL1B+ pro-inflammatory monocytes (OR=7.8), CD11c+T-bet+ autoimmune-associated B cells (OR=5.7), and PD-1+Tph/Tfh (OR=3.0). We also defined CD8+ T cell subsets characterized by GZMK+, GZMB+, and GNLY+ expression. Using bulk and single-cell data, we mapped inflammatory mediators to source cell populations, for example attributing IL6 production to THY1+HLAhigh fibroblasts and naïve B cells, and IL1B to pro-inflammatory monocytes. These populations are potentially key mediators of RA pathogenesis.


2021 ◽  
Author(s):  
Maryam Zand ◽  
Jianhua Ruan

Single-cell RNA sequencing (scRNAseq) offers an unprecedented potential for scrutinizing complex biological systems at single cell resolution. One of the most important applications of scRNAseq is to cluster cells into groups of similar expression profiles, which allows unsupervised identification of novel cell subtypes. While many clustering algorithms have been tested towards this goal, graph-based algorithms appear to be the most effective, due to their ability to accommodate the sparsity of the data, as well as the complex topology of the cell population. An integral part of almost all such clustering methods is the construction of a k-nearest-neighbor (KNN) network, and the choice of k, implicitly or explicitly, can have a profound impact on the density distribution of the graph and the structure of the resulting clusters, as well as the resolution of clusters that one can successfully identify from the data. In this work, we propose a fairly simple but robust approach to estimate the best k for constructing the KNN graph while simultaneously identifying the optimal clustering structure from the graph. Our method, named scQcut, employs a topology-based criterion to guide the construction of KNN graph, and then applies an efficient modularity-based community discovery algorithm to predict robust cell clusters. The results obtained from applying scQcut on a large number of real and synthetic datasets demonstrated that scQcut-which does not require any user-tuned parameters-outperformed several popular state-of-the-art clustering methods in terms of clustering accuracy and the ability to correctly identify rare cell types. The promising results indicate that an accurate approximation of the parameter k, which determines the topology of the network, is a crucial element of a successful graph-based clustering method to recover the final community structure of the cell population.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tracy M. Yamawaki ◽  
Daniel R. Lu ◽  
Daniel C. Ellwanger ◽  
Dev Bhatt ◽  
Paolo Manzanillo ◽  
...  

Abstract Background Elucidation of immune populations with single-cell RNA-seq has greatly benefited the field of immunology by deepening the characterization of immune heterogeneity and leading to the discovery of new subtypes. However, single-cell methods inherently suffer from limitations in the recovery of complete transcriptomes due to the prevalence of cellular and transcriptional dropout events. This issue is often compounded by limited sample availability and limited prior knowledge of heterogeneity, which can confound data interpretation. Results Here, we systematically benchmarked seven high-throughput single-cell RNA-seq methods. We prepared 21 libraries under identical conditions of a defined mixture of two human and two murine lymphocyte cell lines, simulating heterogeneity across immune-cell types and cell sizes. We evaluated methods by their cell recovery rate, library efficiency, sensitivity, and ability to recover expression signatures for each cell type. We observed higher mRNA detection sensitivity with the 10x Genomics 5′ v1 and 3′ v3 methods. We demonstrate that these methods have fewer dropout events, which facilitates the identification of differentially-expressed genes and improves the concordance of single-cell profiles to immune bulk RNA-seq signatures. Conclusion Overall, our characterization of immune cell mixtures provides useful metrics, which can guide selection of a high-throughput single-cell RNA-seq method for profiling more complex immune-cell heterogeneity usually found in vivo.


2017 ◽  
Vol 37 (17) ◽  
pp. 12-13
Author(s):  
Jennifer Chew ◽  
Adam Bemis ◽  
Ronald Lebofsky ◽  
Anna Quinlan ◽  
Kelly Kaihara
Keyword(s):  

Author(s):  
Emma Dann ◽  
Neil C. Henderson ◽  
Sarah A. Teichmann ◽  
Michael D. Morgan ◽  
John C. Marioni

Cells ◽  
2021 ◽  
Vol 10 (11) ◽  
pp. 3126
Author(s):  
Dominik Saul ◽  
Robyn Laura Kosinsky

The human aging process is associated with molecular changes and cellular degeneration, resulting in a significant increase in cancer incidence with age. Despite their potential correlation, the relationship between cancer- and ageing-related transcriptional changes is largely unknown. In this study, we aimed to analyze aging-associated transcriptional patterns in publicly available bulk mRNA-seq and single-cell RNA-seq (scRNA-seq) datasets for chronic myelogenous leukemia (CML), colorectal cancer (CRC), hepatocellular carcinoma (HCC), lung cancer (LC), and pancreatic ductal adenocarcinoma (PDAC). Indeed, we detected that various aging/senescence-induced genes (ASIGs) were upregulated in malignant diseases compared to healthy control samples. To elucidate the importance of ASIGs during cell development, pseudotime analyses were performed, which revealed a late enrichment of distinct cancer-specific ASIG signatures. Notably, we were able to demonstrate that all cancer entities analyzed in this study comprised cell populations expressing ASIGs. While only minor correlations were detected between ASIGs and transcriptome-wide changes in PDAC, a high proportion of ASIGs was induced in CML, CRC, HCC, and LC samples. These unique cellular subpopulations could serve as a basis for future studies on the role of aging and senescence in human malignancies.


Sign in / Sign up

Export Citation Format

Share Document