Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning

Mapping Intimacies ◽

10.1101/052225 ◽

2016 ◽

Cited By ~ 7

Author(s):

Bo Wang ◽

Junjie Zhu ◽

Emma Pierson ◽

Daniele Ramazzotti ◽

Serafim Batzoglou

Keyword(s):

Gene Expression ◽

Single Cell ◽

High Throughput ◽

Cell Populations ◽

Gene Expression Measurement ◽

Data Sets ◽

Similarity Learning ◽

Rna Seq ◽

High Level ◽

Cell Data

AbstractSingle-cell RNA-seq technologies enable high throughput gene expression measurement of individual cells, and allow the discovery of heterogeneity within cell populations. Measurement of cell-to-cell gene expression similarity is critical to identification, visualization and analysis of cell populations. However, single-cell data introduce challenges to conventional measures of gene expression similarity because of the high level of noise, outliers and dropouts. Here, we propose a novel similarity-learning framework, SIMLR (single-cell interpretation via multi-kernel learning), which learns an appropriate distance metric from the data for dimension reduction, clustering and visualization applications. Benchmarking against state-of-the-art methods for these applications, we used SIMLR to re-analyse seven representative single-cell data sets, including high-throughput droplet-based data sets with tens of thousands of cells. We show that SIMLR greatly improves clustering sensitivity and accuracy, as well as the visualization and interpretability of the data.

Download Full-text

SOMSC: Self-Organization-Map for High-Dimensional Single-Cell Data of Cellular States and Their Transitions

10.1101/124735 ◽

2017 ◽

Author(s):

Tao Peng ◽

Qing Nie

Keyword(s):

Gene Expression ◽

Single Cell ◽

Embryo Development ◽

Single Cells ◽

High Dimensional ◽

Data Sets ◽

Rna Seq ◽

Expression Levels ◽

Cell Lineages ◽

Cell Data

Measurements of gene expression levels for multiple genes in single cells provide a powerful approach to study heterogeneity of cell populations and cellular plasticity. While the expression levels of multiple genes in each cell are available in such data, the potential connections among the cells (e.g. the lineage relationship) are not directly evident from the measurement. Classifying cellular states and identifying transitions among those states are challenging due to many factors, including the small number of cells versus the large number of genes collected in the data. In this paper we adapt a classical self-organizing-map approach to single-cell gene expression data, such as those based on qPCR and RNA-seq. In this method (SOMSC), a cellular state map (CSM) is derived and employed to identify cellular states inherited in a population of measured single cells. Cells located in the same basin of the CSM are considered as in one cellular state while barriers between the basins provide information on transitions among the cellular states. Consequently, paths of cellular state transitions (e.g. differentiation) and a temporal ordering of the measured single cells are obtained. Applied to a set of synthetic data, two single-cell qPCR data sets and two single-cell RNA-seq data sets for a simulated model of cell differentiation, and systems on the early embryo development, haematopoietic cell lineages, human preimplanation embryo development, and human skeletal muscle myoblasts differentiation, the SOMSC shows good capabilities in identifying cellular states and their transitions in the high-dimensional single-cell data. This approach will have broad applications in studying cell lineages and cellular fate specification.

Download Full-text

Missing Data and Technical Variability in Single-Cell RNA- Sequencing Experiments

10.1101/025528 ◽

2015 ◽

Cited By ~ 32

Author(s):

Stephanie C Hicks ◽

F. William Townes ◽

Mingxiang Teng ◽

Rafael A Irizarry

Keyword(s):

Gene Expression ◽

Missing Data ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput ◽

Single Cells ◽

Systematic Errors ◽

Gene Expression Measurement ◽

Rna Seq ◽

Batch Effects

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-Seq and scRNA-seq data are markedly different. In particular, unlike RNA-Seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, gene expressing RNA, but not at a sufficient level to detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.

Download Full-text

Enhancing droplet-based single-nucleus RNA-seq resolution using the semi-supervised machine learning classifier DIEM

10.1101/786285 ◽

2019 ◽

Cited By ~ 4

Author(s):

Marcus Alvarez ◽

Elior Rahmani ◽

Brandon Jew ◽

Kristina M. Garske ◽

Zong Miao ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Supervised Machine Learning ◽

Data Sets ◽

Rna Seq ◽

Novel Approach ◽

Single Nucleus ◽

Downstream Analysis

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.

Download Full-text

LTMG: A novel statistical modeling of transcriptional expression states in single-cell RNA-Seq data

10.1101/430009 ◽

2018 ◽

Cited By ~ 1

Author(s):

Changlin Wan ◽

Wennan Chang ◽

Yu Zhang ◽

Fenil Shah ◽

Xiaoyu Lu ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Single Cells ◽

Cell Types ◽

R Package ◽

Data Sets ◽

Rna Seq ◽

Cell Functions ◽

Transcriptional Regulatory ◽

A Cell

ABSTRACTA key challenge in modeling single-cell RNA-seq (scRNA-seq) data is to capture the diverse gene expression states regulated by different transcriptional regulatory inputs across single cells, which is further complicated by a large number of observed zero and low expressions. We developed a left truncated mixture Gaussian (LTMG) model that stems from the kinetic relationships between the transcriptional regulatory inputs and metabolism of mRNA and gene expression abundance in a cell. LTMG infers the expression multi-modalities across single cell entities, representing a gene’s diverse expression states; meanwhile the dropouts and low expressions are treated as left truncated, specifically representing an expression state that is under suppression. We demonstrated that LTMG has significantly better goodness of fitting on an extensive number of single-cell data sets, comparing to three other state of the art models. In addition, our systems kinetic approach of handling the low and zero expressions and correctness of the identified multimodality are validated on several independent experimental data sets. Application on data of complex tissues demonstrated the capability of LTMG in extracting varied expression states specific to cell types or cell functions. Based on LTMG, a differential gene expression test and a co-regulation module identification method, namely LTMG-DGE and LTMG-GCR, are further developed. We experimentally validated that LTMG-DGE is equipped with higher sensitivity and specificity in detecting differentially expressed genes, compared with other five popular methods, and that LTMG-GCR is capable to retrieve the gene co-regulation modules corresponding to perturbed transcriptional regulations. A user-friendly R package with all the analysis power is available at https://github.com/zy26/LTMGSCA.

Download Full-text

K-nearest neighbor smoothing for high-throughput single-cell RNA-Seq data

10.1101/217737 ◽

2017 ◽

Cited By ~ 39

Author(s):

Florian Wagner ◽

Yun Yan ◽

Itai Yanai

Keyword(s):

Single Cell ◽

High Throughput ◽

Nearest Neighbor ◽

Expression Profiles ◽

Simulated Data ◽

T Cell Subsets ◽

Cell Populations ◽

Rna Seq ◽

K Nearest Neighbor ◽

Heterogeneous Tissues

High-throughput single-cell RNA-Seq (scRNA-Seq) is a powerful approach for studying heterogeneous tissues and dynamic cellular processes. However, compared to bulk RNA-Seq, single-cell expression profiles are extremely noisy, as they only capture a fraction of the transcripts present in the cell. Here, we propose the k-nearest neighbor smoothing (kNN-smoothing) algorithm, designed to reduce noise by aggregating information from similar cells (neighbors) in a computationally efficient and statistically tractable manner. The algorithm is based on the observation that across protocols, the technical noise exhibited by UMI-filtered scRNA-Seq data closely follows Poisson statistics. Smoothing is performed by first identifying the nearest neighbors of each cell in a step-wise fashion, based on partially smoothed and variance-stabilized expression profiles, and then aggregating their transcript counts. We show that kNN-smoothing greatly improves the detection of clusters of cells and co-expressed genes, and clearly outperforms other smoothing methods on simulated data. To accurately perform smoothing for datasets containing highly similar cell populations, we propose the kNN-smoothing 2 algorithm, in which neighbors are determined after projecting the partially smoothed data onto the first few principal components. We show that unlike its predecessor, kNN-smoothing 2 can accurately distinguish between cells from different T cell subsets, and enables their identification in peripheral blood using unsupervised methods. Our work facilitates the analysis of scRNA-Seq data across a broad range of applications, including the identification of cell populations in heterogeneous tissues and the characterization of dynamic processes such as cellular differentiation. Reference implementations of our algorithms can be found at https://github.com/yanailab/knn-smoothing.

Download Full-text

LTMG: a novel statistical modeling of transcriptional expression states in single-cell RNA-Seq data

Nucleic Acids Research ◽

10.1093/nar/gkz655 ◽

2019 ◽

Vol 47 (18) ◽

pp. e111-e111 ◽

Cited By ~ 12

Author(s):

Changlin Wan ◽

Wennan Chang ◽

Yu Zhang ◽

Fenil Shah ◽

Xiaoyu Lu ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Single Cells ◽

Cell Types ◽

R Package ◽

Data Sets ◽

Rna Seq ◽

Mrna Metabolism ◽

Transcriptional Regulatory ◽

User Friendly

Abstract A key challenge in modeling single-cell RNA-seq data is to capture the diversity of gene expression states regulated by different transcriptional regulatory inputs across individual cells, which is further complicated by largely observed zero and low expressions. We developed a left truncated mixture Gaussian (LTMG) model, from the kinetic relationships of the transcriptional regulatory inputs, mRNA metabolism and abundance in single cells. LTMG infers the expression multi-modalities across single cells, meanwhile, the dropouts and low expressions are treated as left truncated. We demonstrated that LTMG has significantly better goodness of fitting on an extensive number of scRNA-seq data, comparing to three other state-of-the-art models. Our biological assumption of the low non-zero expressions, rationality of the multimodality setting, and the capability of LTMG in extracting expression states specific to cell types or functions, are validated on independent experimental data sets. A differential gene expression test and a co-regulation module identification method are further developed. We experimentally validated that our differential expression test has higher sensitivity and specificity, compared with other five popular methods. The co-regulation analysis is capable of retrieving gene co-regulation modules corresponding to perturbed transcriptional regulations. A user-friendly R package with all the analysis power is available at https://github.com/zy26/LTMGSCA.

Download Full-text

Time- and cost-efficient high-throughput transcriptomics enabled by Bulk RNA Barcoding and sequencing

10.1101/256594 ◽

2018 ◽

Cited By ~ 3

Author(s):

Daniel Alpern ◽

Vincent Gardeux ◽

Julie Russeil ◽

Bart Deplancke

Keyword(s):

Gene Expression ◽

Single Cell ◽

High Throughput ◽

Low Cost ◽

Cdna Libraries ◽

List Type ◽

Library Preparation ◽

Rna Seq ◽

Genome Wide ◽

Hands On

ABSTRACTGenome-wide gene expression analyses by RNA sequencing (RNA-seq) have quickly become a standard in molecular biology because of the widespread availability of high throughput sequencing technologies. While powerful, RNA-seq still has several limitations, including the time and cost of library preparation, which makes it difficult to profile many samples simultaneously. To deal with these constraints, the single-cell transcriptomics field has implemented the early multiplexing principle, making the library preparation of hundreds of samples (cells) markedly more affordable. However, the current standard methods for bulk transcriptomics (such as TruSeq Stranded mRNA) remain expensive, and relatively little effort has been invested to develop cheaper, but equally robust methods. Here, we present a novel approach, Bulk RNA Barcoding and sequencing (BRB-seq), that combines the multiplexing-driven cost-effectiveness of a single-cell RNA-seq workflow with the performance of a bulk RNA-seq procedure. BRB-seq produces 3’ enriched cDNA libraries that exhibit similar gene expression quantification to TruSeq and that maintain this quality, also in terms of number of detected differentially expressed genes, even with low quality RNA samples. We show that BRB-seq is about 25 times less expensive than TruSeq, enabling the generation of ready to sequence libraries for up to 192 samples in a day with only 2 hours of hands-on time. We conclude that BRB-seq constitutes a powerful alternative to TruSeq as a standard bulk RNA-seq approach. Moreover, we anticipate that this novel method will eventually replace RT-qPCR-based gene expression screens given its capacity to generate genome-wide transcriptomic data at a cost that is comparable to profiling 4 genes using RT-qPCR.‘SoftwareWe developed a suite of open source tools (BRB-seqTools) to aid with processing BRB-seq data and generating count matrices that are used for further analyses. This suite can perform demultiplexing, generate count/UMI matrices and trim BRB-seq constructs and is freely available at http://github.com/DeplanckeLab/BRB-seqToolsHighlightsRapid (~2h hands on time) and low-cost approach to perform transcriptomics on hundreds of RNA samplesStrand specificity preservedPerformance: number of detected genes is equal to Illumina TruSeq Stranded mRNA at same sequencing depthHigh capacity: low cost allows increasing the number of biological replicatesProduces reliable data even with low quality RNA samples (down to RIN value = 2)Complete user-friendly sequencing data pre-processing and analysis pipeline allowing result acquisition in a day

Download Full-text

Sfaira accelerates data and model reuse in single cell genomics

Genome Biology ◽

10.1186/s13059-021-02452-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

David S. Fischer ◽

Leander Dony ◽

Martin König ◽

Abdul Moeed ◽

Luke Zappia ◽

...

Keyword(s):

Single Cell ◽

Data Sets ◽

Rna Seq ◽

Cell Type ◽

Training Models ◽

Public Data ◽

Data Partitions ◽

Cell Data ◽

Type Classification ◽

Different Levels

AbstractSingle-cell RNA-seq datasets are often first analyzed independently without harnessing model fits from previous studies, and are then contextualized with public data sets, requiring time-consuming data wrangling. We address these issues with sfaira, a single-cell data zoo for public data sets paired with a model zoo for executable pre-trained models. The data zoo is designed to facilitate contribution of data sets using ontologies for metadata. We propose an adaption of cross-entropy loss for cell type classification tailored to datasets annotated at different levels of coarseness. We demonstrate the utility of sfaira by training models across anatomic data partitions on 8 million cells.

Download Full-text

Laplacian eigenmaps and principal curves for high resolution pseudotemporal ordering of single-cell RNA-seq profiles

10.1101/027219 ◽

2015 ◽

Cited By ~ 18

Author(s):

Kieran Campbell ◽

Chris P Ponting ◽

Caleb Webber

Keyword(s):

Gene Expression ◽

Single Cell ◽

Biological Processes ◽

Rna Seq ◽

Principal Curves ◽

Cell Level ◽

Laplacian Eigenmaps ◽

Low Dimensional ◽

Cell Data ◽

Insight Into

Advances in RNA-seq technologies provide unprecedented insight into the variability and heterogeneity of gene expression at the single-cell level. However, such data offers only a snapshot of the transcriptome, whereas it is often the progression of cells through dynamic biological processes that is of interest. As a result, one outstanding challenge is to infer such progressions by ordering gene expression from single cell data alone, known as the cell ordering problem. Here, we introduce a new method that constructs a low-dimensional non-linear embedding of the data using laplacian eigenmaps before assigning each cell a pseudotime using principal curves. We characterise why on a theoretical level our method is more robust to the high levels of noise typical of single-cell RNA-seq data before demonstrating its utility on two existing datasets of differentiating cells.

Download Full-text

CHARTS: a web application for characterizing and comparing tumor subpopulations in publicly available single-cell RNA-seq data sets

BMC Bioinformatics ◽

10.1186/s12859-021-04021-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Matthew N. Bernstein ◽

Zijian Ni ◽

Michael Collins ◽

Mark E. Burkard ◽

Christina Kendziorski ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Web Application ◽

Gene Expression Omnibus ◽

Cellular Heterogeneity ◽

Data Sets ◽

Rna Seq ◽

Individual Gene ◽

Cancer Data ◽

The Web

Abstract Background Single-cell RNA-seq (scRNA-seq) enables the profiling of genome-wide gene expression at the single-cell level and in so doing facilitates insight into and information about cellular heterogeneity within a tissue. This is especially important in cancer, where tumor and tumor microenvironment heterogeneity directly impact development, maintenance, and progression of disease. While publicly available scRNA-seq cancer data sets offer unprecedented opportunity to better understand the mechanisms underlying tumor progression, metastasis, drug resistance, and immune evasion, much of the available information has been underutilized, in part, due to the lack of tools available for aggregating and analysing these data. Results We present CHARacterizing Tumor Subpopulations (CHARTS), a web application for exploring publicly available scRNA-seq cancer data sets in the NCBI’s Gene Expression Omnibus. More specifically, CHARTS enables the exploration of individual gene expression, cell type, malignancy-status, differentially expressed genes, and gene set enrichment results in subpopulations of cells across tumors and data sets. Along with the web application, we also make available the backend computational pipeline that was used to produce the analyses that are available for exploration in the web application. Conclusion CHARTS is an easy to use, comprehensive platform for exploring single-cell subpopulations within tumors across the ever-growing collection of public scRNA-seq cancer data sets. CHARTS is freely available at charts.morgridge.org.

Download Full-text