Sparsity-Penalized Stacked Denoising Autoencoders for Imputing Single-Cell RNA-seq Data

Weilai Chi; Minghua Deng

doi:10.3390/genes11050532

Enhancing droplet-based single-nucleus RNA-seq resolution using the semi-supervised machine learning classifier DIEM

10.1101/786285 ◽

2019 ◽

Cited By ~ 4

Author(s):

Marcus Alvarez ◽

Elior Rahmani ◽

Brandon Jew ◽

Kristina M. Garske ◽

Zong Miao ◽

...

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Supervised Machine Learning ◽

Data Sets ◽

Rna Seq ◽

Novel Approach ◽

Single Nucleus ◽

Downstream Analysis

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.

Download Full-text

MarkerCount: A stable, count-based cell type identifier for single cell RNA-Seq experiments

10.21203/rs.3.rs-418249/v1 ◽

2021 ◽

Author(s):

Hanbyeol Kim ◽

Joongho Lee ◽

Keunsoo Kang ◽

Seokhyun Yoon

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Batch Effect ◽

Expression Level ◽

Rna Seq ◽

Cell Type ◽

Stable Performance ◽

Downstream Analysis

Abstract Cell type identification is a key step to downstream analysis of single cell RNA-seq experiments. Indispensible information for this is gene expression, which is used to cluster cells, train the model and set rejection thresholds. Problem is they are subject to batch effect arising from different platforms and preprocessing. We present MarkerCount, which uses the number of markers expressed regardless of their expression level to initially identify cell types and, then, reassign cell type in cluster-basis. MarkerCount works both in reference and marker-based mode, where the latter utilizes only the existing lists of markers, while the former required pre-annotated dataset to train the model. The performance was evaluated and compared with the existing identifiers, both marker and reference-based, that can be customized with publicly available datasets and marker DB. The results show that MarkerCount provides a stable performance when comparing with other reference-based and marker-based cell type identifiers.

Download Full-text

SPsimSeq: semi-parametric simulation of bulk and single cell RNA sequencing data

10.1101/677740 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alemu Takele Assefa ◽

Jo Vandesompele ◽

Olivier Thas

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Empirical Distribution ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Actual Distribution ◽

Wide Range ◽

Single Cell Rna Sequencing

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.

Download Full-text

Cobolt: integrative analysis of multimodal single-cell sequencing data

Genome Biology ◽

10.1186/s13059-021-02556-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Boying Gong ◽

Yun Zhou ◽

Elizabeth Purdom

Keyword(s):

Gene Expression ◽

Single Cell ◽

Chromatin Accessibility ◽

Integrative Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Single Cell Sequencing ◽

Multiple Datasets ◽

Novel Method ◽

Sequencing Platforms

AbstractA growing number of single-cell sequencing platforms enable joint profiling of multiple omics from the same cells. We present , a novel method that not only allows for analyzing the data from joint-modality platforms, but provides a coherent framework for the integration of multiple datasets measured on different modalities. We demonstrate its performance on multi-modality data of gene expression and chromatin accessibility and illustrate the integration abilities of by jointly analyzing this multi-modality data with single-cell RNA-seq and ATAC-seq datasets.

Download Full-text

Machine Learning-Assisted Identification of Factors Contributing to the Technical Variability Between Bulk and Single-Cell RNA-Seq Experiments

10.21203/rs.3.rs-1247889/v1 ◽

2022 ◽

Author(s):

Sofya Lipnitskaya ◽

Yang Shen ◽

Stefan Legewie ◽

Holger Klein ◽

Kolja Becker

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Quantitative Difference ◽

Rna Seq ◽

Sequencing Data ◽

Factors Affecting ◽

Expression Variability ◽

Technical Variability

Abstract Background: Recent studies in the area of transcriptomics performed on single-cell and population levels reveal noticeable variability in gene expression measurements provided by different RNA sequencing technologies. Due to increased noise and complexity of single-cell RNA-Seq (scRNA-Seq) data over the bulk experiment, there is a substantial number of variably-expressed genes and so-called dropouts, challenging the subsequent computational analysis and potentially leading to false positive discoveries. In order to investigate factors affecting technical variability between RNA sequencing experiments of different technologies, we performed a systematic assessment of single-cell and bulk RNA-Seq data, which have undergone the same pre-processing and sample preparation procedures. Results: Our analysis indicates that variability between gene expression measurements as well as dropout events are not exclusively caused by biological variability, low expression levels, or random variation. Furthermore, we propose FAVSeq, a machine learning-assisted pipeline for detection of factors contributing to gene expression variability in matched RNA-Seq data provided by two technologies. Based on the analysis of the matched bulk and single-cell dataset, we found the 3'-UTR and transcript lengths as the most relevant effectors of the observed variation between RNA-Seq experiments, while the same factors together with cellular compartments were shown to be associated with dropouts. Conclusions: Here, we investigated the sources of variation in RNA-Seq profiles of matched single-cell and bulk experiments. In addition, we proposed the FAVSeq pipeline for analyzing multimodal RNA sequencing data, which allowed to identify factors affecting quantitative difference in gene expression measurements as well as the presence of dropouts. Hereby, the derived knowledge can be employed further in order to improve the interpretation of RNA-Seq data and identify genes that can be affected by assay-based deviations. Source code is available under the MIT license at https://github.com/slipnitskaya/FAVSeq.

Download Full-text

CCSN: Single Cell RNA Sequencing Data Analysis by Conditional Cell-specific Network

10.1101/2020.01.25.919829 ◽

2020 ◽

Author(s):

Lin Li ◽

Hao Dai ◽

Zhaoyuan Fang ◽

Luonan Chen

Keyword(s):

Gene Expression ◽

Single Cell ◽

Rna Sequencing ◽

Network Flow ◽

Single Cells ◽

Cellular Heterogeneity ◽

Rna Seq ◽

Sequencing Data ◽

Cell Clustering ◽

A Cell

AbstractThe rapid advancement of single cell technologies has shed new light on the complex mechanisms of cellular heterogeneity. However, compared with bulk RNA sequencing (RNA-seq), single-cell RNA-seq (scRNA-seq) suffers from higher noise and lower coverage, which brings new computational difficulties. Based on statistical independence, cell-specific network (CSN) is able to quantify the overall associations between genes for each cell, yet suffering from a problem of overestimation related to indirect effects. To overcome this problem, we propose the “conditional cell-specific network” (CCSN) method, which can measure the direct associations between genes by eliminating the indirect associations. CCSN can be used for cell clustering and dimension reduction on a network basis of single cells. Intuitively, each CCSN can be viewed as the transformation from less “reliable” gene expression to more “reliable” gene-gene associations in a cell. Based on CCSN, we further design network flow entropy (NFE) to estimate the differentiation potency of a single cell. A number of scRNA-seq datasets were used to demonstrate the advantages of our approach: (1) one direct association network for one cell; (2) most existing scRNA-seq methods designed for gene expression matrices are also applicable to CCSN-transformed degree matrices; (3) CCSN-based NFE helps resolving the direction of differentiation trajectories by quantifying the potency of each cell. CCSN is publicly available at http://sysbio.sibcb.ac.cn/cb/chenlab/soft/CCSN.zip.

Download Full-text

scImpute: Accurate And Robust Imputation For Single Cell RNA-Seq Data

10.1101/141598 ◽

2017 ◽

Cited By ~ 20

Author(s):

Wei Vivian Li ◽

Jingyi Jessica Li

Keyword(s):

Gene Expression ◽

Single Cell ◽

Differential Expression Analysis ◽

Embryonic Stem ◽

Rna Seq ◽

Human Blood Cells ◽

Gene Expression Dynamics ◽

Rare Cells ◽

Downstream Analysis ◽

Zero Counts

The emerging single cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscapes at single-cell resolution. The analysis of scRNA-seq data is complicated by excess zero or near zero counts, the so-called dropouts due to the low amounts of mRNA sequenced within individual cells. Downstream analysis of scRNA-seq would be severely biased if the dropout events are not properly corrected. We introduce scImpute, a statistical method to accurately and robustly impute the dropout values in scRNA-seq data. ScImpute automatically identifies gene expression values affected by dropout events, and only perform imputation on these values without introducing new bias to the rest data. ScImpute also detects outlier or rare cells and excludes them from imputation. Evaluation based on both simulated and real scRNA-seq data on mouse embryos, mouse brain cells, human blood cells, and human embryonic stem cells suggests that scImpute is an effective tool to recover transcriptome dynamics masked by dropout events. scImpute is shown to correct false zero counts, enhance the clustering of cell populations and subpopulations, improve the accuracy of differential expression analysis, and aid the study of gene expression dynamics.

Download Full-text

SDImpute: A statistical block imputation method based on cell-level and gene-level information for dropouts in single-cell RNA-seq data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009118 ◽

2021 ◽

Vol 17 (6) ◽

pp. e1009118

Author(s):

Jing Qi ◽

Yang Zhou ◽

Zicen Zhao ◽

Shuilin Jin

Keyword(s):

Gene Expression ◽

Single Cell ◽

Differential Expression Analysis ◽

Cell Types ◽

Rna Seq ◽

Cell Level ◽

Gene Level ◽

Level Information ◽

Downstream Analysis ◽

Gene Expression Levels

The single-cell RNA sequencing (scRNA-seq) technologies obtain gene expression at single-cell resolution and provide a tool for exploring cell heterogeneity and cell types. As the low amount of extracted mRNA copies per cell, scRNA-seq data exhibit a large number of dropouts, which hinders the downstream analysis of the scRNA-seq data. We propose a statistical method, SDImpute (Single-cell RNA-seq Dropout Imputation), to implement block imputation for dropout events in scRNA-seq data. SDImpute automatically identifies the dropout events based on the gene expression levels and the variations of gene expression across similar cells and similar genes, and it implements block imputation for dropouts by utilizing gene expression unaffected by dropouts from similar cells. In the experiments, the results of the simulated datasets and real datasets suggest that SDImpute is an effective tool to recover the data and preserve the heterogeneity of gene expression across cells. Compared with the state-of-the-art imputation methods, SDImpute improves the accuracy of the downstream analysis including clustering, visualization, and differential expression analysis.

Download Full-text

MarkerCount: A stable, count-based cell type identifier for single cell RNA-Seq experiments

10.21203/rs.3.rs-418249/v2 ◽

2021 ◽

Author(s):

HanByeol Kim ◽

Joongho Lee ◽

Keunsoo Kang ◽

Seokhyun Yoon

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Batch Effect ◽

Expression Level ◽

Rna Seq ◽

Cell Type ◽

Stable Performance ◽

Downstream Analysis

Abstract Cell type identification is a key step to downstream analysis of single cell RNA-seq experiments. Indispensible information for this is gene expression, which is used to cluster cells, train the model and set rejection thresholds. Problem is they are subject to batch effect arising from different platforms and preprocessing. We present MarkerCount, which uses the number of markers expressed regardless of their expression level to initially identify cell types and, then, reassign cell type in cluster-basis. MarkerCount works both in reference and marker-based mode, where the latter utilizes only the existing lists of markers, while the former required pre-annotated dataset to train the model. The performance was evaluated and compared with the existing identifiers, both marker and reference-based, that can be customized with publicly available datasets and marker DB. The results show that MarkerCount provides a stable performance when comparing with other reference-based and marker-based cell type identifiers.

Download Full-text

Imputing Single-cell RNA-seq data by combining Graph Convolution and Autoencoder Neural Networks

10.1101/2020.02.05.935296 ◽

2020 ◽

Cited By ~ 3

Author(s):

Jiahua Rao ◽

Xiang Zhou ◽

Yutong Lu ◽

Huiying Zhao ◽

Yuedong Yang

Keyword(s):

Single Cell ◽

Clustering Analysis ◽

State Of The Art ◽

Differential Expression Analysis ◽

Gene Interactions ◽

Rna Seq ◽

Sequencing Technology ◽

Imputation Methods ◽

Downstream Analysis ◽

Low Dimensional

AbstractSingle-cell RNA sequencing technology promotes the profiling of single-cell transcriptomes at an unprecedented throughput and resolution. However, in scRNA-seq studies, only a low amount of sequenced mRNA in each cell leads to missing detection for a portion of mRNA molecules, i.e. the dropout problem. The dropout event hinders various downstream analysis, such as clustering analysis, differential expression analysis, and inference of gene-to-gene relationships. Therefore, it is necessary to develop robust and effective imputation methods for the increasing scRNA-seq data. In this study, we have developed an imputation method (GraphSCI) to impute the dropout events in scRNA-seq data based on the graph convolution networks. The method takes advantage of low-dimensional representations of similar cells and gene-gene interactions to impute the dropouts. Extensive experiments demonstrated that GraphSCI outperforms other state-of-the-art methods for imputation on both simulated and real scRNA-seq data. Meanwhile, GraphSCI is able to accurately infer gene-to-gene relationships by utilizing the imputed matrix that are concealed by dropout events in raw data.

Download Full-text