GT-TS: Experimental design for maximizing cell type discovery in single-cell data

Mapping Intimacies ◽

10.1101/386540 ◽

2018 ◽

Cited By ~ 4

Author(s):

Bianca Dumitrascu ◽

Karen Feng ◽

Barbara E Engelhardt

Keyword(s):

Experimental Design ◽

Single Cell ◽

Simulated Data ◽

Computational Method ◽

Rna Seq ◽

Cell Type ◽

Thompson Sampling ◽

Random Strategy ◽

Cell Data ◽

Type Information

We present the Good-Toulmin like estimator via Thompson sampling, a computational method for iterative experimental design in multi-tissue single-cell RNA-seq (scRNA-seq) data. Given a budget and modeling cell type information across tissues, GT-TS estimates how many cells are required for sampling from each tissue with the goal of maximizing cell type discovery across samples from multiple iterations. In both real and simulated data, we demonstrate the advantages of GT-TS in data collection planning when compared to a random strategy in the absence of experimental design.

Download Full-text

A computational method to aid the design and analysis of single cell RNA-seq experiments for cell type identification

10.1101/247114 ◽

2018 ◽

Cited By ~ 1

Author(s):

Douglas Abrams ◽

Parveen Kumar ◽

R. Krishna Murthy Karuturi ◽

Joshy George

Keyword(s):

Experimental Design ◽

Single Cell ◽

Single Cells ◽

Cell Types ◽

Cell Number ◽

Fold Change ◽

Computational Method ◽

Marker Genes ◽

Cell Type ◽

Estimate Sample Size

AbstractBackgroundThe advent of single cell RNA sequencing (scRNA-seq) enabled researchers to study transcriptomic activity within individual cells and identify inherent cell types in the sample. Although numerous computational tools have been developed to analyze single cell transcriptomes, there are no published studies and analytical packages available to guide experimental design and to devise suitable analysis procedure for cell type identification.ResultsWe have developed an empirical methodology to address this important gap in single cell experimental design and analysis into an easy-to-use tool called SCEED (Single Cell Empirical Experimental Design and analysis). With SCEED, user can choose a variety of combinations of tools for analysis, conduct performance analysis of analytical procedures and choose the best procedure, and estimate sample size (number of cells to be profiled) required for a given analytical procedure at varying levels of cell type rarity and other experimental parameters. Using SCEED, we examined 3 single cell algorithms using 48 simulated single cell datasets that were generated for varying number of cell types and their proportions, number of genes expressed per cell, number of marker genes and their fold change, and number of single cells successfully profiled in the experiment.ConclusionsBased on our study, we found that when marker genes are expressed at fold change of 4 or more than the rest of the genes, either Seurat or Simlr algorithm can be used to analyze single cell dataset for any number of single cells isolated (minimum 1000 single cells were tested). However, when marker genes are expected to be only up to fC 2 upregulated, choice of the single cell algorithm is dependent on the number of single cells isolated and proportion of rare cell type to be identified. In conclusion, our work allows the assessment of various single cell methods and also aids in examining the single cell experimental design.

Download Full-text

De novo prediction of cell-type complexity in single-cell RNA-seq and tumor microenvironments

Life Science Alliance ◽

10.26508/lsa.201900443 ◽

2019 ◽

Vol 2 (4) ◽

pp. e201900443 ◽

Cited By ~ 1

Author(s):

Jun Woo ◽

Boris J. Winterhoff ◽

Timothy K. Starr ◽

Constantin Aliferis ◽

Jinhua Wang

Keyword(s):

Single Cell ◽

Model Comparison ◽

De Novo ◽

Nonnegative Matrix ◽

Simulated Data ◽

Cell Type ◽

Pancreatic Cell ◽

Bayesian Model Comparison ◽

Cellular Microenvironments ◽

Cell Data

Recent single-cell transcriptomic studies revealed new insights into cell-type heterogeneities in cellular microenvironments unavailable from bulk studies. A significant drawback of currently available algorithms is the need to use empirical parameters or rely on indirect quality measures to estimate the degree of complexity, i.e., the number of subgroups present in the sample. We fill this gap with a single-cell data analysis procedure allowing for unambiguous assessments of the depth of heterogeneity in subclonal compositions supported by data. Our approach combines nonnegative matrix factorization, which takes advantage of the sparse and nonnegative nature of single-cell RNA count data, with Bayesian model comparison enabling de novo prediction of the depth of heterogeneity. We show that the method predicts the correct number of subgroups using simulated data, primary blood mononuclear cell, and pancreatic cell data. We applied our approach to a collection of single-cell tumor samples and found two qualitatively distinct classes of cell-type heterogeneity in cancer microenvironments.

Download Full-text

False signals induced by single-cell imputation

F1000Research ◽

10.12688/f1000research.16613.2 ◽

2019 ◽

Vol 7 ◽

pp. 1740 ◽

Cited By ~ 26

Author(s):

Tallulah S. Andrews ◽

Martin Hemberg

Keyword(s):

Single Cell ◽

Effect Size ◽

False Positive ◽

Statistical Tests ◽

Simulated Data ◽

False Positives ◽

Rna Seq ◽

Cell Type ◽

Imputation Methods ◽

Cell Type Specific

Background: Single-cell RNA-seq is a powerful tool for measuring gene expression at the resolution of individual cells. A challenge in the analysis of this data is the large amount of zero values, representing either missing data or no expression. Several imputation approaches have been proposed to address this issue, but they generally rely on structure inherent to the dataset under consideration they may not provide any additional information, hence, are limited by the information contained therein and the validity of their assumptions. Methods: We evaluated the risk of generating false positive or irreproducible differential expression when imputing data with six different methods. We applied each method to a variety of simulated datasets as well as to permuted real single-cell RNA-seq datasets and consider the number of false positive gene-gene correlations and differentially expressed genes. Using matched 10X and Smart-seq2 data we examined whether cell-type specific markers were reproducible across datasets derived from the same tissue before and after imputation. Results: The extent of false-positives introduced by imputation varied considerably by method. Data smoothing based methods, MAGIC, knn-smooth and dca, generated many false-positives in both real and simulated data. Model-based imputation methods typically generated fewer false-positives but this varied greatly depending on the diversity of cell-types in the sample. All imputation methods decreased the reproducibility of cell-type specific markers, although this could be mitigated by selecting markers with large effect size and significance. Conclusions: Imputation of single-cell RNA-seq data introduces circularity that can generate false-positive results. Thus, statistical tests applied to imputed data should be treated with care. Additional filtering by effect size can reduce but not fully eliminate these effects. Of the methods we considered, SAVER was the least likely to generate false or irreproducible results, thus should be favoured over alternatives if imputation is necessary.

Download Full-text

Cell type prioritization in single-cell data

10.1101/2019.12.20.884916 ◽

2019 ◽

Cited By ~ 1

Author(s):

Michael A. Skinnider ◽

Jordan W. Squair ◽

Claudia Kathe ◽

Mark A. Anderson ◽

Matthieu Gautier ◽

...

Keyword(s):

Single Cell ◽

Neural Circuits ◽

Cell Types ◽

Chromatin Accessibility ◽

High Dimensional ◽

Machine Learning Method ◽

Learning Method ◽

Rna Seq ◽

Cell Type ◽

Cell Data

We present a machine-learning method to prioritize the cell types most responsive to biological perturbations within high-dimensional single-cell data. We validate our method, Augur (https://github.com/neurorestore/Augur), on a compendium of single-cell RNA-seq, chromatin accessibility, and imaging transcriptomics datasets. We apply Augur to expose the neural circuits that enable walking after paralysis in response to spinal cord neurostimulation.

Download Full-text

Sfaira accelerates data and model reuse in single cell genomics

Genome Biology ◽

10.1186/s13059-021-02452-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

David S. Fischer ◽

Leander Dony ◽

Martin König ◽

Abdul Moeed ◽

Luke Zappia ◽

...

Keyword(s):

Single Cell ◽

Data Sets ◽

Rna Seq ◽

Cell Type ◽

Training Models ◽

Public Data ◽

Data Partitions ◽

Cell Data ◽

Type Classification ◽

Different Levels

AbstractSingle-cell RNA-seq datasets are often first analyzed independently without harnessing model fits from previous studies, and are then contextualized with public data sets, requiring time-consuming data wrangling. We address these issues with sfaira, a single-cell data zoo for public data sets paired with a model zoo for executable pre-trained models. The data zoo is designed to facilitate contribution of data sets using ontologies for metadata. We propose an adaption of cross-entropy loss for cell type classification tailored to datasets annotated at different levels of coarseness. We demonstrate the utility of sfaira by training models across anatomic data partitions on 8 million cells.

Download Full-text

Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

10.1101/2021.04.06.438536 ◽

2021 ◽

Author(s):

Lei Xiong ◽

Kang Tian ◽

Yuzhe Li ◽

Qiangfeng Cliff Zhang

Keyword(s):

Single Cell ◽

Rna Seq ◽

Cell Type ◽

Experimental Conditions ◽

Multiple Data ◽

Mouse Tissues ◽

Heterogeneous Datasets ◽

Cell Data ◽

Human And Mouse ◽

Biological Differences

Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.

Download Full-text

Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

10.21203/rs.3.rs-398163/v1 ◽

2021 ◽

Author(s):

Lei Xiong ◽

Kang Tian ◽

Yuzhe Li ◽

Qiangfeng Zhang

Keyword(s):

Single Cell ◽

Rna Seq ◽

Cell Type ◽

Experimental Conditions ◽

Multiple Data ◽

Mouse Tissues ◽

Heterogeneous Datasets ◽

Cell Data ◽

Human And Mouse ◽

Biological Differences

Abstract Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while r,etaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular landscapes of human and mouse tissues and identified multiple peripheral immune subtypes associated with COVID-19 disease severity.

Download Full-text

Contrastive Cycle Adversarial Autoencoders for Single-cell Multi-omics Alignment and Integration

10.1101/2021.12.12.472268 ◽

2021 ◽

Author(s):

Xuesong Wang ◽

Zhihang Hu ◽

Tingyang Yu ◽

Ruijie Wang ◽

Yumeng Wei ◽

...

Keyword(s):

Single Cell ◽

Simulated Data ◽

Dimensional Manifold ◽

Omics Data ◽

Rna Seq ◽

High Dimensions ◽

Low Dimensional ◽

Low Dimensional Manifold ◽

Cell Data ◽

Insight Into

Muilti-modality data are ubiquitous in biology, especially that we have entered the multi-omics era, when we can measure the same biological object (cell) from different aspects (omics) to provide a more comprehensive insight into the cellular system. When dealing with such multi-omics data, the first step is to determine the correspondence among different modalities. In other words, we should match data from different spaces corresponding to the same object. This problem is particularly challenging in the single-cell multi-omics scenario because such data are very sparse with extremely high dimensions. Secondly, matched single-cell multi-omics data are rare and hard to collect. Furthermore, due to the limitations of the experimental environment, the data are usually highly noisy. To promote the single-cell multi-omics research, we overcome the above challenges, proposing a novel framework to align and integrate single-cell RNA-seq data and single-cell ATAC-seq data. Our approach can efficiently map the above data with high sparsity and noise from different spaces to a low-dimensional manifold in a unified space, making the downstream alignment and integration straightforward. Compared with the other state-of-the-art methods, our method performs better in both simulated and real single-cell data. The proposed method is helpful for the single-cell multi-omics research. The improvement for integration on the simulated data is significant.

Download Full-text

A computational method to aid the design and analysis of single cell RNA-seq experiments for cell type identification

BMC Bioinformatics ◽

10.1186/s12859-019-2817-2 ◽

2019 ◽

Vol 20 (S11) ◽

Cited By ~ 2

Author(s):

Douglas Abrams ◽

Parveen Kumar ◽

R. Krishna Murthy Karuturi ◽

Joshy George

Keyword(s):

Single Cell ◽

Computational Method ◽

Rna Seq ◽

Cell Type

Download Full-text

A United Statistical Framework for Single Cell and Bulk Sequencing Data

10.1101/206532 ◽

2017 ◽

Cited By ~ 1

Author(s):

Lingxue Zhu ◽

Jing Lei ◽

Bernie Devlin ◽

Kathryn Roeder

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Accurate Estimation ◽

Specific Gene ◽

Rna Seq ◽

Cell Type ◽

Cell Type Specific ◽

Different Cell Types ◽

Cell Data

Recent advances in technology have enabled the measurement of RNA levels for individual cells. Compared to traditional tissue-level bulk RNA-seq data, single cell sequencing yields valuable insights about gene expression profiles for different cell types, which is potentially critical for understanding many complex human diseases. However, developing quantitative tools for such data remains challenging because of high levels of technical noise, especially the “dropout” events. A “dropout” happens when the RNA for a gene fails to be amplified prior to sequencing, producing a “false” zero in the observed data. In this paper, we propose a Unified RNA-Sequencing Model (URSM) for both single cell and bulk RNA-seq data, formulated as a hierarchical model. URSM borrows the strength from both data sources and carefully models the dropouts in single cell data, leading to a more accurate estimation of cell type specific gene expression profile. In addition, URSM naturally provides inference on the dropout entries in single cell data that need to be imputed for downstream analyses, as well as the mixing proportions of different cell types in bulk samples. We adopt an empirical Bayes approach, where parameters are estimated using the EM algorithm and approximate inference is obtained by Gibbs sampling. Simulation results illustrate that URSM outperforms existing approaches both in correcting for dropouts in single cell data, as well as in deconvolving bulk samples. We also demonstrate an application to gene expression data on fetal brains, where our model successfully imputes the dropout genes and reveals cell type specific expression patterns.

Download Full-text