A descriptive marker gene approach to single-cell pseudotime inference

Mapping Intimacies ◽

10.1101/060442 ◽

2016 ◽

Cited By ~ 5

Author(s):

Kieran R Campbell ◽

Christopher Yau

Keyword(s):

Single Cell ◽

Marker Gene ◽

Cell Types ◽

R Package ◽

Estimation Methods ◽

Marker Genes ◽

Peak Time ◽

Transient Behaviour ◽

Link Type ◽

Cell Gene Expression

AbstractPseudotime estimation from single-cell gene expression allows the recovery of temporal information from otherwise static profiles of individual cells. This pseudotemporal information can be used to characterise transient events in temporally evolving biological systems. Conventional algorithms typically emphasise an unsupervised transcriptome-wide approach and use retrospective analysis to evaluate the behaviour of individual genes. Here we introduce an orthogonal approach termed “Ouija” that learns pseudotimes from a small set of marker genes that might ordinarily be used to retrospectively confirm the accuracy of unsupervised pseudotime algorithms. Crucially, we model these genes in terms of switch-like or transient behaviour along the trajectory, allowing us to understand why the pseudotimes have been inferred and learn informative parameters about the behaviour of each gene. Since each gene is associated with a switch or peak time the genes are effectively ordered along with the cells, allowing each part of the trajectory to be understood in terms of the behaviour of certain genes. In the following we introduce our model and demonstrate that in many instances a small panel of marker genes can recover pseudotimes that are consistent with those obtained using the entire transcriptome. Furthermore, we show that our method can detect differences in the regulation timings between two genes and identify “metastable” states - discrete cell types along the continuous trajectories - that recapitulate known cell types. Ouija therefore provides a powerful complimentary approach to existing whole transcriptome based pseudotime estimation methods. An open source implementation is available at http://www.github.com/kieranrcampbell/ouija as an R package and at http://www.github.com/kieranrcampbell/ouijaflow as a Python/TensorFlow package.

Download Full-text

Exploiting marker genes for robust classification and characterization of single-cell chromatin accessibility

10.1101/2021.04.01.438068 ◽

2021 ◽

Author(s):

Risa Karakida Kawaguchi ◽

Ziqi Tang ◽

Stephan Fischer ◽

Rohit Tripathy ◽

Peter K. Koo ◽

...

Keyword(s):

Single Cell ◽

Marker Gene ◽

Cell Types ◽

Chromatin Accessibility ◽

Marker Genes ◽

Cell Type ◽

Gene Sets ◽

Typing Methods ◽

Cell Type Specific ◽

Cell Typing

Background: Single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) measures genome-wide chromatin accessibility for the discovery of cell-type specific regulatory networks. ScATAC-seq combined with single-cell RNA sequencing (scRNA-seq) offers important avenues for ongoing research, such as novel cell-type specific activation of enhancer and transcription factor binding sites as well as chromatin changes specific to cell states. On the other hand, scATAC-seq data is known to be challenging to interpret due to its high number of zeros as well as the heterogeneity derived from different protocols. Because of the stochastic lack of marker gene activities, cell type identification by scATAC-seq remains difficult even at a cluster level. Results: In this study, we exploit reference knowledge obtained from external scATAC-seq or scRNA-seq datasets to define existing cell types and uncover the genomic regions which drive cell-type specific gene regulation. To investigate the robustness of existing cell-typing methods, we collected 7 scATAC-seq datasets targeting mouse brain for a meta-analytic comparison of neuronal cell-type annotation, including a reference atlas generated by the BRAIN Initiative Cell Census Network (BICCN). By comparing the area under the receiver operating characteristics curves (AUROCs) for the three major cell types (inhibitory, excitatory, and non-neuronal cells), cell-typing performance by single markers is found to be highly variable even for known marker genes due to study-specific biases. However, the signal aggregation of a large and redundant marker gene set, optimized via multiple scRNA-seq data, achieves the highest cell-typing performances among 5 existing marker gene sets, from the individual cell to cluster level. That gene set also shows a high consistency with the cluster-specific genes from inhibitory subtypes in two well-annotated datasets, suggesting applicability to rare cell types. Next, we demonstrate a comprehensive assessment of scATAC-seq cell typing using exhaustive combinations of the marker gene sets with supervised learning methods including machine learning classifiers and joint clustering methods. Our results show that the combinations using robust marker gene sets systematically ranked at the top, not only with model based prediction using a large reference data but also with a simple summation of expression strengths across markers. To demonstrate the utility of this robust cell typing approach, we trained a deep neural network to predict chromatin accessibility in each subtype using only DNA sequence. Through model interpretation methods, we identify key motifs enriched about robust gene sets for each neuronal subtype. Conclusions: Through the meta-analytic evaluation of scATAC-seq cell-typing methods, we develop a novel method set to exploit the BICCN reference atlas. Our study strongly supports the value of robust marker gene selection as a feature selection tool and cross-dataset comparison between scATAC-seq datasets to improve alignment of scATAC-seq to known biology. With this novel, high quality epigenetic data, genomic analysis of regulatory regions can reveal sequence motifs that drive cell type-specific regulatory programs.

Download Full-text

STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data

10.1101/2020.06.15.152306 ◽

2020 ◽

Cited By ~ 1

Author(s):

Massimo Andreatta ◽

Santiago J. Carmona

Keyword(s):

Single Cell ◽

Distance Measure ◽

Cell Types ◽

R Package ◽

Rna Seq ◽

Batch Effects ◽

Link Type ◽

Transcriptomics Data ◽

Public Repositories ◽

Cell Data

AbstractComputational tools for the integration of single-cell transcriptomics data are designed to correct batch effects between technical replicates or different technologies applied to the same population of cells. However, they have inherent limitations when applied to heterogeneous sets of data with moderate overlap in cell states or sub-types. STACAS is a package for the identification of integration anchors in the Seurat environment, optimized for the integration of datasets that share only a subset of cell types. We demonstrate that by i) correcting batch effects while preserving relevant biological variability across datasets, ii) filtering aberrant integration anchors with a quantitative distance measure, and iii) constructing optimal guide trees for integration, STACAS can accurately align scRNA-seq datasets composed of only partially overlapping cell populations. We anticipate that the algorithm will be a useful tool for the construction of comprehensive single-cell atlases by integration of the growing amount of single-cell data becoming available in public repositories.Code availabilityR package:https://github.com/carmonalab/STACASDocker image:https://hub.docker.com/repository/docker/mandrea1/stacas_demo

Download Full-text

Single-cell analysis delineates a trajectory toward the human early otic lineage

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1605537113 ◽

2016 ◽

Vol 113 (30) ◽

pp. 8508-8513 ◽

Cited By ~ 21

Author(s):

Megan Ealy ◽

Daniel C. Ellwanger ◽

Nina Kosaric ◽

Andres P. Stapper ◽

Stefan Heller

Keyword(s):

Gene Expression ◽

Stem Cell ◽

Single Cell ◽

Single Cell Analysis ◽

Cell Types ◽

Marker Genes ◽

Cell Analysis ◽

Cell Gene Expression ◽

Neural Ectoderm ◽

Cell Gene

Efficient pluripotent stem cell guidance protocols for the production of human posterior cranial placodes such as the otic placode that gives rise to the inner ear do not exist. Here we use a systematic approach including defined monolayer culture, signaling modulation, and single-cell gene expression analysis to delineate a developmental trajectory for human otic lineage specification in vitro. We found that modulation of bone morphogenetic protein (BMP) and WNT signaling combined with FGF and retinoic acid treatments over the course of 18 days generates cell populations that develop chronological expression of marker genes of non-neural ectoderm, preplacodal ectoderm, and early otic lineage. Gene expression along this differentiation path is distinct from other lineages such as endoderm, mesendoderm, and neural ectoderm. Single-cell analysis exposed the heterogeneity of differentiating cells and allowed discrimination of non-neural ectoderm and otic lineage cells from off-target populations. Pseudotemporal ordering of human embryonic stem cell and induced pluripotent stem cell-derived single-cell gene expression profiles revealed an initially synchronous guidance toward non-neural ectoderm, followed by comparatively asynchronous occurrences of preplacodal and otic marker genes. Positive correlation of marker gene expression between both cell lines and resemblance to mouse embryonic day 10.5 otocyst cells implied reasonable robustness of the guidance protocol. Single-cell trajectory analysis further revealed that otic progenitor cell types are induced in monolayer cultures, but further development appears impeded, likely because of lack of a lineage-stabilizing microenvironment. Our results provide a framework for future exploration of stabilizing microenvironments for efficient differentiation of stem cell-generated human otic cell types.

Download Full-text

Sub-Cluster Identification through Semi-Supervised Optimization of Rare-cell Silhouettes (SCISSORS) in Single-Cell Sequencing

10.1101/2021.10.29.466448 ◽

2021 ◽

Author(s):

Jack Leary ◽

Yi Xu ◽

Ashley Morrison ◽

Chong Jin ◽

Emily C. Shen ◽

...

Keyword(s):

Single Cell ◽

Great Influence ◽

Cell Types ◽

R Package ◽

Optimal Number ◽

Marker Genes ◽

Single Cell Sequencing ◽

Cluster Identification ◽

Major Cluster ◽

Rare Cells

Single-cell RNA-sequencing (scRNA-seq) has enabled the molecular profiling of thousands to millions of cells simultaneously in biologically heterogenous samples. Currently, common practice in scRNA-seq is to determine cell type labels through unsupervised clustering and the examination of cluster-specific genes. However, even small differences in analysis and parameter choice can greatly alter clustering solutions and thus impose great influence on which cell types are identified. Existing methods largely focus on determining the optimal number of robust clusters, which is not favorable for identifying cells of extremely low abundance due to their subtle contributions towards overall patterns of gene expression. Here we present a carefully designed framework, SCISSORS, which accurately profiles subclusters within major cluster(s) for the identification of rare cell types in scRNA-seq data. SCISSORS employs silhouette scoring for the estimation of heterogeneity of clusters and reveals rare cells in heterogenous clusters by implementing a multi-step, semi-supervised reclustering process. Additionally, SCISSORS provides a method for the identification of marker genes of rare cells, which may be used for further study. SCISSORS is wrapped around the popular Seurat R package and can be easily integrated into existing Seurat pipelines. SCISSORS, including source code and vignettes for two example datasets, is freely available at https://github.com/jrleary/SCISSORS.

Download Full-text

Projected t-SNE for batch correction

Bioinformatics ◽

10.1093/bioinformatics/btaa189 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3522-3527 ◽

Cited By ~ 3

Author(s):

Emanuele Aliverti ◽

Jeffrey L Tilson ◽

Dayne L Filer ◽

Benjamin Babcock ◽

Alejandro Colaneri ◽

...

Keyword(s):

Single Cell ◽

High Dimensional Data ◽

Cell Types ◽

R Package ◽

High Dimensional ◽

Batch Effects ◽

Batch Correction ◽

Fundamental Information ◽

Cell Gene Expression ◽

Low Dimensional

Abstract Motivation Low-dimensional representations of high-dimensional data are routinely employed in biomedical research to visualize, interpret and communicate results from different pipelines. In this article, we propose a novel procedure to directly estimate t-SNE embeddings that are not driven by batch effects. Without correction, interesting structure in the data can be obscured by batch effects. The proposed algorithm can therefore significantly aid visualization of high-dimensional data. Results The proposed methods are based on linear algebra and constrained optimization, leading to efficient algorithms and fast computation in many high-dimensional settings. Results on artificial single-cell transcription profiling data show that the proposed procedure successfully removes multiple batch effects from t-SNE embeddings, while retaining fundamental information on cell types. When applied to single-cell gene expression data to investigate mouse medulloblastoma, the proposed method successfully removes batches related with mice identifiers and the date of the experiment, while preserving clusters of oligodendrocytes, astrocytes, and endothelial cells and microglia, which are expected to lie in the stroma within or adjacent to the tumours. Availability and implementation Source code implementing the proposed approach is available as an R package at https://github.com/emanuelealiverti/BC_tSNE, including a tutorial to reproduce the simulation studies. Contact [email protected]

Download Full-text

NS-Forest: A machine learning method for the objective identification of minimum marker gene combinations for cell type determination from single cell RNA sequencing

10.1101/2020.09.23.308932 ◽

2020 ◽

Author(s):

Brian Aevermann ◽

Yun Zhang ◽

Mark Novotny ◽

Trygve Bakken ◽

Jeremy Miller ◽

...

Keyword(s):

Machine Learning ◽

Single Cell ◽

Rna Sequencing ◽

Marker Gene ◽

Cell Types ◽

Biological Research ◽

Marker Genes ◽

Cell Type ◽

Type Identity ◽

Wide Range

AbstractSingle cell genomics is rapidly advancing our knowledge of cell phenotypic types and states. Driven by single cell/nucleus RNA sequencing (scRNA-seq) data, comprehensive atlas projects covering a wide range of organisms and tissues are currently underway. As a result, it is critical that the cell transcriptional phenotypes discovered are defined and disseminated in a consistent and concise manner. Molecular biomarkers have historically played an important role in biological research, from defining immune cell-types by surface protein expression to defining diseases by molecular drivers. Here we describe a machine learning-based marker gene selection algorithm, NS-Forest version 2.0, which leverages the non-linear attributes of random forest feature selection and a binary expression scoring approach to discover the minimal marker gene expression combinations that precisely captures the cell type identity represented in the complete scRNA-seq transcriptional profiles. The marker genes selected provide a barcode of the necessary and sufficient characteristics for semantic cell type definition and serve as useful tools for downstream biological investigation. The use of NS-Forest to identify marker genes for human brain middle temporal gyrus cell types reveals the importance of cell signaling and non-coding RNAs in neuronal cell type identity.

Download Full-text

Accurate and fast cell marker gene identification with COSG

10.1101/2021.06.15.448484 ◽

2021 ◽

Author(s):

Min Dai ◽

Xiaobing Pei ◽

Xiu-Jie Wang

Keyword(s):

Single Cell ◽

Marker Gene ◽

Cell Types ◽

Superior Performance ◽

Gene Identification ◽

Marker Genes ◽

Sequencing Data ◽

Cell Type Specificity ◽

Spatially Resolved ◽

Downstream Analysis

Accurate cell classification is the groundwork for downstream analysis of single-cell sequencing data, yet how to identify marker genes to distinguish different cell types still remains as a big challenge. We developed COSG as a cosine similarity-based method for more accurate and scalable marker gene identification. COSG is applicable to single-cell RNA sequencing data, single-cell ATAC sequencing data and spatially resolved transcriptome data. COSG is fast and scalable for ultra-large datasets of million-scale cells. Application on both simulated and real experimental datasets demonstrates the superior performance of COSG in terms of both accuracy and efficiency as compared with other available methods. Marker genes or genomic regions identified by COSG are more indicative and with greater cell-type specificity.

Download Full-text

dropClust2: An R package for resource efficient analysis of large scale single cell RNA-Seq data

10.1101/596924 ◽

2019 ◽

Author(s):

Debajyoti Sinha ◽

Pradyumn Sinha ◽

Ritwik Saha ◽

Sanghamitra Bandyopadhyay ◽

Debarka Sengupta

Keyword(s):

Single Cell ◽

Programming Languages ◽

Large Scale ◽

Principal Component ◽

Cell Types ◽

R Package ◽

Locality Sensitive Hashing ◽

Rna Seq ◽

Link Type ◽

Component Selection

ABSTRACTDropClust leverages Locality Sensitive Hashing (LSH) to speed up clustering of large scale single cell expression data. It makes ingenious use of structure persevering sampling and modality based principal component selection to rescue minor cell types. Existing implementation of dropClust involves interfacing with multiple programming languagesviz. R, python and C, hindering seamless installation and portability. Here we present dropClust2, a complete R package that’s not only fast but also minimally resource intensive. DropClust2 features a novel batch effect removal algorithm that allows integrative analysis of single cell RNA-seq (scRNA-seq) datasets.Availability and implementationdropClust2 is freely available athttps://debsinha.shinyapps.io/dropClust/as an online web service and athttps://github.com/debsin/dropClustas an R package.

Download Full-text

Accurate estimation of cell composition in bulk expression through robust integration of single-cell information

10.1101/669911 ◽

2019 ◽

Cited By ~ 1

Author(s):

Brandon Jew ◽

Marcus Alvarez ◽

Elior Rahmani ◽

Zong Miao ◽

Arthur Ko ◽

...

Keyword(s):

Single Cell ◽

Cell Types ◽

R Package ◽

Accurate Estimation ◽

Marker Genes ◽

Rna Seq ◽

Cell Type ◽

Dorsolateral Prefrontal ◽

Additional Mode ◽

Single Nucleus

AbstractWe present Bisque, a tool for estimating cell type proportions in bulk expression. Bisque implements a regression-based approach that utilizes single-cell RNA-seq (scRNA-seq) data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data. These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression. Importantly, compared to existing methods, our approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous. When applied to subcutaneous adipose and dorsolateral prefrontal cortex expression datasets with both bulk RNA-seq and single-nucleus RNA-seq (snRNA-seq) data, Bisque was able to replicate previously reported associations between cell type proportions and measured phenotypes across abundant and rare cell types. Bisque requires a single-cell reference dataset that reflects physiological cell type composition and can further leverage datasets that includes both bulk and single cell measurements over the same samples for improved accuracy. We further propose an additional mode of operation that merely requires a set of known marker genes. Bisque is available as an R package at: https://github.com/cozygene/bisque.

Download Full-text

Molecular characteristics and spatial distribution of adult human corneal cell subtypes

Scientific Reports ◽

10.1038/s41598-021-94933-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ann J. Ligocki ◽

Wen Fury ◽

Christian Gutierrez ◽

Christina Adler ◽

Tao Yang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cross Sections ◽

Cell Types ◽

Marker Genes ◽

Molecular Characteristics ◽

Transcriptional Level ◽

Human Cornea ◽

Adult Human ◽

And Migration

AbstractBulk RNA sequencing of a tissue captures the gene expression profile from all cell types combined. Single-cell RNA sequencing identifies discrete cell-signatures based on transcriptomic identities. Six adult human corneas were processed for single-cell RNAseq and 16 cell clusters were bioinformatically identified. Based on their transcriptomic signatures and RNAscope results using representative cluster marker genes on human cornea cross-sections, these clusters were confirmed to be stromal keratocytes, endothelium, several subtypes of corneal epithelium, conjunctival epithelium, and supportive cells in the limbal stem cell niche. The complexity of the epithelial cell layer was captured by eight distinct corneal clusters and three conjunctival clusters. These were further characterized by enriched biological pathways and molecular characteristics which revealed novel groupings related to development, function, and location within the epithelial layer. Moreover, epithelial subtypes were found to reflect their initial generation in the limbal region, differentiation, and migration through to mature epithelial cells. The single-cell map of the human cornea deepens the knowledge of the cellular subsets of the cornea on a whole genome transcriptional level. This information can be applied to better understand normal corneal biology, serve as a reference to understand corneal disease pathology, and provide potential insights into therapeutic approaches.

Download Full-text