scholarly journals matchSCore: Matching Single-Cell Phenotypes Across Tools and Experiments

2018 ◽  
Author(s):  
Elisabetta Mereu ◽  
Giovanni Iacono ◽  
Amy Guillaumet-Adkins ◽  
Catia Moutinho ◽  
Giulia Lunazzi ◽  
...  

AbstractSingle-cell transcriptomics allows the identification of cellular types, subtypes and states through cell clustering. In this process, similar cells are grouped before determining co-expressed marker genes for phenotype inference. The performance of computational tools is directly associated to their marker identification accuracy, but the lack of an optimal solution challenges a systematic method comparison. Moreover, phenotypes from different studies are challenging to integrate, due to varying resolution, methodology and experimental design. In this work we introduce matchSCore (https://github.com/elimereu/matchSCore), an approach to match cell populations fast across tools, experiments and technologies. We compared 14 computational methods and evaluated their accuracy in clustering and gene marker identification in simulated data sets. We further used matchSCore to project cell type identities across mouse and human cell atlas projects. Despite originating from different technologies, cell populations could be matched across data sets, allowing the assignment of clusters to reference maps and their annotation.

2019 ◽  
Vol 21 (5) ◽  
pp. 1581-1595 ◽  
Author(s):  
Xinlei Zhao ◽  
Shuang Wu ◽  
Nan Fang ◽  
Xiao Sun ◽  
Jue Fan

Abstract Single-cell RNA sequencing (scRNA-seq) has been rapidly developing and widely applied in biological and medical research. Identification of cell types in scRNA-seq data sets is an essential step before in-depth investigations of their functional and pathological roles. However, the conventional workflow based on clustering and marker genes is not scalable for an increasingly large number of scRNA-seq data sets due to complicated procedures and manual annotation. Therefore, a number of tools have been developed recently to predict cell types in new data sets using reference data sets. These methods have not been generally adapted due to a lack of tool benchmarking and user guidance. In this article, we performed a comprehensive and impartial evaluation of nine classification software tools specifically designed for scRNA-seq data sets. Results showed that Seurat based on random forest, SingleR based on correlation analysis and CaSTLe based on XGBoost performed better than others. A simple ensemble voting of all tools can improve the predictive accuracy. Under nonideal situations, such as small-sized and class-imbalanced reference data sets, tools based on cluster-level similarities have superior performance. However, even with the function of assigning ‘unassigned’ labels, it is still challenging to catch novel cell types by solely using any of the single-cell classifiers. This article provides a guideline for researchers to select and apply suitable classification tools in their analysis workflows and sheds some lights on potential direction of future improvement on classification tools.


2019 ◽  
Author(s):  
Chenling Xu ◽  
Romain Lopez ◽  
Edouard Mehlman ◽  
Jeffrey Regier ◽  
Michael I. Jordan ◽  
...  

AbstractAs single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.


2019 ◽  
Author(s):  
Roman Hillje ◽  
Pier Giuseppe Pelicci ◽  
Lucilla Luzi

AbstractSummaryDespite the growing availability of sophisticated bioinformatic methods for the analysis of single-cell RNA-seq data, few tools exist that allow biologists without bioinformatic expertise to directly visualize and interact with their own data and results. Here, we present Cerebro (cell report browser), a Shiny- and Electron-based standalone desktop application for macOS and Windows, which allows investigation and inspection of pre-processed single-cell transcriptomics data without requiring bioinformatic experience of the user.Through an interactive and intuitive graphical interface, users can i) explore similarities and heterogeneity between samples and cells clusters in 2D or 3D projections such as t-SNE or UMAP, ii) display the expression level of single genes or genes sets of interest, iii) browse tables of most expressed genes and marker genes for each sample and cluster.We provide a simple example to show how Cerebro can be used and which are its capabilities. Through a focus on flexibility and direct access to data and results, we think Cerebro offers a collaborative framework for bioinformaticians and experimental biologists which facilitates effective interaction to shorten the gap between analysis and interpretation of the data.AvailabilityCerebro and example data sets are available at https://github.com/romanhaa/Cerebro. Similarly, the R packages cerebroApp and cerebroPrepare R packages are available at https://github.com/romanhaa/cerebroApp and https://github.com/romanhaa/cerebroPrepare, respectively. All components are released under the MIT License.


2021 ◽  
Author(s):  
Xiaowen Cao ◽  
Li Xing ◽  
Elham Majd ◽  
Hua He ◽  
Junhua Gu ◽  
...  

Abstract Background: Single-cell RNA sequencing (scRNA-seq) yields valuable insights about gene expression and gives critical information about complex tissue cellular composition. In the analysis of single-cell RNA sequencing, the annotations of cell subtypes are often done manually, which is time-consuming and irreproducible. Garnett is a cell-type annotation software based the on elastic net method. Beside cell-type annotation, supervised machine learning methods can also be applied to predict other cell phenotypes from genomic data. Despite the popularity of such applications, there is no existing study to systematically investigate the performance of those supervised algorithms in various sizes of scRNA-seq data sets. Methods and Results: This study evaluates 13 popular supervised machine learning algorithms to classify cell phenotypes, using published real and simulated data sets with diverse cell sizes. The benchmark contained two parts. In the first part, we used real data sets to assess the popular supervised algorithms’ computing speed and cell phenotype classification performance. The classification performances were evaluated using AUC statistics, F1-score, precision, recall, and false-positive rate. In the second part, we evaluated gene selection performance using published simulated data sets with a known list of real genes. Conclusion: The study outcomes showed that ElasticNet with interactions performed best in small and medium data sets. NB was another appropriate method for medium data sets. In large data sets, XGB works excellent. Ensemble algorithms were not significantly superior to individual machine learning methods. Adding interactions to ElasticNet can help, and the improvement was significant in small data sets.


2021 ◽  
Author(s):  
Wolfgang Kopp ◽  
Altuna Akalin ◽  
Uwe Ohler

Advances in single-cell technologies enable the routine interrogation of chromatin accessibility for tens of thousands of single cells, shedding light on gene regulatory processes at an unprecedented resolution. Meanwhile, size, sparsity and high dimensionality of the resulting data continue to pose challenges for its computational analysis, and specifically the integration of data from different sources. We have developed a dedicated computational approach, a variational auto-encoder using a noise model specifically designed for single-cell ATAC-seq data, which facilitates simultaneous dimensionality reduction and batch correction via an adversarial learning strategy. We showcase both its individual advantages on carefully chosen real and simulated data sets, as well as the benefits for detailed cell type characterization via integrating multiple complex datasets.


2020 ◽  
Author(s):  
Hy Vuong ◽  
Thao Truong ◽  
Tan Phan ◽  
Son Pham

AbstractMost widely used tools for finding marker genes in single cell data (SeuratT/NegBinom/Poisson, CellRanger, EdgeR, limmatrend) use a conventional definition of differentially expressed genes: genes with different mean expression values. However, in single-cell data, a cell population can be a mixture of many cell types/cell states, hence the mean expression of genes cannot represent the whole population. In addition, these tools assume that gene expression of a population belongs to a specific family of distribution. This assumption is often violated in single-cell data. In this work, we define marker genes of a cell population as genes that can be used to distinguish cells in the population from cells in other populations. Besides log-fold change, we devise a new metric to classify genes into up-regulated, down-regulated, and transitional states. In a benchmark for finding up-regulated and down-regulated genes, our tool outperforms all compared methods, including Seurat, ROTS, scDD, edgeR, MAST, limma, normal t-test, Wilcoxon and Kolmogorov–Smirnov test. Our method is much faster than all compared methods, therefore, enables interactive analysis for large single-cell data sets in BioTuring Browser. Venice algorithm is available within Signac package: https://github.com/bioturing/signac1).


Oncogene ◽  
2021 ◽  
Author(s):  
Philip Bischoff ◽  
Alexandra Trinks ◽  
Benedikt Obermayer ◽  
Jan Patrick Pett ◽  
Jennifer Wiederspahn ◽  
...  

AbstractRecent developments in immuno-oncology demonstrate that not only cancer cells, but also the tumor microenvironment can guide precision medicine. A comprehensive and in-depth characterization of the tumor microenvironment is challenging since its cell populations are diverse and can be important even if scarce. To identify clinically relevant microenvironmental and cancer features, we applied single-cell RNA sequencing to ten human lung adenocarcinomas and ten normal control tissues. Our analyses revealed heterogeneous carcinoma cell transcriptomes reflecting histological grade and oncogenic pathway activities, and two distinct microenvironmental patterns. The immune-activated CP²E microenvironment was composed of cancer-associated myofibroblasts, proinflammatory monocyte-derived macrophages, plasmacytoid dendritic cells and exhausted CD8+ T cells, and was prognostically unfavorable. In contrast, the inert N³MC microenvironment was characterized by normal-like myofibroblasts, non-inflammatory monocyte-derived macrophages, NK cells, myeloid dendritic cells and conventional T cells, and was associated with a favorable prognosis. Microenvironmental marker genes and signatures identified in single-cell profiles had progonostic value in bulk tumor profiles. In summary, single-cell RNA profiling of lung adenocarcinoma provides additional prognostic information based on the microenvironment, and may help to predict therapy response and to reveal possible target cell populations for future therapeutic approaches.


2019 ◽  
Author(s):  
Mahmoud M Ibrahim ◽  
Rafael Kramann

ABSTRACTMarker genes identified in single cell experiments are expected to be highly specific to a certain cell type and highly expressed in that cell type. Detecting a gene by differential expression analysis does not necessarily satisfy those two conditions and is typically computationally expensive for large cell numbers.Here we present genesorteR, an R package that ranks features in single cell data in a manner consistent with the expected definition of marker genes in experimental biology research. We benchmark genesorteR using various data sets and show that it is distinctly more accurate in large single cell data sets compared to other methods. genesorteR is orders of magnitude faster than current implementations of differential expression analysis methods, can operate on data containing millions of cells and is applicable to both single cell RNA-Seq and single cell ATAC-Seq data.genesorteR is available at https://github.com/mahmoudibrahim/genesorteR.


Author(s):  
Nico Borgsmüller ◽  
Jose Bonet ◽  
Francesco Marass ◽  
Abel Gonzalez-Perez ◽  
Nuria Lopez-Bigas ◽  
...  

AbstractThe high resolution of single-cell DNA sequencing (scDNA-seq) offers great potential to resolve intra-tumor heterogeneity by distinguishing clonal populations based on their mutation profiles. However, the increasing size of scDNA-seq data sets and technical limitations, such as high error rates and a large proportion of missing values, complicate this task and limit the applicability of existing methods. Here we introduce BnpC, a novel non-parametric method to cluster individual cells into clones and infer their genotypes based on their noisy mutation profiles. BnpC employs a Dirichlet process mixture model coupled with a Markov chain Monte Carlo sampling scheme, including a modified split-merge move and a novel posterior estimator to predict clones and genotypes. We benchmarked our method comprehensively against state-of-the-art methods on simulated data using various data sizes, and applied it to three cancer scDNA-seq data sets. On simulated data, BnpC compared favorably against current methods in terms of accuracy, runtime, and scalability. Its inferred genotypes were the most accurate, and it was the only method able to run and produce results on data sets with 10,000 cells. On tumor scDNA-seq data, BnpC was able to identify clonal populations missed by the original cluster analysis but supported by supplementary experimental data. With ever growing scDNA-seq data sets, scalable and accurate methods such as BnpC will become increasingly relevant, not only to resolve intra-tumor heterogeneity but also as a pre-processing step to reduce data size. BnpC is freely available under MIT license at https://github.com/cbg-ethz/BnpC.


2021 ◽  
Author(s):  
Manman Dai ◽  
Min Feng ◽  
Ziwei Li ◽  
Weisan Chen ◽  
Ming Liao

ABSTRACTChicken peripheral blood lymphocytes (PBLs) exhibit wide-ranging cell types, but current understanding of their subclasses, immune cell classification, and function is limited and incomplete. Previously, we found that viremia caused by avian leukosis virus subgroup J (ALV‐J) was eliminated by 21 days post infection (DPI), accompanied by increased CD8+ T cell ratio in PBLs and low antibody levels. Here we performed single-cell RNA sequencing (scRNA-seq) of PBLs in ALV-J infected and control chickens at 21 DPI to determine chicken PBL subsets and their specific molecular and cellular characteristics, before and after viral infection. Eight cell clusters and their potential marker genes were identified in chicken PBLs. T cell populations (clusters 6 and 7) had the strongest response to ALV-J infection at 21 DPI, based on detection of the largest number of differentially expressed genes (DEGs). T cell populations of clusters 6 and 7 could be further divided into four subsets: activated CD4+ T cells (cluster A0), Th1-like cells (cluster A2), Th2-like cells (cluster A1), and cytotoxic CD8+ T cells. Hallmark genes for each T cell subset response to viral infection were initially identified. Furthermore, pseudotime analysis results suggested that chicken CD4+ T cells could potentially differentiate into Th1-like and Th2-like cells. Moreover, ALV-J infection probably induced CD4+ T cell differentiation into Th1-like cells in which the most immune related DEGs were detected. With respect to the control group, ALV-J infection also had an obvious impact on PBL cell composition. B cells showed inconspicuous response and their numbers decreased in PBLs of the ALV-J infected chickens at 21 DPI. Percentages of cytotoxic Th1-like cells and CD8+ T cells were increased in the T cell population of PBLs from ALV-J infected chicken, which were potentially key mitigating factors against ALV-J infection. More importantly, our results provided a rich resource of gene expression profiles of chicken PBL subsets for a systems-level understanding of their function in homeostatic condition as well as in response to viral infection.


Sign in / Sign up

Export Citation Format

Share Document