A review of computational strategies for denoising and imputation of single-cell transcriptomic data

Author(s):  
Lucrezia Patruno ◽  
Davide Maspero ◽  
Francesco Craighero ◽  
Fabrizio Angaroni ◽  
Marco Antoniotti ◽  
...  

Abstract Motivation The advancements of single-cell sequencing methods have paved the way for the characterization of cellular states at unprecedented resolution, revolutionizing the investigation on complex biological systems. Yet, single-cell sequencing experiments are hindered by several technical issues, which cause output data to be noisy, impacting the reliability of downstream analyses. Therefore, a growing number of data science methods has been proposed to recover lost or corrupted information from single-cell sequencing data. To date, however, no quantitative benchmarks have been proposed to evaluate such methods. Results We present a comprehensive analysis of the state-of-the-art computational approaches for denoising and imputation of single-cell transcriptomic data, comparing their performance in different experimental scenarios. In detail, we compared 19 denoising and imputation methods, on both simulated and real-world datasets, with respect to several performance metrics related to imputation of dropout events, recovery of true expression profiles, characterization of cell similarity, identification of differentially expressed genes and computation time. The effectiveness and scalability of all methods were assessed with regard to distinct sequencing protocols, sample size and different levels of biological variability and technical noise. As a result, we identify a subset of versatile approaches exhibiting solid performances on most tests and show that certain algorithmic families prove effective on specific tasks but inefficient on others. Finally, most methods appear to benefit from the introduction of appropriate assumptions on noise distribution of biological processes.

2020 ◽  
Vol 8 (Suppl 3) ◽  
pp. A520-A520
Author(s):  
Son Pham ◽  
Tri Le ◽  
Tan Phan ◽  
Minh Pham ◽  
Huy Nguyen ◽  
...  

BackgroundSingle-cell sequencing technology has opened an unprecedented ability to interrogate cancer. It reveals significant insights into the intratumoral heterogeneity, metastasis, therapeutic resistance, which facilitates target discovery and validation in cancer treatment. With rapid advancements in throughput and strategies, a particular immuno-oncology study can produce multi-omics profiles for several thousands of individual cells. This overflow of single-cell data poses formidable challenges, including standardizing data formats across studies, performing reanalysis for individual datasets and meta-analysis.MethodsN/AResultsWe present BioTuring Browser, an interactive platform for accessing and reanalyzing published single-cell omics data. The platform is currently hosting a curated database of more than 10 million cells from 247 projects, covering more than 120 immune cell types and subtypes, and 15 different cancer types. All data are processed and annotated with standardized labels of cell types, diseases, therapeutic responses, etc. to be instantly accessed and explored in a uniform visualization and analytics interface. Based on this massive curated database, BioTuring Browser supports searching similar expression profiles, querying a target across datasets and automatic cell type annotation. The platform supports single-cell RNA-seq, CITE-seq and TCR-seq data. BioTuring Browser is now available for download at www.bioturing.com.ConclusionsN/A


2017 ◽  
Author(s):  
Maurizio Pellegrino ◽  
Adam Sciambi ◽  
Sebastian Treusch ◽  
Robert Durruthy-Durruthy ◽  
Kaustubh Gokhale ◽  
...  

ABSTRACTTo enable the characterization of genetic heterogeneity in tumor cell populations, we developed a novel microfluidic approach that barcodes amplified genomic DNA from thousands of individual cancer cells confined to droplets. The barcodes are then used to reassemble the genetic profiles of cells from next generation sequencing data. Using this approach, we sequenced longitudinally collected AML tumor populations from two patients and genotyped up to 62 disease relevant loci across more than 16,000 individual cells. Targeted single-cell sequencing was able to sensitively identify tumor cells during complete remission and uncovered complex clonal evolution within AML tumors that was not observable with bulk sequencing. We anticipate that this approach will make feasible the routine analysis of heterogeneity in AML leading to improved stratification and therapy selection for the disease.


2018 ◽  
Author(s):  
Jochen Singer ◽  
Jack Kuipers ◽  
Katharina Jahn ◽  
Niko Beerenwinkel

AbstractUnderstanding the evolution of cancer is important for the development of appropriate cancer therapies. The task is challenging because tumors evolve as heterogeneous cell populations with an unknown number of genetically distinct subclones of varying frequencies. Conventional approaches based on bulk sequencing are limited in addressing this challenge as clones cannot be observed directly. Single-cell sequencing holds the promise of resolving the heterogeneity of tumors; however, it has its own challenges including elevated error rates, allelic dropout, and uneven coverage. Here, we develop a new approach to mutation detection in individual tumor cells by leveraging the evolutionary relationship among cells. Our method, called SCIΦ, jointly calls mutations in individual cells and estimates the tumor phylogeny among these cells. Employing a Markov Chain Monte Carlo scheme we robustly account for the various sources of noise in single-cell sequencing data. Our approach enables us to reliably call mutations in each single cell even in experiments with high dropout rates and missing data. We show that SCIΦ outperforms existing methods on simulated data and applied it to different real-world datasets, namely a whole exome breast cancer as well as a panel acute lymphoblastic leukemia dataset. Availability: https://github.com/cbg-ethz/SCIPhI


2021 ◽  
Vol 12 ◽  
Author(s):  
Zhenhua Yu ◽  
Huidong Liu ◽  
Fang Du ◽  
Xiaofen Tang

Single-cell sequencing (SCS) now promises the landscape of genetic diversity at single cell level, and is particularly useful to reconstruct the evolutionary history of tumor. There are multiple types of noise that make the SCS data notoriously error-prone, and significantly complicate tumor tree reconstruction. Existing methods for tumor phylogeny estimation suffer from either high computational intensity or low-resolution indication of clonal architecture, giving a necessity of developing new methods for efficient and accurate reconstruction of tumor trees. We introduce GRMT (Generative Reconstruction of Mutation Tree from scratch), a method for inferring tumor mutation tree from SCS data. GRMT exploits the k-Dollo parsimony model to allow each mutation to be gained once and lost at most k times. Under this constraint on mutation evolution, GRMT searches for mutation tree structures from a perspective of tree generation from scratch, and implements it to an iterative process that gradually increases the tree size by introducing a new mutation per time until a complete tree structure that contains all mutations is obtained. This enables GRMT to efficiently recover the chronological order of mutations and scale well to large datasets. Extensive evaluations on simulated and real datasets suggest GRMT outperforms the state-of-the-arts in multiple performance metrics. The GRMT software is freely available at https://github.com/qasimyu/grmt.


2020 ◽  
Vol 22 (Supplement_3) ◽  
pp. iii406-iii406
Author(s):  
Andrew Donson ◽  
Kent Riemondy ◽  
Sujatha Venkataraman ◽  
Ahmed Gilani ◽  
Bridget Sanford ◽  
...  

Abstract We explored cellular heterogeneity in medulloblastoma using single-cell RNA sequencing (scRNAseq), immunohistochemistry and deconvolution of bulk transcriptomic data. Over 45,000 cells from 31 patients from all main subgroups of medulloblastoma (2 WNT, 10 SHH, 9 GP3, 11 GP4 and 1 GP3/4) were clustered using Harmony alignment to identify conserved subpopulations. Each subgroup contained subpopulations exhibiting mitotic, undifferentiated and neuronal differentiated transcript profiles, corroborating other recent medulloblastoma scRNAseq studies. The magnitude of our present study builds on the findings of existing studies, providing further characterization of conserved neoplastic subpopulations, including identification of a photoreceptor-differentiated subpopulation that was predominantly, but not exclusively, found in GP3 medulloblastoma. Deconvolution of MAGIC transcriptomic cohort data showed that neoplastic subpopulations are associated with major and minor subgroup subdivisions, for example, photoreceptor subpopulation cells are more abundant in GP3-alpha. In both GP3 and GP4, higher proportions of undifferentiated subpopulations is associated with shorter survival and conversely, differentiated subpopulation is associated with longer survival. This scRNAseq dataset also afforded unique insights into the immune landscape of medulloblastoma, and revealed an M2-polarized myeloid subpopulation that was restricted to SHH medulloblastoma. Additionally, we performed scRNAseq on 16,000 cells from genetically engineered mouse (GEM) models of GP3 and SHH medulloblastoma. These models showed a level of fidelity with corresponding human subgroup-specific neoplastic and immune subpopulations. Collectively, our findings advance our understanding of the neoplastic and immune landscape of the main medulloblastoma subgroups in both humans and GEM models.


Author(s):  
Givanna H Putri ◽  
Irena Koprinska ◽  
Thomas M Ashhurst ◽  
Nicholas J C King ◽  
Mark N Read

Abstract Motivation Many ‘automated gating’ algorithms now exist to cluster cytometry and single-cell sequencing data into discrete populations. Comparative algorithm evaluations on benchmark datasets rely either on a single performance metric, or a few metrics considered independently of one another. However, single metrics emphasize different aspects of clustering performance and do not rank clustering solutions in the same order. This underlies the lack of consensus between comparative studies regarding optimal clustering algorithms and undermines the translatability of results onto other non-benchmark datasets. Results We propose the Pareto fronts framework as an integrative evaluation protocol, wherein individual metrics are instead leveraged as complementary perspectives. Judged superior are algorithms that provide the best trade-off between the multiple metrics considered simultaneously. This yields a more comprehensive and complete view of clustering performance. Moreover, by broadly and systematically sampling algorithm parameter values using the Latin Hypercube sampling method, our evaluation protocol minimizes (un)fortunate parameter value selections as confounding factors. Furthermore, it reveals how meticulously each algorithm must be tuned in order to obtain good results, vital knowledge for users with novel data. We exemplify the protocol by conducting a comparative study between three clustering algorithms (ChronoClust, FlowSOM and Phenograph) using four common performance metrics applied across four cytometry benchmark datasets. To our knowledge, this is the first time Pareto fronts have been used to evaluate the performance of clustering algorithms in any application domain. Availability and implementation Implementation of our Pareto front methodology and all scripts and datasets to reproduce this article are available at https://github.com/ghar1821/ParetoBench. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 9 (Suppl 1) ◽  
pp. A12.1-A12
Author(s):  
Y Arjmand Abbassi ◽  
N Fang ◽  
W Zhu ◽  
Y Zhou ◽  
Y Chen ◽  
...  

Recent advances of high-throughput single cell sequencing technologies have greatly improved our understanding of the complex biological systems. Heterogeneous samples such as tumor tissues commonly harbor cancer cell-specific genetic variants and gene expression profiles, both of which have been shown to be related to the mechanisms of disease development, progression, and responses to treatment. Furthermore, stromal and immune cells within tumor microenvironment interact with cancer cells to play important roles in tumor responses to systematic therapy such as immunotherapy or cell therapy. However, most current high-throughput single cell sequencing methods detect only gene expression levels or epigenetics events such as chromatin conformation. The information on important genetic variants including mutation or fusion is not captured. To better understand the mechanisms of tumor responses to systematic therapy, it is essential to decipher the connection between genotype and gene expression patterns of both tumor cells and cells in the tumor microenvironment. We developed FocuSCOPE, a high-throughput multi-omics sequencing solution that can detect both genetic variants and transcriptome from same single cells. FocuSCOPE has been used to successfully perform single cell analysis of both gene expression profiles and point mutations, fusion genes, or intracellular viral sequences from thousands of cells simultaneously, delivering comprehensive insights of tumor and immune cells in tumor microenvironment at single cell resolution.Disclosure InformationY. Arjmand Abbassi: None. N. Fang: None. W. Zhu: None. Y. Zhou: None. Y. Chen: None. U. Deutsch: None.


2020 ◽  
Author(s):  
Haoyu Ruan ◽  
Yihang Zhou ◽  
Jie Shen ◽  
Yue Zhai ◽  
Ying Xu ◽  
...  

AbstractMetastatic lung cancer accounts for about half of the brain metastases (BM). Development of leptomeningeal metastases (LM) are becoming increasingly common, and its prognosis is still poor despite the advances in systemic and local approaches. Cytology analysis in the cerebrospinal fluid (CSF) remains the diagnostic gold standard. Although several previous studies performed in CSF have offered great promise for the diagnostics and therapeutics of LM, a comprehensive characterization of circulating tumor cells (CTCs) in CSF is still lacking. To fill this critical gap of lung adenocarcinoma LM (LUAD-LM), we analyzed the transcriptomes of 1,375 cells from 5 LUAD-LM patient and 3 control samples using single-cell RNA sequencing technology. We defined CSF-CTCs based on abundant expression of epithelial markers and genes with lung origin, as well as the enrichment of metabolic pathway and cell adhesion molecules, which are crucial for the survival and metastases of tumor cells. Elevated expression of CEACAM6 and SCGB3A2 was discovered in CSF-CTCs, which could serve as candidate biomarkers of LUAD-LM. We identified substantial heterogeneity in CSF-CTCs among LUAD-LM patients and within patient among individual cells. Cell-cycle gene expression profiles and the proportion of CTCs displaying mesenchymal and cancer stem cell properties also vary among patients. In addition, CSF-CTC transcriptome profiling identified one LM case as cancer of unknown primary site (CUP). Our results will shed light on the mechanism of LUAD-LM and provide a new direction of diagnostic test of LUAD-LM and CUP cases from CSF samples.


2019 ◽  
Author(s):  
Simone Ciccolella ◽  
Murray Patterson ◽  
Paola Bonizzoni ◽  
Gianluca Della Vedova

AbstractBackgroundSingle cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods.One possible solution is to reduce the size of an SCS instance — usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells — and infer the tree from this reduced-size instance. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing means. However, since the values in these matrices are of a categorical nature (having the three categories: present, absent and missing), we explore techniques for clustering categorical data — commonly used in data mining and machine learning — to SCS data, with this goal in mind.ResultsIn this work, we present a new clustering procedure aimed at clustering categorical vector, or matrix data — here representing SCS instances, called celluloid. We demonstrate that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method.Finally, we demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice.AvailabilityOur approach, celluloid: clustering single cell sequencing data around centroids is available at https://github.com/AlgoLab/celluloid/ under an MIT license.


Sign in / Sign up

Export Citation Format

Share Document