scholarly journals souporcell: Robust clustering of single cell RNAseq by genotype and ambient RNA inference without reference genotypes

2019 ◽  
Author(s):  
Haynes Heaton ◽  
Arthur M. Talman ◽  
Andrew Knights ◽  
Maria Imaz ◽  
Daniel Gaffney ◽  
...  

Methods to deconvolve single-cell RNA sequencing (scRNAseq) data are necessary for samples containing a natural mixture of genotypes and for scRNAseq experiments that multiplex cells from different donors1. Multiplexing across donors is a popular experimental design with many benefits including avoiding batch effects2, reducing costs, and improving doublet detection. Using variants detected in the RNAseq reads, it is possible to assign cells to the individuals from which they arose. These variants can also be used to identify and remove cross-genotype doublet cells that may have highly similar transcriptional profiles precluding detection by transcriptional profile. More subtle cross-genotype variant contamination can be used to estimate the amount of ambient RNA in the system. Ambient RNA is caused by cell lysis prior to droplet partitioning and is an important confounder of scRNAseq analysis3. Souporcell is a novel method to cluster cells using only the genetic variants detected within the scRNAseq reads. We show that it achieves high accuracy on genotype clustering, doublet detection, and ambient RNA estimation as demonstrated across a wide range of challenging scenarios.

BMC Genomics ◽  
2020 ◽  
Vol 21 (S11) ◽  
Author(s):  
Adam Cornish ◽  
Shrabasti Roychoudhury ◽  
Krishna Sarma ◽  
Suravi Pramanik ◽  
Kishor Bhakat ◽  
...  

Abstract Background Single-cell sequencing enables us to better understand genetic diseases, such as cancer or autoimmune disorders, which are often affected by changes in rare cells. Currently, no existing software is aimed at identifying single nucleotide variations or micro (1-50 bp) insertions and deletions in single-cell RNA sequencing (scRNA-seq) data. Generating high-quality variant data is vital to the study of the aforementioned diseases, among others. Results In this study, we report the design and implementation of Red Panda, a novel method to accurately identify variants in scRNA-seq data. Variants were called on scRNA-seq data from human articular chondrocytes, mouse embryonic fibroblasts (MEFs), and simulated data stemming from the MEF alignments. Red Panda had the highest Positive Predictive Value at 45.0%, while other tools—FreeBayes, GATK HaplotypeCaller, GATK UnifiedGenotyper, Monovar, and Platypus—ranged from 5.8–41.53%. From the simulated data, Red Panda had the highest sensitivity at 72.44%. Conclusions We show that our method provides a novel and improved mechanism to identify variants in scRNA-seq as compared to currently existing software. However, methods for identification of genomic variants using scRNA-seq data can be still improved.


2021 ◽  
Vol 12 ◽  
Author(s):  
Bin Zou ◽  
Tongda Zhang ◽  
Ruilong Zhou ◽  
Xiaosen Jiang ◽  
Huanming Yang ◽  
...  

It is well recognized that batch effect in single-cell RNA sequencing (scRNA-seq) data remains a big challenge when integrating different datasets. Here, we proposed deepMNN, a novel deep learning-based method to correct batch effect in scRNA-seq data. We first searched mutual nearest neighbor (MNN) pairs across different batches in a principal component analysis (PCA) subspace. Subsequently, a batch correction network was constructed by stacking two residual blocks and further applied for the removal of batch effects. The loss function of deepMNN was defined as the sum of a batch loss and a weighted regularization loss. The batch loss was used to compute the distance between cells in MNN pairs in the PCA subspace, while the regularization loss was to make the output of the network similar to the input. The experiment results showed that deepMNN can successfully remove batch effects across datasets with identical cell types, datasets with non-identical cell types, datasets with multiple batches, and large-scale datasets as well. We compared the performance of deepMNN with state-of-the-art batch correction methods, including the widely used methods of Harmony, Scanorama, and Seurat V4 as well as the recently developed deep learning-based methods of MMD-ResNet and scGen. The results demonstrated that deepMNN achieved a better or comparable performance in terms of both qualitative analysis using uniform manifold approximation and projection (UMAP) plots and quantitative metrics such as batch and cell entropies, ARI F1 score, and ASW F1 score under various scenarios. Additionally, deepMNN allowed for integrating scRNA-seq datasets with multiple batches in one step. Furthermore, deepMNN ran much faster than the other methods for large-scale datasets. These characteristics of deepMNN made it have the potential to be a new choice for large-scale single-cell gene expression data analysis.


2019 ◽  
Author(s):  
Alemu Takele Assefa ◽  
Jo Vandesompele ◽  
Olivier Thas

SummarySPsimSeq is a semi-parametric simulation method for bulk and single cell RNA sequencing data. It simulates data from a good estimate of the actual distribution of a given real RNA-seq dataset. In contrast to existing approaches that assume a particular data distribution, our method constructs an empirical distribution of gene expression data from a given source RNA-seq experiment to faithfully capture the data characteristics of real data. Importantly, our method can be used to simulate a wide range of scenarios, such as single or multiple biological groups, systematic variations (e.g. confounding batch effects), and different sample sizes. It can also be used to simulate different gene expression units resulting from different library preparation protocols, such as read counts or UMI counts.Availability and implementationThe R package and associated documentation is available from https://github.com/CenterForStatistics-UGent/SPsimSeq.Supplementary informationSupplementary data are available at bioRχiv online.


2020 ◽  
Author(s):  
Weimiao Wu ◽  
Qile Dai ◽  
Yunqing Liu ◽  
Xiting Yan ◽  
Zuoheng Wang

AbstractSingle-cell RNA sequencing provides an opportunity to study gene expression at single-cell resolution. However, prevalent dropout events result in high data sparsity and noise that may obscure downstream analyses. We propose a novel method, G2S3, that imputes dropouts by borrowing information from adjacent genes in a sparse gene graph learned from gene expression profiles across cells. We applied G2S3 and other existing methods to seven single-cell datasets to compare their performance. Our results demonstrated that G2S3 is superior in recovering true expression levels, identifying cell subtypes, improving differential expression analyses, and recovering gene regulatory relationships, especially for mildly expressed genes.


2018 ◽  
Vol 36 (5) ◽  
pp. 421-427 ◽  
Author(s):  
Laleh Haghverdi ◽  
Aaron T L Lun ◽  
Michael D Morgan ◽  
John C Marioni

2020 ◽  
Author(s):  
Wanqiu Chen ◽  
Yongmei Zhao ◽  
Xin Chen ◽  
Xiaojiang Xu ◽  
Zhaowei Yang ◽  
...  

AbstractSingle-cell RNA sequencing (scRNA-seq) has become a very powerful technology for biomedical research and is becoming much more affordable as methods continue to evolve, but it is unknown how reproducible different platforms are using different bioinformatics pipelines, particularly the recently developed scRNA-seq batch correction algorithms. We carried out a comprehensive multi-center cross-platform comparison on different scRNA-seq platforms using standard reference samples. We compared six pre-processing pipelines, seven bioinformatics normalization procedures, and seven batch effect correction methods including CCA, MNN, Scanorama, BBKNN, Harmony, limma and ComBat to evaluate the performance and reproducibility of 20 scRNA-seq data sets derived from four different platforms and centers. We benchmarked scRNA-seq performance across different platforms and testing sites using global gene expression profiles as well as some cell-type specific marker genes. We showed that there were large batch effects; and the reproducibility of scRNA-seq across platforms was dictated both by the expression level of genes selected and the batch correction methods used. We found that CCA, MNN, and BBKNN all corrected the batch variations fairly well for the scRNA-seq data derived from biologically similar samples across platforms/sites. However, for the scRNA-seq data derived from or consisting of biologically distinct samples, limma and ComBat failed to correct batch effects, whereas CCA over-corrected the batch effect and misclassified the cell types and samples. In contrast, MNN, Harmony and BBKNN separated biologically different samples/cell types into correspondingly distinct dimensional subspaces; however, consistent with this algorithm’s logic, MNN required that the samples evaluated each contain a shared portion of highly similar cells. In summary, we found a great cross-platform consistency in separating two distinct samples when an appropriate batch correction method was used. We hope this large cross-platform/site scRNA-seq data set will provide a valuable resource, and that our findings will offer useful advice for the single-cell sequencing community.


2019 ◽  
Author(s):  
Travis S Johnson ◽  
Zhi Huang ◽  
Christina Y Yu ◽  
Tongxin Wang ◽  
Yi Wu ◽  
...  

AbstractMotivationRapid advances in single cell RNA sequencing have produced more granular subtypes of cells in multiple tissues from different species. There exists a need to develop rigorous methods that can i) model multiple datasets with ambiguous labels across species and studies and ii) remove systematic biases across datasets and species.ResultsWe developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets and applied our framework on scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy pancreas: 91%, brain: 78%) using LAmbDA Random Forest. LAmbDA Feedforward 1 Layer Neural Network achieved higher weighted accuracy in labeling cellular subtypes than CaSTLe or MetaNeighbor in brain (48%, 32%, 20% respectively). Furthermore, LAmbDA Feedforward 1 Layer Neural Network was the only method to correctly predict ambiguous cellular subtype labels in both pancreas and brain compared to CaSTLe and MetaNeighbor. LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing.Availability: github.com/tsteelejohnson91/LAmbDAContact:[email protected], [email protected]


2020 ◽  
Author(s):  
Min Feng ◽  
Junming Xia ◽  
Shigang Fei ◽  
Xiong Wang ◽  
Yaohong Zhou ◽  
...  

AbstractA wide range of hemocyte types exist in insects but a full definition of the different subclasses is not yet established. The current knowledge of the classification of silkworm hemocytes mainly comes from morphology rather than specific markers, so our understanding of the detailed classification, hemocyte lineage and functions of silkworm hemocytes is very incomplete. Bombyx mori nucleopolyhedrovirus (BmNPV) is a representative member of the baculoviruses, which are a major pathogens that specifically infects silkworms and cause serious loss in sericulture industry. Here, we performed single-cell RNA sequencing (scRNA-seq) of silkworm hemocytes in BmNPV and mock-infected larvae to comprehensively identify silkworm hemocyte subsets and determined specific molecular and cellular characteristics in each hemocyte subset before and after viral infection. A total of 19 cell clusters and their potential marker genes were identified in silkworm hemocytes. Among these hemocyte clusters, clusters 0, 1, 2, 5 and 9 might be granulocytes (GR); clusters 14 and 17 were predicted as plasmatocytes (PL); cluster 18 was tentatively identified as spherulocytes (SP); and clusters 7 and 11 could possibly correspond to oenocytoids (OE). In addition, all of the hemocyte clusters were infected by BmNPV and some infected cells carried high viral-load in silkworm larvae at 3 day post infection (dpi). Interestingly, BmNPV infection can cause severe and diverse changes in gene expression in hemocytes. Cells belonging to the infection group mainly located at the early stage of the pseudotime trajectories. Furthermore, we found that BmNPV infection suppresses the immune response in the major hemocyte types. In summary, our scRNA-seq analysis revealed the diversity of silkworm hemocytes and provided a rich resource of gene expression profiles for a systems-level understanding of their functions in the uninfected condition and as a response to BmNPV.


2021 ◽  
Author(s):  
S. Kelly ◽  
Kai Battenberg ◽  
Nicola Hetherington ◽  
Makoto Hayashi ◽  
Aki Minoda

Abstract Single-cell RNA-sequencing analysis to quantify RNA molecules in individual cells has become popular owing to the large amount of information one can obtain from each experiment. We have developed UniverSC (https://github.com/minoda-lab/universc), a universal single-cell processing tool that supports any UMI-based platform. Our command-line tool enables consistent and comprehensive integration, comparison, and evaluation across data generated from a wide range of platforms.


Sign in / Sign up

Export Citation Format

Share Document