scholarly journals Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge

2017 ◽  
Author(s):  
Sumit Mukherjee ◽  
Yue Zhang ◽  
Joshua Fan ◽  
Georg Seelig ◽  
Sreeram Kannan

ABSTRACTMotivationSingle cell RNA-seq (scRNA-seq) data contains a wealth of information which has to be inferred computationally from the observed sequencing reads. As the ability to sequence more cells improves rapidly, existing computational tools suffer from three problems. (1) The decreased reads-per-cell implies a highly sparse sample of the true cellular transcriptome. (2) Many tools simply cannot handle the size of the resulting datasets. (3) Prior biological knowledge such as bulk RNA-seq information of certain cell types or qualitative marker information is not taken into account. Here we present UNCURL, a preprocessing framework based on non-negative matrix factorization for scRNA-seq data, that is able to handle varying sampling distributions, scales to very large cell numbers and can incorporate prior knowledge.ResultsWe find that preprocessing using UNCURL consistently improves performance of commonly used scRNA-seq tools for clustering, visualization, and lineage estimation, both in the absence and presence of prior knowledge. Finally we demonstrate that UNCURL is extremely scalable and parallelizable, and runs faster than other methods on a scRNA-seq dataset containing 1.3 million cells.AvailabilitySource code is available at https://github.com/yjzhang/[email protected], [email protected]

eLife ◽  
2019 ◽  
Vol 8 ◽  
Author(s):  
Dylan Kotliar ◽  
Adrian Veres ◽  
M Aurel Nagy ◽  
Shervin Tabrizi ◽  
Eran Hodis ◽  
...  

Identifying gene expression programs underlying both cell-type identity and cellular activities (e.g. life-cycle processes, responses to environmental cues) is crucial for understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell’s expression profile may be a mixture of both types of programs, making them difficult to disentangle. Here, we benchmark and enhance the use of matrix factorization to solve this problem. We show with simulations that a method we call consensus non-negative matrix factorization (cNMF) accurately infers identity and activity programs, including their relative contributions in each cell. To illustrate the insights this approach enables, we apply it to published brain organoid and visual cortex scRNA-Seq datasets; cNMF refines cell types and identifies both expected (e.g. cell cycle and hypoxia) and novel activity programs, including programs that may underlie a neurosecretory phenotype and synaptogenesis.


2021 ◽  
Author(s):  
Juexiao Zhou ◽  
bin zhang ◽  
Haoyang Li ◽  
Longxi Zhou ◽  
Zhongxiao Li ◽  
...  

Abstract The accurate annotation of transcription start sites (TSSs) and their usage is critical for the mechanistic understanding of gene regulation under different biological contexts. To fulfil this, on one hand, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner. On the other hand, various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these computational tools cast the problem as a binary classification task on a balanced dataset and thus result in drastic false positive predictions when applied on the genome-scale. To address these issues, we present DeeReCT-TSS, a deep-learning-based method that is capable of TSSs identification across the whole genome based on both DNA sequences and conventional RNA-seq data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we develop a meta-learning-based extension for simultaneous transcription start site (TSS) annotation on 10 cell types, which enables the identification of cell-type-specific TSS. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets from the ENCODE project by correlating our predicted TSSs with experimentally defined TSS chromatin states. Our application, pre-trained models and data are available at https://github.com/JoshuaChou2018/DeeReCT-TSS_release.


2020 ◽  
Author(s):  
Sergei Rybakov ◽  
Mohammad Lotfollahi ◽  
Fabian J. Theis ◽  
F. Alexander Wolf

AbstractExisting methods for learning latent representations for single-cell RNA-seq data are based on autoencoders and factor models. However, representations learned by autoencoders are hard to interpret and representations learned by factor models have limited flexibility. Here, we introduce a framework for learning interpretable autoencoders based on regularized linear decoders. It decomposes variation into interpretable components using prior knowledge in the form of annotated feature sets obtained from public databases. Through this, it provides an alternative to enrichment techniques and factor models for the task of explaining observed variation with biological knowledge. Benchmarking our model on two single-cell RNA-seq datasets, we demonstrate how our model outperforms an existing factor model regarding scalability while maintaining interpretability.


Author(s):  
Hananeh Aliee ◽  
Fabian Theis

AbstractTissues are complex systems of interacting cell types. Knowing cell-type proportions in a tissue is very important to identify which cells or cell types are targeted by a disease or perturbation. When measuring such responses using RNA-seq, bulk RNA-seq masks cellular heterogeneity. Hence, several computational methods have been proposed to infer cell-type proportions from bulk RNA samples. Their performance with noisy reference profiles highly depends on the set of genes undergoing deconvolution. These genes are often selected based on prior knowledge or a single-criterion test that might not be useful to dissect closely correlated cell types. In this work, we introduce AutoGeneS, a tool that automatically extracts informative genes and reveals the cellular heterogeneity of bulk RNA samples. AutoGeneS requires no prior knowledge about marker genes and selects genes by simultaneously optimizing multiple criteria: minimizing the correlation and maximizing the distance between cell types. It can be applied to reference profiles from various sources like single-cell experiments or sorted cell populations. Results from human samples of peripheral blood illustrate that AutoGeneS outperforms other methods. Our results also highlight the impact of our approach on analyzing bulk RNA samples with noisy single-cell reference profiles and closely correlated cell types. Ground truth cell proportions analyzed by flow cytometry confirmed the accuracy of the predictions of AutoGeneS in identifying cell-type proportions. AutoGeneS is available for use via a standalone Python package (https://github.com/theislab/AutoGeneS).


Author(s):  
Wenjun Kong ◽  
Yuheng C. Fu ◽  
Samantha A. Morris

SummaryTransitions in cell identity are fundamental to development, reprogramming, and disease. Single-cell technologies enable the dissection of tissue composition on a cell-by-cell basis in complex biological systems. However, highly-sparse single-cell RNA-seq data poses challenges for cell-type identification algorithms based on bulk RNA-seq. Single-cell analytical tools are also limited, where they require prior biological knowledge and typically classify cells in a discrete, categorical manner. Here, we present a computational tool, ‘Capybara,’ designed to measure cell identity as a continuum, at single-cell resolution. This approach enables the classification of discrete cell entities but also identifies cells harboring multiple identities, supporting a metric to quantify cell fate transition dynamics. We benchmark the performance of Capybara against other existing classifiers and demonstrate its efficacy to annotate cells and identify critical transitions within a well-characterized differentiation hierarchy, hematopoiesis. Our application of Capybara to a range of reprogramming strategies reveals previously uncharacterized regional patterning and identifies a putative in vivo correlate for an engineered cell type that has, to date, remained undefined. These findings prioritize interventions to increase the efficiency and fidelity of cell engineering strategies, showcasing the utility of Capybara to dissect cell identity and fate transitions. Capybara code and documentation are available at https://github.com/morris-lab/Capybara.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Lin Que ◽  
David Lukacsovich ◽  
Wenshu Luo ◽  
Csaba Földy

AbstractThe diversity reflected by >100 different neural cell types fundamentally contributes to brain function and a central idea is that neuronal identity can be inferred from genetic information. Recent large-scale transcriptomic assays seem to confirm this hypothesis, but a lack of morphological information has limited the identification of several known cell types. In this study, we used single-cell RNA-seq in morphologically identified parvalbumin interneurons (PV-INs), and studied their transcriptomic states in the morphological, physiological, and developmental domains. Overall, we find high transcriptomic similarity among PV-INs, with few genes showing divergent expression between morphologically different types. Furthermore, PV-INs show a uniform synaptic cell adhesion molecule (CAM) profile, suggesting that CAM expression in mature PV cells does not reflect wiring specificity after development. Together, our results suggest that while PV-INs differ in anatomy and in vivo activity, their continuous transcriptomic and homogenous biophysical landscapes are not predictive of these distinct identities.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ege Ülgen ◽  
O. Uğur Sezerman

Abstract Background Cancer develops due to “driver” alterations. Numerous approaches exist for predicting cancer drivers from cohort-scale genomics data. However, methods for personalized analysis of driver genes are underdeveloped. In this study, we developed a novel personalized/batch analysis approach for driver gene prioritization utilizing somatic genomics data, called driveR. Results Combining genomics information and prior biological knowledge, driveR accurately prioritizes cancer driver genes via a multi-task learning model. Testing on 28 different datasets, this study demonstrates that driveR performs adequately, achieving a median AUC of 0.684 (range 0.651–0.861) on the 28 batch analysis test datasets, and a median AUC of 0.773 (range 0–1) on the 5157 personalized analysis test samples. Moreover, it outperforms existing approaches, achieving a significantly higher median AUC than all of MutSigCV (Wilcoxon rank-sum test p < 0.001), DriverNet (p < 0.001), OncodriveFML (p < 0.001) and MutPanning (p < 0.001) on batch analysis test datasets, and a significantly higher median AUC than DawnRank (p < 0.001) and PRODIGY (p < 0.001) on personalized analysis datasets. Conclusions This study demonstrates that the proposed method is an accurate and easy-to-utilize approach for prioritizing driver genes in cancer genomes in personalized or batch analyses. driveR is available on CRAN: https://cran.r-project.org/package=driveR.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ruizhu Huang ◽  
Charlotte Soneson ◽  
Pierre-Luc Germain ◽  
Thomas S.B. Schmidt ◽  
Christian Von Mering ◽  
...  

AbstracttreeclimbR is for analyzing hierarchical trees of entities, such as phylogenies or cell types, at different resolutions. It proposes multiple candidates that capture the latent signal and pinpoints branches or leaves that contain features of interest, in a data-driven way. It outperforms currently available methods on synthetic data, and we highlight the approach on various applications, including microbiome and microRNA surveys as well as single-cell cytometry and RNA-seq datasets. With the emergence of various multi-resolution genomic datasets, treeclimbR provides a thorough inspection on entities across resolutions and gives additional flexibility to uncover biological associations.


2020 ◽  
Vol 22 (1) ◽  
pp. 261
Author(s):  
Abdelnaby Khalyfa ◽  
Wesley Warren ◽  
Jorge Andrade ◽  
Christopher A. Bottoms ◽  
Edward S. Rice ◽  
...  

Intermittent hypoxia (IH) is a hallmark of obstructive sleep apnea (OSA) and induces metabolic dysfunction manifesting as inflammation, increased lipolysis and insulin resistance in visceral white adipose tissues (vWAT). However, the cell types and their corresponding transcriptional pathways underlying these functional perturbations are unknown. Here, we applied single nucleus RNA sequencing (snRNA-seq) coupled with aggregate RNA-seq methods to evaluate the cellular heterogeneity in vWAT following IH exposures mimicking OSA. C57BL/6 male mice were exposed to IH and room air (RA) for 6 weeks, and nuclei from vWAT were isolated and processed for snRNA-seq followed by differential expressed gene (DEGs) analyses by cell type, along with gene ontology and canonical pathways enrichment tests of significance. IH induced significant transcriptional changes compared to RA across 14 different cell types identified in vWAT. We identified cell-specific signature markers, transcriptional networks, metabolic signaling pathways, and cellular subpopulation enrichment in vWAT. Globally, we also identify 298 common regulated genes across multiple cellular types that are associated with metabolic pathways. Deconvolution of cell types in vWAT using global RNA-seq revealed that distinct adipocytes appear to be differentially implicated in key aspects of metabolic dysfunction. Thus, the heterogeneity of vWAT and its response to IH at the cellular level provides important insights into the metabolic morbidity of OSA and may possibly translate into therapeutic targets.


Sign in / Sign up

Export Citation Format

Share Document