Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference

Mapping Intimacies ◽

10.1101/598748 ◽

2019 ◽

Cited By ~ 4

Author(s):

Yuanhua Huang ◽

Davis J McCarthy ◽

Oliver Stegle

Keyword(s):

Single Cell ◽

Real Data ◽

Experimental Designs ◽

Joint Analysis ◽

Rna Seq ◽

Computationally Efficient ◽

Primary Object ◽

Genotype Information ◽

Object Of Study ◽

Multiple Samples

AbstractThe joint analysis of multiple samples using single-cell RNA-seq is a promising experimental design, offering both increased throughput while allowing to account for batch variation. To achieve multi-sample designs, genetic variants that segregate between the samples in the pool have been proposed as natural barcodes for cell demultiplexing. Existing demultiplexing strategies rely on access to complete genotype data from the pooled samples, which greatly limits the applicability of such methods, in particular when genetic variation is not the primary object of study. To address this, we here present Vireo, a computationally efficient Bayesian model to demultiplex single-cell data from pooled experimental designs. Uniquely, our model can be applied in settings when only partial or no genotype information is available. Using simulations based on synthetic mixtures and results on real data, we demonstrate the robustness of our model and illustrate the utility of multi-sample experimental designs for common expression analyses.

Download Full-text

Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference

Genome Biology ◽

10.1186/s13059-019-1865-2 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 13

Author(s):

Yuanhua Huang ◽

Davis J. McCarthy ◽

Oliver Stegle

Keyword(s):

Single Cell ◽

Real Data ◽

Experimental Designs ◽

Rna Seq ◽

Computationally Efficient ◽

Primary Object ◽

Genotype Information ◽

Object Of Study ◽

Multiple Samples ◽

Pooled Samples

AbstractMultiplexed single-cell RNA-seq analysis of multiple samples using pooling is a promising experimental design, offering increased throughput while allowing to overcome batch variation. To reconstruct the sample identify of each cell, genetic variants that segregate between the samples in the pool have been proposed as natural barcode for cell demultiplexing. Existing demultiplexing strategies rely on availability of complete genotype data from the pooled samples, which limits the applicability of such methods, in particular when genetic variation is not the primary object of study. To address this, we here present Vireo, a computationally efficient Bayesian model to demultiplex single-cell data from pooled experimental designs. Uniquely, our model can be applied in settings when only partial or no genotype information is available. Using pools based on synthetic mixtures and results on real data, we demonstrate the robustness of Vireo and illustrate the utility of multiplexed experimental designs for common expression analyses.

Download Full-text

Scalable latent-factor models applied to single-cell RNA-seq data separate biological drivers from confounding effects

10.1101/087775 ◽

2016 ◽

Cited By ~ 6

Author(s):

Florian Buettner ◽

Naruemon Pratanwanich ◽

John C. Marioni ◽

Oliver Stegle

Keyword(s):

Single Cell ◽

Real Data ◽

Rna Seq ◽

Biological Factors ◽

Computationally Efficient ◽

Pathway Annotation ◽

Large Populations ◽

Sources Of Variation ◽

Latent Factor Models ◽

Gene Expression Levels

Single-cell RNA-sequencing (scRNA-seq) allows heterogeneity in gene expression levels to be studied in large populations of cells. Such heterogeneity can arise from both technical and biological factors, thus making decomposing sources of variation extremely difficult. We here describe a computationally efficient model that uses prior pathway annotation to guide inference of the biological drivers underpinning the heterogeneity. Moreover, we jointly update and improve gene set annotation and infer factors explaining variability that fall outside the existing annotation. We validate our method using simulations, which demonstrate both its accuracy and its ability to scale to large datasets with up to 100,000 cells. Moreover, through applications to real data we show that our model can robustly decompose scRNA-seq datasets into interpretable components and facilitate the identification of novel sub-populations.

Download Full-text

SampleQC: robust multivariate, multi-celltype, multi-sample quality control for single cell data

10.1101/2021.08.28.458012 ◽

2021 ◽

Author(s):

Will Macnair ◽

Mark D Robinson

Keyword(s):

Quality Control ◽

Single Cell ◽

Real Data ◽

R Package ◽

Gaussian Mixture ◽

Model Fit ◽

Rna Seq ◽

Industry Standard ◽

Multiple Samples ◽

Cell Data

Quality control (QC) is a critical component of single cell RNA-seq processing pipelines. Many single cell methods assume that scRNA-seq data comprises multiple celltypes that are distinct in terms of gene expression, however this is not reflected in current approaches to QC. We show that the current widely-used methods for QC may have a bias towards exclusion of rarer celltypes, especially those whose QC metrics are more extreme, e.g. those with naturally high mitochondrial proportions. We introduce SampleQC, which improves sensitivity and reduces bias relative to current industry standard approaches, via a robust Gaussian mixture model fit across multiple samples simultaneously. We show via simulations that SampleQC is less susceptible than other methods to exclusion of rarer celltypes. We also demonstrate SampleQC on complex real data, comprising up to 867k cells over 172 samples. The framework for SampleQC is general, and has applications as an outlier detection method for data beyond single cell RNA-seq. SampleQC is parallelized and implemented in Rcpp, and is available as an R package.

Download Full-text

Multiple Haplotype Reconstruction from Allele Frequency Data

10.1101/2020.07.09.191924 ◽

2020 ◽

Author(s):

Marta Pelizzola ◽

Merle Behr ◽

Housen Li ◽

Axel Munk ◽

Andreas Futschik

Keyword(s):

Allele Frequency ◽

Real Data ◽

Coefficient Matrix ◽

Point Of View ◽

Design Matrix ◽

Frequency Data ◽

Computationally Efficient ◽

Regression Problem ◽

Allele Frequency Data ◽

Multiple Samples

AbstractSince haplotype information is of widespread interest in biomedical applications, effort has been put into their reconstruction. Here, we propose a new, computationally efficient method, called haploSep, that is able to accurately infer major haplotypes and their frequencies just from multiple samples of allele frequency data. Our approach seems to be the first that is able to estimate more than one haplotype given such data. Even the accuracy of experimentally obtained allele frequencies can be improved by re-estimating them from our reconstructed haplotypes. From a methodological point of view, we model our problem as a multivariate regression problem where both the design matrix and the coefficient matrix are unknown. The design matrix, with 0/1 entries, models haplotypes and the columns of the coefficient matrix represent the frequencies of haplotypes, which are non-negative and sum up to one. We illustrate our method on simulated and real data focusing on experimental evolution and microbial data.

Download Full-text

The SZS is an efficient statistical method to identify regulated splicing events in droplet-based RNA sequencing

10.1101/2020.11.10.377572 ◽

2020 ◽

Author(s):

Julia Eve Olivieri ◽

Roozbeh Dehghannasiri ◽

Julia Salzman

Keyword(s):

Single Cell ◽

Statistical Method ◽

Rna Seq ◽

Computationally Efficient ◽

Small Set ◽

Biological Discovery ◽

Cell Type Specific ◽

Human Spermatogenesis ◽

Splicing Patterns ◽

Cell Data

AbstractTo date, the field of single-cell genomics has viewed robust splicing analysis as completely out of reach in droplet-based platforms, preventing biological discovery of single-cell regulated splicing. Here, we introduce a novel, robust, and computationally efficient statistical method, the Splicing Z Score (SZS), to detect differential alternative splicing in single cell RNA-Seq technologies including 10x Chromium. We applied the SZS to primary human cells to discover new regulated, cell type-specific splicing patterns. Illustrating the power of the SZS method, splicing of a small set of genes has high predictive power for tissue compartment in the human lung, and the SZS identifies un-annotated, conserved splicing regulation in the human spermatogenesis. The SZS is a method that can rapidly identify regulated splicing events from single cell data and prioritize genes predicted to have functionally significant splicing programs.

Download Full-text

AdRoit: an accurate and robust method to infer complex transcriptome composition

10.1101/2020.12.14.422697 ◽

2020 ◽

Author(s):

Tao Yang ◽

Nicole Alessandri-Haber ◽

Wen Fury ◽

Michael Schaner ◽

Robert Breese ◽

...

Keyword(s):

Single Cell ◽

Adaptive Learning ◽

Transcriptome Profiling ◽

Cell Types ◽

Data Interpretation ◽

Live Cells ◽

Rna Seq ◽

Cell Type ◽

Computationally Efficient ◽

Cell Composition

AbstractRNA sequencing technology promises an unprecedented opportunity in learning disease mechanisms and discovering new treatment targets. Recent spatial transcriptomics methods further enable the transcriptome profiling at spatially resolved spots in a tissue section. In controlled experiments, it is often of immense importance to know the cell composition in different samples. Understanding the cell type content in each tissue spot is also crucial to the spatial transcriptome data interpretation. Though single cell RNA-seq has the power to reveal cell type composition and expression heterogeneity in different cells, it remains costly and sometimes infeasible when live cells cannot be obtained or sufficiently dissociated. To computationally resolve the cell composition in RNA-seq data of mixed cells, we present AdRoit, an accurate androbust method to infer transcriptome composition. The method estimates the proportions of each cell type in the compound RNA-seq data using known single cell data of relevant cell types. It uniquely uses an adaptive learning approach to correct the bias gene-wise due to the difference in sequencing techniques. AdRoit also utilizes cell type specific genes while control their cross-sample variability. Our systematic benchmarking, spanning from simple to complex tissues, shows that AdRoit has superior sensitivity and specificity compared to other existing methods. Its performance holds for multiple single cell and compound RNA-seq platforms. In addition, AdRoit is computationally efficient and runs one to two orders of magnitude faster than some of the state-of-the-art methods.

Download Full-text

Splatter: simulation of single-cell RNA sequencing data

10.1101/133173 ◽

2017 ◽

Cited By ~ 8

Author(s):

Luke Zappia ◽

Belinda Phipson ◽

Alicia Oshlack

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Real Data ◽

Cell Types ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Simulation Based ◽

Single Cell Rna Sequencing ◽

Multiple Cell

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.

Download Full-text

Gene regulation inference from single-cell RNA-seq data with linear differential equations and velocity inference

10.1101/464479 ◽

2018 ◽

Cited By ~ 4

Author(s):

Pierre-Cyril Aubin-Frankowski ◽

Jean-Philippe Vert

Keyword(s):

Gene Regulation ◽

Single Cell ◽

De Novo ◽

State Of The Art ◽

Real Data ◽

Biological Processes ◽

Rna Seq ◽

Cell Cycles ◽

Linear Differential ◽

Cell Trajectories

AbstractSingle-cell RNA sequencing (scRNA-seq) offers new possibilities to infer gene regulation networks (GRN) for biological processes involving a notion of time, such as cell differentiation or cell cycles. It also raises many challenges due to the destructive measurements inherent to the technology. In this work we propose a new method named GRISLI for de novo GRN inference from scRNA-seq data. GRISLI infers a velocity vector field in the space of scRNA-seq data from profiles of individual data, and models the dynamics of cell trajectories with a linear ordinary differential equation to reconstruct the underlying GRN with a sparse regression procedure. We show on real data that GRISLI outperforms a recently proposed state-of-the-art method for GRN reconstruction from scRNA-seq data.

Download Full-text

Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape

10.1101/2021.08.13.456196 ◽

2021 ◽

Author(s):

Luke Zappia ◽

Fabian J Theis

Keyword(s):

Single Cell ◽

Open Science ◽

Software Tools ◽

Field Analysis ◽

Analysis Tool ◽

Tracking Data ◽

Rna Seq ◽

Science Practices ◽

Single Cell Rna Sequencing ◽

Multiple Samples

Recent years have seen a revolution in single-cell technologies, particularly single-cell RNA-sequencing (scRNA-seq). As the number, size and complexity of scRNA-seq datasets continue to increase, so does the number of computational methods and software tools for extracting meaning from them. Since 2016 the scRNA-tools database has catalogued software tools for analysing scRNA-seq data. With the number of tools in the database passing 1000, we take this opportunity to provide an update on the state of the project and the field. Analysis of five years of analysis tool tracking data clearly shows the evolution of the field, and that the focus of developers has moved from ordering cells on continuous trajectories to integrating multiple samples and making use of reference datasets. We also find evidence that open science practices reward developers with increased recognition and help accelerate the field.

Download Full-text

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Bioinformatics ◽

10.1093/bioinformatics/btaa097 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3115-3123 ◽

Cited By ~ 3

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Single Cell ◽

Differential Expression Analysis ◽

Distance Matrix ◽

Real Data ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Differential Expression

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text