scholarly journals Valid post-clustering differential analysis for single-cell RNA-Seq

2018 ◽  
Author(s):  
Jesse M. Zhang ◽  
Govinda M. Kamath ◽  
David N. Tse

SummarySingle-cell computational pipelines involve two critical steps: organizing cells (clustering) and identifying the markers driving this organization (differential expression analysis). State-of-the-art pipelines perform differential analysis after clustering on the same dataset. We observe that because clustering forces separation, reusing the same dataset generates artificially low p-values and hence false discoveries. We introduce a valid post-clustering differential analysis framework which corrects for this problem. We provide software at https://github.com/jessemzhang/tn_test.

2021 ◽  
Author(s):  
Marine Gauthier ◽  
Denis Agniel ◽  
Rodolphe Thiébaut ◽  
Boris P. Hejblum

State-of-the-art methods for single-cell RNA-seq (scRNA-seq) Differential Expression Analysis (DEA) often rely on strong distributional assumptions that are difficult to verify in practice. Furthermore, while the increasing complexity of clinical and biological single-cell studies calls for greater tool versatility, the majority of existing methods only tackle the comparison between two conditions. We propose a novel, distribution-free, and flexible approach to DEA for single-cell RNA-seq data. This new method, called ccdf, tests the association of each gene expression with one or many variables of interest (that can be either continuous or discrete), while potentially adjusting for additional covariates. To test such complex hypotheses, ccdf uses a conditional independence test relying on the conditional cumulative distribution function, estimated through multiple regressions. We provide the asymptotic distribution of the ccdf test statistic as well as a permutation test (when the number of observed cells is not sufficiently large). ccdf substantially expands the possibilities for scRNA-seq DEA studies: it obtains good statistical performance in various simulation scenarios considering complex experimental designs i.e. beyond the two condition comparison), while retaining competitive performance with state-of-the-art methods in a two-condition benchmark.


2018 ◽  
Vol 34 (19) ◽  
pp. 3340-3348 ◽  
Author(s):  
Zhijin Wu ◽  
Yi Zhang ◽  
Michael L Stitzel ◽  
Hao Wu

2017 ◽  
Author(s):  
Charlotte Soneson ◽  
Mark D. Robinson

AbstractBackgroundAs single-cell RNA-seq (scRNA-seq) is becoming increasingly common, the amount of publicly available data grows rapidly, generating a useful resource for computational method development and extension of published results. Although processed data matrices are typically made available in public repositories, the procedure to obtain these varies widely between data sets, which may complicate reuse and cross-data set comparison. Moreover, while many statistical methods for performing differential expression analysis of scRNA-seq data are becoming available, their relative merits and the performance compared to methods developed for bulk RNA-seq data are not sufficiently well understood.ResultsWe present conquer, a collection of consistently processed, analysis-ready public single-cell RNA-seq data sets. Each data set has count and transcripts per million (TPM) estimates for genes and transcripts, as well as quality control and exploratory analysis reports. We use a subset of the data sets available in conquer to perform an extensive evaluation of the performance and characteristics of statistical methods for differential gene expression analysis, evaluating a total of 30 statistical approaches on both experimental and simulated scRNA-seq data.ConclusionsConsiderable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.


2016 ◽  
Author(s):  
Stefano Beretta ◽  
Yuri Pirola ◽  
Valeria Ranzani ◽  
Grazisa Rossetti ◽  
Raoul Bonnal ◽  
...  

MOTIVATION Long non-coding RNAs (lncRNAs) have recently gained interest, especially for their involvement in controlling several cell processes, but a full understanding of their role is lacking. Differential Expression (DE) analysis is one of the most important tasks in the analysis of RNA-seq data, since it potentially points out genes involved in the regulation of the condition under study. However, a classical analysis at gene level may disregard the role of Alternative Splicing (AS) in regulating cell conditions. This is the case, for example, when a given gene is expressed in all the different conditions, but the expressed isoform is significantly diverse in the different conditions (that is an isoform switch). A transcript level analysis may better shed light on this case, especially in studies having as goal, for example, a better understanding of the behavior of lncRNAs in lymphocytes T cells, which are fundamental in studies of specific diseases, such as cancer. After Cufflinks/Cuffdiff, several approaches for DE analysis at isoform/transcript level have been proposed. However, their results are often sensitive to the upstream analysis such as read mapping, transcript reconstruction and quantification, and it is often hard to choose "a priori" the most appropriate combination of tools. This work presents a tool for assisting the user in this choice, and poses the bases for a study devoted to the characterization of lncRNAs and the identification of of isoform switch events. Our tool includes a framework for the description and the execution of a set of DE pipelines over the same input dataset, as well a set of tools for reconciling and comparing the results. METHOD We designed an automated and easily customizable tool which is able to execute a set of existing pipelines for DE analysis at transcript level starting from RNA-seq data. Our method is built upon Snakemake, a workflow management system, with the specific goal of reducing the complexity of creating workflows. This approach guarantees that the experimentation is fully replicable and easy to customize. Each considered pipeline is structured in three steps: (i) transcript assembly, (ii) quantification, and (iii) DE analysis. By default, our tool builds and compares 9 different pipelines, each taking as input the same set of RNA-seq reads, obtained by combining different state-of-the-art methods to perform the transcript assembly (TA step) with different state-of-the-art methods to perform quantification and differential expression analysis (Q+DE step). More precisely, the 9 pipelines are obtained by combining two tools (Cufflinks and StringTie) and a Reference Annotation (Ensembl annotated transcripts) for the TA step, with three tools (Cuffquant+Cuffdiff, StringTie-B+Ballgown and Kallisto+Sleuth) for the Q+DE step. Abstract truncated at 3,000 characters - the full version is available in the pdf file


2021 ◽  
Author(s):  
Jordan W. Squair ◽  
Matthieu Gautier ◽  
Claudia Kathe ◽  
Mark A. Anderson ◽  
Nicholas D. James ◽  
...  

Differential expression analysis in single-cell transcriptomics enables the dissection of cell-type-specific responses to perturbations such as disease, trauma, or experimental manipulation. While many statistical methods are available to identify differentially expressed genes, the principles that distinguish these methods and their performance remain unclear. Here, we show that the relative performance of these methods is contingent on their ability to account for variation between biological replicates. Methods that ignore this inevitable variation are biased and prone to false discoveries. Indeed, the most widely used methods can discover hundreds of differentially expressed genes in the absence of biological differences. Our results suggest an urgent need for a paradigm shift in the methods used to perform differential expression analysis in single-cell data.


2017 ◽  
Author(s):  
Koen Van den Berge ◽  
Charlotte Soneson ◽  
Michael I. Love ◽  
Mark D. Robinson ◽  
Lieven Clement

AbstractDropout in single cell RNA-seq (scRNA-seq) applications causes many transcripts to go undetected. It induces excess zero counts, which leads to power issues in differential expression (DE) analysis and has triggered the development of bespoke scRNA-seq DE tools that cope with zero-inflation. Recent evaluations, however, have shown that dedicated scRNA-seq tools provide no advantage compared to traditional bulk RNA-seq tools. We introduce zingeR, a zero-inflated negative binomial model that identifies excess zero counts and generates observation weights to unlock bulk RNA-seq pipelines for zero-inflation, boosting performance in scRNA-seq differential expression analysis.


2016 ◽  
Author(s):  
Stefano Beretta ◽  
Yuri Pirola ◽  
Valeria Ranzani ◽  
Grazisa Rossetti ◽  
Raoul Bonnal ◽  
...  

MOTIVATION Long non-coding RNAs (lncRNAs) have recently gained interest, especially for their involvement in controlling several cell processes, but a full understanding of their role is lacking. Differential Expression (DE) analysis is one of the most important tasks in the analysis of RNA-seq data, since it potentially points out genes involved in the regulation of the condition under study. However, a classical analysis at gene level may disregard the role of Alternative Splicing (AS) in regulating cell conditions. This is the case, for example, when a given gene is expressed in all the different conditions, but the expressed isoform is significantly diverse in the different conditions (that is an isoform switch). A transcript level analysis may better shed light on this case, especially in studies having as goal, for example, a better understanding of the behavior of lncRNAs in lymphocytes T cells, which are fundamental in studies of specific diseases, such as cancer. After Cufflinks/Cuffdiff, several approaches for DE analysis at isoform/transcript level have been proposed. However, their results are often sensitive to the upstream analysis such as read mapping, transcript reconstruction and quantification, and it is often hard to choose "a priori" the most appropriate combination of tools. This work presents a tool for assisting the user in this choice, and poses the bases for a study devoted to the characterization of lncRNAs and the identification of of isoform switch events. Our tool includes a framework for the description and the execution of a set of DE pipelines over the same input dataset, as well a set of tools for reconciling and comparing the results. METHOD We designed an automated and easily customizable tool which is able to execute a set of existing pipelines for DE analysis at transcript level starting from RNA-seq data. Our method is built upon Snakemake, a workflow management system, with the specific goal of reducing the complexity of creating workflows. This approach guarantees that the experimentation is fully replicable and easy to customize. Each considered pipeline is structured in three steps: (i) transcript assembly, (ii) quantification, and (iii) DE analysis. By default, our tool builds and compares 9 different pipelines, each taking as input the same set of RNA-seq reads, obtained by combining different state-of-the-art methods to perform the transcript assembly (TA step) with different state-of-the-art methods to perform quantification and differential expression analysis (Q+DE step). More precisely, the 9 pipelines are obtained by combining two tools (Cufflinks and StringTie) and a Reference Annotation (Ensembl annotated transcripts) for the TA step, with three tools (Cuffquant+Cuffdiff, StringTie-B+Ballgown and Kallisto+Sleuth) for the Q+DE step. Abstract truncated at 3,000 characters - the full version is available in the pdf file


2016 ◽  
Author(s):  
Zong-Hong Zhang ◽  
Naomi R. Wray ◽  
Qiong-Yi Zhao

AbstractSummaryDifferential expression analysis using high-throughput RNA sequencing (RNA-seq) data is widely applied in transcriptomic studies and many software tools have been developed for this purpose. Active development of existing popular tools, together with emergence of new tools means that studies comparing the performance of differential expression analysis methods become rapidly out-of-date. In order to enable researchers to evaluate new and updated software in a timely manner, we developed DEAR-O, a user-friendly platform for performance evaluation of differential expression analysis based on RNA-seq data. The platform currently includes four of the most popular tools: DESeq, DESeq2, edgeR and Cuffdiff2. Based on the DEAR-O platform, researchers can evaluate the performance of different tools, or the same tool with different versions, with a customised number of biological replicates using already curated RNA-seq datasets. We also initiated an online forum for discussion of RNA-seq differential expression analysis. Through this forum, new useful tools and benchmarking datasets can be introduced. Our platform will be actively maintained to ensure new major versions of existing tools and new popular tools are included. DEAR-O will serve the community by providing timely evaluations of tools, versions and number of replicates for RNA-seq differential expression analysis.Availability and implementationThe DEAR-O platform is available at http://cnsgenomics.com/software/dear-o; the online discussion forum is https://groups.google.com/d/forum/[email protected] and [email protected]


Sign in / Sign up

Export Citation Format

Share Document