Differential Expression Analysis for RNA-Seq: An Overview of Statistical Methods and Computational Software

Cancer Informatics ◽

10.4137/cin.s21631 ◽

2015 ◽

Vol 14s1 ◽

pp. CIN.S21631 ◽

Cited By ~ 6

Author(s):

Huei-Chung Huang ◽

Yi Niu ◽

Li-Xuan Qin

Keyword(s):

Differential Expression ◽

Statistical Methods ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Computational Tools ◽

Estimation Algorithms ◽

Testing Strategies ◽

Discrete Nature ◽

Statistical Background

Deep sequencing has recently emerged as a powerful alternative to microarrays for the high-throughput profiling of gene expression. In order to account for the discrete nature of RNA sequencing data, new statistical methods and computational tools have been developed for the analysis of differential expression to identify genes that are relevant to a disease such as cancer. In this paper, it is thus timely to provide an overview of these analysis methods and tools. For readers with statistical background, we also review the parameter estimation algorithms and hypothesis testing strategies used in these methods.

Download Full-text

The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab028 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Xueyi Dong ◽

Luyi Tian ◽

Quentin Gouil ◽

Hasaru Kariyawasam ◽

Shian Su ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Transcriptomic Analysis ◽

Statistical Testing ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Sequencing Platform ◽

Long Read

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.

Download Full-text

A relative comparison between Hidden Markov- and Log-Linear-based models for differential expression analysis in a real time course RNA sequencing data

10.1101/448886 ◽

2018 ◽

Author(s):

Fatemeh Gholizadeh ◽

Zahra Salehi ◽

Ali Mohammad banaei-Moghaddam ◽

Abbas Rahimi Foroushani ◽

Kaveh kavousi

Keyword(s):

Real Time ◽

Differential Expression ◽

Rna Sequencing ◽

Time Course ◽

Hidden Markov ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Normalization Methods ◽

Log Linear

AbstractWith the advent of the Next Generation Sequencing technologies, RNA-seq has become known as an optimal approach for studying gene expression profiling. Particularly, time course RNA-seq differential expression analysis has been used in many studies to identify candidate genes. However, applying a statistical method to efficiently identify differentially expressed genes (DEGs) in time course studies is challenging due to inherent characteristics of such data including correlation and dependencies over time. Here we aim to relatively compare EBSeq-HMM, a Hidden Markov-based model, with multiDE, a Log-Linear-based model, in a real time course RNA sequencing data. In order to conduct the comparison, common DEGs detected by edgeR, DESeq2 and Voom (referred to as Benchmark DEGs) were utilized as a measure. Each of the two models were compared using different normalization methods. The findings revealed that multiDE identified more Benchmark DEGs and showed a higher agreement with them than EBSeq-HMM. Furthermore, multiDE and EBSeq-HMM displayed their best performance using TMM and Upper-Quartile normalization methods, respectively.

Download Full-text

Bias, robustness and scalability in differential expression analysis of single-cell RNA-seq data

10.1101/143289 ◽

2017 ◽

Cited By ~ 16

Author(s):

Charlotte Soneson ◽

Mark D. Robinson

Keyword(s):

Single Cell ◽

Differential Expression ◽

Statistical Methods ◽

Expression Analysis ◽

Method Development ◽

Differential Expression Analysis ◽

Data Sets ◽

Rna Seq ◽

Data Set ◽

Extensive Evaluation

AbstractBackgroundAs single-cell RNA-seq (scRNA-seq) is becoming increasingly common, the amount of publicly available data grows rapidly, generating a useful resource for computational method development and extension of published results. Although processed data matrices are typically made available in public repositories, the procedure to obtain these varies widely between data sets, which may complicate reuse and cross-data set comparison. Moreover, while many statistical methods for performing differential expression analysis of scRNA-seq data are becoming available, their relative merits and the performance compared to methods developed for bulk RNA-seq data are not sufficiently well understood.ResultsWe present conquer, a collection of consistently processed, analysis-ready public single-cell RNA-seq data sets. Each data set has count and transcripts per million (TPM) estimates for genes and transcripts, as well as quality control and exploratory analysis reports. We use a subset of the data sets available in conquer to perform an extensive evaluation of the performance and characteristics of statistical methods for differential gene expression analysis, evaluating a total of 30 statistical approaches on both experimental and simulated scRNA-seq data.ConclusionsConsiderable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.

Download Full-text

Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA–sequencing data

10.1101/220129 ◽

2017 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Katrijn De Paepe ◽

Celine Everaert ◽

Pieter Mestdagh ◽

Olivier Thas ◽

...

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Expression Analysis ◽

Web Application ◽

Empirical Bayes ◽

Performance Metrics ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Normalization Methods

ABSTRACTBackgroundProtein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.ResultsThirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.ConclusionOverall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/

Download Full-text

Differential expression analysis of log-ratio transformed counts: benchmarking methods for RNA-Seq data

10.1101/231175 ◽

2017 ◽

Cited By ~ 1

Author(s):

Thomas P. Quinn ◽

Tamsyn M. Crowley ◽

Mark F. Richardson

Keyword(s):

16S Rrna ◽

Differential Expression ◽

High Precision ◽

Differential Expression Analysis ◽

Real Data ◽

False Positives ◽

Rna Seq ◽

Sequencing Data ◽

Library Size ◽

Log Ratio

AbstractBackgroundCount data generated by next-generation sequencing assays do not measure absolute transcript abundances. Instead, the data are constrained to an arbitrary “library size” by the sequencing depth of the assay, and typically must be normalized prior to statistical analysis. The constrained nature of these data means one could alternatively use a log-ratio transformation in lieu of normalization, as often done when testing for differential abundance (DA) of operational taxonomic units (OTUs) in 16S rRNA data. Therefore, we benchmark how well the ALDEx2 package, a transformation-based DA tool, detects differential expression in high-throughput RNA-sequencing data (RNA-Seq), compared to conventional RNA-Seq differential expression methods.ResultsTo evaluate the performance of log-ratio transformation-based tools, we apply the ALDEx2 package to two simulated, and one real, RNA-Seq data sets. The latter was previously used to benchmark dozens of conventional RNA-Seq differential expression methods, enabling us to directly compare transformation-based approaches. We show that ALDEx2, widely used in meta-genomics research, identifies differentially expressed genes (and transcripts) from RNA-Seq data with high precision and, given sufficient sample sizes, high recall too (regardless of the alignment and quantification procedure used). Although we show that the choice in log-ratio transformation can affect performance, ALDEx2 has high precision (i.e., few false positives) across all transformations. Finally, we present a novel, iterative log-ratio transformation (now implemented in ALDEx2) that further improves performance in simulations.ConclusionsOur results suggest that log-ratio transformation-based methods can work to measure differential expression from RNA-Seq data, provided that certain assumptions are met. Moreover, these methods have high precision (i.e., few false positives) in simulations and perform as good as, or better than, than conventional methods on real data. With previously demonstrated applicability to 16S rRNA data, ALDEx2 can work as a single tool for data from multiple sequencing modalities.

Download Full-text

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads

10.1101/016196 ◽

2015 ◽

Author(s):

Hung-I Harry Chen ◽

Yuanhang Liu ◽

Yi Zou ◽

Zhao Lai ◽

Devanand Sarkar ◽

...

Keyword(s):

Differential Expression ◽

Rna Sequencing ◽

Expression Analysis ◽

Background Noise ◽

Negative Binomial ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Expression Levels ◽

Differential Expressed Genes

Background RNA sequencing (RNA-seq) is a powerful tool for genome-wide expression profiling of biological samples with the advantage of high-throughput and high resolution. There are many existing algorithms nowadays for quantifying expression levels and detecting differential gene expression, but none of them takes the misaligned reads that are mapped to non-exonic regions into account. We developed a novel algorithm, XBSeq, where a statistical model was established based on the assumption that observed signals are the convolution of true expression signals and sequencing noises. The mapped reads in non-exonic regions are considered as sequencing noises, which follows a Poisson distribution. Given measureable observed and noise signals from RNA-seq data, true expression signals, assuming governed by the negative binomial distribution, can be delineated and thus the accurate detection of differential expressed genes. Results We implemented our novel XBSeq algorithm and evaluated it by using a set of simulated expression datasets under different conditions, using a combination of negative binomial and Poisson distributions with parameters derived from real RNA-seq data. We compared the performance of our method with other commonly used differential expression analysis algorithms. We also evaluated the changes in true and false positive rates with variations in biological replicates, differential fold changes, and expression levels in non-exonic regions. We also tested the algorithm on a set of real RNA-seq data where the common and different detection results from different algorithms were reported. Conclusions In this paper, we proposed a novel XBSeq, a differential expression analysis algorithm for RNA-seq data that takes non-exonic mapped reads into consideration. When background noise is at baseline level, the performance of XBSeq and DESeq are mostly equivalent. However, our method surpasses DESeq and other algorithms with the increase of non-exonic mapped reads. Only in very low read count condition XBSeq had a slightly higher false discovery rate, which may be improved by adjusting the background noise effect in this situation. Taken together, by considering non-exonic mapped reads, XBSeq can provide accurate expression measurement and thus detect differential expressed genes even in noisy conditions.

Download Full-text

Best practices on the differential expression analysis of multi-species RNA-seq

Genome Biology ◽

10.1186/s13059-021-02337-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Matthew Chung ◽

Vincent M. Bruno ◽

David A. Rasko ◽

Christina A. Cuomo ◽

José F. Muñoz ◽

...

Keyword(s):

Best Practices ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Single Species ◽

Rna Seq ◽

Species Analysis ◽

Differential Gene ◽

Multiple Species ◽

Downstream Analysis

AbstractAdvances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.

Download Full-text

Accounting for technical noise in differential expression analysis of single-cell RNA sequencing data

Nucleic Acids Research ◽

10.1093/nar/gkx754 ◽

2017 ◽

Vol 45 (19) ◽

pp. 10978-10988 ◽

Cited By ~ 26

Author(s):

Cheng Jia ◽

Yu Hu ◽

Derek Kelly ◽

Junhyong Kim ◽

Mingyao Li ◽

...

Keyword(s):

Single Cell ◽

Differential Expression ◽

Rna Sequencing ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Sequencing Data ◽

Technical Noise ◽

Single Cell Rna Sequencing

Download Full-text

Survey of Methods Used for Differential Expression Analysis on RNA Seq Data

Learning and Analytics in Intelligent Systems - Biologically Inspired Techniques in Many-Criteria Decision Making ◽

10.1007/978-3-030-39033-4_21 ◽

2020 ◽

pp. 226-239

Author(s):

Reema Joshi ◽

Rosy Sarmah

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Rna Seq

Download Full-text

Side-by-side analysis of alternative approaches on multi-level RNA-seq data

10.1101/131862 ◽

2017 ◽

Author(s):

Irina Mohorianu

Keyword(s):

Differential Expression ◽

Rna Seq ◽

Sequencing Data ◽

Essential Components ◽

Sequencing Quality ◽

Quality Checks ◽

Multi Level ◽

Using Data ◽

Key Steps ◽

Robust Prediction

AbstractBackgroundRNA sequencing (RNA-seq) is widely used for RNA quantification across environmental, biological and medical sciences; it enables the description of genome-wide patterns of expression and the deduction of regulatory interactions and networks. The aim of computational analyses is to achieve an accurate output, i.e. rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite the variable levels of noise and biases present in sequencing data. The evaluation of sequencing quality and normalization are essential components of this process.ResultsWe investigate the discriminative power of existing approaches for the quality checking of mRNA-seq data and also propose additional, quantitative, quality checks. To accommodate the analysis of a nested, multi-level design using data on D. melanogaster, we incorporated the sample layout into the analysis. We describe a “subsampling without replacement”-based normalization and identification of DE that accounts for the experimental design i.e. the hierarchy and amplitude of effect sizes within samples. We also evaluate the differential expression call in comparison to existing approaches. To assess the broader applicability of these methods, we applied this series of steps to a published set of H. sapiens mRNA-seq samples.ConclusionsThe dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. Overall, the proposed approach offers the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments into the data analysis. 38

Download Full-text