Side-by-side analysis of alternative approaches on multi-level RNA-seq data

Mapping Intimacies ◽

10.1101/131862 ◽

2017 ◽

Author(s):

Irina Mohorianu

Keyword(s):

Differential Expression ◽

Rna Seq ◽

Sequencing Data ◽

Essential Components ◽

Sequencing Quality ◽

Quality Checks ◽

Multi Level ◽

Using Data ◽

Key Steps ◽

Robust Prediction

AbstractBackgroundRNA sequencing (RNA-seq) is widely used for RNA quantification across environmental, biological and medical sciences; it enables the description of genome-wide patterns of expression and the deduction of regulatory interactions and networks. The aim of computational analyses is to achieve an accurate output, i.e. rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite the variable levels of noise and biases present in sequencing data. The evaluation of sequencing quality and normalization are essential components of this process.ResultsWe investigate the discriminative power of existing approaches for the quality checking of mRNA-seq data and also propose additional, quantitative, quality checks. To accommodate the analysis of a nested, multi-level design using data on D. melanogaster, we incorporated the sample layout into the analysis. We describe a “subsampling without replacement”-based normalization and identification of DE that accounts for the experimental design i.e. the hierarchy and amplitude of effect sizes within samples. We also evaluate the differential expression call in comparison to existing approaches. To assess the broader applicability of these methods, we applied this series of steps to a published set of H. sapiens mRNA-seq samples.ConclusionsThe dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. Overall, the proposed approach offers the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments into the data analysis. 38

Download Full-text

The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab028 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Xueyi Dong ◽

Luyi Tian ◽

Quentin Gouil ◽

Hasaru Kariyawasam ◽

Shian Su ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Transcriptomic Analysis ◽

Statistical Testing ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Sequencing Platform ◽

Long Read

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.

Download Full-text

NASA GeneLab RNA-Seq Consensus Pipeline: Standardized Processing of Short-Read RNA-Seq Data

10.1101/2020.11.06.371724 ◽

2020 ◽

Author(s):

Eliah G. Overbey ◽

Amanda M. Saravia-Butler ◽

Zhe Zhang ◽

Komal S. Rathi ◽

Homer Fogle ◽

...

Keyword(s):

Expression Profiles ◽

Gene Expression Profiles ◽

Rna Seq ◽

Sequencing Data ◽

Analysis Pipeline ◽

Short Read ◽

Gene Quantification ◽

Working Groups ◽

Using Data ◽

Data Analysis Pipeline

SummaryWith the development of transcriptomic technologies, we are able to quantify precise changes in gene expression profiles from astronauts and other organisms exposed to spaceflight. Members of NASA GeneLab and GeneLab-associated analysis working groups (AWGs) have developed a consensus pipeline for analyzing short-read RNA-sequencing data from spaceflight-associated experiments. The pipeline includes quality control, read trimming, mapping, and gene quantification steps, culminating in the detection of differentially expressed genes. This data analysis pipeline and the results of its execution using data submitted to GeneLab are now all publicly available through the GeneLab database. We present here the full details and rationale for the construction of this pipeline in order to promote transparency, reproducibility and reusability of pipeline data, to provide a template for data processing of future spaceflight-relevant datasets, and to encourage cross-analysis of data from other databases with the data available in GeneLab.

Download Full-text

Impact of Gene Annotation Choice on the Quantification of RNA-Seq Data

10.21203/rs.3.rs-421080/v1 ◽

2021 ◽

Author(s):

David Chisanga ◽

Yang Liao ◽

Wei Shi

Keyword(s):

Gene Expression ◽

Gene Annotation ◽

Expression Data ◽

Refseq Gene ◽

Rna Seq ◽

Sequencing Data ◽

Microarray Expression Data ◽

Sequencing Quality ◽

Gene Expression Quantification ◽

Expression Quantification

Abstract Background: RNA sequencing is currently the method of choice for genome-wide profiling of gene expression. A popular approach to quantify expression levels of genes from RNA-seq data is to map reads to a reference genome and then count mapped reads to each gene. Gene annotation data, which include chromosomal coordinates of exons for tens of thousands of genes, are required for this quantification process. There are several major sources of gene annotations that can be used for quantification, such as Ensembl and RefSeq databases. However, there is very little understanding of the effect that the choice of annotation has on the accuracy of gene expression quantification in an RNA-seq analysis.Results: In this paper, we present results from our comparison of Ensembl and RefSeq human annotations on their impact on gene expression quantification using a benchmark RNA-seq dataset generated by the SEquencing Quality Control (SEQC) consortium. We show that the use of RefSeq gene annotation models led to better quantification accuracy, based on the correlation with ground truths including expression data from >800 real-time PCR validated genes, known titration ratios of gene expression and microarray expression data. We also found that the recent expansion of the RefSeq annotation has led to a decrease in its annotation accuracy. Finally, we demonstrated that the RNA-seq quantification differences observed between different annotations were not affected by the use of different normalization methods.Conclusion: In conclusion, our study found that the use of the conservative RefSeq gene annotation yields better RNA-seq quantification results than the more comprehensive Ensembl annotation. We also found that, surprisingly, the recent expansion of the RefSeq database, which was primarily driven by the incorporation of sequencing data into the gene annotation process, resulted in a reduction in the accuracy of RNA-seq quantification.

Download Full-text

A relative comparison between Hidden Markov- and Log-Linear-based models for differential expression analysis in a real time course RNA sequencing data

10.1101/448886 ◽

2018 ◽

Author(s):

Fatemeh Gholizadeh ◽

Zahra Salehi ◽

Ali Mohammad banaei-Moghaddam ◽

Abbas Rahimi Foroushani ◽

Kaveh kavousi

Keyword(s):

Real Time ◽

Differential Expression ◽

Rna Sequencing ◽

Time Course ◽

Hidden Markov ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Normalization Methods ◽

Log Linear

AbstractWith the advent of the Next Generation Sequencing technologies, RNA-seq has become known as an optimal approach for studying gene expression profiling. Particularly, time course RNA-seq differential expression analysis has been used in many studies to identify candidate genes. However, applying a statistical method to efficiently identify differentially expressed genes (DEGs) in time course studies is challenging due to inherent characteristics of such data including correlation and dependencies over time. Here we aim to relatively compare EBSeq-HMM, a Hidden Markov-based model, with multiDE, a Log-Linear-based model, in a real time course RNA sequencing data. In order to conduct the comparison, common DEGs detected by edgeR, DESeq2 and Voom (referred to as Benchmark DEGs) were utilized as a measure. Each of the two models were compared using different normalization methods. The findings revealed that multiDE identified more Benchmark DEGs and showed a higher agreement with them than EBSeq-HMM. Furthermore, multiDE and EBSeq-HMM displayed their best performance using TMM and Upper-Quartile normalization methods, respectively.

Download Full-text

Differential Expression Analysis for RNA-Seq: An Overview of Statistical Methods and Computational Software

Cancer Informatics ◽

10.4137/cin.s21631 ◽

2015 ◽

Vol 14s1 ◽

pp. CIN.S21631 ◽

Cited By ~ 6

Author(s):

Huei-Chung Huang ◽

Yi Niu ◽

Li-Xuan Qin

Keyword(s):

Differential Expression ◽

Statistical Methods ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Computational Tools ◽

Estimation Algorithms ◽

Testing Strategies ◽

Discrete Nature ◽

Statistical Background

Deep sequencing has recently emerged as a powerful alternative to microarrays for the high-throughput profiling of gene expression. In order to account for the discrete nature of RNA sequencing data, new statistical methods and computational tools have been developed for the analysis of differential expression to identify genes that are relevant to a disease such as cancer. In this paper, it is thus timely to provide an overview of these analysis methods and tools. For readers with statistical background, we also review the parameter estimation algorithms and hypothesis testing strategies used in these methods.

Download Full-text

Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA–sequencing data

10.1101/220129 ◽

2017 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Katrijn De Paepe ◽

Celine Everaert ◽

Pieter Mestdagh ◽

Olivier Thas ◽

...

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Expression Analysis ◽

Web Application ◽

Empirical Bayes ◽

Performance Metrics ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Normalization Methods

ABSTRACTBackgroundProtein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.ResultsThirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.ConclusionOverall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/

Download Full-text

Improved prediction of smoking status via isoform-aware RNA-seq deep learning models

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009433 ◽

2021 ◽

Vol 17 (10) ◽

pp. e1009433

Author(s):

Zifeng Wang ◽

Aria Masoomi ◽

Zhonghui Xu ◽

Adel Boueiz ◽

Sool Lee ◽

...

Keyword(s):

Gene Expression ◽

Predictive Models ◽

Prediction Models ◽

Smoking Status ◽

Rna Seq ◽

Sequencing Data ◽

Test Set ◽

Gene Splicing ◽

Eukaryotic Gene ◽

Using Data

Most predictive models based on gene expression data do not leverage information related to gene splicing, despite the fact that splicing is a fundamental feature of eukaryotic gene expression. Cigarette smoking is an important environmental risk factor for many diseases, and it has profound effects on gene expression. Using smoking status as a prediction target, we developed deep neural network predictive models using gene, exon, and isoform level quantifications from RNA sequencing data in 2,557 subjects in the COPDGene Study. We observed that models using exon and isoform quantifications clearly outperformed gene-level models when using data from 5 genes from a previously published prediction model. Whereas the test set performance of the previously published model was 0.82 in the original publication, our exon-based models including an exon-to-isoform mapping layer achieved a test set AUC (area under the receiver operating characteristic) of 0.88, which improved to an AUC of 0.94 using exon quantifications from a larger set of genes. Isoform variability is an important source of latent information in RNA-seq data that can be used to improve clinical prediction models.

Download Full-text

TVScript: A Tool for Exploring the Effects of Intra-Condition Variation on the Detection of Differentially Expressed Genes

10.21203/rs.3.rs-468506/v1 ◽

2021 ◽

Author(s):

Diana Lobo ◽

Raquel Godinho ◽

John Archer

Keyword(s):

Differential Expression ◽

Expression Patterns ◽

Transcript Level ◽

Differentially Expressed ◽

Rna Seq ◽

Expression Studies ◽

Individual Transcript ◽

Selection For ◽

Using Data ◽

Scientific Value

Abstract The evolution of RNA-Seq technologies yielded datasets that are of immense scientific value. Commonly, such data is generated within differential expression studies, where datasets derived from individual samples are grouped into conditions, and gene expression patterns quantified. The number of archived datasets is increasing and revisiting many at an inter-study level provides an in-depth view into transcriptome evolution. The biggest hurdle is in dealing with variation of read counts at an individual transcript level between common conditions. We present a tool, TVScript, that quantifies intra-condition variation, and subsequently, removes reference-based transcripts that are associated with high levels of this. TVScript is demonstrated at inter and intra-study levels, using data from brain samples of dogs, wolves and foxes (aggressive and tame), where a marked improvement in the distribution of the gene-wise dispersion estimates, the metric utilized by the majority of differential expression tools, lowered the number of outliers detected. We provide support for seven candidate genes with potential for being involved with selection for tameness, and that appear to play a crucial role in canine domestication. We also identify several genes previously identified as being differentially expressed, but that possessed high intra-condition variation, weakening their relevance. TVScript is available at: https://sourceforge.net/projects/tvscript/.

Download Full-text

Differential expression analysis of log-ratio transformed counts: benchmarking methods for RNA-Seq data

10.1101/231175 ◽

2017 ◽

Cited By ~ 1

Author(s):

Thomas P. Quinn ◽

Tamsyn M. Crowley ◽

Mark F. Richardson

Keyword(s):

16S Rrna ◽

Differential Expression ◽

High Precision ◽

Differential Expression Analysis ◽

Real Data ◽

False Positives ◽

Rna Seq ◽

Sequencing Data ◽

Library Size ◽

Log Ratio

AbstractBackgroundCount data generated by next-generation sequencing assays do not measure absolute transcript abundances. Instead, the data are constrained to an arbitrary “library size” by the sequencing depth of the assay, and typically must be normalized prior to statistical analysis. The constrained nature of these data means one could alternatively use a log-ratio transformation in lieu of normalization, as often done when testing for differential abundance (DA) of operational taxonomic units (OTUs) in 16S rRNA data. Therefore, we benchmark how well the ALDEx2 package, a transformation-based DA tool, detects differential expression in high-throughput RNA-sequencing data (RNA-Seq), compared to conventional RNA-Seq differential expression methods.ResultsTo evaluate the performance of log-ratio transformation-based tools, we apply the ALDEx2 package to two simulated, and one real, RNA-Seq data sets. The latter was previously used to benchmark dozens of conventional RNA-Seq differential expression methods, enabling us to directly compare transformation-based approaches. We show that ALDEx2, widely used in meta-genomics research, identifies differentially expressed genes (and transcripts) from RNA-Seq data with high precision and, given sufficient sample sizes, high recall too (regardless of the alignment and quantification procedure used). Although we show that the choice in log-ratio transformation can affect performance, ALDEx2 has high precision (i.e., few false positives) across all transformations. Finally, we present a novel, iterative log-ratio transformation (now implemented in ALDEx2) that further improves performance in simulations.ConclusionsOur results suggest that log-ratio transformation-based methods can work to measure differential expression from RNA-Seq data, provided that certain assumptions are met. Moreover, these methods have high precision (i.e., few false positives) in simulations and perform as good as, or better than, than conventional methods on real data. With previously demonstrated applicability to 16S rRNA data, ALDEx2 can work as a single tool for data from multiple sequencing modalities.

Download Full-text

Differential expression analysis of RNA sequencing data by incorporating non-exonic mapped reads

10.1101/016196 ◽

2015 ◽

Author(s):

Hung-I Harry Chen ◽

Yuanhang Liu ◽

Yi Zou ◽

Zhao Lai ◽

Devanand Sarkar ◽

...

Keyword(s):

Differential Expression ◽

Rna Sequencing ◽

Expression Analysis ◽

Background Noise ◽

Negative Binomial ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Expression Levels ◽

Differential Expressed Genes

Background RNA sequencing (RNA-seq) is a powerful tool for genome-wide expression profiling of biological samples with the advantage of high-throughput and high resolution. There are many existing algorithms nowadays for quantifying expression levels and detecting differential gene expression, but none of them takes the misaligned reads that are mapped to non-exonic regions into account. We developed a novel algorithm, XBSeq, where a statistical model was established based on the assumption that observed signals are the convolution of true expression signals and sequencing noises. The mapped reads in non-exonic regions are considered as sequencing noises, which follows a Poisson distribution. Given measureable observed and noise signals from RNA-seq data, true expression signals, assuming governed by the negative binomial distribution, can be delineated and thus the accurate detection of differential expressed genes. Results We implemented our novel XBSeq algorithm and evaluated it by using a set of simulated expression datasets under different conditions, using a combination of negative binomial and Poisson distributions with parameters derived from real RNA-seq data. We compared the performance of our method with other commonly used differential expression analysis algorithms. We also evaluated the changes in true and false positive rates with variations in biological replicates, differential fold changes, and expression levels in non-exonic regions. We also tested the algorithm on a set of real RNA-seq data where the common and different detection results from different algorithms were reported. Conclusions In this paper, we proposed a novel XBSeq, a differential expression analysis algorithm for RNA-seq data that takes non-exonic mapped reads into consideration. When background noise is at baseline level, the performance of XBSeq and DESeq are mostly equivalent. However, our method surpasses DESeq and other algorithms with the increase of non-exonic mapped reads. Only in very low read count condition XBSeq had a slightly higher false discovery rate, which may be improved by adjusting the background noise effect in this situation. Taken together, by considering non-exonic mapped reads, XBSeq can provide accurate expression measurement and thus detect differential expressed genes even in noisy conditions.

Download Full-text