scholarly journals A Comparison of Methods: Normalizing High-Throughput RNA Sequencing Data

2015 ◽  
Author(s):  
Rahul Reddy

As RNA-Seq and other high-throughput sequencing grow in use and remain critical for gene expression studies, technical variability in counts data impedes studies of differential expression studies, data across samples and experiments, or reproducing results. Studies like Dillies et al. (2013) compare several between-lane normalization methods involving scaling factors, while Hansen et al. (2012) and Risso et al. (2014) propose methods that correct for sample-specific bias or use sets of control genes to isolate and remove technical variability. This paper evaluates four normalization methods in terms of reducing intra-group, technical variability and facilitating differential expression analysis or other research where the biological, inter-group variability is of interest. To this end, the four methods were evaluated in differential expression analysis between data from Pickrell et al. (2010) and Montgomery et al. (2010) and between simulated data modeled on these two datasets. Though the between-lane scaling factor methods perform worse on real data sets, they are much stronger for simulated data. We cannot reject the recommendation of Dillies et al. to use TMM and DESeq normalization, but further study of power to detect effects of different size under each normalization method is merited.

2017 ◽  
Author(s):  
Alemu Takele Assefa ◽  
Katrijn De Paepe ◽  
Celine Everaert ◽  
Pieter Mestdagh ◽  
Olivier Thas ◽  
...  

ABSTRACTBackgroundProtein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.ResultsThirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.ConclusionOverall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/


2021 ◽  
Vol 3 (2) ◽  
Author(s):  
Xueyi Dong ◽  
Luyi Tian ◽  
Quentin Gouil ◽  
Hasaru Kariyawasam ◽  
Shian Su ◽  
...  

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.


2019 ◽  
Author(s):  
Avi Srivastava ◽  
Laraib Malik ◽  
Hirak Sarkar ◽  
Mohsen Zakeri ◽  
Fatemeh Almodaresi ◽  
...  

AbstractBackgroundThe accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy.ResultsWe investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large, and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally-acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment.ConclusionWe observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.


2014 ◽  
Author(s):  
Simon Anders ◽  
Paul Theodor Pyl ◽  
Wolfgang Huber

Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from standard work flows, custom scripts are needed. Results: We present HTSeq, a Python library to facilitate the rapid development of such scripts. HTSeq offers parsers for many common data formats in HTS projects, as well as classes to represent data such as genomic coordinates, sequences, sequencing reads, alignments, gene model information, variant calls, and provides data structures that allow for querying via genomic coordinates. We also present htseq-count, a tool developed with HTSeq that preprocesses RNA-Seq data for differential expression analysis by counting the overlap of reads with genes. Availability: HTSeq is released as open-source software under the GNU General Public Licence and available from http://www-huber.embl.de/HTSeq or from the Python Package Index, https://pypi.python.org/pypi/HTSeq


2018 ◽  
Author(s):  
Fatemeh Gholizadeh ◽  
Zahra Salehi ◽  
Ali Mohammad banaei-Moghaddam ◽  
Abbas Rahimi Foroushani ◽  
Kaveh kavousi

AbstractWith the advent of the Next Generation Sequencing technologies, RNA-seq has become known as an optimal approach for studying gene expression profiling. Particularly, time course RNA-seq differential expression analysis has been used in many studies to identify candidate genes. However, applying a statistical method to efficiently identify differentially expressed genes (DEGs) in time course studies is challenging due to inherent characteristics of such data including correlation and dependencies over time. Here we aim to relatively compare EBSeq-HMM, a Hidden Markov-based model, with multiDE, a Log-Linear-based model, in a real time course RNA sequencing data. In order to conduct the comparison, common DEGs detected by edgeR, DESeq2 and Voom (referred to as Benchmark DEGs) were utilized as a measure. Each of the two models were compared using different normalization methods. The findings revealed that multiDE identified more Benchmark DEGs and showed a higher agreement with them than EBSeq-HMM. Furthermore, multiDE and EBSeq-HMM displayed their best performance using TMM and Upper-Quartile normalization methods, respectively.


Sign in / Sign up

Export Citation Format

Share Document