Cross-Platform Normalization Enables Machine Learning Model Training On Microarray And RNA-Seq Data Simultaneously

A relative comparison between Hidden Markov- and Log-Linear-based models for differential expression analysis in a real time course RNA sequencing data

10.1101/448886 ◽

2018 ◽

Author(s):

Fatemeh Gholizadeh ◽

Zahra Salehi ◽

Ali Mohammad banaei-Moghaddam ◽

Abbas Rahimi Foroushani ◽

Kaveh kavousi

Keyword(s):

Real Time ◽

Differential Expression ◽

Rna Sequencing ◽

Time Course ◽

Hidden Markov ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Normalization Methods ◽

Log Linear

AbstractWith the advent of the Next Generation Sequencing technologies, RNA-seq has become known as an optimal approach for studying gene expression profiling. Particularly, time course RNA-seq differential expression analysis has been used in many studies to identify candidate genes. However, applying a statistical method to efficiently identify differentially expressed genes (DEGs) in time course studies is challenging due to inherent characteristics of such data including correlation and dependencies over time. Here we aim to relatively compare EBSeq-HMM, a Hidden Markov-based model, with multiDE, a Log-Linear-based model, in a real time course RNA sequencing data. In order to conduct the comparison, common DEGs detected by edgeR, DESeq2 and Voom (referred to as Benchmark DEGs) were utilized as a measure. Each of the two models were compared using different normalization methods. The findings revealed that multiDE identified more Benchmark DEGs and showed a higher agreement with them than EBSeq-HMM. Furthermore, multiDE and EBSeq-HMM displayed their best performance using TMM and Upper-Quartile normalization methods, respectively.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

PeerJ ◽

10.7717/peerj.1621 ◽

2016 ◽

Vol 4 ◽

pp. e1621 ◽

Cited By ~ 42

Author(s):

Jeffrey A. Thompson ◽

Jie Tan ◽

Casey S. Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simplelog2transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

A comparison of per sample global scaling and per gene normalization methods for differential expression analysis of RNA-seq data

PLoS ONE ◽

10.1371/journal.pone.0176185 ◽

2017 ◽

Vol 12 (5) ◽

pp. e0176185 ◽

Cited By ~ 32

Author(s):

Xiaohong Li ◽

Guy N. Brock ◽

Eric C. Rouchka ◽

Nigel G. F. Cooper ◽

Dongfeng Wu ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Rna Seq ◽

Gene Normalization ◽

Normalization Methods ◽

Global Scaling ◽

Per Gene

Download Full-text

Cross-platform normalization of microarray and RNA-seq data for machine learning applications

10.7287/peerj.preprints.1460v1 ◽

2015 ◽

Author(s):

Jeffrey A Thompson ◽

Jie Tan ◽

Casey S Greene

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Machine Learning Algorithms ◽

Quantile Normalization ◽

Learning Approaches ◽

Rna Seq ◽

Distribution Matching ◽

Machine Learning Applications ◽

Cross Platform ◽

R Programming

Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization and a simple log2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.

Download Full-text

Differential gene expression analysis tools exhibit substandard performance for long non-coding RNA–sequencing data

10.1101/220129 ◽

2017 ◽

Cited By ~ 2

Author(s):

Alemu Takele Assefa ◽

Katrijn De Paepe ◽

Celine Everaert ◽

Pieter Mestdagh ◽

Olivier Thas ◽

...

Keyword(s):

Gene Expression ◽

Differential Expression ◽

Expression Analysis ◽

Web Application ◽

Empirical Bayes ◽

Performance Metrics ◽

Differential Expression Analysis ◽

Rna Seq ◽

Sequencing Data ◽

Normalization Methods

ABSTRACTBackgroundProtein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.ResultsThirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.ConclusionOverall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/

Download Full-text

Best practices on the differential expression analysis of multi-species RNA-seq

Genome Biology ◽

10.1186/s13059-021-02337-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Matthew Chung ◽

Vincent M. Bruno ◽

David A. Rasko ◽

Christina A. Cuomo ◽

José F. Muñoz ◽

...

Keyword(s):

Best Practices ◽

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Single Species ◽

Rna Seq ◽

Species Analysis ◽

Differential Gene ◽

Multiple Species ◽

Downstream Analysis

AbstractAdvances in transcriptome sequencing allow for simultaneous interrogation of differentially expressed genes from multiple species originating from a single RNA sample, termed dual or multi-species transcriptomics. Compared to single-species differential expression analysis, the design of multi-species differential expression experiments must account for the relative abundances of each organism of interest within the sample, often requiring enrichment methods and yielding differences in total read counts across samples. The analysis of multi-species transcriptomics datasets requires modifications to the alignment, quantification, and downstream analysis steps compared to the single-species analysis pipelines. We describe best practices for multi-species transcriptomics and differential gene expression.

Download Full-text

The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqab028 ◽

2021 ◽

Vol 3 (2) ◽

Author(s):

Xueyi Dong ◽

Luyi Tian ◽

Quentin Gouil ◽

Hasaru Kariyawasam ◽

Shian Su ◽

...

Keyword(s):

Differential Expression ◽

Expression Analysis ◽

Differential Expression Analysis ◽

Transcriptomic Analysis ◽

Statistical Testing ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Sequencing Platform ◽

Long Read

Abstract Application of Oxford Nanopore Technologies’ long-read sequencing platform to transcriptomic analysis is increasing in popularity. However, such analysis can be challenging due to the high sequence error and small library sizes, which decreases quantification accuracy and reduces power for statistical testing. Here, we report the analysis of two nanopore RNA-seq datasets with the goal of obtaining gene- and isoform-level differential expression information. A dataset of synthetic, spliced, spike-in RNAs (‘sequins’) as well as a mouse neural stem cell dataset from samples with a null mutation of the epigenetic regulator Smchd1 was analysed using a mix of long-read specific tools for preprocessing together with established short-read RNA-seq methods for downstream analysis. We used limma-voom to perform differential gene expression analysis, and the novel FLAMES pipeline to perform isoform identification and quantification, followed by DRIMSeq and limma-diffSplice (with stageR) to perform differential transcript usage analysis. We compared results from the sequins dataset to the ground truth, and results of the mouse dataset to a previous short-read study on equivalent samples. Overall, our work shows that transcriptomic analysis of long-read nanopore data using long-read specific preprocessing methods together with short-read differential expression methods and software that are already in wide use can yield meaningful results.

Download Full-text