scholarly journals GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data

PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4600 ◽  
Author(s):  
Li Chen ◽  
James Reeve ◽  
Lujun Zhang ◽  
Shengbing Huang ◽  
Xuefeng Wang ◽  
...  

Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero-inflation remain largely undeveloped. Here we propose geometric mean of pairwise ratios—a simple but effective normalization method—for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.

2017 ◽  
Author(s):  
Li Chen ◽  
James Reeve ◽  
Lujun Zhang ◽  
Shengbing Huang ◽  
Jun Chen

Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero inflation remain largely undeveloped. Here we propose GMPR - a simple but effective normalization method - for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.


2018 ◽  
Author(s):  
Li Chen ◽  
James Reeve ◽  
Lujun Zhang ◽  
Shengbing Huang ◽  
Xuefeng Wang ◽  
...  

Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero inflation remain largely undeveloped. Here we propose GMPR - a simple but effective normalization method - for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.


2017 ◽  
Author(s):  
Li Chen ◽  
James Reeve ◽  
Lujun Zhang ◽  
Shengbin Huang ◽  
Jun Chen

Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero inflation remain largely undeveloped. Here we propose GMPR - a simple but effective normalization method - for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.


2018 ◽  
Author(s):  
Li Chen ◽  
James Reeve ◽  
Lujun Zhang ◽  
Shengbing Huang ◽  
Xuefeng Wang ◽  
...  

Normalization is the first critical step in microbiome sequencing data analysis used to account for variable library sizes. Current RNA-Seq based normalization methods that have been adapted for microbiome data fail to consider the unique characteristics of microbiome data, which contain a vast number of zeros due to the physical absence or under-sampling of the microbes. Normalization methods that specifically address the zero inflation remain largely undeveloped. Here we propose GMPR - a simple but effective normalization method - for zero-inflated sequencing data such as microbiome data. Simulation studies and real datasets analyses demonstrate that the proposed method is more robust than competing methods, leading to more powerful detection of differentially abundant taxa and higher reproducibility of the relative abundances of taxa.


2017 ◽  
Author(s):  
Li Chen ◽  
Jun Chen

ABSTRACTSummaryNormalization is the first and a critical step in microbiome sequencing (microbiome-Seq) data analysis to account for variable library sizes. Though RNA-Seq based normalization methods have been adapted for microbiome-Seq data, they fail to consider the unique characteristics of microbiome-Seq data, which contain a vast number of zeros due to the physical absence or undersampling of the microbes. Normalization methods that specifically address the zeroinflation remain largely undeveloped. Here we propose GMPR - a simple but effective normalization method - for zeroinflated sequencing data such as microbiome-Seq data. Simulation studies and analyses of 38 real gut microbiome datasets from 16S rRNA gene amplicon sequencing demonstrated the superior performance of the proposed method.Availability and Implementation‘GMPR’ is implemented in R andavailable at https://github.com/jchen1981/GMPRSupplementary InformationSupplementary data are available at Bioinformatics [email protected]


2018 ◽  
Author(s):  
Fatemeh Gholizadeh ◽  
Zahra Salehi ◽  
Ali Mohammad banaei-Moghaddam ◽  
Abbas Rahimi Foroushani ◽  
Kaveh kavousi

AbstractWith the advent of the Next Generation Sequencing technologies, RNA-seq has become known as an optimal approach for studying gene expression profiling. Particularly, time course RNA-seq differential expression analysis has been used in many studies to identify candidate genes. However, applying a statistical method to efficiently identify differentially expressed genes (DEGs) in time course studies is challenging due to inherent characteristics of such data including correlation and dependencies over time. Here we aim to relatively compare EBSeq-HMM, a Hidden Markov-based model, with multiDE, a Log-Linear-based model, in a real time course RNA sequencing data. In order to conduct the comparison, common DEGs detected by edgeR, DESeq2 and Voom (referred to as Benchmark DEGs) were utilized as a measure. Each of the two models were compared using different normalization methods. The findings revealed that multiDE identified more Benchmark DEGs and showed a higher agreement with them than EBSeq-HMM. Furthermore, multiDE and EBSeq-HMM displayed their best performance using TMM and Upper-Quartile normalization methods, respectively.


2021 ◽  
Vol 12 ◽  
Author(s):  
Shen Yin ◽  
Xiaowei Zhan ◽  
Bo Yao ◽  
Guanghua Xiao ◽  
Xinlei Wang ◽  
...  

RNA-sequencing (RNA-seq) provides a comprehensive quantification of transcriptomic activities in biological samples. Formalin-Fixed Paraffin-Embedded (FFPE) samples are collected as part of routine clinical procedure, and are the most widely available biological sample format in medical research and patient care. Normalization is an essential step in RNA-seq data analysis. A number of normalization methods, though developed for RNA-seq data from fresh frozen (FF) samples, can be used with FFPE samples as well. The only extant normalization method specifically designed for FFPE RNA-seq data, MIXnorm, which has been shown to outperform the normalization methods, but at the cost of a complex mixture model and a high computational burden. It is therefore important to adapt MIXnorm for simplicity and computational efficiency while maintaining superior performance. Furthermore, it is critical to develop an integrated tool that performs commonly used normalization methods for both FF and FFPE RNA-seq data. We developed a new normalization method for FFPE RNA-seq data, named SMIXnorm, based on a simplified two-component mixture model compared to MIXnorm to facilitate computation. The expression levels of expressed genes are modeled by normal distributions without truncation, and those of non-expressed genes are modeled by zero-inflated Poisson distributions. The maximum likelihood estimates of the model parameters are obtained by a nested Expectation-Maximization algorithm with a less complicated latent variable structure, and closed-form updates are available within each iteration. Real data applications and simulation studies show that SMIXnorm greatly reduces computing time compared to MIXnorm, without sacrificing the performance. More importantly, we developed a web-based tool, RNA-seq Normalization (RSeqNorm), that offers a simple workflow to compute normalized RNA-seq data for both FFPE and FF samples. It includes SMIXnorm and MIXnorm for FFPE RNA-seq data, together with five commonly used normalization methods for FF RNA-seq data. Users can easily upload a raw RNA-seq count matrix and select one of the seven normalization methods to produce a downloadable normalized expression matrix for any downstream analysis. The R package is available at https://github.com/S-YIN/RSEQNORM. The web-based tool, RSeqNorm is available at http://lce.biohpc.swmed.edu/rseqnorm with no restriction to use or redistribute.


2021 ◽  
Author(s):  
Yue Cao ◽  
Pengyi Yang ◽  
Jean Yee Hwa Yang

Single-cell RNA-seq (scRNA-seq) data simulation is critical for evaluating computational methods for analysing scRNA-seq data especially when ground truth is experimentally unattainable. The reliability of evaluation depends on the ability of simulation methods to capture properties of experimental data. However, while many scRNA-seq data simulation methods have been proposed, a systematic evaluation of these methods is lacking. We developed a comprehensive evaluation framework, SimBench, including a novel kernel density estimation measure to benchmark 12 simulation methods through 36 scRNA-seq experimental datasets. We evaluated the simulation methods on a panel of data properties, ability to maintain biological signals and computational scalability. Our benchmark uncovered performance differences among the methods and highlighted the varying difficulties in simulating data characteristics. Furthermore, we identified several limitations including maintaining heterogeneity of distribution. These results, together with the framework and datasets made publicly available as R packages, will guide simulation methods selection and their future development.


2017 ◽  
Author(s):  
Alemu Takele Assefa ◽  
Katrijn De Paepe ◽  
Celine Everaert ◽  
Pieter Mestdagh ◽  
Olivier Thas ◽  
...  

ABSTRACTBackgroundProtein-coding RNAs (mRNA) have been the primary target of most transcriptome studies in the past, but in recent years, attention has expanded to include long non-coding RNAs (lncRNA). lncRNAs are typically expressed at low levels, and are inherently highly variable. This is a fundamental challenge for differential expression (DE) analysis. In this study, the performance of 14 popular tools for testing DE in RNA-seq data along with their normalization methods is comprehensively evaluated, with a particular focus on lncRNAs and low abundant mRNAs.ResultsThirteen performance metrics were used to evaluate DE tools and normalization methods using simulations and analyses of six diverse RNA-seq datasets. Non-parametric procedures are used to simulate gene expression data in such a way that realistic levels of expression and variability are preserved in the simulated data. Throughout the assessment, we kept track of the results for mRNA and lncRNA separately. All statistical models exhibited inferior performance for lncRNAs compared to mRNAs across all simulated scenarios and analysis of benchmark RNA-seq datasets. No single tool uniformly outperformed the others.ConclusionOverall, the linear modeling with empirical Bayes moderation (limma) and the nonparametric approach (SAMSeq) showed best performance: good control of the false discovery rate (FDR) and reasonable sensitivity. However, for achieving a sensitivity of at least 50%, more than 80 samples are required when studying expression levels in a realistic clinical settings such as in cancer research. About half of the methods showed severe excess of false discoveries, making these methods unreliable for differential expression analysis and jeopardizing reproducible science. The detailed results of our study can be consulted through a user-friendly web application, http://statapps.ugent.be/tools/AppDGE/


2016 ◽  
Author(s):  
Rhonda Bacher ◽  
Li-Fang Chu ◽  
Ning Leng ◽  
Audrey P. Gasch ◽  
James A. Thomson ◽  
...  

SummaryNormalization of RNA-sequencing data is essential for accurate downstream inference, but the assumptions upon which most methods are based do not hold in the single-cell setting. Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of scRNA-seq data.


Sign in / Sign up

Export Citation Format

Share Document