scholarly journals qtQDA: quantile transformed quadratic discriminant analysis for high-dimensional RNA-seq data

2019 ◽  
Author(s):  
Necla Koçhan ◽  
Gözde Y. Tütüncü ◽  
Gordon K. Smyth ◽  
Luke C. Gandolfo ◽  
Göknur Giner

AbstractClassification on the basis of gene expression data derived from RNA-seq promises to become an important part of modern medicine. We propose a new classification method based on a model where the data is marginally negative binomial but dependent, thereby incorporating the dependence known to be present between measurements from different genes. The method, called qtQDA, works by first performing a quantile transformation (qt) then applying Gaussian Quadratic Discriminant Analysis (QDA) using regularized covariance matrix estimates. We show that qtQDA has excellent performance when applied to real data sets and has advantages over some existing approaches. An R package implementing the method is also available.

PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e8260 ◽  
Author(s):  
Necla Koçhan ◽  
G. Yazgi Tutuncu ◽  
Gordon K. Smyth ◽  
Luke C. Gandolfo ◽  
Göknur Giner

Classification on the basis of gene expression data derived from RNA-seq promises to become an important part of modern medicine. We propose a new classification method based on a model where the data is marginally negative binomial but dependent, thereby incorporating the dependence known to be present between measurements from different genes. The method, called qtQDA, works by first performing a quantile transformation (qt) then applying Gaussian quadratic discriminant analysis (QDA) using regularized covariance matrix estimates. We show that qtQDA has excellent performance when applied to real data sets and has advantages over some existing approaches. An R package implementing the method is also available on https://github.com/goknurginer/qtQDA.


2016 ◽  
Vol 14 (06) ◽  
pp. 1650034 ◽  
Author(s):  
Naim Al Mahi ◽  
Munni Begum

One of the primary objectives of ribonucleic acid (RNA) sequencing or RNA-Seq experiment is to identify differentially expressed (DE) genes in two or more treatment conditions. It is a common practice to assume that all read counts from RNA-Seq data follow overdispersed (OD) Poisson or negative binomial (NB) distribution, which is sometimes misleading because within each condition, some genes may have unvarying transcription levels with no overdispersion. In such a case, it is more appropriate and logical to consider two sets of genes: OD and non-overdispersed (NOD). We propose a new two-step integrated approach to distinguish DE genes in RNA-Seq data using standard Poisson and NB models for NOD and OD genes, respectively. This is an integrated approach because this method can be merged with any other NB-based methods for detecting DE genes. We design a simulation study and analyze two real RNA-Seq data to evaluate the proposed strategy. We compare the performance of this new method combined with the three [Formula: see text]-software packages namely edgeR, DESeq2, and DSS with their default settings. For both the simulated and real data sets, integrated approaches perform better or at least equally well compared to the regular methods embedded in these [Formula: see text]-packages.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3797 ◽  
Author(s):  
Sheng Yang ◽  
Fang Shao ◽  
Weiwei Duan ◽  
Yang Zhao ◽  
Feng Chen

RNA sequencing (RNA-Seq) enables the measurement and comparison of gene expression with isoform-level quantification. Differences in the effect of each isoform may make traditional methods, which aggregate isoforms, ineffective. Here, we introduce a variance component-based test that can jointly test multiple isoforms of one gene to identify differentially expressed (DE) genes, especially those with isoforms that have differential effects. We model isoform-level expression data from RNA-Seq using a negative binomial distribution and consider the baseline abundance of isoforms and their effects as two random terms. Our approach tests the global null hypothesis of no difference in any of the isoforms. The null distribution of the derived score statistic is investigated using empirical and theoretical methods. The results of simulations suggest that the performance of the proposed set test is superior to that of traditional algorithms and almost reaches optimal power when the variance of covariates is large. This method is also applied to analyze real data. Our algorithm, as a supplement to traditional algorithms, is superior at selecting DE genes with sparse or opposite effects for isoforms.


2015 ◽  
Author(s):  
David M Rocke ◽  
Luyao Ruan ◽  
Yilun Zhang ◽  
J. Jared Gossett ◽  
Blythe Durbin-Johnson ◽  
...  

Motivation: An important property of a valid method for testing for differential expression is that the false positive rate should at least roughly correspond to the p-value cutoff, so that if 10,000 genes are tested at a p-value cutoff of 10−4, and if all the null hypotheses are true, then there should be only about 1 gene declared to be significantly differentially expressed. We tested this by resampling from existing RNA-Seq data sets and also by matched negative binomial simulations. Results: Methods we examined, which rely strongly on a negative binomial model, such as edgeR, DESeq, and DESeq2, show large numbers of false positives in both the resampled real-data case and in the simulated negative binomial case. This also occurs with a negative binomial generalized linear model function in R. Methods that use only the variance function, such as limma-voom, do not show excessive false positives, as is also the case with a variance stabilizing transformation followed by linear model analysis with limma. The excess false positives are likely caused by apparently small biases in estimation of negative binomial dispersion and, perhaps surprisingly, occur mostly when the mean and/or the dis-persion is high, rather than for low-count genes.


F1000Research ◽  
2016 ◽  
Vol 4 ◽  
pp. 1521 ◽  
Author(s):  
Charlotte Soneson ◽  
Michael I. Love ◽  
Mark D. Robinson

High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package (tximport) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yance Feng ◽  
Lei M. Li

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.


2017 ◽  
Author(s):  
Zhun Miao ◽  
Ke Deng ◽  
Xiaowo Wang ◽  
Xuegong Zhang

AbstractSummaryThe excessive amount of zeros in single-cell RNA-seq data include “real” zeros due to the on-off nature of gene transcription in single cells and “dropout” zeros due to technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect 3 types of DE genes in single-cell RNA-seq data with higher accuracy.Availability and ImplementationThe R package DEsingle is freely available at https://github.com/miaozhun/DEsingle and is under Bioconductor’s consideration [email protected] informationSupplementary data are available at bioRxiv online.


2021 ◽  
Author(s):  
Enrico Gaffo ◽  
Alessia Buratin ◽  
Anna Dal Molin ◽  
Stefania Bortoluzzi

AbstractCurrent methods for identifying circular RNAs (circRNAs) suffer from low discovery rates and inconsistent performance in diverse data sets. Therefore, the applied detection algorithm can bias high-throughput study findings by missing relevant circRNAs. Here, we show that our bioinformatics tool CirComPara2 (https://github.com/egaffo/CirComPara2), by combining multiple circRNA detection methods, consistently achieves high recall rates without loss of precision in simulated and different real-data sets.


Sign in / Sign up

Export Citation Format

Share Document