scholarly journals Classification of RNA-Seq Data via Gaussian Copulas

2017 ◽  
Author(s):  
Qingyang Zhang

AbstractRNA-sequencing (RNA-Seq) has become a preferred option to quantify gene expression, because it is more accurate and reliable than microarrays. In RNA-Seq experiments, the expression level of a gene is measured by the count of short reads that are mapped to the gene region. Although some normal-based statistical methods may also be applied to log-transformed read counts, they are not ideal for directly modeling RNA-Seq data. Two discrete distributions, Poisson distribution and negative binomial distribution, have been commonly used in the literature to model RNA-Seq data, where the latter is a natural extension of the former with allowance of overdispersion. Due to the technical difficulty in modeling correlated counts, most existing classifiers based on discrete distributions assume that genes are independent of each other. However, as we show in this paper, the independence assumption may cause non-ignorable bias in estimating the discriminant score, making the classification inaccurate. To this end, we drop the independence assumption and explicitly model the dependence between genes using Gaussian copula. We apply a Bayesian approach to estimate covariance matrix and the overdispersion parameter in negative binomial distribution. Both synthetic data and real data are used to demonstrate the advantages of our model.

2019 ◽  
Vol 35 (18) ◽  
pp. 3372-3377 ◽  
Author(s):  
Kimon Froussios ◽  
Nick J Schurch ◽  
Katarzyna Mackinnon ◽  
Marek Gierliński ◽  
Céline Duc ◽  
...  

Abstract Motivation RNA-seq experiments are usually carried out in three or fewer replicates. In order to work well with so few samples, differential gene expression (DGE) tools typically assume the form of the underlying gene expression distribution. In this paper, the statistical properties of gene expression from RNA-seq are investigated in the complex eukaryote, Arabidopsis thaliana, extending and generalizing the results of previous work in the simple eukaryote Saccharomyces cerevisiae. Results We show that, consistent with the results in S.cerevisiae, more gene expression measurements in A.thaliana are consistent with being drawn from an underlying negative binomial distribution than either a log-normal distribution or a normal distribution, and that the size and complexity of the A.thaliana transcriptome does not influence the false positive rate performance of nine widely used DGE tools tested here. We therefore recommend the use of DGE tools that are based on the negative binomial distribution. Availability and implementation The raw data for the 17 WT Arabidopsis thaliana datasets is available from the European Nucleotide Archive (E-MTAB-5446). The processed and aligned data can be visualized in context using IGB (Freese et al., 2016), or downloaded directly, using our publicly available IGB quickload server at https://compbio.lifesci.dundee.ac.uk/arabidopsisQuickload/public_quickload/ under ‘RNAseq>Froussios2019’. All scripts and commands are available from github at https://github.com/bartongroup/KF_arabidopsis-GRNA. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 70 (4) ◽  
pp. 917-934
Author(s):  
Muhammad Mansoor ◽  
Muhammad Hussain Tahir ◽  
Gauss M. Cordeiro ◽  
Sajid Ali ◽  
Ayman Alzaatreh

AbstractA generalization of the Lindley distribution namely, Lindley negative-binomial distribution, is introduced. The Lindley and the exponentiated Lindley distributions are considered as sub-models of the proposed distribution. The proposed model has flexible density and hazard rate functions. The density function can be decreasing, right-skewed, left-skewed and approximately symmetric. The hazard rate function possesses various shapes including increasing, decreasing and bathtub. Furthermore, the survival and hazard rate functions have closed form representations which make this model tractable for censored data analysis. Some general properties of the proposed model are studied such as ordinary and incomplete moments, moment generating function, mean deviations, Lorenz and Bonferroni curve. The maximum likelihood and the Bayesian estimation methods are utilized to estimate the model parameters. In addition, a small simulation study is conducted in order to evaluate the performance of the estimation methods. Two real data sets are used to illustrate the applicability of the proposed model.


Author(s):  
R. Ashly ◽  
C. S. Rajitha

The objective of this paper is to introduce a new two parameter mixed negative binomial distribution, namely negative binomial-improved second degree Lindley(NB-ISL) distribution. This distribution is obtained by mixing the negative binomial distribution with the improved second degree Lindley distribution. Many mixed distributions have been used in the literature for modeling the over dispersed count data, which provide a better fit compared to the Poisson and negative binomial distribution. In addition, we present the basic statistical properties of the new distribution such as factorial moments, mean and variance and the behavior of mean, variance and coefficient of variation are also discussed. Parameter estimation is implemented by using maximum likelihood estimation method. The performance of the NB-ISL distribution is shown in practice by applying it on real data set and compare it with some well-known count distributions. The result shows that the negative binomial-improved second degree Lindley distribution provides a better fit compared to Poisson, negative binomial and negative binomial-Lindley distributions.


2016 ◽  
Author(s):  
Kimon Froussios ◽  
Nick J. Schurch ◽  
Katarzyna Mackinnon ◽  
Marek Gierliński ◽  
Céline Duc ◽  
...  

AbstractRNA-seq experiments are usually carried out in three or fewer replicates. In order to work well with so few samples, Differential Gene Expression (DGE) tools typically assume the form of the underlying distribution of gene expression. A recent highly replicated study revealed that RNA-seq gene expression measurements in yeast are best represented as being drawn from an underlying negative binomial distribution. In this paper, the statistical properties of gene expression in the higher eukaryote Arabidopsis thaliana are shown to be essentially identical to those from yeast despite the large increase in the size and complexity of the transcriptome: Gene expression measurements from this model plant species are consistent with being drawn from an underlying negative binomial or log-normal distribution and the false positive rate performance of nine widely used DGE tools is not strongly affected by the additional size and complexity of the A. thaliana transcriptome. For RNA-seq data, we therefore recommend the use of DGE tools that are based on the negative binomial distribution.


2020 ◽  
Vol 4 (3) ◽  
pp. 484-497
Author(s):  
Puput Cahya Ambarwati ◽  
Indahwati Indahwati ◽  
Muhammad Nur Aidi

Geographic weighted regression (GWR) is one of the regression methods for spatial data. GWR with the response variable following the poisson distribution can use the geographic weighted poisson regression (GWPR). GWPR often does not complete the assumption of dispersion. The classic approach commonly used to overcome overdispersion is related to poisson distribution, which is the approach obtained from poisson and gamma distribution which is similar to negative binomial distribution function. GWR for the response variable following the negative binomial distribution can use the geographical weighted negative binomial regression (GWNBR). The data used in this study are simulation data and real data. The results of the simulation data are the tolerance limits that are still precisely modeled with GWPR are overdispersion approaching 1 based on significant amount and average p-value.. The results of research from real data, the GWNBR is the best model for overdispersion cases in malnourished children in East Java Province in 2017 compared to the GWPR based on comparison of the values ​​of AIC. 


2016 ◽  
Vol 5 (1) ◽  
pp. 53-65 ◽  
Author(s):  
Abdullahi Yusuf ◽  
Badamasi Bashir Mikail ◽  
Aliyu Isah Aliyu ◽  
Abdurrahaman L. Sulaiman

2018 ◽  
Author(s):  
◽  
John Christian Snyder

In Bayesian analysis, the “objective” Bayesian approach seeks to select a prior distribution not by using (often subjective) scientific belief or by mathematical convenience, but rather by deriving it under a pre-specified criteria. This approach takes the decision of prior selection out of the hands of the researcher. Ideally, for a given data model, we would like to have a prior which represents a "neutral" prior belief in the phenomenon we are studying. In categorical data analysis, the odds ratio is one of several approaches to quantify how strongly the presence or absence of one property is associated with the presence or absence of another property. In this project, we present a Reference prior for the odds ratio of an unrestricted 2 x 2 table. Posterior simulation can be conducted without MCMC and is implemented on a GPU via the CUDA extensions for C. Simulation results indicate that the proposed approach to this problem is far superior to the widely used Frequentist approaches that dominate this area. Real data examples also typically yield much more sensible results, especially for small sample sizes or for tables that contain zeros. An R package is also presented to allow for easy implementation of this methodology. Next, we develop an approximate reference prior for the negative binomial distribution, applying this methodology to a continuous parameterization often used for modeling over-dispersed count data as well as the typical discrete case. Results indicate that the developed prior equals the performance of the MLE in estimating the mean of the distribution but is far superior when estimating the dispersion parameter.


Sign in / Sign up

Export Citation Format

Share Document