A Flexible Multivariate Distribution for Correlated Count Data

Kimberly F. Sellers; Tong Li; Yixuan Wu; Narayanaswamy Balakrishnan

doi:10.3390/stats4020021

A Flexible Multivariate Distribution for Correlated Count Data

Stats ◽

10.3390/stats4020021 ◽

2021 ◽

Vol 4 (2) ◽

pp. 308-326

Author(s):

Kimberly F. Sellers ◽

Tong Li ◽

Yixuan Wu ◽

Narayanaswamy Balakrishnan

Keyword(s):

Count Data ◽

Negative Binomial ◽

Real Data ◽

Data Types ◽

Real World Data ◽

Special Cases ◽

Multivariate Poisson Distribution ◽

Classical Models ◽

Over Dispersion ◽

Bernoulli Distributions

Multivariate count data are often modeled via a multivariate Poisson distribution, but it contains an underlying, constraining assumption of data equi-dispersion (where its variance equals its mean). Real data are oftentimes over-dispersed and, as such, consider various advancements of a negative binomial structure. While data over-dispersion is more prevalent than under-dispersion in real data, however, examples containing under-dispersed data are surfacing with greater frequency. Thus, there is a demonstrated need for a flexible model that can accommodate both data types. We develop a multivariate Conway–Maxwell–Poisson (MCMP) distribution to serve as a flexible alternative for correlated count data that contain data dispersion. This structure contains the multivariate Poisson, multivariate geometric, and the multivariate Bernoulli distributions as special cases, and serves as a bridge distribution across these three classical models to address other levels of over- or under-dispersion. In this work, we not only derive the distributional form and statistical properties of this model, but we further address parameter estimation, establish informative hypothesis tests to detect statistically significant data dispersion and aid in model parsimony, and illustrate the distribution’s flexibility through several simulated and real-world data examples. These examples demonstrate that the MCMP distribution performs on par with the multivariate negative binomial distribution for over-dispersed data, and proves particularly beneficial in effectively representing under-dispersed data. Thus, the MCMP distribution offers an effective, unifying framework for modeling over- or under-dispersed multivariate correlated count data that do not necessarily adhere to Poisson assumptions.

Download Full-text

The Negative Binomial – Weighted Garima Distribution: Model, Properties and Applications

Pakistan Journal of Statistics and Operation Research ◽

10.18187/pjsor.v16i1.3013 ◽

2020 ◽

pp. 1-10

Author(s):

Winai Bodhisuwan ◽

Pornpop Saengthong

Keyword(s):

Parameter Estimation ◽

Count Data ◽

Negative Binomial ◽

Likelihood Estimation ◽

Real Data ◽

Distribution Model ◽

Data Sets ◽

Factorial Moments ◽

Special Cases ◽

First Four Moments

In this paper, a new mixed negative binomial (NB) distribution named as negative binomial-weighted Garima (NB-WG) distribution has been introduced for modeling count data. Two special cases of the formulation distribution including negative binomial- Garima (NB-G) and negative binomial-size biased Garima (NB-SBG) are obtained by setting the specified parameter. Some statistical properties such as the factorial moments, the first four moments, variance and skewness have also been derived. Parameter estimation is implemented using maximum likelihood estimation (MLE) and real data sets are discussed to demonstrate the usefulness and applicability of the proposed distribution.

Download Full-text

Transition models for count data: a flexible alternative to fixed distribution models

Statistical Methods & Applications ◽

10.1007/s10260-021-00558-6 ◽

2021 ◽

Author(s):

Moritz Berger ◽

Gerhard Tutz

Keyword(s):

Count Data ◽

Regression Models ◽

Negative Binomial ◽

Real Data ◽

Distribution Models ◽

Explanatory Variables ◽

Excess Zeros ◽

Proposed Model ◽

Transition Models ◽

Fixed Distribution

AbstractA flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and Negative Binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared in simulations and by utilizing several real data applications from the area of health and social science.

Download Full-text

Managing Inflation

Crime & Delinquency ◽

10.1177/0011128716679796 ◽

2016 ◽

Vol 63 (1) ◽

pp. 77-87 ◽

Cited By ~ 5

Author(s):

William H. Fisher ◽

Stephanie W. Hartwell ◽

Xiaogang Deng

Keyword(s):

Count Data ◽

Regression Models ◽

Negative Binomial ◽

Negative Binomial Regression ◽

Excess Zeros ◽

Binomial Regression ◽

Negative Binomial Models ◽

Using Data ◽

Over Dispersion ◽

Binomial Models

Poisson and negative binomial regression procedures have proliferated, and now are available in virtually all statistical packages. Along with the regression procedures themselves are procedures for addressing issues related to the over-dispersion and excessive zeros commonly observed in count data. These approaches, zero-inflated Poisson and zero-inflated negative binomial models, use logit or probit models for the “excess” zeros and count regression models for the counted data. Although these models are often appropriate on statistical grounds, their interpretation may prove substantively difficult. This article explores this dilemma, using data from a study of individuals released from facilities maintained by the Massachusetts Department of Correction.

Download Full-text

BayCount: A Bayesian Decomposition Method for Inferring Tumor Heterogeneity using RNA-Seq Counts

10.1101/218511 ◽

2017 ◽

Cited By ~ 1

Author(s):

Fangzheng Xie ◽

Mingyuan Zhou ◽

Yanxun Xu

Keyword(s):

Rna Sequencing ◽

Count Data ◽

Decomposition Method ◽

Tumor Heterogeneity ◽

Negative Binomial ◽

Expression Profiles ◽

The Cancer Genome Atlas ◽

Cancer Prognosis ◽

Real World Data ◽

Tumor Sample

AbstractTumors are heterogeneous - a tumor sample usually consists of a set of subclones with distinct transcriptional profiles and potentially different degrees of aggressiveness and responses to drugs. Understanding tumor heterogeneity is therefore critical for precise cancer prognosis and treatment. In this paper, we introduce BayCount, a Bayesian decomposition method to infer tumor heterogeneity with highly over-dispersed RNA sequencing count data. Using negative binomial factor analysis, BayCount takes into account both the between-sample and gene-specific random effects on raw counts of sequencing reads mapped to each gene. For the posterior inference, we develop an efficient compound Poisson based blocked Gibbs sampler. Simulation studies show that BayCount is able to accurately estimate the subclonal inference, including number of subclones, the proportions of these subclones in each tumor sample, and the gene expression profiles in each subclone. For real-world data examples, we apply BayCount to The Cancer Genome Atlas lung cancer and kidney cancer RNA sequencing count data and obtain biologically interpretable results. Our method represents the first effort in characterizing tumor heterogeneity using RNA sequencing count data that simultaneously removes the need of normalizing the counts, achieves statistical robustness, and obtains biologically/clinically meaningful insights. The R package BayCount implementing our model and algorithm is available for download.

Download Full-text

Distributions You Can Count On …But What’s the Point?

Econometrics ◽

10.3390/econometrics8010009 ◽

2020 ◽

Vol 8 (1) ◽

pp. 9 ◽

Cited By ~ 1

Author(s):

Brendan P. M. McCabe ◽

Christopher L. Skeels

Keyword(s):

Count Data ◽

Negative Binomial ◽

Broad Class ◽

Poisson Model ◽

Econometric Analysis ◽

Discrete Distributions ◽

Optimal Test ◽

Score Tests ◽

Optimal Tests ◽

Over Dispersion

The Poisson regression model remains an important tool in the econometric analysis of count data. In a pioneering contribution to the econometric analysis of such models, Lung-Fei Lee presented a specification test for a Poisson model against a broad class of discrete distributions sometimes called the Katz family. Two members of this alternative class are the binomial and negative binomial distributions, which are commonly used with count data to allow for under- and over-dispersion, respectively. In this paper we explore the structure of other distributions within the class and their suitability as alternatives to the Poisson model. Potential difficulties with the Katz likelihood leads us to investigate a class of point optimal tests of the Poisson assumption against the alternative of over-dispersion in both the regression and intercept only cases. In a simulation study, we compare score tests of ‘Poisson-ness’ with various point optimal tests, based on the Katz family, and conclude that it is possible to choose a point optimal test which is better in the intercept only case, although the nuisance parameters arising in the regression case are problematic. One possible cause is poor choice of the point at which to optimize. Consequently, we explore the use of Hellinger distance to aid this choice. Ultimately we conclude that score tests remain the most practical approach to testing for over-dispersion in this context.

Download Full-text

Extended negative binomial hurdle models

Statistical Methods in Medical Research ◽

10.1177/0962280218766567 ◽

2018 ◽

Vol 28 (5) ◽

pp. 1540-1551

Author(s):

Maengseok Noh ◽

Youngjo Lee

Keyword(s):

Count Data ◽

Negative Binomial ◽

Random Effect ◽

Random Effect Model ◽

Real Data ◽

Hurdle Models ◽

Poisson Models ◽

Effect Model ◽

Zero Rate ◽

General Statistical

Poisson models are widely used for statistical inference on count data. However, zero-inflation or zero-deflation with either overdispersion or underdispersion could occur. Currently, there is no available model for count data, that allows excessive occurrence of zeros along with underdispersion in non-zero counts, even though there have been reported necessity of such models. Furthermore, given an excessive zero rate, we need a model that allows a larger degree of overdispersion than existing models. In this paper, we use a random-effect model to produce a general statistical model for accommodating such phenomenon occurring in real data analyses.

Download Full-text

A Stochastic Condensation Mechanism for Inducing Underdispersion in Count Models

10.20944/preprints202103.0570.v1 ◽

2021 ◽

Author(s):

Chenangnon Frédéric Tovissodé ◽

Romain Glele Kakai

Keyword(s):

Negative Binomial Distribution ◽

Count Data ◽

Negative Binomial ◽

Count Models ◽

Original Variable ◽

Binomial Distributions ◽

Original Distribution ◽

Count Distribution ◽

Special Cases ◽

Stochastic Mechanism

It is quite easy to stochastically distort an original count variable to obtain a new count variable with relatively more variability than in the original variable. Many popular overdispersion models (variance greater than mean) can indeed be obtained by mixtures, compounding or randomlystopped sums. There is no analogous stochastic mechanism for the construction of underdispersed count variables (variance less than mean), starting from an original count distribution of interest. This work proposes a generic method to stochastically distort an original count variable to obtain a new count variable with relatively less variability than in the original variable. The proposed mechanism, termed condensation, attracts probability masses from the quantiles in the tails of the original distribution and redirect them toward quantiles around the expected value. If the original distribution can be simulated, then the simulation of variates from a condensed distribution is straightforward. Moreover, condensed distributions have a simple mean-parametrization, a characteristic useful in a count regression context. An application to the negative binomial distribution resulted in a distribution allowing under, equi and overdispersion. In addition to graphical insights, fields of applications of special cases of condensed Poisson and condensed negative binomial distributions were pointed out as an indication of the potential of condensation for a flexible analysis of count data

Download Full-text

The Effect of Sample Size on the Efficiency of Count Data Models: Application to Marriage Data

Journal of Economics and Behavioral Studies ◽

10.22610/jebs.v9i3.1742 ◽

2017 ◽

Vol 9 (3) ◽

pp. 6

Author(s):

Volition Tlhalitshi Montshiwa ◽

Ntebogang Dinah Moroke

Keyword(s):

Regression Model ◽

Sample Size ◽

Count Data ◽

Negative Binomial ◽

Information Criterion ◽

Data Models ◽

Hurdle Model ◽

Negative Binomial Regression Model ◽

Count Data Models ◽

Over Dispersion

Abstract: Sample size requirements are common in many multivariate analysis techniques as one of the measures taken to ensure the robustness of such techniques, such requirements have not been of interest in the area of count data models. As such, this study investigated the effect of sample size on the efficiency of six commonly used count data models namely: Poisson regression model (PRM), Negative binomial regression model (NBRM), Zero-inflated Poisson (ZIP), Zero-inflated negative binomial (ZINB), Poisson Hurdle model (PHM) and Negative binomial hurdle model (NBHM). The data used in this study were sourced from Data First and were collected by Statistics South Africa through the Marriage and Divorce database. PRM, NBRM, ZIP, ZINB, PHM and NBHM were applied to ten randomly selected samples ranging from 4392 to 43916 and differing by 10% in size. The six models were compared using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Vuong’s test for over-dispersion, McFadden RSQ, Mean Square Error (MSE) and Mean Absolute Deviation (MAD).The results revealed that generally, the Negative Binomial-based models outperformed Poisson-based models. However, the results did not reveal the effect of sample size variations on the efficiency of the models since there was no consistency in the change in AIC, BIC, Vuong’s test for over-dispersion, McFadden RSQ, MSE and MAD as the sample size increased.

Download Full-text

Statistical analysis of no observed effect concentrations or levels in eco-toxicological assays with overdispersed count endpoints

10.1101/2020.01.15.907881 ◽

2020 ◽

Author(s):

Ludwig A. Hothorn ◽

Felix M. Kluxen

Keyword(s):

Count Data ◽

Poisson Model ◽

Real Data ◽

Test Methods ◽

Statistical Testing ◽

Data Sets ◽

Mixing Distribution ◽

Flexible Modeling ◽

Hazard Characterization ◽

Over Dispersion

AbstractIn (eco-)toxicological hazard characterization, the No Observed Adverse Effect Concentration or Level (NOAEC or NOAEL) approach is used and often required despite of its known limitations. For count data, statistical testing can be challenging, due to several confounding factors, such as zero inflation, low observation numbers, variance heterogeneity, over- or under-dispersion when applying the Poisson model or hierarchical experimental designs. As several tests are available for count data, we selected sixteen tests suitable for overdispersed counts and compared them in a simulation study. We assessed their performance considering data sets containing mixing distribution and over-dispersion with different observation numbers. It shows that there is no uniformly best approach because the assumed data conditions and assumptions are very different. However, the Dunnett-type procedure based on most likely transformation can be recommended, because of its size and power behavior, which is relatively better over most data conditions as compared to the available alternative test methods, and because it allows flexible modeling and effect sizes can be estimated by confidence intervals. Related R-code is provided for real data examples.

Download Full-text

The Effect of Sample Size on the Efficiency of Count Data Models: Application to Marriage Data

Journal of Economics and Behavioral Studies ◽

10.22610/jebs.v9i3(j).1742 ◽

2017 ◽

Vol 9 (3(J)) ◽

pp. 6-18

Author(s):

Volition Tlhalitshi Montshiwa ◽

Ntebogang Dinah Moroke

Keyword(s):

Regression Model ◽

Sample Size ◽

Count Data ◽

Negative Binomial ◽

Information Criterion ◽

Data Models ◽

Hurdle Model ◽

Negative Binomial Regression Model ◽

Count Data Models ◽

Over Dispersion

Abstract: Sample size requirements are common in many multivariate analysis techniques as one of the measures taken to ensure the robustness of such techniques, such requirements have not been of interest in the area of count data models. As such, this study investigated the effect of sample size on the efficiency of six commonly used count data models namely: Poisson regression model (PRM), Negative binomial regression model (NBRM), Zero-inflated Poisson (ZIP), Zero-inflated negative binomial (ZINB), Poisson Hurdle model (PHM) and Negative binomial hurdle model (NBHM). The data used in this study were sourced from Data First and were collected by Statistics South Africa through the Marriage and Divorce database. PRM, NBRM, ZIP, ZINB, PHM and NBHM were applied to ten randomly selected samples ranging from 4392 to 43916 and differing by 10% in size. The six models were compared using the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Vuongâ€™s test for over-dispersion, McFadden RSQ, Mean Square Error (MSE) and Mean Absolute Deviation (MAD).The results revealed that generally, the Negative Binomial-based models outperformed Poisson-based models. However, the results did not reveal the effect of sample size variations on the efficiency of the models since there was no consistency in the change in AIC, BIC, Vuongâ€™s test for over-dispersion, McFadden RSQ, MSE and MAD as the sample size increased.

Download Full-text