Count data sometimes exhibit a greater proportion of zero counts than is consistent with the data having been generated by a simple Poisson or negative binomial process. For example, a preponderance of zero counts have been observed in data that record the number of automobile accidents per driver, the number of criminal acts per person, the number of derogatory credit reports per person, the number of incidences of a rare disease in a population, and the number of defects in a manufacturing process, just to name a few.
Failure to properly account for the excess zeros constitutes a model misspecification that can result in biased or inconsistent estimators. Zero-inflated count models provide one method to explain the excess zeros by modeling the data as a mixture of two separate distributions: one distribution is typically a Poisson or negative binomial distribution that can generate both zero and nonzero counts, and the second distribution is a constant distribution that generates only zero counts.
When the underlying count distribution is a Poisson distribution, the mixture is called a zero-inflated Poisson ZIP distribution; when the underlying count distribution is a negative binomial distribution, the mixture is called a zero-inflated negative binomial ZINB distribution. Count data that have an incidence of zeros greater than expected for the underlying probability distribution can be modeled with a zero-inflated distribution.
The population is considered to consist of two subpopulations. Observations drawn from the first subpopulation are realizations of a random variable that typically has either a Poisson or negative binomial distribution, which might contain zeros. Observations drawn from the second subpopulation always provide a zero count. Suppose the mean of the underlying Poisson or negative binomial distribution is and the probability of an observation being drawn from the constant distribution that always generates zeros is.
The parameter is often called the zero-inflation probability. The parameters and can be modeled as functions of linear predictors. The log link function is typically used for. The excess zeros are a form of overdispersion. Fitting a zero-inflated Poisson model can account for the excess zeros, but there are also other sources of overdispersion that must be considered.
If there are sources of overdispersion that cannot be attributed to the excess zeros, failure to account for them constitutes a model misspecification, which results in biased standard errors.
If this is an invalid assumption, the data exhibit overdispersion or underdispersion. A useful diagnostic tool that can aid you in detecting overdispersion is the Pearson chi-square statistic. This statistic, under certain regularity conditions, has a limiting chi-square distribution, with degrees of freedom equal to the number of observations minus the number of parameters estimated.
Comparing the computed Pearson chi-square statistic to an appropriate quantile of a chi-square distribution with degrees of freedom constitutes a test for overdispersion. If overdispersion is detected, the ZINB model often provides an adequate alternative. The probability distribution of a zero-inflated negative binomial random variable Y is given by.
Because the ZINB model assumes a negative binomial distribution for the first component of the mixture, it has a more flexible variance function.
Thus it provides a means to account for overdispersion that is not due to the excess zeros. However, the negative binomial, and thus the ZINB model, achieves this additional flexibility at the cost of an additional parameter.
Thus, if you fit a ZINB model when there is no overdispersion, the parameter estimates are less efficient compared to the more parsimonious ZIP model. If the ZINB model does not fully account for the overdispersion, more flexible mixture models can be considered. Consider a horticultural experiment to study the number of roots produced by a certain species of apple tree.
The objective is to assess the effect of both the photoperiod and the concentration levels of BAP on the number of roots produced. The analysis begins with a graphical inspection of the data.When the dependent variable is a non-negative count variable, the standard OLS regression is no longer valid.
Typically, the Poisson regression or some variation of it is used to analyze such count data. Poisson regression fits models of the number of occurrences counts of an event where it is assumed that the number of occurrences follow a Poisson distribution. The Poisson distribution has been applied to diverse events with the following basic assumptions:.
With these assumptions, to find the probability of k events in an exposure of size Eone divides E into n sub-intervals, and approximate the answer as the binomial probability of observing k successes in n trials. By letting n to be arbitrarily large, one obtains the Poisson distribution. We want to understand how the deaths of the children changes with age of the children.
Before we run a Poisson regression, generate logexposure as natural log of exposure. Often in Poisson regressions, one is interested in comparing the incidence rates. This is easily done by calculating incidence-rate ratios IRR. The option irr tells STATA to report the incidence-rate ratios exp b 1 and exp b 2 instead of the coefficients b 1 and b 2. Similar to a Poisson regression, in a Negative Binomial regression the dependent count variable is believed to be generated by a Poisson-like process, except that the variation is greater than that of a true Poisson.
This extra variation is referred to as overdispersionwhich may arise due to an omitted explanatory variable. We refer to v as the overdispersion parameter. The larger v is, the greater the overdispersion. Lets, return to the above example. Suppose, we think in our example that the overdispersion varies across cohorts. Else stick with the Poisson model.
In this case, the p-value is very small so, the Poisson model is inappropriate for our example. Sometimes the count of zeros in a sample is much larger than the count of any other frequency. In other words, the number of zeros are inflated. In that case, instead of using the ordinary negative binomial or Poisson regression, one should run the Zero-Inflated Negative Binomial model.
Obviously, how much zero-inflation is enough to call for the choice of this zer-inflated model is a matter of modelling preference, which is resolved by statistical tests discussed below.For the analysis of count data, many statistical software packages now offer zero-inflated Poisson and zero-inflated negative binomial regression models.
But are such models really needed? Maybe not. In most count data sets, the conditional variance is greater than the conditional mean, often much greater, a phenomenon known as overdispersion. The zero inflated Poisson ZIP model is one way to allow for overdispersion. Observed values of 0 could come from either group. Although not essential, the model is typically elaborated to include a logistic regression model predicting which group an individual belongs to. In cases of overdispersion, the ZIP model typically fits better than a standard Poisson model.
But what about the zero-inflated negative binomial ZINB model? Of course, there are certainly situations where a zero-inflated model makes sense from the point of view of theory or common sense.
For example, if the dependent variable is number of children ever born to a sample of year-old women, it is reasonable to suppose that some women are biologically sterile. For these women, no variation on the predictor variables whatever they might be could change the expected number of children.
The second edition was published in April A simple reparameterization of the ZINB model allows for such a restriction. So a likelihood ratio test is appropriate, although the chi-square distribution may need some adjustment because the restriction is on the boundary of the parameter space.
Thanks for this blog post. You make these statistical concepts easy to understand; I will certainly be on look out for your books. The zero inflation model is a latent class model.
It is proposed in a specific situation — when there are two kinds of zeros in the observed data. It is a two part model that has a specific behavioral interpretation that is not particularly complicated, by the way. The preceding discussion is not about the model. It is about curve fitting. If you use the model to predict the outcome variable, then compare these predictions to the actual data, the ZINB model will fit so much better there will be no comparison.
These models have existed for years as supported procedures in these programs. There is nothing difficult about fitting them. As for difficulty in interpreting the model, the ZINB model, as a two part model makes a great deal of sense. It is hard to see why it should be difficult to interpret.
The reparameterization merely inflates the zero probability. But, it loses the two part interpretation — the reparameterized model is not a zero inflated model in the latent class sense in which it is defined. The so called reparameterized model is no longer a latent class model.
It is true that the NB model can be tested as a restriction on proposed model.We use data from Long on the number of publications produced by Ph. The data are over-dispersed, but of course we haven't considered any covariates yet.
A Poisson Model Let us fit the model used by Long and Freesea simple additive model using all five predictors. We could use poisson to obtain the estimates and then estat gof to get the deviance, but will use instead the glm command to obtain both the deviance and Pearson's chi-squared statistics immediately.
We will also store the estimates for later use. The five-percent critical value for a chi-squared with d.
This means that we should adjust the standard errors multiplying by 1. The glm command can do this for us via the scale option, which takes as argument either a numeric value, in this case 1.
Using this procedure we have essentially attributed all the lack of fit to pure error. You may want to try poisson with the the robust option to compute standard errors using the robust or 'sandwich' estimator. You will get very similar results.
In either case all tests have to be done using Wald's statistic. Likelihood ratio tests are not possible because we are not making full distributional assumptions about the outcome, relying instead on assumptions about the mean and variance. Negative Binomial Regression We now fit a negative binomial model with the same predictors:.
It is estimated to be 0. To test the significance of this parameter you may think of computing twice the difference in log-likelihoods between this model and the Poisson model, The usual asymptotics do not apply, however, because the null hypothesis is on a boundary of the parameter space. There is some work showing that a better approximation is to treat the statistic as as mixture of zero and a chi-squared with one d.
Alternatively, treating the statistic as a chi-squared one gives a conservative test. Either way, we have overwhelming evidence of overdispersion. For testing hypotheses about the regression coefficients we can use either Wald tests or likelihood ratio tests, which are possible because we have made full distributional assumptions.
Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. It only takes a minute to sign up. I have a dataset which seems to have a lot of zeroes. I have already fit a poisson regression model as well as a negative binomial model. I would like to fit zero-inflated and hurdle models as well. Before I do I would like to run a test to investigate whether my data really is zero inflated.
I think there are different ways to do this. It would look like this in R:. For more information, see this ever useful tutorial. In this case, there are two processes at work here.
My understanding is that a zero-inflated model is only appropriate when there is an alternate process that generates only zeros. For example, if you are attempting to estimate the number of widgets different stores sell, but some stores do not have widgets for sale, then it seems like two processes are at work here: one process that generates only zeros those stores that cannot sell widgets because they do not ever stock widgets for sale and another process that generates different values those stores that do stock widgets and therefore can sell some.
Rather than having a "test" to determine whether the data are zero-inflated, I would suggest determining whether it is plausible that there are two processes at work - one being a zero-generating process at work and another process that generates non-zero numbers. If it seems reasonable given the context of your data, then use a zero-inflated model.
If it doesn't seem reasonable given the context of your data, then a zero-inflated model is probably inappropriate even though it may appear to fit your data better.
It might not be clear from what I've written above, but I want to articulate the fact that both processes can generate zeros. One process generates only zeros and the other process can generate different values which may be zero. For example, a store can stock widgets and happen to sell zero widgets. This is different from a store that does not stock widgets and therefore must sell zero widgets by default.
Sign up to join this community. The best answers are voted up and rise to the top. Home Questions Tags Users Unanswered. How to test for Zero-Inflation in a dataset? Ask Question. Asked 5 years, 6 months ago.
Zero-Inflated Poisson Models for Count Outcomes
Active 1 year, 11 months ago. Viewed 7k times.Introduction to Poisson regression n Count Data Model (Negative binomial etc.)
AdamO DAz DAz 81 1 1 silver badge 2 2 bronze badges. Active Oldest Votes. Masato Nakazawa Masato Nakazawa 1, 5 5 silver badges 9 9 bronze badges.Why Stata?
Supported platforms. Stata Press books Books on Stata Books on statistics. Policy Contact. Bookstore Stata Journal Stata News.
Contact us Hours of operation. Advanced search.
Stata's zioprobit command fits zero-inflated ordered probit ZIOP models. ZIOP models are used for ordered response variables, such as 1 fully ambulatory, 2 ambulatory with restrictions, and 3 partially ambulatory, when the data exhibit a high fraction of observations at the lowest end of the ordering. It's called zero-inflated because the idea started with Poisson regression, and it was the lower-end zeros that were overly prevalent. Given the category values we just used, Stata's zioprobit command could fit 1-inflated models.
Or we could have numbered the categories 0, 1, and 2, and fit a 0-inflated model. The results would be the same either way. Standard ordered probit models cannot account for the preponderance of zero observations when the zeros relate to an extra, distinct source.
Many of the individuals in the first category will be nonsmokers who have never smoked and will never smoke. The rest of them will be ex-smokers. Think of the standard ordered probit model as fitting the behavior of smokers, including ex-smokers. The zero inflation arises because the first group now includes those who have never smoked.
We have fictional data on the smoking study just described. The outcome variable is called tobacco and contains. We want to fit a model in which smoking by those who have ever smoked is given by income gender age. And membership in the never-smoked group is determined by income gender age whether parents smoked religion.
The standard ordered probit parameters, coefficients and cutpoints, are displayed in the first and last parts of the output, respectively. The middle part of the output reports the probit coefficients for the inflation.
A tutorial on count regression and zero-altered count models for longitudinal substance use data
Coefficients can be difficult to interpret. For instance, what does a parent smoking coefficient of 0. The predict pnpar option is unique to margins when used after zioprobit. We asked margins to calculate predictions of the probability of nonparticipation, which in this example means the probability of being a never-smoker.
You can also fit Bayesian zero-inflated ordered probit models using the bayes prefix. Read more about zero-inflated ordered probit in the Stata Base Reference Manual. Stata: Data Analysis and Statistical Software.
Go Stata. Purchase Products Training Support Company. ORDER STATA Zero-inflated ordered probit Highlights Ordinal outcome Zero inflation: zero observations generated by two distinct processes Robust, cluster—robust, and bootstrap standard errors Complex survey designs support Predict marginal, joint, and conditional probabilities of levels Predict probability of participation and nonparticipation Support for Bayesian estimation. Stata New in Stata Why Stata?Our original plan in was to write a second edition of the book.
After writing one page, we immediately decided that we had to write a completely new book. Not that we were unhappy with our book: on the contrary.
This book contains only half a page of text that overlaps with the book. Everything else is new. The minimum prerequisite for Beginner's Guide to Zero-Inflated Models with R is knowledge of multiple linear regression.
In Chapter 2 we start with brief explanations of the Poisson, negative binomial, Bernoulli, binomial and gamma distributions. Chapters 4 and 5 contain detailed case studies using count data of orange-crowned warblers and sharks.
Just like all other chapters, these case studies are based on real datasets used in scientific papers. In Chapter 6 we use zero-altered Poisson ZAP models to deal with the excessive number of zeros in count data. In Chapter 7 we analyse continuous data with a large number of zeros. Biomass of Chinese tallow trees is analysed with zero-altered gamma ZAG models. In Chapter 8, which begins the second part of the book, we explain how to deal with dependency. Mixed models are introduced, using beaver and monkey datasets.
In Chapter 9 we encounter a rather complicated dataset in terms of dependency. Reproductive indices are sampled from plants. But the seeds come from the same source and are planted in the same bed in the same garden. We apply models that take care of the excessive number of zeros in count data, crossed random effects and nested random effects.
Up to this point we have done everything in a frequentist setting, but at this stage of the book you will see that we are reaching the limit of what we can achieve with the frequentist software. For this reason we switch to Bayesian techniques in the third part of the book. The chapter also contains exercises and a video solution file for one of the exercises. This means that you can see our screen and listen to our voices just in case you have trouble falling asleep at night.
A large number of students reviewed this chapter and we incorporated their suggestions for improvement, so Chapter 10 should be very easy to understand. We do the same for mixed models in Chapter