# Translating between the dispersion term in a negative binomial regression and random variables in SPSS

NOTE!! – when I initially posted this I was incorrect, I thought SPSS listed the dispersion term in the form of `Var(x) = mean + mean*dispersion`. But I was wrong, and it is `Var(x) = 1 + mean*dispersion` (the same as Stata’s, what Cameron and Trivedi call the NB2 model, as cited in the Long and Freese Stata book for categorical variables.) The simulation in the original post worked out because my example I used the mean as 1, here I update it to have a mean of 2 to show the calculations are correct. (Also note that this parametrization is equivalent to `Var(x) = mean*(1 + mean*dispersion)`, see Stata’s help for nbreg.)

When estimating a negative binomial regression equation in SPSS, it returns the dispersion parameter in the form of:

``Var(x) = 1 + mean*dispersion``

When generating random variables from the negative binomial distribution, SPSS does not take the parameters like this, but the more usual N trials with P successes. Stealing a bit from the R documentation for `dnbinom`, I was able to translate between the two with just a tedious set of algebra. So with our original distribution being:

``````Mean = mu
Variance = 1 + mu*a``````

R has an alternative representation closer to SPSS’s based on:

``````Mean = mu
Variance = mu + mu^2/x``````

Some tedious algebra will reveal that in this notation `x = mu^2/(1 - mu + a*mu)` (note to future self, using Solve in Wolfram Alpha could have saved some time, paper and ink). Also, R’s help for `dbinom` states that in the original N and P notation that `p = x/(x + mu)`. So here with `mu` and `a` (again `a` is the dispersion term as reported by GENLIN in SPSS) we can solve for `p`.

``````x = mu^2/(1 - mu + a*mu)
p = x/(x + mu)``````

And since `p` is solved, R lists the mean of the distribution in the N and P notation as:

``n*(1-p)/p = mu``

So with `p` solved we can figure out N as equal to:

``mu*p/(1-p) = n``

So to reiterate, if you have a mean of 2 and dispersion parameter of 4, the resultant N and P notation would be:

``````mu = 2
a = 4
x = mu^2/(1 - mu + a*mu) = 2^2/(1 - 2 + 4*2) = 4/7
p = x/(x + mu) = (4/7)/(4/7 + 2) = 2/9
n = mu*p/(1-p) = 2*(4/7)/(3/7) = 8/3``````

Here we can see that in the N and P notation the similar negative binomial model results in a fractional number of successes, which might be a surprising result for some that it is even a possibility. (There is likely an easier way to do this translation, but forgive me I am not a mathematician!)

Now we would be finished, but unfortunately SPSS’s negative binomial random functions only take integer values and do not take values of N less than 1 (R’s `dnbinom` does). So we have to do another translation of the N and P notation to the gamma distribution to be able to draw random numbers in SPSS. Another representation of the negative binomial model is a mixture of Poisson distributions, with the distribution of the mixtures being from a gamma distribution. Wikipedia lists a translation from the N and P notation to a gamma with shape = N and scale = P/(1-P).

So I wrapped these computations up in an SPSS macros that takes the mean and the dispersion parameter, calculates N and P under the hood, and then draws a random variable from the associated negative binomial distribution.

``````DEFINE !NegBinRV (mu = !TOKENS(1)
/disp = !TOKENS(1)
/out = !TOKENS(1) )
COMPUTE #x = !mu**2/(1 - !mu + !disp*!mu).
COMPUTE #p = #x / (#x + !mu).
COMPUTE #n = !mu*#p/(1 - #p).
COMPUTE #G = RV.GAMMA(#n,#p/(1 - #p)).
COMPUTE !Out = RV.POISSON(#G).
FORMATS !Out (F5.0).
!ENDDEFINE.``````

I am not sure if it is possible to use this gamma representation and native SPSS functions to calculate the corresponding CDF and PDF of the negative binomial distribution. But we can use R to do that. Here is an example of keeping the mean at 1 and varying the dispersion parameter between 0 and 5.

``````BEGIN PROGRAM R.
library(ggplot2)
x <- expand.grid(0:10,1:5)
names(x) <- c("Int","Disp")
mu <- 1
x\$PDF <- mapply(dnbinom, x=x\$Int, size=mu^2/(1 - mu + x\$Disp*mu), mu=mu)
#add in poisson
t <- data.frame(cbind(0:10,rep(0,11),dpois(0:10,lambda=1)))
names(t) <- c("Int","Disp","PDF")
x <- rbind(t,x)
p <- ggplot(data = x, aes(x = Int, y = PDF, group = as.factor(Disp))) + geom_line()
p
#for the CDF
x\$CDF <- ave(x\$PDF, x\$Disp, FUN = cumsum)
END PROGRAM.``````

Here you can see how the larger dispersion term can easily approximate the zero inflation typical in criminal justice data (see an applied example from my work). R will not take a dispersion parameter of zero in this notation (as the size would be divided by zero and not defined), so I just tacked on the Poisson distribution with a mean of zero.

Here is an example of generating random data from a negative binomial distribution with a mean of 2 and a dispersion parameter of 4. I then grab the PDF from R, and superimpose them both on a chart in SPSS (or perhaps I should call it a PMF, since it only has support on integer values). You can see the simulation with 10,000 observations is a near perfect fit (so a good sign I did not make any mistakes!)

``````*Simulation In SPSS.
INPUT PROGRAM.
LOOP Id = 1 TO 10000.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME RandNB.

!NegBinRV mu = 2 disp = 4 out = NB.

*Making seperate R dataset of PDF.
BEGIN PROGRAM R.
mu <- 2
disp <- 4
x <- 0:11
pdf <- dnbinom(x=x,size=mu^2/(1 - mu + disp*mu),mu=mu)
#add in larger than 10
pdf[max(x)+1] <- 1 - sum(pdf[-(max(x)+1)])
MyDf <- data.frame(cbind(x,pdf))
END PROGRAM.
EXECUTE.
STATS GET R FILE=* /GET DATAFRAME=MyDf DATASET=PDF_NB.
DATASET ACTIVATE PDF_NB.
FORMATS x (F2.0).
VALUE LABELS x 11 '11 or More'.

*Now superimposing bar plot and PDF from separate datasets.
DATASET ACTIVATE RandNB.
RECODE NB (11 THRU HIGHEST = 11)(ELSE = COPY) INTO NB_Cat.
FORMATS NB_Cat (F2.0).
VALUE LABELS NB_Cat 11 '11 or More'.

GGRAPH
/GRAPHDATASET NAME="Data" DATASET='RandNB' VARIABLES=NB_Cat[LEVEL=ORDINAL] COUNT()[name="COUNT"]
/GRAPHDATASET NAME="PDF" DATASET='PDF_NB' VARIABLES=x pdf
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: Data=userSource(id("Data"))
DATA: NB_Cat=col(source(Data), name("NB_Cat"), unit.category())
DATA: COUNT=col(source(Data), name("COUNT"))
SOURCE: PDF=userSource(id("PDF"))
DATA: x=col(source(PDF), name("x"), unit.category())
DATA: den=col(source(PDF), name("pdf"))
TRANS: den_per = eval(den*100)
GUIDE: axis(dim(1))
GUIDE: axis(dim(2))
SCALE: linear(dim(2), include(0))
ELEMENT: interval(position(summary.percent(NB_Cat*COUNT)), shape.interior(shape.square))
ELEMENT: point(position(x*den_per), color.interior(color.black), size(size."8"))
END GPL.``````
Advertisements

# Negative Binomial regression and predicted probabilities in SPSS

For my dissertation I have been estimating negative binomial regression models predicting the counts of crimes at small places (i.e. street segments and intersections). When evaluating the fit of poisson regression models and their variants, you typically make a line plot of the observed percent of integer values versus the predicted percent by the models. This is particularly pertinent for data that have a high proportion of zeros, as the negative binomial may still under-predict the number of zeros.

I mistakenly thought that to make such a plot you could simply estimate the predicted value following the negative binomial regression model and then round the predictions. But I was incorrect, and to make the typical predicted versus observed plot you need to estimate the probability of an observation taking an integer value, and then take the mean of that probability over all the observations. That mean will subsequently be the predicted percent given the model. Fortunately I caught my mistake before I gave some talks on my work recently, and I will show how to make said calculations in SPSS. I have posted the data to replicate this work at this dropbox link, and so you can download the data and follow along.

First, I got some help on how to estimate the predicted probabilities via an answer to my question at CrossValidated. So that question lists the formula one needs to estimate the predicted probability for any integer value N after the negative binomial model. To calculate that value though we need to make some special SPSS functions, the factorial and the complete gamma function. Both have SPSS tech help pages showing how to calculate them.

For the factorial we can use a general relationship with the `LNGAMMA` function.

``````
DEFINE !FACT (!POSITIONAL = !ENCLOSE("(",")"))
( EXP(LNGAMMA((!1)+1)) )
!ENDDEFINE.
``````

And for the complete gamma function we can use a relationship to the CDF of the gamma function.

``````
DEFINE !GAMMAF (!POSITIONAL = !ENCLOSE("(",")"))
( EXP(-1)/(!1)/(CDF.GAMMA(1,(!1),1) - CDF.GAMMA(1,(!1)+1,1)) )
!ENDDEFINE.
``````

And given these two functions, we can create a macro that takes as parameters and returns the predicted probability we are interested in:

• `out` – new variable name for predicted probability of taking on that integer value
• `PredN` – the predicted mean of the variable conditional on the covariates
• `Disp` – estimate of the dispersion parameter
• `Int` – the integer value being predicted
``````
DEFINE !PredNB (Out = !TOKENS(1)
/PredN = !TOKENS(1)
/Disp = !TOKENS(1)
/Int = !TOKENS(1) )
COMPUTE #a = (!Disp)**(-1).
COMPUTE #mu = !PredN.
COMPUTE #Y = !Int.
COMPUTE #1 = (!GAMMAF(#Y + #a))/(!FACT(#Y)*!GAMMAF(#a)).
COMPUTE #2 = (#a/(#a+#mu))**#a.
COMPUTE #3 =  (#mu/(#a + #mu))**#Y.
COMPUTE !Out =  #1*#2*#3.
!ENDDEFINE.
``````

But to make our plot we want to estimate this predicted probability over a range of values, so I created a helper macro that instead of taking only one integer value, takes the end integer value and will calculate the predicted probability of zero through N.

``````
DEFINE !PredNBRange (Num = !TOKENS(1)
/Mean = !TOKENS(1)
/Disp = !TOKENS(1)
/Stub = !TOKENS(1) )
!DO !I = 0 !TO !Num
!LET !Base = !CONCAT(!Stub,!I)
!PredNB Out = !Base PredN = !Mean Disp = !Disp Int = !I.
!DOEND
!ENDDEFINE.
``````

The example data and code I have posted compares these values to the ones predicted from Stata, and shows my function agrees with Stata to about 7 decimal points. I won’t go through all of those commands here, but I will show how to make the predicted proportions plot after you have a vector of predicted probabilities (you can download all of the code and data and the link I reference prior in the post).

So lets say that you have a vector `NB0 TO NB8`, and these are the predicted probabilities of integer values 0 to 8 for the observations in your dataset. To subsequently get the mean of the predictions, you can use the `AGGREGATE` command. Having no variables specified on the `BREAK` subcommand tells SPSS to aggregate over all values in the dataset. Here I export the file to a new dataset named `PredNBAgg`.

``````
DATASET DECLARE PredNBAgg.
AGGREGATE OUTFILE='PredNBAgg'
/BREAK =
/NB0 TO NB8 = MEAN(NB0 TO NB8).
``````

Now to merge later on to the observed proportions, I will reshape the dataset so the mean values are all in the same column using `VARSTOCASES`. Here I also make a category for the predicted probability of being 9 or higher (which isn’t typical for these types of plots, but something I believe is useful).

``````
DATASET ACTIVATE PredNBAgg.
COMPUTE NB9_Plus = 1 - SUM(NB0 TO NB8).
VARSTOCASES /MAKE NBPred FROM NB0 TO NB9_Plus /INDEX Int.
COMPUTE Int = Int - 1. /*Index starts at 1 instead of 0 */.
``````

Now I reactivate my original dataset, here named `PredNegBin`, calculate the binned observed values (with observations 9 and larger recoded to just 9) and then aggregate those values.

``````
DATASET ACTIVATE PredNegBin.
RECODE TotalCrime (9 THRU HIGHEST = 9)(ELSE = COPY) INTO Int.
DATASET DECLARE PredObsAgg.
AGGREGATE OUTFILE='PredObsAgg'
/BREAK = Int
/TotalObs = N.
``````

To get the predicted proportions within each category, I need to do another aggregation to get the total number of observations, and then divide the totals of each integer value with the total number of observations.

``````
DATASET ACTIVATE PredObsAgg.
AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES
/BREAK =
/TotalN=SUM(TotalObs).
COMPUTE PercObs = TotalObs / TotalN.
``````

Now we can go ahead and merge the two aggregated datasets together. I also go ahead and close the old `PredNBAgg` dataset and define a value label so I know that the 9 integer category is really 9 and larger.

``````
MATCH FILES FILE = *
/FILE = 'PredNBAgg'
/BY Int.
DATASET CLOSE PredNBAgg.
VALUE LABELS Int 9 '9+'.
``````

Now at this point you could make the plot with the predicted and observed proportions in seperate variables, but this would take two `ELEMENT` statements within a `GGRAPH` command (and I like to make line plots with both the lines and points, so it would actually take 4 `ELEMENT` statements). So what I do here is reshape the data one more time with `VARSTOCASES`, and make a categorical variable to identify if the proportion is the observed value or the predicted value from the model. Then you can make your chart.

``````
VARSTOCASES /MAKE Dens FROM PercObs NBPred /Index Type /DROP TotalObs TotalN.
VALUE LABELS Type
1 'Observed'
2 'Predicted'.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Int Dens Type
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Int=col(source(s), name("Int"), unit.category())
DATA: Type=col(source(s), name("Type"), unit.category())
DATA: Dens=col(source(s), name("Dens"))
GUIDE: axis(dim(1), label("Total Crimes on Street Units"))
GUIDE: axis(dim(2), label("Percent of Streets"))
GUIDE: legend(aesthetic(aesthetic.color.interior), null())
SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color.black),("2",color.red)))
ELEMENT: line(position(Int*Dens), color.interior(Type))
ELEMENT: point(position(Int*Dens), color.interior(Type), color.exterior(color.white), size(size."7"))
END GPL.
``````

And voila, here you can see the predicted values are so close to the observed that it is difficult to even see the observed values. Here instead of creating a legend I manually added labels to the chart. A better chart may be to subtract the observed from predicted (especially if you were comparing multiple poisson models), but it should be quite plain to see that the negative binomial fits quite well to the observed data in this instance.

Similar to Paul Allison’s experience, even with nearly 64% of the observations being zero, the negative binomial model fits just fine. I recently fit some other models with the same data (but a different outcome) in which the number of zeros were nearer to 90%. In that instance the negative binomial model would not converge, so estimating a zero inflated model was necessary. Here though it is clearly not necessary, and I would prefer the negative binomial model over a zip (or hurdle) as I see no obvious reason why I would prefer the complications of the different predicted zero equation in addition to the count equation.

• ## Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 115 other followers

• ## Follow me on Twitter

• Advertisements