# Pooling multiple outcomes into one regression equation

Something that came up for many of my students this last semester in my Seminar in Research class is that many were interested in multiple outcomes. The most common one is examining different types of delinquency for juveniles (often via surveys), but it comes up in quite a few other designs as well (e.g. different crime outcomes for spatial research, different measures of perceptions towards police, different measures of fear of crime, etc.).

Most of the time students default to estimating separate equations for each of these outcomes, but in most circumstances I was telling the students they should pool these outcomes into one model. I think that is the better default for the majority of situations. So say we have a situation with two outcomes, violent crimes and property crimes, and we have one independent variable we are interested in, say whether an individual was subjected to a particular treatment. We might then estimate two separate equations:

``````E[# Violent Crimes]  = B0v + B1v*(Treatment)

E[# Property Crimes] = B0p + B1p*(Treatment)``````

By saying that I think by default we should think about pooling is basically saying that `B1v` is going to be close to equal to `B1p` in the two equations. Pooling the models together both lets us test that assertion, as well as get a better estimate of the overall treatment effect. So to pool the models we would stack the outcomes together, and then estimate something like:

``E[# Crimes (by type)] = B0 + B1*(Treatment) + B2*(Outcome = Violent) + B3(Treatment*Outcome = Violent)``

Here the B3 coefficient tests whether the treatment effect is different for the violent crime outcome as opposed to the property crime, and the dummy variable B2 effect controls for any differences in the levels of the two overall (that is, you would expect violent incidents to be less common than property crime incidents).

Because you will have multiple measures per individual, you can correct for that (by clustering the standard errors). But in the case you have many outcomes you might also want to consider a multi-level model, and actually estimate random effects for individuals and outcomes. So say instead of just violent and property crimes, but had a survey listing for 20 different types of delinquency. In that case you might want to do a model that looks like:

``Prob(Delinquency_ij) = f[B0 + B1*(Treatment_j) + d_j + g_i]``

Where one is estimating a multi-level logistic regression equation for delinquency type i within individual j, and the g_i and d_j are the random effects for delinquency types and individuals respectively. In the case you do not have many outcomes (say only 10), the random effect distribution might be hard to estimate. In that case I would just use fixed effects for the outcome dummy variables. But I can imagine the random effects for persons are of interest in many different study designs. And this way you get one model — instead of having to interpret 20+ models.

Also you can still estimate differential treatment effects across the different items if you want to, such as by looking at the interaction of the outcome types and the treatment. But in most cases in criminology I have come across treatments are general. That is, we would expect them to decrease/increase all crime types, not just some specific violent or property crime types. So to default pooling the treatment effect estimate makes sense.

To go a bit farther — juvenile delinquency is not my bag, but offhand I don’t understand why those who examine surveys of delinquency items use that multi-level model more often. Often times people aggregate the measures altogether into one overall scale, such as saying someone checked yes to 2 out of 10 violent crime outcomes, and checked yes to 5 out of 10 property crime outcomes. Analyzing those aggregated outcomes is another type of pooling, but one I don’t think is appropriate, mainly because it ignores the overall prevalence for the different items. For example, you might have an item such as "steal a car", and another that is "steal a candy bar". The latter is much more serious and subsequently less likely to occur. Going with my prior examples, pooling items together like this would force the random effects for the individual delinquency types, g_i, to all equal zero. Just looking at the data one can obviously tell that is not a good assumption.

Here I will provide an example via simulation to demonstrate this in Stata. First I generate an example dataset that has 1,000 individuals and 20 yes/no outcomes. They way the data are simulated is that each individual has a specific amount of `self_control` that decreases the probability of an outcome (with a coefficient of -0.5), they are nested within a particular group (imagine a different school) that affect whether the outcome occurs or not. In addition to this, each individual has a random intercept (drawn from a normal distribution), and each question has a fixed different prevalence.

``````*Stata simulation
clear
set more off
set seed 10
set obs 1000
generate caseid = _n
generate group = ceil(caseid/100)
generate self_control = rnormal(0,1)
generate rand_int = rnormal(0,1)

*generating 20 outcomes that just have a varying intercept for each
forval i = 1/20 {
generate logit_`i' = -0.4 -0.5*self_control -0.1*group + 0.1*(`i'-10) + rand_int
generate prob_`i' = 1/(1 + exp(-1*logit_`i'))
generate outcome_`i' = rbinomial(1,prob_`i')
}
drop logit_* prob_* rand_int
summarize prob_*``````

And here is that final output:

``````. summarize prob_*

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
prob_1 |      1,000    .1744795    .1516094   .0031003   .9194385
prob_2 |      1,000    .1868849     .157952   .0034252   .9265418
prob_3 |      1,000     .199886    .1642414   .0037841   .9330643
prob_4 |      1,000      .21348    .1704442   .0041804   .9390459
prob_5 |      1,000    .2276601    .1765258    .004618   .9445246
-------------+---------------------------------------------------------
prob_6 |      1,000     .242416    .1824513   .0051012   .9495374
prob_7 |      1,000    .2577337    .1881855   .0056347   .9541193
prob_8 |      1,000    .2735951    .1936933   .0062236   .9583033
prob_9 |      1,000     .289978    .1989401   .0068736    .962121
prob_10 |      1,000    .3068564    .2038919    .007591   .9656016
-------------+---------------------------------------------------------
prob_11 |      1,000    .3242004    .2085164   .0083827   .9687729
prob_12 |      1,000    .3419763    .2127823   .0092562   .9716603
prob_13 |      1,000    .3601469    .2166605   .0102197   .9742879
prob_14 |      1,000    .3786715    .2201237   .0112824   .9766776
prob_15 |      1,000    .3975066    .2231474   .0124542   .9788501
-------------+---------------------------------------------------------
prob_16 |      1,000    .4166057    .2257093    .013746   .9808242
prob_17 |      1,000    .4359203    .2277906   .0151697   .9826173
prob_18 |      1,000       .4554    .2293751   .0167384   .9842454
prob_19 |      1,000     .474993    .2304504   .0184663   .9857233
prob_20 |      1,000    .4946465    .2310073   .0203689   .9870643``````

You can see from this list that each `prob*` variable then has a different overall prevalence, from around 17% for prob_1, climbing to around 50% for prob_20.

Now if you wanted to pool the items into one overall delinquency scale, you might estimate a binomial regression model (note this is not a negative binomial model!) like below (see Britt et al., 2017 for discussion).

``````*first I will show the binomial model in Britt
egen delin_total = rowtotal(outcome_*)
*Model 1
glm delin_total self_control i.group, family(binomial 20) link(logit)``````

Which shows for the results (note that the effect of self-control is too small, it should be around -0.5):

``````. glm delin_total self_control i.group, family(binomial 20) link(logit)

Iteration 0:   log likelihood =  -3536.491
Iteration 1:   log likelihood = -3502.3107
Iteration 2:   log likelihood = -3502.2502
Iteration 3:   log likelihood = -3502.2502

Generalized linear models                         No. of obs      =      1,000
Optimization     : ML                             Residual df     =        989
Scale parameter =          1
Deviance         =  4072.410767                   (1/df) Deviance =   4.117706
Pearson          =  3825.491931                   (1/df) Pearson  =    3.86804

Variance function: V(u) = u*(1-u/20)              [Binomial]
Link function    : g(u) = ln(u/(20-u))            [Logit]

AIC             =     7.0265
Log likelihood   = -3502.250161                   BIC             =  -2759.359

------------------------------------------------------------------------------
|                 OIM
delin_total |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
|
group |
2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
|
_cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623
------------------------------------------------------------------------------``````

One of the things I wish the Britt paper mentioned was that the above binomial model is equivalent to the a logistic regression model on the individual outcomes — but one that forces the predictions for each item to be the same across a person. So if you reshape the data from wide to long you can estimate that same binomial model as a logistic regression on the 0/1 outcomes.

``````*reshape wide to long
reshape long outcome_, i(caseid) j(question)
*see each person now has 20 questions each
*tab caseid

*regression model with the individual level data, should be equivalent to the aggregate binomial model
*Model 2
glm outcome_ self_control i.group, family(binomial) link(logit)``````

And here are the results:

``````. glm outcome_ self_control i.group, family(binomial) link(logit)

Iteration 0:   log likelihood = -12204.638
Iteration 1:   log likelihood = -12188.762
Iteration 2:   log likelihood = -12188.755
Iteration 3:   log likelihood = -12188.755

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,989
Scale parameter =          1
Deviance         =  24377.50934                   (1/df) Deviance =   1.219546
Pearson          =  19949.19243                   (1/df) Pearson  =   .9980085

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

AIC             =   1.219975
Log likelihood   = -12188.75467                   BIC             =  -173583.3

------------------------------------------------------------------------------
|                 OIM
outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
|
group |
2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
|
_cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623
------------------------------------------------------------------------------``````

So you can see that Model 1 and Model 2 are exactly the same (in terms of estimates for the regression coefficients).

Model 2 though should show the limitations of using the binomial model — it predicts the same probability for each delinquency item, even though prob_1 is less likely to occur than prob_20. So for example, if we generate the predictions of this model, we can see that each question has the same predicted value.

``````predict prob_mod2, mu
sort question
by question: summarize outcome_ prob_mod2``````

And here are the results for the first four questions:

``````.     by question: summarize outcome_ prob_mod2

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 1

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
outcome_ |      1,000        .183      .38686          0          1
prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 2

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
outcome_ |      1,000        .205    .4039036          0          1
prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 3

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
outcome_ |      1,000        .208    .4060799          0          1
prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 4

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
outcome_ |      1,000        .202    .4016931          0          1
prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998``````

By construction, the binomial model on the aggregated totals is a bad fit to the data. It predicts that each question should have a probability of around 32% of occurring. Although you can’t fit the zero-inflated model discussed by Britt via the individual level logit approach (that I am aware of), that approach has the same limitation as the generic binomial model approach. Modeling the individual items just makes more sense when you have the individual items. It is hard to think of examples where such a restriction would be reasonable for delinquency items.

So here a simple update is to include a dummy variable for each item. Here I also cluster according to whether the item is nested within an individual `caseid`.

``````*Model 3
glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)``````

And here are the results:

``````.     glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

Iteration 0:   log pseudolikelihood = -11748.056
Iteration 1:   log pseudolikelihood = -11740.418
Iteration 2:   log pseudolikelihood = -11740.417
Iteration 3:   log pseudolikelihood = -11740.417

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,970
Scale parameter =          1
Deviance         =  23480.83406                   (1/df) Deviance =   1.175805
Pearson          =  19949.15609                   (1/df) Pearson  =   .9989562

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

AIC             =   1.177042
Log pseudolikelihood = -11740.41703               BIC             =  -174291.8

(Std. Err. adjusted for 1,000 clusters in caseid)
------------------------------------------------------------------------------
|               Robust
outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.3858319   .0334536   -11.53   0.000    -.4513996   -.3202641
|
group |
2  |  -.0620222   .1350231    -0.46   0.646    -.3266626    .2026182
3  |   -.049852   .1340801    -0.37   0.710    -.3126442    .2129403
4  |   .0549271   .1383412     0.40   0.691    -.2162167    .3260709
5  |  -.1329942   .1374758    -0.97   0.333    -.4024419    .1364535
6  |  -.4103578   .1401212    -2.93   0.003    -.6849904   -.1357253
7  |  -.3145033   .1452201    -2.17   0.030    -.5991296   -.0298771
8  |  -.2803599   .1367913    -2.05   0.040    -.5484659    -.012254
9  |  -.4543686   .1431314    -3.17   0.002    -.7349011   -.1738362
10  |  -.5962359   .1457941    -4.09   0.000    -.8819872   -.3104847
|
question |
2  |   .1453902   .1074383     1.35   0.176    -.0651851    .3559654
3  |   .1643203   .1094113     1.50   0.133     -.050122    .3787625
4  |   .1262597   .1077915     1.17   0.241    -.0850078    .3375272
5  |   .1830563    .105033     1.74   0.081    -.0228047    .3889173
6  |   .3609468   .1051123     3.43   0.001     .1549304    .5669633
7  |    .524749    .100128     5.24   0.000     .3285017    .7209963
8  |   .5768412   .1000354     5.77   0.000     .3807754     .772907
9  |   .7318797   .1021592     7.16   0.000     .5316513    .9321081
10  |    .571682   .1028169     5.56   0.000     .3701646    .7731994
11  |    .874362   .0998021     8.76   0.000     .6787535     1.06997
12  |   .8928982   .0998285     8.94   0.000     .6972379    1.088559
13  |   .8882734   .1023888     8.68   0.000      .687595    1.088952
14  |   .9887095   .0989047    10.00   0.000     .7948599    1.182559
15  |   1.165517   .0977542    11.92   0.000     .9739222    1.357111
16  |   1.230355   .0981687    12.53   0.000     1.037948    1.422762
17  |   1.260403   .0977022    12.90   0.000      1.06891    1.451896
18  |   1.286065    .098823    13.01   0.000     1.092376    1.479755
19  |   1.388013   .0987902    14.05   0.000     1.194388    1.581638
20  |   1.623689   .0999775    16.24   0.000     1.427737    1.819642
|
_cons |  -1.336376   .1231097   -10.86   0.000    -1.577666   -1.095085
------------------------------------------------------------------------------``````

You can now see that the predicted values for each individual item are much more reasonable. In fact they are a near perfect fit.

``````predict prob_mod3, mu
by question: summarize outcome_ prob_mod2 prob_mod3``````

And the results:

``````.     by question: summarize outcome_ prob_mod2 prob_mod3

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 1

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
outcome_ |      1,000        .183      .38686          0          1
prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
prob_mod3 |      1,000        .183    .0672242   .0475809   .4785903

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 2

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
outcome_ |      1,000        .205    .4039036          0          1
prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
prob_mod3 |      1,000        .205    .0729937   .0546202   .5149203

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 3

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
outcome_ |      1,000        .208    .4060799          0          1
prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
prob_mod3 |      1,000        .208    .0737455    .055606   .5196471

-------------------------------------------------------------------------------------------------------------------------------------------------------
-> question = 4

Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
outcome_ |      1,000        .202    .4016931          0          1
prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
prob_mod3 |      1,000        .202    .0722336   .0536408   .5101407``````

If you want, you can also test whether any "treatment" effect (or here the level of a persons self control), has differential effects across the different delinquency items.

``````*Model 4
glm outcome_ self_control i.group i.question (c.self_control#i.question), family(binomial) link(logit) cluster(caseid)
*can do a test of all the interactions equal to zero at once
testparm c.self_control#i.question``````

I’ve omitted this output, but here of course the effect of self control is simulated to be the same across the different items, so one would fail to reject the null that any of the interaction terms are non-zero.

Given the way I simulated the data, the actual correct model is a random effects one. You should notice in each of the prior models the effect of self control is too small. One way to estimate that model in Stata is to below:

``````*Model 5
melogit outcome_ self_control i.group i.question || caseid:``````

And here are the results:

``````. melogit outcome_ self_control i.group i.question || caseid:

Fitting fixed-effects model:

Iteration 0:   log likelihood = -11748.056
Iteration 1:   log likelihood = -11740.418
Iteration 2:   log likelihood = -11740.417
Iteration 3:   log likelihood = -11740.417

Refining starting values:

Grid node 0:   log likelihood =  -10870.54

Fitting full model:

Iteration 0:   log likelihood =  -10870.54
Iteration 1:   log likelihood = -10846.176
Iteration 2:   log likelihood = -10845.969
Iteration 3:   log likelihood = -10845.969

Mixed-effects logistic regression               Number of obs     =     20,000
Group variable:          caseid                 Number of groups  =      1,000

Obs per group:
min =         20
avg =       20.0
max =         20

Integration method: mvaghermite                 Integration pts.  =          7

Wald chi2(29)     =    1155.07
Log likelihood = -10845.969                     Prob > chi2       =     0.0000
------------------------------------------------------------------------------
outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
self_control |  -.4744173   .0372779   -12.73   0.000    -.5474807    -.401354
|
group |
2  |  -.0648018   .1642767    -0.39   0.693    -.3867783    .2571746
3  |  -.0740465   .1647471    -0.45   0.653    -.3969449    .2488519
4  |    .036207   .1646275     0.22   0.826     -.286457     .358871
5  |  -.1305605   .1645812    -0.79   0.428    -.4531337    .1920126
6  |  -.5072909   .1671902    -3.03   0.002    -.8349776   -.1796042
7  |  -.3732567    .165486    -2.26   0.024    -.6976032   -.0489102
8  |  -.3495889   .1657804    -2.11   0.035    -.6745126   -.0246653
9  |  -.5593725   .1675276    -3.34   0.001    -.8877205   -.2310245
10  |  -.7329717   .1673639    -4.38   0.000    -1.060999   -.4049445
|
question |
2  |   .1690546   .1240697     1.36   0.173    -.0741177    .4122268
3  |    .191157   .1237894     1.54   0.123    -.0514657    .4337797
4  |   .1467393   .1243586     1.18   0.238    -.0969991    .3904776
5  |   .2130531   .1235171     1.72   0.085     -.029036    .4551422
6  |   .4219282   .1211838     3.48   0.000     .1844123    .6594441
7  |   .6157484   .1194133     5.16   0.000     .3817027    .8497941
8  |   .6776651   .1189213     5.70   0.000     .4445837    .9107465
9  |   .8626735   .1176486     7.33   0.000     .6320865    1.093261
10  |   .6715272   .1189685     5.64   0.000     .4383532    .9047012
11  |   1.033571   .1167196     8.86   0.000     .8048051    1.262338
12  |    1.05586    .116615     9.05   0.000     .8272985    1.284421
13  |   1.050297   .1166407     9.00   0.000     .8216858    1.278909
14  |   1.171248   .1161319    10.09   0.000     .9436331    1.398862
15  |   1.384883   .1154872    11.99   0.000     1.158532    1.611234
16  |   1.463414   .1153286    12.69   0.000     1.237375    1.689454
17  |   1.499836   .1152689    13.01   0.000     1.273913    1.725759
18  |   1.530954   .1152248    13.29   0.000     1.305117     1.75679
19  |   1.654674   .1151121    14.37   0.000     1.429058    1.880289
20  |   1.941035   .1152276    16.85   0.000     1.715193    2.166877
|
_cons |  -1.591796   .1459216   -10.91   0.000    -1.877797   -1.305795
-------------+----------------------------------------------------------------
caseid       |
var(_cons)|   1.052621   .0676116                      .9281064     1.19384
------------------------------------------------------------------------------
LR test vs. logistic model: chibar2(01) = 1788.90     Prob >= chibar2 = 0.0000``````

In that model it is the closest to estimating the correct effect of self control (-0.5). It is still small (at -0.47), but the estimate is within one standard error of the true value. (Another way to estimate this model is to use `xtlogit`, but with `melogit` you can actually extract the random effects. That will have to wait until another blog post though.)

Another way to think about this model is related to item-response theory, where individuals can have a latent estimate of how smart they are, and questions can have a latent easiness/hardness. In Stata you might fit that by the code below, but a warning it takes awhile to converge. (Not sure why, the fixed effects for questions are symmetric, so assuming a random effect distribution should not be too far off. If you have any thoughts as to why let me know!)

``````*Model 6
melogit outcome_ self_control i.group || caseid: || question:``````

For an academic reference to this approach see Osgood et al.,(2002). Long story short, model the individual items, but pool them together in one model!

# Don’t include temporal lags of crime in cross-sectional crime models

In my 311 and crime paper a reviewer requested I conduct cross-lagged models. That is, predict crime in 2011 while controlling for prior counts of crime in 2010, in addition to the other specific variables of interest (here 311 calls for service). In the supplementary material I detail why this is difficult with Poisson models, as the endogenous effect will often be explosive in Poisson models, something that does not happen as often in linear models.

There is a second problem though with cross-lagged models I don’t discuss though, and it has to do with how what I think a reasonable data generating process for crime at places can cause cross-lagged models to be biased. This is based on the fact that crime at places tends to be very temporally stable (see David Weisburd’s, or Martin Andresen’s, or my work showing that). So when you incorporate temporal lags of crime in models, this makes the other variables of interest (311 calls, alcohol outlets, other demographics, whatever) biased, because they cause crime in the prior time period. This is equivalent to controlling for an intermediate outcome. For examples of this see some of the prior work on the relationship between crime and disorder by Boggess and Maskaly (2014) or O’Brien and Sampson (2015).1

So Boggess and Maskaley (BM) and O’Brien and Sampson (OS) their simplified cross-lagged model is:

`(1) Crime_post = B0*Crime_pre + B1*physicaldisorder_pre`

Where the post and pre periods are yearly counts of crime and indicators of physical disorder. My paper subsequently does not include the prior counts of crime, but does lag the physical disorder measures by a year to ensure they are exogenous.

`(2) Crime_post = B1*physicaldisorder_pre`

There are a few reasons to do these lags. The most obvious is to make explanatory variable of broken windows exogenous, by making sure it is in the past. The reasons for including lags of crime counts are most often strictly as a control variable. There are some examples where crime begets more crime directly, such as retaliatory violence, (or see Rosenfeld, 2009) but most folks who do the cross-lagged models do not make this argument.

Now, my whole argument rests on what I think is an appropriate model explaining counts of crime at places. Continuing with the physical disorder example, I think a reasonable cross-sectional model of crime at places is that there are some underlying characteristics of locations that tend to be pretty stable over fairly long periods of time, and then we have more minor stuff like physical disorder that provide small exogenous shocks to the system over time.

`(3) Crime_i = B0*(physicaldisorder_i) + Z_i`

Where crime at location `i` is a function of some fixed characteristic `Z`. I can’t prove this model is correct, but I believe it is better supported by data. To support this position, I would refer to the incredibly high correlations between counts of crime at places from year to year. This is true of every crime dataset I have worked with (at every spatial unit of analysis), and is a main point of Shaw and McKay’s work plus Rob Sampsons for neighborhoods in Chicago, as well as David Weisburd’s work on trajectories of crime at street segments in Seattle. Again, this very high correlation doesn’t strike me as reasonably explained by crime causes more crime, what is more likely is that there are a set of fixed characteristics that impact criminal behavior at a certain locations.

If a model of crime is like that in (3), there are then two problems with the prior equations. The first problem for both (1) and (2) is that lagging physical disorder measures by a year does not make any sense. The idea behind physical disorder (a.k.a. broken windows) is that visible signs of disorder prime people to behave in a particular way. The priming presumably needs to be recent to affect behavior. But this can simply be solved by not lagging physical disorder by a year in the model. The lagged physical disorder effect might approximate the contemporaneous effect, if physical disorder itself is temporally consistent over long periods. So if say we replace physical disorder with locations of bars, the lagged effect of bars likely does not make any difference, between bars don’t turn over that much (and when they do they are oft just replaced by another bar).

But what if you still include the lags of crime counts? One may think that this controls for the omitted `Z_i` effect, but the effect is very bad for the other exogenous variables, especially lagged ones or temporally consistent ones. You are probably better off with the omitted random effect, because crime in the prior year is an intermediate outcome. I suspect this bias can be very large, and likely biases the effects of the other variables towards zero by quite alot. This is because effect of the fixed characteristic is large, the effect of the exogenous characteristic is smaller, and the two are likely correlated at least to a small amount.

To show this I conduct a simulation. SPSS Code here to replicate it. The true model I simulated is:

``````(4)  BW_it = 0.2*Z_i + ew_it
(5)  Crime_it = 5 + 0.1*BW_it + 0.9*Z_i + ec_it```````

I generated this for 25,000 locations and two time points (the t subscript), and all the variables are set to have a variance of 1 (all variables are normally distributed). The error terms (ew_it and ec_it) are not correlated, and are set to whatever value is necessary so the resultant variable on the left hand side has a variance of 1. With so many observations one simulation run is pretty representative of what would happen even if I replicated the simulation multiple times. This specification makes both BW (to stand for broken windows) and Z_i correlated.

In my run, what happens when we fit the cross-lagged model? The effect estimates are subsequently:

``````Lag BW:   -0.07
Lag Crime: 0.90``````

Yikes – effect of BW is in the opposite direction and nearly as large as the true effect. What about if you just include the lag of BW?

``Lag BW: 0.22``

The reason this is closer to the true effect is because of some round-about-luck. Since BW_it is correlated with the fixed effect Z_i, the lag of BW has a slight correlation to the future BW. This potentially changes how we view the effects of disorder on crime though. If BW is more variable, we can make a stronger argument that it is exogenous of other omitted variables. If it is temporally consistent it is harder to make that argument (it should also reduce the correlation with Z_i).

Still, the only reason this lag has a positive effect is that Z_i is omitted. For us to make the argument that this approximates the true effect, we have to make the argument the model has a very important omitted variable. Something one could only do as an act of cognitive dissonance.

How about use the contemporaneous effect of BW, but still include the lag counts of crime?

``````BW:        0.13
Lag Crime: 0.86``````

That is not as bad, because the lag of crime is now not an intermediate outcome. Again though, if we switch BW with something more consistent in time, like locations of bars, the lag will be an intermediate outcome, and will subsequently bias the effect. So what about a model of the contemporaneous effect of BW, omitting Z_i? The contemporaneous effect of BW will still be biased, since Z_i is omitted from the model.

``BW: 0.32``

But a way to reduce this bias is to introduce other control variables that approximate the omitted Z_i. Here I generate a set of 10 covariates that are a function of Z_i, but are otherwise not correlated with BW nor each other.

`(6) Oth_it = 0.5*Z_i + eoth_it`

Including these covariates in the model progressively reduces the bias. Here is a table for the reduction in the BW effect for the more of the covariates you add in, e.g. `with 2` means it includes two of the control variables in the model.

``````BW (with 0):  0.32
BW (with 1):  0.25
BW (with 2):  0.21
BW (with 3):  0.19
BW (with 10): 0.14``````

So if you include other cross-sectional covariates in an attempt to control for Z_i it brings the effect of BW closer to its true effect. This is what I believe happens in the majority of social science research that use strictly cross-sectional models, and is a partial defense of what people sometimes refer to kitchen sink models.

So in brief, I think using lags of explanatory variables and lags of crime in the same model are very bad, and can bias the effect estimates quite alot.

So using lags of explanatory variables and lags of crime counts in cross-sectional models I believe are a bad idea for most research designs. It is true that it makes it their effects exogenous, but it doesn’t eliminate the more contemporaneous effect of the variable, and so we may be underestimating the effect to a very large extent. Whether of not the temporal lag effects crime has to do with how the explanatory variable itself arises, and so the effect estimated by the temporal lag is likely to be misleading (and may be biased upward or downward depending on other parts of the model).

Incorporating prior crime counts is likely to introduce more bias than it solves I think for most cross-lagged models. I believe simply using a cross-sectional model with a reasonable set of control variables will get you closer to the real effect estimates than the cross-lagged models. If you think `Z_i` is correlated with a variable of interest (or lags of crime really do cause future crime) I think you need to do the extra step and have multiple time measures and fit a real panel data model, not just a cross lagged one.

I’m still not sure though when you are better off fitting a panel model versus expanding the time for the cross-section though. For one example, I think you are better off estimating the effects of demographic variables in a cross-sectional model, as opposed to a panel one, over a short period of time, (say less than 10 years). This is because demographic shifts simply don’t occur very fast, so there is little variance within units for a short panel.

1. I actually came up with the idea of using 311 calls independently of Dan O’Brien’s work, see my prospectus in 2013 in which I proposed the analysis. So I’m not totally crazy – although was alittle bummed to miss the timing abit! Four years between proposing and publishing the work is a bit depressing as well.

# Testing the equality of two regression coefficients

The default hypothesis tests that software spits out when you run a regression model is the null that the coefficient equals zero. Frequently there are other more interesting tests though, and this is one I’ve come across often — testing whether two coefficients are equal to one another. The big point to remember is that `Var(A-B) = Var(A) + Var(B) - 2*Cov(A,B)`. This formula gets you pretty far in statistics (and is one of the few I have memorized).

Note that this is not the same as testing whether one coefficient is statistically significant and the other is not. See this Andrew Gelman and Hal Stern article that makes this point. (The link is to a pre-print PDF, but the article was published in the American Statistician.) I will outline four different examples I see people make this particular mistake.

One is when people have different models, and they compare coefficients across them. For an example, say you have a base model predicting crime at the city level as a function of poverty, and then in a second model you include other control covariates on the right hand side. Let’s say the the first effect estimate of poverty is `3 (1)`, where the value in parentheses is the standard error, and the second estimate is `2 (2)`. The first effect is statistically significant, but the second is not. Do you conclude that the effect sizes are different between models though? The evidence for that is much less clear.

To construct the estimate of how much the effect declined, the decline would be `3 - 2 = 1`, a decrease in `1`. What is the standard error around that decrease though? We can use the formula for the variance of the differences that I noted before to construct it. So the standard error squared is the variance around the parameter estimate, so we have `sqrt(1^2 + 2^2) =~ 2.23` is the standard error of the difference — which assumes the covariance between the estimates is zero. So the standard error around our estimated decline is quite large, and we can’t be sure that it is an appreciably different estimate of poverty between the two models.

There are more complicated ways to measure moderation, but this ad-hoc approach can be easily applied as you read other peoples work. The assumption of zero covariance for parameter estimates is not a big of deal as it may seem. In large samples these tend to be very small, and they are frequently negative. So even though we know that assumption is wrong, just pretending it is zero is not a terrible folly.

The second is where you have models predicting different outcomes. So going with our same example, say you have a model predicting property crime and a model predicting violent crime. Again, I will often see people make an equivalent mistake to the moderator scenario, and say that the effect of poverty is larger for property than violent because one is statistically significant and the other is not.

In this case if you have the original data, you actually can estimate the covariance between those two coefficients. The simplest way is to estimate that covariance via seemingly unrelated regression. If you don’t though, such as when you are reading someone else’s paper, you can just assume the covariance is zero. Because the parameter estimates often have negative correlations, this assumption will make the standard error estimate smaller.

The third is where you have different subgroups in the data, and you examine the differences in coefficients. Say you had recidivism data for males and females, and you estimated an equation of the effect of a treatment on males and another model for females. So we have two models:

``````Model Males  : Prob(Recidivism) = B_0m + B_1m*Treatment
Model Females: Prob(Recidivism) = B_0f + B_1f*Treatment``````

Where the `B_0?` terms are the intercept, and the `B_1?` terms are the treatment effects. Here is another example where you can stack the data and estimate an interaction term to estimate the difference in the effects and its standard error. So we can estimate a combined model for both males and females as:

``Combined Model: Prob(Recidivism) = B_0c + B_1c*Treatment + B_2c*Female + B_3c(Female*Treatment)``

Where `Female` is a dummy variable equal to 1 for female observations, and `Female*Treatment` is the interaction term for the treatment variable and the Female dummy variable. Note that you can rewrite the model for males and females as:

``````Model Mal.: Prob(Recidivism) =     B_0c      +      B_1c    *Treatment    ....(when Female=0)
Model Fem.: Prob(Recidivism) = (B_0c + B_2c) + (B_1c + B_3c)*Treatment    ....(when Female=1)``````

So we can interpret the interaction term, `B_3c` as the different effect on females relative to males. The standard error of this interaction takes into account the covariance term, unlike estimating two totally separate equations would. (You can stack the property and violent crime outcomes I mentioned earlier in a synonymous way to the subgroup example.)

The final fourth example is the simplest; two regression coefficients in the same equation. One example is from my dissertation, the correlates of crime at small spatial units of analysis. I test whether different places that sell alcohol — such as liquor stores, bars, and gas stations — have the same effect on crime. For simplicity I will just test two effects, whether liquor stores have the same effect as on-premise alcohol outlets (this includes bars and restaurants). So lets say I estimate a Poisson regression equation as:

``log(E[Crime]) = Intercept + b1*Bars + b2*LiquorStores``

And then my software spits out:

``````                  B     SE
Liquor Stores    0.36  0.10
Bars             0.24  0.05``````

And then lets say we also have the variance-covariance matrix of the parameter estimates – which most stat software will return for you if you ask it:

``````                L       B
Liquor_Stores    0.01
Bars            -0.0002 0.0025``````

On the diagonal are the variances of the parameter estimates, which if you take the square root are equal to the reported standard errors in the first table. So the difference estimate is `0.36 - 0.24 = 0.12`, and the standard error of that difference is `sqrt(0.01 + 0.0025 - 2*-0.002) =~ 0.13`. So the difference is not statistically significant. You can take the ratio of the difference and its standard error, here `0.12/0.13`, and treat that as a test statistic from a normal distribution. So the rule that it needs to be plus or minus two to be stat. significant at the 0.05 level applies.

This is called a Wald test specifically. I will follow up with another blog post and some code examples on how to do these tests in SPSS and Stata. For completeness and just because, I also list two more ways to accomplish this test for the last example.

There are two alternative ways to do this test though. One is by doing a likelihood ratio test.

So we have the full model as:

`````` log(E[Crime]) = b0 + b1*Bars + b2*Liquor_Stores [Model 1]
``````

And we have the reduced model as:

`````` log(E[Crime]) = b4 + b5*(Bars + Liquor_Stores)  [Model 2]
``````

So we just estimate the full model with Bars and Liquor Stores on the right hand side (Model 1), then estimate the reduced model (2) with the sum of `Bars + Liquor Stores` on the right hand side. Then you can just do a chi-square test based on the change in the log-likelihood. In this case there is a change of one degree of freedom.

I give an example of doing this in R on crossvalidated. This test is nice because it extends to testing multiple coefficients, so if I wanted to test `bars=liquor stores=convenience stores`. The prior individual Wald tests are not as convenient for testing more than two coefficients equality at once.

Here is another way though to have the computer more easily spit out the Wald test for the difference between two coefficients in the same equation. So if we have the model (lack of intercept does not matter for discussion here):

``y = b1*X + b2*Z [eq. 1]``

We can test the null that b1 = b2 by rewriting our linear model as:

``y = B1*(X + Z) + B2*(X - Z) [eq. 2]``

And the test for the B2 coefficient is our test of interest The logic goes like this — we can expand `[eq. 2]` to be:

``y = B1*X + B1*Z + B2*X - B2*Z [eq. 3]``

which you can then regroup as:

``y = X*(B1 + B2) + Z*(B1 - B2) [eq. 4]``

and note the equalities between equations 4 and 1.

``B1 + B2 = b1; B1 - B2 = b2``

So B2 tests for the difference between the combined B1 coefficient. B2 is a little tricky to interpret in terms of effect size for how much larger b1 is than b2 – it is only half of the effect. An easier way to estimate that effect size though is to insert (X-Z)/2 into the right hand side, and the confidence interval for that will be the effect estimate for how much larger the effect of X is than Z.

Note that this gives an equivalent estimate as to conducting the Wald test by hand as I mentioned before.

# Regression to the mean – a tale of change scores

This is a real example from my work illustrating regression to the mean. I have a scale measuring impulsivity of offenders. I had an intervention that used cognitive behavioral thereapy (CBT) in a boot camp for one group, and business as usual for another (just plain old jail). I have measures of impulsivity at `pre`, `post`, and 6 month follow up (what I label as `post2`). CBT is suppossed to reduce impulsivity, and hopefully keep it that way.

I find that those who have gained the most during the intervention tend to revert back to their prior scores once they leave the bootcamp. That is, the measure `[post - pre]`, the gain in bootcamp, has a negative correlation with `[post2 - post]`, the loss after bootcamp. Is this due to the intervention being shitty? No! It is not — this is the result of regression to the mean. This does not show any relationship between the values, it will happen even if the impulse scores are totally random.

Note that the definition of covariance is:

``Cov(X,Y) = E[(x - E[X])*(y - E[Y])]``

Where `E` is representing the expectation, and `Cov(X,Y)` of course means the covariance between `X` and `Y`. Here for easier equations we can assume the mean in the impulse scale is zero across all three waves, which makes the means of the change scores zero as well (without any loss in generality). So dropping the inner expecations this equation reduces to:

``Cov(X,Y) = E[x*y]``

So defining `post-pre = Change1` and `post2 - post = Change2`, expanding out to the original components we have:

``Cov(Change1,Change2) = Cov(post-pre,post2-post) = E[ (post-pre)*(post2-post) ]``

The last result can then be expanded to:

``E[ post*post2 - post*post - pre*post2 + pre*post ]``

Because of the bilinearity of expectation, these can be further teased out:

``E[ post*post2 ] - E[ post*post ] - E[ pre*post2 ] + E[ pre*post]``

Note we can rewrite this back into variances and covariances of the original levels:

``Cov(post,post2) - Var(post) - Cov(pre,post2) + Cov(pre,post)``

There are two things to note here. 1) The covariances in the change scores can be entirely written as functions in the covariances of the levels. They do not supply information independent of the levels themselves.

For 2), if the data are random (that is the covariances between all the levels are random), the covariances between the change scores will be negative. This is because of the minus sign in front of the variance of the post term. For random data, all the other covariances are zero. This results in the correlation between the change scores being -1/2.

For a simple example in R:

``````> set.seed(10)
> n <- 10000 #sample size
> t1 <- rnorm(n) #three random vectors
> t2 <- rnorm(n)
> t3 <- rnorm(n)
> levels <- data.frame(t1,t2,t3)
> differ <- data.frame(c1=t2-t1,c2=t3-t2)
>
> #correlations in levels are approximately zero
> cor(levels)
t1           t2            t3
t1  1.0000000000  0.001874345 -0.0007006367
t2  0.0018743450  1.000000000 -0.0045967380
t3 -0.0007006367 -0.004596738  1.0000000000
>
> #correlation of differences is -0.5
> cor(differ)
c1         c2
c1  1.0000000 -0.4983006
c2 -0.4983006  1.0000000``````

Sometimes I see people talk about regression to the mean as if it is a sociological thing, like something that needs to be explained in terms of human behavior. It is not, it is entirely mathematical.

This is also one of the reasons I don’t like using change scores, either as independent or dependent variables. They typically can be rewritten in terms of the levels, and involve coeffficient restrictions that can have strange consequences. There are some situations (fixed effects) that make sense for the dependent variable. I haven’t seen a situation in the terms of independent variables where they make sense.

# Negative Binomial regression and predicted probabilities in SPSS

For my dissertation I have been estimating negative binomial regression models predicting the counts of crimes at small places (i.e. street segments and intersections). When evaluating the fit of poisson regression models and their variants, you typically make a line plot of the observed percent of integer values versus the predicted percent by the models. This is particularly pertinent for data that have a high proportion of zeros, as the negative binomial may still under-predict the number of zeros.

I mistakenly thought that to make such a plot you could simply estimate the predicted value following the negative binomial regression model and then round the predictions. But I was incorrect, and to make the typical predicted versus observed plot you need to estimate the probability of an observation taking an integer value, and then take the mean of that probability over all the observations. That mean will subsequently be the predicted percent given the model. Fortunately I caught my mistake before I gave some talks on my work recently, and I will show how to make said calculations in SPSS. I have posted the data to replicate this work at this dropbox link, and so you can download the data and follow along.

First, I got some help on how to estimate the predicted probabilities via an answer to my question at CrossValidated. So that question lists the formula one needs to estimate the predicted probability for any integer value N after the negative binomial model. To calculate that value though we need to make some special SPSS functions, the factorial and the complete gamma function. Both have SPSS tech help pages showing how to calculate them.

For the factorial we can use a general relationship with the `LNGAMMA` function.

``````
DEFINE !FACT (!POSITIONAL = !ENCLOSE("(",")"))
( EXP(LNGAMMA((!1)+1)) )
!ENDDEFINE.
``````

And for the complete gamma function we can use a relationship to the CDF of the gamma function.

``````
DEFINE !GAMMAF (!POSITIONAL = !ENCLOSE("(",")"))
( EXP(-1)/(!1)/(CDF.GAMMA(1,(!1),1) - CDF.GAMMA(1,(!1)+1,1)) )
!ENDDEFINE.
``````

And given these two functions, we can create a macro that takes as parameters and returns the predicted probability we are interested in:

• `out` – new variable name for predicted probability of taking on that integer value
• `PredN` – the predicted mean of the variable conditional on the covariates
• `Disp` – estimate of the dispersion parameter
• `Int` – the integer value being predicted
``````
DEFINE !PredNB (Out = !TOKENS(1)
/PredN = !TOKENS(1)
/Disp = !TOKENS(1)
/Int = !TOKENS(1) )
COMPUTE #a = (!Disp)**(-1).
COMPUTE #mu = !PredN.
COMPUTE #Y = !Int.
COMPUTE #1 = (!GAMMAF(#Y + #a))/(!FACT(#Y)*!GAMMAF(#a)).
COMPUTE #2 = (#a/(#a+#mu))**#a.
COMPUTE #3 =  (#mu/(#a + #mu))**#Y.
COMPUTE !Out =  #1*#2*#3.
!ENDDEFINE.
``````

But to make our plot we want to estimate this predicted probability over a range of values, so I created a helper macro that instead of taking only one integer value, takes the end integer value and will calculate the predicted probability of zero through N.

``````
DEFINE !PredNBRange (Num = !TOKENS(1)
/Mean = !TOKENS(1)
/Disp = !TOKENS(1)
/Stub = !TOKENS(1) )
!DO !I = 0 !TO !Num
!LET !Base = !CONCAT(!Stub,!I)
!PredNB Out = !Base PredN = !Mean Disp = !Disp Int = !I.
!DOEND
!ENDDEFINE.
``````

The example data and code I have posted compares these values to the ones predicted from Stata, and shows my function agrees with Stata to about 7 decimal points. I won’t go through all of those commands here, but I will show how to make the predicted proportions plot after you have a vector of predicted probabilities (you can download all of the code and data and the link I reference prior in the post).

So lets say that you have a vector `NB0 TO NB8`, and these are the predicted probabilities of integer values 0 to 8 for the observations in your dataset. To subsequently get the mean of the predictions, you can use the `AGGREGATE` command. Having no variables specified on the `BREAK` subcommand tells SPSS to aggregate over all values in the dataset. Here I export the file to a new dataset named `PredNBAgg`.

``````
DATASET DECLARE PredNBAgg.
AGGREGATE OUTFILE='PredNBAgg'
/BREAK =
/NB0 TO NB8 = MEAN(NB0 TO NB8).
``````

Now to merge later on to the observed proportions, I will reshape the dataset so the mean values are all in the same column using `VARSTOCASES`. Here I also make a category for the predicted probability of being 9 or higher (which isn’t typical for these types of plots, but something I believe is useful).

``````
DATASET ACTIVATE PredNBAgg.
COMPUTE NB9_Plus = 1 - SUM(NB0 TO NB8).
VARSTOCASES /MAKE NBPred FROM NB0 TO NB9_Plus /INDEX Int.
COMPUTE Int = Int - 1. /*Index starts at 1 instead of 0 */.
``````

Now I reactivate my original dataset, here named `PredNegBin`, calculate the binned observed values (with observations 9 and larger recoded to just 9) and then aggregate those values.

``````
DATASET ACTIVATE PredNegBin.
RECODE TotalCrime (9 THRU HIGHEST = 9)(ELSE = COPY) INTO Int.
DATASET DECLARE PredObsAgg.
AGGREGATE OUTFILE='PredObsAgg'
/BREAK = Int
/TotalObs = N.
``````

To get the predicted proportions within each category, I need to do another aggregation to get the total number of observations, and then divide the totals of each integer value with the total number of observations.

``````
DATASET ACTIVATE PredObsAgg.
AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES
/BREAK =
/TotalN=SUM(TotalObs).
COMPUTE PercObs = TotalObs / TotalN.
``````

Now we can go ahead and merge the two aggregated datasets together. I also go ahead and close the old `PredNBAgg` dataset and define a value label so I know that the 9 integer category is really 9 and larger.

``````
MATCH FILES FILE = *
/FILE = 'PredNBAgg'
/BY Int.
DATASET CLOSE PredNBAgg.
VALUE LABELS Int 9 '9+'.
``````

Now at this point you could make the plot with the predicted and observed proportions in seperate variables, but this would take two `ELEMENT` statements within a `GGRAPH` command (and I like to make line plots with both the lines and points, so it would actually take 4 `ELEMENT` statements). So what I do here is reshape the data one more time with `VARSTOCASES`, and make a categorical variable to identify if the proportion is the observed value or the predicted value from the model. Then you can make your chart.

``````
VARSTOCASES /MAKE Dens FROM PercObs NBPred /Index Type /DROP TotalObs TotalN.
VALUE LABELS Type
1 'Observed'
2 'Predicted'.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Int Dens Type
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Int=col(source(s), name("Int"), unit.category())
DATA: Type=col(source(s), name("Type"), unit.category())
DATA: Dens=col(source(s), name("Dens"))
GUIDE: axis(dim(1), label("Total Crimes on Street Units"))
GUIDE: axis(dim(2), label("Percent of Streets"))
GUIDE: legend(aesthetic(aesthetic.color.interior), null())
SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color.black),("2",color.red)))
ELEMENT: line(position(Int*Dens), color.interior(Type))
ELEMENT: point(position(Int*Dens), color.interior(Type), color.exterior(color.white), size(size."7"))
END GPL.
``````

And voila, here you can see the predicted values are so close to the observed that it is difficult to even see the observed values. Here instead of creating a legend I manually added labels to the chart. A better chart may be to subtract the observed from predicted (especially if you were comparing multiple poisson models), but it should be quite plain to see that the negative binomial fits quite well to the observed data in this instance.

Similar to Paul Allison’s experience, even with nearly 64% of the observations being zero, the negative binomial model fits just fine. I recently fit some other models with the same data (but a different outcome) in which the number of zeros were nearer to 90%. In that instance the negative binomial model would not converge, so estimating a zero inflated model was necessary. Here though it is clearly not necessary, and I would prefer the negative binomial model over a zip (or hurdle) as I see no obvious reason why I would prefer the complications of the different predicted zero equation in addition to the count equation.