Pooling multiple outcomes into one regression equation

Something that came up for many of my students this last semester in my Seminar in Research class is that many were interested in multiple outcomes. The most common one is examining different types of delinquency for juveniles (often via surveys), but it comes up in quite a few other designs as well (e.g. different crime outcomes for spatial research, different measures of perceptions towards police, different measures of fear of crime, etc.).

Most of the time students default to estimating separate equations for each of these outcomes, but in most circumstances I was telling the students they should pool these outcomes into one model. I think that is the better default for the majority of situations. So say we have a situation with two outcomes, violent crimes and property crimes, and we have one independent variable we are interested in, say whether an individual was subjected to a particular treatment. We might then estimate two separate equations:

E[# Violent Crimes]  = B0v + B1v*(Treatment) 
E[# Property Crimes] = B0p + B1p*(Treatment)

By saying that I think by default we should think about pooling is basically saying that B1v is going to be close to equal to B1p in the two equations. Pooling the models together both lets us test that assertion, as well as get a better estimate of the overall treatment effect. So to pool the models we would stack the outcomes together, and then estimate something like:

E[# Crimes (by type)] = B0 + B1*(Treatment) + B2*(Outcome = Violent) + B3(Treatment*Outcome = Violent)

Here the B3 coefficient tests whether the treatment effect is different for the violent crime outcome as opposed to the property crime, and the dummy variable B2 effect controls for any differences in the levels of the two overall (that is, you would expect violent incidents to be less common than property crime incidents).

Because you will have multiple measures per individual, you can correct for that (by clustering the standard errors). But in the case you have many outcomes you might also want to consider a multi-level model, and actually estimate random effects for individuals and outcomes. So say instead of just violent and property crimes, but had a survey listing for 20 different types of delinquency. In that case you might want to do a model that looks like:

Prob(Delinquency_ij) = f[B0 + B1*(Treatment_j) + d_j + g_i]

Where one is estimating a multi-level logistic regression equation for delinquency type i within individual j, and the g_i and d_j are the random effects for delinquency types and individuals respectively. In the case you do not have many outcomes (say only 10), the random effect distribution might be hard to estimate. In that case I would just use fixed effects for the outcome dummy variables. But I can imagine the random effects for persons are of interest in many different study designs. And this way you get one model — instead of having to interpret 20+ models.

Also you can still estimate differential treatment effects across the different items if you want to, such as by looking at the interaction of the outcome types and the treatment. But in most cases in criminology I have come across treatments are general. That is, we would expect them to decrease/increase all crime types, not just some specific violent or property crime types. So to default pooling the treatment effect estimate makes sense.

To go a bit farther — juvenile delinquency is not my bag, but offhand I don’t understand why those who examine surveys of delinquency items use that multi-level model more often. Often times people aggregate the measures altogether into one overall scale, such as saying someone checked yes to 2 out of 10 violent crime outcomes, and checked yes to 5 out of 10 property crime outcomes. Analyzing those aggregated outcomes is another type of pooling, but one I don’t think is appropriate, mainly because it ignores the overall prevalence for the different items. For example, you might have an item such as "steal a car", and another that is "steal a candy bar". The latter is much more serious and subsequently less likely to occur. Going with my prior examples, pooling items together like this would force the random effects for the individual delinquency types, g_i, to all equal zero. Just looking at the data one can obviously tell that is not a good assumption.

Here I will provide an example via simulation to demonstrate this in Stata. First I generate an example dataset that has 1,000 individuals and 20 yes/no outcomes. They way the data are simulated is that each individual has a specific amount of self_control that decreases the probability of an outcome (with a coefficient of -0.5), they are nested within a particular group (imagine a different school) that affect whether the outcome occurs or not. In addition to this, each individual has a random intercept (drawn from a normal distribution), and each question has a fixed different prevalence.

*Stata simulation
set more off
set seed 10
set obs 1000
generate caseid = _n
generate group = ceil(caseid/100) 
generate self_control = rnormal(0,1)
generate rand_int = rnormal(0,1)

*generating 20 outcomes that just have a varying intercept for each
forval i = 1/20 { 
  generate logit_`i' = -0.4 -0.5*self_control -0.1*group + 0.1*(`i'-10) + rand_int
  generate prob_`i' = 1/(1 + exp(-1*logit_`i'))
  generate outcome_`i' = rbinomial(1,prob_`i')
drop logit_* prob_* rand_int
summarize prob_*

And here is that final output:

. summarize prob_*

    Variable |        Obs        Mean    Std. Dev.       Min        Max
      prob_1 |      1,000    .1744795    .1516094   .0031003   .9194385
      prob_2 |      1,000    .1868849     .157952   .0034252   .9265418
      prob_3 |      1,000     .199886    .1642414   .0037841   .9330643
      prob_4 |      1,000      .21348    .1704442   .0041804   .9390459
      prob_5 |      1,000    .2276601    .1765258    .004618   .9445246
      prob_6 |      1,000     .242416    .1824513   .0051012   .9495374
      prob_7 |      1,000    .2577337    .1881855   .0056347   .9541193
      prob_8 |      1,000    .2735951    .1936933   .0062236   .9583033
      prob_9 |      1,000     .289978    .1989401   .0068736    .962121
     prob_10 |      1,000    .3068564    .2038919    .007591   .9656016
     prob_11 |      1,000    .3242004    .2085164   .0083827   .9687729
     prob_12 |      1,000    .3419763    .2127823   .0092562   .9716603
     prob_13 |      1,000    .3601469    .2166605   .0102197   .9742879
     prob_14 |      1,000    .3786715    .2201237   .0112824   .9766776
     prob_15 |      1,000    .3975066    .2231474   .0124542   .9788501
     prob_16 |      1,000    .4166057    .2257093    .013746   .9808242
     prob_17 |      1,000    .4359203    .2277906   .0151697   .9826173
     prob_18 |      1,000       .4554    .2293751   .0167384   .9842454
     prob_19 |      1,000     .474993    .2304504   .0184663   .9857233
     prob_20 |      1,000    .4946465    .2310073   .0203689   .9870643

You can see from this list that each prob* variable then has a different overall prevalence, from around 17% for prob_1, climbing to around 50% for prob_20.

Now if you wanted to pool the items into one overall delinquency scale, you might estimate a binomial regression model (note this is not a negative binomial model!) like below (see Britt et al., 2017 for discussion).

*first I will show the binomial model in Britt
egen delin_total = rowtotal(outcome_*)
*Model 1
glm delin_total self_control i.group, family(binomial 20) link(logit)

Which shows for the results (note that the effect of self-control is too small, it should be around -0.5):

. glm delin_total self_control i.group, family(binomial 20) link(logit)

Iteration 0:   log likelihood =  -3536.491  
Iteration 1:   log likelihood = -3502.3107  
Iteration 2:   log likelihood = -3502.2502  
Iteration 3:   log likelihood = -3502.2502  

Generalized linear models                         No. of obs      =      1,000
Optimization     : ML                             Residual df     =        989
                                                  Scale parameter =          1
Deviance         =  4072.410767                   (1/df) Deviance =   4.117706
Pearson          =  3825.491931                   (1/df) Pearson  =    3.86804

Variance function: V(u) = u*(1-u/20)              [Binomial]
Link function    : g(u) = ln(u/(20-u))            [Logit]

                                                  AIC             =     7.0265
Log likelihood   = -3502.250161                   BIC             =  -2759.359

             |                 OIM
 delin_total |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
       group |
          2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
          3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
          4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
          5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
          6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
          7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
          8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
          9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
         10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
       _cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623

One of the things I wish the Britt paper mentioned was that the above binomial model is equivalent to the a logistic regression model on the individual outcomes — but one that forces the predictions for each item to be the same across a person. So if you reshape the data from wide to long you can estimate that same binomial model as a logistic regression on the 0/1 outcomes.

*reshape wide to long
reshape long outcome_, i(caseid) j(question)
*see each person now has 20 questions each
*tab caseid

*regression model with the individual level data, should be equivalent to the aggregate binomial model
*Model 2
glm outcome_ self_control i.group, family(binomial) link(logit)

And here are the results:

. glm outcome_ self_control i.group, family(binomial) link(logit)

Iteration 0:   log likelihood = -12204.638  
Iteration 1:   log likelihood = -12188.762  
Iteration 2:   log likelihood = -12188.755  
Iteration 3:   log likelihood = -12188.755  

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,989
                                                  Scale parameter =          1
Deviance         =  24377.50934                   (1/df) Deviance =   1.219546
Pearson          =  19949.19243                   (1/df) Pearson  =   .9980085

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.219975
Log likelihood   = -12188.75467                   BIC             =  -173583.3

             |                 OIM
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
self_control |  -.3683605   .0156401   -23.55   0.000    -.3990146   -.3377065
       group |
          2  |   -.059046   .0666497    -0.89   0.376    -.1896769     .071585
          3  |  -.0475712   .0665572    -0.71   0.475     -.178021    .0828785
          4  |   .0522331   .0661806     0.79   0.430    -.0774786    .1819448
          5  |  -.1266052   .0672107    -1.88   0.060    -.2583357    .0051254
          6  |   -.391597   .0695105    -5.63   0.000     -.527835   -.2553589
          7  |  -.2997012   .0677883    -4.42   0.000    -.4325639   -.1668386
          8  |   -.267207   .0680807    -3.92   0.000    -.4006427   -.1337713
          9  |  -.4340516   .0698711    -6.21   0.000    -.5709964   -.2971069
         10  |  -.5695204    .070026    -8.13   0.000    -.7067689    -.432272
       _cons |  -.5584345   .0470275   -11.87   0.000    -.6506067   -.4662623

So you can see that Model 1 and Model 2 are exactly the same (in terms of estimates for the regression coefficients).

Model 2 though should show the limitations of using the binomial model — it predicts the same probability for each delinquency item, even though prob_1 is less likely to occur than prob_20. So for example, if we generate the predictions of this model, we can see that each question has the same predicted value.

predict prob_mod2, mu
sort question
by question: summarize outcome_ prob_mod2

And here are the results for the first four questions:

.     by question: summarize outcome_ prob_mod2

-> question = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
    outcome_ |      1,000        .183      .38686          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-> question = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
    outcome_ |      1,000        .205    .4039036          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-> question = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
    outcome_ |      1,000        .208    .4060799          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

-> question = 4

    Variable |        Obs        Mean    Std. Dev.       Min        Max
    outcome_ |      1,000        .202    .4016931          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998

By construction, the binomial model on the aggregated totals is a bad fit to the data. It predicts that each question should have a probability of around 32% of occurring. Although you can’t fit the zero-inflated model discussed by Britt via the individual level logit approach (that I am aware of), that approach has the same limitation as the generic binomial model approach. Modeling the individual items just makes more sense when you have the individual items. It is hard to think of examples where such a restriction would be reasonable for delinquency items.

So here a simple update is to include a dummy variable for each item. Here I also cluster according to whether the item is nested within an individual caseid.

*Model 3
glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

And here are the results:

.     glm outcome_ self_control i.group i.question, family(binomial) link(logit) cluster(caseid)

Iteration 0:   log pseudolikelihood = -11748.056  
Iteration 1:   log pseudolikelihood = -11740.418  
Iteration 2:   log pseudolikelihood = -11740.417  
Iteration 3:   log pseudolikelihood = -11740.417  

Generalized linear models                         No. of obs      =     20,000
Optimization     : ML                             Residual df     =     19,970
                                                  Scale parameter =          1
Deviance         =  23480.83406                   (1/df) Deviance =   1.175805
Pearson          =  19949.15609                   (1/df) Pearson  =   .9989562

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = ln(u/(1-u))             [Logit]

                                                  AIC             =   1.177042
Log pseudolikelihood = -11740.41703               BIC             =  -174291.8

                             (Std. Err. adjusted for 1,000 clusters in caseid)
             |               Robust
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
self_control |  -.3858319   .0334536   -11.53   0.000    -.4513996   -.3202641
       group |
          2  |  -.0620222   .1350231    -0.46   0.646    -.3266626    .2026182
          3  |   -.049852   .1340801    -0.37   0.710    -.3126442    .2129403
          4  |   .0549271   .1383412     0.40   0.691    -.2162167    .3260709
          5  |  -.1329942   .1374758    -0.97   0.333    -.4024419    .1364535
          6  |  -.4103578   .1401212    -2.93   0.003    -.6849904   -.1357253
          7  |  -.3145033   .1452201    -2.17   0.030    -.5991296   -.0298771
          8  |  -.2803599   .1367913    -2.05   0.040    -.5484659    -.012254
          9  |  -.4543686   .1431314    -3.17   0.002    -.7349011   -.1738362
         10  |  -.5962359   .1457941    -4.09   0.000    -.8819872   -.3104847
    question |
          2  |   .1453902   .1074383     1.35   0.176    -.0651851    .3559654
          3  |   .1643203   .1094113     1.50   0.133     -.050122    .3787625
          4  |   .1262597   .1077915     1.17   0.241    -.0850078    .3375272
          5  |   .1830563    .105033     1.74   0.081    -.0228047    .3889173
          6  |   .3609468   .1051123     3.43   0.001     .1549304    .5669633
          7  |    .524749    .100128     5.24   0.000     .3285017    .7209963
          8  |   .5768412   .1000354     5.77   0.000     .3807754     .772907
          9  |   .7318797   .1021592     7.16   0.000     .5316513    .9321081
         10  |    .571682   .1028169     5.56   0.000     .3701646    .7731994
         11  |    .874362   .0998021     8.76   0.000     .6787535     1.06997
         12  |   .8928982   .0998285     8.94   0.000     .6972379    1.088559
         13  |   .8882734   .1023888     8.68   0.000      .687595    1.088952
         14  |   .9887095   .0989047    10.00   0.000     .7948599    1.182559
         15  |   1.165517   .0977542    11.92   0.000     .9739222    1.357111
         16  |   1.230355   .0981687    12.53   0.000     1.037948    1.422762
         17  |   1.260403   .0977022    12.90   0.000      1.06891    1.451896
         18  |   1.286065    .098823    13.01   0.000     1.092376    1.479755
         19  |   1.388013   .0987902    14.05   0.000     1.194388    1.581638
         20  |   1.623689   .0999775    16.24   0.000     1.427737    1.819642
       _cons |  -1.336376   .1231097   -10.86   0.000    -1.577666   -1.095085

You can now see that the predicted values for each individual item are much more reasonable. In fact they are a near perfect fit.

predict prob_mod3, mu
by question: summarize outcome_ prob_mod2 prob_mod3

And the results:

.     by question: summarize outcome_ prob_mod2 prob_mod3

-> question = 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
    outcome_ |      1,000        .183      .38686          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .183    .0672242   .0475809   .4785903

-> question = 2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
    outcome_ |      1,000        .205    .4039036          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .205    .0729937   .0546202   .5149203

-> question = 3

    Variable |        Obs        Mean    Std. Dev.       Min        Max
    outcome_ |      1,000        .208    .4060799          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .208    .0737455    .055606   .5196471

-> question = 4

    Variable |        Obs        Mean    Std. Dev.       Min        Max
    outcome_ |      1,000        .202    .4016931          0          1
   prob_mod2 |      1,000      .32305    .0924081   .1049203   .6537998
   prob_mod3 |      1,000        .202    .0722336   .0536408   .5101407

If you want, you can also test whether any "treatment" effect (or here the level of a persons self control), has differential effects across the different delinquency items.

*Model 4
glm outcome_ self_control i.group i.question (c.self_control#i.question), family(binomial) link(logit) cluster(caseid)
*can do a test of all the interactions equal to zero at once
testparm c.self_control#i.question

I’ve omitted this output, but here of course the effect of self control is simulated to be the same across the different items, so one would fail to reject the null that any of the interaction terms are non-zero.

Given the way I simulated the data, the actual correct model is a random effects one. You should notice in each of the prior models the effect of self control is too small. One way to estimate that model in Stata is to below:

*Model 5
melogit outcome_ self_control i.group i.question || caseid:

And here are the results:

. melogit outcome_ self_control i.group i.question || caseid:

Fitting fixed-effects model:

Iteration 0:   log likelihood = -11748.056  
Iteration 1:   log likelihood = -11740.418  
Iteration 2:   log likelihood = -11740.417  
Iteration 3:   log likelihood = -11740.417  

Refining starting values:

Grid node 0:   log likelihood =  -10870.54

Fitting full model:

Iteration 0:   log likelihood =  -10870.54  
Iteration 1:   log likelihood = -10846.176  
Iteration 2:   log likelihood = -10845.969  
Iteration 3:   log likelihood = -10845.969  

Mixed-effects logistic regression               Number of obs     =     20,000
Group variable:          caseid                 Number of groups  =      1,000

                                                Obs per group:
                                                              min =         20
                                                              avg =       20.0
                                                              max =         20

Integration method: mvaghermite                 Integration pts.  =          7

                                                Wald chi2(29)     =    1155.07
Log likelihood = -10845.969                     Prob > chi2       =     0.0000
    outcome_ |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
self_control |  -.4744173   .0372779   -12.73   0.000    -.5474807    -.401354
       group |
          2  |  -.0648018   .1642767    -0.39   0.693    -.3867783    .2571746
          3  |  -.0740465   .1647471    -0.45   0.653    -.3969449    .2488519
          4  |    .036207   .1646275     0.22   0.826     -.286457     .358871
          5  |  -.1305605   .1645812    -0.79   0.428    -.4531337    .1920126
          6  |  -.5072909   .1671902    -3.03   0.002    -.8349776   -.1796042
          7  |  -.3732567    .165486    -2.26   0.024    -.6976032   -.0489102
          8  |  -.3495889   .1657804    -2.11   0.035    -.6745126   -.0246653
          9  |  -.5593725   .1675276    -3.34   0.001    -.8877205   -.2310245
         10  |  -.7329717   .1673639    -4.38   0.000    -1.060999   -.4049445
    question |
          2  |   .1690546   .1240697     1.36   0.173    -.0741177    .4122268
          3  |    .191157   .1237894     1.54   0.123    -.0514657    .4337797
          4  |   .1467393   .1243586     1.18   0.238    -.0969991    .3904776
          5  |   .2130531   .1235171     1.72   0.085     -.029036    .4551422
          6  |   .4219282   .1211838     3.48   0.000     .1844123    .6594441
          7  |   .6157484   .1194133     5.16   0.000     .3817027    .8497941
          8  |   .6776651   .1189213     5.70   0.000     .4445837    .9107465
          9  |   .8626735   .1176486     7.33   0.000     .6320865    1.093261
         10  |   .6715272   .1189685     5.64   0.000     .4383532    .9047012
         11  |   1.033571   .1167196     8.86   0.000     .8048051    1.262338
         12  |    1.05586    .116615     9.05   0.000     .8272985    1.284421
         13  |   1.050297   .1166407     9.00   0.000     .8216858    1.278909
         14  |   1.171248   .1161319    10.09   0.000     .9436331    1.398862
         15  |   1.384883   .1154872    11.99   0.000     1.158532    1.611234
         16  |   1.463414   .1153286    12.69   0.000     1.237375    1.689454
         17  |   1.499836   .1152689    13.01   0.000     1.273913    1.725759
         18  |   1.530954   .1152248    13.29   0.000     1.305117     1.75679
         19  |   1.654674   .1151121    14.37   0.000     1.429058    1.880289
         20  |   1.941035   .1152276    16.85   0.000     1.715193    2.166877
       _cons |  -1.591796   .1459216   -10.91   0.000    -1.877797   -1.305795
caseid       |
   var(_cons)|   1.052621   .0676116                      .9281064     1.19384
LR test vs. logistic model: chibar2(01) = 1788.90     Prob >= chibar2 = 0.0000

In that model it is the closest to estimating the correct effect of self control (-0.5). It is still small (at -0.47), but the estimate is within one standard error of the true value. (Another way to estimate this model is to use xtlogit, but with melogit you can actually extract the random effects. That will have to wait until another blog post though.)

Another way to think about this model is related to item-response theory, where individuals can have a latent estimate of how smart they are, and questions can have a latent easiness/hardness. In Stata you might fit that by the code below, but a warning it takes awhile to converge. (Not sure why, the fixed effects for questions are symmetric, so assuming a random effect distribution should not be too far off. If you have any thoughts as to why let me know!)

*Model 6
melogit outcome_ self_control i.group || caseid: || question:

For an academic reference to this approach see Osgood et al.,(2002). Long story short, model the individual items, but pool them together in one model!


New course in the spring – Crime Science

This spring I will be teaching a new graduate level course, Crime Science. A better name for the course would be evidence based policing tactics to reduce crime — but that name is too long!

Here you can see the current syllabus. I also have a page for the course, which I will update with more material over the winter break.

Given my background it has a heavy focus on hot spots policing (different tactics at hot spots, time spent at hot spots, crackdowns vs long term). But the class covers other policing strategies; such as chronic offenders, the focused deterrence gang model, and CPTED. We also discuss the use of technology in policing (e.g. CCTV, license plate readers, body-worn-cameras).

I will weave in ethical discussions throughout the course, but I reserved the last class to specifically talk about predictive policing strategies. In particular the two main concerns are increasing disproportionate minority contact through prediction, and privacy concerns with police collecting various pieces of information.

So take my course!

New working paper: Mapping attitudes towards the police at micro places

I have a new preprint posted, Mapping attitudes towards the police at micro places. This is work with Jasmine Silver, as well as Rob Worden and Sarah McLean. See the abstract:

We demonstrate the utility of mapping community satisfaction with the police at micro places using data from citizen surveys conducted in 2001, 2009 and 2014 in one city. In each survey, respondents provided the nearest intersection to their address. We use inverse distance weighting to map a smooth surface of satisfaction with police over the entire city, which shows broader neighborhood patterns of satisfaction as well as small area hot spots of dissatisfaction. Our results show that hot spots of dissatisfaction with police do not conform to census tract boundaries, but rather align closely with hot spots of crime and police activity. Models predicting satisfaction with police show that local counts of violent crime are the strongest predictors of attitudes towards police, even above individual level predictors of race and age.

In this article we make what are analogs of hot spot maps of crime, but measure dissatisfaction with the police.

One of the interesting findings is that these hot spots do not align nicely with census tracts (the tracts are generalized, we cannot divulge the location of the city). So the areas identified by each procedure would be much different.

As always, feel free to comment or send me an email if you have feedback on the article.

Monitoring homicide trends paper published

My paper, Monitoring Volatile Homicide Trends Across U.S. Cities (with coauthor Tom Kovandzic) has just been published online in Homicide Studies. Unfortunately, Homicide Studies does not give me a link to share a free PDF like other publishers, but you can either grab the pre-print on SSRN or always just email me for a copy of the paper.

They made me convert all of the charts to grey scale :(. Here is an example of the funnel chart for homicide rates in 2015.

And here are example fan charts I generated for a few different cities.

As always if you have feedback or suggestions let me know! I posted all of the code to replicate the analysis at this link. The prediction intervals can definately be improved both in coverage and in making their length smaller, so I hope to see other researchers tackling this as well.

Creating an animated heatmap in Excel

I’ve been getting emails recently about the online Carto service not continuing their free use model. I’ve previously used this service to create animated maps heatmaps over time, in particular a heatmap of reported meth labs over time. That map still currently works, but I’m not sure how long it will though. But the functionality can be replicated in recent versions of Excel, so I will do a quick walkthrough of how to make an animated map. The csv to follow along with, as well as the final produced excel file, you can down download from this link.

I split the tutorial into two parts. Part 1 is prepping the data so the Excel 3d Map will accept the data. The second is making the map pretty.

Prepping the Data

The first part before we can make the map in Excel are:

  1. eliminate rows with missing dates
  2. turn the data into a table
  3. explicitly set the date column to a date format
  4. save as an excel file

We need to do those four steps before we can worry about the mapping part. (It took me forever to figure out it did not like missing data in the time field!)

So first after you have downloaded that data, double click to open the Geocoded_MethLabs.csv file in word. Once that sheet is open select the G column, and then sort Oldest to Newest.

It will give you a pop-up to Expand the selection – keep that default checked and click the Sort button.

After that scroll down to the current bottom of the spreadsheet. There are around 30+ records in this dataset that have missing dates. Go ahead and select the row labels on the left, which highlights the whole row. Once you have done that, right click and then select Delete. Again you need to eliminate those missing records for the map to accept the time field.

After you have done that, select the bottom right most cell, L26260, then scroll back up to the top of the worksheet, hold shift, and select cell A1 (this should highlight all of the cells in the sheet that contain data). After that, select the Insert tab, and then select the Table button.

In the pop-up you can keep the default that the table has headers checked. If you lost the selection range in the prior step, you can simply enter it in as =$A$1:$;$26260.

After that is done you should have a nice blue formatted table. Select the G column, and then right click and select Format Cells.

Change that date column to a specific date format, here I just choose the MM/DD/YY format, but it does not matter. Excel just needs to know it represents a date field.

Finally, you need to save the file as an excel file before we can make the maps. To do this, click File in the top left header menu’s, and then select Save As. Choose where you want to save the file, and then in the Save as Type dropdown in the bottom of the dialog select xlsx.

Now the data is all prepped to create the map.

Making an Animated Map

Now in this part we basically just do a set of several steps to make our map recognize the correct data and then make the map look nice.

With the prior data all prepped, you should be able to now select the 3d Map option that you can access via the Insert menu (just to the right of where the Excel charts are).

Once you click that, you should get a map opened up that looks like mine below.

Here it actually geocoded the points based on the address (very fast as well). So if you only have address data you can still create some maps. Here I want to change the data though so it uses my Lat/Lon coordinates. In the little table on the far right side, under Layer 1, I deleted all of the fields except for Lat by clicking the large to their right (see the X circled in the screenshot below). Then I selected the + Add Field option, and then selected my Lng field.

After you select that you can select the dropdown just to the right of the field and set it is Longitude. Next navigate down slightly to the Time option, and there select the DATE field.

Now here I want to make a chart similar to the Carto graph that is of the density, so in the top of the layer column I select the blog looking thing (see its drawn outline). And then you will get various options like the below screenshot. Adjust these to your liking, but for this I made the radius of influence a bit larger, and made the opacity not 100%, but slightly transparent at 80%.

Next up is setting the color of the heatmap. The default color scale uses the typical rainbow, which should be avoided for multiple reasons, one of which is color-blindness. So in the dropdown for colors select Custom, and then you will get the option to create your own color ramp. If you click on one of the color swatches you will then get options to specify the color in a myriad of ways.

Here I use the multi-hue pink-purple color scheme via ColorBrewer with just three steps. You can see in the above screenshot I set the lowest pink step via the RGB colors (which you can find on the color brewer site.) Below is what my color ramp looks like in the end.

Next part we want to set the style of the map. I like the monotone backgrounds, as it makes the animated kernel density pop out much more (see also my blog post, When should we use a black background for a map). It is easy to experiement with all of these different settings though and see which ones you like more for your data.

Next I am going to change the format of the time notation in the top right of the map. Left click to select the box around the time part, and then right click and select Edit.

Here I change to the simpler Month/Year. Depending on how fast the animation runs, you may just want to change it to year. But you can leave it more detailed if you are manually dragging the time slider to look for trends.

Finally, the current default is to show all of the data permanently. There are examples where you may want to do that (see the famous example by Nathan Yau mapping the growth of Wal Mart), but here we do not want that. So navigate back to the Layer options on the right hand side, and in the little tiny clock above the Time field select the dropdown, and change it to Data shows for an instant.

Finally I select the little cog in the bottom of the map window to change the time options. Here I set the animation to run longer at 30 seconds. I also set the transition duration to slightly longer at 5 seconds. (Think of the KDE as a moving window in time.)

After that you are done! You can zoom in the map, set the slider to run (or manually run it forward/backward). Finally you can export the map to an animated file to share or use in presentations if you want. To do that click the Create Video option in the toolbar in the top left.

Here is my exported video

Now go make some cool maps!


To aid in reproducible analysis, I often have a set of FILE HANDLE commands at the header of my syntax. For example, here is basically what most of my syntax’s look like at the top.


*Simple description here of what the syntax does.

FILE HANDLE data /NAME = "H:\ProjectX\OriginalData".
FILE HANDLE save /NAME = "C:\Users\axw161530\Dropbox\Documents\BLOG\FileHandles_SPSS".

What those commands go are point to particular locations on my machine that either have the data I will use for the syntax, and where to save the subsequent results. So what this does instead of having to write something like:

GET FILE = "H:\ProjectX\OriginalData\SPSS_Dataset1.sav".

I can just write something like:

GET FILE = "data\SPSS_Dataset1.sav".

The same works for where I save the files, so I would use the second SAVE line instead of the first after I’ve defined a file handle.

*SAVE OUTFILE = "C:\Users\axw161530\Dropbox\Documents\BLOG\FileHandles_SPSS\TransformedData1.sav"
SAVE OUTFILE = "save\TransformedData1.sav"

Besides being shorter, this greatly aids when you need to move files around. Say I needed to move where I saved the files from my personal drive to a work H drive. If you use file handles, all you need to do is change the one line of code at the top of your syntax. If you used absolute references you would need to edit the file at every location you used that absolute path (e.g. every GET FILE, or GET TRANSLATE, or SAVE).

Another trick you can use is that you can stack multiple file handles together. Consider this example.

FILE HANDLE base /NAME = "H:\ProjectX".
FILE HANDLE data /NAME = "base\Datasets"
FILE HANDLE report /NAME = "base\Reports"

In this case I defined a base location, and then defined the data and report file handles as folders within the base file handle. Again if you need to move this entire project, all you need to do is edit that original base file handle location.

A final note, if I have a set of complicated code that I split up into multiple syntax files, one trick you can use is to place all of your file handles into one set of syntax, and then use INSERT to call that syntax. Depending on the structure though you may need to edit the INSERT call though at the header for all of the sub-syntaxes, so it may not be any less work.

I should also mention you could use the CD command to some of the same effect. So instead of defining a save file handle, you could do:

CD "C:\Users\axw161530\Dropbox\Documents\BLOG\FileHandles_SPSS".
SAVE OUTFILE = "TransformedData1.sav"

I don’t typically change SPSS’s current directory, but there are legitimate reasons to do so. If SPSS needs write access to the directory in particular, this is sometimes easier than dealing with permissions. That does not come up often for me, and most of the time I have data files in many different places, so I typically just stick with using file handles.

Accessing FILE HANDLES in Python or R

If you are using file handles in SPSS code, you may want to be able to access those file handles the same way in Python or R code called within SPSS. This may be for much of the same reason — you may want to access data or save data in those programs, but use the file handles you have identified. Albert-Jan Roskam on the Nabble group showed how to do this.

So for python you could do below:

import spssaux

#Getting the SPSS file handles
file_handles = spssaux.FileHandles().resolve

#Grabbing a particular handle
data_loc = file_handles('data')

#Setting the python directory to the save location
import os

And for R you can do:

fh <- spssdata.GetFileHandles()
file_handles <- sapply(fh, function(x) x[2])
names(file_handles) <-  sapply(fh, function(x) x[1])

#Accessing a particular file handle

#setting the R directory to save handle

In either case it is nice so you again can just edit one line at the top of your syntax if you change file locations, instead of having to edit multiple lines in the subsequent syntax file.

Presentation at ASC: Crime Data Visualization for the Future

At the upcoming American Society of Criminology conference in Philadelphia I will be presenting a talk, Crime Data Visualization for the Future. Here is the abstract:

Open data is a necessary but not sufficient condition for data to be transparent. Understanding how to reduce complicated information into informative displays is an important step for those wishing to understand crime and how the criminal justice system works. I focus the talk on using simple tables and graphs to present complicated information using various examples in criminal justice. Also I describe ways to effectively evaluate the size of effects in regression models, and make black box machine learning models more interpretable.

But I have written a paper to go with the talk as well. You can download that paper here. As always, if you have feedback/suggestions let me know.

Here are some example graphs of plotting the predictions from a random forest model predicting when restaurants in Chicago will fail their inspections.

I present on Wednesday 11/15 at 11 am. You can see the full session here. Here is a quick rundown of the other papers — Marcus was the one who put together the panel.

  • A Future Proposal for the Model Crime Report – Marcus Felson
  • Crime Data Warehouses and the future of Big Data in Criminology – Martin Andresen
  • Can We Unify Criminal Justice Data, Like the Dutch and the Nordics? – Michael Mueller-Smith

So it should be a great set of talks.

I also signed up to present a poster, Mapping Attitudes Towards the Police at Micro Places. This is work with Albany Finn Institute folks, including Jasmine Silver, Sarah McLean, and Rob Worden. Hopefully I will have a paper to share about that soon, but for a teaser on that here is an example map from that work, showing hot spots of dissatisfaction with the police estimated via inverse distance weighting. Update: for those interested, see here for the paper and here for the poster. Stop on by Thursday to check it out!

And here is the abstract:

We demonstrate the utility of mapping community satisfaction with the police at micro places using data from citizen surveys conducted in 2001, 2009 and 2014 in one city. In each survey, respondents provided the nearest intersection to their address. We use inverse distance weighting to map a smooth surface of satisfaction with police over the entire city, which shows broader neighborhood patterns of satisfaction as well as small area hot spots of dissatisfaction. Our results show that hot spots of dissatisfaction with police do not conform to census tract boundaries, but rather align closely with hot spots of crime and police activity. Models predicting satisfaction with police show that local counts of violent crime are the strongest predictors of attitudes towards police, even above individual level predictors of race and age.

My Protips for College

Last week I gave a lecture to prospective undergrad students here at UT Dallas. At the end of the talk I gave some pro-tips for undergrads — things if I could go back in time and talk to myself this is what I would say. Here are those points, with a bit more personal elaboration than I gave during my talk. This is for undergrads going for a four year degree, it is not really applicable to graduate students (and mostly not applicable to community college students).

Pick a Major

My first major piece of advice is that all students should pick a major. You should not go undeclared.

Personally I did not have this problem — I haphazardly picked criminal justice and enjoyed it enough to stick with it. I had several friends though get stuck in the quagmire of never being able to decide a major.

There are a few reasons I have this piece of advice. One is that you should really think about why you are going to college in the first place — you should have some cognizable career aspirations that guide why you are there. Those aspirations will obviously guide what major you choose. You don’t have to worry though if you are not fully committed to that major. It is totally normal for an 18 year old to not be mature enough to make that sort of commitment. If you go for biology, realize that lab work sucks, you can change majors. So picking a major at the start is not like marriage — it is more like going on a date.

Part of the issue of going undeclared is people have the expectation that they will figure out a subject they really like later on (via a class) and then choose a major. Haphazardly taking classes does not guarantee this though. This is especially the case with introduction classes — which can often suck compared to higher level classes. Going again with the analogy to dating, you should have multiple dates (classes) in a particular major before you decide if it is/is not for you (you could always get a crappy teacher in one class as well). Being undeclared you only get minimal exposure to any particular subject. So it is better to pick something, with the mind that you might change later on, as the class that changes your mind may never come.

Finally, you should always have those career goals in mind. You shouldn’t decide to be a professional tennis player because you liked that GenEd tennis course you were randomly assigned. So ultimately you need to base what you want to major in on factors outside of how much you like the subject material. Some career opportunities are not strongly tied to a major — you can be a cop with an English degree if you want. That is not always the case though.

An additional point related to this is that you have more opportunities/majors in larger universities. If you pick the small private university and you decide you don’t want to major in education, you will be hamstrung in what other potential majors you can choose from, which leads to my next point.

More Interesting >> Easier

Second piece of advice — don’t pick easier classes over ones you think will be more interesting.

For college students, this is really the first opportunity you have in your education to shape your schedule. (You have some agency in high school, but nothing like college.) When you go for a four year degree, about half of your college courses will be general education (GenEd). For example at Bloomsburg IIRC I had to take a collection of four classes in each area of Humanities, Math & Sciences, and Social Sciences — as well as additional writing classes. These you have to choose a particular subject area (e.g. Economics, History), but any class in that department counts towards your GenEd requirements.

Now the seemingly obvious choice is to take the introduction classes in these majors, as they will be easier. This is a mistake I made. First, the intro classes can be mind numbingly boring. Second, they aren’t guaranteed to even be less work than higher level courses. Third, and most important, it ends up being much easier to do the work/show up to an interesting class than it is to wade through a really boring one. Let me say that again another way — you will not go to class if it is really boring, and maybe fail. It is easier to put the work in for classes you enjoy.

This works the same when picking a major. STEM has a reputation as being harder. Yes, aero-space engineering is really hard, but the STEM field is not filled with geniuses — they are normal people. I’d hasten to say writing English good is easy not so. (Most social sciences you will write — alot.) Painting something people want to buy is probably alot harder than learning about biology. Teaching elementary children is madness. Take chemistry if you think it is interesting — don’t be afraid because of some stereotype that you think it will be really hard.

The example that most comes to mind for me was a history general education requirement. I had the option one semester to do either a 300 level course on Japanese history, or the 100 level course on something like 16th century western European history. I chose the western European course, being concerned the 300 level course would be more work. It was a terrible decision — the 100 level course was quite a bit of work, requiring writing several essays and reading very boring material. This variation is pretty typical — variation between teachers is much larger than variation in how hard the 100, 200, 300 level courses are. About the only guarantee is that a mass lecture course will be less work (mostly exams) compared to smaller courses, but those can be real stinkers. (Do research on the professors, grab a syllabus, ask someone who took the course previously, go talk to them during officer hours or send an email. You can often weed out if it is good class or a stinker before having to sit through it. Avoid generic classes taught by someone for the very first time.)

Some courses have pre-requisites, but in many cases you can ask permission to take the course even if you have not fulfilled them. You should just use your best judgment about whether the pre-requisites are really necessary or are just bureaucratic. My communities and crime undergrad has a pre-requisite for introduction to criminology, but I highly doubt an engaged student would not be able to follow along. Differential equations you should probably have a good grasp of calculus though (but something like linear algebra would probably be ok to take with just high school mathematics, so it is just dependent on the context).

In retrospect I would have also chosen various stats/econometrics courses in the economics department over the introduction to macro economics. It is really hard to convince yourself to wake up at 8 in the morning Tuesdays and Thursdays if you really hate the course. Again I did not take those higher level econ courses as I was worried about being able to pass, although I shouldn’t have been. Which leads to my next point.

Don’t be Afraid to Fail!

Third piece of advice is related to the second — don’t be afraid to fail. So you should not forgo a particular path because you are concerned about failing. It is ok to fail.

It is ok to fail for a few reasons. One is that most universities have built in the ability to drop courses. So if you take a physics class and can’t keep up with the material, you can drop the course about half way through the semester. It is pretty easy to make up those credits over the summer (if you even need to make them up at all). The system is made so you can fail a few times and it has no impact on your GPA.

The second reason it is ok to attempt harder classes is because it is ok to have B’s/C’s. Students have been primed for their whole education about the importance of grades. I’m here to tell you now that grades don’t matter all that much. The sky will not fall down if you get a C. The difference between say a 3.3 GPA and a 3.8 GPA means very little for the majority of potential career opportunities.

Now, I’m not saying you can totally stink up the joint with grades (many schools have a minimum GPA requirement), but most folks aren’t trying to get into John Hopkins medical school — you will be fine if you take a harder course and get a C — and you will probably learn more in that hard course than taking something easy but boring. You are ultimately paying for your education. You might as well take advantage of the opportunities to learn.

I do have a few C’s on my undergraduate transcript. One is that macro-economics I talked about earlier — I only showed up to around half of the classes. Another though is the Design of Experiments — the first statistics course in the math department I took. I had to get permission to be in the course, and I did not understand everything. But after the semester the professor told me I was one of the best students in the class (despite the C – the material was difficult). I subsequently took quite a few more stats classes from that professor (I got all A’s from then on out I’m pretty sure). Taking those harder classes really prepared me for graduate school compared to many within my cohort, as well as gave me a leg up in many more research oriented positions while I was at SUNY Albany.

Talk on Scholars Day – Crime in Space and Time

I will be giving a talk tomorrow (10/21/17) at Scholars Day here at UT Dallas (where we get visits from prospective students). Here is the synopsis of my talk:

Synopsis: In this lecture, Dr. Andrew Wheeler will discuss his research on the spatial and temporal patterns of crime. He will discuss whether recent homicide trends are atypical given historical data and if you can predict which neighborhoods in Dallas have the most crime. He will also discuss what to expect from an education in criminology and the social sciences in general.

I will be at JSOM 2.106 from 11 to 11:45. Here is a bit of a sneak peak. (You will also get some Han’s Rosling style animated charts of homicide trends!)

I will also discuss some of my general pro-tips for incoming college students. I will expand that into a short post next week, but if you want that advice a few days ahead come to my talk!

Some notes on PChange – estimating when trajectories cross over time

J.C. Barnes and company published a paper in JQC not too long ago and came up with a metric, PChange, to establish the number of times trajectories cross in a sample. This is more of interest to life course folks, although it is not totally far fetched to see it applied to trajectories of crime at places. Part of my interest in it was simply that it is an interesting statistical question — when two trajectories with errors cross. A seemingly simple question that has a few twists and turns. Here are my subsequent notes on that metric.

The Domain Matters

First, here is an example of the trajectories not crossing:

This points to an important assumption about those lines not crossing though that was never mentioned in the Barnes paper — the domain matters. For instance, if we draw those rays further back in time what happens?

They cross! This points to an important piece of information when evaluating PChange — the temporal domain in which you examine the data matters. So if you have a sample of juvenile delinquency measures from 14-18 you would find less change than a similar sample from 12-20.

This isn’t really a critique of PChange — it is totally reasonable to only want to examine changes within a specific domain. Who cares if delinquency trajectories cross when people are babies! But it should be an important piece of information researchers use in the future if they use PChange — longer samples will show more change. It also won’t be fair to compare PChange for samples of different lengths.

A Functional Approach to PChange

For above you may ask — how would you tell if a trajectory crosses outside of the domain of the data? The answer to that question is you estimate an underlying function of the trajectory — some type of model where the outcome is a function of age (or time). With that function you can estimate the trajectory going back in time or forward in time (or in between sampled measurements). You may not want to rely on data outside of the domain (its standard error will be much higher than data within the time domain, forecasting is always fraught with peril!), but the domain of your sample is ultimately arbitrary. So what about the question will the trajectories ever cross? Or would the trajectories have crossed if I had data for ages 12-20 instead of just 16-18? Or would they have crossed if I checked the juveniles at age 16 1/2 instead of only at 16?

So actually instead of the original way the Barnes paper formulated PChange, here is how I thought about calculating PChange. First you estimate the underlying trajectory for each individual in your sample, then you take the difference of those trajectories.

y_i = f(t)
y_j = g(t)
y_delta = f(t) - g(t) = d(t)

Where y_i is the outcome y for observation i, and y_j is the outcome y for observation j. t is a measure of time, and thus the anonymous functions f and g represent growth models for observations i and j over time. y_delta is then the difference between these two functions, which I represent as the new function d(t). So for example the functions for each individual might be quadratic in time:

y_i = b0i + b1i(t) + b2i(t^2)
y_j = b0j + b1j(t) + b2j(t^2)

Subsequently the difference function will also be quadratic, and can be simply represented as:

y_delta = (b0i - b0j) + (b1i - b1j)*t + (b2i - b2j)*t^2

Then for the trajectories to cross (or at least touch), y_delta just then has to equal zero at some point along the function. If this were math, and the trajectories had no errors, you would just set d(t) = 0 and solve for the roots of the equation. (Most people estimating models like these use functions that do have roots, like polynomials or splines). If you cared about setting the domain, you would then just check if the roots are within the domain of interest — if they are, the trajectories cross, if they are not, then they do not cross. For data on humans with age, obviously roots for negative human years will not be of interest. But that is a simple way to solve the domain problem – if you have an underlying estimate of the trajectory, just see how often the trajectories cross within equivalent temporal domains in different samples.

I’d note that the idea of having some estimate of the underlying trajectory is still relevant even within the domain of the data — not just extrapolating to time periods outside. Consider two simple curves below, where the points represent the time points where each individual was measured.

So while the two functions cross, when only considering the sampled locations, like is done in Barnes et al.’s PChange, you would say these trajectories do not cross, when in actuality they do. It is just the sampled locations are not at the critical point in the example for these two trajectories.

This points to another piece of contextual information important to interpreting PChange — the number of sample points matter. If you have samples every 6 months, you will likely find more changes than if you just had samples every year.

I don’t mean here to bag on Barnes original metric too much — his PChange metric does not rely on estimating the underlying functional form, and so is a non-parametric approach to identifying change. But estimating the functional form for each individual has some additional niceties — one is that you do not need the measures to be at equivalent sample locations. You can compare someone measured at 11, 13, and 18 to someone who is measured at 12, 16, and 19. For people analyzing stuff for really young kids I bet this is a major point — the underlying function at a specific age is more important then when you conveniently measured the outcome. For older kids though I imagine just comparing the 12 to 11 year old (but in the same class) is probably not a big deal for delinquency. It does make it easier though to compare say different cohorts in which the measures are not at nice regular intervals (e.g. Add Health, NLYS, or anytime you have missing observations in a longitudinal survey).

In the end you would only want to estimate an underlying functional form if you have many measures (more so than 3 in my example), but this typically ties in nicely with what people modeling the behavior over time are already doing — modeling the growth trajectories using some type of functional form, whether it is a random effects model or a group based trajectory etc., they give you an underlying functional form. If you are willing to assume that model is good enough to model the trajectories over time, you should think it is good enough to calculate PChange!

The Null Matters

So this so far would be fine and dandy if we had perfect estimates of the underlying trajectories. We don’t though, so you may ask, even though y_delta does not exactly equal zero anywhere, its error bars might be quite wide. Wouldn’t we then still infer that there is a high probability the two trajectories cross? This points to another hidden assumption of Barnes PChange — the null matters. In the original PChange the null is that the two trajectories do not cross — you need a sufficient change in consecutive time periods relative to the standard error to conclude they cross. If the standard error is high, you won’t consider the lines to cross. Consider the simple table below:

Period A_Level A_SE B_Level B_SE
1      4         1    1.5   0.5     
2      5         1    3     0.5
3      6         1    4.5   0.5
4      7         1    6     0.5

Where A_Level and B_Level refer to the outcome for the four time periods, and A_SE and B_SE refer to the standard errors of those measurements. Here is the graph of those two trajectories, with the standard error drawn as areas for the two functions (only plus minus one standard error for each line).

And here is the graph of the differences — assuming that the covariance between the two functions is zero (so the standard error of the difference equals sqrt(A_SE^2 + B_SE^2)). Again only plus/minus one standard error.

You can see that the line never crosses zero, but the standard error area does. If our null is H0: y_delta = 0 for any t, then we would fail to reject the null in this example. So in Barnes original PChange these examples lines would not cross, whereas with my functional approach we don’t have enough data to know they don’t cross. This I suspect would make a big difference in many samples, as the standard error is going to be quite large unless you have very many observations and/or very many time points.

If one just wants a measure of crossed or did not cross, with my functional approach you could set how wide you want to draw your error bars, and then estimate whether the high or low parts of that bar cross zero. You may not want a discrete measure though, but a probability. To get that you would integrate the probability over the domain of interest and calculate the chunk of the function that cross zero. (Just assume the temporal domain is uniform across time.)

So in 3d, my difference function would look like this, whereas to the bottom of the wall is the area to calculate the probability of the lines crossing, and the height of the surface plot is the PDF at that point. (Note the area of the density is not normalized to sum to 1 in this plot.)

This surface graph ends up crossing more than was observed in my prior 2d plots, as I was only plotting 1 standard error. Here imagine that the top green part of the density is the mean function — which does not cross zero — but then you have a non-trivial amount of the predicted density that does cross the zero line.

In the example where it just crosses one time by a little, it seems obvious to consider the small slice as the probability of the two lines crossing. I think to extend this to not knowing to test above or below the line you could calculate the probability on either side of the line, take the minimum, and then double that minimum for your p-value. So if say 5% of the area is below the line in my above example, you would double it and say the two-tailed p-value of the lines crossing is p = 0.10. Imagine the situation in which the line mostly hovers around 0, so the mass is about half on one side and half on the other. In that case the probability the lines cross seems much higher than 50%, so doubling seems intuitively reasonable.

So if you consider this probability to be a p-value, with a very small p-value you would reject the null that the lines cross. Unlike most reference distributions for p-values though, you can get a zero probability estimate of the lines crossing. You can aggregate up those probabilities as weights when calculating the overall PChange for the sample. So you may not know for certain if two trajectories cross, but you may be able to say these two trajectories cross with a 30% probability.

Again this isn’t to say that PChange is bad — it is just different. I can’t give any reasoning whether assuming they do cross (my functional approach) or assuming they don’t cross (Barnes PChange) is better – they are just different, but would likely make a large difference in the estimated number of crossings.

Population Change vs Individual Level Change

So far I have just talked about trying to determine whether two individual lines cross. For my geographic analysis of trajectories in which I have the whole population (just a sample in time), this may be sufficient. You can calculate all pairwise differences and then calculate PChange (I see no data based reason to use the permutation sample approach Barnes suggested – we don’t have that big of samples, we can just calculate all pairwise combinations.)

But for many of the life course researchers, they are more likely to be interested in estimating the population of changes from the samples. Here I will show how you can do that for either random effects models, or for group based trajectory models based on the summary information. This takes PChange from a sample metric to a population level metric implied via your models you have estimated. This I imagine will be much easier to generalize across samples than the individual change metrics, which will be quite susceptible to outlier trajectories, especially in small samples.

First lets start with the random effects model. Imagine that you fit a linear growth model — say the random intercept has a variance of 2, and the random slope has a variance of 1. So these are population level metrics. The fixed effects and the covariance between the two random effect terms will be immaterial for this part, as I will discuss in a moment.

First, trivially, if you selected two random individuals from the population with this random effects distribution, the probability their underlying trajectories cross at some point is 1. The reason is for linear models, two lines only never cross if the slopes are perfectly parallel. Which sampling from a continuous random distribution has zero probability of them being exactly the same. This does not generalize to more complicated functions (imagine parabolas concave up and concave down that are shifted up and down so they never cross), but should be enough of a motivation to make the question only relevant for a specified domain of time.

So lets say that we are evaluating the trajectories over the range t = [10,20]. What is the probability two individuals randomly sampled from the population will cross? So again with my functional difference approach, we have

y_i = b0i + b1i*t
y_j = b0j + b1j*t
y_delta = (b0i - b0j) + (b1i - b1j)*t

Where in this case the b0 and b1 have prespecified distributions, so we know the distribution of the difference. Note that in the case with no covariates, the fixed effects will cancel out when taking the differences. (Generalizing to covariates is not as straightforward, you could either assume they are equal so they cancel out, or you could have them vary according to additional distributions, e.g. males have an 90% chance of being drawn versus females have a 10% chance, in that case the fixed effects would not cancel out.) Here I am just assuming they cancel out. Additionally, taking the difference in the trajectories also cancels out the covariance term, so you can assume the covariance between (b0i - b0j) and (b1i - b1j) is zero even if b0 and b1 have a non-zero covariance for the overall model. (Post is long enough — I leave that as an exercise for the reader.)

For each of the differences the means will be zero, and the variance will be the two variances added together, e.g. b0i - b0j will have a mean of zero and a variance of 2 + 2 = 4. The variance of the difference in slopes will then be 2. Now to figure out when the two lines will cross.

If you make a graph where the X axis is the difference in the intercepts, and the Y axis is the difference in the slopes, you can then mark off areas that indicate the two lines will cross given the domain. Here for example is a sampling of where the lines cross – red is crossing, grey is not crossing.

So for example, say we had two random draws:

y_i = 1   + 0.5*t
y_j = 0.5 + 0.3*t
y_delta = 0.5 + 0.2*t

This then shows that the two lines do not cross when only evaluating t between 10 and 20. They have already diverged that far out (you would need negative t to have the lines cross). Imagine if y_delta = -6 + 0.2*t though, this line does cross zero though, at t = 10 this function equals -1, whereas at t = 20 the function equals 4.

If you do another 3d plot you can plot the bivariate PDF. Again integrate the chunks of areas in which the function crosses zero, and voila, you get your population estimate.

This works in a similar manner to higher order polynomials, but you can’t draw it in a nice graph though. I’m blanking at the moment of a way to find these areas offhand in a nice way — suggestions welcome!

This gets a bit tricky thinking about in relation to individual level change. This approach does not assume any error in the random draws of the line, but assumes the draws will have a particular distribution. So the PChange does not come from adding up whether individual lines in your sample cross, it comes from the estimated distribution of what the difference in two randomly drawn lines would look like that is implied by your random effects model. Think if based on your random effect distribution you randomly drew two lines, calculated if they crossed, and then did this simulation a very large number of times. The integrations I’m suggesting are just an exact way to calculate PChange instead of the simulation approach.

If you were to do individual change from your random effects model you would incorporate the standard error of the estimated slope and intercept for the individual observation. This is for your hypothetical population though, so I see no need to incorporate any error.

Estimating population level change from group based trajectory models via my functional approach is more straightforward. First, with my functional approach you would assume individuals who share the same latent trajectory will cross with a high probability, no need to test that. Second, for testing whether two individual trajectories cross you would use the approach I’ve already discussed around individual lines and gain the p-value I mentioned.

So for example, say you had a probability of 25% that a randomly drawn person from group A would cross a randomly drawn person from Group B. Say also that Group A has 40/100 of the sample, and Group B is 60/100. So then we have three different groups: A to A, B to B, and A to B. You can then break down the pairwise number in each group, as well as the number of crosses below.

Compare   N    %Cross Cross
A-A      780    100    780
B-B     1770    100   1770
A-B     2400     25    600
Total   4950     64   3150

So then we have a population level p-change estimate of 64% from our GBTM. All of these calculations can be extended to non-integers, I just made them integers here to simplify the presentation.

Now, that overall PChange estimate may not be real meaningful for GBTM, as the denominator includes pairwise combinations of folks in the same trajectory group, which is not going to be of much interest. But just looking at the individual group solutions and seeing what is the probability they cross could be more informative. For example, although Barnes shows the GBTM models in the paper as not crossing, depending on how wide the standard errors of the functions are (that aren’t reported), this functional approach would probably assign non-zero probability of them crossing (think low standard error for the higher group crossing a high standard error for the low group).

Phew — that was a long post! Let me know in the comments if you have any thoughts/concerns on what I wrote. Simple question — whether two lines cross — not a real simple solution when considering the statistical nature of the question though. I can’t be the only person to think about this though — if you know of similar or different approaches to testing whether two lines cross please let me know in the comments.