Testing the equality of coefficients – Same Independent, Different Dependent variables

As promised earlier, here is one example of testing coefficient equalities in SPSS, Stata, and R.

Here we have different dependent variables, but the same independent variables. This is taken from Dallas survey data (original data link, survey instrument link), and they asked about fear of crime, and split up the questions between fear of property victimization and violent victimization. Here I want to see if the effect of income is the same between the two. People in more poverty tend to be at higher risk of victimization, but you may also expect people with fewer items to steal to be less worried. Here I also control for the race and the age of the respondent.

The dataset has missing data, so I illustrate how to select out for complete case analysis, then I estimate the model. The fear of crime variables are coded as Likert items with a scale of 1-5, (higher values are more safe) but I predict them using linear regression (see the Stata code at the end though for combining ordinal logistic equations using suest). Race is of course nominal, and income and age are binned as well, but I treat the income bins as a linear effect. I pasted the codebook for all of the items at the end of the post.

These models with multiple dependent variables have different names, economists call them seemingly unrelated regression, psychologists will often just call them multivariate models, those familiar with structural equation modeling can get the same results by allowing residual covariances between the two outcomes — they will all result in the same coefficient estimates in the end.


In SPSS we can use the GLM procedure to estimate the model. Simultaneously we can specify particular contrasts to test whether the income coefficient is different for the two outcomes.

*Grab the online data.
SPSSINC GETURI DATA URI="https://dl.dropbox.com/s/r98nnidl5rnq5ni/MissingData_DallasSurv16.sav?dl=0" FILETYPE=SAV DATASET=MissData.

*Conducting complete case analysis.
COUNT MisComplete = Safety_Violent Safety_Prop Gender Race Income Age (MISSING).
COMPUTE CompleteCase = (MisComplete = 0).
FILTER BY CompleteCase.

*This treats the different income categories as a continuous variable.
*Can use GLM to estimate seemingly unrelated regression in SPSS and test.
*equality of the two coefficients.
GLM Safety_Violent Safety_Prop BY Race Age WITH Income
  /DESIGN=Income Race Age
  /LMATRIX Income 1
  /MMATRIX ALL 1 -1.


In the output you can see the coefficient estimates for the two equations. The income effect for violent crime is 0.168 (0.023) and for property crime is 0.114 (0.022).

And then you get a separate table for the contrast estimates.

You can see that the contrast estimate, 0.054, equals 0.168 – 0.114. The standard error in this output (0.016) takes into account the covariance between the two estimates. Here you would reject the null that the effects are equal across the two equations, and the effect of income is larger for violent crime. Because higher values on these Likert scales mean a person feels more safe, this is evidence that those with higher incomes are more likely to be fearful of property victimization, controlling for age and race.

Unfortunately the different matrix contrasts are not available in all the different types of regression models in SPSS. You may ask whether you can fit two separate regressions and do this same test. The answer is you can, but that makes assumptions about how the two models are independent — it is typically more efficient to estimate them at once, and here it allows you to have the software handle the Wald test instead of constructing it yourself.


As I stated previously, seemingly unrelated regression is another name for these multivariate models. So we can use the R libraries systemfit to estimate our seemingly unrelated regression model, and then use the library multcomp to test the coefficient contrast. This does not result in the exact same coefficients as SPSS, but devilishly close. You can download the csv file of the data here.

library(systemfit) #for seemingly unrelated regression
library(multcomp)  #for hypothesis tests of models coefficients

#read in CSV file
SurvData <- read.csv(file="MissingData_DallasSurvey.csv",header=TRUE)
names(SurvData)[1] <- "Safety_Violent" #name gets messed up because of BOM

#Need to recode the missing values in R, use NA
NineMis <- c("Safety_Violent","Safety_Prop","Race","Income","Age")
for (i in NineMis){
  SurvData[SurvData[,i]==9,i] <- NA

#Making a complete case dataset
SurvComplete <- SurvData[complete.cases(SurvData),NineMis]
#Now changing race and age to factor variables, keep income as linear
SurvComplete$Race <- factor(SurvComplete$Race, levels=c(1,2,3,4), labels=c("Black","White","Hispanic","Other"))
SurvComplete$Age <- factor(SurvComplete$Age, levels=1:5, labels=c("18-24","25-34","35-44","45-54","55+"))

#now fitting seemingly unrelated regression
viol <- Safety_Violent ~ Income + Race + Age
prop <- Safety_Prop ~ Income + Race + Age
fitsur <- systemfit(list(violreg = viol, propreg= prop), data=SurvComplete, method="SUR")

#testing whether income effect is equivalent for both models
viol_more_prop <- glht(fitsur,linfct = c("violreg_Income - propreg_Income = 0"))

Here is a screenshot of the results then:

This is also the same as estimating a structural equation model in which the residuals for the two regressions are allowed to covary. We can do that in R with the lavaan library.


#for this need to convert factors into dummy variables for lavaan
DumVars <- data.frame(model.matrix(~Race+Age-1,data=SurvComplete))
names(DumVars) <- c("Black","White","Hispanic","Other","Age2","Age3","Age4","Age5")

SurvComplete <- cbind(SurvComplete,DumVars)

model <- '
     Safety_Prop    ~ Income + Black + Hispanic + Other + Age2 + Age3 + Age4 + Age5
     Safety_Violent ~ Income + Black + Hispanic + Other + Age2 + Age3 + Age4 + Age5
    #residual covariances
     Safety_Violent ~~ Safety_Prop
     Safety_Violent ~~ Safety_Violent
     Safety_Prop ~~ Safety_Prop

fit <- sem(model, data=SurvComplete)

I’m not sure offhand though if there is an easy way to test the coefficient differences with a lavaan object, but we can do it manually by grabbing the variance and the covariances. You can then see that the differences and the standard errors are equal to the prior output provided by the glht function in multcomp.

#Grab the coefficients I want, and test the difference
PCov <- inspect(fit,what="vcov")
PEst <- inspect(fit,what="list")
Diff <- PEst[9,'est'] - PEst[1,'est']
SE <- sqrt( PEst[1,'se']^2 + PEst[9,'se']^2 - 2*PCov[9,1] )


In Stata we can replicate the same prior analyses. Here is some code to simply replicate the prior results, using Stata’s postestimation commands (additional examples using postestimation commands here). Again you can download the csv file used here. The results here are exactly the same as the R results.

*Load in the csv file
import delimited MissingData_DallasSurvey.csv, clear

*BOM problem again
rename ïsafety_violent safety_violent

*we need to specify the missing data fields.
*for Stata, set missing data to ".", not the named missing value types.
foreach i of varlist safety_violent-ownhome {
    tab `i'

*dont specify district
mvdecode safety_violent-race income-age ownhome, mv(9=.)
mvdecode yearsdallas, mv(999=.)

*making a variable to identify the number of missing observations
egen miscomplete = rmiss(safety_violent safety_prop race income age)
tab miscomplete
*even though any individual question is small, in total it is around 20% of the cases

*lets conduct a complete case analysis
keep if miscomplete==0

*Now can estimate multivariate regression, same as GLM in SPSS
mvreg safety_violent safety_prop = income i.race i.age

*test income coefficient is equal across the two equations
lincom _b[safety_violent:income] - _b[safety_prop:income]

*same results as seemingly unrelated regression
sureg (safety_violent income i.race i.age)(safety_prop income i.race i.age)

*To use lincom it is the exact same code as with mvreg
lincom _b[safety_violent:income] - _b[safety_prop:income]

*with sem model
tabulate race, generate(r)
tabulate age, generate(a)
sem (safety_violent <- income r2 r3 r4 a2 a3 a4 a5)(safety_prop <- income r2 r3 r4 a2 a3 a4 a5), cov(e.safety_violent*e.safety_prop) 

*can use the same as mvreg and sureg
lincom _b[safety_violent:income] - _b[safety_prop:income]

You will notice here it is the exact some post-estimation lincom command to test the coefficient equality across all three models. (Stata makes this the easiest of the three programs IMO.)

Stata also allows us to estimate seemingly unrelated regressions combining different generalized outcomes. Here I treat the outcome as ordinal, and then combine the models using seemingly unrelated regression.

*Combining generalized linear models with suest
ologit safety_violent income i.race i.age
est store viol

ologit safety_prop income i.race i.age
est store prop

suest viol prop

*lincom again!
lincom _b[viol_safety_violent:income] - _b[prop_safety_prop:income]

An application in spatial criminology is when you are estimating the effect of something on different crime types. If you are predicting the number of crimes in a spatial area, you might separate Poisson regression models for assaults and robberies — this is one way to estimate the models jointly. Cory Haberman and Jerry Ratcliffe have an application of this as well estimate the effect of different crime types at different times of day – e.g. the effect of bars in the afternoon versus the effect of bars at nighttime.


Here is the codebook for each of the variables in the database.

    1   Very Unsafe
    2   Unsafe
    3   Neither Safe or Unsafe
    4   Safe
    5   Very Safe
    9   Do not know or Missing
    1   Very Unsafe
    2   Unsafe
    3   Neither Safe or Unsafe
    4   Safe
    5   Very Safe
    9   Do not know or Missing
    1   Male
    2   Female
    9   Missing
    1   Black
    2   White
    3   Hispanic
    4   Other
    9   Missing
    1   Less than 25k
    2   25k to 50k
    3   50k to 75k
    4   75k to 100k
    5   over 100k
    9   Missing
    1   Less than High School
    2   High School
    3   Some above High School
    9   Missing
    1   18-24
    2   25-34
    3   35-44
    4   45-54
    5   55+
    9   Missing
    1   Own
    2   Rent
    9   Missing
    999 Missing

Group based trajectory models in Stata – some graphs and fit statistics

For my advanced research design course this semester I have been providing code snippets in Stata and R. This is the first time I’ve really sat down and programmed extensively in Stata, and this is a followup to produce some of the same plots and model fit statistics for group based trajectory statistics as this post in R. The code and the simulated data I made to reproduce this analysis can be downloaded here.

First, for my own notes my version of Stata is on a server here at UT Dallas. So I cannot simply go

net from http://www.andrew.cmu.edu/user/bjones/traj
net install traj, force

to install the group based trajectory code. First I have this set of code in the header of all my do files now

*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns"
*to install on your own in the lab it would be
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns"

So after running that then I can install the traj command, and Stata will know where to look for it!

Once that is taken care of after setting the working directory, I can simply load in the csv file. Here my first variable was read in as ïid instead of id (I’m thinking because of the encoding in the csv file). So I rename that variable to id.

*Load in csv file
import delimited GroupTraj_Sim.csv
*BOM mark makes the id variable weird
rename ïid id

Second, the traj model expects the data in wide format (which this data set already is), and has counts in count_1, count_2count_10. The traj command also wants you to input a time variable though to, which I do not have in this file. So I create a set of t_ variables to mimic the counts, going from 1 to 10.

*Need to generate a set of time variables to pass to traj, just label 1 to 10
forval i = 1/10 { 
  generate t_`i' = `i'

Now we can estimate our group based models, and we get a pretty nice default plot.

traj, var(count_*) indep(t_*) model(zip) order(2 2 2) iorder(0)

Now for absolute model fit statistics, there are the average posterior probabilities, the odds of correct classification, and the observed classification proportion versus the expected classification proportion. Here I made a program that I will surely be ashamed of later (I should not brutalize the data and do all the calculations in matrix), but it works and produces an ugly table to get us these stats.

*I made a function to print out summary stats
program summary_table_procTraj
    *now lets look at the average posterior probability
    gen Mp = 0
    foreach i of varlist _traj_ProbG* {
        replace Mp = `i' if `i' > Mp 
    sort _traj_Group
    *and the odds of correct classification
    by _traj_Group: gen countG = _N
    by _traj_Group: egen groupAPP = mean(Mp)
    by _traj_Group: gen counter = _n
    gen n = groupAPP/(1 - groupAPP)
    gen p = countG/ _N
    gen d = p/(1-p)
    gen occ = n/d
    *Estimated proportion for each group
    scalar c = 0
    gen TotProb = 0
    foreach i of varlist _traj_ProbG* {
       scalar c = c + 1
       quietly summarize `i'
       replace TotProb = r(sum)/ _N if _traj_Group == c 
    *This displays the group number, the count per group, the average posterior probability for each group,
    *the odds of correct classification, and the observed probability of groups versus the probability 
    *based on the posterior probabilities
    list _traj_Group countG groupAPP occ p TotProb if counter == 1

This should work after any model as long as the naming conventions for the assigned groups are _traj_Group and the posterior probabilities are in the variables _traj_ProbG*. So when you run


You get this ugly table:

     | _traj_~p   countG   groupAPP        occ       p    TotProb |
  1. |        1      103    .937933   43.57431   .2575   .2573651 |
104. |        2      161   .9935606   229.0434   .4025   .4064893 |
265. |        3      136   .9607248   47.48378     .34   .3361456 |

The groupAPP are the average posterior probabilities – here you can see they are all quite high. occ is the odds of correct classification, and again they are all quite high. p is the proportion in each group based on the assignments for the maximum posterior probability, and the TotProb are the expected number based on the sums of the posterior probabilities. TotProb should be the same as in the Group Membership part at the bottom of the traj model. They are close (to 5 decimals), but not exactly the same (and I do not know why that is the case).

Next, to generate a plot of the individual trajectories, I want to reshape the data to long format. I use preserve in case I want to go back to wide format later and estimate more models. If you need to look to see how the reshape command works, type help reshape at the Stata prompt. (Ditto for help on all Stata commands.)

reshape long count_ t_, i(id)

To get the behavior I want in the plot, I use a scatter plot but have them connected via c(L). Then I create small multiples for each trajectory group using the by() option. Before that I slightly jitter the count data, so the lines are not always perfectly overlapped. I make the lines thin and grey — I would also use transparency but Stata graphs do not support this.

gen count_jit = count_ + ( 0.2*runiform()-0.1 )
graph twoway scatter count_jit t_, c(L) by(_traj_Group) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)

I’m too lazy at the moment to clean up the axis titles and such, but I think this plot of the individual trajectories should always be done. See Breaking Bad: Two Decades of Life-Course Data Analysis in Criminology, Developmental Psychology, and Beyond (Erosheva et al., 2014).

While this fit looks good, this is not the correct number of groups given how I simulated the data. I will give those trying to find the right answer a few hints; none of the groups have a higher polynomial than 2, and there is a constant zero inflation for the entire sample, so iorder(0) will be the correct specification for the zero inflation part. If you take a stab at it let me know, I will fill you in on how I generated the simulation.

Some Stata notes – Difference-in-Difference models and postestimation commands

Many of my colleagues use Stata (note it is not STATA), and I particularly like it for various panel data models. Also one of my favorite parts of Stata code that are sometimes tedious to replicate in other stat. software are the various post-estimation commands. These includes the test command, which does particular coefficient restriction tests or multiple coefficient tests, margins (and the corresponding marginsplot) which gives model based estimates at various values of the explanatory variables, and lincom and nlcom which I will show as being useful for differences in differences models.

To follow along with this post, I have posted all of the data and code here. Part of the data is simulated to be very similar to a panel data model I estimated recently, and the other dataset comes from Dan Kahan.

I don’t know Stata as well as I do SPSS or R, but these are things I tend to re-visit (and wish people sending me data and code follow as well). So these are for my own notes so I remember!

The Preamble

Before we go into estimating models in Stata, I first like to set the working directory, specify a plain text log file for the results, and then set how the results are displayed in the window. I know some programmers do not like you changing their directory, but this is the simplest way I’ve found to share code between colleagues – they just need to change one line at the beginning to change the directory to their local machine.

So the top of my code will typically have my name, plus what the code does. Note that comments in Stata start with an *.

*Andy Wheeler, ????@email.com
*This code estimates difference in difference models
*For the X intervention on Crimes per Month

After that I will set the working directory, using the cd command, which lets you upload datasets without specifying a file handle. Because of this, I can just say log using "PostEst.txt" and it saves the plain text log file in the same directory folder.

*Setting up the directory
cd "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Stata_Postest"
*logging the results in a text file
log using "PostEst.txt", text replace
*So the output just keeps going
set more off

The code set more off is idiosyncratic to the Stata display. For results that you need to scroll, Stata asks you to click on see more output. This makes it so it just pipes all the results to the display without you needing to click on anything.

Importing Data and Setting up a Panel Model

Now we will go on to a simulated example very similar to a difference in difference model I estimated recently. Because the working directory is set that already contains my datasets, I can just call:

use "Monthly_Sim_Data.dta", clear 

To load up my simulated dataset. Before we estimate panel models in Stata, you need to tell Stata what the panel id variables refer to. You use the tsset command for that. Here the variable Exper refers to a dummy variable that equals 1 for the experimental time series, and 0 for the control time series. Ord is a squential counter for each time period – here the data is suppossed to look like the count of crimes during the month over several years. So the first variable after tsset in the panel id, and the second refers to the time variable (if there is a time variable).

*Set the panel vars
tsset Exper Ord

Now we are ready to estimate our model. In my real use case I had looked at the univariate series, and found that an ARIMA(1,0,0) model with monthly dummies to account for seasonality fit pretty well and took care of temporal auto-correlation, so I decided a reasonable substitute for the panel model would be a GEE Poisson model with monthly dummies and the errors having an AR(1) correlation.

To fit this model in Stata is:

*Original model
xtgee Y Exper#Post i.Month, family(poisson) corr(ar1)

Here I do not worry about the standard errors, but if the data are over-dispersed you can either fit a negative binomial model and/or estimate robust standard errors (or bootstrapped standard errors). So you can specify these as something like xtgee Y X, family(poisson) corr(ar1) vce(robust). The vce option is available for many of the panel data models. (Note the bootstrap does not make sense when you only have two series as in this example.)

The Post variable is the dummy variable equal to 0 before the intervention, and 1 after the intervention. Exper#Post is one way to specify interaction variables in Stata. The variable i.Month tells Stata that the Month variable is a factor, and it should estimate a different dummy variable for each month (dropping one to prevent perfect collinearity).

You can of course create your own interactions or dummy variables, but using Stata’s inbuilt commands is very nice when working with post-estimation commands, which I will show in a bit. Some models in Stata do not like the i.Factor notation, so a quick way to make all of the dummy variables is via tabulate Month, gen(m). This would create variables m1, m2, m3....m12 in the dataset that you can include on the right hand side of regression equations.

This model then produces the output:

Iteration 1: tolerance = .00055117
Iteration 2: tolerance = 1.375e-06
Iteration 3: tolerance = 2.730e-09

GEE population-averaged model                   Number of obs      =       264
Group and time vars:             Exper Ord      Number of groups   =         2
Link:                                  log      Obs per group: min =       132
Family:                            Poisson                     avg =     132.0
Correlation:                         AR(1)                     max =       132
                                                Wald chi2(14)      =    334.52
Scale parameter:                         1      Prob > chi2        =    0.0000

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Exper#Post |
        0 1  |    .339348   .0898574     3.78   0.000     .1632307    .5154653
        1 0  |   1.157615   .0803121    14.41   0.000     1.000206    1.315023
        1 1  |   .7602288   .0836317     9.09   0.000     .5963138    .9241439
       Month |
          2  |    .214908   .1243037     1.73   0.084    -.0287228    .4585388
          3  |   .2149183   .1264076     1.70   0.089    -.0328361    .4626726
          4  |   .3119988   .1243131     2.51   0.012     .0683497    .5556479
          5  |   .4416236   .1210554     3.65   0.000     .2043594    .6788877
          6  |    .556197   .1184427     4.70   0.000     .3240536    .7883404
          7  |   .4978252   .1197435     4.16   0.000     .2631322    .7325182
          8  |   .2581524   .1257715     2.05   0.040     .0116447      .50466
          9  |    .222813   .1267606     1.76   0.079    -.0256333    .4712592
         10  |    .430002    .121332     3.54   0.000     .1921956    .6678084
         11  |   .0923395   .1305857     0.71   0.479    -.1636037    .3482828
         12  |  -.1479884   .1367331    -1.08   0.279    -.4159803    .1200035
       _cons |   .9699886   .1140566     8.50   0.000     .7464418    1.193536

Post estimation commands

So now the fun part begins – trying to interpret the model. Non-linear models – such as a Poisson regression – I think are almost always misleading to interpret the coefficients directly. Even exponentiating the coefficients (so they are incident rate ratios) I think tends to be misleading (see this example). To solve this, I think the easiest solution is to just predict the model based outcome at different locations of the explanatory variables back on the original count scale. This is what the margins post-estimation command does.

margins Post#Exper, at( (base) Month )
marginsplot, recast(scatter)

So basically once you estimate a regression equation, Stata has many of its attributes accessible to subsequent command. What the margins command does is predict the value of Y at the 2 by 2 factor levels for Post and Exper. The at( (base) Month ) option says to estimate the effects at the base level of the Month factor, which happens to default to January here. The marginsplot shows this easier than I can explain.

The option recast(scatter) draws the plot so there are no lines connecting the different points. Note I prefer to draw these error bar like plots without the end cross hairs, which can be done like marginsplot, recast(scatter) ciopts(recast(rspike)), but this causes a problem when the error bars overlap.

So I initially interpreted this simply as the experimental series went down by around 2 crimes, and the control series went up by around 1 crime. A colleague though pointed out the whole point of having control series though is to say what would have happend if the intervention did not take place (the counterfactual). Because the control series was going up, we would have also expected the experimental series to go up.

The coefficients in this model though don’t directly answer that question. To estimate that relative decrease is somewhat convoluted in this parameterization, but you can use the test and lincom post-estimation commands to do it. Test basically does linear model restrictions for one or multiple variables. This can be useful to test if multiple coefficients are equal to one another, or joint testing to see if one coefficient is equal to another, or to test if a coefficient is different from a non-zero value.

So to test the relative decrease in this DiD setup ends up being (which I am too lazy to explain):

test 1.Exper#1.Post = (1.Exper#0.Post + 0.Exper#1.Post)

Note you can also do a joint test for all dummy variables with testparm:

testparm i.Month

which spits out:

 ( 1)  2.Month = 0
 ( 2)  3.Month = 0
 ( 3)  4.Month = 0
 ( 4)  5.Month = 0
 ( 5)  6.Month = 0
 ( 6)  7.Month = 0
 ( 7)  8.Month = 0
 ( 8)  9.Month = 0
 ( 9)  10.Month = 0
 (10)  11.Month = 0
 (11)  12.Month = 0

           chi2( 11) =   61.58
         Prob > chi2 =    0.0000

The idea behind this is that it often does not make sense to test the significance of only one level of a dummy variable – you want to jointly test whether the whole set of dummy variables is statistically significant. Most of the time I do this using F-tests for model restrictions (see this example in R). But here Stata does a chi-square test. (I imagine they will result in the same inferences in most circumstances.)

test just gives inferential statistics though, I wanted an actual estimate of the relative decrease. To do this you can use lincom. So working with my same set of variables I get:

lincom 1.Exper#1.Post - 0.Exper#1.Post - 1.Exper#0.Post

Which produces in the output:

 ( 1)  - 0b.Exper#1.Post - 1.Exper#0b.Post + 1.Exper#1.Post = 0

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
         (1) |  -.7367337   .1080418    -6.82   0.000    -.9484917   -.5249757

I avoid explaining why this particular set of coefficient contrasts produces the relative decrease of interest because there is an easier way to specify the DiD model to get this coefficient directly (see the Wikipedia page linked earlier for an explanation):

*An easier way to estimate the model
xtgee Y i.Exper i.Post Exper#Post i.Month, family(poisson) corr(ar1)

Which you can look at the output and see that the interaction term now recreates the same -.7367337 (.1080418) estimate as the prior lincom command:

Iteration 1: tolerance = .00065297
Iteration 2: tolerance = 1.375e-06
Iteration 3: tolerance = 2.682e-09

GEE population-averaged model                   Number of obs      =       264
Group and time vars:             Exper Ord      Number of groups   =         2
Link:                                  log      Obs per group: min =       132
Family:                            Poisson                     avg =     132.0
Correlation:                         AR(1)                     max =       132
                                                Wald chi2(14)      =    334.52
Scale parameter:                         1      Prob > chi2        =    0.0000

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
     1.Exper |   1.157615   .0803121    14.41   0.000     1.000206    1.315023
      1.Post |    .339348   .0898574     3.78   0.000     .1632307    .5154653
  Exper#Post |
        1 1  |  -.7367337   .1080418    -6.82   0.000    -.9484917   -.5249757
       Month |
          2  |    .214908   .1243037     1.73   0.084    -.0287228    .4585388
          3  |   .2149183   .1264076     1.70   0.089    -.0328361    .4626726
          4  |   .3119988   .1243131     2.51   0.012     .0683497    .5556479
          5  |   .4416236   .1210554     3.65   0.000     .2043594    .6788877
          6  |    .556197   .1184427     4.70   0.000     .3240536    .7883404
          7  |   .4978252   .1197435     4.16   0.000     .2631322    .7325182
          8  |   .2581524   .1257715     2.05   0.040     .0116447      .50466
          9  |    .222813   .1267606     1.76   0.079    -.0256333    .4712592
         10  |    .430002    .121332     3.54   0.000     .1921956    .6678084
         11  |   .0923395   .1305857     0.71   0.479    -.1636037    .3482828
         12  |  -.1479884   .1367331    -1.08   0.279    -.4159803    .1200035
       _cons |   .9699886   .1140566     8.50   0.000     .7464418    1.193536

But we can do some more here, and figure out the hypothetical experimental mean if the experimental series would have followed the same trend as the control series. Here I use nlcom and exponentiate the results to get back on the original count scale:

nlcom exp(_b[1.Exper] + _b[1.Post]  + _b[_cons])

Which gives an estimate of:

       _nl_1:  exp(_b[1.Exper] + _b[1.Post]  + _b[_cons])

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       _nl_1 |   11.78646   1.583091     7.45   0.000     8.683656    14.88926

So in our couterfactual world, the intervention decreased crimes from around 12 to 6 per month instead of 8 to 6 in my original interpretation, a larger effect because of the control series. We can show how nlcom reproduces the output of margins by reproducing the experimental series mean at the baseline, over 8 crimes per month.

*This creates the observed estimates, (at January) - if you want for another month need to add to above
margins Post#Exper, at( (base) Month )
*to show it recreates margins of pre period
nlcom exp(_b[1.Exper] + _b[_cons])

Which produces the output:

. margins Post#Exper, at( (base) Month )

Adjusted predictions                              Number of obs   =        264
Model VCE    : Conventional

Expression   : Exponentiated linear prediction, predict()
at           : Month           =           1

             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Post#Exper |
        0 0  |   2.637914   .3008716     8.77   0.000     2.048217    3.227612
        0 1  |   8.394722   .8243735    10.18   0.000      6.77898    10.01046
        1 0  |   3.703716   .3988067     9.29   0.000     2.922069    4.485363
        1 1  |   5.641881   .5783881     9.75   0.000     4.508261    6.775501

. *to show it recreates margins of pre period
. nlcom exp(_b[1.Exper] + _b[_cons])

       _nl_1:  exp(_b[1.Exper] + _b[_cons])

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       _nl_1 |   8.394722   .8243735    10.18   0.000      6.77898    10.01046

Poisson models are no big deal

Much of my experience suggests that although Poisson models are standard fare for predicting low crime counts in our field, they almost never make a difference when evaluating the marginal effects like I have above. So here we can reproduce the same GEE model, but instead of Poisson have it be a linear model. Here I use the quietly command to suppress the output of the model. Comparing coefficients directly between the two models does not make sense, but comparing the predictions is fine.

*Compare Poisson to a linear model
quietly xtgee Y Exper#Post i.Month, family(gaussian) corr(ar1)
margins Post#Exp, at( (base) Month )

And this subsequently produces:

Adjusted predictions                              Number of obs   =        264
Model VCE    : Conventional

Expression   : Linear prediction, predict()
at           : Month           =           1

             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Post#Exper |
        0 0  |   1.864656   .6951654     2.68   0.007     .5021569    3.227155
        0 1  |   9.404887   .6951654    13.53   0.000     8.042388    10.76739
        1 0  |   3.233312   .6992693     4.62   0.000     1.862769    4.603854
        1 1  |   5.810809   .6992693     8.31   0.000     4.440266    7.181351

The predictions are very similar between the two models. I simply state this here because I think the parallel trends assumption for the DiD model makes more sense on a linear scale than it does for the exponential scale. Pretty much wherever I look, the effects of explanatory variables appear pretty linear to me when predicting crime counts, so I think linear models make a lot of sense, even if you are predicting low count data, despite current conventions in the field.

An additional Example using marginsplot and areas

Long post – but stick with me! The last thing I wanted to go over was how to get area’s instead of error bars for marginsplot. The pdf help online has some examples – so this is pretty much just a restatement of the help. (Stata’s help in the online PDFs are excellent – code and math and figures and references and intelligible discussion all in one.) Like Andrew Gelman, I think the discrete bars are misleading when the sample locations are just arbitrary locations along a continuous function. I also think the areas look nicer, especially when the error bars overlap.

So using the same example from Dan Kahan, we first use drop _all to get rid of the prior panel data, then use import to load up the csv file.

*This clears the prior data.
drop _all 

*grab the csv data
import delimited gelman_cup_graphic_reporting_challenge_data

Now we can estimate our model and our margins:

*estimate a similar logit model
logit agw c.left_right##c.religiosity
*graph areas using margins - pretended -1 and 1 are the high/low religious cut-offs
quietly margins, at(left_right=(-1.6(0.1)1.6) religiosity=(-1 1))

quietly is definately useful here for the margins command, we don’t want to pour through the whole text table, since it is only intended for a chart. And now to make our area chart at the sample points:

marginsplot, xlabel(-1.6(0.4)1.6) recast(line) recastci(rarea) ciopts(color(*.8))

That is much better looking that the default with the ugly cross-hair error bars overlapping. Stata draws the lines unfortunately in the wrong order (see the blue line over the red area), and if I figure out an easy way to fix that I will update the post in the future. I need to learn the options for Stata charts to a greater extent before I give any more advice about Stata charts though.

To wrap up. I typically have in my Stata do file:

*Finish the script.
drop _all 
exit, clear

This drops the dataset (as mentioned before), and exit, clear then basically closes out of Stata.