A principled approach to conducting subgroup analysis

Social scientists often have a problem when conducting analysis — we have theories that are not tightly coupled to actual measures of individual behavior. A response to this is to often conduct models of many different, interrelated measures. This can be either as outcome variables, e.g. if I know poverty predicts all crimes, does poverty predict both violent crime and property crime at the city level? Or as explanatory variables, e.g. does being a minority reduce your chances of getting a job interview, or does it matter the type of minority you are — e.g. Black, Asian, Hispanic, Native American, etc.

Another situation is conducting analysis among different units of analysis, e.g. see if a treatment has a different effect for males or females, or see if a treatment works well in one country, but does not work well in another. Or if I find that a policy intervention works at the city level, are the effects in all areas of the city, or in just some neighborhoods?

On its face, these may seem like all unique problems. They are not, they are all different variants of subgroup analysis. In my dissertation in trying to identify situations in which you need to use small geographic spatial units of analysis, I realized that the logic behind choosing a geographic unit of analysis is the same as these different subgroup analyses. I more succinctly outline my logic in this article, but I will try it in a blog post as well.

I am what I would call a "reductionist social scientist". In plain terms, if we fit a model:

Y = B*X

we can always get more specific, either in terms of explanatory variables:

Y = b1*x1 + b2*x2, where X = x1 + x2

Or in terms of the outcome:

y1 = b1*X
y2 = b2*X, where Y = y1 + y2

Hubert Blalock talks about this in his causal inferences book. I think many social scientists are reductionists in this sense, we can always estimate more specific explanatory variables, or more specific outcomes, or within different subgroups, ad nauseam. Thus the problem is not should we conduct analysis in some particular subgroup, but when should we be satisfied that the aggregate effect we estimate is good enough, or at least not misleading.

So remember this when we are evaluating a model, the aggregate effect is a function of the sub-group effects. In linear models the math is easy, which I show some examples in my linked paper, but the logic generally holds for non-linear models as well. So when should we be ok with the aggregate level effect? We should be ok with the aggregate effect if we expect the direction and size of the effects in the subgroups to be similar. We should not be ok if the effects are countervailing in the subgroups, or if the magnitude of the differences is very large.

For some simplistic examples, if we go with our job interview for minorities relative to whites example:

Prob(Job Interview) = 0.5 + 0*(Minority)

So here the effect is zero, minorities have the same probability as white individuals, 50%. But lets say we estimate an effect for different minority categories:

Prob(Job Interview) = 0.5 + 0.3(Asian) - 0.3(Black)

Our aggregate effect for minorities is zero, because it is positive for Asian’s and negative for Black individuals, and in the aggregate these two effects cancel out. That is one situation in which we should be worried.

Now how about the effect of poverty on crime:

All Crime = 5*(Percent in Poverty)

Versus different subsets of crime, violent and property.

Violent Crime = 3*(Percent in Poverty)
Property Crime = 2*(Percent in Poverty)

Here we can see that the subgroups contribute to the total, but the effect for property is slightly less than that for violent crime. Here the aggregate effect is not misleading, but the micro level effect may be theoretically interesting.

For the final areas, lets say we do a gun buy back program, and we estimate the reduction in shootings at the city wide level. So lets say we estimate the number of shootings per month:

Shootings in the City = 10 - 5*(Gun Buy Back)

So we say the gun buy back reduced 5 shootings per month. Maybe we think the reduction is restricted to certain areas of the city. For simplicity, say this city only has two neighborhoods, North and South. So we estimate the effect of the gun buy back in these two neighborhoods:

Shootings in the North Neighborhood = 9 - 5*(Gun Buy Back)
Shootings in the South Neighborhood = 1 - 0*(Gun Buy Back)

Here we find the program only reduced shootings in the North neighborhood, it had no appreciable effects in the south neighborhood. The aggregate city level effect is not misleading, we can just be more specific about decomposing that effect to different areas.

Here I will relate this to some of my recent work — using 311 calls for service to predict crime at micro places in DC.

In a nutshell, I’ve fit models of the form:

Crime = B*(311 Calls for Service)

And I found that 311 calls have a positive, but small, effect on crime.

Over time, either at presentations in person or in peer review, I’ve gotten three different "subgroup" critiques. These are:

  • I think you cast the net too wide in 311 calls, e.g. "bulk collections" should not be included
  • I think you shouldn’t aggregate all crime on the left hand side, e.g. I think the effect is mostly for robberies
  • I think you shouldn’t estimate one effect for the entire city, e.g. I think these signs of disorder matter more in some neighborhoods than others

Now, these are all reasonable questions, but does it call into question my main aggregate finding? Not at all.

For the casting the net too wide for 311 calls, do you think that bulk collections have a negative relationship to crime? Unlikely. (I’ve eliminated them from my current article due to one reviewer complaint, but to be honest I think they should be included. Seeing a crappy couch on the street is not much different than seeing garbage.)

For all crime on the left hand side, do you think 311 calls have a negative effect on some crimes, but a positive effect on others? Again, unlikely. It may be the case that it has larger effects on some than others, but it does not mean the effect on all crime is misleading. So what if it is larger for robberies than others, you can go and build a theory about why that is the case and test it.

For the one estimate in different parts of the city, do you think it has a negative effect in some parts, and a positive effect in others? Again, unlikely. It may be some areas the effect is larger, but overall we expect it to be positive or zero in all areas of the city. The aggregate city wide effect is not likely to be misleading.

These are all fine for future research question, but I get frustrated when these are given as reasoning to critique my current findings. They don’t invalidate the aggregate findings at all.

In response to this, you may think, well why not conduct all these subgroup analyses – whats the harm? There are a few different harms to conducting these subgroup analyses willy-nilly. They are all related to chasing the noise and interpreting it.

For each of these subgroups, you will have smaller power to estimate effects than the aggregate. Say I test the effect of each individual 311 call type (there are near 30 that I sum together). Simply by chance some of these will have null effects or slightly negative effects and all will be small by themselves. I have no apriori reason to think some have a different effect than others, the theory behind why they are related to crime at all (Broken Windows) does not distinguish between them.

This often ends up being a catch-22 in peer review. You do more specific analysis, by chance a coefficient goes in the wrong direction, and the reviewer interprets it as your measures and/or model is bunk. In reality they are just over-interpreting noise.

That is in response to reviewers, but what about you conducting subgroup analysis on your own? Here you have to worry about the garden-of-forking paths. Say I conducted the subgroup analysis for different types of crime outcomes, and they are all in the same direction except for thefts from auto. I then report all of the results except for thefts from auto, because that does not confirm my theory. This is large current problem in reproducing social science findings — a subgroup analysis may seem impressive, but you have to worry about which results the researcher cherry picked.

This only reporting confirmatory evidence for some subgroups will always be a problem in data analysis — not even pre-registration of your data plan will solve it. Thus, you should only do subgroup analysis if there is strong theoretical reasoning you think the aggregate effect is misleading. You should simply plan a new study on its own to assess different subgroups from the start if you think differences may be theoretically interesting.

Given some of the reviews I received for the 311 paper, I am stuffing many of these subgroup analyses in Appendices just to preempt reviewer critique. (Ditto for my paper on alcohol outlets and crime I will be presenting at ASC in a few weeks, that will be up on SSRN likely next week.) I don’t think it is the right thing to do though, again I think it is mostly noise mining, but perpetually nit-picky reviewers have basically forced me to do it.

My endorsement for criminal justice at Bloomsburg University

The faculty at PASSHE schools (public universities in Pennsylvania) are currently under strike. My main reason to write this post is because I went to Bloomsburg University and I received a terrific education (a BA in criminal justice). If I could go back and do it all over again, I would still definitely attend Bloomsburg.

For a bit of background, all of the PA state schools were originally formed as normal schools — colleges to prepare teachers for lower education. They were intentionally placed around the state, so students did not have to travel very far. This is why they seem to be in rural places no one has ever heard of (it is intentional). For those in New York, this is an equivalent story for the smaller SUNY campuses — although unlike New York the state schools in PA have no shared acronym. This does not include Penn State University, which is a land-grant school. At some later point, the normal colleges expanded to universities and PASSHE was formed.

There are some pathological problems with higher education currently. One of them is the rising price of tuition. Tuition at PASSHE schools are basically the cheapest places you can get a bachelor’s degree. Private institutions (or Penn State Univ.) you are going to pay two to three times as much compared to at a PASSHE institution.

Criminal justice is a continually growing degree. To meet the teaching demand, many programs are filling in with adjunct labor. Bloomsburg did not do this when I was there, and this continues to appear the be the case. The majority of the faculty I had (04-08) are still on the faculty (this is true for criminal justice, sociology, and the two math professors I took all my statistics courses for), although the CJ program has appeared to grow beyond the three faculty members (Leo Barrile, Neal Slone, and Pam Donovan) when I was there. They did have an additional adjunct when I was there (who shall go unnamed) who is in the running for laziest teacher I have ever had.

Now, don’t get me wrong — adjuncts can be good teachers. I’ve taught as an adjunct myself. You should be concerned though if the majority of the courses in a department are being taught by adjuncts. People with professional experience can be great teachers — especially for advanced courses about their particular expertise — but they should rarely be teaching core courses for a degree in criminal justice. Core courses for CJ would likely include, intro. to criminal justice, criminology, penology, criminal law, statistics, and research design. (The last two are really essential courses for any student in the social sciences.)

The main reason some professors are better than others is not directly related to being tenure track faculty or adjunct though — a big factor is about continuity. When I am teaching a course for the first time, students are guinea pigs, whereas a professor who has taught the course many semesters is going to be better prepared. Adjunct’s with poor pay are not as likely to stick around, so you get a revolving door. Folks who have been around awhile are just more likely to be polished teachers.

To end, some students choose bigger schools because they believe there are more opportunities (either to have fun or for their education). There were really more opportunities at Bloomsburg than I could even take advantage of. Besides the BA in criminal justice, I had a minor in statistics and sociology. I also got my introduction to making maps by taking a GIS class in the geography department. In retrospect I would have taken a few more math classes (like swapping out the Econ Statistics courses for Macro-Econ). Bloomsburg is small but don’t worry about having a fun time either — if you get take out do NAPS, if you just want a slice do OIP. (The pizza here in Dallas is terrible all the places I’ve tried.)

While it may be frustrating to students (or maybe more to parents who are paying the bills), it is in the best long term interest to preserve the quality of education at PASSHE schools. Appropriate pay and benefits for faculty and adjuncts is necessary to do that.

I think I will write a blog post describing more about an undergraduate degree in criminal justice, but if you are a student here in the Dallas area interested in criminology at UTD, always feel free to send me an email with questions. Also please email me a place where I can get a decent tasting slice!

Testing the equality of two regression coefficients

The default hypothesis tests that software spits out when you run a regression model is the null that the coefficient equals zero. Frequently there are other more interesting tests though, and this is one I’ve come across often — testing whether two coefficients are equal to one another. The big point to remember is that Var(A-B) = Var(A) + Var(B) - 2*Cov(A,B). This formula gets you pretty far in statistics (and is one of the few I have memorized).

Note that this is not the same as testing whether one coefficient is statistically significant and the other is not. See this Andrew Gelman and Hal Stern article that makes this point. (The link is to a pre-print PDF, but the article was published in the American Statistician.) I will outline four different examples I see people make this particular mistake.

One is when people have different models, and they compare coefficients across them. For an example, say you have a base model predicting crime at the city level as a function of poverty, and then in a second model you include other control covariates on the right hand side. Let’s say the the first effect estimate of poverty is 3 (1), where the value in parentheses is the standard error, and the second estimate is 2 (2). The first effect is statistically significant, but the second is not. Do you conclude that the effect sizes are different between models though? The evidence for that is much less clear.

To construct the estimate of how much the effect declined, the decline would be 3 - 2 = 1, a decrease in 1. What is the standard error around that decrease though? We can use the formula for the variance of the differences that I noted before to construct it. So the standard error squared is the variance around the parameter estimate, so we have sqrt(1^2 + 2^2) =~ 2.23 is the standard error of the difference — which assumes the covariance between the estimates is zero. So the standard error around our estimated decline is quite large, and we can’t be sure that it is an appreciably different estimate of poverty between the two models.

There are more complicated ways to measure moderation, but this ad-hoc approach can be easily applied as you read other peoples work. The assumption of zero covariance for parameter estimates is not a big of deal as it may seem. In large samples these tend to be very small, and they are frequently negative. So even though we know that assumption is wrong, just pretending it is zero is not a terrible folly.

The second is where you have models predicting different outcomes. So going with our same example, say you have a model predicting property crime and a model predicting violent crime. Again, I will often see people make an equivalent mistake to the moderator scenario, and say that the effect of poverty is larger for property than violent because one is statistically significant and the other is not.

In this case if you have the original data, you actually can estimate the covariance between those two coefficients. The simplest way is to estimate that covariance via seemingly unrelated regression. If you don’t though, such as when you are reading someone else’s paper, you can just assume the covariance is zero. Because the parameter estimates often have negative correlations, this assumption will make the standard error estimate smaller.

The third is where you have different subgroups in the data, and you examine the differences in coefficients. Say you had recidivism data for males and females, and you estimated an equation of the effect of a treatment on males and another model for females. So we have two models:

Model Males  : Prob(Recidivism) = B_0m + B_1m*Treatment
Model Females: Prob(Recidivism) = B_0f + B_1f*Treatment

Where the B_0? terms are the intercept, and the B_1? terms are the treatment effects. Here is another example where you can stack the data and estimate an interaction term to estimate the difference in the effects and its standard error. So we can estimate a combined model for both males and females as:

Combined Model: Prob(Recidivism) = B_0c + B_1c*Treatment + B_2c*Female + B_3c(Female*Treatment)

Where Female is a dummy variable equal to 1 for female observations, and Female*Treatment is the interaction term for the treatment variable and the Female dummy variable. Note that you can rewrite the model for males and females as:

Model Mal.: Prob(Recidivism) =     B_0c      +      B_1c    *Treatment    ....(when Female=0)
Model Fem.: Prob(Recidivism) = (B_0c + B_2c) + (B_1c + B_3c)*Treatment    ....(when Female=1)

So we can interpret the interaction term, B_3c as the different effect on females relative to males. The standard error of this interaction takes into account the covariance term, unlike estimating two totally separate equations would. (You can stack the property and violent crime outcomes I mentioned earlier in a synonymous way to the subgroup example.)

The final fourth example is the simplest; two regression coefficients in the same equation. One example is from my dissertation, the correlates of crime at small spatial units of analysis. I test whether different places that sell alcohol — such as liquor stores, bars, and gas stations — have the same effect on crime. For simplicity I will just test two effects, whether liquor stores have the same effect as on-premise alcohol outlets (this includes bars and restaurants). So lets say I estimate a Poisson regression equation as:

log(E[Crime]) = Intercept + b1*Bars + b2*LiquorStores

And then my software spits out:

                  B     SE      
Liquor Stores    0.36  0.10
Bars             0.24  0.05

And then lets say we also have the variance-covariance matrix of the parameter estimates – which most stat software will return for you if you ask it:

                L       B  
Liquor_Stores    0.01
Bars            -0.0002 0.0025

On the diagonal are the variances of the parameter estimates, which if you take the square root are equal to the reported standard errors in the first table. So the difference estimate is 0.36 - 0.24 = 0.12, and the standard error of that difference is sqrt(0.01 + 0.0025 - 2*-0.002) =~ 0.13. So the difference is not statistically significant. You can take the ratio of the difference and its standard error, here 0.12/0.13, and treat that as a test statistic from a normal distribution. So the rule that it needs to be plus or minus two to be stat. significant at the 0.05 level applies.

This is called a Wald test specifically. I will follow up with another blog post and some code examples on how to do these tests in SPSS and Stata. For completeness and just because, I also list two more ways to accomplish this test for the last example.

There are two alternative ways to do this test though. One is by doing a likelihood ratio test.

So we have the full model as:

 log(E[Crime]) = b0 + b1*Bars + b2*Liquor_Stores [Model 1]

And we have the reduced model as:

 log(E[Crime]) = b4 + b5*(Bars + Liquor_Stores)  [Model 2]

So we just estimate the full model with Bars and Liquor Stores on the right hand side (Model 1), then estimate the reduced model (2) with the sum of Bars + Liquor Stores on the right hand side. Then you can just do a chi-square test based on the change in the log-likelihood. In this case there is a change of one degree of freedom.

I give an example of doing this in R on crossvalidated. This test is nice because it extends to testing multiple coefficients, so if I wanted to test bars=liquor stores=convenience stores. The prior individual Wald tests are not as convenient for testing more than two coefficients equality at once.

Here is another way though to have the computer more easily spit out the Wald test for the difference between two coefficients in the same equation. So if we have the model (lack of intercept does not matter for discussion here):

y = b1*X + b2*Z [eq. 1]

We can test the null that b1 = b2 by rewriting our linear model as:

y = B1*(X + Z) + B2*(X - Z) [eq. 2]

And the test for the B2 coefficient is our test of interest The logic goes like this — we can expand [eq. 2] to be:

y = B1*X + B1*Z + B2*X - B2*Z [eq. 3]

which you can then regroup as:

y = X*(B1 + B2) + Z*(B1 - B2) [eq. 4]

and note the equalities between equations 4 and 1.

B1 + B2 = b1; B1 - B2 = b2

So B2 tests for the difference between the combined B1 coefficient. B2 is a little tricky to interpret in terms of effect size for how much larger b1 is than b2 – it is only half of the effect. An easier way to estimate that effect size though is to insert (X-Z)/2 into the right hand side, and the confidence interval for that will be the effect estimate for how much larger the effect of X is than Z.

Note that this gives an equivalent estimate as to conducting the Wald test by hand as I mentioned before.

Group based trajectory models in Stata – some graphs and fit statistics

For my advanced research design course this semester I have been providing code snippets in Stata and R. This is the first time I’ve really sat down and programmed extensively in Stata, and this is a followup to produce some of the same plots and model fit statistics for group based trajectory statistics as this post in R. The code and the simulated data I made to reproduce this analysis can be downloaded here.

First, for my own notes my version of Stata is on a server here at UT Dallas. So I cannot simply go

net from http://www.andrew.cmu.edu/user/bjones/traj
net install traj, force

to install the group based trajectory code. First I have this set of code in the header of all my do files now

*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns"
*to install on your own in the lab it would be
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns"

So after running that then I can install the traj command, and Stata will know where to look for it!

Once that is taken care of after setting the working directory, I can simply load in the csv file. Here my first variable was read in as ïid instead of id (I’m thinking because of the encoding in the csv file). So I rename that variable to id.

*Load in csv file
import delimited GroupTraj_Sim.csv
*BOM mark makes the id variable weird
rename ïid id

Second, the traj model expects the data in wide format (which this data set already is), and has counts in count_1, count_2count_10. The traj command also wants you to input a time variable though to, which I do not have in this file. So I create a set of t_ variables to mimic the counts, going from 1 to 10.

*Need to generate a set of time variables to pass to traj, just label 1 to 10
forval i = 1/10 { 
  generate t_`i' = `i'

Now we can estimate our group based models, and we get a pretty nice default plot.

traj, var(count_*) indep(t_*) model(zip) order(2 2 2) iorder(0)

Now for absolute model fit statistics, there are the average posterior probabilities, the odds of correct classification, and the observed classification proportion versus the expected classification proportion. Here I made a program that I will surely be ashamed of later (I should not brutalize the data and do all the calculations in matrix), but it works and produces an ugly table to get us these stats.

*I made a function to print out summary stats
program summary_table_procTraj
    *now lets look at the average posterior probability
    gen Mp = 0
    foreach i of varlist _traj_ProbG* {
        replace Mp = `i' if `i' > Mp 
    sort _traj_Group
    *and the odds of correct classification
    by _traj_Group: gen countG = _N
    by _traj_Group: egen groupAPP = mean(Mp)
    by _traj_Group: gen counter = _n
    gen n = groupAPP/(1 - groupAPP)
    gen p = countG/ _N
    gen d = p/(1-p)
    gen occ = n/d
    *Estimated proportion for each group
    scalar c = 0
    gen TotProb = 0
    foreach i of varlist _traj_ProbG* {
       scalar c = c + 1
       quietly summarize `i'
       replace TotProb = r(sum)/ _N if _traj_Group == c 
    *This displays the group number, the count per group, the average posterior probability for each group,
    *the odds of correct classification, and the observed probability of groups versus the probability 
    *based on the posterior probabilities
    list _traj_Group countG groupAPP occ p TotProb if counter == 1

This should work after any model as long as the naming conventions for the assigned groups are _traj_Group and the posterior probabilities are in the variables _traj_ProbG*. So when you run


You get this ugly table:

     | _traj_~p   countG   groupAPP        occ       p    TotProb |
  1. |        1      103    .937933   43.57431   .2575   .2573651 |
104. |        2      161   .9935606   229.0434   .4025   .4064893 |
265. |        3      136   .9607248   47.48378     .34   .3361456 |

The groupAPP are the average posterior probabilities – here you can see they are all quite high. occ is the odds of correct classification, and again they are all quite high. p is the proportion in each group based on the assignments for the maximum posterior probability, and the TotProb are the expected number based on the sums of the posterior probabilities. TotProb should be the same as in the Group Membership part at the bottom of the traj model. They are close (to 5 decimals), but not exactly the same (and I do not know why that is the case).

Next, to generate a plot of the individual trajectories, I want to reshape the data to long format. I use preserve in case I want to go back to wide format later and estimate more models. If you need to look to see how the reshape command works, type help reshape at the Stata prompt. (Ditto for help on all Stata commands.)

reshape long count_ t_, i(id)

To get the behavior I want in the plot, I use a scatter plot but have them connected via c(L). Then I create small multiples for each trajectory group using the by() option. Before that I slightly jitter the count data, so the lines are not always perfectly overlapped. I make the lines thin and grey — I would also use transparency but Stata graphs do not support this.

gen count_jit = count_ + ( 0.2*runiform()-0.1 )
graph twoway scatter count_jit t_, c(L) by(_traj_Group) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)

I’m too lazy at the moment to clean up the axis titles and such, but I think this plot of the individual trajectories should always be done. See Breaking Bad: Two Decades of Life-Course Data Analysis in Criminology, Developmental Psychology, and Beyond (Erosheva et al., 2014).

While this fit looks good, this is not the correct number of groups given how I simulated the data. I will give those trying to find the right answer a few hints; none of the groups have a higher polynomial than 2, and there is a constant zero inflation for the entire sample, so iorder(0) will be the correct specification for the zero inflation part. If you take a stab at it let me know, I will fill you in on how I generated the simulation.

Writing equations in Microsoft Word

A student asked me about using LaTex the other day, and I stated that it is a bit of a hassle for journal articles in our field, so I have begun to use it less. Most of the journals in my field (criminology and criminal justice) make it difficult to turn in an article in that format. Many refuse to accept PDF articles outright, and last time I submitted a LaTex file to JQC (a Springer journal) that would not compile I received zero help from staff over a month of emails, so I just reformatted it to a Word document anyway. Last time I submitted a LaTex document to Criminology a reviewer said it probably had typos — without pointing out any of course. (FYI folks, besides doing the obvious and pointing out typos if they exist, my text editor has a spell checker same as Word to highlight typos.) Besides this, none of my co-workers use LaTex, so it is a non-starter for when I am collaborating. I did my dissertation in LaTex, and I would do that in LaTex again, but smaller articles are not a big deal.

The main nicety of LaTex are math equations. I don’t do too heavy of math stuff, and I have figured out the Microsoft Word equation editor enough to suit most of my needs. So here are a set of examples for many of the use cases I have needed to use in journal articles. I also have this in a Word (docx) document and a PDF for handy reference. Those have a few references I gathered from the internet, but the best IMO is this guys blog (who I think is a developer for Word) and this document authored by the same individual.

One of the things to note about the equation editor in Word is that you can type various shortcuts and then they will be automatically converted. For example, you can type \gamma, hit the space bar, and then the equation will actually change to showing the gamma symbol. So there are some similarities to LaTex. (Another pro-tip, to start an equation in Word you can press Alt=.) In the subsequent examples I will use <space> to represent hitting the space bar, and there are other examples of using <back> (for the left arrow key) and <backspace> for the backspace button.

Greek characters, subscripts and superscripts

If you type

log<space>(\lambda) = \beta_0<space> + \beta_1<space>(X) + \beta_2<space>(X^2<space>)

you get:


For these you need to hit the space key twice, so

x\hat<space><space> = y\bar<space><space>

turns into:

Expected value and variance

For the equivalent of \mathbb in LaTex, you can do

\doubleV<space>(X)= \doubleE<space>(X)^2<space> + \doubleE<space>(X^2<space>)

Plain text within equation

To do plain text within an equation, equivalent to \text{*} in LaTex, you can use double quotes. (Note that you do not need a backslash before "log".) So

Y = -1\cdot<space>log<space>("Property Crime"<space>) + (not pretty text)

looks like:

Sum and product

To get the product symbol is simply \prod<space>, and here is a more complicated example for the sum:

n^-1\cdot<space>\sum^n_(i=1)<space>x_i<space>= x\bar<space><space>

Square root

Square roots always cause me trouble for how they look and kern (both in LaTex and Word). Here is how I would do an example of Euclidean distance,



The big (stacked) fraction is simple, but I had to search for a bit to find how to do inline fractions (what Word calls "linear"). So here back slash followed by forward slash does the inline fraction:

1/n = 1\/n

Numbering an equation

I’ve seen quite a few different hacks for numbering equations in Word. If you need to number and refer to them in text often, I would use LaTex. But here is one way to do it in Word.

E = mc^2#(30)<enter>

produces below (is it just me or does this make the equation look different than the prior ones in Word?):

Multiple lines of equations

For a while I did not think this was possible, but I recently found examples of multiline equations (equivalent to \align in Latex). The way this works is you place a & sign before the symbols you want to line up (same as LaTex), but for Word to split a line you use @. So if you type


you will get:

Have any more good examples? Let me know in the comments!

The other nicety of LaTex is formatting references — you are on your own though for that in Word though.

Added code snippets page

I’ve written quite a few blog posts over the years, and it is getting to be hard for me to keep all of them orderly. So I recently created a page with a simple table of contents with code snippets broken down into categories. There are also a few extra SPSS macros in there that I have not written blog posts about.

Every now and then I see a broken link and update it, but I know there are more I am missing. So just send an email if I have a link to some of my code and it is currently broken.


Regression to the mean – a tale of change scores

This is a real example from my work illustrating regression to the mean. I have a scale measuring impulsivity of offenders. I had an intervention that used cognitive behavioral thereapy (CBT) in a boot camp for one group, and business as usual for another (just plain old jail). I have measures of impulsivity at pre, post, and 6 month follow up (what I label as post2). CBT is suppossed to reduce impulsivity, and hopefully keep it that way.

I find that those who have gained the most during the intervention tend to revert back to their prior scores once they leave the bootcamp. That is, the measure [post - pre], the gain in bootcamp, has a negative correlation with [post2 - post], the loss after bootcamp. Is this due to the intervention being shitty? No! It is not — this is the result of regression to the mean. This does not show any relationship between the values, it will happen even if the impulse scores are totally random.

Note that the definition of covariance is:

Cov(X,Y) = E[(x - E[X])*(y - E[Y])]

Where E is representing the expectation, and Cov(X,Y) of course means the covariance between X and Y. Here for easier equations we can assume the mean in the impulse scale is zero across all three waves, which makes the means of the change scores zero as well (without any loss in generality). So dropping the inner expecations this equation reduces to:

Cov(X,Y) = E[x*y]

So defining post-pre = Change1 and post2 - post = Change2, expanding out to the original components we have:

Cov(Change1,Change2) = Cov(post-pre,post2-post) = E[ (post-pre)*(post2-post) ]

The last result can then be expanded to:

E[ post*post2 - post*post - pre*post2 + pre*post ]

Because of the bilinearity of expectation, these can be further teased out:

E[ post*post2 ] - E[ post*post ] - E[ pre*post2 ] + E[ pre*post]

Note we can rewrite this back into variances and covariances of the original levels:

Cov(post,post2) - Var(post) - Cov(pre,post2) + Cov(pre,post)

There are two things to note here. 1) The covariances in the change scores can be entirely written as functions in the covariances of the levels. They do not supply information independent of the levels themselves.

For 2), if the data are random (that is the covariances between all the levels are random), the covariances between the change scores will be negative. This is because of the minus sign in front of the variance of the post term. For random data, all the other covariances are zero. This results in the correlation between the change scores being -1/2.

For a simple example in R:

> set.seed(10)
> n <- 10000 #sample size
> t1 <- rnorm(n) #three random vectors
> t2 <- rnorm(n)
> t3 <- rnorm(n)
> levels <- data.frame(t1,t2,t3)
> differ <- data.frame(c1=t2-t1,c2=t3-t2)
> #correlations in levels are approximately zero
> cor(levels)
              t1           t2            t3
t1  1.0000000000  0.001874345 -0.0007006367
t2  0.0018743450  1.000000000 -0.0045967380
t3 -0.0007006367 -0.004596738  1.0000000000
> #correlation of differences is -0.5
> cor(differ)
           c1         c2
c1  1.0000000 -0.4983006
c2 -0.4983006  1.0000000

Sometimes I see people talk about regression to the mean as if it is a sociological thing, like something that needs to be explained in terms of human behavior. It is not, it is entirely mathematical.

This is also one of the reasons I don’t like using change scores, either as independent or dependent variables. They typically can be rewritten in terms of the levels, and involve coeffficient restrictions that can have strange consequences. There are some situations (fixed effects) that make sense for the dependent variable. I haven’t seen a situation in the terms of independent variables where they make sense.

New course – Advanced Methods in Criminology

This fall I am teaching a PhD course, the course is listed as Crim 7301 – Seminar in Criminology Research and Analysis. The link to UT Dallas’s coursebook description is here, and I have placed a page on my blog that contains the syllabus. My blog is just nicer, since I can include more info. than I can directly on UTD’s page, as well as update material as I go.

The description behind the course in UT Dallas is pretty open, but I am mostly motivated to design a course to go over the regular quasi-experimental research designs I encounter most often in practice. Students also have a lab component which entails actually conducting such data analysis, with code snippets mostly in SPSS, R, and Stata.

For this, as well as other course materials if you cannot access Dropbox links feel free to email me and I will send the material directly.

As with many graduate level methods courses, it is heavily influenced by my personal experience. But I am open to suggestions in the future if you want a particular topic covered. For example I debated on including missing data analysis (e.g. multiple imputation, full information max. likelihood) for a week. If there was interest in that (or some other topic) I could definitely update the syllabus. Just come by for a chat or send me an email with your suggestion.

In general for students, if you have questions or would like my input on project ideas feel free to stop by. My current posted office hours are Tuesday and Friday, 9 to 11 am, but I am at my office during normal work hours for the whole week. (Knock if the door is closed, I am often in here.)






Roadblocks in Buffalo update (plus more complaints about peer-review!)

I’ve updated the roadblocks in Buffalo manuscript due to a rejection and subsequent critiques. So be prepared about my complaints of the peer-review!

I’ve posted the original manuscript, reviews and a line-by-line response here. This was reviewed at Policing: An International Journal of Police Strategies & Management. I should probably always do this, but I felt compelled to post this review by the comically negative reviewer 1 (worthy of an article on The Allium).

The comment of reviewer 1 that really prompted me to even bother writing a response was the critique of the maps. I spend alot of time on making my figures nice and understandable. I’m all ears if you think they can be improved, but best be prepared for my response if you critique something silly.

So here is the figure in question – spot anything wrong?

The reviewer stated it did not have legend, so it does not meet "GIS standards". The lack of a legend is intentional. When you open google maps do they have a legend? Nope! It is a positive thing to make a graphic simple enough that it does not need a legend. This particular map only has three elements: the outline of Buffalo, the streets, and the points where the roadblocks took place. There is no need to make a little box illustrating these three things – they are obvious. The title is sufficient to know what you are looking at.

Reviewer 2 was more even keeled. The only thing I would consider a large problem in their review was they did not think we matched comparable control areas. If true I agree it is a big deal, but I’m not quite sure why they thought this (sure balance wasn’t perfect, but it is pretty close across a dozen variables). I wouldn’t release the paper if I thought the control areas were not reasonable.

Besides arbitrary complaints about the literature review this is probably the most frustrating thing about peer-reviews. Often you will get a list of two dozens complaints, with most being minor and fixable in a sentence (if not entirely arbitrary), but the article will still be rejected. People have different internal thresholds for what is or is not publishable. I’m on the end that even with worts most of the work I review should still be published (or at least the authors given a chance to respond). Of the 10 papers I’ve reviewed, my record is 5 revise-and-resubmits, 4 conditional accepts, and 1 rejection. One of the revise-and-resubmits I gave a pretty large critique of (in that I didn’t think it was possible to improve the research design), but the other 4 would be easily changed to accept after addressing my concerns. So worst case scenario I’ve given the green light to 8/10 of the manuscripts I’ve reviewed.

Many reviewers are at the other end though. Sometimes comically so, in that given the critiques nothing would ever meet their standards. I might call it the golden-cow peer review standard.

Even though both of my manuscripts have been rejected from PSM, I do like their use of a rubric. This experience makes me wonder what if the reviewers did not give a final reject-accept decision – just the editors took the actual comments and made their own decision. Editors do a version of this currently, but some are known to reject if any of the reviewers give a rejection no matter what the reviewers actually say. It would force the editor to use more discretion if the reviewers themselves did not make the final judgement. It also forces reviewers to be more clear in their critiques. If they are superficial the editor will ignore them, whereas the final accept-reject is easy to take into account even if the review does not state any substantive critiques.

I don’t know if I can easily articulate what I think is a big deal and what isn’t though. I am a quant guy, so the two instances I rejected were for model identification in one and for sample selection biases in the other. So things that could not be changed essentially. I haven’t read a manuscript that was so poor I considered it to be unsalvagable in terms of writing. (I will do a content analysis of reviews I’ve recieved sometime, but almost all complaints about the literature review are arbitrary and shouldn’t be used as reasons for rejection.)

Often times I write abunch of notes on the paper manuscript my first read, and then when I go to write up the critique specifically I edit them out. This often catches silly initial comments of mine, as I better understand the manuscript. Examples of silly comments in the reviews of the roadblock paper are claiming I don’t conduct a pre-post analysis (reviewer 1), and asking for things already stated in the manuscript (reviewer 2 asking for how long the roadblocks were and whether they were "high visibility"). While it is always the case things could be explained more clearly, at some point the reviewer(s) needs to be more careful in their reading of the manuscript. I think my motto of "be specific" helps with this. Being generic helps to conceal silly critiques.

Plotting panel data with many lines in SPSS

A quick blog post – so you all are not continually assaulted by my mug shot on the front page of the blog!

Panel data is complicated. When conducting univariate time series analysis, pretty much everyone plots the series. I presume people do not do this often for panel data because the charts tend to be more messy and less informative. But by using transparency and small multiple plots are easy places to start to unpack the information. Here I am going to show these using plots of arrest rates from 1970 through 2014 in New York state counties. The data and code can be downloaded here, and that zip file contains info. on where the original data came from. It is all publicly available – but mashing up the historical census data for the population counts by county is a bit of a pain.

So I will start with grabbing my created dataset, and then making a default plot of all the lines. Y axis is the arrest rate per 1,000 population, and the X axis are years.

*Grab the dataset.
FILE HANDLE data /NAME = "!!Your File Handle Here!!!".
GET FILE = "data\Arrest_WPop.sav".

*Small multiple lines over time - default plot.
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Total Arrest Rate per 1,000"))
  ELEMENT: line(position(Year*Total_Rate), split(County))

That is not too bad, but we can do slightly better by making the lines small and semi-transparent (which is the same advice for dense scatterplots):

*Make them transparent and smaller.
FORMATS Total_Rate (F2.0).
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Total Arrest Rate per 1,000"))
  SCALE: linear(dim(1), min(1970), max(2014))
  ELEMENT: line(position(Year*Total_Rate), split(County), transparency(transparency."0.7"), size(size."0.7"))

This helps disentangle the many lines bunched up. There appear to be two outliers, and basically the rest of the pack.

A quick way to check out each individual line is then to make small multiples. Here I wrap the panels, and make the plot size bigger. I also make the X and Y axis null. This is ok though, as I am just focusing on the shape of the trend, not the absolute level.

*Small multiples.
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  PAGE: begin(scale(1000px,1000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), null())
  GUIDE: axis(dim(3), opposite())
  SCALE: linear(dim(1), min(1970), max(2014))
  ELEMENT: line(position(Year*Total_Rate*County))
  PAGE: end()
*Manually edited to make less space between panels.

There are a total of 62 counties in New York, so this is feasible. With panel sets of many more lines, you can either split the small multiple into more graphs, or cluster the lines based on the overall shape of the trend into different panels.

Here you can see that the outliers are New York county (Manhattan) and Bronx county. Bronx is a pretty straight upward trend (which mirrors many other counties), but Manhattan’s trajectory is pretty unique and has a higher variance than most other places in the state. Also you can see Sullivan county has quite a high rate compared to most other upstate counties (upstate is New York talk for everything not in New York City). But it leveled off fairly early in the time series.

This dataset also has arrest rates broken down by different categories; felony (drug, violent, dwi, other), and misdemeanor (drug, dwi, property, other). It is interesting to see that arrest rates have been increasing in most places over this long time period, even though crime rates have been going down since the 1990’s. They all appear to be pretty correlated, but let me know if you use this dataset to do some more digging. (It appears index crime totals can be found going back to 1990 here.)