Paper on Roadblocks in Buffalo published

My paper with Scott Phillips, A quasi-experimental evaluation using roadblocks and automatic license plate readers to reduce crime in Buffalo, NY, has just been published online first in the Security Journal. Springer gifts me a special link in which you can read the paper. Previously when I have been given links like that from the publisher they have a time limit, but the email for this one said nothing. But even if that goes bad you can always read my pre-print of the article I posted on SSRN.


Title: A quasi-experimental evaluation using roadblocks and automatic license plate readers to reduce crime in Buffalo, NY

Abstract:

This article evaluates the effective of a hot spots policing strategy: using automated license plate readers at roadblocks in Buffalo, NY. Different roadblock locations were chosen by the Buffalo Police Department every day over a two-month period. We use propensity score matching to identify a set of control locations based on prior counts of crime and demographic factors. We find modest reductions in Part 1 violent crimes (10 over all roadblock locations and over the two months) using t tests of mean differences. We find a 20% reduction in traffic accidents using fixed effects negative binomial regression models. Both results are sensitive to the model used though, and the fixed effects models predict increases in crimes due to the intervention. We suggest that the limited intervention at one time may be less effective than focusing on a single location multiple times over an extended period.

And here is Figure 2 from the paper, showing the units of analysis (street midpoints and intersections) and how the treatment locations were assigned.

Much ado about nothing: Overinterpreting volatility in homicide rates

I’m not much of a macro criminologist, but being asked questions by my dad (about Richard Rosenfeld and the Ferguson effect) and the dentist yesterday (asking about some of Trumps comments about rising crime trends) has prompted me to jump into it and give my opinion. Long story short — many sources I believe are overinterpreting short term fluctuations as more meaningful than they are.

First I will tackle national crime rates. So if you have happened to walk by a TV playing CNN the past few days, you may have heard Donald Trump being criticized for his statements on crime rates. This is partially a conflation with the difference between overall levels of crime versus changes in crime over time. Basically crime is currently low compared to historical patterns, but homicide rates have been rising in the past two years. This is easier to show in a chart than to explain in words. So here is the national estimated homicide rate per 100,000 individuals since 1960.1

2016 is not official and is still an estimate, but basically the pattern is this – crime has been falling generally across the country since the early 1990’s. Crime rates in just the past few years have finally dropped below levels in the 1960’s, but for the past two years homicides have been increasing. So some have pointed to the increase in the past two years and have claimed the sky is falling. To say this they say the rate of change is the largest in past 40 years. There are better charts to show rates of change (a semi-log chart), but the overall look is basically the same.

You have to really squint to see that change from 2014 to 2015 is a larger jump than any of the changes over the entire period, so arguments based on the size of recent changes in the homicide rate are hyperbole (either on a linear scale or a logarithmic scale). And even if you take the recent increases over the past two years as evidence of a more general rising trend, for a broader term pattern we still have homicide rates close to a low point in the past 50 years.

For a bit of general advice — any source that gives you a percent change you always want to see the base numbers and any longer term historical trends. Any media source that cites recent increases in homicides without providing this graph of long term historical crime trends is simply misleading. I’ve seen this done in many places, see this example from the New York Times or this recent note from the Economist. So this isn’t something specific to the President.

Now, macro criminologists don’t really have any better track record explaining these patterns than macro economists have in explaining economic trends. Basically we have a bunch of patch work theories that make sense for parts of the trend, but not the entire time frame. Changes in routine activities in 1960’s, increases in incarceration, the decline of crack use, ease of calling 911 with cell-phones, lead use, abortion (just to name a few). And academics come up with new theories all the time, the most recent being the Ferguson effect — which is simply another term for de-policing.

Now a bit on trends for specific cities. How this ties in with the national trend is that some articles have been pointing out that some cities have seen increases and some have not. That is fine to point out (albeit trivial), but then the articles frequently go on generate stories about why crime is rising in those specific places. Those on the left cite civil unrest and police brutality as possible reasons (Milwaukee, St. Louis, Chicago, Baltimore), while those on the right cite the deleterious effects of police departments not being as proactive (stops in Chicago, arrests in Baltimore).

While any of these explanations may turn out reasonable in the end, I’m pretty sure most of these articles severely underappreciate the volatility in homicide rates. Take an example with St. Louis, with a city population of just over 300,000. A homicide rate of 50 individuals per 100,000 means a total of 150 murders. A homicide rate of 40 per 100,000 means 120 murders. So we are only talking about a change of 30 murders overall. Fluctuations of around 10 in the murder rate would not be unexpected for a city with a population of 300,000 individuals. The confidence interval for a rate of 150 murders per 300,000 individuals is 126 to 176 murders.2

Even that though understates the typical volatility in homicide rates. As basically that assumes the proportion does not change over time. In reality crime statistics are more bursty, and show wilder fluctuations in different places.3 To show this for many cities, I use the data from the Economist article mentioned earlier, and create a motion chart of the changes in homicide rates over time. The idea behind this chart is a funnel chart. Cities with lower populations will show higher variance, and subsequently those dots on the left hand side of the chart will jump around alot more. The population figures are current and not varying, so the dots just move up and down on the Y axis.

For best viewing, make the X axis on the log scale, and size the points according to the population of the city. If you are at a desktop computer, you can open up a bigger version of the chart here.

Selecting individual points and then letting the animation run though illustrates the typical variability of crime over time. Here is the trace of St. Louis over the 36 year period.

New Orleans is another good example, we have fluctuations from under 30 to over 90 in the time period.

And here is Chicago, which shows less fluctuation than the smaller cities (as expected) but still has a range of homicide rates around 20 over the time period.

Howard Wainer has previously pointed this relationship out, and called it The Most Dangerous Equation. Basically, if you look you will be able to find some upward crime trends, especially in smaller cities. You need to look at it in the long term though and understand typical fluctuations to make a reasonable decision as to whether crime is increasing or if it is just typical year to year variation. The majority of news articles on the topic and just chock full of post hoc ergo propter hoc for particular cherry picked cites, and they often don’t make sense in explaining crime patterns over the past decade in those particular cities, let alone make sense for different cities experience similar conditions but not having rising homicide rates.



  1. For my notes about data sources, generally the data have come from the FBI UCR data tool (for the 1960 through 2014 data). 2015 data have come from the FBI web page for the 2015 UCR report. The 2016 projections come from this Economist article as well as the 50 cities data for the google motion chart.
  2. Calculated in R via (binom.test(150,300000)$conf.int[1:2])*300000. This is the exact Clopper-Pearson confidence interval.
  3. So even though this 538 article does a better job of acknowledging volatility, whatever test they use to determine statistically significant increases is likely to have too many false positives.

New undergrad course – Communities and Crime

This semester I am teaching a new undergrad course, communities and crime. Still a few seats left if you are a UT Dallas student and still interested. (You can also audit the course as well even if you are not a UT Dallas student.)

You can see the syllabus from the linked page, but compared to other syllabi I’ve found floating around, (see Dan O’Brien or Elizabeth Groff for two undergrad examples) I focus more on micro places than others. Some syllabi I’ve found spend basically the whole semester on social disorganization, which I think is excessive.

One experiment I am going to try for this course is to use Dallas Open crime data, and then have the students make predictions. For example, for their first assignment they are supposed to make their prediction based on social disorganization theory what neighborhood has the most crime in Dallas from this neighborhood map in Dallas. (Fusion table embedding not working in my WordPress post at the moment for some reason!)

These neighborhoods were obtained from Jane Massey, a researcher for the Dallas area Habitat for Humanity. Hence why the flood plain is its own neighborhood. It is the most reasonable source I’ve seen so far. Most generally agree (see Dallas Magazine for one example), but that data is not very tidy. See this web app to draw your own neighborhood in Dallas as well. And of course for students interested part of the discussion will be about how you define a neighborhood.

Blogging in Review – 2016

The site has continued to grow in 2016. Looking back over the prior years it has looked pretty linear the whole time.

I take a hit in December, but I almost managed on average 200 site views per day in November. I topped the 100,000 cumulative site views for the entire blogs existence in November of this year.

Despite moving from Albany to Texas, I still managed to publish 40 new pages this year, which I am pretty happy with. I don’t set myself with any hard expectations, but I like to publish something at least once every two to four weeks.

While some of my initial traffic is bursty, e.g. gets shared on a popular site and you get a couple hundred views in a day, most of my traffic is a slow trickle of referrals from google. Here is a plot of my pages by average views per day, broken down by some of my main categories. Posts colored in red have an SPSS tag, and so the Python and R columns can also be posts on SPSS. (So most of my python posts are calling python from SPSS.)

So even my most popular posts do not average more than a few views per day, and most do not get any appreciable traffic at all. Here are the labels in that dot plot to show what posts they are.

Don’t ask me why some end up being more popular than others (who knew Venn diagrams in R?). I wrote a few more blog posts on using various google maps APIs with python in response to the google places post being popular. The google street view post is doing pretty well, the others not so much though.

My motivation for posts though are more in line with an academic journal/notebook/diary – I post on some project I am working on essentially, I don’t go and research specific topics just for the blog. I am happy with the extra exposure though – and I’m sure there is more value added to a tutorial blog post than there is for a stuffy academic paper that is read by two dozen individuals (even if that is what counts towards my tenure)!

Review of Trees, maps, and theorems: Effective Communication for rational minds by Jean-luc Doumont

I was recently introduced to the work of Jean-luc Doumont via Robert Kosara. So I picked up his book, Trees, maps, and theorems: Effective Communication for rational minds, and it does not disappoint.

In a nutshell, if you have read Tufte’s Visual display of quantitative information and you like it, you will like Doumont’s book as well. He persists in the same minimalist ideal as Tufte, but has advice not just about statistical graphics, but about all aspects of scientific communication; writing, presentations, and even email.

Doumont’s chapter on effective graphical displays is mainly a brief overview of Tufte’s main points for statistical graphics (also he gives some advice on pictures and icons), but otherwise the book has quite a bit of new advice. Here is a quick sampling of some of the points that most resonated with me:

The rule of three: It is very difficult to maintain any more than three items in our short term memory. While some people use the magic number 7 rule, Doumont notes this is clearly the upper limit. Doumont’s suggestion of using three (such as for subheadings in a document, or bullet points in a powerpoint presentation) also coincides with Howard Wainer’s suggestion to limit the number of significant digits in tables to three as well.

For oral presentations with slides, he suggests printing out your slides 6 to a page on a standard letter size paper. If you have a hard time reading them, the font is too small. I’m not sure if this fits inline with my suggestions for font sizes, it will take some more investigation on my part. Another piece of advice for oral presentations is that you can’t read text on slides and listen to the presenter at the same time. Those two inputs compete in our brain, as opposed to images and talking at the same time. Doumont gives the same advice as Tufte (prepare a handout), but I don’t think this is a good idea. (The handout can be distracting.) If you need people to read text, just take a break and get a sip of water. Otherwise make the text as minimal as possible.

My only real point of contention is that Doumont makes the mistake in talking about graphics that one only needs two points labeled on axes. This is not true in general, you need three. Imagine I gave you an axis:

2--?--8

For a linear scale, the missing point would be 5, but for a logarithmic scale (in base 2) the missing point would be 4. I figured this is worth pointing out as I recently reviewed a paper where a legend for a raster image (pretty sure ArcGIS was the culprit) only had the end points labeled.

Doumont also has a bunch of advice about writing that I will need to periodically reread. In general one point is that the first sentence of either a section (or paragraph) should be declarative as to the point of that section. Sometimes folks lead with fluff that is only revealed to be related to the material later on in the section.

My writing and work will definitely not live up to Doumont’s standard, but it is a goal I believe scientists should strive for.

Paper – Replicating Group Based Trajectory Models of Crime at Micro-Places in Albany, NY published

My article on estimating crime trajectories in Albany from 2000 through 2014 has been published in the latest issue of JQC.

That link is permanent, but Springer gifts me a temporary free pdf link for everyone for up to four weeks. So grab that if you are interested.

Also note though that I have the pre-print posted on SSRN. Since that is Albany PD’s data, I cannot provide code to replicate the analysis. But, I have produced a series of blog posts showing to to replicate the trajectory and the point pattern analysis on your own data if you are interested, see

Here is the cross Ripley’s L plot testing for clustering between the different trajectory groupings.

Also always feel free to send me an email if you have questions about the findings and paper.

ASC 2016 – Quantifying the Local and Spatial Effects of Alcohol Outlets on Crime

This year at the American Society of Criminology I will be presenting some work from my dissertation, Quantifying the Local and Spatial Effects of Alcohol Outlets on Crime. I have the working paper posted on SSRN, and that also has a link to download data and code to reproduce the findings in the paper.

I will be presenting at the panel Alcohol and Crime on Wednesday at 9:30 (at the Cambridge room on the 2nd level).

Here is the abstract:

This paper estimates the relationship between alcohol outlets and crime at micro place street units in Washington, D.C. Three specific additions to this voluminous literature are articulated. First, the diffusion effect of alcohol outlets is larger than the local effect. This has important implications for crime prevention. The second is that in this sample the effects of on-premise and off-premise outlets are very similar in magnitude. I argue this is evidence in favor of routine activities theory, in opposition to theories which emphasize individual alcohol consumption. The final is that alcohol outlets have large effects on burglary, despite the fact that alcohol outlets cannot increase the number of vulnerable targets, as it can with interpersonal crimes. I discuss how this can either be interpreted as evidence that alcohol outlets self-select into already crime prone areas, or potentially that the presence of motivated offenders’ matters much more than increasing the number of potential victims.

The most interesting finding is the fact that I estimate the diffusion effect of alcohol outlets is larger than the local effect. I then show that this is the case for some other papers as well, it is just interpreting the regression model is tricky. Here is a diagram showing what happens. The idea is the regression coefficient for the spatial lag is one orange dot, and the local effect is the blue dot. Adding a bar though diffuses to multiple places, so when adding up all the smaller orange dots, they result in more crime than the one bigger blue dot.

A principled approach to conducting subgroup analysis

Social scientists often have a problem when conducting analysis — we have theories that are not tightly coupled to actual measures of individual behavior. A response to this is to often conduct models of many different, interrelated measures. This can be either as outcome variables, e.g. if I know poverty predicts all crimes, does poverty predict both violent crime and property crime at the city level? Or as explanatory variables, e.g. does being a minority reduce your chances of getting a job interview, or does it matter the type of minority you are — e.g. Black, Asian, Hispanic, Native American, etc.

Another situation is conducting analysis among different units of analysis, e.g. see if a treatment has a different effect for males or females, or see if a treatment works well in one country, but does not work well in another. Or if I find that a policy intervention works at the city level, are the effects in all areas of the city, or in just some neighborhoods?

On its face, these may seem like all unique problems. They are not, they are all different variants of subgroup analysis. In my dissertation in trying to identify situations in which you need to use small geographic spatial units of analysis, I realized that the logic behind choosing a geographic unit of analysis is the same as these different subgroup analyses. I more succinctly outline my logic in this article, but I will try it in a blog post as well.

I am what I would call a "reductionist social scientist". In plain terms, if we fit a model:

Y = B*X

we can always get more specific, either in terms of explanatory variables:

Y = b1*x1 + b2*x2, where X = x1 + x2

Or in terms of the outcome:

y1 = b1*X
y2 = b2*X, where Y = y1 + y2

Hubert Blalock talks about this in his causal inferences book. I think many social scientists are reductionists in this sense, we can always estimate more specific explanatory variables, or more specific outcomes, or within different subgroups, ad nauseam. Thus the problem is not should we conduct analysis in some particular subgroup, but when should we be satisfied that the aggregate effect we estimate is good enough, or at least not misleading.

So remember this when we are evaluating a model, the aggregate effect is a function of the sub-group effects. In linear models the math is easy, which I show some examples in my linked paper, but the logic generally holds for non-linear models as well. So when should we be ok with the aggregate level effect? We should be ok with the aggregate effect if we expect the direction and size of the effects in the subgroups to be similar. We should not be ok if the effects are countervailing in the subgroups, or if the magnitude of the differences is very large.

For some simplistic examples, if we go with our job interview for minorities relative to whites example:

Prob(Job Interview) = 0.5 + 0*(Minority)

So here the effect is zero, minorities have the same probability as white individuals, 50%. But lets say we estimate an effect for different minority categories:

Prob(Job Interview) = 0.5 + 0.3(Asian) - 0.3(Black)

Our aggregate effect for minorities is zero, because it is positive for Asian’s and negative for Black individuals, and in the aggregate these two effects cancel out. That is one situation in which we should be worried.

Now how about the effect of poverty on crime:

All Crime = 5*(Percent in Poverty)

Versus different subsets of crime, violent and property.

Violent Crime = 3*(Percent in Poverty)
Property Crime = 2*(Percent in Poverty)

Here we can see that the subgroups contribute to the total, but the effect for property is slightly less than that for violent crime. Here the aggregate effect is not misleading, but the micro level effect may be theoretically interesting.

For the final areas, lets say we do a gun buy back program, and we estimate the reduction in shootings at the city wide level. So lets say we estimate the number of shootings per month:

Shootings in the City = 10 - 5*(Gun Buy Back)

So we say the gun buy back reduced 5 shootings per month. Maybe we think the reduction is restricted to certain areas of the city. For simplicity, say this city only has two neighborhoods, North and South. So we estimate the effect of the gun buy back in these two neighborhoods:

Shootings in the North Neighborhood = 9 - 5*(Gun Buy Back)
Shootings in the South Neighborhood = 1 - 0*(Gun Buy Back)

Here we find the program only reduced shootings in the North neighborhood, it had no appreciable effects in the south neighborhood. The aggregate city level effect is not misleading, we can just be more specific about decomposing that effect to different areas.


Here I will relate this to some of my recent work — using 311 calls for service to predict crime at micro places in DC.

In a nutshell, I’ve fit models of the form:

Crime = B*(311 Calls for Service)

And I found that 311 calls have a positive, but small, effect on crime.

Over time, either at presentations in person or in peer review, I’ve gotten three different "subgroup" critiques. These are:

  • I think you cast the net too wide in 311 calls, e.g. "bulk collections" should not be included
  • I think you shouldn’t aggregate all crime on the left hand side, e.g. I think the effect is mostly for robberies
  • I think you shouldn’t estimate one effect for the entire city, e.g. I think these signs of disorder matter more in some neighborhoods than others

Now, these are all reasonable questions, but does it call into question my main aggregate finding? Not at all.

For the casting the net too wide for 311 calls, do you think that bulk collections have a negative relationship to crime? Unlikely. (I’ve eliminated them from my current article due to one reviewer complaint, but to be honest I think they should be included. Seeing a crappy couch on the street is not much different than seeing garbage.)

For all crime on the left hand side, do you think 311 calls have a negative effect on some crimes, but a positive effect on others? Again, unlikely. It may be the case that it has larger effects on some than others, but it does not mean the effect on all crime is misleading. So what if it is larger for robberies than others, you can go and build a theory about why that is the case and test it.

For the one estimate in different parts of the city, do you think it has a negative effect in some parts, and a positive effect in others? Again, unlikely. It may be some areas the effect is larger, but overall we expect it to be positive or zero in all areas of the city. The aggregate city wide effect is not likely to be misleading.

These are all fine for future research question, but I get frustrated when these are given as reasoning to critique my current findings. They don’t invalidate the aggregate findings at all.


In response to this, you may think, well why not conduct all these subgroup analyses – whats the harm? There are a few different harms to conducting these subgroup analyses willy-nilly. They are all related to chasing the noise and interpreting it.

For each of these subgroups, you will have smaller power to estimate effects than the aggregate. Say I test the effect of each individual 311 call type (there are near 30 that I sum together). Simply by chance some of these will have null effects or slightly negative effects and all will be small by themselves. I have no apriori reason to think some have a different effect than others, the theory behind why they are related to crime at all (Broken Windows) does not distinguish between them.

This often ends up being a catch-22 in peer review. You do more specific analysis, by chance a coefficient goes in the wrong direction, and the reviewer interprets it as your measures and/or model is bunk. In reality they are just over-interpreting noise.

That is in response to reviewers, but what about you conducting subgroup analysis on your own? Here you have to worry about the garden-of-forking paths. Say I conducted the subgroup analysis for different types of crime outcomes, and they are all in the same direction except for thefts from auto. I then report all of the results except for thefts from auto, because that does not confirm my theory. This is large current problem in reproducing social science findings — a subgroup analysis may seem impressive, but you have to worry about which results the researcher cherry picked.

This only reporting confirmatory evidence for some subgroups will always be a problem in data analysis — not even pre-registration of your data plan will solve it. Thus, you should only do subgroup analysis if there is strong theoretical reasoning you think the aggregate effect is misleading. You should simply plan a new study on its own to assess different subgroups from the start if you think differences may be theoretically interesting.

Given some of the reviews I received for the 311 paper, I am stuffing many of these subgroup analyses in Appendices just to preempt reviewer critique. (Ditto for my paper on alcohol outlets and crime I will be presenting at ASC in a few weeks, that will be up on SSRN likely next week.) I don’t think it is the right thing to do though, again I think it is mostly noise mining, but perpetually nit-picky reviewers have basically forced me to do it.

My endorsement for criminal justice at Bloomsburg University

The faculty at PASSHE schools (public universities in Pennsylvania) are currently under strike. My main reason to write this post is because I went to Bloomsburg University and I received a terrific education (a BA in criminal justice). If I could go back and do it all over again, I would still definitely attend Bloomsburg.

For a bit of background, all of the PA state schools were originally formed as normal schools — colleges to prepare teachers for lower education. They were intentionally placed around the state, so students did not have to travel very far. This is why they seem to be in rural places no one has ever heard of (it is intentional). For those in New York, this is an equivalent story for the smaller SUNY campuses — although unlike New York the state schools in PA have no shared acronym. This does not include Penn State University, which is a land-grant school. At some later point, the normal colleges expanded to universities and PASSHE was formed.

There are some pathological problems with higher education currently. One of them is the rising price of tuition. Tuition at PASSHE schools are basically the cheapest places you can get a bachelor’s degree. Private institutions (or Penn State Univ.) you are going to pay two to three times as much compared to at a PASSHE institution.

Criminal justice is a continually growing degree. To meet the teaching demand, many programs are filling in with adjunct labor. Bloomsburg did not do this when I was there, and this continues to appear the be the case. The majority of the faculty I had (04-08) are still on the faculty (this is true for criminal justice, sociology, and the two math professors I took all my statistics courses for), although the CJ program has appeared to grow beyond the three faculty members (Leo Barrile, Neal Slone, and Pam Donovan) when I was there. They did have an additional adjunct when I was there (who shall go unnamed) who is in the running for laziest teacher I have ever had.

Now, don’t get me wrong — adjuncts can be good teachers. I’ve taught as an adjunct myself. You should be concerned though if the majority of the courses in a department are being taught by adjuncts. People with professional experience can be great teachers — especially for advanced courses about their particular expertise — but they should rarely be teaching core courses for a degree in criminal justice. Core courses for CJ would likely include, intro. to criminal justice, criminology, penology, criminal law, statistics, and research design. (The last two are really essential courses for any student in the social sciences.)

The main reason some professors are better than others is not directly related to being tenure track faculty or adjunct though — a big factor is about continuity. When I am teaching a course for the first time, students are guinea pigs, whereas a professor who has taught the course many semesters is going to be better prepared. Adjunct’s with poor pay are not as likely to stick around, so you get a revolving door. Folks who have been around awhile are just more likely to be polished teachers.

To end, some students choose bigger schools because they believe there are more opportunities (either to have fun or for their education). There were really more opportunities at Bloomsburg than I could even take advantage of. Besides the BA in criminal justice, I had a minor in statistics and sociology. I also got my introduction to making maps by taking a GIS class in the geography department. In retrospect I would have taken a few more math classes (like swapping out the Econ Statistics courses for Macro-Econ). Bloomsburg is small but don’t worry about having a fun time either — if you get take out do NAPS, if you just want a slice do OIP. (The pizza here in Dallas is terrible all the places I’ve tried.)

While it may be frustrating to students (or maybe more to parents who are paying the bills), it is in the best long term interest to preserve the quality of education at PASSHE schools. Appropriate pay and benefits for faculty and adjuncts is necessary to do that.

I think I will write a blog post describing more about an undergraduate degree in criminal justice, but if you are a student here in the Dallas area interested in criminology at UTD, always feel free to send me an email with questions. Also please email me a place where I can get a decent tasting slice!

Testing the equality of two regression coefficients

The default hypothesis tests that software spits out when you run a regression model is the null that the coefficient equals zero. Frequently there are other more interesting tests though, and this is one I’ve come across often — testing whether two coefficients are equal to one another. The big point to remember is that Var(A-B) = Var(A) + Var(B) - 2*Cov(A,B). This formula gets you pretty far in statistics (and is one of the few I have memorized).

Note that this is not the same as testing whether one coefficient is statistically significant and the other is not. See this Andrew Gelman and Hal Stern article that makes this point. (The link is to a pre-print PDF, but the article was published in the American Statistician.) I will outline four different examples I see people make this particular mistake.

One is when people have different models, and they compare coefficients across them. For an example, say you have a base model predicting crime at the city level as a function of poverty, and then in a second model you include other control covariates on the right hand side. Let’s say the the first effect estimate of poverty is 3 (1), where the value in parentheses is the standard error, and the second estimate is 2 (2). The first effect is statistically significant, but the second is not. Do you conclude that the effect sizes are different between models though? The evidence for that is much less clear.

To construct the estimate of how much the effect declined, the decline would be 3 - 2 = 1, a decrease in 1. What is the standard error around that decrease though? We can use the formula for the variance of the differences that I noted before to construct it. So the standard error squared is the variance around the parameter estimate, so we have sqrt(1^2 + 2^2) =~ 2.23 is the standard error of the difference — which assumes the covariance between the estimates is zero. So the standard error around our estimated decline is quite large, and we can’t be sure that it is an appreciably different estimate of poverty between the two models.

There are more complicated ways to measure moderation, but this ad-hoc approach can be easily applied as you read other peoples work. The assumption of zero covariance for parameter estimates is not a big of deal as it may seem. In large samples these tend to be very small, and they are frequently negative. So even though we know that assumption is wrong, just pretending it is zero is not a terrible folly.

The second is where you have models predicting different outcomes. So going with our same example, say you have a model predicting property crime and a model predicting violent crime. Again, I will often see people make an equivalent mistake to the moderator scenario, and say that the effect of poverty is larger for property than violent because one is statistically significant and the other is not.

In this case if you have the original data, you actually can estimate the covariance between those two coefficients. The simplest way is to estimate that covariance via seemingly unrelated regression. If you don’t though, such as when you are reading someone else’s paper, you can just assume the covariance is zero. Because the parameter estimates often have negative correlations, this assumption will make the standard error estimate smaller.

The third is where you have different subgroups in the data, and you examine the differences in coefficients. Say you had recidivism data for males and females, and you estimated an equation of the effect of a treatment on males and another model for females. So we have two models:

Model Males  : Prob(Recidivism) = B_0m + B_1m*Treatment
Model Females: Prob(Recidivism) = B_0f + B_1f*Treatment

Where the B_0? terms are the intercept, and the B_1? terms are the treatment effects. Here is another example where you can stack the data and estimate an interaction term to estimate the difference in the effects and its standard error. So we can estimate a combined model for both males and females as:

Combined Model: Prob(Recidivism) = B_0c + B_1c*Treatment + B_2c*Female + B_3c(Female*Treatment)

Where Female is a dummy variable equal to 1 for female observations, and Female*Treatment is the interaction term for the treatment variable and the Female dummy variable. Note that you can rewrite the model for males and females as:

Model Mal.: Prob(Recidivism) =     B_0c      +      B_1c    *Treatment    ....(when Female=0)
Model Fem.: Prob(Recidivism) = (B_0c + B_2c) + (B_1c + B_3c)*Treatment    ....(when Female=1)

So we can interpret the interaction term, B_3c as the different effect on females relative to males. The standard error of this interaction takes into account the covariance term, unlike estimating two totally separate equations would. (You can stack the property and violent crime outcomes I mentioned earlier in a synonymous way to the subgroup example.)

The final fourth example is the simplest; two regression coefficients in the same equation. One example is from my dissertation, the correlates of crime at small spatial units of analysis. I test whether different places that sell alcohol — such as liquor stores, bars, and gas stations — have the same effect on crime. For simplicity I will just test two effects, whether liquor stores have the same effect as on-premise alcohol outlets (this includes bars and restaurants). So lets say I estimate a Poisson regression equation as:

log(E[Crime]) = Intercept + b1*Bars + b2*LiquorStores

And then my software spits out:

                  B     SE      
Liquor Stores    0.36  0.10
Bars             0.24  0.05

And then lets say we also have the variance-covariance matrix of the parameter estimates – which most stat software will return for you if you ask it:

                L       B  
Liquor_Stores    0.01
Bars            -0.0002 0.0025

On the diagonal are the variances of the parameter estimates, which if you take the square root are equal to the reported standard errors in the first table. So the difference estimate is 0.36 - 0.24 = 0.12, and the standard error of that difference is sqrt(0.01 + 0.0025 - 2*-0.002) =~ 0.13. So the difference is not statistically significant. You can take the ratio of the difference and its standard error, here 0.12/0.13, and treat that as a test statistic from a normal distribution. So the rule that it needs to be plus or minus two to be stat. significant at the 0.05 level applies.

This is called a Wald test specifically. I will follow up with another blog post and some code examples on how to do these tests in SPSS and Stata. For completeness and just because, I also list two more ways to accomplish this test for the last example.


There are two alternative ways to do this test though. One is by doing a likelihood ratio test.

So we have the full model as:

 log(E[Crime]) = b0 + b1*Bars + b2*Liquor_Stores [Model 1]
 

And we have the reduced model as:

 log(E[Crime]) = b4 + b5*(Bars + Liquor_Stores)  [Model 2]
 

So we just estimate the full model with Bars and Liquor Stores on the right hand side (Model 1), then estimate the reduced model (2) with the sum of Bars + Liquor Stores on the right hand side. Then you can just do a chi-square test based on the change in the log-likelihood. In this case there is a change of one degree of freedom.

I give an example of doing this in R on crossvalidated. This test is nice because it extends to testing multiple coefficients, so if I wanted to test bars=liquor stores=convenience stores. The prior individual Wald tests are not as convenient for testing more than two coefficients equality at once.


Here is another way though to have the computer more easily spit out the Wald test for the difference between two coefficients in the same equation. So if we have the model (lack of intercept does not matter for discussion here):

y = b1*X + b2*Z [eq. 1]

We can test the null that b1 = b2 by rewriting our linear model as:

y = B1*(X + Z) + B2*(X - Z) [eq. 2]

And the test for the B2 coefficient is our test of interest The logic goes like this — we can expand [eq. 2] to be:

y = B1*X + B1*Z + B2*X - B2*Z [eq. 3]

which you can then regroup as:

y = X*(B1 + B2) + Z*(B1 - B2) [eq. 4]

and note the equalities between equations 4 and 1.

B1 + B2 = b1; B1 - B2 = b2

So B2 tests for the difference between the combined B1 coefficient. B2 is a little tricky to interpret in terms of effect size for how much larger b1 is than b2 – it is only half of the effect. An easier way to estimate that effect size though is to insert (X-Z)/2 into the right hand side, and the confidence interval for that will be the effect estimate for how much larger the effect of X is than Z.

Note that this gives an equivalent estimate as to conducting the Wald test by hand as I mentioned before.