Using regularization to generate synthetic controls and conformal prediction for significance tests

When viewing past synthetic control results, one of things that has struck me is that the matching of the pre-trends is really good — almost too good in many cases (appears to be fitting to noise, although you may argue that is a feature in terms of matching exogenous shocks). For example, if you end up having a pre-treatment series of 10 years, and you have a potential donor pool the size of 30, you could technically pick 10 of them at random, fit a linear regression predicting the 10 observations in the treated unit, based on 10 covariates of the donor pool outcomes over the same pre time period, and get perfect predictions (ignoring the typical constraints one places on the coefficients).

So how do we solve that problem? One solution is to use regularized regression results (e.g. ridge regression, lasso), when the number of predictors is greater than the number of observations. So I can cast the matching procedure into a regression problem to generate the weights. Those regression procedures are typically used for forecasting, but don’t have well defined standard errors, and so subsequently are typically only used for point forecasts. One way to make inferences though is to generate the synthetic weights (here using lasso regression), and then use conformal prediction intervals to do our hypothesis testing of counterfactual trends.

Here I walk through an example using state panel crime data in R, full code and data can be downloaded here.

A Synthetic Control Example

So first, these are the packages we need to replicate the results. conformalInference is not on CRAN yet, so use devtools to install it.

#library(devtools)
#install_github(repo="ryantibs/conformal", subdir="conformalInference")
library(conformalInference)
library(glmnet)
library(Synth)

Then I have prepped a nice state panel dataset of crime rates and counts from 1960 through 2014. I set a hypothetical treatment start year in 2005 just so I have a nice 10 years post data for illustration. That is a pretty good length pre-panel though, and a good number of potential donors.

MyDir <- "C:\\Users\\axw161530\\Desktop\\SynthIdeas"
setwd(MyDir)

TreatYear <- 2005

LongData <- read.csv("CrimeStatebyState_Edited.csv")
summary(LongData)

Next I prep my data, currently it is in long panel format, but I need it in wide format to fit the regression equations I want. I am just matching on violent crime rates here. I take out NY, as it is missing a few years of data. (This dataset also includes DC.) Then I split it up into my pre intervention and post intervention set.

#Changing the data to wide for just the violent offenses
wide <- LongData[,c('State','Year','Violent.Crime.rate')]
names(wide)[3] <- 'VCR'
wide <- reshape(wide, idvar="Year", timevar="State", direction="wide")
summary(wide)
#Take out NY because of NAs
wide <- wide[,c(1:33,35:52)]

wide_pre <- as.matrix(wide[wide$Year < TreatYear,])
wide_post <- as.matrix(wide[wide$Year >= TreatYear,])

Now onto the good stuff, we can estimate our lasso regression using the pre-data to get our weights. This constrains the coefficients to be positive and below 1. But does not have the constraint they sum to 1. I just choose Alabama as an example treated unit — I intentionally chose a state and year that should not have any effects for illustration and to check the coverage of my technique vs more traditional analyses.

You can see in my notes this is different than traditional synth in that it has an intercept as well. I was surprised, but the predictions in sample were really bad without the intercept no matter how I sliced it.

res <- glmnet(x=wide_pre[,3:51],y=wide_pre[,2],family="gaussian",
       lower.limits=0,upper.limits=1,intercept=TRUE,standardize=FALSE,
       alpha=1) #need the intercept, predictions suck otherwise

Even though this does not constrain the coefficients to sum to 1, it ends up with weights really close to that ideal anyway (sum of the non-intercept coefficients is just over 1.01). When I use crossvalidation it does not choose weights that sum to unity, but in sample the above code and the cv.glmnet are really similar in terms of predictions.

co_ridge <- as.matrix(coef(res))
fin <- co_ridge[,"s99"]
active <- fin[fin > 0] #Does not include intercept

If you print active we then have for our state weights (and the intercept is pretty tiny, -22). So not quite sure why eliminating the intercept was causing such problems in this example. So North Carolina just sneaks in, but otherwise the synthetic control is a mix of Arkansas, California, Kentucky, and Texas. The intercept is just a level shift, so we are still matching curves otherwise, so that does not bother me very much.

VCR.AR 0.2078156362
VCR.CA 0.1201658279
VCR.IL 0.1543015666
VCR.KY 0.2483613907
VCR.NC 0.0002896238
VCR.TX 0.2818272850

If we look at our predictions for the pre-time period, Alabama had the typical crime path, with a big raise going into the early 90’s and then a fall afterward (black line), and our in-sample predictions from the lasso regression are decent.

pre_pred <- predict(res,newx=wide_pre[,3:51],s=min(res$lambda)) #for not cv results

plot(wide_pre[,1],wide_pre[,2],type='l',xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_pre[,1],pre_pred,bg='red',pch=21) #Not too shabby
legend(1960,800,legend=c("Observed Albama","Predicted"),col=c("black","black"), pt.bg=c("black","red"), lty=c(1,NA), pch=c(NA,21))

Now to evaluate post intervention, we are going to generate conformal prediction intervals using a jackknife approach. Basically doing all the jazz of above, but leaving one pre year out at a time, and trying to predict Alabama’s violent crime rate for that left out year. Repeat that same process for all prior years, and we can get a calculation of the standard error of our prediction. Then apply that standard error to future years, so we can tell if the observed trend is different than the counterfactual we estimated (given the counterfactual has errors). I generate both 90% prediction intervals, as well as 99% prediction intervals.

train_fun <- function(x, y, out=NULL){
  return( glmnet(x,y,alpha=1,standardize=FALSE,intercept=TRUE,nlambda=100,
                lower.limits=0,upper.limits=1,family="gaussian")
  )
}

pred_fun = function(out, newx) {
    return(predict(out, newx, s=min(out$lambda)))
}

limits_10 <- conformal.pred.jack(x=wide_pre[,3:51],y=wide_pre[,2],x0=wide_post[,3:51],
                                 train.fun=train_fun,predict.fun=pred_fun,alpha=0.10,
                                 verbose=TRUE)

limits_01 <- conformal.pred.jack(x=wide_pre[,3:51],y=wide_pre[,2],x0=wide_post[,3:51],
               train.fun=train_fun,predict.fun=pred_fun,alpha=0.01,
               verbose=TRUE)

plot(wide_post[,1],wide_post[,2],type='l',ylim=c(150,650),xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_post[,1],post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_10$lo,col='grey')
lines(wide_post[,1],limits_10$up,col='grey')
lines(wide_post[,1],limits_01$lo,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up,col='grey',lwd=3)
legend("topright",legend=c("Observed Albama","Predicted","90% Pred. Int.","99% Pred. Int."),cex=0.7,
       col=c("black","black","grey","grey"), pt.bg="red", lty=c(1,NA,1,1), pch=c(NA,21,NA,NA), lwd=c(1,1,1,3))

Then at the end of the above code snippet I made a plot. Black line is observed for Alabama from 05-14. Red dots are the estimated counterfactual based on the pre-weights. The lighter grey lines are then the prediction intervals. So we can see it is just outside the 90% intervals 3 times in the later years (would only expect 1 time), but all easily within the 99% intervals.

Note these are prediction intervals, not confidence intervals. Thinking about it I honestly don’t know whether we want prediction or confidence intervals in this circumstance, but prediction will be wider.

So this approach just matches on the pre-treated same outcome observations. To match on additional covariates, you can add them in as rows into the pre-treatment dataset (although you would want to normalize the values to a similar mean and standard deviation as the pre-treated outcome series).

You may also add in other covariates, like functions of time (although this changes the nature of the identification). So for example say you incorporate a linear and quadratic trend in time, and lasso only chooses those two time factors and no control areas. You are doing something more akin to interrupted time series analysis at that point (the counterfactual is simply based on your estimate of the pre-trend). Which I think is OK sometimes, but is quite different than using control areas to hopefully capture random shocks.

Comparing to Traditional Synth results

To see whether my error intervals are similar to the placebo approach, I used the old school synth R package. It isn’t 100% comparable, as it makes you match on at least one covariate, so here I choose to also match on the average logged population over the pre-treatment period.

#NY is missing years
LongData_MinNY <- LongData[as.character(LongData$State) != "NY",c("State","Year","Violent.Crime.rate","Population")]
LongData_MinNY$StateNum <- as.numeric(LongData_MinNY$State)
LongData_MinNY$State <- as.character(LongData_MinNY$State)
LongData_MinNY$LogPop <- log(LongData_MinNY$Population)    

state_nums <- unique(LongData_MinNY$StateNum)
    
dataprep.out <- dataprep(foo = LongData_MinNY,
                         dependent = "Violent.Crime.rate",
                         predictors = c("LogPop"),
                         unit.variable = "StateNum",
                         unit.names.variable = "State",
                         time.variable = "Year",
                         treatment.identifier = 2,
                         controls.identifier = state_nums[!state_nums %in% 2],
                         time.optimize.ssr = 1960:(TreatYear-1),
                         time.predictors.prior = 1960:(TreatYear-1),
                         time.plot = 1960:2014
                         )

synth_res <- synth(dataprep.out)
synth_tables <- synth.tab(dataprep.res = dataprep.out, synth.res = synth_res)
synth_tables$tab.w #a bunch of little weights across the board
path.plot(synth.res = synth_res, dataprep.res = dataprep.out, tr.intake=TreatYear,Xlab='',Ylab='Violent Crime Rate per 100,000',
      Legend=c("Alabama","Synthetic Control"), Legend.position=c("topleft"))

Looking at the weights, it is a bunch of little ones for many different states. Looking at the plot, it doesn’t appear to be any better fit than the lasso approach.

And then I just do the typical approach and use placebo checks to do inference. I loop over my 49 placebos (-1 state for NY, but +1 state because this list includes DC).

#Dataframes to stuff the placebos check results into
Predicted <- data.frame(dataprep.out$Y0plot %*% synth_res$solution.w)
names(Predicted) <- "TreatPred"

Pred_MinTreat <- data.frame(TreatPred = Predicted$TreatPred - LongData_MinNY[LongData_MinNY$StateNum == 2,"Violent.Crime.rate"])

#Now I just need to loop over the other states and collect their results for the placebo tests

placebos <- state_nums[!state_nums %in% 2]
for (i in placebos){
  dataprep.plac <- dataprep(foo = LongData_MinNY,
                           dependent = "Violent.Crime.rate",
                           predictors = c("LogPop"),
                           unit.variable = "StateNum",
                           unit.names.variable = "State",
                           time.variable = "Year",
                           treatment.identifier = i,
                           controls.identifier = state_nums[!state_nums %in% i],
                           time.optimize.ssr = 1960:(TreatYear-1),
                           time.predictors.prior = 1960:(TreatYear-1),
                           time.plot = 1960:2014
  )
  synth_resP <- synth(dataprep.plac)
  synth_tablesP <- synth.tab(dataprep.res = dataprep.plac, synth.res = synth_resP)
  nm <- paste0("S.",i)
  Predicted[,nm] <- dataprep.plac$Y0plot %*% synth_resP$solution.w
  Pred_MinTreat[,nm] <- Predicted[,nm] - LongData_MinNY[LongData_MinNY$StateNum == i,"Violent.Crime.rate"]
}

If you look at the synth estimates for Alabama (grey circles), they are almost exactly the same as the lasso predictions (red circles), even though the weights are very different.

PredRecent <- Predicted[1960:2014 >= TreatYear,]
DiffRecent <- Pred_MinTreat[1960:2014 >= TreatYear,]

plot(wide_post[,1],wide_post[,2],type='l',ylim=c(100,700),xlab='',ylab='Violent Crime Rate per 100,000')
points(wide_post[,1],post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_10$lo,col='grey')
lines(wide_post[,1],limits_10$up,col='grey')
lines(wide_post[,1],limits_01$lo,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up,col='grey',lwd=3)
points(wide_post[,1],PredRecent$TreatPred,bg='grey',pch=21)
legend("topright",legend=c("Observed Albama","Lasso Pred.","90% Pred. Int.","99% Pred. Int.","Synth Pred."),cex=0.6,
       col=c("black","black","grey","grey"), pt.bg=c(NA,"red",NA,NA,"grey"), lty=c(1,NA,1,1,NA), pch=c(NA,21,NA,NA,21), lwd=c(1,1,1,3,1))

But when we look at variation in our placebo results (thin, purple lines), they are much wider than our conformal prediction intervals.

plot(wide_post[,1],wide_post[,2]-post_pred,type='l',ylim=c(-500,500),xlab='',ylab='Observed - Predicted (Violent Crime Rates)')
points(wide_post[,1],post_pred-post_pred,bg='red',pch=21)
lines(wide_post[,1],limits_01$lo-post_pred,col='grey',lwd=3)
lines(wide_post[,1],limits_01$up-post_pred,col='grey',lwd=3)

for (i in 2:ncol(PredRecent)){
  lines(wide_post[,1],DiffRecent[,i],col='#9400D340',lwd=0.5)
}

legend(x=2005.5,y=-700,legend=c("Observed Albama","Lasso Pred.","99% Pred. Int.","Placebos"),
       col=c("black","black","grey",'#9400D3'), pt.bg=c(NA,"red",NA,NA), lty=c(1,NA,1,1), 
       pch=c(NA,21,NA,NA), lwd=c(1,1,3,0.5), xpd=TRUE, horiz=TRUE, cex = 0.45)

So I was hoping they would be the same (conformal would cover the placebo at the expected rate), but alas they are not. So I’m not sure if my conformal intervals are too small, or the placebo checks are extra noisy. I can’t prove it, but I suspect the placebo checks are somewhat noisy, mainly because there will always be some intervention that is idiosyncratic to specific donors over long periods of time that makes them no longer good counterfactuals. This seems especially true if you consider predictions further out from the treatment year. Although I find the logic of the placebo checks pretty convincing, so I am somewhat torn.

Since we have in this example 49 donors, the two-tailed p-value for being outside the placebos would be 2/(49+1)=0.04. Here we would need an intervention that either increased violent crime rates by plus/minus 400 per 100,000, pretty much an impossible standard given a baseline of only 400 crimes per 100,000 as of 2004. The 99% conformal intervals are still pretty wide, with an increase/decrease of about 150 violent crimes per 100,000 needed to be a significant change. The two lines way outside 400 happen to be Alaska and Wyoming, not DC, so maybe a tiny population state results in higher volatility problem. But besides them there are a bunch of placebo states around plus/minus 300 as well.

So caveat emptor if you want to use this idea in your own work, I don’t know if my suggestion is good or bad. Here it suggests its more diagnostic (smaller intervals) than the placebo checks, and isn’t limited by the number of potential donors in setting the alpha level for your tests (e.g. if you only have 10 potential donors your placebo checks are only 90% intervals).

Since this is just one example, there are a few things I would need to know before recommending it more generally. One is that it may not work with smaller pre time series and/or a smaller donor pool. (Not sure of any better way of checking than via a ton of different simulations.)

More general notes

Doing some more lit review while preparing this post, I appear to be like 15th in line to suggest this approach (so don’t take it as novel). In terms of using the lasso to estimate the synth weights, it seems Susan Athey and colleagues proposed something similar in addition to using other machine learning techniques. Also see Amjad et al. 2018 in the Journal of Machine Learning, and this workshop by Alex Hollingsworth and Coady Wing. I am not even the first one to think to use conformal prediction intervals apparently, see this working paper (Chernozhukov, Wuthrich, and Zhu, 2019) posted just a few weeks prior.

There is another R package, gsynth, that appears to solve the problem of p > n via a variable reduction technique (Xu, 2017). Xu also discusses how incorporating more information is really making different identification assumptions. So again just getting good predictions/minimizing the in-sample mean square error is not necessarily the right approach to get correct causal inferences.

Just a blog post, so again can’t say if this is an improvement over other work offhand. This is just illustrative that the bounds for the conformal prediction may be smaller than the typical permutation based approach. Casting it as a regression problem I intuitively grok more, and think opens up more possibilities. For example, you may want to use binomial logistic models instead of linear for the fitting process (so takes into account more volatility for smaller population states).

 

Making caterpillar plots for random effects in SPSS

For one of my classes for PhD students (in seminar research and analysis), I talk about the distinction between random effect models and fixed effect models for a week.

One of my favorite plots to go with random effect models is called a caterpillar plot. So typically folks just stop at reporting the variance of the random intercepts and slopes when they estimate these models. But you not only get the global variance estimates, but can also get an estimate (and standard error) for each higher level variable. So if I have 100 people, and I do a random intercept for those 100 people, I can say “Joe B’s random intercept is 0.5, and Jane Doe’s random intercept is -0.2” etc.

So this is halfway in between confirmatory data analysis (we used a model to get those estimates) but is often useful for further understanding the model and seeing if you should add anything else. E.g. if you see the random intercepts have a high correlation with some other piece of person information, that information should be incorporated into the model. It is also useful to spot outliers. And if you have spatial data mapping the random intercepts should be something you do.

SPSS recently made it easier to make these types of plot (as of V25), so I am going to give an example. In my class, I give code examples in R, Stata, and SPSS whenever I can, so this link contains code for all three programs. I will be using data from my dissertation, with crime on street segments in DC, nested within regular grid cells (used to approximate neighborhoods).

SPSS Code

So first data prep, I define where my data is using FILE HANDLE, read in the csv file of the data, compute a new variable (the sum of both detritus and physical infrastructure 311 calls). Then finally I declare that the FishID variable (my grid cells neighborhoods) is a nominal level variable. SPSS needs that defined correctly for later models.

*************************************************************.
FILE HANDLE data /NAME = "??????Your Path Here!!!!!!!!!!!".

*Importing the CSV file into SPSS.
GET DATA  /TYPE=TXT
  /FILE="data\DC_Crime_withAreas.csv"
  /ENCODING='UTF8'
  /DELCASE=LINE
  /DELIMITERS=","
  /QUALIFIER='"'
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /DATATYPEMIN PERCENTAGE=95.0
  /VARIABLES=
  MarID AUTO
  XMeters AUTO
  YMeters AUTO
  FishID AUTO
  XMetFish AUTO
  YMetFish AUTO
  TotalArea AUTO
  WaterArea AUTO
  AreaMinWat AUTO
  TotalLic AUTO
  TotalCrime AUTO
  CFS1 AUTO
  CFS2 AUTO
  CFS1Neigh AUTO
  CFS2Neigh AUTO
  /MAP.
CACHE.
EXECUTE.
DATASET NAME CrimeDC.
DATASET ACTIVATE CrimeDC.

*Compute a new variable, total number of 311 calls for service.
COMPUTE CFS = CFS1 + CFS2.
EXECUTE.

VARIABLE LEVEL FishID (NOMINAL).
*************************************************************.

Now onto the good stuff, estimating our model. Here we are looking at the fixed effects of bars and 311 calls on crime on street segments, but also estimating a random intercept for each grid cell. As of V25, SPSS lets you specify an option to print the solution for the random statements, which we can capture in a new SPSS dataset using the OMS command.

So first we declare our new dataset to dump the results in, Catter. Then we specify an OMS command to capture the random effect estimates, and then estimate our negative binomial model. I swear SPSS did not use to be like this, but now you need to end the OMS command before you putz with that dataset.

*************************************************************.
DATASET DECLARE Catter.

OMS
  /SELECT TABLES
  /IF SUBTYPES='Empirical Best Linear Unbiased Predictions'
  /DESTINATION FORMAT=SAV OUTFILE='Catter' VIEWER=YES
  /TAG='RandTable'.

*SOLUTION option only as of V25.
GENLINMIXED
  /FIELDS TARGET=TotalCrime
  /TARGET_OPTIONS DISTRIBUTION=NEGATIVE_BINOMIAL
  /FIXED EFFECTS=TotalLic CFS
  /RANDOM USE_INTERCEPT=TRUE SUBJECTS=FishID SOLUTION=TRUE
  /SAVE PREDICTED_VALUES(PredRanEff).

OMSEND TAG='RandTable'.
EXECUTE.
DATASET ACTIVATE Catter.
*************************************************************.

And now we can navigate over to the saved table and make our caterpillar plot. Because we have over 500 areas, I sort the results and don’t display the X axis. But this lets you see the overall distribution and spot any outliers.

*************************************************************.
*Lets make a caterpillar plot.
FORMATS Prediction Std.Error LowerBound UpperBound (F4.2).
SORT CASES BY Prediction (D).

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Var1 Prediction LowerBound UpperBound
  /GRAPHSPEC SOURCE=INLINE
  /FRAME INNER=YES.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Var1=col(source(s), name("Var1"), unit.category())
  DATA: Prediction=col(source(s), name("Prediction"))
  DATA: LowerBound=col(source(s), name("LowerBound"))
  DATA: UpperBound=col(source(s), name("UpperBound"))
  SCALE: cat(dim(1), sort.data())
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), label("BLUP"))
  SCALE: linear(dim(2), include(0))
  ELEMENT: edge(position(region.spread.range(Var1*(LowerBound + UpperBound))), size(size."0.5")))
  ELEMENT: point(position(Var1*Prediction), color.interior(color.black), size(size."1"))
END GPL.
*************************************************************.

And here is my resulting plot.

And I show in the linked code some examples for not only random intercepts, but you can do the same thing for random slopes. Here is an example doing a model where I let the TotalLic effect (the number of alcohol licenses on the street segment) vary by neighborhood grid cell. (The flat 0 estimates and consistent standard errors are grid cells with 0 licenses in the entire area.)

The way to interpret these estimates are as follows. The fixed effect part of the regression equation here is: 0.247 + 0.766*Licenses. That alcohol license effect though varies across the study area, some places have a random slope of +2, so the equation could then be thought of as 0.247 + (0.766 + 2)*Licenses (ignoring the random intercept part). So the effect of bars in that area is much larger. Also there are places with negative effects, so the effects of bars in those places are smaller. You can do the same type of thought experiments simply with the reported variance components, but I find the caterpillar plots to be a really good visual to show what those random effects actually mean.

For other really good multilevel modelling resources, check out the Centre for Multilevel Modelling, and Germán Rodríguez’s online notes. Eventually I will get around to uploading my seminar class notes and code snippets, but in the mean time if you see a week and would like my code examples, always feel free to email.

Knowing when to fold them: A quantitative approach to ending investigations

The recent work on investigations in the criminal justice field has my head turning about potential quantitative applications in this area (check out the John Eck & Kim Rossmo podcasts on Jerry’s site first, then check out the recent papers in Criminology and Public Policy on the topic for a start). One particular problem that was presented to me was detective case loads — detectives are humans, so can only handle so many cases at once. Triage is typically taken at the initial crime reporting stage, with inputs such as seriousness of the offense, the overall probability of the case being solved, and future dangerousness of folks involved being examples of what goes into that calculus to assign a case.

Here I wanted to focus on a different problem though — how long to keep cases open? There are diminishing returns to keeping cases open indefinitely, and so PDs should be able to right size the backend of detective open cases as well as the front end triaging. Here my suggested solution is to estimate a survival model of the probability of a case being solved, and then you can estimate an expected return on investment given the time you put in.

Here is a simplified example. Say the table below shows the (instantaneous) probability of a case being solved per weeks put into the investigation.

Week 1  20%
Week 2  10%
Week 3   5%
Week 4   3%
Week 5   1%

In survival model parlance, this would be the hazard function in discrete time increments. And then we have diminishing probabilities over time, which should also be true (e.g. a higher probability of being solved right away, and gets lower over time). The expected return of investigating this crime at time t is the cumulative probability of the crime being solved at time t, multiplied by whatever value you assign to the case being solved. The costs of investigating will be fixed (based on the detective salary), so is just a multiple of t*invest_costs.

So just to fill in some numbers, lets say that it costs the police department $1,000 a week to keep an investigation going. Also say a crime has a return of $10,000 if it is solved (the latter number will be harder to figure out in practice, as cost of crime estimates are not a perfect fit). So filling in our table, we have below our detective return on investment estimates (note that the cumulative probability of being solved is not simply the sum of the instantaneous probabilities, else it would eventually go over 100%). So return on investment (ROI), at week 1 is 10,000*0.2 = 2,000, at week 2 is 10,000*0.28 = 2,800, etc.

        h(t) solved%  cum-costs   ROI   
Week 1  20%    20%     1,000     2,000
Week 2  10%    28%     2,000     2,800
Week 3   5%    32%     3,000     3,200
Week 4   3%    33%     4,000     3,300
Week 5   1%    34%     5,000     3,400

So the cumulative costs outweigh the total detective resources devoted to the crime by Week 4 here. So in practice (in this hypothetical example) you may say to a detective you get 4 weeks to figure it out, if not solved by then it should be closed (but not cleared), and you should move onto other things. In the long run (I think) this strategy will make sure detective resources are balanced against actual cases solved.

This right sizes investigation lengths from a global perspective, but you also might consider whether to close a case on an individual case-by-case basis. In that case you wouldn’t calculate the sunk cost of the investigation so far, it is just the probability of the case being solved going forward relative to future necessary resources. (You do the same table, just start the cum-costs and solved percent columns from scratch whenever you are making that decision.)

In an actual applied setting, you can estimate the survival function however you want (e.g. you may want a cure mixture-model, so not all cases will result in 100% being solved given infinite time). It is also the case that different crimes will not only have different survival curves, but also will have different costs of crime (e.g. a murder has a greater cost to society than a theft) and probably different investigative resources needed (detective costs may also get lower over time, so are not constant). You can bake that all right into this estimate. So you may say the cost of a murder is infinite, and you should forever keep that case open investigating it. A burglary though may be a very short time interval before it should be dropped (but still have some initial investment).

Another neat application of this is that if you can generate reasonable returns to solving crimes, you can right size your overall detective bureau. That is you can make a quantitative argument I need X more detectives, and they will help solve Y more crimes resulting in Z return on investment. It may be we should greatly expand detective bureaus, but have them only keep many cases open a short time period. I’m thinking of the recent officer shortages in Dallas, where very few cases are assigned at all. (Some PDs have patrol officers take initial detective duties on the crime scene as well.)

There are definitely difficulties with applying this approach. One is that getting the cost of solving a crime estimate is going to be tough, and bridges both quantitative cost of crime estimates (although many of them are sunk costs after the crime has been perpetrated, arresting someone does not undo the bullet wound), likelihood of future reoffending, and ethical boundaries as well. If we are thinking about a detective bureau that is over-booked to begin with, we aren’t deciding on assigning individual cases at that point, but will need to consider pre-empting current investigations for new ones (e.g. if you drop case A and pick up case B, we have a better ROI). And that is ignoring the estimating survival part of different cases, which is tricky using observational data as well (selection biases in what cases are currently assigned could certainly make our survival curve estimates too low or too high).

This problem has to have been tackled in different contexts before (either by actuaries or in other business/medical contexts). I don’t know the best terms to google though to figure it out — so let me know in the comments if there is related work I should look into on solving this problem.

Actively monitoring place based crime interventions

Recently I presented my work (with Jerry Ratcliffe) on the Weighted Displacement Difference test at the New York State GIVE Symposium. My talk fit right in with what the folks at the American Society of Evidence Based Police discussed as well. In particular, Jason Potts and Jeremiah Johnson gave a talk about how officers can conduct their own experiments, and my work provides a simple tool to test if changes over time are significant or just due to chance.

There was one point of contention though between us — ASEBP folks advocate for the failing fast model of evaluation, whereas I advocated for planning more long term experiments. In particular, I suggest this chart to plan your experiments. So say if you have an area with only 10 crimes per month, I would suggest you should do the experiment for at least 4 months, so if what your are doing is 50% effective at reducing crime, you will conclude it has at least weak evidence of effectiveness using my WDD test. If you think 50% is too high of a bar, if you do it for 12 months it only needs to be alittle over 25% effective to tell if it is working.

The ideal behind failing fast and innovating I totally get, but whether or not we can actually see if something is effective in the short run with low baseline crime counts may be a road block to this idea in practice. Note that this is not me bagging on people doing experiments — what is most likely to happen if you do an experiment with low power is you will conclude it is not effective, even if it partially works. So I’m more concerned the BetaGov fail fast model is likely to throw out cost-effective interventions that don’t appear on their face to be effective, as opposed to false positives.1

Am I being too negative though? And also can we create a monitoring tool to give more immediate feedback — so instead of waiting a year and seeing the results, evaluating the efficacy of an intervention over time? To do this I am giving cusum charts a try, and did a little simulation to show how it might look in practice. SPSS Code to replicate the findings here.

So what I did was simulate a baseline control area with 10 crimes per time period, and a treated area that had a 20% reduction (so goes down to 8 crimes on average per time period). Here is the time series of those two areas, black line is the control area, and red line is the treated area. Time periods can be whatever you want (e.g. days, weeks, months), what matters is the overall average and the difference between the two series.

Based on this graph, you can’t really visually tell if the red treated area is doing any better than the black control area — they overlap too much. But we aren’t just interested in the effect for any one time period, but in the cumulative effect over time. To calculate that, you just subtract the black line from the red line, and take the cumulative sum of that difference. The next chart shows that statistic, along with 100 simulated lines showing what happens when you do the same cumulative statistic to data with no changes.

So you can see here that it takes about 13 time periods to show the cumulative effects are outside of the simulation boundaries, but you might conclude there is suggestive evidence of effectiveness after say 8+ time periods. Going further out it still shows the cumulative number of crimes prevented over the life of the intervention, so goes down to around 75 crimes prevented by 25 time periods.

The number of time periods necessary to show divergence is dependent on how effective the intervention is. So if we have the same baseline average of 10 crimes per time period, but the intervention is 50% effective (so reduces to an average of 5 crimes per time period), you can tell it diverges by period 6 in this second simulation example.

If we go back to my power chart I made for the WDD test, you can see that these effective time periods are close to my power chart suggestions for the weak evidence line. So this cusum approach is maybe slightly more diagnostic, and has the benefit you may be able to stop the experiment early if it is really effective.

You should still commit to running the experiment though for a set amount of time. The amount of time should be based on how effective you think the experiment should be, as well as cost-benefit analyses. If something costs alot of money (e.g. overtime) the effectiveness threshold to make it worthwhile in practice is a much higher bar than something that is closer to zero cost (such as shifting around current assignments). Given that ASEBP is advocating for lower level officers to experiment, they are more likely to be the latter type of low cost interventions.

There are still a few issues with using cusum charts like this though. One, this is very dependent on having a control area that is the same level of crime counts. So my WDD test only needs the parallel trends assumption, but this needs both the parallel trends and equal levels. (There are ways to get rid of the equal levels assumption though, one is to take the differences in differences and calculate those cusums over time.)

Another is that you need to reset cusum charts after a particular time period — you can see the simulations are random walks, and so grow much wider over time. I’m not sure at that point though if you should choose a new control area, or just stick with the prior one though. In the first example you can see the red line overestimates the effectiveness — the first chart the true estimate should be -2*time period (estimated -75 versus should be -50 after 25 time periods). For the second the true effect is -5*time period, so the estimate is a slight underestimate (estimated -100 versus should be -125 after 25 time periods).

But this is about the best meet in the middle of actively monitoring place based crime interventions and my advocacy for planning long term interventions that I can drum up for now. It is short term feedback, but you should be committed to running the experiment for a longer period of time. The sequential monitoring allows you to stop early if the intervention is really effective, see this example for A/B tests. But otherwise you are often better off just planning a long term intervention and not peek at the short term results.

Besides the technical stats portion of being able to tell if it diverges from a control area, it may also be behavioral, in that you need a longer period of time to generate deterrence, or for officers to effectively implement the strategy. You notice in these examples if you only did 5 time periods, they meander about 0 and so don’t appear to be effective. It takes longer time periods, even with the 50% effective intervention, to know if the intervention was effective given these low of baseline crime counts.


  1. Although with low power you do have issues with what Andrew Gelman calls type S (sign) and type M (magnitude) errors as well. That is even if you do think it is effective, you may think it is way more effective than it actually is in reality based on your noisy estimates. Or you conclude it has iatrogenic effects when it works in practice.

Optimal treatment assignment with network spillovers

Motivated by a recent piece by Wood and Papachristos (2019), (WP from here on) which finds if you treat an individual at high risk for gun shot victimization, they have positive spillover effects on individuals they are connected to. This creates a tricky problem in identifying the best individuals to intervene with given finite resources. This is because you may not want to just choose the people with the highest risk – the best bang for your buck will be folks who are some function of high risk and connected to others with high risk (as well as those in areas of the network not already treated).

For a simplified example consider the network below, with individuals baseline probabilities of future risk noted in the nodes. Lets say the local treatment effect reduces the probability to 0, and the spillover effect reduces the probability by half, and you can only treat 1 node. Who do you treat?

We could select the person with the highest baseline probability (B), and the reduced effect ends up being 0.5(B) + 0.1(E) = 0.6 (the 0.1 is for the spillover effect for E). We could choose node A, which is a higher baseline probability and has the most connections, and the reduced effect is 0.4(A) + 0.05(C) + 0.05(D) + 0.1(E) = 0.6. But it ends up in this network the optimal node to choose is E, because the spillovers to A and B justify choosing a lower probability individual, 0.2(E) + 0.2(A) + 0.25(B) = 0.65.

Using this idea of a local effect and a spillover effect, I formulated an integer linear program with the same idea of a local treatment effect and a spillover effect:

\text{Maximize} \{ \sum_{i = 1}^n (L_i\cdot p_{li} + S_i \cdot p_{si}) \}

Where p_{li} is the reduction in the probability due to the local effect, and p_{si} is the reduction in the probability due to the spillover effect. These probabilities are fixed values you know at the onset, e.g. estimated from some model like in Wheeler, Worden, and Silver (2019) (and Papachristos has related work using the network itself to estimate risk). Each node, i, then gets two decision variables; L_i will equal 1 if that node is treated, and S_i will equal 1 if the node gets a spillover effect (depending on who is treated). Actually the findings in WP show that these effects are not additive (you don’t get extra effects if you are treated and your neighbors are treated, or if you have multiple neighbors treated), and this makes it easier to keep the problem on the probability scale. So we then have our constraints:

  1. L_i , S_i \in \{ 0,1 \}
  2. \sum L_i = K
  3. S_i \leq 1 + -1\cdot L_i , \forall \text{ Node}_i
  4. \sum_{\text{neigh}(i)} L_j \geq S_i , \forall \text{ Node}_i

Constraint 1 is that these are binary 0/1 decision variables. Constraint 2 is we limit the number of people treated to K (a value that we choose). Constraint 3 ensures that if a local decision variable is set to 1, then the spillover variable has to be set to 0. If the local is 0, it can be either 0 or 1. Constraint 4 looks at the neighbor relations. For Node i, if any of its neighbors local treated decision variable is set to 1, the Spillover decision variable can be set to 1.

So in the end, if the number of nodes is n, we have 2*n decision variables and 2*n + 1 constraints, I find it easier just to look at code sometimes, so here is this simple network and problem formulated in python using networkx and pulp. (Here is a full file of the code and data used in this post.)

####################################################
import pulp
import networkx

Nodes = ['a','b','c','d','e']
Edges = [('a','c'),
         ('a','d'),
         ('a','e'),
         ('b','e')]

p_l = {'a': 0.4, 'b': 0.5, 'c': 0.1, 'd': 0.1,'e': 0.2}
p_s = {'a': 0.2, 'b': 0.25, 'c': 0.05, 'd': 0.05,'e': 0.1}
K = 1

G = networkx.Graph()
G.add_edges_from(Edges)

P = pulp.LpProblem("Choosing Network Intervention", pulp.LpMaximize)
L = pulp.LpVariable.dicts("Treated Units", [i for i in Nodes], lowBound=0, upBound=1, cat=pulp.LpInteger)
S = pulp.LpVariable.dicts("Spillover Units", [i for i in Nodes], lowBound=0, upBound=1, cat=pulp.LpInteger)

P += pulp.lpSum( p_l[i]*L[i] + p_s[i]*S[i] for i in Nodes)
P += pulp.lpSum( L[i] for i in Nodes ) == K

for i in Nodes:
    P += pulp.lpSum( S[i] ) = S[i]

P.solve()

#Should select e for local, and a & b for spillover
print(pulp.value(P.objective))
print(pulp.LpStatus[P.status])

for n in Nodes:
    print([n,L[n].varValue,S[n].varValue])
####################################################

And this returns the correct results, that node E is chosen in this example, and A and B have the spillover effects. In the linked code I provided a nicer function to just pipe in your network, your two probability reduction estimates, and the number of treated units, and it will pipe out the results for you.

For an example with a larger network for just proof of concept, I conducted the same analysis, choosing 20 people to treat in a network of 311 nodes I pulled from Rostami and Mondani (2015). I simulated some baseline probabilities to pipe in, and made it so the local treatment effect was a 50% reduction in the probability, and a spillover effect was a 20% reduction. Here red squares are treated, pink circles are the spill-over, and non-treated are grey circles. It did not always choose the locally highest probability (largest nodes), but did tend to choose highly connected folks also with a high probability (but also chose some isolate nodes with a high probability as well).

This problem is solved in an instant. And I think out of the box this will work for even large networks of say over 100,000 nodes (I have let CPLEX churn on problems with near half a million decision variables on my desktop overnight). I need to check myself to make 100% sure though. A simple way to make the problem smaller if needed though is to conduct the analysis on subsets of connected components, and then shuffle the results back together.

Looking at the results, it is very similar to my choosing representatives work (Wheeler et al., 2019), and I think you could get similar results with just piping in 1’s for each of the local and spillover probabilities. One of the things I want to work on going forward though is treatment non-compliance. So if we are talking about giving some of these folks social services, they don’t always take up your offer (this is a problem in choose rep’s for call ins as well). WP actually relied on this to draw control nodes in their analysis. I thought for a bit the problem with treatment non-compliance in this setting was intractable, but another paper on a totally different topic (Bogle et al., 2019) has given me some recent hope that it can be solved.

This same idea is also is related to hot spots policing (think spatial diffusion of benefits). And I have some ideas about that to work on in the future as well (e.g. how wide of net to cast when doing hot spots interventions given geographical constraints).

References

  • Bogle, J., Bhatia, N., Ghobadi, M., Menache, I., Bjørner, N., Valadarsky, A., & Schapira, M. (2019). TEAVAR: striking the right utilization-availability balance in WAN traffic engineering. In Proceedings of the ACM Special Interest Group on Data Communication (pp. 29-43).
  • Rostami, A., & Mondani, H. (2015). The complexity of crime network data: A case study of its consequences for crime control and the study of networks. PloS ONE, 10(3), e0119309.
  • Wheeler, A. P., McLean, S. J., Becker, K. J., & Worden, R. E. (2019). Choosing Representatives to Deliver the Message in a Group Violence Intervention. Justice Evaluation Journal, Online First.
  • Wheeler, A. P., Worden, R. E., & Silver, J. R. (2019). The Accuracy of the Violent Offender Identification Directive Tool to Predict Future Gun Violence. Criminal Justice and Behavior, 46(5), 770-788.
  • Wood, G., & Papachristos, A. V. (2019). Reducing gunshot victimization in high-risk social networks through direct and spillover effects. Nature Human Behaviour, 1-7.

 

Making a hexbin map in ggplot

In a recent working paper I made a hexbin map all in R. (Gio did most of the hard work of data munging and modeling though!) Figured I would detail the process here for some notes. Hexagon binning is purportedly better than regular squares (to avoid artifacts of runs in discretized data). But the reason I use them in this circumstance is mostly just an aesthetic preference.

Two tricky parts to this: 1) making the north arrow and scale bar, and 2) figuring out the dimensions to make regular hexagons. As an illustration I use the shooting victim data from Philly (see the working paper for all the details) full data and code to replicate here. I will walk through a bit of it though.

Data Prep

First to start out, I just use these three libraries, and set the working directory to where my data is.

library(ggplot2)
library(rgdal)
library(proj4)
setwd('C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\HexagonMap_ggplot\\Analysis')

Now I read in the Philly shooting data, and then an outline of the city that is projected. Note I read in the shapefile data using rgdal, which imports the projection info. I need that to be able to convert the latitude/longitude spherical coordinates in the shooting data to a local projection. (Unless you are making a webmap, you pretty much always want to use some type of local projection, and not spherical coordinates.)

#Read in the shooting data
shoot <- read.csv('shootings.csv')
#Get rid of missing
shoot <- shoot[!is.na(shoot$lng),c('lng','lat')]
#Read in the Philly outline
PhilBound <- readOGR(dsn="City_Limits_Proj.shp",layer="City_Limits_Proj")
#Project the Shooting data
phill_pj <- proj4string(PhilBound)
XYMeters <- proj4::project(as.matrix(shoot[,c('lng','lat')]), proj=phill_pj)
shoot$x <- XYMeters[,1]
shoot$y <- XYMeters[,2]

Making a Basemap

It is a bit of work to make a nice basemap in R and ggplot, but once that upfront work is done then it is really easy to make more maps. To start, the GISTools package has a set of functions to get a north arrow and scale bar, but I have had trouble with them. The ggsn package imports the north arrow as a bitmap instead of vector, and I also had a difficult time with its scale bar function. (I have not figured out the cartography package either, I can’t keep up with all the mapping stuff in R!) So long story short, this is my solution to adding a north arrow and scale bar, but I admit better solutions probably exist.

So basically I just build my own polygons and labels to add into the map where I want. Code is motivated based on the functions in GISTools.

#creating north arrow and scale bar, motivation from GISTools package
arrow_data <- function(xb, yb, len) {
  s <- len
  arrow.x = c(0,0.5,1,0.5,0) - 0.5
  arrow.y = c(0,1.7  ,0,0.5,0)
  adata <- data.frame(aX = xb + arrow.x * s, aY = yb + arrow.y * s)
  return(adata)
}

scale_data <- function(llx,lly,len,height){
  box1 <- data.frame(x = c(llx,llx+len,llx+len,llx,llx),
                     y = c(lly,lly,lly+height,lly+height,lly))
  box2 <- data.frame(x = c(llx-len,llx,llx,llx-len,llx-len),
                     y = c(lly,lly,lly+height,lly+height,lly))
  return(list(box1,box2))
}

x_cent <- 830000
len_bar <- 3000
offset_scaleNum <- 64300
arrow <- arrow_data(xb=x_cent,yb=67300,len=2500)
scale_bxs <- scale_data(llx=x_cent,lly=65000,len=len_bar,height=750)

lab_data <- data.frame(x=c(x_cent, x_cent-len_bar, x_cent, x_cent+len_bar, x_cent),
                       y=c( 72300, offset_scaleNum, offset_scaleNum, offset_scaleNum, 66500),
                       lab=c("N","0","3","6","Kilometers"))

This is about the best I have been able to automate the creation of the north arrow and scale bar polygons, while still having flexibility where to place the labels. But now we have all of the ingredients necessary to make our basemap. Make sure to use coord_fixed() for maps! Also for background maps I typically like making the outline thicker, and then have borders for smaller polygons lighter and thinner to create a hierarchy. (If you don’t want the background map to have any color, use fill=NA.)

base_map <- ggplot() + 
            geom_polygon(data=PhilBound,size=1.5,color='black', fill='darkgrey', aes(x=long,y=lat)) +
            geom_polygon(data=arrow, fill='black', aes(x=aX, y=aY)) +
            geom_polygon(data=scale_bxs[[1]], fill='grey', color='black', aes(x=x, y = y)) + 
            geom_polygon(data=scale_bxs[[2]], fill='white', color='black', aes(x=x, y = y)) + 
            geom_text(data=lab_data, size=4, aes(x=x,y=y,label=lab)) +
            coord_fixed() + theme_void()

#Check it out           
base_map

This is what it looks like on my windows machine in RStudio — it ends up looking alittle different when I export the figure straight to PNG though. Will get to that in a minute.

Making a hexagon map

Now you have your basemap you can superimpose whatever other data you want. Here I wanted to visualize the spatial distribution of shootings in Philly. One option is a kernel density map. I tend to like aggregated count maps though better for an overview, since I don’t care so much for drilling down and identifying very specific hot spots. And the counts are easier to understand than densities.

In geom_hex you can supply a vertical and horizontal parameter to control the size of the hexagon — supplying the same for each does not create a regular hexagon though. The way the hexagon is oriented in geom_hex the vertical parameter is vertex to vertex, whereas the horizontal parameter is side to side.

Here are three helper functions. First, wd_hex gives you a horizontal width length given the vertical parameter. So if you wanted your hexagon to be vertex to vertex to be 1000 meters (so a side is 500 meters), wd_hex(1000) returns just over 866. Second, if for your map you wanted to convert the numbers to densities per unit area, you can use hex_area to figure out the size of your hexagon. Going again with our 1000 meters vertex to vertex hexagon, we have a total of hex_area(1000/2) is just under 650,000 square meters (or about 0.65 square kilometers).

For maps though, I think it makes the most sense to set the hexagon to a particular area. So hex_dim does that. If you want to set your hexagons to a square kilometer, given our projected data is in meters, we would then just do hex_dim(1000^2), which with rounding gives us vert/horz measures of about (1241,1075) to supply to geom_hex.

#ggplot geom_hex you need to supply height and width
#if you want a regular hexagon though, these
#are not equal given the default way geom_hex draws them
#https://www.varsitytutors.com/high_school_math-help/how-to-find-the-area-of-a-hexagon

#get width given height
wd_hex <- function(height){
  tri_side <- height/2
  sma_side <- height/4
  width <- 2*sqrt(tri_side^2 - sma_side^2)
  return(width)
}

#now to figure out the area if you want
#side is simply height/2 in geom_hex
hex_area <- function(side){
  area <- 6 * (  (sqrt(3)*side^2)/4 )
  return(area)
}

#So if you want your hexagon to have a regular area need the inverse function
#Gives height and width if you want a specific area
hex_dim <- function(area){
  num <- 4*area
  den <- 6*sqrt(3)
  vert <- 2*sqrt(num/den)
  horz <- wd_hex(height)
  return(c(vert,horz))
}

my_dims <- hex_dim(1000^2)   #making it a square kilometer
sqrt(hex_area(my_dims[1]/2)) #check to make sure it is square km
#my_dims also checks out with https://hexagoncalculator.apphb.com/

Now onto the good stuff. I tend to think discrete bins make nicer looking maps than continuous fills. So through some trial/error you can figure out the best way to make those via cut. Also I make the outlines for the hexagons thin and white, and make the hexagons semi-transparent. So you can see the outline for the city. I like how by default areas with no shootings are not given any hexagon.

lev_cnt <- seq(0,225,25)
shoot_count <- base_map + 
               geom_hex(data=shoot, color='white', alpha=0.85, size=0.1, binwidth=my_dims, 
                        aes(x=x,y=y,fill=cut(..count..,lev_cnt))) + 
               scale_fill_brewer(name="Count Shootings", palette="OrRd")

We have come so far, now to automate exporting the figure to a PNG file. I’ve had trouble getting journals recently to not bungle vector figures that I forward them, so I am just like going with high res PNG to avoid that hassle. If you render the figure and use the GUI to export to PNG, it won’t be as high resolution, so you can often easily see aliasing pixels (e.g. the pixels in the North Arrow for the earlier base map image).

png('Philly_ShootCount.png', height=5, width=5, units="in", res=1000, type="cairo") 
shoot_count
dev.off()

Note the font size/location in the exported PNG are often not quite exactly as they are when rendered in the RGUI window or RStudio on my windows machine. So make sure to check the PNG file.

Making nice margin plots in Stata

For a recent working paper I had a student of mine (Jordan Riddell) help write some code to make nice margin plots in Stata, based on the work of Ben Jann and his grstyle code. Another good resource is Trenton Mize’s Sociological Science article on non-linear interactions. Here is what the end output will look like.

My notes on how to get this to follow. Data and code to follow along can be downloaded from here.

Start Up

First in my do file, I have a typical start up that sets the working directory and logs the results to a text file. I use set more off so I don’t have to do the annoying this and tell Stata to keep scrolling down. The next part is partly idiosyncratic to my Stata work set up — I call Stata from a centralized install location here at EPPS in UTD. I don’t have write access there, so to install commands I need to set my own place to install them on my local machine. So I add a location to adopath that is on my machine, and I also do net set ado to that same location.

Finally, for here I ssc installed grstyle and palettes. The code is currently commented out, as I only need to install it once. But it is good for others to know what extra packages they need to fully replicate your results.

**************************************************************************
*START UP STUFF

*Set the working directory and plain text log file
cd "C:\Users\axw161530\Dropbox\Documents\BLOG\Stata_NiceMargins\Analysis"

*log the results to a text file
log using "LogitModels.txt", text replace

*so the output just keeps going
set more off

*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns\V15"
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns\V15"

*In this script I use 
*net install http://www.stata-journal.com/software/sj18-3/gr0073/
*ssc install grstyle, replace
*ssc install palettes, replace
**************************************************************************

Graph Settings

Here is what I did to change my default graph settings. Again check out Ben Jann’s awesome website he made an all the great examples. That will be more productive than me commenting on every individual line.

**************************************************************************
*Graph Settings
grstyle clear
set scheme s2color
grstyle init
grstyle set plain, box
grstyle color background white
grstyle set color Set1
grstyle yesno draw_major_hgrid yes
grstyle yesno draw_major_ygrid yes
grstyle color major_grid gs8
grstyle linepattern major_grid dot
grstyle set legend 4, box inside
grstyle color ci_area gs12%50
**************************************************************************

Data Prep

So here is pretty straight forward. I read in the data as a CSV file, generate a new variable that is the weekly average number of crimes within 1000 feet in the historical crime data (see the working paper for more details). One trick I like to use with regression models with many terms is to make a global that specifies those variables, so I don’t need to retype them a bunch. I named it $ContVars here. Finally for simplicity in this script I am just examining the burglary incidents, so I get rid of the other crimes using the keep command.

**************************************************************************
*DATA PREP

*Getting the data
import delimited CrimeStrings_withData.csv

*Making the previous densities per time period
generate buff_1000_1 = buff_1000 * (7/1611)

*control variables used in the regression
global ContVars "d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 whiteperc blackperc hispperc asianperc under17 propmove perpoverty perfemheadhouse perunemploy perassist i.month c.dateint"

*For here I am just examining burglary incidents 
keep if crimetype == 3
**************************************************************************

Making a Nice marginsplot

So basically what I want to do in the end is to draw an interaction effect between a dummy variable (whether a crime resulted in an arrest) and a continuous variable (the historical crime density at a location). I am predicting whether a crime results in a near-repeat follow up — hot spots with more crime on average will have more near-repeats simply by chance.

When displaying that interaction effect though, I only want to limit it to the support of the historical crime density in the sample. Or stated another way, the historical crime density variable basically ranges from 0 to 2.5 in the sample — I don’t care what the interaction effect is then at a historical crime density of 3.

To do that in Stata, I use summarize to get the min/max of that historical crime density and pipe them into a global. The Grid global will then tell Stata how often to calculate those effects. Too few and the plot may not look smooth, too many and it will take margins forever to calculate the results. Here 100 points is plenty.

*I will need this later to draw the margins over the support
*Of the prior crime density
summarize buff_1000_1
global MyMin = r(min)
global MyMax = r(max)
global Grid = ($MyMax-$MyMin)/100

This may seem overkill, as I could just fill in those values by hand later. If you look at my replication code for my paper though, I ended up doing this same thing for four different crimes and two different estimates, so I wanted as automated approach that avoids as many magic numbers as possible.

Now I estimate my logistic regression model, with my interaction effect and the global $ContVars control variables I specified earlier. Here I am predicting whether a burglary has a follow up near-repeat crime (within 1000 feet and 7 days). I think an arrest will reduce that probability.

*Now estimate the logit model
logit future0_1000_7 i.arrest c.buff_1000_1 i.arrest#c.buff_1000_1 $ContVars

Note that the estimate of the interaction effect looks like this:

--------------------------------------------------------------------------------------
      future0_1000_7 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------------+----------------------------------------------------------------
            1.arrest |  -.0327821    .123502    -0.27   0.791    -.2748415    .2092774
         buff_1000_1 |   1.457795   .0588038    24.79   0.000     1.342542    1.573048
                     |
arrest#c.buff_1000_1 |
                  1  |  -.5013652   .2742103    -1.83   0.067    -1.038807    .0360771
                  

So how exactly do I interpret this? It is very difficult to interpret the coefficients directly — it is much easier to make graphs and visualize what those effects actually mean on the probability of a near-repeat burglary occurring.

Now the good stuff. Basically I want to show the predicted probability of a near-repeat follow up crime, conditional on whether an arrest occurred, as well as the historical crime density. The first line uses quietly, so I don’t get the full margins table in the output. The second is just all the goodies to make my nice grey scale plot. Note I name the plot — this will be needed for later combining multiple plots.

*Create the two margin plots
quietly margins arrest, at(c.buff_1000_1=($MyMin ($Grid) $MyMax))
marginsplot, recast(line) noci title("Residential Burglary, Predictive Margins") xtitle("Historical Crime Density") ytitle("Pr(Future Crime = 1)") plot1opts(lcolor(black)) plot2opts(lcolor(gs6) lpattern("--")) legend(on order(1 "no arrest" 2 "arrest")) name(main)

You could superimpose confidence intervals on the prior plot, but those are the pointwise intervals for the probability for each individual line, they don’t directly tell you about the difference between the two lines. Viz. the difference in lines often can lead to misinterpretation (e.g. remember the Cleveland example of viz. the differences in exports/imports originally drawn by Playfair). Also superimposing multiple error bands tend to get visually messy. So a solution is to directly graph the estimate of the difference between those two probabilities in a separate graph. (Another idea though I’ve seen is here, a CI of the difference set to the midpoint of the two lines.)

quietly margins, dydx(arrest) at(c.buff_1000_1=($MyMin ($Grid) $MyMax))
marginsplot, recast(line) plot1opts(lcolor(gs8)) ciopt(color(black%20)) recastci(rarea) title("Residential Burglary, Average Marginal Effects of Arrest") xtitle("Historical Crime Density") ytitle("Effects on Pr(Future Crime)") name(diff)

Yay for the fact that Stata can now draw transparent areas. So here we can see that even though the marginal effect grows at higher prior crime densities — suggesting an arrest has a larger effect on reducing near repeats in hot spots, the confidence interval of the difference grows larger as well.

To end I combine the two plots together (same image at the beginning of the post), and then export them to a higher resolution PNG.

*Now combining the plots together
graph combine main diff, xsize(6.5) ysize(2.7) iscale(.8) name(comb)
graph close main diff
graph export "BurglaryMarginPlot.png", width(6000) replace

I am often doing things interactively in the Stata shell when I am writing up scripts. Including redoing charts. To be able to redo a chart with the same name, you need to not only use graph close, but also graph drop it from memory. Then just dropping all the data and using exit will finish out your script and close down Stata entirely.

**************************************************************************  
*FINISHING UP THE SCRIPT

*closing the combined graph
graph close comb

*This is necessary if you want to reuse the plot names 
graph drop _all 

*Finish the script.
drop _all 
exit, clear
**************************************************************************  

Finding the dominant set in a network (python)

My paper, Choosing representatives to deliver the message in a group violence intervention, is now published online at the Justice Evaluation Journal. For those who don’t have access to that journal, here is a link good for 50 e-prints (for a limited time), and here is a pre-print version, and you can always send me an email for the published copy.

I’ve posted Python code to replicate the analysis, including the original network nodes and edges group data. I figured I would go through a quick example of applying the code for others to use the algorithm.

The main idea is that for a focused deterrence initiative, for the call-ins you want to identify folks to spread the deterrence message around the network. When working with several PDs I figured looking at who was called in would be interesting. Literally the first network graph I drew was below on the left — folks who were called in are the big red squares. This was one of the main problem gangs, and the PD had done several call-ins for over a year at this point. Those are not quite the worst set of four folks to call-in based on the topology of the network, but damn close.

But to criticize the PD I need to come up with a better solution — which is the graph on the right hand side. The larger red squares are my suggested call-ins, and they reach everyone within one step. That means everyone is at most just one link away from someone who attended the call-in. This is called a dominant set of a graph when all of the graph is colored in.

Below I give a quicker example using my code for others to generate the dominant set (instead of going through all of the replication analysis). If you are a PD interested in applying this for your focused deterrence initiative let me know!


So first to set up your python code, I import all of the needed libraries (only non-standard is networkx). Then I import my set of functions, named MyFunctions.py, and then change the working directory.

############################################################
#The libraries I need

import itertools
import networkx as nx
import csv
import sys
import os

#Now importing my own functions I made
locDir = r'C:\Users\axw161530\Dropbox\Documents\BLOG\DominantSet_Python'
sys.path.append(locDir)
from MyFunctions import *

#setting the working directory to this location
os.chdir(locDir)
#print(os.getcwd())
############################################################

The next part I read in the CSV data for City 4 Gang 1, both the nodes and the edges. Then I create a networkx graph simply based on the edges. Technically I do not use the node information at all for this, just the edges that list a source and a target.

############################################################
#Reading in the csv files that have the nodes and the edges
#And turning into a networkX graph

#simple function to read in csv files
def ReadCSV(loc):
    tup = []
    with open(loc) as f:
        z = csv.reader(f)
        for row in z:
            tup.append(tuple(row))
    return tup
            
#Turning my csv files into networkx objects

nd = ReadCSV('Nodes_City4_Gang1.csv')
ed = ReadCSV('Edges_City4_Gang1.csv')
head_node = nd.pop(0) #First row for both is a header
head_edge = ed.pop(0)

#Turning my csv files into networkx objects
C1G4 = nx.Graph()
C1G4.add_edges_from(ed)
############################################################

Now it is quite simple, to get my suggested dominant set it is simple as this function call:

ds_C1G4 = domSet_Whe(C1G4)
print(ds_C1G4)

In my current session this gives the edges ['21', '18', '17', '16', '3', '22', '20', '6']. Which if you look to my original graph is somewhat different, but all are essentially single swaps where the best node to choose is arbitrary.

I have a bunch of other functions in the analysis, one of interest will be given who is under probation/parole who are the best people to call in (see the domSet_WheSub function). Again if you are interested in pursuing this further always feel free to reach out to me.

David Bayley

David Bayley is most known in my research area, policing interventions to reduce crime, based on this opening paragraph in Police for the future:

The police do not prevent crime. This is one of the best kept secrets of modern life. Experts know it, the police know it, but the public does not know it. Yet the police pretend that they are society’s best defense against crime and continually argue that if they are given more resources, especially personnel, they will be able to protect communities against crime. This is a myth.

This quote is now paraded as backwards thinking, often presented before discussing the overall success of hot spots policing. If you didn’t read the book, you might come to the conclusion that this quote is a parallel to the nothing works mantra in corrections research. That take is not totally off-base: Police for the future was published in 1994, so it was just at the start of the CompStat revolution and hot spots policing. The evidence base was no doubt much thinner at that point and deserving of skepticism.

I don’t take the contents of David’s book as so hardlined on the stance that police cannot reduce crime, at least at the margins, as his opening quote suggests though. He has a chapter devoted to traditional police responses (crackdowns, asset forfeiture, stings, tracking chronic offenders), where he mostly expresses scientific skepticism of their effectiveness given their cost. He also discusses problem oriented approaches to solving crime problems, how to effectively measure police performance (outputs vs outcomes), and promotes evaluation research to see what works. Still all totally relevant twenty plus years later.

The greater context of David’s quote comes from his work examining police forces internationally. David was more concerned about professionalization of police forces. Part of this is better record keeping of crimes, and in the short term crime rates will often increase because of this. In class he mocked metrics used to score international police departments on professionalization that used crime as a measure that went into their final grade. He thought the function of the police was broader than reducing crime to zero.


I was in David’s last class he taught at Albany. The last day he sat on the desk at the front of the room and expressed doubt about whether he accomplished anything tangible in his career. This is the fate of most academics. Very few of us can point to direct changes anyone implemented in response to our work. Whether something works is independent of an evaluation I conduct to show it works. Even if a police department takes my advice about implementing some strategy, I am still only at best indirectly responsible for any crime reductions that follow. Nothing I could write would ever compete with pulling a single person from a burning car.

While David was being humble he was right. If I had to make a guess, I would say David’s greatest impact likely came about through his training of international police forces — which I believe spanned multiple continents and included doing work with the United Nations. (As opposed to saying something he wrote had some greater, tangible impact.) But even there if we went and tried to find direct evidence of David’s impact it would be really hard to put a finger on any specific outcome.

If a police department wanted to hire me, but I would be fired if I did not reduce crimes by a certain number within that first year, I would not take that job. I am confident that I can crunch numbers with the best of them, but given real constraints of police departments I would not take that bet. Despite devoting most of my career to studying policing interventions to reduce crime, even with the benefit of an additional twenty years of research, I’m not sure if David’s quote is as laughable as many of my peers frame it to be.

Why I publish preprints

I encourage peers to publish preprint articles — journal articles before they go through the whole peer review process and are published. It isn’t normative in our field, and I’ve gotten some pushback from colleagues, so figured I would put on paper why I think it is a good idea. In short, the benefits (increased exposure) outweigh the minimal costs of doing so.

The good — getting your work out there

The main benefit of posting preprints is to get your work more exposure. This occurs in two ways: one is that traditional peer-review work is often behind paywalls. This prevents the majority of non-academics from accessing your work. This point about paywalls applies just the same to preventing other academics from reading your work in some cases. So while the prior blog post I linked by Laura Huey notes that you can get access to some journals through your local library, it takes several steps. Adding in steps you are basically losing out on some folks who don’t want to spend the time. Even through my university it is not uncommon for me to not be able to access a journal article. I can technically take the step of getting the article through inter-library loan, but that takes more time. Time I am not going to spend unless I really want to see the contents of the article.

This I consider a minor benefit. Ultimately if you want your academic work to be more influential in the field you need to write about your work in non-academic outlets (like magazines and newspapers) and present it directly to CJ practitioner audiences. But there are a few CJ folks who read journal articles you are missing, as well as a few academics who are missing your work because of that paywall.

A bigger benefit is actually that you get your work out much quicker. The academic publishing cycle makes it impossible to publish your work in a timely fashion. If you are lucky, once your paper is finished, it will be published in six months. More realistically it will be a year before it is published online in our field (my linked article only considers when it is accepted, tack on another month or two to go through copy-editing).

Honestly, I publish preprints because I get really frustrated with waiting on peer review. No offense to my peers, but I do good work that I want others to read — I do not need a stamp from three anonymous reviewers to validate my work. I would need to do an experiment to know for sure (having a preprint might displace some views/downloads from the published version) but I believe the earlier and open versions on average doubles the amount of exposure my papers would have had compared to just publishing in traditional journals. It is likely a much different audience than traditional academic crim people, but that is a good thing.

But even without that extra exposure I would still post preprints, because it makes me happy to self-publish my work when it is at the finish line, in what can be a miserably long and very much delayed gratification process otherwise.

The potential downsides

Besides the actual time cost of posting a preprint (next section I will detail that more precisely, it isn’t much work), I will go through several common arguments why posting preprints are a bad idea. I don’t believe they carry much weight, and have not personally experienced any of them.

What if I am wrong — Typically I only post papers either when I am doing a talk, or when it is ready to go out for peer review. So I don’t encourage posting really early versions of work. While even at this stage there is never any guarantee you did not make a big mistake (I make mistakes all the time!), the sky will not fall down if you post a preprint that is wrong. Just take it down if you feel it is a net negative to the scholarly literature (which is very hard to do — the results of hypothesis tests do not make the work a net positive/negative). If you think it is good enough to send out for peer review it is definitely at the stage where you can share the preprint.

What if the content changes after peer review — My experience with peer review is mostly pedantic stuff — lit. review/framing complaints, do some robustness checks for analysis, beef up the discussion. I have never had a substantive interpretation change after peer-review. Even if you did, you can just update the preprint with the new results. While this could be bad (an early finding gets picked up that is later invalidated) this is again something very rare and a risk I am willing to take.

Note peer review is not infallible, and so hedging that peer review will catch your mistakes is mostly a false expectation. Peer review does not spin your work into gold, you have to do that yourself.

My ideas may get scooped — This I have never personally had happen to me. Posting a preprint can actually prevent this in terms of more direct plagiarism, as you have a time-stamped example of your work. In terms of someone taking your idea and rewriting it, this is a potential risk (same risk if you present at a conference) — really only applicable for folks working on secondary data analysis. Having the preprint the other person should at least cite your work, but sorry, either presenting some work or posting a preprint does not give you sole ownership of an idea.

Journals will view preprints negatively — Or journals do not allow preprints. I haven’t come across a journal in our field that forbids preprints. I’ve had one reviewer note (out of likely 100+ at this point) that the pre-print was posted as a negative (suggesting I was double publishing or plagiarizing my own work). An editor that actually reads reviews should know that is not a substantive critique. That was likely just a dinosaur reviewer that wasn’t familiar with the idea of preprints (and they gave an overall positive review in that one case, so did not get the paper axed). If you are concerned about this, just email the editor for feedback, but I’ve never had a problem from editors.

Peer reviewers will know who I am — This I admit is a known unknown. So peer review in our crim/cj journals are mostly doubly blind (most geography and statistic journals I have reviewed for are not, I know who the authors are). If you presented the work at a conference you have already given up anonymity, and also the field is small enough a good chunk of work the reviewers can guess who the author is anyway. So your anonymity is often a moot point at the peer review stage anyway.

So I don’t know how much reviewers are biased if they know who you are (it can work both ways, if you get a friend they may be more apt to give a nicer review). It likely can make a small difference at the margins, but again I personally don’t think the minor risk/cost outweighs the benefits.

These negatives are no doubt real, but again I personally find them minor enough risks to not outweigh the benefits of posting preprints.

The not hard work of actually posting preprints

All posting a preprint involves is uploading a PDF file of your work to either your website or a public hosting service. My workflow currently I have my different components of a journal article in several word documents (I don’t use LaTex very often). (Word doesn’t work so well when it has one big file, especially with many pictures.) So then I export those components to PDF files, and stitch them together using a freeware tool PDFtk. It has a GUI and command line, so I just have a bat file in my paper directory that lists something like:

pdftk.exe TitlePage.pdf MainPaper.pdf TablesGraphs.pdf Appendix.pdf cat output CombinedPaper.pdf

So just a double click to update the combined pdf when I edit the different components.

Public hosting services to post preprints I have used in the past are Academia.edu, SSRN, and SoxArXiv, although again you could just post the PDF on your webpage (and Google Scholar will eventually pick it up). I use SocArXiv now, as SSRN currently makes you sign up for an account to download PDFs (again a hurdle, the same as a going through inter-library loan). Academia.edu also makes you sign up for an account, and has weird terms of service.

Here is an example paper of mine on SocArXiv. (Note the total downloads, most of my published journal articles have fewer than half that many downloads.) SocArXiv also does not bother my co-authors to create an account when I upload a paper. If we had a more criminal justice focused depository I would use that, but SocArXiv is fine.

There are other components of open science I should write about — such as replication materials/sharing data, and open peer reviewed journals, but I will leave those to another blog post. Posting preprints takes very little extra work compared to what academics are currently doing, so I hope more people in our field start doing it.