Preprint – A Quasi-Experimental Evaluation Using Roadblocks and Automatic License Plate Readers to Reduce Crime in Buffalo, NY

I have a new preprint article posted on SSRN – A Quasi-Experimental Evaluation Using Roadblocks and Automatic License Plate Readers to Reduce Crime in Buffalo, NY. This is some work I have been conducting with Scott Phillips out at SUNY Buffalo (as well as Dae-Young Kim, although he is not on this paper).

Here is the abstract:

Purpose: To evaluate the effectiveness of a hot spots policing strategy: using automated license plate readers at roadblocks.

Design: Different roadblock locations were chosen by the Buffalo Police Department every day over a two month period. We use propensity score matching to identify a set of control locations based on prior counts of crime and demographic factors before the intervention took place. We then evaluate the reductions in Part 1 crimes, calls for service, and traffic accidents at roadblock locations compared to control locations.

Findings: We find modest reductions in Part 1 violent crimes (10 over all roadblock locations and over the two months) using t-tests of mean differences. We find a 20% reduction in traffic accidents using fixed effects negative binomial regression models. Both results are sensitive to the model used though, and the fixed effects models predict increases in crimes due to the intervention.

Research Limitations: The main limitations are the quasi-experimental nature of the intervention, the short length of the intervention, and that many micro places have low baseline counts of crime.

Originality/Value: This adds to literature on hot spots policing – in particular on the use of automated license plate readers and traffic enforcement at hot spots of crime. While the results are mixed, it provides some evidence that the intervention has potential to reduce crime.

And here is one figure from the paper, showing how street units are defined, and given the intersection the road block was stationed on how we determined the treated street units:

Feedback is always welcome!

Some Stata notes – Difference-in-Difference models and postestimation commands

Many of my colleagues use Stata (note it is not STATA), and I particularly like it for various panel data models. Also one of my favorite parts of Stata code that are sometimes tedious to replicate in other stat. software are the various post-estimation commands. These includes the test command, which does particular coefficient restriction tests or multiple coefficient tests, margins (and the corresponding marginsplot) which gives model based estimates at various values of the explanatory variables, and lincom and nlcom which I will show as being useful for differences in differences models.

To follow along with this post, I have posted all of the data and code here. Part of the data is simulated to be very similar to a panel data model I estimated recently, and the other dataset comes from Dan Kahan.

I don’t know Stata as well as I do SPSS or R, but these are things I tend to re-visit (and wish people sending me data and code follow as well). So these are for my own notes so I remember!

The Preamble

Before we go into estimating models in Stata, I first like to set the working directory, specify a plain text log file for the results, and then set how the results are displayed in the window. I know some programmers do not like you changing their directory, but this is the simplest way I’ve found to share code between colleagues – they just need to change one line at the beginning to change the directory to their local machine.

So the top of my code will typically have my name, plus what the code does. Note that comments in Stata start with an *.

*Andy Wheeler, ????
*This code estimates difference in difference models
*For the X intervention on Crimes per Month

After that I will set the working directory, using the cd command, which lets you upload datasets without specifying a file handle. Because of this, I can just say log using "PostEst.txt" and it saves the plain text log file in the same directory folder.

*Setting up the directory
cd "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Stata_Postest"
*logging the results in a text file
log using "PostEst.txt", text replace
*So the output just keeps going
set more off

The code set more off is idiosyncratic to the Stata display. For results that you need to scroll, Stata asks you to click on see more output. This makes it so it just pipes all the results to the display without you needing to click on anything.

Importing Data and Setting up a Panel Model

Now we will go on to a simulated example very similar to a difference in difference model I estimated recently. Because the working directory is set that already contains my datasets, I can just call:

use "Monthly_Sim_Data.dta", clear 

To load up my simulated dataset. Before we estimate panel models in Stata, you need to tell Stata what the panel id variables refer to. You use the tsset command for that. Here the variable Exper refers to a dummy variable that equals 1 for the experimental time series, and 0 for the control time series. Ord is a squential counter for each time period – here the data is suppossed to look like the count of crimes during the month over several years. So the first variable after tsset in the panel id, and the second refers to the time variable (if there is a time variable).

*Set the panel vars
tsset Exper Ord

Now we are ready to estimate our model. In my real use case I had looked at the univariate series, and found that an ARIMA(1,0,0) model with monthly dummies to account for seasonality fit pretty well and took care of temporal auto-correlation, so I decided a reasonable substitute for the panel model would be a GEE Poisson model with monthly dummies and the errors having an AR(1) correlation.

To fit this model in Stata is:

*Original model
xtgee Y Exper#Post i.Month, family(poisson) corr(ar1)

Here I do not worry about the standard errors, but if the data are over-dispersed you can either fit a negative binomial model and/or estimate robust standard errors (or bootstrapped standard errors). So you can specify these as something like xtgee Y X, family(poisson) corr(ar1) vce(robust). The vce option is available for many of the panel data models. (Note the bootstrap does not make sense when you only have two series as in this example.)

The Post variable is the dummy variable equal to 0 before the intervention, and 1 after the intervention. Exper#Post is one way to specify interaction variables in Stata. The variable i.Month tells Stata that the Month variable is a factor, and it should estimate a different dummy variable for each month (dropping one to prevent perfect collinearity).

You can of course create your own interactions or dummy variables, but using Stata’s inbuilt commands is very nice when working with post-estimation commands, which I will show in a bit. Some models in Stata do not like the i.Factor notation, so a quick way to make all of the dummy variables is via tabulate Month, gen(m). This would create variables m1, m2, m3....m12 in the dataset that you can include on the right hand side of regression equations.

This model then produces the output:

Iteration 1: tolerance = .00055117
Iteration 2: tolerance = 1.375e-06
Iteration 3: tolerance = 2.730e-09

GEE population-averaged model                   Number of obs      =       264
Group and time vars:             Exper Ord      Number of groups   =         2
Link:                                  log      Obs per group: min =       132
Family:                            Poisson                     avg =     132.0
Correlation:                         AR(1)                     max =       132
                                                Wald chi2(14)      =    334.52
Scale parameter:                         1      Prob > chi2        =    0.0000

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Exper#Post |
        0 1  |    .339348   .0898574     3.78   0.000     .1632307    .5154653
        1 0  |   1.157615   .0803121    14.41   0.000     1.000206    1.315023
        1 1  |   .7602288   .0836317     9.09   0.000     .5963138    .9241439
       Month |
          2  |    .214908   .1243037     1.73   0.084    -.0287228    .4585388
          3  |   .2149183   .1264076     1.70   0.089    -.0328361    .4626726
          4  |   .3119988   .1243131     2.51   0.012     .0683497    .5556479
          5  |   .4416236   .1210554     3.65   0.000     .2043594    .6788877
          6  |    .556197   .1184427     4.70   0.000     .3240536    .7883404
          7  |   .4978252   .1197435     4.16   0.000     .2631322    .7325182
          8  |   .2581524   .1257715     2.05   0.040     .0116447      .50466
          9  |    .222813   .1267606     1.76   0.079    -.0256333    .4712592
         10  |    .430002    .121332     3.54   0.000     .1921956    .6678084
         11  |   .0923395   .1305857     0.71   0.479    -.1636037    .3482828
         12  |  -.1479884   .1367331    -1.08   0.279    -.4159803    .1200035
       _cons |   .9699886   .1140566     8.50   0.000     .7464418    1.193536

Post estimation commands

So now the fun part begins – trying to interpret the model. Non-linear models – such as a Poisson regression – I think are almost always misleading to interpret the coefficients directly. Even exponentiating the coefficients (so they are incident rate ratios) I think tends to be misleading (see this example). To solve this, I think the easiest solution is to just predict the model based outcome at different locations of the explanatory variables back on the original count scale. This is what the margins post-estimation command does.

margins Post#Exper, at( (base) Month )
marginsplot, recast(scatter)

So basically once you estimate a regression equation, Stata has many of its attributes accessible to subsequent command. What the margins command does is predict the value of Y at the 2 by 2 factor levels for Post and Exper. The at( (base) Month ) option says to estimate the effects at the base level of the Month factor, which happens to default to January here. The marginsplot shows this easier than I can explain.

The option recast(scatter) draws the plot so there are no lines connecting the different points. Note I prefer to draw these error bar like plots without the end cross hairs, which can be done like marginsplot, recast(scatter) ciopts(recast(rspike)), but this causes a problem when the error bars overlap.

So I initially interpreted this simply as the experimental series went down by around 2 crimes, and the control series went up by around 1 crime. A colleague though pointed out the whole point of having control series though is to say what would have happend if the intervention did not take place (the counterfactual). Because the control series was going up, we would have also expected the experimental series to go up.

The coefficients in this model though don’t directly answer that question. To estimate that relative decrease is somewhat convoluted in this parameterization, but you can use the test and lincom post-estimation commands to do it. Test basically does linear model restrictions for one or multiple variables. This can be useful to test if multiple coefficients are equal to one another, or joint testing to see if one coefficient is equal to another, or to test if a coefficient is different from a non-zero value.

So to test the relative decrease in this DiD setup ends up being (which I am too lazy to explain):

test 1.Exper#1.Post = (1.Exper#0.Post + 0.Exper#1.Post)

Note you can also do a joint test for all dummy variables with testparm:

testparm i.Month

which spits out:

 ( 1)  2.Month = 0
 ( 2)  3.Month = 0
 ( 3)  4.Month = 0
 ( 4)  5.Month = 0
 ( 5)  6.Month = 0
 ( 6)  7.Month = 0
 ( 7)  8.Month = 0
 ( 8)  9.Month = 0
 ( 9)  10.Month = 0
 (10)  11.Month = 0
 (11)  12.Month = 0

           chi2( 11) =   61.58
         Prob > chi2 =    0.0000

The idea behind this is that it often does not make sense to test the significance of only one level of a dummy variable – you want to jointly test whether the whole set of dummy variables is statistically significant. Most of the time I do this using F-tests for model restrictions (see this example in R). But here Stata does a chi-square test. (I imagine they will result in the same inferences in most circumstances.)

test just gives inferential statistics though, I wanted an actual estimate of the relative decrease. To do this you can use lincom. So working with my same set of variables I get:

lincom 1.Exper#1.Post - 0.Exper#1.Post - 1.Exper#0.Post

Which produces in the output:

 ( 1)  - 0b.Exper#1.Post - 1.Exper#0b.Post + 1.Exper#1.Post = 0

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
         (1) |  -.7367337   .1080418    -6.82   0.000    -.9484917   -.5249757

I avoid explaining why this particular set of coefficient contrasts produces the relative decrease of interest because there is an easier way to specify the DiD model to get this coefficient directly (see the Wikipedia page linked earlier for an explanation):

*An easier way to estimate the model
xtgee Y i.Exper i.Post Exper#Post i.Month, family(poisson) corr(ar1)

Which you can look at the output and see that the interaction term now recreates the same -.7367337 (.1080418) estimate as the prior lincom command:

Iteration 1: tolerance = .00065297
Iteration 2: tolerance = 1.375e-06
Iteration 3: tolerance = 2.682e-09

GEE population-averaged model                   Number of obs      =       264
Group and time vars:             Exper Ord      Number of groups   =         2
Link:                                  log      Obs per group: min =       132
Family:                            Poisson                     avg =     132.0
Correlation:                         AR(1)                     max =       132
                                                Wald chi2(14)      =    334.52
Scale parameter:                         1      Prob > chi2        =    0.0000

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
     1.Exper |   1.157615   .0803121    14.41   0.000     1.000206    1.315023
      1.Post |    .339348   .0898574     3.78   0.000     .1632307    .5154653
  Exper#Post |
        1 1  |  -.7367337   .1080418    -6.82   0.000    -.9484917   -.5249757
       Month |
          2  |    .214908   .1243037     1.73   0.084    -.0287228    .4585388
          3  |   .2149183   .1264076     1.70   0.089    -.0328361    .4626726
          4  |   .3119988   .1243131     2.51   0.012     .0683497    .5556479
          5  |   .4416236   .1210554     3.65   0.000     .2043594    .6788877
          6  |    .556197   .1184427     4.70   0.000     .3240536    .7883404
          7  |   .4978252   .1197435     4.16   0.000     .2631322    .7325182
          8  |   .2581524   .1257715     2.05   0.040     .0116447      .50466
          9  |    .222813   .1267606     1.76   0.079    -.0256333    .4712592
         10  |    .430002    .121332     3.54   0.000     .1921956    .6678084
         11  |   .0923395   .1305857     0.71   0.479    -.1636037    .3482828
         12  |  -.1479884   .1367331    -1.08   0.279    -.4159803    .1200035
       _cons |   .9699886   .1140566     8.50   0.000     .7464418    1.193536

But we can do some more here, and figure out the hypothetical experimental mean if the experimental series would have followed the same trend as the control series. Here I use nlcom and exponentiate the results to get back on the original count scale:

nlcom exp(_b[1.Exper] + _b[1.Post]  + _b[_cons])

Which gives an estimate of:

       _nl_1:  exp(_b[1.Exper] + _b[1.Post]  + _b[_cons])

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       _nl_1 |   11.78646   1.583091     7.45   0.000     8.683656    14.88926

So in our couterfactual world, the intervention decreased crimes from around 12 to 6 per month instead of 8 to 6 in my original interpretation, a larger effect because of the control series. We can show how nlcom reproduces the output of margins by reproducing the experimental series mean at the baseline, over 8 crimes per month.

*This creates the observed estimates, (at January) - if you want for another month need to add to above
margins Post#Exper, at( (base) Month )
*to show it recreates margins of pre period
nlcom exp(_b[1.Exper] + _b[_cons])

Which produces the output:

. margins Post#Exper, at( (base) Month )

Adjusted predictions                              Number of obs   =        264
Model VCE    : Conventional

Expression   : Exponentiated linear prediction, predict()
at           : Month           =           1

             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Post#Exper |
        0 0  |   2.637914   .3008716     8.77   0.000     2.048217    3.227612
        0 1  |   8.394722   .8243735    10.18   0.000      6.77898    10.01046
        1 0  |   3.703716   .3988067     9.29   0.000     2.922069    4.485363
        1 1  |   5.641881   .5783881     9.75   0.000     4.508261    6.775501

. *to show it recreates margins of pre period
. nlcom exp(_b[1.Exper] + _b[_cons])

       _nl_1:  exp(_b[1.Exper] + _b[_cons])

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       _nl_1 |   8.394722   .8243735    10.18   0.000      6.77898    10.01046

Poisson models are no big deal

Much of my experience suggests that although Poisson models are standard fare for predicting low crime counts in our field, they almost never make a difference when evaluating the marginal effects like I have above. So here we can reproduce the same GEE model, but instead of Poisson have it be a linear model. Here I use the quietly command to suppress the output of the model. Comparing coefficients directly between the two models does not make sense, but comparing the predictions is fine.

*Compare Poisson to a linear model
quietly xtgee Y Exper#Post i.Month, family(gaussian) corr(ar1)
margins Post#Exp, at( (base) Month )

And this subsequently produces:

Adjusted predictions                              Number of obs   =        264
Model VCE    : Conventional

Expression   : Linear prediction, predict()
at           : Month           =           1

             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Post#Exper |
        0 0  |   1.864656   .6951654     2.68   0.007     .5021569    3.227155
        0 1  |   9.404887   .6951654    13.53   0.000     8.042388    10.76739
        1 0  |   3.233312   .6992693     4.62   0.000     1.862769    4.603854
        1 1  |   5.810809   .6992693     8.31   0.000     4.440266    7.181351

The predictions are very similar between the two models. I simply state this here because I think the parallel trends assumption for the DiD model makes more sense on a linear scale than it does for the exponential scale. Pretty much wherever I look, the effects of explanatory variables appear pretty linear to me when predicting crime counts, so I think linear models make a lot of sense, even if you are predicting low count data, despite current conventions in the field.

An additional Example using marginsplot and areas

Long post – but stick with me! The last thing I wanted to go over was how to get area’s instead of error bars for marginsplot. The pdf help online has some examples – so this is pretty much just a restatement of the help. (Stata’s help in the online PDFs are excellent – code and math and figures and references and intelligible discussion all in one.) Like Andrew Gelman, I think the discrete bars are misleading when the sample locations are just arbitrary locations along a continuous function. I also think the areas look nicer, especially when the error bars overlap.

So using the same example from Dan Kahan, we first use drop _all to get rid of the prior panel data, then use import to load up the csv file.

*This clears the prior data.
drop _all 

*grab the csv data
import delimited gelman_cup_graphic_reporting_challenge_data

Now we can estimate our model and our margins:

*estimate a similar logit model
logit agw c.left_right##c.religiosity
*graph areas using margins - pretended -1 and 1 are the high/low religious cut-offs
quietly margins, at(left_right=(-1.6(0.1)1.6) religiosity=(-1 1))

quietly is definately useful here for the margins command, we don’t want to pour through the whole text table, since it is only intended for a chart. And now to make our area chart at the sample points:

marginsplot, xlabel(-1.6(0.4)1.6) recast(line) recastci(rarea) ciopts(color(*.8))

That is much better looking that the default with the ugly cross-hair error bars overlapping. Stata draws the lines unfortunately in the wrong order (see the blue line over the red area), and if I figure out an easy way to fix that I will update the post in the future. I need to learn the options for Stata charts to a greater extent before I give any more advice about Stata charts though.

To wrap up. I typically have in my Stata do file:

*Finish the script.
drop _all 
exit, clear

This drops the dataset (as mentioned before), and exit, clear then basically closes out of Stata.

Jane Jacobs on Neighborhoods

Google image on 5/4/16 – Jane Jacobs 100th birthday.

Continuing on with my discussion of neighborhoods, it seems apt to talk about how Jane Jacob defines neighborhoods in her book The Death and Life of Great American Cities. She is often given perfunctory citations in criminology articles for the idea that mixed zoned neighborhoods (those with both residential and commercial zoning all together) increase “eyes on the street” and thus can reduce crime. I will write another post more fully about informal social control and how her ideas fit in, but here I want to focus on her conception of neighborhoods.

She asks the question, what do neighborhoods accomplish – and from this attempts to define what is a neighborhood. Thus she feels a neighborhood can only be defined in terms of having actual political capital to use for its constituents – and so a neighborhood is any region in which can be organized enough to act as an independent polity and campaign against the larger government in which it is nested. This is quite different from the current conception of neighborhoods in most social sciences – in which we assume neighborhoods exist, calculate their effects on behavior, and then tautologically say they exist when we find they do affect behavior.

Based on her polity idea Jacob’s defines three levels of possible neighborhoods:

  • your street
  • the greater area of persons – around 100,000
  • the entire city

This again is quite different than most of current social sciences. Census tracts and block groups are larger swaths than just one block. In very dense cities like NYC, census block groups are often the square defined by streets on either side, so they split apart people across the street from one another. Census tracts intentionally are made to contain around 8,000 people. For a counter-example criminologist, Ralph Taylor does think streets are neighborhood units. A hearsay paraphrase of his I recently heard was “looking out your front door, all that matters is how far you can see down the street in either direction”.

I think Jacob’s guesstimate of 100,000 may be partly biased from only thinking about giant megapolises. Albany itself has only around 100,000 residents – so it is either streets or the whole city per her definition. Although I can’t argue much about smaller areas having little to no political capital to accomplish specific goals.

In some of the surveys I have participated in they have asked the question about how you define your neighborhood. Here are the responses for two different samples (from the same city at two different time points) for an example:

I think the current approach – that neighborhoods are defined by their coefficients on the right hand side of regression models is untenable in the end – based on ideas that are derivatives of the work in my dissertation. Mainly that discrete neighborhood boundaries are a fiction of social scientists.

Neighborhoods in Albany according to Google

One of the most vexing aspects of spatial analysis in the social sciences in the concept of neighborhoods. There is a large literature on neighborhood effects in criminology, but no one can really define a neighborhood. For analysis they are most often assumed to approximately conform to census areas (like tracts or blocks). Sometimes there are obvious physical features that divide neighborhoods (most often a major roadway), but more often boundaries are fuzzy.

I’ve worked on several surveys (at the Finn Institute) in which we ask people what neighborhood they live in as well as the nearest intersection to their home. Even where there is a clear border, often people say the “wrong” neighborhood, especially near the borders. IIRC, when I calculated the wrongness for one survey in Syracuse we did it was only around 60% of the time the respondents stated they lived the right neighborhood. I do scare quotes around “wrong” because it is obviously arbitrary where people draw the boundaries, so more people saying the wrong neighborhood is indicative of the borders being misaligned than the respondents being wrong.

For this reason I like the Google maps approach in which they just place a label at the approximate center of noteworthy neighborhoods. I emulated this for a recent background map I made for a paper in Albany. (Maps can be opened in a separate tab to see a larger image.)

As background I did not grow up in Albany, but I’ve lived and worked in the Capital District since I came up to Albany for grad school – since 2008. Considering this and the fact that I make maps of Albany on a regular basis is my defense I have a reasonable background to make such judgements.

When looking at Google’s reverse geocoding API the other day I noticed they returned a neighborhood field in the response. So I created a regular sampling grid over Albany to see what they return. First, lets see my grid and where Google actually decides some neighborhood exists. Large grey circles are null, and small red circles some neighborhood label was returned. I have no idea where Google culls such neighborhood labels from.

See my python code at the end of the post to see how I extracted this info. given an input lat-lng. In the reverse geo api they return multiple addresses – but I only examine the first returned address and look for a neighborhood. (So I could have missed some neighborhoods this way – it would take more investigation.)

Given the input fishnet I then dissolved the neighborhood labels into areas. Google has quite a few more specific neighborhoods than me.

I’ve never really made much of a distinction between West Hill and Arbor Hill – although the split is clearly at Henry Johnson. Also I tend to view Pine Hill as the triangle between Western and Central before the State campus – but Google and others seem to disagree with me. What I call the Pinebush Google calls the Dunes. Dunes is appropriate, because it actually has sand dunes, but I can’t recall anyone referring to it as that. Trees are pretty hard to come by in Arbor Hill though, so don’t be misled. Also kill is Dutch for creek, so you don’t have to worry that Normanskill is such a bad place (even if your name is Norman).

For a third opinion, see

You can see more clearly in this map how Pine Hill’s area goes south of Madison. Google maps has a fun feature showing related maps, and so they show a related map on someones take for where law students should or should not get an apartment. In that map you can see that south of Madison is affectionately referred to as the student ghetto. That comports with my opinion as well, although I did not think putting student ghetto was appropriate for my basemap for a journal article!

People can’t seem to help but shade Arbor Hill in red. Which sometimes may be innocent – if red is the first color used in defaults (as Arbor Hill will be the first neighborhood in an alphabetic list). But presumably the law student making the apartment suggestions map should know better.

In short, it would be convenient for me (as a researcher) if everyone could agree with what a neighborhood is and where its borders are, but that is not reality.

Here is the function in Python to grab the neighborhood via the google reverse geocoding API. Here if it returns anything it grabs the first address returned and searches for the neighborhood in the json. If it does not find a neighborhood it returns None.

#Reverse geocoding and looking up neighborhoods
import urllib, json

def GoogRevGeo(lat,lng,api=""):
  base = r""
  GeoUrl = base + "latlng=" + str(lat) + "," + str(lng) + "&key=" + api
  response = urllib.urlopen(GeoUrl)
  jsonRaw =
  jsonData = json.loads(jsonRaw)
  neigh = None
  if jsonData['status'] == 'OK':
    for i in jsonData['results'][0]['address_components']:
      if i['types'][0] == 'neighborhood':
        neigh = i['long_name']
  return neigh

Shape, color, and pattern constants in SPSS charts

I have a version of the SPSS (Statistics) Version 24 GPL reference guide bookmarked here. The reference guide is great to skim through and see what is possible in SPSS charts – especially the set of examples on pages 329 to 411.

On page 413 they also give a set of constant colors, shapes, and texture patterns you can use in charts. Colors you can also specify in RGB scale, but it is often convenient to just say or etc. Shapes and patterns for practical purposes you have to choose among the constants. (Technically in the chart template you can edit the cycle styles, and change a circle to an ellipse for example, or change the points for a dash pattern, but this would be painful for anything besides a few constants.)

Here is a handy reference guide to actually visualize those constants. Many you can guess what they look like, but the colors are more subtle. Who knew there were differences between tomato, salmon, and pink! (The tomato is more like tomato soup color.)

Here are the color constants (you can open the chart in a new tab to see a larger image):

The shape constants:

The elbow and the elbowArrow do not look correct – but will take some more time to investigate. The others look ok to me though. (The number of sides and star points appear to me to be something you can also manipulate in the chart template cycles, if for some reason you want a hendecagon).

And here are the pattern constants. I plot them with a grey filled interior – you can see some specifically only have an outline and always have a transparent fill:

Here is code to replicate the charts, and here is a PDF to download with the constants. The colors and shapes are hard to read because they are squeezed in so small, but you can zoom into the PDF and read the categories. I haven’t used dashed lines ever, so I omit those constants here. (Patterns I use pretty rarely, but I have used them if there are only two categories.)

A useful change for the colors would be sorting in a logical order. They are just currently in alphabetical. I am too lazy though to convert the colors to a colorspace and sort them though. (Maybe converting the PDF to SVG would do the trick easy enough though.)

Posting my peer reviews on Publons and a few notes about reviewing

Publons is a service that currates your peer review work. In a perfect world this would be done by publishers – they just forward your reviews with some standard meta-data. This would be useful when one paper is reviewed multiple times, as well as identifying good vs. poor reviewers. I forget where I saw the suggestion recently (maybe on the orgtheory or scatterplot blog), but someone mentioned it would be nice if you submit your paper to a different journal to forward your previous reviews at your discretion. I wouldn’t mind that at all, as oft the best journals will reject for lesser reasons because they can be more selective. (Also I’ve gotten copy-paste same reviews from different journals, even though I have updated the draft to address some of the comments. Forwarding would allow me to address those comments directly before the revise-resubmit decision.)

I’ve posted all of my reviews so far, but they are only public if the paper is eventually accepted. So here you can see my review for the recent JQC article Shooting on the Street: Measuring the Spatial Influence of Physical Features on Gun Violence in a Bounded Street Network by Jie Xu and Elizabeth Griffiths.

I’ve done my fair share of complaining about reviews before, but I don’t think the whole peer-review process is fatally flawed despite its low reliability. People take peer review a bit too seriously at times – but that is a problem for most academics in general. Even if you think your idea is not getting a fair shake, just publish it yourself on your website (or places like SSRN and ArXiv). This of course does not count towards things like tenure – but valuing quantity over quality is another separate problem currently in academia.

In the spirit of do unto others as you would have them do unto you, here are a two main points I try to abide by when I review papers.

  • be as specific as possible in your critique

There is nothing more frustrating than getting a vague critique (the paper has multiple mispellings and grammar issues). A frequent one I have come across (both in reviews of my papers and seeing comments others have made on papers I’ve reviewed) is in the framing of the paper – a.k.a. the literature review. (Which makes sending the paper to multiple journals so frustrating, you will always get more arbitrary framing debates each time with new reviewers.)

So for a few examples:

  • (bad) The literature review is insufficient
  • (good) The literature review skips some important literature, see specifically (X, 2004; Y, 2006; Z, 2007). The description of (A, 2000) is awkward/wrong.
  • (bad) The paper is too long, it can be written in half the length
  • (better) The paper could be shortened, section A.1 can be eliminated in my opinion, and section A.2 can be reduced to one paragraph on X.

Being specific provides a clear path for the author to correct what you think, or at least respond if they disagree. The “you can cut the paper in half” I don’t even know how to respond to effectively, nor the generic complaint about spelling. One I’ve gotten before is “x is an innapropriate measure” with no other detail. This is tricky because I have to guess why you think it is innapropriate, so I have to make your argument for you (mind read) and then respond why I disagree (which obviously I do, or I wouldn’t have included that measure to begin with). So to respond to a critique I at first have to critique my own paper – maybe this reviewer is more brilliant than I thought.

Being specific I also think helps cut down on arbitrary complaints that are arguable.

  • provide clear signals to the editor, both about main critiques and the extent to which they can be addressed

Peer review has two potential motivations, one is a gate-keeper and one is to improve the draft. Often times arbitrary advice by reviewers intended for the latter is not clearly delineated in the review, so it is easily confused for evidence pertinent to the gate-keeper function. I’ve gotten reviews of 20 bullet points or 2,000 words that make it seem like a poor paper due to sheer length of the comment, but the majority are minor points or arbitrary suggestions. Longer reviews actually suggest the paper is better – if there is something clearly wrong you can say it in a much shorter space.

Gabriel Rossman states these different parts of peer review a bit more succintly than me:

You need to adopt a mentality of “is it good how the author did it” rather than “how could this paper be made better”

I think this is a good quip to follow. I might add “don’t sweat the small stuff” to that as well. Some editors will read the paper and reviews and make judgement calls – but some editors just follow the reviewers blindly – so I worry with the 20 bullet point minor review that it unduly influenced a reject decision. I’m happy to respond to the bullets, and happy you took the time, but I’m not happy about you (the reviewer) not giving clear advice to the editor of the extent to which I can address those points.

I still give advice about improving the manuscript, but I try to provide clear signals to the editor about main critiques, and I also will explicitly state whether they can be addressed. The “can be addressed” is not for the person writing the paper – it is for the editor making the decision for whether to revise-and-resubmit! The main critiques in my experience will either entail 2-3 main points (or none at all for some papers). I also typically say when things are minor and put them in a separate section, which editors can pretty much ignore.

Being a quantitative guy, the ones that frustrate me the most are complaints about model specifications. Some are legitimately major complaints, but often times it will be things that are highly unlikely to greatly influence the reported results. Examples are adding/dropping/changing a particular control variable and changes in the caliper used for propensity score matching. Note I’m not saying you shouldn’t ask to see differences, I’m asking that you clearly articulate why your suggestion is preferable and make an appropriate judgement as to whether it is a big problem or a little problem. A similar complaint is what information to include in tables or in the main manuscript or appendix. The author already has the information, so it is minor editing, not a major problem.

While I am here I will end with three additional complaints that don’t fit into anywhere previously in my post. One, multiple rounds of review are totally a waste. So the life cycle of the paper-review should be

paper -> review -> editor decision -> reject/accept
                                      revise-resumbit -> new paper + responses to reviews -> editor decision

The way the current system works, I have to submit another review after the new paper has been submitted. I rather the editor take the time to see if the authors sufficiently addressed the original complaints, because as a reviewer I am not an unbiased judge of that. So if I say something is important and you retort it is not, what else do you want me to say in my second review! It is the editors job at that point to arbiter disagreements. This then stops the cycle of multiple rounds of review, which have a very large amount of diminishing returns in improving the draft.

This then leads into my second complaint, generally about keeping a civil tone for reviews. In general I don’t care if a reviewer is a bit gruff in the review – it is not personal. But since reviewers have a second go, when I respond I need to keep an uber deferential tone. I don’t think that is really necessary, and I’d rather original authors have similar latitude to be gruff in responses. Reviewers say stupid things all the time (myself included) and you should be allowed to retort that my critique is stupid! (Of course with reasoning as to why it is stupid…)

Finally, I had one reviewer start recently:

This paper is interesting and very well written…I will not focus on the paper’s merits, but instead restrict myself to areas where it can be improved.

The good is necessary to signal to the editor whether a paper should be published. I’ve started to try in my own reviews to include more of the good (which is definately not the norm) and argue why a paper should be published. You can see in my linked review of the Xu and Griffiths paper by the third round I simply gave arguments why the paper should be published, despite a disagreement about the change-point model they reported on in the paper.

Using the Google Geocoding API with Python

Previously I posted how to use the geopy python library to call the Google geocode API. But somewhere along the way my version of geopy was not working (maybe because the API changed). Instead of figuring out that problem, I just wrote my own function to call the Google API directly. No need to worry about installing geopy.

Part of the reason I blog is so I have notes for myself – I’m pretty sure I’ve rewritten this several times for different quick geocoding projects, but I couldn’t find them this morning when I needed to do it again. So here is a blog post for my own future reference.

Here is the function, it takes as input the full string address. Also I was getting back some null responses by rapid fire calling the API (with only 27 addresses), so I set the function to delay for five seconds and that seemed to fix that problem.

import urllib, json, time
def GoogGeoAPI(address,api="",delay=5):
  base = r""
  addP = "address=" + address.replace(" ","+")
  GeoUrl = base + addP + "&key=" + api
  response = urllib.urlopen(GeoUrl)
  jsonRaw =
  jsonData = json.loads(jsonRaw)
  if jsonData['status'] == 'OK':
    resu = jsonData['results'][0]
    finList = [resu['formatted_address'],resu['geometry']['location']['lat'],resu['geometry']['location']['lng']]
    finList = [None,None,None]
  time.sleep(delay) #in seconds
  return finList

And here is an example use of the function. It returns the formatted address, the latitude and the longitude.

#Example Use
test = r"1600 Amphitheatre Parkway, Mountain View, CA"
geoR = GoogGeoAPI(address=test)
print geoR

This works for a few addresses without an API key. Even with an API key though the limit I believe is 2,500 – so don’t use this to geocode a large list. Also if you have some special characters in your address field this will take more work. For example if you have an & for an intersection I bet this url call will fail. But that should not be too hard to deal with. Also note the terms of service for using the API (which I don’t understand – so don’t ask me!)

I should eventually wrap up all of this google API python code into an extension for SPSS. Don’t hold your breath though for me getting the time to do that.

Presentation at ACJS 2016

I will be presenting at the ACJS (Academy of Criminal Justice Sciences) conference in Denver in a few days. My talk will be on some of the work I have been conducting with the Albany Police Department via the Finn Institute (Rob Worden and Sarah McLean are co-authors on the presentation). The title is Making stops smart: Predicting arrest rates from discretionary police stops at micro places in Albany, NY and here is the abstract:

Police stops are one of the most invasive, but regularly used crime control tactics by police. Similar to how focusing police resources at hot spots of crime can improve police efficiency, here we examine the spatial variation in arrest rates at micro places (street segments and intersections) in Albany, NY. Using data from over 240,000 discretionary police stops, we fit random effects logistic regression models to predict the probability of an arrest at different micro places. We show that like hot spots, there are examples of high arrest rate locations next to low arrest rate locations. Using a simulation, we show that if one displaced stops from low arrest locations to high arrest locations, one could make half as many stops but still have the same number of total arrests.

Here is a funnel chart of the arrest hit rates at micro-places across the city. You can see quite a bit of extra variation in arrest rates to attempt to explain.

I am giving this at 8 am on Thursday (see Event #185 in the program)

There will be two other presentations at the moment (Ling Wu is not going to make it), and they are:

  • Results from a victim generated crime mapping software, Zavin Nazaretian et al. – Indiana University of PA
  • Spatial analysis of aggravated assault and homicide crime scene, arrest and offender residence locations in Houston, TX, Elishewah Weisz – Sam Houston

So if you are interested in crime mapping stuff it should be a good session.

Feel free to bug me if you see me around at ACJS.


Also before I forget, my co-workers are presenting a poster on analysis of Syracuse Truce – a focused deterrence gang intervention. The posters are on Friday, so I won’t be around unfortunately. The title is Gangs, groups, networks, and deterrence: An evaluation of Syracuse Truce. (See poster #45 in the same program I linked to earlier.) Rob and Kelly will at least be manning the poster though – so you can go and bug them about the details!

Here is a picture of the reach of call-ins for one particular gang. The idea is for those who attended call ins to spread the message to other members. So this graph evaluates how well the call-ins would be expected to reach all of the members of the gang.

If you are wondering what I do for my job – yes I pretty much just make maps and graphs all day long😉




Some GIS data scraping adventures: Banksy graffiti and gang locations in NYC

I’ve recently scraped some geographic data that I may use in my graduate level GIS course. I figured I would share with everyone, and take some time to describe for others how I scraped the data.

So to start, if you read an online article and it has a webmap with some GIS data in it – the data exists somewhere. It won’t always be the case that you can actually download the data, but for the most current and popular interactive mapping tools, the data is often available if you know where to look.

For example, I asked on the GIS stackexchange site awhile ago how can you download the point data in this NYC homicide map from the Times. I had emailed the reporters multiple times and they did not respond. A simple solution the answerers suggested was to use website developer tools to see what was being loaded when I refreshed the page. It happened that the map is being populated by a two simple text files (1,2).

It may be an interesting project to see how this compares (compiling via news stories) versus official data, which NYC recently released going back to 2006. Especially since such crowdsourced news datasets are used for other things, like counting mass shootings.

The two example mapping datasets I provide below though are a bit different process to get the underlying data – but just as easy. Many current webmaps use geojson files as the backend. What I did for the two examples below is I just looked at the html source for the website, and look for json data formats – links that specify “js” or “json” extensions. If you click through those external json links you can see if they have the data.

The other popular map type though comes from ESRI. You can typically find an ESRI server populating the map, and if the website has say a parcel data lookup you can often find an ESRI geocoding server (see here for one example of using an ESRI geocoding api). The maps though unfortunately do not always have exposed data. Sometimes what looks like vector data are actually just static PNG tiles. Council Districts in this Dallas map are an example. If you dig deep enough, you can find the PNG tiles for the council districts, but that does not do anyone much good. Pretty much all of those layers are available for download from other sources though. A similar thing happens with websites with crime reports, such as RAIDS Online or They intentionally build the web map so you cannot scrape the data.

So that said, before we go further though – it should go without saying that you should not steal/plagiarize people’s articles or simply rip-off their graphics. Conducting new analysis with the publicly available data though seems fair game to me.

Banksy Taggings in NYC

There was a recent stink in the press about Kim Rossmo and company using geographic offender profiling to identify the likely home location of the popular graffiti artist Banksy. Here is the current citation of the journal article for those interested:

Hauge, M. V., Stevenson, M. D., Rossmo, D. K., and Le Comber, S. C. (2016). Tagging banksy: using geographic profiling to investigate a modern art mystery. Journal of Spatial Science, pages 1-6. doi:10.1080/14498596.2016.1138246

The article uses data from Britain, so I looked up to see if his taggings in other places was available. I came across this article showing a map of locations in New York City. So I searched where the data was coming from, and found the json file that contains the point data here. I just built a quick excel spreadsheet to parse the data, and you can download that spreadsheet here.

Gang Locations

This article posts a set of gang territories in NYC. This is pretty unique – I am unfamiliar with any other public data source that identifies gang territories. So I figured it would be a potential fun project for students in my GIS course – for instance in overlaying with the 311 graffiti data.

Again the data at the backend is in json format, and can be found here. To convert this data to a shapefile is a bit challenging, as it has points, lines and polygons all in the same file. What I did was buffer the lines and points by a small amount to be able to stuff them all in one shapefile. A zip file of that shapefile can be downloaded here.

Drop me a note if you use this data, I’d be interested in your analyses! Hence why I am sharing the data for others to play with:)

On overlapping error bars in charts

Andrew Gelman posted an example graph the other day in a blog post which showed trends over time in measures of smiling in high school yearbook photos.

Surprisingly, Andrew did not make a comment on the error bars in the graph. Error bars with cross hairs are often distracting in the plot. In the example graph it is quite bad, in that they perfectly overlap, so the ends are very difficult to disentangle. Here I will suggest some alternatives.

I simulated data that approximately captures the same overall trends, and replicated the initial chart in SPSS.

First, a simple solution with only two groups is to use semi-transparent areas instead of the error bars.

This makes it quite easy to see the overlap and non-overlap of the two groups. This will even print out nice in black-white. In the end, this chart is over-complicated by separating out genders. Since each follow the same trend, with females just having a constant level shift over the entire study period, there is not much point in showing each in a graph. A simpler solution would just pool them together (presumably the error bars would be smaller by pooling as well). The advice here still applies though, and the areas are easier to viz. than the discontinuous error bars.

For more complicated plots with more groups, I would suggest doing small multiples.

While it is harder now to see the exact overlap between groups, we can at least visually assess the trends within each group quite well. In the original it is quite a bit of work to figure out the differences between groups and keep the within group comparisons straight. Since the trends are so simple it is not impossible, but with more variable charts it would be quite a bit of work.

For instances in which a trend line is not appropriate, you can dodge the individual error bars on the x-axis so that they do not perfectly overlap. This is the same principle as in clustered bar charts, just with points and error bars instead of bars.

Here I like using just the straight lines (a tip taken from Andrew Gelman). The serif part of the I beam like error bars I find distracting, and make it necessary to separate the lines further. Using just the lines you can pack many more into a small space, like caterpillar plots of many random effects.

Here is a copy of the SPSS syntax used to generate these graphs.


Get every new post delivered to your Inbox.

Join 73 other followers