Plotting Predictive Crime Curves

Writing some notes on this has been in the bucket list for a bit, how to evaluate crime prediction models. A recent paper on knife homicides in London is a good use case scenario for motivation. In short, when you have continuous model predictions, there are a few different graphs I would typically like to see, in place of accuracy tables.

The linked paper does not provide data, so what I do for a similar illustration is grab the lower super output area crime stats from here, and use the 08-17 data to predict homicides in 18-Feb19. I’ve posted the SPSS code I used to do the data munging and graphs here — all the stats could be done in Excel though as well (just involves sorting, cumulative sums, and division). Note this is not quite a replication of the paper, as it includes all cases in the homicide/murder minor crime category, and not just knife crime. There ends up being a total of 147 homicides/murders from 2018 through Feb-2019, so the nature of the task is very similar though, predicting a pretty rare outcome among almost 5,000 lower super output areas (4,831 to be exact).

So the first plot I like to make goes like this. Use whatever metric you want based on historical data to rank your areas. So here I used assaults from 08-17. Sort the dataset in descending order based on your prediction. And then calculate the cumulative number of homicides. Then calculate two more columns; the total proportion of homicides your ranking captures given the total proportion of areas.

Easier to show than to say. So for reference your data might look something like below (pretend we have 100 homicides and 1000 areas for a simpler looking table):

 PriorAssault  CurrHom CumHom PropHom PropArea
 1000          1         1      1/100    1/1000
  987          0         1      1/100    2/1000
  962          2         4      4/100    3/1000
  920          1         5      5/100    4/1000
    .          .         .       .        .
    .          .         .       .        .
    .          .         .       .        .
    0          0       100    100/100 1000/1000

You would sort the PriorCrime column, and then calculate CumHom (Cumulative Homicides), PropHom (Proportion of All Homicides) and PropArea (Proportion of All Areas). Then you just plot the PropArea on the X axis, and the PropHom on the Y axis. Here is that plot using the London data.

Paul Ekblom suggests plotting the ROC curve, and I am too lazy now to show it, but it is very similar to the above graph. Basically you can do a weighted ROC curve (so predicting areas with more than 1 homicide get more weight in the graph). (See Mohler and Porter, 2018 for an academic reference to this point.)

Here is the weighted ROC curve that SPSS spits out, I’ve also superimposed the predictions generated via prior homicides. You can see that prior homicides as the predictor is very near the line of equality, suggesting prior homicides are no better than a coin-flip, whereas using all prior assaults does alittle better job, although not great. SPSS gives the area-under-the-curve stat at 0.66 with a standard error of 0.02.

Note that the prediction can be anything, it does not have to be prior crimes. It could be predictions from a regression model (like RTM), see this paper of mine for an example.

So while these do an OK job of showing the overall predictive ability of whatever metric — here they show using assaults are better than random, it isn’t real great evidence that hot spots are the go to strategy. Hot spots policing relies on very targeted enforcement of a small number of areas. The ROC curve shows the entire area. If you need to patrol 1,000 LSOA’s to effectively capture enough crimes to make it worth your while I wouldn’t call that hot spots policing anymore, it is too large.

So another graph you can do is to just plot the cumulative number of crimes you capture versus the total number of areas. Note this is based on the same information as before (using rankings based on assaults), just we are plotting whole numbers instead of proportions. But it drives home the point abit better that you need to go to quite a large number of areas to be able to capture a substantive number of homicides. Here I zoom in the plot to only show the first 800 areas.

So even though the overall curve shows better than random predictive ability, it is unclear to me if a rare homicide event is effectively concentrated enough to justify hot spots policing. Better than random predictions are not necessarily good enough.

A final metric worth making note of is the Predictive Accuracy Index (PAI). The PAI is often used in evaluating forecast accuracy, see some of the work of Spencer Chainey or Grant Drawve for some examples. The PAI is simply % Crime Captured/% Area, which we have already calculated in our prior graphs. So you want a value much higher than 1.

While those cited examples again use tables with simple cut-offs, you can make a graph like this to show the PAI metric under different numbers of areas, same as the above plots.

The saw-tooth ends up looking very much like a precision-recall curve, but I haven’t sat down and figured out the equivalence between the two as of yet. It is pretty noisy, but we might have two regimes based on this — target around 30 areas for a PAI of 3-5, or target 150 areas for a PAI of 3. PAI values that low are not something to brag to your grandma about though.

There are other stats like the predictive efficiency index (PAI vs the best possible PAI) and the recapture-rate index that you could do the same types of plots with. But I don’t want to put everyone to sleep.

Advertisements

Weighted buffers in R

Had a request not so recently about implementing weighted buffer counts. The idea behind a weighted buffer is that instead of say counting the number of crimes that happen within 1,000 meters of a school, you want to give events that are closer to the school more weight.

There are two reasons you might want to do this for crime analysis:

  • You want to measure the amount of crime around a location, but you rather have a weighted crime count, where crimes closer to the location have a greater weight than those further away.
  • You want to measure attributes nearby a location (so things that predict crime), but give a higher weight to those closer to a location.

The second is actually more common in academic literature — see John Hipp’s Egohoods, or Liz Groff’s work on measuring nearby to bars, or Joel Caplan and using kernel density to estimate the effect of crime generators. Jerry Ratcliffe and colleagues work on the buffer intensity calculator is actually the motivation for the original request. So here are some quick code snippets in R to accomplish either. Here is the complete code and original data to replicate.

Here I use over 250,000 reported Part 1 crimes in DC from 08 through 2015, 173 school locations, and 21,506 street units (street segment midpoints and intersections) I constructed for various analyses in DC (all from open data sources) as examples.

Example 1: Crime Buffer Intensities Around Schools

First, lets define where our data is located and read in the CSV files (don’t judge me setting the directory, I do not use RStudio!)

MyDir <- 'C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\buffer_stuff_R\\Code' #Change to location on your machine!
setwd(MyDir)

CrimeData <- read.csv('DC_Crime_08_15.csv')
SchoolLoc <- read.csv('DC_Schools.csv')

Now there are several ways to do this, but here is the way I think will be most useful in general for folks in the crime analysis realm. Basically the workflow is this:

  • For a given school, calculate the distance between all of the crime points and that school
  • Apply whatever function to that distance to get your weight
  • Sum up your weights

For the function to the distance there are a bunch of choices (see Jerry’s buffer intensity I linked to previously for some example discussion). I’ve written previously about using the bi-square kernel. So I will illustrate with that.

Here is an example for the first school record in the dataset.

#Example for crimes around school, weighted by Bisquare kernel
BiSq_Fun <- function(dist,b){
    ifelse(dist < b, ( 1 - (dist/b)^2 )^2, 0)
    }

S1 <- t(SchoolLoc[1,2:3])
Dis <- sqrt( (CrimeData$BLOCKXCOORD - S1[1])^2 + (CrimeData$BLOCKYCOORD - S1[2])^2 )
Wgh <- sum( BiSq_Fun(Dis,b=2000) )

Then repeat that for all of the locations that you want the buffer intensities, and stuff it in the original SchoolLoc data frame. (Takes less than 30 seconds on my machine.)

SchoolLoc$BufWeight <- -1 #Initialize field

#Takes about 30 seconds on my machine
for (i in 1:nrow(SchoolLoc)){
  S <- t(SchoolLoc[i,2:3])
  Dis <- sqrt( (CrimeData$BLOCKXCOORD - S[1])^2 + (CrimeData$BLOCKYCOORD - S[2])^2 )
  SchoolLoc[i,'BufWeight'] <- sum( BiSq_Fun(Dis,b=2000) )
}

In this example there are 173 schools and 276,621 crimes. It is too big to create all of the pairwise comparisons at once (which will generate nearly 50 million records), but the looping isn’t too cumbersome and slow to worry about building a KDTree.

One thing to note about this technique is that if the buffers are large (or you have locations nearby one another), one crime can contribute to weighted crimes for multiple places.

Example 2: Weighted School Counts for Street Units

To extend this idea to estimating attributes at places just essentially swaps out the crime locations with whatever you want to calculate, ala Liz Groff and her inverse distance weighted bars paper. I will show something alittle different though, in using the weights to create a weighted sum, which is related to John Hipp and Adam Boessen’s idea about Egohoods.

So here for every street unit I’ve created in DC, I want an estimate of the number of students nearby. I not only want to count the number of kids in attendance in schools nearby, but I also want to weight schools that are closer to the street unit by a higher amount.

So here I read in the street unit data. Also I do not have school attendance counts in this dataset, so I just simulate some numbers to illustrate.

StreetUnits <- read.csv('DC_StreetUnits.csv')
StreetUnits$SchoolWeight <- -1 #Initialize school weight field

#Adding in random school attendance
SchoolLoc$StudentNum <- round(runif(nrow(SchoolLoc),100,2000)) 

Now it is very similar to the previous example, you just do a weighted sum of the attribute, instead of just counting up the weights. Here for illustration purposes I use a different weighting function, inverse distance weighting with a distance cut-off. (I figured this would need a better data management strategy to be timely, but this loop works quite fast as well, again under a minute on my machine.)

#Will use inverse distance weighting with cut-off instead of bi-square
Inv_CutOff <- function(dist,cut){
    ifelse(dist < cut, 1/dist, 0)
}

for (i in 1:nrow(StreetUnits)){
    SU <- t(StreetUnits[i,2:3])
    Dis <- sqrt( (SchoolLoc$XMeters - SU[1])^2 + (SchoolLoc$YMeters - SU[2])^2 )
    Weights <- Inv_CutOff(Dis,cut=8000)
    StreetUnits[i,'SchoolWeight'] <- sum( Weights*SchoolLoc$StudentNum )
}   

The same idea could be used for other attributes, like sales volume for restaurants to get a measure of the business of the location (I think more recent work of John Hipp’s uses the number of employees).

Some attributes you may want to do the weighted mean instead of a weighted sum. For example, if you were using estimates of the proportion of residents in poverty, it makes more sense for this measure to be a spatially smoothed mean estimate than a sum. In this case it works exactly the same but you would replace sum( Weights*SchoolLoc$StudentNum ) with sum( Weights*SchoolLoc$StudentNum )/sum(Weights). (You could use the centroid of census block groups in place of the polygon data.)

Some Wrap-Up

Using these buffer weights really just swaps out one arbitrary decision for data analysis (the buffer distance) with another (the distance weighting function). Although the weighting function is more complicated, I think it is probably closer to reality for quite a few applications.

Many of these different types of spatial estimates are all related to another (kernel density estimation, geographically weighted regression, kriging). So there are many different ways that you could go about making similar estimates. Not letting the perfect be the enemy of the good, I think what I show here will work quite well for many crime analysis applications.

Reasons Police Departments Should Consider Collaborating with Me

Much of my academic work involves collaborating and consulting with police departments on quantitative problems. Most of the work I’ve done so far is very ad-hoc, through either the network of other academics asking for help on some project or police departments cold contacting me directly.

In an effort to advertise a bit more clearly, I wrote a page that describes examples of prior work I have done in collaboration with police departments. That discusses what I have previously done, but doesn’t describe why a police department would bother to collaborate with me or hire me as a consultant. In fact, it probably makes more sense to contact me for things no one has previously done before (including myself).

So here is a more general way to think about (from a police departments or criminal justice agencies perspective) whether it would be beneficial to reach out to me.

Should I do X?

So no one is going to be against different evidence based policing practices, but not all strategies make sense for all jurisdictions. For example, while focussed deterrence has been successfully applied in many different cities, if you do not have much of a gang violence problem it probably does not make sense to apply that strategy in your jurisdiction. Implementing any particular strategy should take into consideration the cost as well as the potential benefits of the program.

Should I do X may involve more open ended questions. I’ve previously conducted in person training for crime analysts that goes over various evidence based practices. It also may involve something more specific, such as should I redistrict my police beats? Or I have a theft-from-vehicle problem, what strategies should I implement to reduce them?

I can suggest strategies to implement, or conduct cost-benefit analysis as to whether a specific program is worth it for your jurisdiction.

I want to do X, how do I do it?

This is actually the best scenario for me. It is much easier to design a program up front that allows a police department to evaluate its efficacy (such as designing a randomized trial and collecting key measures). I also enjoy tackling some of the nitty-gritty problems of implementing particular strategies more efficiently or developing predictive instruments.

So you want to do hotspots policing? What strategies do you want to do at the hotspots? How many hotspots do you want to target? Those are examples of where it would make sense to collaborate with me. Pretty much all police departments should be doing some type of hot spots policing strategy, but depending on your particular problems (and budget constraints), it will change how you do your hot spots. No budget doesn’t mean you can’t do anything — many strategies can be implemented by shifting your current resources around in particular ways, as opposed to paying for a special unit.

If you are a police department at this stage I can often help identify potential grant funding sources, such as the Smart Policing grants, that can be used to pay for particular elements of the strategy (that have a research component).

I’ve done X, should I continue to do it?

Have you done something innovative and want to see if it was effective? Or are you putting a bunch of money into some strategy and are skeptical it works? It is always preferable to design a study up front, but often you can conduct pretty effective post-hoc analysis using quasi-experimental methods to see if some crime reduction strategy works.

If I don’t think you can do a fair evaluation I will say so. For example I don’t think you can do a fair evaluation of chronic offender strategies that use officer intel with matching methods. In that case I would suggest how you can do an experiment going forward to evaluate the efficacy of the program.

Mutual Benefits of Academic-Practitioner Collaboration

Often I collaborate with police departments pro bono — which you may ask what is in it for me then? As an academic I get evaluated mostly by my research productivity, which involves writing peer reviewed papers and getting research grants. So money is not the main factor from my perspective. It is typically easier to write papers about innovative problems or programs. If it involves applying for a grant (on a project I am interested in) I will volunteer my services to help write the grant and design the study.

I could go through my career writing papers without collaborating with police departments. But my work with police departments is more meaningful. It is not zero-sum, I tend to get better ideas when understanding specific agencies problems.

So get in touch if you think I can help your agency!

CAN SEBP webcast on predictive policing

I was recently interviewed for a webcast by the Canadian Society of Evidence Based Policing on Predictive Policing.

I am not directly affiliated with any software vendor, so these are my opinions as an outsider, academic, and regular consultant for police departments on quantitative problems.

I do have some academic work on predictive policing applications that folks can peruse at the moment (listed below). The first is on evaluating the accuracy of a people predictions, the second is for addressing the problem of disproportionate minority contact in spatial predictive systems.

  • Wheeler, Andrew P., Robert E. Worden, and Jasmine R. Silver. (2018) The predictive accuracy of the Violent Offender Identification Directive (VOID) tool. Conditionally accepted at Criminal Justice and Behavior. Pre-print available here.
  • Wheeler, Andrew P. (2018) Allocating police resources while limiting racial inequality. Pre-print available here.

I have some more work on predictive policing applications in the pipeline, so just follow the blog or follow me on Twitter for updates about future work.

If police departments are interested in predictive policing applications and would like to ask me some questions, always feel free to get in contact. (My personal email is listed on my CV, my academic email is just Andrew.Wheeler at utdallas.edu.)

Most of my work consulting with police departments is ad-hoc (and much of it is pro bono), so if you think I can be of help always feel free to get in touch. Either for developing predictive applications or evaluating whether they are effective at achieving the outcomes you are interested in.

Monitoring Use of Force in New Jersey

Recently ProPublica published a map of uses-of-force across different jurisdictions in New Jersey. Such information can be used to monitor whether agencies are overall doing a good or bad job.

I’ve previously discussed the idea of using funnel charts to spot outliers, mostly around homicide rates but the idea is the same when examining any type of rate. For example in another post I illustrated its use for examining rates of officer involved shootings.

Here is another example applying it to lesser uses of force in New Jersey. Below is the rate of use of force reports per the total number of arrests. (Code to replicate at the end of the post.)

The average use of force per arrests in the state is around 3%. So the error bars show relative to the state average. Here is an interactive chart in which you can use tool tips to see the individual jurisdictions.

Now the original press release noted by Seth Stoughton on twitter noted that several towns have ratio’s of black to white use of force that are very high. Scott Wolfe suspected that was partly a function of smaller towns will have more variable rates. Basically as one is comparing the ratio between two rates with error, the error bars around the rate ratio will also be quite large.

Here is the chart showing the same type of funnel around the rate ratio of black to white use-of-force relative to the average over the whole sample (the black percent use of force is 3.2 percent of arrests, and the white percent use of force is 2.4, and the rate ratio between the two is 1.35). I show in the code how I constructed this, which I should write a blog post about itself, but in short there are decisions I could make to make the intervals wider. So the points that are just slightly above a ratio of 2 at around 10,000 arrests are arguably not outliers, those more to the top-right of the plot though are much better evidence. (I’d note that if one group is very small, you could always make these error bars really large, so to construct them you need to make reasonable assumptions about the size of the two groups you are comparing.)

And here is another interactive chart in which you can view the outliers again. The original press release, Millville, Lakewood, and South Orange are noted as outliers. Using arrests as the denominator instead of population, they each have a rate ratio of around 2. In this chart Millville and Lakewood are outside the bounds, but just barely. South Orange is within the bounds. So those aren’t the places I would have called out according to this chart.

That same twitter thread other folks noted the potential reliability/validity of such data (Pete Moskos and Kyle McLean). These charts cannot say why individual agencies are outliers — either high or low. It could be their officers are really using force at different rates, it could also be though they are using different definitions to reporting force. There are also potential other individual explanations that explain the use of force distribution as well as the ratio differences in black vs white — no doubt policing in Princeton vs Camden are substantively different. Also even if all individual agencies are doing well, it does not mean there are no potential problem officers (as noted by David Pyrooz, often a few officers contribute to most UoF).

Despite these limitations, I still think there is utility in this type of monitoring though. It is basically a flag to dig deeper when anomalous patterns are spotted. Those unaccounted for factors contribute to more points being pushed outside of my constructed limits (overdispersion), but more clearly indicate when a pattern is so far outside the norm of what is expected the public deserves some explanation of the pattern. Also it highlights when agencies are potentially doing good, and so can be promoted according to their current practices.

This is a terrific start to effectively monitoring police agencies by ProPublica — state criminal justice agencies should be doing this themselves though.

Here is the code to replicate the analysis.

New preprint: Allocating police resources while limiting racial inequality

I have a new working paper out, Allocating police resources while limiting racial inequality. In this work I tackle the problem that a hot spots policing strategy likely exacerbates disproportionate minority contact (DMC). This is because of the pretty simple fact that hot spots of crime tend to be in disadvantaged/minority neighborhoods.

Here is a graph illustrating the problem. X axis is the proportion of minorities stopped by the police in 500 by 500 meter grid cells (NYPD data). Y axis is the number of violent crimes over along time period (12 years). So a typical hot spots strategy would choose the top N areas to target (here I do top 20). These are all very high proportion minority areas. So the inevitable extra police contact in those hot spots (in the form of either stops or arrests) will increase DMC.

I’d note that the majority of critiques of predictive policing focus on whether reported crime data is biased or not. I think that is a bit of a red herring though, you could use totally objective crime data (say swap out acoustic gun shot sensors with reported crime) and you still have the same problem.

The proportion of stops by the NYPD of minorities has consistently hovered around 90%, so doing a bunch of extra stuff in those hot spots will increase DMC, as those 20 hot spots tend to have 95%+ stops of minorities (with the exception of one location). Also note this 90% has not changed even with the dramatic decrease in stops overall by the NYPD.

So to illustrate my suggested solution here is a simple example. Consider you have a hot spot with predicted 30 crimes vs a hot spot with predicted 28 crimes. Also imagine that the 30 crime hot spot results in around 90% stops of minorities, whereas the 28 crime hot spot only results in around 50% stops of minorities. If you agree reducing DMC is a reasonable goal for the police in-and-of-itself, you may say choosing the 28 crime area is a good idea, even though it is a less efficient choice than the 30 crime hot spot.

I show in the paper how to codify this trade-off into a linear program that says choose X hot spots, but has a constraint based on the expected number of minorities likely to be stopped. Here is an example graph that shows it doesn’t always choose the highest crime areas to meet that racial equity constraint.

This results in a trade-off of efficiency though. Going back to the original hypothetical, trading off a 28 crime vs 30 crime area is not a big deal. But if the trade off was 3 crimes vs 30 that is a bigger deal. In this example I show that getting to 80% stops of minorities (NYC is around 70% minorities) results in hot spots with around 55% of the crime compared to the no constraint hot spots. So in the hypothetical it would go from 30 crimes to 17 crimes.

There won’t be a uniform formula to calculate the expected decrease in efficiency, but I think getting to perfect equality with the residential pop. will typically result in similar large decreases in many scenarios. A recent paper by George Mohler and company showed similar fairly steep declines. (That uses a totally different method, but I think will be pretty similar outputs in practice — can tune the penalty factor in a similar way to changing the linear program constraint I think.)

So basically the trade-off to get perfect equity will be steep, but I think the best case scenario is that a PD can say "this predictive policing strategy will not make current levels of DMC worse" by applying this algorithm on-top-of your predictive policing forecasts.

I will be presenting this work at ASC, so stop on by! Feedback always appreciated.

The random distribution of near-repeat strings

One thing several studies that examine near-repeat patterns have looked at is the distribution of the string of near-repeats. So near-repeats sometimes result in only 2 cases connected, sometimes 3, sometimes 4, etc. Here is an example from a recent work on arsons (Turchan et al., 2018):

Cory Haberman and Jerry Ratcliffe were the first I noticed to do this in this paper (Jerry’s near-repeat calculator has the option to export the strings). It is also a similar idea to what Davies and Marchione did in this paper.

Looking at these strings of events has clear utility for crime analysts, as they have a high probability of being linked to the same offender(s). Building off of some prior work, I wrote some python code to see what the distribution of these strings would look like when you randomly permuted the times in the data (which is the same approach used to estimate the intervals in the near repeat calculator). Here is the data and code, which is an analysis of 14,184 thefts from motor vehicles in Dallas that occurred in 2015.

So first I breakdown the total number of near repeat strings according to within 1000 feet and 7 days of each other. I then conduct 99 random permutations to see how many strings might happen by chance even if there is no near-repeat phenomenon. Some near-repeats can simply happen by chance, especially in places where crime is more prevalent. A length of string 1 in the table means it is not a near repeat, and 10+ means the string has 10 or more events in it. The numbers are the number of chains (in the Turchan article parlance), so 1,384 2-length chains means it includes 2,768 crime events.

If you compare the observed to the bounds in the table, you can see there are fewer isolates (1 length) in the observed than permutation distribution, and more 2 and 3 string events. After that the higher level strings occur just as frequently in the observed data than in the random data, with the exception of 10+ are fewer, but not by much.

So this provides evidence of the boost hypothesis in this data, albeit many near-repeat strings are still likely to occur just by chance, and the differences are not uber large. A crime analyst may be more interested in the question though "if I have X events in a near-repeat string, should I look into the data more". The idea being that since 2-strings are not that rare it would probably be a waste of an analysts time to dig into all of the two-events. I don’t think this is the perfect way to make that decision, but here is a breakdown of the distribution of strings for the permutated data.

So isolates happen in the random data 86% of the time. 2-strings happen 8.7% of the time, 3-strings 2.6%, etc. Based on this I would recommend that there needs to be at least 3 strings of near-repeat events if you have a low threshold in terms of "should I bother to dig into these events". If you want a high threshold though you may do more like 6+ events in a string.

This again is alittle bit of a slippage, as this is actual if you randomly picked a crime, what is the probability it is in a string of near-repeats of length N. I’m not quite sure of a better way to pose it though. Maybe it is better to think in terms of forecasts (eg given N prior crimes, what is the prob. of an additional near-repeat crime, similar to Piza and Carter). Or maybe in terms of if there are N near-repeats, what is the probability they will be linked to a common person (ala Mike Porter and crime linkage).

Also I should mention some of the cool work Liz Groff and Travis Taniguchi are doing on near-repeat work. I should probably just use their near-repeat code instead of rolling my own.

New paper: A simple weighted displacement difference test to evaluate place based crime interventions

At the ECCA conference this past spring Jerry Ratcliffe asked if I could apply some of my prior work on evaluating changes in crime patterns over time to make a set of confidence intervals for the weighted displacement quotient statistic (WDQ). The answer to that is no, you can’t, but in its stead I created another statistic in which you can do that, the weighted displacement difference (WDD). The work is published in the open access journal Crime Science.

The main idea is we wanted a simple statistic folks can use to evaluate place based interventions to reduce crime. All you need is pre and post crime counts for you treated and control areas of interest. Here is an excel spreadsheet to calculate the statistic, and below is a screen shot. You just need to fill in the pre and post counts for the treated and control locations and the spreadsheet will spit out the statistic, along with a p-value and a 95% confidence interval of the number of crimes reduced.

What is different compared to the WDQ statistic is that you need a control area for the displacement area too in this statistic. But if you are not worry about displacement, you can actually just put in zero’s for the displacement area and still do the statistic for the local (and its control area). In this way you can actually do two estimates, one for the local effects and one for the displacement. Just put in zero’s for the other values.

While you don’t really need to read the paper to be able to use the statistic, we do have some discussion on choosing control areas. In general the control areas should have similar counts of crime, you shouldn’t have a treatment area that has 100 crimes and a control area that only has 10 crimes. We also have this graph, which is basically a way to conduct a simple power analysis — the idea that “could you reasonably detect whether the intervention reduced crime” before you actually conduct the analysis.

So the way to read this graph is if you have a set of treated and control areas that have an average of 100 crimes in each period (so the cumulative total crimes is around 800), the number of crimes you need to reduce due to the intervention to even have weak evidence of a crime reduction (a one-tailed p-value of less than 0.1), the intervention needs to have prevented around 30 crimes. Many interventions just aren’t set up to have strong evidence of crime reductions. For example if you have a baseline of 20 crimes, you need to prevent 15 of them to find weak evidence of effectiveness. Interventions in areas with fewer baseline crimes basically cannot be verified they are effective using this simple of a design.

For those more mathy, I created a test statistic based on the differences in the changes of the counts over time by making an assumption that the counts are Poisson distributed. This is then basically just a combination of two difference-in-difference estimates (for the local and the displacement areas) using counts instead of means. For researchers with the technical capabilities, it probably makes more sense to use a data based approach to identify control areas (such as the synthetic control method or propensity score matching). This is of course assuming an actual randomized experiment is not feasible. But this is too much a burden for many crime analysts, so if you can construct a reasonable control area by hand you can use this statistic.

Aoristic analysis for hour of day and day of week in Excel

I’ve previously written code to conduct Aoristic analysis in SPSS. Since this reaches about an N of three crime analysts (if that even), I created an Excel spreadsheet to do the calculations for both the hour of the day and the day of the week in one go.

Note if you simply want within day analysis, Joseph Glover has a nice spreadsheet with VBA functions to accomplish that. But here I provide analysis for both the hour of the day and the day of the week. Here is the spreadsheet and some notes, and I will walk through using the spreadsheet below.

First off, you need your data in Excel to be BeginDateTime and EndDateTime — you cannot have the dates and times in separate fields. If you do have them in separate fields, if they are formatting correctly you can simply add your date field to your hour field. If you have the times in three separate date, hour, and minute fields, you can do a formula like =DATE + HOUR/24 + MINUTE/(60*24) to create the combined datetime field in Excel (excel stores a single date as one integer).

Presumably at this stage you should fix your data if it has errors. Do you have missing begin/end times? Some police databases when there is an exact time treat the end date time as missing — you will want to fix that before using this spreadsheet. I constructed the spreadsheet so it will ignore missing cells, as well as begin datetimes that occur after the end datetime.

So once your begin and end times are correctly set up, you can copy paste your dates into my Aoristic_HourWeekday.xlsx excel spreadsheet to do the aoristic calculations. If following along with my data I posted, go ahead and open up the two excel files in the zip file. In the Arlington_Burgs.xlsx data select the B2 cell.

Then scroll down to the bottom of the sheet, hold Shift, and then select the D3269 cell. That should highlight all of the data you need. Right-click, and the select Copy (or simply Ctrl + C).

Now migrate over to the Aoristic_HourWeekday.xlsx spreadsheet, and paste the data into the first three columns of the OriginalData sheet.

Now go to the DataConstructed sheet. Basically we need to update the formulas to recognize the new rows of data we just copied in. So go ahead and select the A11 to MI11 row. (Note there are a bunch of columns hidden from view).

Now we have a few over 3,000 cases in the Arlington burglary data. Grab the little green square in the lower right hand part of the selected cells, and then drag down the formulas. With your own data, you simply want to do this for as many cases as you have. If you go past your total N it is ok, it just treats the extra rows like missing data. This example with 3,268 cases then takes about a minute to crunch all of the calculations.

If you navigate to the TimeIntervals sheet, this is where the intervals are actually referenced, but I also place several summary statistics you might want to check out. The Total N shows that I have 3,268 good rows of data (which is what I expected). I have 110 missing rows (because I went over), and zero rows that have the begin/end times switched. The total proportion should always equal 1 — if it doesn’t I’ve messed up something — so please let me know!

Now the good stuff, if you navigate to the NiceTables_Graphs sheet it does all the summaries that you might want. Considering it takes awhile to do all the calculations (even for a tinier dataset of 3,000 cases), if you want to edit things I would suggest copying and pasting the data values from this sheet into another one, to avoid redoing needless calculations.

Interpreting the graphs you can see that burglaries in this dataset have a higher proportion of events during the daytime, but only on weekdays. Basically what you would expect.

Personally I would always do this analysis in SPSS, as you can make much nicer small multiple graphs than Excel like below. Also my SPSS code can split the data between different subsets. This particular Excel code you would just need to repeat for whatever subset you are interested in. But a better Excel sleuth than me can likely address some of those critiques.

One minor additional note on this is that Jerry’s original recommendation rounded the results. My code does proportional allocation. So if you have an interval like 00:50 TO 01:30, it would assign the [0-1] hour as 10/40, and [1-2] as 30/40 (original Jerry’s would be 50% in each hour bin). Also if you have an interval that is longer than the entire week, I simply assign equal ignorance to each bin, I don’t further wrap it around.

Data sources for crime generators

Those interested in micro place based crime analysis often need to collect information on businesses or other facilities where many people gather (e.g. hospitals, schools, libraries, parks). To keep it short, businesses influence the comings-and-goings of people, and those people are those who commit offenses and are victimized. Those doing neighborhood level research census data is almost a one stop shop, but that is not the case when trying to collect businesses data of interest. Here are some tips and resources I have collected over the years of conducting this research.

Alcohol License Data

Most states have a state level board in which one needs to obtain a license to sell alcohol. Bars and liquor stores are one of the most common micro crime generator locations criminologists are interested in, but in most states places like grocery stores, gas stations, and pharmacies also sell alcohol (minus those Quakers in Pennsylvania) and so need a license. So such lists contain many different crime generators of interest. For example here is Texas’s list, which includes a form to search for and download various license data. Here is Washington’s, which just has spreadsheets of the current alcohol and cannabis licenses in the state. To find these you can generally just google something like “Texas alcohol license data”.

In my experience these also have additional fields to further distinguish between the different types of locations. Such as besides the difference between on-premise vs off-premise, you can often also tell the difference between a sit down restaurant vs a more traditional bar. (Often based on the percent of food-stuffs vs alcohol that make up total revenue.) So if you were interested in a dataset of gas stations to examine commercial robbery, I might go here first as opposed to the other sources (again PA is an exception to that advice though, as well as dry counties).

Open Data Websites

Many large cities anymore have open data websites. If you simply google “[Your City] open data” they will often come up. Every city is unique in what data they have available, so you will just have to take a look on the site to see if whatever crime generator you are interested in is available. (These sites almost always contain reported crimes as well, I daresay reported crimes are the most common open data on these websites.) For businesses, the city may have a directory (like Chicago). (That is not the norm though.) They often have other points/places of interest as well, such as parks, hospitals and schools.

Another example is googling “[your city] GIS data”. Often cities/counties have a GIS department, and I’ve found that many publicly release some data, such as parcels, zoning, streets, school districts, etc. that are not included on the open data website. For example here is the Dallas GIS page, which includes streets, parcels, and parks. (Another pro-tip is that many cities have an ArcGIS data server lurking in the background, often which you can use to geocode address data. See these blog posts of mine (python,R) for examples. ) If you have a county website and you need some data, it never hurts to send a quick email to see if some of those datasets are available (ditto for crime via the local crime analyst). You have nothing to lose by sending a quick email to ask.

I’d note that sometimes you can figure out a bit from the zoning/parcel dataset. For instance there may be a particular special code for public schools or apartment complexes. NYC’s PLUTO data is the most extensive I have ever seen for a parcel dataset. Most though have simpler codes, but you can still at least figure out apartments vs residential vs commercial vs mixed zoning.

You will notice that finding these sites involve using google effectively. Since every place is idiosyncratic it is hard to give general advice. But google searches are easy. Recently I needed public high schools in Dallas for a project, and it was not on any of the prior sources I noted. A google search however turned up a statewide database of the public and charter school locations. If you include things like “GIS” or “shapefile” or “data” in the search it helps whittle it down some to provide a source that can actually be downloaded/manipulated.

Scraping from public websites

The prior two sources are generally going to be better vetted. They of course will have errors, but are typically based on direct data sources maintained by either the state or local government. All of the other sources I will list though are secondary, and I can’t really say to what extent they are incorrect. The biggest thing I have noticed with these data sources is that they tend to be missing facilities in my ad-hoc checks. (Prior mentioned sources at worst I’ve noticed a rare address swap with a PO box that was incorrect.)

I’ve written previously about using the google places API to scrape data. I’ve updated to create a short python code snippet that all you need is a bounding box you are looking for and it will do a grid search over the area for the place type you are interested in. Joel Caplan has a post about using Google Earth in a similar nature, but unfortunately that has a quite severe limitation — it only returns 10 locations. My python code snippet has no such limitation.

I don’t really understand googles current pricing scheme, but the places API has a very large number of free requests. So I’m pretty sure you won’t run out even when scraping a large city. (Geocoding and distance APIs are much fewer unfortunately, and so are much more limited.)

Other sources I have heard people use before are Yelp and Yellow pages. I haven’t checked those sources extensively (and if they have API’s like Google). When looking closely at the Google data, it tends to be missing places (it is up to the business owner to sign up for a business listing). Despite it being free and seemingly madness to not take the step to have your business listed easily in map searches, it is easy to find businesses that do not come up. So user beware.

Also, scraping the data for academic articles is pretty murky whether it violates the terms of service for these sites. They say you can’t cache the original data, but if you just store the lat/lon and then turn into a “count of locations” or a “distance to nearest location” (ala risk terrain modelling), I believe that does not violate the TOS (not a lawyer though — so take with a grain of salt). Also for academic projects since you are not making money I would not worry too extensively about being sued, but it is not a totally crazy concern.

Finally, the nature of scraping the business data is no different than other researchers who have been criticized for scraping public sites like Facebook or dating websites (it is just a business instead of personal info). I personally don’t find it unethical (and I did not think those prior researchers were unethical), but others will surely disagree.

City Observatory Data

City observatory has a convenient set of data, that they named the StoreFront Index. They have individual data points you can download for many different metro areas, along with their SIC codes. See also here for a nice map and to see if your metro area of interest is included.

See here for the tech report on which stores are included. They do not include liquor stores and gas stations though in their index. (Since it is based on Jane Jacob’s work I presume they also do not include used car sale lots.)

Lexis Nexis Business Data (and other proprietary sources)

The store front data come from a private database, Custom Lists U.S. Business Database. I’m not sure exactly what vendor produces this (a google search brings up several), but here are a few additional proprietary sources researchers may be interested in.

My local library in Plano (as well as my University), have access to a database named reference USA. This allows you to search for businesses in a particular geo area (such as zip code), as well as by other characteristics (such as by the previously mentioned SIC code). Also this database includes additional info. about sales and number of employees, which may be of further interest to tell the difference between small and large stores. (Obviously Wal-Mart has more customers and more crime than a smaller department store.) It provides the street address, which you will then need to geocode.

Reference USA though only allows you to download 250 addresses at a time, so could be painful for crime generators that are more prevalent or for larger cities. Another source though my friendly UTD librarian pointed out to me is Lexis Nexis’s database of public businesses. It has all the same info. as reference USA and you can bulk download the files. See here for a screenshot walkthrough my librarian created for me.

Any good sources I am missing? Let me know in the comments. In particular these databases I mention are cross-sectional snapshots in time. It would be difficult to use these to measure changes over time with few exceptions.