American Community Survey Variables of Interest to Criminologists

I’ve written prior blog posts about downloading Five Year American Community Survey data estimates (ACS for short) for small area geographies, but one of the main hiccups is figuring out what variables you want to use. The census has so many variables that are just small iterations of one another (e.g. Males under 5, males 5 to 9, males 10 to 14, etc.) that it is quite a chore to specify the ones you want. Often you want combinations of variables or to calculate percentages as well, so you need to take two or more variables and turn them into your constructed variable.

I have posted some notes on the variables I have used for past projects in an excel spreadsheet. This includes the original variables, as well as some notes for creating percentage variables. Some are tricky — such as figuring out the proportion of black residents for block groups you need to add non-Hispanic black and Hispanic black estimates (and then divide by the total population). For spatially oriented criminologists these are basically indicators commonly used for social disorganization. It also includes notes on what is available at the smaller block group level, as not all of the variables are. So you are more limited in your choices if you want that small of area.

Let me know if you have been using other variables for your work. I’m not an expert on these variables by any stretch, so don’t take my list as authoritative in any way. For example I have no idea whether it is valid to use the imputed data for moving in the prior year at the block group level. (In general I have not incorporated the estimates of uncertainty for any of the variables into my analyses, not sure of the additional implications for the imputed data tables.) Also I have not incorporated variables that could be used for income-inequality or for ethnic heterogeneity (besides using white/black/Hispanic to calculate the index). I’m sure there are other social disorganization relevant variables at the block group level folks may be interested in as well. So let me know in the comments or shoot me an email if you have suggestions to update my list.

I would prefer if as a field we could create a set of standardized indices so we are not all using different variables (see for example this Jeremy Miles paper). It is a bit hodge-podge though what variables folks use from study-to-study, and most folks don’t report the original variables so it is hard to replicate their work exactly. British folks have their index of deprivation, and it would be nice to have a similarly standardized measure to use in social science research for the states.


The ACS data has consistent variable names over the years, such as B03001_001 is the total population, B03002_003 is the Non-Hispanic white population, etc. Unfortunately those variables are not necessarily in the same tables from year to year, so concatenating ACS results over multiple years is a bit of a pain. Below I post a python script that given a directory of the excel template files will produce a nice set of dictionaries to help find what table particular variables are in.

#This python code grabs ACS meta-data templates
#To easier search for tables that have particular variables
import xlrd, os

mydir = r'!!!Insert your path to the excel files here!!!!!'

def acs_vars(directory):
    #get the excel files in the directory
    excel_files = []
    for file in os.listdir(directory):
        if file.endswith(".xls"):
            excel_files.append( os.path.join(directory, file) )
    #getting the variables in a nice dictionaries
    lab_dict = {}
    loc_dict = {}
    for file in excel_files:
        book = xlrd.open_workbook(file) #first open the xls workbook
        sh = book.sheet_by_index(0)
        vars = [i.value for i in sh.row(0)] #names on the first row
        labs = [i.value for i in sh.row(1)] #labels on the second
        #now add to the overall dictionary
        for v,l in zip(vars,labs):
            lab_dict[v] = l
            loc_dict[v] = file
    #returning the two dictionaries
    return lab_dict,loc_dict
    
labels,tables = acs_vars(mydir)

#now if you have a list of variables you want, you can figure out the table
interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B14006_002','B01001_003','B23025_005','B22010_002',
            'B16002_004']
            
for i in interest:
    head, tail = os.path.split(tables[i])
    print (i,labels[i],tail)
Advertisements

Paper published: Evaluating Community Prosecution Code Enforcement in Dallas, Texas

Some work John Worrall and I collaborated on was just published in Justice Quarterly, Evaluating Community Prosecution Code Enforcement in Dallas, Texas. I have two links to share:

If you need access to the article always feel free to email.

Below is the abstract:

We evaluated a community prosecution program in Dallas, Texas. City attorneys, who in Dallas are the chief prosecutors for specified misdemeanors, were paired with code enforcement officers to improve property conditions in a number of proactive focus areas, or PFAs, throughout the city. We conducted a panel data analysis, focusing on the effects of PFA activity on crime in 19 PFAs over a six-year period (monthly observations from 2010 to 2015). Control areas with similar levels of pre-intervention crime were also included. Statistical analyses controlled for pre-existing crime trends, seasonality effects, and other law enforcement activities. With and without dosage data, the total crime rate decreased in PFA areas relative to control areas. City attorney/code enforcement teams, by seeking the voluntary or court-ordered abatement of code violations and criminal activity at residential and commercial properties, apparently improved public safety in targeted areas.

This was a neat program, as PFAs are near equivalents of hot spots that police focus on. So for the evaluation we drew control areas from Dallas PD’s Target Area Action Grid (TAAG) Areas:

New course in the spring – Crime Science

This spring I will be teaching a new graduate level course, Crime Science. A better name for the course would be evidence based policing tactics to reduce crime — but that name is too long!

Here you can see the current syllabus. I also have a page for the course, which I will update with more material over the winter break.

Given my background it has a heavy focus on hot spots policing (different tactics at hot spots, time spent at hot spots, crackdowns vs long term). But the class covers other policing strategies; such as chronic offenders, the focused deterrence gang model, and CPTED. We also discuss the use of technology in policing (e.g. CCTV, license plate readers, body-worn-cameras).

I will weave in ethical discussions throughout the course, but I reserved the last class to specifically talk about predictive policing strategies. In particular the two main concerns are increasing disproportionate minority contact through prediction, and privacy concerns with police collecting various pieces of information.

So take my course!

Monitoring homicide trends paper published

My paper, Monitoring Volatile Homicide Trends Across U.S. Cities (with coauthor Tom Kovandzic) has just been published online in Homicide Studies. Unfortunately, Homicide Studies does not give me a link to share a free PDF like other publishers, but you can either grab the pre-print on SSRN or always just email me for a copy of the paper.

They made me convert all of the charts to grey scale :(. Here is an example of the funnel chart for homicide rates in 2015.

And here are example fan charts I generated for a few different cities.

As always if you have feedback or suggestions let me know! I posted all of the code to replicate the analysis at this link. The prediction intervals can definately be improved both in coverage and in making their length smaller, so I hope to see other researchers tackling this as well.

Notes on using UCR data for class projects

Students in my classes often want to use UCR reported data for projects. One thing many don’t realize though is that the UCR data reported to the FBI is only aggregate statistics at regular intervals for the entire jurisdiction. So for example one can’t look at hot spots using reported UCR data.

If you do have a hypothesis that can be reasonably examined using monthly or yearly data at the jurisdiction level, here are a few notes on using UCR data. First is that you can get the most detailed downloads of data from ICPSR. That link has data series going back to 1960, and ends up being about two years behind (e.g. it is close to the end of 2017, and only 2015 data is available).

The datasets on ICPSR have monthly data for Part 1 crime types, as well as some information on arrests and clearances. Also they have all of the individual agencies, along with their ORI code. The ORI code allows you to link agencies over time.

While the FBI does have a page for more up to date UCR data (they just released the 2016 stats, so they are about a year behind), they are much more limited in the types of tables they disseminate. There typically is one table for Part 1 crime rates for individual large cities for each year, but otherwise it is aggregated to different city sizes. So most data analyses need to use the ICPSR data — the data directly from the FBI is not detailed enough.

For those wishing to map the data, it ends up being a bit tricky. Most people in the US are probably under the jurisdiction of at least two police departments — the local PD and the state police. Many people are also under the jurisdiction of a local sheriff. So many of these police agencies have overlapping boundaries. There is no easy source of the geographic boundaries for the police departments, but the ICPSR data does contain the zipcode for the headquarters for the police department. This won’t be accurate for state police — but should be suitable for mapping purposes for local agencies and sheriffs (sheriffs are sometimes organized at the county level). If you want polygon data for jurisdictional boundaries you will need to search for individual agencies and political boundaries — there is no easy source to download them all at once. Many rural areas will have police departments cover multiple towns, but if you stick to more urban areas you might be able to use city boundaries.

The ICPSR data has crime reports aggregated to the county level, so if that level of aggregation is not problematic you may use that data directly. You should be aware of many of the complaints about UCR data quality though. Mike Maltz has written a bit about it, but there are quite a few other folks who have noticed problems with reporting in the UCR data. The main problem to watch out for is missing data being accidentally reported as zero crimes occurring.

To stack datasets from different years from ICPSR is not too difficult if you are not going too far back in time. But if you go back to the older data, ICPSR changed the variable order. The variables are simply listed as V1 TO V100 something, so for example V15 in 1979 is not the same variable as V15 in 2005. My notes say they used the same variable order from 1998-2015, but you will want to check that yourself (I downloaded the SPSS files, it would not surprise me if the datasets differed for some of the years.)

Some additional resources students may want to familiarize themselves with to gather UCR data more quickly are the FBI UCR data tool and Mike Maltz’s cleaned up dataset and notes on how he made it. You should probably just use Mike Maltz’s dataset if you are using data over time.

If you are just interested in yearly homicides, I have provided a dataset of cleaned up homicides that goes back to 1960, see my paper that goes along with that dataset on graphing temporal homicide trends (mapping those trends could be an interesting project as well!)

Graphs and interrupted time series analysis – trends in major crimes in Baltimore

Pete Moskos’s blog is one I regularly read, and a recent post he pointed out how major crimes (aggravated assaults, robberies, homicides, and shootings) have been increasing in Baltimore post the riot on 4/27/15. He provides a series of different graphs using moving averages to illustrate the rise, see below for his initial attempt:

He also has an interrupted moving average plot that shows the break more clearly – but honestly I don’t understand his description, so I’m not sure how he created it.

I recreated his initial line plot using SPSS, and I think a line plot with a guideline shows the bump post riot pretty clearly.

The bars in Pete’s graph are not the easiest way to visualize the trend. Here making the line thin and lighter grey also helps.

The way to analyze this data is using an interrupted time series analysis. I am not going to go through all of those details, but for those interested I would suggest picking up David McDowell’s little green book, Interrupted Time Series Analysis, for a walkthrough. One of the first steps though is to figure out the ARIMA structure, which you do by examining the auto-correlation function. Here is that ACF for this crime data.

You can see that it is positive and stays quite consistent. This is indicative of a moving average model. It does not show the geometric decay of an auto-regressive process, nor is the autocorrelation anywhere near 1, which you would expect for an integrated process. Also the partial autocorrelation plot shows the geometric decay, which is again consistent with a moving average model. See my note at the bottom, how this interpretation was wrong! (Via David Greenberg sent me a note.)

Although it is typical to analyze crime counts as a Poisson model, I often like to use linear models. Coefficients are much easier to interpret. Here the distribution of the counts is high enough I am ok using a linear interrupted ARIMA model.

So I estimated an interrupted time series model. I include a dummy variable term that equals 1 as of 4/27/15 and after, and equals 0 before. That variable is labeled PostRiot. I then have dummy variables for each month of the year (M1, M2, …., M11) and days of the week (D1,D2,….D6). The ARIMA model I estimate then is (0,0,7), with a constant. Here is that estimate.

So we get an estimate that post riot, major crimes have increased by around 7.5 per day. This is pretty similar to what you get when you just look at the daily mean pre-post riot, so it isn’t really any weird artifact of my modeling strategy. Pre-riot it is under 25 per day, and post it is over 32 per day.

This result is pretty robust across different model specifications. Dropping the constant term results in a larger post riot estimate (over 10). Inclusion of fewer or more MA terms (as well as seasonal MA terms for 7 days) does not change the estimate. Inclusion of the monthly or day of week dummy variables does not make a difference in the estimate. Changing the outlier value on 4/27/15 to a lower value (here I used the pre-mean, 24) does reduce the estimate slightly, but only to 7.2.

There is a bit of residual autocorrelation I was never able to get rid of, but it is fairly small, with the highest autocorrelation of only about 0.06.

Here is the SPSS code to reproduce the Baltimore graphs and ARIMA analysis.

As a note, while Pete believes this is a result of depolicing (i.e. Baltimore officers being less proactive) the evidence for that hypothesis is not necessarily confirmed by this analysis. See Stephen Morgan’s analysis on crime and arrests, although I think proactive street stops should likely also be included in such an analysis.


This Baltimore data just shows a bump up in the series, but investigating homicides in Chicago (here at the monthly level) it looks to me like an upward trend post the McDonald shooting. This graph is at the monthly level.

I have some other work on Chicago homicide geographic patterns going back quite a long time I can hopefully share soon!

I will need to update the Baltimore analysis to look at just homicides as well. Pete shows a similar bump in his charts when just examining homicides.

For additional resources for folks interested in examining crime over time, I would suggest checking out my article, Monitoring volatile homicide trends across U.S. cities, as well as Tables and Graphs for Monitoring Crime Patterns. I’m doing a workshop at the upcoming International Association for Crime Analysts conference on how to recreate such graphs in Excel.


David Greenberg sent me an email  to note my interpretation of the ACF plots was wrong – and that a moving average process should only have a spike, and not show the slow decay. He is right, and so I updated the interrupted ARIMA models to include higher order AR terms instead of MA terms. The final model I settled on was (5,0,0) — I kept adding higher order AR terms until the AR coefficients were not statistically significant. For these models I still included a constant.

For the model that includes the outlier riot count, it results in an estimate that the riot increased these crimes by 7.5 per day, with a standard error of 0.5

This model has no residual auto-correlation until you get up to very high lags. Here is a table of the Box-Ljung stats for up to 60 lags.

Estimating the same ARIMA model with the outlier value changed to 24, the post riot estimate is still over 7.

Subsequently the post-riot increase estimate is pretty robust across these different ARIMA model settings. The lowest estimate I was able to get was a post mean increase of 5 when not including an intercept and not including the outlier crime counts on the riot date. So I think this result holds up pretty well to a bit of scrutiny.

New working paper: Choosing Representatives to Deliver the Message in a Group Violence Intervention

I have a new preprint up on SSRN, Choosing Representatives to Deliver the Message in a Group Violence Intervention. This is what I will be presenting at ACJS next Friday the 24th. Here is the abstract:

Objectives: The group based violence intervention model is predicated on the assumption that individuals who are delivered the deterrence message spread the message to the remaining group members. We focus on the problem of who should be given the initial message to maximize the reach of the message within the group.

Methods: We use social network analysis to create an algorithm to prioritize individuals to deliver the message. Using a sample of twelve gangs in four different cities, we identify the number of members in the dominant set. The edges in the gang networks are defined by being arrested or stopped together in the prior three years. In eight of the gangs we calculate the reach of observed call-ins, and compare these with the sets defined by our algorithm. In four of the gangs we calculate the reach for a strategy that only calls-in members under supervision.

Results: The message only needs to be delivered to around 1/3 of the members to reach 100% of the group. Using simulations we show our algorithm identifies the minimal dominant set in the majority of networks. The observed call-ins were often inefficient, and those under supervision could be prioritized more effectively.

Conclusions: Group based strategies should monitor their potential reach based on who has been given the message. While only calling-in those under supervision can reach a large proportion of the gang, delivering the message to those not under supervision will likely be needed to reach 100% of the group.

And here is an image of the observed reach for one of the gang networks using both call-ins and custom notifications.

The paper has the gang networks available at this link, and uses Python to do the network analysis and SPSS to draw the graphs.

If you are interested in applying this to your work let me know! Not only do I think this is a good idea for focused deterrence initiatives for criminal justice agencies, but I think the idea can be more widely applied to other fields in social sciences, such as public health (needle clean/dirty exchange programs) or organizational studies (finding good leaders in an organization to spread a message).

Paper on Roadblocks in Buffalo published

My paper with Scott Phillips, A quasi-experimental evaluation using roadblocks and automatic license plate readers to reduce crime in Buffalo, NY, has just been published online first in the Security Journal. Springer gifts me a special link in which you can read the paper. Previously when I have been given links like that from the publisher they have a time limit, but the email for this one said nothing. But even if that goes bad you can always read my pre-print of the article I posted on SSRN.


Title: A quasi-experimental evaluation using roadblocks and automatic license plate readers to reduce crime in Buffalo, NY

Abstract:

This article evaluates the effective of a hot spots policing strategy: using automated license plate readers at roadblocks in Buffalo, NY. Different roadblock locations were chosen by the Buffalo Police Department every day over a two-month period. We use propensity score matching to identify a set of control locations based on prior counts of crime and demographic factors. We find modest reductions in Part 1 violent crimes (10 over all roadblock locations and over the two months) using t tests of mean differences. We find a 20% reduction in traffic accidents using fixed effects negative binomial regression models. Both results are sensitive to the model used though, and the fixed effects models predict increases in crimes due to the intervention. We suggest that the limited intervention at one time may be less effective than focusing on a single location multiple times over an extended period.

And here is Figure 2 from the paper, showing the units of analysis (street midpoints and intersections) and how the treatment locations were assigned.

Much ado about nothing: Overinterpreting volatility in homicide rates

I’m not much of a macro criminologist, but being asked questions by my dad (about Richard Rosenfeld and the Ferguson effect) and the dentist yesterday (asking about some of Trumps comments about rising crime trends) has prompted me to jump into it and give my opinion. Long story short — many sources I believe are overinterpreting short term fluctuations as more meaningful than they are.

First I will tackle national crime rates. So if you have happened to walk by a TV playing CNN the past few days, you may have heard Donald Trump being criticized for his statements on crime rates. This is partially a conflation with the difference between overall levels of crime versus changes in crime over time. Basically crime is currently low compared to historical patterns, but homicide rates have been rising in the past two years. This is easier to show in a chart than to explain in words. So here is the national estimated homicide rate per 100,000 individuals since 1960.1

2016 is not official and is still an estimate, but basically the pattern is this – crime has been falling generally across the country since the early 1990’s. Crime rates in just the past few years have finally dropped below levels in the 1960’s, but for the past two years homicides have been increasing. So some have pointed to the increase in the past two years and have claimed the sky is falling. To say this they say the rate of change is the largest in past 40 years. There are better charts to show rates of change (a semi-log chart), but the overall look is basically the same.

You have to really squint to see that change from 2014 to 2015 is a larger jump than any of the changes over the entire period, so arguments based on the size of recent changes in the homicide rate are hyperbole (either on a linear scale or a logarithmic scale). And even if you take the recent increases over the past two years as evidence of a more general rising trend, for a broader term pattern we still have homicide rates close to a low point in the past 50 years.

For a bit of general advice — any source that gives you a percent change you always want to see the base numbers and any longer term historical trends. Any media source that cites recent increases in homicides without providing this graph of long term historical crime trends is simply misleading. I’ve seen this done in many places, see this example from the New York Times or this recent note from the Economist. So this isn’t something specific to the President.

Now, macro criminologists don’t really have any better track record explaining these patterns than macro economists have in explaining economic trends. Basically we have a bunch of patch work theories that make sense for parts of the trend, but not the entire time frame. Changes in routine activities in 1960’s, increases in incarceration, the decline of crack use, ease of calling 911 with cell-phones, lead use, abortion (just to name a few). And academics come up with new theories all the time, the most recent being the Ferguson effect — which is simply another term for de-policing.

Now a bit on trends for specific cities. How this ties in with the national trend is that some articles have been pointing out that some cities have seen increases and some have not. That is fine to point out (albeit trivial), but then the articles frequently go on generate stories about why crime is rising in those specific places. Those on the left cite civil unrest and police brutality as possible reasons (Milwaukee, St. Louis, Chicago, Baltimore), while those on the right cite the deleterious effects of police departments not being as proactive (stops in Chicago, arrests in Baltimore).

While any of these explanations may turn out reasonable in the end, I’m pretty sure most of these articles severely underappreciate the volatility in homicide rates. Take an example with St. Louis, with a city population of just over 300,000. A homicide rate of 50 individuals per 100,000 means a total of 150 murders. A homicide rate of 40 per 100,000 means 120 murders. So we are only talking about a change of 30 murders overall. Fluctuations of around 10 in the murder rate would not be unexpected for a city with a population of 300,000 individuals. The confidence interval for a rate of 150 murders per 300,000 individuals is 126 to 176 murders.2

Even that though understates the typical volatility in homicide rates. As basically that assumes the proportion does not change over time. In reality crime statistics are more bursty, and show wilder fluctuations in different places.3 To show this for many cities, I use the data from the Economist article mentioned earlier, and create a motion chart of the changes in homicide rates over time. The idea behind this chart is a funnel chart. Cities with lower populations will show higher variance, and subsequently those dots on the left hand side of the chart will jump around alot more. The population figures are current and not varying, so the dots just move up and down on the Y axis.

For best viewing, make the X axis on the log scale, and size the points according to the population of the city. If you are at a desktop computer, you can open up a bigger version of the chart here.

Selecting individual points and then letting the animation run though illustrates the typical variability of crime over time. Here is the trace of St. Louis over the 36 year period.

New Orleans is another good example, we have fluctuations from under 30 to over 90 in the time period.

And here is Chicago, which shows less fluctuation than the smaller cities (as expected) but still has a range of homicide rates around 20 over the time period.

Howard Wainer has previously pointed this relationship out, and called it The Most Dangerous Equation. Basically, if you look you will be able to find some upward crime trends, especially in smaller cities. You need to look at it in the long term though and understand typical fluctuations to make a reasonable decision as to whether crime is increasing or if it is just typical year to year variation. The majority of news articles on the topic and just chock full of post hoc ergo propter hoc for particular cherry picked cites, and they often don’t make sense in explaining crime patterns over the past decade in those particular cities, let alone make sense for different cities experience similar conditions but not having rising homicide rates.



  1. For my notes about data sources, generally the data have come from the FBI UCR data tool (for the 1960 through 2014 data). 2015 data have come from the FBI web page for the 2015 UCR report. The 2016 projections come from this Economist article as well as the 50 cities data for the google motion chart.
  2. Calculated in R via (binom.test(150,300000)$conf.int[1:2])*300000. This is the exact Clopper-Pearson confidence interval.
  3. So even though this 538 article does a better job of acknowledging volatility, whatever test they use to determine statistically significant increases is likely to have too many false positives.

Keeping it simple: Viz. mass shooting definitions

My wife asked me the other day about some mass shooting statistics, in particular some claims of an average of one a day in the US. Without knowing the source, I told her outright it is probably because that person widened the net to events beyond what most people stereotypically consider a mass shooting.

Now, I have no personal opinion on how it should be defined, and being a researcher in criminal justice I appreciate people digging into the details. I was prompted to write this post by an interactive application showing how the numbers change by Kevin Schaul of the Washington Post (referred via Flowing Data). I was pretty frustrated by Kevin’s example interactive application though – there are much simpler ways than making me change the definition and seeing what individual events pop up. Here is an example screen shot of inputting a definition and then how Kevin’s data pop out.

So, downloading the same Reddit data for 2015 so far (as of 12/7/15) I created what I consider to be simple summaries. Caveat – these crowdsourced datasets are likely to have substantial missing data, especially towards the events with fewer injured. First I made a frequency histogram of the total number of dead per incident.

So you can see that if you only want to include dead in your personal definition, the one per day statistic is a dramatic over-representation. If you want to draw the line at 5 or more you will have around 9 more events than you would if you made the line at 6 or more. If you make the line at 10 or more there are only two incidents, but there are another 4 if you include incidents with 8 or 9 dead.

Another simple overview is a table. Here are tables of dead, injured, and the combined counts per each incident, sorted in descending value of the count. So the way to read this is that there there 147 seperate incidents in the reddit database that had 0 deaths, and 104 that had only one death, etc. The tables also have percents and cumulative percentage, so you can see how where you define the cut-point changes how much of the data you chop-off. Cumulative counts would be just as useful.

I have no personal problem using injured as well in a mass shooting definition. Basically the difference between being shot and being killed is seemingly due to random happenstance, so a shooting with 10 injured and no one killed can easily be argued to be a mass shooting in my opinion. Kevin’s interactive makes you choose an and condition though between injured and killed, whereas one could place the cut point at an or condition or simply the combined total. Here is a cross tabulation of the frequencies of injured by dead.

You can clearly see the reddit definition is the combined total of injured or dead is 4 via the line on the upper left of the table. Kevin’s and condition forces you to make a cut-point along each axis, basically choosing a rectangle in the lower right of the above crosstab table. If you want a combined total though, it will be along a diagonal somewhere in the table.

I appreciate these interactive visualizations allow a viewer to dig deeper into specific events in the data, but that does not mean some simple summaries could not also accompany the piece.