# New working paper: Choosing Representatives to Deliver the Message in a Group Violence Intervention

I have a new preprint up on SSRN, Choosing Representatives to Deliver the Message in a Group Violence Intervention. This is what I will be presenting at ACJS next Friday the 24th. Here is the abstract:

Objectives: The group based violence intervention model is predicated on the assumption that individuals who are delivered the deterrence message spread the message to the remaining group members. We focus on the problem of who should be given the initial message to maximize the reach of the message within the group.

Methods: We use social network analysis to create an algorithm to prioritize individuals to deliver the message. Using a sample of twelve gangs in four different cities, we identify the number of members in the dominant set. The edges in the gang networks are defined by being arrested or stopped together in the prior three years. In eight of the gangs we calculate the reach of observed call-ins, and compare these with the sets defined by our algorithm. In four of the gangs we calculate the reach for a strategy that only calls-in members under supervision.

Results: The message only needs to be delivered to around 1/3 of the members to reach 100% of the group. Using simulations we show our algorithm identifies the minimal dominant set in the majority of networks. The observed call-ins were often inefficient, and those under supervision could be prioritized more effectively.

Conclusions: Group based strategies should monitor their potential reach based on who has been given the message. While only calling-in those under supervision can reach a large proportion of the gang, delivering the message to those not under supervision will likely be needed to reach 100% of the group.

And here is an image of the observed reach for one of the gang networks using both call-ins and custom notifications.

The paper has the gang networks available at this link, and uses Python to do the network analysis and SPSS to draw the graphs.

If you are interested in applying this to your work let me know! Not only do I think this is a good idea for focused deterrence initiatives for criminal justice agencies, but I think the idea can be more widely applied to other fields in social sciences, such as public health (needle clean/dirty exchange programs) or organizational studies (finding good leaders in an organization to spread a message).

# Paper on Roadblocks in Buffalo published

My paper with Scott Phillips, A quasi-experimental evaluation using roadblocks and automatic license plate readers to reduce crime in Buffalo, NY, has just been published online first in the Security Journal. Springer gifts me a special link in which you can read the paper. Previously when I have been given links like that from the publisher they have a time limit, but the email for this one said nothing. But even if that goes bad you can always read my pre-print of the article I posted on SSRN.

Title: A quasi-experimental evaluation using roadblocks and automatic license plate readers to reduce crime in Buffalo, NY

Abstract:

This article evaluates the effective of a hot spots policing strategy: using automated license plate readers at roadblocks in Buffalo, NY. Different roadblock locations were chosen by the Buffalo Police Department every day over a two-month period. We use propensity score matching to identify a set of control locations based on prior counts of crime and demographic factors. We find modest reductions in Part 1 violent crimes (10 over all roadblock locations and over the two months) using t tests of mean differences. We find a 20% reduction in traffic accidents using fixed effects negative binomial regression models. Both results are sensitive to the model used though, and the fixed effects models predict increases in crimes due to the intervention. We suggest that the limited intervention at one time may be less effective than focusing on a single location multiple times over an extended period.

And here is Figure 2 from the paper, showing the units of analysis (street midpoints and intersections) and how the treatment locations were assigned.

I’m not much of a macro criminologist, but being asked questions by my dad (about Richard Rosenfeld and the Ferguson effect) and the dentist yesterday (asking about some of Trumps comments about rising crime trends) has prompted me to jump into it and give my opinion. Long story short — many sources I believe are overinterpreting short term fluctuations as more meaningful than they are.

First I will tackle national crime rates. So if you have happened to walk by a TV playing CNN the past few days, you may have heard Donald Trump being criticized for his statements on crime rates. This is partially a conflation with the difference between overall levels of crime versus changes in crime over time. Basically crime is currently low compared to historical patterns, but homicide rates have been rising in the past two years. This is easier to show in a chart than to explain in words. So here is the national estimated homicide rate per 100,000 individuals since 1960.1

2016 is not official and is still an estimate, but basically the pattern is this – crime has been falling generally across the country since the early 1990’s. Crime rates in just the past few years have finally dropped below levels in the 1960’s, but for the past two years homicides have been increasing. So some have pointed to the increase in the past two years and have claimed the sky is falling. To say this they say the rate of change is the largest in past 40 years. There are better charts to show rates of change (a semi-log chart), but the overall look is basically the same.

You have to really squint to see that change from 2014 to 2015 is a larger jump than any of the changes over the entire period, so arguments based on the size of recent changes in the homicide rate are hyperbole (either on a linear scale or a logarithmic scale). And even if you take the recent increases over the past two years as evidence of a more general rising trend, for a broader term pattern we still have homicide rates close to a low point in the past 50 years.

For a bit of general advice — any source that gives you a percent change you always want to see the base numbers and any longer term historical trends. Any media source that cites recent increases in homicides without providing this graph of long term historical crime trends is simply misleading. I’ve seen this done in many places, see this example from the New York Times or this recent note from the Economist. So this isn’t something specific to the President.

Now, macro criminologists don’t really have any better track record explaining these patterns than macro economists have in explaining economic trends. Basically we have a bunch of patch work theories that make sense for parts of the trend, but not the entire time frame. Changes in routine activities in 1960’s, increases in incarceration, the decline of crack use, ease of calling 911 with cell-phones, lead use, abortion (just to name a few). And academics come up with new theories all the time, the most recent being the Ferguson effect — which is simply another term for de-policing.

Now a bit on trends for specific cities. How this ties in with the national trend is that some articles have been pointing out that some cities have seen increases and some have not. That is fine to point out (albeit trivial), but then the articles frequently go on generate stories about why crime is rising in those specific places. Those on the left cite civil unrest and police brutality as possible reasons (Milwaukee, St. Louis, Chicago, Baltimore), while those on the right cite the deleterious effects of police departments not being as proactive (stops in Chicago, arrests in Baltimore).

While any of these explanations may turn out reasonable in the end, I’m pretty sure most of these articles severely underappreciate the volatility in homicide rates. Take an example with St. Louis, with a city population of just over 300,000. A homicide rate of 50 individuals per 100,000 means a total of 150 murders. A homicide rate of 40 per 100,000 means 120 murders. So we are only talking about a change of 30 murders overall. Fluctuations of around 10 in the murder rate would not be unexpected for a city with a population of 300,000 individuals. The confidence interval for a rate of 150 murders per 300,000 individuals is 126 to 176 murders.2

Even that though understates the typical volatility in homicide rates. As basically that assumes the proportion does not change over time. In reality crime statistics are more bursty, and show wilder fluctuations in different places.3 To show this for many cities, I use the data from the Economist article mentioned earlier, and create a motion chart of the changes in homicide rates over time. The idea behind this chart is a funnel chart. Cities with lower populations will show higher variance, and subsequently those dots on the left hand side of the chart will jump around alot more. The population figures are current and not varying, so the dots just move up and down on the Y axis.

For best viewing, make the X axis on the log scale, and size the points according to the population of the city. If you are at a desktop computer, you can open up a bigger version of the chart here.

Selecting individual points and then letting the animation run though illustrates the typical variability of crime over time. Here is the trace of St. Louis over the 36 year period.

New Orleans is another good example, we have fluctuations from under 30 to over 90 in the time period.

And here is Chicago, which shows less fluctuation than the smaller cities (as expected) but still has a range of homicide rates around 20 over the time period.

Howard Wainer has previously pointed this relationship out, and called it The Most Dangerous Equation. Basically, if you look you will be able to find some upward crime trends, especially in smaller cities. You need to look at it in the long term though and understand typical fluctuations to make a reasonable decision as to whether crime is increasing or if it is just typical year to year variation. The majority of news articles on the topic and just chock full of post hoc ergo propter hoc for particular cherry picked cites, and they often don’t make sense in explaining crime patterns over the past decade in those particular cities, let alone make sense for different cities experience similar conditions but not having rising homicide rates.

1. For my notes about data sources, generally the data have come from the FBI UCR data tool (for the 1960 through 2014 data). 2015 data have come from the FBI web page for the 2015 UCR report. The 2016 projections come from this Economist article as well as the 50 cities data for the google motion chart.
2. Calculated in R via (binom.test(150,300000)\$conf.int[1:2])*300000. This is the exact Clopper-Pearson confidence interval.
3. So even though this 538 article does a better job of acknowledging volatility, whatever test they use to determine statistically significant increases is likely to have too many false positives.

# Keeping it simple: Viz. mass shooting definitions

My wife asked me the other day about some mass shooting statistics, in particular some claims of an average of one a day in the US. Without knowing the source, I told her outright it is probably because that person widened the net to events beyond what most people stereotypically consider a mass shooting.

Now, I have no personal opinion on how it should be defined, and being a researcher in criminal justice I appreciate people digging into the details. I was prompted to write this post by an interactive application showing how the numbers change by Kevin Schaul of the Washington Post (referred via Flowing Data). I was pretty frustrated by Kevin’s example interactive application though – there are much simpler ways than making me change the definition and seeing what individual events pop up. Here is an example screen shot of inputting a definition and then how Kevin’s data pop out.

So, downloading the same Reddit data for 2015 so far (as of 12/7/15) I created what I consider to be simple summaries. Caveat – these crowdsourced datasets are likely to have substantial missing data, especially towards the events with fewer injured. First I made a frequency histogram of the total number of dead per incident.

So you can see that if you only want to include dead in your personal definition, the one per day statistic is a dramatic over-representation. If you want to draw the line at 5 or more you will have around 9 more events than you would if you made the line at 6 or more. If you make the line at 10 or more there are only two incidents, but there are another 4 if you include incidents with 8 or 9 dead.

Another simple overview is a table. Here are tables of dead, injured, and the combined counts per each incident, sorted in descending value of the count. So the way to read this is that there there 147 seperate incidents in the reddit database that had 0 deaths, and 104 that had only one death, etc. The tables also have percents and cumulative percentage, so you can see how where you define the cut-point changes how much of the data you chop-off. Cumulative counts would be just as useful.

I have no personal problem using injured as well in a mass shooting definition. Basically the difference between being shot and being killed is seemingly due to random happenstance, so a shooting with 10 injured and no one killed can easily be argued to be a mass shooting in my opinion. Kevin’s interactive makes you choose an and condition though between injured and killed, whereas one could place the cut point at an or condition or simply the combined total. Here is a cross tabulation of the frequencies of injured by dead.

You can clearly see the reddit definition is the combined total of injured or dead is 4 via the line on the upper left of the table. Kevin’s and condition forces you to make a cut-point along each axis, basically choosing a rectangle in the lower right of the above crosstab table. If you want a combined total though, it will be along a diagonal somewhere in the table.

I appreciate these interactive visualizations allow a viewer to dig deeper into specific events in the data, but that does not mean some simple summaries could not also accompany the piece.

As a researcher in criminal justice, tackling ethical questions is a difficult task. There are no hypotheses to test, nor models to fit, just opinions bantering around. I figured I would take my best shot and writing some coherent thoughts on the topic of the data police collect and its impacts on personal privacy – and my blog is really the best outlet.

What prompted this is a recent Nick Selby post which suggested the use of license plate readers (LPRs) to target Johns in LA is one of the worst ideas ever and a good example of personal privacy invasion by law enforcement. (Also see this Washington Post opinion article.)

I have a bit of a different and more neutral take on the program, and will try to articulate some broader themes in personal privacy invasion and the collection/use of data by police. I think it is an important topic and will continue to be with the continual expansion of public sensor data being collected by the police (with body worn cameras, stationary cameras, cell phone data, GPS traces being some examples). Basically, much of the negative sentiment I’ve seen so far of this hypothetical intervention are for reasons that don’t have to do with privacy. I’ll articulate these points by presenting alternative, currently in use police programs that use similar means, but have different ends.

To describe the LA program in a nutshell, the police use what are called license plate readers to identify particular vehicles being driven in known prostitution areas. LPRs are just cameras that take a snapshot of a license plate, automatically code the alpha-numeric plate, and then place that [date-time-location-plate-car image] in a database. Linking up this data with registered vehicles, in LA the idea is to have the owner of the vehicle sent a letter in the mail. The letter itself won’t have any legal consequences, just a note that says the police know you have been spotted. The idea in theory is that you will think you are more likely to be caught in the future, and may have some public shaming also if your family happens to see the letter, so you will be less likely to solicit a prostitute in the future.

To start with, some of the critiques of the program focus on the possibilities of false positives. Probably no reasonable person would think this is a worthwhile idea if the false positive rate is anything but small – people will be angry with being falsely accused, there are negative externalities in terms of family relationships, and any potential crime reduction upside would be so small that it is not worthwhile. But, I don’t think that itself is damning to this idea – I think you could build a reasonable algorithm to limit false positives. Say the car is spotted multiple times at a very specific location, and specific times, and the home owners address is not nearby the location. It would be harder to limit false positives in areas where people conduct other legitimate business, but I think it has potential with just LPR data, and would likely improve by adding in other information from police records.

If you have other video footage, like from a stationary camera, I think limiting false positives can definitely be done by incorporating things like loitering behavior and seeing the driver interact with an individual on the street. Eric Piza has done similar work on human coding/monitoring video footage in Newark to identify drug transactions, and I have had conversations with an IBM Smart City rep. and computer scientists about automatically coding audio and video to identify particular behaviors that are just as complicated. False negatives may still be high, but I would be pretty confident you could create a pretty low false positive rate for identifying Johns.

As a researcher, we often limit our inquiries to just evaluating 1) whether the program works (e.g. reduces crime) and 2) if it works whether it is cost-effective. LPR’s and custom notifications are an interesting case compared to say video cameras because they are so cheap. Camera’s and the necessary data storage infrastructure are so expensive that, to be frank, are unlikely to be a cost-effective return on investment in any short term time frame even given the best case scenario crime reductions (ditto for police body worn cameras). LPR’s and mailing letters on the other hand are cheap (both in terms of physical capital and human labor), so even small benefits could be cost-effective.

So in short, I don’t think the idea should be dismissed outright because of false positives, and the idea of using public video/sensor footage to proactively identify criminal behavior could be expanded to other areas. I’m not saying this particular intervention would work, but I think it has better potential than some programs police departments are currently spending way more money on.

Assuming you could limit the false positives, the next question then is it ok for the police to intrude on the privacy of individuals who have not committed any particular crime? The answer to this I don’t know, but there are other examples of police sending letters that are similar in nature but haven’t generated much critique. One is the use of letters to trick offenders with active warrants to turning themselves in. Another more similar example though are custom notifications. These are very similar in that often the individuals aren’t identified because of specific criminal charges, but are identified using data analytics and human intelligence to place them as high risk and gang involved offenders. Intrusion to privacy is way higher for these custom notifications than the suggested Dear John letters, but individuals did much more to precipitate police action as well.

When the police stop you in the car or on the street the police are using discretion to intrude in your privacy under circumstances where you have not necessarily committed a crime. Is there any reason a cop has to take that action in person versus seeing it on a video? Automatic citations at red light cameras are similar in mechanics to what this program is suggesting.

The note about negative externalities to legitimate businesses in the areas and the cost of letters I consider hyperbole. Letters are cheap, and actual crime data is frequently available that could already be used to redline neighborhoods. But Nick’s critique of the information being collated by outside agencies and used in other actuarial aspects, such as loans and employment decisions, I think is legitimate. I have no good answers to this problem – I have mixed feelings as I think open data is important (which ironically I can’t quantify in any meaningful way), and I think perpetual online criminal histories are a problem as well. Should we not have public crime maps though because businesses are less likely to invest in high crime neighborhoods? I think doing a criminal background check for many businesses is a legitimate query as well.

I have mixed feelings about familial shaming being an explicit goal of the letters, but compared to an arrest the letter is mundane. It is even less severe than a citation (which given some state laws you could be given a citation for loitering in a high prostitution area). Is a program that intentionally tries to shame a person – which I agree could have incredible family repercussions – a legitimate goal of the criminal justice system? Fair question, but in terms of privacy issues though I think it is a red herring – you can swap out different letters that would not have those repercussions but still uses the same means.

What if instead of the "my eyes are on you" letter the police simply sent a PSA like post-card that talked about the blight of sex workers? Can police never send out letters? How about if police send out letters to people who have previous victimizations about ways to prevent future victimization? I have a feeling much of the initial negative reactions to the Dear John program are because of the false positive aspect and the "victimless" nature of the crime. The ethical collection and use of data is a bit more subtle though.

LPR data was initially intended to passively identify stolen cars, but it is pretty ripe for mission creep. One example is that the police could use LPR data to actively track a cars location without a warrant. It is easy to think of both good and other bad examples of its use. For good examples, retrospectively identifying a car at the scene of a crime I think is reasonable, or to notify the police of a vehicle associated with a kidnapping.

For another example use of LPR data, what if the police did not send custom notifications, but used such LPR data to create a John list of vehicles, and then used that as information to profile the cars? If we think using LPR data to identify stolen cars is a legitimate use should we ignore the data we have for other uses? Does the potential abuse of the data outweigh the benefits – so LPR collection shouldn’t be allowed at all?

For equivalent practices, most police departments have chronic offender or gang lists that use criminal history, victimizations, where you have been stopped and who you have been stopped with to create similar databases. This is all from data the police routinely collect. The LPR data can be reasonably questioned whether it is available for such analytics use – police RMS data is often available in large swaths to the general public though.

Although you can question whether police should be allowed to collect LPR data, I am going to assume LPR data is not going to go away, and cameras definitely are not. So how do you regulate the use of such data within police departments? In New York, when you conduct an online criminal history check you have to submit a reason for doing the check. That is a police officer or a crime analyst can’t do a check of your next door neighbor because you are curious – you are supposed to have a more relevant reason related to some criminal investigation. You could have a similar set up with LPR that prevents actively monitoring a car except in particular circumstances and to purge the data after a particular time frame. It would be up to the state though to enact legislation and monitor its use. There is currently some regulation of gang databases, such as sending notifications to individuals if they are on the list and when to take people off the list.

Similar questions can be extended beyond public cameras though to other domains, such as DNA collection and cell phone data. Cell phone data is regularly collected with warrants currently. DNA searching is going beyond the individual to familial searches (imagine getting a DUI, and then the police use your DNA to tell that a close family member committed a rape).

Going forward, to frame the discussion of police behavior in terms of privacy issues, I would ask two specific questions:

• Should the police be allowed to collect this data?
• Assuming the police have said data, what are reasonable uses of that data?

I think the first question, should the police be allowed to collect this data, should be intertwined with how well does the program work and how cost-effective is the program (or potential if the program has not been implemented yet). There are no bright lines, but there will always be a trade off between personal privacy and public intrusion. Higher personal intrusion would demand a higher level of potential benefits in terms of safety. Given that LPR’s are passively collecting data I consider it an open question whether they meet a threshold of whether it is reasonable for the police to collect such data.

Some data police now collect, such as public video and DNA, I don’t see going away whether or not they meet a reasonable trade-off. In those cases I think it is better to ask what are reasonable uses of that data and how to prevent abuses of it. Basically any police technology can be given extreme examples where it saved a life or where a rogue agent used it in a nefarious way. Neither extreme case should be the only information individuals use to evaluate whether such data collection and use is ethical though.

# How wide to make the net in actuarial tools? (false positives versus false negatives)

An interesting debate/question came up in my work recently. I conducted an analysis of a violence risk assessment tool for a police department. Currently the PD takes around the top 1,000 scores of this tool, and then uses further intelligence and clinical judgements to place a small number of people on a chronic offender list (who are then subject to further interventions). My assessment of the predictive validity when examining ROC curves suggested the tool does a pretty good job discriminating violent people up to around the top 6,000 individuals and after that flattens out. In a sample of over 200,000, the top 1000 scores correctly classified 30 of the 100 violent cases, and the top 6000 classified 60.

So the question came up should we recommend that the analysts widen the net to the top 6,000 scores, instead of only examining the top 1,000 scores? There are of course costs and limitations of what the analysts can do. It may simply be infeasible for the analysts to review 6,000 people. But how do you set the limit? Should the clinical assessments be focused on even fewer individuals than 1,000?

We can make some estimates of where the line should be drawn by setting weights for the cost of a false positive versus a false negative. Implicit in the whole exercise of predicting violence in a small set of people is that false negatives (failing to predict someone will be violent when they are) greatly outweigh a false positive (predicting someone will be violent but they are not). The nature of the task dictates that you will always need to have quite a few false positives to classify even a few true positives, and no matter what you do there will only be a small number of false negatives.

Abstractly, you can place a value on the cost of failing to predict violence, and a cost on the analysts time to evaluate cases. In this situation we want to know whether the costs of widening the net to 6,000 individuals are less than the costs of only examining the top 1,000 individuals. Here I will show we don’t even need to know what the exact cost of a false positive or a false negative is, only the relative costs, to make an estimate about whether the net should be cast wider.

The set up is that if we only take the top 1,000 scores, it will capture 30 out of the 100 violent cases. So there will be (100 – 30) false negatives, and (1000 – 30) false positives. If we increase the scores to evaluate the top 6,000, it will capture 60 out the 100 violent cases, but then we will have (6000 – 60) false positives. I can not assign a specific number to the cost of a false negative and a false positive. So we can write these cost equations as:

1) (100 - 30)*FN + (1000 - 30)*FP = Cost Low
2) (100 - 60)*FN + (6000 - 60)*FP = Cost High

Even though we do not know the exact cost of a false negative, we can talk about relative costs, e.g. 1 false negative = 1000*false positives. There are too many unknowns here, so I am going to set FP = 1. This makes the numbers relative, not absolute. So with this constraint the reduced equations can be written as:

1) 70*FN +  970 = Cost Low
2) 40*FN + 5940 = Cost High

So we want to know the ratio at which there is a net benefit over including the top 6,000 scores versus only the top 1,000. So this means that Cost High < Cost Low. To figure out this point, we can subtract equation 2 from equation 1:

3) (70 - 40)*FN - 4970 = Cost Low - Cost High

If we set this equation to zero and solve for FN we can find the point where these two equations are equal:

30*FN - 4970 = 0
30*FN = 4970
FN = 4970/30 = 165 + 2/3

If the value of a false negative is more than 166 times the value of a false positive, Cost Low - Cost High will be positive, and so the false negatives are more costly to society relative to the analysts time spent. It is still hard to make guesses as to whether the cost of violence to society is 166 times more costly than the analysts time, but that is at least one number to wrap your head around. In a more concrete example, such as granting parole or continuing to be incarcerated, given how expensive prison is net widening (with these example numbers) would probably not be worth it. But here it is a bit more fuzzy especially because the analysts time is relatively inexpensive. (You also have to guess how well you can intervene, in the prison example incarceration essentially reduces the probability of committing violence to zero, whereas police interventions can not hope to be that successful.)

As long as you assume that the classification rate is linear within this range of scores, the same argument holds for net widening any number. But in reality there are diminishing returns the more scores you examine (and 6,000 is basically where the returns are near zero). If you conduct the same exercise between classifying zero and the top 1,000, the rate of the cost of a false negative to a false positive needs be 32+1/3 to justify evaluating the top 1,000 scores. If you actually had an estimate of the ratio of the cost of false positives to false negatives you could then figure out exactly how wide to make the net. But if you think the ratio is well above 166, you have plenty of reason to widen the net to the larger value.

# New paper: Replicating Group-Based Trajectory Models of Crime at Micro-Places in Albany, NY

I posted a pre-print of a paper myself, Rob Worden and Sarah McLean have finished, Replicating Group-Based Trajectory Models of Crime at Micro-Places in Albany, NY. This is part of the work of the Finn Institute in collaboration with the Albany police department, and the goal of the project was to identify micro places (street segments and intersections) that showed long term patterns of being high crime places.

The structured abstract is below:

Objectives: Replicate two previous studies of temporal crime trends at the street block level. We replicate the general approach of group-based trajectory modelling of crimes at micro-places originally taken by Weisburd, Bushway, Lum and Yan (2004) and replicated by Curman, Andresen, and Brantingham (2014). We examine patterns in a city of a different character (Albany, NY) than those previously examined (Seattle and Vancouver) and so contribute to the generalizability of previous findings.

Methods: Crimes between 2000 through 2013 were used to identify different trajectory groups at street segments and intersections. Zero-inflated Poisson regression models are used to identify the trajectories. Pin maps, Ripley’s K and neighbor transition matrices are used to show the spatial patterning of the trajectory groups.

Results: The trajectory solution with eight classes is selected based on several model selection criteria. The trajectory of each those groups follow the overall citywide decline, and are only separated by the mean level of crime. Spatial analysis shows that higher crime trajectory groups are more likely to be nearby one another, potentially suggesting a diffusion process.

Conclusions: Our work adds additional support to that of others who have found tight coupling of crime at micro-places. We find that the clustering of trajectories identified a set of street units that disproportionately contributed to the total level of crime citywide in Albany, consistent with previous research. However, the temporal trends over time in Albany differed from those exhibited in previous work in Seattle but were consistent with patterns in Vancouver.

And here is one of the figures, a drawing of the individual trajectory groupings over the 14 year period. As always, if you have any comments on the paper feel free to shoot me an email.

# Big data problems for Criminal Justice

I am on the job market this year, and I have noticed a few academic jobs focused on big data (see this Penn State posting for one example). Because example data sets in criminal justice are not typical fodder for big data conversations, I figured I would talk abit about my experiences and illustrate the need for the types of skills needed to manipulate and analyze these big datasets.

As opposed to trying to further define the big data buzzword, I will simply talk about the actual size of data I have dealt with. Depending on the definition used, most large criminal justice datasets may be called medium sized data. That is you can load it in a database or statistical program (particularly those that do not load everything into RAM, like SPSS and SAS) and calculate different summary statistics and fit simple models. Were not talking about datasets that need custom big data solutions like Hadoop. The biggest single table I’ve personally worked with is a set of 25 million arrest histories (with around 150 variables). Using SPSS server to sort this dataset took less than a minute, using my local machine it took about 10 minutes. Nothing much to complain about there, and it is where the statistical programs that don’t load everything into memory shine.

To talk specifics, the police agency where I was an analyst at (Troy, NY) is a fairly small city with a population of around 50,000 people. They generated around 60,000 calls for service per year (this includes anytime someone calls 911, or police initiated interactions like a traffic stop). Every single one of these incidents generates a one to many relationship for multiple tables, and here is a sampling of those relationships; multiple free text description of the event and follow up investigations, people involved in the incident, offences committed, property stolen or damaged, persons arrested, property recovered or confiscated, drug and weapon contraband, vehicles involved, etc. Over the time period of 04-13 the incident narratives themselves are around 1 gigabyte, and the number of unique individuals and institutions in the "names" table was around 100,000. None of these tables alone would be considered big data, but when taking multiple years and having to conduct multiple table merges it turns into complicated medium size data pretty quickly.

I’m sure I’m not alone here working with police departments. In the past month I’ve had conversations with two individuals about corrections datasets that result in millions of records. Criminal justice organizations have been collecting data for along time, and given say 50,000 records per year it only takes 10 years to turn that into 500,000. When considering larger agencies (like statewide corrections or courts) the per year becomes even larger.

Most of the time summary statistics and fairly simple regression models are all researchers and analysts are interested in in criminal justice. The field is not heavily devoted to prediction, and certainly not to fitting complicated machine learning models. Many regression tasks can be estimated with data as large as 25 million records (given that the number of predictor variables tends to be small) and even if it didn’t sampling (or reducing the data to unique observations and weighting) is an obvious option. So for these types of simple needs just learning effective practices at manipulating datasets — such as SQL and best practices for conducting data manipulations in statistical packages is most of the education one needs. But these are still definitely needs that are not met in any social science curricula that I am aware. By fire is my only experience.

Two particular areas that turn little data into big data are spatial and network analysis, as one not only needs to consider the number of nodes but also the number of edges (or potential edges) in the system to calculate various measures. For example, in my dissertation I needed to conduct spatial lags of several variables (and this is needed in calculating measures such as Moran’s I). In matrix notation this typically involves calculating Wx, where W is an n by n spatial weights matrix. In my dissertation, n was 21,506, so not a large dataset, but W is then a 21,506^2 matrix. It can be held in memory, but good luck trying to calculate anything with it. Most of the spatial econometrics literature discusses how calculating W^-1 is problematic, let alone the simpler operation of Wx. So to do those calculations I needed to create custom code. I hope to be able to write a blog post on how it can be done at some point – but these blog posts aren’t earning me any brownie points to getting a job (let alone getting tenure in the future).

The other area that I believe needs to be developed in the social science related to medium data problems are custom visualization solutions. Data in social science typically has lots of noise to signal, and adding in 100,000 observations rarely makes things clearer. This is why I think visualization within the social sciences has potential to expand, as the majority of historical discussions are not extensible to our particular use applications in the social sciences.

So I’m excited by academia recognizing that big data is a problem and takes custom solutions in the social sciences. An environment where I can be reworded for taking on those big data tasks and partly focus on publishing software, as opposed to solely publish or perish, would help develop the field and have a more lasting impact on practical applications than journal articles. At least a place that acknowledges the need to develop curricula related to these data management tasks would be a good start. But I’m not sure I like the types of applications currently being pitched in the social sciences as big data problems, particularly the trivial applications of examining social networks like facebook or twitter, nor emphasis on big data tools like Hadoop that I don’t think are applicable to the social scientists toolset. But I’m certainly biased to think that applications in criminal justice have more practical implications than alot of contemporary social science research.

# Quantifying racial bias in peremptory challenges

A question came up recently on cross validated about putting some numbers on the amount of bias in jury selection. I had a previous question of a similar nature, so it had been on my mind previously. The original poster did not say this was specifically for a Batson challenge, but that is simply my presumption. It is both amazing and maddening that given the same question four different potential analyses were suggested. Although it is a bit out of the norm for what I talk about, I figured it would be worth a post.

# Some background on Batson

For some background, Batson challenges are specifically in the context of selecting jurors for a trial. (Everything that follows is specific to what I know about law in the US.) To select a jury first the court selects potential jurors for the venire from the general public. Then both the prosecution and defence counsels have the opportunity to question individuals in the venire. A typical flow seems to be that a panel of the venire is selected (say 10), then the court has a set of standardized questions they ask every individual potential juror. This part is referred to as voir dire. If the individual states they can not be impartial, or there is some other characteristic that indicates they cannot be impartial that potential juror can be eliminated for cause. Without intervention of counsel the standard questions by the court typically weeds out any obvious cases. After the standard questions both counsels have the opportunity to ask their own questions and further identify challenges for cause. There are no limits on who can be eliminated for cause.

The wrinkle specific to Batson though is that each counsel is given a fixed number of peremptory challenges. The number is dictated by the severity of the case (in more serious cases each side has a higher number of challenges). The wikipedia page says in some circumstances the defense gets more than the prosecution, but the total number is always fixed in advance.

The logic behind peremptory challenges is that either counsel can use personal discretion to eliminate potential jurors without needing a justification. Basically it is a fail-safe of the court to allow gut feelings of either counsel to eliminate jurors they believe will be partial to the opposing side. But based on the equal protection clause it was decided in Batson vs. Kentucky that one can not use the challenges solely based on race. As a side effect of allowing so many peremptory challenges, one can easily eliminate a particular minority group, as being a minority group they will only have a few representatives in the venire.

During the voir dire if the opposing counsel believes the opposition is using the peremptory challenges in a racially discriminatory manner, they can object with a Batson challenge. The supreme court decided on three steps to evaluate the challenge.

1. The party that objected has the burden to prove a prima facie case that the challenges were used in a discriminatory manner. This includes an argument that the group is discriminated against is cognizable, and that there is additional numerical evidence of discrimination.
2. Then the burden shifts on the party being challenged to justify the use of the peremptory challenges based on race neutral reasons.
3. The burden then shifts back to the original challenging party. This is to dispute whether the reasons proferred for the use of the peremptory challenges are purely pretextual.

Witnessing the proceedings for this particular case in the New York court of appeals case is what prompted my interest, and I recommend reading their decision as a good general background on Batson challenges (the wikipedia page is lacking quite a bit). What follows is some number crunching specific to the first part, establishing a prima facie case of discrimination.

# Now some numbers

Batson challenges are made in situ during voir dire. All the cases I am familiar with simply use fractions to establish that the peremptory challenges are being used in a discriminatory manner. The fact that the numbers are changing during voir dire makes the calculations of statistics more difficult. But I will address the ex post facto assessment of the first step given the final counts of the number of peremptory challenges and the total number on the venire with their racial distribution. This presumption I will later discuss how it might impact on the findings in a more realistic setting.

Consider the case of People vs. Hecker (linked to above). It happened that the peremptory challenges by the defence to exclude two Asian’s from the jury panel is what prompted the Batson challenge. Later on one other Asian juror was seated to the jury. The appeals court considered in this case whether step 1 was justified, so it is not a totally academic question to attempt to quantify the chances of two out of three Asian’s being challenged.

First, I will specify how we might put a number of this chance occurrence. If a person randomly selected 13 names out of a hat with 39 people, and of those names 3 individuals were Asian, what is the probability that 2 of those selected would be Asian? This probability is dictated by the hypergeometric distribution. More generally, the set up is:

• n equals the total number of eligible cases that are subject to be challenged
• p equals the total number of the race in question that are subject to be challenged
• k equals the total number of peremptory challenges used on the racial group in question
• d equals the total number of peremptory challenges

And the hypergeometric distribution is calculated using binomial coefficients as:

$\frac{{p \choose k} {n-p \choose d-k} }{{n \choose d}}$

So plugging the numbers listed above into the formula, we get the probability of two out of three Asian jurors being challenged if the challenges were made randomly would be equal to below according to Wolfram Alpha:

$\frac{{p \choose k} {n-p \choose d-k} }{{n \choose d}} = \frac{{3 \choose 2} {39-3 \choose 13-2} }{{39 \choose 13}} = \frac{156}{703} \approx 0.22$

So the probability of a chance occurrence given this particular set of circumstances is 22%, not terribly small, although may be sufficient given other circumstances to justify the first step (it seems the first step is intended that the burden is rather light). Where did my numbers come from though exactly? Given the circumstances of the case in the appeal decision the only number that would be uncontroversial would be p = 2, that two Asian jurors were excluded based on peremptory challenges. p = 3 comes from after the case, in which one other Asian juror was actually seated. It appears in the appeals case I linked to they consider the jury composition after the Batson challenge was initially brought, but obviously the initial trial judge can’t use that future information. I choose to use d = 13 because the defence only used 13 of their 15 peremptory challenges by the end of the seating. The total number n = 39 is the most difficult to come by.

The appeal case states that in the first pool 18 jurors were brought for questioning, 5 were eliminated for cause, and that both the defence and prosecution used 5 peremptory challenges. I chose to count the total number as 13 for this round, 18 minus the 5 eliminated for cause. In our pulling names out of the hat experiment though you may consider the number to be only 8, so not count the cases the prosecution used their peremptory challenges on. The second round brought another 18 potential jurors, of which 4 were eliminated for challenge. In this round the two Asian jurors were challenged by the defence when the judge asked to evaluate the first 9 of this panel (given the language it appears defence used 3 peremptory challenges during this evaluation of the first 9). At the end of the second round both parties used another 5 peremptory challenges. So to get the the total number of 39, I use 13 for the first round plus 14 for the second, although I could reasonably use 8 for the first and 9 for the second. I end up at 39 by using 13 + 14 + 12 – the last Asian juror (Kazuko) was the the 13th to be questioned on the third panel. One prior juror had been challenged for cause, so I count this as 12 towards the total of 39. There were a total of 26 panellists in this round, and the defence used a total of another 3 peremptory challenges, and there is no other information on whether the prosecution used any more peremptory challenges. (I’m unsure the total number of jurors seated for the case, so I can’t make many other guesses – the total number of jurors selected in the prior two rounds were 7). I don’t worry about the selection of the alternate juror for this analysis.

So lets try to apply this same analysis at the exact time the Batson challenge was raised, when the second Asian juror was eliminated. I prefer to make the calculations at the time the challenge was raised, as the opposing counsel may later alter their behavior in light of a prior Batson challenge. In doing this, now we have p = 2 and k = 2 (there was no other mention of any Asian’s being challenged for cause). We have a bit of uncertainty about d and n though. d at a minimum for the defence has to be 7 at this point (five in the first round plus the two Asians in the second round), but could be as many as 10 (5 in the first and 5 in the second). n could be as mentioned before 8 + 9 = 17 (excluding cases the prosecution challenged) or 13 + 14 = 27 (including all cases not challenged for cause). In the middle you may consider n to be 13 + 9 = 22, the exact point when the defence was asked to bring forth challenges for the first 9 seated on that round of the panel. So what difference does this make on the estimated probabilities? Well lets just graph the estimates for all values of d between 7 and 10, and n between 17 and 27. Here the lines are for different values of d (with labels at the beginning left part of the line), and the x axis is the different values of n. We can see the probabilities follow the pattern that as n increases and d decreases the probability of that combination goes down. Even over all these values the probability never goes below 5%. I do not know if a 5% probability is sufficient for the numerical justification of the prima facie case of discrimination. 22% seems too low a threshold to me (by chance about 1 in 5 times) but 5% may be good enough (by chance 1 in 20 times).

So lets try this same sensitivity analysis for the entire case. For the evaluation of everything after the fact I think p = 3 and k = 2 are largely uncontroversial, but lets vary d between 7 and 15 and n between 23 (chosen to be at the lower end of cases that were legitimately evaluated in the pool by the end I believe) to 45. Some of these situations are not commensurate with the limited information we have (e.g. d = 7 and n = 45 is not possible) but I think the graph will be informative anyway. My labelling could use more work, but the line on top is d = 15 and they increment until the line on bottom where d = 7.

So here we can see that the probability after the fact never gets much below 10%, and that is for lower values of d and higher values of n that are not likely possible given the data. So basically in the case that looks the worst for the defence here the probability of selecting two out of three Asian’s by random (giving varying numbers of peremptory challenges and varying the pool from which to draw them) is never below 10%. Not a terribly strong case that the numerical portion of step 1 has been satisfied. Basically, no matter what reasonable values you put in for d or n in this circumstance the probability of choosing 2 out of three Asian jurors randomly is not going to be much below 10%.

# On some of the other suggested analyses

On the original question on CV I mentioned previously, besides my own suggested analysis here there were 3 other suggested analysis:

• Using a regression model to predict the probability of a racial group being challenged
• Analysis of Contingency tables
• Calculating all potential permutations, and then counting the percentage of those permutations that meet some criteria.

All three I do not think are completely unreasonable, but I prefer the approach I listed above. I will attempt to articulate those reasons.

So first I will talk about the regression approach. This is generically a model of the form predicting the probability of a peremptory challenge based on the race of the potential juror:

$\text{Prob}(\text{Challenge}) = f(\beta_0 + \beta_1(\text{Racial Group}_i))$

The anonymous function is typically a logit (for a logistic model) or a probit function. This generalized linear model is then estimated via maximum likelihood, and one formulates hypothesis tests for the B_1 coefficient. The easy critique of this is that the test is not likely to be very powerful with the small samples – as the estimates are based on maximum likelihood and are only guaranteed to unbiased asymptotically. This could be a fairly simple exercise to attempt to see the behavior of this bias in the small samples, but I suspect it reduces the power of the test greatly. Also note that in the case where the subgroup of interest is always challenged, such as in the two out of two Asian’s in the Hecker case mentioned, the equation is not identified due to perfect separation. There are alternative ways to estimate the equation in the case of perfect separation, but this does not mitigate the small sample problem.

More generally, my original formulation of the data generating mechanism being the hypergeometric distribution, drawing names out of hat, is quite different than this. This is a model of the probability of anyone being peremptory challenged. One then estimates the model to see if the probability is increased among the racial group of interest. This is arguably not the question of interest. For instance, say the model estimated the probability of an Asian being challenged to be only 6%, and the probability of anyone else to be 4%. In one sense, this establishes the prima facie case of discrimination of Asian’s compared to everyone else, but does only a probability of 6% of using a peremptory challenge warrant a Batson challenge? I don’t think so. If you think that a challenge will never come with such low probabilities, you are right in that the expected probabilities for the racial group of question will not be that low when a Batson challenge is made, but once you consider the uncertainty in the estimates (e.g. 95% confidence intervals) they could easily be that low. On the flip side if the racial group is struck 96% of the time, but everyone else is struck 94% of the time, does that establish the numerical evidence of discrimination? I’m not sure, it may if this prevents any of the particular racial group being seated.

The analysis of contingency tables, in particular Fisher’s Exact Test, is exactly the same as my hypergeometric approach if one only considers the racial group of interest against all other parties. Fisher’s exact test is a reasonable approach over the more typical chi-square because, 1) the cells will be quite small, and 2) this is one of the unusual cases where the marginals are fixed. So making a 2 by 2 contingency table based on the very first example I gave (which resulted in a probability of around 22%) would be a table:

Challenge No Challenge Total
Asian 2 1 3
Other 11 25 36
Total 13 26 39

For the formula the 2 by 2 table is referred to as:

Challenge No Challenge Total
Asian a b a + b
Other c d c + d
Total a + c b + d n

Which Fisher’s Exact Test can be formulated by the binomial coefficients:

$\frac{{a + b \choose a} {c+d \choose c} }{{n \choose a+c}} = \frac{{2 + 1 \choose 2} {11+25 \choose 11} }{{39 \choose 13}}$

Which if you look closely is exactly the same set of binomial coefficients for the hypergeometric test I listed previously. So, as long as one only tests the one racial group against all others Fisher’s Exact test of a 2 by 2 contingency table is exactly the same as my recommendation. I don’t particularly think the historical p-value <= 0.05 standard is necessary, but it is the same information.

What bothers me more about this approach is when people start adding other cells in the contingency table. In the first step you need to establish a pattern of discrimination against one particular cognizable group. The treatment of other groups is non sequitur to this question in the first step. (In People vs Black evaluations of how unemployment was treated for non-black jurors was considered is steps 2 and 3, but not in the first step.) Including other groups into the table though will change the outcome. Such ad hoc decisions on what racial groups to consider should not have any effect on the evidence of discrimination against the specific racial group of interest. This problem of what groups is similarly applicable to the regression approach mentioned above. Hypothesis tests of the coefficients will be dependent on what particular contrasts you wish to draw and will change the estimates if certain groups are specified in the equation.

The final approach, counting up particular permutations that meet a particular threshold is intuitive, but again has an ad hoc element, the same problem with choosing which racial groups will impact the test statistic. All of the approaches (including my own) need to be explicit about the groups being tested beforehand. Only monitoring one group as in the hypergeometric test I presented earlier is much simpler to justify ex post facto, but it still would be best to establish the cognizable groups before voir dire takes place. For the tests that use other racial or ethnic groups in the calculations are much more suspect to justification, as the picking and choosing of the other groups will impact the calculations.

# Some recommendations

A typical question I get asked as an academic is, So what would you recommend to improve the situation? Totally reasonable question that I often don’t have a good answer to. It is easy to throw out recommendations without considering the entirety of the situation, and the complexities of the criminal justice system are no exception. With full awareness that no one with any authority will likely read my recommendations, my suggestions follow none-the-less.

The first is, only slightly in jest, is to only allow 1 peremptory challenge. There is no bright line rule on numerical evidence presented that is necessary to establish discrimination, but the NY State court of appeals case I mentioned did indicate that it takes more than 1 challenge to establish a pattern of discrimination. This may seem extreme, but the logic applies to the same to allowing fewer peremptory challenges. The fewer the challenges, the less capability either counsel has to entirely eliminate a particular racial group from the jury. It simultaneously makes counsels use of the challenges more precious, so they should be more hesitant to use them based on gut feelings predicated solely by racial stereotyping.

As another side effect (good or bad depending on how you look at it) it also makes the evaluation of whether one is using the challenges in a racially discriminatory manner much clearer. As I shown above, when d decreased the probability of the outcome generally decreased. For example with the Hecker case, pretend there were only a total of 5 peremptory challenges. So in this hypothetical situation have n = 20, d = 5, and p and k = 2 (if the number of n was much higher with the number of peremptory challenges limited to 5 for each side the jury would have to be close to set already by the time 20 individuals were questioned). The probability of this is 5%, whereas if d = 7 the probability is 11%. To be very generic, having fewer challenges makes particular racial patterns less likely by chance.

I understand the motivation for peremptory challenges, but it is unclear to me why such a large number are currently afforded for most cases. Also to the extent that timeliness is a priority for the court, allowing fewer challenges would certainly decrease the necessary time needed for voir dire. (Which was a concern in the Hecker case, as the court only allowed a very short time for questioning the panels.)

The second is that case-law should be established for cognizable groups, particularly given the racial make up of the defendent(s) and victim(s). Or, conversely, counsels should be required a priori to voir dire to establish the cognizable groups. This avoids cherry picking any group for a Batson challenge, as one could always specify a group based on the ex post facto characteristics of the groups used for peremptory challenges. To put a probability to whether a certain number of a particular group could be chosen at at random it is necessary to supply the hypothesis before looking at data. Ad hoc selections of a group could always occur, and with some of the other statistical tests ad hoc inclusion of cells in a contingency table or to include in a test statistic could impact the analysis. Making such a case a priori should prevent any nefarious manipulation of the numbers after the fact.

Neither of these appear to be too onerous to me to be reasonable suggestions. I doubt any lawyer or judge is going to be typing binomial coefficients into Wolfram Alpha during voir dire anytime soon though. Maybe I should make a look up table or nomogram for hypergeometric probabilities for typical values that would come up during voir dire. It would be pretty easy for a lawyer to keep a tally and then do a look up, or keep in mind before hand at what point a set of challenges is unlikely due to chance. With many peremptory challenges and few uses of a particular group, I suspect the probabilities of that happening by chance are much larger than people expect. The two out of two Asian’s in the Hecker case is a good example where the numerical evidence of discrimination is very weak no matter how you plug in the numbers.

# Informational Asymmetries in my role as Crime Analyst

One aspect I’ve come to realize in my job as crime analyst, and really in any technical job I’ve had, is that I face large informational asymmetries between myself and my employers (and colleagues). What exactly do I mean? Well, I consider a prime example of informational asymmetry when I have a large body of knowledge about some particular topic or task I need to conduct, and the person asking for the task has relatively little.

I believe this is problematic in one major way with my job: That people don’t know what is or is not reasonable to ask me to do, or similarly how long it takes me to conduct particular tasks. I believe most of the time this makes people hesitate to ask me particular questions or ask me to conduct particular analysis. The obverse happens though not entirely infrequently, I get asked nonchalantly to do something that is a considerable investment.

I’m not sure how to best solve this situation (especially the not asking part) besides by developing relationships with colleagues and the boss, and through experience elucidating what I can (or can’t do). To a certain extent I can’t know what people want if they don’t ask me.

The situation in which someone asks me to do something that takes more of in investment is easier, in that I can directly tell the person that this request is either unreasonable or will take along time. A good example of tasks that on the outside may look similar in scope, but are largely different are descriptive vs. causal analysis.

Examples of the difference are “How many calls for service occurred at this particular apartment in the last year?” (descriptive), or “Is there more crime around 15 Main St. than we would normally expect?” (causal). The first is typically just a query or the database and a table or map, and this will typically satisfy the answer. The other though is much more difficult, I have to dream up a reasonable comparison, else the information I provide may be potentially out of context.

The information I produce also depends on who is asking. If someone within the PD asks for descriptive statistics, that is usually all I provide. If someone from the public asks for descriptive statistics, I frequently (at least attempt to) provide more context for those statistics (i.e. some reasonable comparisons or historical trends that form the basis for causal analysis).

This is because I assume people within the PD have the necessary external context to evaluate the information, whereas people outside the PD don’t. If I just stated how many calls for service occurred on your street block, you may think your street is crime ridden, because you don’t have a good internal baseline to judge what is a reasonable number of calls for service. In such requests to the public I try to provide historical numbers over a long period (as people are often worried about newer trends) or comparisons to neighboring areas.

The informational asymmetry problem stills persists though, and filters into other areas of work. In particular how am I evaluated within the PD itself.