New working paper: Choosing Representatives to Deliver the Message in a Group Violence Intervention

I have a new preprint up on SSRN, Choosing Representatives to Deliver the Message in a Group Violence Intervention. This is what I will be presenting at ACJS next Friday the 24th. Here is the abstract:

Objectives: The group based violence intervention model is predicated on the assumption that individuals who are delivered the deterrence message spread the message to the remaining group members. We focus on the problem of who should be given the initial message to maximize the reach of the message within the group.

Methods: We use social network analysis to create an algorithm to prioritize individuals to deliver the message. Using a sample of twelve gangs in four different cities, we identify the number of members in the dominant set. The edges in the gang networks are defined by being arrested or stopped together in the prior three years. In eight of the gangs we calculate the reach of observed call-ins, and compare these with the sets defined by our algorithm. In four of the gangs we calculate the reach for a strategy that only calls-in members under supervision.

Results: The message only needs to be delivered to around 1/3 of the members to reach 100% of the group. Using simulations we show our algorithm identifies the minimal dominant set in the majority of networks. The observed call-ins were often inefficient, and those under supervision could be prioritized more effectively.

Conclusions: Group based strategies should monitor their potential reach based on who has been given the message. While only calling-in those under supervision can reach a large proportion of the gang, delivering the message to those not under supervision will likely be needed to reach 100% of the group.

And here is an image of the observed reach for one of the gang networks using both call-ins and custom notifications.

The paper has the gang networks available at this link, and uses Python to do the network analysis and SPSS to draw the graphs.

If you are interested in applying this to your work let me know! Not only do I think this is a good idea for focused deterrence initiatives for criminal justice agencies, but I think the idea can be more widely applied to other fields in social sciences, such as public health (needle clean/dirty exchange programs) or organizational studies (finding good leaders in an organization to spread a message).

Weekly and monthly graphs for monitoring crime patterns (SPSS)

I was recently asked for some code to show how I created the charts in my paper, Tables and Graphs for Monitoring Crime Patterns (Pre-print can be seen here).

I cannot share the data used in the paper, but I can replicate them with some other public data. I will use calls for service publicly available from Burlington, VT to illustrate them.

The idea behind these time-series charts are not for forecasting, but to identify anomalous patterns – such as recent spikes in the data. (So they are more in line with the ideas behind control charts.) Often even in big jurisdictions, one prolific offender can cause a spike in crimes over a week or a month. They are also good to check more general trends as well, to see if crimes have had more slight, but longer term trends up or down.

For a preview, we will be making a weekly time series chart:

In the weekly chart the red line is the actual data, the black line is the average of the prior 8 weeks, and the grey band is a Poisson confidence interval around that prior moving average. The red dot is the most recent week.

And we will also be making a monthly seasonal chart:

The red line is the counts of calls per month in the current year, and the lighter grey lines are prior years (here going back to 2012).


So to start, I saved the 2012 through currently 6/20/2016 calls for service data as a csv file. And here is the code to read in that incident level data.

*Change this to where the csv file is located on your machine.
FILE HANDLE data /NAME = "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Tables_Graphs".
GET DATA  /TYPE=TXT
  /FILE="data\Calls_for_Service_Dashboard_data.csv"
  /ENCODING='UTF8'
  /DELCASE=LINE
  /DELIMITERS=","
  /QUALIFIER='"'
  /ARRANGEMENT=DELIMITED
  /FIRSTCASE=2
  /DATATYPEMIN PERCENTAGE=95.0
  /VARIABLES=
  AdjustedLatitude AUTO
  AdjustedLongitude AUTO
  AlcoholRelated AUTO
  Area AUTO
  CallDateTime AUTO
  CallType AUTO
  Domv AUTO
  DayofWeek AUTO
  DrugRelated AUTO
  EndDateTime AUTO
  GeneralTimeofDay AUTO
  IncidentNumber AUTO
  LocationType AUTO
  MentalHealthRelated AUTO
  MethodofEntry AUTO
  Month AUTO
  PointofEntry AUTO
  StartDateTime AUTO
  Street AUTO
  Team AUTO
  Year AUTO
  /MAP.
CACHE.
EXECUTE.
DATASET NAME CFS.

First I will be making the weekly chart. What I did when I was working as an analyst was make a chart that showed the recent weekly trends and to identify if the prior week was higher than you might expect it to be. The weekly patterns can be quite volatile though, so I smoothed the data based on the average of the prior eight weeks, and calculated a confidence interval around that average count (based on the Poisson distribution).

As a start, we are going to turn our date variable, CallDateTime, into an SPSS date variable (it gets read in as a string, AM/PM in date-times are so annoying!). Then we are going to calculate the number of days since some baseline – here it is 1/1/2012, which is Sunday. Then we are going to calculate the weeks since that Sunday. Lastly we select out the most recent week, as it is not a full week.

*Days since 1/1/2012.
COMPUTE #Sp = CHAR.INDEX(CallDateTime," ").
COMPUTE CallDate = NUMBER(CHAR.SUBSTR(CallDateTime,1,#Sp),ADATE10).
COMPUTE Days = DATEDIFF(CallDate,DATE.MDY(1,1,2012),"DAYS").
COMPUTE Weeks = TRUNC( (Days-1)/7 ).
FREQ Weeks /FORMAT = NOTABLE /STATISTICS = MIN MAX.
SELECT IF Weeks < 233.

Here I do weeks since a particular date, since if you do XDATE.WEEK you can have not full weeks. The magic number 233 can be replaced by sometime like SELECT IF Weeks < ($TIME - 3*24*60*60). if you know you will be running the syntax on a set date, such as when you do a production job. (Another way is to use AGGREGATE to figure out the latest date in the dataset.)

Next what I do is that when you use AGGREGATE in SPSS, there can be missing weeks with zeroes, which will mess up our charts. There end up being 22 different call-types in the Burlington data, so I make a base dataset (named WeekFull) that has all call types for each week. Then I aggregate the original calls for service dataset to CallType and Week, and then I merge the later into the former. Finally I then recode the missings intos zeroes.

*Make sure I have a full set in the aggregate.
FREQ CallType.
AUTORECODE CallType /INTO CallN.
*22 categories, may want to collapse a few together.
INPUT PROGRAM.
LOOP #Weeks = 0 TO 232.
  LOOP #Calls = 1 TO 22.
    COMPUTE CallN = #Calls.
    COMPUTE Weeks = #Weeks.
    END CASE.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME WeekFull.

*Aggregate number of tickets to weeks.
DATASET ACTIVATE CFS.
DATASET DECLARE WeekCalls.
AGGREGATE OUTFILE='WeekCalls'
  /BREAK Weeks CallN
  /CallType = FIRST(CallType)
  /TotCalls = N.

*Merge Into WeekFull.
DATASET ACTIVATE WeekFull.
MATCH FILES FILE = *
  /FILE = 'WeekCalls'
  /BY Weeks CallN.
DATASET CLOSE WeekCalls.
*Missing are zero cases.
RECODE TotCalls (SYSMIS = 0)(ELSE = COPY).

Now we are ready to calculate our statistics and make our charts. First we create a date variable that represents the beginning of the week (for our charts later on.) Then I use SPLIT FILE and CREATE to calculate the prior moving average only within individiual call types. The last part of the code calculates a confidence interval around prior moving average, and assumes the data is Poisson distributed. (More discussion of this is in my academic paper.)

DATASET ACTIVATE WeekFull.
COMPUTE WeekBeg = DATESUM(DATE.MDY(1,1,2012),(Weeks*7),"DAYS").
FORMATS WeekBeg (ADATE8).

*Moving average of prior 8 weeks.
SORT CASES BY CallN Weeks.
SPLIT FILE BY CallN.
CREATE MovAv = PMA(TotCalls,8).
*Calculating the plus minus 3 Poisson intervals.
COMPUTE #In = (-3/2 + SQRT(MovAv)).
DO IF #In >= 0.
  COMPUTE LowInt = #In**2.
ELSE.
  COMPUTE LowInt = 0.
END IF.
COMPUTE HighInt = (3/2 + SQRT(MovAv))**2.
EXECUTE.

If you rather use the inverse of the Poisson distribution I have notes in the code at the end to do that, but they are pretty similar in my experience. You also might consider (as I mention in the paper), rounding fractional values for the LowInt down to zero as well.

Now we are ready to make our charts. The last data manipulation is to just put a flag in the file for the very last week (which will be marked with a large red circle). I use EXECUTE before the chart just to make sure the variable is available. Finally I keep the SPLIT FILE on, which produces 22 charts, one for each call type.

IF Weeks = 232 FinCount = TotCalls.
EXECUTE.

*Do a quick look over all of them.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=WeekBeg TotCalls MovAv LowInt HighInt FinCount MISSING=VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: WeekBeg=col(source(s), name("WeekBeg"))
  DATA: TotCalls=col(source(s), name("TotCalls"))
  DATA: MovAv=col(source(s), name("MovAv"))
  DATA: LowInt=col(source(s), name("LowInt"))
  DATA: HighInt=col(source(s), name("HighInt"))
  DATA: FinCount=col(source(s), name("FinCount"))
  SCALE: pow(dim(2), exponent(0.5))
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Crime Count"))
  ELEMENT: line(position(WeekBeg*TotCalls), color(color.red), transparency(transparency."0.4"))
  ELEMENT: area(position(region.spread.range(WeekBeg*(LowInt+HighInt))), color.interior(color.lightgrey), 
  transparency.interior(transparency."0.4"), transparency.exterior(transparency."1"))
  ELEMENT: line(position(WeekBeg*MovAv))
  ELEMENT: point(position(WeekBeg*FinCount), color.interior(color.red), size(size."10"))
END GPL.
SPLIT FILE OFF.

This is good for the analyst, I can monitor many series. Here is an example the procedure produces for mental health calls:

The current value is within the confidence band, so it is not alarmingly high. But we can see that they have been trending up over the past few years. Plotting on the square root scale makes the Poisson count data have the same variance, but a nice thing about the SPLIT FILE approach is that SPSS does smart Y axis ranges for each individual call type.

You can update this to make plots for individual crimes, and here I stuff four different crime types into a small multiple plot. I use a TEMPORARY and SELECT IF statement before the GGRAPH code to select out the crime types I am interested in.

FORMATS TotCalls MovAv LowInt HighInt FinCount (F3.0).
TEMPORARY.
SELECT IF ANY(CallN,3,10,13,17).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=WeekBeg TotCalls MovAv LowInt HighInt FinCount CallN MISSING=VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(900px,600px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: WeekBeg=col(source(s), name("WeekBeg"))
  DATA: TotCalls=col(source(s), name("TotCalls"))
  DATA: MovAv=col(source(s), name("MovAv"))
  DATA: LowInt=col(source(s), name("LowInt"))
  DATA: HighInt=col(source(s), name("HighInt"))
  DATA: FinCount=col(source(s), name("FinCount"))
  DATA: CallN=col(source(s), name("CallN"), unit.category())
  COORD: rect(dim(1,2), wrap())
  SCALE: pow(dim(2), exponent(0.5))
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), start(1), delta(3))
  GUIDE: axis(dim(3), opposite())
  GUIDE: form.line(position(*,0),color(color.lightgrey),shape(shape.half_dash))
  ELEMENT: line(position(WeekBeg*TotCalls*CallN), color(color.red), transparency(transparency."0.4"))
  ELEMENT: area(position(region.spread.range(WeekBeg*(LowInt+HighInt)*CallN)), color.interior(color.lightgrey), 
  transparency.interior(transparency."0.4"), transparency.exterior(transparency."1"))
  ELEMENT: line(position(WeekBeg*MovAv*CallN))
  ELEMENT: point(position(WeekBeg*FinCount*CallN), color.interior(color.red), size(size."10"))
  PAGE: end()
END GPL.
EXECUTE.

You could do more fancy time-series models to create the confidence bands or identify the outliers, (exponential smoothing would be similar to just the prior moving average I show) but this ad-hoc approach worked well in my case. (I wanted to make more fancy models, but I did not let the perfect be the enemy of the good to get at least this done when I was employed as a crime analyst.)


Now we can move onto making our monthly chart. These weekly charts are sometimes hard to visualize with highly seasonal data. So what this chart does is gives each year a new line. Instead of drawing error bars, the past years data show the typical variation. It is then easy to see seasonal ups-and-downs, as well as if the latest month is an outlier.

Getting back to the code — I activate the original calls for service database and then close the Weekly database. Then it is much the same as for weeks, but here I just use calendar months and match to a full expanded set of calls types and months over the period. (I do not care about normalizing months, it is ok that February is only 28 days).

DATASET ACTIVATE CFS.
DATASET CLOSE WeekFull.

COMPUTE Month = XDATE.MONTH(CallDate).
COMPUTE Year = XDATE.YEAR(CallDate).

DATASET DECLARE AggMonth.
AGGREGATE OUTFILE = 'AggMonth'
  /BREAK Year Month CallN
  /MonthCalls = N.

INPUT PROGRAM.
LOOP #y = 2012 TO 2016.
  LOOP #m = 1 TO 12.
    LOOP #call = 1 TO 22.
      COMPUTE CallN = #call.
      COMPUTE Year = #y.
      COMPUTE Month = #m.
      END CASE.
    END LOOP.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME MonthAll.

MATCH FILES FILE = *
  /FILE = 'AggMonth'
  /BY Year Month CallN.
DATASET CLOSE AggMonth.

Next I select out the most recent month of the date (June 2016) since it is not a full month. (When I originally made these charts I would normalize to days and extrapolate out for my monthly meeting. These forecasts were terrible though, even only extrapolating two weeks, so I stopped doing them.) Then I calculate a variable called Current – this will flag the most recent year to be red in the chart.

COMPUTE MoYr = DATE.MDY(Month,1,Year).
FORMATS MoYr (MOYR6) Year (F4.0) Month (F2.0).
SELECT IF MoYr < DATE.MDY(6,1,2016).
RECODE MonthCalls (SYSMIS = 0)(ELSE = COPY).

*Making current year red.
COMPUTE Current = (Year = 2016).
FORMATS Current (F1.0).

SORT CASES BY CallN MoYr.
SPLIT FILE BY CallN.

*Same thing with the split file.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Month MonthCalls Current Year
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Month=col(source(s), name("Month"), unit.category())
  DATA: MonthCalls=col(source(s), name("MonthCalls"))
  DATA: Current=col(source(s), name("Current"), unit.category())
  DATA: Year=col(source(s), name("Year"), unit.category())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Calls"), start(0))
  GUIDE: legend(aesthetic(aesthetic.color.interior), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("0",color.lightgrey),("1",color.red)))
  ELEMENT: line(position(Month*MonthCalls), color.interior(Current), split(Year))
END GPL.

You can again customize this to be individual charts for particular crimes or small multiples. You can see in the example at the beginning of the post Retail thefts are high for March, April and May. I was interested to examine overdoses, as the northeast (and many parts of the US) are having a problem with heroin at the moment. In the weekly charts they are so low of counts it is hard to see any trends though.

We can see that overdoses were high in March. The other highest line are months in 2015, so it looks like a problem here in Burlington, but it started around a year ago.

For low counts of crime (say under 20 per month) seasonality tends to be hard to spot. For crimes more frequent though you can often see pits and peaks in summer and winter. It is not universal that crimes increase in the summer though. For ordinance violations (and ditto for Noise complaints) we can see a pretty clear peak in September. (I don’t know why that is, there is likely some logical explanation for it though.)

My main motivation to promote these is to replace terrible CompStat tables of year-over-year percent changes. All of these patterns I’ve shown are near impossible to tell from tables of counts per month.

Finally if you want to export your images to place into another report, you can use:

OUTPUT EXPORT /PNG IMAGEROOT = "data\TimeGraphs.png".

PNG please – simple vector graphics like these should definately not be exported as jpegs.

Here is a link to the full set of syntax and the csv data to follow along. I submitted to doing an hour long training session at the upcoming IACA conference on this, so hopefully that gets funded and I can go into this some more.

Using and Making Cumulative Probability Charts

Stephen Few had a recent post critiquing an evaluation of a particular data visualization. Long story short, the experiment asked questions like "What is the probability that X is above 5?", and showed the accuracy based on mean+error bar charts, histogram like visualizations, and animated vizualations showing random draws.

It is always the case in data viz. that some charts are easier to answer particular questions. This is one question, what is the probability a value is above X, in which traditional histograms or error bar charts are not well suited for. But there is an alternative I don’t see used very often, the cumulative probability chart, that is well suited to answer that question.

It is a totally reasonable question to ask as well. For one example use when I was a crime analyst, I used this chart to show the time in-between shootings. Many shootings are retaliatory so I was interested in saying if a shooting happened on Sunday, how long should be PD be on guard for after an initial shooting. Do most retaliatory shootings happen within hours, days, or weeks of a prior shooting? This is a hard question to answer with histograms, but is easier to answer with cumulative probability plots.

Here is that example chart for time-in-between shootings:

Although this chart is not regularly used, it is really easy to explain how to interpret. For example, at time equals 7 days (on the X axis), the probability that a shooting would have occurred is under 60%. In my opinion, it is easier to explain this chart than a histogram to a lay audience.

To produce the chart it is often not a canned option in software, but it takes very simple set of steps to produce the right ingrediants – and then you can use a typical line chart. So those steps generically are:

  • sort the data
  • rank the data (1 for the lowest value, 2 for the second lowest value, etc.)
  • calculate rank/(total sample size) – call this Prop
  • plot the data on the X axis, and Prop on the Y axis

Which can be easily done in any software, but here you can download an excel spreadsheet here used to make the above chart.

A variant of this chart often used in crime analysis is the proportion of places on the X axis and the cumulative proportion of crime on the Y axis. E.g. Pareto’s 80/20 rule – or 50/1 rule – or whatever. The chart makes it easy to pick whatever cut-offs you want. If you have your spatial units of analysis in one column, and the total number of crimes in a second column, the procedure to produce this chart is:

  • sort the data descending by number of crimes
  • rank the data
  • calculate rank/(total sample size) – this equals the proportion of all spatial units – call this PropUnits
  • calculate the cumulative number of crimes – call this Cum_Crime
  • calculate Cum_Crime/(Total Crime) – this equals the proportion of all crimes – call this PerCumCrime
  • plot PerCumCrime on the Y axis and PropUnits on the X axis.

See the third sheet of the excel file for a hypothetical example. This pattern basically happens in all aspects of criminal justice. That is, the majority of the bad stuff is happening among a small number of people/places. See this example from William Spelman showing places, victims, and offenders.

We can see there that 10% of the victims account for 40% of all victimizations etc.

Making and Exploring Crime Networks (Access and Excel)

I’ve been doing quite a bit of stuff with gang networks lately at work. Networks are a total PIA though to create and do data manipulation on in traditional spreadsheets and statistic tools, so I figured I would blog about some of my attempts to ease the pain for fellow crime analysts.

First I will show how to create an edge list in Access from the way a traditional police RMS database is set up. Second I will show a trick about exploring people and gangs by creating a dynamic lookup in Excel. You can download the Access Database I used and the Excel spreadsheet here to follow along.

Making an Edge List in Access

I’ve previously shown how to make an edgelist in SPSS. I’ll cast the net wider and show how to do this in Access though.

In a nutshell, an edge list is a table of the form:

Person A, Person B
Person B, Person C
Person C, Person D

Where being in the same row shows some type of connection between the two persons, e.g. Person A is connected to Person B. In police databases the connections most often of interest are co-offending (e.g. two people were arrested for the same incident) or being stopped together (e.g. in the same car or during the same field interrogation).

Typically police databases will have a table that lists a common incident identifier, along with persons associated with that incident and their involvement. Here is a screen shot of the simple example I made in an Access Database to mimic this which I named IncidentPersons:

So here we can see that for incident 1, Andy Pandy, Sandy Randy, and Candy Dandy are all persons involved. Candy is the victim, and the other two were arrested. This table is always called something different for every PD’s RMS system, but some examples I have come across are crossref and person_exploded. All RMS’s I have seen though have some sort of table like this.

To make an edge list from this table takes some knowledge of SQL, but it can be done in one query. Basically we will be joining a table to itself, and selecting out distinct rows. Here is the most basic SQL query in Access to accomplish this.

SELECT DISTINCT F.PersonID, F.PersonName, S.PersonID, S.PersonName
FROM IncidentPersons AS F INNER JOIN IncidentPersons AS S ON F.IncidentID = S.IncidentID
WHERE F.PersonID < S.PersonID;

To walk through this, we make two table aliases from the same original IncidentPersons table, F and S. Then we do an INNER JOIN based on the original incident ID. If we stopped here without the last WHERE clase, what would happen is we would have pairs of people with themselves, and with duplicate ties of the form A -> B and B -> A. So selecting only instances in which F.PersonID < S.PersonID eliminates those self edges and duplicates. The last part here is SELECT DISTINCT instead of select. This will make it so any particular edge is only returned once. (If you deleted DISTINCT in this database, Andy Pandy -> Sandy Randy would be returned twice.)

Running this query we then have:

In practice it will be more complicated because you will want to filter certain connections and add more info. on people into the final edge list. Here I ignore the involvement type, but you may want to only restrict matches to certain co-involvements (since offender-victim is of a different nature than co-offending). You also may want to not just know those connected, but count up the number of times those people are connected. For my work, I have always just limited to co-offending and being stopped together (and haven’t ever worried about the number of ties).

Also depending on how the database is normalized, often people names will change/have spelling errors, but they will still be linked to the same personid. These different spellings would cause the DISTINCT selection to not work as expected. A workaround is to only select based on the unique PersonID’s and not import other data, then in an additoional query merge in the person data. For gang network analysis you will likely want to merge in gang affiliation (which will probably be in a seperate table, not in the RMS). If you are still following along though you can figure that stuff out on your own.

Making an Edge Lookup Table in Excel

So now that I have shown how to make the edge table, what to do with it now? (No excuses – since I gave examples in both SPSS and SQL!) Here I will show a simple trick to explore the network using filtering in Excel.

The edge list itself is often the needed format to import into other network based software. So you can make a nice network graph using Gephi or whatever. The graph is good to see the overall form of the network when the graph is limited to only a few nodes, but they are typically really complicated, and tools like Gephi aren’t very good for drilling down into specific people. Here I will show my simple drilldown solution using Excel.

The network I use for this example is entirely made up; it was simulated using NetworkX (python), names are random based on some internet lists of popular baby names and last names I forgot the source of already, and Date of births are random between 1975 and 1997. I also made up a list of 7 gangs (but people have a 9/16 chance to be assigned to no gang).

So starting with an edgelist, here is a screenshot of my made up edge list excel table.

The problem in this format is if I filter the Id.1 column for 19 (BONNIE BARKER), they could potentially be in the Id.2 column as well, so I potentially miss edges. A simple solution to this is just to duplicate the data, but switch the order of the edges. Then when I filter by Id = 19, I will get all possible Bonnie Barker edges.

For a simple example of how to do this on a small table, if you start with:

17,19
18,19
19,20
19,21

If you filter the first column by 19, you will eliminate the 19’s in the second column. So just make a new table that has the ID’s reversed:

19,17
19,18
20,19
21,19

And then stack the two tables on top of one another

17,19 |
18,19 |  Table 1
19,20 |
19,21 |
19,17 +
19,18 +  Table 2
20,19 +
21,19 +

So now if you filter the first column by 19 you get 19’s all four connections. This is just three copy-pastes in excel to go from the original edge list to this table.

Now we can make a filter that dynamically changes based on user input. Here I make a selection in the top row, in N2 you can put in a persons ID. Then in A2, the formula is =IF(B2=$N$1,1,0). You can then paste this formula down, and it always references cell N2 because of the absolute $ modifiers.

Here is a screenshot of my example LookupTable in excel filtering for person 431.

If you update the personid in N1, then hit the reapply button in the toolbar (or hit Ctrl+Alt+L) to update the filter. Here I updated to be person 382.

The context of why I created this example was to identify people that were connected to gang members, but themselves were not in the gang. Basically have a list to take to officers and say, are you sure this person is not an actual member of the gang? The spreadsheet is then a tool if I have a meeting, where someone can say, who is Raelyn Hatfield connected to? I can easily update the id and filter.

You can do this drill down in the original edge table if you have the IF condition look in both the first and second id column, but I do this because it is easier to see who a person is connected to. You only have to look in one column – you don’t have to scan back and forth between two columns to see the connections.

You can also do other aggregations on this table as well. For instance if you aggregate using a pivot table and count the number of instances it is the edge centrality of a person (i.e. the number of different people a person is connected to).

If you want to do a drilldown of specific gangs you could use the same logic and build another filter column, but this will duplicate people when they are connected to another person in the same gang. That would be an instance where it might be easier to use just the original edge table.

Testing day-of-week crime randomness paper published

My paper, Testing Serial Crime Events for Randomness in Day of Week Patterns with Small Samples, was recently published in the Journal of Investigative Pyschology and Offender Profiling. Here is the pre-print version on SSRN if you can’t get access to that journal.

The main idea behind the paper was if you had a series of a few crime events that you know are linked to the same offender, can we tell if those patterns are random with respect to the day of the week? We know spatial patterns are often clustered, but police responses such as surveillance are conditioned not only on a spatial location, but take place during certain days and times. I wanted to know when I could go to command staff and say, yeah you should BOLO on Saturday. Or just as importantly say in response, no the observed patterns could easily happen if the offender were just randomly picking days.

In the paper I show that if you have only 3 events and they all occur on the same day, you would reject the null that crimes have an equal probability across all seven days of the week at a p-value of less than 0.05. I also show that the exact test I propose has pretty good power for as few as 8 events in the series. So if you have, say 10 events and you fail to reject the null that each day of the week has equal probability of being chosen, it is pretty good evidence that a police response should not have any preference for a particular day.

To illustrate how one would use the test, I have a simple spreadsheet posted here (in the zip file has my other SPSS code to reproduce the results in the paper) in which you can type in the days of the week that the crimes are occurring on, and it calculates the hypothesis test.

The spreadsheet contains both the G-test and Kuiper’s V test. If you don’t read the paper and understand the difference, just use the G-test and ignore the Kuiper’s V results. For crime analysts, this is basically the minimum of what you need to know.


For analysts who are more into the nitty gritty, I also have some R code that is a bit more flexible, and calculates the exact test for varying numbers of bins and provides some code to conduct power analysis. So you can either download the code from GitHub and insert it to define the functions, or simply copy-paste it into the console. The only library dependency is the partitions library, so make sure that is installed before following along.

So if you have downloaded the code, you can use something like below to insert the functions and load the partitions library.

library(partitions)
mydir <- "C:\\Users\\andrew.wheeler\\Dropbox\\Documents\\BLOG\\ExactTest_Weekdays"
setwd(mydir)
source("Exact_Dist.R")

Now, say you had a series of crimes that had 4 on Saturday, 3 on Tuesday, and 1 on Sunday. You can test this for randomness by simply using:

crime <- c(1,0,3,0,0,0,4)
res <- SmallSampTest(d=crime)
res

Which prints at the console:

Small Sample Test Object 
Test Type is G 
Statistic is 15.5455263389754 
p-value is:  0.0182662  
Data are:  1 0 3 0 0 0 4 
Null probabilities are:  0.14 0.14 0.14 0.14 0.14 0.14 0.14 
Total permutations are:  3003  

This defaults to using the likelihood ratio G-test, but you can also use Kuiper’s V, the chi-square test, or the Komolgrov-Smirnov test. Also you can change the null hypothesis to not equal probability in the bins. I default to the G-test in my paper because it is more powerful than the more typical chi-square after 8 crimes for 7 day-of-week bins, but equal in power to the chi-square for smaller sample sizes. So to do the chi-square test on the same data, use:

resChi <- SmallSampTest(d=crime, type="Chi")
resChi
chisq.test(crime) #for comparison to base R 
chisq.test(crime, simulate.p.value = TRUE, B = 10000)

Which you can see the test statistic mimics base R’s chisq.test, and the p-value is slightly higher than the asymptotic p-value (the exact test should always have a higher p-value than the asympotic distribution, and here it is lower than the simulated p-value). This situation the simulation approach would have been fine. I prefer the exact approach when feasible though, because it is exact, and you don’t need to worry about convergence for the simulation (which most everyone simply picks a large number and hopes for the best).

I’ve also made some code that allows for easy evaluation of the power of the exact test. Coding wise it was easiest to simply use the original object created with the test, so I know it invites post-hoc power analysis – forgive me for my slothness in coding practices. So say you wanted to do apriori power analysis with the Kuiper’s V test for 10 bins and 15 observations (so over 1.3 million permutations, i.e. n <- 15; m <- 10; choose(n+m-1,m-1)). You can simply make an original object (with any observed values across the bins).

test10_data <- c(15,rep(0,9))
test10_perm <- SmallSampTest(d=test10_data, type="KS")
#takes around a minute

The default null is equal probability across the bins, and to do a power analysis you have to specify an alternative. Lets say for the alternative there is equal probability in 5 of the bins, and zero probability in the other 5. (Most of the work is done in making the original permutation object, the power analysis is quite fast, hence why I coded it to work this way.)

p_alt <- c(rep(1/5,5),rep(0,5))
Pow_test <- PowAlt(SST=test10_perm,p_alt=p_alt)
Pow_test

This prints out at the console:

Power for Small Sample Test 
Test statistic is: KS  
Power is: 0.1822815  
Null is: 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1  
Alt is: 0.2 0.2 0.2 0.2 0.2   0   0   0   0   0  
Alpha is: 0.05  
Number of Bins: 10  
Number of Observations: 15  

So for this alternative there is quite low power, only 0.18. But if we change it to only have mass in four of the bins, the power goes way up to over 0.99.

> p_alt2 <- c(rep(1/4,4),rep(0,6))
> Pow_test2 <- PowAlt(SST=test10_perm,p_alt=p_alt2)
> Pow_test2
Power for Small Sample Test 
Test statistic is: KS  
Power is: 0.9902265  
Null is: 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1  
Alt is: 0.25 0.25 0.25 0.25   0   0   0   0   0   0  
Alpha is: 0.05  
Number of Bins: 10  
Number of Observations: 15 

So this shows how the exact test R code can be extended beyond just 7 day-of-week bins. I have not done really any exploration of the power of the KS test or differing numbers of bins though.

The spatial clustering of hits-vs-misses in shootings using Ripley’s K

My article, Replicating Group-Based Trajectory Models of Crime at Micro-Places in Albany, NY was recently published online first at the Journal of Quantitative Criminology (here is a pre-print version, and you can get my JQC offprint for the next few weeks).

Part of the reason I blog is to show how to replicate some of my work. I have previously shown how to generate the group based trajectory models (1,2), and here I will illustrate how to replicate some of the point pattern analysis I conducted in that paper.

A regular task for an analyst is to evaluate spatial clustering. A technique I use in that article is to use Ripley’s K statistic to evaluate clustering between different types of events, or what spatial statistics jargon calls a marked point pattern. I figured I would illustrate how to do this on some example point patterns of shootings, so analysts could replicate for other situations. The way crime data is collected and geocoded makes generating the correct reference distributions for the tests different than if the points could occur anywhere in the study region.

An example I am going to apply are shooting incidents that have marks of whether they hit the intended victim or missed. One theory in criminology is that murder is simply the extension of general violence — with the difference in aggravated assault versus murder often being happenstance. One instance this appears to be the case from my observations are shootings. It seems pure luck whether an individual gets hit or the bullets miss everyone. Here I will see if there appear to be any spatial patterning in shots fired with a victim hit. That is, we know that shootings themselves are clustered, but within those clusters of shootings are the locations of hits-and-misses further clustered or are they random?

A hypothetical process in which clustering of shooting hits can occur is if some shootings are meant to simply scare individuals vs. being targeted at people on the street. It could also occur if there are different tactics, like drive-bys vs. being on foot. This could occur if say one area had many shootings related to drug deals gone wrong, vs. another area that had gang retaliation drive by shootings. If they are non-random, the police may consider different interventions in different places – probably focusing on locations where people are more likely to be injured from the shooting. If they are random though, there is nothing special about analyzing shootings with a person hit versus shootings that missed everyone – you might as well analyze and develop strategy for all shootings.

So to start we will be using the R statistical package and the spatstat library. To follow along I made a set of fake shooting data in a csv file. So first load up R, I then change the working directory to where my csv file is located, then I read in the csv into an R data frame.

library(spatstat)

#change R directory to where your file is
MyDir <- "C:\\Users\\andrew.wheeler\\Dropbox\\Documents\\BLOG\\R_shootingVic\\R_Code"
setwd(MyDir)

#read in shooting data
ShootData <- read.csv("Fake_ShootingData.csv", header = TRUE)
head(ShootData)

The file subsequently has four fields, ID, X & Y coordinates, and a Vic column with 0’s and 1’s. Next to conduct the spatial analysis we are going to convert this data into an object that the spatstat library can work with. To do that we need to create a study window. Typically you would have the outline of the city, but for this analysis the window won’t matter, so here I make a window that is just slightly larger than the bounding box of the data.

#create ppp file, window does not matter for permuation test
StudyWin <- owin(xrange=c(min(ShootData$X)-0.01,max(ShootData$X)+0.01),yrange=c(min(ShootData$Y)-0.01,max(ShootData$Y)+0.01))
Shoot_ppp <- ppp(ShootData$X, ShootData$Y, window=StudyWin, marks=as.factor(ShootData$Vic), unitname = c("meter","meters"))

Traditionally when Ripley’s K was originally developed it was for points that could occur anywhere in the study region, such as the location of different tree species. Crimes are a bit different though, in that there are some areas in any city that a crime basically cannot occur (such as the middle of a lake). Also, crimes are often simply geocoded according to addresses and intersections, so this further reduces the locations where the points can be located in a crime dataset. To calculate the sample statistic for Ripley’s K one does not need to account for this, but to tell whether those patterns are random one needs to simulate the point pattern under a realistic null hypothesis. There are different ways to do the simulation, but here the simulation keeps the shooting locations fixed, but randomly assigns a shooting to be either a victim or a not with the same marginal frequencies. That is, it basically just shuffles which events are counted as a victim being hit or one in which there were no people hit. The Dixon article calls this the random relabelling approach.

Most of the spatstat functions can take a separate list of point patterns to use to simulate error bounds for different functions, so this function takes the initial point pattern, generates the permutations, and stuffs them in a list. I set the seed so the analysis can be reproduced exactly.

#generate the simulation envelopes to use in the Cross function
MarkedPerms <- function(ppp, nlist) {
  require(spatstat)
  myppp_list <- c() #make empty list
  for (i in 1:nlist) {
    current_ppp <-  ppp(ppp$x, ppp$y,window=ppp$window,marks=sample(ppp$marks))  #making permutation      
    myppp_list[[i]] <- current_ppp                                               #appending perm to list
  }
  return(myppp_list)
}

#now making a set of simulated permutations
set.seed(10) #setting seed for reproducibility
MySimEvel <- MarkedPerms(ppp=Shoot_ppp,nlist=999)

Now we have all the ingredients to conduct the analysis. Here I call the cross K function and submit my set of simulated point patterns named MySimEvel. With only 100 points in the dataset it works pretty quickly, and then we can graph the Ripley’s K function. The grey bands are the simulated K statistics, and the black line is the observed statistic. We can see the observed is always within the simulated bands, so we conclude that conditional on shooting locations, there is no clustering of shootings with a victim versus those with no one hit. Not surprising, since I just simulated random data.

#Cross Ripleys K
CrossK_Shoot <- alltypes(Shoot_ppp, fun="Kcross", envelope=TRUE, simulate=MySimEvel)
plot(CrossK_Shoot$fns[[2]], main="Cross-K Shooting Victims vs. No Victims", xlab="Meters")

I conducted this same analysis with actual shooting data in three separate cities that I have convenient access to the data. It is a hod podge of length, but City A and City B have around 100 shootings, and City C has around 500 shootings. In City A the observed line is very near the bottom, suggesting some evidence that shootings victims may be further apart than would be expected, but for most instances is within the 99% simulation band. (I can’t think of a theoretical reason why being spread apart would occur.) City B is pretty clearly within the simulation band, and City C’s observed pattern mirrors the mean of the simulation bands almost exactly. Since City C has the largest sample, I think this is pretty good evidence that shootings with a person hit are spatially random conditional on where shootings occur.

Long story short, when conducting Ripley’s K with crime data, the default way to generate the simulation envelopes for the statistics are not appropriate given how crime data is recorded. I show here one way to account for that in generating simulation envelopes.

Some ad-hoc fuzzy name matching within Police databases

A repeated annoying task I have had to undertake is take a list of names and date-of-births and match them to a reference set. This can happen when you try to merge data from different sources. Or when working with police RMS data, a frequent problem is the same individual can go into the master name list multiple times. This can be either due to data error, or someone being malicious and providing the PD with fraudulent data.

Most often I am trying to match a smaller set of names to a bigger set from a police RMS system. Typically what I do is grab all of the names in the police RMS, make them the same case, and then simply sort the file by last then first name. Then I typically go through one by one from the smaller file and identify the name ID’s that are in the sorted bigger police database. Ctrl-F can make it a quick search for only a few people.

This works quite well for small numbers, but I wanted to see if I could make some simple rules when I need to match a larger list. For example, the above workflow may be fine if I need to look up 10 names quickly, but say you want to eliminate duplicates in the entire PD RMS system? A manually hand search through 100,000+ names is crazy (and will be out of date before you finish).

A tool I’ve used in the past (and would recommend) is FRIL, Fine-grained records integration and linkage tool. In a nutshell the way that tool works is that you can calculate string distances between names and/or date distances between date-of-births from two separate files. Then you specify how close you want the records to be to either automatically match the records or manually view and make a personal determination if the two records are the same person. FRIL has a really great interface to quickly view the suggested matches and manually confirm or reject certain matches.

FRIL uses the Potter Stewart I know it when I see it approach to finding matches. There is no ground truth, you just use your best judgement whether two names belong to the same person, and FRIL uses some metrics to filter out the most unlikely candidates. I have a bit of a unique strategy though to identify typical string and date of birth differences in fuzzy name matches by using Police RMS data itself. Police RMS’s tend to have a variety of people who are linked up to the same master name index value, but for potentially several reasons they have various idiosyncrasies among different individual incidents. This allows me to calculate distances within persons, so a ground truth estimate, and then I can evaluate different distances compared to a control sample to see how well they the metrics discriminate matches.

There can be several reasons for slightly different data among an individuals incidents in a police RMS, but that they end up being linked to the same person. One is that frequently RMS systems incorporate tables from several different data sources, dispatchers have their own CAD system, the PD has a system to type in paper records, custodial arrests/finger printing may have another system, etc. Merging this data into one RMS may simply cause differences in how the data is stored or even how particular fields tend to be populated in the database. A second reason is that individuals can be ex-ante associated to a particular master name index, but can still have differences is various person fields for any particular incident.

One simple example is that for an arrest report the offender may have an old address in the system, so the officer types in the new address. The same thing can happen to slight name changes or DOB changes. The master name index should update with the most recent info, but you have a record trail of all the minor variations through each incident. Depending on the type of involvement in an incident has an impact on what information is collected and the quality of that information as well. For example, if I was interviewed as a witness to a crime, I may just go down in the report as Andy Wheeler with no date of birth info. If I was arrested, someone would take more time to put in my full name, Andrew Wheeler, and my date of birth. But if the original person inputting the data took the time, they would probably realize I was the same person and associate me with the same master ID.

So I can look at these within ID changes to see the typical distances. What I did was take a name database from a police department I work with, make all pair-wise comparisons between unique names and date of births, and then calculate several string distances between the names and the date differences between listed DOB’s. I then made a randomly matched sample for a comparison group. For the database I was working with this ended up being over 100,000 people with the same ID, but different names/DOB’s somewhere in different incidents, with an average of between 2~3 different names/dob’s per person (so a sample of nearly 200,000 same name comparisons, two names only results in one comparison, but three names results in 3 comparisons). My control sample took one of these names person and matched another random person in the database as a control group, so I have a control group sample of over 100,000 cases.

The data I was working with is secondary, so the names were already aggregated to Last, First Middle. If I had the original database I could do distances for the individual fields (and probably not worry about the middle name) but it somewhat simplifies the analysis as well. Here are some histograms of the Levenshtein distance between the name strings for the same person and random samples. The Levenshtein distance is the number of single edits it takes to transform one string to another string, so 0 would be the same word.

Part of the reason the distances within the same name have such a long tail is because of the already aggregated data. There end up being some people with full middle names, some with middle initials, and some with no middle names at all. So what I did was calculate a normalized Levenshtein distance based on the max and min possible values (listed at the Wikipedia page) the string can take given the size of the two input strings. The minimum value is the difference in the length of the two strings, the maximum is the length of the longest string. So then I calculate NormLevenDist = (LevenDist - min)/(max - min). This would cause Wheeler, Andy P and Wheeler, Andy Palmer to have a normalized distance of zero, whereas the edit distance would be 5. So in these histograms you can see even more discrimination between the two classes, based mainly on such names being perfect subsets of other names.

If you eliminate the 0’s in the normalized distance, you can get a better look at the shapes of each distribution. There is no clear cut-off between the samples, but there is a pretty clear difference in the distributions.

I also calculated the Jaro-Winkler and the Dice (bi-gram) string distances. All four of these metrics had a fairly high correlation, around 0.8 with one another, and all did pretty well classifying the same ID’s according to their ROC curves.

If I wanted to train a classifier as accurately as possible, I would use all of these metrics and probably make some sort of decision tree (or estimate their effects via logistic regression), but I wanted to make a simple function (since it will be doing quite a few comparisons) that calculates as few of the metrics as possible, so I just went with the normalized Levenshtein distance here. Jaro-Winkler would probably be more competitive if I had the separate first and last names (and played around with the weights for the beginning and ends of the strings). If you had mixed strings, like some are First Middle Last and others are Last First I suspect the dice similarity would be the best.

But in the end I think all of the string metrics will do a pretty similar job for this input data, and the normalized Levenshtein distance will work pretty well so I am going to stick with that. (I don’t consider soundex matching here, I’ve very rarely come across an example where soundex would match but had a high edit distance for names, e.g. typos are much more common than intentional mis-spellings based on enunciation I believe, and even the intentional mis-spellings tend to have a small edit distance.)

Now looking at the absolute differences in the DOB’s (where both are available) provides a bit of a different pattern. Here is the histograms

But I think the easiest illustration of this is to examine the frequency table of the most common day differences for the same individuals.

Obviously zero is the most common, but you can see a few other bumps that illustrate the nature of data mistakes. The second is 365 days – exactly off by one year. Also in the top 10 are 366, 731 & 730 – off by either a year and a day or two years. 1096, 1095, 1461, 1826, are examples of these yearly cycles as well. 36,525 are examples of being off by a century! Somewhere along the way some of the DOB fields were accidentally assigned to dates in the future (such as 2032 vs. 1932). The final examples are off by some other number typical of a simple typo, such as 10, 20, or by the difference in one month 30,31,27. By the end of this table of PDF of the same persons is smaller than the PDF of the control sample.

I also calculated string distances by transforming the DOB’s to mm/dd/yy format, but when incorporating the yearly cycles and other noted mistakes they did not appear to offer any new information.

So based on this information, I made a set of ad-hoc rules to classify matched names. I wanted to keep the false positive rate at less than 1 in 1,000, but make the true positive as high as possible. The simple rules I came up with were:

  • If a normalized Levenshtein distance of less than 0.2, consider a match
  • If a normalized Levenshtein distance of less than 0.4 and a close date, consider a match.

Close dates are defined as:

  • absolute difference of within 10 days OR
  • days apart are 10,20,27,30,31 OR
  • the number of days within a yearly cycle are less than 10

This match procedure produces the classification table below:

The false positive rate is right where I wanted it to be, but the true positive is a bit lower than I hoped. But it is a simple tool though to implement, and built into it you can have missing data for a birthday.

It is a bit hard to share this data and provide reproducible code, but if you want help doing something similar with your own data just shoot me an email and I will help. This was all done in SPSS and Python (using the extended transforms python code). In the end I wanted to make a simple Python function to use with the FUZZY command to automatically match names.

Tables and Graphs paper rejection/update – and on the use of personal pronouns in scientific writing

My paper, Tables and Graphs for Monitoring Temporal Crime Patterns was recently rejected from Policing: An International Journal of Police Strategies & Management. I’ve subsequently updated the SSRN draft based on feedback from the review, and here I post the reviews and my responses to those reviews (in the text file).

One of the main critiques by both reviewers was that the paper was too informal, mainly because of the use of "I" in the paper. I use personal pronouns in writing intentionally, despite typical conventions in scientific writing, so I figured a blog post about why I do this is in order. I’ve been criticized for it on other occasions as well, but this is the first time it was listed as a main reason to reject an article of mine.

My main motivation comes from Michael Billig’s book Learn to Write Badly: How to Succeed in the Social Sciences (see a prior blog post I wrote on the contents). In a nut-shell, when you use personal pronouns it is clear that you, the author, are doing something. When you rewrite the sentence to avoid personal pronouns, you often obfuscate who the actor is in a particular sentence.

For an example of Billig’s point that personal pronouns can be more informative, I state in the paper:

I will refer to this metric as a Poisson z-score.

I could rewrite this sentence as:

This metric will be referred to as a Poisson z-score.

But that is ambiguous as to its source. Did someone else coin this phrase, and I am borrowing it? No – it is a phrase I made up, and using the personal pronoun clearly articulates that fact.

Pretty much all of the examples where I eliminated first person in the updated draft were of the nature,

In this article I discuss the use of percent change in tables.

which I subsequently changed to:

This article discusses the use of percent changes as a metric in tables.

Formal I suppose, but insipid. All rewriting the sentence to avoid the first person pronoun does is make the article seem like a sentient being, as well as forces me to use the passive tense. I don’t see how the latter is better in any way, shape, or form – yet this is one of the main reasons my paper is rejected above. The use of "we" in academic articles seems to be more common, but using "we" when there is only one author is just silly. So I will continue to use "I" when I am the only author.

Presentation at IACA 2015: An Exploratory Network Analysis of Hot People and Places

My work was accepted at the 2015 International Association of Crime Analysts Conference, so I will be going to Denver this fall to present. These are "NIJ Research Track" presentations, but basically took the place of the old MAPS conference presentations. So on Thursday at 2:30 (Panel 9) I will be presenting. The title of the presentation is An Exploratory Network Analysis of Hot People and Places, and below is the abstract for the talk:

Intelligence led policing practices focus on chronic offenders and hot spots of crime. I examine the connections between these hot people and hot places by considering micro places (street segments and intersections) and people as nodes in an interconnected network. I focus on whether hot people tend to have a finite set of locations they congregate, and whether hot places have unique profiles of chronic offenders. The end goal is to identify if observed patterns can help police combine targeted enforcement of hot people and hot places in one overarching strategy.

This is still a work in progress, but here is a quick preview. This graph is a random sample of offender footprints in each panel.

Also during this slot there are two other presentations, below is their info:

Title: Offender Based Investigations: A Paradigm Shift in Policing, Author: Chief James W. Buie, Gaston County Police

Abstract: Routine activity theory and research conducted by Wiles, P & Costello, A. (2000) ‘The Road to Nowhere’ prove criminals 1) commit crimes where comfortable and 2) are very likely to re-offend. Therefore it makes sense to elevate the importance of offender locations in relation to crimes. Our focus on the offender is called Offender Based Investigations (OBI). We’re using GIS to plot not only crimes but criminals as well. Our experience over the past seven years of utilizing OBI has proven that mapping offenders plays a critical role in solving crimes.

Title: A Spatial Network Analysis of Gangs Author: Davin Hall, Crime Analysis Specialist for the Greensboro PD

Abstract: Gangs are characterized by the associations between members and the territory that members operate in. Many representations of gang territory and gang networks are separate from one another; a map shows the territory while a network map shows linkages between individuals. This presentation demonstrates the combination of territory and network within a single map, for both gangs and gang members. This analysis shows law enforcement a clearer picture of the complex relationships that occur between and within gangs.

You can see the full agenda here. I really enjoyed the presentations last year at Seattle, and this year looks great as well. If you want any more info. on my particular work feel free to send me an email, or if you want to get together at Denver this fall.

How wide to make the net in actuarial tools? (false positives versus false negatives)

An interesting debate/question came up in my work recently. I conducted an analysis of a violence risk assessment tool for a police department. Currently the PD takes around the top 1,000 scores of this tool, and then uses further intelligence and clinical judgements to place a small number of people on a chronic offender list (who are then subject to further interventions). My assessment of the predictive validity when examining ROC curves suggested the tool does a pretty good job discriminating violent people up to around the top 6,000 individuals and after that flattens out. In a sample of over 200,000, the top 1000 scores correctly classified 30 of the 100 violent cases, and the top 6000 classified 60.

So the question came up should we recommend that the analysts widen the net to the top 6,000 scores, instead of only examining the top 1,000 scores? There are of course costs and limitations of what the analysts can do. It may simply be infeasible for the analysts to review 6,000 people. But how do you set the limit? Should the clinical assessments be focused on even fewer individuals than 1,000?

We can make some estimates of where the line should be drawn by setting weights for the cost of a false positive versus a false negative. Implicit in the whole exercise of predicting violence in a small set of people is that false negatives (failing to predict someone will be violent when they are) greatly outweigh a false positive (predicting someone will be violent but they are not). The nature of the task dictates that you will always need to have quite a few false positives to classify even a few true positives, and no matter what you do there will only be a small number of false negatives.

Abstractly, you can place a value on the cost of failing to predict violence, and a cost on the analysts time to evaluate cases. In this situation we want to know whether the costs of widening the net to 6,000 individuals are less than the costs of only examining the top 1,000 individuals. Here I will show we don’t even need to know what the exact cost of a false positive or a false negative is, only the relative costs, to make an estimate about whether the net should be cast wider.

The set up is that if we only take the top 1,000 scores, it will capture 30 out of the 100 violent cases. So there will be (100 – 30) false negatives, and (1000 – 30) false positives. If we increase the scores to evaluate the top 6,000, it will capture 60 out the 100 violent cases, but then we will have (6000 – 60) false positives. I can not assign a specific number to the cost of a false negative and a false positive. So we can write these cost equations as:

1) (100 - 30)*FN + (1000 - 30)*FP = Cost Low
2) (100 - 60)*FN + (6000 - 60)*FP = Cost High

Even though we do not know the exact cost of a false negative, we can talk about relative costs, e.g. 1 false negative = 1000*false positives. There are too many unknowns here, so I am going to set FP = 1. This makes the numbers relative, not absolute. So with this constraint the reduced equations can be written as:

1) 70*FN +  970 = Cost Low
2) 40*FN + 5940 = Cost High

So we want to know the ratio at which there is a net benefit over including the top 6,000 scores versus only the top 1,000. So this means that Cost High < Cost Low. To figure out this point, we can subtract equation 2 from equation 1:

3) (70 - 40)*FN - 4970 = Cost Low - Cost High

If we set this equation to zero and solve for FN we can find the point where these two equations are equal:

30*FN - 4970 = 0
30*FN = 4970
FN = 4970/30 = 165 + 2/3

If the value of a false negative is more than 166 times the value of a false positive, Cost Low - Cost High will be positive, and so the false negatives are more costly to society relative to the analysts time spent. It is still hard to make guesses as to whether the cost of violence to society is 166 times more costly than the analysts time, but that is at least one number to wrap your head around. In a more concrete example, such as granting parole or continuing to be incarcerated, given how expensive prison is net widening (with these example numbers) would probably not be worth it. But here it is a bit more fuzzy especially because the analysts time is relatively inexpensive. (You also have to guess how well you can intervene, in the prison example incarceration essentially reduces the probability of committing violence to zero, whereas police interventions can not hope to be that successful.)

As long as you assume that the classification rate is linear within this range of scores, the same argument holds for net widening any number. But in reality there are diminishing returns the more scores you examine (and 6,000 is basically where the returns are near zero). If you conduct the same exercise between classifying zero and the top 1,000, the rate of the cost of a false negative to a false positive needs be 32+1/3 to justify evaluating the top 1,000 scores. If you actually had an estimate of the ratio of the cost of false positives to false negatives you could then figure out exactly how wide to make the net. But if you think the ratio is well above 166, you have plenty of reason to widen the net to the larger value.