Identifying near repeat crime strings in R or Python

People in criminology should be familiar with repeats or near-repeats for crimes such as robbery, burglaries, or shootings. An additional neat application of this idea though is to pull out strings of incidents that are within particular distance and time thresholds. See this example analysis by Haberman and Ratcliffe, The Predictive Policing Challenges of Near Repeat Armed Street Robberies. This is particularly useful to an analyst interested in crime linkage — to see if those particular strings of incidents are likely to be committed by the same offender.

Here I will show how to pluck out those near-repeat strings in R or Python. The general idea is to transform the incidents into a network, where two incidents are connected only if they meet the distance and time requirements. Then you can identify the connected components of the graph, and those are your strings of near-repeat events.

To follow along, here is the data and the code used in the analysis. I will be showing this on an example set of thefts from motor vehicles (aka burglaries from motor vehicles) in Dallas in 2015. In the end I take two different approaches to this problem — in R the solution will only work for smaller datasets (say n~5000 or less), but the python code should scale to much larger datasets.

Near-repeat strings in R

The approach I take in R does the steps as follows:

  1. compute the distance matrix for the spatial coordinates
  2. convert this matrix to a set of 0’s and 1’s, 1’s correspond to if the distance is below the user specified distance threshold (call it S)
  3. compute the distance matrix for the times
  4. convert this matrix to a set of 0’1 and 1’s, 1’s correspond to if the distance is below the user specified time threshold (call it T)
  5. use element-wise multiplication on the S and T matrices, call the result A, then set the diagonal of A to zero
  6. A is now an adjacency matrix, which can be converted into a network
  7. extract the connected components of that network

So here is an example of reading in the thefts from motor vehicle data, and defining my function, NearStrings, to grab the strings of incidents. Note you need to have the igraph R library installed for this code to work.


MyDir <- "C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\SourceNearRepeats"

BMV <- read.csv(file="TheftFromMV.csv",header=TRUE)

#make a function
NearStrings <- function(data,id,x,y,time,DistThresh,TimeThresh){
    library(igraph) #need igraph to identify connected components
    MyData <- data
    SpatDist <- as.matrix(dist(MyData[,c(x,y)])) < DistThresh  #1's for if under distance
    TimeDist <-  as.matrix(dist(MyData[,time])) < TimeThresh #1's for if under time
    AdjMat <- SpatDist * TimeDist #checking for both under distance and under time
    diag(AdjMat) <- 0 #set the diagonal to zero
    row.names(AdjMat) <- MyData[,id] #these are used as labels in igraph
    colnames(AdjMat) <- MyData[,id] #ditto with row.names
    G <- graph_from_adjacency_matrix(AdjMat, mode="undirected") #mode should not matter
    CompInfo <- components(G) #assigning the connected components

So here is a quick example run on the first ten records. Note I have a field that is named DateInt in the csv, which is just the integer number of days since the first of the year. In R though if the dates are actual date objects you can submit them to the dist function though as well.

#Quick example with the first ten records
BMVSub <- BMV[1:10,]
ExpStrings <- NearStrings(data=BMVSub,id='incidentnu',x='xcoordinat',y='ycoordinat',time='DateInt',DistThresh=30000,TimeThresh=3)

So here we can see this prints out:

> ExpStrings
            CompId CompNum
000036-2015      1       3
000113-2015      2       4
000192-2015      2       4
000251-2015      1       3
000360-2015      2       4
000367-2015      3       1
000373-2015      4       2
000378-2015      4       2
000463-2015      2       4
000488-2015      1       3

The CompId field is a unique Id for every string of events. The CompNum field states how many events are within the string. So we have one string of events that contains 4 records in this subset.

Now this R function comes with a big caveat, it will not work on large datasets. I’d say your pushing it with 10,000 incidents. The issue is holding the distance matrices in memory. But if you can hold the matrices in memory this will still run quite fast. For 5,000 incidents it takes around ~15 seconds on my machine.

#Second example alittle larger, with the first 5000 records
BMVSub2 <- BMV[1:5000,]
BigStrings <- NearStrings(data=BMVSub2,id='incidentnu',x='xcoordinat',y='ycoordinat',time='DateInt',DistThresh=1000,TimeThresh=3)

The elements in the returned matrix will line up with the original dataset, so you can simply add those fields in, and do subsequent analysis (such as exporting back into a mapping program and digging into the strings).

#Add them into the original dataset
BMVSub2$CompId <- BigStrings$CompId
BMVSub2$CompNum <- BigStrings$CompNum   

You can check out the number of chains of different sizes by using aggregate and table.

#Number of chains
table(aggregate(CompNum ~ CompId, data=BigStrings, FUN=max)$CompNum)

This prints out:

   1    2    3    4    5    6    7    9 
3814  405   77   27    3    1    1    1

So out of our first 1,000 incidents, using the distance threshold of 1,000 feet and the time threshold of 3 days, we have 3,814 isolates. Thefts from vehicles with no other incidents nearby. We have 405 chains of 2 incidents, 77 chains of 3 incidents, etc. You can pull out the 9 incident like this since there is only one chain that long:

#Look up the 9 incident
BMVSub2[BMVSub2$CompNum == 9,]  

Which prints out here:

> BMVSub2[BMVSub2$CompNum == 9,]
      incidentnu xcoordinat ycoordinat StartDate DateInt CompId CompNum
2094 043983-2015    2460500    7001459 2/25/2015      56   1842       9
2131 044632-2015    2460648    7000542 2/26/2015      57   1842       9
2156 045220-2015    2461162    7000079 2/27/2015      58   1842       9
2158 045382-2015    2460154    7000995 2/27/2015      58   1842       9
2210 046560-2015    2460985    7000089  3/1/2015      60   1842       9
2211 046566-2015    2460452    7001457  3/1/2015      60   1842       9
2260 047544-2015    2460154    7000995  3/2/2015      61   1842       9
2296 047904-2015    2460452    7001457  3/3/2015      62   1842       9
2337 048691-2015    2460794    7000298  3/4/2015      63   1842       9

Or you can look up a particular chain by its uniqueid. Here is an example of a 4-chain set.

> #Looking up a particular incident chains
> BMVSub2[BMVSub2$CompId == 4321,]
      incidentnu xcoordinat ycoordinat StartDate DateInt CompId CompNum
4987 108182-2015    2510037    6969603 5/14/2015     134   4321       4
4988 108183-2015    2510037    6969603 5/14/2015     134   4321       4
4989 108184-2015    2510037    6969603 5/14/2015     134   4321       4
4993 108249-2015    2510037    6969603 5/14/2015     134   4321       4

Again, only use this function on smaller crime datasets.

Near-repeat strings in Python

Here I show how to go about a similar process in Python, but the algorithm does not calculate the whole distance matrix at once, so can handle much larger datasets. An additional note is that I exploit the fact that this list is sorted by dates. This makes it so I do not have to calculate all pair-wise distances – I will basically only compare distances within a moving window under the time threshold – this makes it easily scale to much larger datasets.

So first I use the csv python library to read in the data and assign it to a list with a set of nested tuples. Also you will need the networkx library to extract the connected components later on.

import networkx as nx
import csv
import math

dir = r'C:\Users\axw161530\Dropbox\Documents\BLOG\SourceNearRepeats'

BMV_tup = []
with open(dir + r'\TheftFromMV.csv') as f:
    z = csv.reader(f)
    for row in z:

The BMV_tup list has the column headers, so I extract that row and then figure out where all the elements I need, such as the XY coordinates, the unique Id’s, and the time column are located in the nested tuples.

colnames = BMV_tup.pop(0)
print colnames
print BMV_tup[0:10]

xInd = colnames.index('xcoordinat')
yInd = colnames.index('ycoordinat')
dInd = colnames.index('DateInt')
IdInd = colnames.index('incidentnu')

Now the magic — here is my function to extract those near-repeat strings. Again, the list needs to be sorted by dates for this to work.

def NearStrings(CrimeData,idCol,xCol,yCol,tCol,DistThresh,TimeThresh):
    G = nx.Graph()
    n = len(CrimeData)
    for i in range(n):
        for j in range(i+1,n):
            if (float(CrimeData[j][tCol]) - float(CrimeData[i][tCol])) > TimeThresh:
                xD = math.pow(float(CrimeData[j][xCol]) - float(CrimeData[i][xCol]),2)
                yD = math.pow(float(CrimeData[j][yCol]) - float(CrimeData[i][yCol]),2)
                d = math.sqrt(xD + yD)
                if d < DistThresh:
    comp = nx.connected_components(G)
    finList = []
    compId = 0
    for i in comp:
        compId += 1
        for j in i:
    return finList

We can then do the same test on the first ten records that we did in R.

print NearStrings(CrimeData=BMV_tup[0:10],idCol=IdInd,xCol=xInd,yCol=yInd,tCol=dInd,DistThresh=30000,TimeThresh=3)

And this subsequently prints out:

[('000378-2015', 1), ('000373-2015', 1), ('000113-2015', 2), ('000463-2015', 2), ('000192-2015', 2), ('000360-2015', 2), 
('000251-2015', 3), ('000488-2015', 3), ('000036-2015', 3)]

The component Id’s wont be in the same order as in R, but you can see we have the same results. E.g. the string with three incidents contains the Id’s 000251, 000488, and 000036. Note that this approach does not return isolates — incidents which have no nearby space-time examples.

Running this on the full dataset of over 14,000 incidents takes around 20 seconds on my machine.

BigResults = NearStrings(CrimeData=BMV_tup,idCol=IdInd,xCol=xInd,yCol=yInd,tCol=dInd,DistThresh=1000,TimeThresh=3)

And that should scale pretty well for really big cities and really big datasets. I will let someone who knows R better than me figure out workarounds to scale to bigger datasets in that language.