Using KDTree’s in python to calculate neighbor counts

For a few different projects I’ve had to take a set of crime data and calculate the number of events nearby. It is a regular geospatial task, counting events in a particular buffer, but one that can be quite cumbersome if you have quite a few points to cross-reference. For one project I needed to calculate this for 4,500 street segments and intersections against a set of over 100,000 crime incidents. While this is not big data, if you go the brute force route and simply calculate all pair-wise distances, this results in 450 million comparisons. Since all I wanted was the number of counts within a particular distance buffer, KDTree’s offer a much more efficient search solution.

Here I give an example in Python using numpy and the nearest neighbor algorithms available in SciPy. This example is calculating the number of shootings in DC within 1 kilometer of schools. The shooting data is sensor data via ShotSpotter, and is publicly available at the new open data site. The school data I calculated the centroids based on school area polygons, available from the legacy DC data site. This particular example was motivated by this Urban Institute post.

There are around 170 schools and under 40,000 total shootings, so this would not be a big issue to calculate all the pairwise distances. It is a simple and quick example though. Here I have the IPython notebook and CSV files are available to download (using Wakari – the Anaconda free service).

So first we will just import the spatial data into numpy arrays. Note each file has the coordinates in projected meters.

import numpy as np
#getting schools and shootings from CSV files
schools = np.genfromtxt('DC_Schools.csv', delimiter=",", skip_header=1)
shootings = np.genfromtxt('ShotSpotter.csv', delimiter=",", skip_header=1, usecols=(4,5))

Next we can import the KDTree function, and then supply it with the two spatial coordinates for the shootings. While this is a relatively small file, even for my example larger set of crime data with over 100,000 incidents building the tree was really fast, less than a minute.

#making KDTree, and then searching within 1 kilometer of school
from sklearn.neighbors import KDTree
shoot_tree = KDTree(shootings)

Finally you can then search within a particular radius. You can either search one location at a time, but here I do a batch search and count the number of shootings that are within 1,000 meters from each school. Again this is a small example, but even with my 4,500 locations against 100,000 crimes it was very fast. (Whereas my initial SPSS code to do all 450 million pairwise combinations was taking all night.)

shoot_tree.query_radius(schools[:,[1,2]],r=1000,count_only=True)

Which produces an array of the counts for each of the schools:

array([1168,  179, 2384,  686,  583, 1475,  239, 1890, 2070,  990,   74,
        492,   10,  226, 2057, 1813, 1137,  785,  742, 1893, 4650, 1910,
          0,  926, 2380,  818, 2868, 1230,    0, 3230, 1103, 2253, 4531,
       1548,    0,    0,    0, 1002, 1912,    0, 1543,    0,  580,    4,
       1224,  843,  212,  591,    0,    0, 1127, 4520, 2283, 1413, 3255,
        926,  972,  435, 2734,    0, 2828,  724,  796,    1, 2016, 2540,
        369, 1903, 2216, 1697,  155, 2337,  496,  258,  900, 2278, 3123,
        794, 2312,  636, 1663,    0, 4896,  604, 1141,    7,    0, 1803,
          2,  283,  270, 1395, 2087, 3414,  143,  238,  517,  238, 2017,
       2857, 1100, 2575,  179,  876, 2175,  450, 5230, 2441,   10, 2547,
        202, 2659, 4006, 2514,  923,    6, 1481,    0, 2415, 2058, 2309,
          3, 1132,    0, 1363, 4701,  158,  410, 2884,  979, 2053, 1120,
       2048, 2750,  632, 1492, 1670,  242,  131,  863, 1246, 1361,  843,
       3567, 1311,  107, 1501,    0, 2176,  296,  803,    0, 1301,  719,
         97, 2312,  150,  475,  764, 2078,  912, 2943, 1453,  178,  177,
        389,  244,  874,  576,  699,  295,  843,  274])

And that is it! Quite simple.

The ShotSpotter data is interesting, with quite a few oddities that are worth exploring. I recently wrote a chapter on SPSS’s geospatial tools for an upcoming book on advanced SPSS features, providing a predictive police example predicting weekly shootings using the new geospatial temporal modelling tool. The book is edited by Keith McCormick and Jesus Salcedo, and I will write a blog post whenever it comes out highlighting what the chapter goes over.

Advertisements
Leave a comment

1 Comment

  1. Spatial join points to polygons using Python and SPSS | Andrew Wheeler

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: