A recent use case of mine I had around 60 million points that I wanted to assign to census block groups. ArcGIS was being problematic to simply load in the 60 million point dataset (let alone spatial join it), so I wrote some python code and will show using python and SPSS how to accomplish this.
First, a shout out to Rex Douglass and this blog post, I’ve adapted most of the python code here from that example. Also before we get started, it will be necessary to download several geospatial libraries for python. Here you need shapely, pyshp, and rtree. As a note, I have only been able to get these to install and work using the IOOS channel for Anaconda, e.g.
conda install -c ioos shapely rtree pyshp. (I have not been able to get fiona to work.)
The Python Part
So I will go through a quick rundown of the python code first. All of the data and code to run this yourself can be downloaded here. To start, I import all of the necessary libraries and functions.
import shapefile from rtree import index from shapely.geometry import Polygon, Point
The next step is to read in the polygon shapefile that we want to assign points to. Note you could swap this part out with fiona (if you can get it working!), but I just use the pyshp function
shapefile.Reader. Note you need to change the
data string to point to where the shapefile containing your polygons is located on your local machine.
#load in the shapefile of block groups data = r'C:\Users\axw161530\Dropbox\Documents\BLOG\Point_inPoly_PythonSPSS' bg_NYC = shapefile.Reader(data + r'\NYC_BG14_Proj.shp')
In my data these are block groups for New York city, and they are projected into feet using a local projection. (As an FYI, you can open up the “prj” file for shapefiles in a plain text editor to see the projection.) Now, the shapefile object,
bg_NYC here, has several iterables that you can access either the geometries or the records available. First we need to get those individual polygons and stuff into a list, and then convert into a Polygon object shapely can deal with.
bg_shapes = bg_NYC.shapes() #get the iterable for the polygon boundary points bg_points = [q.points for q in bg_shapes] #convert to list of geometry polygons = [Polygon(q) for q in bg_points] #convert to a shapely Polygon
Next I am going to do two things. First to make a vector that matches those Polygons to a particular id, I need to read in the data attributes from the shapefile. This is accomplished via the
.records() attribute. For US census geometries they have what is oft labeled a GEOID. In this example shapefile the GEOID ends up being in the second variable slot. The second thing I accomplish here is I build an rtree lookup. The motivation for this is, when we do a point in polygon check, it can be an expensive procedure the more polygons you have. You can first limit the number of potential polygons to check though by only checking whether a point falls within the bounding box of a polygon, and then do the more expensive operation on the actual (more complicated) boundary of the polygon.
#build spatial index from bounding boxes #also has a second vector associating area IDs to numeric id bg_records = bg_NYC.records() #bg_records is the geoid idx = index.Index() #creating an rtree c_id = 0 area_match =  for a,b in zip(bg_shapes,bg_records): area_match.append(b) idx.insert(c_id,a.bbox,obj=b) c_id += 1
Now we have all the necessary ingredients to make a function that inputs one X,Y point, and then returns a GEOID. First, the function turns the input X,Y points into a Point object shapely can work with. Second, it does the bounding box lookup I mentioned earlier, using the
idx rtree that is available in the global environment. Third, it loops over those resulting polygons that intersected the bounding box, and checks to see if the point is within that polygon using the shapely operation
point.within(polygon). If that is true, it returns the associated GEOID, and if none are found it returns None. Again, the objects in this function
area_match are taken from the global environment. A few additional notes: it will return the first point in polygon found, so if you have overlapping polygons this will simply return the first, not necessarily all of them. That is not the case with our census polygons here though. Second, the functionality here is for a point on the exact border between two polygons to return False.
#now can define function with polygons, area_match, and idx as globals def assign_area(x,y): point = Point(x,y) for i in idx.intersection((x,y,x,y)): if point.within(polygons[i]): return area_match[i] return None #note points on the borders will return None
To test this function I have a set of points in New York for this particular projection already associated with a GEOID.
#now testing test_vec = [(1003610, 239685, '360050063002'), (1006787, 240666, '360050183022'), ( 993580, 219484, '360610122001'), ( 986385, 214971, '360610115001'), ( 947148, 167688, '360850201001'), ( 0, 0, 'Miss')] for a,b,c in test_vec: print [assign_area(x=a,y=b),c]
And this should subsequently print out at your console:
['360050063002', '360050063002'] ['360050183022', '360050183022'] ['360610122001', '360610122001'] ['360610115001', '360610115001'] ['360850201001', '360850201001'] [None, 'Miss']
For those wishing to do this in vectorized in python, check out the GeoPanda’s functionality. But here I let it churn out one by one by using SPSS.
The SPSS Part
So once the above function is defined in your SPSS environment, we can simply use
SPSSINC TRANS to assign XY data to a block group. Here is a quick example. First we read in some data, this is the homicide data from the New York times discussed here. It has the points projected in the same feet as the polygons were.
*Conducting point in polygon tests with Python and SPSS. FILE HANDLE data /NAME = "C:\Users\axw161530\Dropbox\Documents\BLOG\Point_inPoly_PythonSPSS". *Read in the NYC homicide data. GET TRANSLATE FILE='data\HomPoints_JoinBG.dbf' /TYPE=DBF /MAP . DATASET NAME HomData.
Now I am going to use the SPSS command
SHOW to display the current date and time, (so you can see how long the operation takes). This dataset has 4,021 cases of homicide, and the set of polygons we are matching to has around 6,500 block groups. The time the operation takes depends on both, but the rtree search should make the number of polygons not as big a deal as simply looping through all of them. Second, I use SPSSINC TRANS to call the python function we previously constructed. Third, this dataset already has the GEOID matched to the points (via ArcGIS), so I check to make sure I get the same results as ArcGIS. In this example there are quite a few points that ArcGIS failed to return a match for, but this operation does. (It would take more investigation on my part though as to why that is the case.)
*Use this to show timing. SHOW $VAR. *Now using SPSSINC TRANS to assign geoid. SPSSINC TRANS RESULT=GeoID2 TYPE=12 /FORMULA "assign_area(x=XFt,y=YFt)". SHOW $VARS. *Check that the operations are all correct (as compared to ArcGIS) COMPUTE Check = (GEOID = GEOID2). FREQ Check.
This example runs almost instantly. For some tests with my bigger dataset of 60 million, matching half a million points to this set of polygons took around 12 minutes.
Again, all of the data and code to run this at once can be downloaded here. I will need to make a blog post at some point of using pyproj to project point data in SPSS as well, such as to go to and from Lat-Lon to a local projection. You probably always want to do geometric operations like this and buffers with projected data, but you may get the data in Lat-Lon or want to export data in Lat-Lon to use online maps.
For those working with crime data, I oft complain that crime is frequently on the borders of census geographies. But due to slight differences in resolution, most GIS systems will still assign crime points to census geographies. I’m not sure if it is a big problem for much analysis in our field, but the proportion on the border is clearly quite large in some instances. For things that can occur often outdoors, like robberies and field stops, the proportion is even higher because crime is often recorded at intersections (I have estimates for the percentage of crimes at intersections for 14 years in Albany in this paper). So the problem depends on the crime type or nature of the incident (traffic stops are almost always listed at intersections), but I have seen analysis I would bet over 50% of the incidents are on the border of census blocks and/or block groups.
A general way to check this in GIS is to turn your polygon data into lines, and then assign points to the nearest line and check the distance. You will see many points that are very close to the border (say within 5 meters) that really should be undetermined.