Scraping Meth Labs with Python

For awhile in my GIS courses I have pointed to the DEA’s website that has a list of busted meth labs across the county, named the National Clandestine Laboratory Register. Finally a student has shown some interest in this, and so I spent alittle time writing a scraper in Python to grab the data. For those who would just like the data, here I have a csv file of the scraped labs that are geocoded to the city level. And here is the entire SPSS and Python script to go from the original PDF data to the finished product.

So first off, if you visit the DEA website, you will see that each state has its own PDF file (for example here is Texas) that lists all of the registered labs, with the county, city, street address, and date. To turn this into usable data, I am going to do three steps in Python:

  1. download the PDF file to my local machine using urllib python library
  2. convert that PDF to an xml file using the pdftohtml command line utility
  3. use Beautifulsoup to parse the xml file

I will illustrate each in turn and then provide the entire Python script at the end of the post.

So first, lets import the libraries we need, and also note I downloaded the pdftohtml utility and placed that location as a system path on my Windows machine. Then we need to set a folder where we will download the files to on our local machine. Finally I create the base url for our meth labs.

from bs4 import BeautifulSoup
import urllib, os

myfolder = r'C:\Users\axw161530\Dropbox\Documents\BLOG\Scrape_Methlabs\PDFs' #local folder to download stuff
base_url = r'' #online site with PDFs for meth lab seizures

Now to just download the Texas pdf file to our local machine we would simply do:

a = 'tx'
url = base_url + r'/' + a + '.pdf'
file_loc = os.path.join(myfolder,a)
urllib.urlretrieve(url,file_loc + '.pdf')

If you are following along and replaced the path in myfolder with a folder on your personal machine, you should now see the Texas PDF downloaded in that folder. Now I am going to use the command line to turn this PDF into an xml document using the os.system() function.

#Turn to xml with pdftohtml, does not need xml on end
cmd = 'pdftohtml -xml ' + file_loc + ".pdf " + file_loc

You should now see that there is an xml document to go along with the Texas file. You can check out its format using a text editor (wordpress does not seem to like me showing it here).

So basically we can use the top and the left attributes within the xml to identify what row and what column the items are in. But first, we need to read in this xml and turn it into a BeautifulSoup object.

MyFeed = open(file_loc + '.xml')
textFeed =
FeedParse = BeautifulSoup(textFeed,'xml')

Now the FeedParse item is a BeautifulSoup object that you can query. In a nutshell, we have a top level page tag, and then within that you have a bunch of text tags. Here is the function I wrote to extract that data and dump it into tuples.

#Function to parse the xml and return the line by line data I want
def ParseXML(soup_xml,state):
    data_parse = []
    page_count = 1
    pgs = soup_xml.find_all('page')
    for i in pgs:
        txt = i.find_all('text')
        order = 1
        for j in txt:
            value = j.get_text() #text
            top = j['top']
            left = j['left']
            dat_tup = (state,page_count,order,top,left,value)
            order += 1
        page_count += 1
    return data_parse

So with our Texas data, we could call ParseXML(soup_xml=FeedParse,state=a) and it will return all of the data nested in those text tags. We can just put these all together and loop over all of the states to get all of the data. Since the PDFs are not that large it works quite fast, under 3 minutes on my last run.

from bs4 import BeautifulSoup
import urllib, os

myfolder = r'C:\Users\axw161530\Dropbox\Documents\BLOG\Scrape_Methlabs\PDFs' #local folder to download stuff
base_url = r'' #online site with PDFs for meth lab seizures
state_ab = ['al','ak','az','ar','ca','co','ct','de','fl','ga','guam','hi','id','il','in','ia','ks',
state_name = ['Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','Florida','Georgia','Guam','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas',
              'Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey',
              'New Mexico','New York','North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Pennsylvania','Rhode Island','South Carolina','South Dakota','Tennessee','Texas',
              'Utah','Vermont','Virginia','Washington','West Virginia','Wisconsin','Wyoming','Washington DC']

all_data = [] #this is the list that the tuple data will be stashed in

#Function to parse the xml and return the line by line data I want
def ParseXML(soup_xml,state):
    data_parse = []
    page_count = 1
    pgs = soup_xml.find_all('page')
    for i in pgs:
        txt = i.find_all('text')
        order = 1
        for j in txt:
            value = j.get_text() #text
            top = j['top']
            left = j['left']
            dat_tup = (state,page_count,order,top,left,value)
            order += 1
        page_count += 1
    return data_parse

#This loops over the pdfs, downloads them, turns them to xml via pdftohtml command line tool
#Then extracts the data

for a,b in zip(state_ab,state_name):
    #Download pdf
    url = base_url + r'/' + a + '.pdf'
    file_loc = os.path.join(myfolder,a)
    urllib.urlretrieve(url,file_loc + '.pdf')
    #Turn to xml with pdftohtml, does not need xml on end
    cmd = 'pdftohtml -xml ' + file_loc + ".pdf " + file_loc
    #parse with BeautifulSoup
    MyFeed = open(file_loc + '.xml')
    textFeed =
    FeedParse = BeautifulSoup(textFeed,'xml')
    #Extract the data elements
    state_data = ParseXML(soup_xml=FeedParse,state=b)
    all_data = all_data + state_data

Now to go from those sets of tuples to actually formatted data takes a bit of more work, and I used SPSS for that. See here for the full set of scripts used to download, parse and clean up the data. Basically it is alittle more complicated than just going from long to wide using the top marker for the data as some rows are off slightly. Also there is complications for long addresses being split across two lines. And finally there are just some data errors and fields being merged together. So that SPSS code solves a bunch of that. Also that includes scripts to geocode the to the city level using the Google geocoding API.

Let me know if you do any analysis of this data! I quickly made a time series map of these events via CartoDB. You can definately see some interesting patterns of DEA concentration over time, although I can’t say if that is due to them focusing on particular areas or if they are really the areas with the most prevalent Meth lab problems.

Spatial join points to polygons using Python and SPSS

A recent use case of mine I had around 60 million points that I wanted to assign to census block groups. ArcGIS was being problematic to simply load in the 60 million point dataset (let alone spatial join it), so I wrote some python code and will show using python and SPSS how to accomplish this.

First, a shout out to Rex Douglass and this blog post, I’ve adapted most of the python code here from that example. Also before we get started, it will be necessary to download several geospatial libraries for python. Here you need shapely, pyshp, and rtree. As a note, I have only been able to get these to install and work using the IOOS channel for Anaconda, e.g. conda install -c ioos shapely rtree pyshp. (I have not been able to get fiona to work.)

The Python Part

So I will go through a quick rundown of the python code first. All of the data and code to run this yourself can be downloaded here. To start, I import all of the necessary libraries and functions.

import shapefile
from rtree import index
from shapely.geometry import Polygon, Point

The next step is to read in the polygon shapefile that we want to assign points to. Note you could swap this part out with fiona (if you can get it working!), but I just use the pyshp function shapefile.Reader. Note you need to change the data string to point to where the shapefile containing your polygons is located on your local machine.

#load in the shapefile of block groups
data = r'C:\Users\axw161530\Dropbox\Documents\BLOG\Point_inPoly_PythonSPSS'
bg_NYC = shapefile.Reader(data + r'\NYC_BG14_Proj.shp')

In my data these are block groups for New York city, and they are projected into feet using a local projection. (As an FYI, you can open up the “prj” file for shapefiles in a plain text editor to see the projection.) Now, the shapefile object, bg_NYC here, has several iterables that you can access either the geometries or the records available. First we need to get those individual polygons and stuff into a list, and then convert into a Polygon object shapely can deal with.

bg_shapes = bg_NYC.shapes()  #get the iterable for the polygon boundary points
bg_points = [q.points for q in bg_shapes] #convert to list of geometry
polygons = [Polygon(q) for q in bg_points] #convert to a shapely Polygon

Next I am going to do two things. First to make a vector that matches those Polygons to a particular id, I need to read in the data attributes from the shapefile. This is accomplished via the .records() attribute. For US census geometries they have what is oft labeled a GEOID. In this example shapefile the GEOID ends up being in the second variable slot. The second thing I accomplish here is I build an rtree lookup. The motivation for this is, when we do a point in polygon check, it can be an expensive procedure the more polygons you have. You can first limit the number of potential polygons to check though by only checking whether a point falls within the bounding box of a polygon, and then do the more expensive operation on the actual (more complicated) boundary of the polygon.

#build spatial index from bounding boxes
#also has a second vector associating area IDs to numeric id
bg_records = bg_NYC.records() #bg_records[0][1] is the geoid
idx = index.Index() #creating an rtree
c_id = 0
area_match = []
for a,b in zip(bg_shapes,bg_records):
    c_id += 1

Now we have all the necessary ingredients to make a function that inputs one X,Y point, and then returns a GEOID. First, the function turns the input X,Y points into a Point object shapely can work with. Second, it does the bounding box lookup I mentioned earlier, using the idx rtree that is available in the global environment. Third, it loops over those resulting polygons that intersected the bounding box, and checks to see if the point is within that polygon using the shapely operation point.within(polygon). If that is true, it returns the associated GEOID, and if none are found it returns None. Again, the objects in this function idx, polygons, and area_match are taken from the global environment. A few additional notes: it will return the first point in polygon found, so if you have overlapping polygons this will simply return the first, not necessarily all of them. That is not the case with our census polygons here though. Second, the functionality here is for a point on the exact border between two polygons to return False.

#now can define function with polygons, area_match, and idx as globals
def assign_area(x,y):
    point = Point(x,y)
    for i in idx.intersection((x,y,x,y)): 
        if point.within(polygons[i]):
            return area_match[i]
    return None
#note points on the borders will return None

To test this function I have a set of points in New York for this particular projection already associated with a GEOID.

#now testing
test_vec = [(1003610, 239685, '360050063002'),
            (1006787, 240666, '360050183022'),
            ( 993580, 219484, '360610122001'),
            ( 986385, 214971, '360610115001'),
            ( 947148, 167688, '360850201001'),
            (      0,      0, 'Miss')]

for a,b,c in test_vec:
    print [assign_area(x=a,y=b),c]

And this should subsequently print out at your console:

['360050063002', '360050063002']
['360050183022', '360050183022']
['360610122001', '360610122001']
['360610115001', '360610115001']
['360850201001', '360850201001']
[None, 'Miss']

For those wishing to do this in vectorized in python, check out the GeoPanda’s functionality. But here I let it churn out one by one by using SPSS.

The SPSS Part

So once the above function is defined in your SPSS environment, we can simply use SPSSINC TRANS to assign XY data to a block group. Here is a quick example. First we read in some data, this is the homicide data from the New York times discussed here. It has the points projected in the same feet as the polygons were.

*Conducting point in polygon tests with Python and SPSS.
FILE HANDLE data /NAME = "C:\Users\axw161530\Dropbox\Documents\BLOG\Point_inPoly_PythonSPSS".
*Read in the NYC homicide data.
GET TRANSLATE FILE='data\HomPoints_JoinBG.dbf' /TYPE=DBF /MAP .

Now I am going to use the SPSS command SHOW to display the current date and time, (so you can see how long the operation takes). This dataset has 4,021 cases of homicide, and the set of polygons we are matching to has around 6,500 block groups. The time the operation takes depends on both, but the rtree search should make the number of polygons not as big a deal as simply looping through all of them. Second, I use SPSSINC TRANS to call the python function we previously constructed. Third, this dataset already has the GEOID matched to the points (via ArcGIS), so I check to make sure I get the same results as ArcGIS. In this example there are quite a few points that ArcGIS failed to return a match for, but this operation does. (It would take more investigation on my part though as to why that is the case.)

*Use this to show timing.

*Now using SPSSINC TRANS to assign geoid.
  /FORMULA "assign_area(x=XFt,y=YFt)".

*Check that the operations are all correct (as compared to ArcGIS)
FREQ Check.

This example runs almost instantly. For some tests with my bigger dataset of 60 million, matching half a million points to this set of polygons took around 12 minutes.

To End

Again, all of the data and code to run this at once can be downloaded here. I will need to make a blog post at some point of using pyproj to project point data in SPSS as well, such as to go to and from Lat-Lon to a local projection. You probably always want to do geometric operations like this and buffers with projected data, but you may get the data in Lat-Lon or want to export data in Lat-Lon to use online maps.

For those working with crime data, I oft complain that crime is frequently on the borders of census geographies. But due to slight differences in resolution, most GIS systems will still assign crime points to census geographies. I’m not sure if it is a big problem for much analysis in our field, but the proportion on the border is clearly quite large in some instances. For things that can occur often outdoors, like robberies and field stops, the proportion is even higher because crime is often recorded at intersections (I have estimates for the percentage of crimes at intersections for 14 years in Albany in this paper). So the problem depends on the crime type or nature of the incident (traffic stops are almost always listed at intersections), but I have seen analysis I would bet over 50% of the incidents are on the border of census blocks and/or block groups.

A general way to check this in GIS is to turn your polygon data into lines, and then assign points to the nearest line and check the distance. You will see many points that are very close to the border (say within 5 meters) that really should be undetermined.

Downloading and reading in American Community Survey Data: Python and SPSS

I had a prior blog post on working with American Community Survey data in SPSS. The meta-data format has changed from that example though, and the Census gives out comma separated files and xls Templates now. So this will be an update, and I have good stuff for those working strictly in python, as well as those wanting to load the data is SPSS.

So first, when downloading the small geographies from the Census’s FTP site, they have a ton of files. See this page, which contains the 5 year estimates for 2014 for New York block groups and tracts. Now instead of downloading each zip file one by one, we can write a quick python script to download all the files.

import urllib, os

downFold = r'C:\Users\axw161530\Dropbox\Documents\BLOG\ACS_Python_SPSS\Data'
base = r''

for i in range(1,5):  #change range(1,5) to range(1,122) to download all zip files
    file = "20145ny0" + str(i).zfill(3) + ""
    urllib.urlretrieve(base + file, os.path.join(downFold,file))

#also download the geography file
urllib.urlretrieve(base + "g20145ny.csv", os.path.join(downFold,"g20145ny.csv"))

The downFold string is where the files will be downloaded to (so change that to a place on your local machine), and the base string ends up being the base URL for that particular set of files. The files go from 1 to 121 in that example, but just to keep the time down I only download tables one through four. The second urlib.urlretrieve line downloads the geography csv file (we won’t be using the other geography file, which is the same data but in tab delimited format).

Now we can go and download the meta data excel file shells. For this dataset they are located here. Here we want the 5 year templates. Once that data is downloaded, then unzip all of the files. You could technically do this in python as well, but I just use 7zip, as that has a handy dialog to unzip multiple files to the same place.

So the way the files work, there are a set of estimate and margin of error text files that are comma delimited that have the demographic characteristics. (Here for all block groups and census tracts in New York.) The xls excel files contain the variable names, along with a brief description for the variables.

If you are a hipster and only do data analysis in python, here is a function that takes the location to a xls template file and the corresponding data file and reads it into a pandas data frame.

#this reads in american community survey data
import xlrd
import pandas as pd

def readACS(Template,Data):
    book = xlrd.open_workbook(Template) #first open the xls workbook
    sh = book.sheet_by_index(0)
    vars = [i.value for i in sh.row(0)] #names on the first row
    labs = [i.value for i in sh.row(1)] #labels on the second
    #this rewrites duplicate 'BLANK' names, mangle dups not working for me
    n = 0
    vars2 = []
    for i in range(len(vars)):
        if vars[i] == 'BLANK':
            n += 1
            vars2.append('BLANK.' + str(n))
    #check for if geo file or data file
    if vars2[1] == 'FILETYPE':
        df = pd.read_csv(Data,names=vars2,dtype={'FILETYPE':'object'})
        df = pd.read_csv(Data,names=vars2)
    return df,zip(vars2,labs)

In a nutshell, it reads the metadata column names and labels from the excel spreadsheet, then reads in the csv file with the data. It returns two objects, the one on the left is a pandas dataframe, and the one on the right is a zipped up list of the variable names and the variable labels. This would be a bit simpler, except that the format for the geo dataset is a little different than all the data files and contains multiple “BLANK” fields (the mangle_dupe_cols option in read_csv is not behaving like I expect it to). For the non-geographic file, I need to tell python the filetype column is a string, else it interprets the “e” in the estimate files as a scientific number (e.g. 1e5 = 100,000).

So here is an example of using this function to grab the second table. When I unzipped the excel templates, it nested the data templates in another subfolder, hence the TemplateFold string.

TemplateFold = downFold + r'\seq'
Tab002,Meta002 = readACS(TemplateFold + r'\Seq2.xls',downFold + r'\e20145ny0002000.txt')

If you want to check out all the variable labels, you can then do:

for i in Meta002:
    print i 

Or if you want to turn that into a dictionary you can simply do dict(Meta002). If you wanted to import all 121 tables and merge them you should be able to figure that out in a simple loop from here (note the “x.zfill(n)” function to pad the integers with leading zeroes). But I typically only work with a select few tables and variables at a time, so I won’t worry about that here.

The function works the same with the geographic data and its template. (Which that metadata template is not nested in the further down seq folder.)

GeoDat,MetaGeo = readACS(downFold + r'\2014_SFGeoFileTemplate.xls',downFold + r'\g20145ny.csv')

Note if you are working with both the estimates and the margin of error files, you may want to put somewhere in the code to change the variable names to signify that, such as by putting a suffix of “e” or “m”. If you just work with the estimates though you don’t need to worry about that.

Reading ACS data into SPSS

For those working in SPSS, I’ve shown previously how to turn python data into SPSS data. I’ve started working on a function to make this simpler with pandas dataframes, but I will hold off on that for now (need to figure out datetimes and NaN’s). So what I did here was grab the meta-data from the template xls file (same as before), but from that build the necessary DATA LIST command in SPSS, and then just submit the command. SPSS has the added benefit of having native meta-data fields, so I can also add in the variable labels. Also, this only uses the xlrd library, in case you do not have access to pandas. (I point SPSS to Anaconda, instead of worrying about using pip with the native SPSS python install.)

So in SPSS, you would first define this function

*This just builds the necessary SPSS program to read in the american community survey data.
#creating my own function to read in data
import xlrd, spss
def OpenACS(Template,Data):
  book = xlrd.open_workbook(Template)
  sh = book.sheet_by_index(0)
  vars = [i.value for i in sh.row(0)]
  labs = [i.value for i in sh.row(1)]
  #this rewrites duplicate 'BLANK' names, mangle dups not working for me
  n = 0
  vars2 = []
  for i in range(len(vars)):
    if vars[i] == 'BLANK':
      n += 1
      vars2.append('BLANK.' + str(n))
    #check for if geo file or data file
  if vars2[1] == 'FILETYPE':  #regular data
    ncols = sh.ncols - 6 #need the end of the number of variables
    ext =  ' (' + str(ncols) + 'F7.0)'
    v2 = '\n '.join(vars2[6:])
    Tab = Data[-10:-7] #Names the dataset based on the table number
  else: #geo data file
    ncols = sh.ncols
    ext =  ' (' + str(ncols) + 'A255)' #255 should be big enough to fit whatever str
    v1 = " / "
    v2 = '\n '.join(vars2)
    Tab = "Geo"
  #this sets errors off, implicit missing data for block groups
  #now creating the import program to read in the data
  begin = "DATA LIST LIST(',') FILE = '%s'" % (Data)
  full_str = begin + v1 + v2 + ext + "\n."
  #now reading in the dataset
  #assigning a dataset name
  datName = "DATASET NAME Table" + Tab + "." 
  #now adding in the variable labels
  for i,j in zip(vars2,labs):
    #replaces double quotes with single quotes in label
    strVal = """VARIABLE LABELS %s "%s".""" % (i,j.replace('"',"'"))
  if Tab == "Geo":
    spss.Submit("ALTER TYPE ALL (A = AMIN).")

Again this is much shorter if I only needed to worry about the data files and not the geo file, but that slight formatting difference is a bit of a pain. Here I use the errors off trick to suppress the data list errors for missing data (which is intended, as not all of the data is available at the block group level). But you will still get error messages in the SPSS syntax bottom section. They can be ignored if it is the “insufficient data” warning.

Here is an example of using this python function now to read the data into SPSS. This automatically assigns a dataset name, either based on the Table number, or “Geo” for the geographic data.

*Now reading in the data files I want.
downFold = r'C:\Users\axw161530\Dropbox\Documents\BLOG\ACS_Python_SPSS\Data'
TemplateFold = downFold + r'\seq'

#reading in Data file, table number 2
OpenACS(TemplateFold + r'\Seq2.xls',downFold + r'\e20145ny0002000.txt')

#reading in Geo file
OpenACS(downFold + r'\2014_SFGeoFileTemplate.xls',downFold + r'\g20145ny.csv')

And Voila, there is your small area American Community Survey data in SPSS. This will produce two different datasets, “Table002” and “TableGeo” that can be merged together via MATCH FILES.

Let me know in the comments if you have already worked out a function to turn pandas dataframes into SPSS datasets.

Added code snippets page

I’ve written quite a few blog posts over the years, and it is getting to be hard for me to keep all of them orderly. So I recently created a page with a simple table of contents¬†with code snippets broken down into categories. There are also a few extra SPSS macros in there that I have not written blog posts about.

Every now and then I see a broken link and update it, but I know there are more I am missing. So just send an email if I have a link to some of my code and it is currently broken.


Sentence length in academic articles

While reviewing a paper recently it struck me that the content was (very) good, but the writing was stereotypical academic. My first impression was that this was caused by monotonously long sentences. See this advice from Gary Provost (via Francis Diebold). Part of the reason why long sentences are undesirable is not only for aesthetic reasons though — longer sentences are harder to parse, hold in memory, and subsequently understand. See Steven Pinker’s The Sense of Style writing guide for discussion.

So I did some text analysis of the sentences. To do the text analysis I used the nltk library in python, and here is the IPython notebook to replicate for yourself if you care to do so. In the notebook I have saved two text corpuses, one my finished draft of this article. I compared the sentence length to Mark Twain’s Huckleberry Finn (text via here).

For a simple example getting started with the library, here is an example of tokenizing a string into words and sentences:

#some tests for, nice book to follow along
import nltk'punkt') #need to download this for the English sentence tokenizer files

#this splits up punctuation
test = """At eight o'clock on Thursday morning Arthur didn't feel very good. This is a second sentence."""
tokens = nltk.word_tokenize(test)
print tokens

ts = nltk.sent_tokenize(test)
print ts

The first prints out each individual word (plus punctuation in some circumstances) and the second marks individual sentences. I have the line'punkt') commented out, as I downloaded it once already. (Running once in Wakari I did not need to download it again – I presume it would work similarly on your local machine.)

So what I did was transfer the PDF document I was reviewing to a text file and then clean up things like the section headers (ditto for my academic articles I compare it to). In Huckleberry I took out the table of contents and the "CHAPTER ?" parts. I also started a list of variables that were parsed as words but that I did not want to count after the sentences and words were tokenized. For example, an inline cite such as (X, 1996) would be split into 4 words with the original tokenizer, (, X, 1996 and ). The "x96" is an en-dash. Below takes those instances out.

#Get the corpus
f = open('SmallSample_Corpus.txt')
raw =

#Count number of sentences
sent_tok = nltk.sent_tokenize(raw)
ns = len(sent_tok)

#Count number of words
word_tok = nltk.word_tokenize(raw) #need to take out commas plus other stuff
NoWord = [',','(',')',':',';','.','%','\x96','{','}','[',']','!','?',"''","``"]
word_tok2 = [i for i in word_tok if i not in NoWord]
nw = len(word_tok2)

#Average Sentence length are words divided by sentences
print float(nw)/ns

There are inevitably more instances of things that shouldn’t be counted as words, but that makes the sentences longer on average. For example, I spotted a few possessive 's that were listed as different words. (The nltk library is smart and lists contractions as seperate words.)

So someone may know a better way to count the words, but all the articles should have the same biases. In my tests, here are the average number of words per sentence:

  • article I was reviewing, 28
  • my small sample article, 27
  • my working article (that has not undergone review), 25
  • Huck Finn, 20

So the pot is calling the kettle black here – my writing is not much better. I looked at the difference between an in-print article and a working draft, as responses to reviewers I bet will make the sentences longer. Hedges in statements that academics love.

Looking at the academic article histograms they are fairly symmetric, confirming my impression about monotonous sentence length. To make the histograms I used the panda’s library, which has a nice simple method.

sent_len = []
for i in sent_tok:
    sent_w1 = nltk.word_tokenize(i)
    sent_w2 = [i for i in sent_w1 if i not in NoWord]

import pandas as pd

dfh = pd.DataFrame(sent_len)
dfh.hist(bins = 50);

Here is the histogram for my small sample paper:

And here it is for Huck Finn

(I’m not much of an exemplar for making graphs in python – forgive the laziness in the figures.) Apparently analyzing sentence length has a long history, see a paper by G. Udny Yule in 1939! From a quick perusal the long right tail is more usual for analyzing texts. The symmetry I see for this sample of academic articles is not the norm.

There could be more innocuous reasons for this. Huck Finn has dialogue with shorter sentences, and the academic articles have numbers and citations. (Although I think it is reasonable to count those things towards sentence complexity, "1" or "one" should have the same complexity.)

I will have to keep this in mind in the future (maybe I should write my articles in poem form)!

Neighborhoods in Albany according to Google

One of the most vexing aspects of spatial analysis in the social sciences in the concept of neighborhoods. There is a large literature on neighborhood effects in criminology, but no one can really define a neighborhood. For analysis they are most often assumed to approximately conform to census areas (like tracts or blocks). Sometimes there are obvious physical features that divide neighborhoods (most often a major roadway), but more often boundaries are fuzzy.

I’ve worked on several surveys (at the Finn Institute) in which we ask people what neighborhood they live in as well as the nearest intersection to their home. Even where there is a clear border, often people say the “wrong” neighborhood, especially near the borders. IIRC, when I calculated the wrongness for one survey in Syracuse we did it was only around 60% of the time the respondents stated they lived the right neighborhood. I do scare quotes around “wrong” because it is obviously arbitrary where people draw the boundaries, so more people saying the wrong neighborhood is indicative of the borders being misaligned than the respondents being wrong.

For this reason I like the Google maps approach in which they just place a label at the approximate center of noteworthy neighborhoods. I emulated this for a recent background map I made for a paper in Albany. (Maps can be opened in a separate tab to see a larger image.)

As background I did not grow up in Albany, but I’ve lived and worked in the Capital District since I came up to Albany for grad school – since 2008. Considering this and the fact that I make maps of Albany on a regular basis is my defense I have a reasonable background to make such judgements.

When looking at Google’s reverse geocoding API the other day I noticed they returned a neighborhood field in the response. So I created a regular sampling grid over Albany to see what they return. First, lets see my grid and where Google actually decides some neighborhood exists. Large grey circles are null, and small red circles some neighborhood label was returned. I have no idea where Google culls such neighborhood labels from.

See my python code at the end of the post to see how I extracted this info. given an input lat-lng. In the reverse geo api they return multiple addresses – but I only examine the first returned address and look for a neighborhood. (So I could have missed some neighborhoods this way – it would take more investigation.)

Given the input fishnet I then dissolved the neighborhood labels into areas. Google has quite a few more specific neighborhoods than me.

I’ve never really made much of a distinction between West Hill and Arbor Hill – although the split is clearly at Henry Johnson. Also I tend to view Pine Hill as the triangle between Western and Central before the State campus – but Google and others seem to disagree with me. What I call the Pinebush Google calls the Dunes. Dunes is appropriate, because it actually has sand dunes, but I can’t recall anyone referring to it as that. Trees are pretty hard to come by in Arbor Hill though, so don’t be misled. Also kill is Dutch for creek, so you don’t have to worry that Normanskill is such a bad place (even if your name is Norman).

For a third opinion, see

You can see more clearly in this map how Pine Hill’s area goes south of Madison. Google maps has a fun feature showing related maps, and so they show a related map on someones take for where law students should or should not get an apartment. In that map you can see that south of Madison is affectionately referred to as the student ghetto. That comports with my opinion as well, although I did not think putting student ghetto was appropriate for my basemap for a journal article!

People can’t seem to help but shade Arbor Hill in red. Which sometimes may be innocent – if red is the first color used in defaults (as Arbor Hill will be the first neighborhood in an alphabetic list). But presumably the law student making the apartment suggestions map should know better.

In short, it would be convenient for me (as a researcher) if everyone could agree with what a neighborhood is and where its borders are, but that is not reality.

Here is the function in Python to grab the neighborhood via the google reverse geocoding API. Here if it returns anything it grabs the first address returned and searches for the neighborhood in the json. If it does not find a neighborhood it returns None.

#Reverse geocoding and looking up neighborhoods
import urllib, json

def GoogRevGeo(lat,lng,api=""):
  base = r""
  GeoUrl = base + "latlng=" + str(lat) + "," + str(lng) + "&key=" + api
  response = urllib.urlopen(GeoUrl)
  jsonRaw =
  jsonData = json.loads(jsonRaw)
  neigh = None
  if jsonData['status'] == 'OK':
    for i in jsonData['results'][0]['address_components']:
      if i['types'][0] == 'neighborhood':
        neigh = i['long_name']
  return neigh

Using the Google Geocoding API with Python

Previously I posted how to use the geopy python library to call the Google geocode API. But somewhere along the way my version of geopy was not working (maybe because the API changed). Instead of figuring out that problem, I just wrote my own function to call the Google API directly. No need to worry about installing geopy.

Part of the reason I blog is so I have notes for myself – I’m pretty sure I’ve rewritten this several times for different quick geocoding projects, but I couldn’t find them this morning when I needed to do it again. So here is a blog post for my own future reference.

Here is the function, it takes as input the full string address. Also I was getting back some null responses by rapid fire calling the API (with only 27 addresses), so I set the function to delay for five seconds and that seemed to fix that problem.

import urllib, json, time
def GoogGeoAPI(address,api="",delay=5):
  base = r""
  addP = "address=" + address.replace(" ","+")
  GeoUrl = base + addP + "&key=" + api
  response = urllib.urlopen(GeoUrl)
  jsonRaw =
  jsonData = json.loads(jsonRaw)
  if jsonData['status'] == 'OK':
    resu = jsonData['results'][0]
    finList = [resu['formatted_address'],resu['geometry']['location']['lat'],resu['geometry']['location']['lng']]
    finList = [None,None,None]
  time.sleep(delay) #in seconds
  return finList

And here is an example use of the function. It returns the formatted address, the latitude and the longitude.

#Example Use
test = r"1600 Amphitheatre Parkway, Mountain View, CA"
geoR = GoogGeoAPI(address=test)
print geoR

This works for a few addresses without an API key. Even with an API key though the limit I believe is 2,500 – so don’t use this to geocode a large list. Also if you have some special characters in your address field this will take more work. For example if you have an & for an intersection I bet this url call will fail. But that should not be too hard to deal with. Also note the terms of service for using the API (which I don’t understand – so don’t ask me!)

I should eventually wrap up all of this google API python code into an extension for SPSS. Don’t hold your breath though for me getting the time to do that.

Turning SPSS data into Python data

Previously I blogged about how to take Python data and turn it back into SPSS data. Here we are going to do the opposite — turn SPSS data into Python objects. First to start out we will make a simple dataset of three variables.

DATA LIST Free /X Y (2F1.0) Z (A1). 
1 2 A
4 5 B
7 8 C

To import this data into Python, we need to import the spss class of functions, which then you can read cases from the active dataset using the Cursor attribute. Here is an example of grabbing all of the cases.

*Importing all of the data.
import spss
dataCursor = spss.Cursor()
AllData = dataCursor.fetchall()
print AllData

What this then prints out is ((1.0, 2.0, 'A'), (4.0, 5.0, 'B'), (7.0, 8.0, 'C')), a set of nested tuples. You can also just grab one case by replacing dataCursor.fetchall() with dataCursor.fetchone(), in which case it will just return one tuple.

To only grab particular variables from the list, you can pass a set of indices in the spss.Cursor object. Remember, Python indices start at zero, so if you want the first and second variables in the dataset, you need to grab the 0 and 1 indices.

*Only grabbing certain variables.
dataNum = spss.Cursor([0,1])
spNumbers = dataNum.fetchall()
print spNumbers

This subsequently prints out ((1.0, 2.0), (4.0, 5.0), (7.0, 8.0)). When grabbing one variable, you may want just a list of the objects instead of the nested tuples. Here I use list comprehension to turn the resulting tuples for the Z variable into a nice list.

*Converting to a nice list.
dataAlp = spss.Cursor([2])
spAlp = dataAlp.fetchall()
spAlp_list = [i[0] for i in spAlp] #convert to nice list
print spAlp
print spAlp_list

The first print object is (('A',), ('B',), ('C',)), but the second is ['A', 'B', 'C'].

The above code works fine if you know the position of the variable in the file, but if the position can change this won’t work. Here is a one liner to get the variable names of the active dataset and plop them in a list.

*Way to get SPSS variable names.
varList = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
print varList

Now if you have your list of variable names you want, you can figure out the index value. There are two ways to do it, iterate over the list of variable names in the dataset, or iterate over the list of your specified variables. I do the latter here (note this will result in an error if you supply a variable name not in the dataset).

*Find the indices of specific variables.
LookVars = ["X","Z"]
VarInd = [varList.index(i) for i in LookVars]
print VarInd

Now you can just supply VarInd above to the argument for spss.Cursor to grab those variables. Here I wrapped it all up in a function.

*Easy function to use.
import spss
def AllSPSSdat(vars):
  if vars == None:
    varNums = range(spss.GetVariableCount())
    allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
    varNums = [allvars.index(i) for i in vars]
  data = spss.Cursor(varNums)
  pydata = data.fetchall()
  return pydata

You can either supply a list of variables or None, in the latter case all of the variables are returned.

MyDat = AllSPSSdat(vars=["Y","Z"])
print MyDat

This set of nested tuples is then pretty easy to convert to other Python objects. Panda’s dataframes, Numpy arrays, and NetworkX objects are all one liners. Here is turning the entire dataset into a panda’s data frame.

*Turn into pandas data frame.
import pandas as pd
MyDat = AllSPSSdat(vars=None)
allvars = [spss.GetVariableName(i) for i in range(spss.GetVariableCount())]
PanDat = pd.DataFrame(list(MyDat),columns=allvars)
print PanDat

Which prints out.

   X  Y  Z 
0  1  2  A 
1  4  5  B 
2  7  8  C

Using Python to grab Google Street View imagery

I am at it again with using Google data. For a few projects I was interested in downloading street view imagery data. It has been used in criminal justice applications as a free source for second hand systematic social observation by having people code aspects of disorder from the imagery (instead of going in person) (Quinn et al., 2014), as estimates of the ambient walking around population (Yin et al., 2015), and examining criminogenic aspects of the built environment (Vandeviver, 2014).

I think it is just a cool source of data though to be honest. See for example Phil Cohen’s Family Inequality post in which he shows examples of auctioned houses in Detroit over time.

Using the Google Street View image API you can submit either a set of coordinates or an address and have the latest street view image returned locally. This ends up being abit simpler than my prior examples (such as the street distance API or the places API) because it just returns the image blob, no need to parse JSON.

Below is a simple example in python, using a set of addresses in Detroit that are part of a land bank. This function takes an address and a location to download the file, then saves the resulting jpeg to your folder of choice. I defaulted for the image to be 1200×800 pixels.

import urllib, os

myloc = r"C:\Users\andrew.wheeler\Dropbox\Public\ExampleStreetView" #replace with your own location
key = "&key=" + "" #got banned after ~100 requests with no key

def GetStreet(Add,SaveLoc):
  base = ""
  MyUrl = base + Add + key
  fi = Add + ".jpg"
  urllib.urlretrieve(MyUrl, os.path.join(SaveLoc,fi))

Tests = ["457 West Robinwood Street, Detroit, Michigan 48203",
         "1520 West Philadelphia, Detroit, Michigan 48206",
         "2292 Grand, Detroit, Michigan 48238",
         "15414 Wabash Street, Detroit, Michigan 48238",
         "15867 Log Cabin, Detroit, Michigan 48238",
         "3317 Cody Street, Detroit, Michigan 48212",
         "14214 Arlington Street, Detroit, Michigan 48212"]

for i in Tests:

Dropbox has a nice mosaic view for a folder of pictures, you can view all seven photos here. Here is the 457 West Robinwood Street picture:

In my tests my IP got banned after around 100 images, but you can get a verified google account which allows 25,000 image downloads per day. Unfortunately the automatic API only returns the most recent image – there is no way to return older imagery nor know the date-stamp of the current image. (You technically could download the historical data if you know the pano id for the image. I don’t see any way though to know the available pano id’s though.)

But this is definately easier for social scientists wishing to code images as oppossed to going into the online maps. Hopefully the API gets extended to have dates and a second API to return info. on what image dates are available. I’m not sure if Mike Bader’s software app is actually in the works, but for computer scientists there is a potential overlap with social scientists to do feature extraction of various social characteristics, in addition to manual coding of the images.

Using KDTree’s in python to calculate neighbor counts

For a few different projects I’ve had to take a set of crime data and calculate the number of events nearby. It is a regular geospatial task, counting events in a particular buffer, but one that can be quite cumbersome if you have quite a few points to cross-reference. For one project I needed to calculate this for 4,500 street segments and intersections against a set of over 100,000 crime incidents. While this is not big data, if you go the brute force route and simply calculate all pair-wise distances, this results in 450 million comparisons. Since all I wanted was the number of counts within a particular distance buffer, KDTree’s offer a much more efficient search solution.

Here I give an example in Python using numpy and the nearest neighbor algorithms available in SciPy. This example is calculating the number of shootings in DC within 1 kilometer of schools. The shooting data is sensor data via ShotSpotter, and is publicly available at the new open data site. The school data I calculated the centroids based on school area polygons, available from the legacy DC data site. This particular example was motivated by this Urban Institute post.

There are around 170 schools and under 40,000 total shootings, so this would not be a big issue to calculate all the pairwise distances. It is a simple and quick example though. Here I have the IPython notebook and CSV files are available to download (using Wakari – the Anaconda free service).

So first we will just import the spatial data into numpy arrays. Note each file has the coordinates in projected meters.

import numpy as np
#getting schools and shootings from CSV files
schools = np.genfromtxt('DC_Schools.csv', delimiter=",", skip_header=1)
shootings = np.genfromtxt('ShotSpotter.csv', delimiter=",", skip_header=1, usecols=(4,5))

Next we can import the KDTree function, and then supply it with the two spatial coordinates for the shootings. While this is a relatively small file, even for my example larger set of crime data with over 100,000 incidents building the tree was really fast, less than a minute.

#making KDTree, and then searching within 1 kilometer of school
from sklearn.neighbors import KDTree
shoot_tree = KDTree(shootings)

Finally you can then search within a particular radius. You can either search one location at a time, but here I do a batch search and count the number of shootings that are within 1,000 meters from each school. Again this is a small example, but even with my 4,500 locations against 100,000 crimes it was very fast. (Whereas my initial SPSS code to do all 450 million pairwise combinations was taking all night.)


Which produces an array of the counts for each of the schools:

array([1168,  179, 2384,  686,  583, 1475,  239, 1890, 2070,  990,   74,
        492,   10,  226, 2057, 1813, 1137,  785,  742, 1893, 4650, 1910,
          0,  926, 2380,  818, 2868, 1230,    0, 3230, 1103, 2253, 4531,
       1548,    0,    0,    0, 1002, 1912,    0, 1543,    0,  580,    4,
       1224,  843,  212,  591,    0,    0, 1127, 4520, 2283, 1413, 3255,
        926,  972,  435, 2734,    0, 2828,  724,  796,    1, 2016, 2540,
        369, 1903, 2216, 1697,  155, 2337,  496,  258,  900, 2278, 3123,
        794, 2312,  636, 1663,    0, 4896,  604, 1141,    7,    0, 1803,
          2,  283,  270, 1395, 2087, 3414,  143,  238,  517,  238, 2017,
       2857, 1100, 2575,  179,  876, 2175,  450, 5230, 2441,   10, 2547,
        202, 2659, 4006, 2514,  923,    6, 1481,    0, 2415, 2058, 2309,
          3, 1132,    0, 1363, 4701,  158,  410, 2884,  979, 2053, 1120,
       2048, 2750,  632, 1492, 1670,  242,  131,  863, 1246, 1361,  843,
       3567, 1311,  107, 1501,    0, 2176,  296,  803,    0, 1301,  719,
         97, 2312,  150,  475,  764, 2078,  912, 2943, 1453,  178,  177,
        389,  244,  874,  576,  699,  295,  843,  274])

And that is it! Quite simple.

The ShotSpotter data is interesting, with quite a few oddities that are worth exploring. I recently wrote a chapter on SPSS’s geospatial tools for an upcoming book on advanced SPSS features, providing a predictive police example predicting weekly shootings using the new geospatial temporal modelling tool. The book is edited by Keith McCormick and Jesus Salcedo, and I will write a blog post whenever it comes out highlighting what the chapter goes over.