Scraping Meth Labs with Python

For awhile in my GIS courses I have pointed to the DEA’s website that has a list of busted meth labs across the county, named the National Clandestine Laboratory Register. Finally a student has shown some interest in this, and so I spent alittle time writing a scraper in Python to grab the data. For those who would just like the data, here I have a csv file of the scraped labs that are geocoded to the city level. And here is the entire SPSS and Python script to go from the original PDF data to the finished product.

So first off, if you visit the DEA website, you will see that each state has its own PDF file (for example here is Texas) that lists all of the registered labs, with the county, city, street address, and date. To turn this into usable data, I am going to do three steps in Python:

  1. download the PDF file to my local machine using urllib python library
  2. convert that PDF to an xml file using the pdftohtml command line utility
  3. use Beautifulsoup to parse the xml file

I will illustrate each in turn and then provide the entire Python script at the end of the post.

So first, lets import the libraries we need, and also note I downloaded the pdftohtml utility and placed that location as a system path on my Windows machine. Then we need to set a folder where we will download the files to on our local machine. Finally I create the base url for our meth labs.

from bs4 import BeautifulSoup
import urllib, os

myfolder = r'C:\Users\axw161530\Dropbox\Documents\BLOG\Scrape_Methlabs\PDFs' #local folder to download stuff
base_url = r'https://www.dea.gov/clan-lab' #online site with PDFs for meth lab seizures

Now to just download the Texas pdf file to our local machine we would simply do:

a = 'tx'
url = base_url + r'/' + a + '.pdf'
file_loc = os.path.join(myfolder,a)
urllib.urlretrieve(url,file_loc + '.pdf')

If you are following along and replaced the path in myfolder with a folder on your personal machine, you should now see the Texas PDF downloaded in that folder. Now I am going to use the command line to turn this PDF into an xml document using the os.system() function.

#Turn to xml with pdftohtml, does not need xml on end
cmd = 'pdftohtml -xml ' + file_loc + ".pdf " + file_loc
os.system(cmd)

You should now see that there is an xml document to go along with the Texas file. You can check out its format using a text editor (wordpress does not seem to like me showing it here).

So basically we can use the top and the left attributes within the xml to identify what row and what column the items are in. But first, we need to read in this xml and turn it into a BeautifulSoup object.

MyFeed = open(file_loc + '.xml')
textFeed = MyFeed.read()
FeedParse = BeautifulSoup(textFeed,'xml')
MyFeed.close()

Now the FeedParse item is a BeautifulSoup object that you can query. In a nutshell, we have a top level page tag, and then within that you have a bunch of text tags. Here is the function I wrote to extract that data and dump it into tuples.

#Function to parse the xml and return the line by line data I want
def ParseXML(soup_xml,state):
    data_parse = []
    page_count = 1
    pgs = soup_xml.find_all('page')
    for i in pgs:
        txt = i.find_all('text')
        order = 1
        for j in txt:
            value = j.get_text() #text
            top = j['top']
            left = j['left']
            dat_tup = (state,page_count,order,top,left,value)
            data_parse.append(dat_tup)
            order += 1
        page_count += 1
    return data_parse

So with our Texas data, we could call ParseXML(soup_xml=FeedParse,state=a) and it will return all of the data nested in those text tags. We can just put these all together and loop over all of the states to get all of the data. Since the PDFs are not that large it works quite fast, under 3 minutes on my last run.

from bs4 import BeautifulSoup
import urllib, os

myfolder = r'C:\Users\axw161530\Dropbox\Documents\BLOG\Scrape_Methlabs\PDFs' #local folder to download stuff
base_url = r'https://www.dea.gov/clan-lab' #online site with PDFs for meth lab seizures
                                           #see https://www.dea.gov/clan-lab/clan-lab.shtml
state_ab = ['al','ak','az','ar','ca','co','ct','de','fl','ga','guam','hi','id','il','in','ia','ks',
            'ky','la','me','md','ma','mi','mn','ms','mo','mt','ne','nv','nh','nj','nm','ny','nc','nd',
            'oh','ok','or','pa','ri','sc','sd','tn','tx','ut','vt','va','wa','wv','wi','wy','wdc']
            
state_name = ['Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','Florida','Georgia','Guam','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas',
              'Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey',
              'New Mexico','New York','North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Pennsylvania','Rhode Island','South Carolina','South Dakota','Tennessee','Texas',
              'Utah','Vermont','Virginia','Washington','West Virginia','Wisconsin','Wyoming','Washington DC']

all_data = [] #this is the list that the tuple data will be stashed in

#Function to parse the xml and return the line by line data I want
def ParseXML(soup_xml,state):
    data_parse = []
    page_count = 1
    pgs = soup_xml.find_all('page')
    for i in pgs:
        txt = i.find_all('text')
        order = 1
        for j in txt:
            value = j.get_text() #text
            top = j['top']
            left = j['left']
            dat_tup = (state,page_count,order,top,left,value)
            data_parse.append(dat_tup)
            order += 1
        page_count += 1
    return data_parse

#This loops over the pdfs, downloads them, turns them to xml via pdftohtml command line tool
#Then extracts the data

for a,b in zip(state_ab,state_name):
    #Download pdf
    url = base_url + r'/' + a + '.pdf'
    file_loc = os.path.join(myfolder,a)
    urllib.urlretrieve(url,file_loc + '.pdf')
    #Turn to xml with pdftohtml, does not need xml on end
    cmd = 'pdftohtml -xml ' + file_loc + ".pdf " + file_loc
    os.system(cmd)
    #parse with BeautifulSoup
    MyFeed = open(file_loc + '.xml')
    textFeed = MyFeed.read()
    FeedParse = BeautifulSoup(textFeed,'xml')
    MyFeed.close()
    #Extract the data elements
    state_data = ParseXML(soup_xml=FeedParse,state=b)
    all_data = all_data + state_data

Now to go from those sets of tuples to actually formatted data takes a bit of more work, and I used SPSS for that. See here for the full set of scripts used to download, parse and clean up the data. Basically it is alittle more complicated than just going from long to wide using the top marker for the data as some rows are off slightly. Also there is complications for long addresses being split across two lines. And finally there are just some data errors and fields being merged together. So that SPSS code solves a bunch of that. Also that includes scripts to geocode the to the city level using the Google geocoding API.

Let me know if you do any analysis of this data! I quickly made a time series map of these events via CartoDB. You can definately see some interesting patterns of DEA concentration over time, although I can’t say if that is due to them focusing on particular areas or if they are really the areas with the most prevalent Meth lab problems.

Some GIS data scraping adventures: Banksy graffiti and gang locations in NYC

I’ve recently scraped some geographic data that I may use in my graduate level GIS course. I figured I would share with everyone, and take some time to describe for others how I scraped the data.

So to start, if you read an online article and it has a webmap with some GIS data in it – the data exists somewhere. It won’t always be the case that you can actually download the data, but for the most current and popular interactive mapping tools, the data is often available if you know where to look.

For example, I asked on the GIS stackexchange site awhile ago how can you download the point data in this NYC homicide map from the Times. I had emailed the reporters multiple times and they did not respond. A simple solution the answerers suggested was to use website developer tools to see what was being loaded when I refreshed the page. It happened that the map is being populated by a two simple text files (1,2).

It may be an interesting project to see how this compares (compiling via news stories) versus official data, which NYC recently released going back to 2006. Especially since such crowdsourced news datasets are used for other things, like counting mass shootings.

The two example mapping datasets I provide below though are a bit different process to get the underlying data – but just as easy. Many current webmaps use geojson files as the backend. What I did for the two examples below is I just looked at the html source for the website, and look for json data formats – links that specify “js” or “json” extensions. If you click through those external json links you can see if they have the data.

The other popular map type though comes from ESRI. You can typically find an ESRI server populating the map, and if the website has say a parcel data lookup you can often find an ESRI geocoding server (see here for one example of using an ESRI geocoding api). The maps though unfortunately do not always have exposed data. Sometimes what looks like vector data are actually just static PNG tiles. Council Districts in this Dallas map are an example. If you dig deep enough, you can find the PNG tiles for the council districts, but that does not do anyone much good. Pretty much all of those layers are available for download from other sources though. A similar thing happens with websites with crime reports, such as RAIDS Online or CrimeReports.com. They intentionally build the web map so you cannot scrape the data.

So that said, before we go further though – it should go without saying that you should not steal/plagiarize people’s articles or simply rip-off their graphics. Conducting new analysis with the publicly available data though seems fair game to me.

Banksy Taggings in NYC

There was a recent stink in the press about Kim Rossmo and company using geographic offender profiling to identify the likely home location of the popular graffiti artist Banksy. Here is the current citation of the journal article for those interested:

Hauge, M. V., Stevenson, M. D., Rossmo, D. K., and Le Comber, S. C. (2016). Tagging banksy: using geographic profiling to investigate a modern art mystery. Journal of Spatial Science, pages 1-6. doi:10.1080/14498596.2016.1138246

The article uses data from Britain, so I looked up to see if his taggings in other places was available. I came across this article showing a map of locations in New York City. So I searched where the data was coming from, and found the json file that contains the point data here. I just built a quick excel spreadsheet to parse the data, and you can download that spreadsheet here.

Gang Locations

This article posts a set of gang territories in NYC. This is pretty unique – I am unfamiliar with any other public data source that identifies gang territories. So I figured it would be a potential fun project for students in my GIS course – for instance in overlaying with the 311 graffiti data.

Again the data at the backend is in json format, and can be found here. To convert this data to a shapefile is a bit challenging, as it has points, lines and polygons all in the same file. What I did was buffer the lines and points by a small amount to be able to stuff them all in one shapefile. A zip file of that shapefile can be downloaded here.

Drop me a note if you use this data, I’d be interested in your analyses! Hence why I am sharing the data for others to play with 🙂

A shout out for ScraperWiki

I’ve used the online tool ScraperWiki a few times to scrape some online data into a simple spreadsheet. It provides a set of tools to code in Python (and a few other languages) online in the browser and dump the resulting set of information into an online table. It also allows you to set the data collection on a timer and updates the table with new information. It is just a nice set of tools that makes getting and sharing the data easier.

For a few sample scripts I’ve used it for in the past:

  • Grabbing the number of chat posts from Stack Overflow chat rooms.
  • Grabbing the user location from profiles on the GIS.SE site. (This is really redundant with the power to query the SO data explorer – it can be more up to date though.)
  • Scraping the NYPD reported precinct statistics that were uploaded in PDF form. (Yes you can turn PDFs into xml which you can subsequently parse).

I previously used what is now called ScraperWiki Classic. But since the service is migrating to a new platform my prior scripts will not continue downloading data.

It appears my NYPD data scraper does not work any more anyway (they do have an online map now – but unfortunately are not providing the data in a convenient format for bulk download to the NYC Open Data site last I knew). But it has some historical data from the weekly precinct level crime stats between the end of 2010 and through most of 2013 if you are interested.

I’ve migrated the chat posts scraper to the new site, and so it will continue to update with new data for the activity in the SO R Room and the main chat room on the CrossValidated Stats Site. I’ll have to leave analysing that data for a future post, but if you are interested in amending the scraper to download data for a different room it is pretty easy. Just fork it and add/change the rooms of interest to the url_base array. You can download a csv file of the data directly from this link. The Mono field is a count of the unique "Monologue" tags in the html, and the Baseroom shows what chat room the data is from. Note that the Mono value shows 0 even if the room did not exist at that date.

I’m not sure how to make the scraper viewable to the public, but you can still view the code on the classic site. Note the comment about the CPU time isn’t applicable to the main site; I could download the transcripts all the way back to 2010 and I did not receive any errors from ScraperWiki.