I’ve used the online tool ScraperWiki a few times to scrape some online data into a simple spreadsheet. It provides a set of tools to code in Python (and a few other languages) online in the browser and dump the resulting set of information into an online table. It also allows you to set the data collection on a timer and updates the table with new information. It is just a nice set of tools that makes getting and sharing the data easier.
For a few sample scripts I’ve used it for in the past:
- Grabbing the number of chat posts from Stack Overflow chat rooms.
- Grabbing the user location from profiles on the GIS.SE site. (This is really redundant with the power to query the SO data explorer – it can be more up to date though.)
- Scraping the NYPD reported precinct statistics that were uploaded in PDF form. (Yes you can turn PDFs into xml which you can subsequently parse).
I previously used what is now called ScraperWiki Classic. But since the service is migrating to a new platform my prior scripts will not continue downloading data.
It appears my NYPD data scraper does not work any more anyway (they do have an online map now – but unfortunately are not providing the data in a convenient format for bulk download to the NYC Open Data site last I knew). But it has some historical data from the weekly precinct level crime stats between the end of 2010 and through most of 2013 if you are interested.
I’ve migrated the chat posts scraper to the new site, and so it will continue to update with new data for the activity in the SO R Room and the main chat room on the CrossValidated Stats Site. I’ll have to leave analysing that data for a future post, but if you are interested in amending the scraper to download data for a different room it is pretty easy. Just fork it and add/change the rooms of interest to the
url_base array. You can download a csv file of the data directly from this link. The
Mono field is a count of the unique "Monologue" tags in the html, and the Baseroom shows what chat room the data is from. Note that the
Mono value shows 0 even if the room did not exist at that date.
I’m not sure how to make the scraper viewable to the public, but you can still view the code on the classic site. Note the comment about the CPU time isn’t applicable to the main site; I could download the transcripts all the way back to 2010 and I did not receive any errors from ScraperWiki.