American Community Survey Variables of Interest to Criminologists

I’ve written prior blog posts about downloading Five Year American Community Survey data estimates (ACS for short) for small area geographies, but one of the main hiccups is figuring out what variables you want to use. The census has so many variables that are just small iterations of one another (e.g. Males under 5, males 5 to 9, males 10 to 14, etc.) that it is quite a chore to specify the ones you want. Often you want combinations of variables or to calculate percentages as well, so you need to take two or more variables and turn them into your constructed variable.

I have posted some notes on the variables I have used for past projects in an excel spreadsheet. This includes the original variables, as well as some notes for creating percentage variables. Some are tricky — such as figuring out the proportion of black residents for block groups you need to add non-Hispanic black and Hispanic black estimates (and then divide by the total population). For spatially oriented criminologists these are basically indicators commonly used for social disorganization. It also includes notes on what is available at the smaller block group level, as not all of the variables are. So you are more limited in your choices if you want that small of area.

Let me know if you have been using other variables for your work. I’m not an expert on these variables by any stretch, so don’t take my list as authoritative in any way. For example I have no idea whether it is valid to use the imputed data for moving in the prior year at the block group level. (In general I have not incorporated the estimates of uncertainty for any of the variables into my analyses, not sure of the additional implications for the imputed data tables.) Also I have not incorporated variables that could be used for income-inequality or for ethnic heterogeneity (besides using white/black/Hispanic to calculate the index). I’m sure there are other social disorganization relevant variables at the block group level folks may be interested in as well. So let me know in the comments or shoot me an email if you have suggestions to update my list.

I would prefer if as a field we could create a set of standardized indices so we are not all using different variables (see for example this Jeremy Miles paper). It is a bit hodge-podge though what variables folks use from study-to-study, and most folks don’t report the original variables so it is hard to replicate their work exactly. British folks have their index of deprivation, and it would be nice to have a similarly standardized measure to use in social science research for the states.


The ACS data has consistent variable names over the years, such as B03001_001 is the total population, B03002_003 is the Non-Hispanic white population, etc. Unfortunately those variables are not necessarily in the same tables from year to year, so concatenating ACS results over multiple years is a bit of a pain. Below I post a python script that given a directory of the excel template files will produce a nice set of dictionaries to help find what table particular variables are in.

#This python code grabs ACS meta-data templates
#To easier search for tables that have particular variables
import xlrd, os

mydir = r'!!!Insert your path to the excel files here!!!!!'

def acs_vars(directory):
    #get the excel files in the directory
    excel_files = []
    for file in os.listdir(directory):
        if file.endswith(".xls"):
            excel_files.append( os.path.join(directory, file) )
    #getting the variables in a nice dictionaries
    lab_dict = {}
    loc_dict = {}
    for file in excel_files:
        book = xlrd.open_workbook(file) #first open the xls workbook
        sh = book.sheet_by_index(0)
        vars = [i.value for i in sh.row(0)] #names on the first row
        labs = [i.value for i in sh.row(1)] #labels on the second
        #now add to the overall dictionary
        for v,l in zip(vars,labs):
            lab_dict[v] = l
            loc_dict[v] = file
    #returning the two dictionaries
    return lab_dict,loc_dict
    
labels,tables = acs_vars(mydir)

#now if you have a list of variables you want, you can figure out the table
interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B14006_002','B01001_003','B23025_005','B22010_002',
            'B16002_004']
            
for i in interest:
    head, tail = os.path.split(tables[i])
    print (i,labels[i],tail)
Advertisements

Downloading and reading in American Community Survey Data: Python and SPSS

I had a prior blog post on working with American Community Survey data in SPSS. The meta-data format has changed from that example though, and the Census gives out comma separated files and xls Templates now. So this will be an update, and I have good stuff for those working strictly in python, as well as those wanting to load the data is SPSS.

So first, when downloading the small geographies from the Census’s FTP site, they have a ton of files. See this page, which contains the 5 year estimates for 2014 for New York block groups and tracts. Now instead of downloading each zip file one by one, we can write a quick python script to download all the files.

import urllib, os

downFold = r'C:\Users\axw161530\Dropbox\Documents\BLOG\ACS_Python_SPSS\Data'
base = r'http://www2.census.gov/programs-surveys/acs/summary_file/2014/data/5_year_seq_by_state/NewYork/Tracts_Block_Groups_Only/'

for i in range(1,5):  #change range(1,5) to range(1,122) to download all zip files
    file = "20145ny0" + str(i).zfill(3) + "000.zip"
    urllib.urlretrieve(base + file, os.path.join(downFold,file))

#also download the geography file
urllib.urlretrieve(base + "g20145ny.csv", os.path.join(downFold,"g20145ny.csv"))

The downFold string is where the files will be downloaded to (so change that to a place on your local machine), and the base string ends up being the base URL for that particular set of files. The files go from 1 to 121 in that example, but just to keep the time down I only download tables one through four. The second urlib.urlretrieve line downloads the geography csv file (we won’t be using the other geography file, which is the same data but in tab delimited format).

Now we can go and download the meta data excel file shells. For this dataset they are located here. Here we want the 5 year templates. Once that data is downloaded, then unzip all of the files. You could technically do this in python as well, but I just use 7zip, as that has a handy dialog to unzip multiple files to the same place.

So the way the files work, there are a set of estimate and margin of error text files that are comma delimited that have the demographic characteristics. (Here for all block groups and census tracts in New York.) The xls excel files contain the variable names, along with a brief description for the variables.

If you are a hipster and only do data analysis in python, here is a function that takes the location to a xls template file and the corresponding data file and reads it into a pandas data frame.

#this reads in american community survey data
import xlrd
import pandas as pd

def readACS(Template,Data):
    book = xlrd.open_workbook(Template) #first open the xls workbook
    sh = book.sheet_by_index(0)
    vars = [i.value for i in sh.row(0)] #names on the first row
    labs = [i.value for i in sh.row(1)] #labels on the second
    #this rewrites duplicate 'BLANK' names, mangle dups not working for me
    n = 0
    vars2 = []
    for i in range(len(vars)):
        if vars[i] == 'BLANK':
            n += 1
            vars2.append('BLANK.' + str(n))
        else:
            vars2.append(vars[i])
    #check for if geo file or data file
    if vars2[1] == 'FILETYPE':
        df = pd.read_csv(Data,names=vars2,dtype={'FILETYPE':'object'})
    else:
        df = pd.read_csv(Data,names=vars2)
    return df,zip(vars2,labs)

In a nutshell, it reads the metadata column names and labels from the excel spreadsheet, then reads in the csv file with the data. It returns two objects, the one on the left is a pandas dataframe, and the one on the right is a zipped up list of the variable names and the variable labels. This would be a bit simpler, except that the format for the geo dataset is a little different than all the data files and contains multiple “BLANK” fields (the mangle_dupe_cols option in read_csv is not behaving like I expect it to). For the non-geographic file, I need to tell python the filetype column is a string, else it interprets the “e” in the estimate files as a scientific number (e.g. 1e5 = 100,000).

So here is an example of using this function to grab the second table. When I unzipped the excel templates, it nested the data templates in another subfolder, hence the TemplateFold string.

TemplateFold = downFold + r'\seq'
Tab002,Meta002 = readACS(TemplateFold + r'\Seq2.xls',downFold + r'\e20145ny0002000.txt')

If you want to check out all the variable labels, you can then do:

for i in Meta002:
    print i 

Or if you want to turn that into a dictionary you can simply do dict(Meta002). If you wanted to import all 121 tables and merge them you should be able to figure that out in a simple loop from here (note the “x.zfill(n)” function to pad the integers with leading zeroes). But I typically only work with a select few tables and variables at a time, so I won’t worry about that here.

The function works the same with the geographic data and its template. (Which that metadata template is not nested in the further down seq folder.)

GeoDat,MetaGeo = readACS(downFold + r'\2014_SFGeoFileTemplate.xls',downFold + r'\g20145ny.csv')

Note if you are working with both the estimates and the margin of error files, you may want to put somewhere in the code to change the variable names to signify that, such as by putting a suffix of “e” or “m”. If you just work with the estimates though you don’t need to worry about that.

Reading ACS data into SPSS

For those working in SPSS, I’ve shown previously how to turn python data into SPSS data. I’ve started working on a function to make this simpler with pandas dataframes, but I will hold off on that for now (need to figure out datetimes and NaN’s). So what I did here was grab the meta-data from the template xls file (same as before), but from that build the necessary DATA LIST command in SPSS, and then just submit the command. SPSS has the added benefit of having native meta-data fields, so I can also add in the variable labels. Also, this only uses the xlrd library, in case you do not have access to pandas. (I point SPSS to Anaconda, instead of worrying about using pip with the native SPSS python install.)

So in SPSS, you would first define this function

*This just builds the necessary SPSS program to read in the american community survey data.
BEGIN PROGRAM Python.
#creating my own function to read in data
import xlrd, spss
def OpenACS(Template,Data):
  book = xlrd.open_workbook(Template)
  sh = book.sheet_by_index(0)
  vars = [i.value for i in sh.row(0)]
  labs = [i.value for i in sh.row(1)]
  #this rewrites duplicate 'BLANK' names, mangle dups not working for me
  n = 0
  vars2 = []
  for i in range(len(vars)):
    if vars[i] == 'BLANK':
      n += 1
      vars2.append('BLANK.' + str(n))
    else:
      vars2.append(vars[i])
    #check for if geo file or data file
  if vars2[1] == 'FILETYPE':  #regular data
    ncols = sh.ncols - 6 #need the end of the number of variables
    ext =  ' (' + str(ncols) + 'F7.0)'
    v1 = ' /FILEID FILETYPE (2A6) STUSAB (A2) CHARITER (A3) SEQUENCE (A4) LOGRECNO (A7) '
    v2 = '\n '.join(vars2[6:])
    Tab = Data[-10:-7] #Names the dataset based on the table number
  else: #geo data file
    ncols = sh.ncols
    ext =  ' (' + str(ncols) + 'A255)' #255 should be big enough to fit whatever str
    v1 = " / "
    v2 = '\n '.join(vars2)
    Tab = "Geo"
  #this sets errors off, implicit missing data for block groups
  spss.Submit("PRESERVE.")
  spss.Submit("SET RESULTS OFF ERRORS OFF.")
  #now creating the import program to read in the data
  begin = "DATA LIST LIST(',') FILE = '%s'" % (Data)
  full_str = begin + v1 + v2 + ext + "\n."
  #now reading in the dataset
  spss.Submit(full_str)
  #assigning a dataset name
  datName = "DATASET NAME Table" + Tab + "." 
  spss.Submit(datName)
  #now adding in the variable labels
  for i,j in zip(vars2,labs):
    #replaces double quotes with single quotes in label
    strVal = """VARIABLE LABELS %s "%s".""" % (i,j.replace('"',"'"))
    spss.Submit(strVal)
  if Tab == "Geo":
    spss.Submit("ALTER TYPE ALL (A = AMIN).")
  spss.Submit("RESTORE.")
END PROGRAM.

Again this is much shorter if I only needed to worry about the data files and not the geo file, but that slight formatting difference is a bit of a pain. Here I use the errors off trick to suppress the data list errors for missing data (which is intended, as not all of the data is available at the block group level). But you will still get error messages in the SPSS syntax bottom section. They can be ignored if it is the “insufficient data” warning.

Here is an example of using this python function now to read the data into SPSS. This automatically assigns a dataset name, either based on the Table number, or “Geo” for the geographic data.

*Now reading in the data files I want.
BEGIN PROGRAM Python.
downFold = r'C:\Users\axw161530\Dropbox\Documents\BLOG\ACS_Python_SPSS\Data'
TemplateFold = downFold + r'\seq'

#reading in Data file, table number 2
OpenACS(TemplateFold + r'\Seq2.xls',downFold + r'\e20145ny0002000.txt')

#reading in Geo file
OpenACS(downFold + r'\2014_SFGeoFileTemplate.xls',downFold + r'\g20145ny.csv')
END PROGRAM.
EXECUTE.

And Voila, there is your small area American Community Survey data in SPSS. This will produce two different datasets, “Table002” and “TableGeo” that can be merged together via MATCH FILES.

Let me know in the comments if you have already worked out a function to turn pandas dataframes into SPSS datasets.