Downloading and reading in American Community Survey Data: Python and SPSS

I had a prior blog post on working with American Community Survey data in SPSS. The meta-data format has changed from that example though, and the Census gives out comma separated files and xls Templates now. So this will be an update, and I have good stuff for those working strictly in python, as well as those wanting to load the data is SPSS.

So first, when downloading the small geographies from the Census’s FTP site, they have a ton of files. See this page, which contains the 5 year estimates for 2014 for New York block groups and tracts. Now instead of downloading each zip file one by one, we can write a quick python script to download all the files.

import urllib, os

downFold = r'C:\Users\axw161530\Dropbox\Documents\BLOG\ACS_Python_SPSS\Data'
base = r'http://www2.census.gov/programs-surveys/acs/summary_file/2014/data/5_year_seq_by_state/NewYork/Tracts_Block_Groups_Only/'

for i in range(1,5):  #change range(1,5) to range(1,122) to download all zip files
    file = "20145ny0" + str(i).zfill(3) + "000.zip"
    urllib.urlretrieve(base + file, os.path.join(downFold,file))

#also download the geography file
urllib.urlretrieve(base + "g20145ny.csv", os.path.join(downFold,"g20145ny.csv"))

The downFold string is where the files will be downloaded to (so change that to a place on your local machine), and the base string ends up being the base URL for that particular set of files. The files go from 1 to 121 in that example, but just to keep the time down I only download tables one through four. The second urlib.urlretrieve line downloads the geography csv file (we won’t be using the other geography file, which is the same data but in tab delimited format).

Now we can go and download the meta data excel file shells. For this dataset they are located here. Here we want the 5 year templates. Once that data is downloaded, then unzip all of the files. You could technically do this in python as well, but I just use 7zip, as that has a handy dialog to unzip multiple files to the same place.

So the way the files work, there are a set of estimate and margin of error text files that are comma delimited that have the demographic characteristics. (Here for all block groups and census tracts in New York.) The xls excel files contain the variable names, along with a brief description for the variables.

If you are a hipster and only do data analysis in python, here is a function that takes the location to a xls template file and the corresponding data file and reads it into a pandas data frame.

#this reads in american community survey data
import xlrd
import pandas as pd

def readACS(Template,Data):
    book = xlrd.open_workbook(Template) #first open the xls workbook
    sh = book.sheet_by_index(0)
    vars = [i.value for i in sh.row(0)] #names on the first row
    labs = [i.value for i in sh.row(1)] #labels on the second
    #this rewrites duplicate 'BLANK' names, mangle dups not working for me
    n = 0
    vars2 = []
    for i in range(len(vars)):
        if vars[i] == 'BLANK':
            n += 1
            vars2.append('BLANK.' + str(n))
        else:
            vars2.append(vars[i])
    #check for if geo file or data file
    if vars2[1] == 'FILETYPE':
        df = pd.read_csv(Data,names=vars2,dtype={'FILETYPE':'object'})
    else:
        df = pd.read_csv(Data,names=vars2)
    return df,zip(vars2,labs)

In a nutshell, it reads the metadata column names and labels from the excel spreadsheet, then reads in the csv file with the data. It returns two objects, the one on the left is a pandas dataframe, and the one on the right is a zipped up list of the variable names and the variable labels. This would be a bit simpler, except that the format for the geo dataset is a little different than all the data files and contains multiple “BLANK” fields (the mangle_dupe_cols option in read_csv is not behaving like I expect it to). For the non-geographic file, I need to tell python the filetype column is a string, else it interprets the “e” in the estimate files as a scientific number (e.g. 1e5 = 100,000).

So here is an example of using this function to grab the second table. When I unzipped the excel templates, it nested the data templates in another subfolder, hence the TemplateFold string.

TemplateFold = downFold + r'\seq'
Tab002,Meta002 = readACS(TemplateFold + r'\Seq2.xls',downFold + r'\e20145ny0002000.txt')

If you want to check out all the variable labels, you can then do:

for i in Meta002:
    print i 

Or if you want to turn that into a dictionary you can simply do dict(Meta002). If you wanted to import all 121 tables and merge them you should be able to figure that out in a simple loop from here (note the “x.zfill(n)” function to pad the integers with leading zeroes). But I typically only work with a select few tables and variables at a time, so I won’t worry about that here.

The function works the same with the geographic data and its template. (Which that metadata template is not nested in the further down seq folder.)

GeoDat,MetaGeo = readACS(downFold + r'\2014_SFGeoFileTemplate.xls',downFold + r'\g20145ny.csv')

Note if you are working with both the estimates and the margin of error files, you may want to put somewhere in the code to change the variable names to signify that, such as by putting a suffix of “e” or “m”. If you just work with the estimates though you don’t need to worry about that.

Reading ACS data into SPSS

For those working in SPSS, I’ve shown previously how to turn python data into SPSS data. I’ve started working on a function to make this simpler with pandas dataframes, but I will hold off on that for now (need to figure out datetimes and NaN’s). So what I did here was grab the meta-data from the template xls file (same as before), but from that build the necessary DATA LIST command in SPSS, and then just submit the command. SPSS has the added benefit of having native meta-data fields, so I can also add in the variable labels. Also, this only uses the xlrd library, in case you do not have access to pandas. (I point SPSS to Anaconda, instead of worrying about using pip with the native SPSS python install.)

So in SPSS, you would first define this function

*This just builds the necessary SPSS program to read in the american community survey data.
BEGIN PROGRAM Python.
#creating my own function to read in data
import xlrd, spss
def OpenACS(Template,Data):
  book = xlrd.open_workbook(Template)
  sh = book.sheet_by_index(0)
  vars = [i.value for i in sh.row(0)]
  labs = [i.value for i in sh.row(1)]
  #this rewrites duplicate 'BLANK' names, mangle dups not working for me
  n = 0
  vars2 = []
  for i in range(len(vars)):
    if vars[i] == 'BLANK':
      n += 1
      vars2.append('BLANK.' + str(n))
    else:
      vars2.append(vars[i])
    #check for if geo file or data file
  if vars2[1] == 'FILETYPE':  #regular data
    ncols = sh.ncols - 6 #need the end of the number of variables
    ext =  ' (' + str(ncols) + 'F7.0)'
    v1 = ' /FILEID FILETYPE (2A6) STUSAB (A2) CHARITER (A3) SEQUENCE (A4) LOGRECNO (A7) '
    v2 = '\n '.join(vars2[6:])
    Tab = Data[-10:-7] #Names the dataset based on the table number
  else: #geo data file
    ncols = sh.ncols
    ext =  ' (' + str(ncols) + 'A255)' #255 should be big enough to fit whatever str
    v1 = " / "
    v2 = '\n '.join(vars2)
    Tab = "Geo"
  #this sets errors off, implicit missing data for block groups
  spss.Submit("PRESERVE.")
  spss.Submit("SET RESULTS OFF ERRORS OFF.")
  #now creating the import program to read in the data
  begin = "DATA LIST LIST(',') FILE = '%s'" % (Data)
  full_str = begin + v1 + v2 + ext + "\n."
  #now reading in the dataset
  spss.Submit(full_str)
  #assigning a dataset name
  datName = "DATASET NAME Table" + Tab + "." 
  spss.Submit(datName)
  #now adding in the variable labels
  for i,j in zip(vars2,labs):
    strVal = """VARIABLE LABELS %s "%s".""" % (i,j)
    spss.Submit(strVal)
  if Tab == "Geo":
    spss.Submit("ALTER TYPE ALL (A = AMIN).")
  spss.Submit("RESTORE.")
END PROGRAM.

Again this is much shorter if I only needed to worry about the data files and not the geo file, but that slight formatting difference is a bit of a pain. Here I use the errors off trick to suppress the data list errors for missing data (which is intended, as not all of the data is available at the block group level). But you will still get error messages in the SPSS syntax bottom section. They can be ignored if it is the “insufficient data” warning.

Here is an example of using this python function now to read the data into SPSS. This automatically assigns a dataset name, either based on the Table number, or “Geo” for the geographic data.

*Now reading in the data files I want.
BEGIN PROGRAM Python.
downFold = r'C:\Users\axw161530\Dropbox\Documents\BLOG\ACS_Python_SPSS\Data'
TemplateFold = downFold + r'\seq'

#reading in Data file, table number 2
OpenACS(TemplateFold + r'\Seq2.xls',downFold + r'\e20145ny0002000.txt')

#reading in Geo file
OpenACS(downFold + r'\2014_SFGeoFileTemplate.xls',downFold + r'\g20145ny.csv')
END PROGRAM.
EXECUTE.

And Voila, there is your small area American Community Survey data in SPSS. This will produce two different datasets, “Table002” and “TableGeo” that can be merged together via MATCH FILES.

Let me know in the comments if you have already worked out a function to turn pandas dataframes into SPSS datasets.

Paper – Replicating Group Based Trajectory Models of Crime at Micro-Places in Albany, NY published

My article on estimating crime trajectories in Albany from 2000 through 2014 has been published in the latest issue of JQC.

That link is permanent, but Springer gifts me a temporary free pdf link for everyone for up to four weeks. So grab that if you are interested.

Also note though that I have the pre-print posted on SSRN. Since that is Albany PD’s data, I cannot provide code to replicate the analysis. But, I have produced a series of blog posts showing to to replicate the trajectory and the point pattern analysis on your own data if you are interested, see

Here is the cross Ripley’s L plot testing for clustering between the different trajectory groupings.

Also always feel free to send me an email if you have questions about the findings and paper.

ASC 2016 – Quantifying the Local and Spatial Effects of Alcohol Outlets on Crime

This year at the American Society of Criminology I will be presenting some work from my dissertation, Quantifying the Local and Spatial Effects of Alcohol Outlets on Crime. I have the working paper posted on SSRN, and that also has a link to download data and code to reproduce the findings in the paper.

I will be presenting at the panel Alcohol and Crime on Wednesday at 9:30 (at the Cambridge room on the 2nd level).

Here is the abstract:

This paper estimates the relationship between alcohol outlets and crime at micro place street units in Washington, D.C. Three specific additions to this voluminous literature are articulated. First, the diffusion effect of alcohol outlets is larger than the local effect. This has important implications for crime prevention. The second is that in this sample the effects of on-premise and off-premise outlets are very similar in magnitude. I argue this is evidence in favor of routine activities theory, in opposition to theories which emphasize individual alcohol consumption. The final is that alcohol outlets have large effects on burglary, despite the fact that alcohol outlets cannot increase the number of vulnerable targets, as it can with interpersonal crimes. I discuss how this can either be interpreted as evidence that alcohol outlets self-select into already crime prone areas, or potentially that the presence of motivated offenders’ matters much more than increasing the number of potential victims.

The most interesting finding is the fact that I estimate the diffusion effect of alcohol outlets is larger than the local effect. I then show that this is the case for some other papers as well, it is just interpreting the regression model is tricky. Here is a diagram showing what happens. The idea is the regression coefficient for the spatial lag is one orange dot, and the local effect is the blue dot. Adding a bar though diffuses to multiple places, so when adding up all the smaller orange dots, they result in more crime than the one bigger blue dot.

Some inverse distance weighting hacks – using R and spatstat

For a recent project I was mapping survey responses to attitudes towards the police, and I wanted to make a map of those responses. The typical default to accomplish this is inverse distance weighting. For those familiar with hot spot maps of crime, this is similar in that is produces a smooth isarithmic map, but instead of being a density it predicts values. For my project I wanted to explore two different things; 1) estimating the variance of the IDW estimate, and 2) explore different weighting schemes besides the default inverse distance. The R code for my functions and data for analysis can be downloaded here.


What is inverse distance weighting?

Since this isn’t typical fodder for social scientists, I will present a simple example to illustrate.

Imagine you are a farmer and want to know where to plant corn vs. soy beans, and are using the nitrogen content of the soil to determine that. You take various samples from a field and measure the nitrogen content, but you want predictions for the areas you did not sample. So say we have four measures at various points in the field.

Nit     X   Y
1.2     0   0
2.1     0   5
2.6    10   2
1.5     6   5

From this lets say we want to estimate average nitrogen content at the center, 5 and 5. Inverse distance weighting is just as the name says, the weight to estimate the average nitrogen content at the center is based on the distance between the sample point and the center. Most often people use the distance squared as the weight. So from this we have as the weights.

Nit     X   Y Weight
1.2     0   0   1/50
2.1     0   5   1/25
2.6    10   2   1/34
1.5     6   5   1/ 1

You can see the last row is the closest point, so gets the largest weight. The weighted average of nitrogen for the 5,5 point ends up being ~1.55.

For inverse distance weighted maps, one then makes a series of weighted estimates at a regular grid over the study space. So not just an estimate at 5,5, but also 5,4|5,3|5,2 etc. And then you have a regular grid of values you can plot.


Example – Street Clean Scores in LA

An ok example to demonstrate this is an LA database rating streets based on their cleanliness. Some might quibble about it only makes sense to estimate street cleanliness values on streets, but I think it is ok for exploratory data analysis. Just visualizing the streets is very hard given their small width and irregularity.

So to follow along, first I load all the libraries I will be using, then set my working directory, and finally import my updated inverse distance weighted hacked functions I will be using.

library(spatstat)
library(inline)
library(rgdal)
library(maptools)
library(ncf)

MyDir <- "C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\IDW_Variance_Bisquare\\ExampleAnalysis"
setwd(MyDir)

#My updated idw functions
source("IDW_Var_Functions.R")

Next we need to create an point pattern object spatstat can work with, so we import our street scores that contain an X and Y coordinate for the midpoint of the street segment, as well as the boundary of the city of Los Angeles. Then we can create a marked point pattern. For reference, the street scores can range from 0 (clean) to a max of 3 (dirty).

CleanStreets <- read.csv("StreetScores.csv",header=TRUE)
summary(CleanStreets)
BorderLA <- readOGR("CityBoundary.shp", layer="CityBoundary")

#create Spatstat object and window
LA_Win <- as.owin(BorderLA)
LA_StreetPP <- ppp(CleanStreets$XMidPoint,CleanStreets$YMidPoint, window=LA_Win, marks=CleanStreets$StreetScor)

Now we can estimate a smooth inverse distance weighted map by calling my new function, idw2. This returns both the original weighted mean (equivalent to the original spatstat idw argument), but also returns the variance. Here I plot them side by side (see the end of the blog post on how I calculate the variance). The weighted mean is on the left, and the variance estimate is on the right. For the functions the rat image is the weighted mean, and the var image is the weighted variance.

#Typical inverse distance weighted estimate
idw_res <- idw2(LA_StreetPP) #only takes a minute
par(mfrow=c(1,2))
plot(idw_res$rat) #this is the weighted mean
plot(idw_res$var) #this is the weighted variance

So contrary to expectations, this does not provide a very smooth map. It is quite rough. This is partially because social science data is not going to be as regular as natural science measurements. In spatial stats jargon street to street measures will have a large nugget – a clean street can be right next to a dirty one.

Here the default is using inverse distance squared – what if we just use inverse distance though?

#Inverse distance (linear)
idw_Lin <- idw2(LA_StreetPP, power=1)
plot(idw_Lin$rat)
plot(idw_Lin$var)

This is smoothed out a little more. There is essentially one dirty spot in the central eastern part of the city (I don’t know anything about LA neighborhoods). Compared to the first set of maps, the dirty streets in the northern mass of the city are basically entirely smoothed out, whereas before you could at least see little spikes.

So I was wondering if there could maybe be better weights we could choose to smooth out the data a little better. One I have used in a few recent projects is the bisquare kernel, which I was introduced by the geographically weighted regression folks. The bisquare kernel weight equals [1 - (d/b)^2]^2, when d < b and zero otherwise. Here d is the distance, and b is a user chosen distance threshold. We can make a plot to illustrate the difference in weight functions, here using a bisquare kernel distance of 2000 meters.

#example weight functions over 3000 meters
dist <- 1:3000
idw1 <- 1/dist
idw2 <- 1/(dist^2)
b <- 2000
bisq <- ifelse(dist < b, ( 1 - (dist/b)^2 )^2, 0)
plot(dist,idw1,type='l')
lines(dist,idw2,col='red')
lines(dist,bisq,col='blue')

Here you can see both of the inverse distance weighted lines trail to zero almost immediately, whereas the bisquare kernel trails off much more slowly. So lets check out our maps using a bisquare kernel with the distance threshold set to 2000 meters. The biSqW function is equivalent to the original spatstat idw function, but uses the bisquare kernel and returns the variance estimate as well. You just need to pass it a distance threshold for the b_dist parameter.

#BiSquare weighting, 2000 meter distance
LA_bS_w <- biSqW(LA_StreetPP, b_dist=2000)
plot(LA_bS_w$rat)
plot(LA_bS_w$var)

Here we get a map that looks more like a typical hot spot kernel density map. We can see some of the broader trends in the northern part of the city, and even see a really dirty hot spot I did not previously notice in the northeastern peninsula.

The 2,000 meter distance threshold was just ad-hoc though. How large or small should it be? A quick check of the spatial correlogram is one way to make it slightly more objective. Here I use the correlog function in the ncf package to estimate this. I subsample the data first (I presume it has a call to dist somewhere).

#correleogram, random sample, it is too big
subSamp <- CleanStreets[sample(nrow(CleanStreets), 3000), ]
fit <- correlog(x=subSamp$XMidPoint,y=subSamp$YMidPoint,z=subSamp$StreetScor, increment=100, resamp=0, quiet=TRUE)
plot(fit)

Here we can see points very nearby each other have a correlation of 0.2, and then this trails off into zero before 20 kilometers (the distances here are in meters). FYI the rising back up in correlation for very large distances often occurs for data that have broader spatial trends.

So lets try out a bisquare kernel with a distance threshold of 10 kilometers.

#BiSquare weighting, 10000 meter distance
LA_bS_w <- biSqW(LA_StreetPP, b_dist=10000)
plot(LA_bS_w$rat)
plot(LA_bS_w$var)

That is now a bit oversmoothed. But it allows a nicer range of potential values, as oppossed to simply sticking with the inverse distance weighting.


A few notes on the variance of IDW

So I hacked the idw function in the spatstat package to return the variance of the estimate as well as the actual weighted mean. This involved going into the C function, so I use the inline package to create my own version. Ditto for creating the maps using the bisquare weights instead of inverse distance weighting. To quick see those functions here is the R code.

Given some harassment on Crossvalidated by Mark Stone, I also updated the algorithm to be a more numerically safe one, both for the weighted mean and the weighted variance. Note though that that Wikipedia article has a special definition for the variance. The correct Bessel correction for weighted data though (in this case) is the sum of the weights (V1) minus the sum of square of the weights (V2) divided by V1. Here I just divide by V1, but that could easily be changed (not sure if in the sum of squares I need to worry about underflow). I.e. change the line MAT(var, ix, iy, Ny) = m2 / sumw; to MAT(var, ix, iy, Ny) = m2 / (sumw - sumw/sumw2); in the various C calls.

Someone should also probably write in a check to prevent distances of zero. Maybe by capping the weights to never be above a certain value, although that is not trivial what the default top value should be. (If you have data on the unit square weights above 1 would occur quite regularly, but for a large city like this projected in meters capping the weight at 1 would be fine.)

In general these variance maps did not behave like I expected them to, either with this or other data. When using Bessel’s correction they tended to look even weirder. So I would need to explore some more before I go and recommend them. Probably should not waste more time on this though, and just fit an actual kriging model though to produce the standard error of the estimates.

A principled approach to conducting subgroup analysis

Social scientists often have a problem when conducting analysis — we have theories that are not tightly coupled to actual measures of individual behavior. A response to this is to often conduct models of many different, interrelated measures. This can be either as outcome variables, e.g. if I know poverty predicts all crimes, does poverty predict both violent crime and property crime at the city level? Or as explanatory variables, e.g. does being a minority reduce your chances of getting a job interview, or does it matter the type of minority you are — e.g. Black, Asian, Hispanic, Native American, etc.

Another situation is conducting analysis among different units of analysis, e.g. see if a treatment has a different effect for males or females, or see if a treatment works well in one country, but does not work well in another. Or if I find that a policy intervention works at the city level, are the effects in all areas of the city, or in just some neighborhoods?

On its face, these may seem like all unique problems. They are not, they are all different variants of subgroup analysis. In my dissertation in trying to identify situations in which you need to use small geographic spatial units of analysis, I realized that the logic behind choosing a geographic unit of analysis is the same as these different subgroup analyses. I more succinctly outline my logic in this article, but I will try it in a blog post as well.

I am what I would call a "reductionist social scientist". In plain terms, if we fit a model:

Y = B*X

we can always get more specific, either in terms of explanatory variables:

Y = b1*x1 + b2*x2, where X = x1 + x2

Or in terms of the outcome:

y1 = b1*X
y2 = b2*X, where Y = y1 + y2

Hubert Blalock talks about this in his causal inferences book. I think many social scientists are reductionists in this sense, we can always estimate more specific explanatory variables, or more specific outcomes, or within different subgroups, ad nauseam. Thus the problem is not should we conduct analysis in some particular subgroup, but when should we be satisfied that the aggregate effect we estimate is good enough, or at least not misleading.

So remember this when we are evaluating a model, the aggregate effect is a function of the sub-group effects. In linear models the math is easy, which I show some examples in my linked paper, but the logic generally holds for non-linear models as well. So when should we be ok with the aggregate level effect? We should be ok with the aggregate effect if we expect the direction and size of the effects in the subgroups to be similar. We should not be ok if the effects are countervailing in the subgroups, or if the magnitude of the differences is very large.

For some simplistic examples, if we go with our job interview for minorities relative to whites example:

Prob(Job Interview) = 0.5 + 0*(Minority)

So here the effect is zero, minorities have the same probability as white individuals, 50%. But lets say we estimate an effect for different minority categories:

Prob(Job Interview) = 0.5 + 0.3(Asian) - 0.3(Black)

Our aggregate effect for minorities is zero, because it is positive for Asian’s and negative for Black individuals, and in the aggregate these two effects cancel out. That is one situation in which we should be worried.

Now how about the effect of poverty on crime:

All Crime = 5*(Percent in Poverty)

Versus different subsets of crime, violent and property.

Violent Crime = 3*(Percent in Poverty)
Property Crime = 2*(Percent in Poverty)

Here we can see that the subgroups contribute to the total, but the effect for property is slightly less than that for violent crime. Here the aggregate effect is not misleading, but the micro level effect may be theoretically interesting.

For the final areas, lets say we do a gun buy back program, and we estimate the reduction in shootings at the city wide level. So lets say we estimate the number of shootings per month:

Shootings in the City = 10 - 5*(Gun Buy Back)

So we say the gun buy back reduced 5 shootings per month. Maybe we think the reduction is restricted to certain areas of the city. For simplicity, say this city only has two neighborhoods, North and South. So we estimate the effect of the gun buy back in these two neighborhoods:

Shootings in the North Neighborhood = 9 - 5*(Gun Buy Back)
Shootings in the South Neighborhood = 1 - 0*(Gun Buy Back)

Here we find the program only reduced shootings in the North neighborhood, it had no appreciable effects in the south neighborhood. The aggregate city level effect is not misleading, we can just be more specific about decomposing that effect to different areas.


Here I will relate this to some of my recent work — using 311 calls for service to predict crime at micro places in DC.

In a nutshell, I’ve fit models of the form:

Crime = B*(311 Calls for Service)

And I found that 311 calls have a positive, but small, effect on crime.

Over time, either at presentations in person or in peer review, I’ve gotten three different "subgroup" critiques. These are:

  • I think you cast the net too wide in 311 calls, e.g. "bulk collections" should not be included
  • I think you shouldn’t aggregate all crime on the left hand side, e.g. I think the effect is mostly for robberies
  • I think you shouldn’t estimate one effect for the entire city, e.g. I think these signs of disorder matter more in some neighborhoods than others

Now, these are all reasonable questions, but does it call into question my main aggregate finding? Not at all.

For the casting the net too wide for 311 calls, do you think that bulk collections have a negative relationship to crime? Unlikely. (I’ve eliminated them from my current article due to one reviewer complaint, but to be honest I think they should be included. Seeing a crappy couch on the street is not much different than seeing garbage.)

For all crime on the left hand side, do you think 311 calls have a negative effect on some crimes, but a positive effect on others? Again, unlikely. It may be the case that it has larger effects on some than others, but it does not mean the effect on all crime is misleading. So what if it is larger for robberies than others, you can go and build a theory about why that is the case and test it.

For the one estimate in different parts of the city, do you think it has a negative effect in some parts, and a positive effect in others? Again, unlikely. It may be some areas the effect is larger, but overall we expect it to be positive or zero in all areas of the city. The aggregate city wide effect is not likely to be misleading.

These are all fine for future research question, but I get frustrated when these are given as reasoning to critique my current findings. They don’t invalidate the aggregate findings at all.


In response to this, you may think, well why not conduct all these subgroup analyses – whats the harm? There are a few different harms to conducting these subgroup analyses willy-nilly. They are all related to chasing the noise and interpreting it.

For each of these subgroups, you will have smaller power to estimate effects than the aggregate. Say I test the effect of each individual 311 call type (there are near 30 that I sum together). Simply by chance some of these will have null effects or slightly negative effects and all will be small by themselves. I have no apriori reason to think some have a different effect than others, the theory behind why they are related to crime at all (Broken Windows) does not distinguish between them.

This often ends up being a catch-22 in peer review. You do more specific analysis, by chance a coefficient goes in the wrong direction, and the reviewer interprets it as your measures and/or model is bunk. In reality they are just over-interpreting noise.

That is in response to reviewers, but what about you conducting subgroup analysis on your own? Here you have to worry about the garden-of-forking paths. Say I conducted the subgroup analysis for different types of crime outcomes, and they are all in the same direction except for thefts from auto. I then report all of the results except for thefts from auto, because that does not confirm my theory. This is large current problem in reproducing social science findings — a subgroup analysis may seem impressive, but you have to worry about which results the researcher cherry picked.

This only reporting confirmatory evidence for some subgroups will always be a problem in data analysis — not even pre-registration of your data plan will solve it. Thus, you should only do subgroup analysis if there is strong theoretical reasoning you think the aggregate effect is misleading. You should simply plan a new study on its own to assess different subgroups from the start if you think differences may be theoretically interesting.

Given some of the reviews I received for the 311 paper, I am stuffing many of these subgroup analyses in Appendices just to preempt reviewer critique. (Ditto for my paper on alcohol outlets and crime I will be presenting at ASC in a few weeks, that will be up on SSRN likely next week.) I don’t think it is the right thing to do though, again I think it is mostly noise mining, but perpetually nit-picky reviewers have basically forced me to do it.

My endorsement for criminal justice at Bloomsburg University

The faculty at PASSHE schools (public universities in Pennsylvania) are currently under strike. My main reason to write this post is because I went to Bloomsburg University and I received a terrific education (a BA in criminal justice). If I could go back and do it all over again, I would still definitely attend Bloomsburg.

For a bit of background, all of the PA state schools were originally formed as normal schools — colleges to prepare teachers for lower education. They were intentionally placed around the state, so students did not have to travel very far. This is why they seem to be in rural places no one has ever heard of (it is intentional). For those in New York, this is an equivalent story for the smaller SUNY campuses — although unlike New York the state schools in PA have no shared acronym. This does not include Penn State University, which is a land-grant school. At some later point, the normal colleges expanded to universities and PASSHE was formed.

There are some pathological problems with higher education currently. One of them is the rising price of tuition. Tuition at PASSHE schools are basically the cheapest places you can get a bachelor’s degree. Private institutions (or Penn State Univ.) you are going to pay two to three times as much compared to at a PASSHE institution.

Criminal justice is a continually growing degree. To meet the teaching demand, many programs are filling in with adjunct labor. Bloomsburg did not do this when I was there, and this continues to appear the be the case. The majority of the faculty I had (04-08) are still on the faculty (this is true for criminal justice, sociology, and the two math professors I took all my statistics courses for), although the CJ program has appeared to grow beyond the three faculty members (Leo Barrile, Neal Slone, and Pam Donovan) when I was there. They did have an additional adjunct when I was there (who shall go unnamed) who is in the running for laziest teacher I have ever had.

Now, don’t get me wrong — adjuncts can be good teachers. I’ve taught as an adjunct myself. You should be concerned though if the majority of the courses in a department are being taught by adjuncts. People with professional experience can be great teachers — especially for advanced courses about their particular expertise — but they should rarely be teaching core courses for a degree in criminal justice. Core courses for CJ would likely include, intro. to criminal justice, criminology, penology, criminal law, statistics, and research design. (The last two are really essential courses for any student in the social sciences.)

The main reason some professors are better than others is not directly related to being tenure track faculty or adjunct though — a big factor is about continuity. When I am teaching a course for the first time, students are guinea pigs, whereas a professor who has taught the course many semesters is going to be better prepared. Adjunct’s with poor pay are not as likely to stick around, so you get a revolving door. Folks who have been around awhile are just more likely to be polished teachers.

To end, some students choose bigger schools because they believe there are more opportunities (either to have fun or for their education). There were really more opportunities at Bloomsburg than I could even take advantage of. Besides the BA in criminal justice, I had a minor in statistics and sociology. I also got my introduction to making maps by taking a GIS class in the geography department. In retrospect I would have taken a few more math classes (like swapping out the Econ Statistics courses for Macro-Econ). Bloomsburg is small but don’t worry about having a fun time either — if you get take out do NAPS, if you just want a slice do OIP. (The pizza here in Dallas is terrible all the places I’ve tried.)

While it may be frustrating to students (or maybe more to parents who are paying the bills), it is in the best long term interest to preserve the quality of education at PASSHE schools. Appropriate pay and benefits for faculty and adjuncts is necessary to do that.

I think I will write a blog post describing more about an undergraduate degree in criminal justice, but if you are a student here in the Dallas area interested in criminology at UTD, always feel free to send me an email with questions. Also please email me a place where I can get a decent tasting slice!

Testing the equality of two regression coefficients

The default hypothesis tests that software spits out when you run a regression model is the null that the coefficient equals zero. Frequently there are other more interesting tests though, and this is one I’ve come across often — testing whether two coefficients are equal to one another. The big point to remember is that Var(A-B) = Var(A) + Var(B) - 2*Cov(A,B). This formula gets you pretty far in statistics (and is one of the few I have memorized).

Note that this is not the same as testing whether one coefficient is statistically significant and the other is not. See this Andrew Gelman and Hal Stern article that makes this point. (The link is to a pre-print PDF, but the article was published in the American Statistician.) I will outline four different examples I see people make this particular mistake.

One is when people have different models, and they compare coefficients across them. For an example, say you have a base model predicting crime at the city level as a function of poverty, and then in a second model you include other control covariates on the right hand side. Let’s say the the first effect estimate of poverty is 3 (1), where the value in parentheses is the standard error, and the second estimate is 2 (2). The first effect is statistically significant, but the second is not. Do you conclude that the effect sizes are different between models though? The evidence for that is much less clear.

To construct the estimate of how much the effect declined, the decline would be 3 - 2 = 1, a decrease in 1. What is the standard error around that decrease though? We can use the formula for the variance of the differences that I noted before to construct it. So the standard error squared is the variance around the parameter estimate, so we have sqrt(1^2 + 2^2) =~ 2.23 is the standard error of the difference — which assumes the covariance between the estimates is zero. So the standard error around our estimated decline is quite large, and we can’t be sure that it is an appreciably different estimate of poverty between the two models.

There are more complicated ways to measure moderation, but this ad-hoc approach can be easily applied as you read other peoples work. The assumption of zero covariance for parameter estimates is not a big of deal as it may seem. In large samples these tend to be very small, and they are frequently negative. So even though we know that assumption is wrong, just pretending it is zero is not a terrible folly.

The second is where you have models predicting different outcomes. So going with our same example, say you have a model predicting property crime and a model predicting violent crime. Again, I will often see people make an equivalent mistake to the moderator scenario, and say that the effect of poverty is larger for property than violent because one is statistically significant and the other is not.

In this case if you have the original data, you actually can estimate the covariance between those two coefficients. The simplest way is to estimate that covariance via seemingly unrelated regression. If you don’t though, such as when you are reading someone else’s paper, you can just assume the covariance is zero. Because the parameter estimates often have negative correlations, this assumption will make the standard error estimate smaller.

The third is where you have different subgroups in the data, and you examine the differences in coefficients. Say you had recidivism data for males and females, and you estimated an equation of the effect of a treatment on males and another model for females. So we have two models:

Model Males  : Prob(Recidivism) = B_0m + B_1m*Treatment
Model Females: Prob(Recidivism) = B_0f + B_1f*Treatment

Where the B_0? terms are the intercept, and the B_1? terms are the treatment effects. Here is another example where you can stack the data and estimate an interaction term to estimate the difference in the effects and its standard error. So we can estimate a combined model for both males and females as:

Combined Model: Prob(Recidivism) = B_0c + B_1c*Treatment + B_2c*Female + B_3c(Female*Treatment)

Where Female is a dummy variable equal to 1 for female observations, and Female*Treatment is the interaction term for the treatment variable and the Female dummy variable. Note that you can rewrite the model for males and females as:

Model Mal.: Prob(Recidivism) =     B_0c      +      B_1c    *Treatment    ....(when Female=0)
Model Fem.: Prob(Recidivism) = (B_0c + B_2c) + (B_1c + B_3c)*Treatment    ....(when Female=1)

So we can interpret the interaction term, B_3c as the different effect on females relative to males. The standard error of this interaction takes into account the covariance term, unlike estimating two totally separate equations would. (You can stack the property and violent crime outcomes I mentioned earlier in a synonymous way to the subgroup example.)

The final fourth example is the simplest; two regression coefficients in the same equation. One example is from my dissertation, the correlates of crime at small spatial units of analysis. I test whether different places that sell alcohol — such as liquor stores, bars, and gas stations — have the same effect on crime. For simplicity I will just test two effects, whether liquor stores have the same effect as on-premise alcohol outlets (this includes bars and restaurants). So lets say I estimate a Poisson regression equation as:

log(E[Crime]) = Intercept + b1*Bars + b2*LiquorStores

And then my software spits out:

                  B     SE      
Liquor Stores    0.36  0.10
Bars             0.24  0.05

And then lets say we also have the variance-covariance matrix of the parameter estimates – which most stat software will return for you if you ask it:

                L       B  
Liquor_Stores    0.01
Bars            -0.0002 0.0025

On the diagonal are the variances of the parameter estimates, which if you take the square root are equal to the reported standard errors in the first table. So the difference estimate is 0.36 - 0.24 = 0.12, and the standard error of that difference is sqrt(0.01 + 0.0025 - 2*-0.002) =~ 0.13. So the difference is not statistically significant. You can take the ratio of the difference and its standard error, here 0.12/0.13, and treat that as a test statistic from a normal distribution. So the rule that it needs to be plus or minus two to be stat. significant at the 0.05 level applies.

This is called a Wald test specifically. I will follow up with another blog post and some code examples on how to do these tests in SPSS and Stata. For completeness and just because, I also list two more ways to accomplish this test for the last example.


There are two alternative ways to do this test though. One is by doing a likelihood ratio test.

So we have the full model as:

 log(E[Crime]) = b0 + b1*Bars + b2*Liquor_Stores [Model 1]
 

And we have the reduced model as:

 log(E[Crime]) = b4 + b5*(Bars + Liquor_Stores)  [Model 2]
 

So we just estimate the full model with Bars and Liquor Stores on the right hand side (Model 1), then estimate the reduced model (2) with the sum of Bars + Liquor Stores on the right hand side. Then you can just do a chi-square test based on the change in the log-likelihood. In this case there is a change of one degree of freedom.

I give an example of doing this in R on crossvalidated. This test is nice because it extends to testing multiple coefficients, so if I wanted to test bars=liquor stores=convenience stores. The prior individual Wald tests are not as convenient for testing more than two coefficients equality at once.


Here is another way though to have the computer more easily spit out the Wald test for the difference between two coefficients in the same equation. So if we have the model (lack of intercept does not matter for discussion here):

y = b1*X + b2*Z [eq. 1]

We can test the null that b1 = b2 by rewriting our linear model as:

y = B1*(X + Z) + B2*(X - Z) [eq. 2]

And the test for the B2 coefficient is our test of interest The logic goes like this — we can expand [eq. 2] to be:

y = B1*X + B1*Z + B2*X - B2*Z [eq. 3]

which you can then regroup as:

y = X*(B1 + B2) + Z*(B1 - B2) [eq. 4]

and note the equalities between equations 4 and 1.

B1 + B2 = b1; B1 - B2 = b2

So B2 tests for the difference between the combined B1 coefficient. B2 is a little tricky to interpret in terms of effect size for how much larger b1 is than b2 – it is only half of the effect. An easier way to estimate that effect size though is to insert (X-Z)/2 into the right hand side, and the confidence interval for that will be the effect estimate for how much larger the effect of X is than Z.

Note that this gives an equivalent estimate as to conducting the Wald test by hand as I mentioned before.

Group based trajectory models in Stata – some graphs and fit statistics

For my advanced research design course this semester I have been providing code snippets in Stata and R. This is the first time I’ve really sat down and programmed extensively in Stata, and this is a followup to produce some of the same plots and model fit statistics for group based trajectory statistics as this post in R. The code and the simulated data I made to reproduce this analysis can be downloaded here.

First, for my own notes my version of Stata is on a server here at UT Dallas. So I cannot simply go

net from http://www.andrew.cmu.edu/user/bjones/traj
net install traj, force

to install the group based trajectory code. First I have this set of code in the header of all my do files now

*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns"
*to install on your own in the lab it would be
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns"

So after running that then I can install the traj command, and Stata will know where to look for it!

Once that is taken care of after setting the working directory, I can simply load in the csv file. Here my first variable was read in as ïid instead of id (I’m thinking because of the encoding in the csv file). So I rename that variable to id.

*Load in csv file
import delimited GroupTraj_Sim.csv
*BOM mark makes the id variable weird
rename ïid id

Second, the traj model expects the data in wide format (which this data set already is), and has counts in count_1, count_2count_10. The traj command also wants you to input a time variable though to, which I do not have in this file. So I create a set of t_ variables to mimic the counts, going from 1 to 10.

*Need to generate a set of time variables to pass to traj, just label 1 to 10
forval i = 1/10 { 
  generate t_`i' = `i'
}

Now we can estimate our group based models, and we get a pretty nice default plot.

traj, var(count_*) indep(t_*) model(zip) order(2 2 2) iorder(0)
trajplot

Now for absolute model fit statistics, there are the average posterior probabilities, the odds of correct classification, and the observed classification proportion versus the expected classification proportion. Here I made a program that I will surely be ashamed of later (I should not brutalize the data and do all the calculations in matrix), but it works and produces an ugly table to get us these stats.

*I made a function to print out summary stats
program summary_table_procTraj
    preserve
    *now lets look at the average posterior probability
    gen Mp = 0
    foreach i of varlist _traj_ProbG* {
        replace Mp = `i' if `i' > Mp 
    }
    sort _traj_Group
    *and the odds of correct classification
    by _traj_Group: gen countG = _N
    by _traj_Group: egen groupAPP = mean(Mp)
    by _traj_Group: gen counter = _n
    gen n = groupAPP/(1 - groupAPP)
    gen p = countG/ _N
    gen d = p/(1-p)
    gen occ = n/d
    *Estimated proportion for each group
    scalar c = 0
    gen TotProb = 0
    foreach i of varlist _traj_ProbG* {
       scalar c = c + 1
       quietly summarize `i'
       replace TotProb = r(sum)/ _N if _traj_Group == c 
    }
    *This displays the group number, the count per group, the average posterior probability for each group,
    *the odds of correct classification, and the observed probability of groups versus the probability 
    *based on the posterior probabilities
    list _traj_Group countG groupAPP occ p TotProb if counter == 1
    restore
end

This should work after any model as long as the naming conventions for the assigned groups are _traj_Group and the posterior probabilities are in the variables _traj_ProbG*. So when you run

summary_table_procTraj

You get this ugly table:

     | _traj_~p   countG   groupAPP        occ       p    TotProb |
     |------------------------------------------------------------|
  1. |        1      103    .937933   43.57431   .2575   .2573651 |
104. |        2      161   .9935606   229.0434   .4025   .4064893 |
265. |        3      136   .9607248   47.48378     .34   .3361456 |

The groupAPP are the average posterior probabilities – here you can see they are all quite high. occ is the odds of correct classification, and again they are all quite high. p is the proportion in each group based on the assignments for the maximum posterior probability, and the TotProb are the expected number based on the sums of the posterior probabilities. TotProb should be the same as in the Group Membership part at the bottom of the traj model. They are close (to 5 decimals), but not exactly the same (and I do not know why that is the case).

Next, to generate a plot of the individual trajectories, I want to reshape the data to long format. I use preserve in case I want to go back to wide format later and estimate more models. If you need to look to see how the reshape command works, type help reshape at the Stata prompt. (Ditto for help on all Stata commands.)

preserve
reshape long count_ t_, i(id)

To get the behavior I want in the plot, I use a scatter plot but have them connected via c(L). Then I create small multiples for each trajectory group using the by() option. Before that I slightly jitter the count data, so the lines are not always perfectly overlapped. I make the lines thin and grey — I would also use transparency but Stata graphs do not support this.

gen count_jit = count_ + ( 0.2*runiform()-0.1 )
graph twoway scatter count_jit t_, c(L) by(_traj_Group) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)

I’m too lazy at the moment to clean up the axis titles and such, but I think this plot of the individual trajectories should always be done. See Breaking Bad: Two Decades of Life-Course Data Analysis in Criminology, Developmental Psychology, and Beyond (Erosheva et al., 2014).

While this fit looks good, this is not the correct number of groups given how I simulated the data. I will give those trying to find the right answer a few hints; none of the groups have a higher polynomial than 2, and there is a constant zero inflation for the entire sample, so iorder(0) will be the correct specification for the zero inflation part. If you take a stab at it let me know, I will fill you in on how I generated the simulation.

Writing equations in Microsoft Word

A student asked me about using LaTex the other day, and I stated that it is a bit of a hassle for journal articles in our field, so I have begun to use it less. Most of the journals in my field (criminology and criminal justice) make it difficult to turn in an article in that format. Many refuse to accept PDF articles outright, and last time I submitted a LaTex file to JQC (a Springer journal) that would not compile I received zero help from staff over a month of emails, so I just reformatted it to a Word document anyway. Last time I submitted a LaTex document to Criminology a reviewer said it probably had typos — without pointing out any of course. (FYI folks, besides doing the obvious and pointing out typos if they exist, my text editor has a spell checker same as Word to highlight typos.) Besides this, none of my co-workers use LaTex, so it is a non-starter for when I am collaborating. I did my dissertation in LaTex, and I would do that in LaTex again, but smaller articles are not a big deal.

The main nicety of LaTex are math equations. I don’t do too heavy of math stuff, and I have figured out the Microsoft Word equation editor enough to suit most of my needs. So here are a set of examples for many of the use cases I have needed to use in journal articles. I also have this in a Word (docx) document and a PDF for handy reference. Those have a few references I gathered from the internet, but the best IMO is this guys blog (who I think is a developer for Word) and this document authored by the same individual.

One of the things to note about the equation editor in Word is that you can type various shortcuts and then they will be automatically converted. For example, you can type \gamma, hit the space bar, and then the equation will actually change to showing the gamma symbol. So there are some similarities to LaTex. (Another pro-tip, to start an equation in Word you can press Alt=.) In the subsequent examples I will use <space> to represent hitting the space bar, and there are other examples of using <back> (for the left arrow key) and <backspace> for the backspace button.

Greek characters, subscripts and superscripts

If you type

log<space>(\lambda) = \beta_0<space> + \beta_1<space>(X) + \beta_2<space>(X^2<space>)

you get:

Accents

For these you need to hit the space key twice, so

x\hat<space><space> = y\bar<space><space>

turns into:

Expected value and variance

For the equivalent of \mathbb in LaTex, you can do

\doubleV<space>(X)= \doubleE<space>(X)^2<space> + \doubleE<space>(X^2<space>)

Plain text within equation

To do plain text within an equation, equivalent to \text{*} in LaTex, you can use double quotes. (Note that you do not need a backslash before "log".) So

Y = -1\cdot<space>log<space>("Property Crime"<space>) + (not pretty text)

looks like:

Sum and product

To get the product symbol is simply \prod<space>, and here is a more complicated example for the sum:

n^-1\cdot<space>\sum^n_(i=1)<space>x_i<space>= x\bar<space><space>

Square root

Square roots always cause me trouble for how they look and kern (both in LaTex and Word). Here is how I would do an example of Euclidean distance,

d_ij<space>=\sqrt<space><space><back>(x_i-x_j)^2+(y_i-y_j)^2<space>

Fractions

The big (stacked) fraction is simple, but I had to search for a bit to find how to do inline fractions (what Word calls "linear"). So here back slash followed by forward slash does the inline fraction:

1/n = 1\/n

Numbering an equation

I’ve seen quite a few different hacks for numbering equations in Word. If you need to number and refer to them in text often, I would use LaTex. But here is one way to do it in Word.

E = mc^2#(30)<enter>

produces below (is it just me or does this make the equation look different than the prior ones in Word?):

Multiple lines of equations

For a while I did not think this was possible, but I recently found examples of multiline equations (equivalent to \align in Latex). The way this works is you place a & sign before the symbols you want to line up (same as LaTex), but for Word to split a line you use @. So if you type

\eqarray(10x&=4y@5x&=2y)\eqarray<space><backspace>

you will get:

Have any more good examples? Let me know in the comments!

The other nicety of LaTex is formatting references — you are on your own though for that in Word though.

Added code snippets page

I’ve written quite a few blog posts over the years, and it is getting to be hard for me to keep all of them orderly. So I recently created a page with a simple table of contents with code snippets broken down into categories. There are also a few extra SPSS macros in there that I have not written blog posts about.

Every now and then I see a broken link and update it, but I know there are more I am missing. So just send an email if I have a link to some of my code and it is currently broken.