Comparing samples post-matching – some helper functions after FUZZY (SPSS)

I’ve been conducting quite a few case-control or propensity score matching studies lately. So I wrote some helper functions for use after the SPSS FUZZY command. These create the case-control dataset, plus calculate some of the standardized bias metrics for matching on continuous outcomes.

The use case here is if you have a sub-set of treated individuals, and you want to draw a comparison sample matched on certain characteristics (which can include just one propensity score and/or multiple covariates). Here is the macro to follow along, and I will provide a quick walkthrough of how it works. (There is documentation in the header for what the parameters are and what the function returns.)

So first I am going to import my macro using INSERT:

*Inserting the macro.
INSERT FILE = "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Matching_StandBias\PropBalance_Macro.sps".

Now just for illustration I am going to make a fake dataset to illustrate the utility of matching. Here I have a universe of 2,000 people. There is a subset of treated individuals (165), but they are only selected if they are under 28 years old and male.

*Create a fake dataset.
LOOP Id = 1 TO 2000.
COMPUTE YearsOld = RV.UNIFORM(18,40).
FORMATS Male (F1.0) YearsOld (F2.0).
DO IF Male = 1 AND YearsOld <= 28.
  COMPUTE Treated = RV.BERNOULLI(0.3).
  COMPUTE Treated = 0.
COMPUTE #OutLogit = 0.7 + 0.5*Male - 0.05*YearsOld - 0.7*Treated.
COMPUTE #OutProb = 1/(1 + EXP(-#OutLogit)).
FREQ Treated Outcome.

So what happens when we make comparisons among the entire sample, which includes females and older people?

*Compare means with the original full sample.
T-TEST GROUPS=Treated(0 1) /VARIABLES=Outcome.

We get basically no difference, our treated mean is 0.40 and the untreated mean is 0.39. But instead of comparing the 165 to the entire sample, we draw more reasonable control cases. Here we do an exact match on Male, and then we do a fuzzy match on YearsOld to within 3 years.

*Draw the comparison sample based on Male (exact) and YearsOld (Fuzzy).

Now what the FUZZY command does in SPSS is creates a new variable, named Match1 here, that places the matched Id in the same row as the original treated sample. You cannot easily make the updated comparisons that you want though in this data format. So after writing the code to do this about 7 times, I decided to make it into a simple macro. Here is an example of calling my macro, !MatchedSample.

*Now run my macro to make the matched sample.
!MatchedSample Dataset=OrigData Id=Id Case=Treated MatchGroup=MGroup Controls=[Match1] 
  MatchVars=[YearsOld] OthVars=Outcome Male.

This then spits out two new datasets, as well as appends a new variable to the original dataset named MatchedSample to show what cases have been matched. Then it is simple to see the difference in our means among our matched sample.

*Now the t-test with the matched sample subset.
T-TEST GROUPS=Treated(0 1) /VARIABLES=Outcome.

Which shows the same mean for treated, 0.40 (since all the treated were matched), but the comparison group now has a mean of 0.51, so here the treatment reduced the outcome.

The macro also provides an additional dataset named AggStats that estimates the standardized bias in the original sample vs. the standardized bias in the matched sample. (Standardized bias is just Cohen’s D measure multiplied by 100.) This then also calculates the standardized bias reduction for each continuous covariate. Before I forget, a neat way to test for balance jointly (instead of one variable at a time) is to conduct an additional regression equation predicting treatment and then testing for all coefficients equal to zero.

In this fake example the propensity scores would not be needed, you could just estimate a typical logistic regression equation controlling for YearsOld and Male. But the utility of matching comes from when you don’t know the functional form of how those covariates affect the outcome. So if the outcome was a very non-linear function of age, you don’t have to worry about estimating that function, you can just match on age and still get a reasonable comparison of the mean difference for treated vs. not-treated.

Weekly and monthly graphs for monitoring crime patterns (SPSS)

I was recently asked for some code to show how I created the charts in my paper, Tables and Graphs for Monitoring Crime Patterns (Pre-print can be seen here).

I cannot share the data used in the paper, but I can replicate them with some other public data. I will use calls for service publicly available from Burlington, VT to illustrate them.

The idea behind these time-series charts are not for forecasting, but to identify anomalous patterns – such as recent spikes in the data. (So they are more in line with the ideas behind control charts.) Often even in big jurisdictions, one prolific offender can cause a spike in crimes over a week or a month. They are also good to check more general trends as well, to see if crimes have had more slight, but longer term trends up or down.

For a preview, we will be making a weekly time series chart:

In the weekly chart the red line is the actual data, the black line is the average of the prior 8 weeks, and the grey band is a Poisson confidence interval around that prior moving average. The red dot is the most recent week.

And we will also be making a monthly seasonal chart:

The red line is the counts of calls per month in the current year, and the lighter grey lines are prior years (here going back to 2012).

So to start, I saved the 2012 through currently 6/20/2016 calls for service data as a csv file. And here is the code to read in that incident level data.

*Change this to where the csv file is located on your machine.
FILE HANDLE data /NAME = "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Tables_Graphs".
  AdjustedLatitude AUTO
  AdjustedLongitude AUTO
  AlcoholRelated AUTO
  Area AUTO
  CallDateTime AUTO
  CallType AUTO
  Domv AUTO
  DayofWeek AUTO
  DrugRelated AUTO
  EndDateTime AUTO
  GeneralTimeofDay AUTO
  IncidentNumber AUTO
  LocationType AUTO
  MentalHealthRelated AUTO
  MethodofEntry AUTO
  Month AUTO
  PointofEntry AUTO
  StartDateTime AUTO
  Street AUTO
  Team AUTO
  Year AUTO

First I will be making the weekly chart. What I did when I was working as an analyst was make a chart that showed the recent weekly trends and to identify if the prior week was higher than you might expect it to be. The weekly patterns can be quite volatile though, so I smoothed the data based on the average of the prior eight weeks, and calculated a confidence interval around that average count (based on the Poisson distribution).

As a start, we are going to turn our date variable, CallDateTime, into an SPSS date variable (it gets read in as a string, AM/PM in date-times are so annoying!). Then we are going to calculate the number of days since some baseline – here it is 1/1/2012, which is Sunday. Then we are going to calculate the weeks since that Sunday. Lastly we select out the most recent week, as it is not a full week.

*Days since 1/1/2012.
COMPUTE #Sp = CHAR.INDEX(CallDateTime," ").
COMPUTE Days = DATEDIFF(CallDate,DATE.MDY(1,1,2012),"DAYS").
COMPUTE Weeks = TRUNC( (Days-1)/7 ).
SELECT IF Weeks < 233.

Here I do weeks since a particular date, since if you do XDATE.WEEK you can have not full weeks. The magic number 233 can be replaced by sometime like SELECT IF Weeks < ($TIME - 3*24*60*60). if you know you will be running the syntax on a set date, such as when you do a production job. (Another way is to use AGGREGATE to figure out the latest date in the dataset.)

Next what I do is that when you use AGGREGATE in SPSS, there can be missing weeks with zeroes, which will mess up our charts. There end up being 22 different call-types in the Burlington data, so I make a base dataset (named WeekFull) that has all call types for each week. Then I aggregate the original calls for service dataset to CallType and Week, and then I merge the later into the former. Finally I then recode the missings intos zeroes.

*Make sure I have a full set in the aggregate.
FREQ CallType.
*22 categories, may want to collapse a few together.
LOOP #Weeks = 0 TO 232.
  LOOP #Calls = 1 TO 22.
    COMPUTE CallN = #Calls.
    COMPUTE Weeks = #Weeks.

*Aggregate number of tickets to weeks.
  /BREAK Weeks CallN
  /CallType = FIRST(CallType)
  /TotCalls = N.

*Merge Into WeekFull.
  /FILE = 'WeekCalls'
  /BY Weeks CallN.
*Missing are zero cases.

Now we are ready to calculate our statistics and make our charts. First we create a date variable that represents the beginning of the week (for our charts later on.) Then I use SPLIT FILE and CREATE to calculate the prior moving average only within individiual call types. The last part of the code calculates a confidence interval around prior moving average, and assumes the data is Poisson distributed. (More discussion of this is in my academic paper.)

COMPUTE WeekBeg = DATESUM(DATE.MDY(1,1,2012),(Weeks*7),"DAYS").

*Moving average of prior 8 weeks.
CREATE MovAv = PMA(TotCalls,8).
*Calculating the plus minus 3 Poisson intervals.
COMPUTE #In = (-3/2 + SQRT(MovAv)).
DO IF #In >= 0.
  COMPUTE LowInt = #In**2.
  COMPUTE LowInt = 0.
COMPUTE HighInt = (3/2 + SQRT(MovAv))**2.

If you rather use the inverse of the Poisson distribution I have notes in the code at the end to do that, but they are pretty similar in my experience. You also might consider (as I mention in the paper), rounding fractional values for the LowInt down to zero as well.

Now we are ready to make our charts. The last data manipulation is to just put a flag in the file for the very last week (which will be marked with a large red circle). I use EXECUTE before the chart just to make sure the variable is available. Finally I keep the SPLIT FILE on, which produces 22 charts, one for each call type.

IF Weeks = 232 FinCount = TotCalls.

*Do a quick look over all of them.
  /GRAPHDATASET NAME="graphdataset" VARIABLES=WeekBeg TotCalls MovAv LowInt HighInt FinCount MISSING=VARIABLEWISE
  SOURCE: s=userSource(id("graphdataset"))
  DATA: WeekBeg=col(source(s), name("WeekBeg"))
  DATA: TotCalls=col(source(s), name("TotCalls"))
  DATA: MovAv=col(source(s), name("MovAv"))
  DATA: LowInt=col(source(s), name("LowInt"))
  DATA: HighInt=col(source(s), name("HighInt"))
  DATA: FinCount=col(source(s), name("FinCount"))
  SCALE: pow(dim(2), exponent(0.5))
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Crime Count"))
  ELEMENT: line(position(WeekBeg*TotCalls), color(, transparency(transparency."0.4"))
  ELEMENT: area(position(region.spread.range(WeekBeg*(LowInt+HighInt))), color.interior(color.lightgrey), 
  transparency.interior(transparency."0.4"), transparency.exterior(transparency."1"))
  ELEMENT: line(position(WeekBeg*MovAv))
  ELEMENT: point(position(WeekBeg*FinCount), color.interior(, size(size."10"))

This is good for the analyst, I can monitor many series. Here is an example the procedure produces for mental health calls:

The current value is within the confidence band, so it is not alarmingly high. But we can see that they have been trending up over the past few years. Plotting on the square root scale makes the Poisson count data have the same variance, but a nice thing about the SPLIT FILE approach is that SPSS does smart Y axis ranges for each individual call type.

You can update this to make plots for individual crimes, and here I stuff four different crime types into a small multiple plot. I use a TEMPORARY and SELECT IF statement before the GGRAPH code to select out the crime types I am interested in.

FORMATS TotCalls MovAv LowInt HighInt FinCount (F3.0).
SELECT IF ANY(CallN,3,10,13,17).
  /GRAPHDATASET NAME="graphdataset" VARIABLES=WeekBeg TotCalls MovAv LowInt HighInt FinCount CallN MISSING=VARIABLEWISE
  PAGE: begin(scale(900px,600px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: WeekBeg=col(source(s), name("WeekBeg"))
  DATA: TotCalls=col(source(s), name("TotCalls"))
  DATA: MovAv=col(source(s), name("MovAv"))
  DATA: LowInt=col(source(s), name("LowInt"))
  DATA: HighInt=col(source(s), name("HighInt"))
  DATA: FinCount=col(source(s), name("FinCount"))
  DATA: CallN=col(source(s), name("CallN"), unit.category())
  COORD: rect(dim(1,2), wrap())
  SCALE: pow(dim(2), exponent(0.5))
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), start(1), delta(3))
  GUIDE: axis(dim(3), opposite())
  GUIDE: form.line(position(*,0),color(color.lightgrey),shape(shape.half_dash))
  ELEMENT: line(position(WeekBeg*TotCalls*CallN), color(, transparency(transparency."0.4"))
  ELEMENT: area(position(region.spread.range(WeekBeg*(LowInt+HighInt)*CallN)), color.interior(color.lightgrey), 
  transparency.interior(transparency."0.4"), transparency.exterior(transparency."1"))
  ELEMENT: line(position(WeekBeg*MovAv*CallN))
  ELEMENT: point(position(WeekBeg*FinCount*CallN), color.interior(, size(size."10"))
  PAGE: end()

You could do more fancy time-series models to create the confidence bands or identify the outliers, (exponential smoothing would be similar to just the prior moving average I show) but this ad-hoc approach worked well in my case. (I wanted to make more fancy models, but I did not let the perfect be the enemy of the good to get at least this done when I was employed as a crime analyst.)

Now we can move onto making our monthly chart. These weekly charts are sometimes hard to visualize with highly seasonal data. So what this chart does is gives each year a new line. Instead of drawing error bars, the past years data show the typical variation. It is then easy to see seasonal ups-and-downs, as well as if the latest month is an outlier.

Getting back to the code — I activate the original calls for service database and then close the Weekly database. Then it is much the same as for weeks, but here I just use calendar months and match to a full expanded set of calls types and months over the period. (I do not care about normalizing months, it is ok that February is only 28 days).



  /BREAK Year Month CallN
  /MonthCalls = N.

LOOP #y = 2012 TO 2016.
  LOOP #m = 1 TO 12.
    LOOP #call = 1 TO 22.
      COMPUTE CallN = #call.
      COMPUTE Year = #y.
      COMPUTE Month = #m.
      END CASE.

  /FILE = 'AggMonth'
  /BY Year Month CallN.

Next I select out the most recent month of the date (June 2016) since it is not a full month. (When I originally made these charts I would normalize to days and extrapolate out for my monthly meeting. These forecasts were terrible though, even only extrapolating two weeks, so I stopped doing them.) Then I calculate a variable called Current – this will flag the most recent year to be red in the chart.

COMPUTE MoYr = DATE.MDY(Month,1,Year).
FORMATS MoYr (MOYR6) Year (F4.0) Month (F2.0).
SELECT IF MoYr < DATE.MDY(6,1,2016).
RECODE MonthCalls (SYSMIS = 0)(ELSE = COPY).

*Making current year red.
COMPUTE Current = (Year = 2016).
FORMATS Current (F1.0).


*Same thing with the split file.
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Month MonthCalls Current Year
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Month=col(source(s), name("Month"), unit.category())
  DATA: MonthCalls=col(source(s), name("MonthCalls"))
  DATA: Current=col(source(s), name("Current"), unit.category())
  DATA: Year=col(source(s), name("Year"), unit.category())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2), label("Calls"), start(0))
  GUIDE: legend(aesthetic(aesthetic.color.interior), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("0",color.lightgrey),("1",
  ELEMENT: line(position(Month*MonthCalls), color.interior(Current), split(Year))

You can again customize this to be individual charts for particular crimes or small multiples. You can see in the example at the beginning of the post Retail thefts are high for March, April and May. I was interested to examine overdoses, as the northeast (and many parts of the US) are having a problem with heroin at the moment. In the weekly charts they are so low of counts it is hard to see any trends though.

We can see that overdoses were high in March. The other highest line are months in 2015, so it looks like a problem here in Burlington, but it started around a year ago.

For low counts of crime (say under 20 per month) seasonality tends to be hard to spot. For crimes more frequent though you can often see pits and peaks in summer and winter. It is not universal that crimes increase in the summer though. For ordinance violations (and ditto for Noise complaints) we can see a pretty clear peak in September. (I don’t know why that is, there is likely some logical explanation for it though.)

My main motivation to promote these is to replace terrible CompStat tables of year-over-year percent changes. All of these patterns I’ve shown are near impossible to tell from tables of counts per month.

Finally if you want to export your images to place into another report, you can use:

OUTPUT EXPORT /PNG IMAGEROOT = "data\TimeGraphs.png".

PNG please – simple vector graphics like these should definately not be exported as jpegs.

Here is a link to the full set of syntax and the csv data to follow along. I submitted to doing an hour long training session at the upcoming IACA conference on this, so hopefully that gets funded and I can go into this some more.

Sentence length in academic articles

While reviewing a paper recently it struck me that the content was (very) good, but the writing was stereotypical academic. My first impression was that this was caused by monotonously long sentences. See this advice from Gary Provost (via Francis Diebold). Part of the reason why long sentences are undesirable is not only for aesthetic reasons though — longer sentences are harder to parse, hold in memory, and subsequently understand. See Steven Pinker’s The Sense of Style writing guide for discussion.

So I did some text analysis of the sentences. To do the text analysis I used the nltk library in python, and here is the IPython notebook to replicate for yourself if you care to do so. In the notebook I have saved two text corpuses, one my finished draft of this article. I compared the sentence length to Mark Twain’s Huckleberry Finn (text via here).

For a simple example getting started with the library, here is an example of tokenizing a string into words and sentences:

#some tests for, nice book to follow along
import nltk'punkt') #need to download this for the English sentence tokenizer files

#this splits up punctuation
test = """At eight o'clock on Thursday morning Arthur didn't feel very good. This is a second sentence."""
tokens = nltk.word_tokenize(test)
print tokens

ts = nltk.sent_tokenize(test)
print ts

The first prints out each individual word (plus punctuation in some circumstances) and the second marks individual sentences. I have the line'punkt') commented out, as I downloaded it once already. (Running once in Wakari I did not need to download it again – I presume it would work similarly on your local machine.)

So what I did was transfer the PDF document I was reviewing to a text file and then clean up things like the section headers (ditto for my academic articles I compare it to). In Huckleberry I took out the table of contents and the "CHAPTER ?" parts. I also started a list of variables that were parsed as words but that I did not want to count after the sentences and words were tokenized. For example, an inline cite such as (X, 1996) would be split into 4 words with the original tokenizer, (, X, 1996 and ). The "x96" is an en-dash. Below takes those instances out.

#Get the corpus
f = open('SmallSample_Corpus.txt')
raw =

#Count number of sentences
sent_tok = nltk.sent_tokenize(raw)
ns = len(sent_tok)

#Count number of words
word_tok = nltk.word_tokenize(raw) #need to take out commas plus other stuff
NoWord = [',','(',')',':',';','.','%','\x96','{','}','[',']','!','?',"''","``"]
word_tok2 = [i for i in word_tok if i not in NoWord]
nw = len(word_tok2)

#Average Sentence length are words divided by sentences
print float(nw)/ns

There are inevitably more instances of things that shouldn’t be counted as words, but that makes the sentences longer on average. For example, I spotted a few possessive 's that were listed as different words. (The nltk library is smart and lists contractions as seperate words.)

So someone may know a better way to count the words, but all the articles should have the same biases. In my tests, here are the average number of words per sentence:

  • article I was reviewing, 28
  • my small sample article, 27
  • my working article (that has not undergone review), 25
  • Huck Finn, 20

So the pot is calling the kettle black here – my writing is not much better. I looked at the difference between an in-print article and a working draft, as responses to reviewers I bet will make the sentences longer. Hedges in statements that academics love.

Looking at the academic article histograms they are fairly symmetric, confirming my impression about monotonous sentence length. To make the histograms I used the panda’s library, which has a nice simple method.

sent_len = []
for i in sent_tok:
    sent_w1 = nltk.word_tokenize(i)
    sent_w2 = [i for i in sent_w1 if i not in NoWord]

import pandas as pd

dfh = pd.DataFrame(sent_len)
dfh.hist(bins = 50);

Here is the histogram for my small sample paper:

And here it is for Huck Finn

(I’m not much of an exemplar for making graphs in python – forgive the laziness in the figures.) Apparently analyzing sentence length has a long history, see a paper by G. Udny Yule in 1939! From a quick perusal the long right tail is more usual for analyzing texts. The symmetry I see for this sample of academic articles is not the norm.

There could be more innocuous reasons for this. Huck Finn has dialogue with shorter sentences, and the academic articles have numbers and citations. (Although I think it is reasonable to count those things towards sentence complexity, "1" or "one" should have the same complexity.)

I will have to keep this in mind in the future (maybe I should write my articles in poem form)!

Trajectories are not real

I recently published my peer review of Andresen, Curman, and Linning (2016), The Trajectories of Crime at Places: Understanding the Patterns of Disaggregated Crime Types. In it I probably come off as a bit curmudgeonly, but in my defense I voted for revise-and-resubmit (I don’t know why Publons does not have a field to note this in your review).

I’ve gotten a few questions about doing trajectory analysis, so I figured I would come out and make my opinion pretty clear — I don’t think such trajectories exist. Part of the motivation in my paper to estimate the trajectories in Albany was to replicate Weisburd’s work in Seattle. (Andresen et al. (2016) give the same reasoning for their currently cited article — at least to stick with the k-means clustering procedure anyway.) But that does not mean I think distinct groups of trajectories exist. I tried to sneak this in wherever possible in my JQC article, but if you just quickly skimmed it you might not have that impression.

For an ugly image I generated but did not make the paper, here is my reasoning that discrete trajectories do not exist:


You can see that each individual grouping (unique colors) basically overlaps with each other. This solution passes all the usual model based criteria for trajectory groups, and each individual line tends to have a very high posterior probability of assignment to its respective group. Despite this there is essentially no separation between the groups – they all just blend into one another. If people plotted the individual trajectories this would be more obvious in other work, but people almost always solely plot the predicted trajectories.

Now, although I’m pretty sure that the clusters are not real (at least in the data I have seen) that does not mean that I don’t think they can be useful. Longitudinal data can be complicated, and clustering trajectories are a simple procedure to make sense of it. For a good example of its utility, Weisburd et al., (2015) used it to assign blocks for a randomized experiment. This is a slightly more robust way than ranking based some prior time interval (in that it might out some weird trajectory groups that simple averaging would not).

I believe it is similar in utility to creating hot spot polygons or maps of kernel density estimates. They can illuminate the patterns the messy data, but the hot spot the computer spits out is not a real entity – it is something we arbitrarily created.

Preprint – A Quasi-Experimental Evaluation Using Roadblocks and Automatic License Plate Readers to Reduce Crime in Buffalo, NY

I have a new preprint article posted on SSRN – A Quasi-Experimental Evaluation Using Roadblocks and Automatic License Plate Readers to Reduce Crime in Buffalo, NY. This is some work I have been conducting with Scott Phillips out at SUNY Buffalo (as well as Dae-Young Kim, although he is not on this paper).

Here is the abstract:

Purpose: To evaluate the effectiveness of a hot spots policing strategy: using automated license plate readers at roadblocks.

Design: Different roadblock locations were chosen by the Buffalo Police Department every day over a two month period. We use propensity score matching to identify a set of control locations based on prior counts of crime and demographic factors before the intervention took place. We then evaluate the reductions in Part 1 crimes, calls for service, and traffic accidents at roadblock locations compared to control locations.

Findings: We find modest reductions in Part 1 violent crimes (10 over all roadblock locations and over the two months) using t-tests of mean differences. We find a 20% reduction in traffic accidents using fixed effects negative binomial regression models. Both results are sensitive to the model used though, and the fixed effects models predict increases in crimes due to the intervention.

Research Limitations: The main limitations are the quasi-experimental nature of the intervention, the short length of the intervention, and that many micro places have low baseline counts of crime.

Originality/Value: This adds to literature on hot spots policing – in particular on the use of automated license plate readers and traffic enforcement at hot spots of crime. While the results are mixed, it provides some evidence that the intervention has potential to reduce crime.

And here is one figure from the paper, showing how street units are defined, and given the intersection the road block was stationed on how we determined the treated street units:

Feedback is always welcome!

Some Stata notes – Difference-in-Difference models and postestimation commands

Many of my colleagues use Stata (note it is not STATA), and I particularly like it for various panel data models. Also one of my favorite parts of Stata code that are sometimes tedious to replicate in other stat. software are the various post-estimation commands. These includes the test command, which does particular coefficient restriction tests or multiple coefficient tests, margins (and the corresponding marginsplot) which gives model based estimates at various values of the explanatory variables, and lincom and nlcom which I will show as being useful for differences in differences models.

To follow along with this post, I have posted all of the data and code here. Part of the data is simulated to be very similar to a panel data model I estimated recently, and the other dataset comes from Dan Kahan.

I don’t know Stata as well as I do SPSS or R, but these are things I tend to re-visit (and wish people sending me data and code follow as well). So these are for my own notes so I remember!

The Preamble

Before we go into estimating models in Stata, I first like to set the working directory, specify a plain text log file for the results, and then set how the results are displayed in the window. I know some programmers do not like you changing their directory, but this is the simplest way I’ve found to share code between colleagues – they just need to change one line at the beginning to change the directory to their local machine.

So the top of my code will typically have my name, plus what the code does. Note that comments in Stata start with an *.

*Andy Wheeler, ????
*This code estimates difference in difference models
*For the X intervention on Crimes per Month

After that I will set the working directory, using the cd command, which lets you upload datasets without specifying a file handle. Because of this, I can just say log using "PostEst.txt" and it saves the plain text log file in the same directory folder.

*Setting up the directory
cd "C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Stata_Postest"
*logging the results in a text file
log using "PostEst.txt", text replace
*So the output just keeps going
set more off

The code set more off is idiosyncratic to the Stata display. For results that you need to scroll, Stata asks you to click on see more output. This makes it so it just pipes all the results to the display without you needing to click on anything.

Importing Data and Setting up a Panel Model

Now we will go on to a simulated example very similar to a difference in difference model I estimated recently. Because the working directory is set that already contains my datasets, I can just call:

use "Monthly_Sim_Data.dta", clear 

To load up my simulated dataset. Before we estimate panel models in Stata, you need to tell Stata what the panel id variables refer to. You use the tsset command for that. Here the variable Exper refers to a dummy variable that equals 1 for the experimental time series, and 0 for the control time series. Ord is a squential counter for each time period – here the data is suppossed to look like the count of crimes during the month over several years. So the first variable after tsset in the panel id, and the second refers to the time variable (if there is a time variable).

*Set the panel vars
tsset Exper Ord

Now we are ready to estimate our model. In my real use case I had looked at the univariate series, and found that an ARIMA(1,0,0) model with monthly dummies to account for seasonality fit pretty well and took care of temporal auto-correlation, so I decided a reasonable substitute for the panel model would be a GEE Poisson model with monthly dummies and the errors having an AR(1) correlation.

To fit this model in Stata is:

*Original model
xtgee Y Exper#Post i.Month, family(poisson) corr(ar1)

Here I do not worry about the standard errors, but if the data are over-dispersed you can either fit a negative binomial model and/or estimate robust standard errors (or bootstrapped standard errors). So you can specify these as something like xtgee Y X, family(poisson) corr(ar1) vce(robust). The vce option is available for many of the panel data models. (Note the bootstrap does not make sense when you only have two series as in this example.)

The Post variable is the dummy variable equal to 0 before the intervention, and 1 after the intervention. Exper#Post is one way to specify interaction variables in Stata. The variable i.Month tells Stata that the Month variable is a factor, and it should estimate a different dummy variable for each month (dropping one to prevent perfect collinearity).

You can of course create your own interactions or dummy variables, but using Stata’s inbuilt commands is very nice when working with post-estimation commands, which I will show in a bit. Some models in Stata do not like the i.Factor notation, so a quick way to make all of the dummy variables is via tabulate Month, gen(m). This would create variables m1, m2, m3....m12 in the dataset that you can include on the right hand side of regression equations.

This model then produces the output:

Iteration 1: tolerance = .00055117
Iteration 2: tolerance = 1.375e-06
Iteration 3: tolerance = 2.730e-09

GEE population-averaged model                   Number of obs      =       264
Group and time vars:             Exper Ord      Number of groups   =         2
Link:                                  log      Obs per group: min =       132
Family:                            Poisson                     avg =     132.0
Correlation:                         AR(1)                     max =       132
                                                Wald chi2(14)      =    334.52
Scale parameter:                         1      Prob > chi2        =    0.0000

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Exper#Post |
        0 1  |    .339348   .0898574     3.78   0.000     .1632307    .5154653
        1 0  |   1.157615   .0803121    14.41   0.000     1.000206    1.315023
        1 1  |   .7602288   .0836317     9.09   0.000     .5963138    .9241439
       Month |
          2  |    .214908   .1243037     1.73   0.084    -.0287228    .4585388
          3  |   .2149183   .1264076     1.70   0.089    -.0328361    .4626726
          4  |   .3119988   .1243131     2.51   0.012     .0683497    .5556479
          5  |   .4416236   .1210554     3.65   0.000     .2043594    .6788877
          6  |    .556197   .1184427     4.70   0.000     .3240536    .7883404
          7  |   .4978252   .1197435     4.16   0.000     .2631322    .7325182
          8  |   .2581524   .1257715     2.05   0.040     .0116447      .50466
          9  |    .222813   .1267606     1.76   0.079    -.0256333    .4712592
         10  |    .430002    .121332     3.54   0.000     .1921956    .6678084
         11  |   .0923395   .1305857     0.71   0.479    -.1636037    .3482828
         12  |  -.1479884   .1367331    -1.08   0.279    -.4159803    .1200035
       _cons |   .9699886   .1140566     8.50   0.000     .7464418    1.193536

Post estimation commands

So now the fun part begins – trying to interpret the model. Non-linear models – such as a Poisson regression – I think are almost always misleading to interpret the coefficients directly. Even exponentiating the coefficients (so they are incident rate ratios) I think tends to be misleading (see this example). To solve this, I think the easiest solution is to just predict the model based outcome at different locations of the explanatory variables back on the original count scale. This is what the margins post-estimation command does.

margins Post#Exper, at( (base) Month )
marginsplot, recast(scatter)

So basically once you estimate a regression equation, Stata has many of its attributes accessible to subsequent command. What the margins command does is predict the value of Y at the 2 by 2 factor levels for Post and Exper. The at( (base) Month ) option says to estimate the effects at the base level of the Month factor, which happens to default to January here. The marginsplot shows this easier than I can explain.

The option recast(scatter) draws the plot so there are no lines connecting the different points. Note I prefer to draw these error bar like plots without the end cross hairs, which can be done like marginsplot, recast(scatter) ciopts(recast(rspike)), but this causes a problem when the error bars overlap.

So I initially interpreted this simply as the experimental series went down by around 2 crimes, and the control series went up by around 1 crime. A colleague though pointed out the whole point of having control series though is to say what would have happend if the intervention did not take place (the counterfactual). Because the control series was going up, we would have also expected the experimental series to go up.

The coefficients in this model though don’t directly answer that question. To estimate that relative decrease is somewhat convoluted in this parameterization, but you can use the test and lincom post-estimation commands to do it. Test basically does linear model restrictions for one or multiple variables. This can be useful to test if multiple coefficients are equal to one another, or joint testing to see if one coefficient is equal to another, or to test if a coefficient is different from a non-zero value.

So to test the relative decrease in this DiD setup ends up being (which I am too lazy to explain):

test 1.Exper#1.Post = (1.Exper#0.Post + 0.Exper#1.Post)

Note you can also do a joint test for all dummy variables with testparm:

testparm i.Month

which spits out:

 ( 1)  2.Month = 0
 ( 2)  3.Month = 0
 ( 3)  4.Month = 0
 ( 4)  5.Month = 0
 ( 5)  6.Month = 0
 ( 6)  7.Month = 0
 ( 7)  8.Month = 0
 ( 8)  9.Month = 0
 ( 9)  10.Month = 0
 (10)  11.Month = 0
 (11)  12.Month = 0

           chi2( 11) =   61.58
         Prob > chi2 =    0.0000

The idea behind this is that it often does not make sense to test the significance of only one level of a dummy variable – you want to jointly test whether the whole set of dummy variables is statistically significant. Most of the time I do this using F-tests for model restrictions (see this example in R). But here Stata does a chi-square test. (I imagine they will result in the same inferences in most circumstances.)

test just gives inferential statistics though, I wanted an actual estimate of the relative decrease. To do this you can use lincom. So working with my same set of variables I get:

lincom 1.Exper#1.Post - 0.Exper#1.Post - 1.Exper#0.Post

Which produces in the output:

 ( 1)  - 0b.Exper#1.Post - 1.Exper#0b.Post + 1.Exper#1.Post = 0

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
         (1) |  -.7367337   .1080418    -6.82   0.000    -.9484917   -.5249757

I avoid explaining why this particular set of coefficient contrasts produces the relative decrease of interest because there is an easier way to specify the DiD model to get this coefficient directly (see the Wikipedia page linked earlier for an explanation):

*An easier way to estimate the model
xtgee Y i.Exper i.Post Exper#Post i.Month, family(poisson) corr(ar1)

Which you can look at the output and see that the interaction term now recreates the same -.7367337 (.1080418) estimate as the prior lincom command:

Iteration 1: tolerance = .00065297
Iteration 2: tolerance = 1.375e-06
Iteration 3: tolerance = 2.682e-09

GEE population-averaged model                   Number of obs      =       264
Group and time vars:             Exper Ord      Number of groups   =         2
Link:                                  log      Obs per group: min =       132
Family:                            Poisson                     avg =     132.0
Correlation:                         AR(1)                     max =       132
                                                Wald chi2(14)      =    334.52
Scale parameter:                         1      Prob > chi2        =    0.0000

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
     1.Exper |   1.157615   .0803121    14.41   0.000     1.000206    1.315023
      1.Post |    .339348   .0898574     3.78   0.000     .1632307    .5154653
  Exper#Post |
        1 1  |  -.7367337   .1080418    -6.82   0.000    -.9484917   -.5249757
       Month |
          2  |    .214908   .1243037     1.73   0.084    -.0287228    .4585388
          3  |   .2149183   .1264076     1.70   0.089    -.0328361    .4626726
          4  |   .3119988   .1243131     2.51   0.012     .0683497    .5556479
          5  |   .4416236   .1210554     3.65   0.000     .2043594    .6788877
          6  |    .556197   .1184427     4.70   0.000     .3240536    .7883404
          7  |   .4978252   .1197435     4.16   0.000     .2631322    .7325182
          8  |   .2581524   .1257715     2.05   0.040     .0116447      .50466
          9  |    .222813   .1267606     1.76   0.079    -.0256333    .4712592
         10  |    .430002    .121332     3.54   0.000     .1921956    .6678084
         11  |   .0923395   .1305857     0.71   0.479    -.1636037    .3482828
         12  |  -.1479884   .1367331    -1.08   0.279    -.4159803    .1200035
       _cons |   .9699886   .1140566     8.50   0.000     .7464418    1.193536

But we can do some more here, and figure out the hypothetical experimental mean if the experimental series would have followed the same trend as the control series. Here I use nlcom and exponentiate the results to get back on the original count scale:

nlcom exp(_b[1.Exper] + _b[1.Post]  + _b[_cons])

Which gives an estimate of:

       _nl_1:  exp(_b[1.Exper] + _b[1.Post]  + _b[_cons])

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       _nl_1 |   11.78646   1.583091     7.45   0.000     8.683656    14.88926

So in our couterfactual world, the intervention decreased crimes from around 12 to 6 per month instead of 8 to 6 in my original interpretation, a larger effect because of the control series. We can show how nlcom reproduces the output of margins by reproducing the experimental series mean at the baseline, over 8 crimes per month.

*This creates the observed estimates, (at January) - if you want for another month need to add to above
margins Post#Exper, at( (base) Month )
*to show it recreates margins of pre period
nlcom exp(_b[1.Exper] + _b[_cons])

Which produces the output:

. margins Post#Exper, at( (base) Month )

Adjusted predictions                              Number of obs   =        264
Model VCE    : Conventional

Expression   : Exponentiated linear prediction, predict()
at           : Month           =           1

             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Post#Exper |
        0 0  |   2.637914   .3008716     8.77   0.000     2.048217    3.227612
        0 1  |   8.394722   .8243735    10.18   0.000      6.77898    10.01046
        1 0  |   3.703716   .3988067     9.29   0.000     2.922069    4.485363
        1 1  |   5.641881   .5783881     9.75   0.000     4.508261    6.775501

. *to show it recreates margins of pre period
. nlcom exp(_b[1.Exper] + _b[_cons])

       _nl_1:  exp(_b[1.Exper] + _b[_cons])

           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
       _nl_1 |   8.394722   .8243735    10.18   0.000      6.77898    10.01046

Poisson models are no big deal

Much of my experience suggests that although Poisson models are standard fare for predicting low crime counts in our field, they almost never make a difference when evaluating the marginal effects like I have above. So here we can reproduce the same GEE model, but instead of Poisson have it be a linear model. Here I use the quietly command to suppress the output of the model. Comparing coefficients directly between the two models does not make sense, but comparing the predictions is fine.

*Compare Poisson to a linear model
quietly xtgee Y Exper#Post i.Month, family(gaussian) corr(ar1)
margins Post#Exp, at( (base) Month )

And this subsequently produces:

Adjusted predictions                              Number of obs   =        264
Model VCE    : Conventional

Expression   : Linear prediction, predict()
at           : Month           =           1

             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
  Post#Exper |
        0 0  |   1.864656   .6951654     2.68   0.007     .5021569    3.227155
        0 1  |   9.404887   .6951654    13.53   0.000     8.042388    10.76739
        1 0  |   3.233312   .6992693     4.62   0.000     1.862769    4.603854
        1 1  |   5.810809   .6992693     8.31   0.000     4.440266    7.181351

The predictions are very similar between the two models. I simply state this here because I think the parallel trends assumption for the DiD model makes more sense on a linear scale than it does for the exponential scale. Pretty much wherever I look, the effects of explanatory variables appear pretty linear to me when predicting crime counts, so I think linear models make a lot of sense, even if you are predicting low count data, despite current conventions in the field.

An additional Example using marginsplot and areas

Long post – but stick with me! The last thing I wanted to go over was how to get area’s instead of error bars for marginsplot. The pdf help online has some examples – so this is pretty much just a restatement of the help. (Stata’s help in the online PDFs are excellent – code and math and figures and references and intelligible discussion all in one.) Like Andrew Gelman, I think the discrete bars are misleading when the sample locations are just arbitrary locations along a continuous function. I also think the areas look nicer, especially when the error bars overlap.

So using the same example from Dan Kahan, we first use drop _all to get rid of the prior panel data, then use import to load up the csv file.

*This clears the prior data.
drop _all 

*grab the csv data
import delimited gelman_cup_graphic_reporting_challenge_data

Now we can estimate our model and our margins:

*estimate a similar logit model
logit agw c.left_right##c.religiosity
*graph areas using margins - pretended -1 and 1 are the high/low religious cut-offs
quietly margins, at(left_right=(-1.6(0.1)1.6) religiosity=(-1 1))

quietly is definately useful here for the margins command, we don’t want to pour through the whole text table, since it is only intended for a chart. And now to make our area chart at the sample points:

marginsplot, xlabel(-1.6(0.4)1.6) recast(line) recastci(rarea) ciopts(color(*.8))

That is much better looking that the default with the ugly cross-hair error bars overlapping. Stata draws the lines unfortunately in the wrong order (see the blue line over the red area), and if I figure out an easy way to fix that I will update the post in the future. I need to learn the options for Stata charts to a greater extent before I give any more advice about Stata charts though.

To wrap up. I typically have in my Stata do file:

*Finish the script.
drop _all 
exit, clear

This drops the dataset (as mentioned before), and exit, clear then basically closes out of Stata.

Jane Jacobs on Neighborhoods

Google image on 5/4/16 – Jane Jacobs 100th birthday.

Continuing on with my discussion of neighborhoods, it seems apt to talk about how Jane Jacob defines neighborhoods in her book The Death and Life of Great American Cities. She is often given perfunctory citations in criminology articles for the idea that mixed zoned neighborhoods (those with both residential and commercial zoning all together) increase “eyes on the street” and thus can reduce crime. I will write another post more fully about informal social control and how her ideas fit in, but here I want to focus on her conception of neighborhoods.

She asks the question, what do neighborhoods accomplish – and from this attempts to define what is a neighborhood. Thus she feels a neighborhood can only be defined in terms of having actual political capital to use for its constituents – and so a neighborhood is any region in which can be organized enough to act as an independent polity and campaign against the larger government in which it is nested. This is quite different from the current conception of neighborhoods in most social sciences – in which we assume neighborhoods exist, calculate their effects on behavior, and then tautologically say they exist when we find they do affect behavior.

Based on her polity idea Jacob’s defines three levels of possible neighborhoods:

  • your street
  • the greater area of persons – around 100,000
  • the entire city

This again is quite different than most of current social sciences. Census tracts and block groups are larger swaths than just one block. In very dense cities like NYC, census block groups are often the square defined by streets on either side, so they split apart people across the street from one another. Census tracts intentionally are made to contain around 8,000 people. For a counter-example criminologist, Ralph Taylor does think streets are neighborhood units. A hearsay paraphrase of his I recently heard was “looking out your front door, all that matters is how far you can see down the street in either direction”.

I think Jacob’s guesstimate of 100,000 may be partly biased from only thinking about giant megapolises. Albany itself has only around 100,000 residents – so it is either streets or the whole city per her definition. Although I can’t argue much about smaller areas having little to no political capital to accomplish specific goals.

In some of the surveys I have participated in they have asked the question about how you define your neighborhood. Here are the responses for two different samples (from the same city at two different time points) for an example:

I think the current approach – that neighborhoods are defined by their coefficients on the right hand side of regression models is untenable in the end – based on ideas that are derivatives of the work in my dissertation. Mainly that discrete neighborhood boundaries are a fiction of social scientists.

Neighborhoods in Albany according to Google

One of the most vexing aspects of spatial analysis in the social sciences in the concept of neighborhoods. There is a large literature on neighborhood effects in criminology, but no one can really define a neighborhood. For analysis they are most often assumed to approximately conform to census areas (like tracts or blocks). Sometimes there are obvious physical features that divide neighborhoods (most often a major roadway), but more often boundaries are fuzzy.

I’ve worked on several surveys (at the Finn Institute) in which we ask people what neighborhood they live in as well as the nearest intersection to their home. Even where there is a clear border, often people say the “wrong” neighborhood, especially near the borders. IIRC, when I calculated the wrongness for one survey in Syracuse we did it was only around 60% of the time the respondents stated they lived the right neighborhood. I do scare quotes around “wrong” because it is obviously arbitrary where people draw the boundaries, so more people saying the wrong neighborhood is indicative of the borders being misaligned than the respondents being wrong.

For this reason I like the Google maps approach in which they just place a label at the approximate center of noteworthy neighborhoods. I emulated this for a recent background map I made for a paper in Albany. (Maps can be opened in a separate tab to see a larger image.)

As background I did not grow up in Albany, but I’ve lived and worked in the Capital District since I came up to Albany for grad school – since 2008. Considering this and the fact that I make maps of Albany on a regular basis is my defense I have a reasonable background to make such judgements.

When looking at Google’s reverse geocoding API the other day I noticed they returned a neighborhood field in the response. So I created a regular sampling grid over Albany to see what they return. First, lets see my grid and where Google actually decides some neighborhood exists. Large grey circles are null, and small red circles some neighborhood label was returned. I have no idea where Google culls such neighborhood labels from.

See my python code at the end of the post to see how I extracted this info. given an input lat-lng. In the reverse geo api they return multiple addresses – but I only examine the first returned address and look for a neighborhood. (So I could have missed some neighborhoods this way – it would take more investigation.)

Given the input fishnet I then dissolved the neighborhood labels into areas. Google has quite a few more specific neighborhoods than me.

I’ve never really made much of a distinction between West Hill and Arbor Hill – although the split is clearly at Henry Johnson. Also I tend to view Pine Hill as the triangle between Western and Central before the State campus – but Google and others seem to disagree with me. What I call the Pinebush Google calls the Dunes. Dunes is appropriate, because it actually has sand dunes, but I can’t recall anyone referring to it as that. Trees are pretty hard to come by in Arbor Hill though, so don’t be misled. Also kill is Dutch for creek, so you don’t have to worry that Normanskill is such a bad place (even if your name is Norman).

For a third opinion, see

You can see more clearly in this map how Pine Hill’s area goes south of Madison. Google maps has a fun feature showing related maps, and so they show a related map on someones take for where law students should or should not get an apartment. In that map you can see that south of Madison is affectionately referred to as the student ghetto. That comports with my opinion as well, although I did not think putting student ghetto was appropriate for my basemap for a journal article!

People can’t seem to help but shade Arbor Hill in red. Which sometimes may be innocent – if red is the first color used in defaults (as Arbor Hill will be the first neighborhood in an alphabetic list). But presumably the law student making the apartment suggestions map should know better.

In short, it would be convenient for me (as a researcher) if everyone could agree with what a neighborhood is and where its borders are, but that is not reality.

Here is the function in Python to grab the neighborhood via the google reverse geocoding API. Here if it returns anything it grabs the first address returned and searches for the neighborhood in the json. If it does not find a neighborhood it returns None.

#Reverse geocoding and looking up neighborhoods
import urllib, json

def GoogRevGeo(lat,lng,api=""):
  base = r""
  GeoUrl = base + "latlng=" + str(lat) + "," + str(lng) + "&key=" + api
  response = urllib.urlopen(GeoUrl)
  jsonRaw =
  jsonData = json.loads(jsonRaw)
  neigh = None
  if jsonData['status'] == 'OK':
    for i in jsonData['results'][0]['address_components']:
      if i['types'][0] == 'neighborhood':
        neigh = i['long_name']
  return neigh

Shape, color, and pattern constants in SPSS charts

I have a version of the SPSS (Statistics) Version 24 GPL reference guide bookmarked here. The reference guide is great to skim through and see what is possible in SPSS charts – especially the set of examples on pages 329 to 411.

On page 413 they also give a set of constant colors, shapes, and texture patterns you can use in charts. Colors you can also specify in RGB scale, but it is often convenient to just say or etc. Shapes and patterns for practical purposes you have to choose among the constants. (Technically in the chart template you can edit the cycle styles, and change a circle to an ellipse for example, or change the points for a dash pattern, but this would be painful for anything besides a few constants.)

Here is a handy reference guide to actually visualize those constants. Many you can guess what they look like, but the colors are more subtle. Who knew there were differences between tomato, salmon, and pink! (The tomato is more like tomato soup color.)

Here are the color constants (you can open the chart in a new tab to see a larger image):

The shape constants:

The elbow and the elbowArrow do not look correct – but will take some more time to investigate. The others look ok to me though. (The number of sides and star points appear to me to be something you can also manipulate in the chart template cycles, if for some reason you want a hendecagon).

And here are the pattern constants. I plot them with a grey filled interior – you can see some specifically only have an outline and always have a transparent fill:

Here is code to replicate the charts, and here is a PDF to download with the constants. The colors and shapes are hard to read because they are squeezed in so small, but you can zoom into the PDF and read the categories. I haven’t used dashed lines ever, so I omit those constants here. (Patterns I use pretty rarely, but I have used them if there are only two categories.)

A useful change for the colors would be sorting in a logical order. They are just currently in alphabetical. I am too lazy though to convert the colors to a colorspace and sort them though. (Maybe converting the PDF to SVG would do the trick easy enough though.)

Posting my peer reviews on Publons and a few notes about reviewing

Publons is a service that currates your peer review work. In a perfect world this would be done by publishers – they just forward your reviews with some standard meta-data. This would be useful when one paper is reviewed multiple times, as well as identifying good vs. poor reviewers. I forget where I saw the suggestion recently (maybe on the orgtheory or scatterplot blog), but someone mentioned it would be nice if you submit your paper to a different journal to forward your previous reviews at your discretion. I wouldn’t mind that at all, as oft the best journals will reject for lesser reasons because they can be more selective. (Also I’ve gotten copy-paste same reviews from different journals, even though I have updated the draft to address some of the comments. Forwarding would allow me to address those comments directly before the revise-resubmit decision.)

I’ve posted all of my reviews so far, but they are only public if the paper is eventually accepted. So here you can see my review for the recent JQC article Shooting on the Street: Measuring the Spatial Influence of Physical Features on Gun Violence in a Bounded Street Network by Jie Xu and Elizabeth Griffiths.

I’ve done my fair share of complaining about reviews before, but I don’t think the whole peer-review process is fatally flawed despite its low reliability. People take peer review a bit too seriously at times – but that is a problem for most academics in general. Even if you think your idea is not getting a fair shake, just publish it yourself on your website (or places like SSRN and ArXiv). This of course does not count towards things like tenure – but valuing quantity over quality is another separate problem currently in academia.

In the spirit of do unto others as you would have them do unto you, here are a two main points I try to abide by when I review papers.

  • be as specific as possible in your critique

There is nothing more frustrating than getting a vague critique (the paper has multiple mispellings and grammar issues). A frequent one I have come across (both in reviews of my papers and seeing comments others have made on papers I’ve reviewed) is in the framing of the paper – a.k.a. the literature review. (Which makes sending the paper to multiple journals so frustrating, you will always get more arbitrary framing debates each time with new reviewers.)

So for a few examples:

  • (bad) The literature review is insufficient
  • (good) The literature review skips some important literature, see specifically (X, 2004; Y, 2006; Z, 2007). The description of (A, 2000) is awkward/wrong.
  • (bad) The paper is too long, it can be written in half the length
  • (better) The paper could be shortened, section A.1 can be eliminated in my opinion, and section A.2 can be reduced to one paragraph on X.

Being specific provides a clear path for the author to correct what you think, or at least respond if they disagree. The “you can cut the paper in half” I don’t even know how to respond to effectively, nor the generic complaint about spelling. One I’ve gotten before is “x is an innapropriate measure” with no other detail. This is tricky because I have to guess why you think it is innapropriate, so I have to make your argument for you (mind read) and then respond why I disagree (which obviously I do, or I wouldn’t have included that measure to begin with). So to respond to a critique I at first have to critique my own paper – maybe this reviewer is more brilliant than I thought.

Being specific I also think helps cut down on arbitrary complaints that are arguable.

  • provide clear signals to the editor, both about main critiques and the extent to which they can be addressed

Peer review has two potential motivations, one is a gate-keeper and one is to improve the draft. Often times arbitrary advice by reviewers intended for the latter is not clearly delineated in the review, so it is easily confused for evidence pertinent to the gate-keeper function. I’ve gotten reviews of 20 bullet points or 2,000 words that make it seem like a poor paper due to sheer length of the comment, but the majority are minor points or arbitrary suggestions. Longer reviews actually suggest the paper is better – if there is something clearly wrong you can say it in a much shorter space.

Gabriel Rossman states these different parts of peer review a bit more succintly than me:

You need to adopt a mentality of “is it good how the author did it” rather than “how could this paper be made better”

I think this is a good quip to follow. I might add “don’t sweat the small stuff” to that as well. Some editors will read the paper and reviews and make judgement calls – but some editors just follow the reviewers blindly – so I worry with the 20 bullet point minor review that it unduly influenced a reject decision. I’m happy to respond to the bullets, and happy you took the time, but I’m not happy about you (the reviewer) not giving clear advice to the editor of the extent to which I can address those points.

I still give advice about improving the manuscript, but I try to provide clear signals to the editor about main critiques, and I also will explicitly state whether they can be addressed. The “can be addressed” is not for the person writing the paper – it is for the editor making the decision for whether to revise-and-resubmit! The main critiques in my experience will either entail 2-3 main points (or none at all for some papers). I also typically say when things are minor and put them in a separate section, which editors can pretty much ignore.

Being a quantitative guy, the ones that frustrate me the most are complaints about model specifications. Some are legitimately major complaints, but often times it will be things that are highly unlikely to greatly influence the reported results. Examples are adding/dropping/changing a particular control variable and changes in the caliper used for propensity score matching. Note I’m not saying you shouldn’t ask to see differences, I’m asking that you clearly articulate why your suggestion is preferable and make an appropriate judgement as to whether it is a big problem or a little problem. A similar complaint is what information to include in tables or in the main manuscript or appendix. The author already has the information, so it is minor editing, not a major problem.

While I am here I will end with three additional complaints that don’t fit into anywhere previously in my post. One, multiple rounds of review are totally a waste. So the life cycle of the paper-review should be

paper -> review -> editor decision -> reject/accept
                                      revise-resumbit -> new paper + responses to reviews -> editor decision

The way the current system works, I have to submit another review after the new paper has been submitted. I rather the editor take the time to see if the authors sufficiently addressed the original complaints, because as a reviewer I am not an unbiased judge of that. So if I say something is important and you retort it is not, what else do you want me to say in my second review! It is the editors job at that point to arbiter disagreements. This then stops the cycle of multiple rounds of review, which have a very large amount of diminishing returns in improving the draft.

This then leads into my second complaint, generally about keeping a civil tone for reviews. In general I don’t care if a reviewer is a bit gruff in the review – it is not personal. But since reviewers have a second go, when I respond I need to keep an uber deferential tone. I don’t think that is really necessary, and I’d rather original authors have similar latitude to be gruff in responses. Reviewers say stupid things all the time (myself included) and you should be allowed to retort that my critique is stupid! (Of course with reasoning as to why it is stupid…)

Finally, I had one reviewer start recently:

This paper is interesting and very well written…I will not focus on the paper’s merits, but instead restrict myself to areas where it can be improved.

The good is necessary to signal to the editor whether a paper should be published. I’ve started to try in my own reviews to include more of the good (which is definately not the norm) and argue why a paper should be published. You can see in my linked review of the Xu and Griffiths paper by the third round I simply gave arguments why the paper should be published, despite a disagreement about the change-point model they reported on in the paper.


Get every new post delivered to your Inbox.

Join 77 other followers