Dropbox links need updating

Just as an FYI, I use Dropbox to share much of my work. They have two recent changes that effect many of the links on my website. One is that the public folder is no longer in use. I thought maybe that since I gave out individual links to items in the public folder those wouldn’t be affected — apparently they are though. So I need to go and update many of those public links.

The second is that HTML pages served up via Dropbox are not rendered anymore in browser. So I can’t use Dropbox as a simple website server for some pages. You can always still download the page and view the HTML (same as a say a PDF), but I realize this is a bit more inconvenient. In some cases I will have to convert those Dropbox links to pages on my WordPress blog. (Which adds another layer of complexity, as sometimes I use those as supplementary materials for academic papers, which require anonymity. Not sure how to deal with that though.)

Be patient, I have quite a few links I need to fix in various places, so it will take me awhile. If you spot a broken link, please let me know (via comment or email). Also if you cannot download any of my posted material (e.g. folks in China cannot access Dropbox) always feel free to send an email and I will distribute it that way. See my about page or my CV page for my email address.

Advertisements

Graphs and interrupted time series analysis – trends in major crimes in Baltimore

Pete Moskos’s blog is one I regularly read, and a recent post he pointed out how major crimes (aggravated assaults, robberies, homicides, and shootings) have been increasing in Baltimore post the riot on 4/27/15. He provides a series of different graphs using moving averages to illustrate the rise, see below for his initial attempt:

He also has an interrupted moving average plot that shows the break more clearly – but honestly I don’t understand his description, so I’m not sure how he created it.

I recreated his initial line plot using SPSS, and I think a line plot with a guideline shows the bump post riot pretty clearly.

The bars in Pete’s graph are not the easiest way to visualize the trend. Here making the line thin and lighter grey also helps.

The way to analyze this data is using an interrupted time series analysis. I am not going to go through all of those details, but for those interested I would suggest picking up David McDowell’s little green book, Interrupted Time Series Analysis, for a walkthrough. One of the first steps though is to figure out the ARIMA structure, which you do by examining the auto-correlation function. Here is that ACF for this crime data.

You can see that it is positive and stays quite consistent. This is indicative of a moving average model. It does not show the geometric decay of an auto-regressive process, nor is the autocorrelation anywhere near 1, which you would expect for an integrated process. Also the partial autocorrelation plot shows the geometric decay, which is again consistent with a moving average model. See my note at the bottom, how this interpretation was wrong! (Via David Greenberg sent me a note.)

Although it is typical to analyze crime counts as a Poisson model, I often like to use linear models. Coefficients are much easier to interpret. Here the distribution of the counts is high enough I am ok using a linear interrupted ARIMA model.

So I estimated an interrupted time series model. I include a dummy variable term that equals 1 as of 4/27/15 and after, and equals 0 before. That variable is labeled PostRiot. I then have dummy variables for each month of the year (M1, M2, …., M11) and days of the week (D1,D2,….D6). The ARIMA model I estimate then is (0,0,7), with a constant. Here is that estimate.

So we get an estimate that post riot, major crimes have increased by around 7.5 per day. This is pretty similar to what you get when you just look at the daily mean pre-post riot, so it isn’t really any weird artifact of my modeling strategy. Pre-riot it is under 25 per day, and post it is over 32 per day.

This result is pretty robust across different model specifications. Dropping the constant term results in a larger post riot estimate (over 10). Inclusion of fewer or more MA terms (as well as seasonal MA terms for 7 days) does not change the estimate. Inclusion of the monthly or day of week dummy variables does not make a difference in the estimate. Changing the outlier value on 4/27/15 to a lower value (here I used the pre-mean, 24) does reduce the estimate slightly, but only to 7.2.

There is a bit of residual autocorrelation I was never able to get rid of, but it is fairly small, with the highest autocorrelation of only about 0.06.

Here is the SPSS code to reproduce the Baltimore graphs and ARIMA analysis.

As a note, while Pete believes this is a result of depolicing (i.e. Baltimore officers being less proactive) the evidence for that hypothesis is not necessarily confirmed by this analysis. See Stephen Morgan’s analysis on crime and arrests, although I think proactive street stops should likely also be included in such an analysis.


This Baltimore data just shows a bump up in the series, but investigating homicides in Chicago (here at the monthly level) it looks to me like an upward trend post the McDonald shooting. This graph is at the monthly level.

I have some other work on Chicago homicide geographic patterns going back quite a long time I can hopefully share soon!

I will need to update the Baltimore analysis to look at just homicides as well. Pete shows a similar bump in his charts when just examining homicides.

For additional resources for folks interested in examining crime over time, I would suggest checking out my article, Monitoring volatile homicide trends across U.S. cities, as well as Tables and Graphs for Monitoring Crime Patterns. I’m doing a workshop at the upcoming International Association for Crime Analysts conference on how to recreate such graphs in Excel.


David Greenberg sent me an email  to note my interpretation of the ACF plots was wrong – and that a moving average process should only have a spike, and not show the slow decay. He is right, and so I updated the interrupted ARIMA models to include higher order AR terms instead of MA terms. The final model I settled on was (5,0,0) — I kept adding higher order AR terms until the AR coefficients were not statistically significant. For these models I still included a constant.

For the model that includes the outlier riot count, it results in an estimate that the riot increased these crimes by 7.5 per day, with a standard error of 0.5

This model has no residual auto-correlation until you get up to very high lags. Here is a table of the Box-Ljung stats for up to 60 lags.

Estimating the same ARIMA model with the outlier value changed to 24, the post riot estimate is still over 7.

Subsequently the post-riot increase estimate is pretty robust across these different ARIMA model settings. The lowest estimate I was able to get was a post mean increase of 5 when not including an intercept and not including the outlier crime counts on the riot date. So I think this result holds up pretty well to a bit of scrutiny.

New working paper: The effect of housing demolitions on crime in Buffalo, New York

I have a new working paper up, The effect of housing demolitions on crime in Buffalo, New York. This is in conjunction with my colleagues Dae-Young Kim and Scott Phillips, who are at SUNY Buffalo. Below is the abstract.

Objectives: From 2010 through 2015, the city of Buffalo demolished over 2,000 residences. This study examines whether those demolitions resulted in crime reductions.

Methods: Analysis was conducted at micro places matching demolished parcels to comparable control parcels with similar levels of crime. In addition, spatial panel regression models were estimated at the census tract and quarterly level, taking into account demographic characteristics of neighborhoods.

Results: We find that at the micro place level, demolitions cause a steep drop in reported crime at the exact parcel, and result in additional crime decreases at buffers of up to 1,000 feet away. At the census tract level, results indicated that demolitions reduced Part 1 crimes, but the effect was not statistically significant across different models.

Conclusions: While concerns over crime and disorder are common for vacant houses, the evidence that housing demolitions are an effective crime reduction solution is only partially supported by the analyses here. Future research should compare demolitions in reference to other neighborhood revitalization processes.

As always, if you have feedback/comments let me know.

And here are a few maps from the paper!

Geocoding with census data and the Census API

For my online GIS class I have a tutorial on creating an address locator using street centerline data in ArcGIS. Eventually I would like to put all of my class online, but for now I am just sharing that one, as I’ve forwarded it alot recently.

That tutorial used local street centerline data in Dallas that you can download from Dallas’s open data site. It also gives directions on how to use an online ESRI geocoding service — which Dallas has. But what if those are not an option? A student recently wanted to geocode data from San Antonio, and the only street data file they publicly provide lacks the beginning and ending street number.

That data is insufficient to create an address locator. It is also the case that the road data you can download from the census’s web interface lacks this data. But you can download street centerline data with beginning and end addresses from the census from the FTP site. For example here is the url that contains the streets with the address features. To use that you just have to figure out what state and county you are interested in downloaded. The census even has ESRI address locators already made for you using 2012 data at the state level. Again you just need to figure out your states number and download it.

Once you download the data with the begin and ending street numbers you can follow along with that tutorial the same as the public data.

Previously I’ve written about using the Google geocoding API. If you just have crime data from one jurisdiction, it is simple to make a geocoder for just that locality. But if you have data for many cities (say if you were geocoding home addresses) this can be more difficult. An alternative online API to google that does not have daily limits is the Census Geocoding API.

Here is a simple example in R of calling the census API and geocoding a list of addresses.

library(httr)
library(jsonlite)

get_CensusAdd <- function(street,city,state,zip,benchmark=4){
    base <- "https://geocoding.geo.census.gov/geocoder/locations/address?"
    soup <- GET(url=base,query=list(street=street,city=city,state=state,zip=zip,format='json',benchmark=benchmark))
    dat <- fromJSON(content(soup,as='text'), simplifyVector=TRUE)
    D_dat <- dat$result$addressMatches
    if (length(D_dat) > 1){
    return(c(D_dat['matchedAddress'],D_dat['coordinates'][[1]])) #error will just return null, x[1] is lon, x[2] is lat
    }
    else {return(c('',NA,NA))}
}

#now create function to loop over data frame and return set of addresses
geo_CensusTIGER <- function(street,city,state,zip,sleep=1,benchmark=4){
  #make empy matrix
  l <- length(street)
  MyDat <- data.frame(matrix(nrow=l,ncol=3))
  names(MyDat) <- c("MatchedAdd","Lon","Lat")
  for (i in 1:l){
    x <- suppressMessages(get_CensusAdd(street=street[i],city=city[i],state=state[i],zip=zip[i],benchmark=benchmark))
    if (length(x) > 0){
        MyDat[i,1] <- x[1]
        MyDat[i,2] <- x[2]
        MyDat[i,3] <- x[3]
    }
    Sys.sleep(sleep)
  }
  MyDat$street <- street
  MyDat$city <- city
  MyDat$zip <- zip
  MyDat$state <- state
  return(MyDat)
}

## Arbitrary dataframe for an exercise
AddList <- data.frame(
  IdNum = c(1,2,3,4,5),
  Address = c("450 W Harwood Rd", "2878 Fake St", "2775 N Collin St", "2775 N Collins St", "Lakewood Blvd and W Shore Dr"),
  City = c("Hurst", "Richardson", "Arlington", "Arlington", "Dallas"),
  State = c("TX", "TX", "TX", "TX", "TX")
)

test <- geo_CensusTIGER(street=AddList$Address,city=AddList$City,state=AddList$State,zip=rep('',5))

If you check out the results, you will see that this API does not appear to do fuzzy matching. 2775 N Collin St failed, whereas 2775 N Collins St was able to return a match. You can also see though it will return an intersection, but in my tests "/" did not work (so in R you can simply use gsub to replace different intersection types with and). I haven’t experimented with it too much, so let me know if you have any other insight into this API.

I will follow up in another post a python function to use the Census geocoding API, as well as using the Nominatim online geocoding API, which you can use for addresses outside of the United States.

IACA Conference 2017 workshop: Monitoring temporal crime trends for outliers (Excel)

This fall at the International Association of Crime Analysts conference I am doing a workshop, Monitoring temporal crime trends for outliers: A workshop using Excel. If you can’t wait (or are not going) I have all my materials already prepared, which you can download here. That includes a walkthrough of my talk/tutorial, as well as a finished Excel workbook. It is basically a workshop to go with my paper, Tables and graphs for monitoring temporal crime trends: Translating theory into practical crime analysis advice.

For some preview, I will show how to make a weekly smoothed chart with error bands:

As well as a monthly seasonal chart:

I use Excel not because I think it is the best tool, but mainly because I think it is the most popular among crime analysts. In the end I just care about getting the job done! (Although I’ve given reasons why I think Excel is more painful than any statistical program.) Even though it is harder to make small multiple charts in Excel, I show how to make these charts using pivot tables and filters, so watching them auto-update when you update the filter is pretty cool.

For those with SPSS I have already illustrated how to make similar charts in SPSS here. You could of course replicate that in R or Stata or whatever if you wanted.

I am on the preliminary schedule currently for Tuesday, September 12th at 13:30 to 14:45. I will be in New Orleans on the 11th, 12th and 13th, so if you want to meet always feel free to send an email to set up a time.

Don’t include temporal lags of crime in cross-sectional crime models

In my 311 and crime paper a reviewer requested I conduct cross-lagged models. That is, predict crime in 2011 while controlling for prior counts of crime in 2010, in addition to the other specific variables of interest (here 311 calls for service). In the supplementary material I detail why this is difficult with Poisson models, as the endogenous effect will often be explosive in Poisson models, something that does not happen as often in linear models.

There is a second problem though with cross-lagged models I don’t discuss though, and it has to do with how what I think a reasonable data generating process for crime at places can cause cross-lagged models to be biased. This is based on the fact that crime at places tends to be very temporally stable (see David Weisburd’s, or Martin Andresen’s, or my work showing that). So when you incorporate temporal lags of crime in models, this makes the other variables of interest (311 calls, alcohol outlets, other demographics, whatever) biased, because they cause crime in the prior time period. This is equivalent to controlling for an intermediate outcome. For examples of this see some of the prior work on the relationship between crime and disorder by Boggess and Maskaly (2014) or O’Brien and Sampson (2015).1

So Boggess and Maskaley (BM) and O’Brien and Sampson (OS) their simplified cross-lagged model is:

(1) Crime_post = B0*Crime_pre + B1*physicaldisorder_pre

Where the post and pre periods are yearly counts of crime and indicators of physical disorder. My paper subsequently does not include the prior counts of crime, but does lag the physical disorder measures by a year to ensure they are exogenous.

(2) Crime_post = B1*physicaldisorder_pre

There are a few reasons to do these lags. The most obvious is to make explanatory variable of broken windows exogenous, by making sure it is in the past. The reasons for including lags of crime counts are most often strictly as a control variable. There are some examples where crime begets more crime directly, such as retaliatory violence, (or see Rosenfeld, 2009) but most folks who do the cross-lagged models do not make this argument.

Now, my whole argument rests on what I think is an appropriate model explaining counts of crime at places. Continuing with the physical disorder example, I think a reasonable cross-sectional model of crime at places is that there are some underlying characteristics of locations that tend to be pretty stable over fairly long periods of time, and then we have more minor stuff like physical disorder that provide small exogenous shocks to the system over time.

(3) Crime_i = B0*(physicaldisorder_i) + Z_i

Where crime at location i is a function of some fixed characteristic Z. I can’t prove this model is correct, but I believe it is better supported by data. To support this position, I would refer to the incredibly high correlations between counts of crime at places from year to year. This is true of every crime dataset I have worked with (at every spatial unit of analysis), and is a main point of Shaw and McKay’s work plus Rob Sampsons for neighborhoods in Chicago, as well as David Weisburd’s work on trajectories of crime at street segments in Seattle. Again, this very high correlation doesn’t strike me as reasonably explained by crime causes more crime, what is more likely is that there are a set of fixed characteristics that impact criminal behavior at a certain locations.

If a model of crime is like that in (3), there are then two problems with the prior equations. The first problem for both (1) and (2) is that lagging physical disorder measures by a year does not make any sense. The idea behind physical disorder (a.k.a. broken windows) is that visible signs of disorder prime people to behave in a particular way. The priming presumably needs to be recent to affect behavior. But this can simply be solved by not lagging physical disorder by a year in the model. The lagged physical disorder effect might approximate the contemporaneous effect, if physical disorder itself is temporally consistent over long periods. So if say we replace physical disorder with locations of bars, the lagged effect of bars likely does not make any difference, between bars don’t turn over that much (and when they do they are oft just replaced by another bar).

But what if you still include the lags of crime counts? One may think that this controls for the omitted Z_i effect, but the effect is very bad for the other exogenous variables, especially lagged ones or temporally consistent ones. You are probably better off with the omitted random effect, because crime in the prior year is an intermediate outcome. I suspect this bias can be very large, and likely biases the effects of the other variables towards zero by quite alot. This is because effect of the fixed characteristic is large, the effect of the exogenous characteristic is smaller, and the two are likely correlated at least to a small amount.

To show this I conduct a simulation. SPSS Code here to replicate it. The true model I simulated is:

(4)  BW_it = 0.2*Z_i + ew_it
(5)  Crime_it = 5 + 0.1*BW_it + 0.9*Z_i + ec_it`

I generated this for 25,000 locations and two time points (the t subscript), and all the variables are set to have a variance of 1 (all variables are normally distributed). The error terms (ew_it and ec_it) are not correlated, and are set to whatever value is necessary so the resultant variable on the left hand side has a variance of 1. With so many observations one simulation run is pretty representative of what would happen even if I replicated the simulation multiple times. This specification makes both BW (to stand for broken windows) and Z_i correlated.

In my run, what happens when we fit the cross-lagged model? The effect estimates are subsequently:

Lag BW:   -0.07
Lag Crime: 0.90

Yikes – effect of BW is in the opposite direction and nearly as large as the true effect. What about if you just include the lag of BW?

Lag BW: 0.22

The reason this is closer to the true effect is because of some round-about-luck. Since BW_it is correlated with the fixed effect Z_i, the lag of BW has a slight correlation to the future BW. This potentially changes how we view the effects of disorder on crime though. If BW is more variable, we can make a stronger argument that it is exogenous of other omitted variables. If it is temporally consistent it is harder to make that argument (it should also reduce the correlation with Z_i).

Still, the only reason this lag has a positive effect is that Z_i is omitted. For us to make the argument that this approximates the true effect, we have to make the argument the model has a very important omitted variable. Something one could only do as an act of cognitive dissonance.

How about use the contemporaneous effect of BW, but still include the lag counts of crime?

BW:        0.13
Lag Crime: 0.86

That is not as bad, because the lag of crime is now not an intermediate outcome. Again though, if we switch BW with something more consistent in time, like locations of bars, the lag will be an intermediate outcome, and will subsequently bias the effect. So what about a model of the contemporaneous effect of BW, omitting Z_i? The contemporaneous effect of BW will still be biased, since Z_i is omitted from the model.

BW: 0.32

But a way to reduce this bias is to introduce other control variables that approximate the omitted Z_i. Here I generate a set of 10 covariates that are a function of Z_i, but are otherwise not correlated with BW nor each other.

(6) Oth_it = 0.5*Z_i + eoth_it

Including these covariates in the model progressively reduces the bias. Here is a table for the reduction in the BW effect for the more of the covariates you add in, e.g. with 2 means it includes two of the control variables in the model.

BW (with 0):  0.32
BW (with 1):  0.25
BW (with 2):  0.21
BW (with 3):  0.19
BW (with 10): 0.14

So if you include other cross-sectional covariates in an attempt to control for Z_i it brings the effect of BW closer to its true effect. This is what I believe happens in the majority of social science research that use strictly cross-sectional models, and is a partial defense of what people sometimes refer to kitchen sink models.

So in brief, I think using lags of explanatory variables and lags of crime in the same model are very bad, and can bias the effect estimates quite alot.

So using lags of explanatory variables and lags of crime counts in cross-sectional models I believe are a bad idea for most research designs. It is true that it makes it their effects exogenous, but it doesn’t eliminate the more contemporaneous effect of the variable, and so we may be underestimating the effect to a very large extent. Whether of not the temporal lag effects crime has to do with how the explanatory variable itself arises, and so the effect estimated by the temporal lag is likely to be misleading (and may be biased upward or downward depending on other parts of the model).

Incorporating prior crime counts is likely to introduce more bias than it solves I think for most cross-lagged models. I believe simply using a cross-sectional model with a reasonable set of control variables will get you closer to the real effect estimates than the cross-lagged models. If you think Z_i is correlated with a variable of interest (or lags of crime really do cause future crime) I think you need to do the extra step and have multiple time measures and fit a real panel data model, not just a cross lagged one.

I’m still not sure though when you are better off fitting a panel model versus expanding the time for the cross-section though. For one example, I think you are better off estimating the effects of demographic variables in a cross-sectional model, as opposed to a panel one, over a short period of time, (say less than 10 years). This is because demographic shifts simply don’t occur very fast, so there is little variance within units for a short panel.


  1. I actually came up with the idea of using 311 calls independently of Dan O’Brien’s work, see my prospectus in 2013 in which I proposed the analysis. So I’m not totally crazy – although was alittle bummed to miss the timing abit! Four years between proposing and publishing the work is a bit depressing as well.

Paper: The Effect of 311 Calls for Service on Crime in D.C. at Microplaces published

My paper, The Effect of 311 Calls for Service on Crime in D.C. at Microplaces, was published online first at Crime & Delinquency. Here is the link to the published paper. If you do not have access to a library where you can get the paper always feel free to email and I will send an off-print. But I also have the pre-print posted on SSRN. Often the only difference between my pre-prints and the finished version is the published paper is shorter!

As a note, I’ve also posted all of the data and code to replicate my findings. The note is unfortunately buried at the end of the paper, instead of the beginning.

This was the first paper published from my dissertation. I have pre-prints out for two others, What we can learn from small units and Local and Spatial Effect of Bars. Hopefully you will see those two in print the near future as well!

Testing the equality of coefficients – Same Independent, Different Dependent variables

As promised earlier, here is one example of testing coefficient equalities in SPSS, Stata, and R.

Here we have different dependent variables, but the same independent variables. This is taken from Dallas survey data (original data link, survey instrument link), and they asked about fear of crime, and split up the questions between fear of property victimization and violent victimization. Here I want to see if the effect of income is the same between the two. People in more poverty tend to be at higher risk of victimization, but you may also expect people with fewer items to steal to be less worried. Here I also control for the race and the age of the respondent.

The dataset has missing data, so I illustrate how to select out for complete case analysis, then I estimate the model. The fear of crime variables are coded as Likert items with a scale of 1-5, (higher values are more safe) but I predict them using linear regression (see the Stata code at the end though for combining ordinal logistic equations using suest). Race is of course nominal, and income and age are binned as well, but I treat the income bins as a linear effect. I pasted the codebook for all of the items at the end of the post.

These models with multiple dependent variables have different names, economists call them seemingly unrelated regression, psychologists will often just call them multivariate models, those familiar with structural equation modeling can get the same results by allowing residual covariances between the two outcomes — they will all result in the same coefficient estimates in the end.

SPSS

In SPSS we can use the GLM procedure to estimate the model. Simultaneously we can specify particular contrasts to test whether the income coefficient is different for the two outcomes.

*Grab the online data.
SPSSINC GETURI DATA URI="https://dl.dropbox.com/s/r98nnidl5rnq5ni/MissingData_DallasSurv16.sav?dl=0" FILETYPE=SAV DATASET=MissData.

*Conducting complete case analysis.
COUNT MisComplete = Safety_Violent Safety_Prop Gender Race Income Age (MISSING).
COMPUTE CompleteCase = (MisComplete = 0).
FILTER BY CompleteCase.


*This treats the different income categories as a continuous variable.
*Can use GLM to estimate seemingly unrelated regression in SPSS and test.
*equality of the two coefficients.
GLM Safety_Violent Safety_Prop BY Race Age WITH Income
  /DESIGN=Income Race Age
  /PRINT PARAMETER
  /LMATRIX Income 1
  /MMATRIX ALL 1 -1.

FILTER OFF.  

In the output you can see the coefficient estimates for the two equations. The income effect for violent crime is 0.168 (0.023) and for property crime is 0.114 (0.022).

And then you get a separate table for the contrast estimates.

You can see that the contrast estimate, 0.054, equals 0.168 – 0.114. The standard error in this output (0.016) takes into account the covariance between the two estimates. Here you would reject the null that the effects are equal across the two equations, and the effect of income is larger for violent crime. Because higher values on these Likert scales mean a person feels more safe, this is evidence that those with higher incomes are more likely to be fearful of property victimization, controlling for age and race.

Unfortunately the different matrix contrasts are not available in all the different types of regression models in SPSS. You may ask whether you can fit two separate regressions and do this same test. The answer is you can, but that makes assumptions about how the two models are independent — it is typically more efficient to estimate them at once, and here it allows you to have the software handle the Wald test instead of constructing it yourself.

R

As I stated previously, seemingly unrelated regression is another name for these multivariate models. So we can use the R libraries systemfit to estimate our seemingly unrelated regression model, and then use the library multcomp to test the coefficient contrast. This does not result in the exact same coefficients as SPSS, but devilishly close. You can download the csv file of the data here.

library(systemfit) #for seemingly unrelated regression
library(multcomp)  #for hypothesis tests of models coefficients

#read in CSV file
SurvData <- read.csv(file="MissingData_DallasSurvey.csv",header=TRUE)
names(SurvData)[1] <- "Safety_Violent" #name gets messed up because of BOM

#Need to recode the missing values in R, use NA
NineMis <- c("Safety_Violent","Safety_Prop","Race","Income","Age")
#summary(SurvData[,NineMis])
for (i in NineMis){
  SurvData[SurvData[,i]==9,i] <- NA
}

#Making a complete case dataset
SurvComplete <- SurvData[complete.cases(SurvData),NineMis]
#Now changing race and age to factor variables, keep income as linear
SurvComplete$Race <- factor(SurvComplete$Race, levels=c(1,2,3,4), labels=c("Black","White","Hispanic","Other"))
SurvComplete$Age <- factor(SurvComplete$Age, levels=1:5, labels=c("18-24","25-34","35-44","45-54","55+"))
summary(SurvComplete)

#now fitting seemingly unrelated regression
viol <- Safety_Violent ~ Income + Race + Age
prop <- Safety_Prop ~ Income + Race + Age
fitsur <- systemfit(list(violreg = viol, propreg= prop), data=SurvComplete, method="SUR")
summary(fitsur)

#testing whether income effect is equivalent for both models
viol_more_prop <- glht(fitsur,linfct = c("violreg_Income - propreg_Income = 0"))
summary(viol_more_prop) 

Here is a screenshot of the results then:

This is also the same as estimating a structural equation model in which the residuals for the two regressions are allowed to covary. We can do that in R with the lavaan library.

library(lavaan)

#for this need to convert factors into dummy variables for lavaan
DumVars <- data.frame(model.matrix(~Race+Age-1,data=SurvComplete))
names(DumVars) <- c("Black","White","Hispanic","Other","Age2","Age3","Age4","Age5")

SurvComplete <- cbind(SurvComplete,DumVars)

model <- '
    #regressions
     Safety_Prop    ~ Income + Black + Hispanic + Other + Age2 + Age3 + Age4 + Age5
     Safety_Violent ~ Income + Black + Hispanic + Other + Age2 + Age3 + Age4 + Age5
    #residual covariances
     Safety_Violent ~~ Safety_Prop
     Safety_Violent ~~ Safety_Violent
     Safety_Prop ~~ Safety_Prop
'

fit <- sem(model, data=SurvComplete)
summary(fit)

I’m not sure offhand though if there is an easy way to test the coefficient differences with a lavaan object, but we can do it manually by grabbing the variance and the covariances. You can then see that the differences and the standard errors are equal to the prior output provided by the glht function in multcomp.

#Grab the coefficients I want, and test the difference
PCov <- inspect(fit,what="vcov")
PEst <- inspect(fit,what="list")
Diff <- PEst[9,'est'] - PEst[1,'est']
SE <- sqrt( PEst[1,'se']^2 + PEst[9,'se']^2 - 2*PCov[9,1] )
Diff;SE

Stata

In Stata we can replicate the same prior analyses. Here is some code to simply replicate the prior results, using Stata’s postestimation commands (additional examples using postestimation commands here). Again you can download the csv file used here. The results here are exactly the same as the R results.

*Load in the csv file
import delimited MissingData_DallasSurvey.csv, clear

*BOM problem again
rename ïsafety_violent safety_violent

*we need to specify the missing data fields.
*for Stata, set missing data to ".", not the named missing value types.
foreach i of varlist safety_violent-ownhome {
    tab `i'
}

*dont specify district
mvdecode safety_violent-race income-age ownhome, mv(9=.)
mvdecode yearsdallas, mv(999=.)

*making a variable to identify the number of missing observations
egen miscomplete = rmiss(safety_violent safety_prop race income age)
tab miscomplete
*even though any individual question is small, in total it is around 20% of the cases

*lets conduct a complete case analysis
preserve 
keep if miscomplete==0

*Now can estimate multivariate regression, same as GLM in SPSS
mvreg safety_violent safety_prop = income i.race i.age

*test income coefficient is equal across the two equations
lincom _b[safety_violent:income] - _b[safety_prop:income]

*same results as seemingly unrelated regression
sureg (safety_violent income i.race i.age)(safety_prop income i.race i.age)

*To use lincom it is the exact same code as with mvreg
lincom _b[safety_violent:income] - _b[safety_prop:income]

*with sem model
tabulate race, generate(r)
tabulate age, generate(a)
sem (safety_violent <- income r2 r3 r4 a2 a3 a4 a5)(safety_prop <- income r2 r3 r4 a2 a3 a4 a5), cov(e.safety_violent*e.safety_prop) 

*can use the same as mvreg and sureg
lincom _b[safety_violent:income] - _b[safety_prop:income]

You will notice here it is the exact some post-estimation lincom command to test the coefficient equality across all three models. (Stata makes this the easiest of the three programs IMO.)

Stata also allows us to estimate seemingly unrelated regressions combining different generalized outcomes. Here I treat the outcome as ordinal, and then combine the models using seemingly unrelated regression.

*Combining generalized linear models with suest
ologit safety_violent income i.race i.age
est store viol

ologit safety_prop income i.race i.age
est store prop

suest viol prop

*lincom again!
lincom _b[viol_safety_violent:income] - _b[prop_safety_prop:income]

An application in spatial criminology is when you are estimating the effect of something on different crime types. If you are predicting the number of crimes in a spatial area, you might separate Poisson regression models for assaults and robberies — this is one way to estimate the models jointly. Cory Haberman and Jerry Ratcliffe have an application of this as well estimate the effect of different crime types at different times of day – e.g. the effect of bars in the afternoon versus the effect of bars at nighttime.

Codebook

Here is the codebook for each of the variables in the database.

Safety_Violent  
    1   Very Unsafe
    2   Unsafe
    3   Neither Safe or Unsafe
    4   Safe
    5   Very Safe
    9   Do not know or Missing
Safety_Prop 
    1   Very Unsafe
    2   Unsafe
    3   Neither Safe or Unsafe
    4   Safe
    5   Very Safe
    9   Do not know or Missing
Gender  
    1   Male
    2   Female
    9   Missing
Race    
    1   Black
    2   White
    3   Hispanic
    4   Other
    9   Missing
Income  
    1   Less than 25k
    2   25k to 50k
    3   50k to 75k
    4   75k to 100k
    5   over 100k
    9   Missing
Edu 
    1   Less than High School
    2   High School
    3   Some above High School
    9   Missing
Age 
    1   18-24
    2   25-34
    3   35-44
    4   45-54
    5   55+
    9   Missing
OwnHome 
    1   Own
    2   Rent
    9   Missing
YearsDallas
    999 Missing

New working paper – Monitoring volatile homicide trends across U.S. cities

I have a new working paper out — Monitoring volatile homicide trends across U.S. cities, with one of my colleagues Tomislav Kovandzic. You can grab the pre-print on SSRN, and the paper has links to code to replicate the charts and models in the paper.

Here I look at homicide rates in U.S. cities and use funnel charts and fan charts to show the typical volatility in homicide rates between cities and within cities over time. As I’ve written previously, I think much of the media narrative around homicide increases are hyperbolic and often cherry pick reasons why they think homicides are going up.

I’ve shown examples of funnel charts on this blog before, so I will use a different image as the tease. To generate the prediction intervals for fan charts I estimate binomial random effect models. Below is an example for New Orleans (homicide rate per 100,000 population):

As always, if you have feedback feel free to send me an email.

Communities and Crime

This was my first semester teaching undergrads at UT Dallas. I taught the Communities and Crime undergrad course. I thought it went very well, and I was impressed with the undergrads here. For the course I had students do a bunch of different prediction assignments based on open data in Dallas, such as predicting what neighborhood has the most crime, or which specific bar has the most assaults. The idea being they would use the theories I discussed in the prior lecture to make the best predictions.

For their final assignment, I had students predict an arbitrary area to capture the most robberies in 2016 (up to that point they had only been predicting crimes in 2015). I used the same metric that NIJ is using in their crime forecasting challenge – the predictive accuracy index. This is simply % crime/% area, so students who give larger areas are more penalized. This ended up producing a pretty neat capstone to the end of the semester.

Below is a screen shot of the map, and here is a link to an interactive version. (WordPress.com sites only allow specific types of iframe sources, so my dropbox src link to the interactive Leaflet map gets stripped.)

Look forward to teaching this class again (as of now it seems I will regularly offer it every spring).

More news on classes to come soon. I am teaching GIS applications in Criminology online over the summer. For a quick idea about the content, it will be almost the same as the GIS course in criminal justice I previously taught at SUNY.

In short, if you think maps rock then you should take my classes 😉