Some more testing coefficient contrasts: Multinomial models and indirect effects

Testing the equality of two coefficients is one of my more popular posts. This is a good thing — often more interesting hypotheses are to test two parameters against each other, as opposed to a strict null hypothesis of a coefficient against zero. Every now an then I get questions about applying this idea to new situations in which it is not always straightforward how to figure out. So here are a few examples using demonstration R code.

Multinomial Models

One question I received about applying the advice was to test coefficients across different contrasts in multinomial models. It may not seem obvious, but the general approach of extracting out the coefficients and the covariance between those estimates works the same way as most regression equations.

So in a quick example in R:

library(nnet)
data(mtcars)
library(car)

mtcars$cyl <- as.factor(mtcars$cyl)  
mtcars$am <- as.factor(mtcars$am)  
mod <- multinom(cyl ~ am + hp, data=mtcars, Hess=TRUE)
summary(mod)

And the estimates for mod are:

> summary(mod)
Call:
multinom(formula = cyl ~ am + hp, data = mtcars)

Coefficients:
  (Intercept)       am1        hp
6   -42.03847  -3.77398 0.4147498
8   -92.30944 -26.27554 0.7836576

Std. Errors:
  (Intercept)       am1        hp
6    27.77917  3.256003 0.2747842
8    31.93525 46.854100 0.2559052

Residual Deviance: 7.702737 
AIC: 19.70274 

So say we want to test whether the hp effect is the same for 6 cylinders vs 8 cylinders. To test that, we just grab the covariance and construct our test:

#Example constructing test by hand
v <- vcov(mod)
c <- coef(mod)
dif <- c[1,3] - c[2,3]
se <- sqrt( v[3,3] + v[6,6] - 2*v[3,6])
z <- dif/se
#test stat, standard error, and two-tailed p-value
dif;se;2*(1 - pnorm(abs(z)))

Which we end up with a p-value of 0.0002505233, so we would reject the null that these two effects are equal to one another. Note to get the variance-covariance estimates for the parameters you need to set Hess=TRUE in the multinom call.

Another easier way though is to use the car libraries function linearHypothesis to conduct the same test:

> linearHypothesis(mod,c("6:hp = 8:hp"),test="Chisq")
Linear hypothesis test

Hypothesis:
6:hp - 8:hp = 0

Model 1: restricted model
Model 2: cyl ~ am + hp

  Df  Chisq Pr(>Chisq)    
1                         
2  1 13.408  0.0002505 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You can see although this is in terms of a Chi-square test, it results in the same p-value. The Wald test however can be extended to testing multiple coefficient equalities, and a popular one for multinomial models is to test if any coefficients change across different levels of the dependent categories. The idea behind that test is to see if you can collapse that category with another that is equivalent.

To do that test, I created a function that does all of the contrasts at once:

#Creating function to return tests for all coefficient equalities at once
all_tests <- function(model){
  v <- colnames(coef(model))
  d <- rownames(coef(model))
  allpairs <- combn(d,2,simplify=FALSE)
  totn <- length(allpairs) + length(d)
  results <- data.frame(ord=1:totn)
  results$contrast <- ""
  results$test <- ""
  results$Df <- NULL
  results$Chisq <- NULL
  results$pvalue <- NULL
  iter <- 0
  for (i in allpairs){
    iter <- iter + 1
    l <- paste0(i[1],":",v)
    r <- paste0(i[2],":",v)
    test <- paste0(l," = ",r)
    temp_res <- linearHypothesis(model,test,test="Chisq")
    results$contrast[iter] <- paste0(i[1]," vs ",i[2])
    results$test[iter] <- paste(test,collapse=" and ")
    results$Df[iter] <- temp_res$Df[2]
    results$Chisq[iter] <- temp_res$Chisq[2]
    results$pvalue[iter] <- temp_res$Pr[2]    
  }
  ref <- model$lab[!(model$lab %in% d)]
  for (i in d){
    iter <- iter + 1
    test <- paste0(i,":",v," = 0")
    temp_res <- linearHypothesis(model,test,test="Chisq")
    results$contrast[iter] <- paste0(i," vs ",ref)
    results$test[iter] <- paste(test,collapse=" and ")
    results$Df[iter] <- temp_res$Df[2]
    results$Chisq[iter] <- temp_res$Chisq[2]
    results$pvalue[iter] <- temp_res$Pr[2]  
  }
  return(results)
}

Not only does this construct the test of the observed categories, but also tests whether each set of coefficients is simultaneously zero, which is the appropriate contrast for the referent category.

> all_tests(mod)
  ord contrast                                                            test Df        Chisq       pvalue
1   1   6 vs 8 6:(Intercept) = 8:(Intercept) and 6:am1 = 8:am1 and 6:hp = 8:hp  3    17.533511 0.0005488491
2   2   6 vs 4                    6:(Intercept) = 0 and 6:am1 = 0 and 6:hp = 0  3     5.941417 0.1144954481
3   3   8 vs 4                    8:(Intercept) = 0 and 8:am1 = 0 and 8:hp = 0  3 44080.662112 0.0000000000

User beware of multiple testing with this, as I am not sure as to the appropriate post-hoc correction here when examining so many hypotheses. This example with just three is obviously not a big deal, but with more categories you get n choose 2, or (n*(n-1))/2 total contrasts.

Testing the equality of multiple indirect effects

Another example I was asked about recently was testing whether you could use the same procedure to calculate indirect effects (popular in moderation and mediation analysis). Those end up being a bit more tricky, as to define the variance and covariance between those indirect effects we are not just dealing with adding and subtracting values of the original parameters, but are considering multiplications.

Thus to estimate the standard error and covariance parameters of indirect effects folks often use the delta method. In R using the lavaan library, here is an example (just taken from a code snippet Yves Rosseel posted himself), to estimate the variance-covariance matrix model defined indirect parameters.

#function taken from post in
#https://groups.google.com/forum/#!topic/lavaan/skgZRyzqtYM
library(lavaan)
vcov.def <- function(model){
  m <- model
  orig <- vcov(m)
  free <- m@Fit@x
  jac <- lavaan:::lavJacobianD(func = m@Model@def.function, x = free)
  vcov_def <- jac %*% orig %*% t(jac)
  estNames <- subset(parameterEstimates(m),op==":=")
  row.names(vcov_def) <- estNames$lhs
  colnames(vcov_def) <- estNames$lhs
  #I want to print the covariance table estimates to make sure the
  #labels are in the correct order
  estNames$se2 <- sqrt(diag(vcov_def))
  estNames$difSE <- estNames$se - estNames$se2
  print(estNames[,c('lhs','se','se2','difSE')])
  print('If difSE is not zero, labels are not in right order')
  return(vcov_def)
}

Now here is an example of testing individual parameter estimates for indirect effects.

set.seed(10)
n <- 100
X1 <- rnorm(n)
X2 <- rnorm(n)
X3 <- rnorm(n)
M <- 0.5*X1 + 0.4*X2 + 0.3*X3 + rnorm(n)
Y <- 0.1*X1 + 0.2*X2 + 0.3*X3 + 0.7*M + rnorm(n)
Data <- data.frame(X1 = X1, X2 = X2, X3 = X3, Y = Y, M = M)
model <- ' # direct effect
             Y ~ X1 + X2 + X3 + d*M
           # mediator
             M ~ a*X1 + b*X2 + c*X3
           # indirect effects
             ad := a*d
             bd := b*d
             cd := c*d
         '
model_SP.fit <- sem(model, data = Data)
summary(model_SP.fit)

#now apply to your own sem model
defCov <- vcov.def(model_SP.fit)

Unfortunately as far as I know, the linearHypothesis function does not work for lavaan objects, so if we want to test whether the indirect effect of whether ad = bd we need to construct it by hand. But with the vcov.def function we have those covariance terms we needed.

#testing hypothesis that "ad = bd"
#so doing "ad - bd = 0"
model_SP.param <- parameterEstimates(model_SP.fit)
model_SP.defined <- subset(model_SP.param, op==":=")
dif <- model_SP.defined$est[1] - model_SP.defined$est[2]
var_dif <- defCov[1,1] + defCov[2,2] - 2*defCov[1,2]
#so the test standard error of the difference is 
se_dif <- sqrt(var_dif)
#and the test statistic is
tstat <- dif/se_dif 
#two tailed p-value
dif;se_dif;2*(1 - pnorm(abs(tstat)))

To test whether all three indirect parameters are equal to each other at once, one way is to estimate a restricted model, and then use a likelihood ratio test of the restricted vs the full model. It is pretty easy in lavaan to create coefficient restrictions, just set what was varying to only be one parameter:

restrict_model <- ' # direct effect
                      Y ~ X1 + X2 + X3 + d*M
                    # mediator
                      M ~ a*X1 + a*X2 + a*X3
                    # indirect effects
                      ad := a*d
                  '

model_SP.restrict <- sem(restrict_model, data = Data)
lavTestLRT(model_SP.fit, model_SP.restrict)

If folks know of an easier way to do the Wald tests via lavaan models let me know, I would be interested!

 

Advertisements

Making interactive plots with R and Plotly

I wrote a small op-ed based on the homicide studies work I recently published about interpreting crime trends. Unfortunately that op-ed was not picked up by anyone (I missed the timing abit, maybe next year when the UCR stats come out I can just update the numbers and make the same point). I’ve posted that op-ed here, and I wanted to make a quick blog post detailing how I made the interactive graphs in that post using R and the Plotly library. All the data and code to replicate this can be downloaded from here.

Unfortunately with my free wordpress blog I cannot embed the actual interactive graphics, but I will provide links to online versions at my UT Dallas page that work and show a screenshot of each. So first, lets load all of the libraries that you will need, as well as set the working directory. (Of course change it to where you have your data saved on your local machine.)

#########################################################
#Making a shiny app for homicide rate chart
library(shiny)
library(ggplot2)
library(plotly)
library(htmlwidgets)
library(scales)

mydir <- "C:\\Users\\axw161530\\Box Sync\\Projects\\HomicideGraphs\\Analysis\\Analysis" 
setwd(mydir)
#########################################################

Now I just read in the data. I have two datasets, the funnel rates just has additional columns to draw the funnel graphs already created. (See here or here, or the data in the original Homicide Studies paper linked at the top, on how to construct these.)

############################################################
#Get the data 

FunnRates <- read.csv(file="FunnelData.csv",header=TRUE)
summary(FunnRates)
FunnRates$Population <- FunnRates$Pop1 #These are just to make nicer labels 
FunnRates$HomicideRate <- FunnRates$HomRate

IntRates <- read.csv(file="IntGraph.csv",header=TRUE)
summary(IntRates)
############################################################

Funnel Chart for One Year

First, plotly makes it dead easy to take graphs you created via ggplot and turn them into an interactive graph. So here is a link to the interactive chart, and below is a screenshot.

To walk through the code, first you make your (almost) plane Jane ggplot object. Here I name it p. You will get an error for an “unknown aesthetics: text”, but this will be used by plotly to create tooltips. Then you use the ggplotly function to turn the original ggplot graph p into an interactive graph. By default the plotly object has more stuff in the tooltip than I want, which you can basically just go into the innards of the plotly object and strip out. Then the final part is just setting the margins to be alittle larger than default, as the axis labels were otherwise slightly cut-off.

############################################################
#Make the funnel chart
year_sel <- 2015
p <- ggplot(data = FunnRates[FunnRates$Year == year_sel,]) + geom_point(aes(x=Population, y=HomicideRate, text=NiceLab), pch=21) +
     geom_line(aes(x=Population,y=LowLoc99)) + geom_line(aes(x=Population,y=HighLoc99)) + 
     labs(title = paste0("Homicide Rates per 100,000 in ",year_sel)) + 
     scale_x_log10("Population", limits=c(10000,10000000), breaks=c(10^4,10^5,10^6,10^7), labels=comma) + 
     scale_y_continuous("Homicide Rate", breaks=seq(0,110,10)) + 
     theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))

pl <- ggplotly(p, tooltip = c("HomicideRate","text"))
#pl <- plotly_build(p, width=1000, height=900)
#See https://stackoverflow.com/questions/45801389/disable-hover-information-for-a-specific-layer-geom-of-plotly
pl$x$data[[2]]$hoverinfo <- "none"
pl$x$data[[3]]$hoverinfo <- "none"
pl <- pl %>% layout(margin = list(l = 75, b = 65))
############################################################

After this point you can just type pl into the console and it will open up an interactive window. Or you can use the saveWidget function from the htmlwidgets package, something like saveWidget(as_widget(pl), "FunnelChart_2015.html", selfcontained=TRUE) to save the graph to an html file.

Now there are a couple of things. You can edit various parts of the graph, such as its size and label text size, but depending on your application these might not be a good idea. If you need to take into account smaller screens, I think it is best to use some of the defaults, as they adjust per the screen that is in use. For the size of the graph if you are embedding it in a webpage using iframe’s you can set the size at that point. If you look at my linked op-ed you can see I make the funnel chart taller than wider — that is through the iframe specs.

Funnel Chart over Time

Ok, now onto the fun stuff. So we have a funnel chart for one year, but I have homicide years from 1965 through 2015. Can we examine those over time. Plotly has an easy to use additional argument to ggplot graphs, named Frame, that allows you to add a slider to the interactive chart for animation. The additional argument ids links one object over time, ala Hans Rosling bubble chart over time. Here is a link to the interactive version, and below is a screen shot:

############################################################
#Making the funnel chart where you can select the year
py <- ggplot(data = FunnRates) + geom_point(aes(x=Population, y=HomicideRate, text=NiceLab, frame=Year,ids=ORI), pch=21) +
      geom_line(aes(x=Population,y=LowLoc99,frame=Year)) + geom_line(aes(x=Population,y=HighLoc99,frame=Year)) + 
      labs(title = paste0("Homicide Rates per 100,000")) + 
      scale_x_log10("Population", limits=c(10000,10000000), breaks=c(10^4,10^5,10^6,10^7), labels=comma) + 
      scale_y_continuous("Homicide Rate", breaks=seq(0,110,10), limits=c(0,110)) + 
      theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))

ply <- ggplotly(py, tooltip = c("text")) %>% animation_opts(0, redraw=FALSE)
ply$x$data[[2]]$hoverinfo <- "none"
ply$x$data[[3]]$hoverinfo <- "none"
saveWidget(as_widget(ply), "FunnelChart_YearSelection.html", selfcontained=FALSE)
############################################################

The way I created the data it does not make sense to do a smooth animation for the funnel line, so this just flashes to each new year (via the animation_opts spec). (I could make the data so it would look nicer in an animation, but will wait for someone to pick up the op-ed before I bother too much more with this.) But it accomplishes via the slider the ability for you to pick which year you want.

Fan Chart Just One City

Next we are onto the fan charts for each individual city with the prediction intervals. Again you can just create this simple chart in ggplot, and then use plotly to make a version with tooltips. Here is a link to an interactive version, and below is a screenshot.

###################################################
#Making the fan graph for New Orleans
titleLab <- unique(IntRates[,c("ORI","NiceLab","AgencyName","State")])
p2 <- ggplot(data=IntRates[IntRates$ORI == "LANPD00",], aes(x=Year, y=HomRate)) + 
     geom_ribbon(aes(ymin=LowB, ymax=HighB), alpha=0.2) +
     geom_ribbon(aes(ymin=LagLow25, ymax=LagHigh25), alpha=0.5) +
     geom_point(shape=21, color="white", fill="red", size=2) +
     labs(x = "Year", y="Homicide Rate per 100,000") +
     #scale_x_continuous(breaks=seq(1960,2015,by=5)) + 
     ggtitle(paste0("Prediction Intervals for ",titleLab[titleLab$ORI == "LANPD00",c("NiceLab")])) +
     theme_bw() #+ theme(text = element_text(size=20), axis.title.y=element_text(margin=margin(0,10,0,0)))
#p2
pl2 <- ggplotly(p2, tooltip = c("Year","HomRate"), dynamicTicks=TRUE)
pl2$x$data[[1]]$hoverinfo <- "none"
pl2$x$data[[2]]$hoverinfo <- "none"
pl2 <- pl2 %>% layout(margin = list(l = 100, b = 65))
#pl2
saveWidget(as_widget(pl2), "FanChart_NewOrleans.html", selfcontained=FALSE)
###################################################

Note when you save the widget to selfcontained=FALSE, it hosts several parts of the data into separate folders. I always presumed this was more efficient than making one huge html file, but I don’t know for sure.

Fan Chart with Dropdown Selection

Unfortunately the frame type animation does not make as much sense here. It would be hard for someone to find a particular city of interest in that slider (as a note though the slider can have nominal data, if I only had a few cities it would work out ok, with a few hundred it will not though). So feature request if anyone from plotly is listening — please have a dropdown type option for ggplot graphs! In the meantime though there is an alternative using a tradition plot_ly type chart. Here is that interactive fan chart with a police agency dropdown, and below is a screenshot.

###################################################
#Making the fan graph where you can select the city of interest
#Need to have a dropdown for the city

titleLab <- unique(IntRates[,c("ORI","NiceLab","State")])
nORI <- length(titleLab[,1])
choiceP <- vector("list",nORI)
for (i in 1:nORI){
choiceP[[i]] <- list(method="restyle", args=list("transforms[0].value", unique(IntRates$NiceLab)[i]), label=titleLab[i,c("NiceLab")])
}

trans <- list(list(type='filter',target=~NiceLab, operation="=", value=unique(IntRates$NiceLab)[1]))
textLab <- ~paste("Homicide Rate:",HomRate,'$
Year:',Year,'$
Homicides:',Homicide,'$
Population:',Pop1,'$
Agency Name:',NiceLab)

#Lets try with the default plotly
#See https://community.plot.ly/t/need-help-on-using-dropdown-to-filter/6596
ply4 <- IntRates %>% 
        plot_ly(x= ~Year,y= ~HighB, type='scatter', mode='lines', line=list(color='transparent'), showlegend=FALSE, name="90%", hoverinfo="none", transforms=trans) %>%
        add_trace(y=~LowB,  type='scatter', mode='lines', line=list(color='transparent'), showlegend=FALSE, name='10%', hoverinfo="none", transforms=trans,
          fill = 'tonexty', fillcolor='rgba(105,105,105,0.3)') %>%
        add_trace(x=~Year,y=~HomRate, text=~NiceLab, type='scatter', mode='markers', marker = list(size=10, color = 'rgba(255, 182, 193, .9)', line = list(color = 'rgba(152, 0, 0, .8)', width = 1)),
          hoverinfo='text', text=textLab, transforms=trans) %>%
        layout(title = "Homicide Rates and 80% Prediction Intervals by Police Department",
          xaxis = list(title="Year"),
          yaxis = list(title="Homicide Rate per 100,000"),
          updatemenus=list(list(type='dropdown',active=0,buttons=choiceP)))
                               
saveWidget(as_widget(ply4), "FanChart_Dropdown.html", selfcontained=FALSE)
###################################################

So in short plotly makes it super-easy to make interactive graphs with tooltips. Long term goal I would like to make a visual supplement to the traditional UCR report (I find the complaint of what tables to include to miss the point — there are much better ways to show the information that worrying about the specific tables). So if you would like to work on that with me always feel free to get in touch!

 

Geocoding with census data and the Census API

For my online GIS class I have a tutorial on creating an address locator using street centerline data in ArcGIS. Eventually I would like to put all of my class online, but for now I am just sharing that one, as I’ve forwarded it alot recently.

That tutorial used local street centerline data in Dallas that you can download from Dallas’s open data site. It also gives directions on how to use an online ESRI geocoding service — which Dallas has. But what if those are not an option? A student recently wanted to geocode data from San Antonio, and the only street data file they publicly provide lacks the beginning and ending street number.

That data is insufficient to create an address locator. It is also the case that the road data you can download from the census’s web interface lacks this data. But you can download street centerline data with beginning and end addresses from the census from the FTP site. For example here is the url that contains the streets with the address features. To use that you just have to figure out what state and county you are interested in downloaded. The census even has ESRI address locators already made for you using 2012 data at the state level. Again you just need to figure out your states number and download it.

Once you download the data with the begin and ending street numbers you can follow along with that tutorial the same as the public data.

Previously I’ve written about using the Google geocoding API. If you just have crime data from one jurisdiction, it is simple to make a geocoder for just that locality. But if you have data for many cities (say if you were geocoding home addresses) this can be more difficult. An alternative online API to google that does not have daily limits is the Census Geocoding API.

Here is a simple example in R of calling the census API and geocoding a list of addresses.

library(httr)
library(jsonlite)

get_CensusAdd <- function(street,city,state,zip,benchmark=4){
    base <- "https://geocoding.geo.census.gov/geocoder/locations/address?"
    soup <- GET(url=base,query=list(street=street,city=city,state=state,zip=zip,format='json',benchmark=benchmark))
    dat <- fromJSON(content(soup,as='text'), simplifyVector=TRUE)
    D_dat <- dat$result$addressMatches
    if (length(D_dat) > 1){
    return(c(D_dat['matchedAddress'],D_dat['coordinates'][[1]])) #error will just return null, x[1] is lon, x[2] is lat
    }
    else {return(c('',NA,NA))}
}

#now create function to loop over data frame and return set of addresses
geo_CensusTIGER <- function(street,city,state,zip,sleep=1,benchmark=4){
  #make empy matrix
  l <- length(street)
  MyDat <- data.frame(matrix(nrow=l,ncol=3))
  names(MyDat) <- c("MatchedAdd","Lon","Lat")
  for (i in 1:l){
    x <- suppressMessages(get_CensusAdd(street=street[i],city=city[i],state=state[i],zip=zip[i],benchmark=benchmark))
    if (length(x) > 0){
        MyDat[i,1] <- x[1]
        MyDat[i,2] <- x[2]
        MyDat[i,3] <- x[3]
    }
    Sys.sleep(sleep)
  }
  MyDat$street <- street
  MyDat$city <- city
  MyDat$zip <- zip
  MyDat$state <- state
  return(MyDat)
}

## Arbitrary dataframe for an exercise
AddList <- data.frame(
  IdNum = c(1,2,3,4,5),
  Address = c("450 W Harwood Rd", "2878 Fake St", "2775 N Collin St", "2775 N Collins St", "Lakewood Blvd and W Shore Dr"),
  City = c("Hurst", "Richardson", "Arlington", "Arlington", "Dallas"),
  State = c("TX", "TX", "TX", "TX", "TX")
)

test <- geo_CensusTIGER(street=AddList$Address,city=AddList$City,state=AddList$State,zip=rep('',5))

If you check out the results, you will see that this API does not appear to do fuzzy matching. 2775 N Collin St failed, whereas 2775 N Collins St was able to return a match. You can also see though it will return an intersection, but in my tests "/" did not work (so in R you can simply use gsub to replace different intersection types with and). I haven’t experimented with it too much, so let me know if you have any other insight into this API.

I will follow up in another post a python function to use the Census geocoding API, as well as using the Nominatim online geocoding API, which you can use for addresses outside of the United States.

Testing the equality of coefficients – Same Independent, Different Dependent variables

As promised earlier, here is one example of testing coefficient equalities in SPSS, Stata, and R.

Here we have different dependent variables, but the same independent variables. This is taken from Dallas survey data (original data link, survey instrument link), and they asked about fear of crime, and split up the questions between fear of property victimization and violent victimization. Here I want to see if the effect of income is the same between the two. People in more poverty tend to be at higher risk of victimization, but you may also expect people with fewer items to steal to be less worried. Here I also control for the race and the age of the respondent.

The dataset has missing data, so I illustrate how to select out for complete case analysis, then I estimate the model. The fear of crime variables are coded as Likert items with a scale of 1-5, (higher values are more safe) but I predict them using linear regression (see the Stata code at the end though for combining ordinal logistic equations using suest). Race is of course nominal, and income and age are binned as well, but I treat the income bins as a linear effect. I pasted the codebook for all of the items at the end of the post.

These models with multiple dependent variables have different names, economists call them seemingly unrelated regression, psychologists will often just call them multivariate models, those familiar with structural equation modeling can get the same results by allowing residual covariances between the two outcomes — they will all result in the same coefficient estimates in the end.

SPSS

In SPSS we can use the GLM procedure to estimate the model. Simultaneously we can specify particular contrasts to test whether the income coefficient is different for the two outcomes.

*Grab the online data.
SPSSINC GETURI DATA URI="https://dl.dropbox.com/s/r98nnidl5rnq5ni/MissingData_DallasSurv16.sav?dl=0" FILETYPE=SAV DATASET=MissData.

*Conducting complete case analysis.
COUNT MisComplete = Safety_Violent Safety_Prop Gender Race Income Age (MISSING).
COMPUTE CompleteCase = (MisComplete = 0).
FILTER BY CompleteCase.


*This treats the different income categories as a continuous variable.
*Can use GLM to estimate seemingly unrelated regression in SPSS and test.
*equality of the two coefficients.
GLM Safety_Violent Safety_Prop BY Race Age WITH Income
  /DESIGN=Income Race Age
  /PRINT PARAMETER
  /LMATRIX Income 1
  /MMATRIX ALL 1 -1.

FILTER OFF.  

In the output you can see the coefficient estimates for the two equations. The income effect for violent crime is 0.168 (0.023) and for property crime is 0.114 (0.022).

And then you get a separate table for the contrast estimates.

You can see that the contrast estimate, 0.054, equals 0.168 – 0.114. The standard error in this output (0.016) takes into account the covariance between the two estimates. Here you would reject the null that the effects are equal across the two equations, and the effect of income is larger for violent crime. Because higher values on these Likert scales mean a person feels more safe, this is evidence that those with higher incomes are more likely to be fearful of property victimization, controlling for age and race.

Unfortunately the different matrix contrasts are not available in all the different types of regression models in SPSS. You may ask whether you can fit two separate regressions and do this same test. The answer is you can, but that makes assumptions about how the two models are independent — it is typically more efficient to estimate them at once, and here it allows you to have the software handle the Wald test instead of constructing it yourself.

R

As I stated previously, seemingly unrelated regression is another name for these multivariate models. So we can use the R libraries systemfit to estimate our seemingly unrelated regression model, and then use the library multcomp to test the coefficient contrast. This does not result in the exact same coefficients as SPSS, but devilishly close. You can download the csv file of the data here.

library(systemfit) #for seemingly unrelated regression
library(multcomp)  #for hypothesis tests of models coefficients

#read in CSV file
SurvData <- read.csv(file="MissingData_DallasSurvey.csv",header=TRUE)
names(SurvData)[1] <- "Safety_Violent" #name gets messed up because of BOM

#Need to recode the missing values in R, use NA
NineMis <- c("Safety_Violent","Safety_Prop","Race","Income","Age")
#summary(SurvData[,NineMis])
for (i in NineMis){
  SurvData[SurvData[,i]==9,i] <- NA
}

#Making a complete case dataset
SurvComplete <- SurvData[complete.cases(SurvData),NineMis]
#Now changing race and age to factor variables, keep income as linear
SurvComplete$Race <- factor(SurvComplete$Race, levels=c(1,2,3,4), labels=c("Black","White","Hispanic","Other"))
SurvComplete$Age <- factor(SurvComplete$Age, levels=1:5, labels=c("18-24","25-34","35-44","45-54","55+"))
summary(SurvComplete)

#now fitting seemingly unrelated regression
viol <- Safety_Violent ~ Income + Race + Age
prop <- Safety_Prop ~ Income + Race + Age
fitsur <- systemfit(list(violreg = viol, propreg= prop), data=SurvComplete, method="SUR")
summary(fitsur)

#testing whether income effect is equivalent for both models
viol_more_prop <- glht(fitsur,linfct = c("violreg_Income - propreg_Income = 0"))
summary(viol_more_prop) 

Here is a screenshot of the results then:

This is also the same as estimating a structural equation model in which the residuals for the two regressions are allowed to covary. We can do that in R with the lavaan library.

library(lavaan)

#for this need to convert factors into dummy variables for lavaan
DumVars <- data.frame(model.matrix(~Race+Age-1,data=SurvComplete))
names(DumVars) <- c("Black","White","Hispanic","Other","Age2","Age3","Age4","Age5")

SurvComplete <- cbind(SurvComplete,DumVars)

model <- '
    #regressions
     Safety_Prop    ~ Income + Black + Hispanic + Other + Age2 + Age3 + Age4 + Age5
     Safety_Violent ~ Income + Black + Hispanic + Other + Age2 + Age3 + Age4 + Age5
    #residual covariances
     Safety_Violent ~~ Safety_Prop
     Safety_Violent ~~ Safety_Violent
     Safety_Prop ~~ Safety_Prop
'

fit <- sem(model, data=SurvComplete)
summary(fit)

I’m not sure offhand though if there is an easy way to test the coefficient differences with a lavaan object, but we can do it manually by grabbing the variance and the covariances. You can then see that the differences and the standard errors are equal to the prior output provided by the glht function in multcomp.

#Grab the coefficients I want, and test the difference
PCov <- inspect(fit,what="vcov")
PEst <- inspect(fit,what="list")
Diff <- PEst[9,'est'] - PEst[1,'est']
SE <- sqrt( PEst[1,'se']^2 + PEst[9,'se']^2 - 2*PCov[9,1] )
Diff;SE

Stata

In Stata we can replicate the same prior analyses. Here is some code to simply replicate the prior results, using Stata’s postestimation commands (additional examples using postestimation commands here). Again you can download the csv file used here. The results here are exactly the same as the R results.

*Load in the csv file
import delimited MissingData_DallasSurvey.csv, clear

*BOM problem again
rename ïsafety_violent safety_violent

*we need to specify the missing data fields.
*for Stata, set missing data to ".", not the named missing value types.
foreach i of varlist safety_violent-ownhome {
    tab `i'
}

*dont specify district
mvdecode safety_violent-race income-age ownhome, mv(9=.)
mvdecode yearsdallas, mv(999=.)

*making a variable to identify the number of missing observations
egen miscomplete = rmiss(safety_violent safety_prop race income age)
tab miscomplete
*even though any individual question is small, in total it is around 20% of the cases

*lets conduct a complete case analysis
preserve 
keep if miscomplete==0

*Now can estimate multivariate regression, same as GLM in SPSS
mvreg safety_violent safety_prop = income i.race i.age

*test income coefficient is equal across the two equations
lincom _b[safety_violent:income] - _b[safety_prop:income]

*same results as seemingly unrelated regression
sureg (safety_violent income i.race i.age)(safety_prop income i.race i.age)

*To use lincom it is the exact same code as with mvreg
lincom _b[safety_violent:income] - _b[safety_prop:income]

*with sem model
tabulate race, generate(r)
tabulate age, generate(a)
sem (safety_violent <- income r2 r3 r4 a2 a3 a4 a5)(safety_prop <- income r2 r3 r4 a2 a3 a4 a5), cov(e.safety_violent*e.safety_prop) 

*can use the same as mvreg and sureg
lincom _b[safety_violent:income] - _b[safety_prop:income]

You will notice here it is the exact some post-estimation lincom command to test the coefficient equality across all three models. (Stata makes this the easiest of the three programs IMO.)

Stata also allows us to estimate seemingly unrelated regressions combining different generalized outcomes. Here I treat the outcome as ordinal, and then combine the models using seemingly unrelated regression.

*Combining generalized linear models with suest
ologit safety_violent income i.race i.age
est store viol

ologit safety_prop income i.race i.age
est store prop

suest viol prop

*lincom again!
lincom _b[viol_safety_violent:income] - _b[prop_safety_prop:income]

An application in spatial criminology is when you are estimating the effect of something on different crime types. If you are predicting the number of crimes in a spatial area, you might separate Poisson regression models for assaults and robberies — this is one way to estimate the models jointly. Cory Haberman and Jerry Ratcliffe have an application of this as well estimate the effect of different crime types at different times of day – e.g. the effect of bars in the afternoon versus the effect of bars at nighttime.

Codebook

Here is the codebook for each of the variables in the database.

Safety_Violent  
    1   Very Unsafe
    2   Unsafe
    3   Neither Safe or Unsafe
    4   Safe
    5   Very Safe
    9   Do not know or Missing
Safety_Prop 
    1   Very Unsafe
    2   Unsafe
    3   Neither Safe or Unsafe
    4   Safe
    5   Very Safe
    9   Do not know or Missing
Gender  
    1   Male
    2   Female
    9   Missing
Race    
    1   Black
    2   White
    3   Hispanic
    4   Other
    9   Missing
Income  
    1   Less than 25k
    2   25k to 50k
    3   50k to 75k
    4   75k to 100k
    5   over 100k
    9   Missing
Edu 
    1   Less than High School
    2   High School
    3   Some above High School
    9   Missing
Age 
    1   18-24
    2   25-34
    3   35-44
    4   45-54
    5   55+
    9   Missing
OwnHome 
    1   Own
    2   Rent
    9   Missing
YearsDallas
    999 Missing

Identifying near repeat crime strings in R or Python

People in criminology should be familiar with repeats or near-repeats for crimes such as robbery, burglaries, or shootings. An additional neat application of this idea though is to pull out strings of incidents that are within particular distance and time thresholds. See this example analysis by Haberman and Ratcliffe, The Predictive Policing Challenges of Near Repeat Armed Street Robberies. This is particularly useful to an analyst interested in crime linkage — to see if those particular strings of incidents are likely to be committed by the same offender.

Here I will show how to pluck out those near-repeat strings in R or Python. The general idea is to transform the incidents into a network, where two incidents are connected only if they meet the distance and time requirements. Then you can identify the connected components of the graph, and those are your strings of near-repeat events.

To follow along, here is the data and the code used in the analysis. I will be showing this on an example set of thefts from motor vehicles (aka burglaries from motor vehicles) in Dallas in 2015. In the end I take two different approaches to this problem — in R the solution will only work for smaller datasets (say n~5000 or less), but the python code should scale to much larger datasets.

Near-repeat strings in R

The approach I take in R does the steps as follows:

  1. compute the distance matrix for the spatial coordinates
  2. convert this matrix to a set of 0’s and 1’s, 1’s correspond to if the distance is below the user specified distance threshold (call it S)
  3. compute the distance matrix for the times
  4. convert this matrix to a set of 0’1 and 1’s, 1’s correspond to if the distance is below the user specified time threshold (call it T)
  5. use element-wise multiplication on the S and T matrices, call the result A, then set the diagonal of A to zero
  6. A is now an adjacency matrix, which can be converted into a network
  7. extract the connected components of that network

So here is an example of reading in the thefts from motor vehicle data, and defining my function, NearStrings, to grab the strings of incidents. Note you need to have the igraph R library installed for this code to work.

library(igraph)

MyDir <- "C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\SourceNearRepeats"
setwd(MyDir)

BMV <- read.csv(file="TheftFromMV.csv",header=TRUE)
summary(BMV)

#make a function
NearStrings <- function(data,id,x,y,time,DistThresh,TimeThresh){
    library(igraph) #need igraph to identify connected components
    MyData <- data
    SpatDist <- as.matrix(dist(MyData[,c(x,y)])) < DistThresh  #1's for if under distance
    TimeDist <-  as.matrix(dist(MyData[,time])) < TimeThresh #1's for if under time
    AdjMat <- SpatDist * TimeDist #checking for both under distance and under time
    diag(AdjMat) <- 0 #set the diagonal to zero
    row.names(AdjMat) <- MyData[,id] #these are used as labels in igraph
    colnames(AdjMat) <- MyData[,id] #ditto with row.names
    G <- graph_from_adjacency_matrix(AdjMat, mode="undirected") #mode should not matter
    CompInfo <- components(G) #assigning the connected components
    return(data.frame(CompId=CompInfo$membership,CompNum=CompInfo$csize[CompInfo$membership]))
}

So here is a quick example run on the first ten records. Note I have a field that is named DateInt in the csv, which is just the integer number of days since the first of the year. In R though if the dates are actual date objects you can submit them to the dist function though as well.

#Quick example with the first ten records
BMVSub <- BMV[1:10,]
ExpStrings <- NearStrings(data=BMVSub,id='incidentnu',x='xcoordinat',y='ycoordinat',time='DateInt',DistThresh=30000,TimeThresh=3)
ExpStrings

So here we can see this prints out:

> ExpStrings
            CompId CompNum
000036-2015      1       3
000113-2015      2       4
000192-2015      2       4
000251-2015      1       3
000360-2015      2       4
000367-2015      3       1
000373-2015      4       2
000378-2015      4       2
000463-2015      2       4
000488-2015      1       3

The CompId field is a unique Id for every string of events. The CompNum field states how many events are within the string. So we have one string of events that contains 4 records in this subset.

Now this R function comes with a big caveat, it will not work on large datasets. I’d say your pushing it with 10,000 incidents. The issue is holding the distance matrices in memory. But if you can hold the matrices in memory this will still run quite fast. For 5,000 incidents it takes around ~15 seconds on my machine.

#Second example alittle larger, with the first 5000 records
BMVSub2 <- BMV[1:5000,]
BigStrings <- NearStrings(data=BMVSub2,id='incidentnu',x='xcoordinat',y='ycoordinat',time='DateInt',DistThresh=1000,TimeThresh=3)

The elements in the returned matrix will line up with the original dataset, so you can simply add those fields in, and do subsequent analysis (such as exporting back into a mapping program and digging into the strings).

#Add them into the original dataset
BMVSub2$CompId <- BigStrings$CompId
BMVSub2$CompNum <- BigStrings$CompNum   

You can check out the number of chains of different sizes by using aggregate and table.

#Number of chains
table(aggregate(CompNum ~ CompId, data=BigStrings, FUN=max)$CompNum)

This prints out:

   1    2    3    4    5    6    7    9 
3814  405   77   27    3    1    1    1

So out of our first 1,000 incidents, using the distance threshold of 1,000 feet and the time threshold of 3 days, we have 3,814 isolates. Thefts from vehicles with no other incidents nearby. We have 405 chains of 2 incidents, 77 chains of 3 incidents, etc. You can pull out the 9 incident like this since there is only one chain that long:

#Look up the 9 incident
BMVSub2[BMVSub2$CompNum == 9,]  

Which prints out here:

> BMVSub2[BMVSub2$CompNum == 9,]
      incidentnu xcoordinat ycoordinat StartDate DateInt CompId CompNum
2094 043983-2015    2460500    7001459 2/25/2015      56   1842       9
2131 044632-2015    2460648    7000542 2/26/2015      57   1842       9
2156 045220-2015    2461162    7000079 2/27/2015      58   1842       9
2158 045382-2015    2460154    7000995 2/27/2015      58   1842       9
2210 046560-2015    2460985    7000089  3/1/2015      60   1842       9
2211 046566-2015    2460452    7001457  3/1/2015      60   1842       9
2260 047544-2015    2460154    7000995  3/2/2015      61   1842       9
2296 047904-2015    2460452    7001457  3/3/2015      62   1842       9
2337 048691-2015    2460794    7000298  3/4/2015      63   1842       9

Or you can look up a particular chain by its uniqueid. Here is an example of a 4-chain set.

> #Looking up a particular incident chains
> BMVSub2[BMVSub2$CompId == 4321,]
      incidentnu xcoordinat ycoordinat StartDate DateInt CompId CompNum
4987 108182-2015    2510037    6969603 5/14/2015     134   4321       4
4988 108183-2015    2510037    6969603 5/14/2015     134   4321       4
4989 108184-2015    2510037    6969603 5/14/2015     134   4321       4
4993 108249-2015    2510037    6969603 5/14/2015     134   4321       4

Again, only use this function on smaller crime datasets.

Near-repeat strings in Python

Here I show how to go about a similar process in Python, but the algorithm does not calculate the whole distance matrix at once, so can handle much larger datasets. An additional note is that I exploit the fact that this list is sorted by dates. This makes it so I do not have to calculate all pair-wise distances – I will basically only compare distances within a moving window under the time threshold – this makes it easily scale to much larger datasets.

So first I use the csv python library to read in the data and assign it to a list with a set of nested tuples. Also you will need the networkx library to extract the connected components later on.

import networkx as nx
import csv
import math

dir = r'C:\Users\axw161530\Dropbox\Documents\BLOG\SourceNearRepeats'

BMV_tup = []
with open(dir + r'\TheftFromMV.csv') as f:
    z = csv.reader(f)
    for row in z:
        BMV_tup.append(tuple(row))

The BMV_tup list has the column headers, so I extract that row and then figure out where all the elements I need, such as the XY coordinates, the unique Id’s, and the time column are located in the nested tuples.

colnames = BMV_tup.pop(0)
print colnames
print BMV_tup[0:10]

xInd = colnames.index('xcoordinat')
yInd = colnames.index('ycoordinat')
dInd = colnames.index('DateInt')
IdInd = colnames.index('incidentnu')

Now the magic — here is my function to extract those near-repeat strings. Again, the list needs to be sorted by dates for this to work.

def NearStrings(CrimeData,idCol,xCol,yCol,tCol,DistThresh,TimeThresh):
    G = nx.Graph()
    n = len(CrimeData)
    for i in range(n):
        for j in range(i+1,n):
            if (float(CrimeData[j][tCol]) - float(CrimeData[i][tCol])) > TimeThresh:
                break
            else:
                xD = math.pow(float(CrimeData[j][xCol]) - float(CrimeData[i][xCol]),2)
                yD = math.pow(float(CrimeData[j][yCol]) - float(CrimeData[i][yCol]),2)
                d = math.sqrt(xD + yD)
                if d < DistThresh:
                    G.add_edge(CrimeData[j][idCol],CrimeData[i][idCol])
    comp = nx.connected_components(G)
    finList = []
    compId = 0
    for i in comp:
        compId += 1
        for j in i:
            finList.append((j,compId))
    return finList

We can then do the same test on the first ten records that we did in R.

print NearStrings(CrimeData=BMV_tup[0:10],idCol=IdInd,xCol=xInd,yCol=yInd,tCol=dInd,DistThresh=30000,TimeThresh=3)

And this subsequently prints out:

[('000378-2015', 1), ('000373-2015', 1), ('000113-2015', 2), ('000463-2015', 2), ('000192-2015', 2), ('000360-2015', 2), 
('000251-2015', 3), ('000488-2015', 3), ('000036-2015', 3)]

The component Id’s wont be in the same order as in R, but you can see we have the same results. E.g. the string with three incidents contains the Id’s 000251, 000488, and 000036. Note that this approach does not return isolates — incidents which have no nearby space-time examples.

Running this on the full dataset of over 14,000 incidents takes around 20 seconds on my machine.

BigResults = NearStrings(CrimeData=BMV_tup,idCol=IdInd,xCol=xInd,yCol=yInd,tCol=dInd,DistThresh=1000,TimeThresh=3)

And that should scale pretty well for really big cities and really big datasets. I will let someone who knows R better than me figure out workarounds to scale to bigger datasets in that language.

Using the exact reference distribution for small sample Benford tests

I recently came across another potential application related to my testing small samples for randomness in day-of-week patterns. Testing digit frequencies for Benford’s law basically works on the same principles as my day-of-week bins. So here I will show an example in R.

First, download my functions here and save them to your local machine. The only library dependency for this code to work is the partitions library, so install that. Now for the code, you can source my functions and load the partitions library. The second part of the code shows how to generate the null distribution for Benford’s digits — the idea is that lower digits will have a higher probability of occurring.

#Example using small sample tests for Benfords law
library(partitions)
#switch to wherever you downloaded the functions to your local machine
source("C:\\Users\\axw161530\\Dropbox\\PublicCode_Git\\ExactDist\\Exact_Dist.R")

f <- 1:9
p_fd <- log10(1 + (1/f)) #first digit probabilities

And so if you do cbind(f,p_fd) at the console it prints out:

      f       p_fd
 [1,] 1 0.30103000
 [2,] 2 0.17609126
 [3,] 3 0.12493874
 [4,] 4 0.09691001
 [5,] 5 0.07918125
 [6,] 6 0.06694679
 [7,] 7 0.05799195
 [8,] 8 0.05115252
 [9,] 9 0.04575749

So we expect over 30% of the first digits to be 1’s, just under 18% to be 2’s, etc. To show how we can use this for small samples, I take an example of fraudulent checks from Mark Nigrini’s book, Digital Analysis using Benford’s law (I can’t find a google books link to this older one — it is the 2000 one published by Global Audit Press I took the numbers from).

#check data from Nigrini page 84
checks <- c(1927.48,
           27902.31,
           86241.90,
           72117.46,
           81321.75,
           97473.96,
           93249.11,
           89658.17,
           87776.89,
           92105.83,
           79949.16,
           87602.93,
           96879.27,
           91806.47,
           84991.67,
           90831.83,
           93766.67,
           88338.72,
           94639.49,
           83709.28,
           96412.21,
           88432.86,
           71552.16)

#extracting the first digits
fd <- substr(format(checks,trim=TRUE),1,1)
tot <- table(factor(fd, levels=paste(f)))

Now Nigrini says this is too small of a sample to perform actual statistical tests, so he just looks at it at face value. If you print out the tot object you will see that we have mostly upper values in the series, three 7’s, nine 8’s, and nine 9’s.

> tot

1 2 3 4 5 6 7 8 9 
1 1 0 0 0 0 3 9 9 

Now, my work on small samples for day-of-week crime sprees showed that given reasonably expected offender behavior, you only needed as few as 8 crimes to have pretty good power to test for randomness in the series. So given that I would expect the check series of 23 is not totally impossible to detect significant deviations from the null Benford’s distribution. But first we need to figure out if I can actually generate the exact distribution for 23 digits in 9 bins in memory.

m <- length(tot)
n <- sum(tot)
choose(m+n-1,m-1)

Which prints out just under 8 million, [1] 7888725. So we should be able to hold that in memory.

So now comes the actual test, and as my comment says this takes a few minutes for R to figure out — so feel free to go get a coffee.

#Takes a few minutes
resG <- SmallSampTest(d=tot,p=p_fd,type="G")
resG

Here I use the likelihood ratio G test instead of the more usual Chi-Square test, as I found that always had more power in the day-of-the-week paper. From our print out of resG we subsequently get:

> resG
Small Sample Test Object 
Test Type is G 
Statistic is 73.4505062784174 
p-value is:  2.319579e-14  
Data are:  1 1 0 0 0 0 3 9 9 
Null probabilities are:  0.3 0.18 0.12 0.097 0.079 0.067 0.058 0.051 0.046 
Total permutations are:  7888725 

So since the p-value is incredibly small, we would reject the null that the first digit distribution of these checks follows Benford’s law. So on its face we can reject the null with this dataset, but it would take more investigation in general to give recommendations of how many observations in practice you would need before you can reasonably even use the small sample distribution. I have code that allows you to test the power given an alternative distribution in those functions, so for a quick and quite conservative test I see the power if the alternative distribution were uniform instead of Benford’s with our 23 observations. The idea is if people make up numbers uniformly instead of according to Benford’s law, what is the probability we would catch them with 23 observations. I label this as conservative, because I doubt people even do a good job of making them uniform — most number fudging cases will be much more egregious I imagine.

#power under null of equal probability
p_alt <- rep(1/9,9)
PowUni <- PowAlt(SST=resG,p_alt=p_alt) #again takes a few minutes
PowUni

So based on that we get a power estimate of:

> PowUni
Power for Small Sample Test 
Test statistic is: G  
Power is: 0.5276224  
Null is: 0.3 0.18 0.12 0.097 0.079 0.067 0.058 0.051 0.046  
Alt is: 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11  
Alpha is: 0.05  
Number of Bins: 9  
Number of Observations: 23 

So the power is only 0.5 in this example. Since you want to aim for power of ~0.80 or higher, this shows that you are not likely to uncover more subtle patterns of manipulation with this few of observations. The power would go up for more realistic deviations though — so I don’t think this idea is totally dead in the water.

If you are like me and find it annoying to wait for a few minutes to get the results, a quick and dirty way to make the test go faster is to collapse bins that have zero observations. So our first digits have zero observations in the 3,4,5 and 6 bins. So I collapse those and do the test with only 6 bins, which makes the results return in around ~10 seconds instead of a few minutes. Note that collapsing bins is better suited for the G or the Chi-Square test, because the KS test or Kuiper’s test the order of the observations matter.

#Smaller subset
UpProb <- c(p_fd[c(1,2,7,8,9)],sum(p_fd[c(3,4,5,6)]))
ZeroAdd <- c(table(fd),0)

resSmall <- SmallSampTest(d=ZeroAdd,p=UpProb,type="G")
resSmall

When you do this for the G and the Chi-square test you will get the same test statistics, but in this example the p-value is larger (but is still quite small).

> resSmall
Small Sample Test Object 
Test Type is G 
Statistic is 73.4505062784174 
p-value is:  1.527991e-15  
Data are:  1 1 3 9 9 0 
Null probabilities are:  0.3 0.18 0.058 0.051 0.046 0.37 
Total permutations are:  98280  

I can’t say for sure the behavior of the tests when collapsing categories, but I think it is reasonable offhand, especially if you have some substantive reason to collapse them before looking at the data.

In practice, they way I expect this would work is not just for testing one individual, but as a way to prioritize audits of many individuals. Say you had a large company, and you wanted to check the invoices for 1,000’s of managers, but each manager may only have 20 some invoices. You would compute this test for each manager then, and then subsequently rank them by the p-values (or do some correction like the false-discovery-rate) for further scrutiny. That takes a bit more work to code that up efficiently than what I have here though. Like I may pre-compute the exact CDF’s for each test statistic, aggregate them alittle so they fit in memory, and then check the test against the relevant CDF.

But feel free to bug me if you want to use this idea in your own work and want some help implementing it.


For some additional examples, here is some code to get the second digit expected probabilities:

#second digit probabilities
s <- 0:9
x <- expand.grid(f*10,s)
x$end <- log10(1 + (1/(x[,1]+x[,2])))
p_sd <- aggregate(end ~ Var2, data=x, sum)
p_sd

which are expected to be much more uniform than the first digits:

> p_sd
   Var2        end
1     0 0.11967927
2     1 0.11389010
3     2 0.10882150
4     3 0.10432956
5     4 0.10030820
6     5 0.09667724
7     6 0.09337474
8     7 0.09035199
9     8 0.08757005
10    9 0.08499735

And we can subsequently also test the check sample for deviation from Benford’s law in the second digits. Here I show an example of using the exact distribution for the Kolmogorov-Smirnov test. (There may be reasonable arguments for using Kuiper’s test with digits as well, but for both the KS and the Kuiper’s V you need to make sure the bins are in the correct order to conduct those tests.) To speed up the computation I only test the first 18 checks.

#second digits test for sample checks, but with smaller subset
sd <- substr(format(checks[1:18],trim=TRUE),2,2)
tot_sd <- table(factor(sd, levels=paste(s)))
resK_2 <- SmallSampTest(d=tot_sd,p=p_sd,type="KS")
resK_2

And the test results are:

> resK_2
Small Sample Test Object 
Test Type is KS 
Statistic is 0.222222222222222 
p-value is:  0.7603276  
Data are:  1 2 2 2 1 0 2 4 1 3 
Null probabilities are:  0.12 0.11 0.11 0.1 0.1 0.097 0.093 0.09 0.088 0.085 
Total permutations are:  4686825 

So for the second digits of our checks we would fail to reject the null that the data follow Benford’s distribution. To test the full 23 checks would generate over 28 million permutations – will update based on how long that takes.

The final example I will give is with another small example dataset — the last 12 purchases on my credit card.

#My last 12 purchases on credit card
purch <- c( 72.00,
           328.36,
            11.57,
            90.80,
            21.47,
             7.31,
             9.99,
             2.78,
            10.17,
             2.96,
            27.92,
            14.49)
#artificial numbers, 72.00 is parking at DFW, 9.99 is Netflix   

In reality, digits can deviate from Benford’s law for reasons that do not have to do with fraud — such as artificial constraints to the system. But, when I test this set of values, I fail to reject the null that they follow Benford’s distribution,

fdP <- substr(format(purch,trim=TRUE),1,1)
totP <- table(factor(fdP, levels=paste(f)))

resG_P <- SmallSampTest(d=totP,p=p_fd,type="G")
resG_P

So for a quick test the first digits of my credit card purchases do approximately follow Benford’s law.

> resG_P
Small Sample Test Object 
Test Type is G 
Statistic is 12.5740089945434 
p-value is:  0.1469451  
Data are:  3 4 1 0 0 0 2 0 2 
Null probabilities are:  0.3 0.18 0.12 0.097 0.079 0.067 0.058 0.051 0.046 
Total permutations are:  125970  

Some inverse distance weighting hacks – using R and spatstat

For a recent project I was mapping survey responses to attitudes towards the police, and I wanted to make a map of those responses. The typical default to accomplish this is inverse distance weighting. For those familiar with hot spot maps of crime, this is similar in that is produces a smooth isarithmic map, but instead of being a density it predicts values. For my project I wanted to explore two different things; 1) estimating the variance of the IDW estimate, and 2) explore different weighting schemes besides the default inverse distance. The R code for my functions and data for analysis can be downloaded here.


What is inverse distance weighting?

Since this isn’t typical fodder for social scientists, I will present a simple example to illustrate.

Imagine you are a farmer and want to know where to plant corn vs. soy beans, and are using the nitrogen content of the soil to determine that. You take various samples from a field and measure the nitrogen content, but you want predictions for the areas you did not sample. So say we have four measures at various points in the field.

Nit     X   Y
1.2     0   0
2.1     0   5
2.6    10   2
1.5     6   5

From this lets say we want to estimate average nitrogen content at the center, 5 and 5. Inverse distance weighting is just as the name says, the weight to estimate the average nitrogen content at the center is based on the distance between the sample point and the center. Most often people use the distance squared as the weight. So from this we have as the weights.

Nit     X   Y Weight
1.2     0   0   1/50
2.1     0   5   1/25
2.6    10   2   1/34
1.5     6   5   1/ 1

You can see the last row is the closest point, so gets the largest weight. The weighted average of nitrogen for the 5,5 point ends up being ~1.55.

For inverse distance weighted maps, one then makes a series of weighted estimates at a regular grid over the study space. So not just an estimate at 5,5, but also 5,4|5,3|5,2 etc. And then you have a regular grid of values you can plot.


Example – Street Clean Scores in LA

An ok example to demonstrate this is an LA database rating streets based on their cleanliness. Some might quibble about it only makes sense to estimate street cleanliness values on streets, but I think it is ok for exploratory data analysis. Just visualizing the streets is very hard given their small width and irregularity.

So to follow along, first I load all the libraries I will be using, then set my working directory, and finally import my updated inverse distance weighted hacked functions I will be using.

library(spatstat)
library(inline)
library(rgdal)
library(maptools)
library(ncf)

MyDir <- "C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\IDW_Variance_Bisquare\\ExampleAnalysis"
setwd(MyDir)

#My updated idw functions
source("IDW_Var_Functions.R")

Next we need to create an point pattern object spatstat can work with, so we import our street scores that contain an X and Y coordinate for the midpoint of the street segment, as well as the boundary of the city of Los Angeles. Then we can create a marked point pattern. For reference, the street scores can range from 0 (clean) to a max of 3 (dirty).

CleanStreets <- read.csv("StreetScores.csv",header=TRUE)
summary(CleanStreets)
BorderLA <- readOGR("CityBoundary.shp", layer="CityBoundary")

#create Spatstat object and window
LA_Win <- as.owin(BorderLA)
LA_StreetPP <- ppp(CleanStreets$XMidPoint,CleanStreets$YMidPoint, window=LA_Win, marks=CleanStreets$StreetScor)

Now we can estimate a smooth inverse distance weighted map by calling my new function, idw2. This returns both the original weighted mean (equivalent to the original spatstat idw argument), but also returns the variance. Here I plot them side by side (see the end of the blog post on how I calculate the variance). The weighted mean is on the left, and the variance estimate is on the right. For the functions the rat image is the weighted mean, and the var image is the weighted variance.

#Typical inverse distance weighted estimate
idw_res <- idw2(LA_StreetPP) #only takes a minute
par(mfrow=c(1,2))
plot(idw_res$rat) #this is the weighted mean
plot(idw_res$var) #this is the weighted variance

So contrary to expectations, this does not provide a very smooth map. It is quite rough. This is partially because social science data is not going to be as regular as natural science measurements. In spatial stats jargon street to street measures will have a large nugget – a clean street can be right next to a dirty one.

Here the default is using inverse distance squared – what if we just use inverse distance though?

#Inverse distance (linear)
idw_Lin <- idw2(LA_StreetPP, power=1)
plot(idw_Lin$rat)
plot(idw_Lin$var)

This is smoothed out a little more. There is essentially one dirty spot in the central eastern part of the city (I don’t know anything about LA neighborhoods). Compared to the first set of maps, the dirty streets in the northern mass of the city are basically entirely smoothed out, whereas before you could at least see little spikes.

So I was wondering if there could maybe be better weights we could choose to smooth out the data a little better. One I have used in a few recent projects is the bisquare kernel, which I was introduced by the geographically weighted regression folks. The bisquare kernel weight equals [1 - (d/b)^2]^2, when d < b and zero otherwise. Here d is the distance, and b is a user chosen distance threshold. We can make a plot to illustrate the difference in weight functions, here using a bisquare kernel distance of 2000 meters.

#example weight functions over 3000 meters
dist <- 1:3000
idw1 <- 1/dist
idw2 <- 1/(dist^2)
b <- 2000
bisq <- ifelse(dist < b, ( 1 - (dist/b)^2 )^2, 0)
plot(dist,idw1,type='l')
lines(dist,idw2,col='red')
lines(dist,bisq,col='blue')

Here you can see both of the inverse distance weighted lines trail to zero almost immediately, whereas the bisquare kernel trails off much more slowly. So lets check out our maps using a bisquare kernel with the distance threshold set to 2000 meters. The biSqW function is equivalent to the original spatstat idw function, but uses the bisquare kernel and returns the variance estimate as well. You just need to pass it a distance threshold for the b_dist parameter.

#BiSquare weighting, 2000 meter distance
LA_bS_w <- biSqW(LA_StreetPP, b_dist=2000)
plot(LA_bS_w$rat)
plot(LA_bS_w$var)

Here we get a map that looks more like a typical hot spot kernel density map. We can see some of the broader trends in the northern part of the city, and even see a really dirty hot spot I did not previously notice in the northeastern peninsula.

The 2,000 meter distance threshold was just ad-hoc though. How large or small should it be? A quick check of the spatial correlogram is one way to make it slightly more objective. Here I use the correlog function in the ncf package to estimate this. I subsample the data first (I presume it has a call to dist somewhere).

#correleogram, random sample, it is too big
subSamp <- CleanStreets[sample(nrow(CleanStreets), 3000), ]
fit <- correlog(x=subSamp$XMidPoint,y=subSamp$YMidPoint,z=subSamp$StreetScor, increment=100, resamp=0, quiet=TRUE)
plot(fit)

Here we can see points very nearby each other have a correlation of 0.2, and then this trails off into zero before 20 kilometers (the distances here are in meters). FYI the rising back up in correlation for very large distances often occurs for data that have broader spatial trends.

So lets try out a bisquare kernel with a distance threshold of 10 kilometers.

#BiSquare weighting, 10000 meter distance
LA_bS_w <- biSqW(LA_StreetPP, b_dist=10000)
plot(LA_bS_w$rat)
plot(LA_bS_w$var)

That is now a bit oversmoothed. But it allows a nicer range of potential values, as oppossed to simply sticking with the inverse distance weighting.


A few notes on the variance of IDW

So I hacked the idw function in the spatstat package to return the variance of the estimate as well as the actual weighted mean. This involved going into the C function, so I use the inline package to create my own version. Ditto for creating the maps using the bisquare weights instead of inverse distance weighting. To quick see those functions here is the R code.

Given some harassment on Crossvalidated by Mark Stone, I also updated the algorithm to be a more numerically safe one, both for the weighted mean and the weighted variance. Note though that that Wikipedia article has a special definition for the variance. The correct Bessel correction for weighted data though (in this case) is the sum of the weights (V1) minus the sum of square of the weights (V2) divided by V1. Here I just divide by V1, but that could easily be changed (not sure if in the sum of squares I need to worry about underflow). I.e. change the line MAT(var, ix, iy, Ny) = m2 / sumw; to MAT(var, ix, iy, Ny) = m2 / (sumw - sumw/sumw2); in the various C calls.

Someone should also probably write in a check to prevent distances of zero. Maybe by capping the weights to never be above a certain value, although that is not trivial what the default top value should be. (If you have data on the unit square weights above 1 would occur quite regularly, but for a large city like this projected in meters capping the weight at 1 would be fine.)

In general these variance maps did not behave like I expected them to, either with this or other data. When using Bessel’s correction they tended to look even weirder. So I would need to explore some more before I go and recommend them. Probably should not waste more time on this though, and just fit an actual kriging model though to produce the standard error of the estimates.

Added code snippets page

I’ve written quite a few blog posts over the years, and it is getting to be hard for me to keep all of them orderly. So I recently created a page with a simple table of contents with code snippets broken down into categories. There are also a few extra SPSS macros in there that I have not written blog posts about.

Every now and then I see a broken link and update it, but I know there are more I am missing. So just send an email if I have a link to some of my code and it is currently broken.

 

The spatial clustering of hits-vs-misses in shootings using Ripley’s K

My article, Replicating Group-Based Trajectory Models of Crime at Micro-Places in Albany, NY was recently published online first at the Journal of Quantitative Criminology (here is a pre-print version, and you can get my JQC offprint for the next few weeks).

Part of the reason I blog is to show how to replicate some of my work. I have previously shown how to generate the group based trajectory models (1,2), and here I will illustrate how to replicate some of the point pattern analysis I conducted in that paper.

A regular task for an analyst is to evaluate spatial clustering. A technique I use in that article is to use Ripley’s K statistic to evaluate clustering between different types of events, or what spatial statistics jargon calls a marked point pattern. I figured I would illustrate how to do this on some example point patterns of shootings, so analysts could replicate for other situations. The way crime data is collected and geocoded makes generating the correct reference distributions for the tests different than if the points could occur anywhere in the study region.

An example I am going to apply are shooting incidents that have marks of whether they hit the intended victim or missed. One theory in criminology is that murder is simply the extension of general violence — with the difference in aggravated assault versus murder often being happenstance. One instance this appears to be the case from my observations are shootings. It seems pure luck whether an individual gets hit or the bullets miss everyone. Here I will see if there appear to be any spatial patterning in shots fired with a victim hit. That is, we know that shootings themselves are clustered, but within those clusters of shootings are the locations of hits-and-misses further clustered or are they random?

A hypothetical process in which clustering of shooting hits can occur is if some shootings are meant to simply scare individuals vs. being targeted at people on the street. It could also occur if there are different tactics, like drive-bys vs. being on foot. This could occur if say one area had many shootings related to drug deals gone wrong, vs. another area that had gang retaliation drive by shootings. If they are non-random, the police may consider different interventions in different places – probably focusing on locations where people are more likely to be injured from the shooting. If they are random though, there is nothing special about analyzing shootings with a person hit versus shootings that missed everyone – you might as well analyze and develop strategy for all shootings.

So to start we will be using the R statistical package and the spatstat library. To follow along I made a set of fake shooting data in a csv file. So first load up R, I then change the working directory to where my csv file is located, then I read in the csv into an R data frame.

library(spatstat)

#change R directory to where your file is
MyDir <- "C:\\Users\\andrew.wheeler\\Dropbox\\Documents\\BLOG\\R_shootingVic\\R_Code"
setwd(MyDir)

#read in shooting data
ShootData <- read.csv("Fake_ShootingData.csv", header = TRUE)
head(ShootData)

The file subsequently has four fields, ID, X & Y coordinates, and a Vic column with 0’s and 1’s. Next to conduct the spatial analysis we are going to convert this data into an object that the spatstat library can work with. To do that we need to create a study window. Typically you would have the outline of the city, but for this analysis the window won’t matter, so here I make a window that is just slightly larger than the bounding box of the data.

#create ppp file, window does not matter for permuation test
StudyWin <- owin(xrange=c(min(ShootData$X)-0.01,max(ShootData$X)+0.01),yrange=c(min(ShootData$Y)-0.01,max(ShootData$Y)+0.01))
Shoot_ppp <- ppp(ShootData$X, ShootData$Y, window=StudyWin, marks=as.factor(ShootData$Vic), unitname = c("meter","meters"))

Traditionally when Ripley’s K was originally developed it was for points that could occur anywhere in the study region, such as the location of different tree species. Crimes are a bit different though, in that there are some areas in any city that a crime basically cannot occur (such as the middle of a lake). Also, crimes are often simply geocoded according to addresses and intersections, so this further reduces the locations where the points can be located in a crime dataset. To calculate the sample statistic for Ripley’s K one does not need to account for this, but to tell whether those patterns are random one needs to simulate the point pattern under a realistic null hypothesis. There are different ways to do the simulation, but here the simulation keeps the shooting locations fixed, but randomly assigns a shooting to be either a victim or a not with the same marginal frequencies. That is, it basically just shuffles which events are counted as a victim being hit or one in which there were no people hit. The Dixon article calls this the random relabelling approach.

Most of the spatstat functions can take a separate list of point patterns to use to simulate error bounds for different functions, so this function takes the initial point pattern, generates the permutations, and stuffs them in a list. I set the seed so the analysis can be reproduced exactly.

#generate the simulation envelopes to use in the Cross function
MarkedPerms <- function(ppp, nlist) {
  require(spatstat)
  myppp_list <- c() #make empty list
  for (i in 1:nlist) {
    current_ppp <-  ppp(ppp$x, ppp$y,window=ppp$window,marks=sample(ppp$marks))  #making permutation      
    myppp_list[[i]] <- current_ppp                                               #appending perm to list
  }
  return(myppp_list)
}

#now making a set of simulated permutations
set.seed(10) #setting seed for reproducibility
MySimEvel <- MarkedPerms(ppp=Shoot_ppp,nlist=999)

Now we have all the ingredients to conduct the analysis. Here I call the cross K function and submit my set of simulated point patterns named MySimEvel. With only 100 points in the dataset it works pretty quickly, and then we can graph the Ripley’s K function. The grey bands are the simulated K statistics, and the black line is the observed statistic. We can see the observed is always within the simulated bands, so we conclude that conditional on shooting locations, there is no clustering of shootings with a victim versus those with no one hit. Not surprising, since I just simulated random data.

#Cross Ripleys K
CrossK_Shoot <- alltypes(Shoot_ppp, fun="Kcross", envelope=TRUE, simulate=MySimEvel)
plot(CrossK_Shoot$fns[[2]], main="Cross-K Shooting Victims vs. No Victims", xlab="Meters")

I conducted this same analysis with actual shooting data in three separate cities that I have convenient access to the data. It is a hod podge of length, but City A and City B have around 100 shootings, and City C has around 500 shootings. In City A the observed line is very near the bottom, suggesting some evidence that shootings victims may be further apart than would be expected, but for most instances is within the 99% simulation band. (I can’t think of a theoretical reason why being spread apart would occur.) City B is pretty clearly within the simulation band, and City C’s observed pattern mirrors the mean of the simulation bands almost exactly. Since City C has the largest sample, I think this is pretty good evidence that shootings with a person hit are spatially random conditional on where shootings occur.

Long story short, when conducting Ripley’s K with crime data, the default way to generate the simulation envelopes for the statistics are not appropriate given how crime data is recorded. I show here one way to account for that in generating simulation envelopes.

One sided line buffers in R using rgeos

I’ve started to do more geographic data manipulation in R, and part of the reason I do blog posts is for self-reference, so I figured I would share some of the geographic functions I have been working on.

The other day on StackOverflow there was a question that asked how to do one sided buffers in R. The question was closed (and the linked duplicate is closed as well), so I post my response here.

The workflow I describe to make one sided buffers is in a nutshell

  • expand the original polyline into a very small area by using a normal buffer, calls this Buf0
  • do a normal two sided buffer on the original polyline, without square or rounded ends, call this Buf1
  • take the geographic difference between Buf1 and Buf0, which basically splits the original buffer into two parts, and then return them as two separate spatial objects

Here is a simple example.

library(rgeos)
library(sp)

TwoBuf <- function(line,width,minEx){
  Buf0 <- gBuffer(line,width=minEx,capStyle="SQUARE")
  Buf1 <- gBuffer(line,width=width,capStyle="FLAT")
  return(disaggregate(gDifference(Buf1,Buf0)))
}

Squig <- readWKT("LINESTRING(0 0, 0.2 0.1, 0.3 0.6, 0.4 0.1, 1 1)") #Orig
TortBuf <- TwoBuf(line=Squig,width=0.05,minEx=0.0001)
plot(TortBuf, col=c('red','blue'))  #First object on left, second on right

If you imagine travelling along the polyline, which in this example goes from left to right, this is how I know the first red polygon is the left hand and the blue is the right hand side buffer. (To pick a specific one, you can subset like TortBuf[1] or TortBuf[2].)

If we reverse the line string, the order will subsequently be reversed.

SquigR <- readWKT("LINESTRING(1 1, 0.4 0.1, 0.3 0.6, 0.2 0.1, 0 0)") #Reversed
TortBuf <- TwoBuf(line=SquigR,width=0.05,minEx=0.0001)
plot(TortBuf, col=c('red','blue'))  #Again first object on left, second on right

Examples of south to north and north to south work the same as well.

SquigN <- readWKT("LINESTRING(0 0, 0 1)") #South to North
TortBuf <- TwoBuf(line=SquigN,width=0.05,minEx=0.0001)
plot(TortBuf, col=c('red','blue'))  #Again first object on left, second on right

SquigS <- readWKT("LINESTRING(0 1, 0 0)") #North to South
TortBuf <- TwoBuf(line=SquigS,width=0.05,minEx=0.0001)
plot(TortBuf, col=c('red','blue'))  #Again first object on left, second on right

One example in which this procedure does not work is if the polyline creates other polygons.

Square <- readWKT("LINESTRING(0 0, 1 0, 1 1, 0 1, 0 0)") #Square
TortBuf <- TwoBuf(line=Square,width=0.05,minEx=0.0001)
plot(TortBuf, col=c('red','blue','green'))

#Switch the direction
SquareR <- readWKT("LINESTRING(0 0, 0 1, 1 1, 1 0, 0 0)") #Square Reversed
TortBuf <- TwoBuf(line=SquareR,width=0.05,minEx=0.0001)
plot(TortBuf, col=c('red','blue','green'))                #Still the same order

This messes up the order as well. If you know that your polyline is actually a polygon you can do a positive and negative buffer to get the desired effect of interest. If I have a need to expand this to multipart polylines I will post an update, but I have some other buffer functions I may share in the mean time.