Some notes on PChange – estimating when trajectories cross over time

J.C. Barnes and company published a paper in JQC not too long ago and came up with a metric, PChange, to establish the number of times trajectories cross in a sample. This is more of interest to life course folks, although it is not totally far fetched to see it applied to trajectories of crime at places. Part of my interest in it was simply that it is an interesting statistical question — when two trajectories with errors cross. A seemingly simple question that has a few twists and turns. Here are my subsequent notes on that metric.

The Domain Matters

First, here is an example of the trajectories not crossing:

This points to an important assumption about those lines not crossing though that was never mentioned in the Barnes paper — the domain matters. For instance, if we draw those rays further back in time what happens?

They cross! This points to an important piece of information when evaluating PChange — the temporal domain in which you examine the data matters. So if you have a sample of juvenile delinquency measures from 14-18 you would find less change than a similar sample from 12-20.

This isn’t really a critique of PChange — it is totally reasonable to only want to examine changes within a specific domain. Who cares if delinquency trajectories cross when people are babies! But it should be an important piece of information researchers use in the future if they use PChange — longer samples will show more change. It also won’t be fair to compare PChange for samples of different lengths.

A Functional Approach to PChange

For above you may ask — how would you tell if a trajectory crosses outside of the domain of the data? The answer to that question is you estimate an underlying function of the trajectory — some type of model where the outcome is a function of age (or time). With that function you can estimate the trajectory going back in time or forward in time (or in between sampled measurements). You may not want to rely on data outside of the domain (its standard error will be much higher than data within the time domain, forecasting is always fraught with peril!), but the domain of your sample is ultimately arbitrary. So what about the question will the trajectories ever cross? Or would the trajectories have crossed if I had data for ages 12-20 instead of just 16-18? Or would they have crossed if I checked the juveniles at age 16 1/2 instead of only at 16?

So actually instead of the original way the Barnes paper formulated PChange, here is how I thought about calculating PChange. First you estimate the underlying trajectory for each individual in your sample, then you take the difference of those trajectories.

y_i = f(t)
y_j = g(t)
y_delta = f(t) - g(t) = d(t)

Where y_i is the outcome y for observation i, and y_j is the outcome y for observation j. t is a measure of time, and thus the anonymous functions f and g represent growth models for observations i and j over time. y_delta is then the difference between these two functions, which I represent as the new function d(t). So for example the functions for each individual might be quadratic in time:

y_i = b0i + b1i(t) + b2i(t^2)
y_j = b0j + b1j(t) + b2j(t^2)

Subsequently the difference function will also be quadratic, and can be simply represented as:

y_delta = (b0i - b0j) + (b1i - b1j)*t + (b2i - b2j)*t^2

Then for the trajectories to cross (or at least touch), y_delta just then has to equal zero at some point along the function. If this were math, and the trajectories had no errors, you would just set d(t) = 0 and solve for the roots of the equation. (Most people estimating models like these use functions that do have roots, like polynomials or splines). If you cared about setting the domain, you would then just check if the roots are within the domain of interest — if they are, the trajectories cross, if they are not, then they do not cross. For data on humans with age, obviously roots for negative human years will not be of interest. But that is a simple way to solve the domain problem – if you have an underlying estimate of the trajectory, just see how often the trajectories cross within equivalent temporal domains in different samples.

I’d note that the idea of having some estimate of the underlying trajectory is still relevant even within the domain of the data — not just extrapolating to time periods outside. Consider two simple curves below, where the points represent the time points where each individual was measured.

So while the two functions cross, when only considering the sampled locations, like is done in Barnes et al.’s PChange, you would say these trajectories do not cross, when in actuality they do. It is just the sampled locations are not at the critical point in the example for these two trajectories.

This points to another piece of contextual information important to interpreting PChange — the number of sample points matter. If you have samples every 6 months, you will likely find more changes than if you just had samples every year.

I don’t mean here to bag on Barnes original metric too much — his PChange metric does not rely on estimating the underlying functional form, and so is a non-parametric approach to identifying change. But estimating the functional form for each individual has some additional niceties — one is that you do not need the measures to be at equivalent sample locations. You can compare someone measured at 11, 13, and 18 to someone who is measured at 12, 16, and 19. For people analyzing stuff for really young kids I bet this is a major point — the underlying function at a specific age is more important then when you conveniently measured the outcome. For older kids though I imagine just comparing the 12 to 11 year old (but in the same class) is probably not a big deal for delinquency. It does make it easier though to compare say different cohorts in which the measures are not at nice regular intervals (e.g. Add Health, NLYS, or anytime you have missing observations in a longitudinal survey).

In the end you would only want to estimate an underlying functional form if you have many measures (more so than 3 in my example), but this typically ties in nicely with what people modeling the behavior over time are already doing — modeling the growth trajectories using some type of functional form, whether it is a random effects model or a group based trajectory etc., they give you an underlying functional form. If you are willing to assume that model is good enough to model the trajectories over time, you should think it is good enough to calculate PChange!

The Null Matters

So this so far would be fine and dandy if we had perfect estimates of the underlying trajectories. We don’t though, so you may ask, even though y_delta does not exactly equal zero anywhere, its error bars might be quite wide. Wouldn’t we then still infer that there is a high probability the two trajectories cross? This points to another hidden assumption of Barnes PChange — the null matters. In the original PChange the null is that the two trajectories do not cross — you need a sufficient change in consecutive time periods relative to the standard error to conclude they cross. If the standard error is high, you won’t consider the lines to cross. Consider the simple table below:

Period A_Level A_SE B_Level B_SE
1      4         1    1.5   0.5     
2      5         1    3     0.5
3      6         1    4.5   0.5
4      7         1    6     0.5

Where A_Level and B_Level refer to the outcome for the four time periods, and A_SE and B_SE refer to the standard errors of those measurements. Here is the graph of those two trajectories, with the standard error drawn as areas for the two functions (only plus minus one standard error for each line).

And here is the graph of the differences — assuming that the covariance between the two functions is zero (so the standard error of the difference equals sqrt(A_SE^2 + B_SE^2)). Again only plus/minus one standard error.

You can see that the line never crosses zero, but the standard error area does. If our null is H0: y_delta = 0 for any t, then we would fail to reject the null in this example. So in Barnes original PChange these examples lines would not cross, whereas with my functional approach we don’t have enough data to know they don’t cross. This I suspect would make a big difference in many samples, as the standard error is going to be quite large unless you have very many observations and/or very many time points.

If one just wants a measure of crossed or did not cross, with my functional approach you could set how wide you want to draw your error bars, and then estimate whether the high or low parts of that bar cross zero. You may not want a discrete measure though, but a probability. To get that you would integrate the probability over the domain of interest and calculate the chunk of the function that cross zero. (Just assume the temporal domain is uniform across time.)

So in 3d, my difference function would look like this, whereas to the bottom of the wall is the area to calculate the probability of the lines crossing, and the height of the surface plot is the PDF at that point. (Note the area of the density is not normalized to sum to 1 in this plot.)

This surface graph ends up crossing more than was observed in my prior 2d plots, as I was only plotting 1 standard error. Here imagine that the top green part of the density is the mean function — which does not cross zero — but then you have a non-trivial amount of the predicted density that does cross the zero line.

In the example where it just crosses one time by a little, it seems obvious to consider the small slice as the probability of the two lines crossing. I think to extend this to not knowing to test above or below the line you could calculate the probability on either side of the line, take the minimum, and then double that minimum for your p-value. So if say 5% of the area is below the line in my above example, you would double it and say the two-tailed p-value of the lines crossing is p = 0.10. Imagine the situation in which the line mostly hovers around 0, so the mass is about half on one side and half on the other. In that case the probability the lines cross seems much higher than 50%, so doubling seems intuitively reasonable.

So if you consider this probability to be a p-value, with a very small p-value you would reject the null that the lines cross. Unlike most reference distributions for p-values though, you can get a zero probability estimate of the lines crossing. You can aggregate up those probabilities as weights when calculating the overall PChange for the sample. So you may not know for certain if two trajectories cross, but you may be able to say these two trajectories cross with a 30% probability.

Again this isn’t to say that PChange is bad — it is just different. I can’t give any reasoning whether assuming they do cross (my functional approach) or assuming they don’t cross (Barnes PChange) is better – they are just different, but would likely make a large difference in the estimated number of crossings.

Population Change vs Individual Level Change

So far I have just talked about trying to determine whether two individual lines cross. For my geographic analysis of trajectories in which I have the whole population (just a sample in time), this may be sufficient. You can calculate all pairwise differences and then calculate PChange (I see no data based reason to use the permutation sample approach Barnes suggested – we don’t have that big of samples, we can just calculate all pairwise combinations.)

But for many of the life course researchers, they are more likely to be interested in estimating the population of changes from the samples. Here I will show how you can do that for either random effects models, or for group based trajectory models based on the summary information. This takes PChange from a sample metric to a population level metric implied via your models you have estimated. This I imagine will be much easier to generalize across samples than the individual change metrics, which will be quite susceptible to outlier trajectories, especially in small samples.

First lets start with the random effects model. Imagine that you fit a linear growth model — say the random intercept has a variance of 2, and the random slope has a variance of 1. So these are population level metrics. The fixed effects and the covariance between the two random effect terms will be immaterial for this part, as I will discuss in a moment.

First, trivially, if you selected two random individuals from the population with this random effects distribution, the probability their underlying trajectories cross at some point is 1. The reason is for linear models, two lines only never cross if the slopes are perfectly parallel. Which sampling from a continuous random distribution has zero probability of them being exactly the same. This does not generalize to more complicated functions (imagine parabolas concave up and concave down that are shifted up and down so they never cross), but should be enough of a motivation to make the question only relevant for a specified domain of time.

So lets say that we are evaluating the trajectories over the range t = [10,20]. What is the probability two individuals randomly sampled from the population will cross? So again with my functional difference approach, we have

y_i = b0i + b1i*t
y_j = b0j + b1j*t
y_delta = (b0i - b0j) + (b1i - b1j)*t

Where in this case the b0 and b1 have prespecified distributions, so we know the distribution of the difference. Note that in the case with no covariates, the fixed effects will cancel out when taking the differences. (Generalizing to covariates is not as straightforward, you could either assume they are equal so they cancel out, or you could have them vary according to additional distributions, e.g. males have an 90% chance of being drawn versus females have a 10% chance, in that case the fixed effects would not cancel out.) Here I am just assuming they cancel out. Additionally, taking the difference in the trajectories also cancels out the covariance term, so you can assume the covariance between (b0i - b0j) and (b1i - b1j) is zero even if b0 and b1 have a non-zero covariance for the overall model. (Post is long enough — I leave that as an exercise for the reader.)

For each of the differences the means will be zero, and the variance will be the two variances added together, e.g. b0i - b0j will have a mean of zero and a variance of 2 + 2 = 4. The variance of the difference in slopes will then be 2. Now to figure out when the two lines will cross.

If you make a graph where the X axis is the difference in the intercepts, and the Y axis is the difference in the slopes, you can then mark off areas that indicate the two lines will cross given the domain. Here for example is a sampling of where the lines cross – red is crossing, grey is not crossing.

So for example, say we had two random draws:

y_i = 1   + 0.5*t
y_j = 0.5 + 0.3*t
y_delta = 0.5 + 0.2*t

This then shows that the two lines do not cross when only evaluating t between 10 and 20. They have already diverged that far out (you would need negative t to have the lines cross). Imagine if y_delta = -6 + 0.2*t though, this line does cross zero though, at t = 10 this function equals -1, whereas at t = 20 the function equals 4.

If you do another 3d plot you can plot the bivariate PDF. Again integrate the chunks of areas in which the function crosses zero, and voila, you get your population estimate.

This works in a similar manner to higher order polynomials, but you can’t draw it in a nice graph though. I’m blanking at the moment of a way to find these areas offhand in a nice way — suggestions welcome!

This gets a bit tricky thinking about in relation to individual level change. This approach does not assume any error in the random draws of the line, but assumes the draws will have a particular distribution. So the PChange does not come from adding up whether individual lines in your sample cross, it comes from the estimated distribution of what the difference in two randomly drawn lines would look like that is implied by your random effects model. Think if based on your random effect distribution you randomly drew two lines, calculated if they crossed, and then did this simulation a very large number of times. The integrations I’m suggesting are just an exact way to calculate PChange instead of the simulation approach.

If you were to do individual change from your random effects model you would incorporate the standard error of the estimated slope and intercept for the individual observation. This is for your hypothetical population though, so I see no need to incorporate any error.

Estimating population level change from group based trajectory models via my functional approach is more straightforward. First, with my functional approach you would assume individuals who share the same latent trajectory will cross with a high probability, no need to test that. Second, for testing whether two individual trajectories cross you would use the approach I’ve already discussed around individual lines and gain the p-value I mentioned.

So for example, say you had a probability of 25% that a randomly drawn person from group A would cross a randomly drawn person from Group B. Say also that Group A has 40/100 of the sample, and Group B is 60/100. So then we have three different groups: A to A, B to B, and A to B. You can then break down the pairwise number in each group, as well as the number of crosses below.

Compare   N    %Cross Cross
A-A      780    100    780
B-B     1770    100   1770
A-B     2400     25    600
Total   4950     64   3150

So then we have a population level p-change estimate of 64% from our GBTM. All of these calculations can be extended to non-integers, I just made them integers here to simplify the presentation.

Now, that overall PChange estimate may not be real meaningful for GBTM, as the denominator includes pairwise combinations of folks in the same trajectory group, which is not going to be of much interest. But just looking at the individual group solutions and seeing what is the probability they cross could be more informative. For example, although Barnes shows the GBTM models in the paper as not crossing, depending on how wide the standard errors of the functions are (that aren’t reported), this functional approach would probably assign non-zero probability of them crossing (think low standard error for the higher group crossing a high standard error for the low group).


Phew — that was a long post! Let me know in the comments if you have any thoughts/concerns on what I wrote. Simple question — whether two lines cross — not a real simple solution when considering the statistical nature of the question though. I can’t be the only person to think about this though — if you know of similar or different approaches to testing whether two lines cross please let me know in the comments.

Advertisements

IACA Conference 2017 workshop: Monitoring temporal crime trends for outliers (Excel)

This fall at the International Association of Crime Analysts conference I am doing a workshop, Monitoring temporal crime trends for outliers: A workshop using Excel. If you can’t wait (or are not going) I have all my materials already prepared, which you can download here. That includes a walkthrough of my talk/tutorial, as well as a finished Excel workbook. It is basically a workshop to go with my paper, Tables and graphs for monitoring temporal crime trends: Translating theory into practical crime analysis advice.

For some preview, I will show how to make a weekly smoothed chart with error bands:

As well as a monthly seasonal chart:

I use Excel not because I think it is the best tool, but mainly because I think it is the most popular among crime analysts. In the end I just care about getting the job done! (Although I’ve given reasons why I think Excel is more painful than any statistical program.) Even though it is harder to make small multiple charts in Excel, I show how to make these charts using pivot tables and filters, so watching them auto-update when you update the filter is pretty cool.

For those with SPSS I have already illustrated how to make similar charts in SPSS here. You could of course replicate that in R or Stata or whatever if you wanted.

I am on the preliminary schedule currently for Tuesday, September 12th at 13:30 to 14:45. I will be in New Orleans on the 11th, 12th and 13th, so if you want to meet always feel free to send an email to set up a time.

New working paper – Monitoring volatile homicide trends across U.S. cities

I have a new working paper out — Monitoring volatile homicide trends across U.S. cities, with one of my colleagues Tomislav Kovandzic. You can grab the pre-print on SSRN, and the paper has links to code to replicate the charts and models in the paper.

Here I look at homicide rates in U.S. cities and use funnel charts and fan charts to show the typical volatility in homicide rates between cities and within cities over time. As I’ve written previously, I think much of the media narrative around homicide increases are hyperbolic and often cherry pick reasons why they think homicides are going up.

I’ve shown examples of funnel charts on this blog before, so I will use a different image as the tease. To generate the prediction intervals for fan charts I estimate binomial random effect models. Below is an example for New Orleans (homicide rate per 100,000 population):

As always, if you have feedback feel free to send me an email.

Scraping Meth Labs with Python

For awhile in my GIS courses I have pointed to the DEA’s website that has a list of busted meth labs across the county, named the National Clandestine Laboratory Register. Finally a student has shown some interest in this, and so I spent alittle time writing a scraper in Python to grab the data. For those who would just like the data, here I have a csv file of the scraped labs that are geocoded to the city level. And here is the entire SPSS and Python script to go from the original PDF data to the finished product.

So first off, if you visit the DEA website, you will see that each state has its own PDF file (for example here is Texas) that lists all of the registered labs, with the county, city, street address, and date. To turn this into usable data, I am going to do three steps in Python:

  1. download the PDF file to my local machine using urllib python library
  2. convert that PDF to an xml file using the pdftohtml command line utility
  3. use Beautifulsoup to parse the xml file

I will illustrate each in turn and then provide the entire Python script at the end of the post.

So first, lets import the libraries we need, and also note I downloaded the pdftohtml utility and placed that location as a system path on my Windows machine. Then we need to set a folder where we will download the files to on our local machine. Finally I create the base url for our meth labs.

from bs4 import BeautifulSoup
import urllib, os

myfolder = r'C:\Users\axw161530\Dropbox\Documents\BLOG\Scrape_Methlabs\PDFs' #local folder to download stuff
base_url = r'https://www.dea.gov/clan-lab' #online site with PDFs for meth lab seizures

Now to just download the Texas pdf file to our local machine we would simply do:

a = 'tx'
url = base_url + r'/' + a + '.pdf'
file_loc = os.path.join(myfolder,a)
urllib.urlretrieve(url,file_loc + '.pdf')

If you are following along and replaced the path in myfolder with a folder on your personal machine, you should now see the Texas PDF downloaded in that folder. Now I am going to use the command line to turn this PDF into an xml document using the os.system() function.

#Turn to xml with pdftohtml, does not need xml on end
cmd = 'pdftohtml -xml ' + file_loc + ".pdf " + file_loc
os.system(cmd)

You should now see that there is an xml document to go along with the Texas file. You can check out its format using a text editor (wordpress does not seem to like me showing it here).

So basically we can use the top and the left attributes within the xml to identify what row and what column the items are in. But first, we need to read in this xml and turn it into a BeautifulSoup object.

MyFeed = open(file_loc + '.xml')
textFeed = MyFeed.read()
FeedParse = BeautifulSoup(textFeed,'xml')
MyFeed.close()

Now the FeedParse item is a BeautifulSoup object that you can query. In a nutshell, we have a top level page tag, and then within that you have a bunch of text tags. Here is the function I wrote to extract that data and dump it into tuples.

#Function to parse the xml and return the line by line data I want
def ParseXML(soup_xml,state):
    data_parse = []
    page_count = 1
    pgs = soup_xml.find_all('page')
    for i in pgs:
        txt = i.find_all('text')
        order = 1
        for j in txt:
            value = j.get_text() #text
            top = j['top']
            left = j['left']
            dat_tup = (state,page_count,order,top,left,value)
            data_parse.append(dat_tup)
            order += 1
        page_count += 1
    return data_parse

So with our Texas data, we could call ParseXML(soup_xml=FeedParse,state=a) and it will return all of the data nested in those text tags. We can just put these all together and loop over all of the states to get all of the data. Since the PDFs are not that large it works quite fast, under 3 minutes on my last run.

from bs4 import BeautifulSoup
import urllib, os

myfolder = r'C:\Users\axw161530\Dropbox\Documents\BLOG\Scrape_Methlabs\PDFs' #local folder to download stuff
base_url = r'https://www.dea.gov/clan-lab' #online site with PDFs for meth lab seizures
                                           #see https://www.dea.gov/clan-lab/clan-lab.shtml
state_ab = ['al','ak','az','ar','ca','co','ct','de','fl','ga','guam','hi','id','il','in','ia','ks',
            'ky','la','me','md','ma','mi','mn','ms','mo','mt','ne','nv','nh','nj','nm','ny','nc','nd',
            'oh','ok','or','pa','ri','sc','sd','tn','tx','ut','vt','va','wa','wv','wi','wy','wdc']
            
state_name = ['Alabama','Alaska','Arizona','Arkansas','California','Colorado','Connecticut','Delaware','Florida','Georgia','Guam','Hawaii','Idaho','Illinois','Indiana','Iowa','Kansas',
              'Kentucky','Louisiana','Maine','Maryland','Massachusetts','Michigan','Minnesota','Mississippi','Missouri','Montana','Nebraska','Nevada','New Hampshire','New Jersey',
              'New Mexico','New York','North Carolina','North Dakota','Ohio','Oklahoma','Oregon','Pennsylvania','Rhode Island','South Carolina','South Dakota','Tennessee','Texas',
              'Utah','Vermont','Virginia','Washington','West Virginia','Wisconsin','Wyoming','Washington DC']

all_data = [] #this is the list that the tuple data will be stashed in

#Function to parse the xml and return the line by line data I want
def ParseXML(soup_xml,state):
    data_parse = []
    page_count = 1
    pgs = soup_xml.find_all('page')
    for i in pgs:
        txt = i.find_all('text')
        order = 1
        for j in txt:
            value = j.get_text() #text
            top = j['top']
            left = j['left']
            dat_tup = (state,page_count,order,top,left,value)
            data_parse.append(dat_tup)
            order += 1
        page_count += 1
    return data_parse

#This loops over the pdfs, downloads them, turns them to xml via pdftohtml command line tool
#Then extracts the data

for a,b in zip(state_ab,state_name):
    #Download pdf
    url = base_url + r'/' + a + '.pdf'
    file_loc = os.path.join(myfolder,a)
    urllib.urlretrieve(url,file_loc + '.pdf')
    #Turn to xml with pdftohtml, does not need xml on end
    cmd = 'pdftohtml -xml ' + file_loc + ".pdf " + file_loc
    os.system(cmd)
    #parse with BeautifulSoup
    MyFeed = open(file_loc + '.xml')
    textFeed = MyFeed.read()
    FeedParse = BeautifulSoup(textFeed,'xml')
    MyFeed.close()
    #Extract the data elements
    state_data = ParseXML(soup_xml=FeedParse,state=b)
    all_data = all_data + state_data

Now to go from those sets of tuples to actually formatted data takes a bit of more work, and I used SPSS for that. See here for the full set of scripts used to download, parse and clean up the data. Basically it is alittle more complicated than just going from long to wide using the top marker for the data as some rows are off slightly. Also there is complications for long addresses being split across two lines. And finally there are just some data errors and fields being merged together. So that SPSS code solves a bunch of that. Also that includes scripts to geocode the to the city level using the Google geocoding API.

Let me know if you do any analysis of this data! I quickly made a time series map of these events via CartoDB. You can definately see some interesting patterns of DEA concentration over time, although I can’t say if that is due to them focusing on particular areas or if they are really the areas with the most prevalent Meth lab problems.

Much ado about nothing: Overinterpreting volatility in homicide rates

I’m not much of a macro criminologist, but being asked questions by my dad (about Richard Rosenfeld and the Ferguson effect) and the dentist yesterday (asking about some of Trumps comments about rising crime trends) has prompted me to jump into it and give my opinion. Long story short — many sources I believe are overinterpreting short term fluctuations as more meaningful than they are.

First I will tackle national crime rates. So if you have happened to walk by a TV playing CNN the past few days, you may have heard Donald Trump being criticized for his statements on crime rates. This is partially a conflation with the difference between overall levels of crime versus changes in crime over time. Basically crime is currently low compared to historical patterns, but homicide rates have been rising in the past two years. This is easier to show in a chart than to explain in words. So here is the national estimated homicide rate per 100,000 individuals since 1960.1

2016 is not official and is still an estimate, but basically the pattern is this – crime has been falling generally across the country since the early 1990’s. Crime rates in just the past few years have finally dropped below levels in the 1960’s, but for the past two years homicides have been increasing. So some have pointed to the increase in the past two years and have claimed the sky is falling. To say this they say the rate of change is the largest in past 40 years. There are better charts to show rates of change (a semi-log chart), but the overall look is basically the same.

You have to really squint to see that change from 2014 to 2015 is a larger jump than any of the changes over the entire period, so arguments based on the size of recent changes in the homicide rate are hyperbole (either on a linear scale or a logarithmic scale). And even if you take the recent increases over the past two years as evidence of a more general rising trend, for a broader term pattern we still have homicide rates close to a low point in the past 50 years.

For a bit of general advice — any source that gives you a percent change you always want to see the base numbers and any longer term historical trends. Any media source that cites recent increases in homicides without providing this graph of long term historical crime trends is simply misleading. I’ve seen this done in many places, see this example from the New York Times or this recent note from the Economist. So this isn’t something specific to the President.

Now, macro criminologists don’t really have any better track record explaining these patterns than macro economists have in explaining economic trends. Basically we have a bunch of patch work theories that make sense for parts of the trend, but not the entire time frame. Changes in routine activities in 1960’s, increases in incarceration, the decline of crack use, ease of calling 911 with cell-phones, lead use, abortion (just to name a few). And academics come up with new theories all the time, the most recent being the Ferguson effect — which is simply another term for de-policing.

Now a bit on trends for specific cities. How this ties in with the national trend is that some articles have been pointing out that some cities have seen increases and some have not. That is fine to point out (albeit trivial), but then the articles frequently go on generate stories about why crime is rising in those specific places. Those on the left cite civil unrest and police brutality as possible reasons (Milwaukee, St. Louis, Chicago, Baltimore), while those on the right cite the deleterious effects of police departments not being as proactive (stops in Chicago, arrests in Baltimore).

While any of these explanations may turn out reasonable in the end, I’m pretty sure most of these articles severely underappreciate the volatility in homicide rates. Take an example with St. Louis, with a city population of just over 300,000. A homicide rate of 50 individuals per 100,000 means a total of 150 murders. A homicide rate of 40 per 100,000 means 120 murders. So we are only talking about a change of 30 murders overall. Fluctuations of around 10 in the murder rate would not be unexpected for a city with a population of 300,000 individuals. The confidence interval for a rate of 150 murders per 300,000 individuals is 126 to 176 murders.2

Even that though understates the typical volatility in homicide rates. As basically that assumes the proportion does not change over time. In reality crime statistics are more bursty, and show wilder fluctuations in different places.3 To show this for many cities, I use the data from the Economist article mentioned earlier, and create a motion chart of the changes in homicide rates over time. The idea behind this chart is a funnel chart. Cities with lower populations will show higher variance, and subsequently those dots on the left hand side of the chart will jump around alot more. The population figures are current and not varying, so the dots just move up and down on the Y axis.

For best viewing, make the X axis on the log scale, and size the points according to the population of the city. If you are at a desktop computer, you can open up a bigger version of the chart here.

Selecting individual points and then letting the animation run though illustrates the typical variability of crime over time. Here is the trace of St. Louis over the 36 year period.

New Orleans is another good example, we have fluctuations from under 30 to over 90 in the time period.

And here is Chicago, which shows less fluctuation than the smaller cities (as expected) but still has a range of homicide rates around 20 over the time period.

Howard Wainer has previously pointed this relationship out, and called it The Most Dangerous Equation. Basically, if you look you will be able to find some upward crime trends, especially in smaller cities. You need to look at it in the long term though and understand typical fluctuations to make a reasonable decision as to whether crime is increasing or if it is just typical year to year variation. The majority of news articles on the topic and just chock full of post hoc ergo propter hoc for particular cherry picked cites, and they often don’t make sense in explaining crime patterns over the past decade in those particular cities, let alone make sense for different cities experience similar conditions but not having rising homicide rates.



  1. For my notes about data sources, generally the data have come from the FBI UCR data tool (for the 1960 through 2014 data). 2015 data have come from the FBI web page for the 2015 UCR report. The 2016 projections come from this Economist article as well as the 50 cities data for the google motion chart.
  2. Calculated in R via (binom.test(150,300000)$conf.int[1:2])*300000. This is the exact Clopper-Pearson confidence interval.
  3. So even though this 538 article does a better job of acknowledging volatility, whatever test they use to determine statistically significant increases is likely to have too many false positives.

Side by side charts in SPSS

One aspect of SPSS charts that you need to use syntax for is to create side-by-side charts. Here I will illustrate a frequent use case, time series charts with different Y axes. You can download the data and code to follow along here. This is data for Buffalo, NY on reported crimes from the UCR data tool.

So after you have downloaded the csv file with the UCR crime rates in Buffalo and have imported the data into SPSS, you can make a graph of violent crime rates over time.

*Making a chart of the violent crime rate.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year ViolentCrimerate MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: ViolentCrimerate=col(source(s), name("ViolentCrimerate"))
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
  ELEMENT: line(position(Year*ViolentCrimerate))
  ELEMENT: point(position(Year*ViolentCrimerate), color.interior(color.black), color.exterior(color.white), size(size."7"))
END GPL.

I like to superimpose the points on simple line charts, to reinforce where the year observations are. Here we can see that there is a big drop post 1995 for the following four years (something that would be hard to say exactly without the points). Part of the story of Buffalo though is the general decline in population (similar to most of the rust belt part of the nation since the 70’s).

*Make a chart of the population decline.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Population MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Population=col(source(s), name("Population"))
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Population"))
  ELEMENT: line(position(Year*Population))
  ELEMENT: point(position(Year*Population), color.interior(color.black), color.exterior(color.white), size(size."7"))
END GPL.

Now we want to place these two charts over top of one another. So check out the syntax below, in particular to GRAPH: begin statements.

*Now put the two together.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Population ViolentCrimerate
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL 
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Population=col(source(s), name("Population"))
  DATA: ViolentCrimerate=col(source(s), name("ViolentCrimerate"))
  GRAPH: begin(origin(14%,12%), scale(85%, 60%))
  GUIDE: axis(dim(1), label("Year"), opposite())
  GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
  ELEMENT: line(position(Year*ViolentCrimerate))
  ELEMENT: point(position(Year*ViolentCrimerate), color.interior(color.black), color.exterior(color.white), size(size."7"))
  GRAPH: end()
  GRAPH: begin(origin(14%, 75%), scale(85%, 20%)) 
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Population"))
  ELEMENT: line(position(Year*Population))
  ELEMENT: point(position(Year*Population), color.interior(color.black), color.exterior(color.white), size(size."7"))  
  GRAPH: end()
END GPL.    

In a nutshell, the graph begin statements allow you to chunk up the graph space to make different/arbitrary plots. The percentages start in the top right, so for the first violent crime rate graph, the origin is listed as 14% and 12%. This means the graph starts 14% to the right in the overall chart space, and 12% down. These paddings are needed to make room for the axis labels. Then for the scale part, it lists it as 85% and 60%. The 85% means take up 85% of the X range in the chart, but only 60% of the Y range in the chart. So this shows how to make the violent crime chart take a bigger proportion. of the overall chart space than the population chart. You can technically do charts with varying axes in SPSS without this, but you would have to make the panels take up an equal amount of space. This way you can make the panels whatever proportion you want.

For Buffalo the big drop in 1996 is largely due to a very large reduction in aggravated assaults (from over 3,000 in 1995 to under 1,600 in 1996). So here I superimpose a bar to viz. the proportion of all violent crimes. This wouldn’t be my first choice of how to show this, but I think it is a good illustration of how to superimpose and/or stack additional charts using this same technique in SPSS.

*Also superimposing a stacked bar chart on the total types of crimes in the background.
COMPUTE PercentAssault = (Aggravatedassault/ViolentCrimeTotal)*100.
FORMATS PercentAssault (F2.0).
EXECUTE.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Population ViolentCrimerate PercentAssault
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL 
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Population=col(source(s), name("Population"))
  DATA: ViolentCrimerate=col(source(s), name("ViolentCrimerate"))
  DATA: PercentAssault=col(source(s), name("PercentAssault"))
  GRAPH: begin(origin(14%,12%), scale(75%, 60%))
  GUIDE: axis(dim(1), label("Year"), opposite())
  GUIDE: axis(dim(2), label("Violent Crime Rate per 100,000"))
  ELEMENT: line(position(Year*ViolentCrimerate))
  ELEMENT: point(position(Year*ViolentCrimerate), color.interior(color.black), color.exterior(color.white), size(size."7"))
  GRAPH: end()
  GRAPH: begin(origin(14%, 75%), scale(75%, 20%)) 
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Population"))
  ELEMENT: line(position(Year*Population))
  ELEMENT: point(position(Year*Population), color.interior(color.black), color.exterior(color.white), size(size."7"))  
  GRAPH: end()
  GRAPH: begin(origin(14%, 12%), scale(75%, 60%)) 
  SCALE: linear(dim(2), min(0), max(60))
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), label("Percent Assault"), opposite(), color(color.red), delta(10))
  ELEMENT: bar(position(Year*PercentAssault), color.interior(color.red), transparency.interior(transparency."0.7"), transparency.exterior(transparency."1.0"), size(size."5"))
  GRAPH: end()
END GPL.

While doing multiple time series charts is a common use, you can basically use your imagination about what you want to accomplish with this. Another common example is to put border histograms on scatterplot (which the GPL reference guide has an example of). Here is an example I posted recently to Nabble that has the number of individuals at risk in a Kaplan-Meier plot.

Heatmaps in SPSS

Heatmap is a visualization term that gets used in a few different circumstances, but here I mean a regular grid in which you use color to indicate particular values. Here is an example from Nathan Yau via FlowingData:

They are often not the best visualization to use to evaluate general patterns, but they offer a mix of zooming into specific individuals, as well as to identify overall trends. In particular I like using them to look at missing data patterns in surveys in SPSS, which I will show an example of in this blog post. Here I am going to use a community survey for Dallas in 2016. The original data can be found here, and the original survey questions can be found here. I’ve saved that survey as an SPSS file you can access at this link. (The full code in one sps syntax file is here.)


So first I am going to open up the data file from online, and name the dataset DallasSurv16.

*Grab the data from online.
SPSSINC GETURI DATA
URI="https://dl.dropbox.com/s/5e07yi9hd0u5opk/Surv2016.sav?dl=0"
FILETYPE=SAV DATASET=DallasSurv16.

Here I am going to illustrate making a heatmap with the questions asking about fear of crime and victimization, the Q6 questions. First I am going to make a copy of the original dataset, as we will be making some changes to the data. I do this via the DATASET COPY function, and follow it up by activating that new dataset. Then I do a frequency to check out the set of Q6 items.

DATASET COPY HeatMap.
DATASET ACTIVATE HeatMap.
FREQ Q6_1Inyourneighborhoodduringthe TO Q69Fromfire.

From the survey instrument, the nine Q6 items have values of 1 through 5, and then a "Don’t Know" category labeled as 9. All of the items also have system missing values. First we are going to recode the system missing items to a value of 8, and then we are going to sort the dataset by those questions.

RECODE Q6_1Inyourneighborhoodduringthe TO Q69Fromfire (SYSMIS = 8)(ELSE = COPY).
SORT CASES BY Q6_1Inyourneighborhoodduringthe TO Q69Fromfire.

You will see the effect of the sorting the cases in a bit for our graph. But the idea about how to make the heatmap in the grammar of graphics is that in your data you have a variable that specifies the X axis, a variable for the Y axis, and then a variable for the color in your heatmap. To get that set up, we need to go from our nine separate Q6 variables to one variable. We do this in SPSS by using VARSTOCASES to reshape the data.

VARSTOCASES /MAKE Q6 FROM Q6_1Inyourneighborhoodduringthe TO Q69Fromfire /INDEX = QType.

So now every person who answered the survey has 9 different rows in the dataset instead of one. The original answers to the questions are placed in the new Q6 variable, and the QType variable is a number of 1 to 9. So now individual people will go on the Y axis, and each question will go on the X axis. But before we make the chart, we will add the meta-data in SPSS to our new Q6 and QType variables.

VALUE LABELS QType
  1 'In your neigh. During Day'
  2 'In your neigh. At Night'
  3 'Downtown during day'
  4 'Downtown at night'
  5 'Parks during day'
  6 'Parks at Night'
  7 'From violent crime'
  8 'From property crime'
  9 'From fire'
.
VALUE LABELS Q6
 8 "Missing" 
 9 "Don't Know"
 1 'Very Unsafe'
 2 'Unsafe'
 3 'Neither safe or unsafe'
 4 'Safe'
 5 'Very Safe'
.
FORMATS Q6 QType (F1.0).

Now we are ready for our GGRAPH statement. It is pretty gruesome but just bare with me for a second.

TEMPORARY.
SELECT IF DISTRICT = 1.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=QType ID Q6
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(800px,2000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: QType=col(source(s), name("QType"), unit.category())
  DATA: ID=col(source(s), name("ID"), unit.category())
  DATA: Q6=col(source(s), name("Q6"), unit.category())
  GUIDE: axis(dim(1), opposite())
  GUIDE: axis(dim(2), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1", color.darkred),("2", color.red),("3", color.lightgrey), 
            ("4", color.lightblue), ("5", color.darkblue), ("9", color.white), ("8", color.white)))
  SCALE: cat(dim(2), sort.data(), reverse())
  ELEMENT: polygon(position(QType*ID), color.interior(Q6), color.exterior(color.grey), transparency.exterior(transparency."0.7"))
  PAGE: end()
END GPL.
EXECUTE.

And this produces the chart,

So to start, normally I would use the chart builder dialog to make the skeleton for the GGRAPH code and update that. Here if you make a scatterplot in the chart dialog and assign the color it gets you most of the way there. But I will walk through some of the other steps.

  • TEMPORARY. and then SELECT IF – these two steps are to only draw a heatmap for survey responses for the around 100 individuals from council district 1. Subsequently the EXECUTE. command at the end makes it so the TEMPORARY command is over.
  • Then for in the inline GPL code, PAGE: begin(scale(800px,2000px)) changes the chart dimensions to taller and skinnier than the default chart size in SPSS. Also note you need a corresponding PAGE: end() command when you use a PAGE: begin() command.
  • GUIDE: axis(dim(1), opposite()) draws the labels for the X axis on the top of the graph, instead of the bottom.
  • GUIDE: axis(dim(2), null()) prevents drawing the Y axis, which just uses the survey id to displace survey responses
  • SCALE: cat(aesthetic maps different colors to each different survey response. Feeling safe are given blues, and not safe are given red colors. I gave neutral grey and missing white as well.
  • SCALE: cat(dim(2), sort.data(), reverse()), this tells SPSS to draw the Y axis in the order in which the data are already sorted. Because I sorted the Q6 variables before I did the VARSTOCASES this sorts the responses with the most fear to the top.
  • The ELEMENT: polygon( statement just draws the squares, and then specifies to color the interior of the squares according to the Q6 variable. I given the outline of the squares a grey color, but white works nice as well. (Black is a bit overpowering.)

So now you have the idea. But like I said this can be hard to identify overall patterns sometimes. So sometimes I like to limit the responses in the graph. Here I make a heatmap of the full dataset (over 1,500 responses), but just look at the different types of missing data. Red is system missing in the original dataset, and Black is the survey filled in "Don’t Know".

*Missing data representation.
TEMPORARY.
SELECT IF (Q6 = 9 OR Q6 = 8).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=QType ID Q6 MISSING = VARIABLEWISE
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(800px,2000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: QType=col(source(s), name("QType"), unit.category())
  DATA: ID=col(source(s), name("ID"), unit.category())
  DATA: Q6=col(source(s), name("Q6"), unit.category())
  GUIDE: axis(dim(1), opposite())
  GUIDE: axis(dim(2), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1", color.darkred),("2", color.red),("3", color.lightgrey), 
            ("4", color.lightblue), ("5", color.darkblue), ("9", color.black), ("8", color.red)))
  ELEMENT: polygon(position(QType*ID), color.interior(Q6), color.exterior(color.grey), transparency.exterior(transparency."0.7"))
  PAGE: end()
END GPL.
EXECUTE.

You can see the system missing across all 6 questions happens very rarely, I only see three cases, but there are a ton of "Don’t Know" responses. Another way to simplify the data is to use small multiples for each type of response. Here is the first graph, but using a panel for each of the individual survey responses. See the COORD: rect(dim(1,2), wrap()) and then the ELEMENT statement for the updates. As well as making the size of the chart shorter and fatter, and not drawing the legend.

*Small multiple.
TEMPORARY.
SELECT IF DISTRICT = 1.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=QType ID Q6
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(1000px,1000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: QType=col(source(s), name("QType"), unit.category())
  DATA: ID=col(source(s), name("ID"), unit.category())
  DATA: Q6=col(source(s), name("Q6"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1), opposite())
  GUIDE: axis(dim(2), null())
  GUIDE: legend(aesthetic(aesthetic.color.interior), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1", color.darkred),("2", color.red),("3", color.lightgrey), 
            ("4", color.lightblue), ("5", color.darkblue), ("9", color.white), ("8", color.white)))
  SCALE: cat(dim(2), sort.data(), reverse())
  ELEMENT: polygon(position(QType*ID*Q6), color.interior(Q6), color.exterior(color.grey), transparency.exterior(transparency."0.7"))
  PAGE: end()
END GPL.
EXECUTE.

You technically do not need to reshape the data using VARSTOCASES at first to make these heatmaps (there is an equivalent VARSTOCASES command within GGRAPH you could use), but this way is simpler in my opinion. (I could not figure out a way to use multiple response sets to make these non-aggregated charts, so if you can figure that out let me know!)


The idea of a heatmap can be extended to much larger grids — basically any raster graphic can be thought of as a heatmap. But for SPSS you probably do not want to make heatmaps that are very dense. The reason being SPSS always makes its charts in vector format, you cannot tell it to just make a certain chart a raster. So a very dense heatmap will take along time to render. But I like to use them in some situations as I have shown here with smaller N data in SPSS.

Also off-topic, but I may be working on a cook-book with examples for SPSS graphics. If I have not already made a blog post let me know what you would examples you would like to see!

Some inverse distance weighting hacks – using R and spatstat

For a recent project I was mapping survey responses to attitudes towards the police, and I wanted to make a map of those responses. The typical default to accomplish this is inverse distance weighting. For those familiar with hot spot maps of crime, this is similar in that is produces a smooth isarithmic map, but instead of being a density it predicts values. For my project I wanted to explore two different things; 1) estimating the variance of the IDW estimate, and 2) explore different weighting schemes besides the default inverse distance. The R code for my functions and data for analysis can be downloaded here.


What is inverse distance weighting?

Since this isn’t typical fodder for social scientists, I will present a simple example to illustrate.

Imagine you are a farmer and want to know where to plant corn vs. soy beans, and are using the nitrogen content of the soil to determine that. You take various samples from a field and measure the nitrogen content, but you want predictions for the areas you did not sample. So say we have four measures at various points in the field.

Nit     X   Y
1.2     0   0
2.1     0   5
2.6    10   2
1.5     6   5

From this lets say we want to estimate average nitrogen content at the center, 5 and 5. Inverse distance weighting is just as the name says, the weight to estimate the average nitrogen content at the center is based on the distance between the sample point and the center. Most often people use the distance squared as the weight. So from this we have as the weights.

Nit     X   Y Weight
1.2     0   0   1/50
2.1     0   5   1/25
2.6    10   2   1/34
1.5     6   5   1/ 1

You can see the last row is the closest point, so gets the largest weight. The weighted average of nitrogen for the 5,5 point ends up being ~1.55.

For inverse distance weighted maps, one then makes a series of weighted estimates at a regular grid over the study space. So not just an estimate at 5,5, but also 5,4|5,3|5,2 etc. And then you have a regular grid of values you can plot.


Example – Street Clean Scores in LA

An ok example to demonstrate this is an LA database rating streets based on their cleanliness. Some might quibble about it only makes sense to estimate street cleanliness values on streets, but I think it is ok for exploratory data analysis. Just visualizing the streets is very hard given their small width and irregularity.

So to follow along, first I load all the libraries I will be using, then set my working directory, and finally import my updated inverse distance weighted hacked functions I will be using.

library(spatstat)
library(inline)
library(rgdal)
library(maptools)
library(ncf)

MyDir <- "C:\\Users\\axw161530\\Dropbox\\Documents\\BLOG\\IDW_Variance_Bisquare\\ExampleAnalysis"
setwd(MyDir)

#My updated idw functions
source("IDW_Var_Functions.R")

Next we need to create an point pattern object spatstat can work with, so we import our street scores that contain an X and Y coordinate for the midpoint of the street segment, as well as the boundary of the city of Los Angeles. Then we can create a marked point pattern. For reference, the street scores can range from 0 (clean) to a max of 3 (dirty).

CleanStreets <- read.csv("StreetScores.csv",header=TRUE)
summary(CleanStreets)
BorderLA <- readOGR("CityBoundary.shp", layer="CityBoundary")

#create Spatstat object and window
LA_Win <- as.owin(BorderLA)
LA_StreetPP <- ppp(CleanStreets$XMidPoint,CleanStreets$YMidPoint, window=LA_Win, marks=CleanStreets$StreetScor)

Now we can estimate a smooth inverse distance weighted map by calling my new function, idw2. This returns both the original weighted mean (equivalent to the original spatstat idw argument), but also returns the variance. Here I plot them side by side (see the end of the blog post on how I calculate the variance). The weighted mean is on the left, and the variance estimate is on the right. For the functions the rat image is the weighted mean, and the var image is the weighted variance.

#Typical inverse distance weighted estimate
idw_res <- idw2(LA_StreetPP) #only takes a minute
par(mfrow=c(1,2))
plot(idw_res$rat) #this is the weighted mean
plot(idw_res$var) #this is the weighted variance

So contrary to expectations, this does not provide a very smooth map. It is quite rough. This is partially because social science data is not going to be as regular as natural science measurements. In spatial stats jargon street to street measures will have a large nugget – a clean street can be right next to a dirty one.

Here the default is using inverse distance squared – what if we just use inverse distance though?

#Inverse distance (linear)
idw_Lin <- idw2(LA_StreetPP, power=1)
plot(idw_Lin$rat)
plot(idw_Lin$var)

This is smoothed out a little more. There is essentially one dirty spot in the central eastern part of the city (I don’t know anything about LA neighborhoods). Compared to the first set of maps, the dirty streets in the northern mass of the city are basically entirely smoothed out, whereas before you could at least see little spikes.

So I was wondering if there could maybe be better weights we could choose to smooth out the data a little better. One I have used in a few recent projects is the bisquare kernel, which I was introduced by the geographically weighted regression folks. The bisquare kernel weight equals [1 - (d/b)^2]^2, when d < b and zero otherwise. Here d is the distance, and b is a user chosen distance threshold. We can make a plot to illustrate the difference in weight functions, here using a bisquare kernel distance of 2000 meters.

#example weight functions over 3000 meters
dist <- 1:3000
idw1 <- 1/dist
idw2 <- 1/(dist^2)
b <- 2000
bisq <- ifelse(dist < b, ( 1 - (dist/b)^2 )^2, 0)
plot(dist,idw1,type='l')
lines(dist,idw2,col='red')
lines(dist,bisq,col='blue')

Here you can see both of the inverse distance weighted lines trail to zero almost immediately, whereas the bisquare kernel trails off much more slowly. So lets check out our maps using a bisquare kernel with the distance threshold set to 2000 meters. The biSqW function is equivalent to the original spatstat idw function, but uses the bisquare kernel and returns the variance estimate as well. You just need to pass it a distance threshold for the b_dist parameter.

#BiSquare weighting, 2000 meter distance
LA_bS_w <- biSqW(LA_StreetPP, b_dist=2000)
plot(LA_bS_w$rat)
plot(LA_bS_w$var)

Here we get a map that looks more like a typical hot spot kernel density map. We can see some of the broader trends in the northern part of the city, and even see a really dirty hot spot I did not previously notice in the northeastern peninsula.

The 2,000 meter distance threshold was just ad-hoc though. How large or small should it be? A quick check of the spatial correlogram is one way to make it slightly more objective. Here I use the correlog function in the ncf package to estimate this. I subsample the data first (I presume it has a call to dist somewhere).

#correleogram, random sample, it is too big
subSamp <- CleanStreets[sample(nrow(CleanStreets), 3000), ]
fit <- correlog(x=subSamp$XMidPoint,y=subSamp$YMidPoint,z=subSamp$StreetScor, increment=100, resamp=0, quiet=TRUE)
plot(fit)

Here we can see points very nearby each other have a correlation of 0.2, and then this trails off into zero before 20 kilometers (the distances here are in meters). FYI the rising back up in correlation for very large distances often occurs for data that have broader spatial trends.

So lets try out a bisquare kernel with a distance threshold of 10 kilometers.

#BiSquare weighting, 10000 meter distance
LA_bS_w <- biSqW(LA_StreetPP, b_dist=10000)
plot(LA_bS_w$rat)
plot(LA_bS_w$var)

That is now a bit oversmoothed. But it allows a nicer range of potential values, as oppossed to simply sticking with the inverse distance weighting.


A few notes on the variance of IDW

So I hacked the idw function in the spatstat package to return the variance of the estimate as well as the actual weighted mean. This involved going into the C function, so I use the inline package to create my own version. Ditto for creating the maps using the bisquare weights instead of inverse distance weighting. To quick see those functions here is the R code.

Given some harassment on Crossvalidated by Mark Stone, I also updated the algorithm to be a more numerically safe one, both for the weighted mean and the weighted variance. Note though that that Wikipedia article has a special definition for the variance. The correct Bessel correction for weighted data though (in this case) is the sum of the weights (V1) minus the sum of square of the weights (V2) divided by V1. Here I just divide by V1, but that could easily be changed (not sure if in the sum of squares I need to worry about underflow). I.e. change the line MAT(var, ix, iy, Ny) = m2 / sumw; to MAT(var, ix, iy, Ny) = m2 / (sumw - sumw/sumw2); in the various C calls.

Someone should also probably write in a check to prevent distances of zero. Maybe by capping the weights to never be above a certain value, although that is not trivial what the default top value should be. (If you have data on the unit square weights above 1 would occur quite regularly, but for a large city like this projected in meters capping the weight at 1 would be fine.)

In general these variance maps did not behave like I expected them to, either with this or other data. When using Bessel’s correction they tended to look even weirder. So I would need to explore some more before I go and recommend them. Probably should not waste more time on this though, and just fit an actual kriging model though to produce the standard error of the estimates.

Group based trajectory models in Stata – some graphs and fit statistics

For my advanced research design course this semester I have been providing code snippets in Stata and R. This is the first time I’ve really sat down and programmed extensively in Stata, and this is a followup to produce some of the same plots and model fit statistics for group based trajectory statistics as this post in R. The code and the simulated data I made to reproduce this analysis can be downloaded here.

First, for my own notes my version of Stata is on a server here at UT Dallas. So I cannot simply go

net from http://www.andrew.cmu.edu/user/bjones/traj
net install traj, force

to install the group based trajectory code. First I have this set of code in the header of all my do files now

*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns"
*to install on your own in the lab it would be
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns"

So after running that then I can install the traj command, and Stata will know where to look for it!

Once that is taken care of after setting the working directory, I can simply load in the csv file. Here my first variable was read in as ïid instead of id (I’m thinking because of the encoding in the csv file). So I rename that variable to id.

*Load in csv file
import delimited GroupTraj_Sim.csv
*BOM mark makes the id variable weird
rename ïid id

Second, the traj model expects the data in wide format (which this data set already is), and has counts in count_1, count_2count_10. The traj command also wants you to input a time variable though to, which I do not have in this file. So I create a set of t_ variables to mimic the counts, going from 1 to 10.

*Need to generate a set of time variables to pass to traj, just label 1 to 10
forval i = 1/10 { 
  generate t_`i' = `i'
}

Now we can estimate our group based models, and we get a pretty nice default plot.

traj, var(count_*) indep(t_*) model(zip) order(2 2 2) iorder(0)
trajplot

Now for absolute model fit statistics, there are the average posterior probabilities, the odds of correct classification, and the observed classification proportion versus the expected classification proportion. Here I made a program that I will surely be ashamed of later (I should not brutalize the data and do all the calculations in matrix), but it works and produces an ugly table to get us these stats.

*I made a function to print out summary stats
program summary_table_procTraj
    preserve
    *now lets look at the average posterior probability
    gen Mp = 0
    foreach i of varlist _traj_ProbG* {
        replace Mp = `i' if `i' > Mp 
    }
    sort _traj_Group
    *and the odds of correct classification
    by _traj_Group: gen countG = _N
    by _traj_Group: egen groupAPP = mean(Mp)
    by _traj_Group: gen counter = _n
    gen n = groupAPP/(1 - groupAPP)
    gen p = countG/ _N
    gen d = p/(1-p)
    gen occ = n/d
    *Estimated proportion for each group
    scalar c = 0
    gen TotProb = 0
    foreach i of varlist _traj_ProbG* {
       scalar c = c + 1
       quietly summarize `i'
       replace TotProb = r(sum)/ _N if _traj_Group == c 
    }
    *This displays the group number, the count per group, the average posterior probability for each group,
    *the odds of correct classification, and the observed probability of groups versus the probability 
    *based on the posterior probabilities
    list _traj_Group countG groupAPP occ p TotProb if counter == 1
    restore
end

This should work after any model as long as the naming conventions for the assigned groups are _traj_Group and the posterior probabilities are in the variables _traj_ProbG*. So when you run

summary_table_procTraj

You get this ugly table:

     | _traj_~p   countG   groupAPP        occ       p    TotProb |
     |------------------------------------------------------------|
  1. |        1      103    .937933   43.57431   .2575   .2573651 |
104. |        2      161   .9935606   229.0434   .4025   .4064893 |
265. |        3      136   .9607248   47.48378     .34   .3361456 |

The groupAPP are the average posterior probabilities – here you can see they are all quite high. occ is the odds of correct classification, and again they are all quite high. p is the proportion in each group based on the assignments for the maximum posterior probability, and the TotProb are the expected number based on the sums of the posterior probabilities. TotProb should be the same as in the Group Membership part at the bottom of the traj model. They are close (to 5 decimals), but not exactly the same (and I do not know why that is the case).

Next, to generate a plot of the individual trajectories, I want to reshape the data to long format. I use preserve in case I want to go back to wide format later and estimate more models. If you need to look to see how the reshape command works, type help reshape at the Stata prompt. (Ditto for help on all Stata commands.)

preserve
reshape long count_ t_, i(id)

To get the behavior I want in the plot, I use a scatter plot but have them connected via c(L). Then I create small multiples for each trajectory group using the by() option. Before that I slightly jitter the count data, so the lines are not always perfectly overlapped. I make the lines thin and grey — I would also use transparency but Stata graphs do not support this.

gen count_jit = count_ + ( 0.2*runiform()-0.1 )
graph twoway scatter count_jit t_, c(L) by(_traj_Group) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)

I’m too lazy at the moment to clean up the axis titles and such, but I think this plot of the individual trajectories should always be done. See Breaking Bad: Two Decades of Life-Course Data Analysis in Criminology, Developmental Psychology, and Beyond (Erosheva et al., 2014).

While this fit looks good, this is not the correct number of groups given how I simulated the data. I will give those trying to find the right answer a few hints; none of the groups have a higher polynomial than 2, and there is a constant zero inflation for the entire sample, so iorder(0) will be the correct specification for the zero inflation part. If you take a stab at it let me know, I will fill you in on how I generated the simulation.

Plotting panel data with many lines in SPSS

A quick blog post – so you all are not continually assaulted by my mug shot on the front page of the blog!

Panel data is complicated. When conducting univariate time series analysis, pretty much everyone plots the series. I presume people do not do this often for panel data because the charts tend to be more messy and less informative. But by using transparency and small multiple plots are easy places to start to unpack the information. Here I am going to show these using plots of arrest rates from 1970 through 2014 in New York state counties. The data and code can be downloaded here, and that zip file contains info. on where the original data came from. It is all publicly available – but mashing up the historical census data for the population counts by county is a bit of a pain.

So I will start with grabbing my created dataset, and then making a default plot of all the lines. Y axis is the arrest rate per 1,000 population, and the X axis are years.

*Grab the dataset.
FILE HANDLE data /NAME = "!!Your File Handle Here!!!".
GET FILE = "data\Arrest_WPop.sav".
DATASET NAME ArrestRates.

*Small multiple lines over time - default plot.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Total Arrest Rate per 1,000"))
  ELEMENT: line(position(Year*Total_Rate), split(County))
END GPL.

That is not too bad, but we can do slightly better by making the lines small and semi-transparent (which is the same advice for dense scatterplots):

*Make them transparent and smaller.
FORMATS Total_Rate (F2.0).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  GUIDE: axis(dim(1), label("Year"))
  GUIDE: axis(dim(2), label("Total Arrest Rate per 1,000"))
  SCALE: linear(dim(1), min(1970), max(2014))
  ELEMENT: line(position(Year*Total_Rate), split(County), transparency(transparency."0.7"), size(size."0.7"))
END GPL.

This helps disentangle the many lines bunched up. There appear to be two outliers, and basically the rest of the pack.

A quick way to check out each individual line is then to make small multiples. Here I wrap the panels, and make the plot size bigger. I also make the X and Y axis null. This is ok though, as I am just focusing on the shape of the trend, not the absolute level.

*Small multiples.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Year Total_Rate County 
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(1000px,1000px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Year=col(source(s), name("Year"))
  DATA: Total_Rate=col(source(s), name("Total_Rate"))
  DATA: County=col(source(s), name("County"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1), null())
  GUIDE: axis(dim(2), null())
  GUIDE: axis(dim(3), opposite())
  SCALE: linear(dim(1), min(1970), max(2014))
  ELEMENT: line(position(Year*Total_Rate*County))
  PAGE: end()
END GPL.
*Manually edited to make less space between panels.

There are a total of 62 counties in New York, so this is feasible. With panel sets of many more lines, you can either split the small multiple into more graphs, or cluster the lines based on the overall shape of the trend into different panels.

Here you can see that the outliers are New York county (Manhattan) and Bronx county. Bronx is a pretty straight upward trend (which mirrors many other counties), but Manhattan’s trajectory is pretty unique and has a higher variance than most other places in the state. Also you can see Sullivan county has quite a high rate compared to most other upstate counties (upstate is New York talk for everything not in New York City). But it leveled off fairly early in the time series.

This dataset also has arrest rates broken down by different categories; felony (drug, violent, dwi, other), and misdemeanor (drug, dwi, property, other). It is interesting to see that arrest rates have been increasing in most places over this long time period, even though crime rates have been going down since the 1990’s. They all appear to be pretty correlated, but let me know if you use this dataset to do some more digging. (It appears index crime totals can be found going back to 1990 here.)