Custom square root scale (with negative values) in ggplot2 (R)

My prior rootogram post Jon Peck made the astute comment that rootograms typically are plotted on a square root scale. (Which should of been obvious to me given the name!) The reason for a square root scale for rootograms is visualization purposes, the square root scale gives more weight to values nearby 0 and shrinks values farther away from 0.

SPSS can not have negative values on a square root scale, but you can make a custom scale using ggplot2 and the scales package in R for this purpose. Here I just mainly replicated this short post by Paul Hiemstra.

So in R, first we load the scales and the ggplot2 package, and then create our custom scale function. Obviously the square root of a negative value is not defined for real numbers, so what we do is make a custom square root function. The function simply takes the square root of the absolute value, and then multiplies by the sign of the original value. This function I name S_sqrt (for signed square root). We also make its inverse function, which is named IS_sqrt. Finally I make a third function, S_sqrt_trans, which is the one used by the scales package.

library(scales)
library(ggplot2)

S_sqrt <- function(x){sign(x)*sqrt(abs(x))}
IS_sqrt <- function(x){x^2*sign(x)}
S_sqrt_trans <- function() trans_new("S_sqrt",S_sqrt,IS_sqrt)

Here is a quick example data set in R to work with.

#rootogram example, see http://stats.stackexchange.com/q/140473/1036
MyText <- textConnection("
Dist Val1 Val2
1 0.03 0.04
2 0.12 0.15
3 0.45 0.50
4 0.30 0.24 
5 0.09 0.04 
6 0.05 0.02
7 0.01 0.01
")
MyData <- read.table(MyText,header=TRUE)
MyData$Hang <- MyData$Val1 - MyData$Val2

And now we can make our plots in ggplot2. First the linear scale, and second update our plot to the custom square root scale.

p <- ggplot(data=MyData, aes(x = as.factor(Dist), ymin=Hang, ymax=Val1)) + 
     geom_hline(aes(yintercept=0)) + geom_linerange(size=5) + theme_bw()
p

p2 <- p + scale_y_continuous(trans="S_sqrt",breaks=seq(-0.1,0.5,0.05), name="Density")
p2

Laplacian Centrality in NetworkX (Python)

The other day I read a few papers on a new algorithm for calculating centrality in networks. Below are the two papers describing the Laplacian Centrality metric. The first is for non-weighted networks, and the second for weighted networks.

  • Qi, X., Duval, R. D., Christensen, K., Fuller, E., Spahiu, A., Wu, Q., Wu, Y., Tang, W., and Zhang, C. (2013). Terrorist networks, network energy and node removal: A new measure of centrality based on laplacian energy. Social Networking, 02(01):19-31.
  • Qi, X., Fuller, E., Wu, Q., Wu, Y., and Zhang, C.-Q. (2012). Laplacian centrality: A new centrality measure for weighted networks. Information Sciences, 194:240-253. PDF Here.

The metric is fairly intuitive I think. The centrality parameter is a function of the local degree plus the degree’s of the neighbors (with different weights for each). I figured it would be a quick programming exercise (which means I spent way too long trying to implement it!). To follow is some code that replicates the measures for both weighted and non-weighted graphs, using the Python networkx library.

The non-weighted graph code is easy, and is a near copy-paste from some igraph code snippet that was already available. Just some updates to idiom’s for NetworkX specifically. The norm option specifies whether you want solely the numerator value, the difference between the energy in the full graph versus the graph with the node removed (norm=False), or whether you want to divide this value by the energy for the full graph. Note this function ignores the weights in the graph. nbunch is if you want to pass the function only a subset of points to calculate the centrality. (If you do that you might as well have norm=False for time savings as well.)

def lap_cent(graph, nbunch=None, norm=False):
  if nbunch is None:
    vs = graph.nodes()
  else:
    vs = nbunch
  degrees = graph.degree(weight=None)
  if norm is True:
    den = sum(v**2 + v for i,v in degrees.items())
    den = float(den)
  else:
    den = 1
  result = []
  for v in vs:
    neis = graph.neighbors(v)
    loc = degrees[v]
    nei = 2*sum(degrees[i] for i in neis)
    val = (loc**2 + loc + nei)/den
    result.append(val)
  return result

The weighted network is a bit more tricky though. I thought coding all of the two walks seemed a royal pain, so I developed a different algorithm that I believe is quicker. Here are three functions, but the last one is the one of interest, lap_cent_weighted. The options are similar to the unweighted version, with the exception that you can pass a weight attribute (which is by default named ‘weight’ in NetworkX graphs).

def lap_energy(graph, weight='weight'):
  degrees = graph.degree(weight=weight)
  d1 = sum(v**2 for i,v in degrees.items())
  wl = 0
  for i in graph.edges(data=True):
    wl += (i[2].get(weight))**2
  return [d1,2*wl]

def cw(graph,node,weight='weight'):
  neis = graph.neighbors(node)
  ne = graph[node]
  cw,sub = 0,0
  for i in neis:
    we = ne[i].get(weight)
    od = graph.degree(i,weight=weight)
    sub += -od**2 + (od - we)**2
    cw += we**2
  return [cw,sub]

def lap_cent_weighted(graph, nbunch=None, norm=False, weight='weight'):
  if nbunch is None:
    vs = graph.nodes()
  else:
    vs = nbunch
  if norm == True:
    fe = lap_energy(graph,weight=weight)
    den = float(fe[0]+fe[1])
  else:
    den = 1
  result = []
  for i in vs:
     d2 = graph.degree(i,weight=weight)
     w2 = cw(graph,i,weight=weight)
     fin = d2**2 - w2[1] + 2*w2[0]
     result.append(fin/den)
  return result

For a brief overview of the new algorithm (in some quick and dirty text math), to define the energy of the entire graph it is:

sum(di^2) + 2*sum(wij^2)   (1)

Where di are the degrees for all i nodes, and the second term is 2 times the sum of the weights squared. So when you take out a particular node, say ‘A’, the drop in the second term is easy, just iterate over the neighbors of ‘A’, and calculate 2*sum(waj^2), then subtract that from the second term in equation 1.

The first term is slightly more complex. First there is a decrease due to simply the degree of the removed node, da^2. There is also a decrease in the degree on the neighboring nodes as well, so you need to calculate their updated contribution. The necessary info. is available when you iterate over the neighbor list though, and if the original contribution is di^2, and the weight of wia, then the updated weight is -di^2 + (di - wia)^2. You can calculate this term at the same time you calculate the decrease in the weights in the second term.

I believe this algorithm is faster than the one originally written in the second paper. It should be something like O(n*a), where a is the average number of neighbors for the entire graph, and n are the number of nodes. Or in worst case a is the maximum number of neighbors any node has in the graph (which should be less than or equal to the max degree in a weighted graph).

Here is an example of using the functions with the small, toy example in the weighted network paper. Note that the lap_cent function ignores the weights.

import networkx as nx

Gp = nx.Graph()
ed = [('A','B',4),('A','C',2),('C','B',1),('B','D',2),('B','E',2),('E','F',1)]
Gp.add_weighted_edges_from(ed)

x = lap_cent(Gp)
xw = lap_cent_weighted(Gp, norm=True)
for a,b,c in zip(Gp.nodes(),x,xw):
  print a,b,c

Which prints out at the console:

A 18 0.7 
C 18 0.28 
B 34 0.9 
E 16 0.26 
D 10 0.22 
F 6 0.04

If you want to see that the graph is the same as the graph in the weighted Qi paper, use below.

import matplotlib.pyplot as plt
pos=nx.spring_layout(Gp) # positions for all nodes
nx.draw(Gp,pos=pos)
nx.draw_networkx_labels(Gp,pos=pos)
nx.draw_networkx_edge_labels(Gp,pos=pos)
plt.show()

Venn diagrams in R (with some discussion!)

The other day I had a set of three separate categories of binary data that I wanted to visualize with a Venn diagram (or a Euler) diagram of their intersections. I used the venneuler R package and it worked out pretty well.

library(venneuler)
MyVenn <- venneuler(c(A=74344,B=33197,C=26464,D=148531,"A&B"=11797, 
                       "A&C"=9004,"B&C"=6056,"A&B&C"=2172,"A&D"=0,"A&D"=0,"B&D"=0,"C&D"=0))
MyVenn$labels <- c("A\n22","B\n7","C\n5","D\n58")
plot(MyVenn)
text(0.59,0.52,"1")
text(0.535,0.51,"3")
text(0.60,0.57,"2")
text(0.64,0.48,"4") 

Some digging around on this topic though I came across some pretty interesting discussion, in particular a graph makeover of a set of autism diagnoses, see:

for background. Below is a recreated image of the original Venn diagram under discussion (from Kosara’s American Scientist article.)

Applying this example to the venneuler library did not work out so well.

MyVenn2 <- venneuler(c(A=111,B=65,C=94,"A&B"=62,"A&C"=77,"B&C"=52,"A&B&C"=51))
MyVenn2$labels <- c("PL-ADOS","clinician","ADI-R")
plot(MyVenn2)

Basically there is a limit on the size the intersections can be with the circles, and here the intersection of all three sets is very large, so there is no feasible solution for this example.

This is alittle bit different situation than typical for Venn diagrams though. Typically these charts all one is interested in is the overlaps between each set. But the autism graph that is secondary. What they were really interested in was the sensitivity of the different diagnostic measures (i.e. percentage identifying true positives), and to see if any particular combination had the greatest sensitivity. Although Kosara in his blog post says that all of the redesigns are better than the original I don’t entirely agree, I think Kosara’s version of the Venn diagram with the text labels does a pretty good job, although I think Kosara’s table is sufficient as well. (Kosara’s recreated graph has better labelling than the original Venn diagram, mainly by increasing the relative font size.)

For the autism graph there are basically two over-arching goals:

  • identifying the percent within arbitrary multiple intersections
  • keeping in mind the baseline N for each of the arbitrary sets

It is not immediately visually obvious, but IMO it is not that hard to arbitrarily collapse different categories in the original Venn diagram and make some rough judgements about the sensitivity. To me the first thing I look at is the center piece, see it is quite a high percentage, and then look to see if I can make any other arbitrary categories to improve upon the sensitivity of all three tests together. All others are either very small baselines or do not improve the percentage, so I conclude that all three combined likely have the most sensitivity. You may also see that the clinicians are quite high for each intersection, so it is questionable whether the two other diagnostics offer any significant improvement over just the clinicians judgement, but many of the clinician sets have quite small N’s, so I don’t put as much stock in that.

Another way to put it is if we think of the original Venn diagram as a graphical table I think it does a pretty good job. The circles and the intersections are a lie factor in the graph, in that their areas do not represent the baseline rates, but it is an intuitive way to lay out the textual categories, and only takes a little work to digest the material. Kosara’s sorted table does a nice job of this as well, but it is easier to ad-hoc combine categories in the Venn diagram than in table rows that are not adjacent. Visually the information does not pop out at you, like a functional relationship in a scatterplot, but the Venn diagram has the ingredients that allow you to drill down and estimate the information you are looking for. Being able to combine arbitrary categories is the key here, and I don’t think any of the other graphical representations allow one to do that very easily.

I thought a useful redesign would be to keep the Venn theme, but have the repeated structures show the base rate via Isotype like recurring graphs. Some of this is motivated by using such diagrams in interpreting statistics (see this post by David Spieghalter for one example, the work of Gerd Gigenzer is relevant as well). I was not able make a nice set of contained glyphs though. Here is a start of what I am talking about, I just exported the R graph into Inkscape and superimposed a bunch of rectangles.

This does not visualize the percentage, but one way to do that would be to color or otherwise distinguish the blocks in a certain way. Also I gave up before I finished the intersecting middle piece, and I would need to make the boxes a bit smaller to be able to squeeze it in. I think this idea could be made to work, but this particular example making the Venn even approximately proportional is impossible, and so sticking with the non-proportional Venn diagram and just saying it is not proportional is maybe less likely to be misleading.

I think the idea of using Isotype like repeated structures though could be a generally good idea. Even when the circles can be made to have the areas of the intersection exact, it is still hard to visually gauge the size of circles (rectangles are easier). So the multiple repeated pixels may be more useful anyway, and putting them inside of the circles still allows the arbitrary collapsing of different intersections while still being able to approximately gauge base rates.

Transforming KDE estimates from Logistic to Probability Scale in R

The other day I had estimates from several logistic regression models, and I wanted to superimpose the univariate KDE’s of the predictions. The outcome was fairly rare, so the predictions were bunched up at the lower end of the probability scale, and the default kernel density estimates on the probability scale smeared too much of the probability outside of the range.

It is a general problem with KDE estimates, and there are two general ways to solve it:

  • truncate the KDE and then reweight the points near the edge (example)
  • estimate the KDE on some other scale that does not have a restricted domain, and then transform the density back to the domain of interest (example)

The first is basically the same as edge correction in spatial statistics, just in one dimension instead of the two. Here I will show how to do the second in R, mapping items on the logistic scale to the probability scale. The second linked CV post shows how to do this when using the log transformation, and here I will show the same with mapping logistic estimates (e.g. from the output of a logistic regression model). This requires the data to not have any values at 0 or 1 on the probability scale, because these will map to negative and positive infinity on the logistic scale.

In R, first define the logit function as log(p/(1-p) and the logistic function as 1/(1+exp(-x)) for use later:

logistic <- function(x){1/(1+exp(-x))}
logit <- function(x){log(x/(1-x))}

We can generate some fake data that might look like output from a logistic regression model and calculate the density object.

set.seed(10)
x <- rnorm(100,0,0.5)
l <- density(x)  #calculate density on logit scale

This blog post goes through the necessary math, but in a nut shell you can’t simply just transform the density estimate using the same function, you need to apply an additional transformation (referred to as the Jacobian). So here is an example transforming the density estimate from the logistic scale, l above, to the probability scale.

px <- logistic(l$x)  #transform density to probability scale
py <- l$y/(px*(1-px))
plot(px,py,type='l')

To make sure that the area does sum to one, we can superimpose the density calculated on the data transformed to the probability scale. In this example of fake data the two are pretty much identical. (Black line is my transformed density, and the red is the density estimate based on the probability data.)

dp <- density(logistic(x)) #density on the probability values to begin with
lines(dp$x,dp$y,col='red')

Here is a helper function, denLogistic, to do this in the future, which simply takes the data (on the logistic scale) and returns a density object modified to the probability scale.

logistic <- function(x){1/(1+exp(-x))}
logit <- function(x){log(x/(1-x))}
denLogistic <- function(x){
  d <- density(x)
  d$x <- logistic(d$x)
  d$y <- d$y/(d$x*(1-d$x))
  d$call <- 'Logistic Density Transformed to Probability Scale'
  d$bw <- paste0(signif(d$bw,4)," (on Logistic scale)")
  return(d)
}

In cases where more of the probability density is smeared beyond 0-1 on the probability scale, the logistic density estimate will look different. Here is an example with a wider variance and more predictions near zero, so the two estimates differ by a larger amount.

lP <- rnorm(100,-0.9,1)
test <- denLogistic(lP)
plot(test)
lines(density(logistic(lP)),col='red')

Again, this works well for data on the probability scale that can not be exactly zero or one. If you have data like that, the edge correction type KDE estimators are better suited.

New working paper: What We Can Learn from Small Units of Analysis

I’ve posted a new working paper, What We Can Learn from Small Units of Analysis to SSRN. This is a derivative of my dissertation (by the same title). Below is the abstract:

This article provides motivation for examining small geographic units of analysis based on a causal logic framework. Local, spatial, and contextual effects are confounded when using larger units of analysis, as well as treatment effect heterogeneity. I relate these types of confounds to all types of aggregation problems, including temporal aggregation, and aggregation of dependent or explanatory variables. Unlike prior literature critiquing the use of aggregate level data, examples are provided where aggregation is unlikely to hinder the goals of the particular research design, and how heterogeneity of measures in smaller units of analysis is not a sufficient motivation to examine small geographic units. Examples of these confounds are presented using simulation with a dataset of crime at micro place street units (i.e. street segments and intersections) in Washington, D.C.

As always, if you have comments or critiques let me know.

Some ad-hoc fuzzy name matching within Police databases

A repeated annoying task I have had to undertake is take a list of names and date-of-births and match them to a reference set. This can happen when you try to merge data from different sources. Or when working with police RMS data, a frequent problem is the same individual can go into the master name list multiple times. This can be either due to data error, or someone being malicious and providing the PD with fraudulent data.

Most often I am trying to match a smaller set of names to a bigger set from a police RMS system. Typically what I do is grab all of the names in the police RMS, make them the same case, and then simply sort the file by last then first name. Then I typically go through one by one from the smaller file and identify the name ID’s that are in the sorted bigger police database. Ctrl-F can make it a quick search for only a few people.

This works quite well for small numbers, but I wanted to see if I could make some simple rules when I need to match a larger list. For example, the above workflow may be fine if I need to look up 10 names quickly, but say you want to eliminate duplicates in the entire PD RMS system? A manually hand search through 100,000+ names is crazy (and will be out of date before you finish).

A tool I’ve used in the past (and would recommend) is FRIL, Fine-grained records integration and linkage tool. In a nutshell the way that tool works is that you can calculate string distances between names and/or date distances between date-of-births from two separate files. Then you specify how close you want the records to be to either automatically match the records or manually view and make a personal determination if the two records are the same person. FRIL has a really great interface to quickly view the suggested matches and manually confirm or reject certain matches.

FRIL uses the Potter Stewart I know it when I see it approach to finding matches. There is no ground truth, you just use your best judgement whether two names belong to the same person, and FRIL uses some metrics to filter out the most unlikely candidates. I have a bit of a unique strategy though to identify typical string and date of birth differences in fuzzy name matches by using Police RMS data itself. Police RMS’s tend to have a variety of people who are linked up to the same master name index value, but for potentially several reasons they have various idiosyncrasies among different individual incidents. This allows me to calculate distances within persons, so a ground truth estimate, and then I can evaluate different distances compared to a control sample to see how well they the metrics discriminate matches.

There can be several reasons for slightly different data among an individuals incidents in a police RMS, but that they end up being linked to the same person. One is that frequently RMS systems incorporate tables from several different data sources, dispatchers have their own CAD system, the PD has a system to type in paper records, custodial arrests/finger printing may have another system, etc. Merging this data into one RMS may simply cause differences in how the data is stored or even how particular fields tend to be populated in the database. A second reason is that individuals can be ex-ante associated to a particular master name index, but can still have differences is various person fields for any particular incident.

One simple example is that for an arrest report the offender may have an old address in the system, so the officer types in the new address. The same thing can happen to slight name changes or DOB changes. The master name index should update with the most recent info, but you have a record trail of all the minor variations through each incident. Depending on the type of involvement in an incident has an impact on what information is collected and the quality of that information as well. For example, if I was interviewed as a witness to a crime, I may just go down in the report as Andy Wheeler with no date of birth info. If I was arrested, someone would take more time to put in my full name, Andrew Wheeler, and my date of birth. But if the original person inputting the data took the time, they would probably realize I was the same person and associate me with the same master ID.

So I can look at these within ID changes to see the typical distances. What I did was take a name database from a police department I work with, make all pair-wise comparisons between unique names and date of births, and then calculate several string distances between the names and the date differences between listed DOB’s. I then made a randomly matched sample for a comparison group. For the database I was working with this ended up being over 100,000 people with the same ID, but different names/DOB’s somewhere in different incidents, with an average of between 2~3 different names/dob’s per person (so a sample of nearly 200,000 same name comparisons, two names only results in one comparison, but three names results in 3 comparisons). My control sample took one of these names person and matched another random person in the database as a control group, so I have a control group sample of over 100,000 cases.

The data I was working with is secondary, so the names were already aggregated to Last, First Middle. If I had the original database I could do distances for the individual fields (and probably not worry about the middle name) but it somewhat simplifies the analysis as well. Here are some histograms of the Levenshtein distance between the name strings for the same person and random samples. The Levenshtein distance is the number of single edits it takes to transform one string to another string, so 0 would be the same word.

Part of the reason the distances within the same name have such a long tail is because of the already aggregated data. There end up being some people with full middle names, some with middle initials, and some with no middle names at all. So what I did was calculate a normalized Levenshtein distance based on the max and min possible values (listed at the Wikipedia page) the string can take given the size of the two input strings. The minimum value is the difference in the length of the two strings, the maximum is the length of the longest string. So then I calculate NormLevenDist = (LevenDist - min)/(max - min). This would cause Wheeler, Andy P and Wheeler, Andy Palmer to have a normalized distance of zero, whereas the edit distance would be 5. So in these histograms you can see even more discrimination between the two classes, based mainly on such names being perfect subsets of other names.

If you eliminate the 0’s in the normalized distance, you can get a better look at the shapes of each distribution. There is no clear cut-off between the samples, but there is a pretty clear difference in the distributions.

I also calculated the Jaro-Winkler and the Dice (bi-gram) string distances. All four of these metrics had a fairly high correlation, around 0.8 with one another, and all did pretty well classifying the same ID’s according to their ROC curves.

If I wanted to train a classifier as accurately as possible, I would use all of these metrics and probably make some sort of decision tree (or estimate their effects via logistic regression), but I wanted to make a simple function (since it will be doing quite a few comparisons) that calculates as few of the metrics as possible, so I just went with the normalized Levenshtein distance here. Jaro-Winkler would probably be more competitive if I had the separate first and last names (and played around with the weights for the beginning and ends of the strings). If you had mixed strings, like some are First Middle Last and others are Last First I suspect the dice similarity would be the best.

But in the end I think all of the string metrics will do a pretty similar job for this input data, and the normalized Levenshtein distance will work pretty well so I am going to stick with that. (I don’t consider soundex matching here, I’ve very rarely come across an example where soundex would match but had a high edit distance for names, e.g. typos are much more common than intentional mis-spellings based on enunciation I believe, and even the intentional mis-spellings tend to have a small edit distance.)

Now looking at the absolute differences in the DOB’s (where both are available) provides a bit of a different pattern. Here is the histograms

But I think the easiest illustration of this is to examine the frequency table of the most common day differences for the same individuals.

Obviously zero is the most common, but you can see a few other bumps that illustrate the nature of data mistakes. The second is 365 days – exactly off by one year. Also in the top 10 are 366, 731 & 730 – off by either a year and a day or two years. 1096, 1095, 1461, 1826, are examples of these yearly cycles as well. 36,525 are examples of being off by a century! Somewhere along the way some of the DOB fields were accidentally assigned to dates in the future (such as 2032 vs. 1932). The final examples are off by some other number typical of a simple typo, such as 10, 20, or by the difference in one month 30,31,27. By the end of this table of PDF of the same persons is smaller than the PDF of the control sample.

I also calculated string distances by transforming the DOB’s to mm/dd/yy format, but when incorporating the yearly cycles and other noted mistakes they did not appear to offer any new information.

So based on this information, I made a set of ad-hoc rules to classify matched names. I wanted to keep the false positive rate at less than 1 in 1,000, but make the true positive as high as possible. The simple rules I came up with were:

  • If a normalized Levenshtein distance of less than 0.2, consider a match
  • If a normalized Levenshtein distance of less than 0.4 and a close date, consider a match.

Close dates are defined as:

  • absolute difference of within 10 days OR
  • days apart are 10,20,27,30,31 OR
  • the number of days within a yearly cycle are less than 10

This match procedure produces the classification table below:

The false positive rate is right where I wanted it to be, but the true positive is a bit lower than I hoped. But it is a simple tool though to implement, and built into it you can have missing data for a birthday.

It is a bit hard to share this data and provide reproducible code, but if you want help doing something similar with your own data just shoot me an email and I will help. This was all done in SPSS and Python (using the extended transforms python code). In the end I wanted to make a simple Python function to use with the FUZZY command to automatically match names.

Tables and Graphs paper rejection/update – and on the use of personal pronouns in scientific writing

My paper, Tables and Graphs for Monitoring Temporal Crime Patterns was recently rejected from Policing: An International Journal of Police Strategies & Management. I’ve subsequently updated the SSRN draft based on feedback from the review, and here I post the reviews and my responses to those reviews (in the text file).

One of the main critiques by both reviewers was that the paper was too informal, mainly because of the use of "I" in the paper. I use personal pronouns in writing intentionally, despite typical conventions in scientific writing, so I figured a blog post about why I do this is in order. I’ve been criticized for it on other occasions as well, but this is the first time it was listed as a main reason to reject an article of mine.

My main motivation comes from Michael Billig’s book Learn to Write Badly: How to Succeed in the Social Sciences (see a prior blog post I wrote on the contents). In a nut-shell, when you use personal pronouns it is clear that you, the author, are doing something. When you rewrite the sentence to avoid personal pronouns, you often obfuscate who the actor is in a particular sentence.

For an example of Billig’s point that personal pronouns can be more informative, I state in the paper:

I will refer to this metric as a Poisson z-score.

I could rewrite this sentence as:

This metric will be referred to as a Poisson z-score.

But that is ambiguous as to its source. Did someone else coin this phrase, and I am borrowing it? No – it is a phrase I made up, and using the personal pronoun clearly articulates that fact.

Pretty much all of the examples where I eliminated first person in the updated draft were of the nature,

In this article I discuss the use of percent change in tables.

which I subsequently changed to:

This article discusses the use of percent changes as a metric in tables.

Formal I suppose, but insipid. All rewriting the sentence to avoid the first person pronoun does is make the article seem like a sentient being, as well as forces me to use the passive tense. I don’t see how the latter is better in any way, shape, or form – yet this is one of the main reasons my paper is rejected above. The use of "we" in academic articles seems to be more common, but using "we" when there is only one author is just silly. So I will continue to use "I" when I am the only author.

Presentation at IACA 2015: An Exploratory Network Analysis of Hot People and Places

My work was accepted at the 2015 International Association of Crime Analysts Conference, so I will be going to Denver this fall to present. These are "NIJ Research Track" presentations, but basically took the place of the old MAPS conference presentations. So on Thursday at 2:30 (Panel 9) I will be presenting. The title of the presentation is An Exploratory Network Analysis of Hot People and Places, and below is the abstract for the talk:

Intelligence led policing practices focus on chronic offenders and hot spots of crime. I examine the connections between these hot people and hot places by considering micro places (street segments and intersections) and people as nodes in an interconnected network. I focus on whether hot people tend to have a finite set of locations they congregate, and whether hot places have unique profiles of chronic offenders. The end goal is to identify if observed patterns can help police combine targeted enforcement of hot people and hot places in one overarching strategy.

This is still a work in progress, but here is a quick preview. This graph is a random sample of offender footprints in each panel.

Also during this slot there are two other presentations, below is their info:

Title: Offender Based Investigations: A Paradigm Shift in Policing, Author: Chief James W. Buie, Gaston County Police

Abstract: Routine activity theory and research conducted by Wiles, P & Costello, A. (2000) ‘The Road to Nowhere’ prove criminals 1) commit crimes where comfortable and 2) are very likely to re-offend. Therefore it makes sense to elevate the importance of offender locations in relation to crimes. Our focus on the offender is called Offender Based Investigations (OBI). We’re using GIS to plot not only crimes but criminals as well. Our experience over the past seven years of utilizing OBI has proven that mapping offenders plays a critical role in solving crimes.

Title: A Spatial Network Analysis of Gangs Author: Davin Hall, Crime Analysis Specialist for the Greensboro PD

Abstract: Gangs are characterized by the associations between members and the territory that members operate in. Many representations of gang territory and gang networks are separate from one another; a map shows the territory while a network map shows linkages between individuals. This presentation demonstrates the combination of territory and network within a single map, for both gangs and gang members. This analysis shows law enforcement a clearer picture of the complex relationships that occur between and within gangs.

You can see the full agenda here. I really enjoyed the presentations last year at Seattle, and this year looks great as well. If you want any more info. on my particular work feel free to send me an email, or if you want to get together at Denver this fall.

Favorite maps and graphs in historical criminology

I was reading Charles Booth’s Life and Labour of the People in London (available entirely at Google books) and stumbled across this gem of a connected dot plot (between pages 18-19, maybe it came as a fold out in the book?)

(You will also get a surprise of the hand of the scanner in the page prior!) This reminded me I wanted to make a collection of my favorite historical examples of maps and graphs for criminology and criminal justice. If you read through Calvin Schmid’s Handbook of Graphical Presentation (available for free at the internet archive) it was a royal pain to create such statistical graphics by hand before computers. It makes you appreciate the effort all that much more, and many of the good ones will rival the quality of any graphic you can make on the computer.

Calvin Schmid himself has some of my favorite example maps. See for instance this gem from Urban Crime Areas: Part II (American Sociological Review, 1960):

The most obvious source of great historical maps in criminology though is from Shaw and McKay’s Juvenile Delinquency in Urban Areas. It was filled with incredible graphs and maps throughout. Here are just a few examples. (These shots are taken from the second edition in 1969, but they are all from the first part of the book, so were likely in the 1942 edition):

Dot maps

Aggregated to grid cells

The concentric zonal model

And they even have some binned scatterplots to ease in calculating linear regression equations

Going back further, Friendly in A.-M. Guerry’s moral statistics of France: Challenges for multivariable spatial analysis has some examples of Guerry’s maps and graphs. Besides choropleth maps, Guerry has one of the first examples of a ranked bumps chart (as later coined by Edward Tufte) of the relative rankings of the counts of crime at different ages (1833):

I don’t have access to any of Quetelet’s historical maps, but Cook and Wainer in A century and a half of moral statistics in the United Kingdom: Variations on Joseph Fletcher’s thematic maps have examples of Joseph Fletcher’s choropleth maps (as of 1849):

Going to more recent mapping examples, the Brantingham’s most notable I suspect is their crime pattern nodes and paths diagram, but my favorites are the ascii glyph contour maps in Crime seen through a cone of resolution (1976):

The earliest example of a journey-to-crime map I am aware of is Capone and Nichols Urban structure and criminal mobility (1976) (I wouldn’t be surprised though if there are earlier examples)

Besides maps, one other famous criminology graphic that came to mind was the age-crime curve. This is from Age and the Explanation of Crime (Hirschi and Gottfredson, 1983) (pdf here). This I presume was made with the computer – although I imagine it was still a pain in the butt to do it in 1983 compared to now! Andresen et al.’s reader Classics in Environmental Criminology in the Quetelet chapter has an age crime curve recreated in it (1842), but I will see if I can find an original scan of the image.

I will admit I have not read Wolfgang’s work, but I imagine he had graphs of the empirical cumulative distribution of crime offenses somewhere in Delinquency in a Birth Cohort. But William Spelman has many great examples of them for both people and places. Here is one superimposing the two from Criminal Careers of Public Places (1995):

Michael Maltz has spent much work on advocating for visual presentation as well. Here is an example from his chapter, Look Before You Analyze: Visualizing Data in Criminal Justice (pdf here) of a 2.5d kernel density estimate. Maltz discussed this in an earlier publication, Visualizing Homicide: A Research Note (1998), but the image from the book chapter is nicer.

Here is an album with all of the images in this post. I will continue to update this post and album with more maps and graphs from historical work in criminology as I find them. I have a few examples in mind — I plan on adding a multivariate scatterplot in Don Newman’s Defensible Space, and I think Sampson’s work in Great American City deserves to be mentioned as well, because he follows in much of the same tradition as Shaw and McKay and presents many simple maps and graphs to illustrate the patterns. I would also like to find the earliest network sociogram of crime relationships. Maltz’s book chapter has a few examples, and Papachristo’s historical work on Al Capone should be mentioned as well (I thought I remembered some nicer network graphs though in Papachristos’s book chapter in the Morselli reader).

Let me know if there are any that I am missing or that you think should be added to the list!

How wide to make the net in actuarial tools? (false positives versus false negatives)

An interesting debate/question came up in my work recently. I conducted an analysis of a violence risk assessment tool for a police department. Currently the PD takes around the top 1,000 scores of this tool, and then uses further intelligence and clinical judgements to place a small number of people on a chronic offender list (who are then subject to further interventions). My assessment of the predictive validity when examining ROC curves suggested the tool does a pretty good job discriminating violent people up to around the top 6,000 individuals and after that flattens out. In a sample of over 200,000, the top 1000 scores correctly classified 30 of the 100 violent cases, and the top 6000 classified 60.

So the question came up should we recommend that the analysts widen the net to the top 6,000 scores, instead of only examining the top 1,000 scores? There are of course costs and limitations of what the analysts can do. It may simply be infeasible for the analysts to review 6,000 people. But how do you set the limit? Should the clinical assessments be focused on even fewer individuals than 1,000?

We can make some estimates of where the line should be drawn by setting weights for the cost of a false positive versus a false negative. Implicit in the whole exercise of predicting violence in a small set of people is that false negatives (failing to predict someone will be violent when they are) greatly outweigh a false positive (predicting someone will be violent but they are not). The nature of the task dictates that you will always need to have quite a few false positives to classify even a few true positives, and no matter what you do there will only be a small number of false negatives.

Abstractly, you can place a value on the cost of failing to predict violence, and a cost on the analysts time to evaluate cases. In this situation we want to know whether the costs of widening the net to 6,000 individuals are less than the costs of only examining the top 1,000 individuals. Here I will show we don’t even need to know what the exact cost of a false positive or a false negative is, only the relative costs, to make an estimate about whether the net should be cast wider.

The set up is that if we only take the top 1,000 scores, it will capture 30 out of the 100 violent cases. So there will be (100 – 30) false negatives, and (1000 – 30) false positives. If we increase the scores to evaluate the top 6,000, it will capture 60 out the 100 violent cases, but then we will have (6000 – 60) false positives. I can not assign a specific number to the cost of a false negative and a false positive. So we can write these cost equations as:

1) (100 - 30)*FN + (1000 - 30)*FP = Cost Low
2) (100 - 60)*FN + (6000 - 60)*FP = Cost High

Even though we do not know the exact cost of a false negative, we can talk about relative costs, e.g. 1 false negative = 1000*false positives. There are too many unknowns here, so I am going to set FP = 1. This makes the numbers relative, not absolute. So with this constraint the reduced equations can be written as:

1) 70*FN +  970 = Cost Low
2) 40*FN + 5940 = Cost High

So we want to know the ratio at which there is a net benefit over including the top 6,000 scores versus only the top 1,000. So this means that Cost High < Cost Low. To figure out this point, we can subtract equation 2 from equation 1:

3) (70 - 40)*FN - 4970 = Cost Low - Cost High

If we set this equation to zero and solve for FN we can find the point where these two equations are equal:

30*FN - 4970 = 0
30*FN = 4970
FN = 4970/30 = 165 + 2/3

If the value of a false negative is more than 166 times the value of a false positive, Cost Low - Cost High will be positive, and so the false negatives are more costly to society relative to the analysts time spent. It is still hard to make guesses as to whether the cost of violence to society is 166 times more costly than the analysts time, but that is at least one number to wrap your head around. In a more concrete example, such as granting parole or continuing to be incarcerated, given how expensive prison is net widening (with these example numbers) would probably not be worth it. But here it is a bit more fuzzy especially because the analysts time is relatively inexpensive. (You also have to guess how well you can intervene, in the prison example incarceration essentially reduces the probability of committing violence to zero, whereas police interventions can not hope to be that successful.)

As long as you assume that the classification rate is linear within this range of scores, the same argument holds for net widening any number. But in reality there are diminishing returns the more scores you examine (and 6,000 is basically where the returns are near zero). If you conduct the same exercise between classifying zero and the top 1,000, the rate of the cost of a false negative to a false positive needs be 32+1/3 to justify evaluating the top 1,000 scores. If you actually had an estimate of the ratio of the cost of false positives to false negatives you could then figure out exactly how wide to make the net. But if you think the ratio is well above 166, you have plenty of reason to widen the net to the larger value.

Follow

Get every new post delivered to your Inbox.

Join 60 other followers