# Monitoring homicide trends paper published

My paper, Monitoring Volatile Homicide Trends Across U.S. Cities (with coauthor Tom Kovandzic) has just been published online in Homicide Studies. Unfortunately, Homicide Studies does not give me a link to share a free PDF like other publishers, but you can either grab the pre-print on SSRN or always just email me for a copy of the paper.

They made me convert all of the charts to grey scale :(. Here is an example of the funnel chart for homicide rates in 2015.

And here are example fan charts I generated for a few different cities.

As always if you have feedback or suggestions let me know! I posted all of the code to replicate the analysis at this link. The prediction intervals can definately be improved both in coverage and in making their length smaller, so I hope to see other researchers tackling this as well.

# Poster presentations should have a minimum font size of 25 points

A fairly generic problem I’ve been trying to do some research on is how large should fonts be for posters and PowerPoint presentations. The motivation is my diminishing eyesight over the years, and in particular default labels for statistical graphics are almost always too small in my opinion. Projected presentations just exacerbate the problem.

First, to tackle the project we need to find research about the the sizes that individuals can comfortably read letters. You don’t measure size of letters in absolute distance terms though, you measure it in the subtended angle that an object commands in your vision. That is, it is both a function of the height of the letters as well as the distance you are away from the object. I.e. in the below diagram angle A is larger than angle B.

The best guide for the size of this angle I have found for letters is an article by Sidney Smith, Letter Size and Legibility. Smith (1979) had a set of students make various labels and then have people stand too far away to be able to read them. Then the participants walked towards the labels until they could read them. Here is the histogram of those subtended angles (in radians) Smith produced:

From this Smith gives the recommendation as 0.007 radians as a good bet for pretty much everyone to be able to read the text. My research into other recommendations (eye tests, highway symbols) tends to be smaller, and between mine and Smith’s other sources tends to produce a range of 0.003 to 0.010 radians. Personal experimentation for me is that 0.007 is a good size, although up to 0.010 is not uncomfortably large. Most everyone with corrective vision can clearly see under 0.007, but we shouldn’t be making our readers strain to read the text.

For comparison, I sit approximately 22 inches away from my computer screen. A subtended angle of 0.007 produces a font size of just over 11 points at that distance. At my usual sitting distance I can read fonts down to 7 points, but I would prefer not to under usual circumstances.

This advice can readily translate to font sizes in poster presentations, since there is a limited range in which people will attempt to read them. Block’s (1996) suggestion that most people are around 4 feet away when they read a poster seems pretty reasonable to me, and so this produces a letter height of 0.34 inches needed to correspond to a 0.007 subtended angle. One point of font is 1/72 inches in letter height, so this converts to a 25 point font as the minimum to which most individuals can comfortably read the words in a poster. (R Functions at the end of the post for conversions, although it is based on relatively simple geometry.)

This advice is larger than Block’s (which is 20 point), but fits in line with Colin Purrington’s templates, which use 28 point for the smallest font. Again note that this is the minimum font for the poster, things like titles and author names should clearly be larger than the minimum to create a hierarchy. Again a frequent problem are axis labels for statistical graphics.

It will take more work to extend this advice to projected presentations, since there is more variability in projected sizes as well as rooms. So if you see a weirdo with a measuring tape at the upcoming ASC conference, don’t be alarmed, I’m just collecting some data!

Here are some R functions, the first takes a height and distance and return the subtended angle (in radians). The second takes the distance and radians to produce a height.

``````visual_angleR <- function(H,D){
x <- 2*atan(H/(2*D))
return(x)
}

return(x)
}``````

Since a point of font is 1/72 of an inch, the code to calculate the recommended font size is `visual_height(D=48,Rad=0.007)*72` and I take the ceiling of this value for the 25 point recommendation.

# Venn diagrams in R (with some discussion!)

The other day I had a set of three separate categories of binary data that I wanted to visualize with a Venn diagram (or a Euler) diagram of their intersections. I used the `venneuler` R package and it worked out pretty well.

``````library(venneuler)
MyVenn <- venneuler(c(A=74344,B=33197,C=26464,D=148531,"A&B"=11797,
"A&C"=9004,"B&C"=6056,"A&B&C"=2172,"A&D"=0,"A&D"=0,"B&D"=0,"C&D"=0))
MyVenn\$labels <- c("A\n22","B\n7","C\n5","D\n58")
plot(MyVenn)
text(0.59,0.52,"1")
text(0.535,0.51,"3")
text(0.60,0.57,"2")
text(0.64,0.48,"4") ``````

Some digging around on this topic though I came across some pretty interesting discussion, in particular a graph makeover of a set of autism diagnoses, see:

for background. Below is a recreated image of the original Venn diagram under discussion (from Kosara’s American Scientist article.)

Applying this example to the `venneuler` library did not work out so well.

``````MyVenn2 <- venneuler(c(A=111,B=65,C=94,"A&B"=62,"A&C"=77,"B&C"=52,"A&B&C"=51))
plot(MyVenn2)``````

Basically there is a limit on the size the intersections can be with the circles, and here the intersection of all three sets is very large, so there is no feasible solution for this example.

This is alittle bit different situation than typical for Venn diagrams though. Typically these charts all one is interested in is the overlaps between each set. But the autism graph that is secondary. What they were really interested in was the sensitivity of the different diagnostic measures (i.e. percentage identifying true positives), and to see if any particular combination had the greatest sensitivity. Although Kosara in his blog post says that all of the redesigns are better than the original I don’t entirely agree, I think Kosara’s version of the Venn diagram with the text labels does a pretty good job, although I think Kosara’s table is sufficient as well. (Kosara’s recreated graph has better labelling than the original Venn diagram, mainly by increasing the relative font size.)

For the autism graph there are basically two over-arching goals:

• identifying the percent within arbitrary multiple intersections
• keeping in mind the baseline N for each of the arbitrary sets

It is not immediately visually obvious, but IMO it is not that hard to arbitrarily collapse different categories in the original Venn diagram and make some rough judgements about the sensitivity. To me the first thing I look at is the center piece, see it is quite a high percentage, and then look to see if I can make any other arbitrary categories to improve upon the sensitivity of all three tests together. All others are either very small baselines or do not improve the percentage, so I conclude that all three combined likely have the most sensitivity. You may also see that the clinicians are quite high for each intersection, so it is questionable whether the two other diagnostics offer any significant improvement over just the clinicians judgement, but many of the clinician sets have quite small N’s, so I don’t put as much stock in that.

Another way to put it is if we think of the original Venn diagram as a graphical table I think it does a pretty good job. The circles and the intersections are a lie factor in the graph, in that their areas do not represent the baseline rates, but it is an intuitive way to lay out the textual categories, and only takes a little work to digest the material. Kosara’s sorted table does a nice job of this as well, but it is easier to ad-hoc combine categories in the Venn diagram than in table rows that are not adjacent. Visually the information does not pop out at you, like a functional relationship in a scatterplot, but the Venn diagram has the ingredients that allow you to drill down and estimate the information you are looking for. Being able to combine arbitrary categories is the key here, and I don’t think any of the other graphical representations allow one to do that very easily.

I thought a useful redesign would be to keep the Venn theme, but have the repeated structures show the base rate via Isotype like recurring graphs. Some of this is motivated by using such diagrams in interpreting statistics (see this post by David Spieghalter for one example, the work of Gerd Gigenzer is relevant as well). I was not able make a nice set of contained glyphs though. Here is a start of what I am talking about, I just exported the R graph into Inkscape and superimposed a bunch of rectangles.

This does not visualize the percentage, but one way to do that would be to color or otherwise distinguish the blocks in a certain way. Also I gave up before I finished the intersecting middle piece, and I would need to make the boxes a bit smaller to be able to squeeze it in. I think this idea could be made to work, but this particular example making the Venn even approximately proportional is impossible, and so sticking with the non-proportional Venn diagram and just saying it is not proportional is maybe less likely to be misleading.

I think the idea of using Isotype like repeated structures though could be a generally good idea. Even when the circles can be made to have the areas of the intersection exact, it is still hard to visually gauge the size of circles (rectangles are easier). So the multiple repeated pixels may be more useful anyway, and putting them inside of the circles still allows the arbitrary collapsing of different intersections while still being able to approximately gauge base rates.

# Using funnel charts for monitoring rates

For a recent project of mine I was exploring arrest rates at small units of analysis (street midpoints and intersections) for one of the collaborations we have at the Finn Institute.

One of the graphs I was interested in making was a funnel plot of the arrest rates on the Y axis and the total number of stops on the X axis. You can then draw confidence bands around a particular percentage (typically the overall rate) to see if any observations have oddly high or low rates. This procedure is described in Funnel plots for comparing institutional performance (Spiegelhalter, 2005), PDF Here. Funnel charts are more common in meta-analysis, but I hope to illustrate their utility here in monitoring rates with different denominators.

For an illustration I will make a funnel plot of the homicide rates per 100,000 given by the police agencies in New York and Pennsylvania in 2012 (available via the UCR data tool). Here is the data and code available to replicate the results in this blog post in a zip file. If you have the `SPSSINC GETURI DATA` add-on you can start just by calling (otherwise you have to download the data to follow along).

``````SPSSINC GETURI DATA
URI="https://dl.dropboxusercontent.com/u/3385251/NY_PA_crimerates_2012.sav"
FILETYPE=SAV DATASET=NY_PA.``````

I start at the point in the code where the data is already loaded, and I generate a scatterplot for `Population` and `Murder_and_nonnegligent_manslaughter_rate`. I eliminate agencies that did not report a full 12 months or had missing data for homicides and/or the population, and this results in around 360 homicide rates for populations above 10,000 and up nearly 10 million (in New York City). The labels for SPSS graphs inherit the format for the original data, so I make the homicide rates `F2.0` to avoid decimal places in the axis tick mark labels.

``````FORMATS Murder_and_nonnegligent_manslaughter_rate (F2.0).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Population Murder_and_nonnegligent_manslaughter_rate
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Population=col(source(s), name("Population"))
DATA: Murder_and_nonnegligent_manslaughter_rate=col(source(s),
name("Murder_and_nonnegligent_manslaughter_rate"))
GUIDE: axis(dim(1), label("Population"))
GUIDE: axis(dim(2), label("Murder Rate per 100,000"))
SCALE: log(dim(1))
ELEMENT: point(position(Population*Murder_and_nonnegligent_manslaughter_rate))
END GPL.``````

So we can see that the bulk of the institutions have very low murder rates, the overall rate is just shy of 6 murders per 100,000 in this sample. But there are quite a few locations that appear to be high outliers. We are talking about proportions as small as `0.00006` to `0.00065` though, so are some of the high murder rate locations really that surprising? Lets see by adding an estimate of the error if the individual rates were drawn from the same distribution as the overall rate.

Now to make the funnel chart we need to estimate the confidence interval for a rate of `6/100,000` for population sizes between 10,000 and 10 million to add to our chart. We can add these interpolated bounds directly into our SPSS dataset. So to estimate the lines what I do is use the `AGGREGATE` command to interpolate population step sizes for the min and max population in the sample (and calculate the total homicide rate). Here I interpolate the population steps on the log scale, as the X axis in the chart uses a log scale.

``````*Aggregate proportions and max/min counts.
COMPUTE Id = \$casenum - 1.
/BREAK
/MaxPop = MAX(Population)
/MinPop = MIN(Population)
/AllHom = SUM(Murder_and_nonnegligent_Manslaughter)
/TotalPop = SUM(Population)
/TotalObs = MAX(Id).
*Overall homicide rate per 100,000.
COMPUTE HomRate = (AllHom/TotalPop)*100000.
*Now interpolating bounds for the population.
COMPUTE #Step = (LG10(MaxPop) - LG10(MinPop))/TotalObs.
COMPUTE N = 10**(LG10(MinPop) + Id*#Step).``````

So if you see the variable `N` in the dataset now you will see that the highest value is equal to the population in New York City at over 8 million. SPSS does not have an inverse function for the binomial distribution, but it does have one for the beta distribution. Wikipedia lists how the Clopper-Pearson binomial confidence intervals can be rewritten in terms of the beta distribution. I use this equality here to generate 99% confidence intervals for the population sizes in the `N` variable for the overall homicide rate, `HomeRate` (remembering to convert rates per 100,000 back to proportions).

``````*Now calculating bands based on statewide homicide rates.
*Exact bounds based on inverse beta.
*http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval.
COMPUTE #Alph = 0.01.
COMPUTE #Suc = (HomRate*N)/100000.
COMPUTE LowB = IDF.BETA(#Alph/2,#Suc,N-#Suc+1)*100000.
COMPUTE HighB = IDF.BETA(1-#Alph/2,#Suc+1,N-#Suc)*100000.
EXECUTE.``````

Now we can superimpose `LowB`, `HighB`, and `HomRate` on the same graph as the previous scatterplot to give a sense of the uncertainty in the estimates. Here I also change the size of the graph using `PAGE` statements. I color the error bounds as light grey lines, and the overall homicide rate as a red line, and then superimpose the homicide rates for each agency on top of them.

``````*Now can make the plot with the error bounds.
FORMATS LowB HighB HomRate (F2.0) N (F8.0).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Population Murder_and_nonnegligent_manslaughter_rate
LowB HighB HomRate N
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
PAGE: begin(scale(600px,800px))
SOURCE: s=userSource(id("graphdataset"))
DATA: Population=col(source(s), name("Population"))
DATA: Murder_and_nonnegligent_manslaughter_rate=col(source(s),
name("Murder_and_nonnegligent_manslaughter_rate"))
DATA: LowB=col(source(s), name("LowB"))
DATA: HighB=col(source(s), name("HighB"))
DATA: HomRate=col(source(s), name("HomRate"))
DATA: N=col(source(s), name("N"))
GUIDE: axis(dim(1), label("Population"))
GUIDE: axis(dim(2), label("Murder Rate per 100,000"), delta(2))
SCALE: log(dim(1))
SCALE: linear(dim(2), min(2), max(64))
ELEMENT: line(position(N*HomRate), color(color.red), size(size."2"))
ELEMENT: line(position(N*LowB), color(color.grey))
ELEMENT: line(position(N*HighB), color(color.grey))
ELEMENT: point(position(Population*Murder_and_nonnegligent_manslaughter_rate), size(size."5"),
transparency.exterior(transparency."0.5"))
PAGE: end()
END GPL.``````

So I hope you see where the name funnel chart comes from now! This chart gives us an idea of the variability we would expect if random data were drawn with a rate of 6/100000 for population sizes between 10,000 and 100,000 is quite large, and most of the agencies with high homicide rates are still below the upper bound. Even with a population of 100,000, the 99% confidence interval covers rates between 2 and almost 16 per 100,000 – which translates directly to 2 and 16 homicides in a jurisdiction of that size.

Here I add labels to the chart for locations above the interval and below the interval (but at least have one homicide). I add the total number of homicides in that jurisdiction in parenthesis to the label as well.

``````*Now adding in labels.
COMPUTE #Alph = 0.01.
COMPUTE #Suc = (HomRate*Population)/100000.
COMPUTE LowA = IDF.BETA(#Alph/2,#Suc,Population-#Suc+1)*100000.
COMPUTE HighA = IDF.BETA(1-#Alph/2,#Suc+1,Population-#Suc)*100000.
EXECUTE.

STRING HighLab (A50).
IF Murder_and_nonnegligent_manslaughter_rate > HighA
HighLab = CONCAT(RTRIM(REPLACE(Agency,"Police Dept",""))," (",LTRIM(STRING(Murder_and_nonnegligent_Manslaughter,F3.0)),")").
IF Murder_and_nonnegligent_manslaughter_rate  0
HighLab = CONCAT(RTRIM(REPLACE(Agency,"Police Dept",""))," (",LTRIM(STRING(Murder_and_nonnegligent_Manslaughter,F3.0)),")").
EXECUTE.``````

The labels get a little busy, but we can see that save for Buffalo and Rochester, all of the jurisdictions with higher than expected homicide rates are in Pennsylvania, with Philadelphia being the most egregious outlier. New York City and Yonkers are actually lower than the confidence interval, although quite close (along with a set of cities with zero homicides in populations between mainly the lower bound of 10,000 to 100,000). Chester city and Mckeesport are high locations as well, but we can see from the labels they only have 22 and 10 homicides respectively. The point to the left of Mckeesport is Coatesville, which ends up having only 6 homicides.

Where I consider such charts useful are to identify outliers (such as Philadelphia and Chester City). Depending on the nature of the study will depend on what this information could be useful for. For a reporting agency like the FBI they may be interested in seeing if Chester City is a reporting anomaly, or the NIJ/DOJ may be interested in funnelling resources to these high homicide rate locations. The same is true for a police department monitoring rates of say contraband recovery at particular locations, or uses of force per officer incident.

For a researcher they may be interested in other causal factors that could cause a high or low rate, as this might give insight into the mechanisms that increase or decrease homicides (for this particular chart). Evaluating against the overall homicide rate for the sample is an ad-hoc decision (it appears in this chart that PA and NY may plausibly have distinct homicide rate values, with PA being higher) but it at least gives the analyst an idea of the variability when evaluating rates.

# Log Scaled Charts in SPSS

Log scales are convenient for a variety of data. Here I am going to post a brief tutorial about making and formatting log scales in SPSS charts. So first lets start with a simple set of data:

``````DATA LIST FREE / X (F1.0) Y (F3.0).
BEGIN DATA
1 1
2 5
3 10
4 20
5 40
6 50
7 70
8 90
9 110.
END DATA.
DATASET NAME LogScales.
VARIABLE LEVEL X Y (SCALE).
EXECUTE.``````

In GPL code a scatterplot with linear scales would simply be:

``````GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"))
ELEMENT: point(position(X*Y))
END GPL.``````

To change this chart to a log scale you need to add a `SCALE` statement. Here I will specify the Y axis (the 2nd dimension) as having a logarithmic scale by inserting `SCALE: log(dim(2), base(2))` between the last `GUIDE` statement and the first `ELEMENT` statement. When using log scales, many people default to having a base of 10, but this uses a base of 2, which works much better with the range of data in this example. (Log base 2 also works much better for ratios that don’t span into the 1,000’s as well.)

``````GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"))
SCALE: log(dim(2), base(2))
ELEMENT: point(position(X*Y))
END GPL.``````

Although the order of the commands makes no difference, I like to have the `ELEMENT` statements last, and then the prior statements before and together with like statements. If you want to have more control over the scale, you can specify and `min` or a `max` for the chart (by default SPSS tries to choose nice values based on the data). With log scales the minimum needs to be above 0. Here I use `0.5`, which is an equivalent distance from 1 -> 2 on a log scale.

``````GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"))
SCALE: log(dim(2), base(2), min(0.5))
ELEMENT: point(position(X*Y))
END GPL.``````

You can see here though we have a problem — two 1’s in the Y axis! SPSS inherits the formats for the axis from the data. Since the Y data are formatted as F3.0, the 0.5 tick mark is rounded up to 1. You can fix this by formatting the variable before the `GGRAPH` command.

``````FORMATS Y (F4.1).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"))
SCALE: log(dim(2), base(2), min(0.5))
ELEMENT: point(position(X*Y))
END GPL.``````

But this is annoying, as it adds a decimal to all of the tick values. I wish you could use fractions in the ticks, but this is not possible that I know of. To prevent this you should be able to specify the `start()` option in the `GUIDE` command for the second dimension, but in a few attempts it was not working for me (and it wasn’t a conflict with my template — in this example set the `DEFAULTTEMPLATE=NO` to check). So here I adjusted the minimum to be `0.8` instead of `0.5` and SPSS did not draw any ticks below 1 (you may have to experiment with this based on how much padding the Y axis has in your default template).

``````FORMATS Y (F3.0).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE DEFAULTTEMPLATE=YES.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"), start(1))
SCALE: log(dim(2), base(2), min(0.8))
ELEMENT: point(position(X*Y))
END GPL.``````

I will leave talking about using log scales if you have 0’s in your data for another day (and talk about displaying missing data in the plot as well.) SPSS has a `safeLog` scale, which is good for data that has negative values, but not necessarily my preferred approach if you have count data with 0’s.