# Using funnel charts for monitoring rates

For a recent project of mine I was exploring arrest rates at small units of analysis (street midpoints and intersections) for one of the collaborations we have at the Finn Institute.

One of the graphs I was interested in making was a funnel plot of the arrest rates on the Y axis and the total number of stops on the X axis. You can then draw confidence bands around a particular percentage (typically the overall rate) to see if any observations have oddly high or low rates. This procedure is described in Funnel plots for comparing institutional performance (Spiegelhalter, 2005), PDF Here. Funnel charts are more common in meta-analysis, but I hope to illustrate their utility here in monitoring rates with different denominators.

For an illustration I will make a funnel plot of the homicide rates per 100,000 given by the police agencies in New York and Pennsylvania in 2012 (available via the UCR data tool). Here is the data and code available to replicate the results in this blog post in a zip file. If you have the SPSSINC GETURI DATA add-on you can start just by calling (otherwise you have to download the data to follow along).

SPSSINC GETURI DATA
URI="https://dl.dropboxusercontent.com/u/3385251/NY_PA_crimerates_2012.sav"
FILETYPE=SAV DATASET=NY_PA.

I start at the point in the code where the data is already loaded, and I generate a scatterplot for Population and Murder_and_nonnegligent_manslaughter_rate. I eliminate agencies that did not report a full 12 months or had missing data for homicides and/or the population, and this results in around 360 homicide rates for populations above 10,000 and up nearly 10 million (in New York City). The labels for SPSS graphs inherit the format for the original data, so I make the homicide rates F2.0 to avoid decimal places in the axis tick mark labels.

FORMATS Murder_and_nonnegligent_manslaughter_rate (F2.0).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Population Murder_and_nonnegligent_manslaughter_rate
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Population=col(source(s), name("Population"))
DATA: Murder_and_nonnegligent_manslaughter_rate=col(source(s),
name("Murder_and_nonnegligent_manslaughter_rate"))
GUIDE: axis(dim(1), label("Population"))
GUIDE: axis(dim(2), label("Murder Rate per 100,000"))
SCALE: log(dim(1))
ELEMENT: point(position(Population*Murder_and_nonnegligent_manslaughter_rate))
END GPL.

So we can see that the bulk of the institutions have very low murder rates, the overall rate is just shy of 6 murders per 100,000 in this sample. But there are quite a few locations that appear to be high outliers. We are talking about proportions as small as 0.00006 to 0.00065 though, so are some of the high murder rate locations really that surprising? Lets see by adding an estimate of the error if the individual rates were drawn from the same distribution as the overall rate.

Now to make the funnel chart we need to estimate the confidence interval for a rate of 6/100,000 for population sizes between 10,000 and 10 million to add to our chart. We can add these interpolated bounds directly into our SPSS dataset. So to estimate the lines what I do is use the AGGREGATE command to interpolate population step sizes for the min and max population in the sample (and calculate the total homicide rate). Here I interpolate the population steps on the log scale, as the X axis in the chart uses a log scale.

*Aggregate proportions and max/min counts.
COMPUTE Id = $casenum - 1. AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK /MaxPop = MAX(Population) /MinPop = MIN(Population) /AllHom = SUM(Murder_and_nonnegligent_Manslaughter) /TotalPop = SUM(Population) /TotalObs = MAX(Id). *Overall homicide rate per 100,000. COMPUTE HomRate = (AllHom/TotalPop)*100000. *Now interpolating bounds for the population. COMPUTE #Step = (LG10(MaxPop) - LG10(MinPop))/TotalObs. COMPUTE N = 10**(LG10(MinPop) + Id*#Step). So if you see the variable N in the dataset now you will see that the highest value is equal to the population in New York City at over 8 million. SPSS does not have an inverse function for the binomial distribution, but it does have one for the beta distribution. Wikipedia lists how the Clopper-Pearson binomial confidence intervals can be rewritten in terms of the beta distribution. I use this equality here to generate 99% confidence intervals for the population sizes in the N variable for the overall homicide rate, HomeRate (remembering to convert rates per 100,000 back to proportions). *Now calculating bands based on statewide homicide rates. *Exact bounds based on inverse beta. *http://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval. COMPUTE #Alph = 0.01. COMPUTE #Suc = (HomRate*N)/100000. COMPUTE LowB = IDF.BETA(#Alph/2,#Suc,N-#Suc+1)*100000. COMPUTE HighB = IDF.BETA(1-#Alph/2,#Suc+1,N-#Suc)*100000. EXECUTE. Now we can superimpose LowB, HighB, and HomRate on the same graph as the previous scatterplot to give a sense of the uncertainty in the estimates. Here I also change the size of the graph using PAGE statements. I color the error bounds as light grey lines, and the overall homicide rate as a red line, and then superimpose the homicide rates for each agency on top of them. *Now can make the plot with the error bounds. FORMATS LowB HighB HomRate (F2.0) N (F8.0). GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=Population Murder_and_nonnegligent_manslaughter_rate LowB HighB HomRate N /GRAPHSPEC SOURCE=INLINE. BEGIN GPL PAGE: begin(scale(600px,800px)) SOURCE: s=userSource(id("graphdataset")) DATA: Population=col(source(s), name("Population")) DATA: Murder_and_nonnegligent_manslaughter_rate=col(source(s), name("Murder_and_nonnegligent_manslaughter_rate")) DATA: LowB=col(source(s), name("LowB")) DATA: HighB=col(source(s), name("HighB")) DATA: HomRate=col(source(s), name("HomRate")) DATA: N=col(source(s), name("N")) GUIDE: axis(dim(1), label("Population")) GUIDE: axis(dim(2), label("Murder Rate per 100,000"), delta(2)) SCALE: log(dim(1)) SCALE: linear(dim(2), min(2), max(64)) ELEMENT: line(position(N*HomRate), color(color.red), size(size."2")) ELEMENT: line(position(N*LowB), color(color.grey)) ELEMENT: line(position(N*HighB), color(color.grey)) ELEMENT: point(position(Population*Murder_and_nonnegligent_manslaughter_rate), size(size."5"), transparency.exterior(transparency."0.5")) PAGE: end() END GPL. So I hope you see where the name funnel chart comes from now! This chart gives us an idea of the variability we would expect if random data were drawn with a rate of 6/100000 for population sizes between 10,000 and 100,000 is quite large, and most of the agencies with high homicide rates are still below the upper bound. Even with a population of 100,000, the 99% confidence interval covers rates between 2 and almost 16 per 100,000 – which translates directly to 2 and 16 homicides in a jurisdiction of that size. Here I add labels to the chart for locations above the interval and below the interval (but at least have one homicide). I add the total number of homicides in that jurisdiction in parenthesis to the label as well. *Now adding in labels. COMPUTE #Alph = 0.01. COMPUTE #Suc = (HomRate*Population)/100000. COMPUTE LowA = IDF.BETA(#Alph/2,#Suc,Population-#Suc+1)*100000. COMPUTE HighA = IDF.BETA(1-#Alph/2,#Suc+1,Population-#Suc)*100000. EXECUTE. STRING HighLab (A50). IF Murder_and_nonnegligent_manslaughter_rate > HighA HighLab = CONCAT(RTRIM(REPLACE(Agency,"Police Dept",""))," (",LTRIM(STRING(Murder_and_nonnegligent_Manslaughter,F3.0)),")"). IF Murder_and_nonnegligent_manslaughter_rate < LowA AND Murder_and_nonnegligent_manslaughter_rate > 0 HighLab = CONCAT(RTRIM(REPLACE(Agency,"Police Dept",""))," (",LTRIM(STRING(Murder_and_nonnegligent_Manslaughter,F3.0)),")"). EXECUTE. The labels get a little busy, but we can see that save for Buffalo and Rochester, all of the jurisdictions with higher than expected homicide rates are in Pennsylvania, with Philadelphia being the most egregious outlier. New York City and Yonkers are actually lower than the confidence interval, although quite close (along with a set of cities with zero homicides in populations between mainly the lower bound of 10,000 to 100,000). Chester city and Mckeesport are high locations as well, but we can see from the labels they only have 22 and 10 homicides respectively. The point to the left of Mckeesport is Coatesville, which ends up having only 6 homicides. Where I consider such charts useful are to identify outliers (such as Philadelphia and Chester City). Depending on the nature of the study will depend on what this information could be useful for. For a reporting agency like the FBI they may be interested in seeing if Chester City is a reporting anomaly, or the NIJ/DOJ may be interested in funnelling resources to these high homicide rate locations. The same is true for a police department monitoring rates of say contraband recovery at particular locations, or uses of force per officer incident. For a researcher they may be interested in other causal factors that could cause a high or low rate, as this might give insight into the mechanisms that increase or decrease homicides (for this particular chart). Evaluating against the overall homicide rate for the sample is an ad-hoc decision (it appears in this chart that PA and NY may plausibly have distinct homicide rate values, with PA being higher) but it at least gives the analyst an idea of the variability when evaluating rates. # Dissertation Draft I figured I would post the current draft of my dissertation. It is being evaluated by the committee members now, and so why not have everyone evaluate it! Also, since I am on the job market there is proof I am close to finished. Here is a pdf of the draft. This draft is not guaranteed to stay the same as I find errors, but at this point changes should (hopefully) be minimal. As always I appreciate any feedback. The title is What we can learn from small units of analysis, and below is the abstract. The dissertation is aimed at advancing knowledge of the correlates of crime at small geographic units of analysis. I begin by detailing what motivates examining crime at small places, and focus on how aggregation creates confounds that limit causal inference. Local and spatial effects are confounded when using aggregate units, so to the extent the researcher wishes to distinguish between these two types of effects it should guide what unit of analysis is chosen. To illustrate these differences, I examine local, spatial and contextual effects for bars, broken windows and crime using publicly available data from Washington, D.C. # Stacking Intervals Motivated by visualizing overlapping intervals from the This is not Rocket Science blog (by Valentine Svensson) I developed some code in SPSS to accomplish a similar feat. Here is an example plot from the blog: It is related to a graph I attempted to make for visualizing overlap in crime events and interval censored crime data is applicable. The algorithm I mimic by Valentine here is not quite the same, but similar in effect. Here is the macro I made to accomplish this in SPSS. (I know I could also use the python code by the linked author directly in SPSS.) DEFINE !StackInt (!POSITIONAL = !TOKENS(1) /!POSITIONAL = !TOKENS(1) /Fuzz = !DEFAULT("0") !TOKENS(1) /Out = !DEFAULT(Y) !TOKENS(1) /Sort = !DEFAULT(Y) !TOKENS(1) /TempVals = !DEFAULT(100) !TOKENS(1) ) !IF (!Sort = "Y") !THEN SORT CASES BY !1 !2. !IFEND VECTOR #TailEnd(!TempVals). DO IF ($casenum = 1).
COMPUTE #TailEnd(1) = End + !UNQUOTE(!Fuzz).
COMPUTE !Out = 1.
LOOP #i = 2 TO !TempVals.
COMPUTE #TailEnd(#i) = Begin-1.
END LOOP.
ELSE.
DO REPEAT i = 1 TO !TempVals /T = #TailEnd1 TO !CONCAT("#TailEnd",!TempVals).
COMPUTE T = LAG(T).
DO IF (Begin > T AND MISSING(Y)).
COMPUTE T = End + !UNQUOTE(!Fuzz).
COMPUTE !Out = i.
END IF.
END REPEAT.
END IF.
FORMATS !Out !CONCAT("(F",!LENGTH(!TempVals),".0)").
FREQ !Out /FORMAT = NOTABLE /STATISTICS = MAX.
!ENDDEFINE.

An annoyance for this is that I need to place the ending bins in a preallocated vector. Since you won’t know how large the offset values are to begin with, I default to 100 values. The function prints a set of frequencies after the command though (which forces a data pass) and if you have missing data in the output you need to increase the size of the vector.

One simple addition I made to the code over the original is what I call Fuzz in the code above. What happens for crime data is that there are likely to be many near overlaps in addition to many crimes that have an exact known time. What the fuzz factor does is displaces the lines if they touch within the fuzz factor. E.g. if you had two intervals, (1,3) and (3.5,5), if you specified a fuzz factor of 1 the second interval would be placed on top of the first, even though they don’t exactly overlap. This appears to work reasonably well for both small and large n in some experimentation, but if you use too large a fuzz factor it will produce river like runs through the overlap histogram.

Here is an example of using the macro with some fake data.

SET SEED 10.
INPUT PROGRAM.
LOOP #i = 1 TO 1000.
COMPUTE Begin = RV.NORMAL(50,15).
COMPUTE End = Begin + EXP(RV.GAMMA(1,2)).
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.

!StackInt Begin End Fuzz = "1" TempVals = 101.

GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Begin End Mid
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: End=col(source(s), name("End"))
DATA: Begin=col(source(s), name("Begin"))
DATA: Y=col(source(s), name("Y"), unit.category())
GUIDE: axis(dim(1))
GUIDE: axis(dim(2), null())
ELEMENT: edge(position((End+Begin)*Y))
END GPL.

And here is the graph:

Using the same burglary data from Arlington I used for my aoristic macro, here is an example plot for a few over 6,300 burglaries in Arlington in 2012 using a fuzz factor of 2 days.

This is somewhat misleading, as events with a known exact time are not given any area in the plot. Here is the same plot but with plotting the midpoint of the line as a circle over top of the edge.

You can also add in other information to the plot. Here I color edges and points according to the beat that they are in.

It produces a pretty plot, but it is difficult to discern any patterns. I imagine this plot would be good to evaluate when there is the most uncertainty in crime reports. I might guess if I did this type of plot with burglaries after college students come back from break you may see a larger number of long intervals, showing that they have large times of uncertainty. It is difficult to see any patterns in the Arlington data though besides more events in the summertime (which an aoristic chart would have shown to begin with.) At the level of abstraction of over a year I don’t think you will be able to spot any patterns (like with certain days of the week or times of the month).

I wonder if this would be useful for survival analysis, in particular to spot any temporal or cohort effects.

# Quantifying racial bias in peremptory challenges

A question came up recently on cross validated about putting some numbers on the amount of bias in jury selection. I had a previous question of a similar nature, so it had been on my mind previously. The original poster did not say this was specifically for a Batson challenge, but that is simply my presumption. It is both amazing and maddening that given the same question four different potential analyses were suggested. Although it is a bit out of the norm for what I talk about, I figured it would be worth a post.

# Some background on Batson

For some background, Batson challenges are specifically in the context of selecting jurors for a trial. (Everything that follows is specific to what I know about law in the US.) To select a jury first the court selects potential jurors for the venire from the general public. Then both the prosecution and defence counsels have the opportunity to question individuals in the venire. A typical flow seems to be that a panel of the venire is selected (say 10), then the court has a set of standardized questions they ask every individual potential juror. This part is referred to as voir dire. If the individual states they can not be impartial, or there is some other characteristic that indicates they cannot be impartial that potential juror can be eliminated for cause. Without intervention of counsel the standard questions by the court typically weeds out any obvious cases. After the standard questions both counsels have the opportunity to ask their own questions and further identify challenges for cause. There are no limits on who can be eliminated for cause.

The wrinkle specific to Batson though is that each counsel is given a fixed number of peremptory challenges. The number is dictated by the severity of the case (in more serious cases each side has a higher number of challenges). The wikipedia page says in some circumstances the defense gets more than the prosecution, but the total number is always fixed in advance.

The logic behind peremptory challenges is that either counsel can use personal discretion to eliminate potential jurors without needing a justification. Basically it is a fail-safe of the court to allow gut feelings of either counsel to eliminate jurors they believe will be partial to the opposing side. But based on the equal protection clause it was decided in Batson vs. Kentucky that one can not use the challenges solely based on race. As a side effect of allowing so many peremptory challenges, one can easily eliminate a particular minority group, as being a minority group they will only have a few representatives in the venire.

During the voir dire if the opposing counsel believes the opposition is using the peremptory challenges in a racially discriminatory manner, they can object with a Batson challenge. The supreme court decided on three steps to evaluate the challenge.

1. The party that objected has the burden to prove a prima facie case that the challenges were used in a discriminatory manner. This includes an argument that the group is discriminated against is cognizable, and that there is additional numerical evidence of discrimination.
2. Then the burden shifts on the party being challenged to justify the use of the peremptory challenges based on race neutral reasons.
3. The burden then shifts back to the original challenging party. This is to dispute whether the reasons proferred for the use of the peremptory challenges are purely pretextual.

Witnessing the proceedings for this particular case in the New York court of appeals case is what prompted my interest, and I recommend reading their decision as a good general background on Batson challenges (the wikipedia page is lacking quite a bit). What follows is some number crunching specific to the first part, establishing a prima facie case of discrimination.

# Now some numbers

Batson challenges are made in situ during voir dire. All the cases I am familiar with simply use fractions to establish that the peremptory challenges are being used in a discriminatory manner. The fact that the numbers are changing during voir dire makes the calculations of statistics more difficult. But I will address the ex post facto assessment of the first step given the final counts of the number of peremptory challenges and the total number on the venire with their racial distribution. This presumption I will later discuss how it might impact on the findings in a more realistic setting.

Consider the case of People vs. Hecker (linked to above). It happened that the peremptory challenges by the defence to exclude two Asian’s from the jury panel is what prompted the Batson challenge. Later on one other Asian juror was seated to the jury. The appeals court considered in this case whether step 1 was justified, so it is not a totally academic question to attempt to quantify the chances of two out of three Asian’s being challenged.

First, I will specify how we might put a number of this chance occurrence. If a person randomly selected 13 names out of a hat with 39 people, and of those names 3 individuals were Asian, what is the probability that 2 of those selected would be Asian? This probability is dictated by the hypergeometric distribution. More generally, the set up is:

• n equals the total number of eligible cases that are subject to be challenged
• p equals the total number of the race in question that are subject to be challenged
• k equals the total number of peremptory challenges used on the racial group in question
• d equals the total number of peremptory challenges

And the hypergeometric distribution is calculated using binomial coefficients as:

$\frac{{p&space;\choose&space;k}&space;{n-p&space;\choose&space;d-k}&space;}{{n&space;\choose&space;d}}$

So plugging the numbers listed above into the formula, we get the probability of two out of three Asian jurors being challenged if the challenges were made randomly would be equal to below according to Wolfram Alpha:

$\frac{{p&space;\choose&space;k}&space;{n-p&space;\choose&space;d-k}&space;}{{n&space;\choose&space;d}}&space;=&space;\frac{{3&space;\choose&space;2}&space;{39-3&space;\choose&space;13-2}&space;}{{39&space;\choose&space;13}}&space;=&space;\frac{156}{703}&space;\approx&space;0.22$

So the probability of a chance occurrence given this particular set of circumstances is 22%, not terribly small, although may be sufficient given other circumstances to justify the first step (it seems the first step is intended that the burden is rather light). Where did my numbers come from though exactly? Given the circumstances of the case in the appeal decision the only number that would be uncontroversial would be p = 2, that two Asian jurors were excluded based on peremptory challenges. p = 3 comes from after the case, in which one other Asian juror was actually seated. It appears in the appeals case I linked to they consider the jury composition after the Batson challenge was initially brought, but obviously the initial trial judge can’t use that future information. I choose to use d = 13 because the defence only used 13 of their 15 peremptory challenges by the end of the seating. The total number n = 39 is the most difficult to come by.

The appeal case states that in the first pool 18 jurors were brought for questioning, 5 were eliminated for cause, and that both the defence and prosecution used 5 peremptory challenges. I chose to count the total number as 13 for this round, 18 minus the 5 eliminated for cause. In our pulling names out of the hat experiment though you may consider the number to be only 8, so not count the cases the prosecution used their peremptory challenges on. The second round brought another 18 potential jurors, of which 4 were eliminated for challenge. In this round the two Asian jurors were challenged by the defence when the judge asked to evaluate the first 9 of this panel (given the language it appears defence used 3 peremptory challenges during this evaluation of the first 9). At the end of the second round both parties used another 5 peremptory challenges. So to get the the total number of 39, I use 13 for the first round plus 14 for the second, although I could reasonably use 8 for the first and 9 for the second. I end up at 39 by using 13 + 14 + 12 – the last Asian juror (Kazuko) was the the 13th to be questioned on the third panel. One prior juror had been challenged for cause, so I count this as 12 towards the total of 39. There were a total of 26 panellists in this round, and the defence used a total of another 3 peremptory challenges, and there is no other information on whether the prosecution used any more peremptory challenges. (I’m unsure the total number of jurors seated for the case, so I can’t make many other guesses – the total number of jurors selected in the prior two rounds were 7). I don’t worry about the selection of the alternate juror for this analysis.

So lets try to apply this same analysis at the exact time the Batson challenge was raised, when the second Asian juror was eliminated. I prefer to make the calculations at the time the challenge was raised, as the opposing counsel may later alter their behavior in light of a prior Batson challenge. In doing this, now we have p = 2 and k = 2 (there was no other mention of any Asian’s being challenged for cause). We have a bit of uncertainty about d and n though. d at a minimum for the defence has to be 7 at this point (five in the first round plus the two Asians in the second round), but could be as many as 10 (5 in the first and 5 in the second). n could be as mentioned before 8 + 9 = 17 (excluding cases the prosecution challenged) or 13 + 14 = 27 (including all cases not challenged for cause). In the middle you may consider n to be 13 + 9 = 22, the exact point when the defence was asked to bring forth challenges for the first 9 seated on that round of the panel. So what difference does this make on the estimated probabilities? Well lets just graph the estimates for all values of d between 7 and 10, and n between 17 and 27. Here the lines are for different values of d (with labels at the beginning left part of the line), and the x axis is the different values of n. We can see the probabilities follow the pattern that as n increases and d decreases the probability of that combination goes down. Even over all these values the probability never goes below 5%. I do not know if a 5% probability is sufficient for the numerical justification of the prima facie case of discrimination. 22% seems too low a threshold to me (by chance about 1 in 5 times) but 5% may be good enough (by chance 1 in 20 times).

So lets try this same sensitivity analysis for the entire case. For the evaluation of everything after the fact I think p = 3 and k = 2 are largely uncontroversial, but lets vary d between 7 and 15 and n between 23 (chosen to be at the lower end of cases that were legitimately evaluated in the pool by the end I believe) to 45. Some of these situations are not commensurate with the limited information we have (e.g. d = 7 and n = 45 is not possible) but I think the graph will be informative anyway. My labelling could use more work, but the line on top is d = 15 and they increment until the line on bottom where d = 7.

So here we can see that the probability after the fact never gets much below 10%, and that is for lower values of d and higher values of n that are not likely possible given the data. So basically in the case that looks the worst for the defence here the probability of selecting two out of three Asian’s by random (giving varying numbers of peremptory challenges and varying the pool from which to draw them) is never below 10%. Not a terribly strong case that the numerical portion of step 1 has been satisfied. Basically, no matter what reasonable values you put in for d or n in this circumstance the probability of choosing 2 out of three Asian jurors randomly is not going to be much below 10%.

# On some of the other suggested analyses

On the original question on CV I mentioned previously, besides my own suggested analysis here there were 3 other suggested analysis:

• Using a regression model to predict the probability of a racial group being challenged
• Analysis of Contingency tables
• Calculating all potential permutations, and then counting the percentage of those permutations that meet some criteria.

All three I do not think are completely unreasonable, but I prefer the approach I listed above. I will attempt to articulate those reasons.

So first I will talk about the regression approach. This is generically a model of the form predicting the probability of a peremptory challenge based on the race of the potential juror:

$\text{Prob}(\text{Challenge})&space;=&space;f(\beta_0&space;+&space;\beta_1(\text{Racial&space;Group}_i))$

The anonymous function is typically a logit (for a logistic model) or a probit function. This generalized linear model is then estimated via maximum likelihood, and one formulates hypothesis tests for the B_1 coefficient. The easy critique of this is that the test is not likely to be very powerful with the small samples – as the estimates are based on maximum likelihood and are only guaranteed to unbiased asymptotically. This could be a fairly simple exercise to attempt to see the behavior of this bias in the small samples, but I suspect it reduces the power of the test greatly. Also note that in the case where the subgroup of interest is always challenged, such as in the two out of two Asian’s in the Hecker case mentioned, the equation is not identified due to perfect separation. There are alternative ways to estimate the equation in the case of perfect separation, but this does not mitigate the small sample problem.

More generally, my original formulation of the data generating mechanism being the hypergeometric distribution, drawing names out of hat, is quite different than this. This is a model of the probability of anyone being peremptory challenged. One then estimates the model to see if the probability is increased among the racial group of interest. This is arguably not the question of interest. For instance, say the model estimated the probability of an Asian being challenged to be only 6%, and the probability of anyone else to be 4%. In one sense, this establishes the prima facie case of discrimination of Asian’s compared to everyone else, but does only a probability of 6% of using a peremptory challenge warrant a Batson challenge? I don’t think so. If you think that a challenge will never come with such low probabilities, you are right in that the expected probabilities for the racial group of question will not be that low when a Batson challenge is made, but once you consider the uncertainty in the estimates (e.g. 95% confidence intervals) they could easily be that low. On the flip side if the racial group is struck 96% of the time, but everyone else is struck 94% of the time, does that establish the numerical evidence of discrimination? I’m not sure, it may if this prevents any of the particular racial group being seated.

The analysis of contingency tables, in particular Fisher’s Exact Test, is exactly the same as my hypergeometric approach if one only considers the racial group of interest against all other parties. Fisher’s exact test is a reasonable approach over the more typical chi-square because, 1) the cells will be quite small, and 2) this is one of the unusual cases where the marginals are fixed. So making a 2 by 2 contingency table based on the very first example I gave (which resulted in a probability of around 22%) would be a table:

Challenge No Challenge Total
Asian 2 1 3
Other 11 25 36
Total 13 26 39

For the formula the 2 by 2 table is referred to as:

Challenge No Challenge Total
Asian a b a + b
Other c d c + d
Total a + c b + d n

Which Fisher’s Exact Test can be formulated by the binomial coefficients:

$\frac{{a&space;+&space;b&space;\choose&space;a}&space;{c+d&space;\choose&space;c}&space;}{{n&space;\choose&space;a+c}}&space;=&space;\frac{{2&space;+&space;1&space;\choose&space;2}&space;{11+25&space;\choose&space;11}&space;}{{39&space;\choose&space;13}}$

Which if you look closely is exactly the same set of binomial coefficients for the hypergeometric test I listed previously. So, as long as one only tests the one racial group against all others Fisher’s Exact test of a 2 by 2 contingency table is exactly the same as my recommendation. I don’t particularly think the historical p-value <= 0.05 standard is necessary, but it is the same information.

What bothers me more about this approach is when people start adding other cells in the contingency table. In the first step you need to establish a pattern of discrimination against one particular cognizable group. The treatment of other groups is non sequitur to this question in the first step. (In People vs Black evaluations of how unemployment was treated for non-black jurors was considered is steps 2 and 3, but not in the first step.) Including other groups into the table though will change the outcome. Such ad hoc decisions on what racial groups to consider should not have any effect on the evidence of discrimination against the specific racial group of interest. This problem of what groups is similarly applicable to the regression approach mentioned above. Hypothesis tests of the coefficients will be dependent on what particular contrasts you wish to draw and will change the estimates if certain groups are specified in the equation.

The final approach, counting up particular permutations that meet a particular threshold is intuitive, but again has an ad hoc element, the same problem with choosing which racial groups will impact the test statistic. All of the approaches (including my own) need to be explicit about the groups being tested beforehand. Only monitoring one group as in the hypergeometric test I presented earlier is much simpler to justify ex post facto, but it still would be best to establish the cognizable groups before voir dire takes place. For the tests that use other racial or ethnic groups in the calculations are much more suspect to justification, as the picking and choosing of the other groups will impact the calculations.

# Some recommendations

A typical question I get asked as an academic is, So what would you recommend to improve the situation? Totally reasonable question that I often don’t have a good answer to. It is easy to throw out recommendations without considering the entirety of the situation, and the complexities of the criminal justice system are no exception. With full awareness that no one with any authority will likely read my recommendations, my suggestions follow none-the-less.

The first is, only slightly in jest, is to only allow 1 peremptory challenge. There is no bright line rule on numerical evidence presented that is necessary to establish discrimination, but the NY State court of appeals case I mentioned did indicate that it takes more than 1 challenge to establish a pattern of discrimination. This may seem extreme, but the logic applies to the same to allowing fewer peremptory challenges. The fewer the challenges, the less capability either counsel has to entirely eliminate a particular racial group from the jury. It simultaneously makes counsels use of the challenges more precious, so they should be more hesitant to use them based on gut feelings predicated solely by racial stereotyping.

As another side effect (good or bad depending on how you look at it) it also makes the evaluation of whether one is using the challenges in a racially discriminatory manner much clearer. As I shown above, when d decreased the probability of the outcome generally decreased. For example with the Hecker case, pretend there were only a total of 5 peremptory challenges. So in this hypothetical situation have n = 20, d = 5, and p and k = 2 (if the number of n was much higher with the number of peremptory challenges limited to 5 for each side the jury would have to be close to set already by the time 20 individuals were questioned). The probability of this is 5%, whereas if d = 7 the probability is 11%. To be very generic, having fewer challenges makes particular racial patterns less likely by chance.

I understand the motivation for peremptory challenges, but it is unclear to me why such a large number are currently afforded for most cases. Also to the extent that timeliness is a priority for the court, allowing fewer challenges would certainly decrease the necessary time needed for voir dire. (Which was a concern in the Hecker case, as the court only allowed a very short time for questioning the panels.)

The second is that case-law should be established for cognizable groups, particularly given the racial make up of the defendent(s) and victim(s). Or, conversely, counsels should be required a priori to voir dire to establish the cognizable groups. This avoids cherry picking any group for a Batson challenge, as one could always specify a group based on the ex post facto characteristics of the groups used for peremptory challenges. To put a probability to whether a certain number of a particular group could be chosen at at random it is necessary to supply the hypothesis before looking at data. Ad hoc selections of a group could always occur, and with some of the other statistical tests ad hoc inclusion of cells in a contingency table or to include in a test statistic could impact the analysis. Making such a case a priori should prevent any nefarious manipulation of the numbers after the fact.

Neither of these appear to be too onerous to me to be reasonable suggestions. I doubt any lawyer or judge is going to be typing binomial coefficients into Wolfram Alpha during voir dire anytime soon though. Maybe I should make a look up table or nomogram for hypergeometric probabilities for typical values that would come up during voir dire. It would be pretty easy for a lawyer to keep a tally and then do a look up, or keep in mind before hand at what point a set of challenges is unlikely due to chance. With many peremptory challenges and few uses of a particular group, I suspect the probabilities of that happening by chance are much larger than people expect. The two out of two Asian’s in the Hecker case is a good example where the numerical evidence of discrimination is very weak no matter how you plug in the numbers.

# Turning data from Python into SPSS data

I’ve shown how you can grab data from SPSS and use it in Python commands, and I figured a post about the opposite process (taking data in Python and turning it into an SPSS data file) would be useful. A few different motivating examples are:

So first as a simple illustration, lets make a set of simple data in Python as a list of lists.

BEGIN PROGRAM Python.
MyData = [(1,2,'A'),(4,5,'B'),(7,8,'C')]
END PROGRAM.

Now to export this data into SPSS you can use spss.StartDataStep(), append variables using varlist.append and then add cases using cases.append (see the Python programming PDF that comes with SPSS in the help to peruse all of these functions plus the documentation). This particular codes adds in 3 variables (two numeric and one string) and then loops through the data python object and adds those cases to the define SPSS dataset.

BEGIN PROGRAM Python.
import spss
spss.StartDataStep()                   #start the data setp
MyDatasetObj = spss.Dataset(name=None) #define the data object
MyDatasetObj.varlist.append('X2',0)
MyDatasetObj.varlist.append('X3',1)
for i in MyData:                       #add cases in a loop
MyDatasetObj.cases.append(i)
spss.EndDataStep()
END PROGRAM.

Here this will create a SPSS dataset and give it a generic name of the form xDataset? where ? will be an incrementing number based on the session history of naming datasets. To specify the name beforehand you need to use the SPSS command DATASET DECLARE X. and then place the dataset name as the option in the spss.Dataset(name='X') command.

As linked above I have had to do this a few times from Python objects, so I decided to make a bit of a simpler SPSS function to take care of this work for me.

BEGIN PROGRAM Python.
#Export to SPSS dataset function
import spss

def SPSSData(data,vars,types,name=None):
VarDict = zip(vars,types) #combining variables and
#formats into tuples
spss.StartDataStep()
datasetObj = spss.Dataset(name=name) #if you give a name,
#needs to be declared
#appending variables to dataset
for i in VarDict:
datasetObj.varlist.append(i[0],i[1])
#now the data
for j in data:
datasetObj.cases.append(list(j))
spss.EndDataStep()
END PROGRAM.

This code takes an arbitrary Python object (data), and two lists, one of the SPSS variable names and the other of the format for the SPSS variables (either 0 for numeric or an integer for the size of the strings). To transform the data to SPSS, it needs a list of the same dimension as the variables you have defined, so this works for any data object that can be iterated over and that can be coerced to returning a list. Or more simply, if list(data[0]) returns a list of the same dimensions for the variables you defined, you can pass the data object to this function. This won’t work for all situations, but will for quite a few.

So with the permutation examples I previously linked to, we can use the itertools library to create a set of all the different permutations of string ABC. Then I define a set of variables and formats as lists, and then we can use the SPSSData function I created to make a new dataset.

DATASET DECLARE Combo.
BEGIN PROGRAM Python.
import itertools
YourSet = 'ABC'
YourLen = 3
x = itertools.permutations(YourSet,YourLen)
v = ['X1','X2','X3']
t = [1,1,1]
SPSSData(data=x,vars=v,types=t,name='Combo')
END PROGRAM. 

This work flow is not optimal if you are creating the data in a loop (such as in the Google Places API example I linked to earlier), but works well for static python objects, such as the object returned by itertools.

# Presentation at IACA 2014 – Making Field Stops Smart

Part of the work I am doing with the Finn Institute in collaboration with the Albany Police Department was accepted as a presentation at the upcoming IACA conference in Seattle next week. The NIJ used to have a separate Crime Mapping conference, but they folded it into the yearly IACA conference. So this is one of the NIJ Crime Mapping presentations.

The title of the presentation is Making Field Stops Smart, and below is the abstract:

Mapping hot spots of crime incidents for use in allocating patrol resources has become commonplace. This research is intended to extend the logic to mapping locations of field interviews. The project has two specific spatial analysis components; 1) are most of the stops being conducted a high crime locations, and 2) are locations with the most stops the locations with the most productive stops (in terms of arrests, contraband recovery, stopping chronic offenders). Making stops smart is being conducted as a research partnership between the Albany, NY police department and the Finn Institute of Public Safety.

The time of the presentation is at 15:30 on Thursday 9/11. Two other presenters, Eric Paull from Akron, Ohio and Christian Peterson from Portland, Oregon have presentations on the panel as well (see the IACA agenda for their talk abstracts).

I am uncomfortable publicly releasing the pre-print white papers given the collaboration (Rob Worden and Sarah McLean are co-authors) and because that APD’s name is directly attached to the work. But if you send me an email I can forward the white paper for this presentation and related work we are doing.

If you see me at IACA feel free to come up and say hi. I do not have any other plans while I am in town besides going to presentations.

# Rejected!

My Critique of Slope Graphs paper was recently rejected as a short article from The American Statistician. I’ve uploaded the new paper to SSRN with the suggested critiques and my responses to them (posted here).

I ended up bugging Nick Cox for some pre peer-review feedback and he actually agreed! (A positive externality of participating at the Cross Validated Q/A site.) The main outcome of Nick’s review was a considerably shorter paper. The reviews from TAS were pretty mild (and totally reasonable), but devoid of anything positive. The main damning aspect of the paper is that the reviewers (including Cox) just did not find the paper very interesting or well motivated.

My main motivation was the recent examples of slope graphs in the popular media, most of which are poor statistical graphics (and are much better suited as a scatterplot). The most obvious being Cairo’s book cover, which I thought in and of itself deserved a critique – but maybe I should not have been so surprised about a poor statistical graphic on the cover. This I will not argue is a rather weak motivation, but one I felt was warranted given the figures praising the use of slopegraphs in inappropriate situations.

In the future I may consider adding in more examples of slopegraphs besides the cover of Albert Cairo’s book. In my collection of examples I may pull out a few more examples from the popular media and popular data viz books (besides Cairo’s there are blog post examples from Ben Fry and Andy Kirk – haven’t read their books so I’m unsure if they are within them.) For a preview, pretty much all of the examples I consider bad except for Tufte’s original ones. Part of the reason I did not do this is that I wrote the paper as a short article for TAS — and I figured adding these examples would make it too long.

I really had no plans to submit it anywhere besides TAS, so this may sit as just a pre-print for now. Let me know if you think it may be within the scope of another journal that I may consider.

# Log Scaled Charts in SPSS

Log scales are convenient for a variety of data. Here I am going to post a brief tutorial about making and formatting log scales in SPSS charts. So first lets start with a simple set of data:

DATA LIST FREE / X (F1.0) Y (F3.0).
BEGIN DATA
1 1
2 5
3 10
4 20
5 40
6 50
7 70
8 90
9 110.
END DATA.
DATASET NAME LogScales.
VARIABLE LEVEL X Y (SCALE).
EXECUTE.

In GPL code a scatterplot with linear scales would simply be:

GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"))
ELEMENT: point(position(X*Y))
END GPL.

To change this chart to a log scale you need to add a SCALE statement. Here I will specify the Y axis (the 2nd dimension) as having a logarithmic scale by inserting SCALE: log(dim(2), base(2)) between the last GUIDE statement and the first ELEMENT statement. When using log scales, many people default to having a base of 10, but this uses a base of 2, which works much better with the range of data in this example. (Log base 2 also works much better for ratios that don’t span into the 1,000’s as well.)

GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"))
SCALE: log(dim(2), base(2))
ELEMENT: point(position(X*Y))
END GPL.

Although the order of the commands makes no difference, I like to have the ELEMENT statements last, and then the prior statements before and together with like statements. If you want to have more control over the scale, you can specify and min or a max for the chart (by default SPSS tries to choose nice values based on the data). With log scales the minimum needs to be above 0. Here I use 0.5, which is an equivalent distance from 1 -> 2 on a log scale.

GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"))
SCALE: log(dim(2), base(2), min(0.5))
ELEMENT: point(position(X*Y))
END GPL.

You can see here though we have a problem — two 1’s in the Y axis! SPSS inherits the formats for the axis from the data. Since the Y data are formatted as F3.0, the 0.5 tick mark is rounded up to 1. You can fix this by formatting the variable before the GGRAPH command.

FORMATS Y (F4.1).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"))
SCALE: log(dim(2), base(2), min(0.5))
ELEMENT: point(position(X*Y))
END GPL.

But this is annoying, as it adds a decimal to all of the tick values. I wish you could use fractions in the ticks, but this is not possible that I know of. To prevent this you should be able to specify the start() option in the GUIDE command for the second dimension, but in a few attempts it was not working for me (and it wasn’t a conflict with my template — in this example set the DEFAULTTEMPLATE=NO to check). So here I adjusted the minimum to be 0.8 instead of 0.5 and SPSS did not draw any ticks below 1 (you may have to experiment with this based on how much padding the Y axis has in your default template).

FORMATS Y (F3.0).
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
/GRAPHSPEC SOURCE=INLINE DEFAULTTEMPLATE=YES.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: X=col(source(s), name("X"))
DATA: Y=col(source(s), name("Y"))
GUIDE: axis(dim(1), label("X"))
GUIDE: axis(dim(2), label("Y"), start(1))
SCALE: log(dim(2), base(2), min(0.8))
ELEMENT: point(position(X*Y))
END GPL.

I will leave talking about using log scales if you have 0’s in your data for another day (and talk about displaying missing data in the plot as well.) SPSS has a safeLog scale, which is good for data that has negative values, but not necessarily my preferred approach if you have count data with 0’s.

# Understanding Uncertainty – crime counts and the Poisson distribution

A regular occurrence for me when I was a crime analyst went along the lines of, "There was a noteworthy crime event in the media, can you provide some related analysis". Most of the time this followed one or multiple noteworthy crimes that caught the public’s attention, which could range from a series of thefts from vehicles over a month, the same gas station being robbed on consecutive days, or a single murder.

Any single violent crime is awful, and this is not meant to deny that. But often single noteworthy events are often misconstrued as crime waves, or general notions that the neighborhood is in decline or the city is a more dangerous place now than it ever was. The media is intentionally hyperbole, so it is not an effective gauge whether or not crime is really increasing or decreasing. Here I will show an example of using the Poisson distribution to show whether or not a recent spree of crimes is more than you would expect by chance.

So lets say that over the course of 20 years, the mean number of homicides in a jurisdiction is 2. Lets also say that in the year so far, we have 5 homicides. Ignoring that the year has not concluded, what is the probability of observing 5 or more homicides? Assuming the number of homicides is a Poisson distributed random variable (often not too unreasonable for low counts over long time periods) the probability is 5.7%. Small, but not totally improbable. To calculate this probability it is just 1 minus the cumulative distribution function for a Poisson distribution with the given mean. This can be calculated easily in R by using ppois, i.e. 1 - ppois(5-1,2) (just replace 5 with your observed count and 2 with your mean). Note that I subtract 1 from 5, otherwise it would be testing the probability of over 5 instead of 5 or more. For those analysts using Excel, the formula is =1 - POISSON.DIST(5-1,2,TRUE). For SPSS it would be COMPUTE Prob = 1 - CDF.POISSON(5-1,2).

The reason for making these calculations is specifically to understand chance variations given the numbers historically. For the crime analyst it is necessary to avoid chasing noise. For the public it is necessary to understand the context of the current events in light of historical data. For another example, say that the average number of robberies in a month is 15, what is the probability of observing 21 or more robberies? It is 8%, so basically you would expect this high of a number to happen at least one time every year (i.e. 0.08*12~1). Without any other information, there is little reason for a crime analyst to spend extra time examining 21 robberies in a month based on the total number of events alone. Ditto for the public there is little reason to be alarmed by that many robberies in a month given the historical data. (The analyst may want to examine the robberies for other reasons, but there is no reason to be fooled into thinking there is an unexpected increase.)

Here I’ve ignored some complications for the sake of simplicity in the analysis. One is that crime may not be Poisson distributed, but may be under or over-dispersed. In the case of over-dispersion (which seems to happen more often with crime data) the series likely has a higher number of 0’s and then high bursts of activity. In this case you would expect the higher bursts more often than you would with the Poisson distribution. For under-dispersed Poisson data, the variance is smaller than the mean, and so higher bursts of activity are less likely. These are fairly simple to check (at least to see if they are grossly violated), either see if the mean approximately equals the variance, or draw a histogram and superimpose a density estimate for the Poisson distribution. This also ignores seasonal fluctuations in crime (e.g. more burglaries occur in the summer than in the winter).

Even if you do not like making the Poisson assumption a very simple analysis to conduct is to plot the time series of the event over a long period. The rarer the crime the larger aggregation and time series you might need, but this is pretty straightforward to conduct with a SQL query and whatever program you use to conduct analysis. If it is for UCR crime counts, you can try going onto the UCR data tool to see if your jurisdiction has historical annual data going back to 1985. My experience is the vast majority of crime waves depicted in the media are simply chance fluctuations, clearly visible as such just by inspecting the time series plot. Similarly such a plot will show if there is an increase or a decrease compared to historical numbers. Another simple analysis is to take the current numbers and rank them against, say the prior 50 to 100 values in the series. If it is abnormal it should be the the highest or very near the highest in those prior values.

Another complication I have ignored is that of multiple testing. When one is constantly observing a series, even a rare event is likely to happen over a long period of observation. So lets say that in your jurisdiction on average there are 3 domestic assaults in a week, and one week you observe 9. The probability of observing 9 or more is 0.003, but over a whole year the probability of this happening at least once is around 18% (i.e. 1 - ppois(8,3)^52 in R code). Over the course of 10 years (around 520 weeks) we would expect around 2 weeks to have 9 or more domestic assaults (i.e. 520*(1-ppois(8,3))). (This probability goes up higher if we consider sliding windows, e.g. 9 or more domestic assaults in any 7 day span, instead of just over different weeks.) These statistics make the assumption that events are independent (likely not true in practice) but I rather make that false assumption to get a sense of the probability then rely on gut feelings or opinions based on the notoriety of the recent crime(s).

The title for this blog post is taken from David Spiegelhalter’s site Understanding Uncertainty, and that link provides a synonymous example with the recent cluster of plane crashes.

# Using regular expressions in SPSS

SPSS has a native set of string manipulations that will suffice for many simple situations. But with the ability to call Python routines, one can use regular expressions (or regex is often used for short) to accomplish more complicated searches. A post on Nabble recently discussed extracting zip codes from address data as an example, and I figured it would be a good general example for the blog.

So first, the SPSSINC TRANS command basically allows you to use Python functions the same as SPSS functions to manipulate a set of data. So for example if there is a function in Python and it takes one parameter, say f(a), and returns one value, you can use SPSSINC TRANS to have SPSS return a new variable based on this Python function. So say we have a variable named X1 and we want to get the result of f(X1) for every case in our dataset, as long as the function f() is available in the Python environment the following code would accomplish this:

SPSSINC TRANS RESULT=f_X TYPE=0 /FORMULA f(X1).

This will create a new variable in the active SPSS dataset, f_X, that is the result based on passing the values of X1 to the function. The TYPE parameter specifies the format of the resulting variable; 0 signifies that the result of the function is a number and if it is a string you simply pass the size of the string.

So using SPSSINC TRANS, we can create a function that does a regular expression search and returns the matching substring. The case on Nabble is a perfect example, extracting a set of 5 consecutive digits in a string, that is difficult to do with native SPSS string manipulation functions. It is really easy to do with regex’s though.

So the workflow is quite simple, import the re library, define your regular expression search, and then make your function. For SPSS if you return more than one value for a function in expects it is a tuple. If you look at the Nabble thread it discusses more complicated regex’s but here I keep it simple, \d{5}. This is interpreted as search for \d, which is shorthand for all digits 0-9, and then {n} is shorthand for search for the preceding string n times in a row.

BEGIN PROGRAM Python.
import re
s = re.compile(r'\d{5}')
def SearchZip(MyStr):
Zip = re.search(s,MyStr)
if Zip:
return [Zip.group()]
else:
return [""]
END PROGRAM. 

Lets make up some test data to test our function within Python to make sure it works correctly.

BEGIN PROGRAM Python.
#Lets try a couple of examples.
test = ["5678 maple lane townname, md 20111",
"123 oak st #4 someplace, RI 02913-1234",
"9011 cedar place villagename"]

for i in test:
print i
print SearchZip(i)
END PROGRAM. 

And this outputs in the SPSS window:

5678 maple lane townname, md 20111
['20111']
123 oak st #4 someplace, RI 02913-1234
['02913']
9011 cedar place villagename
['']

So at this point it is as simple as below (assuming Address is the string field in the SPSS dataset we are searching):

SPSSINC TRANS RESULT=Zip TYPE=5 /FORMULA SearchZip(Address).

This just scratches the surface of what regex’s can do. Based on some of the back and forth on the recent Nabble discussion this is a pretty general solution that searches an address at the end of the string, optionally finds a dash or a space, and then searches for 4 digits, re.compile(r"(\d{5})(?:[\s-])?(\d{4})?\s*$"). Note because the middle grouping is optional this would match 9 digits in a row (which I think is ok in my experience cleaning string address fields, especially since the search is limited to the end of the string). Here is the full function for use. Note if you get errors about the None type conversion update your version of SPSSINC TRANS, see this Nabble thread for details. BEGIN PROGRAM Python. import re SearchZ = re.compile(r"(\d{5})(?:[\s-])?(\d{4})?\s*$") #5 digits in a row @ end of string
#and optionally space or dash plus 4 digits
def SearchZip(MyStr):
Zip = re.search(SearchZ,MyStr)
#these return None if there is no match, so just replacing with
#a tuple of two None's if no match
if Zip:
return Zip.groups()
else:
return (None,None)

#Lets try a couple of examples.
test = ["5678 maple lane townname, md 20111",
"5678 maple lane townname, md 20111 \t",
"123 oak st #4 someplace, RI 02913-1234",
"9011 cedar place villagename",
"123 oak st #4 someplace, RI 029131234",
"123 oak st #4 someplace, RI 02913 1234"]

for i in test:
print [i]
print SearchZip(i)
END PROGRAM. 

Because this returns two separate groups, the SPSSINC TRANS command will need to specify multiple variables, so would be something like:

SPSSINC TRANS RESULT=Zip5 Zip4 TYPE=5 4 /FORMULA SearchZip(Address).