Using Google Fusion Tables to make some maps!

In the past to share interactive maps with others I’ve used BatchGeo and CartoDB. BatchGeo is super easy to geocode a few incidents, and CartoDB has a few more stylistic options (including some very cool animations). Both of these projects have a limit on the number of points you can map with the free service though. The new Google maps allows you make similar products to BatchGeo and CartoDB, in that you can upload a csv file or kml and then do some light editing of the points, and then embed an iframe in a website if you want (I wish Google Maps had a time slider like Google Earth does). Here is an example from my PhD of a few locations that one of my original models did a very poor job of predicting the amount of crime at the street midpoint or intersection.

But a few recent projects I wanted to place many more geographies on the map than these free versions allow. ArcGIS online is pretty slick in my few tests, but I am settling on Google Fusion tables for the ability to link the geographies and data tables (plus the ability to filter is very nice). Basically you can upload your data table and kml in seperate Fusion tables and then merge them to create your own polygons with associated data. Here is another example from my dissertation and embedded map below.

Basically what I do is make a set of units of analysis based on street mid-points and intersections. I then divide the city up based on the Thiessen polygons of those sets of points for the allocation of different areal measures. E.g. I can calculate the overlap of the Thiessen polygon with the area of sidewalks.

I’m using Google Fusion tables for some other projects in which I want people within the PD to be able to interactively explore the data. My main interest in these slippy maps are that you can pan and zoom – and with a static map it is hard to recreate all of the potential views a consumer of the map wants. I can typically make a nicer overview map of the forest or any general data patterns in a static map, but if I think the user of the map will want to zoom in to particular locations these interactive maps meet that challenge. Pop-ups allow for a brief digging into the data as well, but don’t allow for visualizing patterns. Fusion tables are very limited though it the styling of the geography. (All of these free versions are pretty limited, but the Fusion tables are especially restrictive for point symbology and creating choropleth classes).

Using these maps has a trade off when sharing with the PD though. They are what I would call semi-public, in that if you want others to be able to view the map you can share a link, but anyone with the link can see the map. This prevents sharing of intimate information on the map that might be possibly leaked. (For the ability to have access control to more sensitive information, e.g. a user has to sign on to a secure website, I know Bair analytics offers paid for products like that – probably some of the prior web map apps I mentioned do so as well.) I’ve made them in the past for Troy P.D., but I really have no idea how often they were used – so other analysts let me know in the comments if you’ve had success with maps like these disseminating info. within the police department.

I’m getting devilishly close to finishing my dissertation, and I will post an update and link when the draft is complete. My prospectus can be seen here, and the linked maps are part of some supplemental material I compiled. The supplemental info. should provide a little more details on what the maps are showing.

Smoothed regression plots for multi-level data

Bruce Weaver on the SPSS Nabble site pointed out that the Centre for Multilevel Modelling has added some syntax files for multilevel modelling for SPSS. I went through the tutorials (in R and Stata) a few years ago and would highly recommend them.

Somehow following the link trail I stumbled on this white paper, Visualising multilevel models; The initial analysis of data, by John Bell and figured it would be good fodder for the the blog. Bell basically shows how using smoothed regression estimates within groups is a good first step in data analysis of complicated multi-level data. I obviously agree, and previously showed how to use ellipses to the same effect. The plots in the Bell whitepaper though are very easy to replicate directly in base SPSS syntax (no extra stat modules or macros required) using just GGRAPH and inline GPL.

For illustration purposes, I will use the same data as I did to illustrate ellipses. It is the popular2.sav sample from Joop Hox’s book. So onto the SPSS code; first we will define a FILE HANDLE for where the popular2.sav data is located and open that file.

FILE HANDLE data /NAME = "!!!!!!Your Handle Here!!!!!".
GET FILE = "data\popular2.sav".
DATASET NAME popular2.

Now, writing the GGRAPH code that will follow is complicated. But we can use the GUI to help us write the most of it and them edit the pasted code to get the plot we want in the end. So, the easiest start to get the graph with the regression lines we want in the end is to navigate to the chart builder menu (Graphs -> Chart Builder), and then create a scatterplot with extrav on the x axis, popular on the y axis, and use class to color the points. The image below is a screen shot of this process, and below that is the GGRAPH code you get when you paste the syntax.

*Base plot created from GUI.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=extrav popular class[LEVEL=NOMINAL] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: extrav=col(source(s), name("extrav"))
  DATA: popular=col(source(s), name("popular"))
  DATA: class=col(source(s), name("class"), unit.category())
  GUIDE: axis(dim(1), label("extraversion"))
  GUIDE: axis(dim(2), label("popularity sociometric score"))
  GUIDE: legend(aesthetic(aesthetic.color.exterior), label("class ident"))
  ELEMENT: point(position(extrav*popular), color.exterior(class))
END GPL.

Now, we aren’t going to generate this chart. With 100 classes, it will be too difficult to identify any differences between classes unless a whole class is an extreme outlier. Here I am going to make several changes to generate the linear regression line of extraversion on popular within each class. To do this we will make some edits to the ELEMENT statement:

  • replace point with line
  • replace position(extrav*popular) with position(smooth.linear(extrav*popular)) – this tells SPSS to generate the linear regression line
  • replace color.exterior(class) with split(class) – the split modifier tells SPSS to generate the regression lines within each class.
  • make the regression lines semi-transparent by adding in transparency(transparency."0.7")

Extra things I did for aesthetics:

  • I added jittered points to the plot, and made them small and highly transparent (these really aren’t necessary in the plot and are slightly distracting). Note I placed the points first in the GPL code, so the regression lines are drawn on top of the points.
  • I changed the FORMATS of extrav and popular to F2.0. SPSS takes the formats for the axis in the charts from the original variables, so this prevents decimal places in the chart (and SPSS intelligently chooses to only label the axes at integer values on its own).
  • I take out the GUIDE: legend line – it is unneeded since we do not use any colors in the chart.
  • I change the x and y axis labels, e.g. GUIDE: axis(dim(1), label("Extraversion")) to be title case.

*Updated chart with smooth regression lines.
FORMATS extrav popular (F2.0).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=extrav popular class[LEVEL=NOMINAL] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: extrav=col(source(s), name("extrav"))
  DATA: popular=col(source(s), name("popular"))
  DATA: class=col(source(s), name("class"), unit.category())
  GUIDE: axis(dim(1), label("Extraversion"))
  GUIDE: axis(dim(2), label("Popularity Sociometric Score"))
  ELEMENT: point.jitter(position(extrav*popular), transparency.exterior(transparency."0.7"), size(size."3"))
  ELEMENT: line(position(smooth.linear(extrav*popular)), split(class), transparency(transparency."0.7"))
END GPL.

So here we can see that the slopes are mostly positive and have intercepts varying mostly between 0 and 6. The slopes are generally positive and (I would guess) around 0.25. There are a few outlier slopes, and given the class sizes do not vary much (most are around 20) we might dig into these outlier locations a bit more to see what is going on. Generally though with 100 classes it doesn’t strike me as very odd as some going against the norm, and a random effects model with varying intercepts and slopes seems reasonable, as well as the assumption that the distribution of slopes are normal. The intercept and slopes probably have a slight negative correlation, but not as much as I would have guessed with a set of scores that are so restricted in this circumstance.

Now the Bell paper has several examples of using the same type of regression lines within groups, but using loess regression estimates to assess non-linearity. This is really simple to update the above plot to incorporate this. One would simply change smooth.linear to smooth.loess. Also SPSS has the ability to estimate quadratic and cubic polynomial terms right within GPL (e.g. smooth.cubic).

Here I will suggest a slightly different chart that allows one to assess how much the linear and non-linear regression lines differ within each class. Instead of super-imposing all of the lines on one plot, I make a small multiple plot where each class gets its own panel. This allows much simpler assessment if any one class shows a clear non-linear trend.

  • Because we have 100 groups I make the plot bigger using the PAGE command. I make it about as big as can fit on my screen without having to scroll, and make it 1,200^2 pixels. (Also note when you use a PAGE: begin command you need an accompanying PAGE: end() command.
  • For the small multiples, I wrap the panels by setting COORD: rect(dim(1,2), wrap()).
  • I strip the x and y axis labels from the plot (simply delete the label options within the GUIDE statements. Space is precious – I don’t want it to be taken up with axis labels and legends.
  • For the panel label I place the label on top of the panel by setting the opposite option, GUIDE: axis(dim(3), opposite()).

WordPress in the blog post shrinks the graph to fit on the website, but if you open the graph up in a second window you can see how big it is and explore it easier.

*Checking for non-linear trends.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=extrav popular class[LEVEL=NOMINAL] MISSING=LISTWISE 
    REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  PAGE: begin(scale(1200px,1200px))
  SOURCE: s=userSource(id("graphdataset"))
  DATA: extrav=col(source(s), name("extrav"))
  DATA: popular=col(source(s), name("popular"))
  DATA: class=col(source(s), name("class"), unit.category())
  COORD: rect(dim(1,2), wrap())
  GUIDE: axis(dim(1))
  GUIDE: axis(dim(2))
  GUIDE: axis(dim(3), opposite())
  ELEMENT: line(position(smooth.linear(extrav*popular*class)), color(color.black))
  ELEMENT: line(position(smooth.loess(extrav*popular*class)), color(color.red))
  PAGE: end()
END GPL.

The graph is complicated, but with some work one can go group by group to see any deviations from the linear regression line. So here we can see that most of the non-linear loess lines are quite similar to the linear line within each class. The only one that strikes me as noteworthy is class 45.

Here there is not much data within classes (around 20 students), so we have to wary of small samples to be estimating these non-linear regression lines. You could generate errors around the linear and polynomial regression lines within GPL, but here I do not do that as it adds a bit of complexity to the plot. But, this is an excellent tool if you have many points within your groups and it can be amenable to quite a large set of panels.

Jittered scatterplots with 0-1 data

Scatterplots with discrete variables and many observations take some touches beyond the defaults to make them useful. Consider the case of a categorical outcome that can only take two values, 0 and 1. What happens when we plot this data against a continuous covariate with my default chart template in SPSS?

Oh boy, that is not helpful. Here is the fake data I made and the GGRAPH code to make said chart.

*Inverse logit - see.
*http://andrewpwheeler.wordpress.com/2013/06/25/an-example-of-using-a-macro-to-make-a-custom-data-transformation-function-in-spss/.
DEFINE !INVLOGIT (!POSITIONAL  !ENCLOSE("(",")") ) 
1/(1 + EXP(-!1))
!ENDDEFINE.

SET SEED 5.
INPUT PROGRAM.
LOOP #i = 1 TO 1000.
  COMPUTE X = RV.UNIFORM(0,1).
  DO IF X <= 0.2.
    COMPUTE YLin = -0.5 + 0.3*(X-0.1) - 4*((X-0.1)**2).
  ELSE IF X > 0.2 AND X < 0.8.
    COMPUTE YLin = 0 - 0.2*(X-0.5) + 2*((X-0.5)**2) - 4*((X-0.5)**3).
  ELSE.
      COMPUTE YLin = 3 + 3*(X - 0.9).
  END IF.
  COMPUTE #YLin = !INVLOGIT(YLin).
  COMPUTE Y = RV.BERNOULLI(#YLin).
  END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME NonLinLogit.
FORMATS Y (F1.0) X (F2.1).

*Original chart.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"))
  ELEMENT: point(position(X*Y))
END GPL.

So here we will do a few things to the chart to make it easier to interpret:

SPSS can jitter the points directly within GGRAPH code (see point.jitter), but here I jitter the data slightly myself a uniform amount. The extra aesthetic options for making points smaller and semi-transparent are at the end of the ELEMENT statement.

*Making a jittered chart.
COMPUTE YJitt = RV.UNIFORM(-0.04,0.04) + Y.
FORMATS Y YJitt (F1.0).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y YJitt
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: YJitt=col(source(s), name("YJitt"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"), delta(1), start(0))
  SCALE: linear(dim(2), min(-0.05), max(1.05))
  ELEMENT: point(position(X*YJitt), size(size."3"), 
           transparency.exterior(transparency."0.7"))
END GPL.

If I made the Y axis categorical I would need to use point.jitter in the inline GPL code because SPSS will always force the categories to the same spot on the axis. But since I draw the Y axis as continuous here I can do the jittering myself.

A useful tool for exploratory data analysis is to add a smoothing term to plot – a local estimate of the mean at different locations of the X-axis. No binning necessary, here is an example using loess right within the GGRAPH call. The red line is the smoother, and the blue line is the actual proportion I generated the fake data from. It does a pretty good job of identifying the discontinuity at 0.8, but the change points earlier are not visible. Loess was originally meant for continuous data, but for exploratory analysis it works just fine on the 0-1 data here. See also smooth.mean for 0-1 data.

*Now adding in a smoother term.
COMPUTE ActualFunct = !INVLOGIT(YLin).
FORMATS Y YJitt ActualFunct (F2.1).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=X Y YJitt ActualFunct
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: X=col(source(s), name("X"))
  DATA: Y=col(source(s), name("Y"))
  DATA: YJitt=col(source(s), name("YJitt"))
  DATA: ActualFunct=col(source(s), name("ActualFunct"))
  GUIDE: axis(dim(1), label("X"))
  GUIDE: axis(dim(2), label("Y"), delta(0.2), start(0))
  SCALE: linear(dim(2), min(-0.05), max(1.05))
  ELEMENT: point(position(X*YJitt), size(size."3"), 
           transparency.exterior(transparency."0.7"))
  ELEMENT: line(position(smooth.loess(X*Y, proportion(0.2))), color(color.red))
  ELEMENT: line(position(X*ActualFunct), color(color.blue))
END GPL.

SPSS’s default smoothing is alittle too smoothed for my taste, so I set the proportion of the X variable to use in estimating the mean within the position statement.

I wish SPSS had the ability to draw error bars around the smoothed means (you can draw them around the linear regression lines with quadratic or cubic polynomial terms, but not around the local estimates like smooth.loess or smooth.mean). I realize they are not well defined and rarely have coverage properties of typical regression estimators – but I rather have some idea about the error than no idea. Here is an example using the ggplot2 library in R. Of course we can work the magic right within SPSS.

BEGIN PROGRAM R.
#Grab Data
casedata <- spssdata.GetDataFromSPSS(variables=c("Y","X"))
#ggplot smoothed version
library(ggplot2)
library(splines)
MyPlot <- ggplot(aes(x = X, y = Y), data = casedata) + 
          geom_jitter(position = position_jitter(height = .04, width = 0), alpha = 0.1, size = 2) +
          stat_smooth(method="glm", family="binomial", formula = y ~ ns(x,5))
MyPlot
END PROGRAM.

To accomplish the same thing in SPSS you can estimate restricted cubic splines and then use any applicable regression procedure (e.g. LOGISTIC, GENLIN) and save the predicted values and confidence intervals. It is pretty easy to call the R code though!

I haven’t explored the automatic linear modelling, so let me know in the comments if there is a simply way right in SPSS to get explore such non-linear predictions.

Poisson regression and crazy predictions

Here is a problem I’ve encountered a few times in my own work (and others) with Poisson regression models and the exponential link function. It came up recently in some discussions on the scatterplot blog by Jeremy Freese (see 1 & 2) critiquing the PNAS paper on the effect of female named hurricanes on death tolls, so I figured I would expand up those thoughts a little here.

So the problem is when you estimate a Poisson regression model is that the exponential link function can become explosive for explanatory variables that have a large range. So to be clear, we have a Poisson regression model of the form (here E[Y] mean the expected value of Y):

log(E[Y]) = B1*(X)
    E[Y]  = e^(B1*X)

If X has a small range this may be fine, but if X has a large range it can become problematic. Consider if Y are hurricane deaths, X is monetary damage of the hurricane, and B1 = 0.01. Lets say the monetary damage ranges from 1 to 1000 (imagine these are in thousands of dollars, so range between $1,000 and $1 million). What happens with the predictions?

E[Deaths] = e^(0.01*   1) =     1.01
E[Deaths] = e^(0.01*   5) =     1.05
E[Deaths] = e^(0.01*  10) =     1.11
E[Deaths] = e^(0.01*  50) =     1.65
E[Deaths] = e^(0.01* 100) =     2.71
E[Deaths] = e^(0.01* 500) =   148
E[Deaths] = e^(0.01*1000) = 22026

These predictions are invariant to linear transformations – that is Z-scoring X in the original units doesn’t change the predictions (the same as expressing X in [dollars/1000] doesn’t make any difference than just by including [X] on the right hand side). The linear predictor of B1 will simply be scaled by the appropriate inverse transformation. Also I’d note that expressed in terms of incident rate ratios the effect would be e^0.01=1.01. This appears on its face a totally innocuous effect, and only in consideration of the variation in X does it appear to be absurd.

You can see that if the range of X were smaller, say between 1 and 100, the predictions might be fine. The predictions between 1 and 100 only vary by 1.7 deaths. The problem with these explosive predictions at larger values is that they are nonsense for most of social scientific research. A simple sanity check to see if this is occurring is to check the predicted value from your Poisson regression equation at the low end of X versus the high end (and just pretend all of the other explanatory variables are set to 0) exactly as I have done here. If the high end is crazy, you will need to consider some alternative model specification (or be very clear that the model can not be extrapolated to the larger values of X).

A useful alternate parametrization is simply to log X, and in this case when exponentiating the right hand side, it will make the predictor a power of the original metric.

So imagine we fit the model:

log(E[Y]) = B2*(log(X))
    E[Y]) = e^(B2*log(X)) 
          = x^B2

Lets say here that B2 = 0.5. What happens to our predictions again?

E[Deaths] =    1^0.5 =  1
E[Deaths] =    5^0.5 =  2.2
E[Deaths] =   10^0.5 =  3.2
E[Deaths] =   50^0.5 =  7.1
E[Deaths] =  100^0.5 = 10
E[Deaths] =  500^0.5 = 22
E[Deaths] = 1000^0.5 = 32

Those predictions look a little bit easier to swallow at the larger ranges. Notice also the differences in predictions in the smaller stages? There is more discrimination for the smaller values than on the original scale, but the larger values are suppressed. Lets consider the predictions side by side for easier comparison.

   X      B1   B2
   -      --   --
   1       1    1
   5       1    2
  10       1    3
  50       2    7
 100       3   10
 500     148   22
1000   22026   32

A frequent problem with logging the explanatory variables is that they contain zeroes. A simple alternative is to treat log(0) as 0 and then have a separate dummy variable equal to 1 when X = 0. This model may not make Occam happy, as it implies a discontinuity at 0, but it is in my opinion a small price to pay. Also if there are a lot of zeroes this doesn’t strike me as totally unrealistic to have a mixture of what happens at 0 and then what happens at the higher values. So the full model written out would be:

log(E[Y]) = B3*D + B4*(log(X))

But the model is essentially discontinuous. When X=0, we treat log(X)=0 and D=1, so the model reduces to;

log(E[Y]) = B3*D :When X = 0

When X>0, D=0 and the model reduces to:

log(E[Y]) = B4*(log(X)) :When X > 0

Now, it certainly would be weird if B3>>0, as this would imply a high spike at 0, and then at the X value of 1 Y goes back down to 1 and then increases with X. If we expect B4 to be positive, then a negative value of B3 (or very close to 0) would make the most sense. It is still a discontinuity in the function, but one that may make theoretical sense. So imagine we fit the equation log(E[Y]) = B3*D + B4*(log(X)), lets say B4 is 0.5 (the same as B2), and that B3 is equal to -0.1. This would then make the set of predictions go:

E[Deaths] =  e^-0.01 =  0.9
E[Deaths] =    1^0.5 =  1.0
E[Deaths] =    5^0.5 =  2.2
E[Deaths] =   10^0.5 =  3.2
E[Deaths] =   50^0.5 =  7.1
E[Deaths] =  100^0.5 = 10
E[Deaths] =  500^0.5 = 22
E[Deaths] = 1000^0.5 = 32

So in this made up example the discontinuity pretty much fits right in with the rest of the function. We may consider other non-linear transformations of X as well (splines or higher powers) but frequently an additional problem is that the bulk of the data lie in the lower end of the range. So for our dollars if it was highly right skewed, there may only be a few values at 100 or higher. These can be highly influential if you use powers of X (e.g. include X^2, X^3 etc. on the right hand side) – so splines are a better choice – but essentially no matter how you fit the function it will be hard to verify the fit at these values or extrapolate to those tails. So the fit in the original function may be fine – but it just implies unrealistic marginal effects in the tails.

So how do we verify one equation over the other? Visualizing count data in scatterplots tend to be harder than visualizing continuous data, especially if there is a stock pile of data at 0. The problem is simply exacerbated if the explanatory variable has a similar right skew, there will be a large mass near the origin of the plot and very sparse everywhere else.

My simple suggesting is to just bin the data at X values, which is very simple if X is integer valued, and then plot the mean and standard error of the Y value within those bins. As Poisson regression and its variants rely on asymptotic properties, if the error bars are too variable to deduce a pattern you should be concerned your sample size isn’t large enough to begin with.

If the bulk of the data only have a small range over X, then it will be hard in practice to differentiate between the two model parametrizations I suggest here (with the typical noisy data we have in the social sciences). So you may prefer logging the X variable simply to prevent the dramatic explosions in the tails of the data right from the start.

I do feel comfortable saying that if the ratio of the smallest to largest value for the independent variable is over 100, you should check the predictions of the exponential link function very closely (if the smallest value is 0 just estimate the ratio as if the smallest value is 1). If this ratio is 100 or larger, unless the linear predictor in the Poisson regression equation is very small (<<.01) the predictions may explode into very implausible ranges for the larger X values.

What is up with 3d graphics for book covers?

The other day in Google books I noticed Graphics for Statistics and Data Analysis with R by Kevin Keen in the related book section. What caught my eye was not the title (there have to be 100+ related R books at this point) but the really awful 3d pie chart.

Looking at the preview on google books this appears to be an unfortunate substitution. The actual cover has a much more reasonable set of surface plots and other online book stores (e.g. Amazon) appear to have the correct cover.

I suspect someone at CRC Press used some stock imagery for the cover, and unfortunately the weird 3d pie graph has been propagated to the google book preview without correction.

This reminded me of a few other book covers in cartography and data visualization though that I find less than appealing. Now, I’m not saying here to judge a book by its cover, and I have not read all of the books I will point to here. But I find the use of 3d graphics in book covers in the data visualization field to be strange and bordering cognitive dissonance with the advice most of the authors give.

First I’ll start with a book I have read, and would suggest to everyone, Thematic cartography and geographic visualization by Slocum et al. I have the 2005 version, and it is dawned by this 3d landscape. (Sorry this is the largest image I can find online – other editions I believe have different covers.)

The multivariate display of data is admirable – so for exploratory analysis you could make a reasonable argument for the use of proportional sized circles superimposed on the choropleth map. The use of 3d in this circumstance though is gratuitous, and the extreme perspective hides much of the data while highlighting the green hills in the background.

The second mapping book I have slight reservations about critiquing the cover (I am on the job market!). I have not read the book, so I can not say anything about its contents. But roaming the book displays at an ASC conference I remember seeing this cover, GIS and Spatial Analysis for the Social Sciences: Coding, Mapping, and Modeling by Nash and Asencio.

This probably should not count in the other 3d graphics I am showing. The bar columns do have shading for 3d perspective – but the map otherwise is 2d. But the spectral color scheme is an awful choice. The red in the map tends to stand out the most – which places with zero crimes I don’t think you want to make that impression. The choropleth colors appear to be displaying the same data as the point data. The point data are so clustered that the choropleth can only be described as misleading – which may be a good point in text for side by side maps – but on the cover? Bar locations seem to be unrelated (as we might expect for juvenile crime) but they are again aggregated to the (probably) census units – making me question if the aggregation obfuscates the relationship. Bars are not available from the census – so it is likely this aggregation was intentional. I have no idea about the content of the book and I will likely get it and do an overall review of all crime mapping books sometime. But the cover is unambiguously a bad map.

The last book cover with 3d graphics (related to data-visualization) that I immediately remembered was R For Dummies by Meys and de Vries.

Now this when you look close really is not bad. It is not a graph on the cover, but a set of winding, hexagon cylinder stairsteps. So the analogy of taking small steps is fine – but the visual similarity to other statistical 3d graphics is clear. Consider the SPSS For Dummies book by Griffith.

Now that is an intentional, 3d chart made up of tiny blocks, with a trend line if you look closely, shadowed by cigarette like red bars in the background. At least this is so strange (and not possible in statistical software) that this example would never be confused with an actual reasonable statistical graphic. The Dummies series has such brand recognition as well that the dominant part of the cover might be the iconic yellow and type, as opposed to the inset graphic.

Not wanting to leave other software out of the loop, I looked for examples for SAS and Stata. SAS has a reasonable 3d cover in SAS System for Statistical Graphics by Friendly.

Short sidetrack story: I first learned statistical programming using SAS back in undergrad days at Bloomsburg University. Default graphics for SAS at that point (04-08) I believe were still the ASCII art looking things (at least that is what I remember). During our last meeting for my last statistics class – one of the other students showed me you could turn on the ODS output to html – and tables and graphs were by default pretty nice. I since have not had a need to use SAS.

This 3d cover by Friendly is arguably a reasonable use of 3d. 3d graphs are hard to navigate, and the use the anchors connecting the observations to the non-linear surface more easily associate a point with below or above the surface. It is certainly difficult though to understand the fit of the function – so likely a series of bivariate graphs would be more intuitive – especially given the meager number of points. I suspect the 3d on the cover is for the same reason 3d graphics were used in the other covers – because it looks cooler to book marketers!

Stata managed to debunk the 3d graph trends – I could not find any example Stata books with 3d graphics. Nick Cox’s newer collection of his Speaking Stata series though has some interesting embellishments.

While in isolation all of the graphs are fine – I’m sure Cox would not endorse the gratuitous use of color gradients in the graphics (I don’t think svg like gradients like that are even possible in Stata graphics). The ternary diagrams show nothing but triangles as well – so I don’t think such gradients are a good idea in any case for simply the background of the plot. Such embellishments could actually decode data, but in the case of bar graphs do not likely hurt or help with understanding the plot. When such gradients are used as the background though they likely compete with the actual data in the plot. Stata apparently can do 3d graphs – so I might suggest I write a book on crime modelling (published by Stata press) and insert a 3d graph on the cover (as this is clearly a niche in the market not currently filled!) I might have to make room for Chernoff faces somewhere on the front or back cover as well.

So maybe I am just seeing things in the examples of 3d covers. If anyone has any insight into how these publishers choose the covers let me know – or if you have other examples of bad book cover examples of data vizualization! Since most of my maps and graphs are pretty dull in 2d I might just outsource the graphic design if I made a book.

Learning to write badly in the social sciences

I recently finished Michael Billig’s book, Learn to Write Badly: How to Succeed in the Social Sciences, and I was largely convinced of Billig’s thesis so I shall reiterate it briefly here. Billig argues that much of social scientific writing is difficult to understand because of the excessive use of nouns instead of verbs. If we (ironically) use the word nominalization to describe the process of turning verbs into nouns, it would sound pretty similar to old hat of avoiding the use of jargon in scientific writing.

Billig goes a bit further though then the usual avoid jargon advice (which is uncontroversial), but gives many examples of where this change from verbs to nouns has negative consequences on how writing is interpreted. Two of these are:

  • Verbs are often much more clear about what actor is performing what actions (and in turning the verb to a noun both the actor and the action can become ambigious)
  • Replacing verbs with nouns gives a false sense of authority that the noun actually exists

The first is important for social scientists because we are pretty much always describing the actions of humans. The second Billig likens to marketing strategies to promote ones work (similar to how advertisements promote products), which I imagine the analogy turns a few academics stomachs.

As an example from my own work, I will use the title to one of my papers, The Moving Home Effect: A Quasi Experiment Assessing Effect of Home Location on the Offence Location. The title is really awful, and in a bit of self-deprecation the few times I presented the work I would make fun of my title making skills at the opening of my talk. The first part of the title before the semi-colon, "The Moving Home Effect", is an example of an ambiguous use of nouns. First, describing my findings as the moving home effect is rather ambiguous, it could mean effecting anything and everything. My particular study is much more restricted, I examined the distance between crimes before and after offenders move. Second, the effect of the added distance between crimes when moving I found to be rather small, which is probably one of the more interesting points of the paper. So saying "The Moving Home Effect" places unwanted emphasis that it exists and is real.

The same exercise can be used for the second part of the title. The use of quasi-experiment is simply econometrics jargon and is unneeded. It is an appeal to associate with a particular camp of analysis, and really does nothing to describe the nature of the work. So, I propose possible rewrites of my title to be:

  • Moving one’s home slightly changes the average distance between offences
  • When an offender moves, it slightly changes the location of where they offend

These are much more descriptive (and shorter) than the original title. Reviewing my own work such examples are rampant, so I have a bit of work to do to live up to Billig’s ideal, but I am convinced it will lead to improved writing. Also it seems to me it is a good exercise to make ones writing more concise, which is always welcome.

In (slight) defense of word clouds

Word cloud of the frequency of words in Dr. Seuss’s Cat in the Hat via Jason Davies D3.js app.

I try to view the practice of data visualization with an open mind. In particular, I like to think that there is rarely a strict dichotomy of good or bad visualizations (or synonymously maps). It is a continual gradation, but we can use knowledge of visual perception to guide us as to visualizations that will be better for communicating particular information. Here I want to spend a few minutes talking about word clouds, and provide some things that they do good along with the bad.

So first, I will be focusing solely on the presentation of data in the form of the word cloud. Jacob Harris has a wonderful rant about why he hates word clouds, and this is focused on the fact that many word clouds are devoid of real meaning. You can present meaningless content in any graphical form you want – so I will not discuss this further.

When we evaluate the utility of any particular graphic, one needs to be clear about the motivation for the graphic – what data the graphic is intended to display to its audience. With that goal in mind, then we can evaluate how well the graphic meets that goal. If the goal of a word cloud is intended as a graphical depiction of the entire distribution of words it is obviously inferior to a bar chart (or with a long tail a chart of the empirical CDF). The words in a word cloud tend to be all jumbled up and in no particular order, so it is difficult to see the distribution. If the goal is a quantitative estimate of the difference between two words, they also fail here because we know that area judgements are much more difficult than comparing lengths along an aligned axis. In addition, there is another confound in that the length of the word (or even particular letters with ascenders or descenders – depending on the font used) further confound the size differences between the words in the word cloud. That is, even if the font sizes are in the end the same, SomeReallyBigLongWord will appear larger in the graphic than a word such as tiny. Again bar charts are clearly superior for either of these tasks, and Marti Hearst has a nice PDF white paper discussing these short comings (via Stephen Few’s Perceptual Edge site), Whats up with Tag clounds. Also see Stephen Few’s critique of the use of packed circles in Tableu for a similar critique.

Given the shortcomings, there are two potential aspects of word clouds that I personally find appealing in terms of communicating information. They both have to do with focusing the attention on particular words, as opposed to focusing on the empirical distribution of words or on the size difference between the two words. The first aspect I would refer to the Stroop effect. Stroop’s experiment showed that reaction time for reading words that provided interfering mental stimulus slowed reaction time. Or more simply, it is easier to read red and blue than it is to read green and orange. So what does this have to do tag clouds? In tag clouds, the differentially sized words themselves are the data. There is no cognitive burden as in bar charts because one need not navigate between the label and the word.

The second aspect is related to the fact that in a particular graphic, there are certain elements that will dominate the readers attention. This is similar to creating a layering in a chart (that is often discussed in cartography), e.g. make elements you want to focus attention on in the chart come to the foreground, and other elements recede to the background. For one example Stephen Kosslyn argues that bar charts are preferable to Cleveland dot plots for presentations, as the bars have a higher visual recognition (e.g. stand out more or are more salient). Word clouds do this very well for the bigger words, while bar charts tend to be more appropriate for visualizing the entire distribution. That is the bigger words aren’t immediately more salient in the graph – the actual bars displaying the data are.

Based on these, I would suspect a user study of word clouds would outperform bar charts in the following aspects in terms of time taken to perform the task:

  • Return the top three words in the graph.
  • Given two particular words, which word is larger than the other word.

Or more simply, even though when evaluating areas we are not terribly effective at assigning areas, we can still rank order based on the size of the words. So in terms of rank order tasks I think word clouds will perform alright for simple comparisons. Given no time constraints, I would expect bar charts will always be more accurate, but depending on the nature of the task will take longer. For the top three words, if the chart is ordered one would need to read the top three and then reply. If the chart was unordered, one needs to take time to find them (this is where the interference and the Stroop effect comes into play). Most words clouds I see these top words are pretty much immediately recognizable, and so I suspect will be much faster for these tasks.

For the two particular words, I suspect it will boil down to in what format you can find the words the fastest. In the bar chart, you have to scan a list, whereas in a tag cloud it is unordered. With the exception of very tiny words in the word cloud, I bet you will find each word faster in a word cloud in most examples. (This is my guess, I know of no experiments confirming this specific scenario.)

So what does this mean in terms of visualizing data – either in the form of a tag/word cloud or a more typical form like a bar chart? For the bar chart, the bar is what is the most salient feature of the graph. If you want to draw more attention to the particular words, while still leaving the data intact, consider direct labels of the bars or maybe a more table like graphic. As always, if you can selectively filter out a larger amount of words, it will provide a simpler graphic to evaluate for the remaining items.

Can the tag cloud ever be a reasonable data visualization compared to a simple table or bar graph? I believe it can, but only if the typical layout algorithms are improved to incorporate ancillary information. Tag clouds are very similar to bin packing algorithms, and strike me as natively similar to force directed network layout algorithms (such as Dorling Cartograms). The layouts as used by Wordle or the Jason Davies implementation simply place the words by a convenient algorithm in a rank order importance. They pretty much just take the biggest word, place it, then go around in a circle to find a place for the next biggest word.

We can think of other ways to lay out the words though. CiteULike has a wonderful example where the words are laid out in alphabetical order, you can see my entire library tags here. Here is a screenshot of my tag library on the homepage, in which it is a smaller space, so they filter out the small tags:

The motivation for this is not to view the empirical distribution of the tags, but is for look up of particular tags. You can click on a tag and it brings up all of my entries in my library with that tag. If you don’t want to navigate to an additional site, simply take a look to the right aside of this WordPress template I use. There is a library of the tags I use on this site ordered alphabetically and filtered, but with the font sized in proportion to the amount of tags used in posts on this site. I find these uses of word clouds quite convenient, where I rather give priority to the words themselves, over specific quantitative estimates of the frequencies.

There are other potential ways to layout the words though that can hold other quantitative information. For instance, I could layout the tags in my CiteULike library according to shared articles. Coloring the words and the typical layouts in random order I find distracting, but given particular tasks (such as simple look up) I find they work just fine. Again because the focus is on the word, and not exact quantitative assessment of the magnitude (nor understanding of the overall distribution) I say tag clouds are better in this particular circumstance than a bar chart is. My arguments based on cognitive load and speed of surveying word clouds certainly does not make them appropriate for all data viz. tasks, but I believe it does for some.

Using the Google Places API in Python

So I was interested in scraping data from the Google Places API recently. My motivation came from some criminology studies that examine the relationship between coffee shops and crime (Papachristos et al. 2011; Smith, 2014) under the auspices that coffee shops are a measure of gentrification. I am also interested in gaining information about other place characteristics as well, as whether a place has some sort of non-residential application (e.g. bus stop, gas station) they tend to have elevated amounts of crime compared to solely residential locations. (These are not per-se characteristics that make places criminogenic – they just influence the walking around population and thus make the baseline exposure of places greater because of more human interaction).

So here I will give an example of grabbing data from the Google Places API given a grid of locations from SPSS and exporting to an SPSS dataset. So first you need sign up for an API key with Google from the Google APIs console. This allows 1,000 queries per day. Here I will make an example of 6 cases, but again with just the base query rates you can submit up to 1,000 requests per day (and can get 100,000 by verifying your identity).

*Need UNICODE.
PRESERVE.
SET UNICODE=ON.

*Some example data.
DATA LIST FREE / ID Lat Lng.
BEGIN DATA
1   38.901989   -77.041921
2   38.901991   -77.036157
3   38.901993   -77.030393
4   38.906493   -77.041924
5   38.906495   -77.036159
6   38.906497   -77.030394
END DATA.
DATASET NAME Locations.

I typically don’t have SPSS in Unicode mode, but since Google will return some results in Unicode it is necessary to turn it on. After that I create a dataset with 6 locations in central D.C. These happen to be centroids from a regular grid I laid over the city with 500 meter square grid cells (and it only takes 800 and some cells to cover the city of D.C.).

Now google has a pretty good intro to get your feet wet, so here I will just show the helper functions I created to ease the task. The first function takes as input parameters latitude, longitude, a radius (in meters), a type field, and your key for the maps API. It then grabs the JSON you get as a response from that specific url, and then turns the JSON object into a nested list that can be traversed in python. You can check out from Google what other information is available, but the IterJson function is just a helper to takes specific results from GoogPlac and put them into a nicer array for me to use. Here I just grab the geographic coordinates, a place name, a reference string, and the address associated with the place.

*Functions to grab the necessary data from Google.
BEGIN PROGRAM.
import urllib, json

#Grabbing and parsing the JSON data
def GoogPlac(lat,lng,radius,types,key):
  #making the url
  AUTH_KEY = key
  LOCATION = str(lat) + "," + str(lng)
  RADIUS = radius
  TYPES = types
  MyUrl = ('https://maps.googleapis.com/maps/api/place/nearbysearch/json'
           '?location=%s'
           '&radius=%s'
           '&types=%s'
           '&sensor=false&key=%s') % (LOCATION, RADIUS, TYPES, AUTH_KEY)
  #grabbing the JSON result
  response = urllib.urlopen(MyUrl)
  jsonRaw = response.read()
  jsonData = json.loads(jsonRaw)
  return jsonData

#This is a helper to grab the Json data that I want in a list
def IterJson(place):
  x = [place['name'], place['reference'], place['geometry']['location']['lat'], 
         place['geometry']['location']['lng'], place['vicinity']]
  return x
END PROGRAM.

Now we can set up the parameters for the search in addition to the geographic coordinates. Here I make python objects specifying the search type cafe and my API key. I will pass a constant for the radius field of 375 meters – which will cause just a bit of overlap among my grid cells.

*Setting the parameters I will use in the query - type & my key.
BEGIN PROGRAM.
#necessary info for geocoding
MyKey = 'Your API Key Here!!!'
MyType = 'cafe'
END PROGRAM.

Now the magic happens; this code 1) declares a new dataset to place the results in, and then in python 2) grabs the current SPSS dataset, 3) sets the variables for the new dataset, 4) searches at each longitude value in the original dataset, and 5) appends cases to the new SPSS dataset for every location result returned for each initial longitude location. The nested loops are necessary as one location will return multiple places. (For those not using SPSS, the SPSS data here, alldata, is simply a list of namedTuple’s for each row of SPSS data. This code is basically amenable to whatever data structure you are working with, just loop through your grid of geographic coordinates and then append those results to whatever data structure you want.)

*Now querying google and placing the results into a new file.
DATASET DECLARE PlacesResults.
BEGIN PROGRAM.
#Grabbing SPSS data
import spss, spssdata
alldata = spssdata.Spssdata().fetchall()

#making an SPSS dataset to place the results into
spss.StartDataStep()
datasetObj = spss.Dataset(name='PlacesResults')
datasetObj.varlist.append('Id',0)
datasetObj.varlist.append('LatSearch',0)
datasetObj.varlist.append('LngSearch',0)
datasetObj.varlist.append('name',100)
datasetObj.varlist.append('reference',500)
datasetObj.varlist.append('LatRes',0)
datasetObj.varlist.append('LngRes',0)
datasetObj.varlist.append('vicinity',300)

#appending the data to the SPSS dataset
for case in alldata:
  search = GoogPlac(lat=case[1],lng=case[2],radius=375,types=MyType,key=MyKey) #grab data
  if search['status'] == 'OK':                                                 #loop through results
    for place in search['results']:
      x = IterJson(place)
      results = list(case) + x
      datasetObj.cases.append(results)                                         #appending data to SPSS dataset
spss.EndDataStep()
END PROGRAM.

Now we have the data in our PlacesResults SPSS dataset for further manipulation or whatever. One of comment on a prior post was basically "What is the point of SPSS, I can do all of this in Python". Well this is generally true – but I continue to do work in SPSS because it is much more convenient a tool for data management then Python (at least for myself, knowing SPSS really well and Python not so well). So here because we have slightly overlapping search radius’s, you will have duplicate results in your set. The following code in SPSS eliminates those duplicate results.

*************************************.
*Getting rid of duplicate results.
DATASET ACTIVATE PlacesResults.
SHOW N.
SORT CASES BY name vicinity LatRes LngRes Id.
MATCH FILES FILE = *
  /FIRST = Keep
  /BY name vicinity LatRes LngRes.
SELECT IF Keep = 1.
MATCH FILES FILE = * /DROP Keep.
EXECUTE.
SHOW N.
*************************************.

So now you can do the analysis with your results without having to worry about duplicates. If you don’t want to keep SPSS in Unicode mode, remember to restore your settings before you end your SPSS session.

*Restoring back to LOCAL mode.
DATASET CLOSE ALL.
NEW FILE.
RESTORE.

So, I have done all this, but I must admit that it appears to me that this act would violate the terms of service for using the Google Places API data (see 10.1.3 Restrictions against Data Export or Copying specifically), as this is creating a database of the cached results. But others like FloatingSheep do approximately the same type of requests (and obviously cache the data in a permanent database to make such maps). Maybe if someone from Google is listening they can let me know if I am violating the terms of service for doing such a non-commercial academic project.

I will update at a later point about the results of the cafes and crime at the micro place level in DC (as part of my dissertation). There actually is a set of sidewalk cafe’s From DC.gov and a quick look with one of the older files I have (downloaded in Oct-13) there are 452 listed cafes. The scraped data from google only totals 368 over the whole city – so it seems likely I am undercounting taking the data from Google – but further investigation will be needed.

Some more value-by-alpha maps for D.C. Census Blocks

I’ve made some more value-by-alpha maps for my dissertation for percent non-white population in comparison to percentage of female-headed households for Census blocks in 2010 in D.C. See my first post for some background. The choropleth classes for the percents are chosen according to quintiles of the distributions and the alpha classes are arbitrary (note the alpha class uses households as the baseline in both maps, even though percent non-white uses the population counts).

When making these maps I’ve found that the Color Brewer sequential styles that range two colors work out much better than those that span one color. What happens with the one color sequential themes is that the faded out colors end up being confounded with the lighter colors in the fully opaque ranges. When using the two sequential color schemes (here showing Yellow to Red and Yellow to Blue) it provides greater discrepancy between the classes.


I did not try out the black background for these maps (I thought perhaps it would be a bit jarring in the document have a swath of black stand out). The CUNY Center for Urban Research has some other example value-by-alpha maps for New York City elections in 2013. After some discussion with Steven Romalewski they decided they liked the white background better for there maps, and my quick attempts for these examples I think I agree.

Taking account of the baseline in kernel density maps using CrimeStat

When making kernel density maps sometimes the phenonema is heavily clustered in particular locations simply because the population at risk is uneven over the study space. xkcd puts this in a bit more of laymans terms than I do:

So how do we take into account the underlying population? It depends on the data, but if you actually have population at risk data as points we can make kernel density maps that are the ratio of the cases to the underlying total population. I will show how you can do this type of raster kernel density estimate in CrimeStat using some data on reported assaults and arrests from the city of Chicago.

To make the necessary smooth estimate of the proportions of arrests in CrimeStat you will need two seperate ones, the first primary file should be all arrests of interest, and the secondary file should be all of the incidents (so the arrests are a subset of all incidents). And of course both files need to have the geocoordinates already.

So CrimeStat has a nice GUI to make our KDE maps. So you will be greeted with the following screen after starting the CrimeStat program.

Now we can enter in our data. First click the Select Files button and then navigate to your data file. Here I saved the seperate files as DBFs, and for the primary file I use all of the arrests associated with an assault incident in Chicago in 2013.

Now that the file is loaded into CrimeStat, we need to specify what fields contain the spatial coordinates in the variables section. Then we set the appropriate options in the bottom panels. Here I am using projected coordinates in feet. I don’t specify a time unit so that option is superflous.

Now we can enter in our secondary file, which will be all of the assault incidents in Chicago in 2013. The Seconday File tab is in the set of minor tabs under the larger Data Setup tab (CrimeStat has an incredible number of routines, hence the many options). It is an equivalent workflow as to that of the primary file, import the spreadsheet, and define the fields.

Now we need to set up the reference grid to which we will write the raster output to. Still on the same Data Setup main tab, we then navigate to the Reference file minor tab. Here we specify a set of coordinates (in the particular projection of use) as a rectangle corresponding to the lower left and upper right corners. Then you can control how fine the grid is by specifying a larger number of columns. Here the cell sizes are 300 square feet. Note you can save the particular reference file for future use.

Now we can finally move onto estimating our kernel density map! Now navigate to the main Spatial Modeling I tab and navigate to the Interpolation I minor tab. To make a ratio of our two rasters (which will be the smoothed proportions of arrests). Here we choose the Dual KDE estimation, and specify the normal kernel. Typically for KDE maps the kernel makes very little difference, choosing an appropriate bandwidth impacts the look of the map to a much greater extent. I typically default to around 300 meters, but here I choose a smaller 500 foot bandwidth (we will see this is seriously undersmoothed – but I rather start with undersmoothed than oversmoothed).

The field Area units: points per ends up being superflous when specifying the ratio of the two densities. Clicking on the Save Result to button we can choose to save the output to various geographic data file formats (both vector and raster). Here I specify ArcGIS’s raster format.

Now we are ready to calculate the KDE raster. Simply click the Compute button at the bottom of CrimeStat, and be alittle patient with a dataset of this size (4,000 some arrests and over 17,500 total incidents). After that runs we can then import the rasters into your favorite GIS and make some maps.

You will notice when you first upload the raster there are several strange artifacts. This is a function of places with very few incidents have a low baseline with with to calculate the smoothed proportion of arrests. Unfortunately it appears CrimeStat specifies 0 where null data values should be (places with zero density in the denominator).

A quick fix to this problem though is to make a separate kernel density map of just the incidents and superimpose that on top of your smoothed arrests. Then you can make the zero density areas the same color as the background map so they are filtered out. Here I filter areas that have a incident density of less than <0.02 (these are absolute densities, so they sum to the total number of incidents used to calculate them to begin with).

So below are the final KDE maps. As you can see from all of them 500 feet is seriously undersmoothed, but the absolute densities of incidents and arrests (the two left most maps) appears to be highly correlated. If you look at the hit rate of arrests though in the right most map, the percent of arrests appear to be spatially random.

Other possibilities for similar analysis are say accidents involving injury or pedestrians where the baseline is all accidents, field stops that result in contraband recovery, or comparison of densities before and after an intervention (although here I may take the absolute difference as opposed to the ratio).

Of course this just scratches the surface of possible analysis. When the population at risk is not so conveniently labelled in the data set, such as coming from census geographies, one may consult the literature on dasymetric mapping (also see the head bang procedure in CrimeStat). Bivand et al. (2008) have an example of calculating the ratio raster along with some statistical tests, and the spatstat library has some more convenient functions to accomplish this and map the results (see the relrisk function). One can also estimate a logistic regression model with the spatial coordinates as non-linear predictors (e.g. using splines) and then plot the predicted probabilities for each grid cell.

I’m not sure of a quick global test of whether the proportion of arrests are random though. I thought off-hand you could use a spatial scan test for the case-control data (e.g. using SatScan or similar functions in the spatstat R library), although I’m not sure if that counts as quick.

Follow

Get every new post delivered to your Inbox.

Join 38 other followers