Bean plots in SPSS

It seems like I have come across alot of posts recently about visualizing univariate distributions. Besides my own recent blog post about comparing distributions of unequal size in SPSS, here are a few other blog posts I have recently come across;

Such a variety of references is not surprising though. Examining univariate distributions is a regular task for data analysis and can tell you alot about the nature of data (including potential errors in the data). Here are some posts on the Cross Validated Q/A site of related interest I have compiled;

In particular the recent post on bean plots and Luca Fenu’s post motivated my playing around with SPSS to produce the bean plots here. Note Jon Peck has published a graphboard template to generate violin plots for SPSS, but here I will show how to generate them in the usual GGRAPH commands. It is actually pretty easy, and here I extend the violin plots to include the beans suggested in bean plots!

A brief bit about the motivation for bean plots. Besides consulting the article by Peter Kampstra, one is interested in viewing a univariate continuous distribution among a set of different categories. To do this one uses a smoothed kernel density estimate of the distribution for each of the subgroups. When viewing the smoothed distribution though one loses the ability to identify patterns in the individual data points. Patterns can mean many things, such as outliers, or patterns such as striation within the main body of observations. The bean plot article gives an example where striation in measurements at specific inches can be seen. Another example might be examining the time of reported crime incidents (they will have bunches at the beginning of the hour, as well as 15, 30, & 45 minute marks).

Below I will go through a brief series of examples demonstrating how to make bean plots in SPSS.


SPSS code to make bean plots

First I will make some fake data for us to work with.

******************************************.
set seed = 10.
input program.
loop #i = 1 to 1000.
compute V1 = RV.NORM(0,1).
compute groups = TRUNC(RV.UNIFORM(0,5)).
end case.
end loop.
end file.
end input program.
dataset name sim.
execute.

value labels groups
0 'cat 0'
1 'cat 1'
2 'cat 2'
3 'cat 3'
4 'cat 4'.
******************************************.

Next, I will show some code to make the two plots below. These are typical kernel density estimates of the V1 variable I made for the entire distribution, and these are to show the elements of the base bean plots. Note the use of the TRANS statement in the GPL to make a constant value to plot the rug of the distribution. Also note although such rugs are typically shown as bars, you could pretty much always use point markers as well in any situation where you use bars. Below the image is the GGRAPH code used to produce them.

******************************************.
*Regular density estimate with rug plot.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V1 MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: V1=col(source(s), name("V1"))
  TRANS: rug = eval(-26)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), label("Density"))
  SCALE: linear(dim(2), min(-30))
  ELEMENT: interval(position(V1*rug), transparency.exterior(transparency."0.8"))
  ELEMENT: line(position(density.kernel.epanechnikov(V1*1)))
END GPL.

*Density estimate with points instead of bars for rug.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V1 MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: V1=col(source(s), name("V1"))
  TRANS: rug = eval(-15)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), label("Density"))
  SCALE: linear(dim(2), min(-30))
  ELEMENT: point(position(V1*rug), transparency.exterior(transparency."0.8"))
  ELEMENT: line(position(density.kernel.epanechnikov(V1*1)))
END GPL.
******************************************.

Now bean plots are just the above plots rotatated 90 degrees, adding a reflection of the distribution (so the area of the density is represented in two dimensions), and then further paneled by another categorical variable. To do the reflection, one has to create a fake variable equal to the first variable used for the density estimate. But after that, it is just knowing alittle GGRAPH magic to make the plots.

******************************************.
compute V2 = V1.

varstocases
/make V from V1 V2
/index panel_dum.

GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=V panel_dum groups MISSING=LISTWISE REPORTMISSING=NO
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  COORD: transpose(mirror(rect(dim(1,2))))
  DATA: V=col(source(s), name("V"))
  DATA: panel_dum=col(source(s), name("panel_dum"), unit.category())
  DATA: groups=col(source(s), name("groups"), unit.category())
  TRANS: zero = eval(10)
  GUIDE: axis(dim(1), label("V1"))
  GUIDE: axis(dim(2), null())
  GUIDE: axis(dim(3), null())
  SCALE: linear(dim(2), min(0))
  ELEMENT: area(position(density.kernel.epanechnikov(V*1*panel_dum*1*groups)), transparency.exterior(transparency."1.0"), transparency.interior(transparency."0.4"), 
           color.interior(color.grey), color.exterior(color.grey)))
  ELEMENT: interval(position(V*zero*panel_dum*1*groups), transparency.exterior(transparency."0.8"))
END GPL.
    ******************************************.

Note I did not label the density estimate anymore. I could have, but I would have had to essentially divide the density estimate by two, since I am showing it twice (which is possible, and if you wanted to show it you would omit the GUIDE: axis(dim(2), null()) command). But even without the axis they are still reasonable for relative comparisons. Also note the COORD statement for how I get the panels to mirror each other (the transpose statement just switches the X and Y axis in the charts).

I just post hoc edited the chart to get it to look nice (in particular settign the spacing between the panel_dum panel to zero and making the panel outlines transparent), but most of those things can likley be more steamlined by making an appropriate chart template. Two things I do not like, which I may need to edit the chart template to be able to accomplish anyway; 1) There is an artifact of a white line running down the density estimates, (it is hard to see with the rug, but closer inspection will show it), 2) I would prefer to have a box around all of the estimates and categories, but to prevent a streak running down the middle of the density estimates one needs to draw the panel boxes without borders. To see if I can accomplish these things will take further investigation.

This framework is easily extended to the case where you don’t want a reflection of the same variable, but want to plot the continuous distribution estimate of a second variable. Below is an example, and here I have posted the syntax in entirety used in making this post. In there I also have an example of weighting groups inversely proportional to the total items in each group, which should make the area of each group equal.

In this example of comparing groups, I utilize dots instead of the bar rug, as I believe it provides more contrast between the two distributions. Also note in general I have not superimposed other summary statistics (some of the bean plots have quartile lines super-imposed). You could do this, but it gets a bit busy.

About these ads
Leave a comment

3 Comments

  1. Louise S.

     /  June 22, 2012

    Hi Andrew,
    Thanks for posting the syntax – it was really useful to follow the steps you took to make these. I’m just wondering how you could add a coloured point to distinguish the median in each group- would it be something like this:
    ELEMENT: point(position(summary.mean(V*1*panel_dum*1*groups))?
    thanks in advance for your help!
    louise

    Reply
    • Hi Louise,

      In theory, something like ELEMENT: point(position(summary.median(V*1*panel_dum*1*groups)), color.exterior(panel_dum), shape("Median"), size(size.large)) should work, although my quick attempts to get it to act as desired were unsucessful. I also tried to make an actual summary variable via the TRANS command in inline GPL and that did not work either. So what I ended up doing was making a new variable and plotting that in its own element statement. Note when you have multiple elements like this the legend gets a bit un-wieldy, and what I have been doing is editing post-hoc in a different vector editor to get the legend to how I want it. Part of the problem I think is that SPSS does not like mapping the same aesthetic to different types of elements, so sometimes it gives errors when trying to construct the legend.

      Below is an example using the same data that is in the last plot in my blog post.

      
      AGGREGATE
        /OUTFILE=* MODE=ADDVARIABLES
        /BREAK=groups panel_dum
        /V_median=MEDIAN(V).
      
      GGRAPH
        /GRAPHDATASET NAME="graphdataset" VARIABLES=V V_median panel_dum groups MISSING=LISTWISE REPORTMISSING=NO
        /GRAPHSPEC SOURCE=INLINE.
      BEGIN GPL
        SOURCE: s=userSource(id("graphdataset"))
        COORD: transpose(mirror(rect(dim(1,2))))
        DATA: V=col(source(s), name("V"))
        DATA: V_median=col(source(s), name("V_median"))
        DATA: panel_dum=col(source(s), name("panel_dum"), unit.category())
        DATA: groups=col(source(s), name("groups"), unit.category())
        TRANS: zero = eval(20)
        TRANS: med_point = eval(40)
        GUIDE: axis(dim(1), label("V1"))
        GUIDE: axis(dim(2), label("Frequency"))
        GUIDE: axis(dim(3), null())
        GUIDE: legend(aesthetic(aesthetic.color.exterior))
        GUIDE: legend(aesthetic(aesthetic.color.interior))
        GUIDE: legend(aesthetic(aesthetic.shape.interior), null())
        GUIDE: legend(aesthetic(aesthetic.transparency.exterior), null())
        GUIDE: legend(aesthetic(aesthetic.transparency.interior), null())
        GUIDE: legend(aesthetic(aesthetic.size), null())
        SCALE: linear(dim(2), min(0))
        SCALE: cat(aesthetic(aesthetic.shape.interior), map(("Median", shape.square), ("Rug", shape.circle)))
        SCALE: cat(aesthetic(aesthetic.transparency), map(("Median", transparency."0.0"), ("Rug", transparency."0.9")))
        SCALE: cat(aesthetic(aesthetic.size), map(("Median", size."8"), ("Rug", size."4")))
        ELEMENT: point(position(V*zero*panel_dum*1*groups), color.exterior(panel_dum), color.interior(panel_dum), transparency.interior("Rug"), transparency.exterior("Rug"), 
                                 shape.interior("Rug"), size("Rug"))
        ELEMENT: point(position(V_median*zero*panel_dum*1*groups), color.exterior(panel_dum), color.interior(panel_dum), transparency.interior("Median"), transparency.exterior("Median"), 
                                 shape.interior("Median"), size("Median"))
        ELEMENT: area(position(density.kernel.epanechnikov(V*1*panel_dum*1*groups)), transparency.exterior(transparency."1.0")), color(panel_dum),  transparency.interior(transparency."0.5")))
      END GPL.

      Thanks for the comment and if you have any other questions let me know here in the comments or feel free to shoot me an email.

      Andy

      Reply
  1. The baseline in histograms & outliers | Andrew Wheeler

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 32 other followers

%d bloggers like this: