My Year Blogging in Review – 2018

The blog continues to grow in site views. I had a little north of 90,000 site views over the entire year. (If you find that impressive don’t be, a very large proportion are likely bots.)

The trend on the original count scale looks linear, but on the log scale the variance is much nicer. So I’m not sure what the best forecast would be.

I thought the demise had already started earlier in the year, as I actually saw the first year-over-year decreases in June and July. But the views recovered in the following months.

So based on that the slow down in growth I think is a better bet than the linear projection.

For those interested in extending their reach, you should not only consider social media and creating a website/blog, but also writing up your work for a more general newspaper. I wrote an article for The Conversation about some of my work on officer involved shootings in Dallas, and that accumulated nearly 7,000 views within a week of it being published.

Engagement in a greater audience is very bursty. Looking at my statistics for particular articles, it doesn’t make much sense to report average views per day. I tend to get a ton of views on the first few days, and then basically nothing after that. So if I do the top posts by average views per day it is dominated by my more recent posts.

This is partly due to shares on Twitter, which drive short term views, but do not impact longer term views as far as I can tell. That is a popular post on Twitter does not appear to predict consistent views being referred via Google searches. In the past year I get a ratio of about 50~1 referrals from Google vs Twitter, and I did not have any posts that had a consistent number of views (most settle in at under 3 views per day after the initial wave). So basically all of my most viewed posts are the same as prior years.

Since I joined Twitter this year, I actually have made fewer blog posts. Not including this post, I’ve made 29 posts in 2018.

2011  5
2012 30
2013 40
2014 45
2015 50
2016 40
2017 35
2018 29

Some examples of substitution are tweets when a paper is published. I typically do a short write up when I post a working paper — there is not much point of doing another one when it is published online. (To date I have not had a working paper greatly change from the published version in content.) I generally just like sharing nice graphs I am working on. Here is an example of citations over time I just quickly published to Twitter, which was simpler than doing a whole blog post.

Since it is difficult to determine how much engagement I will get for any particular post, it is important to just keep plugging away. Twitter can help a particular post take off (see these examples I wrote about for the Cross Validated Blog), but any one tweet or blog post is more likely to be a dud than anything.

Advertisements

The length it takes from submission to publication

The other day I received a positive comment about my housing demolition paper. It made me laugh abit inside — it felt like I finished that work so long ago it was talking about history. That paper was not so ancient though, I submitted it 8/4/17, went through one round of revision, and I got the email from Jean McGloin for conditional acceptance on 1/16/18. It then came online first a few months later (3/15/18), and is in the current print issue of JRCD, which came out in May 2018.

This ignores the time it takes from conception to finishing a project (we started the project sometime in 2015), but focusing just on the publishing process this is close to the best case scenario for the life-cycle of a paper through peer reviewed journals in criminology & criminal justice. The realist best case scenario typically is:

  • Submission
  • Wait 3 months for peer reviews
  • Get chance to revise-resubmit
  • Wait another 3 months for second round of reviews and editor final decision

So ignoring the time it takes for editors to make decisions and the time for you to turn around edits, you should not bank on a paper being accepted under 6 months. There are exceptions to this, some journals/editors don’t bother with the second three month wait period for reviewers to look at your revisions (which I think is the correct way to do it), and sometimes you will get reviews back faster or slower than three months, but that realist scenario is the norm for most journals in the CJ/Crim field. Things that make this process much slower (multiple rounds of revisions, editors taking time to make decisions, time it takes to make extensive revisions), are much more common than things that can make it go shorter (I’ve only heard myths about a uniform accept on the first round without revisions).

Not having tenure this is something that is on my mind. It is a bit of a rat race trying to publish all the papers expected of you, and due to the length of peer review times you essentially need to have your articles out and under review well before your tenure deadline is up. The six month lag is the best case scenario in which your paper is accepted at the first journal you submit to. The top journals are uber competitive though, so you often have to go through that process multiple times due to rejections.

So to measure that time I took my papers, including those not published, to see what this life-cycle time is. If I only included those that were published it would bias the results to make the time look shorter. Here I measured the time it took from submission of the original article until when I received the email of the paper being accepted or conditionally accepted. So I don’t consider the lag time at the end with copy-editing and publishing online, nor do I consider up front time from conception of the project or writing the paper. Also I include three papers that I am not shopping around anymore, and censored them at the date of the last reject. For articles still under review I censored them at 5/9/18.

So first, for 25 of my papers that have received one editorial decision, here is a graph of the typical number of rejects I get for each paper. A 0 for a paper means it was published at the first journal I submitted to, a 1 means I had one reject and was accepted at the second journal I submitted the paper to, etc. (I use "I" but this includes papers I am co-author on as well.) The Y axis shows the total percentage, and the label for each bar shows the total N.

So the proportion of papers of mine that are accepted on the first round is 28%, and I have a mean of 1.6 rejections per article. This does not take into account censoring (not sure how to for this estimate), and that biases the estimate of rejects per paper downward here, as it includes some articles under review now that will surely be rejected at some point after writing this blog post.

The papers with multiple rejects run the typical gamut of why academic papers are sometimes hard to publish. Null results, a hostile reviewer at multiple places, controversial findings. It also illustrates that peer review is not necessarily a beacon showing the absolute truth of an article. I’m pretty sure everything I’ve published, even papers accepted at the first venue, have had one reviewer with negative comments. You could find reasons to reject the findings of anything I write that has been peer reviewed — same as you can think many of my pre-print articles are correct or useful even though they do not currently have a peer review stamp of approval.

Most of those rejections add about three months to the life-cycle, but some can be fast (these include desk rejections), and some can be slower (rejections on later rounds of revisions). So using those begin times, end times, and taking into account censoring, I can estimate the typical survival time of my papers within the peer-review system when lumping all of those different factors together into the total time. Here is the 1 - survival chart, so can be interpreted as the number of days until publication. This includes 26 papers (one more that has not had a first decision), so this estimate does account for papers that are censored.

The Kaplan-Meier estimate of the median survival times for my papers is 290 days. So if you want a 50% chance of your article being published, you should expect 10 months based on my experience. The data is too sparse to estimate extreme quantiles, but say I want an over 80% probability of an article being published based on this data, how much time do I need? The estimate based on this data is at least 460 days.

Different strategies will produce different outcomes — so my paper survival times may not generalize to yours, but I think that estimate will be pretty reasonable for most folks in Crim/CJ. I try to match papers to journals that I think are the best fit (so I don’t submit everything to Criminology or Justice Quarterly at the first go), so I have a decent percent of papers that land on the first round. If I submitted first round to more mediocre journals overall my survival times would be faster. But even many mid-tiered journals in our field have overall acceptance rates below 10%, nothing I submit I ever think is really a slam dunk sure thing, so I don’t think my overall strategy is the biggest factor. Some of that survival time is my fault and includes time editing the article in between rejects and revise-resubmits, but the vast majority of this is simply waiting on reviewers.

So the sobering truth for those of us without tenure is that based on my estimates you need to have your journal articles out of the door well over a year before you go up for review to really ensure that your work is published. I have a non-trivial chunk of my work (near 20%) that has taken over one and a half years to publish. Folks currently getting their PhD it is the same pressure really, since to land a tenure track job you need to have publications as well. (It is actually one I think reasonable argument to take a longer time writing your dissertation.) And that is just for the publishing part — that does not include actually writing the article or conducting the research. The nature of the system is very much delayed gratification in having your work finally published.

Here is a link to the data on survival times for my papers, as well as the SPSS code to reproduce the analysis.

Digg reader is shutting down, giving Twitter a try

I’ve used RSS feeds for quite awhile now to keep up with blogs I enjoy. I also use it to follow scholarly journals of interest. Unfortunately, my current feed reader of choice (Digg Reader) is shutting down.

This is the second time my feed reader has shuttered (I used Google Reader before that shut down as well). Another particular problem I never really solved was link rot. Google reader has some metrics where you could see old feeds that have not had any new posts for awhile. Digg had no such service, and I tried my hand at writing python code to do this myself, but that code never quite worked out.

To partially replace this service instead of migrating to another feed reading service I will give Twitter a shot. Twitter is a bit chaotic from what I can tell — I much prefer the spreadsheet like listing of just a title to peruse news and events of interest in the morning. I had been using Google+ and like it (yes, I know I’m one of those nerds), but it is a bit of a ghost-town. So I will migrate entirely over to Twitter and give it a shot.

My Year Blogging in Review – 2017

So the blog has continued to show linear growth in terms of views over time, I take a good hit though in December.

I only ended up writing 35 new posts in 2017 (that includes things that are not blog posts, like pages I created for new classes). For comparison in 2015 I wrote 50, and 2016 I wrote 40. I’ve managed to be pretty consistent though over time, here is the cumulative total over time.

That is more or less what I aim for, to just have some content every few weeks.

There is not much to say in terms of popular posts on the site for the year. My most popular posts are ones I’ve written in previous years. I did not have any post this year gain a large number of viewers when it was first written. It is just a slow accumulation of around 200 views per day, mostly people being referred via Google searches.

I wanted to analyze the topics I’ve written about over time, so I grabbed all of the tags I’ve placed on posts. I collapsed both categories and tags, as I don’t really make much of a distinction when I pick them. Here is a graph of the number of posts that have that tag and the page views (this will double count page views, for example a post could have both SPSS and Data Visualization). None refers to pages that are not blog posts, like my home page and pages I created for class syllabi.

If we look at the ratio though, you can see my scholarly posts are mostly ignored, only in total do they accumulate much viewing.

My posts on showing how to use various google maps services with python must be reasonably high in Google searches, as I get a slow trickle of hits for them every day. The high uncertainty is driven by my ratios need to be plotted on log scales post.

I tried to analyze whether the content of my posts substantively has changed over time. I suspected since I took my job at Dallas my posts swayed more towards paper/scholarly (the tag I use for academic related things) and more away from technical computing stuff. I have too few of posts though (and too many tags) to easily make sense of it. Taking only tags that are included on 30 or more posts, here are the counts of those tags over time.

About the only clear trend is that scholarly has risen with SPSS dropping, the other frequent categories though look to me to be fairly consistent. I could spend more time grouping the tags into thematic content, but I have too many other things I need to do (including writing other blog posts)!

Happy New Year!

Blogging in review 2015

For some meta commentary on the blog itself, the blog has continued to grow, surpassing a total of 100,000 cumulative site views since its inception in December 2011 (according to the stats that wordpress keeps). I added a total of 50 posts & pages in 2015, for a total of 170. The monthly growth is shown below, and it now appears linear. (This data was a few days short of the 1st, so December ended up cracking 4,000 site views in total.)

Pretty much all of my top posts are SPSS related, and were posted in years prior to 2015. Currently I average around 140 views per day, but it is spread out over many of the pages. Below is a table of the average views per day for my most popular pages. Note that the average site views of my home page are currently much higher, but it is dated to when I created the blog. (Created as of 12/28/15, so a few days short of the new year.)

Top Pages
Title Date Posted Views Days Since Posted Average Views Per Day URL
1 Home page / Archives 12/15/11 20048 1474 13.60 https://andrewpwheeler.wordpress.com/
2 Odds Ratios NEED To Be Graphed On Log Scales 10/26/13 6460 793 8.15 https://andrewpwheeler.wordpress.com/2013/10/26/odds-ratios-need-to-be-graphed-on-log-scales/
3 Comparing continuous distributions of unequal size groups in SPSS 04/29/12 6722 1338 5.02 https://andrewpwheeler.wordpress.com/2012/04/29/comparing-continuous-distributions-of-unequal-size-groups-in-spss/
4 Using the Google Places API in Python 05/15/14 2339 592 3.95 https://andrewpwheeler.wordpress.com/2014/5/15/using-the-google-places-api-in-python/
5 Hacking the default SPSS chart template 01/03/12 5705 1455 3.92 https://andrewpwheeler.wordpress.com/2012/01/03/hacking-the-default-spss-chart-template/
6 Why I feel SPSS (or any statistical package) is better than Excel for this particular job 03/30/13 3786 1003 3.77 https://andrewpwheeler.wordpress.com/2013/3/30/why-i-feel-spss-(or-any-statistical-package)-is-better-than-excel-for-this-particular-job/
7 Avoid Dynamite Plots! Visualizing dot plots with super-imposed confidence intervals in SPSS and R 02/20/12 4868 1407 3.46 https://andrewpwheeler.wordpress.com/2012/02/20/avoid-dynamite-plots-visualizing-dot-plots-with-super-imposed-confidence-intervals-in-spss-and-r/
8 Using sequential case processing for data management in SPSS 02/18/13 3539 1043 3.39 https://andrewpwheeler.wordpress.com/2013/2/18/using-sequential-case-processing-for-data-management-in-spss/

So most of the posts go relatively unnoticed, and even when they do get shared it at best a few hundred views in a day or two after it is posted. Being on my home page probably gets all of my posts abit of exposure for a week or two. But looking at the long haul many of my tutorial SPSS, python, and R posts get a fair amount of traffic. Or at least enough to continue to motivate me to write more posts!

As always, I don’t have a real fixed schedule I plan to write posts, nor any real roadmap of what I plan to blog about. I will say though doing my tour on the job market it has been interesting to get some recognition for what I post on the blog. It is mainly students who have asked me about it, but I’ve have some folks mention it at conferences as well.

Music and distractions in the workplace

I was recently re-reading Zen and the Art of Motorcycle Maintenance, and it re-reminded me of why I do not like to listen to music in the workplace. The thesis in Pirsig’s book (in regards to listening to music) is simple; you can’t concentrate entirely on the task at hand if you have music distracting you. So those who value their work tend to not have idle distractions like music playing (and be all engrossed in their work).

I have worked in various shared workspaces (cubicles and shared offices) for quite a while now, and I do have a knack for going off into space and ignoring all of the background noise around me. But I still do not like listening to music, even though I have learned to cope with the situation. At this point I prefer the open office workspace, as there at least is no illusion of privacy. When I worked at a cubicle someone coming behind me and scaring me was basically a daily thing.

Scott Adams, the artist of the Dilbert comic, had a recent blog post saying that music is the lesser evil compared to constant distractions via the internet (email, facebook, twitter, etc.) This I can understand as well, and sometimes I turn off the wi-fi to try to get work done without distraction. I don’t see how turning on music helps, but given its prevalence it may just be differences between myself and other people. I should probably turn off the wi-fi for all but an hour in the morning and an hour in the afternoon everyday, but I’m pretty addicted to the internet at this point.

It partly depends on the task I am currently working on though how easily I am distracted. Sometimes I can get really engrossed in a particular problem and become obsessed with it to the point you could probably set the office on fire and I wouldn’t notice. For example this programming problem dominated my thoughts for around two days, and I ended up thinking of the general solution while I did not have access to the computer (while I was waiting for my car to get inspected). Most of the time though I can only give that type of concentration for an hour or two a day though, and the rest of the time I am working in a state of easy distraction.

Background music I don’t like, and other ambient noises I can manage to drown out, but background TV drives me crazy. My family was watching videos (on TV and tablets) the other day while I was reading Zen and ironically I became angry, because I was really into the book and wanted to give it my full concentration. I know people who watch TV in bed to go to sleep, and it is giving me a headache just thinking about it while I am writing this blog post.

I highly recommend both Zen and the Art of Motorcycle Maintenance and Scott Adam’s blog. I’m glad I revisited Zen, as it is an excellent philosophical book on the logic of science that did not make much of an impression on me as an undergrad, but I have a much better grasp of it after having my PhD and reading some other philosophy texts (like Popper).

Emailing with Python and SPSS

Emailing automated messages using Python was on my bucket list for a few projects, so here I will illustrate how to do that within SPSS. Basically the use case is if you have an automated report generated by SPSS and you want to send that automated report to certain parties (or to yourself while you are away from work). Emailing right within Python cuts out the middle annoying task of having to email yourself.

There are basically two parts of emailing within Python, 1) building the message and 2) opening your server and sending the mail. The latter is pretty simple, the former is quite tedious though. So adapting from several posts around the internet (1,2,3 plus others I don’t remember at this point), here is a function to build the email text.

*function to build message.
BEGIN PROGRAM Python.
from os.path import basename
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.MIMEBase import MIMEBase
from email import Encoders
from email.utils import COMMASPACE, formatdate

def make_mail(send_from, send_to, subject, text, files=None):
    assert isinstance(send_to, list)

    msg = MIMEMultipart(
        From=send_from,
        To=COMMASPACE.join(send_to),
        Date=formatdate(localtime=True),
        Subject=subject
    )
    msg.attach(MIMEText(text))

    if files != None:
      for f in files:              
        part = MIMEBase('application', 'base64')
        part.set_payload(open(f,"rb").read())
        part.add_header('Content-Disposition', 'attachment', filename=basename(f))    
        Encoders.encode_base64(part)      
        msg.attach(part)
    return msg.as_string()
END PROGRAM.

The function subsequently takes as arguments:

  • send_from: Your email address as a string
  • send_to: A list of email addresses to send to.
  • subject: A text string for the subject (please use a subject when you send an email!)
  • text: The text composition of the email in a string
  • files: A list of files with the full directory

Basically an email message is just a particular text format (which actually looking at the markup I’m slightly amazed email still functions at all). Building the markup for the to, from, subject and text in the email is tedious but relatively straightforward. However, attaching files (the main motivation for this to begin with!) is rather a pain in the butt. Here I just encode all the files in base64, and CSV, PDF, and PNG files have all worked out so far for me in my tests. (You can attach images as binary, but this approach seems to work fine at least for PNG images.)

So here is an example constructing a message, and I attach three example files. Here I just use my gmail address as both the from and to address. You can uncomment the print MyMsg at the end to see the particular markup, but it is quite long with the base64 attached files.

*Now lets make a message.
BEGIN PROGRAM Python.

us = "apwheele"
fr = us + "@gmail.com"
to = [fr]
te = "Hello!"

MyCSV = [r"C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Email_Python\Test.csv",
         r"C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Email_Python\Test.pdf",
         r"C:\Users\andrew.wheeler\Dropbox\Documents\BLOG\Email_Python\OUTPUT0.PNG"]

MyMsg = make_mail(send_from=fr,send_to=to,subject="Test",text=te,files=MyCSV)
#print MyMsg
END PROGRAM.

The second part is opening your email server and sending the message — relatively straight forward. Many people have their python functions for emailing with the username and password as part of the function. This does not make much sense to me, as they will be basically constants for a particular user, so I find it simpler to make the message and then open the server and send it. If you want to send multiple messages it makes more sense to open up the server just once. Below to make it work for yourself you just have to insert your own username and password (and possibly update the port number for your server).

*Now set up the server and send a message.
BEGIN PROGRAM Python.

us = "!!Your Username!!"
pa = "!!Your Password!!"

import smtplib
server = smtplib.SMTP('smtp.gmail.com',587)
server.starttls()

server.login(us,pa)
server.sendmail(fr,to, MyMsg)
server.quit()
END PROGRAM.

I don’t have Outlook on any of my personal machines, but hopefully it is just as simple when sending an email through a client as it is through gmail.

My Blogging in Review in 2013

2013 was my second year in blogging. I published 40 posts in 2013 (for a total of 72), and my cumulative site views were just a few shy of a 21,000 for the year. I only recieved 7,200 site views in 2012, so the blog has seen a fair bit of growth. The below chart aggregates the site views per month since the beginning (in December 2011) until December 2013. December has been a bit of a dip with only around an average of 60 views per day, but I was up to an average of 78 and 75 views per day in October and November respectively.

The large uptick in March was due to the Junk Charts Challenge being mentioned by Kaiser Fung. I got over 500 site views that day, and have totalled 765 referrals from the JunkCharts domain. This is pretty similar to the bursty behavior I noted on the CV blog, and that one good tweet or mention by a prominent figure will boost visibility by a large margin.

Most of the regular traffic though comes from generic internet searches, mainly for SPSS related material. A few of my earlier posts of Comparing continuous distributions of unequal size groups in SPSS (2,468 total views), Hacking the default SPSS chart template (2,237), and Avoid Dynamite Plots! Visualizing dot plots with super-imposed confidence intervals in SPSS and R (1,542) are some of my most popular posts. The Junk Charts Challenge post has a total of 1,804 views, but it seems to me that it was more of a flood initially and then a trickle as oppossed to the steady views the other posts bring.

Last year I said I would blog about a few topics and failed to write a post about any of them, so I won’t do that again this year. I will however state that I am currently on the job market, as I recently defended my prospectus. If you are aware of a job opportunity you think I would be interested in, or would like to talk to me about a consulting project feel free to send me an email (you can see my CV for my qualifications and brief discussion of past and current consulting services I have provided).

Some sites give advice about maintaining a blog and attracting visitors (such as writing posts so often). My advice is to write quality material, and the rest is just icing on the cake. Hopefully I have more cake for you in the near future.

Why I feel SPSS (or any statistical package) is better than Excel for this particular job

I debated on pulling an Andrew Gelman and adding a ps to my prior Junk Charts Challenge post, but it ended up being too verbose, so I just made an entirely new follow-up. To start, the discussion has currently evolved from this series of posts;

  • The original post on remaking a great line chart by Kaiser Fung, with the suggestion that the task (data manipulation and graphing) is easier in Excel.
  • My response on how to make the chart in SPSS.
  • Kaiser’s response to my post, in which I doubt I swayed his opinion on using Excel for this task! It appears to me based on the discussion so far the only real quarrel is whether the data manipulation is sufficiently complicated enough compared to the ease of pointing and clicking in Excel to justify using Excel. In SPSS to recreate Kaiser’s chart is does take some advanced knowledge of sorting and using lags to identify the pit and recoveries (the same logic could be extended to the data manipulations Kaiser says I skim over, as long as you can numerically or externally define what is a start of a recession).

All things considered for the internet, discussion has been pretty cordial so far. Although it is certainly sprinkled in my post, I didn’t mean for my post on SPSS to say that the task of grabbing data from online, manipulating it, and creating the graph was in any objective way easier in SPSS than in Excel. I realize pointing-and-clicking in Excel is easier for most, and only a few really adept at SPSS (like myself) would consider it easier in SPSS. I write quite a few tutorials on how to do things in SPSS, and that was one of the motivations for the tutorial. I want people using SPSS (or really any graphing software) to make nice graphs – and so if I think I can add value this way to the blogosphere I will! I hope my most value added is through SPSS tutorials, but I try to discuss general graphing concepts in the posts as well, so even for those not using SPSS it hopefully has some other useful content.

My original post wasn’t meant to discuss why I feel SPSS is a better job for this particular task, although it is certainly a reasonable question to ask (I tried to avoid it to prevent flame wars to be frank – but now I’ve stepped in it at this point it appears). As one of the comments on Kaiser’s follow up notes (and I agree), some tools are better for some jobs and we shouldn’t prefer one tool because of some sort of dogmatic allegiance. To make it clear though, and it was part of my motivation to write my initial response to the challenge post, I highly disagree that this particular task, which entails grabbing data from the internet, manipulating it, and creating a graph, and updating said graph on a monthly basis is better done in Excel. For a direct example of my non-allegiance to doing everything in SPSS for this job, I wouldn’t do the grabbing the data from the internet part in SPSS (indeed – it isn’t even directly possible unless you use Python code). Assuming it could be fully automated, I would write a custom SPSS job that manipulates the data after a wget command grabs the data, and have it all wrapped up in one bat file that runs on a monthly timer.

To go off on a slight tangent, why do I think I’m qualified to make such a distinction? Well, I use both SPSS and Excel on a regular basis. I wouldn’t consider myself a wiz at Excel nor VBA for Excel, but I have made custom Excel MACROS in the past to perform various jobs (make and format charts/tables etc.), and I have one task (a custom daily report of the crime incidents reported the previous day) I do on a daily basis at my job in Excel. So, FWIW, I feel reasonably qualified to make decisions on what tasks I should perform in which tools. So I’m giving my opinion, the same way Kaiser gave his initial opinion. I doubt my experience is as illustruous as Kaiser’s, but you can go to my CV page to see my current and prior work roles as an analyst. If I thought Excel, or Access, or R, or Python, or whatever was a better tool I would certainly personally use and suggest that. If you don’t have alittle trust in my opinion on such matters, well, you shouldn’t read what I write!

So, again to be clear, I feel this is a job better for SPSS (both the data manipulation and creating the graphics), although I admit it is initially harder to write the code to accomplish the task than pointing, clicking and going through chart wizards in Excel. So here I will try to articulate those reasons.

  • Any task I do on a regular basis, I want to be as automated as possible. Having to point-click, copy-paste on a regular basis invites both human error and is a waste of time. I don’t doubt you could fully (or very near) automate the task in Excel (as the comment on my blog post mentions). But this will ultimately involve scripting in VBA, which diminishes in any way that the Excel solution is easier than the SPSS solution.
  • The breadth of both data management capabilities, statistical analysis, and graphics are much larger in SPSS than in Excel. Consider the VBA code necessary to replicate my initial VARSTOCASES command in Excel, that is reshaping wide data to stacked long form. Consider the necessary VBA code to execute summary statistics over different groups without knowing what the different groups are beforehand. These are just a sampling of data management tools that are routine in statistics packages. In terms of charting, the most obvious function lacking in Excel is that it currently does not have facilities to make small-multiple charts (you can see some exceptional hacks from Jon Peltier, but those are certainly more limited in functionality that SPSS). Not mentioned (but most obvious) is the statistical capabilities of a statistical software!

So certainly, this particular job, could be done in Excel, as it does not require any functionality unique to a stats package. But why hamstring myself with these limitations from the onset? Frequently after I build custom, routine analysis like this I continually go back and provide more charts, so even if I have a good conceptualization of what I want to do at the onset there is no guarantee I won’t want to add this functionality in later. In terms of charting not having flexible small multiple charts is really a big deal, they can be used all the time.

Admittedly, this job is small enough in scope, if say the prior analyst was doing a regular updated chart via copy-paste like Kaiser is suggesting, I would consider just keeping that same format (it certainly is a lost opportunity cost to re-write the code in SPSS, and the fact that it is only on a monthly basis means to get time recovered if the task were fully automated would take quite some time). I just have personally enough experience in SPSS I know I could script a solution in SPSS quicker from the on-set than in Excel (I certainly can’t extrapolate that to anyone else though).

Part of both my preference and experience in SPSS comes from the jobs I personally have to do. For an example, I routinely pull a database of 500,000 incidents, do some data cleaning, and then merge this to a table of 300,000 charges and offenses and then merge to a second table of geocoded incident locations. Then using this data I routinely subset it, create aggregate summaries, tables, estimate various statistics and models, make some rudimentary maps, or even export the necessary data to import into a GIS software.

For arguments sake (with the exception of some of the more complicated data cleaning) this could be mostly done in SQL – but certainly no reasonable person should consider doing these multiple table merges and data cleaning in Excel (the nice interactive facilities with working with the spreadsheet in Excel are greatly dimished with any tables that take more a few scrolls to see). Statistical packages are really much more than tools to fit models, they are tools for working and manipulating data. I would highly recommend if you have to conduct routine tasks in which you manipulate data (something I assume most analysts have to do) you should consider learning statistical sofware, the same way I would recommend you should get to know SQL.

To be more balanced, here are things (knowing SPSS really well and Excel not as thoroughly) I think Excel excels at compared to SPSS;

  • Ease of making nicely formatted tables
  • Ease of directly interacting and editing components of charts and tables (this includes adding in supplementary vector graphics and labels).
  • Sparklines
  • Interactive Dashboards/Pivot Tables

Routine data management is not one of them, and only really sparklines and interactive dashboards are functionality in which I would prefer to make an end product in Excel over SPSS (and that doesn’t mean the whole workflow needs to be one software). I clean up ad-hoc tables for distribution in Excel all the time, because (as I said above) editing them in Excel is easier than editing them in SPSS. Again, my opinion, FWIW.

My experience blogging in 2012

I figured I would write a brief post about my experience blogging. I created this blog and published my first post in December of 2011. Since then, in 2012, I published 30 blog posts, and totaled 7,200 views. While I thought the number was quite high (albeit a bit dissapointing compared to the numbers of Larry Wasserman), it is still many more people than would have listened to what I had to say if I didn’t write a blog. When starting out I averaged under 10 views a day, but throughout the year it steadily grew, and now I average about 30 views per day. The post that had the most traffic in one day was When should we use a black background for a map?, and that was largely because of some twitter traffic (a result of Steven Romalewski tweeting it and then it being re-tweeted by Kenneth Field), and it had 73 views.

I started the blog because I really loved reading alot of others blogs, and so I hope to encourage others to do so as well. It is a nice venue to share work and opinions for an academic, as it is more flexible and can be less formal than articles. Also much of what I write about I would just consider helpful tips or generic discussion that I wouldn’t get to discuss otherwise (SPSS programming and graph tips will never make it into a publication). One of my main motivations was actually R-Bloggers and the SAS blog roll; I would like a similarly active community for SPSS, but there is none really that I have found outside of the NABBLE forum (some exceptions are Andy Field, The Analysis Factor, Jon Peck and these few posts by a Louis K I only found through the labyrinth that is the IBM developerworks site (note I think you need to be signed in to even see that site), but they certainly aren’t very active and/or don’t write much about SPSS). I assume the best way to remedy that is to lead by example! Most of my more popular posts are ones about SPSS, and I frequently get web-traffic via general google searches of SPSS + something else I blogged about (hacking the template and comparing continuous distributions are my two top posts).

Also the blog is also just another place to highlight my academic work and bring more attention to it. WordPress tells me how often someone clicks a link on the blog, and someone has clicked the link to my CV close to 40 times since I’ve made the blog. Hopefully I have some pre-print journal articles to share on the blog in the near future (as well as my prospectus). My post on my presentation at ASC did not generate much traffic, but I would love to see a similar trend for other criminologists/criminal justicians in the future. My work isn’t perfect for sure, but why not get it out there at least for it to be judged and hopefully get feedback.

I would like to blog more, and I actively try to write something if I haven’t in a few weeks, but I don’t stress about it too much. I certainly have an infinite pool of posts to write about programming and generating graphs in SPSS. I have also thought about talking about historical graphics in criminology and criminal justice, or generally talking about some historical and contemporary crime mapping work. Other potential posts I’d like to write about are a more formal treatment about why I loathe most difference-in-differences designs, and perhaps about the sillyness that can ensue when using null-hypothesis significance testing to determine racial bias. But they will both take more careful elaboration on, so might not be anytime soon.

So in short, SPSSer’s, crime mapper’s, criminologist’s/criminal justician’s, I want you to start blogging, and I will eagerly consume your work (and in the meantime hopefully produce some more useful stuff on my end)!