Knowing when to fold them: A quantitative approach to ending investigations

The recent work on investigations in the criminal justice field has my head turning about potential quantitative applications in this area (check out the John Eck & Kim Rossmo podcasts on Jerry’s site first, then check out the recent papers in Criminology and Public Policy on the topic for a start). One particular problem that was presented to me was detective case loads — detectives are humans, so can only handle so many cases at once. Triage is typically taken at the initial crime reporting stage, with inputs such as seriousness of the offense, the overall probability of the case being solved, and future dangerousness of folks involved being examples of what goes into that calculus to assign a case.

Here I wanted to focus on a different problem though — how long to keep cases open? There are diminishing returns to keeping cases open indefinitely, and so PDs should be able to right size the backend of detective open cases as well as the front end triaging. Here my suggested solution is to estimate a survival model of the probability of a case being solved, and then you can estimate an expected return on investment given the time you put in.

Here is a simplified example. Say the table below shows the (instantaneous) probability of a case being solved per weeks put into the investigation.

Week 1  20%
Week 2  10%
Week 3   5%
Week 4   3%
Week 5   1%

In survival model parlance, this would be the hazard function in discrete time increments. And then we have diminishing probabilities over time, which should also be true (e.g. a higher probability of being solved right away, and gets lower over time). The expected return of investigating this crime at time t is the cumulative probability of the crime being solved at time t, multiplied by whatever value you assign to the case being solved. The costs of investigating will be fixed (based on the detective salary), so is just a multiple of t*invest_costs.

So just to fill in some numbers, lets say that it costs the police department $1,000 a week to keep an investigation going. Also say a crime has a return of $10,000 if it is solved (the latter number will be harder to figure out in practice, as cost of crime estimates are not a perfect fit). So filling in our table, we have below our detective return on investment estimates (note that the cumulative probability of being solved is not simply the sum of the instantaneous probabilities, else it would eventually go over 100%). So return on investment (ROI), at week 1 is 10,000*0.2 = 2,000, at week 2 is 10,000*0.28 = 2,800, etc.

        h(t) solved%  cum-costs   ROI   
Week 1  20%    20%     1,000     2,000
Week 2  10%    28%     2,000     2,800
Week 3   5%    32%     3,000     3,200
Week 4   3%    33%     4,000     3,300
Week 5   1%    34%     5,000     3,400

So the cumulative costs outweigh the total detective resources devoted to the crime by Week 4 here. So in practice (in this hypothetical example) you may say to a detective you get 4 weeks to figure it out, if not solved by then it should be closed (but not cleared), and you should move onto other things. In the long run (I think) this strategy will make sure detective resources are balanced against actual cases solved.

This right sizes investigation lengths from a global perspective, but you also might consider whether to close a case on an individual case-by-case basis. In that case you wouldn’t calculate the sunk cost of the investigation so far, it is just the probability of the case being solved going forward relative to future necessary resources. (You do the same table, just start the cum-costs and solved percent columns from scratch whenever you are making that decision.)

In an actual applied setting, you can estimate the survival function however you want (e.g. you may want a cure mixture-model, so not all cases will result in 100% being solved given infinite time). It is also the case that different crimes will not only have different survival curves, but also will have different costs of crime (e.g. a murder has a greater cost to society than a theft) and probably different investigative resources needed (detective costs may also get lower over time, so are not constant). You can bake that all right into this estimate. So you may say the cost of a murder is infinite, and you should forever keep that case open investigating it. A burglary though may be a very short time interval before it should be dropped (but still have some initial investment).

Another neat application of this is that if you can generate reasonable returns to solving crimes, you can right size your overall detective bureau. That is you can make a quantitative argument I need X more detectives, and they will help solve Y more crimes resulting in Z return on investment. It may be we should greatly expand detective bureaus, but have them only keep many cases open a short time period. I’m thinking of the recent officer shortages in Dallas, where very few cases are assigned at all. (Some PDs have patrol officers take initial detective duties on the crime scene as well.)

There are definitely difficulties with applying this approach. One is that getting the cost of solving a crime estimate is going to be tough, and bridges both quantitative cost of crime estimates (although many of them are sunk costs after the crime has been perpetrated, arresting someone does not undo the bullet wound), likelihood of future reoffending, and ethical boundaries as well. If we are thinking about a detective bureau that is over-booked to begin with, we aren’t deciding on assigning individual cases at that point, but will need to consider pre-empting current investigations for new ones (e.g. if you drop case A and pick up case B, we have a better ROI). And that is ignoring the estimating survival part of different cases, which is tricky using observational data as well (selection biases in what cases are currently assigned could certainly make our survival curve estimates too low or too high).

This problem has to have been tackled in different contexts before (either by actuaries or in other business/medical contexts). I don’t know the best terms to google though to figure it out — so let me know in the comments if there is related work I should look into on solving this problem.

David Bayley

David Bayley is most known in my research area, policing interventions to reduce crime, based on this opening paragraph in Police for the future:

The police do not prevent crime. This is one of the best kept secrets of modern life. Experts know it, the police know it, but the public does not know it. Yet the police pretend that they are society’s best defense against crime and continually argue that if they are given more resources, especially personnel, they will be able to protect communities against crime. This is a myth.

This quote is now paraded as backwards thinking, often presented before discussing the overall success of hot spots policing. If you didn’t read the book, you might come to the conclusion that this quote is a parallel to the nothing works mantra in corrections research. That take is not totally off-base: Police for the future was published in 1994, so it was just at the start of the CompStat revolution and hot spots policing. The evidence base was no doubt much thinner at that point and deserving of skepticism.

I don’t take the contents of David’s book as so hardlined on the stance that police cannot reduce crime, at least at the margins, as his opening quote suggests though. He has a chapter devoted to traditional police responses (crackdowns, asset forfeiture, stings, tracking chronic offenders), where he mostly expresses scientific skepticism of their effectiveness given their cost. He also discusses problem oriented approaches to solving crime problems, how to effectively measure police performance (outputs vs outcomes), and promotes evaluation research to see what works. Still all totally relevant twenty plus years later.

The greater context of David’s quote comes from his work examining police forces internationally. David was more concerned about professionalization of police forces. Part of this is better record keeping of crimes, and in the short term crime rates will often increase because of this. In class he mocked metrics used to score international police departments on professionalization that used crime as a measure that went into their final grade. He thought the function of the police was broader than reducing crime to zero.


I was in David’s last class he taught at Albany. The last day he sat on the desk at the front of the room and expressed doubt about whether he accomplished anything tangible in his career. This is the fate of most academics. Very few of us can point to direct changes anyone implemented in response to our work. Whether something works is independent of an evaluation I conduct to show it works. Even if a police department takes my advice about implementing some strategy, I am still only at best indirectly responsible for any crime reductions that follow. Nothing I could write would ever compete with pulling a single person from a burning car.

While David was being humble he was right. If I had to make a guess, I would say David’s greatest impact likely came about through his training of international police forces — which I believe spanned multiple continents and included doing work with the United Nations. (As opposed to saying something he wrote had some greater, tangible impact.) But even there if we went and tried to find direct evidence of David’s impact it would be really hard to put a finger on any specific outcome.

If a police department wanted to hire me, but I would be fired if I did not reduce crimes by a certain number within that first year, I would not take that job. I am confident that I can crunch numbers with the best of them, but given real constraints of police departments I would not take that bet. Despite devoting most of my career to studying policing interventions to reduce crime, even with the benefit of an additional twenty years of research, I’m not sure if David’s quote is as laughable as many of my peers frame it to be.

Why I publish preprints

I encourage peers to publish preprint articles — journal articles before they go through the whole peer review process and are published. It isn’t normative in our field, and I’ve gotten some pushback from colleagues, so figured I would put on paper why I think it is a good idea. In short, the benefits (increased exposure) outweigh the minimal costs of doing so.

The good — getting your work out there

The main benefit of posting preprints is to get your work more exposure. This occurs in two ways: one is that traditional peer-review work is often behind paywalls. This prevents the majority of non-academics from accessing your work. This point about paywalls applies just the same to preventing other academics from reading your work in some cases. So while the prior blog post I linked by Laura Huey notes that you can get access to some journals through your local library, it takes several steps. Adding in steps you are basically losing out on some folks who don’t want to spend the time. Even through my university it is not uncommon for me to not be able to access a journal article. I can technically take the step of getting the article through inter-library loan, but that takes more time. Time I am not going to spend unless I really want to see the contents of the article.

This I consider a minor benefit. Ultimately if you want your academic work to be more influential in the field you need to write about your work in non-academic outlets (like magazines and newspapers) and present it directly to CJ practitioner audiences. But there are a few CJ folks who read journal articles you are missing, as well as a few academics who are missing your work because of that paywall.

A bigger benefit is actually that you get your work out much quicker. The academic publishing cycle makes it impossible to publish your work in a timely fashion. If you are lucky, once your paper is finished, it will be published in six months. More realistically it will be a year before it is published online in our field (my linked article only considers when it is accepted, tack on another month or two to go through copy-editing).

Honestly, I publish preprints because I get really frustrated with waiting on peer review. No offense to my peers, but I do good work that I want others to read — I do not need a stamp from three anonymous reviewers to validate my work. I would need to do an experiment to know for sure (having a preprint might displace some views/downloads from the published version) but I believe the earlier and open versions on average doubles the amount of exposure my papers would have had compared to just publishing in traditional journals. It is likely a much different audience than traditional academic crim people, but that is a good thing.

But even without that extra exposure I would still post preprints, because it makes me happy to self-publish my work when it is at the finish line, in what can be a miserably long and very much delayed gratification process otherwise.

The potential downsides

Besides the actual time cost of posting a preprint (next section I will detail that more precisely, it isn’t much work), I will go through several common arguments why posting preprints are a bad idea. I don’t believe they carry much weight, and have not personally experienced any of them.

What if I am wrong — Typically I only post papers either when I am doing a talk, or when it is ready to go out for peer review. So I don’t encourage posting really early versions of work. While even at this stage there is never any guarantee you did not make a big mistake (I make mistakes all the time!), the sky will not fall down if you post a preprint that is wrong. Just take it down if you feel it is a net negative to the scholarly literature (which is very hard to do — the results of hypothesis tests do not make the work a net positive/negative). If you think it is good enough to send out for peer review it is definitely at the stage where you can share the preprint.

What if the content changes after peer review — My experience with peer review is mostly pedantic stuff — lit. review/framing complaints, do some robustness checks for analysis, beef up the discussion. I have never had a substantive interpretation change after peer-review. Even if you did, you can just update the preprint with the new results. While this could be bad (an early finding gets picked up that is later invalidated) this is again something very rare and a risk I am willing to take.

Note peer review is not infallible, and so hedging that peer review will catch your mistakes is mostly a false expectation. Peer review does not spin your work into gold, you have to do that yourself.

My ideas may get scooped — This I have never personally had happen to me. Posting a preprint can actually prevent this in terms of more direct plagiarism, as you have a time-stamped example of your work. In terms of someone taking your idea and rewriting it, this is a potential risk (same risk if you present at a conference) — really only applicable for folks working on secondary data analysis. Having the preprint the other person should at least cite your work, but sorry, either presenting some work or posting a preprint does not give you sole ownership of an idea.

Journals will view preprints negatively — Or journals do not allow preprints. I haven’t come across a journal in our field that forbids preprints. I’ve had one reviewer note (out of likely 100+ at this point) that the pre-print was posted as a negative (suggesting I was double publishing or plagiarizing my own work). An editor that actually reads reviews should know that is not a substantive critique. That was likely just a dinosaur reviewer that wasn’t familiar with the idea of preprints (and they gave an overall positive review in that one case, so did not get the paper axed). If you are concerned about this, just email the editor for feedback, but I’ve never had a problem from editors.

Peer reviewers will know who I am — This I admit is a known unknown. So peer review in our crim/cj journals are mostly doubly blind (most geography and statistic journals I have reviewed for are not, I know who the authors are). If you presented the work at a conference you have already given up anonymity, and also the field is small enough a good chunk of work the reviewers can guess who the author is anyway. So your anonymity is often a moot point at the peer review stage anyway.

So I don’t know how much reviewers are biased if they know who you are (it can work both ways, if you get a friend they may be more apt to give a nicer review). It likely can make a small difference at the margins, but again I personally don’t think the minor risk/cost outweighs the benefits.

These negatives are no doubt real, but again I personally find them minor enough risks to not outweigh the benefits of posting preprints.

The not hard work of actually posting preprints

All posting a preprint involves is uploading a PDF file of your work to either your website or a public hosting service. My workflow currently I have my different components of a journal article in several word documents (I don’t use LaTex very often). (Word doesn’t work so well when it has one big file, especially with many pictures.) So then I export those components to PDF files, and stitch them together using a freeware tool PDFtk. It has a GUI and command line, so I just have a bat file in my paper directory that lists something like:

pdftk.exe TitlePage.pdf MainPaper.pdf TablesGraphs.pdf Appendix.pdf cat output CombinedPaper.pdf

So just a double click to update the combined pdf when I edit the different components.

Public hosting services to post preprints I have used in the past are Academia.edu, SSRN, and SoxArXiv, although again you could just post the PDF on your webpage (and Google Scholar will eventually pick it up). I use SocArXiv now, as SSRN currently makes you sign up for an account to download PDFs (again a hurdle, the same as a going through inter-library loan). Academia.edu also makes you sign up for an account, and has weird terms of service.

Here is an example paper of mine on SocArXiv. (Note the total downloads, most of my published journal articles have fewer than half that many downloads.) SocArXiv also does not bother my co-authors to create an account when I upload a paper. If we had a more criminal justice focused depository I would use that, but SocArXiv is fine.

There are other components of open science I should write about — such as replication materials/sharing data, and open peer reviewed journals, but I will leave those to another blog post. Posting preprints takes very little extra work compared to what academics are currently doing, so I hope more people in our field start doing it.

 

My Year Blogging in Review – 2018

The blog continues to grow in site views. I had a little north of 90,000 site views over the entire year. (If you find that impressive don’t be, a very large proportion are likely bots.)

The trend on the original count scale looks linear, but on the log scale the variance is much nicer. So I’m not sure what the best forecast would be.

I thought the demise had already started earlier in the year, as I actually saw the first year-over-year decreases in June and July. But the views recovered in the following months.

So based on that the slow down in growth I think is a better bet than the linear projection.

For those interested in extending their reach, you should not only consider social media and creating a website/blog, but also writing up your work for a more general newspaper. I wrote an article for The Conversation about some of my work on officer involved shootings in Dallas, and that accumulated nearly 7,000 views within a week of it being published.

Engagement in a greater audience is very bursty. Looking at my statistics for particular articles, it doesn’t make much sense to report average views per day. I tend to get a ton of views on the first few days, and then basically nothing after that. So if I do the top posts by average views per day it is dominated by my more recent posts.

This is partly due to shares on Twitter, which drive short term views, but do not impact longer term views as far as I can tell. That is a popular post on Twitter does not appear to predict consistent views being referred via Google searches. In the past year I get a ratio of about 50~1 referrals from Google vs Twitter, and I did not have any posts that had a consistent number of views (most settle in at under 3 views per day after the initial wave). So basically all of my most viewed posts are the same as prior years.

Since I joined Twitter this year, I actually have made fewer blog posts. Not including this post, I’ve made 29 posts in 2018.

2011  5
2012 30
2013 40
2014 45
2015 50
2016 40
2017 35
2018 29

Some examples of substitution are tweets when a paper is published. I typically do a short write up when I post a working paper — there is not much point of doing another one when it is published online. (To date I have not had a working paper greatly change from the published version in content.) I generally just like sharing nice graphs I am working on. Here is an example of citations over time I just quickly published to Twitter, which was simpler than doing a whole blog post.

Since it is difficult to determine how much engagement I will get for any particular post, it is important to just keep plugging away. Twitter can help a particular post take off (see these examples I wrote about for the Cross Validated Blog), but any one tweet or blog post is more likely to be a dud than anything.

Reasons Police Departments Should Consider Collaborating with Me

Much of my academic work involves collaborating and consulting with police departments on quantitative problems. Most of the work I’ve done so far is very ad-hoc, through either the network of other academics asking for help on some project or police departments cold contacting me directly.

In an effort to advertise a bit more clearly, I wrote a page that describes examples of prior work I have done in collaboration with police departments. That discusses what I have previously done, but doesn’t describe why a police department would bother to collaborate with me or hire me as a consultant. In fact, it probably makes more sense to contact me for things no one has previously done before (including myself).

So here is a more general way to think about (from a police departments or criminal justice agencies perspective) whether it would be beneficial to reach out to me.

Should I do X?

So no one is going to be against different evidence based policing practices, but not all strategies make sense for all jurisdictions. For example, while focussed deterrence has been successfully applied in many different cities, if you do not have much of a gang violence problem it probably does not make sense to apply that strategy in your jurisdiction. Implementing any particular strategy should take into consideration the cost as well as the potential benefits of the program.

Should I do X may involve more open ended questions. I’ve previously conducted in person training for crime analysts that goes over various evidence based practices. It also may involve something more specific, such as should I redistrict my police beats? Or I have a theft-from-vehicle problem, what strategies should I implement to reduce them?

I can suggest strategies to implement, or conduct cost-benefit analysis as to whether a specific program is worth it for your jurisdiction.

I want to do X, how do I do it?

This is actually the best scenario for me. It is much easier to design a program up front that allows a police department to evaluate its efficacy (such as designing a randomized trial and collecting key measures). I also enjoy tackling some of the nitty-gritty problems of implementing particular strategies more efficiently or developing predictive instruments.

So you want to do hotspots policing? What strategies do you want to do at the hotspots? How many hotspots do you want to target? Those are examples of where it would make sense to collaborate with me. Pretty much all police departments should be doing some type of hot spots policing strategy, but depending on your particular problems (and budget constraints), it will change how you do your hot spots. No budget doesn’t mean you can’t do anything — many strategies can be implemented by shifting your current resources around in particular ways, as opposed to paying for a special unit.

If you are a police department at this stage I can often help identify potential grant funding sources, such as the Smart Policing grants, that can be used to pay for particular elements of the strategy (that have a research component).

I’ve done X, should I continue to do it?

Have you done something innovative and want to see if it was effective? Or are you putting a bunch of money into some strategy and are skeptical it works? It is always preferable to design a study up front, but often you can conduct pretty effective post-hoc analysis using quasi-experimental methods to see if some crime reduction strategy works.

If I don’t think you can do a fair evaluation I will say so. For example I don’t think you can do a fair evaluation of chronic offender strategies that use officer intel with matching methods. In that case I would suggest how you can do an experiment going forward to evaluate the efficacy of the program.

Mutual Benefits of Academic-Practitioner Collaboration

Often I collaborate with police departments pro bono — which you may ask what is in it for me then? As an academic I get evaluated mostly by my research productivity, which involves writing peer reviewed papers and getting research grants. So money is not the main factor from my perspective. It is typically easier to write papers about innovative problems or programs. If it involves applying for a grant (on a project I am interested in) I will volunteer my services to help write the grant and design the study.

I could go through my career writing papers without collaborating with police departments. But my work with police departments is more meaningful. It is not zero-sum, I tend to get better ideas when understanding specific agencies problems.

So get in touch if you think I can help your agency!

New preprint: Allocating police resources while limiting racial inequality

I have a new working paper out, Allocating police resources while limiting racial inequality. In this work I tackle the problem that a hot spots policing strategy likely exacerbates disproportionate minority contact (DMC). This is because of the pretty simple fact that hot spots of crime tend to be in disadvantaged/minority neighborhoods.

Here is a graph illustrating the problem. X axis is the proportion of minorities stopped by the police in 500 by 500 meter grid cells (NYPD data). Y axis is the number of violent crimes over along time period (12 years). So a typical hot spots strategy would choose the top N areas to target (here I do top 20). These are all very high proportion minority areas. So the inevitable extra police contact in those hot spots (in the form of either stops or arrests) will increase DMC.

I’d note that the majority of critiques of predictive policing focus on whether reported crime data is biased or not. I think that is a bit of a red herring though, you could use totally objective crime data (say swap out acoustic gun shot sensors with reported crime) and you still have the same problem.

The proportion of stops by the NYPD of minorities has consistently hovered around 90%, so doing a bunch of extra stuff in those hot spots will increase DMC, as those 20 hot spots tend to have 95%+ stops of minorities (with the exception of one location). Also note this 90% has not changed even with the dramatic decrease in stops overall by the NYPD.

So to illustrate my suggested solution here is a simple example. Consider you have a hot spot with predicted 30 crimes vs a hot spot with predicted 28 crimes. Also imagine that the 30 crime hot spot results in around 90% stops of minorities, whereas the 28 crime hot spot only results in around 50% stops of minorities. If you agree reducing DMC is a reasonable goal for the police in-and-of-itself, you may say choosing the 28 crime area is a good idea, even though it is a less efficient choice than the 30 crime hot spot.

I show in the paper how to codify this trade-off into a linear program that says choose X hot spots, but has a constraint based on the expected number of minorities likely to be stopped. Here is an example graph that shows it doesn’t always choose the highest crime areas to meet that racial equity constraint.

This results in a trade-off of efficiency though. Going back to the original hypothetical, trading off a 28 crime vs 30 crime area is not a big deal. But if the trade off was 3 crimes vs 30 that is a bigger deal. In this example I show that getting to 80% stops of minorities (NYC is around 70% minorities) results in hot spots with around 55% of the crime compared to the no constraint hot spots. So in the hypothetical it would go from 30 crimes to 17 crimes.

There won’t be a uniform formula to calculate the expected decrease in efficiency, but I think getting to perfect equality with the residential pop. will typically result in similar large decreases in many scenarios. A recent paper by George Mohler and company showed similar fairly steep declines. (That uses a totally different method, but I think will be pretty similar outputs in practice — can tune the penalty factor in a similar way to changing the linear program constraint I think.)

So basically the trade-off to get perfect equity will be steep, but I think the best case scenario is that a PD can say "this predictive policing strategy will not make current levels of DMC worse" by applying this algorithm on-top-of your predictive policing forecasts.

I will be presenting this work at ASC, so stop on by! Feedback always appreciated.

New paper: A simple weighted displacement difference test to evaluate place based crime interventions

At the ECCA conference this past spring Jerry Ratcliffe asked if I could apply some of my prior work on evaluating changes in crime patterns over time to make a set of confidence intervals for the weighted displacement quotient statistic (WDQ). The answer to that is no, you can’t, but in its stead I created another statistic in which you can do that, the weighted displacement difference (WDD). The work is published in the open access journal Crime Science.

The main idea is we wanted a simple statistic folks can use to evaluate place based interventions to reduce crime. All you need is pre and post crime counts for you treated and control areas of interest. Here is an excel spreadsheet to calculate the statistic, and below is a screen shot. You just need to fill in the pre and post counts for the treated and control locations and the spreadsheet will spit out the statistic, along with a p-value and a 95% confidence interval of the number of crimes reduced.

What is different compared to the WDQ statistic is that you need a control area for the displacement area too in this statistic. But if you are not worry about displacement, you can actually just put in zero’s for the displacement area and still do the statistic for the local (and its control area). In this way you can actually do two estimates, one for the local effects and one for the displacement. Just put in zero’s for the other values.

While you don’t really need to read the paper to be able to use the statistic, we do have some discussion on choosing control areas. In general the control areas should have similar counts of crime, you shouldn’t have a treatment area that has 100 crimes and a control area that only has 10 crimes. We also have this graph, which is basically a way to conduct a simple power analysis — the idea that “could you reasonably detect whether the intervention reduced crime” before you actually conduct the analysis.

So the way to read this graph is if you have a set of treated and control areas that have an average of 100 crimes in each period (so the cumulative total crimes is around 800), the number of crimes you need to reduce due to the intervention to even have weak evidence of a crime reduction (a one-tailed p-value of less than 0.1), the intervention needs to have prevented around 30 crimes. Many interventions just aren’t set up to have strong evidence of crime reductions. For example if you have a baseline of 20 crimes, you need to prevent 15 of them to find weak evidence of effectiveness. Interventions in areas with fewer baseline crimes basically cannot be verified they are effective using this simple of a design.

For those more mathy, I created a test statistic based on the differences in the changes of the counts over time by making an assumption that the counts are Poisson distributed. This is then basically just a combination of two difference-in-difference estimates (for the local and the displacement areas) using counts instead of means. For researchers with the technical capabilities, it probably makes more sense to use a data based approach to identify control areas (such as the synthetic control method or propensity score matching). This is of course assuming an actual randomized experiment is not feasible. But this is too much a burden for many crime analysts, so if you can construct a reasonable control area by hand you can use this statistic.

American Community Survey Variables of Interest to Criminologists

I’ve written prior blog posts about downloading Five Year American Community Survey data estimates (ACS for short) for small area geographies, but one of the main hiccups is figuring out what variables you want to use. The census has so many variables that are just small iterations of one another (e.g. Males under 5, males 5 to 9, males 10 to 14, etc.) that it is quite a chore to specify the ones you want. Often you want combinations of variables or to calculate percentages as well, so you need to take two or more variables and turn them into your constructed variable.

I have posted some notes on the variables I have used for past projects in an excel spreadsheet. This includes the original variables, as well as some notes for creating percentage variables. Some are tricky — such as figuring out the proportion of black residents for block groups you need to add non-Hispanic black and Hispanic black estimates (and then divide by the total population). For spatially oriented criminologists these are basically indicators commonly used for social disorganization. It also includes notes on what is available at the smaller block group level, as not all of the variables are. So you are more limited in your choices if you want that small of area.

Let me know if you have been using other variables for your work. I’m not an expert on these variables by any stretch, so don’t take my list as authoritative in any way. For example I have no idea whether it is valid to use the imputed data for moving in the prior year at the block group level. (In general I have not incorporated the estimates of uncertainty for any of the variables into my analyses, not sure of the additional implications for the imputed data tables.) Also I have not incorporated variables that could be used for income-inequality or for ethnic heterogeneity (besides using white/black/Hispanic to calculate the index). I’m sure there are other social disorganization relevant variables at the block group level folks may be interested in as well. So let me know in the comments or shoot me an email if you have suggestions to update my list.

I would prefer if as a field we could create a set of standardized indices so we are not all using different variables (see for example this Jeremy Miles paper). It is a bit hodge-podge though what variables folks use from study-to-study, and most folks don’t report the original variables so it is hard to replicate their work exactly. British folks have their index of deprivation, and it would be nice to have a similarly standardized measure to use in social science research for the states.


The ACS data has consistent variable names over the years, such as B03001_001 is the total population, B03002_003 is the Non-Hispanic white population, etc. Unfortunately those variables are not necessarily in the same tables from year to year, so concatenating ACS results over multiple years is a bit of a pain. Below I post a python script that given a directory of the excel template files will produce a nice set of dictionaries to help find what table particular variables are in.

#This python code grabs ACS meta-data templates
#To easier search for tables that have particular variables
import xlrd, os

mydir = r'!!!Insert your path to the excel files here!!!!!'

def acs_vars(directory):
    #get the excel files in the directory
    excel_files = []
    for file in os.listdir(directory):
        if file.endswith(".xls"):
            excel_files.append( os.path.join(directory, file) )
    #getting the variables in a nice dictionaries
    lab_dict = {}
    loc_dict = {}
    for file in excel_files:
        book = xlrd.open_workbook(file) #first open the xls workbook
        sh = book.sheet_by_index(0)
        vars = [i.value for i in sh.row(0)] #names on the first row
        labs = [i.value for i in sh.row(1)] #labels on the second
        #now add to the overall dictionary
        for v,l in zip(vars,labs):
            lab_dict[v] = l
            loc_dict[v] = file
    #returning the two dictionaries
    return lab_dict,loc_dict
    
labels,tables = acs_vars(mydir)

#now if you have a list of variables you want, you can figure out the table
interest = ['B03001_001','B02001_005','B07001_017','B99072_001','B99072_007',
            'B11003_016','B14006_002','B01001_003','B23025_005','B22010_002',
            'B16002_004']
            
for i in interest:
    head, tail = os.path.split(tables[i])
    print (i,labels[i],tail)

The length it takes from submission to publication

The other day I received a positive comment about my housing demolition paper. It made me laugh abit inside — it felt like I finished that work so long ago it was talking about history. That paper was not so ancient though, I submitted it 8/4/17, went through one round of revision, and I got the email from Jean McGloin for conditional acceptance on 1/16/18. It then came online first a few months later (3/15/18), and is in the current print issue of JRCD, which came out in May 2018.

This ignores the time it takes from conception to finishing a project (we started the project sometime in 2015), but focusing just on the publishing process this is close to the best case scenario for the life-cycle of a paper through peer reviewed journals in criminology & criminal justice. The realist best case scenario typically is:

  • Submission
  • Wait 3 months for peer reviews
  • Get chance to revise-resubmit
  • Wait another 3 months for second round of reviews and editor final decision

So ignoring the time it takes for editors to make decisions and the time for you to turn around edits, you should not bank on a paper being accepted under 6 months. There are exceptions to this, some journals/editors don’t bother with the second three month wait period for reviewers to look at your revisions (which I think is the correct way to do it), and sometimes you will get reviews back faster or slower than three months, but that realist scenario is the norm for most journals in the CJ/Crim field. Things that make this process much slower (multiple rounds of revisions, editors taking time to make decisions, time it takes to make extensive revisions), are much more common than things that can make it go shorter (I’ve only heard myths about a uniform accept on the first round without revisions).

Not having tenure this is something that is on my mind. It is a bit of a rat race trying to publish all the papers expected of you, and due to the length of peer review times you essentially need to have your articles out and under review well before your tenure deadline is up. The six month lag is the best case scenario in which your paper is accepted at the first journal you submit to. The top journals are uber competitive though, so you often have to go through that process multiple times due to rejections.

So to measure that time I took my papers, including those not published, to see what this life-cycle time is. If I only included those that were published it would bias the results to make the time look shorter. Here I measured the time it took from submission of the original article until when I received the email of the paper being accepted or conditionally accepted. So I don’t consider the lag time at the end with copy-editing and publishing online, nor do I consider up front time from conception of the project or writing the paper. Also I include three papers that I am not shopping around anymore, and censored them at the date of the last reject. For articles still under review I censored them at 5/9/18.

So first, for 25 of my papers that have received one editorial decision, here is a graph of the typical number of rejects I get for each paper. A 0 for a paper means it was published at the first journal I submitted to, a 1 means I had one reject and was accepted at the second journal I submitted the paper to, etc. (I use "I" but this includes papers I am co-author on as well.) The Y axis shows the total percentage, and the label for each bar shows the total N.

So the proportion of papers of mine that are accepted on the first round is 28%, and I have a mean of 1.6 rejections per article. This does not take into account censoring (not sure how to for this estimate), and that biases the estimate of rejects per paper downward here, as it includes some articles under review now that will surely be rejected at some point after writing this blog post.

The papers with multiple rejects run the typical gamut of why academic papers are sometimes hard to publish. Null results, a hostile reviewer at multiple places, controversial findings. It also illustrates that peer review is not necessarily a beacon showing the absolute truth of an article. I’m pretty sure everything I’ve published, even papers accepted at the first venue, have had one reviewer with negative comments. You could find reasons to reject the findings of anything I write that has been peer reviewed — same as you can think many of my pre-print articles are correct or useful even though they do not currently have a peer review stamp of approval.

Most of those rejections add about three months to the life-cycle, but some can be fast (these include desk rejections), and some can be slower (rejections on later rounds of revisions). So using those begin times, end times, and taking into account censoring, I can estimate the typical survival time of my papers within the peer-review system when lumping all of those different factors together into the total time. Here is the 1 - survival chart, so can be interpreted as the number of days until publication. This includes 26 papers (one more that has not had a first decision), so this estimate does account for papers that are censored.

The Kaplan-Meier estimate of the median survival times for my papers is 290 days. So if you want a 50% chance of your article being published, you should expect 10 months based on my experience. The data is too sparse to estimate extreme quantiles, but say I want an over 80% probability of an article being published based on this data, how much time do I need? The estimate based on this data is at least 460 days.

Different strategies will produce different outcomes — so my paper survival times may not generalize to yours, but I think that estimate will be pretty reasonable for most folks in Crim/CJ. I try to match papers to journals that I think are the best fit (so I don’t submit everything to Criminology or Justice Quarterly at the first go), so I have a decent percent of papers that land on the first round. If I submitted first round to more mediocre journals overall my survival times would be faster. But even many mid-tiered journals in our field have overall acceptance rates below 10%, nothing I submit I ever think is really a slam dunk sure thing, so I don’t think my overall strategy is the biggest factor. Some of that survival time is my fault and includes time editing the article in between rejects and revise-resubmits, but the vast majority of this is simply waiting on reviewers.

So the sobering truth for those of us without tenure is that based on my estimates you need to have your journal articles out of the door well over a year before you go up for review to really ensure that your work is published. I have a non-trivial chunk of my work (near 20%) that has taken over one and a half years to publish. Folks currently getting their PhD it is the same pressure really, since to land a tenure track job you need to have publications as well. (It is actually one I think reasonable argument to take a longer time writing your dissertation.) And that is just for the publishing part — that does not include actually writing the article or conducting the research. The nature of the system is very much delayed gratification in having your work finally published.

Here is a link to the data on survival times for my papers, as well as the SPSS code to reproduce the analysis.

Work on Shootings in Dallas Published

I have two recent articles that examine racial bias in decisions to shoot using Dallas Police Data:

  • Wheeler, Andrew P., Scott W. Phillips, John L. Worrall, and Stephen A. Bishopp. (2018) What factors influence an officer’s decision to shoot? The promise and limitations of using public data. Justice Research and Policy Online First.
  • Worrall, John L., Stephen A. Bishopp, Scott C. Zinser, Andrew P. Wheeler, and Scott W. Phillips. (2018) Exploring bias in police shooting decisions with real shoot/don’t shoot cases. Crime & Delinquency Online First.

In each the main innovation is using control cases in which officers pulled their firearm and pointed at a suspect, but decided not to shoot. Using this design we find that officers are less likely to shoot African-Americans, which runs counter to most recent claims of racial bias in police shootings. Besides the simulation data of Lois James, this is a recurring finding in the recent literature — see Roland Fryer’s estimates of this as well (although he uses TASER incidents as control cases).

The reason for the two articles is that me and John through casual conversation found out that we were both pursuing very similar projects, so we decided to collaborate. The paper John is first author examined individual officer level outcomes, and in particular retrieved personnel complaint records for individual officers and found they did correlate with officer decisions to shoot. My article I wanted to intentionally stick with the publicly available open data, as a main point of the work was to articulate where the public data falls short and in turn suggest what information would be needed in such a public database to reasonably identify racial bias. (The public data is aggregated to the incident level — one incident can have multiple officers shooting.) From that I suggest instead of a specific officer involved shooting database, it would make more sense to have officer use of force (at all levels) attached to incident based reporting systems (i.e. NIBRS should have use of force fields included). In a nutshell when examining any particular use-of-force outcome, you need a counter-factual that is that use-of-force could happen, but didn’t. The natural way to do that is to have all levels of force recorded.

Both John and I thought prior work that only looked at shootings was fundamentally flawed. In particular analyses where armed/unarmed was the main outcome among only a set of shooting cases confuses cause and effect, and subsequently cannot be used to determine racial bias in officer decision making. Another way to think about it is that when only looking at shootings you are just limiting yourself to examining potentially bad outcomes — officers often use their discretion for good (the shooting rate in the Dallas data is only 3%). So in this regard databases that only include officer involved shooting cases are fundamentally limited in assessing racial bias — you need cases in which officers did not shoot to assess bias in officer decision making.

This approach of course has some limitations as well. In particular it uses another point of discretion for officers – when to draw their firearm. It could be the case that there is no bias in terms of when officers pull the trigger, but they could be more likely to pull their gun against minorities — our studies cannot deny that interpretation. But, it is also the case other explanations could explain why minorities are more likely to have an officer point a gun at them, such as geographic policing or even more basic that minorities call the police more often. In either case, at the specific decision point of pulling the trigger, there is no evidence of racial bias against minorities in the Dallas data.

I did not post pre-prints of this work due to the potentially contentious nature, as well as the fact that colleagues were working on additional projects based on the same data. I have posted the last version before the copy-edits of the journal for the paper in which I am first author here. If you would like a copy of the article John is first author always feel free to email.