Keeping it simple: Viz. mass shooting definitions

My wife asked me the other day about some mass shooting statistics, in particular some claims of an average of one a day in the US. Without knowing the source, I told her outright it is probably because that person widened the net to events beyond what most people stereotypically consider a mass shooting.

Now, I have no personal opinion on how it should be defined, and being a researcher in criminal justice I appreciate people digging into the details. I was prompted to write this post by an interactive application showing how the numbers change by Kevin Schaul of the Washington Post (referred via Flowing Data). I was pretty frustrated by Kevin’s example interactive application though – there are much simpler ways than making me change the definition and seeing what individual events pop up. Here is an example screen shot of inputting a definition and then how Kevin’s data pop out.

So, downloading the same Reddit data for 2015 so far (as of 12/7/15) I created what I consider to be simple summaries. Caveat – these crowdsourced datasets are likely to have substantial missing data, especially towards the events with fewer injured. First I made a frequency histogram of the total number of dead per incident.

So you can see that if you only want to include dead in your personal definition, the one per day statistic is a dramatic over-representation. If you want to draw the line at 5 or more you will have around 9 more events than you would if you made the line at 6 or more. If you make the line at 10 or more there are only two incidents, but there are another 4 if you include incidents with 8 or 9 dead.

Another simple overview is a table. Here are tables of dead, injured, and the combined counts per each incident, sorted in descending value of the count. So the way to read this is that there there 147 seperate incidents in the reddit database that had 0 deaths, and 104 that had only one death, etc. The tables also have percents and cumulative percentage, so you can see how where you define the cut-point changes how much of the data you chop-off. Cumulative counts would be just as useful.

I have no personal problem using injured as well in a mass shooting definition. Basically the difference between being shot and being killed is seemingly due to random happenstance, so a shooting with 10 injured and no one killed can easily be argued to be a mass shooting in my opinion. Kevin’s interactive makes you choose an and condition though between injured and killed, whereas one could place the cut point at an or condition or simply the combined total. Here is a cross tabulation of the frequencies of injured by dead.

You can clearly see the reddit definition is the combined total of injured or dead is 4 via the line on the upper left of the table. Kevin’s and condition forces you to make a cut-point along each axis, basically choosing a rectangle in the lower right of the above crosstab table. If you want a combined total though, it will be along a diagonal somewhere in the table.

I appreciate these interactive visualizations allow a viewer to dig deeper into specific events in the data, but that does not mean some simple summaries could not also accompany the piece.

Advertisements

License plate readers and the trade off in privacy

As a researcher in criminal justice, tackling ethical questions is a difficult task. There are no hypotheses to test, nor models to fit, just opinions bantering around. I figured I would take my best shot and writing some coherent thoughts on the topic of the data police collect and its impacts on personal privacy – and my blog is really the best outlet.

What prompted this is a recent Nick Selby post which suggested the use of license plate readers (LPRs) to target Johns in LA is one of the worst ideas ever and a good example of personal privacy invasion by law enforcement. (Also see this Washington Post opinion article.)

I have a bit of a different and more neutral take on the program, and will try to articulate some broader themes in personal privacy invasion and the collection/use of data by police. I think it is an important topic and will continue to be with the continual expansion of public sensor data being collected by the police (with body worn cameras, stationary cameras, cell phone data, GPS traces being some examples). Basically, much of the negative sentiment I’ve seen so far of this hypothetical intervention are for reasons that don’t have to do with privacy. I’ll articulate these points by presenting alternative, currently in use police programs that use similar means, but have different ends.

To describe the LA program in a nutshell, the police use what are called license plate readers to identify particular vehicles being driven in known prostitution areas. LPRs are just cameras that take a snapshot of a license plate, automatically code the alpha-numeric plate, and then place that [date-time-location-plate-car image] in a database. Linking up this data with registered vehicles, in LA the idea is to have the owner of the vehicle sent a letter in the mail. The letter itself won’t have any legal consequences, just a note that says the police know you have been spotted. The idea in theory is that you will think you are more likely to be caught in the future, and may have some public shaming also if your family happens to see the letter, so you will be less likely to solicit a prostitute in the future.

To start with, some of the critiques of the program focus on the possibilities of false positives. Probably no reasonable person would think this is a worthwhile idea if the false positive rate is anything but small – people will be angry with being falsely accused, there are negative externalities in terms of family relationships, and any potential crime reduction upside would be so small that it is not worthwhile. But, I don’t think that itself is damning to this idea – I think you could build a reasonable algorithm to limit false positives. Say the car is spotted multiple times at a very specific location, and specific times, and the home owners address is not nearby the location. It would be harder to limit false positives in areas where people conduct other legitimate business, but I think it has potential with just LPR data, and would likely improve by adding in other information from police records.

If you have other video footage, like from a stationary camera, I think limiting false positives can definitely be done by incorporating things like loitering behavior and seeing the driver interact with an individual on the street. Eric Piza has done similar work on human coding/monitoring video footage in Newark to identify drug transactions, and I have had conversations with an IBM Smart City rep. and computer scientists about automatically coding audio and video to identify particular behaviors that are just as complicated. False negatives may still be high, but I would be pretty confident you could create a pretty low false positive rate for identifying Johns.

As a researcher, we often limit our inquiries to just evaluating 1) whether the program works (e.g. reduces crime) and 2) if it works whether it is cost-effective. LPR’s and custom notifications are an interesting case compared to say video cameras because they are so cheap. Camera’s and the necessary data storage infrastructure are so expensive that, to be frank, are unlikely to be a cost-effective return on investment in any short term time frame even given the best case scenario crime reductions (ditto for police body worn cameras). LPR’s and mailing letters on the other hand are cheap (both in terms of physical capital and human labor), so even small benefits could be cost-effective.

So in short, I don’t think the idea should be dismissed outright because of false positives, and the idea of using public video/sensor footage to proactively identify criminal behavior could be expanded to other areas. I’m not saying this particular intervention would work, but I think it has better potential than some programs police departments are currently spending way more money on.

Assuming you could limit the false positives, the next question then is it ok for the police to intrude on the privacy of individuals who have not committed any particular crime? The answer to this I don’t know, but there are other examples of police sending letters that are similar in nature but haven’t generated much critique. One is the use of letters to trick offenders with active warrants to turning themselves in. Another more similar example though are custom notifications. These are very similar in that often the individuals aren’t identified because of specific criminal charges, but are identified using data analytics and human intelligence to place them as high risk and gang involved offenders. Intrusion to privacy is way higher for these custom notifications than the suggested Dear John letters, but individuals did much more to precipitate police action as well.

When the police stop you in the car or on the street the police are using discretion to intrude in your privacy under circumstances where you have not necessarily committed a crime. Is there any reason a cop has to take that action in person versus seeing it on a video? Automatic citations at red light cameras are similar in mechanics to what this program is suggesting.

The note about negative externalities to legitimate businesses in the areas and the cost of letters I consider hyperbole. Letters are cheap, and actual crime data is frequently available that could already be used to redline neighborhoods. But Nick’s critique of the information being collated by outside agencies and used in other actuarial aspects, such as loans and employment decisions, I think is legitimate. I have no good answers to this problem – I have mixed feelings as I think open data is important (which ironically I can’t quantify in any meaningful way), and I think perpetual online criminal histories are a problem as well. Should we not have public crime maps though because businesses are less likely to invest in high crime neighborhoods? I think doing a criminal background check for many businesses is a legitimate query as well.

I have mixed feelings about familial shaming being an explicit goal of the letters, but compared to an arrest the letter is mundane. It is even less severe than a citation (which given some state laws you could be given a citation for loitering in a high prostitution area). Is a program that intentionally tries to shame a person – which I agree could have incredible family repercussions – a legitimate goal of the criminal justice system? Fair question, but in terms of privacy issues though I think it is a red herring – you can swap out different letters that would not have those repercussions but still uses the same means.

What if instead of the "my eyes are on you" letter the police simply sent a PSA like post-card that talked about the blight of sex workers? Can police never send out letters? How about if police send out letters to people who have previous victimizations about ways to prevent future victimization? I have a feeling much of the initial negative reactions to the Dear John program are because of the false positive aspect and the "victimless" nature of the crime. The ethical collection and use of data is a bit more subtle though.

LPR data was initially intended to passively identify stolen cars, but it is pretty ripe for mission creep. One example is that the police could use LPR data to actively track a cars location without a warrant. It is easy to think of both good and other bad examples of its use. For good examples, retrospectively identifying a car at the scene of a crime I think is reasonable, or to notify the police of a vehicle associated with a kidnapping.

For another example use of LPR data, what if the police did not send custom notifications, but used such LPR data to create a John list of vehicles, and then used that as information to profile the cars? If we think using LPR data to identify stolen cars is a legitimate use should we ignore the data we have for other uses? Does the potential abuse of the data outweigh the benefits – so LPR collection shouldn’t be allowed at all?

For equivalent practices, most police departments have chronic offender or gang lists that use criminal history, victimizations, where you have been stopped and who you have been stopped with to create similar databases. This is all from data the police routinely collect. The LPR data can be reasonably questioned whether it is available for such analytics use – police RMS data is often available in large swaths to the general public though.

Although you can question whether police should be allowed to collect LPR data, I am going to assume LPR data is not going to go away, and cameras definitely are not. So how do you regulate the use of such data within police departments? In New York, when you conduct an online criminal history check you have to submit a reason for doing the check. That is a police officer or a crime analyst can’t do a check of your next door neighbor because you are curious – you are supposed to have a more relevant reason related to some criminal investigation. You could have a similar set up with LPR that prevents actively monitoring a car except in particular circumstances and to purge the data after a particular time frame. It would be up to the state though to enact legislation and monitor its use. There is currently some regulation of gang databases, such as sending notifications to individuals if they are on the list and when to take people off the list.

Similar questions can be extended beyond public cameras though to other domains, such as DNA collection and cell phone data. Cell phone data is regularly collected with warrants currently. DNA searching is going beyond the individual to familial searches (imagine getting a DUI, and then the police use your DNA to tell that a close family member committed a rape).

Going forward, to frame the discussion of police behavior in terms of privacy issues, I would ask two specific questions:

  • Should the police be allowed to collect this data?
  • Assuming the police have said data, what are reasonable uses of that data?

I think the first question, should the police be allowed to collect this data, should be intertwined with how well does the program work and how cost-effective is the program (or potential if the program has not been implemented yet). There are no bright lines, but there will always be a trade off between personal privacy and public intrusion. Higher personal intrusion would demand a higher level of potential benefits in terms of safety. Given that LPR’s are passively collecting data I consider it an open question whether they meet a threshold of whether it is reasonable for the police to collect such data.

Some data police now collect, such as public video and DNA, I don’t see going away whether or not they meet a reasonable trade-off. In those cases I think it is better to ask what are reasonable uses of that data and how to prevent abuses of it. Basically any police technology can be given extreme examples where it saved a life or where a rogue agent used it in a nefarious way. Neither extreme case should be the only information individuals use to evaluate whether such data collection and use is ethical though.

Randomness in ranking officers

I was recently re-reading the article The management of violence by police patrol officers (Bayley & Garofalo, 1989) (noted as BG from here on). In this article BG had NYPD officers (in three precincts) each give a list of their top 3 officers in terms based on minimizing violence. The idea was to have officers give self-assessments to the researcher, and then the researcher try to tease out differences between the good officers and a sample of other officers in police-citizen encounters.

BG’s results stated that the rankings were quite variable, that a single officer very rarely had over 8 votes, and that they chose the cut-off at 4 votes to categorize them as a good officer. Variability in the rankings does not strike me as odd, but these results are so variable I suspected they were totally random, and taking the top vote officers was simply chasing the noise in this example.

So what I did was make a quick simulation. BG stated that most of the shifts in each precinct had around 25 officers (and they tended to only rate officers they worked with.) So I simulated a random process where 25 officers randomly pick 3 of the other officers, replicating the process 10,000 times (SPSS code at the end of the post). This is the exact same situation Wilkinson (2006) talks about in Revising the Pareto chart, and here is the graph he suggests. The bars represent the 1st and 99th percentiles of the simulation, and the dot represents the modal category. So in 99% of the simulations the top ranked officer has between 5 and 10 votes. This would suggest in these circumstances you would need more than 10 votes to be considered non-random.

The idea is that while getting 10 votes at random for any one person would be rare, we aren’t only looking at one person, we are looking at a bunch of people. It is an example of the extreme value fallacy.

Here is the SPSS code to replicate the simulation.

***************************************************************************.
*This code simulates randomly ranking individuals.
SET SEED 10.
INPUT PROGRAM.
LOOP #n = 1 TO 1e4.
  LOOP #i = 1 TO 25.
    COMPUTE Run = #n.
    COMPUTE Off = #i.
    END CASE.
  END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Sim.
*Now for every officer, choosing 3 out of 25 by random (without replacement).
SPSSINC TRANS RESULT = V1 TO V3
  /FORMULA "random.sample(range(1,26),3)".
FORMATS V1 TO V3 (F2.0).
*Creating a set of 25 dummies.
VECTOR OffD(25,F1.0).
COMPUTE OffD(V1) = 1.
COMPUTE OffD(V2) = 1.
COMPUTE OffD(V3) = 1.
RECODE OffD1 TO OffD25 (SYSMIS = 0).
*Aggregating and then reshaping.
DATASET DECLARE AggResults.
AGGREGATE OUTFILE='AggResults'
  /BREAK Run
  /OffD1 TO OffD25 = SUM(OffD1 TO OffD25).
DATASET ACTIVATE AggResults.
VARSTOCASES /MAKE OffVote FROM OffD1 TO OffD25 /INDEX OffNum.
*Now compute the ordering.
SORT CASES BY Run (A) OffVote (D).
COMPUTE Const = 1.
SPLIT FILE BY Run.
CREATE Ord = CSUM(Const).
SPLIT FILE OFF.
MATCH FILES FILE = * /DROP Const.
*Quantile graph (for entire simulation).
FORMATS Ord (F2.0) OffVote (F2.0).
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Ord PTILE(OffVote,99)[name="Ptile99"] 
                                    PTILE(OffVote,1)[name="Ptile01"] MODE(OffVote)[name="Mod"]
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Ord=col(source(s), name("Ord"), unit.category())
  DATA: Ptile01=col(source(s), name("Ptile01"))
  DATA: Ptile99=col(source(s), name("Ptile99"))
  DATA: Mod=col(source(s), name("Mod"))
  DATA: OffVote=col(source(s), name("OffVote"))
  DATA: Run=col(source(s), name("Run"), unit.category())
  GUIDE: axis(dim(1), label("Ranking"))
  GUIDE: axis(dim(2), label("Number of Votes"), delta(1))
  ELEMENT: interval(position(region.spread.range(Ord*(Ptile01+Ptile99))), color.interior(color.lightgrey))
  ELEMENT: point(position(Ord*Med), color.interior(color.grey), size(size."8"), shape(shape.circle))
END GPL.
***************************************************************************.

Big data problems for Criminal Justice

I am on the job market this year, and I have noticed a few academic jobs focused on big data (see this Penn State posting for one example). Because example data sets in criminal justice are not typical fodder for big data conversations, I figured I would talk abit about my experiences and illustrate the need for the types of skills needed to manipulate and analyze these big datasets.

As opposed to trying to further define the big data buzzword, I will simply talk about the actual size of data I have dealt with. Depending on the definition used, most large criminal justice datasets may be called medium sized data. That is you can load it in a database or statistical program (particularly those that do not load everything into RAM, like SPSS and SAS) and calculate different summary statistics and fit simple models. Were not talking about datasets that need custom big data solutions like Hadoop. The biggest single table I’ve personally worked with is a set of 25 million arrest histories (with around 150 variables). Using SPSS server to sort this dataset took less than a minute, using my local machine it took about 10 minutes. Nothing much to complain about there, and it is where the statistical programs that don’t load everything into memory shine.

To talk specifics, the police agency where I was an analyst at (Troy, NY) is a fairly small city with a population of around 50,000 people. They generated around 60,000 calls for service per year (this includes anytime someone calls 911, or police initiated interactions like a traffic stop). Every single one of these incidents generates a one to many relationship for multiple tables, and here is a sampling of those relationships; multiple free text description of the event and follow up investigations, people involved in the incident, offences committed, property stolen or damaged, persons arrested, property recovered or confiscated, drug and weapon contraband, vehicles involved, etc. Over the time period of 04-13 the incident narratives themselves are around 1 gigabyte, and the number of unique individuals and institutions in the "names" table was around 100,000. None of these tables alone would be considered big data, but when taking multiple years and having to conduct multiple table merges it turns into complicated medium size data pretty quickly.

I’m sure I’m not alone here working with police departments. In the past month I’ve had conversations with two individuals about corrections datasets that result in millions of records. Criminal justice organizations have been collecting data for along time, and given say 50,000 records per year it only takes 10 years to turn that into 500,000. When considering larger agencies (like statewide corrections or courts) the per year becomes even larger.

Most of the time summary statistics and fairly simple regression models are all researchers and analysts are interested in in criminal justice. The field is not heavily devoted to prediction, and certainly not to fitting complicated machine learning models. Many regression tasks can be estimated with data as large as 25 million records (given that the number of predictor variables tends to be small) and even if it didn’t sampling (or reducing the data to unique observations and weighting) is an obvious option. So for these types of simple needs just learning effective practices at manipulating datasets — such as SQL and best practices for conducting data manipulations in statistical packages is most of the education one needs. But these are still definitely needs that are not met in any social science curricula that I am aware. By fire is my only experience.

Two particular areas that turn little data into big data are spatial and network analysis, as one not only needs to consider the number of nodes but also the number of edges (or potential edges) in the system to calculate various measures. For example, in my dissertation I needed to conduct spatial lags of several variables (and this is needed in calculating measures such as Moran’s I). In matrix notation this typically involves calculating Wx, where W is an n by n spatial weights matrix. In my dissertation, n was 21,506, so not a large dataset, but W is then a 21,506^2 matrix. It can be held in memory, but good luck trying to calculate anything with it. Most of the spatial econometrics literature discusses how calculating W^-1 is problematic, let alone the simpler operation of Wx. So to do those calculations I needed to create custom code. I hope to be able to write a blog post on how it can be done at some point – but these blog posts aren’t earning me any brownie points to getting a job (let alone getting tenure in the future).

The other area that I believe needs to be developed in the social science related to medium data problems are custom visualization solutions. Data in social science typically has lots of noise to signal, and adding in 100,000 observations rarely makes things clearer. This is why I think visualization within the social sciences has potential to expand, as the majority of historical discussions are not extensible to our particular use applications in the social sciences.

So I’m excited by academia recognizing that big data is a problem and takes custom solutions in the social sciences. An environment where I can be reworded for taking on those big data tasks and partly focus on publishing software, as opposed to solely publish or perish, would help develop the field and have a more lasting impact on practical applications than journal articles. At least a place that acknowledges the need to develop curricula related to these data management tasks would be a good start. But I’m not sure I like the types of applications currently being pitched in the social sciences as big data problems, particularly the trivial applications of examining social networks like facebook or twitter, nor emphasis on big data tools like Hadoop that I don’t think are applicable to the social scientists toolset. But I’m certainly biased to think that applications in criminal justice have more practical implications than alot of contemporary social science research.

Online Crime Mapping for Troy PD

One of the big projects I have been working on since joining the Troy Police Department as a crime analyst last fall is producing timely geocoded data. I am happy to say that a fruit of this labor is the public crime map, via RAIDS Online, that has finally gone public (and can be viewed here). The credit for the online map mainly goes to BAIR Analytics and their free online mapping platform. I merely serve up the data for them to put on the map.

I’ve come to believe that more open data is the way of the future, and in particular an online crime map is a way to engage and enlighten the public to the realities of crime statistics. Although this comes with some potential negative externalities for the police department, such as complaints about innacurracy, decreasing home prices, and misleading symbology and offset geocoding. I firmly believe though that providing this information empowers the public to be more engaged in matters of crime and safety within their communities.

I thank the Troy Police Department for supporting the project in spite of these potential negative consequences, and Chief Tedesco for his continual support of the project. I also thank Capt. Cooney for arranging for all of the media releases. Below is the current online news stories (will update with CW15 if they post a story).

Here I end with a list of reading materials I consider necessary for any other crime analyst pondering the decision whether to public crime statistics online. And I end by again thanking Troy PD for allowing me to publish this data, and BAIR for providing the online service that makes it possible with a zero dollar budget.


Let me know if I should add any papers to the list! Privacy implications (such as this work by Michael Leitner and colleagues) might be worth a read as well for those interested. See my geomasking tag at CiteUlike for various other references.

Informational Asymmetries in my role as Crime Analyst

One aspect I’ve come to realize in my job as crime analyst, and really in any technical job I’ve had, is that I face large informational asymmetries between myself and my employers (and colleagues). What exactly do I mean? Well, I consider a prime example of informational asymmetry when I have a large body of knowledge about some particular topic or task I need to conduct, and the person asking for the task has relatively little.

I believe this is problematic in one major way with my job: That people don’t know what is or is not reasonable to ask me to do, or similarly how long it takes me to conduct particular tasks. I believe most of the time this makes people hesitate to ask me particular questions or ask me to conduct particular analysis. The obverse happens though not entirely infrequently, I get asked nonchalantly to do something that is a considerable investment.

I’m not sure how to best solve this situation (especially the not asking part) besides by developing relationships with colleagues and the boss, and through experience elucidating what I can (or can’t do). To a certain extent I can’t know what people want if they don’t ask me.

The situation in which someone asks me to do something that takes more of in investment is easier, in that I can directly tell the person that this request is either unreasonable or will take along time. A good example of tasks that on the outside may look similar in scope, but are largely different are descriptive vs. causal analysis.

Examples of the difference are “How many calls for service occurred at this particular apartment in the last year?” (descriptive), or “Is there more crime around 15 Main St. than we would normally expect?” (causal). The first is typically just a query or the database and a table or map, and this will typically satisfy the answer. The other though is much more difficult, I have to dream up a reasonable comparison, else the information I provide may be potentially out of context.

The information I produce also depends on who is asking. If someone within the PD asks for descriptive statistics, that is usually all I provide. If someone from the public asks for descriptive statistics, I frequently (at least attempt to) provide more context for those statistics (i.e. some reasonable comparisons or historical trends that form the basis for causal analysis).

This is because I assume people within the PD have the necessary external context to evaluate the information, whereas people outside the PD don’t. If I just stated how many calls for service occurred on your street block, you may think your street is crime ridden, because you don’t have a good internal baseline to judge what is a reasonable number of calls for service. In such requests to the public I try to provide historical numbers over a long period (as people are often worried about newer trends) or comparisons to neighboring areas.

The informational asymmetry problem stills persists though, and filters into other areas of work. In particular how am I evaluated within the PD itself.