New preprint and Monitoring Time Between Events

Will be a long post today, have some updates on a preprint, quotes on Flock cameras, an upcoming webinar, plus some R analysis examples of monitoring time between rare crime events.

Pre-print on JTC and examining the Buffer Zone

For a few updates on other projects, I have a pre-print out with Kim Rossmo, The Journey-to-Crime Buffer Zone: Measurement Issues and Methodological Challenges.

Two parts to this paper. Part 1, to test whether a journey to crime (JTC) distribution conforms to a buffer zone (an area with lower, but non-zero, probability of offending nearby their home for predatory crimes against strangers), it only makes sense to look at an individual offenders JTC. This is because mixtures of multiple offenders can each individually have a buffer, but in the aggregate do not (in particular if offenders have varying travel distances). This is the same point in Van Koppen & De Keijser (1997), and the fact that offenders have different travel distance distributions is pretty well established now (Andresen et al., 2014, Drawve et al., 2015; Townsley & Sidebottom, 2010).

The second part is given we need to examine individual offenders, I worked out estimates of how many observations you need to effectively measure whether a buffer zone exists. I estimate you need around 50 observations when using a gamma distribution to measure the existence of a buffer vs monotonically decreasing. Above graphs shows a kernel density estimator that takes into account to not smear the probability below 0 distance, using a transform trick to calculate the KDE on the log scale and then back transform. Both case studies we look at suggest a more peaked distribution for the buffer than gamma probably makes more sense for those samples, but pretty strong evidence the buffer exists. The code to replicate the methods and papers findings is on Github.

If you are a department and you have a good case study of a prolific offender get in touch, would be happy to add more case studies to the paper. Part of the difficulty is having high fidelity measures, offenders tend to move a lot (Wheeler, 2012), and so it is typically necessary to have an analyst really make sure the home (or nearest anchor node) locations are all correct. In addition to most prolific offenders don’t have that many observations.

Flock Story

For a second update, I had a minor quote in Tyler Duke’s story on Flock Cameras in North Carolina, Camera by camera, North Carolina police build growing network to track vehicles. I think license plate readers are good investments for PDs (see Ozer 2016 for a good example, I actually like the mobile ones in vehicles more than fixed ones, see Wheeler & Phillips, 2018 for a case study in Buffalo using them at low friction road blocks). But I do think more regulation to prevent people doing indiscriminate searches is in order (similar to how doing background checks in most states have state rules).

I get more annoyed by Flock’s advertising that suggests they solve 10% of crime nationwide, which is absurd. It is very poorly done design (Snow & Charpentier, 2024) that does a regression of clearance rates regressed on cameras per officer, and suggests the cameras increase clearance rates by 10%. There is multiple things wrong with this – interpreting regression coefficients incorrectly (an increase in 1 in camera per-officer is quite a few cameras and does not in translate to they increase clearances 10% in toto), confounding in the design (smaller agencies with higher clearance by default will have more cameras per officer), not taking into account weights in the modeling or interpretation (e.g. a 20% increase in a small department and a 0% increase in a large department should not average to an overall 10% increase). Probably the worse part about this though is extrapolating from they have cameras in a few hundred departments to saying they help solve 10% of crime nationwide.

It is extra silly because it does not even matter – it makes close to no material difference to the quality of Flock’s products (which look to me high quality, they certainly don’t need to increase clearance rates by 10% to be worth investing in ALPRs for an agency, ALPRs are so cheap if a single camera helps with say 10-20 arrests they are worth it). If anyone from Flock is listening and wants to fund a real high quality study just let me know and ask, but this work they have put out is ridiculous.

Tyle Duke’s (and the Newsobserver in general) I think do a really good job on various data stories. So highly recommend checking out that and their other work.

Webinar on Monitoring Temporal Crime Trends

I am doing a webinar for the Carolina Crime Analysis Association on Monitoring Temporal Crime Trends at the end of the month on May 31st.

Free for CCAA members and $10 for non-members. I will be going over work I have written in various places before, such as the Poisson Z-score for CompStat reports (Wheeler, 2016). I did a recent blog post on the Crime De-Coder site on using the Poisson distribution to flag outliers for rare crime events, e.g. if you have 0.8 robberies per month, is a month with 3 robberies weird?

For other regional crime analysis groups, if you have requests like that feel free. I am thinking I want to spend more time with regional groups than worrying about the bigger IACA group going forward.

And this Poisson example segways into the final section of the blog post.

Monitoring Time in Between Events

So the above of using the Poisson distribution to say is 3 robberies in a month weird, you have to think about the nature of how a crime analyst will use and act on that information. So in that scenario, that may be an analyst has a monthly CompStat report, and it is useful to say ‘yeah 3 is high, but is consistent with chance variation that is not uncommon’. In this scenario though, if you have counts that are high, it is not the best (although better than nothing) to wait until the end of the month CompStat report.

Another common case though is the analyst is regularly reading reports, and they come in and read a new robbery, and right then and there say “I feel like there are more robberies than usual”. How would they tell then if there are more than you would expect? It does not make sense to wait for the end of the month (you technically can back-calculate in the prior month and use a scan statistic, but I think what I will suggest below is a more diagnostic approach).

Here I will outline an approach by examining the time in between events, which is motivated by a comment Rob Fornango mentioned on LinkedIn.

So there is a duality between the Poisson distribution and the exponential distribution – if you have a mean of 0.8 events per month, the inter-arrival times are exponentially distributed with a mean of 1/0.8. The typical motivation for a Poisson distribution is the inter-arrival times are independent, so you can technically just work with the inter-arrival times directly.

Here is a quick simulation in R to show that you can simulate inter-arrival times, and then turn them into counts per unit time. The counts per unit time will then be Poisson distributed. Note that R you give the lambda term directly in the R parameterization, whereas others (like in scipy) you specify 1/lambda. I know it is not documented well, but I leave as an exercise to the reader who cares enough to figure out what I am doing here and why.

set.seed(10)
pmean <- 0.8
n <- 50000

re <- rexp(n,pmean) # simulating exponential
rec <- cumsum(re)   # translating to times
frec <- floor(rec)  # will aggregate to counts per 1 unit

# factor is to include units with 0 counts
recV <- 0:max(frec)
frec <- factor(frec,levels=recV)
re_tab <- as.data.frame(table(frec))
re_tab$frec <- recV
re_tab$Freq <- factor(re_tab$Freq,levels=0:max(re_tab$Freq))

# Two tables are to aggregate to units, and then get a count
# of counts per unit
count_tab <- as.data.frame(table(re_tab$Freq))
names(count_tab) <- c("Count","ExpSim")
count_tab$Count <- as.numeric(levels(count_tab$Count))
count_tab$PoisExp <- round(dpois(count_tab$Count,pmean)*length(recV))

And this prints out a table that shows very close correspondence between the two.

> print(count_tab)
  Count ExpSim PoisExp
      0  28538   28411
      1  22959   22729
      2   8813    9092
      3   2356    2424
      4    482     485
      5     75      78
      6      5      10
      7      2       1

Ok, with that established, how do we take into account the time in between events, and use that to flag if recent events are occurring too close to each other? Going with my suggestion of using 1/100 or 1/1000 probability to flag an outlier, for a single “time between two events”, you can look at the quantiles of the exponential distribution. So for the 1/100 threshold:

qexp(0.01,0.8)

Gives 0.01256292. Note this is in months, so if we say a month is 30 days, we could then say it is 0.01256292*30, which is 0.38 days, or just over 9 hours. So this is saying basically if you had two robberies on the same shift, given a mean of 0.8 per month in your jurisdiction, that may be worth looking into if they are the same offender. Not terribly helpful as that would be something most analysts would spot without the help of analytics.

But say you had an event with a rate of 0.1 per month (so on average just over one per year). Two events in three days then would be cause for alarm, qexp(0.01,0.1)*30 is just over 3 days.

So that is examining two recent events, you could extend this to several recent events nearby in time (what I think is likely to be more useful for crime analysts). So say you had a crime on Monday, Wednesday, and then Saturday. So two times in between of 2 days and 3 days. I would say the probability of this occurring is:

prod(pexp(c(2,3),0.8/30))

Which R gives as 0.003993034, so around 4 in 1,000. This is the probability of the 2 days multiplied by the probability of 3 days. We can make a graph of the string of three events (so two times in-between) that meet our less than 0.01 chance.

library(ggplot2)

theme_cdc <- function(){
  theme_bw() %+replace% theme(
     text = element_text(size = 16),
     panel.grid.major= element_line(linetype = "longdash"),
     panel.grid.minor= element_blank()
) }

days <- 1:20
df <- expand.grid(x=days,y=days)
df$p <- pexp(df$x,pmean/30)*pexp(df$y,pmean/30)
df <- df[df$p < 0.01,]
p <- ggplot(df,aes(x=x,y=y)) + 
     geom_point(pch=21,fill='grey',size=7.5) +
     labs(x=NULL,y=NULL,title='Nearby Days < 0.01') +
     scale_y_continuous(breaks=days,limits=c(1,max(days))) +
     scale_x_continuous(breaks=days,limits=c(1,max(days))) +
     theme_cdc()

p

So this gives a chart that meets the criteria for days between in 3 nearby events for the 0.8 per month scenario. So if you have times between of 3 and 4 it meets this threshold, as well as 2 and 8, etc. This data for 1+ days it pretty much never gets to the 1/1000 threshold.

You can technically extend this to multiple crimes. We are in the cusum chart territory then. The idea behind cusum charts is if you have an expected value of say 10, in a typical process control chart if you had a bunch of values 12,14,11,13,12, they may not individually alarm. But you can see that the process is consistently above the expected value, which for random data it should fluctuate sometimes below 10 and sometimes above 10. The consistent above the expected value is itself a signal that will alarm in the cusum approach.

I debate on doing more cusum type process control charts with crime data, but they are abit of work to reset (they will always alarm eventually, and then you reset the cumulative statistics and start the process over again) – but in this scenario the reset is not too difficult.

So the approach would be something like:

probs <- pexp(days_between,mean_per_unit)
snorm <- qnorm(probs)
cumvals <- cumsum(snorm)

The cusum approach works like this here. So only start counting if the days between are less than qexp(0.5,pmean), which here is about 26 days. If you have any time in between more than 26 days, you reset the cusum chart. But if you have several events with times less than 26 days, you do the above calculations, and if the cumulative sum gets lower than -4 (so multiple events nearby less than 26 days apart), you alarm. So for example, for our mean of 0.8, if you had a string of 7,6,12,9,10 days in between for crimes:

pmean <- 0.8
days_between <- c(7,6,12,9,10)
probs <- pexp(days_between,pmean/30)
snorm <- qnorm(probs)
cumvals <- cumsum(snorm)

That would alarm on the final crime, even though those are 6 crimes spread apart 44 days.

This is because snorm will have a standard normal distribution, and so the typical alarm rate for cusum charts with mean zero and standard deviation of 1 is +/- 4. You can technically use it for events too far apart as well here, although I don’t know of situations where people would care too much about that (either in crime or other monitoring situations).

This is all more complicated than 5+ in a month example, partly why I haven’t used cusum charts (or days in between) in other examples. But hopefully someone finds that useful to monitor rare events, and not wait for their end of month stats to alert them!

References

Andresen, M. A., Frank, R., & Felson, M. (2014). Age and the distance to crime. Criminology & Criminal Justice, 14, 314-333.
Drawve, G., Walker, J. T., & Felson, M. (2015). Juvenile offenders: An examination of distance-to-crime and crime clusters. Cartography and Geographic Information Science, 42, 122-133.
Ozer, M. (2016). Automatic licence plate reader (ALPR) technology: Is ALPR a smart choice in policing?. The Police Journal, 89(2), 117-132.
Rossmo, D. K., & Wheeler, A. (2024). The Journey-to-Crime Buffer Zone: Measurement Issues and Methodological Challenges. CrimRxiv
Snow, A., & Charpentier, C. (2024) Flock Safety Technologies in Law Enforcement: An Initial Evaluation of Effectiveness in Aiding Police in Real-World Crime Clearance
Townsley, M., & Sidebottom, A. (2010). All offenders are equal, but some are more equal than others: Variation in journeys to crime between offenders. Criminology, 48, 897-917.
Wheeler, A. (2012). The moving home effect: A quasi experiment assessing effect of home location on the offence location. Journal of Quantitative Criminology, 28, 587-606.
Wheeler, A. P. (2016). Tables and graphs for monitoring temporal crime trends: Translating theory into practical crime analysis advice. International Journal of Police Science & Management, 18(3), 159-172.
Wheeler, A. P., & Phillips, S. W. (2018). A quasi-experimental evaluation using roadblocks and automatic license plate readers to reduce crime in Buffalo, NY. Security Journal, 31, 190-207.
Van Koppen, P. J., & De Keijser, J. W. (1997). Desisting distance decay: On the aggregation of individual crime trips. Criminology, 35(3), 505-515.

Leave a comment

by apwheele on May 12, 2024 • Permalink

Posted in Crime Analysis, Papers, R

Tagged outlier, preprint, temporal

Posted by apwheele on May 12, 2024

https://andrewpwheeler.com/2024/05/12/new-preprint-and-monitoring-time-between-events/

Power for analyzing Likert items

First for some other updates of interest to folks on the blog. On CRIME De-Coder a blog post, You should be geocoding crime data locally. I give python code to create a local geocoding engine using the arcpy library.

This is a bit more techy, so would typically post this on this blog instead of the CRIME De-Coder one. But, currently the web is sorely lacking in good advice for local geocoding solutions. Vast majority of sites discuss online geocoding APIs (e.g. google or the census geocoder), which I guess are common for web-apps, but they do not make sense for crime analysis. For the few webpages that are actually relevant to describe local solutions (that do not involve calling an online web API), all the exmples use PostGIS that I am aware of. PostGIS is both very difficult to setup and has worse results compared to ESRI. So I know ESRI is a paid for service, but they have reasonable academic and small business pricing (and most police departments already have access), so to me this is a reasonable use case. If you need to geocode 100k cases, the license fee for ESRI is worth it at that point relative to using the web engines.

Definitely do not spend thousands of dollars if you need to batch geocode a few million records. That is something that is worth getting in touch with me about. And so hopefully that gets picked up by search engines and drives a bit more traffic to my consulting website.

A second example I posted some python code to help construct network experiments. So the idea here is you want to spread out the treated nodes so you have a specific allocation of treated, connected to treated (what I call spillover here), and those not connected to treated (the leftover control group). This python code constructs linear programs to accomplish certain treated/not-touched proportions. So this graph shows if you choose to treat 1 person, but have constraints on 1,2,3 leftover.

And then you can apply this to bigger networks, here the network is 311 nodes, and 90 are treated and I want a total of 150 not treated.

Idea derivative from some work Bruce Desmarais discussed on Twitter, but also have thought about this in some discussion with Barak Ariel in focused deterrence style interventions. So hopefully that comes in handy.

My linear programming formulation is not as svelte as the optimal treatment assignment with spillovers formulation, it is 3*N + 2*E decision variables and 5*N + 2*E constraints (where N is the number of nodes and E is the number of un-directed edges). I have a feeling my formulation is redundant, so if I write my constraints smarter can be more like 2N decision variables and 2N + E constraints.

But for my examples I show it solves quite fast as is (and maybe solvers get rid of the cruft in pre-solve), so not worried about that at the moment. Don’t know the typical size networks people use, but I suspect it will work just fine and dandy on typical machines for networks even as large as 10k. (Generally if I keep the model to under 100k decision variables it is only a few minutes to solve the types of problems I show on this blog.)

Power with Likert items

The other day for a grant application we needed to conduct power analysis. Our design was t-test of mean differences for a simple treated/control group randomized experiment with the outcome being a Likert score survey response. (I know, most of the time people create latent scores with Likert items, analyzing the individual items is fine IMO and simpler to specify for a pre-registration analysis.) I am sure others have needed to do similar things, but I could not find code online to help out with this. So I scripted up a simulation in R to do this, and figured sharing would be useful.

So the rub with Likert data, and why you can’t use typical power calculations, is that they have ceiling effects. If most people answer on average 4.5 out of the your scale up to 5, it is difficult to go much higher. Here I simulate data that has that skew (so ceiling effects come into play), and then go through the motions of doing the t-test. So first for some setup, I have a function that rounds and clips data to limited sets of integers, plus some plotting functions.

# power analysis of Likert data
library(ggplot2)

# custom theme
theme_cdc <- function(){
  theme_bw() %+replace% theme(
     text = element_text(size = 16),
     panel.grid.major= element_line(linetype = "longdash"),
     panel.grid.minor= element_blank()
) }

set.seed(10) # setting the random seed

# rounding/clipping data to integers 
clipint <- function(x,min=1,max=5){
    rx <- round(x)
    rx <- ifelse(rx < min,min,rx)
    rx <- ifelse(rx > max,max,rx)
    return(rx)
}

This following function generates the p-values and standard errors, what I will use later in my simulation. Here I use a t-test of mean differences, but it would be fairly easy to say swap out with an ordinal logistic regression if you prefer that. Probably the bigger deal is the simulation generates data using a normal distribution, and then post rounds/clip the data. There is probably a smarter way to generate the data according to the logistic model and ordinal intercepts (Frank Harrell discusses such things a bit on his blog). But this at least takes into account that the data will be skewed, even in the control group, to have more positive outcomes and thus take the ceiling into account.

# this uses OLS to do t-test of mean differences
# generates normal data, but then rounds/clips
sim_ols <- function(n,eff=0.5,int=4,sd=1){
    df <- data.frame(1:n)
    df$treat <- sample(c(0,1),n,replace=TRUE)
    df$latent <- int + eff*df$treat + rnorm(n,sd=sd)
    df$likert <- clipint(df$latent)
    m1 <- lm(likert ~ treat,data=df)
    cd <- coef(summary(m1))
    pval <- cd[2,4]
    se <- cd[2,2]
    return(c(pval,se))
}

Now we can move onto the simulations, this evaluates sample sizes from 100 to 2000 (in increments of 50), effect sizes of 0.1, 0.3, and 0.5, and repeats the simulations 10k times. I then see the power (how often the two-tailed p-value is less than 0.05), as well as the standard error (precision) of the estimates. Effect sizes are in terms of the original Likert scale values, what I take to be much easier to reason about. (I have seen power analyses here use Cohen’s D, which you really can’t get a very large Cohen’s D value due to ceiling effects with this data.)

# running power estimates over different
# sample sizes and effect sizes
samp_sizes <- seq(100,2000,50)
eff_sizes <- c(0.1,0.3,0.5)
rep_size <- 10000
df <- expand.grid(samps_sizes=samp_sizes,eff_sizes=eff_sizes,pow=c(NA),se=c(NA))
for (i in 1:nrow(df)){
    ss <- df[i,1]
    es <- df[i,2]
    stats <- replicate(rep_size,sim_ols(n=ss,eff=es))
    smean <- rowMeans(stats)
    df[i,3] <- mean(stats[1,] < 0.05) # alpha 0.05
    df[i,4] <- smean[2]
}

df$eff_sizes <- as.factor(df$eff_sizes)

The graph of the power shows what you would expect, so with a few hundred samples you can determine an effect size of 0.5, but with a smaller effect size (on the order of 0.1) you will need more than 2k samples.

# Graph of power
powg <- ggplot(data=df,aes(x=samps_sizes,y=pow)) + 
        geom_line(aes(color=eff_sizes)) + 
        geom_point(pch=21,color='white',size=2,aes(fill=eff_sizes)) +
        labs(x='Sample Sizes',y=NULL,title='Power Estimates') +
        scale_y_continuous(breaks=seq(0,1,0.1)) + 
        scale_x_continuous(breaks=seq(100,2000,250)) + 
        scale_color_discrete(name="Effect Sizes") +
        scale_fill_discrete(name="Effect Sizes") + 
        theme_cdc()

Unfortunately, in reality with most survey measures of police data, e.g. rate your officer 1 to 5, a 0.5 effect is a really large increase. In my mapping attitudes paper, some demographics shift global attitudes at max by 0.2, and I doubt most interventions will be that dramatic. So I like plotting the precision of the estimator, which the effect size doesn’t really make a dent here (it could with more severe ceiling effects).

# Graph of Standard Errors (for Precision)
precg <- ggplot(data=df,aes(x=samps_sizes,y=se,color=eff_sizes)) + 
         geom_line(aes(color=eff_sizes)) + 
         geom_point(pch=21,color='white',size=2,aes(fill=eff_sizes)) +
         labs(x='Sample Sizes',y=NULL,title='Precision Estimates') +
         scale_x_continuous(breaks=seq(100,2000,250)) + 
         scale_color_discrete(name="Effect Sizes") +
         scale_fill_discrete(name="Effect Sizes") + 
         theme_cdc()

With field experiments when considering post police contacts (general attitude surveys you have more wiggle room, but still they cost money to survey) you really can’t control the sample size. You are at the whims of whatever events happen in the police departments daily duties. So the best you can do is make approximate plans for “how long am I going to collect measures before I analyze the data”, and “how reasonably precise will my estimates be”.

This particular grant I make arguments we care more about a non-inferiority type test (e.g. can be sure attitudes are not worse, within a particular error level, given our treatment is more cost-effective than business as usual). But if we did an intervention specifically intended to improve attitudes, you may be talking more like 5,000+ contacts to detect an effect of 0.1 (given likely sample non-response), which is still a big effect.

You would gain power by doing a scale (e.g. summing together multiple items or conducting a factor analysis), assuming the intervention effects the underlying latent trait, and piecemeal individual items. But that will have to be for another day simulating data to do that end-to-end analysis.

Despite not being an academic anymore, I have in the hopper more grant collaborations than I did when I was a professor. The NIJ season is winding down, so probably too late to collaborate on any more of those. But if you have other ideas and need quant help with your projects, always feel free to reach out. I enjoy doing these side projects with academics when reasonable funding is available.

Leave a comment

by apwheele on May 2, 2024 • Permalink

Posted in Crime Analysis, Criminal Justice, Data Visualization, Python, R

Tagged power-analysis, survey

Posted by apwheele on May 2, 2024

https://andrewpwheeler.com/2024/05/02/power-for-analyzing-likert-items/

Some musings on plagiarism

There have been several recent high profile cases of plagiarism in the news recently. Chris Rufo, a conservative pundit, has identified plagiarism in Claudine Gay and Christina Cross’s work. In a bit of tit-for-tat, Neri Oxman was then profiled for similar sins (as her husband was one of the most vocal critics of Gay).

Much of the discussion around this seems to be more focused on who is doing the complaining than the actual content of the complaints. It can simultaneously be true that Chris Rufo has ulterior motives to malign those particular scholars (ditto for those criticizing Neri Oxman), but the complaints are themselves legitimate.

What is plagiarism? I would typically quote a specific definition here, but I actually do not like the version in the Merriam Webster (or American Heritage) dictionary. I assert plagiarism is the direct copying of material without proper attribution to its original source. The concept of “idea” plagiarism seems to muddy up the waters – I am ignoring that here. That is, even if you don’t “idea” plagiarize, you can do “direct copying of material without proper attribution”.

It may seem weird that you can do one without the other, but many of the examples given (whether by Rufo or those who pointed out Neri Oxman’s) are insipid passages that are mostly immaterial to the main thesis of the paper, or they have some form of attribution but it is not proper. So it goes, the way academics write you can cut out whole cloth large portions of writing and it won’t make a difference to the story line.

These passages however are clearly direct copying of material without proper attribution. I think Cross’s is a great example, here is one (of several Rufo points out):

Words are so varied, if I gave you a table of descriptions, something like:

survey years: 1998, 2000, ...., 2012
missing data: took prior years survey values

And asked 100 scholars to put that in a plain text description in a paragraph, all 100 would have some overlap in wording, but not near this extent. For numbers, these paragraphs have around 30 words, say your word vocabulary relevant to describe that passage is 100 words, the overlap would be under 10% of the material. Here is a simple simulation to illustrate.

And this does not consider the word order (and the fact that the corpus is much larger than 100 words). Which will both make the probability of this severe of overlap much smaller.

Unlike what ASA says, it does not matter that what was copied is a straightforward description of a dataset, it clearly still is plagiarism in terms of “direct copying without proper attribution”. The Neri Oxman examples are similar, they are direct copying, even if they have an in text citation at the end, that is not proper attribution. If these cases went in front of ethics boards at universities that students are subject to, they would all be found guilty of plagiarism. The content of what was copied and its importance to the work does not matter at all.

So the defense of the clearly indefensible based on ideological grounds among academics I find quite disappointing. But, as a criminologist I am curious as to its prevalence – if you took a sample of dissertations or peer reviewed articles, how many would have plagiarized passages? Would it be 1%, or 10%, or 50%, I genuinely do not know (it clearly won’t be 0% though!).

I would do this to my own articles if I could easily (I don’t think I did, but it is possible I self-plagiarized portions of my dissertation in later peer reviewed articles). So let me know if you have Turnitin (or are aware of another service), that can upload PDFs to check this.

I’d note here that some of the defense of these scholars is the “idea” part of plagiarism, which is a shifting sands definition of saying it is necessary to steal some fundamental idea for it to be plagiarism. Idea plagiarism isn’t really a thing, at least not a thing anymore than vague norms among academics (or even journalists). Scooping ideas is poor form, but that is it.

I am not aware of my words being plagiarized, I am aware of several examples of scholars clearly taking work from my code repositories or blog posts and not citing them. (As a note to scholars, it is fine to cite my blog posts, I maybe have a dozen or so citations to my blog at this point.) But those would not typically be considered plagiarism. If someone copies one of my functions and applies it to their own data, it is not plagiarism as typically understood. If I noted that and sent those examples to COPE and asked for the articles to be retracted, I am pretty sure COPE would say that is not plagiarism as typically treated.

Honestly it does not matter though. I find it obnoxious to not be cited, but it is a minor thing, and clearly does not impact the quality of the work. I basically expect my work on the blog to be MIT licensed (subject to change) – it is mostly a waste of time for me to police how it is used. Should Cross be disciplined for her plagiarism? Probably not – if it was an article I would suggest a correction would be sufficient.

I can understand students may be upset that they are held to higher standards than their professors, but I am not sure if that means students should be given more slack or if professors should be given less. These instances of plagiarism by Gay/Cross/Oxman I take more as laziness than anything, but they do not have much to do with whether they are fit for their jobs.

AI writing

Pretty soon, even my plagiarism definition is not going to work anymore, as generative chatbots are essentially stochastic parrots – everything they do is paraphrasing plagiarism, but in a form that is hard to see direct copying (that tools like Turnitin will identify). So I am starting to get links to the blog from Perplexity and Iask. So people may cite ChatGPT or whatever service generated the text, but that service copied from someone else, and no one is the wiser.

These specific services have paraphrasing citations, e.g. you ask a question, it gives the response + 3 citations (these are called RAG applications in LLM speak). So you may think they are OK in terms of above, they give paraphrasing type at the end of paragraph citations. But I have noticed they frequently spit out near verbatim copies of my work for a few blog posts I get traffic for. The example I have linked to was this post on fitting a beta-binomial model in python. So I intentionally wrote that post because the first google results were bad/wrong and ditto for the top stackoverflow questions. So I am actually happy my blog got picked up by google quite fast and made it to the top of the list.

These services are stochastic, so subject to change, but iask is currently a direct copy of my work and does not give me as a list of its citations (although I come up high in their search rankings, so not clear to me why I am not credited). Even if it did though it would be the same problem as Oxman, it would not be a proper quoted citation.

And the Cross faux pas edit a few words will not only be common but regular with these tools.

This is pretty much intended though from my perspective with these tools – I saw a wrong on the internet and wrote a post to make it so others do not make the same mistakes. Does it matter if the machine does not properly attribute me? I just want you to do your stats right, I don’t get a commission if you properly cite me. (I rather you just personally hire me to help than get a citation!)

I think these services may make citation policing over plagiarism impossible though. No major insights here to prevent it – I say now it is mostly immaterial to the quality of the work, but I do think saying you can carte blanche plagiarize is probably not good.

Leave a comment

by apwheele on April 20, 2024 • Permalink

Posted in scholarly, writing

Tagged plagiarism

Posted by apwheele on April 20, 2024

https://andrewpwheeler.com/2024/04/20/some-musings-on-plagiarism/

Grabbing the NHAMCS emergency room data in python

So Katelyn Jetelina on her blog, The Local Epidemiologist, had a blog post (with Heidi Moseson) on how several papers examining mifepristone related to emergency room (ER) visits were retracted. (Highly recommend Katelyn’s blog, I really enjoy the mix of virology and empirical data discussions you can’t get from other outlets.)

This reminded me I had on the todo list examining the CDC’s NHAMCS (National Hospital Ambulatory Medical Care Survey) data. This is a sample of ER visit data collated by the CDC. I previously putzed with this data to illustrate predictive models for wait times, and I was interested in examining gun violence survival rates in this dataset.

I had the idea with checking out gun violence in this data after seeing Jeff Brantingham’s paper showing gun shot survival rates in California have been decreasing, and ditto for Chicago via work by Jen’s Ludwig and Jacob Miller. It is not uber clear though if this is a national pattern, Jeff Asher does not think so for example. So I figured the NHAMCS would be a good way to check national survival rates, and maybe see if any metro areas were diverging over time.

Long story short, the NHAMCS data is waaay too small of sample to look at rare outcomes like gun violence. (So probably replicating the bad studies Katelyn mentions in her blog are not worth the effort, they will be similarly rare). But the code is concise enough to share in a quick blog post for others if interested. Looking at the data the other day, I realized you could download SPSS/SAS/Stata files instead of the fixed width from the CDC website. This is easier than my prior post, as you can read those different files into python directly without having to code all of the variable fields from the fixed width file.

So for some upfront, the main library you need is pandas (as well as pyreadstat installed). The rest is just stuff that comes with pythons standard library. The NHAMCS files are zipped SPSS files, so a bit more painful to download but not that much of an issue. (Unfortunately you cannot just read in memory, like I did with Excel/csv here, I have to save the file to disk and then read it back.)

import pandas as pd
import zipfile
from io import BytesIO
import requests
from os import path, remove

# This downloads zip file for SPSS
def get_spss(url,save_loc='.',convert_cat=False):
    ext = url[-3:]
    res = requests.get(url)
    if ext == 'zip':
        zf = zipfile.ZipFile(BytesIO(res.content))
        spssf = zf.filelist[0].filename
        zz = zf.open(spssf)
        zs = zz.read()
    else:
        zs = BytesIO(res.content)
        spssf = path.basename(url)
    sl = path.join('.',spssf)
    with open(sl, "wb") as sav:
        sav.write(zs)
    df = pd.read_spss(sl,convert_categoricals=convert_cat)
    remove(sl)
    return df

Now that we have our set up, we can just read in each year. Note 2005!

# creating urls
base_url = 'https://ftp.cdc.gov/pub/health_statistics/nchs/dataset_documentation/NHAMCS/spss/'
files = ['ed02-spss.zip',
         'ed03-spss.zip',
         'ed04-spss.zip',
         'ed05-sps.zip',
         'ed06-spss.zip',
         'ed07-spss.zip',
         'ed08-spss.zip',
         'ed09-spss.zip',
         'ed2010-spss.zip',
         'ed2011-spss.zip',
         'ed2012-spss.zip',
         'ed2013-spss.zip',
         'ed2014-spss.zip',
         'ed2015-spss.zip',
         'ed2016-spss.zip',
         'ed2017-spss.zip',
         'ed2018-spss.zip',
         'ed2019-spss.zip',
         'ed2019-spss.zip',
         'ed2020-spss.zip',
         'ed2021-spss.zip']
urls = [base_url + f for f in files]

def get_data():
    res_data = []
    for u in urls:
        res_data.append(get_spss(u))
    for r in res_data:
        r.columns = [v.upper() for v in list(r)]
    vars = []
    for i,d in enumerate(res_data):
        year = i + 2001
        vars += list(d)
    vars = list(set(vars))
    vars.sort()
    vars = pd.DataFrame(vars,columns=['V'])
    for i,d in enumerate(res_data):
        year = i + 2001
        uc = [v.upper() for v in list(d)]
        vars[str(year)] = 1*vars['V'].isin(uc)
    return res_data, vars

rd, va = get_data()
all_data = pd.concat(rd,axis=0,ignore_index=True)

Note that the same links with the zipped up sav files also have .sps files, so you can see how the numeric variables are encoded. Or pass in the argument convert_cat=True to the get_spss function and it will turn the data into strings based on the labels.

You can check out which variables are available for which years via the va dataframe. They are quite consistent though. The bigger pain is that for older years, we have ICD9 codes, and for more recent years we have ICD10. So it takes a bit of work to normalize between the two (for ICD10, just looking at the first 3 is ok, for ICD9 you need to look at all 5 though). It is similar to NIBRS crime data, a single event can have different codes associated with it, so you need to look across all of them to identify whether any of the codes are associated with gun assaults.

# Assaulting Gun Violence for ICD9/ICD10
# ICD9, https://www.aapc.com/codes/icd9-codes-range/165/
# ICD9, https://www.cdc.gov/nchs/injury/ice/amsterdam1998/amsterdam1998_guncodes.htm
# ICD10, https://www.icd10data.com/ICD10CM/Codes/V00-Y99/X92-Y09
gv = {'Handgun': ['X93','9650'],
      'Longgun': ['X94','9651','9652','9653'],
      'Othergun': ['X95','9654']}

any_gtype = gv['Handgun'] + gv['Longgun'] + gv['Othergun']
gv['Anygun'] = any_gtype

fields = ['CAUSE1','CAUSE2','CAUSE3']

all_data['Handgun'] = 0
all_data['Longgun'] = 0
all_data['Othergun'] = 0
all_data['Anygun'] = 0


for f in fields:
    for gt, gl in gv.items():
        all_data[gt] = all_data[gt] + 1*all_data[f].isin(gl) + 1*all_data[f].str[:3].isin(gl)

for gt in gv.keys():
    all_data[gt] = all_data[gt].clip(0,1)

There are ranging between 10k and 40k rows in each year, but overall there are very few observations of assaultive gun violence. So even with over 500k rows across the 19 years, there are fewer than 200 incidents of people going to the ER because of a gun assault.

# Not very many, only a handful each
all_data[gv.keys()].sum(axis=0)

# This produces
# Handgun      20
# Longgun      11
# Othergun    139
# Anygun      170

These are far too few in number to estimate the survival rate changes over time. So the Brantingham or Ludwig analysis that looks at larger register data of healthcare claims, or folks looking at reported crime incident data, is likely to be much more reasonable to estimate those trends. If you do a groupby per year this becomes even more stark:

# Per year it is quite tiny
all_data.groupby('YEAR')[list(gv.keys())].sum()

#         Handgun  Longgun  Othergun  Anygun
# YEAR
# 2002        1        0        12      13
# 2003        5        2         4      11
# 2004        1        3        12      16
# 2005        2        1         7      10
# 2007        1        0        14      15
# 2008        2        2        10      14
# 2009        0        0        12      12
# 2010        1        0        10      11
# 2011        2        1         9      12
# 2012        1        0         6       7
# 2013        0        0         0       0
# 2014        1        0         2       3
# 2015        0        0         6       6
# 2016        0        0         5       5
# 2017        0        0         0       0
# 2018        0        1         4       5
# 2019        0        0         6       6
# 2020        0        0         9       9
# 2021        1        0         4       5

While using weights you can get national level estimates, the standard errors are based on the observed number of cases. Which in retrospect I should have realized – gun violence is pretty rare, so rates in the 1 to 100 per 100,000 would be the usual range. If anything these are maybe a tinge higher than I should have guessed (likely due to how CDC does the sampling).

To be able to do the analysis of survival rates I want, the sample sizes here would need to be 100 times larger than they are. Which would require something more akin to NIBRS reporting by hospitals directly, instead of having CDC do boots on the ground samples. Which is feasible of course (no harder for medical providers to do this than police departments), see SPARCS with New York data for example.

But perhaps others can find this useful. It may be easier to do things more like 1/100 events and analyze them. The data has quite a few variables, like readmission due to other events, public/private insurance, different drugs, and then of course all the stuff that is recorded via ICD10 codes (which is both health events as well as behavioral). So probably not a large enough sample to do analysis of other criminal justice related health care incidents, but they do add up to big victim costs to the state that are easy to quantify, as the medicaid population is a large chunk of that.

Leave a comment

by apwheele on March 30, 2024 • Permalink

Posted in data science, healthcare, Python

Tagged cdc, health data

Posted by apwheele on March 30, 2024

https://andrewpwheeler.com/2024/03/30/grabbing-the-nhamcs-emergency-room-data-in-python/

Matching mentors to mentees using OR-tools

So the other day on Hackernews, a startup had a question in their entry to interview:

You have m mentees – You have M mentors – You have N match scores between mentees and mentors, based on how good of a match they are for mentorship. – mxM > N, because not every mentee is eligible to match with every mentor. It depends on seniority constraints. – Each mentee has a maximum mentor limit of 1 – Each mentor has a maximum mentee limit of l, where 0<l<4 How would you find the ideal set of matches to maximize match scores across all mentees and mentors?

Which I am not interested in applying to this start-up (it is in Canada), but it is a fun little distraction. I would use linear programming to solve this problem – it is akin to an assignment matching problem.

So here to make some simple data, I have two mentors and four mentees. Each mentee has a match score to its mentor, and not all mentees have a match score with each mentor.

# using googles or-tools
from ortools.linear_solver import pywraplp

# Create data
menA = {'a':0.2, 'b':0.4, 'c':0.5}
menB = {'b':0.3, 'c':0.4, 'd':0.5}

So pretend each mentor can be assigned two mentees, what is the optimal pairing? If you do a greedy approach, you might say, for menA lets do the highest, b then c. And then for mentor B, we only have left mentee d. This is a total score of 0.4 + 0.5 + 0.5 = 1.4. If we do greedy the other way, we get mentor B assigned c and d, and then mentor A can have a and b assigned, with a score of 0.4 + 0.5 + 0.4 + 0.2 = 1.5. And this ends up being the optimal approach in this scenario.

Here I am going to walk through creating this model in google’s OR tools package. First we create a few different data structures, one is a set of pairs of mentor/mentees, and these will end up being our decision variables. The other is a set of the unique mentees (and we will need a set of the unique mentors as well) later for the constraints.

# a few different data structures
# for our linear programming problem
dat = {'A':menA,
       'B':menB}

mentees = []

pairs = {}
for m,vals in dat.items():
    for k,v in vals.items():
        mentees.append(k)
        pairs[f'{m},{k}'] = v

mentees = set(mentees)

Now we can create our model, and here we are maximizing the matches. Edit: Note that GLOP is not an MIP solver, leaving the blog post as since in my experiments it does always return 0/1’s (some LP formulations will always return binary, I do not know if this is always true for this one or not). If you want to be sure to use integer variables though, use pywraplp.Solver.CreateSolver("SAT") instead of GLOP.

# Create model, variables
solver = pywraplp.Solver.CreateSolver("GLOP")
matchV = [solver.IntVar(0, 1, p) for p in pairs.keys()]

# set objective, maximize matches
objective = solver.Objective()
for m in matchV:
    objective.SetCoefficient(m, pairs[m.name()])

objective.SetMaximization()

Now we have two sets of constraints. One set is that for each mentor, they are only assigned two individuals. And then for each mentee, they are only assigned one mentor.

# constraints, mentors only get 2
Mentor_constraints = {d:solver.RowConstraint(0, 2, d) for d in dat.keys()}
# mentees only get 1
Mentee_constraints = {m:solver.RowConstraint(0, 1, m) for m in mentees}
for m in matchV:
    mentor,mentee = m.name().split(",")
    Mentor_constraints[mentor].SetCoefficient(m, 1)
    Mentee_constraints[mentee].SetCoefficient(m, 1)

Now we are ready to solve the problem,

# solving the model
status = solver.Solve()
print(objective.Value())

# Printing the results
for m in matchV:
    print(m,m.solution_value())

Which then prints out:

>>> print(objective.Value())
1.5
>>>
>>>
>>> # Printing the results
>>> for m in matchV:
...     print(m,m.solution_value())
...
A,a 1.0
A,b 0.0
A,c 1.0
B,b 1.0
B,c 0.0
B,d 1.0

So here it actually chose a different solution than I listed above, but is also maximizing the match scores at 1.5 total. Mentor A with {a,c} and Mentor B with {b,d}.

Sometimes people are under the impression that linear programming is only for tiny data problems. Here the problem size grows depending on how filled in the mentor/mentee pairs are. So if you have 1000 mentors, and they each have 50 potential mentees, the total number of decision variables will be 50,000. Which can still be solved quite fast. Note the problem does not per se grow with more mentees per se, since it can be a sparse problem.

To show this a bit easier, here is a function to return the nice matched list:

def match(menDict,num_assign=2):
    # prepping the data
    mentees,mentors,pairs = [],[],{}
    for m,vals in menDict.items():
        for k,v in vals.items():
            mentees.append(str(k))
            mentors.append(str(m))
            pairs[f'{m},{k}'] = v
    mentees,mentors = list(set(mentees)),list(set(mentors))
    print(f'Total decision variables {len(pairs.keys())}')
    # creating the problem
    solver = pywraplp.Solver.CreateSolver("GLOP")
    matchV = [solver.IntVar(0, 1, p) for p in pairs.keys()]
    # set objective, maximize matches
    objective = solver.Objective()
    for m in matchV:
        objective.SetCoefficient(m, pairs[m.name()])
    objective.SetMaximization()
    # constraints, mentors only get num_assign
    Mentor_constraints = {d:solver.RowConstraint(0, num_assign, f'Mentor_{d}') for d in mentors}
    # mentees only get 1 mentor
    Mentee_constraints = {m:solver.RowConstraint(0, 1, f'Mentee_{m}') for m in mentees}
    for m in matchV:
        mentor,mentee = m.name().split(",")
        #print(mentor,mentee)
        Mentor_constraints[mentor].SetCoefficient(m, 1)
        Mentee_constraints[mentee].SetCoefficient(m, 1)
    # solving the model
    status = solver.Solve()
    # figuring out whom is matched
    matches = {m:[] for m in mentors}
    for m in matchV:
        if m.solution_value() > 0.001:
            mentor, mentee = m.name().split(",")
            matches[mentor].append(mentee)
    return matches

I have the num_assigned as a constant for all mentors, but it could be variable as well. So here is an example with around 50k decision variables:

# Now lets do a much larger problem
from random import seed, random, sample
seed(10)
mentors = list(range(1000))
mentees = list(range(4000))

# choose random 50 mentees
# and give random weights
datR = {}
for m in mentors:
    sa = sample(mentees,k=50)
    datR[m] = {s:random() for s in sa}

# This still takes just a few seconds on my machine
# 50k decision variables
res = match(datR,num_assign=4)

And this takes less than 2 seconds on my desktop, which much of that is probably setting up the problem.

For very big problems, you can separate out the network into connected components, and then run the linear program on each seperate component. Only in the case with dense and totally connected networks will you need to worry much about running out of memory I suspect.

So I have written in the past how I stopped doing homework assignments for interviews. More recently at work we give basic pair coding sessions – something like a python hello world problem, and then a SQL question that requires knowing GROUP BY. If they get that far (many people who say they have years of data science experience fail at those two), we then go onto making environments, git, and describing some machine learning methods.

You would be surprised how many data scientists I interview only know how to code in Jupyter notebooks and cannot do a hello world program from the command line. It is part of the reason I am writing my intro python book. Check out the table of contents and first few chapters below. Only one more final chapter on an end-to-end project before the book will be released on Amazon. (So this is close to the last opportunity to purchase as the lower price before then.)

DS_PythonCrimeAnalysis_EarlyRelease Download

Leave a comment

by apwheele on March 24, 2024 • Permalink

Posted in data science, Python

Tagged linear programming

Posted by apwheele on March 24, 2024

https://andrewpwheeler.com/2024/03/24/matching-mentors-to-mentees-using-or-tools/

Poisson designs and Minimum Detectable Effects

Ian Adam’s posted a working paper the other day on power analysis for analyzing counts, Power Simulations of Rare Event Counts and Introduction to the ‘Power Lift’ Metric (Adams, 2024). I have a few notes I wanted to make in regards to Ian’s contribution. Nothing I say conflicts with what he writes, moreso just the way I have thought about this problem. It is essentially the same issue as I have written about monitoring crime trends (Wheeler, 2016), or examining quasi-experimental designs with count data (Wheeler & Ratcliffe, 2018; Wilson, 2022).

I am going to make two broader points here: point 1, power is solely a property of the aggregate counts in treated vs control, you don’t gain power by simply slicing your data into finer temporal time periods. Part 2 I show an alternative to power, called minimum detectable effect sizes. This focuses more on how wide your confidence intervals are, as opposed to power (which as Ian shows is not monotonic). I think it is easier to understand the implications of certain designs when approached this way – both from “I have this data, what can I determine from it” (a retrospective quasi-experimental design), as well as “how long do I need to let this thing cook to determine if it is effective”. Or more often “how effective can I determine this thing is in a reasonable amount of time”.

Part 1, Establishing it is all about the counts

So lets say you have a treated and control area, where the base rate is 10 per period (control), and 8 per period (treated):

##########
set.seed(10)
n <- 20 # time periods
reduction <- 0.2 # 20% reduced
base <- 10

control <- rpois(n,base)
treat <- rpois(n,base*(1-reduction))

print(cbind(control,treat))
##########

And this simulation produces 20 time periods with values below:

 [1,]      10     6
 [2,]       9     5
 [3,]       5     3
 [4,]       8     8
 [5,]       9     5
 [6,]      10    10
 [7,]      10     7
 [8,]       9    13
 [9,]       8     6
[10,]      13     8
[11,]      10     6
[12,]       8     8
[13,]      11     8
[14,]       7     8
[15,]      10     7
[16,]       6     8
[17,]      12     3
[18,]      15     5
[19,]      10     8
[20,]       7     7

Now we can fit a Poisson regression model, simply comparing treated to control:

##########
outcome <- c(control,treat)
dummy <- rep(0:1,each=n)

m1 <- glm(outcome ~ dummy,family=poisson)
summary(m1)
###########

Which produces:

Call:
glm(formula = outcome ~ dummy, family = poisson)

Deviance Residuals:
     Min        1Q    Median        3Q       Max
-1.69092  -0.45282   0.01894   0.38884   2.04485

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.23538    0.07313  30.568  < 2e-16 ***
dummy       -0.29663    0.11199  -2.649  0.00808 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 32.604  on 39  degrees of freedom
Residual deviance: 25.511  on 38  degrees of freedom
AIC: 185.7

Number of Fisher Scoring iterations: 4

In this set of data, the total treated count is 139, and the total control count is 187. Now watch what happens when we fit a glm model on the aggregated data, where we just now have 2 rows of data?

##########
agg <- c(sum(treat),sum(control))
da <- c(1,0)
m2 <- glm(agg ~ da,family=poisson)
summary(m2)
##########

And the results are:

Call:
glm(formula = agg ~ da, family = poisson)

Deviance Residuals:
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  5.23111    0.07313  71.534  < 2e-16 ***
da          -0.29663    0.11199  -2.649  0.00808 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 7.0932e+00  on 1  degrees of freedom
Residual deviance: 9.5479e-15  on 0  degrees of freedom
AIC: 17.843

Number of Fisher Scoring iterations: 2

Notice how the treatment effect coefficients and standard errors are the exact same results as they are with the micro observations. This is something people who do regression models often do not understand. Here you don’t gain power by having more observations, power in the Poisson model is determined by the total counts of things you have observed.

If this were not the case, you could just slice observations into finer time periods and gain power. Instead of counts per day, why not per hour? But that isn’t how it works when using Poisson research designs. Counter-intuitive perhaps, you get smaller standard errors when you observe higher counts.

It ends up being the treatment effect estimate in this scenario is easy to calculate in closed form. This is just riffing off of David Wilson’s work (Wilson, 2022).

treat_eff <- log(sum(control)/sum(treat))
treat_se <- sqrt(1/sum(control) + 1/sum(treat))
print(c(treat_eff,treat_se))

Which produces [1] 0.2966347 0.1119903.

For scenarios in which are slightly more complicated, such as treated/control have different number of periods, you can use weights to get the same estimates. Here for example we have 25 periods in treated and 19 periods in the control using the regression approach.

# Micro observations, different number of periods
treat2 <- rpois(25,base*(1 - reduction))
cont2 <- rpois(19,base)
val2 <- c(treat2,cont2)
dum2 <- c(rep(1,25),rep(0,19))
m3 <- glm(val2 ~ dum2,family=poisson)

# Aggregate, estimate rates
tot2 <- c(sum(treat2),sum(cont2))
weight <- c(25,19)
rate2 <- tot2/weight
tagg2 <- c(1,0)
# errors for non-integer values is fine
m4 <- glm(rate2 ~ tagg2,weights=weight,family=poisson) 
print(vcov(m3)/vcov(m4)) # can see these are the same estimates
summary(m4)

Which results in:

>print(vcov(m3)/vcov(m4)) # can see these are the same estimates
            (Intercept)      dum2
(Intercept)   0.9999999 0.9999999
dum2          0.9999999 0.9999992
>summary(m4)

Call:
glm(formula = rate2 ~ tagg2, family = poisson, weights = weight)

Deviance Residuals:
[1]  0  0

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  2.36877    0.07019  33.750  < 2e-16 ***
tagg2       -0.38364    0.10208  -3.758 0.000171 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The treatment effect estimate is similar, where the variance is still dictated by the counts.

treat_rate <- log(rate2[1]/rate2[2])
treat_serate <- sqrt(sum(1/tot2))
print(c(treat_rate,treat_serate))

Which again is [1] -0.3836361 0.1020814, same as the regression results.

Part 2, MDEs

So Ian’s paper has simulation code to determine power. You can do infinite sums with the Poisson distribution to get closer to closed form estimates, like the e-test does in my ptools package. But the simulation approach is fine overall, so just use Ian’s code if you want power estimates.

The way power analysis works, you pick an effect size, then determine the study parameters to be able to detect that effect size a certain percentage of the time (the power, typically set to 0.8 for convenience). An alternative way to think about the problem is how variable will your estimates be? You can then back out the minimum detectable effect size (MDE), given those particular counts. (Another way people talk about this is plan for precision in your experiment.)

Lets do a few examples to illustrate. So say you wanted to know if training reduced conducted energy device (CED) deployments. You are randomizing different units of the city, so you have treated and control. Baseline rates are around 5% per arrest, and say you have 10 arrests per day in each treated/control arm of the study. Around 30 days, you will have ~15 CED usages. Subsequently the standard error of the logged incident rate ratio will be approximately sqrt(1/15 + 1/15) = 0.37. Thus, the smallest effect size you could detect has to be a logged incident rate ratio pretty much double that value.

Presumably we think the intervention will decrease CED uses, so we are looking at an IRR of exp(-0.37*2) = 0.48. So you pretty much need to cut CED usage in half to be able to detect if the intervention worked when only examining the outcomes for one month. (The 2 comes from using a 95% confidence interval.)

If we say we think best case the intervention had a 20% reduction in CED usage, we would then need exp(-se*2) = 0.8. log(0.8) ~ -0.22, so we need a standard error of se = 0.11 to meet this minimum detectable effect. If we have equal counts in each arm, this is approximately sqrt(1/x + 1/x) = 0.11, with rearranging we get 0.11^2 = 2*(1/x), and then 2/(0.11^2) = x = 166. So we want over 160 events in each treated/control group, to be able to detect a 20% reduction.

Now lets imagine a scenario in which one of the arms is fixed, such as retrospective analysis. (Say the control group is prior time periods before training, and 100% of the patrol officers gets the training.) So we have fixed 100 events in the control group, in that scenario, we need to monitor our treatment until we observe sqrt(1/x + 1/100) = 0.11, that 20% reduction standard. We can rearrange this to be 0.11^2 - 1/100 = 1/x, which is x = 1/0.0021 = 476.

When you have a fixed background count, in either in a treated or control arm, that pretty much puts a lower bound on the standard error. In this case with the control arm that has a fixed 100 events, the standard error can never be smaller than sqrt(1/100) = 0.1. So in that case, you can never detect an effect smaller than exp(-0.2).

Another way to think about this is that with smaller effect sizes, you can approximately translate the standard errors to percent point ranges. So if you want to say plan for precision estimates of around +/- 5% – that is a standard error of 0.05. We are going to need sqrt(z) ~ 0.05. At a minimum we need 400 events in one of the treated or control arms, since sqrt(1/400) = 0.05 (and that is only taking into account one of the arms).

For those familiar with survey stats, these are close to the same sample size recommendation for proportions – it is just instead of total sample size, it is the total counts we are interested in. E.g. if you want +/- 5% for sample proportions, you want around 1,000 observations.

And most of the examples of more complicated research designs (e.g. fixed or random effects, overdispersion estimates) will likely make the power lower, not higher, than the back of the envelope estimates here. But they should be a useful starting to know whether a particular experimental design is dead in the water to detect reasonable effect sizes of interest.

If you found this interesting, you will probably find my work on continuous monitoring of crime trends over time also interesting:

This approach relies on very similar Poisson models to what Ian is showing here, you just monitor the process over time and draw the error intervals as you go. For low powered designs, the intervals will just seem hopelessly wide over time.

References

Adams, I. (2024) Power Simulations of Rare Event Counts and Introduction to the ‘Power Lift’ Metric. CrimRxiv
Blattman, C., Green, D., Ortega, D., & Tobón, S. (2018). Place-based interventions at scale: The direct and spillover effects of policing and city services on crime (No. w23941). National Bureau of Economic Research.
Wheeler, A. P. (2016). Tables and graphs for monitoring temporal crime trends: Translating theory into practical crime analysis advice. International Journal of Police Science & Management, 18(3), 159-172.
Wheeler, A.P., & Ratcliffe, J.H. (2018). A simple weighted displacement difference test to evaluate place based crime interventions. Crime Science, 7(1), 11.
Wilson, D. B. (2022). The relative incident rate ratio effect size for count-based impact evaluations: When an odds ratio is not an odds ratio. Journal of Quantitative Criminology, 38(2), 323–34.

Leave a comment

by apwheele on March 18, 2024 • Permalink

Posted in Crime Analysis, data science, R

Tagged Poisson

Posted by apwheele on March 18, 2024

https://andrewpwheeler.com/2024/03/18/poisson-designs-and-minimum-detectable-effects/

Harmweighted hotspots, using ESRI python API, and Crime De-Coder Updates

Haven’t gotten the time to publish a blog post in a few. There has been a ton of stuff I have put out on my Crime De-Coder website recently. For some samples since I last mentioned here, have published four blog posts:

on what AI regulation in policing would look like
high level advice on creating dashboards
overview of early warning systems for police
types of surveys for police departments

For surveys a few different groups have reached out to me in regards to the NIJ measuring attitudes solicitation (which is essentially a follow up of the competition Gio and myself won). So get in touch if interested (whether a PD or a research group), may try to coordinate everyone to have one submission instead of several competing ones.

To keep up with everything, my suggestion is to sign up for the RSS feed on the site. If you want an email use the if this than that service. (I may have to stop doing my AltAc newsletter emails, it is so painful to send 200 emails and I really don’t care to sign up for another paid for service to do that.)

I also have continued the AltAc newsletter. Getting started with LLMs, using secrets, advice on HTML, all sorts of little pieces of advice every other week.

I have created a new page for presentations. Including, my recent presentation at the Carolina Crime Analysis Association Conference. (Pic courtesy of Joel Caplan who was repping his Simsi product – thank you Joel!)

If other regional IACA groups are interested in a speaker always feel free to reach out.

And finally a new demo on creating a static report using quarto/python. It is a word template I created (I like often generating word documents that are easier to post-hoc edit, it is ok to automate 90% and still need a few more tweaks.)

Harmweighted Hotspots

If you like this blog, also check out Iain Agar’s posts, GIS/SQL/crime analysis – the good stuff. Here I wanted to make a quick note about his post on weighting Crime Harm spots.

So the idea is that when mapping harm spots, you could have two different areas with same high harm, but say one location had 1 murder and one had 100 thefts. So if murder harm weight = 100 and theft harm weight = 1, they would be equal in weight. Iain talks about different transformations of harm, but another way to think about it is in terms of variance. So here assuming Poisson variance (although in practice that is not necessary, you could estimate the variance given enough historical time series data), you would have for your two hotspots:

Hotspot1: mean 1 homicide, variance 1
Hotspot2: mean 100 thefts, variance 100

Weight of 100 for homicides, 1 for theft

Hotspot1: Harmweight = 1*100 = 100
          Variance = 100^2*1 = 10,000
          SD = sqrt(10,000) = 100

Hotspot2: Harmweight = 100*1 = 100
          Variance = 1^2*100 = 100
          SD = sqrt(100) = 10

When you multiply by a constant, which is what you are doing when multiplying by harm weights, the relationship with variance is Var(const*x) = const^2*Var(x). The harm weights add variance, so you may simple add a penalty term, or rank by something like Harmweight - 2*SD (so the lower end of the harm CI). So in this example, the low end of the CI for Hotspot 1 is 0, but the low end of the CI for Hotspot2 is 80. So you would rank Hotspot2 higher, even though they are the same point estimate of harm.

The rank by low CI is a trick I learned from Evan Miller’s blog.

You could fancy this up more with estimating actual models, having multiple harm counts, etc. But this is a quick way to do it in a spreadsheet with just simple counts (assuming Poisson variance). Which I think is often quite reasonable in practice.

Using ESRI Python API

So I knew you could use python in ESRI, they have a notebook interface now. What I did not realize is now with Pro you can simply do pip install arcgis, and then just interact with your org. So for a quick example:

from arcgis.gis import GIS

# Your ESRI url
gis = GIS("https://modelpd.maps.arcgis.com/", username="user_email", password="???yourpassword???")
# For batch geocoding, probably need to do GIS(api_key=<your api key>)

This can be in whatever environment you want, so you don’t even need ArcGIS installed on the system to use this. It is all web-api’s with Pro. To geocode for example, you would then do:

from arcgis.geocoding import geocode, Geocoder, get_geocoders, batch_geocode

# Can search to see if any nice soul has published a geocoding server

arcgis_online = GIS()
items = arcgis_online.content.search('geocoder north carolina', 'geocoding service', max_items=30)

# And we have four
#[<Item title:"North Carolina Address Locator" type:Geocoding Layer owner:ecw31_dukeuniv>,
# <Item title:"Southeast North Carolina Geocoding Service" type:Geocoding Layer owner:RaleighGIS>, 
# <Item title:"Geocoding Service - AddressNC " type:Geocoding Layer owner:nconemap>, 
# <Item title:"ArcGIS World Geocoding Service - NC Extent" type:Geocoding Layer owner:NCDOT.GOV>]

geoNC = Geocoder.fromitem(items[0]) # lets try Duke
#geoNC = Geocoder.fromitem(items[-1]) # NCDOT.GOV
# can also do directly from URL
# via items[0].url
# url = 'https://utility.arcgis.com/usrsvcs/servers/8caecdf6384144cbafc9d56944af1ccf/rest/services/World/GeocodeServer'
# geoNC = Geocoder(url,gis)

# DPAC
res = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1)
print(res[0])

Note you cannot cache the geocoding results. To do that, you need to use credits and probably sign in via a token and not a username password.

# To cache, need a token
r2 = geocode('123 Vivian Street, Durham, NC 27701',geocoder=geoNC, max_locations=1,for_storage=True)

# If you have multiple addresses, use batch_geocode, again need a token
#dc_res = batch_geocode(FullAddressList, geocoder=geoNC)

Geocoding to this day is still such a pain. I will need to figure out if you can make a local geocoding engine with ESRI and then call that through Pro (I mean I know you can, but not sure pricing for all that).

Overall being able to work directly in python makes my life so much easier, will need to dig more into making some standard dashboards and ETL processes using ESRI’s tools.

I have another post that has been half finished about using the ESRI web APIs, hopefully will have time to put that together before another 6 months passes me by!

Leave a comment

by apwheele on March 3, 2024 • Permalink

Posted in Crime Analysis, Crime Mapping, Criminal Justice, data science, geocoding, Python

Tagged ESRI, speaking

Posted by apwheele on March 3, 2024

https://andrewpwheeler.com/2024/03/03/harmweighted-hotspots-using-esri-python-api-and-crime-de-coder-updates/

Getting started with github notes

I mentioned on LinkedIn the other day I think github is a good resource for crime analysts to learn. Even if you don’t write code, it is convenient to have an audit-trail of changes in documents.

Jerry Ratcliffe made the comment that it is a tough learning curve, and I agree dealing with merge conflicts is a pain in the butt:

In the past I have suggested people to get started by using the github desktop GUI tool. But I do not suggest that anymore because of the issues Jerry mentions. If you get headaches like this, you pretty much need to use the command line to deal with them. I do not have many git commands memorized, and I will give a rundown of my getting started with git and github notes. So I just suggest now people bite the bullet and learn the command line.

Agree it takes some effort, but I think it is well worth it.

Making a project and first commit

Technically github is the (now Microsoft owned) software company that offers web hosted version control, and git is a more general system for version control. (There is another popular web host called Gitlab for example.) Here I will offer advice about using github and git from the command line.

So first, I typically create projects first online on the web-browser on github.com (I do not have the command prompt command memorized to create a new repository). On github.com, click the green New button:

Here I am creating a new repo named example_repo. I do it this way intentionally, as I can make sure I set the repo owner to the correct one (myself or my organization), and set the repo to the correct public/private by default. Many things you want to default to private.

Note on windows, the git command is probably not installed by default. If you install git-bash, it should be available in the command prompt.

Now that you have your repository created, in github I click the green code button, and copy the URL to the repo:

Then from the command line, navigate to where you want to download the repo (I set up my windows machine so I have a G drive mapped to where I download github repos). So from command line, mine looks like:

# cd to to correct location
git clone https://github.com/apwheele/example_repo.git
# now go inside the folder you just downloaded
cd ./example_repo

Now typically I do two things when first creating a repo, edit the README.md to give a high level overview of the project, and also create a .gitignore file (no file extension!). Often you have files that you don’t want committed to the github repository. Most of my .gitignore files look like this for example, where # are comment lines:

# No csv files
*.csv

# No python artifacts
*.pyc
__pycache__

# can prevent uploading entire folders if you want
/folder_dont_upload

Note if you don’t generally want files, but want a specific file for whatever reason, you can use an exclamation point, e.g. !./data/keep_me.csv will include that file, even if you have *.csv as ignored in the .gitignore file in general. And if you want to upload an empty folder, place a .gitkeep file in that folder.

Now in the command prompt, run git status. You will see the files that you have edited listed (minus any file that is ignored in the gitignore file).

So once you have those files edited, then in the command prompt you will do three different commands in a row:

git add .
git commit -m 'making init commit'
git push

The first command git add ., adds all of the files you edited (again minus any file that is ignored in the gitignore file). Note you can add a specific file one at a time if you want, e.g. git add README.md, but using the period adds all of the files you edited at once.

Git commit adds in a message where you should write a short note on the changes. Technically at this point you could go and do more changes, but here I am going to git push, which will send the updates to the online hosted github branch. (Note if doing this the first time from the command prompt, you may need to give your username and maybe set up a github token or do two-factor authentication).

You don’t technically need to do these three steps at once, but in my workflows I pretty much always do. Now you can go checkout the online github repo and see the updated changes.

Branches

When you are working on things yourself for small projects, just those above commands and committing directly to the default main branch is fine. Branches allow for more complicated scenarios like:

you want the main code to not change, but you want to experiment and try out some changes
you have multiple people working on the code at the same time

Branches provide isolation – they allow the code in the branch to change, whereas code in main (or other branches) does not change. Here I am going to show how to make a branch in the command prompt, but first a good habit when working with multiple people is to do at the start of your day:

git fetch
git pull origin main

Git fetch updates the repo if other collaborators added a branch (but does not update the files directly). And git pull origin main pulls the most recent main branch version. So if a colleague updated main, when you do git pull origin main it will update the code on your local computer. (If you want to pull the most recent version of a different branch, it will be git pull origin branch_name.)

To create a new branch, you can do:

git checkout -b new_branch

Note if the branch is already created you can just omit the -b flag, and this just switches to that branch. Make a change, and then when pushing, use git push origin new_branch, which specifies you are specifically pushing to your branch you just created (instead of pushing to the default main branch).

# after editing readme to make a change
git add .
git commit -m 'trivial edit'
git push origin new_branch

Now back in the browser, you can go and checkout the updated code by switching the branch you are looking at in the dropdown on the left hand part of the screen that says “new_branch” with the tiny branching diagram:

A final step, you want to merge the branch back into the main code script. If you see the big green button Compare and Pull Request in the above screenshot, click that, and it will bring up a dialog about creating a pull request. Then click the green Create Pull Request button:

Then after you created the request, it will provide another dialogue to merge in the code into the target (by default main).

If everything is ok (you have correct permissions and no merge conflicts), you can click the buttons to merge the branches and that is that.

Merge Conflicts

The rub with above is that sometimes merge conflicts happen, and as Jerry mentions, these can be a total pain to sort out. It is important to understand why merge conflicts happen first though, and to take steps to prevent them. In my experience merge conflicts most often happen because of two reasons:

Multiple people are working on the same branch, and I forget to run git pull origin branch at the start of my day, so I did not incorporate the most recent changes. (Note these can happen via auto-changes as well, such as github actions running scripts.)

The second scenario is someone updated main, and I did not update my version of main. This tends to occur with more long term development. Typically this means at the start of my day, I should have run git checkout main, then git pull origin main.

I tend to find managing merge conflicts is very difficult using the built in github tools (so I don’t typically use git rebase for example). More commonly, when I have a merge conflict for a single file, first I will save the file that is giving me problems outside of the github repo (so I don’t accidently delete/overwrite it). Then, if my new_branch is conflicting with main, I will do:

# this pulls the exact file from main
git checkout main conflict_file.txt
git add conflict_file.txt
git commit -m 'pulling file to fix conflict'
git push origin new_branch

Then if I want to make edits to conflict_file.txt, make the edits now, then redo add-commit-push.

This workflow tends to be easier in my experience than dealing with rebase or trying to edit the merge conflicts directly.

It is mostly important though to realize what caused the merge conflict to begin with, to prevent the pain of dealing with it again in the future. My experience these are mostly avoidable, and mean you made a personal mistake in not pulling the most recent version, or more rarely collaboration with a colleague wasn’t coordinated correctly (you both editing the same file at the same time).

I realize this is not easy – it takes a bit of work to understand github and incorporate into your workflow. I think it is a worthwhile tool for analysts and data scientists to learn though.

Leave a comment

by apwheele on January 14, 2024 • Permalink

Posted in Crime Analysis, data science

Tagged command-line, github

Posted by apwheele on January 14, 2024

https://andrewpwheeler.com/2024/01/14/getting-started-with-github-notes/

Recoding America review, Data Science CV Update, Sworn Dashboard

Over this Christmas break I read Jennifer Pahlka’s Recoding America. It is a great book and I really recommend it.

My experience working in criminal justice is a bit different than Pahlka’s examples, but even if you are just interested in private sector product/project management this is a great book. It has various user experience gems as well (such as forms that eliminate people, put the eliminating questions in order by how many people they filter).

Pahlka really digs on waterfall, I have critiqued agile on the blog in the past, but we are both just using generic words to describe bad behavior. I feel like a kindred spirit with Pahlka based on some of her anecdotes; concrete boats, ridiculous form questions, PDF inputs that only work on ancient web-browsers, mainframes are not the problem stupid requirements are, hiring too many people makes things worse, people hanging up on them in phone calls when you tell the truth – so many good examples.

To be specific with agile/waterfall, Pahlka is very critical of fixed requirements coming down on high from policy makers. When you don’t have strong communication at the requirements gathering stage between techies, users and owners making the requests (which can happen in private sector too), you can get some comical inefficiencies.

A good example for my CJ followers are policies to do auto-clearance of records in California. So the policy makers made a policy that said those with felony convictions for stealing less than $1,000 can be expunged, but there is no automated way to do this, since the criminal records do not save the specific dollar amount in the larceny charge. (And to do the manual process is very difficult, so pretty much no one bothers.) It probably would make more sense to say something like “a single felony larceny charge that is 5 years old will be auto-cleared”, that is not exactly the same but is similar in spirit to what the legislature wants, and can be easily automated based on criminal records that are already collected by the state. A real effective solution would look like data people working with policy makers directly and giving scenarios “if we set the criteria to X, it will result in Y clearances”. These are close to trivial things to ask a database person to comment on, there is no fundamental reason why policy/techs can’t go back in forth and craft policy that makes sense and is simpler to implement.

To be more generic, what can happen is someone requests X, X is really hard/impossible, but you can suggest a,b,c instead that is easier to accomplish and probably meets the same high level goals. There is asymmetry in what people ask for and understanding of the work it takes to accomplish those requests, an important part of your job as a programmer/analyst is to give feedback to those asking to make the requests better. It takes understanding from the techies of the business requirements (Pahlka suggests govt should hire more product owners, IMO would rather just have senior developer roles do that stuff directly). And it takes people asking to be open to potential changes. Which most people are in my experience, just sometimes you get people who hang up in phone calls when you don’t tell them what they want to hear.

I actually like the longer term plan out a few months waterfall approach (I find that easier to manage junior developers, I think the agile shorter term stuff is too overbearing at times). But it requires good planning and communication between end users and developers no matter whether you say you are doing waterfall or agile. My experience in policing is not much like the policy people giving stone tablets, I have always had more flexibility to give suggestions in my roles. But I do think many junior crime analysts need to learn to say “you asked for percent change, here is a different stat instead that is better for what you want”.

What I am trying to do with CRIME De-Coder is really consistent with Pahlka’s goals with Code for America. I think it is really important for CJ agencies to take on more human investment in tech. Part of the reason I started CRIME De-Coder was anger – I get angry when I see cities pay software vendors six digits for crappy software that a good crime analyst could do. Or pay a consulting firm 6 figures for some mediocre (and often inappropriate) statistical analysis. Cities can do so much better by internally developing skills to take on many software projects, which are not moving mountains, and often outside software causes more problems than they solve.

At work we are starting to hire a new round of data scientists (no links to share, they are offshore in India, and first round is through a different service). The resume over-stating technical expertise for data scientists is at lol levels at this point. Amazing how everyone is an LLM, deep learning, and big data expert these days.

I’ve written before how I am at a loss on how to interview data scientists. The resumes I am getting are also pretty much worthless at this point. One problem I am seeing in these resumes is that people work on teams, so people can legitimately claim “I worked on this LLM”, but when you dig in and ask about specifics you find out they only contributed this tiny thing (which is normal/OK). But the resumes look like they are Jedi masters in advanced machine learning.

I went and updated my data science resume in response to reading others. (I should probably put that in HTML, so it shows up in google search results.) I don’t really have advice for folks “what should your resume look like” – I have no clue how recruiters view these things. No doubt my resume is not immune to a recruiter saying “you have 10+ years with python, but I don’t see any Jira experience, so I don’t think you are qualified”.

What I have done is only include stuff in the resume where I can link to specific, public examples (peer reviewed work, blog posts, web pages, github). I doubt recruiters are going to click on a single link in the resume (let alone all 40+), but that is what I personally would prefer when I am reviewing a resume. Real tangible stuff so someone can see I actually know how to write code.

So for example in the most recent update of the resume, I took Unix, Kubernetes/Docker, Azure, and Databricks off. Those are all tech I have worked with at HMS/Gainwell, but do not have any public footprint to really show off. I have some stuff on Docker on the blog, but nothing real whiz bang to brag about. And I have written some about my deployment strategy for python code in Databricks using github actions. (I do like Azure DevOps pipelines, very similar to building github actions, which are nice for many of the batch script processes I do. My favorite deployment pattern at work is using conda + persistent Fedora VMs. Handling servers/kubernetes everything docker is a total pain.) “Expertise” in those tools is probably too strong, I think claiming basic competence is reasonable though. (Databricks has changed so much in the two years we have been using it at work I’m not sure anyone outside of Databricks themselves could claim expertise – only if you are a very fast learner!)

But there is no real fundamental way for an outsider to know I have any level of competence/expertise in these tech tools. Honestly they do not matter – if you want me to use google cloud or AWS for something equivalent to Azure DevOps, or Snowflake instead of Databricks, it doesn’t really matter. You just learn the local stack in a month or two. Some rare things you do need very specialized tech skills, say if someone wanted me to optimize latency in serving pytorch LLMs, that will be tough given my background. Good luck posting that position on LinkedIn!

But the other things I list, I can at least pull up a web page to say “here is code I wrote to do this specific thing”. Proof is in the pudding. Literally 0 of the resumes I am reviewing currently have outside links to any code, so it could all be made up (and clearly for many people is embellished). I am sure people think mine is embellished as well, best I can do to respond to that is share public links.

For updates on CRIME De-Coder:

Sworn Dashboard, trends in officer hiring using D3.js to make charts and supabase as a backend
How to evaluate predictive policing software blog post
Crime De-Coder Storefront, currently can purchase (in and outside US) my GUI tool to download streetview images, as well as pre-release of python book

I researched ways to do payments for so long, in the end just turning on WooPayments in wordpress (and using an iframe) was a super simple solution (and it works fine for digital downloads and international payments). Will need to figure out webhooks with Stripe to do more complicated stuff eventually (like SaaS services, licenses, recurring payments), but for now this set up works for what I need.

I will start up newsletters again next week.

1 Comment

by apwheele on January 2, 2024 • Permalink

Posted in Criminal Justice, data science, scholarly

Tagged book-review, resume

Posted by apwheele on January 2, 2024

https://andrewpwheeler.com/2024/01/02/recoding-america-review-data-science-cv-update-sworn-dashboard/

Year in Review 2023: How did CRIME De-Coder do?

In 2023, I published 45 pages on the blog. Cumulative site views were slightly more than last year, a few over 150,000.

I would have had pretty much steady cumulative views from last year (site views took a dip in April, the prior year had quite a bit of growth, I suspect something to do with the way WordPress counts stats changed), but in December my post Forecasts need to have error bars hit front page on Hackernews. This generated about 12k views for that post over two days. (In 2022 I had just shy of 140k views in total.)

It was very high on the front page (#1) for most of that day. So for folks who want to guesstimate the “death by Hackernews” referrals, I would guess if your site/app can handle 10k requests in an hour you will be ok. WordPress by default this is fine (my Crime De-Coder Hostinger site is maybe not so good for that, the SLA is 20k requests per day). Also interesting note, about 10% of people who were referred to the forecast post clicked at least one other page on the site.

So I started CRIME De-Coder in February this year. I have published a few over 30 pages on that site during the year, and have accumulated a total of a few more than 11k site views. This is very similar to the first year of my personal blog, with publishing around 30 posts and getting just over 7k total views for the year. This is almost entirely via direct referrals (I share posts on LinkedIn, google searches are just a trickle).

Sometimes people are like “cool you started your own company”, but really I did that same type of consulting since I was in grad school. I have had a fairly consistent set of consulting work (around $20k per year) for quite awhile. That was people cold asking me for help with mostly statistical analysis.

The reason I started CRIME De-Coder was to be more intentional about it – advertise the work I do, instead of waiting for people to come to me. Doing your own LLC is simple, and it is more a website than anything.

So how much money did I make this year for CRIME De-Coder? Not that much more than $30k (I don’t count the data competitions I won in that metric, but actual commissioned work.) I do have substantially more work lined up for next year though already (more on the order of $50k so far, although no doubt some of that will fall through).

I sent out something like 30 some soft pitches during the year to people in my extended network (first or strong second degree). I don’t know the typical rate of something like that, but mine was abysmal – I was lucky to get an email response no thanks. These are just ideas like “hey I could build you an interactive dashboard with your data” or “you paid this group $150k, I would do that same thing for less than $30k”.

Having CRIME De-Coder did however did increase my first degree network to “ask me for stat analysis” more. So it was definitely worth spending time doing the website and creating the LLC. Don’t ask me for advice though about making pitches for consulting work!

The goal is ultimately to be able to go solo, and just do my consulting work as my full time job. It is hard to see that happening though – even if I had 5 times the amount of work lined up, it would still just be short term single projects. I have pitched more consistent retainers, but no one has gone for that. Small police departments if interested in outsourcing crime analysis let me know – that is I believe the best solution for them. Also have pitched to think tanks to hire me part time as well, as well as CJ programs to hire me in part time roles as well. I understand the CJ programs no interest, I am way more expensive than typical adjunct, I am a good deal for other groups though. (I mean I am good deal for CJ programs as well, part of the value add is supervising students for research, but universities don’t value that very high.)

I will ultimately keep at it – sending email pitches is easy. And I am hoping that as the website gets more organic search referrals, I will be able to break out of my first degree network.

Leave a comment

by apwheele on December 26, 2023 • Permalink

Posted in Crime Analysis, Criminal Justice, Meta, website, writing

Tagged year-in-review

Posted by apwheele on December 26, 2023

https://andrewpwheeler.com/2023/12/26/year-in-review-2023-how-did-crime-de-coder-do/

Search for:
Recent Posts
Categories
Categories
Site RSS Feeds
- RSS - Posts
- RSS - Comments
Follow Blog via Email

Enter your email address to follow this blog and receive notifications of new posts by email.

Email Address:

Join 377 other subscribers
aoristic cartography census choropleth citeulike color cost-benefit courses crime-mapping crime-trends Crime Analysis Criminal Justice data-manipulation data visualization deep-learning excel flow-data geocoding ggplot2 github google-streetview-api grammar of graphics group-based-trajectory gun-violence healthcare homicide-rates hot spots hypothesis-testing kernel-density linear programming logistic-regression machine-learning MACRO mapping matplotlib meta multi-level negative-binomial network NetworkX officer-involved-shooting open-science paper Papers peer-review Poisson prediction Predictive-Policing preprint presentation Python Python-programability pytorch quasi-experiment r recidivism regression resources scholarly seaborn shootings simulation slopegraph small-multiples social-networking SPSS stackexchange Stata statistics survey time-series uncertainty wdd web-scraping writing
Top Posts & Pages
Stack Exchange

Pre-print on JTC and examining the Buffer Zone

Flock Story

Webinar on Monitoring Temporal Crime Trends

Monitoring Time in Between Events

References

Power with Likert items

AI writing

Part 1, Establishing it is all about the counts

Part 2, MDEs

References

Harmweighted Hotspots

Using ESRI Python API

Making a project and first commit

Branches

Merge Conflicts

Recent Posts

Categories

Site RSS Feeds

Follow Blog via Email

Top Posts & Pages

Stack Exchange