Testing the equality of two regression coefficients

The default hypothesis tests that software spits out when you run a regression model is the null that the coefficient equals zero. Frequently there are other more interesting tests though, and this is one I’ve come across often — testing whether two coefficients are equal to one another. The big point to remember is that Var(A-B) = Var(A) + Var(B) - 2*Cov(A,B). This formula gets you pretty far in statistics (and is one of the few I have memorized).

Note that this is not the same as testing whether one coefficient is statistically significant and the other is not. See this Andrew Gelman and Hal Stern article that makes this point. (The link is to a pre-print PDF, but the article was published in the American Statistician.) I will outline four different examples I see people make this particular mistake.

One is when people have different models, and they compare coefficients across them. For an example, say you have a base model predicting crime at the city level as a function of poverty, and then in a second model you include other control covariates on the right hand side. Let’s say the the first effect estimate of poverty is 3 (1), where the value in parentheses is the standard error, and the second estimate is 2 (2). The first effect is statistically significant, but the second is not. Do you conclude that the effect sizes are different between models though? The evidence for that is much less clear.

To construct the estimate of how much the effect declined, the decline would be 3 - 2 = 1, a decrease in 1. What is the standard error around that decrease though? We can use the formula for the variance of the differences that I noted before to construct it. So the standard error squared is the variance around the parameter estimate, so we have sqrt(1^2 + 2^2) =~ 2.23 is the standard error of the difference — which assumes the covariance between the estimates is zero. So the standard error around our estimated decline is quite large, and we can’t be sure that it is an appreciably different estimate of poverty between the two models.

There are more complicated ways to measure moderation, but this ad-hoc approach can be easily applied as you read other peoples work. The assumption of zero covariance for parameter estimates is not a big of deal as it may seem. In large samples these tend to be very small, and they are frequently negative. So even though we know that assumption is wrong, just pretending it is zero is not a terrible folly.

The second is where you have models predicting different outcomes. So going with our same example, say you have a model predicting property crime and a model predicting violent crime. Again, I will often see people make an equivalent mistake to the moderator scenario, and say that the effect of poverty is larger for property than violent because one is statistically significant and the other is not.

In this case if you have the original data, you actually can estimate the covariance between those two coefficients. The simplest way is to estimate that covariance via seemingly unrelated regression. If you don’t though, such as when you are reading someone else’s paper, you can just assume the covariance is zero. Because the parameter estimates often have negative correlations, this assumption will make the standard error estimate smaller.

The third is where you have different subgroups in the data, and you examine the differences in coefficients. Say you had recidivism data for males and females, and you estimated an equation of the effect of a treatment on males and another model for females. So we have two models:

Model Males  : Prob(Recidivism) = B_0m + B_1m*Treatment
Model Females: Prob(Recidivism) = B_0f + B_1f*Treatment

Where the B_0? terms are the intercept, and the B_1? terms are the treatment effects. Here is another example where you can stack the data and estimate an interaction term to estimate the difference in the effects and its standard error. So we can estimate a combined model for both males and females as:

Combined Model: Prob(Recidivism) = B_0c + B_1c*Treatment + B_2c*Female + B_3c(Female*Treatment)

Where Female is a dummy variable equal to 1 for female observations, and Female*Treatment is the interaction term for the treatment variable and the Female dummy variable. Note that you can rewrite the model for males and females as:

Model Mal.: Prob(Recidivism) =     B_0c      +      B_1c    *Treatment    ....(when Female=0)
Model Fem.: Prob(Recidivism) = (B_0c + B_2c) + (B_1c + B_3c)*Treatment    ....(when Female=1)

So we can interpret the interaction term, B_3c as the different effect on females relative to males. The standard error of this interaction takes into account the covariance term, unlike estimating two totally separate equations would. (You can stack the property and violent crime outcomes I mentioned earlier in a synonymous way to the subgroup example.)

The final fourth example is the simplest; two regression coefficients in the same equation. One example is from my dissertation, the correlates of crime at small spatial units of analysis. I test whether different places that sell alcohol — such as liquor stores, bars, and gas stations — have the same effect on crime. For simplicity I will just test two effects, whether liquor stores have the same effect as on-premise alcohol outlets (this includes bars and restaurants). So lets say I estimate a Poisson regression equation as:

log(E[Crime]) = Intercept + b1*Bars + b2*LiquorStores

And then my software spits out:

                  B     SE      
Liquor Stores    0.36  0.10
Bars             0.24  0.05

And then lets say we also have the variance-covariance matrix of the parameter estimates – which most stat software will return for you if you ask it:

                L       B  
Liquor_Stores    0.01
Bars            -0.0002 0.0025

On the diagonal are the variances of the parameter estimates, which if you take the square root are equal to the reported standard errors in the first table. So the difference estimate is 0.36 - 0.24 = 0.12, and the standard error of that difference is sqrt(0.01 + 0.0025 - 2*-0.002) =~ 0.13. So the difference is not statistically significant. You can take the ratio of the difference and its standard error, here 0.12/0.13, and treat that as a test statistic from a normal distribution. So the rule that it needs to be plus or minus two to be stat. significant at the 0.05 level applies.

This is called a Wald test specifically. I will follow up with another blog post and some code examples on how to do these tests in SPSS and Stata. For completeness and just because, I also list two more ways to accomplish this test for the last example.


There are two alternative ways to do this test though. One is by doing a likelihood ratio test.

So we have the full model as:

 log(E[Crime]) = b0 + b1*Bars + b2*Liquor_Stores [Model 1]
 

And we have the reduced model as:

 log(E[Crime]) = b4 + b5*(Bars + Liquor_Stores)  [Model 2]
 

So we just estimate the full model with Bars and Liquor Stores on the right hand side (Model 1), then estimate the reduced model (2) with the sum of Bars + Liquor Stores on the right hand side. Then you can just do a chi-square test based on the change in the log-likelihood. In this case there is a change of one degree of freedom.

I give an example of doing this in R on crossvalidated. This test is nice because it extends to testing multiple coefficients, so if I wanted to test bars=liquor stores=convenience stores. The prior individual Wald tests are not as convenient for testing more than two coefficients equality at once.


Here is another way though to have the computer more easily spit out the Wald test for the difference between two coefficients in the same equation. So if we have the model (lack of intercept does not matter for discussion here):

y = b1*X + b2*Z [eq. 1]

We can test the null that b1 = b2 by rewriting our linear model as:

y = B1*(X + Z) + B2*(X - Z) [eq. 2]

And the test for the B2 coefficient is our test of interest The logic goes like this — we can expand [eq. 2] to be:

y = B1*X + B1*Z + B2*X - B2*Z [eq. 3]

which you can then regroup as:

y = X*(B1 + B2) + Z*(B1 - B2) [eq. 4]

and note the equalities between equations 4 and 1.

B1 + B2 = b1; B1 - B2 = b2

So B2 tests for the difference between the combined B1 coefficient. B2 is a little tricky to interpret in terms of effect size for how much larger b1 is than b2 – it is only half of the effect. An easier way to estimate that effect size though is to insert (X-Z)/2 into the right hand side, and the confidence interval for that will be the effect estimate for how much larger the effect of X is than Z.

Note that this gives an equivalent estimate as to conducting the Wald test by hand as I mentioned before.

Regression to the mean – a tale of change scores

This is a real example from my work illustrating regression to the mean. I have a scale measuring impulsivity of offenders. I had an intervention that used cognitive behavioral thereapy (CBT) in a boot camp for one group, and business as usual for another (just plain old jail). I have measures of impulsivity at pre, post, and 6 month follow up (what I label as post2). CBT is suppossed to reduce impulsivity, and hopefully keep it that way.

I find that those who have gained the most during the intervention tend to revert back to their prior scores once they leave the bootcamp. That is, the measure [post - pre], the gain in bootcamp, has a negative correlation with [post2 - post], the loss after bootcamp. Is this due to the intervention being shitty? No! It is not — this is the result of regression to the mean. This does not show any relationship between the values, it will happen even if the impulse scores are totally random.

Note that the definition of covariance is:

Cov(X,Y) = E[(x - E[X])*(y - E[Y])]

Where E is representing the expectation, and Cov(X,Y) of course means the covariance between X and Y. Here for easier equations we can assume the mean in the impulse scale is zero across all three waves, which makes the means of the change scores zero as well (without any loss in generality). So dropping the inner expecations this equation reduces to:

Cov(X,Y) = E[x*y]

So defining post-pre = Change1 and post2 - post = Change2, expanding out to the original components we have:

Cov(Change1,Change2) = Cov(post-pre,post2-post) = E[ (post-pre)*(post2-post) ]

The last result can then be expanded to:

E[ post*post2 - post*post - pre*post2 + pre*post ]

Because of the bilinearity of expectation, these can be further teased out:

E[ post*post2 ] - E[ post*post ] - E[ pre*post2 ] + E[ pre*post]

Note we can rewrite this back into variances and covariances of the original levels:

Cov(post,post2) - Var(post) - Cov(pre,post2) + Cov(pre,post)

There are two things to note here. 1) The covariances in the change scores can be entirely written as functions in the covariances of the levels. They do not supply information independent of the levels themselves.

For 2), if the data are random (that is the covariances between all the levels are random), the covariances between the change scores will be negative. This is because of the minus sign in front of the variance of the post term. For random data, all the other covariances are zero. This results in the correlation between the change scores being -1/2.

For a simple example in R:

> set.seed(10)
> n <- 10000 #sample size
> t1 <- rnorm(n) #three random vectors
> t2 <- rnorm(n)
> t3 <- rnorm(n)
> levels <- data.frame(t1,t2,t3)
> differ <- data.frame(c1=t2-t1,c2=t3-t2)
> 
> #correlations in levels are approximately zero
> cor(levels)
              t1           t2            t3
t1  1.0000000000  0.001874345 -0.0007006367
t2  0.0018743450  1.000000000 -0.0045967380
t3 -0.0007006367 -0.004596738  1.0000000000
> 
> #correlation of differences is -0.5
> cor(differ)
           c1         c2
c1  1.0000000 -0.4983006
c2 -0.4983006  1.0000000

Sometimes I see people talk about regression to the mean as if it is a sociological thing, like something that needs to be explained in terms of human behavior. It is not, it is entirely mathematical.

This is also one of the reasons I don’t like using change scores, either as independent or dependent variables. They typically can be rewritten in terms of the levels, and involve coeffficient restrictions that can have strange consequences. There are some situations (fixed effects) that make sense for the dependent variable. I haven’t seen a situation in the terms of independent variables where they make sense.

Negative Binomial regression and predicted probabilities in SPSS

For my dissertation I have been estimating negative binomial regression models predicting the counts of crimes at small places (i.e. street segments and intersections). When evaluating the fit of poisson regression models and their variants, you typically make a line plot of the observed percent of integer values versus the predicted percent by the models. This is particularly pertinent for data that have a high proportion of zeros, as the negative binomial may still under-predict the number of zeros.

I mistakenly thought that to make such a plot you could simply estimate the predicted value following the negative binomial regression model and then round the predictions. But I was incorrect, and to make the typical predicted versus observed plot you need to estimate the probability of an observation taking an integer value, and then take the mean of that probability over all the observations. That mean will subsequently be the predicted percent given the model. Fortunately I caught my mistake before I gave some talks on my work recently, and I will show how to make said calculations in SPSS. I have posted the data to replicate this work at this dropbox link, and so you can download the data and follow along.

First, I got some help on how to estimate the predicted probabilities via an answer to my question at CrossValidated. So that question lists the formula one needs to estimate the predicted probability for any integer value N after the negative binomial model. To calculate that value though we need to make some special SPSS functions, the factorial and the complete gamma function. Both have SPSS tech help pages showing how to calculate them.

For the factorial we can use a general relationship with the LNGAMMA function.


DEFINE !FACT (!POSITIONAL = !ENCLOSE("(",")"))
( EXP(LNGAMMA((!1)+1)) )
!ENDDEFINE.

And for the complete gamma function we can use a relationship to the CDF of the gamma function.


DEFINE !GAMMAF (!POSITIONAL = !ENCLOSE("(",")"))
( EXP(-1)/(!1)/(CDF.GAMMA(1,(!1),1) - CDF.GAMMA(1,(!1)+1,1)) )
!ENDDEFINE.

And given these two functions, we can create a macro that takes as parameters and returns the predicted probability we are interested in:

  • out – new variable name for predicted probability of taking on that integer value
  • PredN – the predicted mean of the variable conditional on the covariates
  • Disp – estimate of the dispersion parameter
  • Int – the integer value being predicted

DEFINE !PredNB (Out = !TOKENS(1)
               /PredN = !TOKENS(1)
                        /Disp = !TOKENS(1)
                        /Int = !TOKENS(1) )
COMPUTE #a = (!Disp)**(-1).
COMPUTE #mu = !PredN.
COMPUTE #Y = !Int.
COMPUTE #1 = (!GAMMAF(#Y + #a))/(!FACT(#Y)*!GAMMAF(#a)).
COMPUTE #2 = (#a/(#a+#mu))**#a.
COMPUTE #3 =  (#mu/(#a + #mu))**#Y.
COMPUTE !Out =  #1*#2*#3.
!ENDDEFINE.

But to make our plot we want to estimate this predicted probability over a range of values, so I created a helper macro that instead of taking only one integer value, takes the end integer value and will calculate the predicted probability of zero through N.


DEFINE !PredNBRange (Num = !TOKENS(1)
                    /Mean = !TOKENS(1)
                    /Disp = !TOKENS(1)
                    /Stub = !TOKENS(1) )
!DO !I = 0 !TO !Num
  !LET !Base = !CONCAT(!Stub,!I)
  !PredNB Out = !Base PredN = !Mean Disp = !Disp Int = !I.
!DOEND 
!ENDDEFINE.

The example data and code I have posted compares these values to the ones predicted from Stata, and shows my function agrees with Stata to about 7 decimal points. I won’t go through all of those commands here, but I will show how to make the predicted proportions plot after you have a vector of predicted probabilities (you can download all of the code and data and the link I reference prior in the post).

So lets say that you have a vector NB0 TO NB8, and these are the predicted probabilities of integer values 0 to 8 for the observations in your dataset. To subsequently get the mean of the predictions, you can use the AGGREGATE command. Having no variables specified on the BREAK subcommand tells SPSS to aggregate over all values in the dataset. Here I export the file to a new dataset named PredNBAgg.


DATASET DECLARE PredNBAgg.
AGGREGATE OUTFILE='PredNBAgg'
  /BREAK = 
  /NB0 TO NB8 = MEAN(NB0 TO NB8).

Now to merge later on to the observed proportions, I will reshape the dataset so the mean values are all in the same column using VARSTOCASES. Here I also make a category for the predicted probability of being 9 or higher (which isn’t typical for these types of plots, but something I believe is useful).


DATASET ACTIVATE PredNBAgg.
COMPUTE NB9_Plus = 1 - SUM(NB0 TO NB8).
VARSTOCASES /MAKE NBPred FROM NB0 TO NB9_Plus /INDEX Int.
COMPUTE Int = Int - 1. /*Index starts at 1 instead of 0 */.

Now I reactivate my original dataset, here named PredNegBin, calculate the binned observed values (with observations 9 and larger recoded to just 9) and then aggregate those values.


DATASET ACTIVATE PredNegBin.
RECODE TotalCrime (9 THRU HIGHEST = 9)(ELSE = COPY) INTO Int.
DATASET DECLARE PredObsAgg.
AGGREGATE OUTFILE='PredObsAgg'
  /BREAK = Int
  /TotalObs = N.

To get the predicted proportions within each category, I need to do another aggregation to get the total number of observations, and then divide the totals of each integer value with the total number of observations.


DATASET ACTIVATE PredObsAgg.
AGGREGATE OUTFILE = * MODE=ADDVARIABLES OVERWRITE=YES
  /BREAK = 
  /TotalN=SUM(TotalObs).
COMPUTE PercObs = TotalObs / TotalN.

Now we can go ahead and merge the two aggregated datasets together. I also go ahead and close the old PredNBAgg dataset and define a value label so I know that the 9 integer category is really 9 and larger.


MATCH FILES FILE = *
  /FILE = 'PredNBAgg'
  /BY Int.
DATASET CLOSE PredNBAgg.
VALUE LABELS Int 9 '9+'.

Now at this point you could make the plot with the predicted and observed proportions in seperate variables, but this would take two ELEMENT statements within a GGRAPH command (and I like to make line plots with both the lines and points, so it would actually take 4 ELEMENT statements). So what I do here is reshape the data one more time with VARSTOCASES, and make a categorical variable to identify if the proportion is the observed value or the predicted value from the model. Then you can make your chart.


VARSTOCASES /MAKE Dens FROM PercObs NBPred /Index Type /DROP TotalObs TotalN.
VALUE LABELS Type 
 1 'Observed'
 2 'Predicted'.
GGRAPH
  /GRAPHDATASET NAME="graphdataset" VARIABLES=Int Dens Type
  /GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
  SOURCE: s=userSource(id("graphdataset"))
  DATA: Int=col(source(s), name("Int"), unit.category())
  DATA: Type=col(source(s), name("Type"), unit.category())
  DATA: Dens=col(source(s), name("Dens"))
  GUIDE: axis(dim(1), label("Total Crimes on Street Units"))
  GUIDE: axis(dim(2), label("Percent of Streets"))
  GUIDE: legend(aesthetic(aesthetic.color.interior), null())
  SCALE: cat(aesthetic(aesthetic.color.interior), map(("1",color.black),("2",color.red)))
  ELEMENT: line(position(Int*Dens), color.interior(Type))
  ELEMENT: point(position(Int*Dens), color.interior(Type), color.exterior(color.white), size(size."7"))
END GPL.

And voila, here you can see the predicted values are so close to the observed that it is difficult to even see the observed values. Here instead of creating a legend I manually added labels to the chart. A better chart may be to subtract the observed from predicted (especially if you were comparing multiple poisson models), but it should be quite plain to see that the negative binomial fits quite well to the observed data in this instance.

Similar to Paul Allison’s experience, even with nearly 64% of the observations being zero, the negative binomial model fits just fine. I recently fit some other models with the same data (but a different outcome) in which the number of zeros were nearer to 90%. In that instance the negative binomial model would not converge, so estimating a zero inflated model was necessary. Here though it is clearly not necessary, and I would prefer the negative binomial model over a zip (or hurdle) as I see no obvious reason why I would prefer the complications of the different predicted zero equation in addition to the count equation.