Group based trajectory models in Stata – some graphs and fit statistics

For my advanced research design course this semester I have been providing code snippets in Stata and R. This is the first time I’ve really sat down and programmed extensively in Stata, and this is a followup to produce some of the same plots and model fit statistics for group based trajectory statistics as this post in R. The code and the simulated data I made to reproduce this analysis can be downloaded here.

First, for my own notes my version of Stata is on a server here at UT Dallas. So I cannot simply go

net from http://www.andrew.cmu.edu/user/bjones/traj
net install traj, force

to install the group based trajectory code. First I have this set of code in the header of all my do files now

*let stata know to search for a new location for stata plug ins
adopath + "C:\Users\axw161530\Documents\Stata_PlugIns"
*to install on your own in the lab it would be
net set ado "C:\Users\axw161530\Documents\Stata_PlugIns"

So after running that then I can install the traj command, and Stata will know where to look for it!

Once that is taken care of after setting the working directory, I can simply load in the csv file. Here my first variable was read in as ïid instead of id (I’m thinking because of the encoding in the csv file). So I rename that variable to id.

*Load in csv file
import delimited GroupTraj_Sim.csv
*BOM mark makes the id variable weird
rename ïid id

Second, the traj model expects the data in wide format (which this data set already is), and has counts in count_1, count_2count_10. The traj command also wants you to input a time variable though to, which I do not have in this file. So I create a set of t_ variables to mimic the counts, going from 1 to 10.

*Need to generate a set of time variables to pass to traj, just label 1 to 10
forval i = 1/10 { 
  generate t_`i' = `i'
}

Now we can estimate our group based models, and we get a pretty nice default plot.

traj, var(count_*) indep(t_*) model(zip) order(2 2 2) iorder(0)
trajplot

Now for absolute model fit statistics, there are the average posterior probabilities, the odds of correct classification, and the observed classification proportion versus the expected classification proportion. Here I made a program that I will surely be ashamed of later (I should not brutalize the data and do all the calculations in matrix), but it works and produces an ugly table to get us these stats.

*I made a function to print out summary stats
program summary_table_procTraj
    preserve
    *now lets look at the average posterior probability
	gen Mp = 0
	foreach i of varlist _traj_ProbG* {
	    replace Mp = `i' if `i' > Mp 
	}
    sort _traj_Group
    *and the odds of correct classification
    by _traj_Group: gen countG = _N
    by _traj_Group: egen groupAPP = mean(Mp)
    by _traj_Group: gen counter = _n
    gen n = groupAPP/(1 - groupAPP)
    gen p = countG/ _N
    gen d = p/(1-p)
    gen occ = n/d
    *Estimated proportion for each group
    scalar c = 0
    gen TotProb = 0
    foreach i of varlist _traj_ProbG* {
       scalar c = c + 1
       quietly summarize `i'
       replace TotProb = r(sum)/ _N if _traj_Group == c 
    }
	gen d_pp = TotProb/(1 - TotProb)
	gen occ_pp = n/d_pp
    *This displays the group number [_traj_~p], 
    *the count per group (based on the max post prob), [countG]
    *the average posterior probability for each group, [groupAPP]
    *the odds of correct classification (based on the max post prob group assignment), [occ] 
    *the odds of correct classification (based on the weighted post. prob), [occ_pp]
    *and the observed probability of groups versus the probability [p]
    *based on the posterior probabilities [TotProb]
    list _traj_Group countG groupAPP occ occ_pp p TotProb if counter == 1
    restore
end

This should work after any model as long as the naming conventions for the assigned groups are _traj_Group and the posterior probabilities are in the variables _traj_ProbG*. So when you run

summary_table_procTraj

You get this ugly table:

     | _traj_~p   countG   groupAPP        occ     occ_pp       p    TotProb |
     |-----------------------------------------------------------------------|
  1. |        1      103   .9379318   43.57342   43.60432   .2575   .2573645 |
104. |        2      136   .9607258   47.48513   48.30997     .34   .3361462 |
240. |        3      161   .9935605   229.0413   225.2792   .4025   .4064893 |

The groupAPP are the average posterior probabilities – here you can see they are all quite high. occ is the odds of correct classification, and again they are all quite high. Update: Jeff Ward stated that I should be using the weighted posterior proportions for the OCC calculation, not the proportions based on the max. post. probability (that I should be using TotProb in the above table instead of p). So I have updated to include an additional column, occ_pp based on that suggestion. I will leave occ in though just to keep a paper trail of my mistake.

p is the proportion in each group based on the assignments for the maximum posterior probability, and the TotProb are the expected number based on the sums of the posterior probabilities. TotProb should be the same as in the Group Membership part at the bottom of the traj model. They are close (to 5 decimals), but not exactly the same (and I do not know why that is the case).

Next, to generate a plot of the individual trajectories, I want to reshape the data to long format. I use preserve in case I want to go back to wide format later and estimate more models. If you need to look to see how the reshape command works, type help reshape at the Stata prompt. (Ditto for help on all Stata commands.)

preserve
reshape long count_ t_, i(id)

To get the behavior I want in the plot, I use a scatter plot but have them connected via c(L). Then I create small multiples for each trajectory group using the by() option. Before that I slightly jitter the count data, so the lines are not always perfectly overlapped. I make the lines thin and grey — I would also use transparency but Stata graphs do not support this.

gen count_jit = count_ + ( 0.2*runiform()-0.1 )
graph twoway scatter count_jit t_, c(L) by(_traj_Group) msize(tiny) mcolor(gray) lwidth(vthin) lcolor(gray)

I’m too lazy at the moment to clean up the axis titles and such, but I think this plot of the individual trajectories should always be done. See Breaking Bad: Two Decades of Life-Course Data Analysis in Criminology, Developmental Psychology, and Beyond (Erosheva et al., 2014).

While this fit looks good, this is not the correct number of groups given how I simulated the data. I will give those trying to find the right answer a few hints; none of the groups have a higher polynomial than 2, and there is a constant zero inflation for the entire sample, so iorder(0) will be the correct specification for the zero inflation part. If you take a stab at it let me know, I will fill you in on how I generated the simulation.

Advertisements
Leave a comment

30 Comments

  1. mario spiezio

     /  April 19, 2018

    Dear Prof Wheeler, thanks a lot for the code. It is really helpful. I was wondering whether it is possible to use margins after after traj to compute average marginal effects of the covariates included in risk(). Thank you.

    Reply
    • I don’t think so, it appears you will need to calculate that effect yourself. I’m not sure exactly how you would do it with the mixture model.

      Reply
  2. Dimi

     /  May 7, 2018

    Dear Prof Wheeler,
    Thank you for the code.
    For my research, I am trying to use the traj plugin for a survival analysis. As I am not a statistician (I am a physician) you can imagine I need a little help.
    My initial issue is that, although I do not have missing values in the long format, in the wide format (due to the fact that some patients died during the follow-up) I do. When I run the code, these patients with “missing values” are dropped from the analysis. Do you know how to overcome this problem?
    All my best,
    Dimi

    Reply
  3. lori

     /  August 12, 2018

    can you covary outcome group when you have a risk variable?

    Reply
    • Not exactly sure what you mean — are you asking about a joint trajectory model with multiple outcomes and predicting probability of being in a particular trajectory?

      Also to see various examples you can type

      help traj

      into Stata and it provides various examples.

      Reply
      • lori

         /  August 13, 2018

        I have an assessment that is completed across 4 different time points – 12, 18, 24, and 36 months. I am putting scores on a different assessment completed at 6 months as a risk variable. Can I also covary outcome group along with this risk marker to control for diagnosis at 36 months – the outcome group)?

      • I think that would just be something like

        traj, var(y1-y4) model(logit) order(2 2 2 2) risk(pre)

        if I am understanding correctly. [Sorry if I am not!]

      • lori

         /  August 13, 2018

        My formulas is traj, var(abc*) indep(time*) model(cnorm) min(53) max(142) order(3) risk(mu6cmss)

        When i try to covary for outcome group at the last time point, I get:
        . traj, var(abc*) indep(time*) model(cnorm) min(53) max(142) order(3) risk(mu6cmss) cov(dxgroup)
        option cov() not allowed
        r(198);

        . traj, var(abc*) indep(time*) model(cnorm) min(53) max(142) order(3) risk(mu6cmss) tcov(dxgroup)
        The number of variables in tcov1 and var1 must match or be a multiple.
        r(198);

        var has 4 time points, dxgroup has 1 time point

      • tcov is for time-varying covariates. If it is not time varying, it can be used to predict what trajectory group a person in likely to be in, but not the shape of the trajectory over time. Not time-varying covariates should probably go in the “risk()” option in most cases then.

        What exactly is “dxgroup”?

      • lori

         /  August 13, 2018

        dxgroup is the outcome group. We collect data at 12, 18, 24, and 36 months. Then at 36 months, we do a diagnostic check to see if any of our kids end up with a diagnosis of autism. I want to be able to add this into my formula to covariate out any effects of outcome group to check if my predictor variable (an assessment completed at 6 months of age) can predict trajectory membership above and beyond outcome group.

      • Given that description your formula should be:

        traj, var(abc*) indep(time*) model(cnorm) min(53) max(142) order(3) risk(mu6cmss dxgroup)

        with only one group though this is probably not identified. So you will need to change it to something like:

        “order(2 2)” for two trajectory groups or
        “order(2 2 2)” for three trajectory groups etc.

        (With only four time points I would not do a cubic, about the best you can do is a quadratic).

  4. lori

     /  August 13, 2018

    If you have two risk variables, does it matter what order you put them in, because if i put the 6 month assessment first followed by dxgroup, i get different information.

    Reply
    • If you mean should

      “risk(mu6cmss dxgroup)”

      give a different result than

      “risk(dxgroup mu6cmss)”

      it should not. One caveat is that the labelling of the trajectory groups is arbitrary, so they could in theory converge to the same estimates, but different groups get different labels (and subsequently the coefficients predicting trajectory membership may then be altered).

      Given that the mixture models have difficulty converging that could also be a culprit as well.

      This is a bit awkward to give so many comments, just send me an email and it will be easier for me to respond.

      Reply
  5. Yong Kyu Lee

     /  August 16, 2018

    Thank you for the thorough review.
    I had a great help from it.
    I usually use SAS, so I am not familiar with STATA, but by some political reason I have to do GBTM in STATA.
    And I am wondering a way to get a file, as in ‘SAS ods output’, of the result of this GBTM which is ID and designated group by posterior probability.

    Reply
    • You can just save the results to a csv file. After you run the traj command, something like:

      ********************
      preserve
      keep ID _traj*
      outsheet using “TrajResults.csv:, replace comma
      restore
      ********************

      Reply
  6. Risha Gidwani-Marszowski

     /  July 18, 2019

    Thanks so much for this very helpful walkthrough.

    I am running the following code with the -traj- program in Stata looking at monthly cost trajectories. Here, t_1-t_12 represent 12 months:

    *********************************************************************************************
    traj, var(total_cost*) indep(t_*) model(cnorm) min(0) max(748938.3) order (2 2 2)
    **********************************************************************************************

    I get the following error:

    “total_cost2 is not within min1 = 0 and max1 = 748938.3”

    The problem is, the maximum value for total_cost2 is actually $748,938.3. So the error doesn’t make sense to me. If I increase the max to 748938.4, I get another error of “unrecognized command.”

    Question: How can I avoid getting the top error?

    I tried making the “min” and “max” values that exceed the range of the actual cost values in the entire 12-month dataset and that didn’t work.

    I tried specifying min(0) and min1…min(12), with the values for min 1…12 being the actual maximum values in the dataset for those months, and got an error for max(7), suggesting I cannot specify past max(6).

    Reply
    • Sorry for the late approval — vacation last week! I am not sure about that error, the code looks correct to me.

      If it is too large of numbers for whatever reason, you can divide by all the dependent variables by a constant (e.g. divide by 10,000). If that does not work I might send Bobby Jones an email with your data and a reproducible example and see if he can give any advice.

      Reply
    • A.Carlsen

       /  September 11, 2019

      I have the exact same problem. Did you find a solution to the problem?

      Reply
  7. Heine Strand

     /  August 9, 2019

    Hi,

    I wonder about the BIC and AIC stats in traj and how to decide the number of groups and slopes based on them. We run a model where we follow dementia patients’ cognition over 8 years. A model with two groups with zero slope provided this stats:

    0,0: BIC= -3980.05 (N=1554) BIC= -3977.56 (N=449) AIC= -3969.35 ll= -3965.35

    While the more reasonable model would be 2 groups with (1,1) or (2,2). For the former, the stats are:

    1,1: BIC= -3679.76 (N=1554) BIC= -3676.04 (N=449) AIC= -3663.72 ll= -3657.72

    As I understand, lower BIC and AIC values are preferable. Here the 0,0 model has lowest BIC (-3977.56) compared to the 1,1 model (-3676.04), and thereby the model to be preferred? This goes against our expectation of the progression, and the trajplot cleary suggest the 1,1 model to give a better fit. Should we pick the model with smallest absolute BIC/AIC? I see several authors have used this strategy, even if the literature seems to suggest that lowest values are best:
    https://stats.stackexchange.com/questions/84076/negative-values-for-aic-in-general-mixed-model.

    Can you help us?

    Reply
    • Yeah it is confusing, most people report AIC/BIC in positive numbers, so a lower value is better fit. Nagin and company always report it the opposite in their software with negative BIC/AIC values, so a higher value (closer to zero) is better.

      I’ve never sat down and figured out why different folks do it differently!

      So your perceptions match up with your results.

      Reply
      • Heine Strand

         /  August 9, 2019

        Thanks, really reassuring! Saved my day 🙂

  8. Hyeonmi

     /  September 18, 2019

    Thanks for your posting. For comparing the results, how can I get a Relative Entropy?

    Reply
    • I haven’t seen reference to that one in this context — that would be looking at say if you did a mixture of 3 groups vs a mixture of 4 groups? Or are you asking something different?

      Reply
  9. Viviane

     /  September 26, 2019

    Hi Andrew,
    Thanks a lot for your helpful post!
    I wonder if the do.file above to generate [groupAPP occ TotProb , etc] can be also applied straightly for logistic models (binary variables) – without any modification.
    Many thanks in advance,
    Viviane Straatmann

    Reply
    • It takes advantage of standardized variable names post the traj command. In particular _traj_ProbG* stores the posterior probabilities for each mixture, and _traj_Group stores a integer label for each group.

      So if you say you did something like below I think it will work:

      ****************************
      logit y x
      predict _traj_ProbG1
      gen _traj_ProbG2 = 1 – _traj_ProbG1
      gen _traj_Group = (_traj_ProbG1 < 0.5) + 1
      summary_table_procTraj
      ****************************

      Where group 1 is the positive class, and group 2 is the 0 class. I haven't given any thought though as to whether this makes sense in this context.

      Reply
      • Viviane

         /  September 27, 2019

        Thanks for replying!
        I’m using the command below to generate the groups (i’m still working on decisions of trajectory shapes and number of groups):

        traj, model(logit) var(poverty_5-poverty_18) indep(year_1-year_5) order(1 1 1)

        From this I get the _traj_Group _traj_ProbG1 _traj_ProbG2 _traj_ProbG3, and then I am running your do.file (below), that seems to be working fine.
        However, I wanna check with you if the calculation used in your program (tested in your example above with a zip model) also works in a logit model.

        program summary_table_procTraj_VSS2
        preserve
        *now lets look at the average posterior probability
        gen Mp = 0
        foreach i of varlist _traj_ProbG* {
        replace Mp = `i’ if `i’ > Mp
        }
        sort _traj_Group
        *and the odds of correct classification
        by _traj_Group: gen countG = _N
        by _traj_Group: egen groupAPP = mean(Mp)
        by _traj_Group: gen counter = _n
        gen n = groupAPP/(1 – groupAPP)
        gen p = countG/ _N
        gen d = p/(1-p)
        gen occ = n/d
        *Estimated proportion for each group
        scalar c = 0
        gen TotProb = 0
        foreach i of varlist _traj_ProbG* {
        scalar c = c + 1
        quietly summarize `i’
        replace TotProb = r(sum)/ _N if _traj_Group == c
        }
        gen d_pp = TotProb/(1 – TotProb)
        gen occ_pp = n/d_pp
        *This displays the group number [_traj_~p],
        *the count per group (based on the max post prob), [countG]
        *the average posterior probability for each group, [groupAPP]
        *the odds of correct classification (based on the max post prob group assignment), [occ]
        *the odds of correct classification (based on the weighted post. prob), [occ_pp]
        *and the observed probability of groups versus the probability [p]
        *based on the posterior probabilities [TotProb]
        list _traj_Group countG groupAPP occ occ_pp p TotProb if counter == 1
        restore
        end
        summary_table_procTraj_VSS2

        Thanks again for you support!
        Kind regards,
        Viviane

      • Sorry I misinterpreted — yes it works the same no matter what the link function for the outcome you are using is. Those stats are all about the latent classes/mixtures, they aren’t tied directly the what the nature of the outcome is.

  10. Viviane

     /  September 29, 2019

    Many thanks, Andrew! 🙂

    Reply
  1. Paper – Replicating Group Based Trajectory Models of Crime at Micro-Places in Albany, NY published | Andrew Wheeler

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: