

Growth Mixture Modeling and Missing Data 

Message/Author 


I wanted to know if I am setting up my GMM correctly or if there is a better way to model the data. My data set has arrest data for individuals starting from when they were juveniles (approximately age 12) up through whatever their current age happens to be (ranging from 21 to 67 years of age). From these raw data, I created a set of 49 indicator variables reflecting whether a person was arrested at a given age or not. Variable 1 represents arrested at age 12 or not for all subjects; variable 2 represents arrested at age 13 or not, etc. up through age 60. If a person is younger than 60, as most in the data set are, I set the indicator variable for that year and for all subsequent years to be missing. For instance, a 35 year old would have missing data for the indicators representing ages 36 through 60 with the variables for years 12 through 35 set at 0 (no arrest) or 1 (arrested). I then ran these data through mplus using Type = mixture, to estimate a GMM with linear and quadratic terms. The model converges (2  5 classes with the 5 class model best) but I get a warning saying the covariance coverage is below the specified limit. I assume this is because of the missing data for the variables representing later years. Should I be concerned about that error message? Is there a better or more correct way to model these data? Thanks in advance! James 


So much missing data can be an issue. I suggest using the multiple group multiple cohort approach shown in Example 6.18. 


I looked at 6.18 and see your point. However, I keep getting the error message: "One or more variables in the data set have no nonmissing values." Unlike the example, there are not an equal number of measurements per participant. The number of measurements varies by age group. To approximate the multiple cohort model, I created 4 groups based on age (< 30, 3140, etc.) My model statement is: Model: i s q arrv1@0 arrv2@.1 arrv3@.2 arrv4@.3 arrv5@.4 arrv6@.5... arrv44@4.3; All measurement points are included in this statement, which I assumed refers to the oldest group. For each younger group, I set up a similar statement but only included the times for which they were measured. For the <30 group, it looks like this: Model <30: i s q  arrv1@0 arrv2@.1 arrv3@.2 arrv4@.3 arrv5@.4 arrv6@.5... arrv19@1.8; And so on... This setup generates the missing data error. What I am doing wrong? 


Please send your input, data, output, and license number to support@statmodel.com. 


Linda, One followup question. I have slogged on with this model and after figuring out how to deal with missing data and the coverage issue, have successfully run it as a GMM. Upon adding covariates to the model (regressing both class and the growth factors on covariates) I notice that the part of the print out showing estimated probabilities for the categorical indicators at each time point by class are no longer presented. Reading through posts on here, I understand this is because the estimated probabilities depend on the covariates. I wanted to calculate them by hand for one of the covariates and classes so I could show the effects of that covariate on the growth curve for that class. I have not been able to figure out the formula for doing so using the intercepts for the growth factors and the estimated values for each growth factor regressed on a covariate (by class). The examples in the mPlus manual (showing conversion from log odds to probabilities) seem not to apply. Is it possible to calculate these probabilities by hand, and if so, can you direct me to an example or the formula. Thanks again, James 


The log odds for various combinations of covariate values are calculated using the results from the output and the values of the covariates. See pages 409410 of the user's guide. 


Does it make sense to include a case with only one observed value in a GMM? I can see how this case would be informative in a simple growth curve model; it would help determine the average value of the curve at that timepoint. However, with a mixture model, it seems like this case would be unclassifiable. If two classes have the same initial value but diverge at later timepoints, a case with only one observation at the initial timepoint might have an equal probability of being in each class. Of course, I can see how this example might exist even for a case with the maximum number of observations. Each observed value could fall halfway between two classes. So, would a case with a single observation help determine the crosssectional mixture distribution at that one timepoint? Thank you for your help! 


I would include all observations. I can't imagine that it would make much of a difference one way or the other. But you can always, do it and see. 


Thank you! 

Back to top 

