I wanted to know if I am setting up my GMM correctly or if there is a better way to model the data. My data set has arrest data for individuals starting from when they were juveniles (approximately age 12) up through whatever their current age happens to be (ranging from 21 to 67 years of age).
From these raw data, I created a set of 49 indicator variables reflecting whether a person was arrested at a given age or not. Variable 1 represents arrested at age 12 or not for all subjects; variable 2 represents arrested at age 13 or not, etc. up through age 60. If a person is younger than 60, as most in the data set are, I set the indicator variable for that year and for all subsequent years to be missing. For instance, a 35 year old would have missing data for the indicators representing ages 36 through 60 with the variables for years 12 through 35 set at 0 (no arrest) or 1 (arrested).
I then ran these data through mplus using Type = mixture, to estimate a GMM with linear and quadratic terms. The model converges (2 - 5 classes with the 5 class model best) but I get a warning saying the covariance coverage is below the specified limit. I assume this is because of the missing data for the variables representing later years.
Should I be concerned about that error message? Is there a better or more correct way to model these data?
I looked at 6.18 and see your point. However, I keep getting the error message: "One or more variables in the data set have no non-missing values." Unlike the example, there are not an equal number of measurements per participant. The number of measurements varies by age group.
To approximate the multiple cohort model, I created 4 groups based on age (< 30, 31-40, etc.) My model statement is:
All measurement points are included in this statement, which I assumed refers to the oldest group. For each younger group, I set up a similar statement but only included the times for which they were measured. For the <30 group, it looks like this:
One follow-up question. I have slogged on with this model and after figuring out how to deal with missing data and the coverage issue, have successfully run it as a GMM.
Upon adding covariates to the model (regressing both class and the growth factors on covariates) I notice that the part of the print out showing estimated probabilities for the categorical indicators at each time point by class are no longer presented. Reading through posts on here, I understand this is because the estimated probabilities depend on the covariates.
I wanted to calculate them by hand for one of the covariates and classes so I could show the effects of that covariate on the growth curve for that class. I have not been able to figure out the formula for doing so using the intercepts for the growth factors and the estimated values for each growth factor regressed on a covariate (by class). The examples in the mPlus manual (showing conversion from log odds to probabilities) seem not to apply. Is it possible to calculate these probabilities by hand, and if so, can you direct me to an example or the formula.
Does it make sense to include a case with only one observed value in a GMM?
I can see how this case would be informative in a simple growth curve model; it would help determine the average value of the curve at that time-point. However, with a mixture model, it seems like this case would be unclassifiable. If two classes have the same initial value but diverge at later time-points, a case with only one observation at the initial time-point might have an equal probability of being in each class.
Of course, I can see how this example might exist even for a case with the maximum number of observations. Each observed value could fall halfway between two classes. So, would a case with a single observation help determine the cross-sectional mixture distribution at that one time-point?
Please send your input, output, data to Support along with your license number.
Daniel Lee posted on Monday, August 12, 2019 - 9:18 am
Hi Dr. Muthen, I ran my growth mixture model (GMM) across 4 waves of data. Some participants were missing data on some, but not all, of the measurement periods. I noticed that for those participants with some missing data, they weren't dropped during the GMMs, leading me to believe that FIML was used to treat missing data.
Can you explain why FIML is used for GMM, and why it cannot be used for BCH or three-step method? I think, per our prior discussion, the default missing data treatment for BCH method is listwise deletion (in Mplus) and one can implement multiple imputation...while GMMs allow FIML.