Growth Mixture Modeling and Missing Data PreviousNext
Mplus Discussion > Latent Variable Mixture Modeling >
 James Swartz posted on Friday, May 22, 2009 - 12:12 pm
I wanted to know if I am setting up my GMM correctly or if there is a better way to model the data. My data set has arrest data for individuals starting from when they were juveniles (approximately age 12) up through whatever their current age happens to be (ranging from 21 to 67 years of age).

From these raw data, I created a set of 49 indicator variables reflecting whether a person was arrested at a given age or not. Variable 1 represents arrested at age 12 or not for all subjects; variable 2 represents arrested at age 13 or not, etc. up through age 60. If a person is younger than 60, as most in the data set are, I set the indicator variable for that year and for all subsequent years to be missing. For instance, a 35 year old would have missing data for the indicators representing ages 36 through 60 with the variables for years 12 through 35 set at 0 (no arrest) or 1 (arrested).

I then ran these data through mplus using Type = mixture, to estimate a GMM with linear and quadratic terms. The model converges (2 - 5 classes with the 5 class model best) but I get a warning saying the covariance coverage is below the specified limit. I assume this is because of the missing data for the variables representing later years.

Should I be concerned about that error message? Is there a better or more correct way to model these data?

Thanks in advance!

 Linda K. Muthen posted on Friday, May 22, 2009 - 5:33 pm
So much missing data can be an issue. I suggest using the multiple group multiple cohort approach shown in Example 6.18.
 James Swartz posted on Saturday, May 23, 2009 - 11:33 am
I looked at 6.18 and see your point. However, I keep getting the error message: "One or more variables in the data set have no non-missing values." Unlike the example, there are not an equal number of measurements per participant. The number of measurements varies by age group.

To approximate the multiple cohort model, I created 4 groups based on age (< 30, 31-40, etc.) My model statement is:

Model: i s q| arrv1@0 arrv2@.1 arrv3@.2 arrv4@.3 arrv5@.4 arrv6@.5... arrv44@4.3;

All measurement points are included in this statement, which I assumed refers to the oldest group. For each younger group, I set up a similar statement but only included the times for which they were measured. For the <30 group, it looks like this:

Model <30:
i s q | arrv1@0 arrv2@.1 arrv3@.2 arrv4@.3 arrv5@.4 arrv6@.5... arrv19@1.8;

And so on...

This setup generates the missing data error. What I am doing wrong?
 Linda K. Muthen posted on Saturday, May 23, 2009 - 2:18 pm
Please send your input, data, output, and license number to
 James Swartz posted on Tuesday, June 09, 2009 - 7:08 pm

One follow-up question. I have slogged on with this model and after figuring out how to deal with missing data and the coverage issue, have successfully run it as a GMM.

Upon adding covariates to the model (regressing both class and the growth factors on covariates) I notice that the part of the print out showing estimated probabilities for the categorical indicators at each time point by class are no longer presented. Reading through posts on here, I understand this is because the estimated probabilities depend on the covariates.

I wanted to calculate them by hand for one of the covariates and classes so I could show the effects of that covariate on the growth curve for that class. I have not been able to figure out the formula for doing so using the intercepts for the growth factors and the estimated values for each growth factor regressed on a covariate (by class). The examples in the mPlus manual (showing conversion from log odds to probabilities) seem not to apply. Is it possible to calculate these probabilities by hand, and if so, can you direct me to an example or the formula.

Thanks again,
 Linda K. Muthen posted on Wednesday, June 10, 2009 - 6:08 am
The log odds for various combinations of covariate values are calculated using the results from the output and the values of the covariates. See pages 409-410 of the user's guide.
 Jonathan Larson posted on Wednesday, March 26, 2014 - 6:21 am
Does it make sense to include a case with only one observed value in a GMM?

I can see how this case would be informative in a simple growth curve model; it would help determine the average value of the curve at that time-point. However, with a mixture model, it seems like this case would be unclassifiable. If two classes have the same initial value but diverge at later time-points, a case with only one observation at the initial time-point might have an equal probability of being in each class.

Of course, I can see how this example might exist even for a case with the maximum number of observations. Each observed value could fall halfway between two classes. So, would a case with a single observation help determine the cross-sectional mixture distribution at that one time-point?

Thank you for your help!
 Linda K. Muthen posted on Wednesday, March 26, 2014 - 12:10 pm
I would include all observations. I can't imagine that it would make much of a difference one way or the other. But you can always, do it and see.
 Jonathan Larson posted on Thursday, March 27, 2014 - 12:51 pm
Thank you!
 saravanelst posted on Tuesday, September 20, 2016 - 7:18 am
I would like to run a GMM, but am wondering how many time points I could use.

Percentage missing for at each time point:
year 1: 0%
year 2: 21.8%
year 3: 36.7%
year 4: 50.4%
year 5: 63.6%
year 6: 77.5&
year 7: 91.1%

Is there a maximum number of missings that is allowed?
 Bengt O. Muthen posted on Tuesday, September 20, 2016 - 10:22 am
No, but the data from year 7 don't seem to be of much value given the high percentage missing data.
 David Okech posted on Sunday, April 16, 2017 - 1:51 pm
I got this syntax error; what should I do. My data has very few missing information.
One or more variables in the data set have no non-missing values.
Check your data and format statement.

Continuous Number of
Variable Observations Variance

**EASE 0

Data set contains cases with missing on x-variables.
These cases were not included in the analysis.
Number of cases with missing on x-variables: 144
 Bengt O. Muthen posted on Monday, April 17, 2017 - 6:35 am
Please send your input, output, data to Support along with your license number.
 Daniel Lee posted on Monday, August 12, 2019 - 9:18 am
Hi Dr. Muthen,
I ran my growth mixture model (GMM) across 4 waves of data. Some participants were missing data on some, but not all, of the measurement periods. I noticed that for those participants with some missing data, they weren't dropped during the GMMs, leading me to believe that FIML was used to treat missing data.

Can you explain why FIML is used for GMM, and why it cannot be used for BCH or three-step method? I think, per our prior discussion, the default missing data treatment for BCH method is listwise deletion (in Mplus) and one can implement multiple imputation...while GMMs allow FIML.

Thank you so much!
 Bengt O. Muthen posted on Monday, August 12, 2019 - 5:36 pm
BCH works with a single DV and therefore FIML doesn't help - see our RMA book.

Trying to model issing on X's for R3step brings in other difficulties.
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message