We are using Mplus to estimate growth mixture models for repeated assessments of depression. We seem to be getting quite different answers regarding class membership depending on whether or not we include predictors of latent class membership in the model. The estimated class sizes change quite a bit and individuals jump between classes. Is this indicative of some kind of problem or model misspecification or just something to be expected from the additional information included in the model?
bmuthen posted on Wednesday, November 29, 2000 - 5:09 pm
This is an interesting topic and it would be good to accumulate experience now that more researchers get into growth mixture models and other latent class models with covariates predicting class membership. In an ASB latent class example that we use for training sessions, the analysis is done in steps: using only the u indicators, reducing down to the class-defining u indicators, and adding the c-predicting covariates x. In this example, we found a strong agreement in class definition across the 3 steps which is an indication that the model is stable and trustworthy. On the other hand, I have one growth mixture example with y's and x's, where the classes seemed to change when adding x's as predictors of c. So, my early impression is that this problem may happen in some data. I can think of 3 reasons. One is that more information is available when adding x's and therefore this solution is what one should trust. Another is that the model may be misspecified when adding the x's because there may be some omitted direct effects from some x's to some y's/u's (these can be included). A third explanation is more subtle and has to do with individuals' misfit. There may be examples where for some individuals in the sample the y's/u's "pull" the classes in a different direction than the x's. Note that both y/u and x information contribute to class formation. Consider the example where in a 2-class model a high x value has a positive influence on being in class 2, and being in class 2 gives a high probability for u=1 for most u's. Individuals who have many u=1 outcomes but low x values are not fitting this model well. If the x information dominates the u information then these individuals will be differently classified using only u versus using u and x. - Just some preliminary thoughts.
I have estimated a 2 class latent class SEM. I then tried to introduce a continuous covariate into the model. For this, I fixed the loadings of items on this continuous covariate in my %overall% model based on an earlier CFA. However, when i specify "c#1 on MUN" under the %overall% command (where MUN is my continuous covariate).
I got the following message: *** FATAL ERROR RECIPROCAL INTERACTION PROBLEM. am i doing somethign wrong...how can one incorporate continuous covariates in a LC model.
I would need to see your full output to understand what is happening. Please send it to email@example.com.
anonymous posted on Tuesday, February 14, 2006 - 8:10 am
Hello Bengt and Linda,
I was wondering if any new information has been learned about this classification problem when different predictors/covar/outcomes are included in the model. I'm new with this type of analysis, but I was wondering if one possible solution would be to estimate class membership in an unconditional model and simply fix their class membership in subsequent models. I have an interest in retaining class membership across models for consistency's sake.
Cheers. And Happy Valentine's Day.
bmuthen posted on Tuesday, February 14, 2006 - 5:02 pm
Take a look at the Muthen (2004) chapter in the Kaplan handbook posted on our web site under Recent Papers and you will find this issue discussed. I don't see it as a problem that classification changes when adding covariates - having more information makes the classification better. If there are big changes, I would think the model with covariates is more trustworthy. But feel free to make a counter argument.
Hello Dr. Muthen, I am estimating a latent class model with 3 continuous and 3 binary predictors, and I have a couple of questions. (1) At 2 classes, my df for Chi-Square Test of Model Fit for the Binary and Ordered Categorical(Ordinal) Outcomes is 0. When I estimate a 3-class model, I receive notice that the "df cannot be computed for this part of the model". Does this mean that my model is underidentified and,thus, misspecified? If yes, can I constrain some parameters to fix this problem? Also, is a just-identified model ok? (2) I am trying to compute BLRT using Tech 14 with LRTBootstrap = 100 and LRTstarts = 0 0 40 10. I recive a message that the p value is not reliable because the THE BEST LOGLIKELIHOOD VALUE WAS NOT REPLICATED in x out of 100 bootstrap draws. However, with multiple sets of starting values, I get the same parameters estimates and my model seems to replicate, suggesting that I am not reaching a local maxima.
Hi Dr. Muthen, I want to correct my post on 11/10. I meant to say that I am estimating a latent class model with 3 continuous and 3 binary indicators. My questions were: (1) At 2 classes, my df for Chi-Square Test of Model Fit for the Binary and Ordered Categorical(Ordinal) Outcomes is 0. When I estimate a 3-class model, I receive notice that the "df cannot be computed for this part of the model". Does this mean that my model is underidentified and,thus, misspecified? If yes, can I constrain some parameters to fix this problem? Also, is a just-identified model ok? (2) I am trying to compute BLRT using Tech 14 with LRTBootstrap = 100 and LRTstarts = 0 0 40 10. I recive a message that the p value is not reliable because the THE BEST LOGLIKELIHOOD VALUE WAS NOT REPLICATED in x out of 100 bootstrap draws. However, with multiple sets of starting values (e.g., starts = 1000 300), I get the same parameters estimates and my model seems to replicate, suggesting that I am not reaching a local maxima.
The two chi-square test statistics given are for the categorical latent class indicators in the model. You can ignore them because you also have continuous categorical latent class indicators. So you don't need to worry about identification.
You should adjust the LRTSTARTS option. See TECH14 in the Mplus Version 4.1 User's Guide which is on the website for hints about how to use TECH14.
I’m doing a LCA with covariates. Class indicators are 14 ordinal (5 categories) variables, and I have also 15 potential covariates (2 continuous, 13 binary). My sample size is 12257, and it has sample weights. The result is a model with 4 classes, according to the LMR test. Then if I introduce the covariates, the classes don’t change very much, but the LMR test suggests a model with 3 classes. ¿Should I keep 4 or 3 classes? Thanks in advance, Fernando.
I would look at more than the LMR test, for example, loglikelihood, BIC, etc. Also, theory and predictive validity can guide the number of classes. See the following paper that is available on the website:
Muthén, B. (2004). Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences (pp. 345-368). Newbury Park, CA: Sage Publications.
Predictive validity, in terms of expected effect of covariates is Ok for both RLCA models (I don't have distal outcomes). Item profiles are parallel for LCA4, RLCA4, and RLCA3. Sorry for the insistence but I don't see any other indications in that paper. Thanks in advance, Fernando.
You need to look at the fourth class and see where it comes from. For example, in the four class results are two classes the same as in the three class results. And does the fourth class come from a division of one class? If so, is the covariate profile different in those two classes? Is there substantive meaning to the fourth class? Basically, statistics can take you only so far. Then you need to use substance and logic.
Linda, thank you very much. After reviewing the theory, and rerunning my models, I have reached the conclusion that the problem lies in the covariates that simultaneously affect the classes and the items directly. My questions are: 1) Is it best to include only direct effects when supported by theory, or to include all covariates as direct effects and proceed backwards by deleting the non significative? 2) Must I let the coefficients of the direct covariates vary among classes, or would that hide the class formation? I don’t have a clear criterion about this issue. 3) Do you know any reference on this issue (covariate selection and setting)? Thanking you in advance, Fernando.
1) I think you would assume that you have relatively few direct effects because otherwise you have a low degree of measurement invariance across the latent clases. Therefore, I would add direct effects as needed by significance, both statistical and substantive.
2) Direct effects that vary across classes are typically hard to estimate with any precision so I would let them be class-invariant.
3) No, none except writings such as my own 2004 chapter in the Kaplan handbook.
I have a growth mixture model (regarding the variances, only the intercept variance is class invariant relaxed) and 6 Covariates. Estimating unconditional models leads to 4 groups and a very bad classifcation quality. Adding covariates (predicting class membership) in the model suggests 3 classes and leads to substantially improved classification quality. In addition, class sizes change when adding these covariates. Referring to your first argument on 29. november 2000 and to 14. Februar 2006 would it be ok to report the class solutions based on the model with covariates included? They seem more trustworthy.
This could point to the need for direct effects from the covariates to the latent class indicators. See the following paper which is available on the website:
Muthén, B. (2004). Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences (pp. 345-368). Newbury Park, CA: Sage Publications.
thank you, as far as I understand one should include all significant effects of covariates in the class finding process. So should I also include effects of my covariates on the intercepts (growth parameters) of the classes, albeit its variance is hold equal? Seems to me, that i should include class invariant effects of my covariates on the intercept, but only significant ones, or all?
Meanwhile, I included all possible effects of covariates on growth parameters held class invariant, as you said. I asked myself if these class invariant coefficients can be interpreted, or are they just for specification issues? My main interest lays on the effects on class membership, but there are also some interesting effects on the growth parameters, albeit the same for all classes.
You first consider the effect of covariates on class membership and then the further, within-class effect of covariates on the growth factors. The latter effect is interpretable and can be understood e.g. as high x value making it more likely to be extra high within the class.
Hao Duong posted on Sunday, September 21, 2008 - 1:15 pm
Dear Dr. Muthen, I have three questions:
1. My three-class model looks better than two-class model based on fit indexes and practical interpretation. However, when I add covariates into the three-class model, p-value of LO-MENDELL-RUBIN LRT TEST is large (0.867). What should I do? Should I try the two-class model with covariates?
2. In two-level GMM for a continuous outcome, there are three options for building the model: a. Individual-level categorical latent variable b. Between-level categorical latent variable c. And both individual-level and between-level categorical latent variables Run all three seems complicated? What would you suggest me to do?
3. I would like to exam if treatment has different effects on different classes. Treatment is a between-level variable (school-level). It does not work when I try the regression of the categorical latent variable (individual-level) on the treatment in the between part of the model. I believe that I have to build the model with between-level categorical latent variable, then in each class, I regress the categorical latent variable (between-level) on the treatment. It is correct?
I am trying to run a LCA model with covariates. I first ran the model without the covariates and am now adding them in. Some of my covariates are categorical and some are continuous. when I specify the variables as categorical I get the fatal error message saying reciprocal interaction problem. below is my input
VARIABLE: Names ARE ARCODE grade pub lang relat auton confr attach stressE gpa autobh somat negaff posaff sepos seneg anxmn acadmo;
USEVARIABLES ARE grade pub relat-stressE somat-acadmo;
CLASSES = C(4); CATEGORICAL ARE grade relat; MISSING is all(99); IDVAR = ARCODE;
ANALYSIS: TYPE = MIXTURE; STARTS 50 10; ESTIMATOR=ML; ALGORITHM=INTEGRATION;
MODEL: %OVERALL% C#1-C#3 on grade relat; C#1-C#3 on pub auton confr attach stressE;
Do I need to add more model specifications to solve this problem? Or can I not have continuous and categorical covariates? Also I should mention that one of my categorical variable has two levels and one has 7 levels. Is that a problem for Mplus?
The CATEGORICAL option is for dependent variables only. Do not put covariates on this list. Covariates are either binary or continuous and are both treated as continuous as in regular regression. If you have a nominal covariate, you need to create a set of dummy variables.
Jon Heron posted on Wednesday, June 24, 2009 - 7:12 am
recently, I was very interested to read
Clark, S. & Muthén, B. (2009). Relating latent class analysis results to variables not included in the analysis.
as this issue has been bugging me for some time, and judging by the age of this thread, I'm not the only one!
Since then (i.e. this morning) I read the following:
Multinomial Logit Latent-Class Regression Models: An Analysis of the Predictors of Gender-Role Attitudes among Japanese Women Kazuo Yamaguchi The American Journal of Sociology, Vol. 105, No. 6 (May, 2000), pp. 1702-1740
which describes a model which in my mind sits half-way between the 1- and 2-stage approaches. This paper presents a conditional LCA model in which there is no 2 or 3-way interactions between the X's and U's and between the X's, C's and U's respectively. Perhaps this is only possible within a log-linear approach, but it seems to be exactly what I, and other researchers, are after - a model which reflects the latent nature of C without allowing X to affect the class-specific probabilities.
I have been attempting to implement this model with the following model - see next post
Jon Heron posted on Wednesday, June 24, 2009 - 7:14 am
XWITH cannot be used for an interaction with a categorical latent variable - your c. Interactions with c should instead be handled by letting a relationship between variables vary across the c classes.
I have to look at the Yamaguchi article to know what he is doing.
Jon Heron posted on Wednesday, June 24, 2009 - 9:17 am
The way I read the 200 Yamaguchi article is that his eqn. (1) specifies in Mplus terms:
usev = r s t a b c; categorical = r s t; classes = y(2);
MODEL: %overall% y on a b c;
This would give the 2-way terms of (1) for YA, YB, YC, and for RY, SY, TY. So (1) is the same as a standard latent class regression model in Mplus. As such, the "x's" a, b, and c do influence the latent class formation. Which by the way, I think is natural that they do - exactly in line with the case of factor scores in a MIMIC model with continuous latent variables.
Jon Heron posted on Friday, June 26, 2009 - 3:18 am
thanks for taking the time to read that paper
the bit I am stuck on is on p1712:
"The substantive meaning of latent classes is determined by the pattern of association of Y with the response variables. If we further allow this pattern of association to depend on covariates, the meaning of contrasts between being in one latent class versus another class changes with covariates, which is clearly undesirable in comparing regression coefficients across variables. Hence, even though a statistical improvement may be obtained by doing so, we should not introduce such three-factor interactions into the multinomial logit latent-class regression models; we make latent classes reflect the most statistically significant latent division of the entire population— even if such a division is not the most significant one among some subgroups of the population."
Is anything known with respect to the performance of BLRT in conditional vs. unconditional growth mixture models? BLRT points to 4 classes (unconditional) or 2 classes (conditional). LMRT seems more consistent and suggests 2 classes regardless of conditional or unconditional.
I don't think there have been any thorough studies comparing BLRT and LMRT. I think their behaviors will vary based on the data and the model. No one statistic should be used to determine the number of classes. Several should be examined and substantive theory should definitely play in. See the following paper on the website:
Nylund, K.L., Asparouhov, T., & Muthen, B. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569.
Katherine Masyn and I have done more studies on the performance of the LMR and BLRT for latent class analysis (LCA) models. In our studies, the BLRT consistently identifies the correct number of classes for unconditional models. However, for any type of conditional model there was not a consistent pattern of which worked better (BLRT or LMR) since both showed some weakness under different misspecifications (with respect to the covariates paths of influence). One main finding from this study (which will be under review soon) is that you should do class enumeration on the unconditional model using BLRT and BIC and then systematically include covariates. One thing to pay close attention to is that a significant change in the nature of the class formation from the unconditional to conditional models (even for the same number of classes) may signal a misspecification in the conditional model.
I have a 2-class solution that almost does not change (with respect to growth shapes and membership proportions) when I add some theoretically important covariates to the model (effects on C and within-class effects on growth factors). However, with regard to a 3 class solution the additional 3rd class alters in its shape (from a stable trend to a declining trend) when I add covariates and proportions of class membership significantly change in comparison to the 3 class model without covariates. I tried direct effects of covariates on manifest indicators but the problem persisted. Additionally, the 3rd class (9% of the sample) is not distinguished by covariates from the next lower trajectory (effects on C). In this case, is it plausible to argue that the 3 class model is unstable, not trustworthy, and does not add substantial information in comparison to the 2 class model? Especially I would be interessted in the "not trustworthy" or "unstable" argument.
I would think in terms of whether the 3rd class adds a substantively important class that is different from what you see with 2 classes. To help decide, I would also use BIC both with and without covariates.
I don't think it is a good argument againts 3 classes to say that the 3rd class is not distinguished by covariate influence from another class. There is always a chance that some important covariate is left out and there is also a possibility that the 3rd class makes a difference for a distal outcome.
I do think that a model with direct effects included from covariates possibly is more reliable than a model without covariates. If those covariate effects are significant, leaving out covariates - that is, having class indicators be influenced only by the latent class - is a misspecified model because latent class isn't the only predictor.
I've been doing an analysis where I've been trying to produce multiple imputations of plausible values for latent class membership as suggested in:
Asparouhov, T. & Muthén, B. (2010). Plausible values for latent variables using Mplus. Technical Report.
However, I found that when I run an imputation model using ESTIMATOR=BAYES I get substantially different proportions assigned to each class compared to the proportions estimated from the ML estimated latent class model, even without including additional indicators of class membership (which I do want to do). I found this was happening even if the response probability parameters for all the latent class indicators were constrained to identical values in both models.
This makes me worry whether or not I can trust the imputations of class membership.
Would this kind of effect be expected? If so, what causes it?
Or am I maybe just doing something wrong?
Note: the latent class indicators in my model are repeated measurements of 3 outcomes from a cohort study, so there is data missing due to drop out for some of the latent class indicators.
I am not aware of any such discrepancy. Try generating 1000 plausible values after running a minimum of 10000 iterations. If that does not fix the problem then it will be something specific to your data. Theoretically the result should be the same with any data though so if you can't resolve the issue send all the information you can to firstname.lastname@example.org
In typical example like EXAMPLE 7.12: LCA WITH BINARY LATENT CLASS INDICATORS USING AUTOMATIC STARTING VALUES WITH RANDOM STARTS WITH A COVARIATE AND A DIRECT EFFECT. By default what will be the reference class if the classes were three i.e c(3). Will like to confirm if by default mplus uses the third category as the default thereafter interchange the references class.
I have estimated a 4 class model which contains 11 latent indicators. Next, I want to run a number of analyses, leaving out one of the 11 indicators (resulting in 11 different models with 10 latent indicators). How would you recommend to compare the 11 indicator model with the 10 indicator models? I have been including the 11th indicator as a covariate, allowing me to compute Walds test through the Model test option but I am not sure if that is the ideal method. I have also computed Loglikelihood difference test according to the following article http://www.statmodel.com/chidiff.shtml resulting in chi square difference values which seemed unrealistic.
Not sure what you want to know about your 11 models, that is, what you are comparing. (1) Perhaps which indicator is the least informative about the latent classes? Or, (2) which indicator contributes more to misfit?
You cannot work with logL or BIC because your 11 models have different DVs. For (1) you can use the new version 7.3 OUTPUT option Entropy in the 11-indicator run. For (2) you can use TECH10 (assuming categorical indicators).
Anna Hawrot posted on Monday, November 24, 2014 - 1:27 am
Hi, I've been reading recently about different procedures for dealing with changes in LCA solution after adding covariates/distal outcomes (DU3STEP, BCH). However, all these methods assume that covariates/distal outcomes are observed variables. Is it possible to benefit from these procedures in models in which covariates and distal outcomes are latent or both latent and observed?
Hi, Mplus 7.3 enables the computation of variable specific entropy Ej for latent indicator Uj. According to the note on ‘Variable specific entropy contribution, October 17, 2014’, univariate entropies are directly comparable among each other. My question is whether there is an absolute threshold X, so that Ej < X means the indicator Uj provides to little information about the latent variable and should therefore be excluded?
Dear Drs. Muthen, Thank you for the previous answer.
I would like to follow up with this new question.
My final optimal unconditional model of health has 4 classes, which I labelled as stable, recovering, slightly improving and improving.
Due to missing data on X, when adding my predictor in the one-step approach procedure, the number of observations drops from 9076 to 5679.
It follows that classification rates change a lot. However, when I have a closer look at the mean intercepts, mean slopes and mean quadratic terms of each new class, they haven't changed that much compared to the unconditional model parameters.
For example, mean intercept of the improving class in the unconditional models was 2.514; now in the conditional model, is 2.334 (and similarly for slope and q term).
Can I still call the new classes as the old ones (stable, recovering, slightly improving and improving classes) based on the similar parameters?
I am new to LCA and so have a very basic question.
I want to run a 3-class LCA (20 binary indicators).
If I include covariates in the USEVARIABLES statement, then class membership is very different to that obtained if I exclude the covariates. The difference in the 3-class composition is 40%, 45% and 15% versus 49.5%, 47.1%, 3.4%.
When I include covariates in the USEVARIABLES statement, I have not included any subsequent statements such as c on x1 x2 x3 and have not specified the covariates using the r3step command.
I am assuming that the covariates are being used in some way to determine or refine class membership probabilities or is there another reason for the changes in class membership ?
When you include a covariate X in the model, you should also include c ON X. Otherwise, the model is probably mis-specified. When including c ON X, the class formation may be very different from that of not including X in the analysis for reasons described in Web Note 15 and 21.
The model did not seem to throw up any warnings in this regard (when I did not explicitly specify the c ON X command).
1. I was thinking that it would be better to use R3STEP instead of c on X when specifying covariates. Would this be ok ?
I also want to add in some concurrent outcome scores for depression and anxiety as I suspect that the classes will differ on these outcomes (continous outcomes).
My syntax/modeling question is...
2. Can I use MODEL CONSTRAINT command to look for class differences on these outcomes (depression & anxiety) alongside R3step command for covariates? I want to set up class difference scores and test for significance.
I’m trying to run a 4 class LCA with a distal binary outcome (O1) in Version 7.31.
Variable: Names are ID I1-I13 O1-O3 C1-C8 ; USEVARIABLES ARE ID I1-I13 ; Categorical are I1-I13 ; Idvariable is ID; Auxiliary = O1 (DCAT) ; Missing = all(-1234) ; Classes = L (4);
Analysis: Type = mixture ;
As series of questions:
1) I thought BCH was 'best' for latent class uncertainty. Does DCAT vs. BCH matter in this case?
2) At the moment the OR association between my latent classes and outcome are estimated relative to class 4. I’d like to make Class 2 (the biggest group) my reference category – is there a trick for this?
3) I’ve identified a bunch of potential confounds for the relationship (C1-C8). I wanted to look at the association between these and latent class membership, and also if the latent class > outcome relationship is affected by covariate control. How do I add these without affecting latent class identification? These confounds are all binary so I’d like to estimate odds ratios.
4) Can I take into account survey weights/ clusters?
In one of my studies, I have two research questions. One is to identify smoker subtypes who attend a smoking intervention. Another is to explore whether the smokers subtypes could predict time to drop out of the intervention. To answer the first question, I plan to conduct an LCA; and to answer the second question, I plan to conduct a survival analysis. Here, I have one question:
In my LCA, I also plan to include 5 covariates: age, gender, education, income, and marital status. Since the 5 covariates are assumed to influence the structure of smoker subtypes, I plan to use one-step approach to conduct the LCA. Afterwards, I plan to conduct a survival analysis to investigate the prediction from the class membership on the time to drop out of the intervention. Since the 5 covariates listed above are also important factors to predict intervention attrition, I plan to control them in the survival analysis again to test the pure effect of the class membership on intervention attrition. But, I am not sure whether this model is correct? Since the 5 covariates have been used in the LCA to identify the subtypes of smokers, do you think I should control them again in the survival analysis to test the prediction from the smoker subtypes on intervention attrition? If the answer is no, could you please let me know what are the problems when controlling the 5 covariates again in my survival analysis? Thanks!