Daniel posted on Tuesday, August 24, 2004 - 12:08 pm
I read your recent chapter in the Kaplan text on latent variable modeling, and have a question about a paper I'm revising. I have a continuous outcome that is not normally distributed. I modeled it with freely estimated class variances, and found three trajectories. However, the variables are not normally distributed. I gather that I cannot use the LMR LRT for testing the number of trajectories. The BIC supports four classes, but the change in BIC from three to four classes is minimal, and the fourth class is very superficial. Can I use the BIC criteria, since I cannot use the LMR LRT to select 3 classes?
bmuthen posted on Tuesday, August 24, 2004 - 12:23 pm
Note that the assumption of within-class normality does not imply that the mixture is normal but can be very non-normal. Your observed variables correspond to the mixture. So these mixture models handle non-normal observed variables. Therefore, you can use the LMR. The performance of the LMR - and the BIC - is, however, not sufficiently well-known in most cases. I would not decide on number of classes based only on statistical measures but also interpretability. Given what you say about the superficial 4th class, it sounds like the way to go is 3 classes.
Anonymous posted on Wednesday, August 17, 2005 - 2:51 pm
I am currently working on a model of behavior change about which there is some controversy in the literature. Specifically, there is some debate regarding the nature of the latent "readiness to change" construct. Some of the debate about the measurement model for the readiness to change construct essentially centers on whether individuals occupy discrete/discontinuous states of readiness or whether the construct is better modeled as continuous. My question is whether evaluation using mixture modeling might be able to shed some light on this, on statistical grounds. That is would a finding that one class best explained the data argue against the notion of different discrete classes and suggest a continuous latent construct? Or is the distinction between continuous and categorical latent variables largely a matter of heuristic concern? Vermunt, for instance, notes that the distinction between latent classes and latent traits is largely a matter of the number of points across which one intergrates. So I suppose the question is does mixture modeling provide statistical evidence regarding the categorical/continuous nature of a latent construct?
bmuthen posted on Thursday, August 18, 2005 - 9:47 am
That is a good question that we still know too little about. I was less hopeful about this earlier. But I am looking at this currently for categorical items, contrasting latent class modeling with factor analysis modeling and with hybrids of the two, and I am gradually getting the impression that in several cases these models are distinguishable in terms of statistical fit. A one-class model with a continuous factor might fit considerable better or worse than a 2-class model without continuous factors. And often, hybrids fit considerably better than either. True, the classes can be seen as discrete points on a continuum - taking a non-parametric view of the factor distribution (which I think your Vermunt reference is about) - and this matter can only be resolved by relating the classes to other variables - antecedents and consequences - to see if classes are significantly different on those variables. Given that we can now do these analyses conveniently in a general modeling framework, it would be interesting to see more investigations of this type.
Chuck Green posted on Friday, August 19, 2005 - 8:26 am
Yes, that is the reference to Vermunt to which I was referring. When you discuss hybrids are you referring to latent categorical variables derived from a comnbination continuous latent factors and observed variables?
bmuthen posted on Friday, August 19, 2005 - 9:23 am
By hybrid I mean a model that has both continuous and categorical latent variables (the outcomes can be of any kind). For example, a latent class model that has factor variation within classes making the items correlate within class.
I should have said lowest BIC. Perhaps I was talking about the possibility of a negative BIC which could happen when logL is positive so that the first BIC term is negative - if the sample size and/or #par.'s is small, the negative first tems will dominate the positive second term and BIC will be negative. But even so, we want smallest BIC (not in an absolute sense).
Justin Jager posted on Tuesday, September 26, 2006 - 9:00 am
I'm using the tech14 option in conjunction with Type=mixture to request a PBL ratio test. I am having a difficult time manipulating which class is the first class identified (and therefore is the class deleted to obtain the c-1 comparison model).
I ran a 3 class solution and a 4 class solution. Comparing the mean estimates across the 3-class solution and the 4-class solution, it is clear which class out of the 4 class solution is the "new" class. Intuitively, it seems to me that I want the "new" class to be the first class identified, so that when comparing the c model to c-1 model the deleted class will be the "new" class.
In order to accomplish this, I re-ran the 4-class solution, but this time used the tech14 option, and listed the start values for the "new" class first, and then listed the start values for the three remaining classes (listing the start values for the largest class last). However, the first class identified is not the "new" class whose start values are listed first.
Given the above I have two questions: (1) Am I being to stringent in my use of the ratio test by identifying the "new" class and manipulating the start values so that it is the first class identifed?
(2) If the answer to #1 is No, then do you have any suggestions for manipulating the model so that the "new" class is the first class identified?
2) in using tech14 with the following values: lrtstarts = 40 10 2500 20; I get a warning indicating that 4 of 5 bootstraps failed to replicate. The p-value for the BLRT is significant, but I assume the lack of replication is a significant problem. How should the lrtstarts be increased, and at what point would you determine that the solution can not replicate?
By the way, the sentence in my first message should be: Dr. Nagin describes in his book (Group-based modeling of development, 2005) that the lowest absolute BIC-value is preferred, so the closest to zero. (without 'not').
where L is the likelihood, r is the number of parameters and n is the sample size. In Mplus we want the smallest BIC(M).
Nagin in (4.1) of his book uses the alternative
BIC(N) = logL - 0.5 r log n,
so that BIC(N) = -2 BIC(M). Nagin wants the largest BIC(N).
With BIC(M) the second term is always positive (or non-negative). The first term is typically positive as well because logL is typically negative. The term decreases as the likelihood increases (gets better). So here we want small positive BIC(M) values. In the rare cases when logL is positive (L > 1) the first term is negative and gets bigger negative as the likelihood increases. So here too do we want smaller BIC(M) where say -10 is smaller than -5 (-10 is further to the left on the real line).
Is there a way to test the difference between two LCAs with the same number of classes, but in which one model includes a forced zero-behavior class (i.e., a forced no-substance use class)?
Specifically, I've determined a 4-class solution fits the data best relative to 1,2,3, and 5-class models. However the fit indices for a restricted and unrestricted 4-class model are nearly equivalent:
My sense is that the BIC discrepancy is negligible (?) and that the restricted model should be selected based upon parsimony (i.e., fewer parameters estimated). The actual make-up of the 4-classes is comparable (basically, one of the unrestricted model classes has a high number of non-use individuals), but prevalence of the classes is different across the models, and predictors vary slightly.
Any guidance -- or a specific test that I might run to determine the "best" model?
I think I would go with the unrestricted model if it has the lower BIC, unless you have a specific theory for the existence of a zero class. I am not much for "model trimming", but just reporting the results even if some parameters may not be needed. But if you want to test for one of the classes being at zero probability for all item, perhaps you can use the Wald test of Model Test. For instance, you can define item j's probability as
pj = 1/(1+exp(tj));
where tj is the label for the threshold of item j. Then you use
You do this for all items at the same time. I haven't tried it, but I think it should work.
Thank you for the response. A couple of follow-ups:
Is there any indication as to the "sensitivity" of the Adj. BIC -- given these differences are less that 2 pts?
Would these two 4-class models be considered nested, since the same predictors and number of classes are specified -- only one model includes constraints that force one class to include youth with no use? If so, is it appropriate to conducted a difference in chi-square (or equivalent) test to see whether the constraints significantly worsen fit?
Also, any utility in examining the quality of classification table -- even though entropy is identical, there is some variation both the diagonal and off-diagonal elements.
There is a literature on how differences between BIC values should be viewed. See for instance
Kass R. E. and Raftery, A. E. (1993). Bayes factors. Journal of the American Statistical Association 90, 773-795
The models are nested, but the assumptions of the likelihood-ratio chi-square test are not fulfilled because the zero-class model specifies parameters that are on the border of their admissible space, namely zero item probabilities conditional on class. Come to think of it, the Wald test would be negatively affected by this too.
The classification table might tell you about differences across models in being able to tell certain classes apart.
I'm evaluating GMM solutions to identify the "best" number of classes for an outcome (total sleep time) measured over 14 occasions. There is clearly one large class, and some number of smaller classes, probably 2 or 3.
I take it from Bengt's response in this thread on October 01, 2006 - 12:14 pm that specifying starting values is not necessary to get a valid BLRT test of the K vs K-1 solutions with TECH14. I assume that the way the classes differ would therefore be based on the profile means for the two solutions? So, how to identify the K and K-1 mean profiles that were identified in the BLRT? (I've found that the "K" BLRT solution is not identical to the model solution it follows.")
Next question on selecting the "best" number of classes.
I'm puzzled about the interpretation of the VLMR and BLRT tests for the K vs K-1 solutions. I've obtained identical H0 LL values (and -2LL diff values) from TECH11 and TECH14 outputs in the same run, and VLMR will have a p-value WAY larger than .05, while BLRT will have a p-value below .0000. Further, after obtaining BLRT for a sequence of solutions with increasing classes (even specifying LRTBOOTSTRAP=100), I have found that it is always significant at .0000 even when the #classes is getting silly. I've also found that the BIC and SABIC also get smaller with additional classes, even for a large number of classes (5 or more, with most being very small), so not much help there.
Tofighi & Enders (2008) recommended the BLRT and sample-size adjusted BIC as the most useful indices for GMM solutions, and Nylund et al. (2006 revised draft) also like the BLRT for GMM.
Do you have further suggestions on the use of these indices vis a vis the # of classes and substantive interpretations in trying to find the "real" solution?
Still another question re the # of classes (my ignorance seems bottomless!) -
How does one decide what values to specify for the LRTSTARTS command? I've read a couple of posts here about it, and the V5 manual (pp. 500-501). The defaults are 0 0 20 5, and you suggest perhaps 2 1 50 15 as an example of a different specification. Why so few for the K-1 class solution? Why that number for the K-class solution? (The question arises partly because of the problematic data set you've helped me with, for which I needed to specify 1500 random starts to get a replicated maximum before going to the BLRT. So, I can use the OPTSEED option from one or more of the replicated solutions when I run the TECH14 analysis, but what issues dictate how to choose the number of draws for the bootstrapped K-1 and K solution analysis?)
Any references would be great, so I don't have to keep bothering you!
Belay that second message re the use of the BLRT resulting in p-values < .0000. That happened with specified starting values and LRTBOOTSTRAP >= 100, but after running more models last night, and another TECH14 BLRT this morning using the default for holding random effects constant, and allowing the TECH14 to run on its own (no specs for LRTBOOTSTRAP or LRTSTARTS), I got a BLRT that made sense (in this case, not significant)! This was after specifying OPTSEED from an analysis I ran overnight, where I used 4000 random starts and got two (2!?!) replicated maximums for the LL.
I also checked the K-1 LL in the TECH14 output, and it was the same as the maximum from the 3-class solution I had gotten previously with the same ANALYSIS settings. So there's some consistency! and a way to know what the K-1 solution would look like from the TECH14 K-1 model.
I'm still troubled though, that out of 4000 random starts, I would get only two replications of the largest LL (-3880.753), then just one LL that I had gotten multiple replications of in prior runs of this 4-class analysis (-3882.103), then 25 identical LL (-3883.747), with the difference between the largest and smallest LL from these three solutions being only 2.994. Having only two out of 4000 replicated LL still seems pretty chancy, making me wonder about a local maximum that I just happened to hit by chance, twice.
Your 4 last posts touch on topics we teach at our 2-day Mplus Short Courses. One just took place as the psychometric meeting and another one is coming up in November at Ann Arbor. This is in the area of Topic 5 and 6 (see our web site for topics and handouts). It is too large a topic to teach on Mplus Discussion - so I will just give some brief comments.
Topic 5 has on slide 197:
"More On LCA Testing Of K – 1 Versus K Classes Bootstrap Likelihood Ratio Test (LRT): TECH14
• LRT = 2*[logL(model 1) – logL(model2)], where model 2 is nested within model 1 • When testing a k-1-class model against a k-class model, the LRT does not have a chi-square distribution due to boundary conditions, but its distribution can be determined empirically by bootstrapping Bootstrap steps: 1. In the k-class run, estimate both the k-class and the k-1-class model to get the LRT value for the data 2. Generate (at most) 100 samples using the parameter estimates from the k-1-class model and for each generated sample get the log likelihood value for both the k-1 and the k-class model to compute the LRT values for all generated samples 3. Get the p value for the data LRT by comparing its value to the distribution in 2."
Because step 2 generates data according to the k-1-class model, the k-1-class model is easier to fit than the k-class model and therefore requires fewer starts.
Having only 2 replicated best LLs out of 4000 is a sign of a problem - it typically indicates that the model tries to read too much out of the data. This happens when using too many parameters such as too many classes, particularly when the sample size is not large and the data signal is not strong.
Thanks very much for the information, Bengt, and for confirming that getting only 2 maximum LL out of 4000 does indicate a problem with this model! I *have* been worried that I was trying to squeeze water from this stone!
I saw an earlier announcement about the November short course and I have been planning to attend. Meanwhile, I see that your handouts are available to view online and will look through the ones for Topics 5 & 6.
You used to sell the short course handouts through the Mplus website, but I couldn't find the page for ordering them. It's very generous of you to offer them for download now! I like to read as much as I can to find answers to my questions before bothering you folks, so I really appreciate the references you provide and the short-course handouts. Thank you.
I've run a 3-class GMM to test a 3 vs 2-class model with BLRT. I first ensured that I had a maximum LL that had many replications, then selected two seeds to check the solution. OPTSEED with both seeds produced identical solutions. I ran the model with one of the seeds using OPTSEED, first with no specifications for the TECH14 BLRT. I got a p = 0.3333 with 9 successful BS draws. Then, I ran the model with the same OPTSEED, but this time specifying
to increase the reliability of the BLRT and the K-1 solution. I get identical LL for the K model (same seed), and identical LL for the K-1 model (and it is identical to the LL from the previous 2-class model I ran without BLRT).
However, the p-value for the BLRT is now 0.0000 with 50 successful BS draws. This is the same type of result I referred to earlier, where the BLRT produces p = .0000 no matter how many classes are in the model, when specifying LRTBOOTSTRAP at some number, usually 50, 100, or 150 for the models I've run.
Could you direct me to a reference that will help me understand this inconsistency in BLRT p-values for the same LL and -2LL diff?
Erratum: My posting on July 14, 2008 - 3:53 pm was wrong about the recommendation of Tofighi & Enders (2008) regarding the indices they recommended for choosing the number of classes in a GMM. I have found my own error in a Google(TM) search using their names, and hope this correction will also show up in future Google searches so folks won't be misled by my error.
I wrote that Tofighi & Enders "recommended the BLRT and sample-size adjusted BIC as the most useful indices for GMM solutions" but my memory failed me, and I should have looked at the paper again before citing them. In fact, Tofighi & Enders (2006; 2008) did not evaluate the BLRT at all in their Monte Carlo study of GMM indices for choosing the number of classes, because it had just been added to Mplus when they did their study. Instead, their recommendation was that the sample-size adjusted BIC was best overall index in its performance across a number of conditions, and the Lo-Mendell-Rubin LRT was next best in several situations.
In fact, Nylund, Asparouhov, & Muthen (2006 revised draft; Structural Equation Modeling, 2007) recommended the BLRT for choosing the number of classes in GMM under some circumstances, but their Monte Carlo study did not evaulate very many conditions for GMM.
Thanks for your note on July 16th and the follow-up tech spt via email. My question is "where-to-go-from-here-for-now" in using the BLRT to help choose the number of classes for GMM.
I understand that it is best to set the largest class as the last class when using the BLRT, but (from another posting) that the order of the other classes is not essential. However, I have been unsuccessful in making the largest class the last class, whether I use class intercepts from the solution being tested as starting values, or whether I use the categorical latent variable means. With either method, the program still re-orders the classes so that the largest class is not last in the TECH14 run, defeating the purpose of the BLRT to some extent, and taking a lot of time to run repeated bootstrap tests trying to make the last class the largest. Using the categorical latent variable means, for example, I got this from the prior solution for the 4-class model being tested, in which the last class WAS largest:
Categorical Latent Variables Means C#1 -1.146 C#2 -2.576 C#3 -2.860
So, I specified in the Model %OVERALL% statement for the TECH14 run:
[ c#1*-1.146 c#2*-2.576 c#3*-2.860 ];
But no matter how I ordered those three values, the subsequent runs put the largest class as number 2 or 3. Any thoughts about what I'm doing wrong?
I’m running a lot of LCA and SEM analysis with and without factors and covariates. A few questions (hope they are not too "simple"): 1) A suppose that the dot in scale correction factor is a decimal identifier. 2) The simplified Satora-Bentler use of scale correction factors in the Difference testing I suppose that the number of parameters are the number of free parameters. 3) When I incorporate co-variates in an analysis the number of subjects may differ considerably because of exclusion of missing values for the co-variates. Concomitantly, the LL and the statistics (AIC etc) change considerably, say with 2100 subjects the BICadjusted may be 31,000 and when incorporating a co-variate only 1700 subjects are included with a BICadjusted being 25,000. The correction for sample size may not seem appropriate. Is it possible to compare the two models by calculating F in chi-square (n-1)*F? Or alternatively correcting e.g. BIC by correcting for sample size (e.g. using the above numbers BICcorr = (25,000/1700)*2100 = 30,882)? An alternative way to do comparisons is by using the USEOBSERVATION option to only include a full data set. Its easy to do, but becomes tedious as more covariates are included. 4) How do you calculate DF in a mixture model? 5) Is there any rules for using the entropy meassure in decision of best fit?
1. Yes. 2. Yes. 3. Only models with the same set of observed variables and the same set of observations can be compared. 4. Degrees of freedom are relevant only for models where means, variances, and covariances are sufficient statistics for model estimation. In other cases, the number of free parameters is used. 5. Entropy is not a fit statistic.
Thanks Linda, This cleared up a few things for me. A few follow up questions:
If we in a model with the same set of obersvations replace one covariate with another so the number of paramters and subjects are the same, shoudln't it be possible to compare the two models? If the first model gives a BIC say 31000 and the second 30000, wouldn't you conclude that the second model is a better model and should be preferred?
Entropi: although entropi is not a fit statisitcs, is there any formal way to conclude that an entropy of 0,7 is worse than 0.8 (e.g. in the example above), and would you be able to include such a result in your decission of which model to choose?
Hello! I have a question concerning tech14. My LRT-value in the real data is dramatically different as compared to the LRT-value in the simulated data sets, which I monitored in the tech8 window (e. g. 124 vs. 66). Should I alter the settings of lrtstarts, or is this not really a problem? In addition the p-values of VMLR and tech14 dramatically differ too (p = .95 vs. p < .001). The real data and generated data H0-LL's (also H1-LLs) in tech14 are the same, as described in the manual. So I think it's a problem with the generated data sets and their H1- and H0-LLs (lrtstarts)!?
Ok, it takes some days to finally compute the tech14s then I would send it. But I guess it's rather hard to find a hint on only the outputs because BLRT LRT-values of generated data sets are available only on the tech8 window (disappearing after computation). The following sentence in my last post was nonsense: 'The real data and generated data H0-LL's (also H1-LLs) in tech14 are the same, as described in the manual.'--> I only wanted to say, that I reproduced the H0 and H1 LLs of 'real data k-1 run' in my tech 14 run, probably pointing to the fact that something is wrong with bootrapping of the generated data set. But this is a guess based on the phenomena I described regarding the LLs in the tech8 window.
In mixture models (especially with large samples) it sometimes happens that the examined fit indices (CAIC, BIC, aBIC, etc) keep on decreasing while additional classes are added, potentially because of their sensitivity to sample size. In most cases when this happens, the additional classes don't necessarily make sense (susbstantively or statistically: very small classes, classes that only represent a meaningless division of preceding classes, etc.).
In those cases, to choose the number of classes one is left with theory and subjectivity.
It seems to me that in such cases the fit indices (CAIC, BIC, aBIC, etc) associated with varying number of classes might be depicted graphically and interpreted as an EFA scree test to help in the determination of the correct number of classes.
So, 1) Do you have any misgiving about this method ? 3) Do you know of any references of a paper either suggesting the use of this method (scree test) or using this method to choose the correct number of classes)?
However, you need to always be open to the possibility that you are fishing in the wrong pond - the model family you are in may not be the best for the data and if you switch model family you might find a minimum BIC. For example switching from LCGA to GMM.
Can you tell me what the difference is in Tech11 between the VLMR and the adjusted LMR?
Also, you recommend that the last class is the largest class. However, you also recommended that model identifying restrictions not be included in the first class. Would you consider starting values for the first class to be model identifying restrictions? I am currently using starting values for the first class to make it the smallest class, thereby making sure that the first class for Tech11 is not the largest class.
The authors provided a post-hoc adjustment. You can see the original article for the details. We do not use the adjusted LMR.
Starting values are not model identifying restrictions. These would be some restrictions on model parameters.
I would use starting values to make the last class the largest class not the first class the smallest class.
Keng-Han Lin posted on Thursday, February 18, 2010 - 1:28 pm
Hi Linda and Bengt,
I'm using LCA on a complex survey data with 48 indicators (13 of them are continuous). The BIC suggests 6-class model, 46213.84(2-class model) 45634.95(3) 45419.72(4) 45322.00(5) 45195.95(6) 45290.84(7), but we are wondering if there's other rule we could follow to better determine the number of class. BLRT(tech 14) doesn't support for complex data. The results (p-value )of LMR test are as following, 0.047(2-class model) 0.373(3) 0.582(4) 0.591(5) 1.000(6) 1.000(7). In 3-class model which has LMR p-value of 0.373, does it suggest 2-class(H0) is good enough in our case?
If so, which statistics should I depend on? Or other criteria I should take into account?
Perhaps the following paper which is available in the website can help:
Nylund, K.L., Asparouhov, T., & Muthen, B. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569.
You should also consider whether the classes have any substantive or theoretical basis.
Hello, I am investigating trajectories of body image over time, 13 to 30 years. The sample in GMM is 1082. I have problems deciding number of classes (we are several here who are puzzled). 3 classes; BIC the lowest, but LMR-LRT 0.079, with 4 classes it is 0.038). Entropy for 3 classes is 0.727 and for 4 classes 0.734 (not very high!). However, the 4 class solution has one class which is only 1.5%, that's 15 persons, don't make much sense. I have learned to trust the LMR-LRT, but find it hard to proceed with 4 classes due to the sample size in each class. What is your opinion, also regarding the relatively low entrophy? Best regards, Ingrid
Adding to my questions above... The Bootstrap test is significant for both a 3 and 4 class solution. The LMR-LRT value above is of course the p-value. A collegue suggested to add q@0 to my modelcommand, then the 3 class solution performed somewhat better. The LMR-LRT (p) 0.024 ( 4 class (p) 0.21 ). However the Entropy is around 0.70 (a bit lower for both a 3 and 4 class solution). Thanks for your help. Ingrid
The significance of LMR-LRT should be interpreted only the first time it is greater than .05 not after that. Your first post suggests two or three classes. The meaningfulness of the classes should determine your choice.
Hi, a have a question about selecting the number of classes in LCA. For any reasonable number of classes Tech11 and Tech 14 show zero p-values. Also BIC decreases (for 3 or more classes) rather slowly. Does this mean LCA does not fit the data and should not be used? On the other hand, 4 and 5 class solutions have high entropy (around 0.9) and(given my research goals) seem to make sense. Is it ok to use LCA and choose a solution that makes sense with respect to my research goals? Thanks for your help, Tomas
I am running a Latent Profile Analysis with 5 continuous variables. My sample size is 430. I have been able to replicate the loglikelihood values for the 1 and 2 class solutions. I have also been using the steps outlined by Asparouhov & Muthén webnote 14 to test the number of latent classes using the BLRT from TECH 14 using the OPTSEED option. Based on those recommendations I get the warning THE BEST LOGLIKELIHOOD WAS NOT REPLICATED. I continue to get the message even after increasing LRTSTARTS to 50 20 100 20 and using LRTBOOTSTRAP from 100 through 500. In all of these, the loglikelihood for the k-1 class is the correct one from step 1.
In my LPA and FMA analyses, the BLRT remains significant (at .0000) for every analysis. My samples are n = 533 and n = 181, so I'm not sure it's a result of too much power. Shaunna Clark thought she had read cases of this, and suggested relying on ICs and substantive interpretation of the models (and the other LRTs, which do reach non-significance). I was wondering whether you know of any citations to justify this kind of decision?
I would use BIC and substantive interpretation. The Nylund et al paper also shows that BIC is one of the top indices (second best?). Notwithstanding the Nylund et al article, my experience is that BLRT is less dependable in practice for some reason.