Hi! For many of my LC models the response pattern frequencies can be pretty sparse, which makes the use of the chi square suspect. I remember a neat little paper in MBR, I think, by Linda Collins suggesting a Monte Carlo approach to assessing fit with such data. It strikes me that Mplus has MC capabilities; is there some way to implement Linda's approach in Mplus?
and while I'm here... :) will it become possible to use LC with complex survey data at some point?
That is an interesting line of thought. Which MBR issue was that in? I'd like to get back to you about this to say if it can be done inside the current Mplus or needs to be done by generating data outside Mplus and running it through Mplus using the RUNALL Utility (see new web info). Further Mplus developments are going full speed ahead and latent class models with complex sample data is one of several things the team is working on.
Anonymous posted on Wednesday, December 05, 2001 - 7:23 am
I am interested in creating a latent class analysis using a series of mutually dependent variables (e.g. race, age dummies). In this data, people can only have one race and one age category. Is it even possible to use these types of variables to indicate latent class. All of the potential variables that may be included in the model all are in the form of mutually exclusive series. They overlap with other series, but not within series. There are a number of potential series that can be included in this model (e.g. zip code, aid categories, family status, gender).
Second, in a separate run, where there were more variables that were not mutually exclusive, some categories were set to 15. Does this represent a potential problem or merely indicating that everyone of that indicator fits into one category.
bmuthen posted on Thursday, December 06, 2001 - 6:02 pm
I can imagine a class being defined by a high probability of being in a certain age range, family status, etc. Mplus does not yet handle unordered categorical latent class indicators. Although I have not tried this, it would seem possible to approximate such analysis using a series of dummy binary indicators created from the polytomous variables, as long as only K-1 binary variables are derived from a K-category item. Other readers of Mplus Discussion may have an opinion on this.
Logits fixed at +-15 do not imply a problem. In fact, it helps the interpretation in that such an indicator definately is switched on or off. For instance, a class that is defined as consisting of people positively inclined toward purchasing a youth item might have a low probability for the "age > 65" indicator being switched on.
Anonymous posted on Monday, December 10, 2001 - 8:45 pm
It would be wrong to use such mutually dependent variables as class indicators, since class indicators should be independent given the class. Represent each series as unordered categorical latent class indicators, rather than through dummy variables. As long as there is no direct effect to this indicator (U on X) there is no difference between ordered and unordered indicator.
As for using "zip code" as a class indicator, you may want to explore the latent class regression to model the impact of exogenous variables on the class membership.
When I run a 2 class latent variable model, the error message like this shown up: "THE MODEL ESTIMATION DID NOT TERMINATE NORMALLY DUE TO CHANGE IN THE LIKELIHOOD DURING THE LAST E STEP. AN INSUFFICENT NUMBER OF E STEP ITERATIONS MAY HAVE BEEN USED. INCREASE THE NUMBER OF MITERATIONS. ESTIMATES CANNOT BE TRUSTED. "
Where to increase the number of miterations? how to fix this problem? The model that I input is identifiable. YOur information and suggestion will be appreciated. Thank you.
I'm working with an LCA with a large pool of dichotomous indicators (42). I started this project with a more traditional LCA (WinLTA), which, as I understand it, uses the crosstabulation of all of the items and the latent. Based on how long it was taking to iterate, I estimated it might finish sometime in the 22nd century. I reduced the indicator pool to 21 and had better luck.
Does LCA in MPlus have the same limitation? That is, is computation time based on the 2^n cells? Or can I expand the item pool without as many worries?
I'm also running into apparent identification problems pretty early with the 21-item set, where the solution is dependent on starting values (for a 6-class model I tried 20 different sets of starting values and got 20 different solutions). Is this likely to get better or worse with larger numbers of items? Sample size is 890, with a modest degree of missing data (10-20%).
bmuthen posted on Thursday, August 15, 2002 - 7:26 am
Mplus has quite efficient LCA computations. I have an example with 17 binary items, n=7,300, and 4 classes that takes 50 seconds on my 1 Mhz computer. The number of items should not be a problem.
Having as many local solutions as you report may be an indication that the 6-class model does not have much information in the data to support it, i.e. it may not be suitable for the data. The new Mplus version 2.12, which will shortly be out, offers a new likelihood ratio test (TECH11) that allows you to see if a smaller number of classes is sufficient. Testing the model fit is difficult with many items because of sparse cells, so this approach of testing the incremental fit when adding classes is useful.
Great, thanks. 1 Ghz, I take it? These are taking 20-30 minutes per run on a 1.6 or 1.8 GHz (I don't quite remember) machine. I run a SAS job to generate the random starting values, and then run MPlus in batch mode. So it's basically a day devoted to each set of runs.
I did go with a simpler model (5 classes -- 4 out of 20 runs converged on the same solution, and the BIC is only marginally worse than the best BIC from the 6-class models). I look forward to 2.12.
Following myself up. I went to the 42-item models, and was surprised to find that each ran in less than 40 seconds. I've got the warning that the crosstab was too big to allow for calculating the chi-square. But, since I'm not using the chi-square, that's fine with me. Everything I do need (including -2LL) seems to be present. Apparently the chi-square calculation was what was taking most of the 20-30 minutes. If I'm interpreting this correctly, it makes me wonder if in future versions it might be possible to turn that off while running smaller models . . .
bmuthen posted on Thursday, August 15, 2002 - 9:58 am
Yes, I meant 1 Ghz. Please send the input and data for the run that took 20-30 minutes to firstname.lastname@example.org for further investigation.
Andy Ross posted on Wednesday, October 12, 2005 - 9:55 am
In a LCA I was wondering at what level of sparseness should i be concerned? As far as i have understood it, any level of sparesness will undermine the chi-square test of model fit, and that we should therefore use BIC instead.
However, does there come a point when the level of sparesness will undermine the whole solution? And if so, what is that point?
Many thanks for your support
bmuthen posted on Wednesday, October 12, 2005 - 11:39 am
It depends very much on the model and how many parameters need to have information from the sparse parts of your data. You need to have at least a couple of observations contributing to a parameter's estimation. To get good SEs one needs larger samples than point estimates. Probably the best way to get a rough sense of this is to do a Monte Carlo simulation using Mplus. There are general guidelines for doing this (although not specifically for LCA) in the Muthen & Muthen (2000) SEM article referenced on our web site. See also montecarlo inputs on your Mplus CD.
I'm experimenting Multiple Imputation of large categorical dataset using latent class analysis as proposed by Vermunt et al. (2008). To do so I need to first run a LCA to estimate the density of many categorical variables (about 300). Vermunt et al. recommended to suppress the Newton-Raphson algorithm and standard error estimation in this procedure since we are only interested in the posterior class probability of the subjects. May I know how can I suppress the Newton-Raphson algorithm and standard error estimation in LCA?
I'm interested in conducting LCA on lifestyle data (n=217) with 11 two-category indicators (ordinal) and 1 three-category indicator (ordinal. I'm concerned that being excessive with the number of indicators/variables given the small sample size. While concerned about sparse data the computation time is short and class profiles make a lot of sense. Are there assumptions for LCA in terms of minimum sample size relative to number of indicators (and # categories per indicator)?
I think LCA is perhaps less sensitive to small sample sizes due to it being a parsimonious model. But to get a good feeling for it, you need to do a Monte Carlo study for your particular case, which is easy to do in Mplus (see Chapter 12 of the v6 UG).
I am applying a Mover-Stayer-Model with four time points and 1,700 cases. For each time point, there is one nominal manifest indicator. I am succeeding in estimating the model if my nominal indicator (and hence also the latent variables) have less than five classes. However, the nominal variable that is of interest to me has seven classes.
If I try to model the Mover-Stayer-Model with more than four classes, either MPLUS processes indefinitely (I quit the program after running three weeks) or tells me that it does not have enough memory space to run it.
I understand that the problem has to do with sparse date. With seven classes and four panel waves, I would ultimately arrive at a crosstable with 7^4 = 2401 cells, with a lot of them being empty. Before I quit my enterprise, however, I would like to ask whether there is anything I can do about this problem. Do you know of a way to run a mover-stayer-model with seven manifest and latent classes?
Hello, I wanted to know if I am overfitting my latent class model as although it is estimated in Mplus I am not sure if it is empirically identified. I have 5 binary indicators of my latent variable (discrimination) and have estimated four classes of discrimination among a sample of 965 adolescents. Any help would be greatly appreciated?
Thank you Linda for the reference and associated formula... very helpful.
Is their an equivalent way to estimate degrees of freedom and LCA parameters when you have both binary and ordinal indicators. In another analysis I have estimated a 2 class LCA with 3 indicators (one of which is binary and the other 2 indicators are three level ordinal indicators)
I realize that a 2 class LCA model with 3 binary indicators is not empirically identified.
You need to figure the cells in your H1 model and subtract 1 for the number of H1 parameters. You need to figure the number of thresholds in the H0 model and the number of categorical latent variable means for the H0 model. Just expand on the slide I referred you to.
Hi Linda, thank you for your response. I don't presently have an active support contract. I looked at this again, it appears that the *'s are not present when I use listwise=on, so I assume the *'s represent response patterns with missing data?
mpduser1 posted on Friday, June 14, 2013 - 1:23 pm
I am fitting a series of LC models in Mplus using weighted data and the newer model fitting procedures (specifically LMR Chi-Square), as well as AIC and BIC.
I'm noticing that LMR Chi-Square and BIC provide indication of a reasonable number of latent classes (T < 5), and substantively the classes seem reasonable. However AIC does not provide an indication of a reasonable number of latent classes.
So, my question is, is the wide divergence between the AIC and BIC results a function of sparse data?
I would guess that AIC is less conservative than BIC in identifying the best latent class solution. Yet I've not seen published examples where the AIC fails to suggest a reasonable number of latent classes.
I am running a latent class model with 9 binary indicators. Looking at the univariate response proportions, I notice that very few study participants fail to endorse this item (ie. 2 out of almost 500). Given this item provides very little information on which to differentiate individuals, I am inclined to drop it from the model. This item is also a key item when determining study eligibility so one would expect 100% endorsement. My question: Can items with such skewed distributions lead to problems with a LCAl and it is generally advisable to consider removing them when performing a latent class analysis.
I don't think such an item necessarily leads to problems in the analysis, but it probably doesn't contribute much and you are wise to remove it.
Sam Daneils posted on Thursday, November 14, 2019 - 1:14 pm
Dear Dr. Muthen,
I have a question on how to improve LCA class homogeneity and separation for indicators with sparse response patterns.
I'm estimating an LCA model with 14 binary indicators initially. But because all of them contain sparse responses and initial LCA model with 14 indicators indicated high bivariate residuals among different indicators, I further joined 14 indicators to 6 indicators.
However, my models (4 class results shown below as an example) still have issues with low class homogeneity and separation. I tried constraining posterior parameters but that didn't help much.
Inspired by your other publications, I think I will try including covariate to the model.
Could you please point me to other approaches I should consider?