Mplus Discussion >> LCA fit with very sparse response patterns

Topics
Last Day
Last 3 Days
Last Week
Tree View

Edit Profile


LCA fit with very sparse response pat...

Mplus Discussion > Latent Variable Mixture Modeling >

Message/Author

John Schafer posted on Friday, February 16, 2001 - 6:29 pm

Hi! For many of my LC models the response pattern frequencies can be pretty sparse, which makes the use of the chi square suspect. I remember a neat little paper in MBR, I think, by Linda Collins suggesting a Monte Carlo approach to assessing fit with such data. It strikes me that Mplus has MC capabilities; is there some way to implement Linda's approach in Mplus?

and while I'm here... :) will it become possible to use LC with complex survey data at some point?

Thanks, and thanks for such a great product!

Chuck

Bengt O. Muthen posted on Saturday, February 17, 2001 - 9:55 am

That is an interesting line of thought. Which MBR issue was that in? I'd like to get back to you about this to say if it can be done inside the current Mplus or needs to be done by generating data outside Mplus and running it through Mplus using the RUNALL Utility (see new web info). Further Mplus developments are going full speed ahead and latent class models with complex sample data is one of several things the team is working on.

John Schafer posted on Sunday, February 18, 2001 - 7:48 am

Thank you, Bengt! I believe this is the article (I lifted it from Linda's cv, the article is in the office):

Collins, L. M., Fidler, P. L., Wugalter, S. E., & Long, J. L. (1993). Goodness-of-fit testing for latent class models. Multivariate Behavioral Research, 28, 375-389.

It is (IMHO) a pretty cool little article. Thanks again, Chuck

Bengt O. Muthen posted on Sunday, February 18, 2001 - 12:21 pm

Thanks, Chuck. Will get back to you about this.

Anonymous posted on Wednesday, December 05, 2001 - 7:23 am

I am interested in creating a latent class analysis using a series of mutually dependent variables (e.g. race, age dummies). In this data, people can only have one race and one age category. Is it even possible to use these types of variables to indicate latent class. All of the potential variables that may be included in the model all are in the form of mutually exclusive series. They overlap with other series, but not within series. There are a number of potential series that can be included in this model (e.g. zip code, aid categories, family status, gender).

Second, in a separate run, where there were more variables that were not mutually exclusive, some categories were set to 15. Does this represent a potential problem or merely indicating that everyone of that indicator fits into one category.

bmuthen posted on Thursday, December 06, 2001 - 6:02 pm

I can imagine a class being defined by a high probability of being in a certain age range, family status, etc. Mplus does not yet handle unordered categorical latent class indicators. Although I have not tried this, it would seem possible to approximate such analysis using a series of dummy binary indicators created from the polytomous variables, as long as only K-1 binary variables are derived from a K-category item. Other readers of Mplus Discussion may have an opinion on this.

Logits fixed at +-15 do not imply a problem. In fact, it helps the interpretation in that such an indicator definately is switched on or off. For instance, a class that is defined as consisting of people positively inclined toward purchasing a youth item might have a low probability for the "age > 65" indicator being switched on.

Anonymous posted on Monday, December 10, 2001 - 8:45 pm

It would be wrong to use such mutually dependent variables as class indicators, since class indicators should be independent given the class. Represent each series as unordered categorical latent class indicators, rather than through dummy variables. As long as there is no direct effect to this indicator (U on X) there is no difference between ordered and unordered indicator.

As for using "zip code" as a class indicator, you may want to explore the latent class regression to model the impact of exogenous variables on the class membership.

YI posted on Friday, May 10, 2002 - 4:38 am

Dear all,

When I run a 2 class latent variable model, the error message like this shown up:
"THE MODEL ESTIMATION DID NOT TERMINATE NORMALLY DUE TO CHANGE IN THE LIKELIHOOD DURING THE LAST E STEP.
AN INSUFFICENT NUMBER OF E STEP ITERATIONS MAY HAVE BEEN USED. INCREASE THE NUMBER OF MITERATIONS. ESTIMATES CANNOT BE TRUSTED. "

Where to increase the number of miterations?
how to fix this problem? The model that I input
is identifiable. YOur information and suggestion
will be appreciated. Thank you.

Yi

Linda K. Muthen posted on Friday, May 10, 2002 - 5:59 am

MITERATIONS is an option of the ANALYSIS command. See page 40 of the Mplus User's Guide.

Patrick Malone posted on Thursday, August 15, 2002 - 6:02 am

Greetings.

I'm working with an LCA with a large pool of dichotomous indicators (42). I started this project with a more traditional LCA (WinLTA), which, as I understand it, uses the crosstabulation of all of the items and the latent. Based on how long it was taking to iterate, I estimated it might finish sometime in the 22nd century. I reduced the indicator pool to 21 and had better luck.

Does LCA in MPlus have the same limitation? That is, is computation time based on the 2^n cells? Or can I expand the item pool without as many worries?

I'm also running into apparent identification problems pretty early with the 21-item set, where the solution is dependent on starting values (for a 6-class model I tried 20 different sets of starting values and got 20 different solutions). Is this likely to get better or worse with larger numbers of items? Sample size is 890, with a modest degree of missing data (10-20%).

Thanks.

bmuthen posted on Thursday, August 15, 2002 - 7:26 am

Mplus has quite efficient LCA computations. I have an example with 17 binary items, n=7,300, and 4 classes that takes 50 seconds on my 1 Mhz computer. The number of items should not be a problem.

Having as many local solutions as you report may be an indication that the 6-class model does not have much information in the data to support it, i.e. it may not be suitable for the data. The new Mplus version 2.12, which will shortly be out, offers a new likelihood ratio test (TECH11) that allows you to see if a smaller number of classes is sufficient. Testing the model fit is difficult with many items because of sparse cells, so this approach of testing the incremental fit when adding classes is useful.

Patrick Malone posted on Thursday, August 15, 2002 - 7:53 am

Great, thanks. 1 Ghz, I take it? These are taking 20-30 minutes per run on a 1.6 or 1.8 GHz (I don't quite remember) machine. I run a SAS job to generate the random starting values, and then run MPlus in batch mode. So it's basically a day devoted to each set of runs.

I did go with a simpler model (5 classes -- 4 out of 20 runs converged on the same solution, and the BIC is only marginally worse than the best BIC from the 6-class models). I look forward to 2.12.

Meanwhile, I'll experiment with adding items.

Thanks.

Patrick Malone posted on Thursday, August 15, 2002 - 9:03 am

Following myself up. I went to the 42-item models, and was surprised to find that each ran in less than 40 seconds. I've got the warning that the crosstab was too big to allow for calculating the chi-square. But, since I'm not using the chi-square, that's fine with me. Everything I do need (including -2LL) seems to be present. Apparently the chi-square calculation was what was taking most of the 20-30 minutes. If I'm interpreting this correctly, it makes me wonder if in future versions it might be possible to turn that off while running smaller models . . .

Thanks.

bmuthen posted on Thursday, August 15, 2002 - 9:58 am

Yes, I meant 1 Ghz. Please send the input and data for the run that took 20-30 minutes to support@statmodel.com for further investigation.

Andy Ross posted on Wednesday, October 12, 2005 - 9:55 am

Hi

In a LCA I was wondering at what level of sparseness should i be concerned? As far as i have understood it, any level of sparesness will undermine the chi-square test of model fit, and that we should therefore use BIC instead.

However, does there come a point when the level of sparesness will undermine the whole solution? And if so, what is that point?

Many thanks for your support

Andy

bmuthen posted on Wednesday, October 12, 2005 - 11:39 am

It depends very much on the model and how many parameters need to have information from the sparse parts of your data. You need to have at least a couple of observations contributing to a parameter's estimation. To get good SEs one needs larger samples than point estimates. Probably the best way to get a rough sense of this is to do a Monte Carlo simulation using Mplus. There are general guidelines for doing this (although not specifically for LCA) in the Muthen & Muthen (2000) SEM article referenced on our web site. See also montecarlo inputs on your Mplus CD.

Moh Yin Chang posted on Thursday, August 28, 2008 - 12:28 pm

I'm experimenting Multiple Imputation of large categorical dataset using latent class analysis as proposed by Vermunt et al. (2008). To do so I need to first run a LCA to estimate the density of many categorical variables (about 300). Vermunt et al. recommended to suppress the Newton-Raphson algorithm and standard error estimation in this procedure since we are only interested in the posterior class probability of the subjects. May I know how can I suppress the Newton-Raphson algorithm and standard error estimation in LCA?

Bengt O. Muthen posted on Friday, August 29, 2008 - 8:11 am

You would fix all parameters.

Moh Yin Chang posted on Friday, August 29, 2008 - 8:52 am

Dr. Muthen,

The only parameters that I'm interested in are class proabilities. Do you mean that I should fix all variances? Would doing so suppress the Newton-Raphson algorithm?

Bengt O. Muthen posted on Friday, August 29, 2008 - 9:43 am

I mean - fix all model parameters. That's the only way to avoid the optimization. You still get estimated posterior probabilities for each individual and each class.

Moh Yin Chang posted on Saturday, September 06, 2008 - 5:27 pm

Dr. Muthen,

It appears that Mplus does not allow fixing variances for categorical outcomes and all my 293 variables are categorical. What would you recommend me to do to avoid variance estimation?

Linda K. Muthen posted on Saturday, September 06, 2008 - 5:55 pm

Categorical outcomes do not have variances. You don't need to fix them. Just don't mention them.

Chris Platania-Phung posted on Thursday, September 15, 2011 - 5:21 pm

Hi,

I'm interested in conducting LCA on lifestyle data (n=217) with 11 two-category indicators (ordinal) and 1 three-category indicator (ordinal. I'm concerned that being excessive
with the number of indicators/variables
given the small sample size. While concerned about sparse data the computation time is short and class profiles make a lot of sense. Are there assumptions for LCA in terms of minimum sample size relative to number of indicators (and # categories per indicator)?

Looking forward to your reply.

Bengt O. Muthen posted on Thursday, September 15, 2011 - 6:11 pm

I think LCA is perhaps less sensitive to small sample sizes due to it being a parsimonious model. But to get a good feeling for it, you need to do a Monte Carlo study for your particular case, which is easy to do in Mplus (see Chapter 12 of the v6 UG).

Thomas Plischke posted on Wednesday, November 16, 2011 - 4:59 am

Also concerning the problem of sparse data:

I am applying a Mover-Stayer-Model with four time points and 1,700 cases. For each time point, there is one nominal manifest indicator. I am succeeding in estimating the model if my nominal indicator (and hence also the latent variables) have less than five classes. However, the nominal variable that is of interest to me has seven classes.

If I try to model the Mover-Stayer-Model with more than four classes, either MPLUS processes indefinitely (I quit the program after running three weeks) or tells me that it does not have enough memory space to run it.

I understand that the problem has to do with sparse date. With seven classes and four panel waves, I would ultimately arrive at a crosstable with 7^4 = 2401 cells, with a lot of them being empty. Before I quit my enterprise, however, I would like to ask whether there is anything I can do about this problem. Do you know of a way to run a mover-stayer-model with seven manifest and latent classes?

Linda K. Muthen posted on Wednesday, November 16, 2011 - 2:10 pm

The number of classes is not determined by the number of categories of the nominal variable. I would start with 2 classes.

Bernice Garnett posted on Thursday, June 07, 2012 - 7:46 am

Hello,
I wanted to know if I am overfitting my latent class model as although it is estimated in Mplus I am not sure if it is empirically identified. I have 5 binary indicators of my latent variable (discrimination) and have estimated four classes of discrimination among a sample of 965 adolescents. Any help would be greatly appreciated?

Thanks,

Bernice

Linda K. Muthen posted on Thursday, June 07, 2012 - 2:21 pm

Do you have an LCA with five binary latent class indicators or a model with one continuous factor that has five binary factor indicators?

Bernice Garnett posted on Monday, June 11, 2012 - 7:18 am

Sorry if I was not clear. I have an LCA with five binary latent class indicators.

Thanks for your help,

Bernice

Linda K. Muthen posted on Monday, June 11, 2012 - 11:59 am

How many classes does your model have?

Bernice Garnett posted on Wednesday, June 13, 2012 - 2:21 pm

My model has four classes.

Thanks!

Bernice

Linda K. Muthen posted on Wednesday, June 13, 2012 - 3:14 pm

You can identify up to 5 classes with 5 binary indicators. See Slide 72 of the Topic 5 course handout to see how to figure this out.

Bernice Garnett posted on Thursday, June 14, 2012 - 12:10 pm

Thank you Linda for the reference and associated formula... very helpful.

Is their an equivalent way to estimate degrees of freedom and LCA parameters when you have both binary and ordinal indicators. In another analysis I have estimated a 2 class LCA with 3 indicators (one of which is binary and the other 2 indicators are three level ordinal indicators)

I realize that a 2 class LCA model with 3 binary indicators is not empirically identified.

Thank you!

Bernice

Linda K. Muthen posted on Friday, June 15, 2012 - 1:06 pm

You need to figure the cells in your H1 model and subtract 1 for the number of H1 parameters. You need to figure the number of thresholds in the H0 model and the number of categorical latent variable means for the H0 model. Just expand on the slide I referred you to.

Joseph E. Glass posted on Thursday, May 02, 2013 - 2:22 pm

In the TECH10 output, response patterns are printed. For example,
1 00 2 10 3 01 4 11
5 *0 6 *1 7 0*

What does response pattern 5, 6, and 7 indicate, where there is a "*" as one of the values? This is an analysis with two dichotomous variables. Thank you!

Linda K. Muthen posted on Thursday, May 02, 2013 - 2:30 pm

Please send the output and your license number to support@statmodel.com so I can see the context.

Joseph E. Glass posted on Thursday, May 02, 2013 - 5:22 pm

Hi Linda, thank you for your response.
I don't presently have an active support contract. I looked at this again, it appears that the *'s are not present when I use listwise=on, so I assume the *'s represent response patterns with missing data?

Bengt O. Muthen posted on Thursday, May 02, 2013 - 9:07 pm

I think that is right.

Joseph E. Glass posted on Friday, May 03, 2013 - 6:23 am

Thank you Linda and Bengt for your responses!

mpduser1 posted on Friday, June 14, 2013 - 1:23 pm

I am fitting a series of LC models in Mplus using weighted data and the newer model fitting procedures (specifically LMR Chi-Square), as well as AIC and BIC.

I'm noticing that LMR Chi-Square and BIC provide indication of a reasonable number of latent classes (T < 5), and substantively the classes seem reasonable. However AIC does not provide an indication of a reasonable number of latent classes.

So, my question is, is the wide divergence between the AIC and BIC results a function of sparse data?

I would guess that AIC is less conservative than BIC in identifying the best latent class solution. Yet I've not seen published examples where the AIC fails to suggest a reasonable number of latent classes.

Bengt O. Muthen posted on Sunday, June 16, 2013 - 4:05 pm

I am not sure that sparse data would lead to this discrepancy. Don't know what might.

cathy labrish posted on Wednesday, June 06, 2018 - 5:22 pm

Hello,

I am running a latent class model with 9 binary indicators. Looking at the univariate response proportions, I notice that very few study participants fail to endorse this item (ie. 2 out of almost 500). Given this item provides very little information on which to differentiate individuals, I am inclined to drop it from the model. This item is also a key item when determining study eligibility so one would expect 100% endorsement. My question:
Can items with such skewed distributions lead to problems with a LCAl and it is generally advisable to consider removing them when performing a latent class analysis.

Bengt O. Muthen posted on Wednesday, June 06, 2018 - 6:15 pm

I don't think such an item necessarily leads to problems in the analysis, but it probably doesn't contribute much and you are wise to remove it.

Sam Daneils posted on Thursday, November 14, 2019 - 1:14 pm

Dear Dr. Muthen,

I have a question on how to improve LCA class homogeneity and separation for indicators with sparse response patterns.

I'm estimating an LCA model with 14 binary indicators initially. But because all of them contain sparse responses and initial LCA model with 14 indicators indicated high bivariate residuals among different indicators, I further joined 14 indicators to 6 indicators.

However, my models (4 class results shown below as an example) still have issues with low class homogeneity and separation. I tried constraining posterior parameters but that didn't help much.

Inspired by your other publications, I think I will try including covariate to the model.

Could you please point me to other approaches I should consider?

Any suggestion is greatly appreciated!

loglikelihood: -31052.761
Akaike (AIC) 62159.521
Sample-Size Adjusted BIC 62277.897
Chi-square 68.19
Entropy 0.642

C1 C2 C3 C4
I1 0.076 0.322 1 0.065
I2 0.025 0.358 0.043 0.017
I3 0.007 0.359 0.09 0.012
I4 0.774 0.443 0.236 0.35
I5 0.635 0.494 0.271 0.366
I6 0.194 0.233 0.01 0

Thanks in advance!
Sam

Bengt O. Muthen posted on Thursday, November 14, 2019 - 5:16 pm

You can work with your 14 items and study their item-specific entropy contribution. See

http://www.statmodel.com/download/UnivariateEntropy.pdf