LCA and sampling weights PreviousNext
Mplus Discussion > Latent Variable Mixture Modeling >
 Mark Shevlin posted on Monday, April 03, 2006 - 9:11 am
Hi, I have been estimating LCA models based on a large national survey. In order to get an accurate estimate of class sizes should I (a) use the sampling weights to adjust estimates of class size after classifying cases, or (b) include the sampling weight in the actual LCA analysis.

I have noticed that including a sampling weight in the analysis tend to result in solutions with fewer classes, when I had expected only the test statisitcs to be adjusted.

Many thanks in advance
 Linda K. Muthen posted on Monday, April 03, 2006 - 2:59 pm
Sampling weights should be included in the analysis because they affect parameter estimates, standard errors, and tests of model fit.
 Mark Shevlin posted on Wednesday, April 05, 2006 - 4:21 am
Many thanks
 Mark Shevlin posted on Thursday, April 06, 2006 - 2:14 am
Hi Linda,
I have run into some problems when using the sampling weight in a LCA analysis. The LRT statistic seems to be behaving oddly. The mean for the VUONG-LO-MENDELL-RUBIN LIKELIHOOD RATIO TEST without the sampling weight is 35 and with the sampling weight it is 426234. This difference seems to be large. With the weighting variable the mean for the model with a class less is 35 and 37000 for the model with a class more. Do these estimates seem reasonable?

I have tried the parametric likelihood ratio bootstrap test but cannot seem to get this test when a weight variable is included in the model. Can the bootstrap test be conducted when a weight variable is included?

Many thanks in advance
 Linda K. Muthen posted on Thursday, April 06, 2006 - 7:14 am
I would have to see your analysis to answer this. Please send your input, data, output, and license number to Bootstrap and weights cannot be used together.
 Justin Jager posted on Monday, December 18, 2006 - 6:26 am
In the first response in this thread, Dr. Muthen endorses the use of sampling weights when using LCA. Well, is it ever valid to not use sample weights? For example, via LCA, I am identifying different classes of substance use trajectories. The sample I am using oversamples heavy substance users, so the weight variable, in order to render the results representative of the U.S. population, weights the heavy users lower than the non-heavy users.

Not surpisingly, the optimal number of latent classes (as well as the growth characteristics of the classes) varies depending upon whether or not the weight variable is included in the analyses or not. In short, when the weight variable is used, which weights low substance use users more, fewer latent classes are identified among the heavy subtance users, while the opposite is true when weights are not used.

(continued on in post below...)
 Justin Jager posted on Monday, December 18, 2006 - 6:27 am
(continuation of post above...)

It seems to me, that an argument can be made for not using the sample weights in this case. That is, the additional classes identified among the heavy users when not using the sample weight are "real" classes - that is they do exist in the sample, and they exist in the population. That is if one is trying to identify latent classes among a small sub-sample of the population, it seems like perfect sense to oversample that small sub-sample in order to do so. In order to make the estimates representative, the posterior proabilities for group membership could then be used to make the latent classes known classes, and then the sample weight could be applied to these later analyses.

Do you see a fundamental flaw in the argument above for not using the sample weight initially, but bringing it back in later for subsequent analyses? While the logic of my argument seems pretty straightforward, I am not familiar with the nuts-and-bolts of how sample weights actually impact class identification in LCA -- so there could be something I am failing to realize.


 Linda K. Muthen posted on Monday, December 18, 2006 - 8:56 am
If you don't use sampling weights, your generalizations are to the sample. If you use sampling weights, you can generalize to the population. You could also consider looking only at the heavy users.
 jtw posted on Monday, October 04, 2010 - 9:41 am

I understand that to generalize LCGA results to the population, one should conduct analysis with the appropriate sampling weight applied. However, I am going to do additional analysis (e.g., ANOVAs) with individuals assigned to their most likely latent class. In general, I believe it to be appropriate to weight such analysis as the ANOVA. However, in this particular case it seems there may be double weighting occurring since weights would be applied for the LCGA and then again for post-trajectory analysis (e.g., ANOVA), which doesn't seem right to me.

Should I apply the sampling weight at the LCGA stage only? Apply the weight at the ANOVA stage only? Apply the weight during both the LCGA and ANOVA stages? Any guidance is most helpful. Thanks.
 Bengt O. Muthen posted on Tuesday, October 05, 2010 - 9:58 am
I think you should use weights in both stages. Using them in the first stage ensures correct parameter estimates which form the basis for the posterior probabilities which give most likely class. But then in the ANOVA you need to account for that every person with his/her most likely class should not count equally - so weight again.
 Carol Rhonda Burns posted on Tuesday, September 29, 2015 - 8:34 am
Dear Dr. Muthen,

I am running a LCA with an epidemiological sample, using weights. I saved the most likely class and exported it to SPSS. When I run frequencies in SPSS, I get the same weighted percentages for each class as in Mplus, but the actual frequencies (class counts) are markedly different. Would you be able to explain why this is the case?
 Linda K. Muthen posted on Tuesday, September 29, 2015 - 10:57 am
You must apply the weights to the posterior probabilities that you exported.
 Carol Rhonda Burns posted on Wednesday, September 30, 2015 - 7:30 am
Thank you!
 Corey Savage posted on Thursday, January 28, 2016 - 1:05 am
I am running a latent profile analysis with a nationally representative sample of 1200 individuals clustered in 100 programs. The latent class indicators are 6 counts and 10 rasch scales.

When using the sampling weights, the BLRT is not allowed for the number of latent classes. I've come across simulation studies where the LMR test has fairly high rates of type-1 error. In my analysis the p-value for the LMR test for 2 vs 3 classes was 0.3. The next best fit index, the BIC, continues to decrease substantially through 8-9 class models. How would one interpret what to do here? I understand that substantive reasoning is the next best step, but the initial tests didn't help much to point in a direction.

Any help or references would be much appreciated!
 Bengt O. Muthen posted on Friday, January 29, 2016 - 11:05 am
BIC is the only index that is useful here because it is the only one that takes complex sampling features into account.

If BIC doesn't show a minimum you may want to add some residual correlations among outcomes. Which ones to add can be gauged from adding a single factor to the model and see for which items its loadings are significant.
 Corey Savage posted on Friday, January 29, 2016 - 11:58 am
OK. I believe a minimum is found, but with a very high number of classes. I've found in the literature that when using Rasch scores as indicators in an LCA/LPA, the BIC can recommend a spurious number of classes. If the BIC doesn't perform well either in this case, I guess I am feeling a bit in the dark about the number of classes to select. What would you recommend?
 Corey Savage posted on Friday, January 29, 2016 - 12:17 pm
Also, do you by chance have a reference for only the BIC being relevant to use with complex sampling?
 Bengt O. Muthen posted on Friday, January 29, 2016 - 6:11 pm
I don't see why Rasch scores would hurt BIC unless their distributions are very skewed.

No special reference, but the TECH11 and TECH14 theory references do not consider complex survey data.
 Corey Savage posted on Friday, January 29, 2016 - 7:39 pm
There are a couple of the Rasch scores that are quite skewed and a couple that are bimodal. What would be the best approach here, or would you recommend against using the sampling weights so I could utilize BLRT?
 Bengt O. Muthen posted on Monday, February 01, 2016 - 10:05 am
I would stay with BIC. See also my answer to your other question on this.
 Ann-Renee Blais posted on Thursday, September 08, 2016 - 8:55 am
Good morning,

I'm working with data from a stratified random sample with 4 strata and their corresponding sampling weights. I believe my weights are the raw weights (e.g., 23.03, 5.50), . In order to generalize the results of my LPA to the population, is the following syntax appropriate?

weight is weight;
strat is stratum;
type is mixture complex

When I run the LPA, I get sample frequencies, however. What is wrong with my syntax?

Thank you for your help!

 Linda K. Muthen posted on Thursday, September 08, 2016 - 11:51 am
Please send the output and your license number to support at and explain exactly where in the output you are looking and don't understand.
 Virginia Rangel posted on Friday, March 10, 2017 - 11:41 am
I am trying to run a LCA using complex data. My data file has the weights and also the replicate weights. However, when I try to run the analysis with

ANALYSIS=Complex Mixture;

I get an error message saying that "Mixture" cannot be used with replicate weights. But when I take out the "Mixture" from "analysis=", I get a new error message stating that "Classes option is only available with Type=Mixture".

Can replicate weights (REPSE=BRR) not be used in LCA? If not, should I drop the replicate weights and just use the weight? If they can be, how should I alter my syntax?

I also had the following error message:
"Analysis with replicate weights is not allowed with algorithm=integration"

Original syntax:
Variable: names =ID X1-X23 W1 W1S001-W1S200;
Count=X1 X2 X23;

Analysis: Type=Complex;

Thank you!

 Tihomir Asparouhov posted on Friday, March 10, 2017 - 11:40 pm
Replicate weights are not available / implemented for all models (not available for LCA and algo=int).

You can do it "manually" using 201 separate runs using each of the weight variables and the combining them to obtain the SE following the formulas in

With some R programming you can automate that

or if you split the data into 201 data sets with one weight variable each - using Mplus external montecarlo can run all the runs for you and save the results, see User's Guide example 12.6 step 2.
 Rachel Casey posted on Sunday, November 19, 2017 - 1:13 pm
Hello Drs.,

I have attempted to enter the following syntax for a LCA:

file is 'C:/Users/LCAnowght.csv';
weight = w;
missing = all(9);
classes = MHDIFF (2);
type = mixture;
estimator = MLR;
starts = 1000 100;
stiterations = 20;
tech11 tech14;

I am receiving the following error messages:
The number of observations is 0. Check your data and format statement.
Data file: \client\c$\users\rache\desktop\/LCAnowght.csv
Non-missing blank found in data file at record #1, field #: 10

When I enter the syntax without the weight line and variable, Mplus runs without issue. What am I doing wrong? Thank you in advance for your assistance.
 Linda K. Muthen posted on Tuesday, November 21, 2017 - 7:55 am
Blanks are not allowed with free format data. Sometimes SPSS uses blanks for missing data. You need to resave the data without blanks or use a FORMAT statement and MISSING = BLANK;
 Nicholas J Parr posted on Saturday, September 28, 2019 - 9:34 pm
Good day -

I am conducting an LCA with covariates (R3STEP), with inverse propensity weighting as described by Butera et al (, i.e., incorporating IPWs as survey weights in the mixture model.

In my estimates of the (inverse propensity weighted) covariateís association with class membership (ORs in this case), I would like to incorporate error associated with propensity score estimation. Iíve estimated the PSs used to calculate the IPWs in R using a generalized boosted regression model.

Do you have any suggestion for how I might adjust the SEs of the covariate ORs with the PS estimation error?

Many thanks,
 Tihomir Asparouhov posted on Monday, September 30, 2019 - 9:20 am
You can use the PS standard errors to generate/impute the weights multiple times and then analyze the multiple data sets using type=imputation in Mplus.

Another method that might be reasonable is to use bootstrap as in bootstrap replicate weights
It would involve bootstrapping the data then for each sample estimate IPW and then use the same method as the bootstrap replicate weights but a lot of the computations would have to be computed outside Mplus. You would be using external montecarlo in Mplus, but the rest of the computations would have to be done in R or excel.
 Nicholas J Parr posted on Monday, September 30, 2019 - 11:30 am
Thank you, Tihomir.

To make sure I understand your recommendation, for the first approach, I'd estimate the IPWs multiple times and include them in a dataset with other variables in the LCA, then run the model on those multiple datasets using type=imputation. Is that correct? If so, how many replications do you think would be appropriate (my concern is that the GBM PS method takes quite a while and its a large dataset of about 50000 n)? Last, how would I specify the multiple datasets in the Mplus syntax?

Thanks again,
 Tihomir Asparouhov posted on Monday, September 30, 2019 - 12:44 pm
Correct. I would use 10. It shouldn't take a long time. See User's Guide example 13.13 for how to setup type=imputation.
 Nicholas J Parr posted on Monday, September 30, 2019 - 1:27 pm
That makes it perfectly clear - thank you again!

One final question if you don't mind, more related to the overall LCA. As I mentioned I was planning to use R3STEP for the covariate. If I wanted to include known groups (e.g., by gender), it seems like the automatic 3-step methods don't allow more than one categorical latent variable (according to the error I received and reading other discussions). Do you have a suggestion for this situation? Obviously I could run the LCA on subgroups of the dataset (by gender), but I'm not sure if that would bias standard errors or cause other issues.

Thanks again,
 Tihomir Asparouhov posted on Monday, September 30, 2019 - 3:57 pm
I think running the LCA on subgroups of the data set is the way to go. That would be equivalent to running the LCA all together with gender as direct predictor for all the indicators in the case of all binary indicators.
 Nicholas J Parr posted on Friday, October 11, 2019 - 10:08 pm
Hi again - So I got this up and running with the 10 data sets and using type = imputation. After running the model, I realized that I don't get the item response probabilities in the output - I consulted this thread ( and used Linda's suggestion to run another model with only 1 data set and with the parameter estimates from the imputation model (below)

My questions are: 1) Does this syntax look correct? 2) Does this model with the fixed parameters still account for the IPW-weighting in the imputation model? 3) Should I test the covariate using the R3STEP method, or is this in effect the final step of a manual 3 step, in which I should just use an "ON" statement under %OVERALL%?

I did try to run R3STEP in the imputation model - it seemed to run ok but I got really extreme estimates for the regression of the covariate.

Thanks again!

STARTS = 200 50;


[...and so on for the other 3 classes]
 Tihomir Asparouhov posted on Monday, October 14, 2019 - 5:10 pm
1) yes - but you would be using that syntax only for the purpose of getting the the item response probabilities

2) Since there are no parameters to be estimated in that model the weights are irrelevant - you can include them or not and you would still be getting the same item response probabilities

3) I think you want to use R3STEP. It might be insightful if you also analyze just one imputed data set (or run them one at time). That may help you understand the exploding parameters.
 Yoon Oh posted on Wednesday, October 23, 2019 - 11:26 am

I'm wondering if the R3STEP can be used with both sampling weights and replicate weights. If not, what would be the best way to conduct LCA with covariates while also incorporating sampling weights and replicate weights?

Please advise. Thank you so much.

 Tihomir Asparouhov posted on Friday, October 25, 2019 - 2:12 pm
It is not supported currently but you can do it with some extra work. Sampling weights alone are supported. This way you can obtain the point estimate. Subsequently to utilize the replicate weights and get the proper standard errors you must estimate the same model for each replicate weight (essentially replace the sampling weight with the replicate weight). If you have 50 replicate weights you get 50 additional runs. To avoid any latent class label switching the additional 50 runs should be done with starts=0 and starting values obtained from the final results of the run with the sampling weights. To obtain the correct standard errors use formula (1)
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message