Message/Author 


I'm working on a growth mixture model with a couple covariates. The models seem to be running fine, but when I output the Class Probabilities I get probabilities ranging from 0 to 4 or so. My understanding is that they should range from 0 to 1? So they question is am I misinturpreting the output or is something wrong with my model? Thanks. 


I am assuming that you are talking about the output and not saving conditional probabilities in a separate file. When you have covariats, only logit values are printed not probabilities. When you have no covariates, the results are printed as both probabilities and logits. Let me know if this is not what you mean. 

Peter Tice posted on Thursday, September 14, 2000  8:04 am



I started with an analysis identifying the proper number of latent classes for a particular set of data and ended up with 4 latent classes with the following probability distribution: #1 78.06% #2 4.41% #3 13.81% #4 3.70% By itself, this is not problematic until I modify the model to include covariates as predictors of latent class membership (e.g., C#1 on verb3 impuls, etc.). After running such a model with the 4 class solution my probability distribution looks rather different. #1 68.06% #2 12.95% #3 14.39% #4 4.58% I don't recall seeing an explicit discussion of this in the manual but would like to understand why this happens and which class distribution should I rely on. At first glance this reminds me of circumstances in conventional growth curve modeling where the parameter estimates vary between unconditional (i.e., w/o predictor variables) and conditional (i.e., w/predictor variables) models. Thank you, 


Your analogy with growth modeling is good. For a stable solution, the two class prob distributions should be the same, but may not be for somewhat misspecified models. For example, some covariates may have direct effects on some outcomes. Note also that the order of the classes may be different for the two solutions. 


Another question. Is it possible to work with (i.e., specify, constrain, etc.) the class probabilities in the Model statement (I may just be missing something in the manual). For example, I'd like to constrain two latent classes to be of equal size in a k>2 latent class analysis. Thanks, Pat 


You can constrain the class probabilities to be equal as follows in a three or greater class model. In this example, the probabilities for the first two classes are held equal. Model: %overall% [c#1] (1); [c#2] (1); You can also fix the values if you want: Model: %overall% [c#1@1.308] (1); [c#2@1.308] (1); 


Thanks 


Linda, I'm finding that the above isn't behaving the way I expected it to. I'm running a 12class LCA, in which I'm using training data to constrain members of one manifest group to one set of six classes, and members of another manifest group to the other set of six classes. There are no predictors of the latent class membership involved. The two sets of six classes are constrained to be identical  the thresholds for each indicator in Class 1 are constrained to be equal to the thresholds in Class 7, and so forth. With Bengt's help, I was able to get that part going. The next step I want to take is test the hypothesis that the class sizes are identical across sets  that Class 1 is the same size as Class 7, and so on. I've used the syntax above to constrain the class size parameters to be equal across groups (1 to 7, 2 to 8, etc.). I've tried leaving Class 6 unconstrained and constraining it to zero (I want it to match Class 12 in size). The model converges, and the "means" in the "latent class regression model part" do follow the equality constraint. However, the "proportions of total sample size" do not, and the variance is not in any obvious pattern. I'd appreciate any insights on this. 

bmuthen posted on Thursday, October 18, 2001  11:14 am



What is printed under the heading FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE is based on the estimated posterior probabilities for each individual, given the model and the individual's data. Posterior probabilities should be thought of as akin to factor scores. In most latent class models, the proportions reported here will agree perfectly with the class probabilities as obtained from the estimated [c#...] logit values, but not always  your model is an example of an exception. So I would say that you succeeded in getting the [c#...] parameters set up the way you wanted. The fact that the posterior probability results disagree may be an indication of model misfit, and how they disagree can be a suggestion for how to modify the model. 


I'm back to the problem with some other data, and I'm still wrestling with this, I'm afraid. I'm testing a treatment vs. control difference. I've run one model with 8 classes. By training data, the control Ss are constrained to be in oddnumbered classes and the treatment Ss in evennumbered classes. The two groups are the same size. In the first model, the indicator logits for the classes are constrained to be equal across pairs of classes (class 1 with 2, class 3 with 4, etc.); there are no constraints on the latent class probabilities. This seems to be working fine, and fits only slightly worse than a model without the constraints. Then I want to test the hypothesis that, assuming the same class structure, the class sizes/proportions are the same in the two groups. Eyeballing the FINAL CLASS COUNTS in the first model, it looks like there's a pattern of differences. So I ran a second model, with the following code added to the %OVERALL% portion: [lc#1] (201); [lc#2] (201); [lc#3] (202); [lc#4] (202); [lc#5] (203); [lc#6] (203); [lc#7@0]; Running this model shows a very slightly worse 2LL, and the following results: LATENT CLASS REGRESSION MODEL PART Means LC#1 0.434 0.202 2.141 LC#2 0.434 0.202 2.141 LC#3 0.055 0.127 0.429 LC#4 0.055 0.127 0.429 LC#5 0.528 0.170 3.097 LC#6 0.528 0.170 3.097 LC#7 0.000 0.000 0.000 and FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE Class 1 87.19078 0.09797 Class 2 93.93257 0.10554 Class 3 124.40489 0.13978 Class 4 140.18348 0.15751 Class 5 82.87007 0.09311 Class 6 81.98881 0.09212 Class 7 150.53426 0.16914 Class 8 128.89514 0.14483 It really doesn't look like the two resuls sections are matching up, and I'm at a loss. Any suggestions? Thanks again, Pat 

Anonymous posted on Tuesday, November 06, 2001  8:26 am



Actually your results are matching up. The final class count is the sum of the estimated posterior probabilities for each individual. Much like the average factor score estimate could be different from the estimated factor mean those numbers are different. Of course you may question the restrictions imposed in the latent class regression model part. 


Ok, thanks. Somehow your wording helps Bengt's message from 10/18 click; now I see what's going on. So does MPlus anywhere produce what I was looking for? The raw probabilities of class membership? Or is the best way to exponentiate the logits, and work with the odds? Thanks. 

Anonymous posted on Tuesday, November 06, 2001  6:42 pm



For this particular model you can just average the final class counts, that is the raw probabilities of class membership for class 1 and class 2 is (0.09797+0.10554)/2. 

Anonymous posted on Tuesday, November 06, 2001  7:40 pm



For this particular model you can just average the final class counts, that is the raw probabilities of class membership for class 1 and class 2 is (0.09797+0.10554)/2. 

Anonymous posted on Wednesday, February 09, 2005  3:03 am



Dear Linda, I am a new user of Mplus. I am using your short courses package as well as the user Guide. I have started the course on Modelling with categorical latent variable and I very much appreciate it if you would expand a little on the statement on page 13, "The uc relation is a logit rgeression". with thanks 

bmuthen posted on Wednesday, February 09, 2005  10:49 am



Look up computations with logistic regression in the Version 3 Mplus User's Guide, chapter 13. This explains the logit in (81) on the page you are referring to. U is the binary dependent variable and c is a categorical x variable. U has a threshold which is the negative of an intercept parameter. If c has 2 classes, then that means that we have a dummy x variable which in line with linear regression means that there are 2 intercepts in the model. 


Hi, I have a twoclass solution with covariates and the logistic output looks like this. C#1 ON Y19 420.059 269.306 1.560 Y20 0.515 0.611 0.843 Y21 40.788 16.549 2.465 Y22 3.643 1.368 2.662 Intercepts C#1 86.006 54.002 1.593 Is class 1 the base class ? i.e., for the coefficient of y20 which is 0.515, does the log odds =1.674 imply that a for change of 1 unit in variable y20, the odds that the indicudual belongs to base class (class 1) increase by a factor of 1.674? am i interpreting this correctly? 


The last class is the reference class. See Chapter 13 of the Mplus User's Guide. In the section, Calculating Probabilities From Logistic Regression Coefficients, you will find a full interpretation. 


Hello Dr.Muthen, Thanks a lot. Regards 

Anonymous posted on Friday, June 17, 2005  9:21 am



Problem 1: I have a dichotomous X variable (that underlies a continuous distribution) that I want to use as a predictor in a growth mixture model. I can't find an example in the manual on defining this X as categorical, is it unnecessary to do so (if it underlies a continuous distribution)? Problem 2: On a different project. I have a somewhat unexpected effect of a cognitive measure call X on LGM model slope of achievement (y1 thru y4). The model has 4 time points and the last two are freely estimated. The estimated time scores are 1.5 and 1.8. I am troubling to interpret the effect of X which I would have expected to be positively related to the slope (It is positively related to the intercept). This may be a rather interesting finding, that the higher a respondent scores on cognitive skill X the higher their intercept will be, but the higher they score on X the lower their growth rate. I plotted the slope estimates by the X variable (by the way I LOVE that we can do this in MPLUS!!) and everything looks fine (yet negative). It seems to me that it’s not their growth rate 'overall' its their initial growth rate (0, 1) because the estimated times scores are 1.5 and 1.8 suggesting a smaller rate of growth at times 3 and 4, that is the first period of growth 0, 1 there is greater incline for those with a lower score on cognitive skill X, but as time goes on (1.5 and 1.8) that initial effect is smaller. Is this correct that the effect of X (when y’s are coded 0, 1, *2 *3 rather than 0 1 2 3) is on the initial growth rate (0 1), we assume it’s linear and stable (as it would be if modeled 0 1 2 3), but when you estimate the time scores you can’t conclude that; or am I unnecessarily getting thrown off by the estimated time scores? Say the effect of X is .30 then to get the slope: Factor loading * effect 0 (.3) =0 1 (.3) =.3 2 (.3) =.6 3 (.3) =.9 but with estimated time scores 0 (.3) = 0 1 (.3) = .3 1.5 (.3) = .45 1.8 (.3) = .54 So I would conclude a higher score on X was related to a slower growth rate, but that this slower growth was strongest initially during the first two waves of data. Thanks for your help!! 

BMuthen posted on Sunday, June 19, 2005  5:37 am



1. The scale of observed x variables is not an issue in estimation. They should not be placed on the CATEGORICAL list. In regression, covariates are treated as either dummy variables or contiuous variables. 2. It is not unusual for a covariate to have a positive influence with the intercept growth factor and a negative influence with a slope growth factor. You may want to use time scores 0 * * 1 if you are interested in growth between the first and last time points. 

Anonymous posted on Wednesday, August 03, 2005  10:56 am



I am relatively new to mixture modeling and am not sure that I understand the rationale behind constaining class probabilities to be equal. There does not seem to be a clear explanation of the circumstances underlying when this should or should not be done in the manual or on the discussion board. I have a few questions on this issue: 1)Is this primarily a theoretical consideration (i.e., I have no a priori beliefs as to class sizes so I will assume them to be relatively close to equal)? Or is it instead based upon clues derived from an analysis of results from runs in which class probabilities were not constrained? 2)By using this feature, am I somehow forcing data to fit in groups or am I simply making it easier for the model to converge, similar to assigning starting values for class thresholds? 3) What is the relationship (if any) between the logit starting values I assign for class thresholds and constraining class probabilities to be equal? Thanks in advance  as a novice, I know I'm probably blurring lines on several of these issues but appreciate your thoughtful response. BTW  I am working with all continuous variables if that makes any difference. 

bmuthen posted on Monday, August 08, 2005  1:29 pm



Typically you would not constrain the class probabilities to be equal. What is the reason that you think they should be or are held equal? 

Marco Grados posted on Thursday, December 01, 2005  9:45 am



I have run LCA on 3 sets of data N = 197, N = 203, and N = 400 for psychiatric comorbities on a sample of patients (5 comorbidities). I obtain 3 classes as the best solution in all three cases and they make sense clincally. I was interested in the probabilities of belonging to each class for subjects and used the command: SAVEDATA: FILE IS prob.result.dat; FORMAT IS FREE; SAVE = CPROBABILITIES; I have used this before in another analysis and have obtained probs as required. However, now all I am obtaining for the variables cprob1, cprob2 and cprob3 are 0's and 1's... Any ideas appreciated Marco Grados, MD, MPH 


Please send your input, data, output, and license number to support@statmodel.com and we can take a look at it. 


Dear Dr. Muthen, I am working on GMM and I try to decide on the best number of classes. When I choose the model with the best BIC, the class probabilities vary between .65 and .85. My questions are: Does a class probability of .65 still indicate good fit? Is there a minimum required value for the average latent class probabilities? Is there a rule of thump with regard to the average latent class probabilities to decide on model fit and the number of classes? Or should I just trust the best BIC and not base my conclusion on the class probabilities? Do you know of any reference that discusses this topic? Thanks in advance. Thanks in advance for your answer. 


See the following paper which can be downloaded from the website to see how to determine the number of classes: Muthén, B. (2004). Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences (pp. 345368). Newbury Park, CA: Sage Publications. Probabilities are not related to model fit. 


I have been comparing 2, 3 and 4 latent class models to the same with 1 factor. The best fit appears to be the 3 class 1 factor model. In the final class counts based on the estimated model and estimated posterior probabilities, class 2 has 3.7% of the sample. However in the final classification class 2 has no individuals! Distribution based on most likely class membership is: C#1 0.272 C#2 0.000 C#3 0.728 I notice from the discussion above that it's possible to constrain 2 classes to be the same size where there's 3 or more classes. I wonder if that is feasible for my analysis? Or is there something else going wrong with my analysis? Fiona 


This means that the individuals in Class 2 have most likely class membership in another class which points to no need for Class 2. 


Dear dr. Muthen, I am running a 3class multigroup LCGA with the following class probabilities. group 1: 0.28 0.63 0.09 group 2: 0.25 0.68 0.07 A chisquare crosstabulation (by hand) indicates that the probabilities differ between the groups. Is there a way in Mplus to test directly whether the 0.63 differs from the 0.68? And is there a way to get SE's or CI's for these probabilities? Thanks in advance. 


You can use MODEL CONSTRAINT to define the probabilities. You will then obtain standard errors for them. You can use MODEL TEST to test if they are equal. See the user's guide for further information. 


Unfortunately I need a little more help from you. I tried several statements, but nothing works, and also it seems I really need to make constraints with MODEL CONSTRAINT, but I only want SE's. What should I specify differently?: Model constraint: [cg#1.c#1]; (do I need to add (1)?) [cg#1.c#2]; [cg#1.c#3]; [cg#2.c#1]; [cg#2.c#2]; [cg#2.c#3]; (By the way, I was referring to the probabilities under this heading: LATENT TRANSITION PROBABILITIES BASED ON THE ESTIMATED MODEL) Thanks again. 


See Slides 4850 of the Topic 6 course handout. This shows how MODEL CONSTRAINT can be used to define the latent transition probabilities described toward the end of Chapter 13. Standard errors are estimated for new parameters defined in MODEL CONSTRAINT. 


Hi there, I am using Growth mixture models to estimate developmental trajectories from continuous data at 6 time points. I am using the following syntax to save the results and conditional probs: SAVEDATA: File is gmmcp.dat; format is free; save=cprobabilities; However, I need the cprobs (only) to more than three decimal places. Please can you help? Many thanks 


This is not possible in the current version of Mplus. 


Hi, I am currently conducting LCA with 5 binary variables and 3 categorical variables (with 3 categories each). When looking at the results, quite some item response probabilities are set to 0.0/1.0, because the "logit tresholds were approached and set at extreme values". Is there a possibility to turn this default setting off? If not, is this also normally done in LatentGold? (I ask this for comparative reasons; comparing my results with that from a previous study). Also, I was wondering whether you can tell me when one should use a logit, and when a probit link when conducting LCA. The v5 manual only mentions the option with regard to LTA. Which one should you use when you employ a LCA? Finally, in the recent book on LCA and LTA by Collins and Lanza (Wiley series) it is explicitly stated that the likelihood ratio statistic (G^2) shouldn't be used when the df of the model is high because the reference distribution for the G^2 statistic is not known. Is it still okay to use the statistics reported in TECH11, as they are based on the likelihood ratio statistic? Thanks in advance, Martijn 


I don't know what Latent Gold does but fixing them is the correct thing to do. It is most common to use logit but probit is also possible. I would need to see the exact verbage from the book to interpret what they are saying. You can send it to support@statmodel.com. 


Yes, the likelihoodratio chi2 (G^2) does not approximate chi2 well with the sparesness in the observeddata frequency table that you get with many variables. And you cannot use chi2 difference testing to find the appropriate number of latent classes; that also is not chi2 distributed. Instead BIC has been found to work well. And also TECH11 and TECH14. See for example the article Nylund, K.L., Asparouhov, T., & Muthen, B. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535569. which is on our web site under Papers, Latent Class Analysis. TECH11 and TECH14 are akin to the bootstrap approaches mentioned in the book you referred to, although not focusing on the frequency table chi2, but directly on the likelihoodratio statistic, considering its nonchi2 distribution. 

anonymous posted on Monday, April 05, 2010  12:02 pm



Hello, I am trying to determine how best to characterize derived classes. I noticed that in the "results in probability scale" section of Mplus output, there are significance values associated with the probabilities of each latent class indicator. Are these pvalues indicating whether the probability is greater than 0 or testing another hypothesis? In addition, in one case, the probability is significant even though the probability value is 0. In other cases, when the probability value is 1.00, the pvalue is also 1.00. thanks in advance! Marcy 


The test is against zero. I would need to see the full output and your license number at support@statmodel.com to comment on your other statements. They sound odd. 


I presume this is a very basic question about Latent Class Analysis. I am running a two group LCA (trying to discriminate MZ and DZ twins from a bunch of ratings of their similarity). I have good reason to think that the solution should be unbalanced, with about 90% of the cases in one group and 10% in the other. But no matter what I do I seem to end up with close to 5050. Is there a way to force Mplus to find an unbalanced solution. Is it a matter of fixing the mean of the latent categorical variable on a logit scale? Can one do it with starting values? Or fixing the loadings on a few select items? Thanks. 


Please send the output and your license number to support@statmodel.com. 


Dear Bengt/Linda, I am working on a GMM/LCGa type model which includes about 10.000 cases. I first identified the correct number of classes, and now wish to regress c on some covariates. However, I'd also like to treat the most likely classes from the first stage as observed variables in the second stage, just to check my model solutions. I found that the maximum recordlength is set at 5.000 in mplus 6, so I have some problems saving the class probabilities. Can you recommend any workaround procedure? thanks heaps, 


I'm not sure I totally understand the question. The data for each observation can be on more than one record. 


Sorry, I should be clearer. I have a fairly large dataset (10.000) ans wish to save class probabilities. There seems to be a recordlength maximum of 5.000 however. Is there a way to save data for more records than 5.000? 


Record length and the number of records are not the same thing.I don't know of any such maximum. Please send the files and your license number to support so I can see what you mean. 


Hello, I am working on a LCA with longitudinal binary data and estimate thresholds for each class. However, when I add an intercept to the overall statement of the model (which is actually a growth parameter), the model fit is considerably improved. How can this be, and does this affect the interpretation of the thresholds? e.g. do I need to standardise them now as they will be affected by a latent continuous variable (intercept)? Thank you very much for your time. 


It sounds like you are freeing the intercept of the intercept growth factor. This should be fixed at zero as part of the growth model parameterization. It should not be freed. If this is not what you are saying, please send your output and license number to support@statmodel.com. 


Dear Dr. Muthen, I am using a mixture modeling approach to crime rates at the county level. The best fitting model was the one with four latent classes. When I ran the frequencies on the class variable saved in the data file and compared those with the class counts reported in the output, I noticed some differences: Class 1: 46 (in output) 57 (data file) Class 2: 2561 (in output) 2558 (data file) Class 3: 206 (in output) 205 (in data file) Class 4: 13 (in output) 6 (in data file) Shouldn’t they be the same? Thank you. 


You are most likely comparing most likely class membership with class counts from the estimated model. These are the same only when entropy is one. 


Hi, I am estimating LPA models from plausible values (20 sets) from a previous ESEM model (i.e. I want to estimate the profiles based on the "factor scores", but tried PVS for greater precision). Doing so, I get warnings that I cant save class memberships and most importantly class probabilities from the models ? I guess that is because we work from 20 different data sets. I am wondering whether there would be a way to save these information (i.e. to combine/merge the class probability results into a single data file) ? Thanks 


The only way to get most likely class membership is to run each data set separately. You would then need to decide how to use the sets of most likely class memberships. 

Jan Ivanouw posted on Wednesday, April 20, 2011  5:56 am



I would very much like help in locating the formula for calculating CPROBABILITIES for the individuals in a mixed model. What I mean is each individual's probability for belonging to class 1, class 2 etc. when knowing the person's item scores and the Mplus estimated model parameters for a certain mixed model. Mplus gives these probabilities in a save file when demanding CPROBABILITIES, but I would like to be able to calculate them myself based on the parameters estimated by a Mplus model (like thresholds and alpha). I have looked into the Technical Appendix 8, but it seems that the formula/algorithm to use is not there. 


It is formula (161) of the Tech App 8. When you use that in the last step when the iterations have converged you get the CPROBs. It's a bit involved. See also Muthén, B. & Shedden, K. (1999). Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics, 55, 463469. and the more technical Section 6.4 of Muthén, B. & Asparouhov, T. (2009). Growth mixture modeling: Analysis with nonGaussian random effects. In Fitzmaurice, G., Davidian, M., Verbeke, G. & Molenberghs, G. (eds.), Longitudinal Data Analysis, pp. 143165. Boca Raton: Chapman & Hall/CRC Press. 

Jan Ivanouw posted on Wednesday, April 20, 2011  11:12 pm



Thank you very much. 

Jan Ivanouw posted on Thursday, April 28, 2011  10:41 am



Hi, Working with the calculation of cprobabilities I still have problems. In Appendix 8 equation 161 I need P(uc,x) and I find this from eq 152/153. My problem is in eq 152 the term (tau  u*). tau is a value which is estimated in the Mplus model, while as I understand it u* is a dimension. How can I find the value for u* to plug into the eq? 


u* comes from eqns (150) and (151), so it is also a function of model parameters  and covariate values x_i. You should view u* as the logit behind each observed u, so the parameters of those two eqns are found in whatever relationships in your model that influence the u's. 


Thank you for the explanation. I have an additional question concerning calculation of posterior probabilities: I tried the method of calculating posterior probabilities which is described in topic 5 slides 69 and 70 (Berlin july 2009 version). I have a simple model with one latent class variable (with two classes), and with 10 binary indicators. I estimated the model in Mplus. Then I plugged the probabilities from this model into the equations in slides 69 and 70 for some selected persons. The result was that I got the same posterior probabilities which was calculated by Mplus. My question is: In slide 70 it is indicated that I have to use the EM algoritm in order to get the posterior probability for a person, treating the class membership as missing data. Why was this in fact not necessary when using the parameters from the estimated Mplus model? 


The probabilities you plugged in (using slide 71) were from the Mplus output and Mplus computes those from the estimated model parameters  and those are obtained via ML estimation using the EM algorithm. 


Hello, when calculate LCA or LPA, is it possible to restrict the ratio/ number of cases for classsolution? Thank You! 


You can try inequality constraints on the class probabilities, or rather its logit parameters. 


Thank You! And where can I find the syntaxcode for those procedures? 


See the Version 6 UG, pages 617618 for the general approach and end of chapter 14 for how the logits relate to the probabilities for the multinomial regression. 

Beth Bynum posted on Friday, August 26, 2011  7:44 am



Good Morning, I’ve run a LPA analysis with one sample and I have determined the number of classes to retain. Using the classes that were determined in the LPA analysis, I would like to determine the most probable classes for a new sample of individuals. For example, I would like to give the same items that were used as class indicators to a new sample and using the results of the LPA to predict which class each person would most likely to be in. Ideally, for each person in the new sample, I would like to be able to use the item responses and compute the probability of being in each class, without having to run a new LPA. Is this possible? If so, does MPLUS output provide the necessary information to compute the probabilities? What equation should I use? Thanks! 


Yes, this can be done and is a good use of LPA. For the new sample of individuals you simply fix all the model parameters at the solution you got from your first sample. This can easily be done using the SVALUES option in the OUTPUT command. The second run then does not estimate any parameters but only estimates the posterior probabilities for each class for each subject (see the CPROBS option of the SAVEDATA command). The results also include an indication of the most likely class. In fact, you can do this for only a single subject. 

Beth Bynum posted on Tuesday, September 06, 2011  12:26 pm



Thanks for the response. Is there a way to estimate the probabilities without running MPLUS? For example, we would like to give each individual in the new sample the set of items, then, in an on demand environment provide feedback to the individual about the class they would most likely fall into based on their pattern of responses on the items. We would like to be able to set this up in an online or computer environment using a standard programming language such as visual basic or javascript. I tracked down and deconstructed the posterior probability equation that LPA uses to estimate the probabilities. Since I know the latent class means, latent class covariance matrices, and the response vector, I could use the equation to produce new probabilities for each new individual. The formula relies on complex matrix algebra to estimate the density functions, which isn’t a show stopper, but we were hoping to find something that might be simpler. My Questions: Have you ever seen anyone use LPA/LCA in this type of application? Do you know of any other way to compute the probabilities that might be simpler than relying on the posterior probability formula? We were hoping that LPA would be similar to discriminant analysis and produce a set of equations that predict class membership for a single person. Finally, I wanted to check which density function MPLUS uses to compute the posterior probabilities. Does it use the multivariate normal density function? 


I think this is a not uncommon type of application and I know others who have had similar interests in having their own routine. I don't think one can avoid going via the computations of the posterior probabilities (like we show in Appendix 8 in the V2 Tech App.). Discriminant analysis assumes known groups, whereas with the posterior probabilities of mixtures a subject is a fractional member of several groups. Yes, Mplus uses the multivariate normal density when the outcomes are continuous. 


I have a question regarding interpretation of conditional probabilities for particular classes. If the conditional probability for a particular response category in a particular latent class is say 0.75 can one say that 75% of respondents in this class selected this category or is it only appropriate to say that respondents classified here had a 75% chance of selecting this category. I am of the opinion that in this case these two are interchangeable since the probabilities also represent proportions. Is this the case? 


I think it is somewhat better to say: "respondents classified here had a 75% chance of selecting this category". This is the probability that is being estimated. Even better is: "respondents in this class had an estimated probability of 0.75 of selecting this category."  That avoids the ambigious phrase "classified here". Note that we are talking about a model, that is about subjects' class membership (a subject is a member of only one class in the model), not how they were classified after the parameter estimation was done (when they are fractionally members of several classes). The other statement sounds like you are talking about the subjects who are most likely in this class and among them 75% are in this response category  that may not be true. 

Carolyn CL posted on Friday, March 01, 2013  10:19 am



I am running a LCA with 3 continuous level variables and 2 dummy variables (reference dummy omitted: D_Me). I am not clear how to interpret the conditional probabilities for responses on the dummy variables. 1. Why does MPLUS assign 1 and 0 to some of the conditional probabilities  how can I interpret this? ex: Latent Class 1 D_LOCE Category 1 0.145 0.307 0.472 0.637 Category 2 0.855 0.307 2.788 0.005 D_HICE Category 1 1.000 0.000 0.000 1.000 Category 2 0.000 0.000 0.000 1.000 2. Should I ignore the 'Results in probability scale' and calculate actual probabilities for each dummy variable using thresholds (Chpt 13)? 3. Is there an easier way to calculate the probabilities for the reference dummy variable? 


I would treat the 3category variable as a nominal variable by putting it on the NOMINAL list. 1. The thresholds have extreme values. 2. No, the results in probability scale will be the same as any values you calculate by hand. 3. Treat is as nominal. 

Carolyn CL posted on Thursday, March 07, 2013  9:51 am



Thank you very much. I am now treating the 3category variable as categorical (it is ordered). To be clear: 1. The model terminates normally, but I am getting this message: IN THE OPTIMIZATION, ONE OR MORE LOGIT THRESHOLDS APPROACHED AND WERE SET AT THE EXTREME VALUES. EXTREME VALUES ARE 15.000 AND 15.000. THE FOLLOWING THRESHOLDS WERE SET AT THESE VALUES: * THRESHOLD 1 OF CLASS INDICATOR CE_CAT FOR CLASS 3 AT ITERATION 110 * THRESHOLD 2 OF CLASS INDICATOR CE_CAT FOR CLASS 3 AT ITERATION 110 Should I be concerned? 2. The "Results in probability scale" can indeed be interpreted as the probability each class falls into a specific category of the ordinal level variable? 


1. Yes. 2. Yes. 


We would like to calculate the percentage of correct class assignments in a simulation study of latent profile analysis (no covariates). Is it possible to save the individual posterior probabilities for each class and the most likely class membership(save=cprob)in a Monte Carlo simulation with multiple replications to compare to the Monte Carlo data sets that provide true class memberships? Or, is there another way to obtain this value? 


No, to do this, you will need to save the data sets from Monte Carlo and use some sort of batch facility like the RUNALL utility or possibly MplusAutomation with R in an external Monte Carlo. 


Dear Linda and Bengt, While performing a 4 class latent profile analysis, the SAVEDATA = CPROB; option only results in a datafile in which the values of the variables + id variable are included. So, no posterior and class probabilities and also no assigned class number. Could you tell me what I'm doing wrong? Thank you. Mirjam TITLE: LPA 4 classes DATA: FILE IS LPAZscores2.csv; VARIABLE: NAMES ARE id Dep_T2 Reexp_T2 Avoid_T2 Emo_T2 Hyper_T2; usevar = Dep_T2 Reexp_T2 Avoid_T2 Emo_T2 Hyper_T2; IDVARIABLE = id; CLASSES = c (4); ANALYSIS: TYPE = MIXTURE; STARTS 500 125; savedata: file is "classmembership2" SAVE = CPROB; FORMAT is FREE; OUTPUT: TECH1 TECH11 TECH12; 


Please send your output and license number to support@statmodel.com. 

Kathleen posted on Wednesday, February 05, 2014  1:21 pm



I'm using LCA and LTA and my questions here concern which data to present in tables and graphically. Should I output the results into a data file, requesting cprobabilities, and then compute the means of each item for each class using each person's most likely class membership? Or, should I graph the class probabilities of each item in the "results in probability scale" on the output file? The means obtained from averaging most likely class are different from the results (in the probability scale) on the output, as they are different things, but I'm not sure which is better to present. Can you also clarify the difference in the most likely class membership and the probabilities? 


I would use the Mplus Plot command to plot the means/probabilities for the items in each class. I would not use most likely class unless necessary and it isn't here. Each person gets an estimated probability (cprob) to be in each class. So with 2 classes the person may get the cprob values 0.85 and 0.15. The most likely class membership for that person is then class 1. But of course that is a cruder piece of information than using both 0.85 and 0.15. A second person with cprobs 0.90, 0.10 is a bit more clearly a class 1 member. 


Hi there, I'm trying to get a new dataset with the class membership that also includes the id variable. It is not working. What am I doing wrong? Here is my code: DATA: FILE is C:\Users\dissfullbl.dat; FORMAT IS FREE; VARIABLE: NAMES ARE ID pda0102 pda0304 pda0506 pda0708 pda0910 pda1112; MISSING = ALL (999); USEVAR = pda0102 pda0304 pda0506 pda0708 pda0910 pda1112; IDVARIABLE = ID; Classes = c (2); ANALYSIS: TYPE=mixture; SAVEDATA: FILE IS C:\Users\lcaPDA.dat; FORMAT IS FREE; SAVE = cprobabilities; Thank you! 


Please send the output and you license number to support@statmodel.com. 

Alvin posted on Monday, May 05, 2014  5:09 pm



Hi Dr Muthen, I just ran a 3class factor mixture model using Bayesian estimator, which took shorter time to reach convergence than using algorithm integration. The output looks interesting, I couldn't however figure out the response pattern in each class as no itemresponse probabilities were given. I wonder if I need to use an auxiliary variable and export class probabilities into a different file and analyze class characteristics that way. Is there a way to obtain itemresponse probabilities in the output? 


No automatic way. You would have to be able to express the probabilities using Model constraint. Also, if you do Bayesian mixtures you have to make sure that you know about "label switching". 

Alvin posted on Wednesday, May 07, 2014  5:56 pm



Thanks Dr Muthen. I've read your work on Bayesian and have come across "label switching", and it seems that there isn't a solution yet to this, is there? Another interesting observation is, and I read it somewhere that, model modifications do not really have much impact on the overall fit of the model assessed using PPC? why and how do you then work out potential areas of misfit? also, can you request DIC in mixture models? If not, are there alternative indices that allow for model comparison? 


We discuss the label switching in our Topic 9 Bayes course on the website. You can apply constraints. Because of this and the current lack of measures for model comparisons, I would not recommend Bayes for mixture modeling unless you are an expert Bayes analyst. I don't have the PPP experience you mention. That is, model improvements can be seen in the range given in the output below the PPP value. 


I am currently on a project in which I need to calculate the probabilities of belonging to a set of latent classes, given individual responses to the latent class indicators. I can do this when the latent class indicators are categorical but I can't find an equation that will calculate these probabilities when the indicators are truly continuous. Could you point me to such an equation? 


The general principle is the same using the Bayes Theorem. I don't know of where it is written out. 


I saved class probabilities and then opened the saved probabilities in WordPad. How do I determine which column represents which class? Any suggestions to help interpret this issue would be much appreciated. 


Asked and answered on Mplus support. 


I'm trying to save a dataset with the most likely class membership. I'm using Mplus v 7.2 (Mac). When I run the following code, it saves a dataset with the observed variables, the weight variable, and the id variable  but no variable for the most likely class. Do you see what I'm doing wrong? TITLE: Unconditional LCA, Full Sample, 4 class; DATA: FILE = "pta mplus data 3.txt"; VARIABLE: NAMES = aid gswgt4_2 edu18 edu20 edu22 edu24 edu26 edu28 edu30 job18 job20 job22 job24 job26 job28 job30 rel18 rel20 rel22 rel24 rel26 rel28 rel30 par18 par20 par22 par24 par26 par28 par30; USEVARIABLES = edu18par30; CATEGORICAL = edu18par30; CLASSES = c(4); WEIGHT = gswgt4_2; AUXILIARY = aid; ANALYSIS: TYPE = MIXTURE; STARTS = 20 10; OUTPUT: TECH11; SAVEDATA: FILE = "pta 3 unc 4 cprob.txt" SAVE = CPROB; The output file lists the following for saved variables. Save file pta 3 unc 4 cprob.txt Order and format of variables EDU18 F10.3 ... PAR30 F10.3 GSWGT4_2 F10.3 AID F10.3 Save file format 30F10.3 


There is no semicolon after the FILE option in the SAVEDATA command – so Mplus can’t see the SAVE=CPROB request. 

ZHANG Liang posted on Wednesday, September 23, 2015  7:47 am



Hello. I ran a 4class LCGA, and requested the CPROBABILITIES. The variables in my input file are ID, A1A6. Column in the cprob.dat are arranged like this: [origin values of A1A6] [ID] [estimated values of A1A6] [??] [??] [Posterior Probabilities for each class] [the class in which this case is finally assigned to]. But I'm not sure about the meanings of the two columns denoted as [??]. Could you please explain it? Thank you! 


Look at the bottom of the output where the variables saved are listed. 


I have a doubt regarding the computation of the posterior probabilities for another individual using the information from Mplus. I am considering a LCGA and, more specifically, I am considering an unconditional model. In particular, I have 30 repeated measures of about 3000 individuals. Using Mplus I have obtained 3 classes defined by an intercept and the slope. I have seen the formula 145 in appendix 8 but I am not sure about its application. Which are the values I have to introduce as x_i? Are they the 30 repeated values we have for each individual? But how does the formula work? Thanks in advance. 


I am looking at the technical appendix for Version 2 on our website and the posteriors are given in equation (161). You don't have x's but only y's which are the outcomes at the different time points. 


Thanks Dr Muthen. However, looking at this equation, I also do not know how to apply it. I understand that only have y's, which are the outcomes at the different time points but then, how x's and u's works? And, how can I adapt equation (161) to my data? Thanks again. 


You ignore the u term and drop u and x from the other parts of the formula. See also Muthen & Shedden (1989) on our website. If this seems unfamiliar and too technical, you may need to contact a statistical consultant to help you. 


I am running a Growth Mixture model. The number of observation included in the analysis is 570 since some are missing information on xvariables or missing on time scores. However, when I request the data file for the cprobabilities and Fscore, I actually get information for 630 individuals (the entire sample). Why is that information included in the data file even though they are not included in the analysis? 


Please send the output and the saved file to Support along with your license number. 


Hi, I am running a latent transition analysis following NylundGibson et al. 2014. In my LTA model I have measurement invariance by class across time points (i.e., classes at t1 have the same structure than classes at t2), and I would need to get the Logits for the classification probabilities of the most likely class for each time point. To do so I was using the starting values (class latent means, and indicators by class) of the LTA to run 2 separate LCAs (one per time point), but I get a completely messed up class membership compared to the LTA. What is the best way to do so? Is there a way to get logits of the classification probabilities directly form the LTA? 


this are the svalues from my LTA: [ c1#1*0.10978 ]; [ c1#2*1.09612 ]; etc.. MODEL C1: %C1#1% [ ar_1*1.91941 ] (1); [ auth_1*0.36262 ] (2); [ polcyn_1*0.29949 ] (3); [ mor_1*0.31598 ] (4); %C1#2% [ ar_1*0.56575 ] (13); [ auth_1*0.41641 ] (14); [ polcyn_1*0.13853 ] (15); [ mor_1*0.41398 ] (16); MODEL C2: %C2#1% [ ar_2*1.91941 ] (1); [ auth_2*0.36262 ] (2); [ polcyn_2*0.29949 ] (3); [ mor_2*0.31598 ] (4); [ dropout$1@15 ]; %C2#2% [ ar_2*0.56575 ] (13); [ auth_2*0.41641 ] (14); [ polcyn_2*0.13853 ] (15); [ mor_2*0.41398 ] (16); 


and this my LCA model based on the LTA usevariables = nAR_1 AUTH_1 POLCYN_1 MOR_1 ; CLASSES = C1(9); TYPE = MIXTURE ; STARTS = 0; Model: %OVERALL% [ c1#1@0.10978 ]; [ c1#2@1.09612 ]; ar_1@0.16055 (5); auth_1@0.19758 (6); polcyn_1@0.52895 (7); mor_1@0.23867 (8); %C1#1% [ ar_1@1.91941 ] (1); [ auth_1@0.36262 ] (2); [ polcyn_1@0.29949 ] (3); [ mor_1@0.31598 ] (4); %C1#2% [ ar_1@0.56575 ] (13); [ auth_1@0.41641 ] (14); [ polcyn_1@0.13853 ] (15); [ mor_1@0.41398 ] (16); 


We request that postings be limited to one window. Questions that require long input/output segments should be sent to Support along with license number. 


Dear Bengt / Linda, Is it possible to constrain the latent classes probabilities, so that they are always greater than, say, 2%? In the LCA model that I am estimating I get several "empty" latent classes (please output below). If I could constrain the LC probabilities to be greater than 1%, I assume that this problem would not happen. ************************************ FINAL CLASS COUNTS AND PROPORTIONS FOR THE LATENT CLASSES BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP Class Counts and Proportions 1 10870 0.75754 2 0 0.00000 3 146 0.01017 4 747 0.05206 5 0 0.00000 6 0 0.00000 7 140 0.00976 8 0 0.00000 9 0 0.00000 10 128 0.00892 11 125 0.00871 12 424 0.02955 13 148 0.01031 14 0 0.00000 15 262 0.01826 16 123 0.00857 17 881 0.06140 18 163 0.01136 19 0 0.00000 20 192 0.01338 


I would not recommend it  I think you want to know when zeros occur. I don't see why they would be problematic. 


Hi Bengt, ok, thanks for your reply. We are trying to classify mental health patients according to latent classes of a cost distribution (outcome = daily cost). Also, we are trying to get the optimal number of latent classes for such cost distribution. So my idea was that each latent class should not be empty, otherwise it makes little sense to say: "the best model has 20 latent classes, but 7 of these are empty". Because, in this case, the number of "effective" classes to which the patients are assigned is only 13. A colleague of mine has also wisely suggested to constrain perhaps the variance of the latent classes... that should probably be another feasible remedy to avoid the zeros. What do you think about this solution instead? 


I assumed that the 20 classes came from ma combination of several latent class variables. If it is only one latent class variable, I would think that BIC would not lead you to 20 classes. But perhaps you analysis is not the usual exploratory LCA but rather a confirmatory version based on substantive matters that I don't know about. Latent classes don't have variances, only probabilities. 

Guido Biele posted on Tuesday, December 05, 2017  6:18 am



I would like to constrain a (Bayesian) latent profile model using the size of the latent classes, e.g. class 3 is largest, class 2 is second largest, class 1 is third largest etc. I assume, this syntax should do the trick (for a 4 class model): MODEL: %Overall% [c#1* c#2* c#3*] (c1c3); ... MODEL CONSTRAINTS: c2 > c1; c3 > c2; ... My questions: 1) Where can I read up on how class probabilities are modeled and what the link between c#1, c#2 ... parameters and class proportions are? 2) Is it correct that I cannot specify the constraints such that class 1 becomes the largest class? 3) Is there another approach you would rather use to constrain a model by the size of the classes? 4) Is there a way to set the priors for c1c3 in a Bayesian analysis? I couldn't find any examples on using Dirichlet priors to achieve this. Thanks in advance! 

Back to top 