I'm working on a growth mixture model with a couple covariates. The models seem to be running fine, but when I output the Class Probabilities I get probabilities ranging from 0 to 4 or so. My understanding is that they should range from 0 to 1? So they question is am I misinturpreting the output or is something wrong with my model? Thanks.
I am assuming that you are talking about the output and not saving conditional probabilities in a separate file. When you have covariats, only logit values are printed not probabilities. When you have no covariates, the results are printed as both probabilities and logits. Let me know if this is not what you mean.
Peter Tice posted on Thursday, September 14, 2000 - 8:04 am
I started with an analysis identifying the proper number of latent classes for a particular set of data and ended up with 4 latent classes with the following probability distribution:
#1 78.06% #2 4.41% #3 13.81% #4 3.70%
By itself, this is not problematic until I modify the model to include covariates as predictors of latent class membership (e.g., C#1 on verb3 impuls, etc.). After running such a model with the 4 class solution my probability distribution looks rather different.
#1 68.06% #2 12.95% #3 14.39% #4 4.58%
I don't recall seeing an explicit discussion of this in the manual but would like to understand why this happens and which class distribution should I rely on. At first glance this reminds me of circumstances in conventional growth curve modeling where the parameter estimates vary between unconditional (i.e., w/o predictor variables) and conditional (i.e., w/predictor variables) models.
Your analogy with growth modeling is good. For a stable solution, the two class prob distributions should be the same, but may not be for somewhat misspecified models. For example, some covariates may have direct effects on some outcomes. Note also that the order of the classes may be different for the two solutions.
Another question. Is it possible to work with (i.e., specify, constrain, etc.) the class probabilities in the Model statement (I may just be missing something in the manual). For example, I'd like to constrain two latent classes to be of equal size in a k>2 latent class analysis.
I'm finding that the above isn't behaving the way I expected it to. I'm running a 12-class LCA, in which I'm using training data to constrain members of one manifest group to one set of six classes, and members of another manifest group to the other set of six classes. There are no predictors of the latent class membership involved.
The two sets of six classes are constrained to be identical -- the thresholds for each indicator in Class 1 are constrained to be equal to the thresholds in Class 7, and so forth. With Bengt's help, I was able to get that part going. The next step I want to take is test the hypothesis that the class sizes are identical across sets -- that Class 1 is the same size as Class 7, and so on.
I've used the syntax above to constrain the class size parameters to be equal across groups (1 to 7, 2 to 8, etc.). I've tried leaving Class 6 unconstrained and constraining it to zero (I want it to match Class 12 in size). The model converges, and the "means" in the "latent class regression model part" do follow the equality constraint. However, the "proportions of total sample size" do not, and the variance is not in any obvious pattern.
I'd appreciate any insights on this.
bmuthen posted on Thursday, October 18, 2001 - 11:14 am
What is printed under the heading
FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE
is based on the estimated posterior probabilities for each individual, given the model and the individual's data. Posterior probabilities should be thought of as akin to factor scores. In most latent class models, the proportions reported here will agree perfectly with the class probabilities as obtained from the estimated [c#...] logit values, but not always - your model is an example of an exception. So I would say that you succeeded in getting the [c#...] parameters set up the way you wanted. The fact that the posterior probability results disagree may be an indication of model misfit, and how they disagree can be a suggestion for how to modify the model.
I'm back to the problem with some other data, and I'm still wrestling with this, I'm afraid. I'm testing a treatment vs. control difference. I've run one model with 8 classes. By training data, the control Ss are constrained to be in odd-numbered classes and the treatment Ss in even-numbered classes. The two groups are the same size. In the first model, the indicator logits for the classes are constrained to be equal across pairs of classes (class 1 with 2, class 3 with 4, etc.); there are no constraints on the latent class probabilities. This seems to be working fine, and fits only slightly worse than a model without the constraints.
Then I want to test the hypothesis that, assuming the same class structure, the class sizes/proportions are the same in the two groups. Eyeballing the FINAL CLASS COUNTS in the first model, it looks like there's a pattern of differences. So I ran a second model, with the following code added to the %OVERALL% portion:
FINAL CLASS COUNTS AND PROPORTIONS OF TOTAL SAMPLE SIZE
Class 1 87.19078 0.09797 Class 2 93.93257 0.10554 Class 3 124.40489 0.13978 Class 4 140.18348 0.15751 Class 5 82.87007 0.09311 Class 6 81.98881 0.09212 Class 7 150.53426 0.16914 Class 8 128.89514 0.14483
It really doesn't look like the two resuls sections are matching up, and I'm at a loss.
Thanks again, Pat
Anonymous posted on Tuesday, November 06, 2001 - 8:26 am
Actually your results are matching up. The final class count is the sum of the estimated posterior probabilities for each individual. Much like the average factor score estimate could be different from the estimated factor mean those numbers are different. Of course you may question the restrictions imposed in the latent class regression model part.
Ok, thanks. Somehow your wording helps Bengt's message from 10/18 click; now I see what's going on. So does MPlus anywhere produce what I was looking for? The raw probabilities of class membership? Or is the best way to exponentiate the logits, and work with the odds?
Anonymous posted on Tuesday, November 06, 2001 - 6:42 pm
For this particular model you can just average the final class counts, that is the raw probabilities of class membership for class 1 and class 2 is (0.09797+0.10554)/2.
Anonymous posted on Tuesday, November 06, 2001 - 7:40 pm
For this particular model you can just average the final class counts, that is the raw probabilities of class membership for class 1 and class 2 is (0.09797+0.10554)/2.
Anonymous posted on Wednesday, February 09, 2005 - 3:03 am
Dear Linda, I am a new user of Mplus. I am using your short courses package as well as the user Guide. I have started the course on Modelling with categorical latent variable and I very much appreciate it if you would expand a little on the statement on page 13, "The u-c relation is a logit rgeression". with thanks
bmuthen posted on Wednesday, February 09, 2005 - 10:49 am
Look up computations with logistic regression in the Version 3 Mplus User's Guide, chapter 13. This explains the logit in (81) on the page you are referring to. U is the binary dependent variable and c is a categorical x variable. U has a threshold which is the negative of an intercept parameter. If c has 2 classes, then that means that we have a dummy x variable which in line with linear regression means that there are 2 intercepts in the model.
Is class 1 the base class ? i.e., for the coefficient of y20 which is 0.515, does the log odds =1.674 imply that a for change of 1 unit in variable y20, the odds that the indicudual belongs to base class (class 1) increase by a factor of 1.674?
Anonymous posted on Friday, June 17, 2005 - 9:21 am
Problem 1: I have a dichotomous X variable (that underlies a continuous distribution) that I want to use as a predictor in a growth mixture model. I can't find an example in the manual on defining this X as categorical, is it unnecessary to do so (if it underlies a continuous distribution)?
Problem 2: On a different project. I have a somewhat unexpected effect of a cognitive measure call X on LGM model slope of achievement (y1 thru y4). The model has 4 time points and the last two are freely estimated. The estimated time scores are 1.5 and 1.8. I am troubling to interpret the effect of X which I would have expected to be positively related to the slope (It is positively related to the intercept). This may be a rather interesting finding, that the higher a respondent scores on cognitive skill X the higher their intercept will be, but the higher they score on X the lower their growth rate.
I plotted the slope estimates by the X variable (by the way I LOVE that we can do this in MPLUS!!) and everything looks fine (yet negative). It seems to me that it’s not their growth rate 'overall' its their initial growth rate (0, 1) because the estimated times scores are 1.5 and 1.8 suggesting a smaller rate of growth at times 3 and 4, that is the first period of growth 0, 1 there is greater incline for those with a lower score on cognitive skill X, but as time goes on (1.5 and 1.8) that initial effect is smaller. Is this correct that the effect of X (when y’s are coded 0, 1, *2 *3 rather than 0 1 2 3) is on the initial growth rate (0 1), we assume it’s linear and stable (as it would be if modeled 0 1 2 3), but when you estimate the time scores you can’t conclude that; or am I unnecessarily getting thrown off by the estimated time scores?
Say the effect of X is -.30 then to get the slope:
1. The scale of observed x variables is not an issue in estimation. They should not be placed on the CATEGORICAL list. In regression, covariates are treated as either dummy variables or contiuous variables.
2. It is not unusual for a covariate to have a positive influence with the intercept growth factor and a negative influence with a slope growth factor. You may want to use time scores 0 * * 1 if you are interested in growth between the first and last time points.
Anonymous posted on Wednesday, August 03, 2005 - 10:56 am
I am relatively new to mixture modeling and am not sure that I understand the rationale behind constaining class probabilities to be equal. There does not seem to be a clear explanation of the circumstances underlying when this should or should not be done in the manual or on the discussion board. I have a few questions on this issue:
1)Is this primarily a theoretical consideration (i.e., I have no a priori beliefs as to class sizes so I will assume them to be relatively close to equal)? Or is it instead based upon clues derived from an analysis of results from runs in which class probabilities were not constrained?
2)By using this feature, am I somehow forcing data to fit in groups or am I simply making it easier for the model to converge, similar to assigning starting values for class thresholds?
3) What is the relationship (if any) between the logit starting values I assign for class thresholds and constraining class probabilities to be equal?
Thanks in advance - as a novice, I know I'm probably blurring lines on several of these issues but appreciate your thoughtful response.
BTW - I am working with all continuous variables if that makes any difference.
bmuthen posted on Monday, August 08, 2005 - 1:29 pm
Typically you would not constrain the class probabilities to be equal. What is the reason that you think they should be or are held equal?
Marco Grados posted on Thursday, December 01, 2005 - 9:45 am
I have run LCA on 3 sets of data N = 197, N = 203, and N = 400 for psychiatric comorbities on a sample of patients (5 comorbidities). I obtain 3 classes as the best solution in all three cases and they make sense clincally. I was interested in the probabilities of belonging to each class for subjects and used the command:
SAVEDATA: FILE IS prob.result.dat; FORMAT IS FREE; SAVE = CPROBABILITIES;
I have used this before in another analysis and have obtained probs as required. However, now all I am obtaining for the variables cprob1, cprob2 and cprob3 are 0's and 1's...
I am working on GMM and I try to decide on the best number of classes. When I choose the model with the best BIC, the class probabilities vary between .65 and .85.
My questions are: Does a class probability of .65 still indicate good fit? Is there a minimum required value for the average latent class probabilities? Is there a rule of thump with regard to the average latent class probabilities to decide on model fit and the number of classes?
Or should I just trust the best BIC and not base my conclusion on the class probabilities?
Do you know of any reference that discusses this topic?
See the following paper which can be downloaded from the website to see how to determine the number of classes:
Muthén, B. (2004). Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences (pp. 345-368). Newbury Park, CA: Sage Publications.
I have been comparing 2, 3 and 4 latent class models to the same with 1 factor. The best fit appears to be the 3 class 1 factor model. In the final class counts based on the estimated model and estimated posterior probabilities, class 2 has 3.7% of the sample. However in the final classification class 2 has no individuals! Distribution based on most likely class membership is:
C#1 0.272 C#2 0.000 C#3 0.728
I notice from the discussion above that it's possible to constrain 2 classes to be the same size where there's 3 or more classes. I wonder if that is feasible for my analysis? Or is there something else going wrong with my analysis? Fiona
I am running a 3-class multigroup LCGA with the following class probabilities.
group 1: 0.28 0.63 0.09 group 2: 0.25 0.68 0.07
A chi-square crosstabulation (by hand) indicates that the probabilities differ between the groups. Is there a way in Mplus to test directly whether the 0.63 differs from the 0.68? And is there a way to get SE's or CI's for these probabilities?
See Slides 48-50 of the Topic 6 course handout. This shows how MODEL CONSTRAINT can be used to define the latent transition probabilities described toward the end of Chapter 13. Standard errors are estimated for new parameters defined in MODEL CONSTRAINT.
I am currently conducting LCA with 5 binary variables and 3 categorical variables (with 3 categories each). When looking at the results, quite some item response probabilities are set to 0.0/1.0, because the "logit tresholds were approached and set at extreme values". Is there a possibility to turn this default setting off? If not, is this also normally done in LatentGold? (I ask this for comparative reasons; comparing my results with that from a previous study). Also, I was wondering whether you can tell me when one should use a logit, and when a probit link when conducting LCA. The v5 manual only mentions the option with regard to LTA. Which one should you use when you employ a LCA? Finally, in the recent book on LCA and LTA by Collins and Lanza (Wiley series) it is explicitly stated that the likelihood ratio statistic (G^2) shouldn't be used when the df of the model is high because the reference distribution for the G^2 statistic is not known. Is it still okay to use the statistics reported in TECH11, as they are based on the likelihood ratio statistic?
Yes, the likelihood-ratio chi-2 (G^2) does not approximate chi-2 well with the sparesness in the observed-data frequency table that you get with many variables. And you cannot use chi-2 difference testing to find the appropriate number of latent classes; that also is not chi-2 distributed. Instead BIC has been found to work well. And also TECH11 and TECH14. See for example the article
Nylund, K.L., Asparouhov, T., & Muthen, B. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569.
which is on our web site under Papers, Latent Class Analysis. TECH11 and TECH14 are akin to the bootstrap approaches mentioned in the book you referred to, although not focusing on the frequency table chi-2, but directly on the likelihood-ratio statistic, considering its non-chi-2 distribution.
anonymous posted on Monday, April 05, 2010 - 12:02 pm
Hello, I am trying to determine how best to characterize derived classes. I noticed that in the "results in probability scale" section of Mplus output, there are significance values associated with the probabilities of each latent class indicator. Are these p-values indicating whether the probability is greater than 0 or testing another hypothesis? In addition, in one case, the probability is significant even though the probability value is 0. In other cases, when the probability value is 1.00, the p-value is also 1.00. thanks in advance! Marcy
I presume this is a very basic question about Latent Class Analysis. I am running a two group LCA (trying to discriminate MZ and DZ twins from a bunch of ratings of their similarity). I have good reason to think that the solution should be unbalanced, with about 90% of the cases in one group and 10% in the other. But no matter what I do I seem to end up with close to 50-50. Is there a way to force Mplus to find an unbalanced solution. Is it a matter of fixing the mean of the latent categorical variable on a logit scale? Can one do it with starting values? Or fixing the loadings on a few select items? Thanks.
Dear Bengt/Linda, I am working on a GMM/LCGa type model which includes about 10.000 cases. I first identified the correct number of classes, and now wish to regress c on some covariates. However, I'd also like to treat the most likely classes from the first stage as observed variables in the second stage, just to check my model solutions. I found that the maximum recordlength is set at 5.000 in mplus 6, so I have some problems saving the class probabilities. Can you recommend any work-around procedure? thanks heaps,
Sorry, I should be clearer. I have a fairly large dataset (10.000) ans wish to save class probabilities. There seems to be a recordlength maximum of 5.000 however. Is there a way to save data for more records than 5.000?
Hello, I am working on a LCA with longitudinal binary data and estimate thresholds for each class. However, when I add an intercept to the overall statement of the model (which is actually a growth parameter), the model fit is considerably improved.
How can this be, and does this affect the interpretation of the thresholds? e.g. do I need to standardise them now as they will be affected by a latent continuous variable (intercept)?
It sounds like you are freeing the intercept of the intercept growth factor. This should be fixed at zero as part of the growth model parameterization. It should not be freed. If this is not what you are saying, please send your output and license number to firstname.lastname@example.org.
Dear Dr. Muthen, I am using a mixture modeling approach to crime rates at the county level. The best fitting model was the one with four latent classes. When I ran the frequencies on the class variable saved in the data file and compared those with the class counts reported in the output, I noticed some differences:
Class 1: 46 (in output) 57 (data file) Class 2: 2561 (in output) 2558 (data file) Class 3: 206 (in output) 205 (in data file) Class 4: 13 (in output) 6 (in data file)
Hi, I am estimating LPA models from plausible values (20 sets) from a previous ESEM model (i.e. I want to estimate the profiles based on the "factor scores", but tried PVS for greater precision). Doing so, I get warnings that I cant save class memberships and most importantly class probabilities from the models ? I guess that is because we work from 20 different data sets. I am wondering whether there would be a way to save these information (i.e. to combine/merge the class probability results into a single data file) ? Thanks
The only way to get most likely class membership is to run each data set separately. You would then need to decide how to use the sets of most likely class memberships.
Jan Ivanouw posted on Wednesday, April 20, 2011 - 5:56 am
I would very much like help in locating the formula for calculating CPROBABILITIES for the individuals in a mixed model.
What I mean is each individual's probability for belonging to class 1, class 2 etc. when knowing the person's item scores and the Mplus estimated model parameters for a certain mixed model.
Mplus gives these probabilities in a save file when demanding CPROBABILITIES, but I would like to be able to calculate them myself based on the parameters estimated by a Mplus model (like thresholds and alpha).
I have looked into the Technical Appendix 8, but it seems that the formula/algorithm to use is not there.
It is formula (161) of the Tech App 8. When you use that in the last step when the iterations have converged you get the CPROBs. It's a bit involved.
Muthén, B. & Shedden, K. (1999). Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics, 55, 463-469.
and the more technical Section 6.4 of
Muthén, B. & Asparouhov, T. (2009). Growth mixture modeling: Analysis with non-Gaussian random effects. In Fitzmaurice, G., Davidian, M., Verbeke, G. & Molenberghs, G. (eds.), Longitudinal Data Analysis, pp. 143-165. Boca Raton: Chapman & Hall/CRC Press.
Jan Ivanouw posted on Wednesday, April 20, 2011 - 11:12 pm
Thank you very much.
Jan Ivanouw posted on Thursday, April 28, 2011 - 10:41 am
Hi, Working with the calculation of cprobabilities I still have problems.
In Appendix 8 equation 161 I need P(u|c,x) and I find this from eq 152/153.
My problem is in eq 152 the term -(tau - u*). tau is a value which is estimated in the Mplus model, while as I understand it u* is a dimension. How can I find the value for u* to plug into the eq?
u* comes from eqns (150) and (151), so it is also a function of model parameters - and covariate values x_i. You should view u* as the logit behind each observed u, so the parameters of those two eqns are found in whatever relationships in your model that influence the u's.
I have an additional question concerning calculation of posterior probabilities:
I tried the method of calculating posterior probabilities which is described in topic 5 slides 69 and 70 (Berlin july 2009 version). I have a simple model with one latent class variable (with two classes), and with 10 binary indicators.
I estimated the model in Mplus. Then I plugged the probabilities from this model into the equations in slides 69 and 70 for some selected persons. The result was that I got the same posterior probabilities which was calculated by Mplus.
My question is: In slide 70 it is indicated that I have to use the EM algoritm in order to get the posterior probability for a person, treating the class membership as missing data. Why was this in fact not necessary when using the parameters from the estimated Mplus model?
The probabilities you plugged in (using slide 71) were from the Mplus output and Mplus computes those from the estimated model parameters - and those are obtained via ML estimation using the EM algorithm.
See the Version 6 UG, pages 617-618 for the general approach and end of chapter 14 for how the logits relate to the probabilities for the multinomial regression.
Beth Bynum posted on Friday, August 26, 2011 - 7:44 am
I’ve run a LPA analysis with one sample and I have determined the number of classes to retain. Using the classes that were determined in the LPA analysis, I would like to determine the most probable classes for a new sample of individuals. For example, I would like to give the same items that were used as class indicators to a new sample and using the results of the LPA to predict which class each person would most likely to be in. Ideally, for each person in the new sample, I would like to be able to use the item responses and compute the probability of being in each class, without having to run a new LPA. Is this possible? If so, does MPLUS output provide the necessary information to compute the probabilities? What equation should I use?
Yes, this can be done and is a good use of LPA. For the new sample of individuals you simply fix all the model parameters at the solution you got from your first sample. This can easily be done using the SVALUES option in the OUTPUT command. The second run then does not estimate any parameters but only estimates the posterior probabilities for each class for each subject (see the CPROBS option of the SAVEDATA command). The results also include an indication of the most likely class. In fact, you can do this for only a single subject.
Beth Bynum posted on Tuesday, September 06, 2011 - 12:26 pm
I tracked down and deconstructed the posterior probability equation that LPA uses to estimate the probabilities. Since I know the latent class means, latent class covariance matrices, and the response vector, I could use the equation to produce new probabilities for each new individual. The formula relies on complex matrix algebra to estimate the density functions, which isn’t a show stopper, but we were hoping to find something that might be simpler.
My Questions: Have you ever seen anyone use LPA/LCA in this type of application? Do you know of any other way to compute the probabilities that might be simpler than relying on the posterior probability formula? We were hoping that LPA would be similar to discriminant analysis and produce a set of equations that predict class membership for a single person. Finally, I wanted to check which density function MPLUS uses to compute the posterior probabilities. Does it use the multivariate normal density function?
I think this is a not uncommon type of application and I know others who have had similar interests in having their own routine. I don't think one can avoid going via the computations of the posterior probabilities (like we show in Appendix 8 in the V2 Tech App.). Discriminant analysis assumes known groups, whereas with the posterior probabilities of mixtures a subject is a fractional member of several groups. Yes, Mplus uses the multivariate normal density when the outcomes are continuous.
I have a question regarding interpretation of conditional probabilities for particular classes. If the conditional probability for a particular response category in a particular latent class is say 0.75 can one say that 75% of respondents in this class selected this category or is it only appropriate to say that respondents classified here had a 75% chance of selecting this category. I am of the opinion that in this case these two are interchangeable since the probabilities also represent proportions. Is this the case?
"respondents classified here had a 75% chance of selecting this category". This is the probability that is being estimated. Even better is:
"respondents in this class had an estimated probability of 0.75 of selecting this category."
- That avoids the ambigious phrase "classified here". Note that we are talking about a model, that is about subjects' class membership (a subject is a member of only one class in the model), not how they were classified after the parameter estimation was done (when they are fractionally members of several classes).
The other statement sounds like you are talking about the subjects who are most likely in this class and among them 75% are in this response category - that may not be true.
Carolyn CL posted on Friday, March 01, 2013 - 10:19 am
I am running a LCA with 3 continuous level variables and 2 dummy variables (reference dummy omitted: D_Me). I am not clear how to interpret the conditional probabilities for responses on the dummy variables.
1. Why does MPLUS assign 1 and 0 to some of the conditional probabilities - how can I interpret this?
I would treat the 3-category variable as a nominal variable by putting it on the NOMINAL list.
1. The thresholds have extreme values. 2. No, the results in probability scale will be the same as any values you calculate by hand. 3. Treat is as nominal.
Carolyn CL posted on Thursday, March 07, 2013 - 9:51 am
Thank you very much.
I am now treating the 3-category variable as categorical (it is ordered).
To be clear:
1. The model terminates normally, but I am getting this message: IN THE OPTIMIZATION, ONE OR MORE LOGIT THRESHOLDS APPROACHED AND WERE SET AT THE EXTREME VALUES. EXTREME VALUES ARE -15.000 AND 15.000. THE FOLLOWING THRESHOLDS WERE SET AT THESE VALUES: * THRESHOLD 1 OF CLASS INDICATOR CE_CAT FOR CLASS 3 AT ITERATION 110 * THRESHOLD 2 OF CLASS INDICATOR CE_CAT FOR CLASS 3 AT ITERATION 110
Should I be concerned?
2. The "Results in probability scale" can indeed be interpreted as the probability each class falls into a specific category of the ordinal level variable?
We would like to calculate the percentage of correct class assignments in a simulation study of latent profile analysis (no covariates). Is it possible to save the individual posterior probabilities for each class and the most likely class membership(save=cprob)in a Monte Carlo simulation with multiple replications to compare to the Monte Carlo data sets that provide true class memberships? Or, is there another way to obtain this value?
While performing a 4 class latent profile analysis, the SAVEDATA = CPROB; option only results in a datafile in which the values of the variables + id variable are included. So, no posterior and class probabilities and also no assigned class number. Could you tell me what I'm doing wrong?
TITLE: LPA 4 classes DATA: FILE IS LPAZscores2.csv; VARIABLE: NAMES ARE id Dep_T2 Reexp_T2 Avoid_T2 Emo_T2 Hyper_T2; usevar = Dep_T2 Reexp_T2 Avoid_T2 Emo_T2 Hyper_T2; IDVARIABLE = id; CLASSES = c (4);
ANALYSIS: TYPE = MIXTURE; STARTS 500 125;
savedata: file is "classmembership2" SAVE = CPROB; FORMAT is FREE;