Anonymous posted on Monday, December 10, 2001 - 4:00 pm
Can you please explain the major differences between latent class analysis and cluster analysis? What is the advantages of using LCA?
bmuthen posted on Tuesday, December 11, 2001 - 4:57 pm
LCA can be seen as a special case of cluster analysis, within the family of mixture cluster analysis. Mixture cluster analysis has been advocated by McLachlan and other statisticians as perhaps a better clustering method than the traditional ones (see McLachlan ref. in the Mplus reference section). The real question is which criterion used to form the clusters is most relevant to the particular application. LCA assumes that it is relevant to find clusters of individuals for whom the observed variables are independent, which is another way of saying that in the total sample the latent class variable is the only thing that causes the observed variables to be related to each other. This in turn is in line with factor analysis. If instead you believe that the observed variables have direct relationships, perhaps LCA is not a good method for clustering.
Is it possible to correlate the error terms in MPlus, if the assumption of independence of the latent class indicators does not hold? In other words, is it possible to fit a LCA, and a cluster model, to see which one fits the data better? What happens under circumstances where indicators are correlated (despite being conditional on different classes), but their correlation is low? If cluster analysis is available, would I be able to include a predictor? Thanks in advance for your reply.
npark posted on Thursday, February 19, 2004 - 7:31 am
I thought I posted this message yesterday to the list, but it might have not gone through. If it is a duplicate, please ignore this.
I obtained 6 cluster solutions using 26 variables (LCA with binary and coninous indicators). A colleague questioned the possibility of multicollinearity among those 26 observed variables and the consequences of it. I examined the correlation matrix of the 26 variables, and found that .06 is the highest bivariate relationship and others are less than .06. I think there is a possiblity of giving redundant information using correlated items, but not sure if the collinearity is a problem in mixture modeling and if it is, at what point researcher should be alarmed.
I will greatly appreciated your answer, or please point me to the relevant materials (references) related to these issues. Thanks very much.
2. It's always a good idea to look at data in several ways to better understand it. It is likely that if you find two factors, you will find three classes. If by data reduction, you mean creating factor scores and using the factor scores in mixture modeling, I have never heard that recommendation.
npark posted on Friday, February 20, 2004 - 8:56 am
Thanks very much!
Anonymous posted on Friday, October 08, 2004 - 5:26 pm
Dear Dr. Muthen:
1) GMMs with different number of groups are not nested; therefore, it is inappropriate to use the likelihood ratio test for model comparison (Ghosh & Sen, 1985; Nagin, 1999). Is this the same situation for LCA?
2) How is the Sample-Size Adjusted BIC defined? Sometimes, I find that BCI is smaller for K classes than for K+1 classes, but adjusted BIC is smaller for K+1 classes than for K classes. Which one should be used for model choosing?
Anonymous posted on Sunday, October 10, 2004 - 6:36 pm
Please ignore the above questions. I got it.
Anonymous posted on Tuesday, November 16, 2004 - 7:55 am
I am using MPLUS to run a Latent Profile Analysis. I am using 14 scales (on a continuous metric) for input for the analyses. These scales measure 4 latent variables. However, the scale level data was used in the analyses to estimater parameter information for each of the 14 scales. I was able to find a 3 class solution that made sense --both in terms of interpretation & fit evidence. A few questions on the procedure:
Is it true that the optimal class solution found by MPLUS is the number of latent traits - 1?
Also, I know that probability information is used in the class assignment, but, does MPLUS allow individual cases to "switch" classes throughout the iterative process (like a K-means cluster analysis)?
Finally, since Latent profile is in the SEM framework, I'm not sure of the role of error. Are the input variables assumed to be measured without error? Or, could error be 'partitioned' (like in CFA & uniqueness terms) into the MPLUS parameter estimates which result for each latent class? Error isn't assumed to be zero just because covariance matrix input used with the analyses, right?
Thank you for your assistance
hildebtb posted on Tuesday, November 16, 2004 - 11:18 am
In terms of LCA versus cluster analysis, is it fair to say that LCA would be preferable to cluster analysis if you were trying to determine if there were subtypes of a particular diagnostic category? For instance, if you had 10 criterion variables that were indiciative of types of body image disturbance. You hypothesize that these criterion are met in different patterns, with each pattern representing a different type of body image disturbance with different etiology, phenomenology, genetic predisposition, and comorbidity. Would LCA be more appropriate than cluster analysis to identify the differnt subtypes of body image disturbance?
bmuthen posted on Tuesday, November 16, 2004 - 1:01 pm
It is not always the case that the number of classes found (k) relates to the number of factors (m) as k = m+1, but it seems to often happen and does have a psychometric reason (see e.g. Bartholomew's book on our web site).
Yes, individual's class probabilities change over iterations and therefore most likely class membership also changes.
You can think of LPA as having error variances - in this case the within-class variance for each outcome. So, the latent class variable explains some of the variation in the outcome and the residual the rest.
I am having trouble conceptually with using LCA in a particular data set that I have of steroid users. I am particularly interested in determining if there are unique patterns of steroid use. I have a list of drugs (14 total) that have different properties and are likely used in different ways to achieve different goals (build muscle, reduce fat, etc). They can also be broken down in several ways (some are injected, some are taken orally, some speed up metabolism, others help build muscle, etc). I also have quantity and frequency data for the larger constructs (how much taken orally and for what duration). My question is whether a LCA model is the most effective way to determine unique patterns of use? I am particularly concerned with violations of local dependence because amount and frequency should be correlated even within class (although the directions may be different)and the number of drugs taken in most cases will also be related to amount, so in most cases, taking a drug vs not taking a drug, will be positively related to amount. Given the inter-relationships between most of the indicator variables, I'm wondering if the LCA model wouldn't be trustworthy given that I would have to allow for most variables to be correlated within class to fit an accurate model?
bmuthen posted on Monday, December 27, 2004 - 2:55 pm
Sounds like you have binary use/no-use variables for each drug and for many drugs also QF information. LCA can handle within-class correlations, although with binary variables it is hard to estimate a model where there are many of these; it is easier with continuous outcomes.
I wonder if a 2-part mixture model is relevant here; this is often useful with strong floor effects (many zeros). In your context 2-part modeling would consider - for each drug - a variable that has one binary part indicating if the drug is used and another continuous part indicating much it is used. For the second part you could multiply Q and F into a continuous amount variable. You can then have a mixture model for each of the 2 parts analyzed simultaneously. The 2 parts are typically strongly correlated. This modeling can be done in Mplus and we have some positive experiences with it.
Thank you Dr. Muthen for the suggestion. Let me give you a bit more detail to make sure that this would work. The data is as follows:
3 steroids taken orally (binary use/no use for each drug) Quantity of oral steroids Frequency of oral steroids
7 steroids taken through injection (binary use/no use for each drug) Quantity of injectible steroids (continuous) Frequency of injectible steroids (continous)
1 over the counter fat burning drug (binary use/no use) Quantity of OTC fat burner (continuous) Frequency of OTC fat burner (continuous)
3 illegal fat burning drugs (binary use/no use for each drug) Quantity of each fat burning drug (continuous) Frequency of all illegal fat burning drugs (continuous)
It seems as though this data would make the 2-part mixture model a bit more complicated as we don't have Q and F for each individual drug. Would the 2-part mixture model still be possible and could you suggest an example?
bmuthen posted on Monday, December 27, 2004 - 5:46 pm
That data structure makes it more complex. I wouldn't throw all the variables in one LCA analysis, 2-part or not. You can always simplify. One way is to analyze only the 14 binary variables by LCA; that can be informative in itself. Another is to do 2-part LCA with 4 variables, where the binary variable is "any oral", any injection, any legal fat burning, and any illegal fat burning.
Thank you for the advice again. I have run the 14 binary indicator LCA and am happy with the results. I wanted to try the 2-part mixture model as you suggested though, to see if adding the quantity and frequency variables add interesting information. Is there an example that you could recomend?
Also, would it be possible to allow Q and F to correlate within class using this model instead of combining them into a single variable? I believe that there are those who use have a High Q and Low F, High Q and High F, Low Q and High F, Low Q and Low F for each of these drugs and think that creating a combined QxF variable would mask these groups.
bmuthen posted on Thursday, December 30, 2004 - 3:53 pm
Yes, you could correlate Q and F within class and that would not be problematic in 2-part LCA/LPA. Although the User's Guide does not have an example of exactly this kind, Ex 6.16 from the Version 3 User's Guide - although a growth model - could be used to generalize to mixture modeling. We encourage such combinations of examples based on the UG components. There is a paper by Brown et al on the Mplus web site that has a 2-part growth application, although not a mixture. I have a paper on 2-part growth mixture modeling and also have some setups for 2-part factor mixture analysis that I could share. One question I have found important is if the mixture holds for both parts or only one of them; these variations can be studied. - You might publish before I have time to
Blaze Aylmer posted on Thursday, September 29, 2005 - 2:16 am
What algorithms does MPLUS use to undertake cluster analysis?
Thanks in advance
bmuthen posted on Thursday, September 29, 2005 - 5:46 am
Latent class analysis can be used for clustering. This method has been found to perform better than k-means clustering. You can also use more general forms of latent class analysis where you allow for within-class (within-cluster) correlations among the variables.
K Faouzy posted on Monday, December 12, 2005 - 9:50 pm
I am having trouble conceptually with using LCA in a particular data set that I have of business strategy. I am particularly interested in determining if there are unique patterns of different strategy types. I have a list of variables that represents different types of strategy (18 total) that have different properties and are likely used in different ways to achieve different types of strategy (Innovators, Followers, etc). For instance, the respondent were asked to indicate the importance of product innovation to the accomplishment of their business strategy, using a seven point likert scale with end points “Least Important” (1) and “Extremely Important” (7). My sample size is 120 companies answered all the 18 questions. I am expecting to obtain three to four distinct groups (clusters) of companies each group follows one type of business strategy. Latent Class Analysis will be performed to identify three to four groups (clusters) as suggested by the literature. My question is whether a LCA model is the most effective way to determine unique patterns of use to identify distinct groups (clusters)? Second question in terms of LCA versus cluster analysis, is it fair to say that LCA would be preferable to cluster analysis in this situation.
IT sounds like latent class analysis would be a good approach. LCA is a type of cluster analysis. It performs better than k-means clustering due to the fact that variances do not need to be equal across classes.
Tonia F posted on Friday, January 27, 2006 - 10:32 am
I would really appreciate some advice as I've never done cluster analysis before. I am working with population-based data of about 10,000 women. I have 5 binary/dichotomous (coded 0, 1) variables for types of violence: control, fear, demean, physical, sexual). We want to know which forms of violence are likely to co-occurr. Others in the field have used cluster analysis to identify patterns. Is this an appropriate technique? Is Hierarchical clustering best? Kmeans? If Hierarchical, is single, complete or average linkage appropriate based on my data? There are so many decisions to make but unsure which are the right ones based on my data. Many many thanks!
I think Latent Class Analysis would be appropriate for your data and reseach question. I think it is a preferred clustering technique over K-means clustering for example.
Tonia F posted on Tuesday, January 31, 2006 - 12:42 pm
Thank you for your response Dr. Muthen. Is it possible to talk briefly about how one would determine the number of latent classes in a LCA. I am confused as to whether one would start with the minimum or maximum number of possible classes. In my situation, the maximum number is quite a lot. Many many thanks!
In my experience, one starts with the two class solution and goes up from there. I suggest you look at the B. Muthen paper in a book edited by Kaplan which can be downloaded from the website. It shows a strategy for determining the number of classes.
anon posted on Wednesday, February 08, 2006 - 5:05 pm
i have a question about how to determine the most optimal clustering solution. can it be done by examining the BIC's alone? can entropy be used? both? what do you advise?
have seen bits and pieces of this question asked, but never in a straightforward manner.
thanks for your help.
bmuthen posted on Wednesday, February 08, 2006 - 6:29 pm
There are several criteria in addition to BIC - and another one coming in Mplus Version 4 (bootstrapped LRT). For an overview, see my 2004 chapter in the Kaplan handbook - the pdf is on our web site:
mpduser1 posted on Tuesday, February 16, 2010 - 9:05 am
I have a question about interpreting / reconfiguring the results from a latent class regression in Mplus.
Specifically, in a 4-class model, is it possible to reconfigure or transform the Mplus output so that my regression results pertain to the log odds of being in class 1 versus all other classes, then class 1 versus all other classes, etc., (rather than, say, the log odds of being in class 1 vs. class 2, class 2 versus class 4, class 3 versus class 4, etc.)?
The multinomial regression coefficients get their interpretation as the log odds of a class relative to the last class. I may be wrong, but don't think there is a simple transformation to get a coefficient that portrays the log odds of a class relative to all others - unlike the coefficients of multinomial regression it would probably depend on the values of the covariates. But one could compute the log odds you want for a certain set of values for the covariates.
Anne Chan posted on Wednesday, February 17, 2010 - 1:46 pm
Hello. I run a LCA analysis on 4 motivation constructs in learning. My data set is quite big with about 25000 respondents. According to AIC and BIC, the 13-class solution fit the data best. However, there are too many cluster in the solution and the characteristics of some cluster are too similar.
Why there are so many clusters in the best-fit solution, is it related to the big sample size? May I ask if you have any suggestion that I can work on this dataset, but able to get a few-cluster best solution?
I assume your 4 outcomes are continuous. Have you checked if BIC is better with a 1-factor model? It is not always the case that a simple LCA is a good model for the data. There is also the possibility of using a Factor Mixture Model so that the within-class correlations are not restricted to be zero. See the Mplus web site under Papers for articles on that.
Anne Chan posted on Thursday, February 18, 2010 - 5:24 am
Thanks for your suggestion. 1-factor model is not a better fit. I will check the Factor Mixture Model. Thanks a lot!
anonymous posted on Saturday, February 20, 2010 - 7:45 am
Hello, I am conducting a LCA (n=1510) of psychiatric diagnoses in males. I have a few questions: 1. In terms of assessing the optimal number of classes, indices are somewhat contradictory. The LL estimate continues to decrease at 4 classes, the LMR likelihood statistic is significant at 3 classes (but no longer at 4 classes), the BIC begins to increase at 2 classes, and the sample-size adjusted BIC begins to increase at 3 classes. 2. For one psychiatric diagnosis or variable, 4 categories are generated - which is strange since it is a dichotomous variable. 3. I receive the following information which I'm not familiar with:
IN THE OPTIMIZATION, ONE OR MORE LOGIT THRESHOLDS APPROACHED AND WERE SET AT THE EXTREME VALUES. EXTREME VALUES ARE -15.000 AND 15.000. THE FOLLOWING THRESHOLDS WERE SET AT THESE VALUES: * THRESHOLD 1 OF CLASS INDICATOR DSM_SP_N FOR CLASS 1 AT ITERATION 400 * THRESHOLD 2 OF CLASS INDICATOR DSM_SP_N FOR CLASS 1 AT ITERATION 400 * THRESHOLD 3 OF CLASS INDICATOR DSM_SP_N FOR CLASS 1 AT ITERATION 400 * THRESHOLD 1 OF CLASS INDICATOR DSM_PDS_ FOR CLASS 1 AT ITERATION 400 * THRESHOLD 1 OF CLASS INDICATOR DSM_SAD_ FOR CLASS 2 AT ITERATION 400 * THRESHOLD 1 OF CLASS INDICATOR DSM_GAD_ FOR CLASS 2 AT ITERATION 400
1. Fit statistics can be contradictory. You need to also consider the theoretical meaning of the classes. 2. If this is the case, it sounds like you are not reading your data correctly. 3. When thresholds become large, they are fixed reflecting probabilities of zero and one. This can be helpful in defining the classes.
anonymous posted on Sunday, February 21, 2010 - 4:40 pm
Thanks very much for your response. Do I understand you correctly that Mplus fixes thresholds by default to -15 and +15 when they are small or large, reflecting probabilities of 0 (-15) and 1 (+15). thanks!
We've never seen entropy of one. It could be over-fitting but I can't say for sure.
anonymous posted on Wednesday, March 10, 2010 - 1:20 pm
I am now attempting to include a continuous covariate in an 5-class LCA of complex survey data. All models ran successfully, except the 5-class model:
WARNING: THE BEST LOGLIKELIHOOD VALUE WAS NOT REPLICATED.THE SOLUTION MAY NOT BE TRUSTWORTHY DUE TO LOCAL MAXIMA. INCREASE THE NUMBER OF RANDOM STARTS.
However, I increased the initial and second starting value to 3000 and 2000, respectively, but continued to receive this message. I've also tried changing the starting values by using the STSEED function.
Erika Wolf posted on Thursday, December 09, 2010 - 9:02 am
I'm running an LPA with random starts and I'm honing in on a 3 class solution. However, in the 3 class model I get the following message: THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTINGVALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.194D-19. PROBLEM INVOLVING PARAMETER 20.
The parameter this is referring to is estimated at 0 (and I would expect it to be 0 in that class). I'm wondering if this is causing the problem and I can ignore this message? Or is there something else I can do to resolve the issue? Thanks!
I am doing a set of LCA's using ordinal indicators with 3 categories. For the first LCA that I am doing I have 26 indicators of the latent class and 255 observations. Beginning with the 2 class model, I get the following error message: THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS -0.373D-18. PROBLEM INVOLVING PARAMETER 8.
Any input on how I could address this would be great. Thank you.
Susan Pe posted on Tuesday, September 18, 2012 - 6:36 am
Hi, I am considering doing a latent class analysis using 3 variables. Is it possible to use latent variables for the latent class analysis? I think the observed proxy for the variable may not be good enough and using a latent variable made up of 3 items may be a better proxy for the variable. Or should I try to come up with the best observed proxy for the variables in the latent class analysis? Thank you.
Jon Heron posted on Tuesday, September 18, 2012 - 7:11 am
Continuous latent variables would be a second order Latent Profile Analysis I guess.
If the variables are binary, no more than two classes can be extracted.
See Example 7.17 where a factor is used as you suggest. You could try it both ways. One issue with using a factor is that the indicators may have a direct relationship to the categorical latent variable.
I have 20 continuous variables to perform latent profile analysis with a sample size of around 900. 1) How do I determine the identifiably of my model? 2) Generally should residual variances across classes be held equal? and what is the theoretical basis for this?
Any good papers on this topic will be greatly appreciated.
Hi Linda, I picked a 4-class solution for my unconditional LCA model. I am now running conditional models. I notice that the item endorsement probabilities for each class change slightly as I add predictors. Is it possible for me to constrain the loadings of the indicators in the conditional models to what they were in the unconditional model? This way, the classes mean the same thing upon adding predictors to the model. Note that all of my indicators are dichotomous. Thank you.
This may indicate the need for direct effects. See the following paper on the website:
Muthén, B. (2004). Latent variable analysis: Growth mixture modeling and related techniques for longitudinal data. In D. Kaplan (ed.), Handbook of quantitative methodology for the social sciences (pp. 345-368).
You may also consider the 3-step approach "R3STEP" that is introduced in the just released Version 7. This holds the class proportions fixed at the values of the unconditional model. See Mplus Web Note 15 as well as the V7 training videos from Utrecht that are referred to on our home page.
Bengt and Linda, I have Mplus Version 6.12. Is there a way to constrain the solution to what it was in the unconditional model manually, without R3STEP? I was thinking that I could do this by using @ values for the thresholds of the dichotomous items, rather than start values (where * is used), which is demonstrated in the Mplus manual. Will this work? Thank you!
Hi Linda and Bengt, I am now using R3STEP and I think it is great! However, in comparing the effect of different predictors on class membership in univariate fashion, I notice that the fit indices are the same regardless of what predictor I enter.
Specifically, the -2LL is -2765.7 and the AIC is 5753.4 regardless of whether I enter Internalizing, Externalizing, or Adversity. I would expect the fit of the model to be different depending on what predictor I enter.
Or, is it the case that these fit statistics refer to the fit of the initial model without the predictor (the first step of the 3-step process)? The numbers above are indeed very close to those from the unconditional model.
Is there a way to get the fit of the model with the predictor added, enabling me to compare the fit of models with various predictors?
I am running a LCA with both continuous and binary indicators. I am interested in reporting odds ratios and confidence intervals for a 2-class solution. I am trying to make one specific class the referent group to obtain the odds ratios in the direction I want. Without using the CINTERVAL statement, I am able to do this by specifying starting values for the classes in my model input statement. However, when I add the CINTERVAL statement, regardless of using starting values (both exact and extreme), I cannot seem to change the class that is used as the referent in the output. Do you know if there is a more appropriate way to do this?
Is it a requirement that responses on the indicators in an LCA be indpendent of each other? That is, there should not be contingency between two indicators of the latent class, right? In other words, there should not be correlations between responses on the indicators above and beyond what is accounted for by the latent class factor, right?
Is this stated in the Mplus manual so that I may cite this idea?
LCA indicators can be highly correlated, but the model says that they are not correlated within class. This is the standard LCA conditional independence assumption that you will find in any LCA writing including in the LCA book our UG refers to: Hagenaars & McCutcheon (2002). You add classes until this is reasonably fulfilled (you can check by TECH10).
It is hard to do model fit testing with many categorical variables. In those cases I would take the more practical approach an increase the number of classes to see if new substantively meaningful classes come out.
I'm running a 3-step LCGA including a range of auxiliary variables (using x(r3step)). I have added CINTERVAL to the output command and expected to find ORs and CIs from the logistic regression in the output. However, they do not seem to be there. Am I missing something?
I understand in LCA (binary indicators) that when thresholds become large, they are fixed reflecting probabilities of zero and one. In the case of LPA (count indicators) when one gets extreme logit parameters set at -15 and 15, does this mean that the probability of the mean for a given indicator is zero or one?
I am new to LCA/LPA and have a few questions before I get started. I am interested in determining clusters of people with arthritis. Previous work using hierarchical cluster analysis has analyzed clusters mainly based one construct (psychological profiles) and then looked at associations with other factors once the clusters were established. So my questions are
1. With LCA/LPA am I able to cluster on more than one construct e.g. psych profiles (3 variables), performance of physical activity tests (4 variables), sensory testing (2 variables) and patient reported pain and function (3 variables)? If this is possible are there concerns in using multiple factors/constructs?
2. I understand that once the clusters are formed that there is the assumption of independence of the variables. As per Dr. Muthen's first comment in this thread about suspecting that there are direct relationships between the observed variables, does one consider a minimal correlation e.g. r<0.4 to determine whether direct relationships exist?
1.Perfectly fine to cluster based on all constructs jointly, working with all the indicators jointly. You have to choose the model specification of letting the factor means vary across classes (invariance indicator intercepts), or letting the indicator intercepts vary across classes (factor means fixed at zero). See my "hybrid" paper from 2008 on our website.
2.Note that having a latent class variable implies that the indicators are correlated; the model says that's why they are correlated. The question is if there is residual, or within-class, correlation between some indicators. I don't know if you have continuous or categorical indicators. In both cases, however, you can use WITH to capture some of these residual correlations.
I am conducting LCA. But I keep receiving the following error message:
THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS -0.243D-16. PROBLEM INVOLVING PARAMETER 12.
According to the parameter specification, the problem parameter is tau (a threshold). So I was initially testing ordinal indicators (3 points), so changed data to binary indicators (2 points) but still receive the same error message. Could you suggest any solutions for this problem?