Skewed varaibles for latent class/pro... PreviousNext
Mplus Discussion > Latent Variable Mixture Modeling >
 Xu, Man posted on Wednesday, May 12, 2010 - 6:49 am
I have a bit of a problem in conducting some latent class/profile analysis. I have a few scales derived from factor scores from CFA models. I would like to use these scales as indicators for latent profile analysis but not sure if it is good to proceed with such skewed data (most data located on the low range, very few on the mid and high range). I thought about categorising these scales then conduct a latent class analysis, but this would result in loss of information due to truncation. Another method I am thinking about is to transform the factor scores non linearly (e.g.arctan) so that they have shapes that resembles a normal distribution. But I am not sure if the interpretation of the result from the latent profile analysis changes if this is done to the factor scores.

If it is fine to transform the CFA derived factor scores, I was wondering if in Mplus I could in one go transform (correct the non-normal shape) the latent variables, and fit a latent profile model using the transformed variables as indicators?

Please give some advice and suggestions.

Thanks very much!
 Bengt O. Muthen posted on Thursday, May 13, 2010 - 7:56 am
If you expect a latent class (mixture) model underlying your data it is natural for you to see non-normal outcomes; that's what the mixture can explain.

But perhaps you are concerned about skewed factor scores because the measurements do not cover the full range of the factors well. This might suggest that you should do the mixture modeling on the original outcomes.

I would not first derive factor scores, then transform/discretized them, then to mixture modeling - you don't know what you have at the end of such a long path.
 Jennifer Buckley posted on Monday, November 22, 2010 - 5:44 am
I have six variables that are measured on a 0-10 scale, which I want to use with other binary variables, in a latent class model.

Ideally, I would like to treat the six variables as continous. However, at least two of the variables are very positively skewed e.g. over 92% of cases are between 6 and 10 and 30-40% are at 10.

From reading previous posts (which I don't seem to be able to find again) I believe that this is problematic. I'm therefore seeking advice regarding the options in this situation.

I have thought about collapsing categories to create a categorical variable, however it is theoretically difficult to do this.

Alternatively, I understand that one option is to treat them as censored variables. However, I am unsure of the assumptions involved in this approach.

Any help with this is greatly appreciated.

Kind regards,

Jen Buckley
 Linda K. Muthen posted on Monday, November 22, 2010 - 8:04 am
Can you give an example of the question and answer format for the scale?
 Jennifer Buckley posted on Monday, November 22, 2010 - 9:18 am
Hi, thank you for the quick response.

The variables all come from the same question block, which has the following opening:

"For each of the tasks I read out please tell me on a score of 0-10 how much responsibility you think governments should have. 0 means it should not be governments’ responsibility at all and 10 means it should be entirely governments’ responsibility".

This is then followed by a list of tasks, such as providing health care for the sick.

A showcard was also used with the question, showing 0-10 in a line with the following statements at 0 "should not be governments' responsibility at all" and 10 "should be eniterly governments' responsibility".
 Bengt O. Muthen posted on Monday, November 22, 2010 - 5:33 pm
You could treat the variables that have strong ceiling effects as censored-normal. It may not make a big difference compared to treating them as regular continuous-normal variables. Typically you have to have quite a strong ceiling effect to see a difference.
 Jennifer Buckley posted on Monday, November 29, 2010 - 4:15 am
Following your suggestion I've run models with censored variables. However,I've not come across censored variables before and I've encountered some problems.

1) Im finding it difficult to understand if they are appropriate in this case. If I understand right, it implies that the measurement tool is not able to measure the full range of responses. However, the measurement scale used implies that 10 is the maximum, individuals believe the government is entirely responsible. Therefore, it seems strange to imply the full range is not measured and difficult to interpret class means of 14 or 20.

2) With the variables as continuous a four-class model was optimal. However, with the censored variables three and four class models might not be identified. (The following message appears ONE OR MORE PARAMETERS WERE FIXED TO AVOID SINGULARITY OF THE INFORMATION MATRIX. THE SINGULARITY IS MOST LIKELY DUE TO THE MODEL IS NOT IDENTIFIED, OR DUE TO A LARGE OR A SMALL PARAMETER ON THE LOGIT SCALE. THE FOLLOWING PARAMETERS WERE FIXED: 29 28 35. The parameters are NU(P) for one out of two censored variables)

I appreciate that these are not directly issues with Mplus, however any guidance or references in relation to the use of censored variables in this case and steps for getting models identified would be greatly appreciated.

Thank you for your time, Jen Buckley
 Linda K. Muthen posted on Monday, November 29, 2010 - 9:50 am
1. Even though 10 is the highest category allowed, not all who answer 10 may have the same opinion. There may be censoring.

2. Large intecepts in some classes means that everyone is at the maximum.
 Jennifer Buckley posted on Wednesday, December 01, 2010 - 2:28 am
Thanks for answering my questions. Hopefully, you might be able to help me with the following?

Can you explain, or provide a reference referring to, the technical problems associated with treating variables with strong ceiling effects as continuous-normal?

In relation to point 2, I now understand the reason for the message, not how problematic this is: do I need to do something to the model or is it ok to procede?

Thanks again,

Jen Buckley
 Bengt O. Muthen posted on Wednesday, December 01, 2010 - 10:43 am
The problem relates to the fact that the assumption of a normally distributed residual for a DV with a strong ceiling effect cannot hold at the ceiling value - only negative residual values can occur. The regression slope is typically underestimated relative to the uncensored underlying DV slope.

See Maddala's 1983 book on "Limited-dependent and qualitative variables in econometrics" and perhaps also the classic paper by Tobin. See Reference section in our Topic 2 handout.

Regarding point 2: It is ok to proceed; there is no problem.
 Jennifer Buckley posted on Thursday, December 02, 2010 - 1:25 am
Thank you, that's very helpful.
 Lucy Riglin posted on Tuesday, May 26, 2015 - 10:04 am
Is it correct that I do not need to transform skewed variables before conducting an LPA (using MLR)?
Many thanks
 Bengt O. Muthen posted on Tuesday, May 26, 2015 - 2:51 pm
Yes, the skewness is part of what is expected in mixtures and part of what determines the classes.
 Lucy Riglin posted on Thursday, May 28, 2015 - 2:34 am
Thank you very much for confirming. Do you know of any published work that I could reference supporting this?
 Bengt O. Muthen posted on Thursday, May 28, 2015 - 1:43 pm
I can only give a general background for this, such as the McLachlan & Peel reference we give in the UG ref list.
 Nicholas Bishop posted on Thursday, May 25, 2017 - 12:18 pm
I've conducted a latent profile analysis with observed variables measured on 5 point Likert scales that are somewhat negatively skewed (skeweness between -.7 and -.1.02). My assumption is that using the latent profile approach to determining the number of classes is correct for this situation, but am concerned that a latent class analysis may be better suited to these outcome measures. Is it correct to use an LPA in this scenario? Thanks.
 Nicholas Bishop posted on Thursday, May 25, 2017 - 1:50 pm
I should ask what your thoughts are, not if it is "correct"....
 Bengt O. Muthen posted on Thursday, May 25, 2017 - 6:38 pm
Because LPA is a mixture model, you would expect your outcomes to be non-normally distributed with a certain amount of skewness. I would only turn to viewing the outcomes as categorical (using LCA) is you have strong floor or ceiling effects.
 Nicholas Bishop posted on Friday, May 26, 2017 - 7:29 am
Thank you for your input.
 Dmitriy Poznyak posted on Tuesday, March 06, 2018 - 10:23 am
Hi, Mplus team,

I am estimating a model in which one variable (cost) is extremely skewed. I realize that mixture modeling is designed to capture skewed outcomes and by transforming the data I could potentially lose information on classes that are potentially present.

I compared model results using untransformed and transformed cost variable and observed non-trivial differences in the results. Few differences were observed in the 2-class model, but in a 3-class model percentage distribution of profiles and mean values of cost were different. For instance, keeping cost on the original metric leads to separation of one class (2% of a sample) with extremely high cost (not surprising). In the model with log-transformed cost I instead observed two classes (17% and 9%) with similarly high cost but different in respect to most other indicators.

Some potential solutions that I have in mind are: 1.Trim some extreme outliers; 2. Rescale cost (e.g. by dividing it by 1000) to bring the metric closer to the rest of the variables and facilitate model convergence. Id like to avoid recoding cost to categorical, unless you see this as an absolutely recommended approach

Also, under what circumstances would you recommend freeing the variances of cost across classes? And, would you recommend sticking to Bayesian estimator given skewness of the data or would MLR handle this just fine?

Thank you as always!
 Bengt O. Muthen posted on Tuesday, March 06, 2018 - 2:30 pm
One key issue is if your cost variable has floor or ceiling effects so that this is the major cause of the skewness. If not, perhaps the log transformation is sufficient. You can also try the skew-t approach of our paper on our website:

Asparouhov, T. & Muthn B. (2015). Structural equation models and mixture models with continuous non-normal skewed distributions. Structural Equation Modeling: A Multidisciplinary Journal, DOI: 10.1080/10705511.2014.947375. Download Mplus inputs and outputs.
 Scot Seitz posted on Wednesday, October 23, 2019 - 1:23 pm
I am running a latent profile analysis using the manual BCH method. I have 7 indicators, but I am concerned that some of the indicators have ceiling effects. I have read that one strategy for handling ceiling effects is to turn the continuous variables into categorical variables.

What percentage of responses at the highest response option would indicate severe enough ceiling effects to warrant turning the continuous indicators into categorical indicators for the BCH LPA/LCA method?

Thank you!
 Bengt O. Muthen posted on Wednesday, October 23, 2019 - 1:53 pm
When you get up above 25% at the floor or ceiling, things start to be affected.
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message