Mplus Discussion >> Non-normal distribution

Topics
Last Day
Last 3 Days
Last Week
Tree View

Edit Profile


Non-normal distribution

Mplus Discussion > Structural Equation Modeling >

Message/Author

hai hong li posted on Monday, July 31, 2006 - 7:47 pm

Hi Linda,

I would like to fit an SEM model to a set of ordered categorical variables that have L or J shaped distribution. That is, the assumption of the latent variable be normally distributed is not valid and there are some missing (up to 2 percent) values.

1. Can the Mplus handle it properly
2. Can it accommodate formative construct.

Thank you in advance
haihong

Linda K. Muthen posted on Tuesday, August 01, 2006 - 9:25 am

Factors with categorical factor indicators are not necessarily non-normal.

1. Yes.
2. Yes.

hai hong li posted on Tuesday, August 01, 2006 - 2:03 pm

You have written that" Factors with categorical factor indicators are not necessarily non-normal". So if I have five point Likert scales which are extremely skewed (L or J shape), for analysis of the data, you do not make the assumption that the underlying latent variable is normally distributed. That is, you do not substitute polychoric correlations instead of the covariance to analyze the data.

Is this correct?

Thanks
haihong

Bengt O. Muthen posted on Tuesday, August 01, 2006 - 7:04 pm

There are 2 things here. First, the fact that your Likert scale items are skewed does not mean that the factor must be skewed. Take the example of very extreme attitude items where most people disagree. This gives skewed items, but the factor may still be normal. The observed non-normality may simply be due to extremeness of the item wording.

Second, the default assumption in Mplus is that the factor is normal. It does not have to be normal, however, if you work with mixture modeling. As for your last sentence, you may be interested in the following statement from our short courses:

Note that by assuming normal factors and using probit links, ML uses
the same model as WLSMV. This is because normal factors and probit
links result in multivariate normal u* variables. For model
estimation, WLSMV uses the limited information of first- and
second-order moments, thresholds and sample correlations of the
multivariate normal u* variables (tetrachoric, polychoric, and
polyserial correlations), whereas ML uses full information from all
moments of the data.

Xu, Man posted on Thursday, July 12, 2012 - 11:23 am

Could I please follow up on this lead? If one's latent variable is not normally distributed, regardless of continuous or ordinal items. For example, most psychiatric screening instruments aren't normally distributed and binary cutoff is often applied on sum scores. It is now quite often to apply CFA for these ordinal item level data (either as continuous under RML or as ordinal under WLSMV), and use these latent factor as predictor or outcome in SEM. I was wondering what the implication is for the results based on this kind of measurement model and is there a way to deal with this?

Thanks!

Bengt O. Muthen posted on Thursday, July 12, 2012 - 6:27 pm

Not sure what your major concern is - perhaps it is that you don't believe your latent variable is normally distributed.

You seem to take a non-normal distribution of a psychiatric screening instrument as an argument that the corresponding latent variable is not normal, but per my discussion above, I don't think that necessarily follows.

For a related, recent article, see

Wall, M. M., Guo, J., & Amemiya, Y. (2012). Mixture factor analysis for approximating a nonnormally distributed continuous latent factor with continuous and dichotomous observed variables. Multivariate Behavioral Research, 47:2, 276-313.

where a non-normal latent variable is obtained using the Mplus mixture approach.

Xu, Man posted on Friday, July 13, 2012 - 2:17 am

yes, I am worried that the factor is not normally distributed. I checked the data I have (3 instruments). One of them has factor scores normally distributed (as you say). The other two are not. I declared all items to be categorical with the default estimator.

On the item level, I found the instrument has normal distribution has more items that are normally distributed. The two that have got skewed factor score distributions have mostly highly skewed items.

Does it mean that the latent variable approach is not suitable for the other two intruments.

Thank you for the paper. I shall read it.

Bengt O. Muthen posted on Friday, July 13, 2012 - 8:37 am

Take a look at slide 117 of out Topic 2 handout and you can see why an observed score can be non-normal at the same time as a latent score being normal.

A latent variable can be normal and still give a non-normal estimated factor score distribution. This is because of items that don't capture the tails of the factor distribution well, for instance too easy or too hard items.

I would guess that the issue you are concerned with is likely of less importance than other aspects of your modeling.

Jazgul Ismailova posted on Tuesday, January 22, 2013 - 6:20 pm

Transforming skewed data in ESEM

Hello!

In my ESEM model I try to estimate relationship between two variables: explanatory variable is positevely skewed (likert scale on preference) and dependent variable is negatively skewed (consumption data). I wonder if I need to use log-transfromation to normilze the data or it is enough to use WLSMV - estimator and skip transformations? Another reason for choosing WLSMV-estimator is that I have some other categorical variables in the model.
Thanks,
Jazgul

Bengt O. Muthen posted on Tuesday, January 22, 2013 - 6:48 pm

I would not transform variables unless that makes the linearity specification more realistic. The MLR estimator handles non-normality, or if you don't want to assume continuous variables, then WLSMV takes care of it.

Jazgul Ismailova posted on Tuesday, January 22, 2013 - 7:22 pm

Thanks! If I have both continous and dichtomous variables in the model, can I specify both estimators?

/Jazgul

Linda K. Muthen posted on Wednesday, January 23, 2013 - 6:39 am

You cannot use more than one estimator in an analysis. Both weighted least squares and maximum likelihood can be used for a model with a combination of continuous and dichotomous variables.

Lucie Tesarova posted on Wednesday, August 14, 2013 - 8:19 am

Hello Linda,

I was hoping you can help me. I am new to Mplus but I have read many discussions here and also the Mplus manual. However, I am failing to find a way how I can improve my CFA model fit (incremental stats used already).

I am trying to fit 2 factor CFA ( 2 latent variables). I am using factor indicators that were calculated as the means of specific items measured on a likert-scale. Thus the factor indicators are not normally distributed (histogram and normality tests support my assumption).

On the top of the non-normality,my data possesses quite a large number of missing data and imputation method nor transformation did not improve the normality.

I have tried to use WLSMV as well as WLS and getting errors stating that I have no categorical variables present (I can not select the factor indicators as categorical as they are not integers). When I use MLM or MLR I get an error message that I have to use listwise deletion which is impossible due to the amount of missing values.

I have used FIML so far but the model fit is poor when I evaluate all fit indices other than chi-square (TLI= 0.8).

Have you got any suggestions?

Thank you in advance for your help.

Lucie

Linda K. Muthen posted on Wednesday, August 14, 2013 - 9:15 am

Try an EFA to see if the two factors you are specifying are supported by the data. You could consider going back to the original items.

Steven A. Miller posted on Tuesday, June 24, 2014 - 1:56 pm

If I have data that is not normally distributed -- reaction time, so likely exGaussian, would it be appropriate to use Mplus 7.2's features for skew and kurtosis to model this? Or are there properties of the distribution that limit the feasibility of applying the new features to such a distribution? Is it possible to predict the amount of skew with an exogenous variable? If so, how?

Thanks,
Steve

Bengt O. Muthen posted on Tuesday, June 24, 2014 - 2:48 pm

If you have a continuous DV that is non-normal and without floor or ceiling effects, why don't you try out the new 7.2 features. If you have an exogenous variable, the non-normal specification is on the residual of the DV. This means that the DV can be non-normal due to both a non-normal exogenous variable and a non-normal residual.

Daniel Lee posted on Monday, June 30, 2014 - 9:08 pm

Hi Dr. Muthen,

Thank you for responding to my questions. I appreciate you, and your team, very much!
While conducting factor analysis, I had a few questions in mind:

1) Items in my latent variable are scaled differently (ordinal & dichotomous). Could I use WLSMV in this scenario? If so, is there anything else I need to be aware of in terms of conducting my analyses the right way?

2) If I elect to use WLSMV estimation, do I need to conduct a test (Tech13?) for multivariate normality? I would guess no since WLSMV uses a probit function...but I just wanted to make sure!

3) Lastly, are there any diagnostics I should be aware of when conducting a CFA using WLSMV?

Thank you so much Dr. Muthen!!!

Best,

Dan

Daniel Lee posted on Monday, June 30, 2014 - 9:11 pm

Hi Dr. Muthen,

I'm so sorry, I had one more question that I forgot to include in the previous post!

4) If I conduct a WLSMV, can I still use FIML to treat missingness in data?

Thank you so much,

Dan

Linda K. Muthen posted on Tuesday, July 01, 2014 - 7:56 am

1. You can use WLSMV or ML in this situation.

2. There is no test for multivariate normality.

3. TECH10.

4. No. FIML is full-information maximum likelihood. With WLSMV, missing data are handled using pairwise present. If you want FIML, use the ML estimator.

Daniel Lee posted on Tuesday, July 01, 2014 - 11:17 am

Hi Dr. Muthen,

Thank you for the response! Just one question for clarification purposes:

With regards to my first question, if you have categorical and dischotomous manifest variables, wouldn't ML estimation generate biased standard errors? I always thought (I don't remember where I saw this) WLSMV was the way to go when indicators were categorical. Therefore, I'm curious as to why ML might be appropriate in this situation?

Thank you so much! Your responses are always so helpful!

Dan

Linda K. Muthen posted on Tuesday, July 01, 2014 - 11:56 am

You can treat variables as categorical with both WLSMV and ML. You would put them on the CATEGORICAL list in both cases.

Jean-Samuel Cloutier posted on Wednesday, August 05, 2015 - 8:34 am

I have a U shape distribution for one of my latent variable as if many observation were 0 many were 1 and less were in betewen. What regression type should i use?

Linda K. Muthen posted on Wednesday, August 05, 2015 - 12:11 pm

I don't know of any regression for a u-shaped variable.

Julie Kim posted on Thursday, August 13, 2015 - 10:51 am

Hello Linda

I appreciate all your posts on the forum.

I am conducting CFA and SEM, but before I even start, I know I have to do multivariate normality checking. When I saw univariate items, it is not normal and therefore, the data is not multivariate normal.

From your posts so far, it seems like WLSMV or TECH13 take care of (?) these issue? Am I understanding correctly? In other words, if I use WLSMV or TECH13, I do not need to do anything about multivariate non-normal data? (e.g., transformation?)

Thank you so much

Linda K. Muthen posted on Thursday, August 13, 2015 - 2:58 pm

Are you asking about categorical variables or continuous variables.

Julie Kim posted on Thursday, August 13, 2015 - 4:32 pm

Linda, my data has both categorical (e.g., gender) and continuous (many likert) scale. But for SEM part, they are all continuous, I beleive.

Linda K. Muthen posted on Thursday, August 13, 2015 - 6:53 pm

The scale of covariates like gender is not an issue in regression. Covariates can be binary or continuous and in both cases they are treated as continuous in regression.

If you have likert items and they have floor or ceiling effects, a piling up of observations in the lowest or highest categories, you should treat them as categorical. It sounds like that is the case. If you treat them as categorical by putting them on the CATEGORICAL list and using either WLSMV or ML, the categorical methodology of probit or logit regression takes care of this.

Julie Kim posted on Monday, August 17, 2015 - 7:10 pm

Linda ,I appreciate your answers very much. Excuse me for asking so basic questions.

1. It is my understanding so far as I analyize my data (that have categorical such as yes no question, continuous such as percentage, and many likert scale). From what you described, when I have univariate non-normal in any kind of data, you can put "categorical" (because it's likely because of floor or ceiling effects) and use either WLSMV or ML to take care of multivariate normality. Is this true? Am I not understanding correctly?

In other words, is it true I do not need to transform anything, if I use categorical and WLSMV or ML?

2. Does WLSMV/ML option take care of homoscedasticity as well?

Thank you

Linda K. Muthen posted on Tuesday, August 18, 2015 - 6:44 am

1. If you have a variable measured on a continuous scale like height or weight and the variable is not normally distributed, you can use an estimator that is robust to non-normality like MLR. It is not necessary to transform the variable. You cannot put it on the CATEGORICAL list.

If you have a binary or Likert-type variable, the numbers, 0/1, or 0/1/2/3/4 have no numeric meaning. They simply denote categories. Only this type of variable can be put on the CATEGORICAL list. When it is put on the CATEGORICAL list, categorical variable methodology is used. This methodology is developed to handle non-normal distributions of frequencies across categories.

2. Not to my knowledge.

Julie Kim posted on Tuesday, August 18, 2015 - 7:31 am

Linda, thank you so much.
Please allow me to ask follow-up question.

After more thought, I realize one of my latent has 3 indicators 1. yes no (0,1) categorical 2. % of women in chosen job 3. a composite score (likely score from 10-60). In this case, I have mixture of continuous (2,3) and categorical (1)---In this case, do you have any recommendation? I believe I can no longer use CATEGORICAL..

Thank you

Linda K. Muthen posted on Tuesday, August 18, 2015 - 8:25 am

You put the categorical variables on the CATEGORICAL list. For the others, the default is to treat them as continuous.

Doris Matosic posted on Tuesday, March 08, 2016 - 9:46 am

Hi,

I am running CFAs for the scales I used in my study. One of the scales (continuous variable, 6-items, 1 factor) is non normally distributed (skewness -3.2 and kurtosis 11.7). I have used MLR estimator and it still provides me with poor model fit (chi square/df ratio and RMSEA are high). How would you suggest to go about this issue? I was thinking of transforming the data, but not sure if that would be a solution.

Thank you

Bengt O. Muthen posted on Wednesday, March 09, 2016 - 8:12 am

Maybe you have strong floor or ceiling effects.

Doris Matosic posted on Wednesday, March 09, 2016 - 8:41 am

Thank you do much!

I think it might be a strong ceiling effect. What approach would you suggest in order to run the CFA with the strong ceiling effect of one variable?

Thank you!

Bengt O. Muthen posted on Wednesday, March 09, 2016 - 11:52 am

Censored using WLSMV is one possibility.

But first I would do an EFA.

WEN Congcong posted on Sunday, April 24, 2016 - 10:01 pm

Hello, Dr. Muthen,
I have some questions about data��s normality testing. Thank you in advance.
In my study, I want to at first use the multiple group ESEM to test the measurement invariance and the structure invariance of 3 student groups, find out the correlations across groups, then use the latent profile analysis to explore the data��s structure and take the results as references to the results of multiple group ESEM. And at this time, I come up with some questions.
Question1: When should I test the observed variable��s normality (9 observed variables)? I think I should undertake this step in EFA part of multiple group ESEM because the default estimator is ML in Mplus.
Question2: How can I test the normality of the 9 observed variables? Is there any introduction about this? The only thing I can find is the normality testing when dealing with mixture model in CFA.
Any response will be really appreciated!
Wen Congcong

WEN Congcong posted on Sunday, April 24, 2016 - 10:34 pm

Here is my program. Is it correct?
TITLE: Testing non-normality;
DATA: FILE="C:/Users/dell/Documents/data.csv"
LISTWISE=ON;
VARIABLE:
NAMES ARE cate y1-y9;
USEVARIABLES ARE y1-y9;
OUTPUT:SAMPSTAT TECH12;

Bengt O. Muthen posted on Monday, April 25, 2016 - 6:34 pm

There is no need to test for normality. Just use the MLR estimator and if the SEs and chi-2 are substantially different from those of ML you know that non-normality is an issue and you should use MLR.

WEN Congcong posted on Tuesday, April 26, 2016 - 6:32 pm

Thank you very much!

Christoph Schaefer posted on Friday, April 20, 2018 - 3:01 am

Dear Professors,

I've specified an SEM with manifest as well as latent variables. As some of the indicators and manifest variables are non-normal, I would like to calculate robust fit indices. If I understand it correctly, the MLM Estimator is based on the Satorra-Bentler correction, which corrects for kurtosis, but not for skewness. It that the case? Could I use a different estimator which is also (more or less) robust to skewness?

Best wishes,
Chris

Bengt O. Muthen posted on Friday, April 20, 2018 - 2:52 pm

We recommend MLR for all kinds of non-normality.

Christoph Schaefer posted on Friday, April 20, 2018 - 3:06 pm

Many thanks for your prompt answer.
If I may ask about MLR: According to the Manual, MLR estimates are robust when used with type=complex.
When I insert type=complex, however, I get the message that this command "requires a cluster variable, a stratification variable or replicate weights."
Is there a way around this?

Christoph Schaefer posted on Friday, April 20, 2018 - 3:58 pm

Maybe the key is the correct interpretation of the following sentence in the Mplus User's Guide: "MLR � maximum likelihood parameter estimates with standard errors and a chi-square test statistic (when applicable) that are robust to non-normality and non-independence of observations when used with TYPE=COMPLEX."
Does this sentence mean "MLR estimates and the chi-square test statistic are are ONLY robust to non-normality IF used in combination with type=complex" or "MLR estimates and the chi-square test statistic are always robust to non-normality, AND ALSO robust to NON-INDEPENDENCE of observations if used with type=complex" (or something else)?

Bengt O. Muthen posted on Friday, April 20, 2018 - 4:54 pm

Answer to your 3:06 post:

MLR does not need Type=Complex or Cluster=. See the Estimator table in the User's Guide (look up index entry Estimator).

Tihomir Asparouhov posted on Friday, April 20, 2018 - 5:49 pm

MLR is robust to non-normality. I would recommend reading web note 2

http://statmodel.com/examples/webnote.shtml#web2

and the references that are there.

Here is another good reading

https://www.stat.berkeley.edu/~census/mlesan.pdf

Tihomir Asparouhov posted on Friday, April 20, 2018 - 5:59 pm

Just one more line: if your variables are non-normal but the model is correct, the ML estimator gives correct (consistent) point estimates (simply because the sample mean and sample variance estimates are consistent under non-normality). The only problems that non-normality causes is that the standard errors and chi-square testing are incorrect with ML, and this is where the Huber-White(1980) sandwich standard errors come in (this is MLR in Mplus) and fix that problem. The Huber-White(1980) method is now very well established everywhere and can fix other problems, see

https://en.wikipedia.org/wiki/Heteroscedasticity-consistent_standard_errors

or this

http://www.statmodel2.com/download/webnotes/mplusnote72.pdf

Christoph Schaefer posted on Friday, April 20, 2018 - 6:28 pm

Thanks a lot for your efforts.

Hillary Gorin posted on Tuesday, May 22, 2018 - 1:47 pm

Hi,

When assessing the distribution of non-normal data, I obtained the following AIC and BIC values (using a four time point growth curve model).

Zero Inflated Poisson:

AIC BIC
9247.536|9294.937

Zero Inflated Negative Binomial:

AIC BIC
9244.643|9313.112

As you can see, the AIC and BIC values are lower in different models. What should I assume to be the best fitting distribution?

Thanks!
Hillary

Bengt O. Muthen posted on Tuesday, May 22, 2018 - 5:44 pm

This is a case where statistics doesn't provide clear guidance (although the BIC advantage of the ZIP model is bigger than the AIC advantage of the ZINB model).

Hillary Gorin posted on Tuesday, May 22, 2018 - 5:54 pm

Hello Dr. Muthen,

Thank you for your response! Ok, so despite the lack of clarity, I should use the ZIP model?

Hillary

Bengt O. Muthen posted on Tuesday, May 22, 2018 - 6:03 pm

I would. But it is up to you. To make choice, you can also plot the fit of the model to the data as we show in our Short Course and also in our RMA book.

Hillary Gorin posted on Tuesday, May 22, 2018 - 6:42 pm

Great, thank you for these resources.

Hillary

Mira Patel posted on Sunday, August 19, 2018 - 10:12 pm

I have a 15-item measure that had three factors determined through an EFA.

I have a new dataset and I wish to run a CFA. My data is skewed in that there is a ceiling effect for each item (most participants chose Agree or Strongly Agree in the 5-point Likert Scale).

Is it possible to do a CFA?

Bengt O. Muthen posted on Monday, August 20, 2018 - 11:31 am

Yes, but I would treat the variables as Categorical (this will mean ordinal in your setting).