Missing data modeling PreviousNext
Mplus Discussion > Missing Data Modeling >
 Kajsa Yang-Hansen posted on Monday, June 07, 2004 - 9:46 am
Are Missing data now allowed in, for example multi-level models with sample weight in Mplus 3? Thanks in advance.

Best, Kajsa.
 Linda K. Muthen posted on Monday, June 07, 2004 - 10:53 am
 CW posted on Tuesday, August 24, 2004 - 7:02 pm

I am fitting a CFA with 2 factors, 20 ordinal indicators, and missing data, with version 3. Though I don't fully understand the missing data methods in v.3 for categorical indicators, I think I prefer the EM ML variety over WLSMV with pairwise deletion.

I've read the descriptions of the ML, MLR and MLF estimators on p. 401 of the manual, but I don't know how to pick which one to use. Do you have any suggestions for how to choose?
 Linda K. Muthen posted on Tuesday, August 24, 2004 - 7:29 pm
I would use MLR.
 CW posted on Tuesday, August 24, 2004 - 9:51 pm
Ok, thanks for the reply. Why that one?

Also, I tried to fit a one-factor model with correlated errors using MLR and it said:


What does this mean?
 Linda K. Muthen posted on Wednesday, August 25, 2004 - 6:52 am
MLR gives robust standard errors. I would have to see your output to answer the question about the error message. Please send it to support@statmodel.com.
 Anonymous posted on Tuesday, November 02, 2004 - 1:46 pm
Is there a Mplus technical report and / or Muthén paper available which details the ways Mplus handles MAR missing data in a MLSEM ?
 Linda K. Muthen posted on Tuesday, November 02, 2004 - 2:52 pm
Techincial Appendix 6 which is available at the Mplus website discusses missing data. Also, the following reference:

Little, R.J., & Rubin, D.B. (2002). Statistical analysis with missing data. Second edition. New York: John Wiley & Sons.
 Anonymous posted on Tuesday, November 02, 2004 - 3:03 pm
Thank you.

I sounds like, in short, Mplus uses the conventional EM fix -- substituting in the sufficient statistics during the E-step, iterating, re-substituting, etc..

I assume this same procedure is used for missing data both at Level-1 and Level-2 of a MLSEM ?
 Linda K. Muthen posted on Tuesday, November 02, 2004 - 3:17 pm
Yes, it is.
 krisitne amlund hagen posted on Thursday, November 13, 2008 - 6:03 am
In my RCT, our participants are assessed at waves 1, 3, and 4. Missingness was not a problem from w1 to w3. But now, at w4 or follow-up, all the "problematic" children in the control group have been lost, leaving us with a "super control group" to which our experimental group is being compared. Both Wave 1 and wave 3 problem-scales predict drop-out in the control group (the greater the problems the more likely they are to drop out), but are unrelated to drop-out in the exp group.

Please help me with the specific Mplus command that takes this "uneven missingness" into consideration. Do I have to apply weights to anything? If so, how do I calculate the weights?

Thank you!
 Linda K. Muthen posted on Thursday, November 13, 2008 - 8:20 am
In principal, the MAR assumption of TYPE=MISSING handles this because missingness is a function of observed outcomes at prior timepoints.
 Nicholas Bishop posted on Tuesday, August 03, 2010 - 1:34 pm
Hello. I am analyzing decline in word recall scores in an elderly population over a decade with measurement taken every two years. I am attempting to use a Wu-Carroll selection model to control for the relationship between the outcome and dropout. All individuals were alive at baseline (t0) measurement. When I regress dropout (t1-t5) on intercept and slope, the model does not converge. When I regress dropout (t2-t5) on intercept and slope, the model converges. Why would the inclusion of dropout at t1 cause problems? Thank you.
dd(t)= 0 - observed, 1 - dropout at time t, -99 - dropout at previous time

i s | totrec98@0 totrec00@.2 totrec02@.4 totrec04@.6 totrec06@.8 totrec08@1;
[i*8.607 s*-2.802];
dd00-dd08 on i s;
 Linda K. Muthen posted on Wednesday, August 04, 2010 - 9:36 am
For the variable dd98, what is the proportion of zeroes and ones?
 Nicholas Bishop posted on Wednesday, August 04, 2010 - 11:44 am
Hi Linda,
For dd98, there were no missing cases. For dd00, 90% were observed (0's) and about 10% of cases were missing (1's). For dd02, 77% were observed (0's), 13% went missing (1's), and 10% were missing from previous wave (-99's).
 Linda K. Muthen posted on Thursday, August 05, 2010 - 10:25 am
If there are no dropouts at the first time point, don't use that dummy variable.
 Nicholas Bishop posted on Thursday, August 05, 2010 - 11:14 am
Hi Linda,
I am not using the dropout indicators for the first wave of measurement (taken in 1998). I am attempting to use the dropout indicators from 2000-2008, which I was successfully able to do using the Diggle-Kenward selection model. When attempting to use the Wu-Carroll selection model, I was not able to get the model to converge using dropout indicators from 2000-2008. However, when I did not include the dropout dummy for 2000 and only included dropout indicators from 2002-2008, the model successfully converged.
 Linda K. Muthen posted on Thursday, August 05, 2010 - 1:26 pm
Please send the outputs and your license number to support@statmodel.com.
 Suzanne Elgendy posted on Sunday, September 12, 2010 - 5:20 pm
Dr. Muthen,

What is the best way to determine whether the degree to which the estimated covariance matrix using MLR replicates the original matrix? Is there a certain fit index I should be referring to?

Thank you,
Suzanne Elgendy
 Linda K. Muthen posted on Monday, September 13, 2010 - 8:49 am
I assume that by original matrix you mean the sample covariance matrix. If means, variances, and covariances are the statistics used for model estimation, all fit statistics examine this.
 Nicholas Bishop posted on Friday, October 01, 2010 - 2:06 pm
I am currently using a number of NMAR models (Diggle-Kenward, Wu-Carroll, and pattern mixture) to control for non-random missing data on the Y variable in my study. When implementing NMAR models, what options do I have for handling missing data on my X variables? I am losing a large number of observations because respondents have Y observations without X observations and the missing data on X seems to be handled through listwise deletion. Thank you.

Nicholas Bishop
Arizona State University
 Linda K. Muthen posted on Saturday, October 02, 2010 - 10:44 am
You could use DATA IMPUTATION to impute values for the missing x's or you could include the variances of the x's in the MODEL command. Doing this means they are treated as dependent variables and distributional assumptions are made about them.
 Nicholas Bishop posted on Monday, October 11, 2010 - 7:51 am
Thank you Linda. When I include the X's with missing data in the model command, I receive an error "THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE
TRUSTWORTHY...". The warning points me to the PSI matrix diagonal for one of the dichotomous x's that I included in the model command (it was not defined as categorical). Do you think the problem is arising due to the use of the binary variable in maximum likelihood estimation? I can also send my information if that will help. Thanks.
 Linda K. Muthen posted on Monday, October 11, 2010 - 11:34 am
The mean and variance of a binary variable are not orthogonal and can generate a message about non-identification when the variable is included in the model and is not identified as categorical. I can't say more without seeing the full output at support@statmodel.com.
 Maren Winkler posted on Sunday, October 17, 2010 - 10:14 am
Dear Dres. Muthén,

I have estimated a model with complete data on the predictor side (7 indicators, a nested-factor model, N = 1187) and a lot of missing data on the criterion side (11 indicators, a g-factor-model, N = 79).

The standardized factor loadings from my 11 indicators on their latent variable on the criterion side (where data is only available for N = 79) are really high in the SEM, mainly between .90 and .99.
When I just estimate a g-factor-model CFA for those 79 people with those 11 indicators, the standardized factor loadings range between .56 and .83.

Why do the estimates change so much? What is the best way to deal with this problem?

Thanks for your help!
 Linda K. Muthen posted on Monday, October 18, 2010 - 9:32 am
It sounds like the sample statistics are very different for the two samples.
 Maren Winkler posted on Tuesday, October 19, 2010 - 1:13 am
The 79 people are a subsample of those 1187. The mean of those 79 on the seven indicators on the predictor side is much higher because a majority of them were selected on the basis of their scores on those indicators. Variances are similar.

Since I only have data on the criterion side for those 79 people I can't estimate sample statistics for the indicators on the criterion side for the other ones.

So the different means on the indicators on the predictor side are the reason why factor loadings on the criterion side change so much in the SEM?

Are my estimates reliable? What would be the best way to handle this situation?

Thanks a lot!
 Bengt O. Muthen posted on Tuesday, October 19, 2010 - 8:34 am
The analysis of the criterion variables only (for n=79) is a submodel of the full model that includes the predictor variables. If your model estimates are very different for the criterion variables in the submodel and full model analyses, this probably means that the assumptions of the full model don't fully hold - in particular the covariances between the predictor variables and the criterion variables may not be captured fully by their sets of factors being related. Look for big modification indices in the full model analysis.

Because you select the n=79 based on the predictor variables, it would seem that MAR may hold so in principle the full model analysis should give you the right answer.
 Maren Winkler posted on Wednesday, October 20, 2010 - 4:22 am
Dear Dr. Muthén,

thanks for your reply.

So far, I've used the option for auxiliary variables (m) where Mplus does not give Modification indices.

Hence, I decided to drop the auxiliary option in order to get Modification indices.
Now, for my measurement model on the criterion side, this leads to "normal" factor loadings for my 11 indicators. However, one of my path coefficients from the two latent variables which are the predictors changes a lot - from .59 to .81 (the other one remains more or less the same) - this seems pretty unrealistic and I'm not sure what to do with this result.
In this model, the highest ModIndices are for residual correlations on the predictor side (where there is no missing data) - not for covariances between predictor and criterion variables.

Would you have any suggestions of how to procede?

Thank you very much!
 Bengt O. Muthen posted on Wednesday, October 20, 2010 - 10:26 am
Perhaps your factor variances changed as well, so that in a standardized metric the change wasn't that big. In any case, I would trust this solution.
 Maren Winkler posted on Wednesday, October 20, 2010 - 2:02 pm
The factor loadings that changed where the standardized factor loadings already.
So far, I've fixed my factor variances to one in all models in order to identify the model - would you suggest to estimate those variances freely and to fix the factor loading for the first indicator at 1? And then to check whether the variance of the factor on the criterion side changes from the measurement model (n = 79) to the structural model (N = 1187)?
 Bengt O. Muthen posted on Wednesday, October 20, 2010 - 2:21 pm
No need to do that.

Note that you haven't fixed the factor variance at 1 for the dependent (criterion) factors - you fixed the residual variance; but that's ok too. All these choices give the same standardized solution.
 Andrea Barrocas Gottlieb posted on Tuesday, March 25, 2014 - 3:33 pm
I'm attempting to impute data from a cross sectional cohort sequential study.

I received the following message in the output.


I attempted to decrease the number of variables used to impute the data, and received the same message.

I would be grateful for some guidance on how to proceed.

Thank you!
 Bengt O. Muthen posted on Tuesday, March 25, 2014 - 3:53 pm
As a first step, take a look at the Version 7 UG ex 11.5, page 397, and its use of USEVARIABLES, AUXILIARY, and IMPUTE. As a second step, have a look at the 14 practical tips in Section 4 of the paper on our website:

Asparouhov, T. & Muthén, B. (2010). Multiple imputation with Mplus. Technical Report. Version 2. Click here to view Mplus inputs, data, and outputs used in this paper.

If you still have problems, please send data, input, output and license number to Support@statmodel.com.
 Andrea Barrocas Gottlieb posted on Tuesday, March 25, 2014 - 3:57 pm
Dr. Muthen,
Thank you very much!
 Lucy Markson posted on Thursday, May 07, 2015 - 3:53 am
I have an overall sample of 1600 which has missing data throughout. I am using the estimator MLR for growth modelling. I get these warning messages in the output: WARNING
Data set contains cases with missing on x-variables.
These cases were not included in the analysis.
Number of cases with missing on x-variables: 772
Data set contains cases with missing on all variables except
x-variables. These cases were not included in the analysis.
Number of cases with missing on all variables except x-variables: 34

Would it be possible to get some clarification on what MLR does in relation to missing data and what these warning messages mean?

Many thanks
 Bengt O. Muthen posted on Thursday, May 07, 2015 - 9:55 am
MLR handles missing data under the MAR assumption, that is, using what is commonly called "FIML". That applies to people with missing on some but not all DVs, in the subset of people who don't have any missing on IVs and don't have all DVs missing.

If you want to include those with missing on IVs, you can mention those variable names in the Model, thereby extending the model and making stronger, normality, assumptions for the IVs.
 Lucy Markson posted on Friday, May 15, 2015 - 7:25 am
Thank you very much for this information. Is it possible to find out what code is needed to tell Mplus to include missing on IV cases?

Many thanks
 Bengt O. Muthen posted on Friday, May 15, 2015 - 7:58 am
If x is an IV, just say


in the Model command.
 Lucy Markson posted on Monday, May 18, 2015 - 7:45 am
OK thank you very much. I have added this and the missing data are being estimated but now I get the following warning:
Parameter 35, APFY9
It names a different variable in each model.
Is it possible to find out what this could mean?
Many thanks
 Bengt O. Muthen posted on Monday, May 18, 2015 - 2:40 pm
That could be due to an x variable that is binary, in which case it is ignorable.
 timboto17@gmail.com posted on Thursday, September 07, 2017 - 7:56 pm
Dear Dr Muthen,
I am reading Muthen & Muthen 2002 paper and I have a basic question.
I generated 200 complete data sets with Mplus simulation device, used an external program to create MAR on the same datasets and ran a SEM using Mplus on the data with and without missing. The following pattern was seen:
1) The same sample size is used in the two analyses. This is expected because an inspection of the data showed that no individual has missing in all variables.
2) The log likelihood is however different with lower mean and small SD for the complete data (e.g., M=-15476.595 SD=48.47 M=-14471.77 SD=71.57 and for complete and incomplete data). I was expecting the log likelihood to be different, for different information is used despite the equality of the sample size. However, I wasn’t expecting constantly higher log likelihood values for the incomplete data (I repeated this experiment with different models and the same pattern is always observed). How can we explain this result? May this be due to the fact that missing information introduces a level of uncertainty in the data which results in a variability in the log likelihood? Do you have any other explanation?
Thank you for your help.
 Bengt O. Muthen posted on Saturday, September 09, 2017 - 5:59 pm
You may want to ask this on SEMNET.
 Namita Joshi posted on Thursday, March 22, 2018 - 7:45 am

I am running a joint process survival model in MPlus to account for missing data as well as a longitudinal outcome measure.

when specifying the following code for freely estimating the mean


I am getting this error
The following MODEL statements are ignored:
* Statements in the GENERAL group:
[ i$1 ]

is there an error is the way I am specifying. Please let me know if you need more information.
 Bengt O. Muthen posted on Thursday, March 22, 2018 - 12:03 pm
[i$1*] refers to a threshold for a categorical variable. If the variable is continuous you should use [i*]
 Simon Coulombe posted on Tuesday, August 27, 2019 - 6:38 pm
I ran an analysis with ESTIMATOR = WLSMV. I have some missing values in one of the variables mediating a pathway between another IV and a DV (both of which do not have any missing values).

With WLSMV, are the cases with missing values on that mediator removed or they are retained as part of the analysis?

Thank you
 Bengt O. Muthen posted on Wednesday, August 28, 2019 - 3:11 pm
The WLSMV handling of missing data is described in the FAQ on our website:

Estimator choices with categorical outcomes
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message