Missing Data, Complex Samples, and Ca... PreviousNext
Mplus Discussion > Missing Data Modeling >
 Tom Munk posted on Monday, December 13, 2004 - 7:39 am
I recently used MPlus and AMOS to replicate a published article using NELS data. I was able to recreate the author's values, but only by ignoring the complex nature of the sample and the categorical nature of the data. When I tried to solve the problem, I was unable to account for these issues at the same time as accounting for the large amount of missing data. Can MPLUS (or any other package) appropriately handle all three of these important issues in one study?
 bmuthen posted on Monday, December 13, 2004 - 8:31 am
The current version of Mplus can handle categorical outcomes, missing data, and complex survey data. Complex survey data can be handled by Type=complex to correct SEs or by Type = twolevel to do multilevel modeling.
 Tom Munk posted on Thursday, December 16, 2004 - 10:04 am
According to the heading for this discussion section: "Missing data are not allowed in mixture, multilevel, and complex sample models or for categorical outcome variables."

In my replication of the NELS study, I agreed. I was unable to successfully use CATEGORICAL, type=MISSING, and WEIGHT in a single MPLUS run. In fact, I was unable to use type=MISSING in combination with either WEIGHT or CATEGORICAL.
 Linda K. Muthen posted on Thursday, December 16, 2004 - 10:09 am
I'm a little confused. I don't see what you are quoting "Missing data are not allowed in mixture, multilevel, and complex sample models or for categorical outcome variables." In Version 3 of Mplus, you should be able to combine CATEGORICAL, MISSING, and WEIGHT in a single Mplus run with COMPLEX or TWOLEVEL. Please send an example showing that you cannot to support@statmodel.com.
 Linda K. Muthen posted on Thursday, December 16, 2004 - 10:16 am
Oh, I see where that is written. I guess this needs to be updated.
 Paul Rathouz posted on Tuesday, February 08, 2005 - 12:44 pm
The missing data section of the Mplus discussion page says that "In Mplus, missing data are allowed in cases where all outcome variables are continuous and normally distributed and a general model is being estimated. Missing data are not allowed in mixture, multilevel, and complex sample models or for categorical outcome variables." However, later discussion now alludes to the fact that missing data can be handled in some way for complex and or categorical outcome data. I am assuming that estimation of the model is still maximum likelihood. What is the exact missing data methodology being used in this case? Can it be used in any model that we can specify in Mplus?
 Linda K. Muthen posted on Tuesday, February 08, 2005 - 2:43 pm
I'm afraid that the introductions on Mplus Discussion have not been updated. Thank you for bringing this to our attention.

Following is a description from the Mplus User's Guide of the current Mplus missing data capabilities:

"Mplus has several options for the estimation of models with missing data. Mplus provides maximum likelihood (ML) estimation under MCAR (missing completely at random) and MAR (missing at random; Little & Rubin, 2002) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types. MAR means that missingness can be a function of observed covariates and observed outcomes. For censored and categorical outcomes using the weighted least squares estimators, missingness is allowed to be a function of the observed covariates but not the observed outcomes. Non-ignorable missing data modeling is possible using maximum likelihood where categorical outcomes represent indicators of missingness and where missingness may be influenced by continuous and categorical latent variables (Muthén et al., 2003).

Multiple data sets generated using multiple imputation (Schafer, 1997) can be analyzed using a special feature of Mplus. Parameter estimates are averaged over the set of analyses, and standard errors are computed using the average of the standard errors over the set of analyses and the between analysis parameter estimate variation.

In all models, missingness is not allowed for the observed covariates because they are not part of the model. The outcomes are modeled conditional on the covariates and the covariates have no distributional assumption. Covariate missingness can be modeled if the covariates are explicitly brought into the model and given a distributional assumption. With missing data, the standard errors for the parameter estimates are computed using the observed rather than the expected information matrix (Kenward & Molenberghs, 1998). Bootstrap standard errors and confidence intervals are also available with missing data."
 Anonymous posted on Sunday, February 20, 2005 - 1:49 pm
I am reading through Mplus Web Note #7 and have a question about a statement made in the note.

In Version 2 of the note, Asparouhov and Muthen note (pg. 6): "...indeed there is no sample statistics based solution of [missing data with continuous and categorical outcomes under complete MAR assumption] even for single level model[s]...".

In Version 3 of the note, the authors state (pg. 8): "...there is no closed form solution to this estimation problem [using an ML estimator] even for the estimation of the mean of an observed variable...".

Would you provide a clarification of this point ? Do you mean that EM-based imputation of the sufficient statistics is not possible (in princple) for multilevel models ? A citation or two would also be especially helpful (since I thought, via Little and Rubin, most MAR missing data problems could be addressed via EM-based techniques).

Thank you.
 Anonymous posted on Wednesday, March 02, 2005 - 8:44 am
The EM algorithm can be used for multilevel models and just as in single
level models the EM algorithm doesn't lead to a closed form solution but
it leads to an iterative procedure. By closed form solution we mean
explicit estimation formula as opposed to iterative algorithms such as the EM.

 Anonymous posted on Monday, May 09, 2005 - 11:02 pm
How can I employ MPLUS to perform EFA in the case of incomplete data not missing data? All variables in my case are categoricals.
 bmuthen posted on Tuesday, May 10, 2005 - 5:56 am
What do you mean by "incomplete data not missing data"? That sounds like a contradiction.
 Anonymous posted on Tuesday, May 10, 2005 - 6:30 am
Dear bmuthen, You are correct. There is a contradiction in the words. What I want to mean " unbalanced data". That means, we may not have responses for some of variables.
 bmuthen posted on Tuesday, May 10, 2005 - 5:28 pm
If you have missing data on some variables for some individuals, you simply add "missing" to your type= statement in the Analysis command (and declare the symbol for missing values in the Variable command).

If you have missing data on some variables for all individuals, you may simply delete those variables.
 Boris Drahner posted on Tuesday, October 25, 2011 - 5:25 am
i have several categorical (binary) covariates with some missings in a complex twolevel analysis. if i bring them into the model mentioning their variances I assume normality. do you think this is a reasonable approach to handle binary missingness? and if i do so, all the missing categorical covariates are treated as dependent and no longer contribute to the missingness-function, if i am correctly interpreting the user guide, right?
thanks for any help
 Linda K. Muthen posted on Tuesday, October 25, 2011 - 2:44 pm
I think a better approach with categorical variables is to use DATA IMPUTATION where you can declare the variables to be categorical.
 Lindsay Bell posted on Wednesday, March 21, 2012 - 12:23 pm
Hi -

I am trying to impute data using a saturated model that accounts for the nested structure of the data. Abbreviated syntax is below (gender is the first variable and h_income is the last).

CLUSTER = schoolID;

MODEL: gender-h_income WITH gender-h_income;

Does this model take into account the nested structure when imputing data? If not, how can I specify a saturated model that does this? I have no variables at level 2, but I think that values on certain variables will be similar within clusters.

Thank you,
 Bengt O. Muthen posted on Wednesday, March 21, 2012 - 9:09 pm
You need to do a Twolevel analysis to account for the nested structure of the data when imputing. See the UG for examples. An saturated H1 model may be hard to work with for the imputations, and a simpler H0 model may have to be used. See the UG example 11.7.
 Daniel Lee posted on Wednesday, June 19, 2013 - 8:04 pm

I'm trying to use multiple imputation for a model with only categorical variables. While running the diagnostic analysis the potential scale reduction value never drops below 1.05 despite showing over 10,000 iterations. In the model I'm estimating all covariances. Should I be looking beyond 10,000 iterations?

Thanks in advance for your help!
 Linda K. Muthen posted on Thursday, June 20, 2013 - 7:57 am
I don't think that value is bad. Compare the estimates with 10,000 versus 20,000 iterations to see if the estimates are close.
 Robert Buch posted on Monday, September 16, 2013 - 2:38 am

I am doing a SEM on longitudinal data where I have missing data. Specifically, whereas participation on each measurement occasion varied from 248 (Questionnaire data Time 1) to 117 (physical test data Time 3), the sample of individuals who participated in all three measurement occasions (both questionnaire and physical test data) consisted of 84 cadets.

When doing a SEM in Mplus, it reports that "Number of obeservations" is 125.

Does this mean that my N = 125?

What would you reccomend that I report in my research paper under the SEM results? N = 248, or N = 125, or N = 84?

Yours Sincerely,
 Linda K. Muthen posted on Monday, September 16, 2013 - 11:17 am
I would think the sample size showing in the output would be 248. It should be the number of observations that do not have missing at all occasions. Cases can be deleted that have missing on observed exogenous variables. Please send the files and your license number to support@statmodel.com if this does not help.
 j guo posted on Thursday, June 05, 2014 - 12:40 am
I did a simple pilot study to compare the results of FIML and multiple imputation (MI). Specifically, the same information was used for FIML and IM without auxiliary and sampling weights and cluster. Three latent constructs with 13 items (4 point likert-scale )were included based on PISA12 sample (14131). For estimator, I used ML. 5 imputed data were used in MI.
First, I compared the results of FIML CFA and MI CFA. I found the FIML CFA was almost equivalent to MI CFA, in which I treated the Likert items as continuous. However, there are substantial differences in ESEM with target rotation (targeting cross-loading to 0) when comparing FIML ESEM and IM ESEM.
For example,
In FIML, Y1 with Y2 was .637(.008) whereas the corresponding coefficient was .209(.334) in MI.
I also tried to compare FIML with categorical imputation ESEM (treat the items as categorical). There are still huge differences.
Why is this so different in ESEM with target rotation?

In addition, I confused that I had moderate Rate of Missing for each coefficients in MI, although covariance coverage indicated no missing data in MI files.
 Bengt O. Muthen posted on Thursday, June 05, 2014 - 3:53 pm
Please send the relevant outputs carefully describing which do what to Support together with your license number.
 j guo posted on Sunday, June 08, 2014 - 3:46 pm
Hi, Bengt

I have sent the outputs with license number to support@statmodel.com last Friday. but I have not received any reply.

Thank you.
 Shaljan  posted on Thursday, December 10, 2015 - 8:31 am
I am using a hierarchically structured data. I need to use both the data imputation and the TYPE = COMPLEX options in Mplus. When I use these options simultaneously, I get a warning that Mplus is not currently capable of using the TYPE = COMPLEX option with the data imputation option.

Would it be a good idea to use the 10 multiple imputed datasets employing the GROUPING option and run the analysis using the the TYPE = COMPLEX option?

I have pasted the relevant syntax below for your review and feedback.


Grouping = Imputation_ (1=A 2=B 3=C 4=D 5=E 6=F 7=G 8=H 9=I 10=J);


Weight = W_FSTUWT;
Cluster = SCHOOLID;

Type = complex;
Estimator = MLR;
 Tihomir Asparouhov posted on Thursday, December 10, 2015 - 3:39 pm
You could use two-level imputation and accommodate the cluster variable SCHOOLID (that would be your second level cluster variable). Sampling weights can be included in the imputation only as a covariate. Once the data is imputed, you should use user's guide example 13.13
 slck.dgn@hotmail.com posted on Friday, November 11, 2016 - 11:54 am
Drs. Muthen

I am using data from complex two level survey design, which provides teacher and school weights, as well as 100 BRR weights (created using FAY's formula).

I am trying to fit H0 model in which a Multilevel CFA model and imputation are conducted together.

variable: names are
Pre1-Pre7 TCHWGT BRR1-BRR100;
missing are Pre1-Pre7 (7, 9);
weight= TCHWGT;
repweights= BRR1-BRR100;

type= complex twolevel;
repse= BRR;

%within% PDw BY Pre1-Pre7;
%between% ; (no Level 2 model)

Data imputation:
impute = Pre1-Pre7;
save= imputation*.csv;

Using this, I got *** ERROR in VARIABLE command
TYPE = TWOLEVEL analysis requires option CLUSTER.

I tried to add cluster ID and stratification, but then I got:

*** ERROR in VARIABLE command
The STRATIFICATION and CLUSTER options are not allowed with replicate weights.
Replicate weights contain information about stratification and/or clustering

I would like to impute data using H0 (ML CFA model) model. So, I think I could not perform imputation using my ML CFA model. So what is the alternative? Maybe imputing without a model first, then fitting ML CFA?

 Tihomir Asparouhov posted on Friday, November 11, 2016 - 1:53 pm
Theoretically solid method for multiple imputation doesn't exist in the presence of sampling weights (not just in Mplus). If you want to impute you can treat the weight as a covariate or add as covariates the variables used in the construction of the weights (or use log(weight)).

In Mplus type=twolevel is not available with replicate weights and as far as I know not anywhere else either.

From the description you give above I would recommend

type= complex;
repse= BRR;

model: PDw BY Pre1-Pre7;

You don't need to worry about the missing data. Mplus uses likelihood base estimation that takes care of the missing data.
 slck.dgn@hotmail.com posted on Saturday, November 12, 2016 - 5:13 am
Thank you Dr. Asparouhov!

Indeed, you are right. The ratio of missingness varies from .01% to .07% for all variables. Imputation may not be "a must" in my case. Now I am using TYPE=COMPLEX with BRR. I used categorical option and WLSMV estimator.

The other thing we need to discuss is using Level 1 or level 1 sampling weights in this same situation. Some authors suggest using within sampling weights (created dividing Level 1 into Level 2 weights), but some find using only level 1 sampling weights useful with BRR. what is your recommendation?
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message