Message/Author 

Shige posted on Saturday, April 24, 2004  3:23 pm



Dear All, I am trying to do a SEM survival model where some of the covariates in the measurement model have some heavily missing data (20%). Multiple imputation seems to be the best choice in this case. Based on my reading of the Mplus 3 user guide, Mplus does not have the facility to carry out multiple imputation, but it can process imputed data (example 12.13). In that case, can anybody share their experience about which multiple imputation software to use to work with Mplus? I know there is large body of literature of multipe imputation, I am a little lost... Thanks! 


Joe Schafer's NORM program is probably one way to get imputed data. I believe that it is freeware. Schafer is at Penn State in the Statistics Department. 

Shige posted on Sunday, April 25, 2004  2:12 am



Thanks Linda, that seems to be a good place to start. I also find that Gary King has a program called "amelia" that does similar things, and it seems to be able to hand nonnormal data pretty well (which NORM is not designed to handle). 

Anonymous posted on Sunday, April 25, 2004  6:36 pm



btw, consider also modeling your missing data variable, if your variable is categorical you can model it with a mixture model, you can also do this with ordered/unordered categorical without mixture 

Shige posted on Monday, April 26, 2004  10:26 am



Dear Anonymous, Can you point me to a example? Thanks! Shige 

Anonymous posted on Monday, April 26, 2004  9:54 pm



the mixture approach is described in the CACE papers http://statmodel.com/mplus/examples/penn.html#jo http://statmodel.com/references.html#catlatent and ex7.24 the approach without mixture is basically a regular path analysis, add type=missing and integration=montecarlo to example ex3.12 btw, both of these methods should produce the same results as multiple imputations 

Shige posted on Monday, May 03, 2004  11:18 pm



Thanks, it's very helpful. 

Anonymous posted on Monday, May 24, 2004  6:07 pm



Hello, In my data, there are 7% missing data for one variable, 1%3% for 4 variables, and about 9% for 5 covariates. All these variables would be included in my MIMIC model. Based on 3 of that 5 covariates with 9% missing rate, I would extract 3 subpopulations (the 3 subpopulation may have lower missing rate for other variables). All analysis would be implemented on these 3 subpopulations. Then, can I ignore the missing data or do I need to run imputation to replace them? If I need to impute, which is better  impute before subpopulation extraction or after? Thank you. 

Anonymous posted on Monday, May 24, 2004  6:14 pm



Hello, In my data, there are 7% missing data for one variable, 1%3% for 4 variables, and about 9% for 5 covariates. All these variables would be included in my MIMIC model. Based on 3 of that 5 covariates with 9% missing rate, I would extract 3 subpopulations (the 3 subpopulation may have lower missing rate for other variables). All analysis would be implemented on these 3 subpopulations. Then, can I ignore the missing data or do I need to run imputation to replace them? If I need to impute, which is better  impute before subpopulation extraction or after? Thank you. 


Are your dependent variables continuous or categorical? 

Anonymous posted on Tuesday, May 25, 2004  9:25 am



My dependent variables are 3 levels of ordered responses (13). Thank you. 

Anonymous posted on Tuesday, May 25, 2004  2:07 pm



Linda, My dependent variables are 3 levels of ordered responses (13). In that case, do I need to impute missing data? Thank you. 

bmuthen posted on Tuesday, May 25, 2004  8:05 pm



You might want to do multiple imputations to handle the missing data on the covariates and then do modeling of the categorical outcomes taking missing data on the outcomes into account. 


On page 308 of the User's Guide to Version 3.0 it says "Multiple data sets generated using multiple imputation (Schafer, 1997) can be analyzed using a special feature of Mplus." What is this special feature? Alan Acock 


See Example 12.13 and the IMPUTATION option of the DATA command. 


Hi Linda/Bengt, I'm using MPlus V3 to conduct several EFA's and CFA's on 2 datasets. The variables are all ordinal (4pt likert) and contain from 3 to 10 % missing data. When running the EFA's while treating the data as categorical, I included the line TYPE = MISSING  Question 1) I was wondering what method MPlus was employing to deal with the missing data in this situation? As for the CFA's, I have been using NORM to create multiple imputed data sets which I then use in MPlus via the IMPUTATION option  this works fine. Question 2) Does this approach seem reasonable or is there an easier way to deal with the missing data without using NORM? Thanks for your time  cheers chris 

bmuthen posted on Saturday, November 06, 2004  5:33 am



With EFA and categorical variables, leastsquares estimation is used and missing data is simply handled by what amounts to pairwise present data. With CFA and categorical you may also use maximumlikelihood  at least if you don't have too many factors so the numerical integration is feasible  and then the usual approach of ML under MAR assumptions is used. But using the Imputation approach you mention should be fine. 


Thanks Bengt! cheers chris 

Anonymous posted on Friday, December 31, 2004  7:23 am



Hi. Is there a conflict between multiple imputation analysis and categorical variables declared in the VARIABLES section of the code? A model with categorical items runs without error in each of the individual imputed datasets, but none converge when using the TYPE=IMPUTATION command to run them all at once. Thanks! 

Shige Song posted on Friday, December 31, 2004  8:24 am



Also, Stata has a set of user contributed routines to generate multiply imputed data set. Try "findit imputation" in the command prompt. 


There should be no conflict between multiple imputation and categorical variables. To look into this I would need two imputed data sets, the two outputs that show that these data sets worked individually, and the output that shows the problem you encountered with multiple imputation. 

Anonymous posted on Tuesday, March 15, 2005  7:20 am



Hello, I am trying to do a multilevel CFA with imputed data. I imputed the data by NORM and created an ASCII file containing the names of the 5 data sets as you described in the Mplus User’s guide. I specified a model with 3 factors both on the within and on the between level. Now, I´ve got 2 questions: 1. In the output is mentioned that the number of replications “requested” is 5, whereas the number of replications “completed” is 1 or 3 (depending on the specific model). What does this mean? And why is the number of repclications completed not also 5? 2. The program tells me that the output option “standardized” is not available for Montecarlo. Is that right? How can I get standardized parameter estimates (factor loadings) with imputed data? I am looking forward to your reply. Thank you very much in advance! 


For some reason, the model did not converge for all five of your data sets. You could run each data set separately to see if that gives you more information about why there were convergence problems. Standardized estimates are not available with Monte Carlo which is what our imputation uses. You would need to compute the standardized estimates by hand. 


Good afternoon. I'm working with imputed data, on a project that I started in the v2 days, when I combined estimates in an external program. Rubin's rules, which I assume Mplus is using to get the SEs of parameter estimates, also usually give a degrees of freedom by which to evaluate the Est/SE on the t distribution. Is there a way to get that information from Mplus? Thanks, Pat 

BMuthen posted on Friday, April 15, 2005  1:53 am



We don't provide that currently, but I would think that the t distribution is well enough approximated by a normal distribution in most cases. 


Thanks, Bengt. Just as an addendum, I've also heard secondhand that Paul Allison recommends using the df as an index of the adequacy of the number of imputations, so this would be quite useful information in a future release. Thanks, Pat 


If you send me an email suggesting this to support@statmodel.com, I will add it to our list of future additions when I return. 


Hi I am trying to run a basic 5 class LCA model (four nominal indicators each with 3 categories) with imputed data sets. When the model is run on a single data set the output is fine. But when the model is run on imputed data I get a output warning (copied below) and no model results. Do I need to use different output comands for imputed data? Thanks for your help. OUTPUT: TECH1 TECH8; *** WARNING in Output command TECH1 option is the default for MONTECARLO. *** WARNING in Output command SAMPSTAT option is not available when outcomes are censored, unordered categorical (nominal), or count variables. Request for SAMPSTAT is ignored. 2 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS 


Analyzing imputed data sets uses external Monte Carlo so that is why you get the warning about TECH1. You should get the other warning even if you are running only one data set. It sounds like you need to send input/output, data, and your license number to support@statmodel.com. 


Hello there, Is it possible to set up a multigroup path analysis with imputed data in Mplus? I have two groups. The data are contained in 5 imputed data sets for each group  so there are 10 datasets altogether. I have tried the "individual data, different data sets" method of specifying the groups (as described in Ch. 13 of the User Guide), listing two files that each contain the names of five imputed data sets. That didn't work, though  Mplus returned the error message: "There are fewer NOBSERVATION entries than groups in the analysis." (My sample sizes are above 5000 in each of the imputed data sets.) Should I combine the two groups so that there are only 5 datasets in total (each containing data from both groups  this would be the "individual data, one data set" method)  or is there another way? As always, I'm grateful for this brilliant discussion site, Peter 


... actually, I've another question about multiple imputation and path analysis (not necessarily multigroup this time). Can you get Rsquares for the dependent variables, like you would for a path analysis without imputation? 


For multiple group analysis with imputation, the data for both groups needs to be in one data set with a grouping variable included. We don't give rsquare with multiple imputation because the output is based on our Monte Carlo output. You would need to compute this by hand. 

fati posted on Friday, October 14, 2005  11:47 am



I am doing an LCA with missing data, but I am not sure to understand well. 1 then if a have a missing data pattern, I can create a file with type=missing basic and define a pattern variable and use the result file in the second step the analysis, that's correct? but have other missing value i can define it in the second analysis, type=mixture missing and make missing=all (999) , that ok? 2 what is the MCAR test, and how can i obtain it with mixture modeling? 3 I have a question about the preceding message posted by Anonymous on Monday, April 26, 2004  9:54 pm , he suggest that i can do the path analysis without mixture, with type=missing and integration=montecarlo and also doing a model with type=mixture missing, and I must have the same result, wich result must be the same? and what is the reason for doing this? 4 another question is what is the maximum percentage of missing data that is acceptable for doing an multiple imputation? thank you very much for your response, 

bmuthen posted on Saturday, October 15, 2005  6:03 am



1. If you have missing data you should do two things in one and the same run. First, you should define what your missing data symbol is by using MISSING = all (999), say, in the VARIABLE command. Second, you should use TYPE = MISSING in the ANALYSIS command which gives you the so called MAR approach of ML estimation. 2. MCAR testing is testing that the data are missing completely at random. MCAR is not a necessary condition given that you can use the less restrictive MAR assumption, so you should have very specific reasons for wanting to know about MCAR. I would want a strong majority of the information to come from the data, not the model. 3. I don't know what your model is (the old post doesn't tell me that). Perhaps you refer to the fact that TYPE = MIXTURE MISSING with a single class gives the same results as TYPE = MISSING. 4. That's difficult to say and is also related to how nonrandom the missingness is. You should read Joe Schafer's book (see the Mplus web site) where he describes ways to quantify effects of degrees of missingness in terms of the uncertainty it brings to the estimation. The more missingness you have, the more your results rely on your model instead of your data, which is not good. 

fati posted on Thursday, October 20, 2005  6:59 am



1. i know that i must do this for missing data, but my question is if i have missind data pattern (see the last message)? 2. ok 3.my model is LCA with 25 categorical variables, the last message that i have see is: Anonymous posted on Sunday, April 25, 2004  6:36 pm but, consider also modeling your missing data variable, if your variable is categorical you can model it with a mixture model, you can also do this with ordered/unordered categorical without mixture the message that contains the question is : Shige posted on Saturday, April 24, 2004  3:23 pm Dear All, I am trying to do a SEM survival model where some of the covariates in the measurement model have some heavily missing data (20%). Multiple imputation seems to be the best choice in this case. Based on my reading of the Mplus 3 user guide, Mplus does not have the facility to carry out multiple imputation, but it can process imputed data (example 12.13). In that case, can anybody share their experience about which multiple imputation software to use to work with Mplus? I know there is large body of literature of multipe imputation, I am a little lost... thank you 

bmuthen posted on Thursday, October 20, 2005  6:56 pm



That earlier post was referring to missing data in covariates, not in the outcomes. You don't need multiple imputations for outcomes, but missingness on outcomes is taken care of by ML under MAR, that is Type = Missing. 

Anonymous posted on Wednesday, November 02, 2005  5:59 pm



Hi. I would like to know how the summary of the chisquare statistcs are calculated in "TYPE IS IMPUTATION". Is the "mean" simply the unweighted "mean" of the chisquare statistics? Thanks! 


We don't give a chisquare for multiple imputation because we are not clear on the theory for this. 

Anonymous posted on Wednesday, November 02, 2005  7:39 pm



So, I shouldn't interpret the "mean" of the chisquare, right? Below is the summary of running the multiple imputation with 10 impulations. TESTS OF MODEL FIT Number of Free Parameters 9 ChiSquare Test of Model Fit Degrees of freedom 18 Mean 52.313 Std Dev 5.759 Number of successful computations 10 Proportions Percentiles Expected Observed Expected Observed 0.990 1.000 7.015 42.415 0.980 1.000 7.906 42.415 0.950 1.000 9.390 42.415 0.900 1.000 10.865 42.415 0.800 1.000 12.857 42.415 0.700 1.000 14.440 49.892 0.500 1.000 17.338 51.864 0.300 1.000 20.601 54.372 0.200 1.000 22.760 54.558 0.100 1.000 25.989 56.637 0.050 1.000 28.869 56.637 0.020 1.000 32.346 56.637 0.010 1.000 34.805 56.637 


Are you saying TYPE=IMPUTATION; or TYPE=MONTECARLO; in your DATA command? 


I finally found a multiple imputation example and you are correct that we do print the mean of chisquare. It is simply an unweighted mean. It has not been adjusted in any way because we are not clear on the theory for this. 

S. Oesterle posted on Friday, January 20, 2006  3:20 pm



I am using TYPE=IMPUTATION to analyze 20 data sets that I have created via multiple imputation in NORM. I am estimating a path model with observed variables only. The output in Mplus does not give me standardized estimates. I know that you said in an earlier response that "Standardized estimates are not available with Monte Carlo which is what our imputation uses. You would need to compute the standardized estimates by hand." How do I calculate the standardized estimates, particularly when my dependent variables are binary? I was going to use the following formula: take the ratio of the standard deviation of x to that of y and multiply it by the unstandardized esimate. However, the sample statistics printed in the output do not include variances for categorical variables and, besides, are only printed for the first data set. Is there any way to get the sample statistics averaged across all data sets, just like the fit statistics and the regression estimates? Or is there any other way to calculate the standardized estimate? 

bmuthen posted on Friday, January 20, 2006  5:53 pm



For a binary dependent variable u, Mplus uses a standardization of the slope that divides by the standard deviation of u*, a latent response variable underlying u (drawing on the u* variance is also used in the Rsquare for binary outcomes of McKelvey & Zavoina, 1975 which is referred to in some text books). In a regular probit regression of u on x, the variance of u* given x, i.e. the residual variance, is fixed at 1. With logit regression, the residual variance is fixed at sqrt(pi**2/3). So the standard deviation is the sqrt of the sum of the variance in u* explained by x plus this residual variance. 

S. Oesterle posted on Monday, January 23, 2006  4:04 pm



I am estimating a multiple group analysis (2 groups) for a path model using TYPE=IMPUTATION with 20 imputed datasets. In the model where all parameters are estimated freely across the 2 groups, I do not get any error messages and the estimation terminated normally. However, when I look at the coefficients for the second group, most (but not all) estimates and standard errors are zero. The estimates for the first group look ok. When I estimate the model separately for the 2 groups, I get correct results. What could be going on here? 


If you are using a version earlier than Version 3.12, you should download the most recent version of Mplus. If you still have the problem, send the input, two data sets, output, and your license number to support@statmodel.com. 

S. Oesterle posted on Monday, January 23, 2006  5:06 pm



Installing version 3.13 did not solve the problem. I will send you my files. Thanks! 

Scott Grey posted on Wednesday, February 01, 2006  3:11 pm



Hello! I have been attempting to conduct a multilevel growth curve analysis “TYPE IS TWOLEVEL” with missing data using the multiple imputation feature as there are a number of covariates with missing data in our dataset. Mplus appears to replicate the analysis in the DOS window, but when the DOS window closes there is no output in the GUI window. An output file is generated, but it always ends at “Input data file(s)” The program has no problem imputating other analysis like “TYPE IS COMPLEX.” Here is the code: DATA: FILE IS "C:\Documents and Settings\insthealthsa4\ My Documents\DARE\Imputation\AUGUSTINE_m\AUG_imp.txt"; TYPE IS IMP; VARIABLE: NAMES ARE crsswlk hs9dist region1 region2 region3 region4 Urban stressms rhsfree MSfrelun MSredlun MSwhite MSBlack MSLatino MSAsian MSother treatms att7rev att9rev utq41a utq41b utq41c utq41d n2q51a n2q51b n2q51c n2q51d upq40c n1q44c_7 n1q44c_8 AGE sex family catuse1 catuse2 catuse3 catuse4 catuse5 catuse6 upq37a upq37b upq37c upq37d upq37e upq37f upq37g upq37h n2q46a n2q46b n2q46c n2q46d n2q46e n1q7g n1q7j n2q7g n2q7j eq10g eq10j tq10c tq10f q10c q10f smoket1 drinkt1 pot1 smoket2 drinkt2 pot2 smoket3 drinkt3 pot3 smoket4 drinkt4 pot4 smoket5 drinkt5 pot5 latino black asian am_ind oth_race clu1 clu2 clu3 clu4 clu5 clu6 clu7 clu8 clu9 clu10 clu11 clu12 clu13 clu14 clu15 clu16 clu17 clu18 clu19; USEVARIABLES ARE urban rhsfree n2q51a n2q51b n2q51c n2q51d upq40c n1q44c_7 n1q44c_8 age sex latino black oth_race anysmoke anydrink anypot lowhi hilow hihi mattpol7 mattpol8 mattpol9; WITHIN IS upq40c n1q44c_7 n1q44c_8 age sex anysmoke anydrink anypot lowhi hilow hihi latino black oth_race mattpol7 mattpol8 mattpol9; BETWEEN IS rhsfree urban; CLUSTER IS hs9dist; DEFINE: mattpol7 = (tq10c+tq10f)/2; mattpol8 = (eq10g+eq10j)/2; mattpol9 = (n2q7g+n2q7j)/2; devian7 = (upq37a+upq37b+upq37d+upq37g)/4; devian9 = (n2q46e+n2q46a+n2q46b+n2q46c)/4; lowhi = 0; hilow = 0; hihi = 0; anysmoke = 0; anydrink = 0; anypot = 0; IF (devian7 LE 1 AND devian9 GT 1) THEN lowhi = 1; IF (devian7 GT 1 AND devian9 LE 1) THEN hilow = 1; IF (devian7 GT 1 AND devian9 GT 1) THEN hihi = 1; IF (smoket1 GT 0 OR smoket2 GT 0 OR smoket3 GT 0 OR smoket4 GT 0 OR smoket5 GT 0) THEN anysmoke = 1; IF (drinkt1 GT 0 OR drinkt2 GT 0 OR drinkt3 GT 0 OR drinkt4 GT 0 OR drinkt5 GT 0) THEN anydrink = 1; IF (pot1 GT 0 OR pot2 GT 0 OR pot3 GT 0 OR pot4 GT 0 OR pot5 GT 0) THEN anypot = 1; ANALYSIS: TYPE IS TWOLEVEL; ESTIMATOR = ML; ITERATIONS = 1000; CONVERGENCE = 0.00005; MODEL: %WITHIN% attinstb BY n2q51a n2q51b n2q51c n2q51d; i s  mattpol7@0 mattpol8@1 mattpol9@2; mattpol7 ON upq40c; mattpol8 ON n1q44c_7; mattpol9 ON n1q44c_8; attinstb ON i s age sex lowhi hilow hihi anysmoke anydrink anypot latino black oth_race; %BETWEEN% attinstw BY n2q51a n2q51b n2q51c n2q51d; n2q51an2q51d@0; attinstw ON rhsfree urban; OUTPUT: TECH1 TECH8; THANKS FOR YOUR HELP!! 


Please send your input, data, and license number to support@statmodel.com. Looking at the input alone cannot tell me what happened. 


Hello, I am estimating a latent class model with type=imputation. There are 5 latent class indicators (Y), 4 of which are categorical, while 1 is nominal. I have also got covariates (X) that relate to the latent variable (C), and some direct effects from covariates to some of the Ys. The estimation runs fine; but the output reports values of .000 for all estimates associated with the nominal Y  that is, both for the means of the nominal Y associated with each latent class, and for the direct effect of one of the Xvariables on the nominal Y. In contrast, estimates associated with categorical Y are given and make sense. The same problem does not occur when I run the model on just one data set (that is, without "type=imputation"). Neither does the problem occur when I specify my nominal Y as categorical, and use "type=imputation". What could it be that goes wrong when estimating with imputed datasets and a nominal latent class indicator? 


This sounds like a problem that has been fixed in Version 4.1. If you download Version 4.1, you should be fine. If not, send your input, data, output, and license number to support@statmodel.com. 


Yes, using version 4.1 resolved the problem. Thank you, Linda! 


when using multiple imputation and regressing latent variables upon other latent variables, is it sufficient to set the all the latent variables' variances to 1 to get standardized values for these regression coefficients? or is it necessary to hand calculate these, too? i feel like this should be a simple question, but i have a very hazy grasp of how standardization works, exactly. thanks in advance for any help. 


If you fix the metric of the factors by fixing the factor variances to one instead of a factor loading, you would receive estimates equivalent to the Std standardization of Mplus. The two standardizations used in Mplus are described in Chapter 11 of the Mplus User's Guide where the general output is described. 


thanks for that reply. now i'm running into another issue. i'm using multiple imputation and specifying the MLM estimator because of some evidence that the multivariate distribution is not normal. now i would like to make some model constraints and test these via chisquare difference tests. of course, when using MLM one cannot simply subtract the chisquaresone needs the scaling correction factors. however. . . the output when using MI does not provide scaling correction factors (that i can find). i tried using the difftest option that works for WLSMV, and it informed me that this only works for WLSMV. so. . . is there something i am missing that would allow me to test model constrains using MI and the MLM estimator? thanks, tom 


. . . i realized after i wrote this that i could, of course, calculate the difference tests for each of the 5 MI data sets separately. is that the only way to go about this? 


With multiple imputation, we give the average of the fits statistics like chisquare. I don't think there is any theory on how to actually calcluate chisquare for multiple imputation. Because of this, I don't know how you would do difference testing in this situation. 


hello, i'm a student in TW. When I try to operate MI in NORM, 11 of 18 variables are ordinal scale vars. How do i observe and decide the method to transform these vars to fit the assumption of normality? Sorry for my question in wrong place. 


Following the previous question, for the ordinal vars, should I choose the "logit transformation" with limited range to make the transfomed valus reasonable, and choose the "to the nearest observed value"? 


I'm not familiar with NORM. I would not transform ordinal variables because the numbers assigned to the categories do not represent numerical values. 


After using NORM to make 5 datasets, I use Mplus vesion 3.0 to read these datasets. However, the output is"***ERROR(Err#: 29)Error opening file: mi.dat". My syntax is in the following. data: file is mi.dat; type is imputation; variable: names are cl edu inc gen bmi year at1at7 ba1ba4; usevariable are at1at7; analysis: type is general basic; These datasets are named under mi1, mi2 mi3, mi4, and mi5. They are saved in the same folder with the syntax. Thanks for your help. 


The error means the file cannot be found. Check if the extension dat was added twice to the data set. Otherwise, send the input, data sets, and your license number to support@statmodel.com. 

Yung Chung posted on Thursday, June 29, 2006  2:30 am



Hi, this is with specific reference to Thomas Rodebaugh's post on Thursday, June 15, 2006  3:03 pm: There is a SAS Macro that calculate the "weighted" average (each statistic overstates the strength of the evidence against the null hypothesis because it ignores missingdata uncertainty) of chi squares obtained over imputated dataset. The syntax can be found on Paul Allison's website. Hope this helps 

Susan Scott posted on Friday, October 13, 2006  11:41 am



Hi, I would like to know where I can find information on how TYPE=IMPUTATION analyses are run. When I do analyses on multiplyimputed datasets in SAS, the model is run on each of the datasets and then the estimates are combined (using proc mianalyse). However, I am finding that when I have an SEM model that has converged with TYPE=IMPUTATION and I try to run the same model on the individual datasets, I often get the message that the model does not converge. I would like to understand why this is happening. Thank you, Susan Scott 


I cannot answer this question without seeing the input, data sets, output, and your license number at support@statmodel.com. Please send the output with TYPE=IMPUTATION; where all data sets converged and also an output where using an individual data set did not converge. 

Susan Scott posted on Friday, October 20, 2006  2:08 pm



I have emailed everything. I'm not sure if it hasn't registered in my email program because I sent it from here, or if it was not sent for some reason. If you do not received anything, please let me know. Thank you, Susan Scott 


I have not yet received anything. 


Hi Linda and Bengt, I created five imputation datasets to be used for a CFA based on MI. The model converges fine for each dataset individually, but when I combine the datasets in the analysis using the TYPE = IMPUTATION command and a separate input I get the following message: Number of replications Requested 5 Completed 1 Then when I change the order of the dataset in the input file I obtain 4 successful replications. I am unclear about what the cause of this discrepancy might be. Do you have any suggestions? I pasted the syntax for a very much simplied model in which I find the same problem. Thank you, Rick Sawatzky. SYNTAX DATA: file is mi.txt; TYPE = IMPUTATION; ANALYSIS: ESTIMATOR = WLSMV; VARIABLE: NAMES ARE y1y7; USEVARIABLES ARE y1y7; CATEGORICAL ARE y1y7; MODEL: f BY y1y7; 


Just to clarify my previous posting, the CFA of the imputed data files runs fine when I do not specify categorical data (i.e., the problem only occurs when the variables are specified as categorical). 


Please send the input, data sets, output, and your license number to support@statmodel.com. We need this information to understand what is happening. 


Linda, Thanks so much. The above problem is now solved. However, when I use the WLSMV estimator I find that the output file does not provide a mean chisquare statistics. I assume that this might be because the estimated degrees of freedom for WLSMV might not be identical for the different MI dataset. Is this explanation correct or is there another reason why the mean chisquare is not provided (it is provided when I use WLSM estimation for the same model). Thanks again, Rick. 


The mean chisquare would not be meaningful for WLSMV for the same reason the chisquare values cannot be used for difference testing  only the pvalue is meaningful. 


Hi  I want to use TYPE=IMPUTATION to do (say) a twostep hierarchical regression analysis and test the difference between the two models. (1) Can I use the difference between the mean 2Loglikelihood statistics for the two models to test for the improvement in fit from the second set of variables added in the larger model? (2) There is no test for model fit (all values are 0). I assume I can use the mean 2LL and df to test each model? (3) How can I get the corrected df and pvalues for the tstatistics reported by the program? (4) How can I get the relative efficiency and fraction of missing information for the intercept and predictors? Thanks, bac 


1 and 2. No. The difference between two average loglikelihoods is not distributed chisquare. 3. The standard errors are correct so the ratio gives a correct zscore. 4. This is not provided. You would have to see the Schafer text to see how to do this. 


Thanks, Linda  I must have a basic misunderstanding about the Deviance Chisquared, or wonder why it would not be relevant to 1&2 above. In other maximum likelihoodbased models, 2LL is distributed as Chisquared with the df = number of parameters in the model. So, it seems like the "mean 2LL" would also be distributed Chisquared. Also, the difference between the 2LL for one model compared to the 2LL for a nested model is called the Deviance, and is distributed as Chisquared with the df = the difference in the # of parameters estimated by the two models. This allows the test for the difference between hierarchical models with, for example, logistic regression and multilevel regression. I don't understand why the same would not be true in the case of hierarchical linear regression, estimated with maximum likelihood in this case.Could you help me with a reference so I could learn why the "mean Deviance" would not also be distributed Chisquared? Thanks, bac 


We have never seen an article discussing whether the loglikelihood averaged over imputed data sets is distributed as chisquare. It may be and whether it is may also be a function of the number of imputed data set. If you know of a reference that supports this, please let us know. 


Thanks, Linda  Here are some references that you may find useful. I haven't gotten the Statistics in Medicine reference yet, but Don Rubin referred to it and two others in the "IMPUTE" thread "IMPUTE: Re: "Averaging" chisquare values (fwd)" as providing information about averaging Chisquared values from SEM models on imputed data sets. There are some other notes in the IMPUTE threads re averaging Rsquared values, but you already report Rsquared for the imputation analysis. It would be nice to have the DF for the ttests and the and an option for testing the Deviance, too! Thread: http://www.mailarchive.com/impute@utdallas.edu/msg00158.html References: Li, K. H., Raghunathan, T. E., & Rubin, D. B. (1991). Largesample significance levels from multiply imputed data using momentbased statistics and an Freference distribution. Journal of the American Statistical Association, 86(416), 10651073. Rubin, D. B., & Meng, X. L. (1992). Performing likelihood ratio tests with multiplyimputed data sets. Biometrika, 79(1), 103111. Rubin, D. B., & Schenker, N. (1991). Multiple imputation in healthcare databases: an overview and some applications. Stat Med, 10(4), 585598. (PMID: 2057657) 


Thanks for the references. We'll take a look at them. 


Looks like some of these references are useful in terms of model testing in future versions of Mplus. 

Anonymous posted on Friday, May 04, 2007  12:16 pm



Good aftenoon, I have quick question: How does MPlus identify the datasets used in the imputation process? Is there a way to specify the names of the data sets or does the program search for some identifier? Thanks. 


There is no identifier. The data set names are listed in a file and the file is accessed. See Example 12.13. 


Hello  I have found what might be a bug in your MI procedure, or at least a documentation problem. I have run a linear regression using 10 MI data sets produced in SAS. In one analysis, I used the names in the SAS file  7 of which have more than 8 characters. Mplus warns that the 7 names contain more than 8 characters and that only the 1st 8 will be used in the output, and gives the offending names. When I run the same inp file after correcting the names to be only 8 characters, I get different results. There are no other differences in the two analyses. The run with the short names corresponds closely to the SAS output. Here are the reduced outputs: Run with 7 long varnames: STAIX1 ON CCMC 1.635 1.866 CCFC 3.035 1.408 CCMXCCF 0.340 1.505 Run with short names: STAIX1 ON CCMC 1.236 2.644 CCFC 2.632 2.836 CCMXCCF 0.077 1.439 Thanks, bac 


I would need to see the input, data sets, output, and license number at support@statmodel.com. If you are not using Version 4.21, I suggest you run the analyses with Version 4.21 as a first step. 

Anonymous posted on Monday, May 07, 2007  5:17 am



In reference to response Linda K. Muthen posted on Friday, May 04, 2007  12:32 pm, I am seeking further clarification of how Mplus delineates the datasets. Thank you for your referral to the MPLUS User's guide. In example 12.13, the following is stated: "Each record of the file must contain one data set name. For example, if five data sets are being analyzed, the contents of impute.dat would be: data1.dat data2.dat data3.dat data4.dat data5.dat where data1.dat, data2.dat, data3.dat, data4.dat, and data5.dat are the names of the five data sets created using multiple imputation." After reviewing this example, my questions are: Are the imputed datasets to be analyzed identified by their possession of the suffix .dat (or any other Ascii file format, .txt, .csv etc.)? Does the program actively search for a variable containing this data? If so, is it correct to say that to analyze imputed data, one must create a character variable that distinguishes each data set, when preparing the data for analysis using MPLUS? I'm sorry to be a bother, but I'd like to understand how Mplus makes this distinction. Thanks again for your aid. 


The program looks for the file impute.dat. It then reads the data sets in the order found in the list of data set names. There is no required extension for the names of the data sets. 

Anonymous posted on Monday, May 07, 2007  12:29 pm



Okay. I think I understand now. The file command tells the program to search for other files whose names are listed in the impute file. The analysis is the based on results combined across the data sets. I was assuming the data sets were all combined into one larger file. 


For Bruce Cooper: The order of the variable names is different in the two inputs: arsmamso arsmamao msosc maosc msxma ccmc ccfc ccmxccf arsmamsos arsmamaos msosc maosc ccmc ccfc msxma ccmxccf 


I have several questions regarding the output when type=imputation is used. 1. If a replication results in warnings (such as a warning about a singular matrix), is that replication's results still included in the output? 2. I cannot seem to find what the expected and observered "proportions" and "percentiles" columns mean in the tests of model fit section. Can you refer me to a page in the User's Guide or briefly explain it? I need enough detail to be able to address its meaning if a Committee member should ask. Thank you! Chris Lloyd 


1. Only runs that did not converge are not included in the results. 2. These are explained in Chapter 11 under Monte Carlo Output. 


Hello, I am using Mplus to perform SEM on a MI dataset for my dissertation. Any new information about how to obtain a chisquare representing the combined datasets? In another post, you made reference to the pvalue of a chisquare being valid. Does this mean that the average pvalue is valid for determining statistical significance? Thank you. K.Murray 


I don't think the average pvalue would be valid. 


Thank you. 


Hi Linda and Bengt, I have 10 imputed data sets created in Amelia with which I would like to do a complex EFA (weighted and clustered) with categorical data and a linear regression using the results of the EFA in a CFA framework and then regressing them on to categorical outcomes and covariates. I am using Mplus v.5. Can I use IMPUTATION with COMPLEX? Are there any other issues that I should be concerned about? Thanks for your assistance. Cheers, Alison 


Hello again, I just checked the User's Guide and it appears that I cannot use IMPUTATION with EFA  is that correct? Cheers, Alison 


You can use IMPUTATION with COMPLEX EFA with the PROMAX or VARIMAX rotations but not with the other rotations. 


I'm using multiple imputation and am doing a latent growth curve model. I would like to do a multiple group analysis. In a posting above it was stated that it is possible to use multiple imputation for multiple group analyses. To clarify, is this true when the multiple imputation is performed on the whole sample as opposed to separate imputation analyses for the groups of interests (e.g., boys and girls)? This is probably simple, but I'm having trouble wrapping my head around how it would work to compare 2 groups (that you are testing to be nonequivalent) if the complete dataset is constructed based on the assumption that the missing data patterns are generated by one sample. Thanks for your help! 


I would think you would base the imputation on the full sample unless missing data patterns vary for males and females. I would suggest seeing what the imputation literature says however as I have no evidence to support this opinion. 


Hi, Can Mplus work the EFA when the TYPE=IMPUTATION? or...it can only work when the analysis like example12.13 ? (our variables are ordinal.) Sorry for disturbing again. Our institution wants to update our Mplus, before that, we have to make sure about the EFA can do the work when TYPE=IMPUTATION. Thanks for your help. 


TYPE=EFA and the IMPUTATION option cannot be used in combination. 


Dear Drs Muthen I have 20 multiply imputed correlation matrices, but not the imputed cases from which they were computed. Can I use TYPE = IMPUTATION to estimate a model in Mplus from these 20 correlation matrices? Or does this option only work with separate cases? Thank you very much in advance for you help, Daniel 


TYPE=IMPUTATION requires raw data. It cannot be used with summary data. 


Hi, My question is related to the post " Anonymous posted on Tuesday, March 15, 2005  7:20 am" and reply. I am running an analysis using imputed data sets, and the output indicates that the number of replications "completed" is only 2, when I had 5 data sets originally. I ran each set separately and there were no errors (i.e., "model estimation terminated normally"). So, why is it that I don't get all 5 sets included in the analysis? 


Which version of Mplus are you using? 


Not the newest update  I think it's version 4 (bought it almost a year ago). 


You should download Version 5.1. 


That worked! Thanks. 


Good morning  I am conducting a path analysis with 10 imputed datasets. Is there a way to run Model Constraint with Type = Imputation? I have tried and am not getting and error (but am also only getting truncated output). Thanks, aprile 


MODEL CONSTRAINT is available with the IMPUTATION option. I was wrong about that. Please send your input, data, output, and license number to support@statmodel.com. 

Donna Ansara posted on Wednesday, November 05, 2008  4:00 pm



Hello, I am running latent class regression analysis using Type=imputation and am able to run this perfectly fine. I am interested in presenting confidence intervals for the regression coefficients for the covariates and Mplus does not seem to provide this output when I specify the cinterval option. Would it be appropriate to calculate them using the standard errors that are indicated for the regression coefficents in the usual manner (i.e., estimate +/ 1.96SE)? Thank you for your assistance. 


This would be correct. 

bob calie posted on Tuesday, May 12, 2009  8:21 am



Hi All, I'm trying to impute missing data for a binary variable (say, gender: girl/boy). Since the data were collected from multiple schools and there are apparently distinct proportions of gender across schools, it seems that a 'stratified' imputation is more appropriate. Any ideas? Thanks very much in advance. 


I don't know much about imputing data. I think you should pose this question to the developer of the software you would be using to impute the data. 

bob calie posted on Wednesday, May 13, 2009  1:24 pm



But Mplus can deal with missing data. What I was asking is if it's possible to impute missing with different probability across different clusters. This is not a research question and I was asking the developer of Mplus. Maybe I posted it in a wrong place or I shouldn't have used Mplus. Thanks anyway. 


If I were you I would ask Joseph Schafer or colleagues. He developed the freeware "Norm" which is a multiple imputation program. Mplus does not impute missing data. It handles missing data via a maximum likelihood approach (FIML). 

bob calie posted on Thursday, May 14, 2009  8:58 am



Thanks, Mike! 


Hi, I am doing some simulations involving multiple imputation. I have imputed data for 100 replications using SAS and created 10 output datafiles for each replication. I would now like to use MPlus to conduct a LGCM for each of the 10 imputations for each of the 100 replications. I see how easy it is to read one set of 10 imputations in using the TYPE=IMPUTATION command, but I'm unsure of how to do this for my set of 100 replications, each with 10 imputations. Does this make sense? Thanks for any suggestions. Holmes 


There is no way in Mplus to combine the ten imputation outputs. You would need to write a program to extract what you need from the output and combine it. 


Thanks, Linda, for the info. Holmes 


Dear Linda, I've analyzed 13 imputed datasets, of which 6 are completed, the output said. However, when I run them individually, only 3 actually converge. A few earlier posts asked the same question, but there was no indication of what may have been the problem(s). Could you please enlighten me of what is happening? Cheers. 


Just to complete my post: I'm using the 5.2 version. Cheers. 


If you add TECH9 to the output command, it should show the problem. If this does not help, please send the problem and your license number to support@statmodel.com. 

Anonymouse posted on Tuesday, August 25, 2009  11:20 am



Hello, I am testing a path model which is composed of a series of quantitative variables predicting two binary dependent variables. I am using multiple imputations to handle missing values on the x variables. When I try to use the MODEL INDIRECT command, I get a message indicating that Mplus cannot perform MODEL INDIRECT for multiple imputations. Is there any way to work around this? It seems that I have to use multiple imputations for the missing x values because otherwise listwise deletion is used... 


You can use MODEL CONSTRAINT to define the indirect effects. 


Hi, I have data from 1187 subjects on 135 variables. There is missing data on one variable (appr. 11 %) which is the only variable that is not categorical. I've done Multiple Imputation with NORM, getting 20 datafiles for further analyses. I have used the WLSMVestimator for my SEM. Mplus suggests to use NOCHISQUARE and NOSERROR to reduce computation time. I've done that but get the following messages: "NOCHISQUARE option is not available with multiple imputation. Request for NOCHISQUARE is ignored. NOSERROR option is not available with multiple imputation. Request for NOSERROR is ignored." Why then does Mplus suggest this option if I can't use it? Moreover, I have a question concerning the output file when using multiple imputation: For the tests of model fit, there are not only mean and SD for CFI, TLI etc., but also expected and observed proportions and percentiles  what does these results tell me? Thanks for your help! 


The option is recommended in general but can't be used with imputation. See page 330 of the user's guide for a description of the expected and observed proportions and percentiles. 


Thanks for your advice. However, my output is a bit different from the example you give. I'm running SEM with multiple imputation (5 computations). For the chi square test I only get the following output: "TESTS OF MODEL FIT Number of Free Parameters 297 ChiSquare Test of Model Fit Number of successful computations 5 Proportions Percentiles Expected Observed Expected Observed ..." So I actually don't have Mean, Std Dev for chisquare. Is there a command needed to ask Mplus for chi square? Moreover, for the other fit indices (CFI, TLI, RMSEA,...), the Std Dev is zero and hence, percentiles expected and observed are always the same. How do I read the proportions expected and observed then? Thank you very much for your help! 


I would need to see your full output to understand what you are seeing. Please send it and your license number to support@statmodel.com. 


Hi, I read your example 12:13 but am still unsure how to set up the input data file. I tried to stack the imputed file in one data set and apparently it doesn't work. May I know how the dataset should look like for type=imputation? Thanks 


Each imputed data set should be in a separate file. The file specified using the FILE option should contain the names of the datasets. Please reread the example and also see page 424 of the user's guide. 


Is there a way to perform model test with multiplyimputed data? 


MODEL TEST can be used with the multiple imputation. Please send your problem and your license number to support if you are having a problem. 


I am running logit models with 40 imputations using the ML estimator and would like to see if model fit improves when I add a block of variables. In other words, I would like to assess the significance of the change in the loglikelihood relative to the change in number of estimated parameters. My question is whether the values for the loglikelihoods given when using multiple imputations can be used in a straightforward way (i.e., computing twice the difference in the loglikelihoods for the nested models) or do I need to apply some sort of correction as is described in your technical report: “Chisquare statistics with multiple imputation”. I am not clear if the output I am getting (using 5.21) already contains the correction to the mean loglikelihood value described in that report. 


The current Mplus version provides loglikelihood testing with imputation only for the SEM model with continuous variables (that would be the test of fit). As far as I know there is no simple way to construct likelihood correction factors that can be used easily to do general LRT tests for arbitrary nested models, i.e., even for the simple SEM with continuous variables you can only get the test of fit at this time. I would say LRT with imputation is still a tricky topic. On the other hand Wald test is not  use model test to conduct a test for multiple parameters. In addition the SE (and the T value/test which is the same as the univariate Wald test) that are already in the output can be used to see if fit improves, i.e., to see if the predictors are significant. 

Paola posted on Friday, February 12, 2010  6:34 am



I have 1000 replications, each replication contains 5 imputed datasets, is it possible to do a random intercept model on all 1000 replic with both type= Montecarlo and type=Imputed? If so, how? 


I think you want to combine TYPE=IMPUTATION and TYPE=MONTECARLO. This cannot be done with Mplus. 

Dylan K posted on Monday, February 22, 2010  7:32 am



Dear MPlus team, I'm a complete novice with MPlus. I'm using it to hopefully produce a MIMIC LCA model. I performed Multiple Imputation using STATA as I had missing covariates. I'm okay with reading the imputed file into MPlus but where I'm getting stuck is in specifying an indicator for the imputed datasets within the single file. When I run the input file without this I get reams of output along the lines of: *** ERROR in Data command An error occurred while opening the file specified for the FILE option. File: C:\Documents and Settings\user\Desktop\Dylan\SRA\OrigData\mi\***111415. Hope you can help prevent me pulling my hair out any further! Thanks and best wishes, Dylan 


The data sets for multiple imputation must be in separate files. 

Dylan K posted on Monday, February 22, 2010  8:08 am



Hi, Thanks for getting back to me so quickly. I'm still getting confused. I've split the imputed datasets into different files, each beginning with data. (e.g. data1.dat). I've called the file that these are stored in data as well. My input file reads: DATA: File is "C:\Documents and Settings\user\Desktop\Dylan\SRA\data.dat"; TYPE=IMPUTATION; I'm getting the following messages *** ERROR in Data command The file specified for the FILE option cannot be found. Check that this file exists: C:\Documents and Settings\user\Desktop\Dylan\SRA\data.dat I get the same whether I mane the parent file just data or put the .dat extension on. I've also tried to put the data in csv but no luck. Does it matter if my syntax file is stored in another file? Can you see where I'm going wrong? (Apologies if this is a v basic question!) Thanks in advance, Dylan 


See Example 12.13 in the user's guide. If this does not help, please send your output file and license number to support@statmodel.com. 

Kelly P posted on Tuesday, March 02, 2010  7:35 am



Hi, I am currently considering using multiple imputation due to missing data problems I am encountering with my dataset. However, the data has a large number of sibling pairs, for which I use the cluster command, and the model I am running has indirect effects. Would these options be available if I was using type=imputation? From the above posts it looks like there would be ways to compute the indirect effects, but I am also concerned about the cluster option. Thanks! 


The CLUSTER option is available for TYPE=IMPUTATION; 

Kelly P posted on Wednesday, March 03, 2010  4:35 am



Thanks Linda! Are there examples available anywhere showing how to use the CONSTRAINT command to compute indirect effects? 


No. An indirect effect is the product of the regression coefficients. 


hello, reading the posts above lead me to the assumption that indirect effects and/or interaction modelling with latent variables shouldn't be done with multiple imputed data, because it violates the basic assumptions of multiple imputation (linear connection between all variables within the imputed datasets)? do I understand this right?? thanks for your help! miriam 


Indirect effects are linear so your concern would not apply to them. For interactions, you may want to include the interactions in the set of variables used for imputation. 


Hi, is there a way to include a certain variable in the variable names list and still not to use it for imputation? Now in the User's guide it is stated: "Because the variable z is included in the NAMES list, it is also used to impute missing data for y1, y2, y3, y4, x1 and x2"(p.348). It is obvious that not all variables are useful for imputation, for example IDs. 


In UG ex 11.5 you don't have a USEV statement, which means that the USEV variables are the same as the NAMES variables. If you add a USEV statement  that excludes say ID of the NAMES list of variables  the variables on the USEV list are the ones that will be used for imputation. 


Hi all, I have a general modeling question. I've used Stata to prep my data for use in mplus. I've also needed to handle some missing data on my dependent variables. So, using ICE in stata, I created a bunch of datasets in which I've imputed values for a select number of variables, leaving everything else alone. What this means is that when Mplus sees the data, I have a "complete" set of data for my dependent variables, but may still have missing values on the independent, covariates, and controls. I thought Mplus handled missing data that was not on the dependent variable, but I'm finding that analysis on my imputed datasets using the "Type = IMPUTATION" command still loses many cases, largely due to the covariates and control variables. Am I doing something wrong in Mplus? Or, do I want to create datasets in which I've imputed everything, and just send Mplus complete data? thank you! craig 


Missing data theory is for dependent variables only. If you don't want observations with missing on the covariates to be excluded, you need to impute for the full data set. You can do this in Version 6 of Mplus using the DATA IMPUTATION command. See Example 11.5 in the Version 6 user's guide. 


Hi Prof. Muthen, We only have v5, so I'll impute everything I want to in Stata, and then use the files in Mplus. Thank you very much for the prompt reply! craig 


I am looking to use multiply imputed data sets to run a multiple regression model with a continuous outcome variable. I have missingness on both my predictors and outcome variable, so I am wondering if it is necessary to omit the outcome variable from the imputation model when creating the MI data sets ? Is there a reference you can recommend that deals with this? Thanks 

Jon Heron posted on Thursday, July 29, 2010  7:19 am



HI John, I know a reference that says the opposite if that helps Missing Data Analysis: Making It Work in the Real World John W. Graham Annual Review of Psychology, Vol. 60: 549576 In the section on dispelling the myths "The fear is that including the DV in the imputation model might lead to bias in estimating the important relationships (e.g., the regression coefficient of a program variable predicting the DV). However, the opposite actually happens. When the DV is included in the model, all relevant parameter estimates are unbiased, but excluding the DV from the imputation model for the IVs and covariates can be shown to produce biased estimates. The problem with leaving the DV out of the imputation model is this: When any variable is omitted from the model, imputation is carried out under the assumption that the correlation is r = 0 between the omitted variable and variables included in the imputation model. Thus, when the DV is omitted, the correlations between it and the IVs (and covariates) included in the model are all suppressed (i.e., biased) toward 0." 


Thank you Jon for your suggestions and reference. Much appreciated. I was thinking about this question in the context of a planned missingness design (3form  Graham, Hofer, and MacKinnon (1996), where the DV construct and the predictors (if measured by multiple items) are systematically reduced in different versions of the form and the missingness subsequently imputed using all available information from all 3 forms. I hope this makes sense? John 

Jon Heron posted on Thursday, July 29, 2010  8:21 am



Hmm, never come across that before. Is the missing data by design treated as MCAR then? depending on how they divide their sample I suppose. 


"Hmm, never come across that before. Is the missing data by design treated as MCAR then? depending on how they divide their sample I suppose?" Yes Jon 

David Bard posted on Monday, August 09, 2010  10:57 pm



Can you clarify the variable output from a twolevel MI procedure in version 6.0? It looks like variables with the same original names represent the observed and imputed values, variables appended by asterisks are thresholds or withinlevel latent response values, and variables prefaced by 'B_' represent random posterior draws from the between level (for variables modeled at both levels), but I couldn't find this documented (if it is documented, could you direct me to that segment of the manual in case other questions arise). Is it possible to output latent response variable values for betweenlevelonly variables? I'm not seeing a B_ variable for any of my betweenonly variables. Also, I tried to save my subject ID variable as an auxiliary variable. A column for it appears in each imputation file, but each value is stored as 10 asterisks. Is there a limit on the size of these auxiliary variables? My Ids are 7 digits. Thanks. 


The IDVARIABLE option of the VARIABLE command should be used to identify the id variable not the AUXILIARY option. Please send the full output as an attachment and your license number to support@statmodel.com so I can see what is being saved. 


Is there a limit to the number of variables that can be imputed at one time? 


There is no limit, but with a large number of variables the number of parameters in the imputation model may be large. Use only the analysis variables and missing data correlates to impute data. Don't use all variables in a data set for example. 


Thanks Linda. Another question  is there a way to put in variable specific minimum and maximum constraints for multiple imputation? For example, often times multiple imputation results in extreme values on some variables and so constraints are necessary to tell the program that imputed values should only fall between 1 and 4 (as an example). Is there any place to do this in MPlus right now? 


There is currently not a way to do this for continuous outcomes. 


I was just reminded that you do have the option VALUES = If the number of values that are present in the data is a relatively small number (1, 2, 3, 4) you just list those. Otherwise you can use 1.0 1.1 1.2 .... 3.9 4.0 to get a rounding of the imputed value to the first decimal. 


Great  thank you! 


Hi! I am trying to use the new multiple imputation software in MPLUS but all I am getting are fatal error messages. It seems to be reading the data in correctly, so I was wondering if there is a way to get a more detailed error message so that I can troubleshoot. I am using output: TECH8 as per example 11.5. *** FATAL ERROR PROBLEMS OCCURRED DURING THE DATA IMPUTATION. Thanks 


Please send your input, output, data and license number to support@statmodel.com. 


Thanks. Just did. Clara 

Tom Booth posted on Monday, October 11, 2010  5:16 am



Hello, I am trying to run multiple imputations on a set of mixed categorical and continuous variables (n=972). I am using the default H1 imputation (sequential regression). From reading Asparouhov & Muthen (2010, 15th July) this seemed most appropriate. I am getting a warning that reads; *** FATAL ERROR THERE IS NOT ENOUGH SPACE TO RUN MPLUS ON THE CURRENT INPUT FILE.... I have no other programs running, and have installed the 32bit version for which the machine I am running it on has plenty of capacity. I am unsure what about the analysis I am running is causing this issue. Regards, Tom 


Please send your output file and license number to support@statmodel.com. 


Hello, I am receiving the same error message as Clara described above: *** FATAL ERROR PROBLEMS OCCURRED DURING THE DATA IMPUTATION. I am receiving the imputed data sets correctly but the imputed list file contains zero data. What is needed to correctly produce the list for the imputed data sets? Thanks for your help. Nicholas 


The list file should be generated automatically. Please send your output file and license number to support@statmodel.com. 

Dana Wood posted on Thursday, October 21, 2010  12:51 pm



Hello, I am trying to use the MODEL TEST command with multiply imputed datasets. The model runs fine with the multiply imputed datasets, but when I add in the request for MODEL TEST, I don't get any output. The black MSDOS screen appears for a brief flash and then nothing happens. 


If you are not using Version 6.1, try that. If you are, please send the input, data, and your license number to support@statmodel.com. 


Hi, I have a couple of questions related to examples 11.5 and 11.6 of the manual. (1) Example 11.5: You mention on page 348 that all variables part of the NAMES list are used to impute data (on the variables listed under IMPUTED). Is there any way NOT to use some variables that are part of the list in the imputation ? (2) Example 11.6: Will this be the same in Plausible values imputations ? Will all the variables listed in the NAMES list be used in generating the plausible values or only those included in the MODEL section ? If yes, is there a way again to not use some variables ? (3) It it possible to generate multiple imputation data sets (51020etc.) including imputed values for missing data on observed variables and plausible values in the same data sets. (4) How can we include additional variables in the saved multiple imputation and/or plausible values data set (lets say the z variables from example 11.5)? I do not necessarily want to impute these data or to use them in the imputation algorythm, just to have them saved in the created data sets so as to be able to use them in subsequent analysis. Will the simple AUXILLIARY function (without emr) work ? Thank you ! 


I am glad you asked so that this can be clarified. (1) and (4): UG ex 11.5 is not as clear as it could be on this point. A user would typically work with not only the NAMES and IMPUTE lists of variables, but also a USEVARIABLES list and an AUXILIARY list. The NAMES list simply reads the variables in the original data set. The USEVARIABLE list is a smaller subset of variables from the NAMES list, just as in an ordinary analysis. The NAMES list variables are the variables used to create the imputations. In UG ex11.5, the USEVARIABLE list is absent and therefore defaults to the NAMES list. Typically you also want to save into the imputed data set other variables that are not to be used in the imputation and to do that you put those variables on the AUXILIARY list. (2) Same thing. (3) The SAVE = data set contains what you are asking for. The PLAUSIBLE = data set gives summary statistics for plausible values. 


Correction  I should have said: UG ex 11.5 is not as clear as it could be on this point. A user would typically work with not only the NAMES and IMPUTE lists of variables, but also a USEVARIABLES list and an AUXILIARY list. The NAMES list simply reads the variables in the original data set. The USEVARIABLES list is a smaller subset of variables from the NAMES list, just as in an ordinary analysis. The USEVARIABLE list variables are the variables used to create the imputations. In UG ex11.5, the USEVARIABLES list is absent and therefore defaults to the NAMES list. Typically you also want to save into the imputed data set other variables that are not to be used in the imputation and to do that you put those variables on the AUXILIARY list. 


Thank you very much! It is indeed clearer. 


Hi again, Does the AUXILIARY (m) function works in the generation of plausible values (i.e. plausible values are generated from a model, but can we let variables NOT in the model influence the generation of plausible values?). The question is based on the result you report in the "plausible value" paper that, when plausible values are to be used in a secondary analyses, all of the variables to be used in this secondary analysis need to be part of the PVs generation... I am generating PVs from a complex ESEMWithinCFa model. I will use them in a secondary analysis with an additional variable. Yet, when I add this variable to the ESEMwithinCFa model and allow it to correlate with the factors, the model crashes. Unless I just need to geet the variable in the Model by estimating its variance without allowing it to correlate with the factors? 


Aux(m) is only intended for ML estimation, not the Bayesian estimation used with plausible values. 


Hello, I am currently using Mplus version 6.1 to perform multiple imputation. I am receiving the following warning message when I run my imputation model: *** FATAL ERROR PROBLEMS OCCURRED DURING THE DATA IMPUTATION. THE PSI MATRIX IS NOT POSITIVE DEFINITE. THE PROBLEM OCCURRED IN CHAIN 2. All variables included in the impute list contain missing data, and I am using the PROCESSORS = 2 command to reduce computing time. I have also specified categorical and continuous variables. Can you suggest changes I can make to get the imputation working? Thanks for your help. Nick 


There are two things you want to consider. One is the clarification of the UG imputation ex 11.5: UG ex 11.5 is not as clear as it could be on this point. A user would typically work with not only the NAMES and IMPUTE lists of variables, but also a USEVARIABLES list and an AUXILIARY list. The NAMES list simply reads the variables in the original data set. The USEVARIABLES list is a smaller subset of variables from the NAMES list, just as in an ordinary analysis. The USEVARIABLES list variables are the variables used to create the imputations. In UG ex11.5, the USEVARIABLES list is absent and therefore defaults to the NAMES list. Typically you also want to save into the imputed data set other variables that are not to be used in the imputation and to do that you put those variables on the AUXILIARY list. The other is the list of 14 suggestions in Section 4 of the AsparouhovMuthen (2010) imputation paper on our website. 


Dear Prof. Muthén, I`m trying to run a SEM with an imputed data set. Mplus reads in all 5 data files, but is using none, hence the number of free parameters is zero and the test of model fit is not executed. TECH9 indicates that there is no convergence for each replication. However, if I run the 5 data sets separately, imputation 1,4, and 5 show a good fit while imputation 2 and 3 do not converge. I am wondering why imputation 1,4, and 5 are not used with type=imputation although these models show good fit. Furthermore, is there anything I might investigate to see why imputation 2 and 3 do not converge? I am especially surprised by that since I have 11 variables in my data sets but only 3 were imputed, so the other 8 variables are the same for each of the imputation data sets. Thank you very much for your support! 


If you are not using Version 6.1, you should do so. If you have this problem with Version 6.1, please send your input, data sets, output, and your license number to support@statmodel.com. 


Hi I have multiple imputed data sets and i want to perform a likelihood ratio test to compared 2 nested models. I read the technical appendix : "ChiSquare with Multiple Imputation". Is exist a way in Mplus to compute this test ? Thanks Alain Girard University of Montreal 


This can be done with the ML estimator. 


Hi, Is it possible that the order of the variables that are used to impute missing data has an effect on imputation? 


Yes. Multiple imputation uses random number generation as a part of the MCMC estimation of the imputation model. When the variables are reordered different random bits will be used for different variables. This however should have minimal impact on any proper use of the imputed data sets. 


I have another problem. When I try to impute data, I get the following error message: FATAL ERROR THE NUMBER OF CLUSTERS PLUS THE PRIOR DEGREES OF FREEDOM OF PSI MUST BE GREATER THAN THE NUMBER OF LATENT VARIABLES. USING MORE INFORMATIVE PRIOR FOR PSI CAN RESOLVE THIS PROBLEM Usually, I have solved it by decreasing the number of variables used for imputation. What should I do? 


This happens because the number of variables in the imputation is more than the number of clusters in data. You can either remove some of the variables in the imputation model or you can perform an H0 imputation. With an H0 imputation you can use a factor analysis model on the second level for imputation purposes or you can use an unrestricted model with a different prior for the variance covariance matrix. Take a look at section 3.3 in http://statmodel.com/download/Imputations7.pdf 


Also I think you are using Mplus version 6. If you use version 6.1 you will not have that problem. 


Thank you. I now downloaded Mplus 6.1, and the problem was solved. I still have another question. I center my data before imputation (as I also form interaction terms before imputation). However, I am not sure which mean value to use when I center my variables that have missing values. Should I subtract the mean values that are computed on the basis of the cases/clusters that have no missing values? My second question concerns variances of the parameter estimates. I understand that these are squared standard errors. But sometimes variance estimates (in the Tech 3 output) are not always equal to the squared standard errors in my model output (e.g., SE = .104, variance estimate = .005). Is that due to rounding error or smth else? What should I do in this case? I need these variance estimates for calculating simple slopes. 


You should center after imputation. Those values sound quite different even for rounding. Please send the output and your license number to support@statmodel.com so we can take a look at it. 


I have understood that it is advised to include all the necessary interaction terms in the imputation phase. If I were to center after imputation, how can I create interaction terms between observed variables? It seems that I cannot use "define" command. 


You can saved the imputed files. DEFINE works with TYPE=IMPUTATION. 

David Bard posted on Monday, March 21, 2011  11:41 pm



I'm having a hard time grasping how the default H1 sequential model is parameterized and estimated when there is a mixture of categorical and continuous imputation variables. The output seems to suggest that both a WLSMV and a Bayes estimator are being used at various points in time. Is the model first estimated with WLSMV and then somehow transitioned to a Bayesian analysis? When I try to create an H0 model that mimics sequential regression with a WLSMV estimator, I'm asked to use the Theta parameterization, but the default H1 model output claims to use Delta. When I try to use a Bayesian estimator for this H0 model, I am unable to reach convergence. Is it even possible to write the default H1 seq reg model as an M+ H0 model? Thanks! 


First let me say that unless you are using the older version Mplus 6, the default imputation model is not sequential. Starting with version 6.1 the default imputation model is COVARIANCE. I think your confusion about what is happening stems from the fact that you have a model (and if you don't specify an estimator you are essentially using the default WLSMV estimator) and a data imputation statement. In this case, Mplus assumes that you want the WLSMV estimator for your estimation, but you want to deal with the missing data via multiple imputations. Therefore Mplus will perform Bayes estimation first to impute the missing data, then analyze the imputed data using the WLSMV estimator. To simplify the methodology I would suggest that as a first step you perform the imputation and estimation separately. To get only imputation specify type=basic in the analysis command and remove the model statement. This will just generate imputed data, which you can later analyze as in the example on page 348 in the User's Guide. Now if you are interested in H0 imputations, follow example 11.7, i.e., you have to specify estimator=Bayes, an imputation model, and the data imputation statement. To mimic the sequential regression imputation as an H0 model imputation the first thing to do is to specify the command MEDIATOR = OBSERVED; in the analysis command. 


Regarding parametrizations, any Bayes or Imputation estimation is based on the Theta parameterization. On the other hand, with the WLSMV estimator generally both the Theta and the Delta parameterizations are available and can be used, however, for some models only the Theta parameterization is available and the sequential model sometimes is such a model. 

David Bard posted on Tuesday, March 22, 2011  4:28 pm



You are right, I have not yet upgraded to 6.1, but will do so shortly. The WLSMV is listed as an Estimator in my file with or without a model statement (under 'Summary of Analysis' section of the output), but sounds like this simply reflects the default estimator were I to have included a model. Thanks for clarifying. I do want the seq reg imputation in this instance. Any advice on getting my H0 version of this off the ground. Do I need to include fairly accurate starting values? I think van Buuren and Ibrahim have commented that the sequence of variable regressions can matter. Can you share the M+ default for setting up these seq regression equations when type=basic? Are the data restructured first to appear roughly like Monotone missingness? 


David I am not very clear why you want the imputation with the sequential method. Version 6.1 has a better method already  the covariance model. Second I am not sure why you are not using the H1 imputation which is already preset for you in terms of optimal performance, just use type=basic, estimator=Bayes and add the data imputation statement, add model=sequential. Mplus does not reorder the variables, we use the order specified in the usevar command. We have not seen examples where the order of the variables is important. Finally if you want to do an H0 imputation and you specified the MEDIATOR = OBSERVED; as well as the model and you are experiencing convergence problems I would suggest that you send it to support@statmodel.com Tihomir 


I would like to use observed classroomlevel means in my analyses. However, some individuals (in some classrooms) have missing values, and thus classroom means would be calculated on the basis of those individuals who don't have missing values (these individual scores are imputed later on though). Is that problematic? The problem is that I would like to create interaction terms between classroomlevel means and include them in the model when imputing the rest of the data. 


This should not be a problem. 


Can the SAVEDATA command be used to control the output file format for the imputed files from DATA IMPUTATION? Thanks, MG 


I am conducting simple slope analyses (to follow up my interactions). I have imputed my data (20 data sets). I need to use covariances and variances of my parameter estimates. Do I need to hand calculate the averages of variancescovariances across 20 data sets (from output 3), or is there an easier solution (I guess I could use squared standard errors from the model output to get the variances of the parameter estimates)? 


Michael: The format of imputed data sets cannot be changed. If the original data is in fixed format, the data saving for imputation will use the format of the original data. But if it is free format, then it uses the default of F10.3. 


Katlin: TECH3 is available with TYPE=IMPUTATION. 


Yes, I do use Tech3, but I get 20 variancecovariance matrices (as I have 20 imputed data sets). So, for instance, when I need a covariance value between the intercept and moderator, do I need to calculate the average covariance across the data sets? I assume this is what I need to do as there isn't such a "summary" matrix across the data sets. Or am I wrong? 


Averaging over TECH3 is not correct. You can square the standard error of the variance parameter but that does not get you everything you need. Perhaps you can use MODEL CONSTRAINT to do what you want. 


Could you be more specific (I would really appreciate your help)? How would I get covariance estimates when using MODEL CONSTRAINT? 


I'm not saying you would get the covariance estimates from MODEL CONSTRAINT. Perhaps you can define whatever it is you want a standard error for in MODEL CONSTRAINT. You would then obtain a standard error. Other than that, I have no suggestions. See MODEL CONSTRAINT in the user's guide for further information. 

Yijie Wang posted on Friday, April 22, 2011  9:18 am



Hello, I'm doing a multiple imputation and want to use the generated data for further analyses. Is there a way for mplus to combine all the imputed datasets and yield an averaged dataset? Thank you! 


No. We produce the individual data sets only. 


Hi, I understand that interaction terms should be included in an imputation model. When using the unrestricted H1 imputation option, should the interaction terms themselves be imputed along with the variables from which they are derived (which would lead to interaction terms which are not the exact product their source variables), or should the interaction terms be used only as predictors in the imputation, and then recalculated from the imputed data during analysis? Best, MG 


We would not include the interaction term in the imputation of the data. We would use it only in the subsequent analysis. 


Hi, Do you recommend running latent factor interaction models on multiply imputed data? Thanks. 


Do you mean creating factor scores via plausible values and then creating interactions? Or do you mean imputing missing values on observed variables and then doing XWITH? The former is an interesting idea that should be explored. The latter is straightforward. 


I was actually asking about the latter but was confused about the post dated March 30, 2010  12:42 pm and Linda's related answer. 1. My structural model is composed of latent factor interactions. Should I have included XWITH while running my imputation model? Or it is fine to run imputation by modelling main effects and then specify XWITH while runnning the structural model on imputed data? 2. Is there a way to go around the twostep approach regarding the use of multiple imputation (i.e. first impute data, then estimate structural model)? Can't we do them simultaneously? 3. Is multiple group analysis on imputed data straightforward, as well? If so, then I am able to test whether grouping improves model fit by comparing models' loglikelihoods, right? I am asking this because in output, we get message "the loglikelihood cannot be used directly for chisquare testing with imputed data" but post dated March 19, 2007  5:59 pm states they could be used. 


1. The imputation model does not have to be correct relative to the analysis model, but how large the deviation can be depends on the situation. So with l.v. interactions, your analysis model contains them, and the imputation may or may not use them. My current thinking is that the imputation is probably good enough without using them. I am not sure if our H1 (unrestricted) imputation has difficulty converging when having both the l.v.'s and their interactions. 2. Yes, you can do it in one run by specifying estimator = ML/WLSMV (but not Bayes). 3. Yes, multiplegroup imputation can be done. We provide a chisquare test suitable for multiple imputed data  see the technical appendix ChiSquare Statistics with Multiple Imputation. I am not sure this can handle chisquare difference testing, however. A new version of the Topic 9 handout including an expanded discussion of multiple imputation taught at UConn last week will be posted next week. 


Dear Linda and Bengt, I`m trying to run a multigroup sem with imputed data. However, in order to get the model running in all imputed data sets, I need to restrict the residual variances of one my variables to be equal. Running the model separately in each imputation data set does not require this restriction. I am now wondering why I need restrictions in the overall model when I don`t need to restrict the residual variances in each of the single data sets. Is it right that, when using imputed data, Mplus runs each data set separately and then combines the results using Rubins’ formula? And if so, why do I need special restrictions with the imputed data set? Thanks in advance! Sofie 


This does not make sense. If you are not using Version 6.11, try that. If you are and still have the problem, please send the files and your license number to support@statmodel.com. 

Eric Teman posted on Sunday, June 26, 2011  6:52 pm



I have read in several places that multiple imputation has no set rules in regard to pooling likelihood ratio chisquare values or adjunct fit indexes. Does Mplus have a special way of handling this? 


The following technical appendix on the website describes our ML chisquare for multiple imputation: ChiSquare Statistics with Multiple Imputation For all other fit statistics we give the average over imputations. 

Eric Teman posted on Monday, June 27, 2011  6:42 pm



Are you aware of any issues when taking the average over imputations for fit statistics? I'm just wondering whether Enders' concern is warranted about "no rules exist for combining fit indices from multiply imputed datasets" 


The averages are not correct. See the Technical Appendix to see the difference between the average and the correct chisquare. 

Eric Teman posted on Monday, June 27, 2011  8:12 pm



Sorry, I was referring to the adjunct fit indexes. Are there any known problems/issues with those averages? 


All of the averages for RMSEA etc. are simply averages. Only the ML chisquare is correct for imputation. Chisquare for weighted least squares is also given as an average and is not to be interpreted for model fit. 


Hiya, I'm trying to run a multiple imputation model but experiencing some problems. Mplus stops at the 12500 iteration because of lack of memory. I'm running the program in Windows32 bits, with a dual processor 2.3g (using processor=2), 2.2g of ram and all non essential processes (even antivirus) terminated. I found the 2010, version 2 Aparouhov & Muthen paper, where they recommend the use of the FBITER and THIN option (the latter is also suggested by my output). Yet, when I try to use the FBITER option in my 6.11 version of Mplus I'm told this function is unrecognised. Any help with this is welcome. 


Please send your files and license number to support@statmodel.com. 

Eric Teman posted on Wednesday, June 29, 2011  4:16 pm



Are you aware of any published research indicating taking the averages of adjunct fit statistics across imputations is not correct? 


I don't know of anything specifically. You might look at Craig Ender's book and Joe Schafer's book. Both references should be in the Topic 9 course handout. 


I am not sure what adjunct fit statistics is but in general the chisquare statistics should not be added directly. See http://statmodel.com/download/MI7.pdf for simulations and description of how Mplus does this. Also any approximate fit indices based on the correct chisquare statistic should be valid. 


Hi, I would like to use the variancecovariance matrix of the coefficients to make some plots. I can export the matrices for each imputation (tech3). How can I summarize the 10 matrices into 1? I understand that I could use the constrain command, but for all covariances, this means a whole lot of input. Are the means over the 10 matrices a good approximation? or the median? It doesn't have to be 100% correct, since it's only for plots. It just can't be too variable to the used dataset. Greetings, Ruben. 


With TYPE=IMPUTATION, you will get a correct TECH3. This is what you should use. 


Hi, I am imputing a single categorical variable using a number of completely observed variables in my data set. Do I have to include completely observed categorical variables on the Categorical statement? 


Any categorical variable on the IMPUTE list should have (c) after it. The CATEGORICAL option is for the analysis not the imputation of the data. All dependent variables in the analysis should be on the CATEGORICAL list. 


Hi, I execute and save multiple imputations with Mplus, but when I analyzed the list of the data sets Mplus doesn't estimate the fit indices and report a negative variance in every data set (FILE IS TESIMPlist.dat; TYPE = IMPUTATION;). When I analyzed each data set by itself works fine and doesn't report the negative variance. Why is that Mplus is not working properly with list? thanks Mauricio 


If you are not using Version 6.12, do. If you are, please send the relevant files and your license number to support@statmodel.com. 

Eric Teman posted on Tuesday, January 17, 2012  1:36 pm



During the analysis phase of multiple imputation, is it possible for Mplus to save the averaged parameter estimates (and the corrected chisquare) as a data file? When I use SAVEDATA: RESULTS ARE results.dat, I get the unaveraged parameter estimates for the NDATASETS, which means no corrected chisquare is being saved. 


We do not save the averaged results. We save the results from each imputation. The average results are given in the results section. 

Eric Teman posted on Friday, January 27, 2012  7:20 pm



When fixing the latent variances to one so that all factor loading can be estimated, is it normal for WLSMV used with multiple imputation to produce negative factor loadings? Is this OK? 


Factor loadings can be positive or negative. They are regression coefficients. 

Eric Teman posted on Friday, January 27, 2012  8:20 pm



Sorry, I should have been more clear. It is a simulation study where I have set the population values to be positive. But when I employ multiple imputation, the factor loadings are often negative when the latent variances are fixed to 1, but never negative when the latent variances are free. It seems a bit odd. 


Perhaps what you see is that all the factor loadings for a certain factor change sign to negative. That is ok and simply means that your factor is reversed (say from knowledge to ignorance). That gives the same fit. You often see this sign reversal in EFA. It is harmless. When you set the metric by fixing a loading to 1 you effectively decide on the sign. 


I have a question about imputing interactions. Initially I thought I should just impute my main variable, and then aggregate my imputed datasets to calculate interactions based off the main variable of interest. However, reading over the literature (Von Hippel 2009  transform then impute) and how Mplus derives results from multiple imputed datasets, I realized that I should include interactions in the imputation procedure. Ok this all makes sense, but I am having trouble with model convergence. I've increased the iterations and deleted variables from the USE VARIABLE command and it still hasn't solved the problem. Is there any way this problem could be related to the fact that I am asking for standardized interactions? Thanks for any input. 


Try imputing without the interactions and see if that works. 


Are variables included in the NAMES list automatically used to impute missing data or do I have to define them explicitly as auxiliary variables? Thank you very much! 


If you do not have a USEVARIABLES list, all variables on the NAMES list are used to impute data for the variables on the IMPUTE list. 


Thanks for the reply! Using the VALUES command I specified the range of the imputed values for each variable (minimum and maximum) but an inspection of the imputed data sets shows that for some imputations the values exceed those restrictions. Any idea how this is possible? I would expect nonconvergence if the mcmc algorithm can't find a value between the specified range after x number of iterations... Do I need to worry about this (percentage of missingness 17%)? 


Please send the output, a data set that shows this, and your license number to support@statmodel.com. 


An additional question: How can I let Mplus know that it should not use SCHOOLID as an covariate to impute the requested variables? How can I include my schoolID in the imputed datasets? When I use the 'Cluster' option in the NAMES command (with TYPE=COMPLEX), Mplus computes all the requested datasets but in the output it shows the error message 'all variables are uncorrelated with all other variables'. When I just run the same MImodel without SCHOOLID included in the input file, my model just runs perfectly. When I ran the same input datafile but than with the SCHOOLID included (in combination with the USEVARIABLE command in the input syntax) my SCHOOLID is not shown in the imputed datasets. I guess there's a simple solution but I can't figure it out. Thank you very much! 


Use the IDVARIABLE option of the VARIABLE command. 


When using the "values=" option with multiple imputation, is it possible to specify a range of values in which negative values are possible? 


We do not currently allow negative values but will do so in the next version. The workaround for this is to add a constant to your variable that makes all numbers positive, impute, and then subtract the constant. 

Aurora Zhao posted on Thursday, March 08, 2012  5:44 pm



Hi Dr. Muthen, I am a beginner of handling missing data with multiple imputation. I am looking at the example 11.5. I am wondering how to calculate this missing data correlate "z" from the original data and save it into the data set to do M.I. Thank you very much! 


Z is not a variable that you create. It is part of the dataset that is used to impute the variables on the IMPUTE list. 


Hello, I tried to do multiple imputation using the following command, but i couldnt find the saved output file. could you please tell me where is the output file saved? TITLE: this is an example of multiple imputation for a set of variables with missing values DATA: FILE IS C:\Users\Admin\Desktop\MGH\Owis\path3.dat; VARIABLE: NAMES ARE smo em sym act emo env pf ef sf re bmi age fev gender act5 actc; missing = .; DATA IMPUTATION: IMPUTE = smo (c) em (c) sym pf ef sf re bmi age fev gender (c) act5 (c) actc (c); NDATASETS = 10; SAVE = C:\Users\Admin\Desktop\MGH\Owis\essra.dat; ANALYSIS: TYPE = BASIC; OUTPUT: TECH8 


The saved data set name should be essra*.dat. The asterisk is replaced by the numbers of the datasets, for example, essra1.dat, essra2,dat etc. 


I did that but i got this error message: *** ERROR in Data command The file specified for the FILE option cannot be found. Check that this file exists: C:\Users\Admin\Desktop\MGH\Owis\essra1.dat 


Please send the relevant files and your license number to support@statmodel.com. 

finnigan posted on Friday, April 13, 2012  10:37 am



Linda I have a longitudinal data set where indicators have 2030% missing data across three waves Covariates have 15% missing data across three waves. I will be conducting measurement invariance using CFA and then estimating a multiple indicator growth model. I am following two solutions an FIML and a multipe imputation to handle the 2030% To follow Multiple imputation should the model used to generate the data sets be the CFA or the growth model? Thanks 


I would impute according to the H1 model. 


Hi  I have a few questions about multiple imputation with a multilevel model. First, I am finding high autocorrelations lasting for many iterations for the betweenlevel parameters. Do you know if this is normal? Second, is there a way to get Mplus to show me the autocorrelations for more than 30 lags? Third, I don't quite understand where Mplus draws the imputed data sets. If I specify THIN=500 in the data imputation command, then is Mplus drawing the imputed values from every 500th iteration, beginning with the first iteration after burnin (i.e., 1st data set has values from the 10,000th, 2nd from the 10,500th, and so on)? Finally, in the imputation, I want Mplus to take into account the fact that certain values (i.e., family socioeconomic status) tend to be similar within schools. I have specified it like this: CLUSTER=schoolid; ANALYSIS: TYPE = TWOLEVEL; MODEL: %WITHIN% ses ON par_ed h_income; %BETWEEN% ses; Does this accomplish my goal of accounting for the schoollevel clustering of values on that variable? If not, can you tell me how I can? Thank you, Lindsay 


1. It is normal to have high autocorrelations in twolevel imputation because especially when the number of clusters is not far from the number of variables (this leads tyo nearly singular variance covariance matrix on the between level). We basically would recommend and H0 model imputation with a 1 factor analysis model on the between level. Take a look at sections 3.3 and 3.4 in http://statmodel.com/download/Imputations7.pdf 2. You cannot get more than 30 autocorrelations. You have to use the thin command to discard MCMC draws  that would let you see how correlated more distant draws are. For example if using thin=50, the 50th autocorrelation will become the first. 3. The thin option in the data imputation command woks as you describe above. 4. First you should make sure that Mplus does what you think it does  look at slide 184 in http://statmodel.com/download/Topic9v52%20%5BCompatibility%20Mode%5D.pdf By default all variable in Mplus are present on both levels within and between that accounts for the similarity of SES within clusters (The command that restrict that default are within= and between=). 


Thank you for your reply. Just to make sure I understand  if I specify ANALYSIS: TYPE = BASIC TWOLEVEL; then the similarity of variables within clusters is accounted for, unless I list then as WITHIN. If I want to specify an H1 model on the within level and an HO model on the between level, do I just not specify any model for the within level? If variables are listed as part of the betweenlevel H0 model, is their clusterlevel similarity still accounted for? Thank you, Lindsay 


Sorry, a couple more questions  how do I evaluate the model when using TYPE = TWOLEVEL BASIC? The program isn't giving me the Bayesian plots, so I don't know how to assess whether the estimates reached a stable pattern or if there is an issue with autocorrelation. Also, the model is converging and not giving me any error messages even when there are more betweenlevel variables than there are clusters. Can I be comfortable with the results? Thank you very much, Lindsay 

istia posted on Thursday, May 10, 2012  9:06 am



what's exactly the role of random value chi square for regression model or even multiple imputatation? anybody can share some papers? thanks before 


Lindsay If you have more variables than clusters you should be using the H0 imputation method (right most path in diagram on slide 184). Like this TYPE = TWOLEVEL; estimator=bayes; model %within% y1y100 with y1y100; %between% y1y100; Add the data imputation command. data imputation: impute=y1y100; save=imputations*.dat; 


Istia: What is "random value chi square". 


Tihomir  Thank you so much for your reply with the example syntax. That is very helpful. One followup question: does it make a difference that a few of the variables are observed at the between level and have no withincluster variance? Would that change the imputation syntax at all? Thank you, Lindsay 

istia posted on Thursday, May 10, 2012  9:11 pm



Linda  Sorry if i was wrong or understading it. But what i've read from here: http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_multiple_imputation_univariate_linear.htm is there will be a random value 'u' of Chi Square. I just can't get it what is it exactly mean. But what's the influence for using this Chi Square value to produce some kind of regression value or imputation value I mean? 


Lindsay You should specify those variables on the between= list in the variable command. 


istia Take a look at http://statmodel.com/download/MI7.pdf 


Ok, thank you, I will do that. Just to be clear, though, the model syntax: model %within% y1y100 with y1y100; %between% y1y100; will be exactly the same, even if variables y95y100 are between? It just seems strange to me to have variables that are between variables as part of the within section in the model. Thank you, Lindsay 


Istia  Regarding "random value 'u' of Chi Square", I think you should ignore that. That's just the way they describe the chisquare testing. The way we describe the chisquare testing is in the document Tihomir pointed to. If you want to learn moe about multiple imputation, I would recommend the 2010 book by Craig Enders. It refers to Mplus. 


As a followup, I just tried the syntax, and got the error message "between variables cannot be used on the within level." Instead I tried: model %within% y1y94 with y1y94; %between% y1y100 with y1y100; But this model is not converging, I'm guessing because there are too many parameters relative to the number of clusters. Perhaps instead I should try this within syntax and the betweenlevel analysis model with reference to the betweenlevel variances of other variables? i.e., model: %within% y1y94 with y1y94; %between% y1y91; y92 ON y95 y96 y97 y98 y99 y100; y93 ON y95 y96 y97 y98 y99 y100; y94 ON y95 y96 y97 y98 y99 y100; I really appreciate your guidance. Lindsay 


I just tried the syntax with the betweenlevel analysis model specified: model: %within% y1y94 with y1y94; %between% y1y91; y92 ON y95 y96 y97 y98 y99 y100; y93 ON y95 y96 y97 y98 y99 y100; y94 ON y95 y96 y97 y98 y99 y100; and it converged very well. Everything looks good except that the betweenlevel variance parameters for all the variables except y92, y93, and y94 have very high autocorrelations. Does this indicate a problem with the imputation model, or do I just need to increase the thinning until the autocorrelation drops to near zero? Thank you again, Lindsay 


You don't need to have the autocorrelation drop to 0.Instead you can aim for the 30th autocorrelation to be below 0.2 or 0.3. Try thin=10 or even 50 or 100. 


Hi, We have a question about planned missingness and FIML. We have a large dataset (N = 1400) and have worked with the three form design for a large part of a large questionnaire. This means we have a lot of missings, but only MCAR. We plan on using FIML to deal with the missings, but have noticed that Mplus does not give all fit statistics. We do not get the RMSEA, CFI, TLI, Chi. We were told that currently Mplus has no way of reliably estimating these fit statistics, and therefore does not give them. Is this indeed the case? Do you know of any papers that discuss this issue? And is there away around this issue without using MI? Thanks for any tips or advice! 


Hi I am running a multiple imputation in mplus, but i have run into the problem that the data set is big (1119 variables, arround 5000 subjects) and mplus tells me that i can not include more than 500 variables. I can get the imputation to run when I select some variables to be imputed with the USEVARIABLES command. But, when i do this am i excluding all the other variables from the imputation process? how can i impute big data sets in mplus? thank you 


Laura, You don't have to use MI (multiple imputation) which it sounds like you are doing given that you don't get all fit indices (which haven't been statistically developed yet). With missing by design you might instead want to use multiplegroup ML analysis, with groups corresponding to the three forms. A good applied source for missing data handling is the C. Enders 2010 book. 


Mauricio, There are 3 lists of variables: The NAMES list which describes the data (it can contain more than 500 variables), the USEV list which are the variables that inform the imputations, and the IMPUTE list which says which subset of variables we want to have imputations from. Typically, your USEV list is much shorter than your NAMES list. You don't need all the NAMES list variables to inform the multiple imputations, but usually a very short list of variables. The IMPUTE list should contain a shorter list of the variables for which you want to do a particular analysis. So 500 USEV variables would seem to be more than enough. 


Hi  I am using 20 imputed data sets to do a twolevel analysis. I've discovered that under certain circumstances, the program is not analyzing all of the data sets (the output says requested: 20, completed: 19). One scenario in which this is happening is with a betweenlevel dichotomous variable that was completely observed, and so is identical in every imputed data set. Can you help me figure out why that might be happening and what I can do to recover the 20th data set? Thank you, Lindsay 


Add TECH9 to the OUTPUT command to see the reason. 


Hello I am having a similar issue to a previous post. I am using 10 imputed data sets, in a single level model. There are 177 cases in the data set, but the output says that the average number of observations is 141. The number of replicated requested is 10, but only 8 completed. My output also does not provide sample stats or a Chisquare test result. I added Tech9. It says that the model terminated normally gives the warning: THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NONPOSITIVE DEFINITE FIRSTORDER DERIVATIVE PRODUCT MATRIX....PROBLEM INVOLVING PARAMETER 17. Parameter 17 is the covariance of a variable with itself from the PSI matrix. I examined each of the imputed files, all have complete data for all 177 cases. I've also confirmed that my implist.dat file lists all 10 data sets correctly named. I am using version 6.0. I have a friend with the 6.11 update and asked her to run the model. The result is the same, her output has an addition message above the estimates for model fit "THE CHISQUARE COULD NOT BE COMPUTED. THIS MAY BE DUE TO AN INSUFFICIENT NUMBER OF IMPUTATIONS OR A LARGE AMOUNT OF MISSING DATA." Her output does include sample statistic, but states that it is only for 8 data sets. Please Advise. I appreciate your time and help. Thank you, Natalie 


Please send the relevant files and your license number to support@statmodel.com. 


Hi I have a question on how Mplus combine the results from multiple imputed data sets. Does Mplus combine both unstandardized and standardized results? or just combine the unstandardized and standardized the combine results? thank you 


The average of the parameter estimates over the multiple imputations are standardized. 


Hi, We have a question about fit statistics (CFI, ...) when applying multiple imputation. Because no formal pooling rules are currently available, Enders (2010)recommends that the 20 (or more) estimates of each fit index are used to create an empirical distribution. I wonder how I'd best do that. Should I run my syntax on every imputed dataset and save each fit statistic manually or is there a more efficient way (perhaps a summary of the fit statistics of each dataset in one output)? Thank you very much 


You would need to run your syntax on each imputed data set. 


Hi, i'm running a latent class growth analysis on 5 time points with missing data. I've been able to use the multiple imputation procedure nicely to establish there are 2 classes, however, I want to save class membership (i.e., cprobabilities) from this analysis and MPlus is telling me i can't. Is there a way I can get the class membership for each patient exported from MPLus when using the multiple imputation method? (please note there are 5 data sets) chris 


For cprobs you need an estimated model and data for the subject in question. You have the estimated model (the average over the imps), but which of the 5 data sets should you use? That's the problem. You can create cprobs for each data set using the average estimates (fixed parameter values when running with each data set), and then average the cprobs. It is not clear, however, that this is the best approach  we are at the research frontier here. 

bkeller posted on Friday, July 27, 2012  1:25 pm



I'd like to determine the fraction of missing information (see, e.g., Snijders & Bosker, 2nd ed, p. 141) for the parameters in a TWOLEVEL RANDOM type analysis for a TYPE = IMPUTATION run. Is there a way to output the average withindataset variance W.bar = 1/M*Sum(SE^2) and the betweenimputation variance B = (1/(M1))*Sum(theta.hat_mtheta.bar)^2 for each parameter so that I can calculate this? Thank you! 


This will be included in the output in the next Mplus version but you can still compute that with the current version. If you have just a small number of imputation data sets just run all of them one at a time and you can get all the parameter estimates and compute this by hand. If you have many imputed data sets use the External Montecarlo technique described in the Users Guide EXAMPLE 12.6 STEP 2. This way you can get the between imputation variance (column 3) and the average SE (column 4). 

bkeller posted on Monday, August 06, 2012  1:39 pm



I used the MONTECARLO technique you described to get B and W.bar, thank you. I am further interested in saving as output an array which contains the actual estimated parameter values across imputations. I am using 25 imputed datasets so I would rather not run each one and compile by hand. Is there a way to ask Mplus to save the results (something similar to SAVEDATA: ESTIMATES = file.dat;) but for all 25? 


Use savedata: results=file.dat; 

EFried posted on Thursday, August 09, 2012  12:45 pm



I'm using multiple imputation and am running a multilevel growth model. In my model results, baseline covariates are shown to be correlated by 0.000 (which is not actually the case in the dataset). The correlations between timevarying covariates are, in contrast, shown as expected. All other model results look fine as well. My question is whether the zero correlations have something to do with the way MPLUS deals with multiple imputation files – or whether something is wrong with the analysis? Thank you 


Please send the output and your license number to support@statmodel.com so I can see what you are doing. 


Hi, I have ten multiple imputation data sets which is my data for CFA. Since my data is categorical, I have used WLSMV and specified the variables as categorical. For ChiSquare, there is the following information: ChiSquare Test of Model Fit Number of successful computations 10 Proportions Percentiles Expected Observed Expected Observed 0.990 1.000 0.020 39.184 0.980 1.000 0.040 39.184 0.950 1.000 0.103 39.184 0.900 1.000 0.211 39.184 0.800 1.000 0.446 39.184 0.700 1.000 0.713 40.065 0.500 1.000 1.386 53.839 0.300 1.000 2.408 62.245 0.200 1.000 3.219 68.881 0.100 1.000 4.605 105.435 0.050 1.000 5.991 105.435 0.020 1.000 7.824 105.435 0.010 1.000 9.210 105.435 What does this output mean? With ML, there is just one Chisquare value  here, there is none. Thank you very much for your help! 


Hi, and an additional question: Are the mean of CFI / TLI / RMSEA / WRMR the pooled results over all 10 datasets? What do "proportions" and "percentiles" of these indices mean? Thank you! 


The only fit statistic that has been developed for multiple imputation is chisquare. For the others we give the average over the imputed data sets. The output is described on page 362 of the user's guide. 


Thanks for your reply, Linda. I have checked my output again  there are NO mean or standard deviation, just the information I've posted above ("10 successful computations" and the table with expected and observed percentiles and proportions). Do I need to specify something more? I do get mean and SD für CFI, RMSEA, TLI and WRMR though. On the other hand, when using ML with the same imputed data, there is just one chisquarevalue (and CFI, RMSEA etc.), but no table with expected and observed proportions and percentiles. Why does this happen? Thanks! 


Please send the two outputs and your license number to support@statmodel.com so I can see what you are looking at. 


Dear Dr. Muthén, I am doing multiple imputation for four variables in a latent growth model. When I request 20 replications, only 19 are completed. I requested TECH9, but apparently only the first repliction did not yield convergence and that the number of iterations was exceed (no reason why was stated in TECH9). Could it be that one of the variables has too many missings to do imputation? (data of 83 out of 175 are missing for this variable). If I do not include this variable in the missing analysis, than all the iterations are correctly completed. Thank you for your response! 


Please send the output, data, and your license number to support@statmodel.com. 


Hi, I am running a multiple imputation with continous variables. I used the rounding option and requested 5 decimals for the imputed variables because my original values have 5 decimals too. ROUNDING = sach lit (5); unfortunately mplus gives only the default of 3 decimals for the imputed data and reduces the not imputed values to 3 decimals. have you any idea what's wrong here? thanks in advance Sofie 


Add savedata: Format=F10.5; 


Thanks for your help, but mplus says that doesn't work for multiple imputation. *** WARNING in SAVEDATA command The FORMAT option for saving data is not available for TYPE=MONTECARLO or multiple imputation. The FORMAT option will be ignored. Have you any other suggestion? Thanks Sofie 

Kofan Lee posted on Wednesday, November 14, 2012  7:23 am



I am using multiple imputation in CFAs. I use CFA to check theoretical constructs and for the preparation of item parceling. I have confronted some questions, and I wonder if I have something wrong with syntax. It seems like such combination (CFA with impute data) brings some limitations. For instance, the survey data I have is nonnormally distributed, However, I cannot use SAVE=MAHA and tech13 to detect multivariate outliers and multivariate normality. Further, modification indexs cannot be administered too. Also, chisquare test in using MLR estimation does not generate the output of significant test. Would you mind taking a look at my command as followed: Title: CFA motivation using imputated data; Data: File = C:\Users\koflee\Desktop\111312\serious.imputelist.dat; Type = imputation; VARIABLE: NAMES ARE m1m19 s20s37; USEVARIABLES ARE m1 m2 m4 m6m19; ANALYSIS: ESTIMATOR = MLR; MODEL: im by m4 m10 m15 m18; id by m8 m14 m17; intro by m2 m7 m13; em by m1 m6 m11 m16; am by m9 m12 m19; SAVEDATA: save=maha; OUTPUT: TECH4 tech13 STANDARDIZED RESIDUAL MODINDICES (0); Thank you so much Kofan 


Sofie It should have worked. Are you using type=basic? Send your example to support@statmodel.com Tihomir 


Kofan: Some things have not yet been developed for multiple imputation, for example, only the ML estimator has been developed. For other estimators we give the mean and other information. 

Kofan Lee posted on Thursday, November 15, 2012  6:11 am



Linda, Thanks for the response. I actually have very few missing responses and decide to delete those cases. I have a question about Mahalanobis D. I try to run a CFA (ML estimation) with the following code: SAVEDATA: save=maha; However, this command is ignored by MPlus. Should I add something in this syntax? Thank you and have a wonderful day Kofan 


You also need to name a file using the FILE option. 

Kofan Lee posted on Saturday, November 17, 2012  6:02 am



Linda, That works. Thank you 

Jan Zirk posted on Thursday, November 29, 2012  5:13 pm



Dear Bengt or Linda, I would like to ask you about imputation in context of Bayesian plausible values. In a dataset with a big sample size (n>10000) there are many orderedcategorical variables from 5 instruments + demographic measures. To decrease computational demand I would like to transform the orderedcategoricals to continuous plausible value measures. Is it better to do this via one big "H1 model" (ie, "** with **" where ** means all variables in the dataset) or would it better to run 5 separate H1 models (separate for each instrument's categoricals)? Best wishes, Jan 

Jan Zirk posted on Thursday, November 29, 2012  5:15 pm



P.S. I would like to next run SEMs with all measures entered. 


It sounds like you are putting a factor behind each orderedcategorical variable. If you can do it with all the variables from all 5 instruments that would be best assuming they are at least moderately correlated. But if that gives you too many variables, then assuming you have enough variables within instrument doing it instrumentwise would seem ok too. I guess any combination of the different sets of plausible values for the 5 instruments is equally valid. 

Jan Zirk posted on Friday, November 30, 2012  8:08 am



Thank you Bengt! It seems than like a topic worth an article/short note. "It sounds like you are putting a factor behind each orderedcategorical variable"  exactly, I wanted to extract them with LRESPONSES. One more question, to best reflect the original underlying data structure, if I run such an H1 model on all the available variables & see in its output that e.g. a few links are ns, do you think that it would be worth effort to trim such links and in the next step extract plausible values from the backwardsdeleted/trimmed version of the H1 model? or rather extract them regardless the ns connections? 


My guess is that it wouldn't be worth the effort. 

Jan Zirk posted on Friday, November 30, 2012  9:09 am



Yes, thank you; this is what I thought. Best wishes, 


I use multiple imputation to handle missing data. Can I interpret model fit indices as usual? (in the analysis phase) Thank you! 


No except for ML. We report the average over the imputations for the other fit statistics which have not yet been developed for multiple imputation. 


Thank you. Could you please specify. Do you mean that when I use ML estimator, fit indices are interpretable or are they still averages? 


With ML, the chisquare value is interpretable not the other fit statistics. The other fit statistics are averages. 


“Dear Linda/Bengt, I conducted multiple imputation analyses in a (relatively small) sample of 175 person. Next, I estimated a longitudinal path model without problems. However, when I explicitly model a correlation between the predictors, I received the following error message: “THE BASELINE CHISQUARE COULD NOT BE COMPUTED. THIS MAY BE DUE TO AN INSUFFICIENT NUMBER OF IMPUTATIONS OR A LARGE AMOUNT OF MISSING DATA.” However, if I increased the number of imputations, I still received the same error message. I do not receive this error message if I do not explicitly model the correlation between the predictors. What could be the reason for that, please? Thank you in advance!” 


Please send the two outputs (with and without the correlation) and your license number to support@statmodel.com. 


Dear Linda, I have questions related to MI with Bayesian method. What combining rules did Mplus use especially with random effect model (latent curve model)? In the case of fixed effects, Rubin (1987) presented the method for combining results from a data analysis performed m times. Or are alternatives imposed? (e.g., jackknife variance estimator? Fractionally weighted imputation?) Can these combination formulas be used with nonlinear models? How about posterior predictive pvalue? Is it same with "combining rules of Likelihood Ratio Test" or "Wald test", which Asparouhov and Muthen (July, 27, 2010) explained in "ChiSquare Statistics with Multiple Imputation: Version 2"? 


Parameter estimates are averaged over the set of analyses. Standard errors are computed using the average of the squared standard errors over the set of analyses and the between analysis parameter estimate variation (Rubin, 1987; Schafer, 1997). A chisquare test of overall model fit is provided (Asparouhov & Muthén, 2008c; Enders, 2010). All other values are averaged over the set of analyses. 


Dear Linda, Thank you for your quick reply. You mean PPP is also averaged over the set of analyses? If so, is there any possibility of bias? 


Yes, that is also an average. I don't believe there is any theory as to how this should be combined. 


Dear Linda, Deeply appreciate your help. 

Eric Deemer posted on Wednesday, March 27, 2013  1:25 pm



Hello, I ran an analysis on 5 imputed data sets but get no output. I used "type = imputation" in the DATA command also. The computation window flashes on the screen and that is all. Is there something I am doing wrong? Thanks, Eric 


Please send the files and your license number to support@statmodel.com. 


Dear Linda & Bengt, when using TYPE = IMPUTATION and 10 imputed datasets, there are only residual covariances. Is there an option to get residual correlations as well? If so  which command would I have to use? Thanks for your help! 


There is not an option for also getting the residual correlations. 

Ping Li posted on Thursday, April 04, 2013  1:39 pm



Hi Linda, I use imputation data sets that I have imputed to do analysis. When I run the syntax, just as Eric Deemer above mentioned, the computation window flashes on the screen but no output file. Could you help to see what is wrong with the syntax: TITLE: Public Administration; DATA: FILE=imputelist.dat; TYPE=imputation; VARIABLE: NAMES ARE ciserq1ciserq8 ceserq1ceserq5 agency gen age gender cltype clgend clpart clint educ ethnic tenure jobpos toint car1car4 hr1hr11 lmx1lmx7 tiserq1tiserq8 teserq1teserq5; USEVARIABLE ciserq1ciserq8 ceserq1ceserq5 gen hr1hr11; ANALYSIS: ESTIMATOR=ML; MODEL: ciserq BY ciserq1* ciserq2ciserq8; ciserq@1; ceserq BY ceserq1* ciserq2ceserq5; ceserq@1; hr BY hr1* hr2hr11; hr@1; ciserq ceserq on hr gen; OUTPUT: STANDARDIZED(stdyx); Thanks very much! 


Please send the files and your license number to support@statmodel.com. 


Hi, my dataset contains missing values. I have completed multiple imputation with m = 10 and use Mplus to estimate structural models over the 10 imputed datasets. Additionally, I have used the original data with MISSING = BLANK to estimate the same models. All model fit indices based on analyses with the dataset that still contains missing data are substantially better than those based on the 10 imputed datasets. I've been wondering which algorithm Mplus uses when analyzing data with missing values that explains these differences. Thank you. 


The only fit statistic that has been developed for multiple imputation is chisquare for maximum likelihood. The means are reported for all other fit statistics. 


Dr. Muthen, I am working on a multiple imputation procedure and in accordance with your Asparouhov & Muthen (2010) paper which states "The missing data is imputed after the MCMC sequence has converged", I am trying to run an MCMC sequence. Even though I am using Mplus v7 (which should be able to run Bayesian statistics), I am getting " Unrecognized setting for ESTIMATOR option: BAYES". Why do you think this is happening? 


I think you are not using Version 7. Check at the top of the output where it shows which version you are using. 


Dr. Muthen, Thank you very much for your prompt response. I checked and you are right; it shows version 5.1. But this is weird because I can see the little Mplus7 sign on the top left corner of my window when I am using the program.And when I check the About MPlus section it says Mplus version 7. Do you think this has something to do with settings ? 


You must have more than one Mplus.exe on your hard drive. So a search and delete all but the most recent. 


Thank you Dr. Muthen, that worked. Yalcin 


Dr. Muthen, I have one other issue that I have been coming across. Please see the line of syntax below: missing are Q51AQ51F (100) Q52AQ52E (99 98) Q53AQ53D (98 99); I am doing multiple imputation and above syntax is where I define which values need to be imputed. When I use the syntax above, for Q52 variables it does not recognize 99 as a missing value flag. However, if I change the order that I type in such that 99 comes first and 98 comes after, it does what it is supposed to be doing. Same thing happens for other variable series as well. Somehow it does not function properly when I write 98 before 99. Do you have any idea why this might be happening? Thanks in advance! 


Yes, you need to put the negative number first. If it is second, the minus sign is read as a list. 


Dear Linda, getting back to my post from April, 12th  is it possible to get the separate results for CFI, RMSEA etc. for the different imputations? Why would the mean for CFI, RMSEA indicate better fit in models with FIML than in models with imputed data? How does Mplus estimate parameters when there is missing data? Thank you very much. 


Q1. You would have to run each imputation as a separate data set. Q2. Don't know offhand. It's a research question; a dissertation topic someone? I would not trust average CFI or RMSEA from imputations unless it's been researched because those measures "don't know" that the analysis of each imputed data set is based on imputed data. When you do CFI/RMSEA by FIML, they "know" that data are missing because the chisquare on which they are based knows. Q3. FIML is done the regular way of ML assuming MAR. Bayesian imputation assumes the same thing. 


Dr. Muthen, I am imputing the data with MPlus. Examining the imputed data, I found that even though the original data is on a 15 scale, imputed data included values smaller than 0 and bigger than 6. Should I be concerned or can I simply use the highest (or lowest) possible value for those outofrange cells? Thanks! 


See the VALUES option of DATA IMPUTATION. I think this may be what you want. 


This was helpful, thanks! 


Dr. Muthen, I am new to both MPlus and MI and this is why I have so many questions. There is another issue that I am struggling with. I just discovered that even though my original sample size is 3000+, the imputed data come with a sample size of 1882. I reviewed the help contents, this forum, and the user guide but I couldn't find any explanation to why this is happening. My hunch is that this might be a memory issue, but I don't know. Can you tell why this might be happening? Thanks! 


You are probably reading your original data incorrectly by having blanks in the data set and using free format or by having too many variable names in the NAMES list. If this does not help, send the data, output, and your license number to support@statmodel.com. 


Dear Linda, for my model, I use the following command: TYPE = IMPUTATION; I have 10 imputed datasets (with no missing values). When estimating a model with four latent factors, model estimation proceeds normally. However, when trying to establish a 2nd order factor (F2 BY F11* F12 F13 F14; F2@1), I get the following message: "The chisquare could not be computed. This may be due to an insufficient number of imputations or a large amount of missing data." There definitely is no missing data. Why would 10 imputations not suffice? All other models worked well  as did the same model with a different dataset. Thanks for your help! 


You can try increasing the number of imputations. If that does not help, send the output and your license number to support@statmodel.com. 


Multiple Imputation in 7.1 produces a new column of results called "rate of missing". Can you tell me what this refers to and how it's computed? I was hoping it was fraction of missing information, but the values don't match my hand calculations and I can't find it in the Guide. Thanks much! 


It is the same as fraction of missing information. How do you calculate it? 


I use FMI = Vb/Vt, where Vb is betweenimputation parameter variance, and Vt is total parameter variance (this definition is in Bodner 2008, Schafer 1997, Enders 2010). Does Mplus use Schafer's rate of missing information (defined here: http://sites.stat.psu.edu/~jls/mifaq.html#minf)? 


Yes 

Alvin Wee posted on Thursday, June 27, 2013  6:02 am



Hi, the new column "rate of missing" in version 7 is FMI right? How to we make use of it (i.e., how do we make use of them to indicate the quality of imputations)? Is there a range or cut off? 

Alvin Wee posted on Thursday, June 27, 2013  8:32 am



Also in the output, the unstandarised estimates = the values in "rate of missing" column. What does this mean? 


Please send the output and your license number to support@statmodel.com. 


I am planning on running a H1 imputation model to save the imputed data sets. I have data clustered within families and was going to use the TYPE=BASIC TWOLEVEL. What I am having trouble figuring out is whether the data needs to be in wide or long format (it is currently one row per family). 


You would not need TWOLEVEL if your data are in wide format and the cluster variable is family. You would just need TYPE=BASIC. Multivariate analysis takes care of any nonindependence of observations. 

Yan Liu posted on Thursday, July 11, 2013  6:11 pm



Dear Dr. Muthen, I did multiple imputation and conducted a mediation analysis with 10 imputed data sets. However, I can only see 9 data sets were analyzed and the model fit indices were also not given in the output. I wonder if there is anything wrong with my Mplus code. Here is my imputation code: USEVARIABLES ARE Y1t1 – M1t2 ; AUXILIARY = schid tchid stdid x1 x2 sex age; MISSING = blank; ANALYSIS: type = basic; DATA IMPUTATION: impute = Y1t1 – M1t2 ; ndatasets = 10; save = imput*.dat; OUTPUT: TECH8; This is my mediation analysis code USEVARIABLES ARE x1 x2 Y1t1 – M1t2 M_d Y1_d Y2_d ; DEFINE: M_d=M1t2 –M1t2; ! Difference score of mediator between time 1 and time 2 Y1_d=Y1t2 –Y1t1; ! Difference score of Y1 Y2_d=Y2t2Y2t1; ! Difference score of Y2 MODEL: M_d ON x1 (a1); M_d ON x2 (a2); Y1_d ON M_d (b1); Y2_d ON M_d (b2); Y1_d ON x1 x2; Y2_d ON x1 x2; MODEL CONSTRAINT: NEW(indb1 indb2 indb3 indb4); indb1=a1*b1; indb2=a1*b2; indb3=a2*b1; indb4=a2*b2; Here is what I saw from the output of my analysis: SUMMARY OF ANALYSIS Average number of Observations 228 Number of replications Requested 10 Completed 9 


Run each separately and see if one has a problem. 

Yan Liu posted on Monday, July 29, 2013  2:25 pm



Dear Dr. Muthen, Thanks for your suggestion! I tried analyses for each imputed data separately and found all the model data fit was really poor. Is this the reason why I cannot finish running all the imputed data sets? The data were collected at 2 time points. The total sample size is 228. No missing at time1, but about 16% missing at time2. The outcome variables are the percent of time, which were derived from counts. The mediator is continuous. The independent variables are two dummy variables (2 interventions vs. control). The distributions of 2 outcome variable are bimodal for two groups at both time points. The third group is not that obvious. I tried two ways to model my data: (1) Using difference scores (time2time1), the distributions of difference scores are not bimodal though a little bit skewed. The model fit was found very bad for each imputed data! Then I took a further look at the distributions for each group. Two of them are still bimodal. My question is: Should I not use difference score or not use ML estimator? Now the outcome variables are overdispersion and also the bimodal was still a problem for two of the groups. (2) Model variables at time2 include time1 variables as covariates. Given the bimodal problem and missing issue, what models would you suggest me use? Thanks! Yan 


So are you saying that the multiple imputation run had only 9 of the 10 converging, but when you ran them separately all 10 converged? Which version of Mplus are you using? Turning to your key question, poor fit can be due to using ML instead of MLR, but the bimodality is likely to not be resolved by MLR. I would suggest investigating the cause of the bimodality. Perhaps you want to simply use a binary variable instead of the bimodal one? 

Yan Liu posted on Wednesday, July 31, 2013  3:45 pm



Dear Dr. Muthen, Thanks for your suggestion! I think using a binary variable will be easier to solve the problem. To dichotomize the continuous outcomes, I am thinking to do it in three ways: (1) make a cutoff that separates the two modes (one small and one big distributions), (2) ask for experts' opinion, and (3) run a regression analysis with outcome, predictors, and mediator using latent class analysis (constrain to be 2 classes) and then save the membership. Will the third option work and be better? Oh, I used Mplus 6.11. So is any difference between the versions? Best regards, Yan 


The choice between 123 has to be made by the researcher. I would recommend always using the latest version of Mplus, which currently is 7.11. 

Yan Liu posted on Friday, August 02, 2013  7:51 am



Dear Dr. Muthen, Thanks a lot! Should I dichotomize outcome variables first and then impute missing data or the other way around? One more question. When imputing continuous outcome variables (should be zero or positive), I found that some imputed values are negative. Is there a way to constrain the imputed value not to be negative? Best regards, Yan 


Personally, I would use max info for imputation so not dichotomize first, but this is your choice. See the VALUE option on page 518 in the Version 7 UG. 


Hello! I am trying to run latent growth curve models with longitudinal data. I have 1015 waves of data, but unfortunately roughly 30% of participants are missing on my predictor. From my understanding, MI is the best method for handling the missing data on these xvalues  does that sound right? I was able to create the imputed data sets, but was not able to take the next step and run the model using the imputed data. The MSDOS window appeared briefly and disappeared, and no output was produced. If you have any advice, I'd really appreciate it! Thanks, Lauren 


Please send the input and data sets to support@statmodel.com. If you are not using Version 7.11, try that first. 


Hello, I'm new to mplus, and I'm doing a growth mixture model. As I have some missing data in the covariates I wanted to do multiple imputation (Type=Imputation). I see from older posts that I should use starting values to avoid label switching. Does this still apply or do the program automatically use the estimates from the first data set as starting values for the subsequent data sets? I'm using version 7.11. Thanks, Ragnhild 


Instead of multiple imputation I would include the covariates in the model by mentioning their variances in the overall MODEL command and use FIML. In this way distributional assumptions are made about them but cases with missing on one or more covariates are not excluded from the analysis. 


Thanks for the advice! I will try that. 


Linda & Bengt: I appreciate the guidance you've provided for my journey into imputation. In trying to understand what is going on, and the tests involved, I analyzed a data set of about 300 observations (all continuous scales) in a couple of ways: (1) FIML using ML, MLR, and Bayes estimators. (2) Bayes estimation of factor scores, creating 50 imputed data sets, then ML estimation of the latent model. The FIML estimates all showed poor model fit (no surprise). The ML chi squared results from the imputed data, however, showed good fit (using the chi squared test in the output). That surprised me, given the poor fit using FIML  I was expecting consistent (though not exactly the same) results. Is the imputedproduced ML test testing something different, or am I missing an additional step? (I read the Multiple Imputation technical paper, version 2, 07/27/2010  I assume that the output chi squared is the appropriate test of fit.) I don't want to publish an incorrect interpretation  appreciate guidance to help me avoid that! Thanks! Michael 


I believe you are looking at an average chisquare value and a standard error in the multiple imputation output. Is this the case? If so, this is not the true imputation chisquare. 


Linda: If that is what is provided in the output, then yes. There is a note about average over 50 data sets, but that appears after SAMPLE STATISTICS heading, so I thought it referred only to those. Here is excerpt from my MPLUS Imputation output: MODEL FIT INFORMATION Number of Free Parameters 37 Loglikelihood H0 Value 619.847 H1 Value 391.050 ChiSquare Test of Model Fit Value 14.280 Degrees of Freedom 23 PValue 0.9186 ChiSquare Test of Model Fit for the Baseline Model Value 168.670 Degrees of Freedom 44 PValue 0.0000 SRMR (Standardized Root Mean Square Residual) Value 0.063 I had assumed the ML estimator Chisquared results produced would reflect the information in the Technical Appendix "ChiSquare Statistics With Multiple Imputation" Version 2. Do I need to "hand calculate" the appropriate chi squared statistic using these averages? Thanks. Michael 


The values above are not averages. They are the values described in the Technical Appendix "ChiSquare Statistics With Multiple Imputation". They are available only for ML not, for example, MLR. Please send the two outputs, imputation and FIML, along with your license number to support@statmodel.com. 


My question is about the autocorrelation plots obtained using multiple imputation. I'm not sure what is being plotted on the horizontal axis. No matter how many iterations I specify, the axis runs from 130. Are iterations "binned" somehow to create the horizontal axis? If so, is the binning achieved by just dividing the total number of iterations by 30? I tried to change the axis range, but Mplus shut down, so I'm guessing that's not an option. Thanks, Debbi 


Please send the output, graph file, and your license number to support@statmodel.com. 


With my data, I have computed the same SEM twice:  once with some missing data, using ML  once with 10 imputed datasets The chisquare value for the MIdatasets is smaller than the value for the dataset with some missing information (with the same df and N); however, CFI for the MIdataset is lower than the one for the dataset with some missing information. Why would this happen? Thanks for your help! 


In addition to my previous post: For simulation purposes, I have also used mean imputation and single stochastic regression imputation. Both methods have led to very similar results as ML with missing data. So it is only the analyses with 10 imputed datasets where chisquare is much lower (around 1275 compared to around 1450 for the three other options  with df= 98 and N = 2326) and CFI is lower, too (.910 compared to .916). Thanks. 


For your MI approach, are you referring to the chisquares for each imputed data set, or the one chisquare that summarizes all the imputed data sets? I am referring to techniques discussed on slides 212 of the 6/1/11 Topic 9 handout. 


In my output file, there is only one chisquare value which I presume is the one that summarizes all imputed data sets. (I'm using Mplus 7 with "TYPE = IMPUTATION;" and ten imputed datasets; according to the output, all ten requested replications are completed.) 


Please send the output for the 2 runs that you compare to Support. 


Thanks for your suggestion, I've done so this morning. An additional question: I have compared the chi square values of the baseline model between multiple imputation and ML (with missing data)  they differ quite a lot; whereas the differences between chi square baseline model ML (with missing data), mean imputation and single stochastic regression imputation are comparably small. Why would that happen? 


Please send those 2 baseline outputs and data to Support. 

Anonymous posted on Monday, January 06, 2014  2:24 am



Hello! I have calculated CFA with categorical variables with 5 MI datasets (TYPE = IMPUTATION). I ran two CFAs with different models. In order to compare them, can I simply calculate the Chisquare difference test by subtracting the Chisquare values provided in the MODEL FIT part of the outputs? Thanks in advance! 


The chisquare with multiple imputation cannot be used for difference testing. You should use FIML if this is important to your study. 

Lucy Morgan posted on Thursday, January 23, 2014  1:23 am



Hi I am trying to run a fully latent path model (N = 199) with a dataset that is complex (data collected from care assistants, clustered by nursing homes), nonnormal distribution, and missing data (< 5%). Data is missing on both exogenous and Endogenous variables. I understand that FIML can account for missing data on endogenous variables only, thus number of observations are reduced to 167 when I run the model. I have a couple of questions I would be very grateful if you could answer: 1) Should I use multiple imputation to compute missing data for ALL variables, and then run the model based on the imputed datasets? Or should I only impute data for the exogenous variables and then run the model with imputed datasets AND FIML? (I did try to impute only exogenous variables, but missing data on the endogenous variables was replaced with * and the model would not run....) 2) When running the multiple imputation, would it be ok to run a straightforward imputation (TYPE = BASIC) as per ex 11.5 in the Version 6.0 handbook or should I be running the multiple imputation with the model that reflects my dataset (TYPE = COMPLEX) similar to ex 11.7? 3) When I run a TYPE = COMPLEX model, I cannot also use MLM (which I need to use to account for nonnormal data). Can I simply substitute MLR? Many (many!) thanks Lucy 


In your case, I would not use multiple imputation. I would use COMPLEX and MLR and include the variances of all observed exogenous covariates in the MODEL command. They will be treated as dependent variables and distributional assumptions will be made about them. Missing data theory will then be used for them. This is asymptotically the same as doing multiple imputation. 


Hello, I have missing data for a 4 wave LGC model I am running, I can't use MLR as an estimator because 2 of my variables are nonnormal, I can't use MLM because I have missing values. So I chose to impute 10 data sets (in Mplus 7) and then use MLM estimation I understand the chisquare and fit statistics are just averages and not accurate evaluations of fit (unless you're using ML), but does MLM still produce robust SEs and parameter estimates with multiple imputed data sets? That is, is it appropriate to use MLM estimation with multiply imputed data if I'm not planning on comparing nested models)? 


Why can't you use MLR? MLR is robust to nonnormality of continuous variables. What do you mean by nonnormal? 


Sorry my mistake re: MLR (by nonnormal I mean one continuous predictor and the continuous outcome variable are both highly leptokurtic). When I previously estimated the unconditional LGC model with MLM on imputed data, the fit indices RMSEA, CFI and SRMR all indicated adequate fit, but with MLR on the original data (with missing) these indices all indicate inadequate fit (the parameter estimates are quite similar)  I assume I have to put more stock in the MLR estimated fit stats and conclude that my model has poor fit correct? Thanks 


You can't assess fit in multiple imputation using the means of the fit statistics. These are means. How well or poorly they represent fit has not been studied. So yes, it seems your model does not fit. 


Hi Dr. Muthen, I'm using multiple imputation (Amelia) and running 10 computations. How do I pool the model fit indices (CMIN/df), CFI, RMSEA, SRMR)? In an earlier post (from 2006) it says to just calculate a simple average of the values (as there is no specific theory on this), I was wondering if this has changed or if that is still the practice. Thank you. 


This is still a research question. 


Dear Linda, I have a question about "the concept of multiple imputation using Bayesian estimation". When imputations are created under Bayesian arguments, MI has a natural interpretation as an approximate Bayesian inference. In addition, I thought that this missing data technique uses Byesian estimation method when obtaining parameter estimates. So, I wrote the below syntax,  ANALYSIS: ESTIMATOR = BAYES; MODEL: i s  Y1@0 Y2@1 Y3@2 Y4@3; [i](a); [s](b); i(c); s(d);i with s(e); MODEL PRIORS: a ~ N(190, 20);b ~ N(7, 3); c ~ IW(616, 5);d ~ IW(8, 5); e ~ IW(28, 5); DATA IMPUTATION: IMPUTE = Y1Y4;NDATASETS = 15; SAVE = C:\*.DAT;  By the way, some are confused whether "MI using Bayesian" indicates Bayesian estimation or simply means MI proposed by Rubin. I thought it is the former. Am I correct? Thank you 


See page 516 of our UG to see an overview of how imputation works in Mplus. See also the imputation examples in Chapter 11. You can do "H1 imputation" or "H0 imputation". Your setup is an H0 imputation example in line with UG ex 11.7. Rubin proposed MI using Bayes, so these are one and the same. I recommend the Craig Ender missing data book. 


Dear Bengt, Thank you for your information. You explained "the data can be imputed from any other model that can be estimated in Mplus with the Bayesian estimator (H0)". Then, the imputed data sets are used in the estimation using Bayesian(Bayesian estimation is used to estimate the model). Does it mean that parameter estimates from Bayesian posterior distributions (for each imputed data) were obtained and combined all? Am I correct? If so, setting informative priors (model priors) affect imputation step? or estimation process? or both? 


Q1. Multiple draws were generated from the Bayesian posterior distribution of H0 model estimated parameters and for each draw data were generated. Q2. The priors only influence the first H0 model estimation step. 


Dear Bengt, Deeply appreciate your quick reply. Here is one last question. When estimating parameter (in estimation step), Mplus uses simply noninformative priors? Thank you. 


Yes. 


Hello  I am doing multiple imputation with a Bayesian estimator. I have a few parameters that appear to have autocorrelations still present at 30 lags. How can I see the autocorrelations for lags greater than 30? Both the output and the plot only go through 30. Also, how can I get the fraction of missing information for each parameter? Thank you, Lindsay 


Using the thin option of the analysis command can give you bigger lags, if you use thin=10 the auto correlations that you will see are essentially 10, 20, 30, ..., 300 so it is multiplied by 10. Alternatively use the BPARAMETERS option of the savedata command to get all parameter values and compute the desired autocorrelation in excel. The fraction of missing information for each parameter is obtained after the imputations are done as in example 13.13 where the desired model is specified. 


Hello, I am working with a multiple imputed dataset (5 imputed sets) because we employed a 3form planned missingness design in a large questionnaire. My most important variables are nonnormally distributed, so I usually use the MLR estimator for my models. As I conclude from reading all the information here, the fit statistics (chisquare, RMSEA, CFI) for MLR are 1) all averages over the computed sets 2) these averages are not reliable because Mplus does not "realize" they are from imputed data 3) for the same reason the fit statistics of the separate imputed sets are not reliable either Now some questions that arise are: How do I assess the fit of my model? Should I run it with ML and see if fit statistics here are similar to the averages I get with MLR? That is, as an indication  I don't think actually reporting results of ML models is a good idea since the nonnormal distribution of my variables. What do I report in a manuscript when I want to refer to model fit (reviewers ask for it)? Is there ANY way to test nested models using MLR (constrained vs. unconstrained to test for moderation by gender)? Or any other way to test moderation in this case? Thank you, Suzan 


If you have planned missingness, use FIML not multiple imputation. Then you have fit statistics and can test nested models. 


Dear Linda, Thank you for your reply. Unfortunately we do have to work with the imputed sets. Could you tell me how I should report on the model fit using TYPE=IMPUTATION in a manuscript? Are there other ways to assess the fit? And can I use the Waldtest instead for moderation purposes? Thank you, Suzan 


You can use MODEL TEST with multiple imputation. No difference testing can be done. The only absolute fit statistic is chisquare for maximum likelihood for continuous outcomes. See the following paper on the website under Bayesian Analysis: Asparouhov, T. & Muthén, B. (2010). Bayesian analysis of latent variable models using Mplus. Technical Report. Version 4. Click here to view Mplus inputs, data, and outputs used in this paper. As far as what others report, I don't know. You might want to ask that on general discussion forum like SEMNET. 


I'm sorry. This is the paper I meant: Asparouhov, T. & Muthén, B. (2010). Multiple imputation with Mplus. Technical Report. Version 2. 


Hello, I am trying to do imputation of missing data before running ULSMV analysis. I'm getting an error message "PROBLEM INVOLVING VARIABLES AND xC‡ . REMOVING ONE OF THESE VARIABLES FROM THE IMPUTATION RUN CAN RESOLVE THE PROBLEM." The problem is that the strange characters are appearing instead of the names of variables and I can not figure out which variables should I remove. Tnx in advance for the help! 


Please send the input, data, output, and your license number to support@statmodel.com. 

Joop Hox posted on Thursday, May 15, 2014  2:12 am



Hi all, I have a practical question: is there a maximum limit to the number of imputed datasets that Mplus can handle? Joop 


Not that I know of. If you have had a problem, please send it to support. 

Shiny7 posted on Tuesday, September 30, 2014  11:07 am



Dear Mrs. Muthen, may I ask another question, please? Is the 'Analyze multiple imputation datasets' compatible with Multilevel Modeling? I tried it, the analysis runs well but the Regression Coefficients and SE´s are not plausbile; Furthermore Mplus registerd 22 Clusters, although in fact there are only 21. I hope you can give me little support. Thanks a lot! Shiny 


Yes, multiple imputation can be done with multilevel modeling. It sounds like you are not reading your data correctly. If you can't see the problem, send the data and output to support@statmodel.com. 

Shiny7 posted on Tuesday, September 30, 2014  11:47 am



okay, thank u so much, I am going to check my model again... Have a nice day... 


It's the data you should check. You may have blanks in it that cause it to be misread in free format. Blanks are not allowed in free format data. 

Shiny7 posted on Wednesday, October 01, 2014  12:31 am



Dear Mrs. Muthen, thanks a lot, it was only the term 'Imputation_' that had been missing at the beginning of the 'names command'.... Now it works fine... 

WenHsu Lin posted on Sunday, October 19, 2014  12:26 am



hello Mrs. Muthen, I try to use MI in mplus; however, I do not know how to use the value syntax. my syntax as follow: variable: names are income sex w3dep; usevariables are income sex w3dep; missing is blank; data imputation: impute = number income sex(c) w3dep; So, how do I tell mplus my w3dep ranged from 1 to 16? Thank you. 


VALUES = w3dep (116); 

Lois Downey posted on Tuesday, October 21, 2014  10:57 am



I used DATA IMPUTATION in Mplus to generate 5 datasets with values imputed for all missing data in my dataset. I am now using TYPE = IMPUTATION to analyze the data. Some of the outcome variables in the dataset are censored from below. When I define one of these outcomes as censored, Mplus does not provide any warning that this statement will be ignored. However, the results of the analysis match the results of an analysis in which I omit the CENSORED command. Does this mean that Mplus ignores the statement and performs the analysis as if the outcome is uncensored continuous? 


Please send the output and license number to Support so we can diagnose this. 

WenHsu Lin posted on Tuesday, October 21, 2014  5:42 pm



Dear Mrs. Muthen I have drop outs(missing) in my longitudinal data. I ran a LGM(wave 1 to wave 3) and used a wave 4 latent variable as a distal outcome (deviance). I ran a multiple imputation and the results were different from that of default in handling missing data in Mplus. Specifically, for the multiple imputation model: deviance on i s was not significant. On the other hand, when I do not use multiple imputation just denote that missing is blank. I got significant for the same path. Which one of these is more trustworthy? Thank you so much. 


I'm considering multiple imputation for a dataset I'm working with (all categorical variables & some dichotomous & so am using WLSMV estimator). The dataset has a weight variable but no clustering or stratification. My hesitation with using multiple imputation comes from this thread, which suggests that the resulting fit indices (e.g., RMSEA, TLI, etc.) are averages & are therefore not interpretable per Linda Muthen. My understanding is that only chisquare is interpretable yet I am concerned about using that with a fairly large sample size. Is it still the case that interpreting other fit indices is still an issue being researched (no clear answer yet?) I suppose the alternative would be EM imputation in one dataset, yet I know of some concerns in the literature with doing so with dichotomous variables. Thank you. 


This is still the case except for maximum likelihood with continuous outcomes. 

Ashley posted on Tuesday, December 02, 2014  4:34 pm



Is it possible to compute basic descriptives across imputed datasets (means, SD, correlations, etc)? Also, I've conducted a CFA and I would like to compute reliabilities (alpha) of the factors identified. Is this possible to compute across all imputed datasets? Thank you in advance. 


Q1. Try Type=Basic. Q2. You can do this using SEM in line with articles/books by Raykov and Marcoulides. 

Ashley posted on Tuesday, December 16, 2014  12:07 pm



Hello, I attempted to obtain the descriptives of the imputed datasets using type = basic. While I can get the means and correlations, I have been unable to get the standard deviations. Is there another code that I could use? Also, I've been unable to figure out how to flag significant correlations. Is this possible for MI data? Thank you in advance. 


We give sample statistics for the first imputed data set. For continuous variables you should get means, variances, and covariances. For categorical variables, you should get thresholds and correlations. 


Dear Dr Muthen and Muthen, As discussed above, when using multiple imputation the model output includes a column Rate of Missing. I understand this represents the (un)certainty of the model results due to the missing data. Yet, I have been unable to find any information regarding what is an acceptable rate of missing information. Could you provide me with a reference on this topic? Thank you in advance and have a lovely new year! Aurelie 


See the FAQ Missingness Fraction on the website. 


Hello, I'm trying to analyze some datasets (impmissl) created through DATA IMPUTATION on the original dataset (Aim2). When I try to use these datasets to run my model using DATA: FILE IS impmissllist.dat; TYPE = IMPUTATION;), I find that there are illegal characters  asterisks. I figured out that these are probably missing values in the original dataset, Aim2.dta, that are represented as 9999 in that Aim2.dat. For some reason, though, for the variables that I did not impute (to save time/power), it seems the missing values may have been converted back to their original asterisks (the file, Aim2.dat, was the result of the stata2mplus utility where asterisks representing missing data were replaced by 9999). I've tried listing the * as a missing value in the MISSING statement, e.g. Missing are all (*) or Missing are vat1 (*) dx1(*),but that doesn't seem to work either as I get errors, e.g.: "ERROR in VARIABLE command Period (.) or asterisk (*) used as the missing symbol must apply to the whole dataset. No variables (or ALL) should be mentioned in the MISSING option..." How can I either replace the * again or keep them from being made into * in the first place. Or maybe something else entirely is happening? Thanks, Michelle 


Mplus uses asterisks as the missing data flag in data sets it saves. Say MISSING = *; Don't use variables names or ALL in the MISSING statement. You should look at the output where you created the imputed data sets. All of the information about the data sets including the order of the variables is shown at the end of the output. 


Dear all, I have run a multiple imputation with Stata but I would like to run a latent class analysis with Mplus. For this reason, I would like to know if there is an option/command in Mplus for handling my new dataset that include 20 datasets created by the multiple imputation. I don't want to do all the analysis (MI+LCA) with Stata because for my main analysis, with another dataset, I found the classes with Mplus. So I'd like to use the same program also for this dataset that I'm using to confirm my results. Thank you! Fabio 


See Example 13.13 


Dear Linda, thank you for your quick answer. I guess you meant Example 12.13 since 13.13 does not exist on the book. Anyway, I run the analysis on a small dataset (with 5 imputed dataset instead of 20) to check if the code was correct but something is wrong. I have 2 errors: "Errors for replication with data file" and "ERROR in DATA command. An error occurred while opening the file specified for the FILE option." The following is the code I used. TITLE: LCA with multiple imputation DATA: file is O:\LCA\Mplus\prova.txt; type = imputation; VARIABLE: names = nat sex drug age id data; usevariables = nat sex drug age; categorical = age; nominal = nat sex drug; idvariable = id; classes = c(2); ANALYSIS: Type Mixture; starts = 1000 100; OUTPUT: standardized; tech1 tech7; SAVEDATA: file = indout_LCA_mi.txt; results = res_LCA_mi.txt; save = cprobabilities; The variable data is numeric and 1 represent the first imputed dataset, 2 the second and so on. Thank you very much for your help! Fabio 


In the current user's guide, the example is 13.13. Please send the files and your license number to support@statmodel.com. 


Yes, I found the current user's guide and the example I have looked at is the same. I can't send the data set but I can give you an example: nat sex drug age id data 1 1 1 1 1 1 1 1 0 1 2 1 0 1 0 3 3 1 and so on until the end of the first imputed data set. Then there is the second imputed data set: 0 0 0 2 1 2 1 1 0 3 2 2 1 0 1 1 3 2 I did that for all the imputed data sets, so my file "prova" has one data set under the others with the variable data indicating the number of data set. Is this the proper way to set the file? May I run a mixture analysis with MI data sets? Thank you for your comprehension. Fabio 


Each imputed data set must be in a separate file. The file named in the FILE option contains the names of the data sets. This is explained in the example. 


Ok, sorry! Now it is clear and everything worked. Thank you again for your help. Best Fabio 


Dear Professors Muthén, I am performing multiple imputation on a large sample of data. I have two types of missing data: by design (99) and not by design (999). I want to impute the 999 (missing not by design) only but I can't fine the right sytanx. Could you help me with this? Thank you, Andrea 


Just use 999 as your missing data designation in the imputation run. The missing by design could later be handled by multiplegroup analysis for instance. 


Thank you! 

Jen posted on Thursday, March 05, 2015  12:50 pm



Hello, I had a question related to the above about missing data that is a combination of MCAR and MAR(ish). I am constructing a relatively complex structural model with 8 multiindicator latent constructs plus 3 manifest variables, one of which is categorical. The categorical manifest variable is one of five mediators and correlates with other mediators (so including covariances seems necessary). Additionally, many of the indicators for the latent variables are categorical. Two latent variables (all indicators) and the categorical manifest variable are MCAR for half the sample (due to giving participants a random subset of measures). There is a small amount of data missing for other reasons. Because of the categorical variables and need for covariances, I'd like to use WLSMV but am concerned about the missing data handling. I thought I might use MI, but wonder if I am okay to just impute the small amount of MAR(ish) data but not the MCAR data. Imputing the MCAR data seems to be causing issues (everyone in that half of the sample is being assigned identical values). I am also open to other ideas for this situation. Thank you! 


How about handling the MCAR by multiplegroup analysis in WLSMV? I assume the MCAR patterns of nonmissing variables can be used to form separate groups of subjects. And then use MI for withingroup MAR missing. 

Jen posted on Thursday, March 05, 2015  2:31 pm



Would the structural model for the group with missing variables just exclude those variables, then? And I would test whether various parameters differed across groups and hope not such that they could be constrained? The random groups are of course of no substantial interest. One very key variable theoretically is actually missing for 50% of the sample, but I am hoping to include that half of the sample to increase the precision of estimates of other paths in the model given our interest in indirect effects. Thank you for the suggestion. 


Right, the MCAR missing variables in a group would be excluded. 


Hello, What are my options for saving data from a twolevel analysis constructed using 20 imputation values? I received the message: "The SAVE option is not available for TYPE=MONTECARLO or TYPE=IMPUTATION." Is there another way to specifically save the level2 score (cluster mean) I created from the 20 input data sets using the following select code? "readcomp" is the variable I am reading in that has 20 plausible values (1 in each of the data sets analyzed and combined here). So I already have the 20 plausible values/imputations. I just want to get the level2 average of them, and save that calculated variables for further analysis. Thank you. DATA: FILE IS "R:\proj\031815list.txt"; TYPE = imputation; VARIABLE: NAMES ARE teachID std_blck readcomp; USEVARIABLES ARE readcomp B_BYSCHL; USEOBSERVATIONS ARE std_blck eq 1; BETWEEN IS B_BYSCHL; WITHIN IS readcomp; CLUSTER IS teachID; MISSING IS .; DEFINE: B_BYSCHL = CLUSTER_MEAN (readcomp); ANALYSIS: TYPE = TWOLEVEL; MODEL: %WITHIN% readcomp; %BETWEEN% [B_BYSCHL]; B_BYSCHL; SAVEDATA: file is R:\proj\031915.dat; OUTPUT: Sampstat STDYX TECH1 SAMP res; 


This would have to be done using a batch file on an Mplus input that has a single data set, one for each imputed data set (ie. not using TYPE=IMPUTATION). You can modify the RUNALL utility available on our website at the page http://statmodel.com/runutil.shtml If you run into problems setting this up, please email your files to support@statmodel.com. 

Tom Booth posted on Sunday, March 22, 2015  10:19 am



Dear Linda/Bengt, I have a potentially very simple question I am just struggling to find an answer to. I wish to use multiple imputation, and then fit a model on those data sets, ideally in a single script. However, I want to use more variables for the imputations than I use in the model. When the extended list of variables (those I wish to use for imputation) is added, and the model syntax only uses a subset of the variables, fit is obviously poor as there is a large set of uncorrelated variables. Is there a way round this, or is this a case of needing to do the analysis in two stages? Best Tom 


You have to do it in two steps. 

Mike Zyphur posted on Wednesday, April 15, 2015  4:25 am



Hi Linda and Bengt, Some datasets have imputed values as separate variables rather than separate datasets (e.g., instead of 20 datasets, there is a single dataset wherein each variable with missing data is repeated 20 times). When this exists, is there any way to run Mplus so that instead of the "Data: Type = imputation" command, it is possible to indicate which range of variables have the imputed values for each variable? This is a shot in the dark, I realize, but would be a big help for an ongoing project. Thanks for your time and help! Mike Zyphur 


Mo, Mplus requires separate data sets. 

dennis posted on Friday, April 17, 2015  3:19 pm



To follow protocol for reporting results, I am putting together a table of correlation/covariances, means, and standard deviations. I used multiple imputation with ULSMV. I am having a difficult time finding the means and standard deviation in the output. To report my means, do I use the number under “Means/Intercepts/Thresholds?” How do I get the standard deviations to report? 


For continuous variables, these values would be means. Standard deviations are the square root of the variances that are found on the diagonal of the covariance matrix. 

dennis posted on Friday, April 17, 2015  9:34 pm



Ok, thanks! So, for ordinal variables, these values would be thresholds? Does it make sense to report thresholds for these variables as I would the means and sd for their continuous counterpart? 


You can do this. 


Hello, I am trying to conduct a CFA model with categorical and dichotomous variables using WLSMV. My dataset had a large number of missing variables so I used multiple imputation. I have three questions: 1. Based on this thread it appears that fit indices RMSEA, LTI, etc. are not calculated for imputed datasets. Is this still the case? If so, how is model fit assessed for this type of data? 2. My original dataset had a sample size of 3,000 but after the imputation was run the analysis reads that there are only 1,500 observations. Is there something I missed about the imputation process? 3. Is there a way to get frequency tables for the imputed data, OR get a final dataset with the final imputed values so that they can be transferred into another program? 


1. Yes. It cannot be assessed. 2. It sounds like you have more variable names and columns in the data set causing two records to be needed for each observation. 3. You can saved the imputed data set. See the SAVE option of DATA IMPUTATION. 


Thank you Dr. Muthen. I rechecked my variable names and columns and adjusted them. However, now I am still only getting 2785 observations. I also entered the following command in order to only get integer values and am still getting 3 decimal values in my output files. Rounding = cp2 cp4a cp4 np1 np2 cp5 cp6 cp7 cp8 cp9 cp13 cp21 it1 prot3 prot6 prot8 b10a b11 b13 b18 b21 b21a b31 b32 edcat q11 etid Vote inccat(0); Lastly, it looks like the values associated with my categorical data points have been changed in the imputation process. Categories that were labeled 1 are now labeled 0, 2 is now 1, etc. Is there any way that I can keep my original values? Thank you! 


Please send the relevant files and your license number to support@statmodel.com. 

CB posted on Tuesday, April 28, 2015  8:29 am



Hello, I'm running multiple imputation as part of LCA. I have specified the variables that I want included in the model in the USEVARIABLES command. However, I have variables that I don't want in the LCA but want to be used for the imputation; what code is needed to incorporate these variables? Thanks! 


Do the imputation analysis in a first, separate step before the LCA. 


Hello, I have three quetions about my current analyses using TYPE=IMUPTATION for combining over 20 data sets, each with 1 plausible value estimate of student performance. I am running a TYPE=TWOLEVEL COMPLEX RANDOM model. 1) The 20 readin data sets have complete data (no missingness), and 165,411 cases. But when I run my analyses in Mplus, the number of cases is 145,678. Why is that happening? 2) What does it mean that only 18 of my 20 requested replications are being completed successfully? Why would not all replications complete? 3) Several of my dichotomous variables are x variables on Level 1 and y variables on Level 2; as such, Mplus is treating them as y variables on both levels (a common warning message that I have seen before). This is OK with me, but should I therefore specify an WLSMV estimator rather than MLR? Thank you. 


Hello, I gave figured out the answer to #1. Some cases are missing the stratification variable. However, can you please inform me on issues 2 and 3? Thank you sincerely. Lisa 


2. You would need to run those data sets separately to see what problem they have. 3. No. 

CB posted on Thursday, April 30, 2015  8:03 am



Thanks for the quick response! Is it possible to impute for a nominal variable in Mplus? I tried using TYPE=BASIC, which didn't work. Does TYPE need a different option to impute a nominal variable? As I mentioned before, I'm running multiple imputation as part of LCA and I want to perform multiple imputation using variables that I don't want in the LCA but included in the imputation. I'm following the suggestion, which is definitely appreciated, to run the imputation separately and before the LCA. However, is there a way to aggregate the results from all of the imputed datasets into one dataset and use that summary dataset to run in a single LCA? Is this a valid approach for imputation? 

Jon Heron posted on Friday, May 01, 2015  3:11 am



The question of how to combine LCA and MI is in my view unanswered. At least I hope this is still the case  I am writing a grant on this at the minute! Lanza and Collins (LCA/LTA book) talk about issues surrounding a preLCA MIstep and indeed you do run into the problem of pooling the results across imputed datasets. Particularly if in some of your imputed datasets a different solution may be supported or perhaps even the theorydriven solution is empirically nonidentified. Another problem with a preLCA MIstep is that the imputation is likely to be misspecified. For instance, if your goal is for latent C to moderate an XY relationship it is not possible to add this to your imputation. The first source i have managed to find for a warning about preLCA imputation is Colder CR, Mehta P, et al. Health Psychol 2001; 20(2): 12735 However I did still carry out preLCA imputation in one of my own papers in 2011. My own feeling is that in some cases preLCA imputation will be fine, some cases postLCA imputation will be fine (i.e. shoehorning an imputation step between steps 2 and 3 of the biasadjusted threestep method) and in other instances we will need to bite the bullet and do what I think of as concurrent LCA and imputation  i.e. MCMC. That is the essence of my grant. If anyone reading this has some spare cash, please drop me an email :) 


Answer to CB: It's not a clear to me what you want to do. Do you want to use the LCA model for imputing the additional variables? If so you can save the LCA parameters using the Bparameters option, use fixed set of parameters 100 iterations apart, fix the LCA parameters to those and impute one data set at a time. Nominal variables are not available in the imputations but they are equivalent to categorical / ordinal if you are imputing only from the latent class variable. 


Response to Jon: I think I mainly agree, but I wonder if one should distinguish between different roles for variables that have missingness/need MI. For LCA indicators I would not necessarily use MI but simply ML under MAR. For covariates predicting latent classes, I would perhaps use MI without worrying about latent classes. For distal outcomes I would probably not use MI but use MLMAR, although if you don't want the distals to influence class membership, then perhaps you want to do an LCAbased MI for the distals, where that LCA doesn't necessarily have the same number of classes as the central one. 

Jon Heron posted on Saturday, May 02, 2015  12:04 am



Thanks Bengt yes, my current draft describes appropriate options for each variable type in turn. If ConX model is planned I had been mulling over the idea of imputing covariates only and letting ML worry about class indicators. That would certainly reduce the variation in LCA solution across datasets and make it easier to pool results with confidence. best, Jon 

CB posted on Monday, May 04, 2015  6:51 am



Thanks Jon and Dr. Muthen for your thoughts! Tihomir, my intention was to run multiple imputation as part of LCA. I had specified the variables I wanted to use in the LCA. I have missing data on an LCA indicator, so I wanted to impute them by using the variables specified for LCA in addition to other variables not in the LCA. Dr. Muthen had suggested that I perform the imputation separately from the LCA. Thus, my questions dealt with how to operationalize this. I did want to clarify your response though. If I'm imputing missing data from an LCA indicator, then the missing nominal variable cannot be imputed and is just left as missing then? 


Nominal variables can't be imputed in Mplus yet. 


Hello, I would like to report chisquare statistics with fit indices. But I used 10 multiple imputation datasets. MLM estimation method was used to handle nonnormality of one factor with 6 indicators (out of 5). So, Mplus output doesn't provide scaling factor. How do I report chisquare statistics? Thank you for your support and help. 


The test of fit with imputation is available currently only for single level ML estimation with continuous variables. 


Thank you for your answer. SO, what is your suggestion? I need to report chisquare statistics with BIC, AIC, aBIC, and so on. 


These fit statistics have not yet been developed for the case of multiple imputation. If you need fit statistics, you need to use FIML for your missing data or listwise deletion. 


Thank you for your reply. Dr. Muthen. But I would like to clearly explain about my paper and want to get an advice. One of reviewers criticized my and my colleagues' manuscript, because we didn't use multiple imputation for missing cases. So, we did that before running SEM, we imputed missing data using multiple imputation in SPSS with EM algorithm and made parcel scores for a couple of latent constructs. Then we conducted SEM (5 latent constructs). As you know, I got fit indices and chisquare statistic averaged across 10 datasets. Because I used MLM (robust to nonnormality, because one latent construct is not normal), I need to use scale factor to adjust chisquare, but it is not provided in the output. MY questions are: 1) Can I report fit indices averaged across 10 datasets in the manuscript? 2) if it is not possible, what are other options for my case? How can I make parcel scores and run SEM without multiple imputation? 3) If fit indices are okay to report, but reporting chisquare is only problem, what is your suggestion to report chisquare statistic? Thank you for your answers in advnace. 


1) that would not be a good idea as explained in terms of chi2 in our Topic 9 handout of 6/1/11, slides 210216; see also the reference to the tech doc AsparouhovMuthen (2010). 2) Don't use MI, use FIML. 3) No suggestion beyond the tech doc AsparouhovMuthen (2010). 


Thank you so much for your answer. So, let me clear about this. When do I need to create parcel scores? before running SEM or within SEM syntax (e.g., define= ). Can FIML handle missing cases when I create parcel scores? Thanks, 


I think you are asking how to handle parcels with missing data on items in the parcel. If so, this general analysis strategy question is best directed to SEMNET. 


I am trying to use mixture modelling using multiple imputation files. I do not get any statistic or an output file with class information. If I use FIML, the Entropy =0.68, from imputation command Entropy= 0.78 and similar classes in both methods. Is there a way when I am using TYpe= MIxture missing; I can tell the program to use X5X7 for imputations. DATA: FILE IS "xyz.dat"; VARIABLE: NAMES ARE x1 x2 x3 x4 x5 x6 x7 ; USEVARIABLES ARE x1 x2 x3 ; CLASSES = c(5); MISSING ARE ALL (999); ANALYSIS: TYPE = mixture missing; LRTSTARTS = 0 0 50 20 Starts =500 200; MODEL: %OVERALL% i s  x1@0 x2@0.52 x3@2.2; i@0; s@0; 


I use follwing script in MPLus for Multiple imputation DATA: FILE IS "EPDS_For_imputations.dat"; VARIABLE: NAMES ARE x1 x2 x3 x4 x5 x6 x7; CATEGORICAL ARE x5; USEVARIABLES ARE x1 x2 x3 x4 x5 x6 x7; MISSING ARE ALL (999); DATA IMPUTATION: IMPUTE= x1 x2 x3; NDATAsets=20; SAVE=impute*.dat; VALUES=x1x3 (030); ANALYSIS: TYPE =BASIC ; 


TYPE=BASIC is for descriptive statistics only. Remove this to estimate a latent class model. Please limit posts to one window. 

John Woo posted on Sunday, August 16, 2015  9:17 pm



Hi, when I run different models using Type=imputation, Mplus does not seem to produce the usual warning/error messages in the output (e.g., latent variable psi matrix not positive definite). Is this correct? Does this mean that I should manually check for any obvious estimation issues (e.g., latent var corr greater than one)? Related, is it possible to have estimation issues (e.g., untrustworthy s.e., non positive definite psi, etc) with an individual imputed dataset and yet the final output across all imputed datasets appear to show no such issues? Thank you in advance. 


Add TECH9 to the OUTPUT command to get the messages. If there is a problem with an imputed data set, the analysis is not completed. You will see that in the output where it shows how many were completed. 

John Woo posted on Monday, August 17, 2015  12:08 pm



Thank you. I have a quick followup question. I used Mplus to generate five imputed datasets. Tech 9 shows a warning for psi matrix in just one of the five imputed datasets. What would be the protocol for dealing with error/warning message in just a few of the many imputed datasets? Should I try to modify the model specification until all imputed datasets are error free? Or can I safely ignore (or even drop) the few problematic datasets? Would it mar the 'integrity' of multiple imputation if i ignore the few? Thank you again in advance. 


I would run the analysis on the data set with the problem to get more information about the problem. 


I am estimating a structural equation model where the indicators of my latent variable are count data. I have a lot of missing data and I believe that multiple imputation will give me better estimation than FIML because it takes into account nonnormality better. I ran a multiple imputation model, where I declared some of the count variables as categorical. Some of the count variables have more than 20 values and so I could not declare them as categorical variables. The imputation seemed to run okay; however, when I tried to use TYPE IS IMPUTATION to use the imputed data sets, I got this error "Invalid symbol in data file: "*" at record #: 1, field #: 42" Sure enough, some of the variables have "*" in the fields in the *.dat files. Do you agree with the way I am modeling my missing data and can you help me figure out why I got * in my data files? Thanks, Jennie 


If the missing is on the count variables, I would use FIML. I don't see why Mult Imp would take nonnormality better into account. As for the second part, we would need to see your MI output and the data files, so send to Support along with license number. 


So FIML is robust to nonnormal data? I thought I'd just read that it was not. Now I can't find the reference. I have another question. Because of the complexities of our assessment, it often seems as if we have more missing than we actually do. We collect a lot of data every 3 years (wave data) and a subset of items are collected every year (annual). If the wave is done in that year, then the annual assessment is not done.The variables are created based on time of assessment. So a person will be missing on annual variables because the information is in the "wave" variable for that year. I could tediously recode the variables to avoid this, by deleting all the "wave" variables and recoding them into "annual" variables. What I am trying to figure out is, does it matter? Basically the same information is in the data set, but in the current coding system there is an extra variable with more missing data. 


with missing data, ML (FIML) assumes normality but so does Multiple Imputation. Neither is perfectly robust to it, but if you don't have too low coverage you have a certain degree of robustness. As for your other question, you may want to take that one by SEMNET since it is more general. 


If the indicators of my latent variable are counts with a lot of zeros, should I declare them as count variables or categorical? 


Counts with a lot of zeros can be fit by ZIP or NEGBIN  possibly with inflation added. See our handout for the Topic 2 short course on our web site. And also mixtures discussed in Topic 5. 


Hi all I have used Version 7.3 to create 20 imputed datasets on which I am running further analysis (measurement and structural). My CFAs on these run fine until I add a grouping variable (multi group analysis). They then will not run and I get an input error message but no information in the output to help me identify the problem. Is this because type=imputation and 'grouping is xxx' cannot be used together? 


Please send output to Support along with license number. 


I post here the solution to my own query in case it helps others. See previous post of same date. I found in looking at the imputed datasets that the program had recoded my grouping variable. Originally this variable was coded 1 / 2 and in the imputted datasets it was recoded as 0 / 1. So changing my syntax under 'grouping' was the solution. Thanks for the prompt response Bengt. 


Hi all, I was wondering under which algorithm the H0 imputation operates? It is stated that it is a Bayes estimation  does that mean that it is a simple Data augmentaion algorithm or are the choices the same as for the H1 imputation? I want to compare different approaches for missing data and I could not find anything that specifies exactly what H0 does... Thanks for your help! 


See the paper on our website: Asparouhov, T. & Muthén, B. (2010). Multiple imputation with Mplus. Technical Report. Version 2. Click here to view Mplus inputs, data, and outputs used in this paper. download paper contact second author 

Andreas Wahl posted on Saturday, January 30, 2016  12:18 am



Thank you for your reply. I already read the paper, which is why I was wondering how H0 operates exactly and under which assumptions as it is not stated. It says, that H0 is a restricted imputation with a bayesian estimator. Now I was wondering, as there are three options for H1 imputation (Regression etc.), what is meant by this bayesian estimator? 


The 3 options are only for H1 imputation, specifying the model for the imputation. For H0 imputation you instead use the model in the MODEL command. That is, in the Bayesian MCMC iterations you not only view parameters as unknowns but also missing data. After convergence several more iterations are completed to give the missing data values. So the H0 imputations are based on the estimated H0 model specified in the MODEL command. 


Thank you very much for your response. This hepls a lot and leads me to another question: Is it possible to impute the missing values with H0imputation, save the data sets with the imputed values and analyze the model afterwards with ML? 


Yes  you would take an approach similar to the two steps of UG ex 11.8. 

Andreas Wahl posted on Tuesday, February 02, 2016  11:01 pm



Thank you very much! This helps a lot. 

sun young posted on Friday, March 04, 2016  12:52 pm



Hi I imputed each group separately and got con* and treat* imputed data sets. Now I am trying to generate an imputed_all data by combining the two imputed group data. I've been searching on this website but couldn't find an answer. I would appreciate it if you advise me. Thank you very much. 


You can do multiplegroup imputation in Mplus. I am not sure if you want to combine your data vertically or horizontally. The former is multiplegroup and the latter can be done by the MERGE option. 

sun young posted on Monday, March 07, 2016  12:11 pm



Thank you. I did try the former suggestion. I have a followup question. After creating 20 data set, I used "imputed_list.txt" as a imputed data set for my analysis. When I opened this imputed_list.txt file includes a list of 20 txt files. Can I actually obtain and see (or save) the completed data (either txt or csv file)? I am having a hard time to define variables from the imputed variables so I thought I might do it in R ro Stata and come back to mplus for analysis. Thank you very much again. 


The 20 data sets are also saved. 

sun young posted on Monday, March 07, 2016  2:50 pm



Thank you very much for the prompt response, Linda. You're right about the saved 20 data setsI have them. I might have a misunderstanding about this MI process but I am wondering about how I can create one completed imputation data set using these 20 data sets in a csv or txt form. If I correctly understand it, a complete imputation data set using 20 data doesn't mean that I should merge those 20 data sets vertically so the number of observation will be n*20. I appreciate your answer very much. 


In Mplus, the imputed data sets must be in different files. If you want to use them in a program where there is a requirement to put them in one file, you will need to see how that program needs them and put the files together that way. 


Dear Dr Muthén, I am trying to impute some continuous and categorical variables, which are measured longitudinally. 1. I am not quite sure how to decide on the correct imputation model. 2. I tried 'sequential', just to get some experience using multiple imputation in Mplus, but I get the error message "Fatal error. Failure to generate trunckated normal deviate. The problem occurred in chain 2." I am not quite sure what this means. Thank you so much for your advice! Sincerely, Aurelie 


Use the default if you are not sure. 


Dear Dr Muthén, I have a dataset of over 2000 clients with 4 measurement moments, and two levels (families within therapists). 1) When I use the default imputation method, multiple imputations run without problems. However, I don't get a column with 'rate of missing'. Is there anything I should specify in the input file to receive those values? 2) Also, based the paper by Asparouhov & Muthén (2010) 'Multiple imputation in Mplus' I would think that sequential regression would be the preferred method, as I use a combination of continuous and categorical data. Moreover, the paper by Enders (2015) 'Multilevel multiple imputation' suggests that sequential regression (in his paper 'chained regression') performs best in models with random slopes. My model of interest will use random slopes: ix sx x@0 x@1 x@2 x@4; iy sy y@0 y@1 y@2; iy on ix sx; sy on ix sx; Unfortunately, Mplus does not seem able to reach convergence when using sequential regression (after half a day, Mplus was still running iterations). Is this normal? Can I expect it to reach convergence at some point? What would be your advice? Thank your for your advice! Sincerely, Aurelie 


1) You can analyze the imputed data like in user's guide example 13.13 with an unrestricted model like model y1y10 with y1y10 Rate of missing refers to a model parameter  not variable and is specific for a model. 2) Use the default method. 


Dear Dr. Asparouhov, Thank you for your reply. I managed to impute my data using the default method. However, the imputed data consists of 'impossible' values. For example, one of my variables has a range between 50 and 100, but the imputed datasets contain values out of this range.  Is this a problem? These variables are measured at multiple moments. I recode these variables in 1 new dichotomous variable consisting of decliners (if they fall below a certain cutoff during one of the measurement moments) and persisters (if they stay above the cutoff all of the time).  Is it problematic that the imputed values are out of the 'possible' range in this situation? If so, is it possible to impute using a restriction on the range? Thank you for your advice! Sincerely, Aurelie 


Are you sure you are reading the original data set correctly. Do a TYPE=BASIC with no MODEL command on the original data to check the sample size and descriptive statistics. 


Dear Dr Muthen, When I checked the data using type=basic, I don't see any problems. The means look oké, although they are slightly different from the means I get in SPSS. Does Mplus usually only impute observed values? Is it possible to specify within which range Mplus should impute? Thank you for your time! Aurelie 


Mplus can impute latent variable values too  that is, factor scores. You can specify that a variable is categorical, e.g. binary. Range restrictions are not available for continuous variables but it is rarely a problem. 


See the VALUES option of the DATA IMPUTATION command. You can restrict the values with this option. If you continue to have problems, send the files and your license number to support@statmodel.com. 


Thank you for your advice. The VALUES option seems to work. 


I used Mplus to conduct multiple imputation to estimate values of missing categorical likert scale survey data, using the Type=BASIC command (according to the example in ch.11). I was asked by a reviewer whether this method is "hot deck or cold deck" multiple imputation. My sense is that it is neither, since Mplus uses a Bayesian approach. Any suggestions for how to respond to the reviewer's comment? Thanks! 


It is not hot or cold deck. The method is described in the paper on our website under Papers, Bayesian Analysis: Asparouhov, T. & Muthén, B. (2010). Multiple imputation with Mplus. Technical Report. Version 2. Click here to view Mplus inputs, data, and outputs used in this paper. download paper contact second author It is also described in a shortened form in the UG  see the index. 


Thank you. 


Hi, I've been through the examples and cannot find any examples of how to impute a multilevel dataset. I have individual level data clustered at the county level (with weights) and would like to impute the individual level missing values (there are no missing level two values). I am concerned that the clustering won't be accounted for in the MI. How could I set this up? Thanks! 


Sorry, I should have posted my code as well. This runs but I am unclear if I am appropriately accounting for clustering and weights. VARIABLE: NAMES = cntyid M80py Apr13pt pt10_13 zM80py zApr13pt zpt10_13 uid wtcnty08 sex age rhisp rnativam rasian rblack rnhopi rwhite frpl mj301 mj3010 rkmjmg pwmjw fwmjw; MISSING=.; CLUSTER = cntyid; WEIGHT = wtcnty08; USEVARIABLES ARE cntyid wtcnty08 sex age rhisp rnativam rasian rblack rnhopi rwhite frpl mj301 mj3010 rkmjmg pwmjw fwmjw; AUXILIARY= uid M80py Apr13pt pt10_13 zM80py zApr13pt zpt10_13; DATA IMPUTATION: IMPUTE = sex (c) age rhisp (c) rnativam (c) rasian (c) rblack (c) rnhopi (c) rwhite (c) frpl (c) mj301 (c) mj3010 (c) rkmjmg (c) pwmjw (c) fwmjw (c); NDATASETS = 40; SAVE = ncast*.dat; ANALYSIS: TYPE=BASIC TWOLEVEL; OUTPUT: TECH8; 


My apologies, I found it here  http://www.statmodel.com/discussion/messages/22/4640.html?1360259751 Thanks for providing such a useful resource! 


Dear Dr Muthén, I imputed 40 datasets. When I run my analyses, it often happens that only a subset of those 40 are replicated. If the replicated datasets provide warnings, I use those warnings to adapt my model. Often I do end up with a model which replicates for all 40 datasets. But sometimes, it does not. 1) How problematic is it, if a model does not replicate for 1 or 2 of the 40 imputed datasets? 2) Is it appropriate to use warnings which do not appear in all 40 datasets to adapt the model? So, for example, if a certain path seems problematic according to the warnings of 10 or 15 datasets, would it be appropriate to remove the path in the model, which would then be run on all 40 datasets? Or is there a better way to go about 'solving' such warnings? Thank you! Sincerely, Aurelie 


Tough question. It's true that the nonconvergence of each of the 40 replicates is somewhat informative about the fragility of the model. But you may be deleting a theoretically and statistically important path  that perhaps would have no problem if other parts of the model were correctly specified. 


Hello, I am trying to impute covariates for a LCGA using multiple imputation. I used the following input: VARIABLE: NAMES ARE [all the variables in my dataset]; IDVARIABLE IS pid; USEVARIABLES ARE [all variables in my dataset except pid and the auxiliary variables] CATEGORICAL ARE [all categorical variables, both those to be imputed and those used for imputation]; AUXILIARY ARE [my outcome variables, which I want to keep in my dataset, but not use for imputation]; MISSING ALL (9999); DATA IMPUTATION: IMPUTE = (c) [the categorical variables which I want to impute]; NDATASETS = 10; SAVE = lcga_pred_imp*.dat; ANALYSIS: TYPE = BASIC; OUTPUT: TECH8; This works without error or warning messages, but when I subsequently want to run my LCGA with the imputed datasets, I get the following error: "*** ERROR Invalid symbol in data file: "*" at record #: 2, field #: 30" Does the imputation input look correct to you? Why does it produce these invalid records in the imputed datasets? Looking forward to your feedback. 


The imputed data set have an asterisk as the missing data flag. You should be reading to imputed data sets according to the information about the saved data sets shown at the end of the output when they were saved. 


Yes, now it works, thank you very much! 

Tom McDonald posted on Thursday, December 08, 2016  10:13 am



I am trying to run a CFA on 100 imputed files. The model runs on about half of the file and for the other half I get that they "Did Not Result in a Completed Replication". I'm not sure of the next step in this process. 


Please send input, output and original data to Support along with your license number. 

Tibor Zin posted on Friday, December 16, 2016  2:43 am



Dear Dr Muthén, I would like to ask a question about fit indices with imputed data. I have a twowave data with 2400 observations but 1400 people did not participate in the second wave missing at random. I have imputed 25 datasets and conducted multiple group analysis. My problem is that using maximum likelihood estimator, the following error appeared: THE CHISQUARE COULD NOT BE COMPUTED. THIS MAY BE DUE TO AN INSUFFICIENT NUMBER OF IMPUTATIONS OR A LARGE AMOUNT OF MISSING DATA. Increasing number of imputation does not help. However, using MLR does. Could you please tell me where is the problem? Many thanks! 


It sounds like you first do imputation and then do ML on those data. I'd suggest simply using ML on the original data. If this doesn't help, please send your outputs to Support along with your license number. 

Tibor Zin posted on Sunday, December 18, 2016  8:11 am



Thank you for fast reply! I am sorry but I did not provide all information. I am not using latent variables but only observed variables. I believe that in this case all missing data would be excluded if I use ML. 


Just use ML. I don't see why imputation is needed. 

Tibor Zin posted on Tuesday, December 20, 2016  5:10 am



I think that I should apply MI because I use 6 variables from the first wave (no missing), 2 variables from the second wave (60% attrition) but 4 variables were created by subtraction of variable in the first wave from variable in the second wave. Thus, when the second wave data are missing, this variable has also a missing value. But if I use MI instead of FIML, I can put into imputations original variables based on which these variables were created. In other words, I loose more information using FIML. 


Ok. But you can also use Auxiliary(M) with ML. 

Tibor Zin posted on Tuesday, December 20, 2016  3:05 pm



Thanks for the advice! But should I specify that these auxiliary variables would predict their counterparts? Is it possible to specify the influence of auxiliary variables in Mplus? Let’s say that one auxiliary variable would predict missing values of one observed variable and the second auxiliary variable would predict missing values of the second observed variable? 


I am referring to the automatic approach of Auxiliary (M), but you can also do it "by hand" in your own setup. We discuss this in one of our topics related to missing data in the set of short courses on the web. 


I am just wondering if there is any way to do multiple imputation such that the result is one dataset (consisting of the averages for each value of each variable across my many imputed datasets). I ask because I would like to create prepost change variables (Wave 2 response  Wave 1 response), and it seems like I would need to do so after imputation . . . and based on particular values found in one dataset. Note: I already have prepost change values in cases where I had Wave 1 and Wave 2 responses. Those were created in SPSS prior to the thought of doing multiple imputation, and prior to me bringing the data into MPlus. I ask the question above because I assume that I should NOT impute values for missing data for this variable (given that those missing values are missing because Wave 2 data are missing, which would now be in the process of imputation themselves). If imputing prepost change values is somehow appropriate, please let me know because that solves my problem. Thoughts? Thank you in advance for your help. 


Second question of the day (my first question is above at 7:08 on December 22): When doing multiple imputation, how do I constrain the possible valid range of what imputed values can be? For example, I have many 17 Likert scales in my dataset, but I noticed that values of 8 and above were being imputed when 7 is the max. I'd like to set the range from 17. Thanks! 


I would recommend going a step back before any data processing is done and perform the imputation on the raw/original data. You can then use the Mplus define command to form the differences. You can average the imputed data sets, but I have not seen that done before. On your second question you have two options a. Impute the variable from a model treating the variable as categorical "data imputation: impute=Y(c);" means that Y will be imputed as categorical variable. See user's guide example 11.5. b. Round off to particular values  see user's guide page 521 "data imputation: values=Y(17);" 


Thank you very much, Tihomir. Very helpful. Quick update and some new questions: 1. Regarding the prepost change variables, I ended up using the DEFINE command after imputation to create them (Wave 2  Wave 1) and it seemed to average from the multiple imputed datasets just fine. Very exciting. 2. NEW QUESTION: I found the VALUES option and used it as you described. For some reason, it keeps telling me I have and "Unknown variable in VALUE option," despite the fact that the variable is also where it needs to be above in the input file (NAME, USEVARIABLE, IMPUTE). I also found and used the ROUNDING option (for other reasons) and I got the same outcome for the same variable "Unknown variable in ROUNDING option." What am I not understanding? 3. NEW QUESTION: When I open up one my newly imputed datasets in Notepad++, I am able to scroll right far enough so that each variable (222 total) can have its own column. However, in Notepad and in the MPlus data viewer, both programs run out of room to the right, so the values continue on the next line. This messes up the column alignment, places values where they should not be, etc. I was actually able to "successfully" run my model for the imputed datasets, but there were various WARNINGS that were clearly related to the columns being misaligned, and the results were clearly off. Thoughts on this? Thank you very much in advance! 


Please send your output and and one imputed data set to Support along with your license number. 

PoYi Chen posted on Sunday, February 19, 2017  11:05 pm



Dear Dr. Muthen, I get a question about the confidence intervals obtained from the ¡§model constrains¡¨ + ¡§data imputations¡¨ command. I got CIs for the new defined parameters in my model after imputations by using the code below. However, I wondered how should I report these CI? Is it correct for me to say these Cis are obtained by Mplus after pooling the standard errors from the delta method, or they are the average of CI across imputed data sets calculated by Mplus? Code: data imputation: IMPUTE = v4 (c) v5 (c); MODEL: f1 by v1 (L11) v2 (L21) v3 (L31) v4 (L41) v5 (L51); model b: f1 by v1 (L12) v2 (L22) v3 (L32) v4 (L42) v5 (L52); model constraint: new( IL2 IL3 IL4 IL5); IL2 = L21 ¡VL22; IL3 = L31 ¡VL32; IL4 = L41 ¡VL42; IL5 = L51 ¡VL52; analysis: OUTPUT: Cinterval; Results New/Additional Parameters IL2 0.118 0.069 0.044 0.086 0.217 0.242 0.291 


Dr. Muthen, I am attempting to run a mediation model with covariates using imputed data. I used the MODEL constraint to examine the indirect effect. The model terminates normally, and the means of the variables and sample size are correct, but the S.E.'s are all identical and every variable in the model is significant. Am I missing a step? Below is my input: Variable: Names are Health3w ELA age sex race educ inc10 health2w fsumal; Missing are all (999); Usevariables are Health3w ELA age sex race educ inc10 health2w fsumal; ANALYSIS: Bootstrap is 1000; Model: Fsumal on ELA age sex race educ inc10 health2w (a); health3w on Fsumal ELA age sex race educ inc10 health2w (b); Model CONSTRAINT: New(ab); ab=a*b; output: standardized; 


Here is the answer to PoYi Chen: They are NOT the average of CI across imputed data sets calculated by Mplus. It is correct to say that these CI are obtained by Mplus after pooling the standard errors from the delta method. The pooling is the standard imputation pooling, see bottom of page 3 https://www.statmodel.com/download/MI7.pdf The delta method is used for each imputation before pooling. 


Julia Sheffler: Please send your output to Support along with your license number. 

Back to top 