Message/Author 

Anonymous posted on Friday, October 29, 1999  11:42 am



Does Mplus impute values for those that are missing? 


No, Mplus does not impute values for those that are missing. It uses all data that is available to estimate the model using full information maximum likelihood. Each parameter is estimated directly without first filling in missing data values for each individual. 


I am having difficulty getting Mplus to converge on H1 (and thus to get a chisq test) for a missingdatalatentgrowthcurve model, even when I fiddle with the starting values and convergence criteria. It runs fine when I do not ask for "type= missing h1;" but then I can't get the chisq. Am I missing some fundamental piece of the puzzle? 


If you give the Mplus statement type=missing h1, the program first does H1 and then H0. You may want to first to a type=basic missing. The H1 estimation that this leads to can be difficult if there are large percentages of missing data  see the Covariance Coverage output. Starting values are not needed for H1. You can try to sharpen the convergence criterion as described in the User´s Guide. 

Anonymous posted on Wednesday, February 14, 2001  9:07 am



I have data that do not fit the assumptions Mplus imposes for SEM with missing data so I am using a multivariate, multiple imputation approach such as that advocated by Little and Rubin. My question is whether the coefficients and standard errors generated by the Mplus WLSMV estimator present particular problems for those planning on combining results from several separate imputed data sets. 


As far as I understand, combining estimates and s.e.'s from analyses of multipleimputed data could be done in the usual way also when using WLSMV. 

Anonymous posted on Thursday, May 10, 2001  1:25 pm



In the manual, you point out that "Mplus has two special data handling features when data are missing because of the design of the study." I understand that using the "not by design" missing data features, models assume that data are missing at random or missing completely at random. I have a data problem that doesn't technically seem to fit either scenario. The study is looking at drug/alcohol treatment over time. It follows 2 cohorts of over 1,300 adults at baseline, 6 months, 18 months, 24 months, and 36 months. Because of funding constraints only 1 cohort (n about 700) was interviewed at 18 months. Both cohorts were interviewed at each of the other waves. I am wondering whether or not we should simply drop the entire 18month wave data in a growth curve model or if we can somehow include the existing data from the 1 cohort who was interviewed. Technically, the missing cohort at 18 months was not missing at random and it did not seem to be similar to the missing patterns by design examples either. In addition, because this is a longitudinal study, there are data at waves other than 18 months that are missing, but these are more defensively considered “missing at random.” Any advice? Thanks in advance. 


I would not get rid of the data for 18 months. Measuring one cohort only at 18 months constitutes missing by design and is MCAR. You also have attrition which may be MAR but I couldn't comment on that. I would analyze all of the data using TYPE=MISSING. This is if your outcomes are continuous. TYPE=MISSING is not available for categorical outcomes. 

Anonymous posted on Friday, May 11, 2001  1:55 pm



I guess it is MCAR (i.e., a random event caused it). I wasn't thinking of it as such since I was wondering if any existing differences between the 2 cohorts would pose a problem in the missing data estimation. But it was not cohort differences that "caused" the missing data, just the flip of the cohort coin. Your advice is very helpful. Thanks! And yes, we do have attrition too that we would consider MAR. 

Mike W posted on Monday, September 10, 2001  9:42 am



I'm interested in analyzing repeated measures data using a latent growth curve model. The data come from a complex sampling design (individual cases have sampling weights), are nonnormal, AND have both planned/unplanned missing values (i.e., cohort sequential design/sample attrition). I'm interested in using 3 of Mplus' features: 1. complex sampling (type=complex) 2. FIML missing data estimation 3. Robust estimators (MLM, MLMV, WLSM, WLSMV). My question is whether these 3 features can be used in conjunction. If not, I'm wondering if it would make sense to do multiple imputation on the missing data, and then use the complex sampling & robust estimators in conjunction? I appreciate any thoughts you may have. MW 


We are having an update in about two weeks that will include crossing TYPE=COMPLEX with MISSING AND MIXTURE, but weights will not be allowed at this time. You can do a one class mixture and thereby cross complex without weights and missing. Perhaps that might help. The estimator is MLR. MLR has maximum likelihood estimates and robust standard errors. This may help. Otherwise, multiple imputation would be the way to go. 

Anonymous posted on Wednesday, February 20, 2002  4:20 am



Does Mplus have any methods to estimate a model when the missing data is nonignorable? 


Work is being done in this area but nothing definitive is available at this time. A possibility is to use the pattern mixture approach (see Little's 1995 JASA paper), using covariates, a multiple group approach, or a mixture approach. 


I am trying to estimate the Diggle and Kenward (1994) model in order to account for nonignorable missing data. This would, however, require to estimate a multiplegroups model with heterogeneous structures (as it is called in AMOS) using MPLUS. Thus, different groups should be allowed to have different variables included in the analyses (e.g., in a longitudinal setting with one outcome variable measured up to six times, for group1 outcome1 would be modeled, for group2 outcome1 and outcome2 would be modeled, etc.). I did not find such an option in MPLUS, so my question is: Is it possible to estimate a multiple groups model with heterogenous structures in MPLUS (which would also be helpful for "pattern mixture models")? 

bmuthen posted on Tuesday, March 26, 2002  8:37 am



In the Diggle and Kenward approach, Mplus would need to model the growth part among the "y's" and the missingness as a function of previous observed y's among the "u's", where the quotation marks are used to refer to the general model parts of the Mplus framework. The D & G approach therefore needs to be able to allow missingness on y's as well as "u ON y" (logit) regressions. This combination cannot be done in the current Mplus, but is planned for version 3 due out early 2003. The pattern mixture approach, however, would seem to be possible to carry out in the current Mplus. This would not use the regular multiplegroup track because that requires nonzero variance for equal numbers of observed variables in the groups, which is not present here due to missingness. Instead the mixture track (type=mixture) would be used. In the mixture track, there is no problem due to missing data and zero variance for y's. The groups corresponding to the dropout patterns can be represented by latent classes ("c"), where the known class membership is handled using the "training data" feature. The growth model parameters for the y's can then be allowed to vary across classes (groups, patterns) to the extent desired. We can help with trying this approach. 

Anonymous posted on Thursday, May 23, 2002  7:09 am



As I understand it, the new analyses in Mplus 2.1 assume MAR, but they use White corrected S.E.s, which I thought assume MCAR. Am I mistaken in my assumption? 


Theory supports the fact that the corrected standard errors (sandwich or White) for missing data are correct under MCAR with nonnormal data. For normal data, they are correct under MAR. We have found that these corrected standard errors also work better than regular standard errors under MAR and nonnormality. However, there appears to be no theory to support this (see Yuan and Bentler in Soc Methods, 2000). 

Sherry Chen posted on Wednesday, June 12, 2002  11:07 am



I am trying to replicate the table on page 27 of Allison's Missing Data using Mplus. I have trouble specifying the model correctly. In AMOS syntax, the model is like this: gradrat = () + csat + .. + (1) error act (correlated with) error This implies that variable act is also correlated with all the independent variables in the model. How should I specify the same model in Mplus? I have tried the following and a few variations of it, but they don't seem to work: VARIABLE: NAMES ARE csat act stufac gradrat rmbrd private lenroll ; USEVARIABLES ARE csat act stufac gradrat rmbrd private lenroll ; MISSING ARE ALL (999); MODEL: gradrat on csat stufac rmbrd private lenroll ; act with gradrat csat stufac rmbrd private lenroll; ANALYSIS: TYPE IS MISSING H1 Meanstructure; ESTIMATOR IS ML; CONVERGENCE = 0.005; COVERAGE = 0.10; Another try is to specify the model as follows and the results are very close, but I don't think it models the same model: MODEL: gradrat on csat stufac rmbrd private lenroll ; act on gradrat csat stufac rmbrd private lenroll; 


Could you send the AMOS output and the two Mplus outputs to support@statmodel.com? If you can send the data, that would also be helpful. 


Thank you for sending the outputs. The correct model in Mplus is the one using the WITH statements. The reason the answers did not agree is that this model did not converge. I added two starting values for the variables GRADRAT and CSAT which have large variances and the model convergeds to the same solution as AMOS. 


The following input does what I what to do: TITLE: Modified from http://www.statmodel.com/mplus/examples/categorical/cat4.html DATA: FILE IS wmimicd.dat; VARIABLE: NAMES ARE x1x3 y1y16; USEV = y6y10; CATEGORICAL = y6y10; GROUPING = x3 (1 = groupA 2 = groupB); ANALYSIS: TYPE = MGROUP MEANSTRUCTURE; MODEL: f1 BY y6y10; OUTPUT: standardized; What changes do I need to make in this input file when y10 is missing by design in groupB (e.g., Type = mixture missing)? Also, are there any fit indices unavailable after respecifying this missing data problem as a mixture analysis? 

Anonymous posted on Tuesday, July 09, 2002  9:26 am



You can split groupB into two groups, one group for the observations with y10 present and the other with y10 missing. Add f1 by y10@0 for the last group and use type = mgroup meanstructure. Type = mixture missing is not going to give you what you want. 

bmuthen posted on Monday, July 15, 2002  4:16 pm



The idea of the solution proposed above is good  that y10 should not influence the fitting function in the last group where it is missing  but there seems to be two complications. Mplus will complain that y10 has zero variance in the last group. This can be circumvented by letting one person have a different value for y10 in the missing data group to give a quasinonzero y10 variance. Also, I think the weight matrix will be singular with zero variance and I don't know its quality if a quasinonzero variance is introduced for y10. I don't know if some other trick can be used. Categorical missing facilities are forthcoming in future Mplus versions. 

Anonymous posted on Thursday, July 25, 2002  7:24 am



Hello. I am working on an LCA model with some missing data, and I would appreciate some advice on its behavior. The (binary) latent class indicators include 5 behaviors measured at each of 2 posttreatment followups (for a total of 10 indicators). About 40% of the sample were interviewed at the shortterm followup, but not the longterm assessment. Some noninterviews were by design, and some represent attrition. I also have several pretreatment covariates I am using to predict the classes. There are several points I am wrestling with: 1. The "Test for MCAR..." is clearly nonsignificant. What practical effect should this have on modeling strategy? 2. Group membership changes noticably when I add predictors. I suspect many "changers" are individuals with only 1 interview, because they simply have less "u" information available for classification, increasing the importance of their "x" information. If this is true, is it reasonable to believe that the LCA with covariates is more likely to be the "correct" model? Should decisions about the number of classes be made in the presence of the covariates? 3. To test for possible nonignorable missingness, is it appropriate in the context of LCA to try a "pattern mixture" approach (in the spirit of Little or Hedeker & Gibbons)? That is, adding a "missing interview" indicator and interaction terms to the set of covariates. Thank you in advance for any suggestions. 

bmuthen posted on Friday, July 26, 2002  10:05 am



Good questions. Re 1, your MCAR test would seem to suggest that you can feel more comfortable using the ML approach that you are using. The ML approach is correct under the less strict MAR assumption. So, having support for MCAR is comforting, but I don't see that it changes your modeling strategy. Re 2, changing group membership may point to a misspecification. If in the true model the predictors influence only class membership and not the latent class indicators directly, then you should get statistically the same membership with and without predictors in the model. But if the true model has some direct effects of predictors on latent class indicators, then class membership will change when including predictors but not allowing for the direct effects. The solution is to examine the need for direct effects by including one at a time and looking at chisquare differences (2*logL differences). It is also correct that predictors help to better determine class membership when the latent class indicator information is not strong, but with a correctly specified model this additional information should not cause essential changes in membership.Re 3, yes, a pattern mixture approach could be useful here. 

Anonymous posted on Wednesday, July 31, 2002  6:54 am



If I am trying to run a DiscreteTime Survival Analysis, but I have missing data in my X values, is the only way for me to estimate a model with missing data is to use a program such as NORM and impute the missingness? 

bmuthen posted on Wednesday, July 31, 2002  9:38 am



Yes, unless the x variables are such that they do not influence class membership, in which case they can be turned into "y variables" (for which missingness is handled) by referring to a parameter for x (e.g. its mean). 

Anonymous posted on Saturday, November 02, 2002  2:54 pm



I need help running an EFA with missing data. Missingness is due to use of a 3form design for 180 participants; data from 49 additional participants who completed either the first or second half of the 64item set is also included. Covariance coverages range from .262 to.633. I used the following codemy first MPLUS experience. Data: file is deedataII.txt; Variable: names are I1I94; Usevariables are I1I64; Missing = .; ANALYSIS: TYPE IS EFA 1 5 MISSING; ESTIMATOR = ML; H1ITERATIONS = 500; H1CONVERGENCE = 0.0001; COVERAGE = 0.10; I have tried lowering the coverage crtiterion to .08, running the model with up to 16 of the 64 variables of interest deleted, eliminating the H1ITERATIONS & H1CONVERGENCE statements, and using analysis type missing basic. The messages I get go something like this... THE MISSING DATA EM ALGORITHM FOR THE H1 MODEL HAS NOT CONVERGED WITH RESPECT TO THE LOGLIKELIHOOD FUNCTION. THIS COULD BE DUE TO LOW COVARIANCE COVERAGE OR A NOT SUFFICIENTLY STRICT EM PARAMETER CONVERGENCE CRITERION. CHECK THE COVARIANCE COVERAGE, OR SHARPEN THE EM PARAMETER CONVERGENCE CRITERION, OR RERUN WITHOUT H1 TO OBTAIN H0 PARAMETER ESTIMATES AND STANDARD ERRORS. NOTE THAT THE NUMBER OF H1 PARAMETERS (MEANS, VARIANCES, AND COVARIANCES) IS GREATER THAN THE NUMBER OF OBSERVATIONS. NUMBER OF H1 PARAMETERS : 2144 NUMBER OF OBSERVATIONS : 229 I think that the covariance coverage is adequatehow do I go about changing the convergence criterion or running the model without H1? 

bmuthen posted on Sunday, November 03, 2002  8:38 am



You can try sharpening the H1convergence criterion to say 0.00001. One question is if your missingness is by design  you mention a 3form design. If so, there may be alternative approaches. 

Anonymous posted on Sunday, November 03, 2002  10:11 am



In response to your question, yes the missingness is by design; 180 participants completed 3form design questionnaires containing 2 of the three subsets of items. I have additional data from 49 participants, each of whom completed half of the 64 items of interest. What options does this give me? 

bmuthen posted on Sunday, November 03, 2002  5:38 pm



Here is an answer about what one can do in principle with missing by design  without claiming that this is how you should try to do your analysis. If I understand your design correctly, apart from the 49 subjects, there are 3 groups of subjects, each of which has missingness on parts of the variables. In a CFA, these 3 groups could be handled via multiplegroup modeling where in each group only the reduced set of variables actually observed in the group would be considered, so that each group would only have missingness that is not by design. This would be an analysis with only about 2/3 of the variables and therefore perhaps less heavy. This approach has 2 complications for you. One, it is not clear how to handle the 49 participants since each group needs to have the same number of variables. Two, you want to do an EFA. Regarding your analysis, what is the lowest coverage value that gets printed? 

Anonymous posted on Sunday, November 03, 2002  6:11 pm



The lowest covariance coverage is .262. 

bmuthen posted on Tuesday, November 05, 2002  8:42 am



Have you had any success using the sharpened H1convergence criterion? If not, perhaps you want to send your input and data to Mplus support so they can help. 

Anonymous posted on Saturday, December 07, 2002  12:36 pm



I am thinking about using multiple imputation with data on which i am doing a structural equation model. the outcome variable in this model is dichotomous, which limits my options for handling missing data. i am considering using multiple imputation, and am wondering how to approach doing this in mplus. i can create my multiple data sets in other software packages. i have read on your website about the RUNALL facility. would it make sense to run the analyses with that? also, are there any other features in mplus that might be useful in this, including anything that would combine the results from the multiple runs to give final estimates? (or is that step something i need to do by hand?). thanks. 


RUNALL would be the way to go. Results for the analysis of each data set are saved in an ASCII file which can subsequently be analyzed to obtain means of the parameter estimates etc. 

anonymous posted on Monday, January 27, 2003  7:11 pm



I would like to use the MLR option across multiple imputations. Because I have no missing data I am not specifiying type=missing. When I try to run the model I get a message telling me that the MLR estimator is not available with type=general. Is MLR only available if you have missing data? 


You can say TYPE=MISSING even if you don't have any missing data. Then you will be able to use MLR. 

Anonymous posted on Tuesday, June 03, 2003  4:02 pm



My sense is that Mplus can only account for data missing on Y variables. Is this because the computation is too intensive to include imputation on X's, or because its empirically incorrect to impute on Y's and X's at the same time ? I ask because I've noticed that one of the HLM packages allows multiple imputations on X and Y in the same model run. This would appear to imply that such models borrow information from the X's to impute Y's (and viceversa). Will Mplus allow for imputation on X's and Y's in the near future (version 3.0) ? Thanks. 

bmuthen posted on Tuesday, June 03, 2003  5:33 pm



Modeling typically concerns a specification of the distribution of y  x (y conditional on x), whereas the marginal distribution of x is not involved in the model. When there is missing on x, a model for the marginal x part needs to be added. This is true for imputations as well as other modeling. That's why missingness on x's changes the picture and is not trivial  it calls for an extended model that may be hard to specify realistically. I am not clear on what type of x modeling HLM does for imputations in the x part  I am not sure that this is stated; please let me know if I am wrong. Mplus does not do imputations, but handles missing data in a general way using ML under MAR. Mplus can handle missing on x's if they are brought into the model as "y's". This is done automatically in some tracks of the program (such as nonmixture, noncategorical). In other tracks, x's can be moved into the y set by mentioning parameters related to them in the model. Missing on x is then handled by a normality model for the x's. Normality may not be suitable if x's are say binary and skewed. In Schaefer's imputation programs, missingness categorical x's is handled by loglinear modeling. Mplus Version 3 will have more facilities related to missingness on categorical variables and missingness for variables that have random slopes. 


Dear Dr. Muthen, Would it be possible to estimate in Mplus 1. multilevel model with missing data 2. multilevel SEM with missing data? For question 2, I would appreciate your recommended reference. Thank you. Yongyun Shin 


Both 1 and 2 can be estimated in the current version of Mplus. These techniques have been available since Version 2.1 which came out in May 2002. The use of these techniques is described in the Addendum to the Mplus User's Guide which can be found at www.statmodel.com under Product Support. More features are coming in Version 3 the Fall. The Mplus techniques for multilevel SEM with missing data are described in a paper that we will be happy to make available at the end of the Summer. We are not aware of any other references on this topic. 


I am using Mplus to perform a stepwise regression analysis. I am using Mplus because some of the data is missing (N=363). For Step 1, the Mplus syntax is: Title: Injury analysis; Data: file is "filename"; Variable: Names are DV IV1 IV2 IV3 IV4 IV5; Usevariables are DV IV1; Missing are all (999); Analysis: Type = H1 Meanstructure missing; Model: DV on IV1; Output: Standardized; The output shows: Chisqare test of model fit for the baseline model Value 3.080 DF 1 PValue 0.0000 (R sq = 0.011) For Step 2: Title: Injury analysis; Data: file is "filename"; Variable: Names are DV IV1 IV2 IV3 IV4 IV5; Usevariables are DV IV1 IV2 IV3 IV4 IV5; Missing are all (999); Analysis: Type = H1 Meanstructure missing; Model: DV on IV1 IV2 IV3 IV4 IV5; Output: Standardized; The output shows: Chisqare test of model fit for the baseline model Value 17.652 DF 5 PValue 0.0000 (R sq = 0.062) To calculate the significance of the R sq change (0.062  0.011 = 0.051) can I simply calculate the change in Chi sq (17.652  3.080 = 14.772), the change in DF (5  1 = 4) and conclude that the R sq change in significant @ p<.01? (The crititcal value of Chi sq for 4 df and p < .01 is 13.28). Or am I on the wrong track completely?? For your advice please, Peter Elliott 


A chisquare difference test can not be used to determine whether a rsquare difference is significant. It can be used to see if parameters in nested models are significant. For example, you could compare Model: DV on IV1 IV2 IV3 IV4 IV5; to Model: DV on IV1 IV2@0 IV3@0 IV4@0 IV5@0; to determine if the four covariates in the model are jointly significant. 


You are so quick and so helpful. Thank you Linda. Best wishes, Peter Elliott 

Anonymous posted on Monday, June 28, 2004  1:34 pm



Hi there, I wish to run a multigroup analyis (women vs. men) using the missing option. 1. Would my analysis code be type=missing h1; or type = missing h1 mgroup? 2. Can I compare nested models (e.g., resticting covariances to be equal across groups)using a chisquare difference test when using the missing command? 3. When I run a mgroup analysis (not specifing missing) leaving the 'estimator=' blank, Mplus uses the ML estimator. Can I always trust what Mplus picks? For example, I have some categorical ivs and and some categorical indicators of a latent variable. Thanks in advance! 

bmuthen posted on Monday, June 28, 2004  3:19 pm



1. The former. 2. Yes 3. With categorical factor indicators Mplus defaults to WLSMV  and yes, the defaults have good reasons behind them. 

Anonymous posted on Tuesday, June 29, 2004  7:20 am



I am trying to use missing data analysis to run a simple path model. However, Mplus only analyzes cases with no missing data (i.e., listwise). How can I use FIML in this situation? Thanks! Mplus VERSION 3.01 MUTHEN & MUTHEN 06/29/2004 10:00 AM INPUT INSTRUCTIONS TITLE: PATH ANALYSIS DATA: FILE IS C:\a1.DAT; VARIABLE: NAMES ARE id x1 x2 x3 y x4; CATEGORICAL ARE y; USEVARIABLES ARE x2 x3 y x4; missing are x2 x3 y x4 (9); analysis: type = basic missing; MODEL: y ON x2 x3 x4; OUTPUT: stand tech1; *** WARNING Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 115 *** WARNING Data set contains cases with missing on xvariables. These cases were not included in the analysis. Number of cases with missing on xvariables: 318 *** WARNING Data set contains cases with missing on all variables except xvariables. These cases were not included in the analysis. Number of cases with missing on all variables except xvariables: 209 3 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS 

bmuthen posted on Tuesday, June 29, 2004  9:50 am



Regression on x's does not include a model for the x distribution, but the model concerns the y outcome conditional on the x's. To handle missing data on x's, you need to expand the model to include a model for the x's, e.g. assuming normality. This can be done in several ways. One way is to first do a multiple imputation step outside Mplus. Note that Mplus can take multiply imputed data as input. Another way is to include the x's in the model in Mplus  this is done by mentioning say their variances: x2x4; You can use ML estimation by the Analysis option estimator = ml; in which case the missing data on your 3 x's results in 3 dimensions of numerical integration. 


Hello: I am running a parallel process latent growth curve model in version 3.11(3 equally spaced time points)involving two outcomes measured continuously (depression and smoking). The latent intercepts and slopes are regressed on two x's: gender and number of siblings. The data are nested (individuals nested within schools). There are missing values on both the Y's and on the sibling "X" variable. The model appears to run fine (no warnings or error messages) and generates results that make sense. However, when I attempt to evaluate the plausibility of the model for girls and boys separately, I get a warning message that states "data set contains cases with missing on xvariables. These cases were not included in the analysis". Below is the sytnax used for latent growth model using the multiple group procedure: USEVAR=DEPPSH1 DEPPSH2 DEPPSH3 CIGS1 CIGS2 CIGS3 NUMSIB; GROUPING IS SEX (1=MALE 0=FEMALE); MISSING=BLANK; CLUSTER=SCHID; ANALYSIS: TYPE=MISSING H1 MEANSTRUCTURE COMPLEX; MODEL: I1 BY DEPPSH1DEPPSH3@1; S1 BY DEPPSH1@0 DEPPSH2@1 DEPPSH3@2; I2 BY CIGS1CIGS3@1; S2 BY CIGS1@0 CIGS2@1 CIGS3@2; S1 ON I2; S2 ON I1; [DEPPSH1DEPPSH3@0 I1S1]; [CIGS1CIGS3@0 I2S2]; I1 WITH S1; I2 WITH S2; I1 WITH I2; S1 WITH S2; I1 ON NUMSIB; S1 ON NUMSIB; I2 ON NUMSIB; S2 ON NUMSIB; OUTPUT:STANDARDIZED; Am I making an error somewhere in the syntax? Does Mplus offer FIML for LGC models where there is missing data on both the Y's and the X's in the context of running complex models (nested data) and multiple group comparisons? I look forward to your response. D.J. DeWit 


I don't see how you would not have missing on x's when you read the full data set and have missing on x's when you look at part of the data set. If you send the two outputs, the one that worked and the one that didn't, and the data, to support@statmodel.com, I can figure this out. Regarding missing on x's, the following is from Chapter 1 in the Mplus User's Guide: "In all models, missingness is not allowed for the observed covariates because they are not part of the model. The outcomes are modeled conditional on the covariates and the covariates have no distributional assumption. Covariate missingness can be modeled if the covariates are explicitly brought into the model and given a distributional assumption." 

Anonymous posted on Tuesday, October 26, 2004  3:01 am



I have a question concerning missing data. I am constructing a twolevel model. Some of my withinlevel variables have missing values. Should I specify that type is twolevel missing or is it not necessary? I am using mlr as an estimator. Thank you! 


If you want the model estimated using all observations, you must use the MISSING keyword as part of the TYPE option. The default is listwise deletion. 


I had a general question about using the H1 feature with missing data present. The manual (p.361) says that H1 allows the estimation of an "unrestricted mean and covariance model with TYPE=MISSING." Appendix 6 of the technical manual says basically the same thingthat the H1 model does not restrict ug or sigmag. I am just wondering if this could be unpacked a little because I'm trying to understand what's going on underneath the hood. Thanks! 


The unrestrictied model is the model of all means, variances, and covariances of the observed variables being free. There are no restrictions on any of the parameters. It is the H1 model. The reason that it is not automatically estimated with TYPE=MISSING in all cases is that it can be slow and is needed only to compute chisquare. So Mplus has it as an option. Without it, you will get parameter estimates and standard errors but not chisquare. 

Anonymous posted on Thursday, January 27, 2005  8:33 am



I have a model where I am testing for invariance of structural paths across gender in the multiplegroup context (all observed, continuous variables) but I am concerned that I have data that are NMAR. One of my endogenous variables is frequency/quantity of alcohol use and I have strong reason to believe that missingness on alcohol use is related to true levels of alcohol use. Consistent with suggestions made in earlier postings, I have constructed training data to represent 4 gender X missing data groups (i.e., males w/complete data, males w/incomplete data, etc.  missing data patterns are too sparse for additional patterns). In order to get weighted averages for structural coefficients (and intercepts) across missing data patterns (as you would via Hedeker/Gibbons 97), I have constrained all parameters to equality within gender for all models (i.e., male incompletes and male incompletes equated, female incompletes and female incompletes equated). My base model would be the fully constrained model (all parameters equated across all 4 groups)  In order to test for invariance across gender, I allowed males to differ from females (but maintained equality constraints between the withingender missing data groupings). I then used 2(deviance1deviance2) for single df X^2 difference tests for invariance across gender. I wanted to know if a) this approach to pattern mixture modeling is generally defensible, b) could I compare the deviance from a fully saturated model against my base model so I can give an indication of "model fit" (and hand calculate RMSEA and the like) and c) if this is defensible, is there a citation to justify this approach specifically in SEM other than Hedeker/Gibbons 97 or Little 93 (i.e., Muthen/Brown 01  is this manuscript available)? 

bmuthen posted on Thursday, January 27, 2005  1:12 pm



A patternmixture approach of this kind seems generally ok, but I need some clarification. You seem to have a typo in the parenthesis at the end of the last paragraph and the second parenthesis of the second paragraph sounds strange to me. Seems like you want to test gender invariance while allowing for missing data differences within each gender. Also, does Hedeker's work give formulas for weighting regression coefficients across the missing data groups? I have not seen a reference to patternmixture for SEM. MuthenBrown is still not available and is focused on actually letting latent variables predict missingness. 

Anonymous posted on Thursday, January 27, 2005  2:09 pm



Yes, I am looking to test gender invariance and adjust for differences across the missing data groups.........Hedeker does give formulas for weighting regression coefficients across missing data groups. He does this by weighting the estimates by the observed proportions among the missing data groups in his 97 Psych Methods paper (an illustration of a conditional LGM with NMAR dropout in Proc Mixed). Here is the link to the .pdf http://tigger.uic.edu/%7Ehedeker/RRMPAT.pdf. Formulas 12 and 14 are the formulas for the weighted estimates and standard errors respectively. The corresponding dataset and SAS IML program that performs the matrix operation described therein is at http://tigger.uic.edu/~hedeker/ml.html. I simulated longitudinal data that were NMAR for 2 missing data patterns and analyzed the simulated data both in Proc Mixed/IML with his approach and in GGMM in Mplus with the two missing data groups identified w/training data and got nearly identical estimates. So the approach seemed viable but I did not want to move forward without consultation. I have had great difficulty finding a published analog to Hedeker's approach in SEM and had wondered whether MuthenBrown was the SEM analog but also wanted your take on the approach before going forward........ 

BMuthen posted on Friday, January 28, 2005  2:42 pm



I will look at this when I am back in the country  after February 2. It looks like you're on the right track and may be the first to do patternmixture in SEM. 

Anonymous posted on Friday, January 28, 2005  4:58 pm



Thank you so much Bengt  look forward to hearing more from you when you are back. Bidding you safe travels on your return to LA...... 

Anonymous posted on Friday, February 04, 2005  3:03 am



Hello and thanks for a great program. I'm running a simple linear regression analysis in Mplus3 where I want to correct the standard errors for the design effect (twolevel structure) as well as estimate this model with missing data. The "problem" is that I have got missing data on both the dependent variable and several of the independent variables. In another posting on this page you wrote "Regression on x's does not include a model for the x distribution, but the model concerns the y outcome conditional on the x's. To handle missing data on x's, you need to expand the model to include a model for the x's, e.g. assuming normality. This can be done in several ways. One way is to first do a multiple imputation step outside Mplus. Note that Mplus can take multiply imputed data as input. Another way is to include the x's in the model in Mplus  this is done by mentioning say their variances: x2x4; This leaves me a bit confused on exactly how to write it in syntax. Could you please take a look at my syntax and see if this is the correct way to write a model where we have got missing data in a complex design (where there is also missing on the independent variables) as well as correcting the standard errors. Names: The names of the variables in the dataset; Missing = All (99); centering GRANDMEAN(age gender education); cluster is klnr1; Usevariables are antisos age gender education singlem1 singlem2 stepf JPC singlef; Analysis: type=complex; type=missing; Model: antisos on age gender education singlem1 singlem2 stepf JPC singlef; age gender education singlem1 singlem2 stepf JPC singlef; ****** In this model we wish to see how six different family structures expressed as five dummy variables (singlem1 singlem2 stepf JPC singlef) predicts antisos after we have controlled for age gender education. I have got no missing data on the dummy variablesonly on the control variables and the dependent variable. Is this the correct way to do it? Do you think it's best to impute the missing data with multiple imputation (NORM) before you use mplus or to to include the x's in the model in Mplus  by mentioning their variances (as I think have done here)? And related: Is it ok to impute missing data with Norm and use the imputed datasets in mplus3 even when you have got a nested data set? Thank you in advance. 


I think this is the correct approach. I don't think NORM handles clustered data. You may have to correlate the observed variables using the WITH option. I'm not sure of the default. You could, however, remove the dummy variables from the variance list given that they have no missing data. 

Anonymous posted on Wednesday, March 09, 2005  11:32 am



What approach does Mplus use to compute the S.E.'s for the H1 model with missing data? Is it using the observed information matrix evaluated at the final estimates? 


The observed information matrix is used. With ML and MLR, there is an option to use the expected information matrix. 

Anonymous posted on Friday, March 18, 2005  10:02 pm



I have longitudinal survey data at 5 time points. I am interested in using multiple imputation to handle missing data. I plan on using available data from time 1 for the imputation model used to impute values for the missing values at time 1. I would like to use the imputed (i.e., complete) data from time 1 to help impute the missing data in time 2, and so on. I have two main questions about this: Would you recommend doing a single imputation for each wave of data. Otherwise, I would have, say, m=5 imputed data sets for time 1, and then it is not clear how I would go about using time 1 to help impute the time 2 data. Also, do you have recommendations about whether to use individual items vs. scale scores in the imputation model? I would like to have complete itemlevel data (for subsequent factor analyses), not just complete scale scores (for path analyses, for example). I have seen examples of multiple imputation and they all use scale scores in the model. Is it ever appropriate to use individual items in the imputation model? I can't seem to find anything about this in the literature. Thank you. 

bmuthen posted on Saturday, March 19, 2005  4:55 am



In principle, a good approach would be to use itemlevel data for all 5 time points jointly, perhaps adding covariates, analyzing these variables by ML under the usual MAR assumption. This approach is certainly doable on the scale score (or IRT theta) level. But perhaps the approaches you discuss are motivated by this MLMAR approach involving too many variables when working on the item level. Perhaps that is why you suggest a timebytime approach. However, the use of complete (partly imputed) data from time 1 for imputing values for time 2 does not seem like a good approach to me since it is acting as if the imputed time 1 data are real. And a single imputation would not give the desired result of multiple imputation  showing the true variability. Staying with the idea of imputing itemlevel data for each time point separately, it seems feasible to do this using observed data (not imputed) data on items (and covariates) at all other time points. I am not familiar with literature on these matters. 

Anonymous posted on Saturday, March 19, 2005  10:49 am



Thank you for your quick reply. When you say "use the itemlevel data for all 5 time points jointly, perhaps adding covariates," isn't it the case that the covariates would already be factored in due to having all the survey items in the imputation model already? So I'm no sure what you mean here. Are you saying that if I want to use depression info as part of the model to impute values for missing anxiety scale items, I should use the depression scale score instead of the depression scale items? Also, just to clarify, you think it is appropriate to use data collected at subsequent time points to impute values from previous time points? (I'm not arguing against the view, just wnat to clarify). Would this still be the case if there is a reason to expect measurements to change over time (e.g., some of the participants belong to an anxiety treatment group)? Thank you again. 

bmuthen posted on Saturday, March 19, 2005  4:19 pm



When I mentioned covariates I was making a distinction between background variables (e.g. demographics) and the (test?) items  it sounds like you are calling all of these variables "survey items" so we were probably just using different vocabulary. So my answer is no to your question at the end of your first paragraph. Regarding your second paragraph, yes my inclination would be to use any variable that might be correlated with the items with missing data. 

Anonymous posted on Saturday, March 19, 2005  5:12 pm



Thanks again. This has been very helpful. 

Anonymous posted on Tuesday, April 05, 2005  6:51 pm



I am using GGMM to analyse a longitudinal dataset with missing values. It seems that if "missing" is specified in the variable and analysis commands, FIML method will be utilized and the default algorithm is EM.i.e. the observed log likelihood will be maximized. am I right about this? In the output I got the warning says the fisher's information matrix and standard error matrix related to some parameters cannot be inverted. what does it imply generally? one more question,is MCMC ever be considered in Mplus when dealing with latent variables and missing values? 

BMuthen posted on Wednesday, April 06, 2005  3:16 am



Yes, you are correct about your first question. Regarding the information matrix, this implies that your model is not identified. Ask for TECH11 to see which parameter causes this. The current version of Mplus does not include MCMC analysis. 

Anonymous posted on Thursday, April 07, 2005  10:13 am



many thanks to your prompt answer. 

Anonymous posted on Wednesday, June 01, 2005  10:23 am



Hi there, I am using type = missing h1 (with the ML estimator) for a structured equation model using all continuous variables (latent and manifest). I am trying to provide a brief description of what missing h1 does for a manuscript. I read the manual but was confused. Could you provide a brief description for inclusion in the manuscript? Thanks in advance, Courtney 

bmuthen posted on Wednesday, June 01, 2005  5:56 pm



"Missing H1" says that we want to do ML estimation of an unconstrained (saturated) covariance matrix for the observed variables taking missingness into account under the MAR assumption (see the Little & Rubin book). This MLMAR estimation is carried out using the EM algorithm in line with the LB book. The estimated covariance matrix is used to compute a chisquare test of model fit, comparing H0 to H1. 

Anonymous posted on Saturday, June 04, 2005  2:29 pm



Is is appropriate to use H1 with outcomes that are from all categorical data? That is, using theta parameterization and the WSLMV estimator? Thanks. 


Yes, it is. There is a table in Chapter 15 of the Mplus User's Guide that shows which TYPE options are avaiable for variaous estimators and outcome scales. See ESTIMATOR in the index of the user's guide to find this table. 


Is there any new information avaiable on implementing the HedekerGibbons patternmixture approach in Mplus? (See the 1/27/05 post above.) 

BMuthen posted on Wednesday, June 22, 2005  12:15 am



No. Is there anything in particular you would like to know? 


Dear Dr. Muthen, I am running a path analysis with 7 IVs at Time1 predicting 2 DVs at Time2. I have some missing data (not a huge amount, the coverage is around .9 for all variables). I have specified the Type = missing h1 under the analysis command. I have the following questions: 1. Does this missing data command take into account ALL variables that are listed under NAMES ARE, or does it use the variables that are listed in the USE VARIABLES ARE only? 2.If the latter is true, how do I go about letting other variables into the missing value analysis?(for instance relevant covariates listed in the NAMES ARE list)? 3. One of the Time one variables is gender. The way I have written the syntax now is just listed gender after the ON statement. Should I specify that gender is categorical? If so, how do I write that in the syntax? Thank you in advance and thanks for a wonderful helppage 


1. USEVARIABLES only. 2. The USEVARIABLES list should contain all variables in the MODEL command  independent and dependent variables. 3. You should not place indepdendent variables on the CATEGORICAL list. This list is for dependent variables only. 

Anonymous posted on Wednesday, July 06, 2005  9:47 am



Dear Dr. Muthen, I have a question about missing value treatment. When I want to conduct FIML instead of EM algorithm, How can I do? Analysis=missing? According to your previous response related to missing value treatment, "Missing H1" says that we want to do ML estimation of an unconstrained (saturated) covariance matrix for the observed variables taking missingness into account under the MAR assumption (see the Little & Rubin book). This MLMAR estimation is carried out using the EM algorithm in line with the LB book. Do you think that the MCMC option in LISREL does the same thing as multiple imputation under NORM or SAS proc MI? Thank you very much!!! 

bmuthen posted on Wednesday, July 06, 2005  4:44 pm



FIML is an estimator and EM is one algorithm for computing FIML estimates. Other algorithms include QuasiNewton, Fisher Scoring, and NewtonRaphson. Mplus uses the EM algorithm for the unrestricted H1 models and the other algorithms for H0 models. Saying Analysis type = missing implies using all available data. With FIML this is the standard "MAR" approach to missingness. MCMC stands for Markov Chain Monte Carlo. I don't know how the LISREL approach relates to NORM. 

Ad Vermulst posted on Friday, October 07, 2005  1:42 am



Dear Bengt/Linda, I am using type=missing H1 in combination with the WLSMV estimator for ordered categorical dependent variables. Can you tell me how MPLUS 3 deals with missing values in this situation? I have read appendix 6 of your technical appendices, but this appendix is restricted to normally distributed yvariables. Maybe you can give me a lit. reference? Thank you very much. Ad Vermulst 

bmuthen posted on Saturday, October 08, 2005  2:07 pm



See the missing data section of Chapter 1 of the version 3 User's Guide  which has the same content as the intro paragraphs for the Missing data topic here on Mplus Discussion. Essentially, pairwise information is used with categorical outcomes using the WLSMV estimator. 

Reetu posted on Wednesday, December 14, 2005  2:19 pm



I am trying to do an exploratory factor analysis with both categorical and continuous variables. I have missings in both and I'm getting an error that is telling me that i can only use the missing option if all my dependents are continuous. Is there a way of getting around this? How should I treat my categorical missings? 


I'm not sure which version of the program you are using but I just tried this in Version 3.13 and it is fine. 

Reetu posted on Wednesday, December 14, 2005  2:58 pm



I'm using version 2.14. 


That does not have missing data estimation for categorical outcomes. This came out in Version 3. 

Annonymous posted on Wednesday, January 11, 2006  10:50 am



Is the missing data estimation for categorical outcomes appropriate even if it does not appear that the data is MAR or MCAR? How can one test to know for certain if the data is not missing at random? 

bmuthen posted on Wednesday, January 11, 2006  11:02 am



There is no test for MAR. If one suspects ways in which MAR is violated, nonignorable missing data modeling can be attempted to see if results differ. Although it is not always easy, you can do nonignorable modeling in Mplus  see for example the model diagrams posted at http://www.gseis.ucla.edu/faculty/muthen/ED231e/Handouts/Lecture17.pdf 

Annonymous posted on Wednesday, January 11, 2006  11:23 am



I'm finding those diagrams a bit hard to follow  can you explain in words what the nonignorable missing data approach is, and what the MPlus code would look like? 

william ryan posted on Tuesday, February 07, 2006  10:13 am



Hi, out there: Full information maximum likelihood and multiple imputation are clearly superior to other ad hoc approaches. I am debating which one to use for modeling my path analyses. Does anyone know if MI has clearly advantage over FIML? 

bmuthen posted on Tuesday, February 07, 2006  6:24 pm



The approaches should give about the same results. I have heard Joe Schafer say that if you can do FIML, do it.  MI is mostly intended for when it is too hard to do FIML. 


I wanted to get a sense for whether or not there is a mathematical and/or conceptual relationship between three approaches to the modeling of nonignorable missingness  the first two are: a) MI where the missing data pattern indicators are included (along with the variables of interest) in the imputation model but only the variables of interest are included in the analysis model (Schafer, 2003, Stat. Neerlandica) and b) FIML with auxiliary variables where the missingness indicators are additional outcomes predicted by the IV(s) of interest (along with the DVs of substantive interest) with residual correlations between the missingness indicator(s) and the substantive DVs (Graham, 2003, SEM). I came across Schafer's (2003) suggestion on a simple approach to pattern mixture modeling where he says in contrast to traditional PMMS "......this process of averaging the results across patterns may be carried out by MI. Suppose that we generate imputations Y1mis.....YMmis under a pattern mixture model. Once these imputations exist, we may forget about "R" (the missing data pattern indicators) and use the imputed datasets to estimate the parameters of P(Ycom) directly."(bottom of p.27) (link to the paper on Schafer's site is @ http://www.stat.psu.edu/~jls/reprints/schafer_2003_neerlandica.pdf) Using R in the imputation model and throwing it out in the analysis model sounded very much like using R as a specialcase auxiliary variable a la Graham (2003). In Graham (2003), Collins et al., (2001, Psych. Methods) and elsewhere, the equivalence between MI with auxiliary variables and FIML with auxiliary variables is either discussed or illustrated. But one of the key models that is suggested by Graham (2003) (the correlated residuals model described above) looked very similar to a third model (Muthen/Jo/Brown 03 JASA  specifically the model on page 6 of your lecture17.pdf) except for two things: a) mixtures of longitudinal trajectories (which is not an important difference per se) and b) latent missing data classes (e.g., CU in the diagram) that are correlated with (or at least account for differences in conditional means on) the growth parameters. Now to my real question  assuming the same model structure of interest across the two approaches (e.g., singlepopulation LGM), is it safe or reasonable to say that Graham's (2003) model is a special case of your "CU" JASA model where missing data pattern class is "known" (or at least captured with observed measure(s) of missing data class)? 


I like the Schafer and Graham (2002) Psych Methods paper and their discussion of MI and FIML. Consider cases where you have variables (Z, say) that relate to missing data and that don't belong in your model of interest for x and y. With MI you would use z in the imputation model but not in the analysis model. With FIML you would use z as extra y variables that are freely related to y and x. The modeling with the missing data indicators (u say as in Lecture 17) is different. If you have MAR, modeling the u's in an unrestricted way in addition to x and y gives the same ML results as analyzing x and y only (ignorability of missingness). Modeling the u's aims to handle nonignorability. Lecture 17 suggests several possible alternatives for doing such u modeling. Page 6 that you point to tries to simplify the u structure. This relates to patternmixture modeling where you have to use all missingness patterns as covariates. The patternmixture model essentially corresponds to a latent class model (the model with cu) that has as many classes as there are missing data patterns. With a latent class model for u, you essentially reduce the number of patterns to the number of classes. You can combine the u modeling idea of Lecture 17 with the z modeling idea above. 


Thanks so much for your response Bengt; it was very helpful. I had an additional question on u modeling in general and cu modeling in particular. Other approaches to NMAR have a mechanism for (weighted) averaging of parms and se's across the missingness patterns such as handcalculation, equality constraints (e.g., Allison 87, MKH 87), combining via matrix manipulation (HG 97) or the multiple imputation approach to NMAR that Schafer discusses in the .pdf linked above. For CU modeling of NMAR, it seems like constraining the estimates to get a weighted average of the covariate>growth parameter effects (i.e., X>I, X>S) is no problem (of course, modeling X>I and X>S only in the %overall% part of the model is less code to do the same thing). But it also seems like if one is interested in getting a weighted averaged estimate of the growth parameter intercepts (GPIs) (across all the latent missing data groups), ( E[ I  X, CU] and E[ S  X, CU] ), you may not be able to estimate them directly in the analysis because if you constrain the GP intercepts to equality, the problem reduces back to an MAR solution  it seems like constraining the GPIs in each CU class to equality eliminates the relation between the growth parameters and CU which seems like the very part of the model that handles nonignorability. But if you allow the GPIs to vary across CU, you do not get a single (weighted averaged) estimate. Is my understanding of this offbase? If so, any additional guidance you could provide would definitely be appreciated. If this is not offbase, then would you recommend handcalculation of the weighted average if one was interested in inferences on the GPIs? 


I think your understanding is correct. You don't want to hold these parameters equal across classes, and this does lead to the problem of how one presents the results mixing over classes. I don't think this is resolved, but needs research. On the other hand, with a cu approach you have fewer patterns (number of classes) and therefore perhaps you are interested in presenting the results for each class by itself without weighting (mixing) them together  the classes may be so fundamentally different that you rather treat them separately. 


I was wondering what missing data strategy you would recommend for a small longitudinal SEM model? More specifically, I ran a SEM model in which there were 58 subjects at the first time point and only 50 subjects at the second time point (i.e. 8 subjects had missing data). I ran the model two ways (1) with listwise cases deleted and (2) with the means in the place of the missing data. Both models fit the data almost equally well and the same paths were significant in both models. Is the listwise strategy more rigorous than running the model with the means? Does this depend on the percentage of the sample that is missing data? Should I run the model another way? 


What happens if you use TYPE=MISSING in the ANALYSIS command? I think this would be far better than using the means. 


That works, my missing data are computed. Do you recommend using EM or regression imputation? 


Mplus uses the EM algorithm for ML estimation under the "MAR" assumption; see the Little & Rubin missing data book. In this approach, missing data are not imputed, but parameters of the model are estimated directly using all available data. 


Is it alright to use the EM algorithm when your data are nonnormal? 


There is little theory on ML under MAR with nonnormal outcomes. I think the results are still better than listwise deletion. See also Mplus Web Note#2 posted on our web site. 


Hello, In Stata, I created a data set that has several multiply imputed data sets. When I try to read this data set into MPLUS, however, I get the same two error messages repeated until the program finally aborts: Errors for replication with data file [and then it lists a bunch of numbers]. *** ERROR in Data command The file specified for the FILE option cannot be found. Check that this file exists: [and then again, a bunch of numbers]. As far as I can tell, the Stata file contains 5 multiply imputed data sets, but do can you tell from the above message if this is problem with the data in Stata or in MPlus? Thank you, Christina 


The message means that the file you have named using the FILE option cannot be found. Perhaps you have misspelled it or it has an extra extension that you are not aware of. If the file contains 5 data sets, you need to separate them if you plan to use the IMPUTATION option of Mplus. If you have further questions on this topic, please send them along with your license number to support@statmodel.com. 


In the intro to the Missing Data Modeling Discussion board, there's a reference to a paper I can't find: "Nonignorable missing data modeling is possible using maximum likelihood where categorical outcomes represent indicators of missingness and where missingness may be influenced by continuous and categorical latent variables (Muthén et al., 2003)." Can you provide a link or more information? 


That is the JASA article which you find on our web site under References. 

Andy Ross posted on Wednesday, June 21, 2006  7:09 am



Dear Prof. Muthen I am attempting to run a MIMIC LCA model with missingness on the covariates. A colleague of mine recommended: rather than including the x's in the model by mentioning their variances, which would require using integration to estimate the model. To instead create a new variable with mean zero and small variance and give random values to each case. Then regress all the covariates on this random variable. The covariates are then not independent variables in the model and can be missing. The syntax for this model is as follows (rg is the new, random variable) Data: file = c:\soton\ncdmis2.dat; Variable: names = sx sc2 sc3 me ma ha br pv cd ep pa sm ex pt re kd em hq rg; classes = c (4); categorical = pt re kd hq; nominal = em; missing are all(99); Analysis: type = mixture missing; starts (0); Model: %overall% c#1c#3 on sxex; sx sc2 sc3 me ma ha br pv cd ep pa sm ex on rg; %c#1% [pt$1*0.688 pt$2*0.243 re$1*2.218 re$2*15]; [kd$1*3.054 kd$2*15]; [em#1*3.121 em#2*0.819 em#3*1.908]; [hq$1*3.320 hq$2*0.432 hq$3*0.287]; %c#2% [pt$1*1.408 pt$2*0.572 re$1*1.164 re$2*3.578]; [kd$1*3.190 kd$2*0.464]; [em#1*0.814 em#2*0.523 em#3*0.879]; [hq$1*0.499 hq$2*2.253 hq$3*3.224]; %c#3% [pt$1*3.985 pt$2*4.999 re$1*3.658 re$2*0.867]; [kd$1*3.799 kd$2*6.832]; [em#1*0.943 em#2*2.388 em#3*3.821]; [hq$1*1.205 hq$2*0.686 hq$3*1.473]; %c#4% [pt$1*4.116 pt$2*2.653 re$1*2.708 re$2*7.775]; [kd$1*3.485 kd$2*1.518]; [em#1*3.180 em#2*2.270 em#3*1.908]; [hq$1*2.777 hq$2*0.057 hq$3*0.848]; Output: tech1 tech8 modindices; However for some reason, whilst this works for my colleague, it does not for me  the outcome is that intergration is still requested. Could you please answer me two questions? Firstly what do you think of my colleagues suggestion? Does it sound feasible at least in principle. Secondly, is there any indication in the syntax why the solution is not working here? Many thanks Andy 


My initial reaction is that this is not a good idea. I would have to hear why your colleagues think it is a good idea to say more. In your case, I would use multiple imputation. You can use the NORM program to generate imputed data sets and analyze them in Mplus using the IMPUTATION option. Problems with the Mplus syntax should be sent to support@statmodel.com. Please include the input, data, output, and license number. 

Andy Ross posted on Wednesday, June 21, 2006  8:37 am



Dear Prof. Muthen Many thanks for your speedy response  i will be certain to pass your thoughts onto my colleague. Multiple imputation has been our method of choice so far, however the problem is we now want to save the probabilities in a data file, which is something you cannot do when working with multiple datasets. Can i ask, are you suggesting that in our case, FIML wouldn't really be an option? i.e. use MI and accept that we will not be able to save the probabilities? With many thanks Andy 


If you have more than two or three covariates with missing date, it is impractical to bring the covariates into the model because the computational burden of numerical integration would be heavy. If only two or three of your covariates have missing data, then FIML should be fine. You should study the missing data in your covariates. Perhaps there are some with very little missing data such that you could allow the listwise deletion on those and bring the others into the model. 

Andy Ross posted on Thursday, June 22, 2006  7:32 am



Many thanks again. I tried running the model again, mentioning the variance of the three variables which had the greatest missingness as suggested. The model requested that i use ALGORITHM=INTEGRATION method of estimation. However when including this term under the analysis command I got the following error message: *** FATAL ERROR THIS MODEL CAN BE DONE ONLY WITH MONTECARLO INTEGRATION. Is this to be expected? Could you please give me some indication of how i should set up the estimation? Many thanks Andy 


Add INTEGRATION=MONTECARLO to the ANALYSIS command. 

Andy Ross posted on Thursday, June 22, 2006  8:21 am



I tried this however i get the following error message: *** WARNING in Analysis command The INTEGRATION option is not available with this analysis. INTEGRATION will be ignored. and it then goes on to say, as before: *** WARNING in Model command This latent class regression requires numerical integration. Add ALGORITHM=INTEGRATION to the ANALYSIS command. Problem with: C#1 ON EX etc. 


It sounds like you need to send your input, data, output, and license number to support@statmodel.com so we can see the whole picture. 

Julie Hall posted on Thursday, June 22, 2006  9:51 am



Hello, I would like to use FIML to accommodate my missing data. Is that possible with WLSMV? Thanks in advance! 


Missing data estimation using FIML is available for categorical outcomes by using the maximum likelihood estimator. Missing data estimation is also available using the weighted least squares estimator. 

Julie Hall posted on Thursday, June 22, 2006  10:30 am



Thanks so much. How does Mplus deal with missing data when using WLSMV? 


Pairwise present if there are no covariates. Missing as function of the covariates if there are covariates. 


I am confused about how Mplus handles missing data. When I type the following into Mplus: /* DATA: FILE IS fulljoin.txt; VARIABLE: NAMES ARE t1t10 y1y80 p1p10 t11t20 m1m20; TSCORES = t1t10; MISSING=ALL (999); USEVARIABLES ARE t1t10 p1p10; ANALYSIS: TYPE=MISSING; MODEL: i s  p1p10 AT t1t10; OUTPUT: SAMP; */ I receive the means for each of the variables p1  p10. However, only the variables that do not have any missing data in them, match up with the means calculated in excel. I am certain that the missing values are set to 999 in MPlus file. Thank you for your help. 


If you do not specify TYPE=MSSING; in the ANALYSIS command, Mplus uses listwise deletion of any observation with a missing value on one or more of the analsyis variables. 

peter kane posted on Tuesday, July 18, 2006  1:12 pm



question about missing data. i am analyzing some longitudinal data in a crosslag model and have about 15% of subjects missing data on one variable at the first time point. these same subjects are missing subsequent time points for this specific variable. essentially, for this 15% of the sample, there is no data on this one variable. however, these missing subjects have observations on other variables. my question is whether i should delete the subjects who are missing this variable, or conduct the analysis on the entire sample by employing a missing data estimation procedure such as FIML? i guess i am not sure if the data is "missing at random". thank you very much for your ideas/suggestions. 


I would use missing data estimation even if the data are not missing at random if it meant losing 15 percent of the sample. You might want to do the analysis both ways and see if it affects the interpretation of the results. 


Dear Bengt, I am conducting some analyses using data from NESARC. In a recent article (Grant et al., 2006) analyses were conducted using a sample of past year drinkers (n = 26946). I hope you could answer a query that I have. I am aware that you have conducted anlayses on this dataset and am interested to know how you and your colleagues dealt with missing data among this sample. I have read in the literature that listwise deletion of missing data is quite popular. I am aware however that the NESARC dataset contains a weighting variable. I have read on the MPlus discussion board that deleting missing data can have an adverse affect on the weighting variable. I want to use the weighting variable in my analyses and I am therefore reluctant to delete cases with missing data. In an attempt to overcome this problem, I have included the following commands in the input: Variable: Missing are all (9); Type: Complex mixture missing; However, I am aware that other people have used the algorithm command in their analyses. Is this an appropriate solution to the issue of missing data? If not, what command(s) would you suggest I use/change in my analyses? Thank you for your time. 


By using Type=MISSING, you are not doing listwise deletion but using the standard MAR approach. I would advice against listwise deletion. 


I would like to extend on a query from my previous post (July 27 2006). I have missing values for approximately 4% of my data. I am considering recoding my dataset from values of ‘yes' to 'criteria present' and values of ‘no' or 'missing’ to ‘all other responses’. Do you think that I could statistically defend this treatment of missing data? I am aware that treating unknowns or missing values as negative has a certain element of risk (as there may be false negatives), but given the low proportion of missing data, I am unsure as to whether this is a problem. I was wondering however if you could perhaps suggest any references or authors that may have utilised such a technique? Thank you for your time. 


I would treat the missing as missing and use TYPE=MISSING; I think it is dangerous to start recoding. You may want to search the literature to see if you can find anyone who advocates the approach that you suggest. 

Julie Hall posted on Wednesday, August 09, 2006  8:55 am



I am using Mplus for my dissertation analyses and I want to make sure that I understand how my missing data will be handled. I am using WLSMV (with covariates) and my understanding is that the data will be treated as missing as a function of the covariates. Could someone explain what this means? Thanks so much! 


The MLMAR approach to missing data allows missingness to be predicted by variables that are not missing for the individual. So both y and x variables. However, with WLSMV, if missing is predicted by y variables, the results are distorted, while they are not if missing is predicted by x variables. 


Hello Bengt (I am having to send this in 2 parts...), I wanted to followup with you on our discussion from March 2006 on this thread about Latent Class Pattern Mixture Models (LCPMMs, i.e., "CU" models for NMAR dropout). With a very small sample (N = 128), I have looked at a series of Kclass (i.e., singleclass through 4class) LCPMMs where CU (treatment attendance classes) jointly accounts for a) threepiece linear growth in alcohol use over time across 12 weekly alc. assessments, b) observed measures of treatment attendance (i.e., missingness) from weeks 212 of tx (i.e., everyone "shows up" for week 1) and c) the (calendar) week of the trial when the person started treatment. BIC and entropy suggest that a 3class model fits best and, in fact, when you mix estimates (i.e., growth parameter intercepts, tx effects for each piece) across classes (weighted averaging outside the analysis), you make a different inference than you would have made if you took the results of the 1class model (e.g., standard LGM under ignorability  but with the missingness indicators left in the model to compare BICs). (Part 1 ends here.....) 


(Part 2 starts here....) My question is the 3class model has 64 parameters  exactly half the number of people in the dataset, which is a dangerously low ratio of peopletoparameters (i.e., 2to1  though the classspecific estimates do not look strange and I reproduce the log likelihood value multiple times with 500 starts). But 33 of those parameters (11 indicators x 3 classes) are the thresholds for the missingness (show/noshow) indicators. Lin et al (2004; Biometrics) say that for CU models for NMAR, data are MAR within each class, after conditioning on class membership  seems to me that once you condition on CU, you could ignore these missingness indicators (just as you would never need the missingness indicators in singlepopulation models) and not be penalized for having such a low ratio because more than half the parameters in the model would not even be there if class membership were known. I wanted to get your thoughts on this and see if this was offbase........ Lin, H.Q., McCulloch, C.E., & Rosenheck, R.A. (2004) Latent pattern mixture model for informative intermittent missing data in longitudinal studies. Biometrics, 60, 295305. 


I see what you are saying, but it seems that you cannot get at CU status without estimating those thresholds, so I think they are necessary. It is an empirical question if you do better with such a 64parameter model than not trying NMAR at all. 


Thanks again Bengt. I agree that this would be a very different model w/o the thresholds. I just worried a little bit about the low ratio, especially given that this particular CU model looks better empirically than 1class model under MAR (though I realize this is not necessarily a "test" for or against MAR). No one has brought up the ratio problem with these data and it seems like it doesn't worry you either..... 


The ratio worries me  a simulation might indicate how much one should worry. 


I conducted a small sim as part of this work (as part of a poster at Yihing Hser's CALDAR conference and a talk I gave Oct 2 at Bud MacCallum's brown bag @ UNC), which focused on confidence interval coverage for the mixing of the classspecific parameters in the meanstructure, using all the classspecific parameter estimates (e.g., growth parameter intercepts, treatment effects, show/noshow thresholds, variance components, etc.) from the 3class model as population parameters with simulation N=128 (500 replications). I looked at coverage under 1class through 4class models, given there were 3 classes in the pop. There were two things that were encouraging for the threeclass solution: 1) coverage was excellent for the 4 effects I looked at (weighted average treatment effects on the three linear pieces and the intercept at the last week of treatment), between 9298% coverage across all replications and 2) no nonconverged solutions/local maxima in any replication. Coverage was bad for 2class and terrible for 1class, with the majority of the confidence interval misses (relative to the pop. tx effect(s)) coming because the (classmixed) tx effect was overestimated. 4class is where the % of nonconverged solutions was so high (even with 700 random starts in all conditions), I stopped studying anything beyond 3class. Does this help?.. 


The 3class results sound encouraging. 


I am conducting analyses using data from NESARC, a complex survey design study. My analyses are concerned with a subsample of respondents, which I identify in my setup using the subpopulation command. My query concerns the coding of respondents who are not members of the subsample. How should they be coded? 


There is no need for any special coding for observations not in the subpopulation. This is handled internally. 


Just to clarify, those respondents who are not included in my subpopulation are coded as missing in the dataset (due to 2 screener questions). I have identified these people in the setup (missing are all 9 and type = mixture complex missing). Is this correct? Also, in my output, should the number of observations reflect the number of respondents in my subsample or rather the entire sample? 


Hello Bengt/Linda, I wanted to followup on an Oct 2006 thread on Latent Class Pattern Mixtures (e.g., MJB, 2003) on an issue that probably comes up in any K=>2 GMM. In working with the covariance matrix of the estimates (covb) (and a Jacobian matrix of 1storder derivatives) to generate delta method standard errors for weightedaveraged estimates from LCPMM, I noticed that there were nonzero covariances *across* classes. I initially thought that was strange, as I was expecting covb to be blockdiagonal (0s for all parameter covariances across classes). But then I wondered if these nonzero covariances were one of the places in the model where the uncertainty in class membership was reflected; in fact the two class combination (in my K=3 model) that has the largest offdiagonal in the matrix of average latent class probabilities also has the largest crossclass covariances. The other sets of crossclass covariances are 0 (or so small as to be functionally 0). Are my suspicions onbase? If not, any explanation as to why covb isn't block diagonal would be very helpful...... 


That's right  it has to do with the posterior probabilities which are spread over all classes, so due to uncertainty as you say. 


Thanks so much as always; hope "retirement" is treating you well..... 


Dear Linda and/or Bengt, I am analyzing data from a longitudinal study of risk for anxiety disorders and depression in 600+ high school juniors. At T1, we obtained selfreports on vulnerability measures for all subjects and tried to obtain peerreport versions of the same measures for all subjects. However, because some subjects refused to nominate peers and some peers refused to participate, we actually obtained peerreports on roughly 50% of our subjects only. I was thinking about incorporating the missing data by using the multiple group approach to missing data. However, I have come across some references suggesting that the FIML approach to missing data is conceptually equivalent to the multiple group approach. If this is true, it certainly seems preferable to me to go the FIML route based on ease of model specification. Can you confirm that these approaches are equivalent? If not, when would you use the one and when would you use the other. Thanks for your time! 


The FIML and multiple group approaches are the same. 


thanks very much Linda! 


My N is 99 but when i run the following syntax, the number of observations in the output is only 70. VARIABLE: NAMES ARE (I deleted this for brevity); MISSING = ALL (99); USEVAR = ad1 ad2 ad3 satbf1 percrit1 percrit2 percrit3; ANALYSIS: TYPE = MEANSTRUCTURE MISSING; MODEL: i s  ad1@0 ad2@1 ad3@2; i s ON satbf1; ad1 ON percrit1; ad2 ON percrit2; ad3 ON percrit3; OUTPUT: SAMPSTAT STANDARDIZED MODINDICES (3.84) Do you know what I am doing incorrectly? 


Look to see if the output says that individuals with missing on all variables or missing on x variables are deleted. If you don't see this, please send your input, output, data and license number to support@statmodel.com. 


I'd like to use estimated sigma within and between matrices for multilevel regression and path analysis. In part, these matrices would serve as input for multiple group analysis. Many variables in my dataset are treated as covariates. As far as I know, covariate missingness leads to listwise deletion when using FIML. When using the sigma matrices as input for analysis with covariate missingness, I wonder what would be the right N for the sample/the groups. Has missingness in covariates to be taken in account to determine N for covariance matrix input? Thanks a lot! 


For the pooledwithin matrix use the sample size shown in the output where you saved the pooledwithin matrix minus the number of clusters. This takes into account any observations lost due to missing data. For the sigma between matrix the sample size is the number of clusters. 

Lisa Liu posted on Friday, May 18, 2007  12:33 am



Hi, I am trying to run a twolevel path analysis but am having trouble estimating the missing data. When I take out the level two data and just run it as a path analysis, the model successfully estimates the missing data. But when I add in the level two data, it stops working. This is strange because all of the missing data is level 1 data. Any suggestions? Thanks for your time! 


I would need more information to help you. Please send your input, data, output, and license number to support@statmodel.com. 

V X posted on Friday, August 24, 2007  3:35 am



1. While using "Type = imputation " option, how does Mplus generate S.E. of the estimates? Does it apply Rubin's rules? I compared the results of the same model with FIML and MI , the S.E. is quite different. 2. How can I request the output of relative efficiency, Relative Increasein Variance, Fraction Missing Information using MPLUS? Thank you. 


1. We estimate standard errors for multiple imputation according to the Schafer 1997 reference listed in the user's guide. FIML and MI are asymptotically equivalent. Differences can come about with small samples. 2. These items are not currently available in Mplus. 

V X posted on Monday, August 27, 2007  3:46 pm



This is a followup question to my previous inquiry about analyzing data using multiple imputation. I generated multiple imputed data sets (40 replications) using PROC MI in SAS. I then analyzed the data using both Mplus and PROC CALIS, taking into account that the data were generated by multiple imputation methods. Below are examples of the resulting parameter estimates with standard errors in parentheses. For comparison, I have also included estimates obtained from Mplus using full information maximum likelihood. Analyzing data based on multiple imputation procedures (using the identical sets of data in both analyses) estimate from Mplus: 1.47 (se = .04) estimate from CALIS: 1.54 (se = .23) Analyzing data without multiple imputation estimates from Mplus (FIML): 1.52 (.24) It is interesting to note that the standard error obtained using PROC CALIS based on MI is quite comparable to that obtained using Mplus with FIML. The result from Mplus based on MI is strikingly different. How does one account for the large discrepancy? 


I would need more information to comment on this. Given that the parameter estimates are simple averages over the replications, I wonder why they are different. Unless, they are the same, I wouldn't expect the standard errors to be the same. If you send the three outputs and your license number to support@statmodel.com, I can take a look at it. 

Jie Lu posted on Tuesday, September 18, 2007  11:17 pm



Hi, I am trying to fit a structural model with imputed data set generated through the procedure of ICE in STATA. One of the endogenous variable is a dummy. After I fit the model, I do get averaged CFI TLI RMSEA and their respective starndard deviation. However, I did not get those for the Chisquare? How can I get them? Thanks 


The current version of Mplus gives a mean and standard deviation for chisquare. 

LJ posted on Sunday, September 23, 2007  6:35 am



Hi, Linda, My version of Mplus is 4.21. When I changed the type of all my endogenous vairalbes as continuous, I can get the averaged Chisquare and its standard deviation. But if some endogenous variables are categorical, I cannot get the averaged Chisquare and standard deviation. Is there some constraint for the WLSMV estimators? Thanks 


With WLSMV, the chisquare test statistic and the degrees of freedom are adjusted to obtain a correct pvalue. So the degrees of freedom varies across the replications and therefore we do not report its average. 

V X posted on Thursday, November 22, 2007  11:59 am



I am not so understand with the "integration = montecarlo;" option in Mplus. WOuld you please provide some references so that I could have a good understand what is the mechanism and when to apply it? Happy thanksgiving ! 


Monte Carlo integration can be useful with many dimensions of integration and in other special cases described in the user's guide. You can search for this in the computational statistics literature. I don't know of a particular reference offhand. 


Hi Bengt and Linda, I am interested in using Mplus for fitting a growth model to a data set with missing values on the outcome variable. I could use TYPE= RANDOM MISSING and the model would produce factor scores and other estimates under a MAR assumption (missing data mechanism depends on observed data) However, my question is the following. If I want to model the missing data mechanism as in Diggle and Kenward, I could use the "missing data indicator at time t ON outcome at time (t1)" (u ON y) type of code but would still need the TYPE=MISSING bit of code to avoid listwise deletion. Am I not "overriding" the missing data code with the inclusion of TYPE= MISSING? Should not factor scores obtained under the two model specifications (with and without the explicit missing data model) be different due to the presence of the model for the missing data mechanism as I am only including "outcome at time (t1)" in the missing data model? thanks,Graciela 


The alternative of using missing data indicators (u_t on y*_t1) in the modeling (so allowing for MNAR) takes the same approach of using all available data as MAR does, so Type = Missing does not override this. Factor scores should come out different in the two approaches given that the models are different. 


Hi, I have a question about the MLR estimator, the WLSMV estimator, and missing data. Regardless of which estimator I use, I get the following message: "Data set contains cases with missing on all variables except xvariables. These cases were not included in the analysis." and only the cases that have been observed on y are included. I was under the impression that only the cases with missing on X should be deleted from an analysis estimated with FIML so I am surprised that both MLR and WLSMV use the same number of cases. Can you please explain why this is happening and how I might change the syntax in order to utilize the cases that have been observed on x? Thank you. 


In a conditional model, information on x does not contribute to the estimation of the regression coefficient in the regression of y on x, and the mean and variance of x are not estimated. So an observation with only information on x is not be used because it has no information to contribute. In an unconditional model where the means, variances, and covariance between y and x are estimated, cases with information only o x are included in the analysis. The only thing you can do to avoid this exclusion is to mention the variances of the x variables that have missing on x in the MODEL command. This will cause them to be treated as y variables. Their means, variances, and covariances will be estimated and distributional assumptions will be made about them. 


Thank you for your quick reply. Would you suggest estimating the (co)variances for the xside variables (and declaring nonnormal variables on the categorical line)? Could this cause other problems or violate assumptions? 


I would not do this. If your x variables are continuous normal, it would probably be okay and in line with multiple imputation programs. If they are categorical, it would change the model. The bottom line is that if you are interested in regression coefficients, bringing the cases with only x's into the model will not change the results. 


Thank you 

anonymous posted on Tuesday, March 18, 2008  6:54 am



Hello, I understand that 10% covariance coverage is the default setting in Mplus. However, is there any "ruleofthumb" regarding the proportion of data that must be present to get reliable estimates? 


I would be happy if I had no lower than 80 percent coverage. I don't know of any rule of thumb. 


Hi Bengt/Linda, Have you published and/or are you aware of any articles that illustrate methods for modeling the conditional expectation of the likelihood given the data and current values of the parameter set (i.e., EM for model parameters) for conventional (singleclass, singlelevel) SEMs? I am either coming across applications of EM a) for means and covariances (e.g., Schafer, 1997, p.163181), b) for model parameters within the multilevel track (e.g., Lee & Poon, 1998, Stat. Sinica; Liang & Bentler, 2004, Psychometrika) or c) for model parameters in the mixture track (Muthén/Shedden, 1999). For b and c, some places are obvious where the model, within the Estep, would be modified/structured to fit a conventional model as a special case  and notsoobvious in other places. And many texts that discuss casewise ML for MAR data for conventional SEM seem to say little about EM, NR or any other optimization techniques (while there is plenty of talk on this for multilevel and/or mixture SEM). Any help either of you could provide on this would be greatly appreciated……… 


EM for singlelevel, singleclass SEM is described in a Rubin et al article on factor analysis (RubinThayer?). 


Thank you Bengt, Looks like they have two papers that are relevant: Rubin, D.R. & Thayer, D.T. (1982) EM Algorithms for Maximum Likelihood Factor Analysis. Psychometrika, 47, 6976. Rubin, D.R. & Thayer, D.T. (1983) More on EM for ML Factor Analysis. Psychometrika, 48, 253257. There was also this: Bentler, P.M. & Tanaka. J.S. (1983) Problems with EM algorithms for ML factor analysis. Psychometrika, 48, 371375. Won't be able to get them until Mon (have hardcopy access but no electronic). The 2nd Rubin paper appears to be a response to the paper by Bentler & Jeff Tanaka where both groups traded concerns about susceptibility of another optimization method (NR maybe? I'll see on Monday...) to local maxima. Thanks for pointing me in the right direction...... 


Hi Linda or Bengt, I know that a model with more parameters than subjects has identification problems (presumably related to the fact that the rank of the data matrix is limited by the number of subjects in this case) but I am not clear on how missing data impacts this. If I am fitting a model say with 200 parameters free to be estimated and 600 subjects total but have complete data from say 150 subjects (selfreport is collected from all 600 but more expensive measures such as peerreport and diagnostic interview are collected from subsamples) would we have the same identification problems that we would have had if we didn't have the 450 subjects with partial data? or do the additional subjects with partial data help in this regard? Thanks for any insight you can provide! 


I don't think this will result in identification problems. I think that you may obtain large standard errors for parameters for which you have little information to compute the H1 model. 


thanks for the very quick reply Linda! I don't understand the last part about parameters for which we have little information to compute the H1 model. If you woulod be able to elaborate a bit I would appreciate it and am not sure which parameter those would be. 


Let's say that the sample covariance between y1 and y2 has only 10 percent coverage. Then the standard errors of any parameters that draw on that information may be large. 


crystal clear now, thanks Linda! 


Dear Drs Muthen, this can be a very silly question but I am struggling to figure out how Mplus handles missing data (MD). Well, I hava a dataset of 8028 people, with 248 variables (between categorical and continuous) including missing data in all of them. I know that Mplus 5 takes into account MD by default, but I want to know why the number of analysed cases differ according to the number of variables in the dataset. See this example, when I regress age on sex (none have MD) the number of analysed cases is only 5189 instead of the 8028. On the other hand, when I create a new dataset including only age and sex as variables the number of analysed cases is 8028. Why are these differences? Is there any default mechanism that I am missing? Thanks in advance for your reply 


I would have to see the two full outputs and your license number at support@statmodel.com to answer that. 


Dear Drs. Muthen, I am confused which approach Mplus uses to handle the missing data. Especially default setting (TYPE = MISSING & H1 as the default ). Would you mind elaborating the approach used in default setting. I am not expecting statistical instruction but I want to know exact approach which Mplus uses. Thank you, 


Chapter 1 of the user's guide has a description of missing data handling in Mplus along with a reference. 


Dear Dr. Muthen, Chapter 1 of the user's guide says this. "Mplus provides maximum likelihood estimation under MCAR (missing completely at random) and MAR (missing at random; Little & Rubin, 2002) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types" Does this mean Mplus used the method Rubin (2002) recommended? This might be due to my lack of knowlege. Could you please specify the method you used? Can I just say Mplus used "special" ML to hand missing data? Please let me know. Thank you so much for your help in advance. 


I am not familiar with Rubin (2002). Just quote the user's guide if you cannot paraphrase it. 

Nikolai Eton posted on Wednesday, October 22, 2008  8:40 am



Dear Linda and Bengt, I am interested in LGC with multiple indicators and multilevel growth mixture models. As my dataset is "different" in a way, I would like to hear your advice on how to handle missing data in it using Mplus. The dataset consists of several variables (cont. & cat.) across 5 timepoints on the item level (every variable shows up 5 times) and is hierarchical with individuals nested in groups. The whole dataset is related to one country. The special thing about the dataset is that there is a high fluctuation across groups (that I can control), but also above countries (that I can't control). Thus, individuals sometimes change the group. In addition to this, sometimes they leave the country (and probably return later), what appears as missing value in the time of absence. Overall, I have about 1000 individuals nested in about 25 groups with all 5 timepoints available for 106, 4 timepoints for 87, 3 timepoints for 165, 2 timepoints for 220 and 1 timepoint for 442 variables. I do not necessarily want to explain the absence. What is your recommendation on taking care of missing values in here? Thank you very much for your help. 

Nikolai Eton posted on Wednesday, October 22, 2008  8:58 am



One addon to my question: the minimum covariance coverage is .113 (although that var has about 70% missy data information). Thus, basic analysis results in this error message: THE COVARIANCE COVERAGE FALLS BELOW THE SPECIFIED LIMIT. 


I think there are two separate issues here. If you use group as a level 2, you are treating group as a random mode of variation. In this case, changing group membership implies the need to use a "crossed random effects" approach (see the multilevel lit.), which Mplus currently cannot do. The leaving the country at some time points is a missing data question which probably is handled fine by the standard ML MAR approach of Mplus. But you want to pay attention to coverage that is lower than the Mplus default limit of 0.10 (which can be altered)  that may already be too low (just put there to prevent convergence problems) for seriously relying on the results. It depends on where the low coverage occurs. If it is for a covariance between say the first and last time point, but coverage is otherwise high, then that is not so problematic because you typically don't have a growth model parameter corresponding to the covariance between first and last. A problem would be if low coverage happens for a variable (the diagonal of the coverage report) or for variables at close time points. 


We have a data set of 300 adolescents who were sampled at three waves in a cohort sequential design. It's a typical longitudinal data set and we have a reasonable amount of missingness. We did a series of unconditional and conditional LGM models to describe and predict constructs and concluded our paper with a conditional parallel process model between two constructs. To maximize our power and sample, we included cases that had data at two time points and used the missing estimator (our rationale being that two points provides a line if not a curve). When we looked at our data everything made sense and we wrote up and submitted our manuscript. Upon receiving our reviews to this and two similar papers, we received consistent critiques that I'm hoping you can help me clarify for the reviewers. I'm writing today to ask if you can direct me to literature that address these issues. 1) Reviewers were concerned that latent growth curve models cannot be properly identified or stably fit with 3 (or less) time points. Is there evidence that the models are trustworthy? 2) Is our rationale to keep cases with at least two time points and use the missing estimator something we can justify? 3) For some of our estimates, the size of the estimate is very small (e.g., slope = .008). Although significant, how are small estimates to be interpreted? Thanks for your time. 


1) Typically, at least 4 time points is desirable for good growth modeling. With only 3 time points, there are several model mis specification risks that cannot be countered due to having too few time points to identify more flexible models. This is discussed in our Mplus Short Courses, Topic 3 (see videos and handouts on our web site). Still, many published studies have used only 3 time points. If all individuals have only 2 time points, only a very limited growth model is possible. 2) It sounds like you have 3 time points for a majority of individuals and 2 for some. The percentages of each should be given. And, in fact, you could have included individuals with only 1 time point. This is what ML estimation under the "MAR" assumption of missing data theory (see the Little & Rubin book) would do. If a majority has 3 time points, I don't see a serious problem with this approach. 3) I think you are talking about a slope mean. The size of this depends on the time scores. The real question is what the implied change in mean is for the outcome from one time point to the next. You find that in the Mplus output when requesting RESIDUAL. 

Hemant Kher posted on Thursday, October 30, 2008  2:53 pm



Dr. Muthen  Greetings, I am working on fitting growth models to survey data and have a question about missing data. There were 233 students in the sample. We collected data at 4 equally spaced time points. With regards to our key variables used to fit growth models, here is the breakdown of how often students provided data: 107 students provided data at all 4 time points (46%) 64 students provided data at 3 points (27%) 36 students provided data at 2 points (15%) 23 students provided data at 1 point (10%) 3 students did not provide any data (1%) I have read somewhere that for growth models, we need at least 1 time point  but I am not sure if having close to 10% people that provided only 1 observation will affect our growth models. 


Assume that you have a linear growth model. The most important factor is how many individuals have at least 3 time points because that's how many you need to identify all the parameters. The individuals with fewer time points also contribute to the estimation of some parameters so they are helpful to include. Of those who have at least 3 time points you also want to know how representative they are of the whole group  a simple thing to check is if the mean of the outcome at the first time point is significantly different across the 4 missing data groups you list. If different, you may consider "patternmixture modeling". 


In an earlier posting, it was mentioned that FIML was available for categorical outcomes. However, whenever I have tried this I get a warning stating, "Data set contains cases with missing on xvariables. These cases were not included in the analysis." This has been the case when I have run logistic regression analyses and when I have run SEM models with binary indicators of latent variables. Can you clear this up for me? Is there a way I can get Mplus to use FIML with such analyses? 


A regression model is estimated conditioned on the observed exogenous variables. Means, variances, and covariances of these variables are not part of the regression model. Missing data theory applies to observed endogenous variables. You can include the observed exogenous variables in the model by mentioning their variances in the MODEL command. By doing this they are treated as endogenous variables and distributional assumptions are made about them. 


Thank you for the helpful response. From looking at the manual I am not clear on the code for mentioning the variances in the MODEL command. Can you provide a little more detail on this? 


See variances in the index of the Mplus User's Guide. It points to page 524 where how to refer to variances is explained. 


Hello, I have missing data question that I was hoping someone could answer. We developed a ten factor measure of connectedness, with one factor measuring ting connectedness to sibling. As expected, some of our subjects do not have siblings and thus their data is missing appropriately. To avoid losing those subjects on the other factors, I estimated the missing data using a multiple imputation procedure and followed it with an invariance analysis that compared siblings and nonsibling samples across the factor loadings, intercepts, residuals, and covariance matrix. My first question is do you find this analytic approach appropriate? As I expected, the results were nearly identical across the two samples, both when testing the ten factor model and single factor model (i.e., sibling connectedness scale). The only caveat is that the sibling connectedness results should only generalize to subjects with siblings. My second question is whether mixture modeling with known classes is a better approach to answer this question. If my understanding of mixture modeling is correct, I would draw the same conclusion. Am I correct? Thanks for your time and consideration. 


I would look at subjects with and without siblings separately. Then if you want to compare them, do so on the factors that are not about sibling connectedness. Imputing siblings for those without seems a little iffy. Mixture modeling with only a known class variable is the same as multiple group analysis. 

anonymous posted on Thursday, February 26, 2009  3:52 pm



Hello, I am trying to determine how to approach a missing data issue. I have ratings of depression severity across time for about N=400. The timing of observations varies across individuals, so I plan to nest time points within individuals. One problem, however, is that the number of data points also varies across individuals. For instance, the number of data points for the sample ranges from 116, with a mean of 8 data points, SD=2.8, and variance=7.7. I am not certain what would be the best way to approach this. For example, would it be best to include only the first 8 timepoints for the analysis? Any thoughts would be very much appreciated. 


I would include all the data. The varying timings is handled by the AT option of the growth language (using ) and the varying number of time points can be handled either by (1) using a singlelevel, wide approach letting the observation vector be of length 16 and using a missing data symbol for time points not available or (2) using a twolevel (time points within individuals), long approach with a univariate outcome, where the different number of time points per individual is merely resulting in different cluster sizes and is therefore inconsequential. 

anonymous posted on Friday, February 27, 2009  10:14 am



Hi Dr. Muthen, I have attempted a LGMM using the first option, however the model does not converge. Do you think that convergence would be more feasible with the second option? Thanks very much for your help! 


I am not sure. It would depend on the reason for nonconvergence. You would have to contact support@statmodel.com to have this diagnosed. 

anonymous posted on Saturday, February 28, 2009  2:46 pm



The error I obtain is the following: *** ERROR One or more variables have a variance of zero. Check your data and format statement. There is one variable with only 3 subjects and the variance is 0.027. However, Mplus indicates that the variance is 0.000 for this variable. Is this possible or is the data file incorrect? Many thanks! 


This cannot be answered without seeing your input, data, output, and license number at support@statmodel.com. 

anonymous posted on Monday, March 02, 2009  7:15 am



Dr. Muthen, Many thanks. I will send you the input, data, and output. I know that Mplus does not generate graphs when the type=random analysis is used to account for individuallyvarying times of observation. I'm guessing this is also true if type=random is used in the LGMM framework, correct? Thanks very much for your assistance. 


I'm using the ECLSB database from NCES. There are about 12 weights to be applied to various variable sets. I think I've identified the correct weight for my variables and should receive confirmation from NCES soon. However, even though the output in SPSS shows I have a variable weighted, my Mplus output shows only something like 51 cases were analyzed. There are missing weight values for some cases, predictably. And yet, I'm told I cannot impute any value, neither a 1 or 0 for example, in Mplus. Do you have any transformation or filtering suggestions, so that I can do a CFA with a larger sample? 


In my understanding, missing is not allowed for a weight variable. You need to check with NCES to determine which weight variable you should use. It should not have missing. 

anonymous posted on Wednesday, March 04, 2009  11:45 am



Hello Dr. Muthen, Regarding your response to my query concerning how to approach missing data (Bengt O. Muthen posted on Thursday, February 26, 2009  6:53 pm), what are the advantages and disadvantages to taking a long vs. wide approach? I have more familiarity with the wide approach and would prefer it, but will certainly consider the long approach if it has definite advantages over missing data. Also, a few more questions: 1) Can the long approach be used in conjunction with a LGMM? 2) Is it possible to graph the classes of trajectories with an HLM that uses the analysis=random option? Thanks very much! 


I think the wide approach is generally preferable, but not always. For example, you can allow the residual variances for the outcomes to vary across time. But the wide approach has to use the max number of observations per subject which may lead to a long observation vector (very wide). And with individuallyvarying times of observation having a different residual variance for each time point makes for many parameters. Furthermore, some time points may not have variation in the outcome if the missingness is extreme. 1) Yes. In this case the latent class variable is a betweenlevel variable (see UG for examples). 2) I think so; try it. 

anonymous posted on Thursday, March 05, 2009  7:19 am



Hi Dr. Muthen, Yes, using the wide approach, I've found that some time points do not have variation in the outcome b/c the missingness is extreme. I was thinking of simply only including time points where the covariance coverage equals or exceeds .60 (although I have no reference to justify this approach). 1. Does this seem to be an acceptable solution? 2. Do you know of any references that recommend such a covariance coverage? thanks! 


You can manipulate the data to fit better with the wide approach by deleting time points or combining them with adjacent timepoints, but such manipulation does not seem right. Given what you see, I would instead take the long approach. You may find the DATA WIDETOLONG option helpful. 


Just wondering if it's possible in any way to run MLM estimation when having missing data. I have noticed that MLM requires the raw data (so it must be a FIML type estimation) so even if I feed the model with a covariance matrix it won't work. Thanks in advance! 


Missing data are not allowed with MLM. 

anonymous posted on Thursday, March 12, 2009  12:46 pm



Hello Dr. Muthen, I've attempted to transform the data from wide to long per your suggestion (Bengt O. Muthen posted on Thursday, March 05, 2009  8:13 am). Thankfully, the model ran! However, I have a few questions regarding interpretation: 1) How might I obtain a graph of the LGM trajectory? When I attempt to view the individuallyfitted curves, only two data points are plotted on the yaxis. 2) If I am using the long option, I no longer need TSCORES correct? 3) How might I compare LGM models? When using the wide approach, I've conducted chisquare diff tests for nested models (intercept vs. intercept + slope vs. Intercept + slope + quadratic slope), but I am not certain how to do this using the log likelihood test. Many thanks! 


1) You need to send this to support with the usual information. 2) You need to use AT (and TSCORES) also here if you want to take into account individuallyvarying times of observation. 3) 2 times the difference in loglikelihood for two models is chisquare distributed. 


Is it ok to write that MPlus estimate models under missing data theory using all available data (when 'missing' option is used)? I.e. it does not perform peirwise deletion, does it? 


Sanja, In the newer versions of Mplus, TYPE = MISSING is the default, where missing cases are handled under the Missing at Random (MAR) assumption using FullInformation Maximum Likelihood (FIML). You may also specify models with listwise deletion through LISTWISE=ON in the DATAcommand. More information is provided in the User's Guide, pp. 78. Sincerely, Amir 


With ML estimators all available data are used, using "MAR". Mplus only does pairwise with categorical outcomes and WLSMV. 


Hi, I've received this comment from a reviewer, regarding a confirmatory factor analytic study: "Were missing data patterns missing at random (this can be done in Mplus by specifying a mixture analysis and using only a single class latent variable, using the %OVERALL% syntax at the beginning of the model statement and declaring the outcome variable as categorical variables)." I don't understand what they are suggesting, and even if I did understand, I don't see how any test could tell if the data were MAR/MCAR vs MNAR. Is this documented anywhere? Thanks, Jeremy 


I also don't understand what is suggested here. And there is no general test of MAR/MCAR versus MNAR. You can only test for MCAR. 


Thanks for confirming. Jeremy 


I have mplus version 5, I am running a path analyis and I understand that the default is to estimate the model under missing data theory. How can I turn off this option? I just want to use complete case analysis in order to compare my results with another package. Thank you 


Add LISTWISE=ON; to the DATA command. 


Thank you 


i tried adding LISTWISE=on, to the data command, but I get error: *** ERROR in Data command Unknown option: LISTWISE 


This option came out with Version 5. Perhaps you are using an older version where listwise deletion is the default. If not you need to send your full output and license number to support@statmodel.com. 


In V 4.21, I tried running logistical model on a binary categorical outcome with Type=missing: ANALYSIS: Type = missing; ESTIMATOR = ML; But I got the following message: *** FATAL ERROR THIS MODEL CAN BE DONE ONLY WITH MONTECARLO INTEGRATION. Please advise. 


Add INTEGRATION=MONTECARLO; to the ANALYSIS. Also download Version 5.21. 


Dear Dr. Muthen. A reviewer of one of my manuscripts requested that I report how Mplus handles missing data. I have a complex structural equation model (see below). I used the WLSMV estimator and MISSING = ALL (999). The outcome variable is categorical (1=relapse, 0=abstinent) and no subjects are missing on this variable. However, some subjects have missing data on some of the other observed variables. For instance, some subjects do not have data for c1c4 (each of the observed variables that make up the crave latent variable). Is my description of what Mplus does in this situation correct? Syntax for the model is below. “Intent to treat abstinence was the dependent variable in the current study. Thus, none of the participants were missing on the dependent variable (i.e., missing were counted as relapse). However, some participants did not complete all of the study measures. Mplus handles these missing values by estimating them using the other variables in the model.” MODEL: SES by s1 s2 s3 s4; Neigh by h1h4; Support by i1i3; NA by n1n4; agency by a1a5; Crave by c1c4; neigh on ses; support on ses neigh; NA on neigh support crave; agency on crave na; w4itt on agency ses; 


Factor indicators are dependent variables. For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes. When there are no covariates in the model, this is analogous to pairwise present analysis. I think that what you are saying is that all of your dependent variables are continuous except abstinence. If this is the case, I would use maximum likelihood estimation where maximum likelihood estimation under MCAR (missing completely at random) and MAR (missing at random; Little & Rubin, 2002) is available for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types. MAR means that missingness can be a function of observed covariates and observed outcomes. 


I am interested in receiving your suggestions for analyzing my data using Mplus. The data comes from an intervention study for couples transitioning to marriage. 18 couples completed pretest data and 12 couples completed posttest data. I am interested in assessing change in couple's attachment, affect regulation, empathy, and trust (continuous variables) following intervention. This was a very preliminary study which is why I want to use the FIML capabilities of Mplus to keep the sample size higher than 12. 


The standard approach to analyze longitudinal data is to use FIML under the "MAR" assumption (see missing data lit.). This means that you use all available data  18 couples at time 1 and 12 couples at time 2. I assume that the 12 couples is a subset of the 18. Couple, not individual, represents the mode of variation for which independent observations is assumed to hold, so the sample size is 18. Because of this, note that with only 2 timepoints the sample size of 18 is quite low and does not allow the estimation of a model with many variables and parameters. 

Newcomer posted on Tuesday, July 28, 2009  8:44 am



Hi, I am using Mplus to run linear regression and just wonder if Mplus can save the adjusted means (predicted values from a multiple regression). If so, what is the command? Thanks! 


You cannot obtain these automatically. You would need to use the DEFINE command in a subsequent analysis to obtain them. 

Newcomer posted on Tuesday, July 28, 2009  2:00 pm



Thanks so much Linda! Just to confirmso, I will need to plug in the regression coefficients in the equation to calculate the predicted values using the DEFINE command, and then use the SAVEDATA command to save it, right? 


Right. 


In a previous post the following was stated by another user: "In the newer versions of Mplus, TYPE = MISSING is the default, where missing cases are handled under the Missing at Random (MAR) assumption using FullInformation Maximum Likelihood (FIML)." And then was followed up with a statement by Bengt: "With ML estimators all available data are used, using "MAR"' I have seen this sort of wording, "all available data are used", by Drs. Muthen in regard to missing data in several places but I have not seen either of them directly state that when using TYPE=MISSING FIML is being employed. Is is fair to say that when you specify TYPE=MISSING (which is now the default) MPLUS is using FIML? 


"FIML" is used in some literature to mean fullinformation maximumlikelihood estimation (most often with continuous outcomes, but that is not necessary) and with missing data the "MAR" assumption of missing data theory is utilized. (As an aside, I think the "fullinformation" part is superfluous because maximimulikelihood estimation uses full information; to me it is not a good idea to add unnecessary acronyms beyond those in mainstream statistics.) Mplus uses ML to refer to maximumlikelihood estimation. ML under MAR is therefore the same as "FIML" and uses all available data. So, TYPE=MISSING together with ESTIMATOR=ML gives "FIML". TYPE=MISSING together with ESTIMATOR=WLSMV, however, does not use MAR but a less flexible assumption detailed in the UG. 


Many thanks for the clarification! 


I am running growth models with a lot of missing data. I need to compare nested models. However, my understanding is that it is not appropriate to use traditional chisquare difference tests to compare the fit of the nested models when modeling missing data due to the approximated chisquare values. Further, the manual states that the DIFFTEST option can only be used with MLMV or WLSMV estimators, yet I am using the default (ML) estimator. What is the most appropriate way to compare the relative fit of the nested models in this case? Should I be changing my estimator, or using some other approach? Thank you. 


The presence of missing data should not be an issue with difference testing. It is only the estimator that dictates the type of difference testing. ML uses a simple difference in chisquare. MLR requires the use of a scaling correction factor. Estimators ending in MV can use the DIFFTEST option. 


Linda, Thank you for your response. I realize that ML chisquare is typically the difference in the chisquare values (with the difference in the degrees of freedom as the df). However, when I estimate these models using regular ML estimation, the df change between samples. For example, if I run a model with sample A, and then run the exact same code with sample B, my chisquare df changes from 70 to 71, respectively. This implies to me that regular chisquare difference testing might not be ok. Am I totally off base? 


If the model is the same, changing the sample does not change the degrees of freedom with ML. If you send the two outputs and your license number to support@statmodel.com, I will find the explanation of this difference. 


Both models estimate the same number of parameters. The difference in degrees of freedom is due to a different number of parameters in the unrestricted models due to different patterns of missing data in the two samples. 


In that case, can I still use the traditional chisquare difference test to compare these results to the results from a nested model, even though the df change? 


Nested models are tested for the same data set so this should not be a problem. 


Do you deem it necessary to conduct an analysis of sample selectivity when FIML is used? I thought of comparing those with full data with those having at least one missing on the main study variables of interest. However, I'm not sure whether this analysis is theoretically needed because FIML uses all data available to estimate the model. In case there a differences between both groups...is MAR violated? 


MAR is not necessarily violated  the missingness can still be predicted by the variables that are observed. You cannot test if MAR holds. Although of interest in itself, your comparison can only reject MCAR. So unless you try to get into NMAR (not missing at random) modeling, you might just as well go ahead with ML under MAR (i.e. what is often called FIML). 


I'm trying to understand how missingness in x variables are handled in MPLUS. I have tried the simplest case with two continuous variables based on a sample size N=415 with X missing 22 cases while Y is missing only 1 case. If I regress y on x I get a message that 1 case is missing on all variables (N = 414). I had expected a message indicating that the analysis would be based on 393 cases (415  22). Is the analysis based on 414 cases or on 393 cases? (i.e., a listwise deletion or are the cases with missing Xs somehow adjusted for missingness rather than ommitted from the analysis). I tried to find information on this and don't understand one of your statements that "Covariate missingness can be modeled if the covariates are explicitly brought into the model and given a distributional assumption." Have I done this in my example? Thank you. 


Your analysis uses TYPE=GENERAL with continuous outcomes. In this special case, there is no difference between estimating the model conditioned on the x variables or treating the x variables as y variables. This is why the 22 cases are not deleted from the analysis. In other cases, it does make a difference how the x variables are treated and cases with missing on x are deleted unless they are explicitly brought into the model by, for example, mentioning their variances in the MODEL command. In this case, they are treated as y variables and distributional assumptions are made about them. 


Hi, I doing a longitudinal study of 1000 children followed at four time points to assess language and literacy growth. Since this study is still ongoing there are some children that not have been assessed at time 3 and time 4 yet. In one of my papers I'm focusing on time 2time 4, doing SEM, to examine how variuos language skills are related to later literacy development. I'm not very familar with missing, but in my data I have some missing values due to the fact that some children have not been assessed yet. What type of missing is this, and how do I handle it? Thank you! 


It sounds like it would be MCAR so using our default of MAR with TYPE=MISSING should be fine. 


Hi Linda I want to compare a measurement model obtained from a complete sample (N=1041) with the same measurement model obtained by multiple imputation using the same data with approximately 30% planned missingness MCAR. I want to see if the MI approach gets close to the original measurement model in a real data set. The items are are scaled on 7point Likert I have managed to run the measurement model using both methods and the models look similar but I wondered whether the data could be combined in one measurement invariance type analysis (multigroup?). Is this possible? For the MI analysis I used a .dat file with the names of the 30 imputed datafiles. Thanks for any pointers 


The completedata sample and the MI samples are not independent so multigroup analysis would not be correct. What you could do is to divide the sample into groups that have different planned missingness (variables for which everyone has data plus variables that some have data for) and then do a multigroup analysis where you can test invariance over model parts that are in common for the different groups. So this would not use MI. 


Thanks 

Holly Burke posted on Thursday, January 14, 2010  2:03 pm



Hello, I was wondering how Mplus handles missing data with WLSM in categorical factor analysis? I thought Mplus handled missing data using maximum likelihood, but when I run the following analysis code: TYPE = COMPLEX EFA 1 5 MISSING; the output says the program used the WLSM estimator so how could the program also be using the ML estimator? Thank you. 


In this case, Mplus uses a pairwise present approach. 


We have collected student selfreport data at seven time points and are interested in doing MM, which may lead into GMM or LGM, depending on the results of the MM. However, we have missing data (total n = 1434; listwise n = ~1271). We have determined that the missing data are not MCAR, and for now are treating them as MAR (will eventually do MNAR models, but are starting with MAR). We would like to do FIML. My question is twopart: 1. Is the following syntax FIML? TITLE: 2ClassA means free and var free but fix equal 0 covars DATA: File is 'EffortMM99.dat'; VARIABLE: names are id eff1 eff2 eff3 eff4 eff5 eff6 eff7; Usevariables are eff1 eff2 eff3 eff4 eff5 eff6 eff7; missing are all (99); classes=c(2); ANALYSIS: type=mixture missing; estimator = MLR; starts 500 500; MODEL: %overall% eff1 eff2 eff3 eff4 eff5 eff6 eff7; OUTPUT: TECH1; 2. We have quite a number of external covariates, which we are hoping to use with the auxiliary command. However, some of the external covariate data are missing, as well. Can we use these data with FIML and the auxiliary command? Or, what is your recommendation? Thank you for any advice that you can offer! It is much appreciated. 


1. Whenever you combine TYPE=MISSING and maximum likelihood estimation, you have FIML. 2. The AUXILILARY setting for missing data correlates cannot be used with TYPE=MIXTURE. You would need to do this yourself. 


Dr. Muthen, Thank you for your response. Would the best approach be to compute MI separately for the external criteria, based upon mixture (class membership from the MM) and then use the imputed data set as auxiliary variables? Or, is there another approach that would be preferable? Thank you. Jeanne 


By MI I assume you mean Multiple Imputation. I don't know about MI software with a mixture (I assume your MM notation means mixture modeling), but perhaps you mean doing MI for subjects grouped by most likely class, which might be an alright approximation. But perhaps you could simply do MI for the external covariates without involving mixtures. If your substantive model can reasonably be extended to include those multiply imputed external covariates among your other covariates, that might be the most straightforward approach. Otherwise, you can include the externals as auxiliaries, either with them imputed or with their missingness. I hope I understood your questions. 


Dr. Muthen, Thank you; that is extremely helpful. Jeanne 


Dear Dr. Muthen, im my dataset some data are missing. i have 13 measurement occasions. i use the line LISTWISE=ON; this drops every Subject, which has a missing value on one of those 13 MP. my question: is there a command to set a criteria for the missing data per subject? for example i only don't wanna use those Subjects, which have more than 3 missing values from the 13 possible. thank you for your support. walter 


I would not recommend using any rule related to missing data. This can cause the sample to be skewed. I would use all available data. 


Dear Dr. Muthen, I have come across some problems running my measurement model. I would like to run the Theory of Planned Behavior on a dataset containing 2,000 participants. My file consists of 30 observed variables who load on 5 factors: intention (3 obs. var); pros (9 obs. var); cons (7 obs. var); selfefficacy (9 obs. var); and social influence (2 obs. var). If I try to run this model: MISSING ARE ALL (9); ANALYSIS: ITERATIONS = 1000; CONVERGENCE = 0.00005; COVERAGE = 0.10; OUTPUT: SAMPSTAT MODINDICES(10) STANDARDIZED TECH4; Model: intenT0 by inten1t0inten3t0; prost0 by pros1t0pros9t0; cont0 by con1t0con7t0; EEt0 by EE1t0 EE9t0; SIt0 by SSt0 SMt0; No model results are shown (at least only the estimate is shown without s.e., pvalues, MI etcetera) and I receive the following text: MAXIMUM LOGLIKELIHOOD VALUE FOR THE UNRESTRICTED (H1) MODEL IS 63112.478 NO CONVERGENCE. NUMBER OF ITERATIONS EXCEEDED. I have already tried to increase the number of iterations but this didn't help. Can the high number of missing values explain this error (number of missing data patterns is 133). If yes, how can I solve this? If not, do you have another suggestion that explains this error? Thank you very much for your help. Best wises! 


Please forget my previous question!! A colleague of my solved the problem. My apologies for this inconvenience! very best wishes, Maartje 


I am wondering the best way to handle a standard CFA with dichotomous indicators where 1 indicator has missing data for all members of a dichotomous covariate? I get the following error when I run the model: THE WEIGHT MATRIX PART OF VARIABLE AMEN IS NONINVERTIBLE. THIS MAY BE DUE TO ONE OR MORE CATEGORIES HAVING TOO FEW OBSERVATIONS. CHECK YOUR DATA AND/OR COLLAPSE THE CATEGORIES FOR THIS VARIABLE. PROBLEM INVOLVING THE REGRESSION OF AMEN ON GENDER. THE PROBLEM MAY BE CAUSED BY AN EMPTY CELL IN THE BIVARIATE TABLE. 


To be sure we understand the model, please send your full output and license number to support@statmodel.com. 

Brian Hall posted on Friday, March 26, 2010  1:27 pm



Dear Dr. Muthen, A quick question: I'm using the MLR estimator for CFA analyses. I have opted to use multiple imputation in order to test CFA models separately in multiple waves of data (sample size precludes normal temporal invariance investigation using FIML). Given the robust estimation, I am concerned that MPLUS is not providing a scaled correction factor in the imputed results. Is this a valid concern? Do I need to compute the scaling factor? and if so, how? Thanks in advance, Brian 


I'm not sure that using multiple imputation rather than FIML helps with a small sample size. You can test for invariance over time without looking at each time point separately. See the Topic 4 course handout starting with Slide 78 where multiple indicator growth is shown. The first steps test for measurement invariance. If you are using TYPE=IMPUTATION and MLR, you will obtain an average MLR chisquare and standard deviation over imputations. These chisquare values have been corrected using the scaling correction factor. How to use a scaling correction factor with multiple imputation is a research question. 


Hello I am running a latent profile analysis with imputed data. I have generated 40 imputations with SAS proc mi, and created an ASCII file containing the names of the 40 data sets as described in the Mplus User’s guide. Although I am able to get the LPA models to converge I am concerned about the range of the indicator estimates across the classes  I have three continuous variables as indicators, all of which have been zscored. Is it possible that the profiles may well change meaning from imputation to imputation in Mplus? In other words, across the 40 datasets is it necessary to verify that profile 1 always has the same meaning across imputations, as do profile 2, profile 3, etc. How does Mplus handle this? Thank you for your help! 


You should use starting values to insure that the classes don't switch across imputations. Run one data set with sufficient random starts to get stable starting values. 


Dear Linda & Bengt, I got a question over handling missing data in SEM analysis for panel studies. Marini, Olsen, & Rubin (1980) suggest this method should be used in nested pattern missing data, that is every subsequent wave time should be a sub sample of previous wave, like this: t1 t2 t3 n 1 1 1 n 1 1 0 n 1 0 0 etc... A few reviews I found, don't clearly make statements on this issue (Enders, 2001; Newman, 2003). For example, what if the pattern missing data is like this: t1 t2 t3 n 1 1 1 = 3 complete wave times n 1 1 0 = t1 and t2; not t3 n 1 0 0 = just t1 n 1 0 1 = t1, not t2 and t3 n 0 1 1 = just t1 and t2 n 0 0 1 = just t3 n 0 1 0 = just t2 My concern is what it is most recommended to do with the not nested cases of the available data? to drop them out, or to hold them for the analysis of panel data when using ML? Another concern, is the 'few cases' in the panel paths; is only a problem of sample amount (enough data to estimate the parameters), or there is a relation of between the N amount of the within covariances (panel cases) versus the between covariances (cross cases)? I'll welcome any comments or directions on this issue, thanks in advance! Diego. 


I think you make a distinction between dropout (monotone missingness) and intermittent missingness. I would think it is ok to make the standard MAR assumption for intermittent missing; perhaps it is even MCAR. You should certainly keep these cases in your analyses. The principle should be to use all available data. MAR for dropout may hold close enough, but for dropout one may want to also investigate other modeling (see for instance my paper under Missing data). But this is more advanced since it means that the missingness is part of what you model. You then bring up coverage which Mplus prints for each outcome and pairs of outcomes. You want both types to be high in a longitudinal model. 


Dr. Muthen, I'm running a simple model examining one indirect effect with one mediator. I have missing values on all variables (x, m, and y) and mentioned the x variable in the model command with the aim of FIML handling all missing data. However, Mplus is dropping cases that have missing values on all variables. Can FIML not address such cases? When I run the same model in Amos (which from what I understand also uses FIML) it appears to use the entire sample. Can you please explain what is happening here? Thank you. 


Cases with missing on all variables have nothing to contribute to the modeling. Although AMOS does not tell you these observations are not used, they are not. 


Thank you. There are other variables not in the model that these cases have values on. Would Mplus stop dropping the cases if I brought these in as auxiliary variables? If so, do I only mention these variables in the auxiliary command or do they also need to be mentioned in the usevariables command? 


No, that will not change things. 

Katy Roche posted on Friday, September 17, 2010  7:23 am



I have 20 imputed .dat files. Can you point me to syntax that creates the .dat file which lists the 20 imputed data sets? 


If you imputed the data in Mplus, the file is created for you. See DATA IMPUTATION. If you did this outside of Mplus, you need to create the file yourself. 

Katy Roche posted on Friday, September 17, 2010  8:50 am



I did this outside of Mplus. Is there syntax or guidance on how to create the file myself? I know that the file is to list the 20 data sets but am unclear about how to create this. 


See Example 13.13 in the most recent user's guide which is on the website. 


I have a question about the individual LL values output under the SAVEDATA option. In trying to reproduce individual LogLikelihood values from a singlerep simulated dataset under MAR missingness, the values I calculated for a single case (in Proc IML) were slightly different than the value(s) produced in the SAVEDATA output. I was originally using the modelimplied means and covariances under H0 to calculate the LogLike in Proc IML but then switched to the H1 means and covariances; the H1 sufficient stats seemed to reproduce the proper individual LL values in the output dataset. So a) am I right in my understanding that the LL values under the SAVEDATA command are the H1 LL values and, if so, b) is there anyway to also output the individual values under H0? 


Try using TYPE=RANDOM. There is a problem with TYPE=GENERAL and all continuous outcomes for the saved LL's which will be changed in the next update. 


Thank you Linda  confirmed on my end that TYPE=RANDOM + SAVEDATA/LOGLIKE gives H0 LLs. Might both H0 and H1 LLs be available in SAVEDATA for the next update? 


No, just H0. Why would you want this for the H1 model? 


More for illustrative purposes for caselevel discrepancies between H1 and H0 LLs  was thinking about them for a module on FIML for a seminar on missing. Your point is well taken though because is difficult to think of how H1 LLs would be useful in practice when your concern is H0 in real applications. 


Just to be sure of myself: when type=missing is specified, what is the default method that Mplus uses to handle missing data? Or is this a function of the model specified. In my case, it is a simple path analysis; all variables are observed. 


It is a function of the estimator and variable type not the model. Mplus provides maximum likelihood estimation under MCAR (missing completely at random), MAR (missing at random), and NMAR (not missing at random) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types (Little & Rubin, 2002). MAR means that missingness can be a function of observed covariates and observed outcomes. For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes. When there are no covariates in the model, this is analogous to pairwise present analysis. 

Sarah Ryan posted on Wednesday, October 06, 2010  10:53 am



I am working on the prospectus for my dissertation in which I will be using a national (NCES) longitudinal dataset. A number of exogenous variables are categorical, as is the mediating variable (5 levels) and the outcome (7 levels so could be considered continuous). I have some missingness (data meet MAR) seldom more than 10% and a good deal less in most cases. I have two questions, and I apologize if these seem horribly basic: 1) Is it best to use the WLSMV estimator? and, if so, 2) Do I need to employ MI to deal with missingness as a first step? I have had some faculty at a training I attended suggest that it is always better to impute first, even in the SEM framework. I'm still learning, so I'm having trouble understanding when and why I would or would not be wise to use MI. Thanks. 


With categorical outcomes, you can use either weighted least squares or maximum likelihood estimation. If your model has more than four factors, maximum likelihood would not be feasible because numerical integration is required. If you use maximum likelihood, you can use the default missing data estimation which is asymptotically equivalent to multiple imputation. If you use weighted least squares estimation, I would use multiple imputation because missing data estimation with weighted least squares estimation is not as good as with maximum likelihood. 

Alice posted on Thursday, November 04, 2010  11:27 am



Question for Mplus discussion board: I am a new user of Mplus. I am trying to run latent class analysis. The data files covers to waves of data. The data file has complex sample survey features. It has stratification, clustering, and weights. And I need to use subpopulation option. The sample is wave4 sample. My questions are: (1) after I limit my sample to wave 4 sample, my data still has missing on all variables. Income is a continous variables and all others are categorical variables. Am I able to use Full Information Maximum Likelihood to deal with missingness. I googled somewhere and it says Mplus FIML is only for continuous variables. Is that true? (2) FIML is able to deal with complex survey sample? (3) Can I use multiple waves to run latent class analysis? Please see my code in the next message for reference. 


1. The default in Mplus is estimating the model using all information using maximum likelihood. A person who has missing values on all variables does not contribute anything so they are deleted. 2. Yes. 3. This would be LCGA or GMM. See the Chapter 8 examples in the user's guide. 

Alice posted on Friday, November 05, 2010  10:55 am



Thanks for the reply, Linda. Although my data are longitudinal (two waves), there are no repeated measures. Wave 1 data are respondents' reporting of their parents socioeconomic variables and wave 4 data are respondents' socioeconomic variables. I want to use latent class analysis to capture intergenerational mobility. And I want to identify individuals into different class membership. For example, I want to classify people into different groups, like moving up, staying the same as their parents, or moving down. For this kind of model, can I do simple latent class analysis (treating the longitudinal data as crosssectional data) instead of LCGA or GMM? BTW, using Full Information Maximum Likelihood to deal with missing data, do I need to specify it in the code? 


I am trying to analyse clinical + genetic data from a patient cohort as part of my PhD. I have started using LCA (LatentGolD) to classify any underlying latent classes within my data, however after reading the manuals and a few tutorials, I am still confused as to how to determine the best cluster model. Some places I have noticed they just opt for the lowest BIC, however in other places they select the lowest L2 value. Is there any set criteria to select the best model? p.s. I have a very basic statistical background! 


It is very common to use BIC with mixtures  take a look at Nylund, K.L., Asparouhov, T., & Muthén, B. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535569. which is on the Mplus web site. 


Answer for Alice: FIML has come to mean ML under the MAR assumption of missing data. It is true that the term is typically used with continuous outcomes, but ML under MAR can be used with categorical outcomes as well, not just continuous ones. It is the default in the current Mplus version. You obtain ML under MAR simply by specifying what missing data symbol you have in the data (Missing = in the Variable command). By requesting Patterns in the Output command you will see what missingness there is in your data. It sounds like you want to do Latent Transition Analysis. This is a Latent Class Analysis at several time points where you can study changes in class membership over time. The User's Guide has several such examples and there are several papers posted on our web site on this topic. 

Xi Chen posted on Monday, November 15, 2010  2:14 pm



Hi Dr. Muthen, I ran a simple regression in Mplus and SPSS. The valid cases in SPSS with listwise deletion is 409, while the number of observation in Mplus is only 290, together with this: *** WARNING Data set contains cases with missing on xvariables. These cases were not included in the analysis. Number of cases with missing on xvariables: 190 I checked the way I read data and I did not find any problem. Looking forward to your suggestions. Thanks! 


That message suggests that you did not do listwise deletion in Mplus. That message is related to TYPE=MISSING where missing data theory does not apply to independent variables. To do listwise deletion in Mplus, specify LISTWISE=ON in the DATA command. 

Xi Chen posted on Monday, November 15, 2010  2:40 pm



the result was the same with listwise=on. 


Please send the two outputs and your license number to support@statmodel.com. 

Xi Chen posted on Monday, November 15, 2010  5:15 pm



It turns out that the total N is 712 but Mplus only used 480 observations. the variables in the model do not have missing data and there are 712 rows in the datafile. Is there anyway to find out which part of data are used in mplus? Thanks! 


You can save the data using the SAVEDATA command. 

Xi Chen posted on Monday, November 15, 2010  9:22 pm



Hi Dr. Muthen, I have checked the data used in the analysis and the original data. it looks like Mplus deleted some observations not for missing data problem (some observations without missing data were also deleted). Is there any reason why Mplus would delete observations from analysis? Thanks! 


No. 


Hi, I just upgrade Mplus to 6.1 and i run an old program and the number of subjects is now lower. I want to estimate a regression model using FIML. But now i revice the following message: *** WARNING Data set contains cases with missing on xvariables. These cases were not included in the analysis. Number of cases with missing on xvariables: 29 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS How can i use FIML with 6.1 ? Thanks Alain Girard University of Montreal 


In Version 5, the default changed from listwise deletion to TYPE=MISSING. You can obtain listwise deletion by adding LISTWISE=ON; to the DATA command. Missing data theory does not apply to observed exogenous covariates. That is why observations with missing on x are excluded. If you want them included, you must mention their variances in the MODEL command. When you do that they are treated as dependent variables and distributional assumptions are made about them. 


I have a question. Data could be missing either MAR, MCAR or MNAR. My question is how can we identify the missing data is which type? I am using SAS. So, what is he mechanism? 


You can't identify whether the data are missing as MAR or NMAR, the two key contenders. For how to approach this dilemma, see the 2 papers on missing data by Enders and Muthen et al. mentioned on our home page, which also show how to do NMAR modeling in Mplus. 


Hello, I have a question regarding a discrepancy between the estimated sample statistics produced by the descriptives produced for example 11.2, and the estimated sample statistics provided for a LGM. For the outcome trajectory, the estimated means for the baseline observation are the same, but diverge for later observations. Specifically, the estimated means are substantially lower for the LGM produced SAMPLE STATISTICS: ESTIMATED SAMPLE STATISTICS at later observations than the RESULTS FOR BASIC ANALYSIS: ESTIMATED SAMPLE STATISTICS. What is the source of this difference? Thanks. Nick 


Please send the output files that show this and your license number to support@statmodel.com. 


I am a doctoral student attempting to test secondorder factors that I will include in a future SEM analysis. I am using complex survey data which require weights. My dataset does have two different versions of the weights – a base weight with a Taylor Series strata and PSU, or replicate weights. When I was proposing this project, I was advised by my mentor to use FIML in Mplus to address missing data. I have been using the replicate weights with bootstrap standard errors, but it appears that FIML is not available with this method of weighting. Am I correct in my understanding that FIML is not available when using replicate weights with bootstrap standard errors? In this case, what approach do you recommend? If at all possible, I would like to avoid listwise deletion. 


I don't think this is the case. Please send the output and your license number to support@statmodel.com so I can see what message you are getting. 


Dear Dr. Muthen, I am unsure whether I understand above explanations correctly. My model is a simple 1 factor, 20 items model with ordinal scaled data. I am using the WLSMV estimator. However, I have missing data. How can I estimate these missing data points? CAn I use FIML? Thank you very much, Sabine 


You can use FIML with your model and this will handle the missing data well. 


can FIML be used when WLSMV is specified as estimator and is this accomplished by type = missing command? An earlier reply post by Linda to another user's question indicated that "if you use TYPE=MISSING with WLSMV, the missing data technique is pairwise present." Thanks! 


No, FIML cannot be used with WLSMV. 

Sarah Ryan posted on Wednesday, April 27, 2011  3:24 pm



I am running a mediation (mediator is latent continuous) model with four latent factors, and predominantly categorical indicators. The outcome is ordered categorical with six levels. Would it be advisable to treat the outcome as continuous in this case (rather than specifying it as CATEGORICAL) in order to reduce the computation burden of the numerical integration that will be required for this model? 


It depends on whether the ordinal variable piles up at either end, that is, has a floor or ceiling effect. If it does, it should be treated as a categorical variable. If not, you are probably safe to treat it as a continuous variable. You can also consider using the WLSMV estimator. If you have categorical factor indicators, each factor is one dimension of integration with maximum likelihood estimation. 

Sarah Ryan posted on Thursday, April 28, 2011  2:26 pm



Okay. If I use the WLSMV estimator with imputed data sets, however, is there a way to test multigroup invariance? My reading has led me to think that difference testing with imputed data is relatively unexplored. Also, is a corrected/scaled Wald statistic provided in the output when using Mplus to analyze imputed data sets using WLSMV? 


I don't know of a way to do a difference test with imputed data. A correct Wald test using MODEL TEST is provided with imputed data. 


A quick question I had: I realize that beginning with v. 5 Mplus uses missing and Type=H1 as the default in model analyses. However, I was curious as to why the same exact model run (same commandline syntax) would indicate different missingness (on the same input data file) between versions 4.1 and 6.0. This was noted when an adviser ran the same analyses on a different version than I am using (the estimates are basically the same, but in the analyses using 4.1, all the data is indicated as being present whereas in V. 6 it indicates that there are 2 cases with data missing on xvariables and 150 cases where data is missing on all variables except x variables). I know from a previous analyses using FIML in v 4.1 that the warning for missing data is worded such that missing data are noted as 'number of cases with missing on all variables'. Is this difference b/c v 5 and higher looks at missingness as a function of x and y variables whereas v 4.1 looks at it with respect to all variables considered simultaneously? 


This difference is explained under Version History. See Analysis Conditional on Covariates under Version 6.1. 


Thanks Linda, read through this. To make sure I am understanding the difference fully would it be correct to say that pre v6.0 cases were only deleted if they were missing values for all variables (i.e., endogenous and exogenous) whereas currently cases are deleted if they are missing either all x vars, all y vars, and/or both considered in total? 


Yes, for pre Version 6. For Version 6 and on, cases are deleted if they have missing for one or mode x variables or all y variables. These are considered separately. 


Many thanks. One thing I am still having trouble wrapping my head around is that when I conduct parallel analyses in V6, my results are exactly the same as they were in v4. The only difference is that in v6 I get the warning that data set contains cases with missing on all variables except xvariables. These cases were not included in the analysis. With those cases not included in the analysis how is it that the model results can still be exactly the same? Does it have something to do with the fact that I complete data for the xvariable in the model? Bear with me if the answer is straightforward and I am just not seeing it. 


It is the case that for maximum likelihood estimation for continuous outcomes with no missing data, the results will be the same if the model is estimated with y and x or with y conditioned on x. It is only in this case that the results will be the same. We changed to y conditioned on x to be in line with the rest of the program and regression in general. 


Are there sample size parameters for conducting a pattern mixture growth model? I am writing a data analysis section for a grant in which missing data that is not ignorable is expected. It is a small clinical trial with a total of 60 subjects with equal distribution in two treatment groups (n=30 in each). I know this is very small, but was wondering if testing a model with two growth classes (one with missing one without) would be possible? 


That depends on how many time points you have and what the growth shape is, plus parameter values. You have to do a Monte Carlo simulation study to learn about it. 60 may not be too low even for a 2class model, but with mixtures the answer also depends on the degree of class separation in the growth factor means. See UG chapter 12. 

Eric Teman posted on Tuesday, August 16, 2011  9:25 pm



When using FIML in Mplus, does Mplus delete cases with missing values on exogenous observed variables by default? And uses full information for missing values on the endogenous variables? 


Yes. Missing data theory does not apply to observed exogenous variables. Any case with missing on one or more observed exogenous variables is eliminated from the analysis. 

Eric Teman posted on Wednesday, August 17, 2011  5:51 pm



Does this also mean that missing data theory doesn't apply to CFAs, since there are only exogenous variables in a CFA? 


There are only endogenous variables in CFA. Endo is Y, exo is X. 

Eric Teman posted on Wednesday, August 17, 2011  6:14 pm



If missing values on covariates are ordinal, will FIML still delete the cases? 


Yes. You can use multiple imputation where you specify that the variable is categorical. 

Eric Teman posted on Wednesday, August 17, 2011  6:49 pm



And this is only since version 6.1 right? 


Multiple imputation came out in Version 6. 

Eric Teman posted on Thursday, August 18, 2011  11:36 am



When did FIML start deleting cases with missing values on the exogenous variables? What did it do prior to that? 


Version 6.1. See Version History on our web site to find more information about this. You can easily revert to before v6.1 by mentioning the means or variances of the covariates. But this then makes additional model assumptions, not included in the original model. They are the same assumptions you make with multiple imputation. We made this change to be consistent throughout the program with categorical modeling, mixture modeling, and other cases. 

Eric Teman posted on Thursday, August 18, 2011  2:38 pm



Hypothetically, let's say a dataset had missing values only on exogenous variables, i.e., the endogenous observed variables are complete. Would employing FIML be identical to using listwise deletion? 


Yes, at least the way I define FIML. FIML is helpful when some endogenous variables have some missing values because then missingness is allowed to be a function of some of the other, not missing, variables for the individuals with missing. For instance, in a longitudinal study, the outcome at the first time point may be observed for many persons and this may predict later missingness. 

Eric Teman posted on Saturday, August 20, 2011  3:03 pm



On p. 458 of the version 6 Mplus manual, it says, "The ASCII files...must be created by the user." I noticed yesterday, though, that I did not need to manually create these. When did Mplus start doing this automatically? 


This is done automatically if you impute the data sets using DATA IMPUTATION but not if you impute the data sets outside of Mplus. 

Eric Teman posted on Wednesday, August 24, 2011  8:05 pm



Hypothetically, in a Monte Carlo simulation study where a design cell contains 1,000 replications of a CFA model where multiple imputation was used (with 10 imputation data sets created per replication), would you simply average all of the parameter estimates and fit statistics (including chisquare) for that one cell across all imputations within replications? 


You can do that for the parameter estimates but not for the fit statistics. How to accumulate fit statistic information is unstudied except for ML chisquare. See the most recent Topic 9 course handout for the formula. 

Eric Teman posted on Thursday, August 25, 2011  2:39 pm



Is the ML chisquare over 5 imputations, for example, output anywhere for reading into an outside statistics software package? Or will I have to calculate the ML chisquare over the number of imputations in multiple imputation? 


Note that the ML chisquare for each imputation  or the average of this over replications  is not a useful measure of fit but misestimates fit quite a bit. See the Topic 9 handout of 6/1/11, slides 212216 for a study of this. The correct chisquare T_imp is printed. 

Eric Teman posted on Thursday, August 25, 2011  3:02 pm



But is it printed in the ASCII results file. I can't find it there. I see it in the OUT files but not the ASCII files. 

Eric Teman posted on Thursday, August 25, 2011  4:31 pm



I am using WLSMV as the estimator. Maybe that is why I'm not getting T_imp. In this case, how should I proceed with calculating the chisquare over imputations? 


You are hitting the research frontier  that hasn't been invented yet. 

Eric Teman posted on Thursday, August 25, 2011  5:46 pm



Would it be reasonable/appropriate to do a Monte Carlo simulation study to see how multiple imputation works with the WSLSMV estimator? Mplus is capable of this, right? We just don't know how it will act? 

Eric Teman posted on Thursday, August 25, 2011  10:26 pm



To be more specific, I mean we don't know how the adjunct and chisquare fit statistics will behave when WLSMV is used, right? It might be beneficial for a Monte Carlo simulation to be done???? 


You will probably learn something about how poorly the statistic performs, but then there is still the need for theory to suggest a better statistic. 


Basic question... I'm a new user and not sure which topic to post this in. How do I get means and n missing for each variable? I want to compare the means and (number of observations for each variable) in Mplus with the means (and n's) in SAS to make sure the data set conversion was completed accurately. I was able to get means from the following code (I'm using version 6.11 on Linux): Title: Checking mplus datafile to SAS datafile Data: File is FIdata_82311_nonames.dat; Variable: Names are lwlkm08 lwlks08 lwlkb08 wlk2x08 sidewalk parkdcar grass_strip bike_trail safe_street noculdesac intrsctn altroute strght_st trees intrstng_thg natrlsight attrctv_bldg traffic trffcspd_slow trffcspd_fast pedsgnl_cross crossw_beeps islands cross_busyst car_sdwlk curbcut crime_high walk_unsafed walk_unsafen straydog alleys_unsafe teens_unsafe streetlights walkbike_seen walkstores parking_diff walkplaces walktobus hilly barriers_walk popdens2 popdens3 singlemixed singlefamily housing; Missing = all(1234) ; Analysis: Type = basic; Output: sampstat; 


Followup to my post above: I was not able to determine the n for each variable in Mplus, using the code above. 


TYPE=BASIC is how you would obtain the means. You can also use the SAMPSTAT option. The sample size is the same for each mean. It is shown at the beginning of the analysis summary. 


Hello, Am sorry if not posting in the right place, could not find an appropiate topic. I keep getting this error message when running my input file. *** ERROR The length of the data field exceeds the 40character limit for freeformatted data. Error at record #: 1, field #: 32 *** ERROR The number of observations is 0. Check your data and format statement. Data file: F:\MATCHEDBYCHNR W1234_mplus.csv I have saved the spss file as .csv without variable names so everything should be alright. Furthermore I was wondering, i am using type=complex and have missing data, is FIML used automatically? also i have indirect effects, bootstrap cannot be used i have found out is there any other option to know whether the indirect effects are significant? (i've heard about Prodscal but am not sure how that works) Best, Anouk 


It sounds like you may have fixed format data that you are trying to read as free format. If that is not the problem, please send the relevant files and your license number to support@statmodel.com. If you are using a version after Version 5, TYPE=MISSING is the default. When the BOOTSTRAP option is not available, Delta method standard errors are provided. 


Dear Linda, Thank you for your reply. How can I indicate that I have a fixed format in the input file? Can you advice me on a reference for the Delta method? Thank you, 


Dear Linda, Using a .dat file i know succeeded in not getting this error however now, a variable is not recognized in the model comment: model: f1 by w12mep w12met f2 by w2afp1 w2afp2 Mplus says f2 is not recognized. what can I do to solve that? Best, 


See the FORMAT option. The Delta method standard errors are computed with MODEL INDIRECT automatically. There is a FAQ on the website if you want more information. You need a semicolon (;) after each BY statement. 


Thank you very much! that helps. Finally have output now, one ,hopefully last.., question: i get this message: THE COVARIANCE COVERAGE WAS NOT FULFILLED FOR ALL GROUPS THE MISSING DATA EM ALGORITHM WILL NOT BE INITIATED CHECK YOU DATA OR LOWER THE COVARIANCE COVERAGE LIMIT. SAMPLE STATISITCS THE MINIMUM COVARIANCE COVERAGE WAS NOT FULFILLED FOR ALL GROUPS 1 i thought FIML was used and not EM, but that should be similar right? 2 how could i solve the covariance problem? Thank you again for the quick replies. Best, 


1. It is. 2. The default covariance coverage is .10. You must have lower coverage than that in one group. See the COVERAGE option in the user's guide. 


Hello, I am interested in using MPlus with large data sets like TIMSS and NAPE, in particular TIMSS. To do so I have a few questions: 1) Does Mplus allow for sample weight to items if yes how? 2)How Mplus handles missing data in blocks to explain it in detail: TIMSS is a collection of 12 booklets that is administered to several thousands students. Each student answers only to 2 booklets, as the result, if one wants to stduy the whole data set, one will have blocks of missing data. I was wondering if Mplus can handle such data sets. TIMSS 2003: 740 students are responding to items in block 1 and 2, another 740 students respond to block 2 and 3, another 740 respond to blocks 3 and 4 and so on. I was wondering if I stack all these items then there will be blocks of items which are missing. Can Mplus handle such a data? 3) I will be using it for Latent class analysis and was wondering if I could fix some of the parameters for the purpose of equating these blocks of itmes Thank you 


1) Mplus allows for sampling weights  see the paper (which is also on our web site): Asparouhov, T. (2005). Sampling weights in latent variable modeling. Structural Equation Modeling, 12, 411434. I don't know what you mean by "sample weight to items" 2 )You have missing by design which can be handled in Mplus in two ways. One is to have as variables (columns in the data) all the variables in all 12 booklets so that all students have missing on most variables. The other is to do a multiplegroup analysis where each group of students has its own set of variables (but the same number of variables). 3) Yes, you can hold parameters equal for the purpose of equating. 


thank you for your reply and sorry for the double posting. just to clarify, these variables are all categorical not continuous. Basically correct (1) or wrong(0) answers to mathematics questions. Can Mplus work with missing data, about 50% of the data is missing? Thank you in advance 


No problem. 


Dear Linda, I have output for two mediation models now. However, the robust chi square values are not computed. (with MLR AND TYPE=COMPLEX) Also, no standard errors are displayed only the estimates. (i use the STANDARDIZED CINTERVAL command in OUTPUT) Do you know why? Thank you, 


It sounds like the model did not converge. If you want help, please send the output and your license number to support@statmodel.com. 


Hello, Would you please direct me to the particular Mplus chapter that discusses handling blocks of missing data for categorical variables, mainly Latent Class modeling. Thank you for your help in advance 


I'm not sure what you mean by blocks of missing data. If you mean missing be design, this is taken care of by the default of TYPE=MISSING. 


missing data are not by design, the data is a national data set that has several chunks missing, I would like to learn how Mplus handles these types of data. Thanks 


Please see pages 78 of the user's guide. 


Hello, I am receiving the error "THE MINIMUM COVARIANCE COVERAGE WAS NOT FULFILLED FOR ALL GROUPS" when running a structural equation model using complex survey data and WLSMV (n=1,100, with 46 observed variables). There are several latent variables and a number of observed dummy variables as covariates. Everything is being regressed on a dichotomous variable. When I run the analysis variables through type=BASIC to investigate covariance coverage, it appears that all coverage values are well above 0.9. I wonder what I should be looking for. This is not a multiple group analysis so I wonder if there are other groups that the error message would be referring to? Perhaps it is referring to the clusters in the complex survey data? Thanks so much! 


Please send the output and your license number to support@statmodel.com so I can see the full picture. 


Using mplus 6, I am getting listwise deletion for my DV, even after specifying in the syntax: MISSING = KQEET07(999); this KQEET07 is my DV and the only variable with missing data. I am confused because I have used the same syntax with similar models. Now, instead of getting FILM, I get listwise deletion. Any ideas? thanks for your help! 


The default is not listwise deletion. If you want me to see what is happening, send the output and your license number to support@statmodel.com. 


Drs. Muthen: I have seen the use of latent coefficients (latent "placeholders," if you will) in longitudinal growth models where there is planned missing data. Can latent coefficients (or latent "placeholders") also be used in the context of multigroup modeling when not all items from a standard scale were administered to one group? What about if the item was not administered to either group? Would we be able to use latent placeholders and have Mplus estimate what the factor loadings for those items would have been had they been administered? Or is this an inappropriate use of the latent coefficients / latent placeholders? Do you have an example in the Mplus manual, or do you know of an article, that has used latent placeholders in the context of multigroup modeling with planned missing data in the past? Thanks. 


More specifically, can latents be used in multigroup CFAs, rather than multigroup LGMs, when there is planned missing data? 


If a study has planned missingness, a missing value flag should be assigned to individuals who did not take the item. Nothing more needs to be done. If no one took an item, it will not be used in the analysis. An item with all missing contributes nothing to the model. 


Would Mplus estimate a CFA loading for an item that was not administered if we included it in the model and in our code? That is, Mplus could estimate what the loading would have been, had it been administered in that groupbased on the functioning of the other items in the lowerorder factor on which it loads? 


And if so, would I need to set something for the latent item equal to other estimates in the model, such as the error variance for that item? 


You cannot identify or estimate a loading for an item that was not administered to anyone in the sample because there is no sample information on this loading. You have to have some subjects who has responses on the item so that you know how this item correlates with the other items  and therefore can draw inference about how subjects who didn't take the item might have scored. In a twogroup model you can have an item that is only administered in one of the two groups, but you cannot estimate a groupspecific loading for that item in the group that didn't take the item. This is for the same reason as above. I have not heard of latent coefficients / latent placeholders, so I don't know what that is. 


Thank you! 


Bengt (or Linda): You mentioned that "in a twogroup model you can have an item that is only administered in one of the two groups." This is what I am doing, and the items are categorical, in a 2group (gender) CFA model. One such item that was administered to only one group (girls only) is CBCEYEP (measuring eye problems as a somatic symptom). When I run the analysis, I get the message: *** ERROR Categorical variable CBCEYEP contains less than 2 categories. However, that is not true. Among girls, to whom the item was administered, all three possible categories were endorsed. I set loadings and thresholds to be equal in both groups, and freed the scaling factor in one group. Can you tell me from this information what I am doing wrong? 


Please send the input, data, output, and your license number to support@statmodel.com. 


Hello, thanks for your advice, Linda! I was looking at this downloadable sheet: http://www.statmodel.com/download/Different%20number%20of%20variables%20in%20different%20groups.pdf We have extra variables in certain groups (and missing in the other group), and these variables are dependent. Fixing the residual variances to be equal to those in the other group implies Theta parameterization, does it not? Thanks again. 


The trick in that FAQ is only for continuous outcomes, not categorical outcomes that you have. Here is another approach you can take. Assume as an example that you have 10 items that both males and females respond to and assume that each gender also responds to 5 additional items, but they aren't the same for the two genders. So each gender responds to 15 items. Your input should then refer to 15 items in the USEV list and your model should have any equality constraints applied only to the same 10 items that both genders respond to. 


Thanks, Bengt. Why wouldn't I list 20 items in the USEV list, if there were 10 common items + 5 extra items for boys + 5 extra items for girls? If I list 15, which 15 do I list? Those administered to the first group? 


You don't want to list 20 items because both groups would then have 5 items where nobody in the group has a responses to those 5 items. The 15 items are different items for the two groups. For males, it is the set of 15 items that males responded to, for females it is the set of items that females responded to. So you have to arrange your data that way. 


Bengt, I am trying to run this as a multigroup model, where parameters for males and estimates for females are estimated simultaneously. Do I estimate this in three stages? For example, get the estimates for the model with items that females responded to, then get the estimates for the model with items that males responded to, then run the overall model with parameters set at the values derived from the first two models for the times that these are missing for a certain gender in the overall model? I am sorry. I am so confused about this. I think the threestage strategy may work because Linda mentioned that groups with no data for an item do not contribute to the estimates? However, if I have for example a factor with 8 indicators but two are missing for boys and two are missing for girls, and I try this threestage stratgegy, can I assume that including the loadings for these items administered only to one of the two groups would have no effect on the other loadings when I add them in? 


What I suggested is a single analysis, not a multistage analysis. I suggested a simultaneous, 2group analysis of males and females. You arrange your data so that say the first 10 columns are the common items and the next 5 columns are the items specific to each gender (so those 5 are different items for the two genders). So for instance if you have one factor f, you say in the Overall part of the model: f by y1y15; You can then apply measurement invariance across gender for the first 10 items. The next 5 items contribute to the measurement of f, although they are different items for the two genders. This is a standard type of approach when different sets of subjects take different forms of an achievement test. A similar approach is also used with multiplecohort data. If you are still unsure of what I am suggesting, you may want to consult with an SEM person on your campus who can sit down with you and talk you through it. 


Hello, what does the following message imply about my data, and how can I fix the problem so that the model will run? I don't think one entire group is missing data for these items, so I am not sure why I am getting these messages. WARNING: THE BIVARIATE TABLE OF VANDA_D AND SKINP_D HAS AN EMPTY CELL. COMPUTATIONAL PROBLEMS ESTIMATING THE CORRELATION FOR VANDA_D AND SKINP_D. 


P.S. I do have several items with low endorsement, such as the items below, which were mentioned in the warning above. Can items with low endorsement lead to the generation of a warning like the messages above? The data are not really missing, so it's a confusing message to receive. VANDA_D Category 1 0.963 315.000 Category 2 0.037 12.000 SKINP_D Category 1 0.937 251.000 Category 2 0.063 17.000 


When a bivariate table has an empty cell, this implies a correlation of one which means that only one of the variables should be used in the analysis. Variables that correlate one are not statistically distinguishable. Empty cells can occur for extreme items when sample size is small. 

Amanda Hare posted on Wednesday, January 11, 2012  8:28 am



Hi there I am trying to run what I thought was a very simple model using version 6.1, predicting wave 2 self esteem (continuous) from sex (categorical), wave 1 self esteem (continuous), and authoritative parenting (continuous). The problem is that I'm getting listwise deletion of all cases with missing on xvariables! Here are the highlights: MISSING IS .; USEVARIABLES ARE Ssex PAQaeAbD w1SRSESs w2SRSESs; ANALYSIS: Type = Missing; MODEL: w2SRSESs on Ssex w1SRSESs PAQaeAbD; OUTPUT: sampstat standardized; Can you help? Thanks! 


This is because missing data theory does not apply to observed exogenous variables. To avoid this, you would need to bring all of the covariates into the model by mentioning their variances in the MODEL command. When you do this, they are treated as dependent variables and distributional assumptions are made about them. 

EFried posted on Saturday, January 14, 2012  8:51 pm



Dear Dr Muthén! Data set of N=800, 5 measurement points (MP), first MP has 5%, last MP 40% missings on my one continuous outcome variable. Covariates (6 time invariant, 1 time varying) also have some missings. If I run the whole growth mixture model, MPLUS deletes about 50% of my subjects, which is an insane amount of information I do not want to lose: "Data set contains cases with missing on xvariables. These cases were not included in the analysis. Number of cases with missing on xvariables: 396 What to do? (1) Auxiliary isn't possible in type RANDOM or MISSING if I see that correctly. (2) I watched your videos 36, but the parts about missing data confused me more than they helped me ;). I read up on "Diggle Kenward selection Modeling" and "Roy's Model (Pattern Mixture Modeling)" but I don't want to write a paper on missing data and imputation. Other people must struggle with this also. Are there any guidelines I can follow on this? (3) Chapter 11 of your wonderful manual: "Covariate missingness can be modeled if the covariates are brought into the model and distributional assumptions such as normality are made about them." What does this mean  how do I model covariate missingness exactly? Thank you so much! T 


I would choose number 3. What this means is that in regression the model is estimated conditioned on the covariates and no distributional assumptions are made about them. If you bring them into the model and treat them as dependent variables, distributional assumptions are made about them. 

EFried posted on Wednesday, January 18, 2012  9:55 pm



Thank you! Is there an example in the v6 manual or in any of the videos for this? Or in one of the papers? I wouldn't quite now how to do this. Also, I have 5 measurement points, and 8 time invariant and 2 time varying covariates (with 4 measurement points each). So I probably would have to decide which ones to bring into the model as dependent variable, otherwise the model would become not identified anymore? Thank you Torvon 


You should bring all observed exogenous covariates into the model. If they are call x1 and x2, you say in the MODEL command x1 x2; 

EFried posted on Thursday, January 19, 2012  12:27 pm



So to the model ... %OVERALL% i s  y0@0 y1@1 y2*2 y3*3 y4*4; i s ON x15; y0 ON x6; y1 ON x7; y2 ON x8; y3 ON x9; y4 ON x10; ... I add the line x1x10; ? Thanks 


Yes. 

Nancy Lewis posted on Tuesday, January 31, 2012  10:42 am



I am trying to run a mixedeffects metaanalysis using SEM, with 4 dummy coded moderator variables as fixed effects and a random effect for the intercept. Several studies are missing data on one or more moderator variables. When I run the model with TYPE=RANDOM, I get a warning indicating that listwise deletion was done and only 11 of my 22 cases were included in the model. However, when I run the model with both the intercept and moderators as fixed, I do not get this warning and all 22 cases are included. Why is this and what do I need to do to use FIML for the mixedeffects analysis? Thank you for your help. 


You are probably running an older version of the program where with all continuous variables the y with x model was estimated instead of the y given x. 

Nancy Lewis posted on Tuesday, January 31, 2012  2:53 pm



I am using Version 6.11. 


Please send your outputs and license number to support@statmodel.com so I can see what is happening. 

katie bee posted on Wednesday, February 08, 2012  10:14 am



Professor Muthen, I am using Mplus Version 6.1 and am using ML w/monte carlo integration, and have been unable to get fit statistics. I thought I read that beginning w/version 3, this would be possible? Do you have any suggestions? Thank you. 


Chisquare and related fit statistics are not available when means, variances, and covariances are not sufficient statistics for model estimation. 

Anonymous posted on Friday, February 17, 2012  10:14 am



Hi. I’m trying to unpack the defaults in Mplus (5.21) Re: the way it “adjusts” for observed control covariates, across different estimators. I have a model: Latent Y regressed on latent X1 and a set of observed control covariates. I use the MLR estimator. It looks like the default is to give estimates of the covariances between the observed and latent covariates. My understanding is that one doesn’t need to “call in” the covariances amongst the observed covariates, in order to make sure that the fitted regression parameters control for the other variables in the model. If I, say, use numerical integration here instead, it looks like the default is to NOT estimate covariances between the observed and latent covariates. Is it still the case that the regression parameters are adjusted for the effects of the covariates in the model (observed or latent)? I ask b/c I fit the same model –w/ MLR and then w/numerical integration—I get notably different estimates for the effects of my observed covariates. W/ MLR, it looks like it might be adjusting for the other covariates, whereas w/ numerical integration it does not appear to be. If I “call in” the covariances between the latent and observed covariates w/ numerical integration, it looks like the MLR model w/out numerical integration. I suppose it could also be a difference in the way the two models handle missing data (?). Thanks for any thoughts. 


Please send the two outputs and your license number to support@statmodel.com. 

gibbon lab posted on Monday, March 05, 2012  8:49 am



Hi Linda, In one of your old posts (above), you mentioned "For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes." I was just wondering if you have a reference paper for this so that I can read more details. Thanks a lot. 


Hi Linda, I am new to MPLUS and have version 6.12. I am running a simple CFA with one factor and 21 ordinal outcome variables. But, I have missing data (assuming MAR). I am a little confused as I have read different things in terms of whether the program 'handles' the missing data when missing is indicated. I have gotten the program to run and get fit indices, but am worried that the estimator being used isn't appropriate. (WLSMV). Is this ok? thanks so much! INPUT INSTRUCTIONS TITLE: 1 FACTOR CFA OF COGNITIVE CAPACITY SAFETY TBI DATA: FILE IS "C:\Users\kathy\Desktop\shepherd_safety_project\ ControlFIle_cg_cc3_3_2012.dat"; VARIABLE: NAMES ARE cc1 cc2 cc3 cc4 cc5 cc6 cc7 cc8 cc9 cc10 cc11 cc12 cc13 cc14 cc15 cc16 cc17 cc18 cc19 cc20 cc21; CATEGORICAL ARE cc1 cc2 cc3 cc4 cc5 cc6 cc7 cc8 cc9 cc10 cc11 cc12 cc13 cc14 cc15 cc16 cc17 cc18 cc19 cc20 cc21; MISSING are all (9); MODEL: F1 BY cc1 cc2 cc3 cc4 cc5 cc6 cc7 cc8 cc9 cc10 cc11 cc12 cc13 cc14 cc15 cc16 cc17 cc18 cc19 cc20 cc21; OUTPUT: SAMPSTAT; STAND; RESIDUAL; PATTERNS; SAVEDATA: FILE IS COGCAP_02122012cfa.DAT; DIFFTEST IS DERIV.DAT; FORMAT IS F2.0; PLOT: TYPE IS PLOT2; 


Gibbon: There is no paper that describes this. With WLSMV the dependent variables are looked at in pairs so missing data information cannot be gathered from all variables like in maximum likelihood. 


Kathleen: In your case with only one factor, I would recommend using MLR. As an alternative, you can do multiple imputation to generate data sets and use WLSMV to analyze them. 

naT posted on Thursday, April 19, 2012  4:15 pm



I am having trouble reproducing residual variances from the estimated parameters of path analysis model with missing data. I ran path models with and without missing variables. When I manually recalculated the residual variance using estimated parameters, I could only reproduce mplus residual variances when I have full data. I took out the cases with missing variables and recalculated again but my calculation still did not match with the mplus estimate of residual variance. Could you please help me understand what is the problem. I built the model as single indicator factor analysis model. Thank you. 


I don't see how you can manually calculate the residual variance  it is estimated by ML I assume? You don't say if your variables are continuous or categorical, where in the latter case residual variances are not free parameters. 

naT posted on Friday, April 20, 2012  2:19 pm



It is estimated with ML. All variables were continuous. I just recalculated as the variance of difference between estimated dependent variable Y' and actual Y. Y' was calculated as Y'=intercept+coef*X. X and Y are observed, so I used the parameters (coef and intercept) produced by mplus to reproduce residual variance in a spreadsheet. Please let me know if I have misunderstood. I was able to reproduce the residual variance for full data but not for missing data. 


You are assuming that estimated variance of Y equals the sample variance of Y. This is not true for all models. If you look at your missing data run, requesting Residual in the Output command, you will probably see that the difference between estimated and observed variance is not zero. 

Mai Sherif posted on Tuesday, May 08, 2012  8:42 am



I am fitting a latent variable model that also includes random effects. Some of the items are binary while the rest are continuous. My data includes missing values as well. The output that I get says that the dimension of numerical integration is 1 and it is actually very fast. I was wondering how the fitting takes place with only one dimension of numerical integration although I have seven latent variables and random effects? Is there a dimension reduction technique used by MPlus? Thanks! 


We would need to see the output to tell. Please send to Support. 


Hi there I am running some new analyses for a revise/resubmit and have encountered a problem. When I ran the following script back in November, the results yielded and N of 184: USEVARIABLES ARE gender1y psautm1 psautm4 psycon1 psycon4; MISSING IS .; ANALYSIS: TYPE IS MEANSTRUCTURE; MODEL: psycon4 on psycon1; psycon4 on psautm1; psycon4 on gender1y; psautm4 on psautm1; psautm4 on psycon1; psautm4 on gender1y; psautm1 with psycon1; psautm4 with psycon4; OUTPUT: sampstat standardized; Today, the same script with the same data is giving me a much smaller N. Can you help? Amanda 


Please send the two outputs and your license number to support@statmodel.com. 

Mai Sherif posted on Wednesday, May 16, 2012  10:20 am



I have 4 u's (random effects) and 3 z's (latent variables) and the output below specifies that the dimension of numerical integration is only 3. I am just wondering why the dimension is only 3 and how the other latent variables are integrated. usevar = G1G4 L1L4 N1N3; Categorical are N1N3; Missing are all (9); Analysis: Estimator=MLR; Model: u1 by G1@1 L1@1 ; u2 by G2@1 L2@1 ; u3 by G3@1 L3@1 ; u4 by G4@1 L4@1 ; G1G4; L1L4; z1 by G1G4 N1; z2 by L1L4 N2; z3 by N1@1 N2@1 N3@1; [N1$1 N2$1 N3$1]; z1 with z2; z1 with z3; z2 with z3; u1 with u2u4 @0; u2 with u3u4 @0; u3 with u4 @0; u1u4 with z1z3 @0; Number of dependent variables 11 Number of continuous latent variables 7 Integration Specifications Type STANDARD Number of integration points 15 Dimensions of numerical integration 3 Adaptive quadrature ON LinkLOGIT 

Mai Sherif posted on Wednesday, May 16, 2012  10:25 am



Another question I have is about dealing with survival data. Do we have to specify that some indicators are survival indicators? Or is it sufficient to have the survival items set up as in (Muthen and Masyn, 2005)and then MPlus will automatically model the conditional probabilities of survival (hazard function)rather than just a logistic model? Many thanks. 


Please send the output and your license number to support@statmodel.com. For discretetime survival, you do not need to specify that the indicators are survival indicators. You need to arrange the data and specify the model as shown in Example 6.19. Please limit your posts to one window. 

Eric Deemer posted on Wednesday, May 16, 2012  11:26 am



Hello, I would like to determine the proportion of missing data in my data set. Under the DATA MISSING command, would I just ask for DESCRIPTIVES for the variables that I named as missing? Are frequencies provided as part of the DESCRIPTIVES output? Eric 


Ask for the PATTERNS option in the OUTPUT command. 

Eric Deemer posted on Wednesday, May 16, 2012  12:13 pm



Thanks, Linda! Eric 

Mai Sherif posted on Thursday, May 17, 2012  3:38 am



Dear Linda, Thanks a lot! 

rs posted on Thursday, June 21, 2012  9:59 pm



Hi Linda, I am conducting SEM analysis using data from crosssectional study. Missing data analysis for some variables showed data are missing not at random (MNAR). I am wondering whether MLR will be able to handle data which are missing not at random? If not, will it help if I impute missing values using SPSS for these variables before I conduct the analysis using MPLUS? Thank you. 


You want to make a distinction between MCAR, MAR, and NMAR (=MNAR). See missing data books, or our Topic 4 teaching. Mplus does MAR by default (often called FIML) and can also do NMAR modeling. Mplus also does multiple imputation. Multiple imputation assumes MAR. What you are probably seeing is that MCAR does not hold. MAR may still hold. There is no way of knowing if MAR or NMAR holds. 


Hi there, i'm doing a latent class growth analysis across 5 time points (baseline, 3, 6, 9, and 12 months) with missing data. In reading varoius literature, I understand I should use the "auxiliary" function to ensure my data is MAR. So, I've pasted my syntax below to ensure it is correct because my output doesn't seem to change with or without the auxiliary function (note, the sample size is 269 patients). USEV ARE age(0,1) educ(0,1) employ (0,1) t1_PA t2_PA t3_PA t4_PA t5_PA; MISSING = t2_PA t3_PA t4_PA t5_PA(999); CLASSES = c(2); AUXILIARY = (r) age educ employ; ANALYSIS: TYPE = MIXTURE; STARTS 20 2; MITERATION = 300; MODEL: %OVERALL% i s  t1_PA@0 t2_PA@1 t3_PA@2 t4_PA@3 t5_PA@4; is@0; Any help would be greatly appreciated! chis 

Eric Teman posted on Friday, June 22, 2012  8:13 pm



When it says missingness is not allowed on observed exogenous variables, does this include the indicators of latent exogenous variables? 


Cris  you want to use auxiliary = (M) ..., not (R). See Topic 4. 


Eric: Indicators of exogenous latent variables are endogenous variables. 


Hi Prof Muthen, In response to one of your previous comments "FIML is an estimator and EM is one algorithm for computing FIML estimates. Other algorithms include QuasiNewton, Fisher Scoring, and NewtonRaphson. Mplus uses the EM algorithm for the unrestricted H1 models and the other algorithms for H0 models. " Aren't QuasiNewton, Fisher Scoring and NewtonRaphson mathematical methods for finding a solution of a equation? How are they related to missing data? If using these algorithms for H0 models, is it true that the missing data were not taken into account for H0 models? Thanks. 


I think I was trying to make a distinction between estimators and algorithms because it seems like sometimes missing data handling is referred to as using the "EM approach", which mixes apples and oranges. So the answer is Yes to your first question. Any of the algorithms can be used to do ML under MAR, which is often called FIML. So the answer to your second question is No  the use of these algorithms is unrelated to whether missing data is handled or not. You have to look to the assumptions made in the estimation to know how missingness is handled. 

gibbon lab posted on Monday, June 25, 2012  6:11 pm



Hi Linda, In response to one of your old comments "There is no paper that describes this. With WLSMV the dependent variables are looked at in pairs so missing data information cannot be gathered from all variables like in maximum likelihood." So pairwise likelihood uses information from each pair of the observed endogenous variables. But for those who have only one observed endogenous variable, there are no pairs available. Will those subjects be thrown away when using pairwise likelihood? Thanks. 


Missing data theory applies only to two or more dependent variables. 


I am including auxiliary variables in a latent growth curve model. Do I need to define what variables are binary or nominal before listing them in the auxiliary option? 


Auxiliary variables are treated as continuous and should not be specified to be other than that. Using a nominal variable as an auxiliary variable would not work. You may want to create a set of dummy variables. 


I am having some trouble with coding in upgraded Mplus. My old version of Mplus was Version 4. If I ran a continuous growth curve in 4 I had to specify the TYPE=Missing option but once I did that it would handle both missingness on my observed Y's (from attrition) and missingness on my X variables. Now in the new version (6) TYPE=Missing is the default but my model is "kicking out" anyone with missingness on any X variable. It only used to do that when I modeled a noncontunuous Y variable. Is this some problem in my code or did the default FIML change such that it automatically drops cases with missing on the X variables? Thanks, Miles 


There was a default change in Version 6. It is described in the Version History on the website under Version 6.1. 


Hi Drs. Muthens, We are experiencing a drop from N = 9303 to N = 7717 in our model with only one exogneous variable. However, we checked missingness on the exogenous variable, and there is an incidence of missingness of only n = 347. Is there another reason that Mplus removes cases from analysis, other than missingness on an exogenous variable, such that we are losing almost 1600 cases? Thank you. 


Please send the output, data, and your license number to support@statmodel.com. 


Hi there, I am trying to run through the steps of factorial invariance and ultimately run from these latent constructs a growth curve over 4 time points (continuous data). I have imputed my missing data (resulting in 20 data sets) using the latest version of Amelia and created the list.dat file (as is done in example 11.5 of the user guide). However, while I no longer have missing data, I do have some variables (mean scores) that will be ultimately included in my model that are nonnormal (not seriously however). I would like to use an appropriate estimator that will allow me to use the TYPE = imputation command to summarize results and provide me with the information I require to examine the CFI and RMSEA CI to judge my model as I go through the steps of factorial invariance. From what I can tell with my first attempt, using the MLR estimator (not the MLM as it seems its not available with TYPE=imputation) I cannot access the CI for the RMSEA (as it is provided if using the ML estimator with TYPE = imputation). My question is how robust the ML estimator is with nonnormal data and if it would be appropriate to use this (knowing I have some nonnormality) so that I can get the fit information I require using the TYPE= imputation command. Your advice would be invaluable. Many thanks! 


With multiple imputation, the only fit statistic that has been developed for multiple imputation is chisquare for ML. For the others, averages are given. I would run ML and MLR and see how different the standard errors are. If they are not that different, it would indicate that you variables all not that nonnormal. I would then use ML. 


Hi Is there any way to obtain listwise deletion only for usevariable list not for all the variables (listed in names ARE .....)? Many thanks, Ebi 


You should get listwise deletion for the variables on the USEVARIABLES list if one is used. If you don't think this is the case, please send the files and your license number to support@statmodel.com. 


I have 13 groups, and am testing a five item CFA (but 150 items in the data set). The sample sizes shown in the "Number of observations" section of the result, is 20 to 30% less than real sample sizes. The only explanation that I can think of is that listwise deletion has been applied to all items menioned in "names are" part. Can you think of any other explanation? If not, I will talk to my supervisor to send the files. Thanks. 


Look at the warning messages that are printed for possible reasons. Check that you are reading the data correctly. You may have blanks in the data set that are not allowed with free format. Check that the number of variable names in the NAMES list is the same as the number of columns in the data set. 


Hi Thanks. I did not know with free format blank is not allowed. It worked out. Mohsen 

Bogdan Voicu posted on Tuesday, September 18, 2012  8:09 am



Hi, I run a TWOLEVEL model. N is 72418. MPlus6.12 drops 42258 cases due to missingness on the xvariables. I have used SPSS to check for the total number of cases with at least a missing value and it is two times lower: 20765. If I run a model with no predictor (just the dependent variable), there is no difference in the number of dropped cases reported by MPlus as compared to the one that I compute in SPSS. The more predictors I add, the higher the loss of cases when using MPlus (as compared to the value that I compute in SPSS). Since I am not very experienced with MPlus, it is probably something that I miss, but I have no clue what this should be. Any suggestion would be more than welcome! 


In Mplus, a case is dropped if it has missing values on one or more predictors. To explain the differences, you would need to send the relevant files and your license number to support@statmodel.com. 


Do Mplus 6 and 7 use FIML to calculate the means and variances produced when Type = Basic? I read hypothetical data for 6 obs and 3 vars into Mplus: Var1 Var2 Var3 1 2 2 2 3 3 3 4 . 4 5 . 6 7 7 7 8 8 The Mplus printout shows mean = 4.833 and variance = 4.472 for both Var2 and Var3. For Var 2, these values are the same as produced by Excel and that I calculate by hand. However, for Var3 Excel and my hand caluclations show mean = 5.000 and pop. variance = 6.500. Are the Mplus values 4.833 and 4.472 different from these because of FIML? Thanks for clarifying how FIML does or doesn't change the sample descriptive statistics produced in Mplus. 


Yes, FIML is used for BASIC. So you draw on information from other variables. 

André Krug posted on Wednesday, November 28, 2012  7:09 am



Hello, i have one question: How can MPLUS7 show me, means seprated by a catergory like a treatment? Thank you very much. 


You can do a TYPE=BASIC for each group separately using the USEOBSERVATIONS option. Or if you want to test parameter differences, you can do a multiple group analysis. 


Hello, I have data that is neither MAR nor missing by design  we are measuring sexual dysfunctions as part of our analysis, and people that did not engage in sexual activity were unable to answer many of the questions, so have system missing values. We are using mixture models to analyse the relationships between sexual dysfunctions, depression and anxiety disorders. I was wondering whether using FIML would be an acceptable way to deal with these cases, or if you have any other advice? All the best, Miriam 


FIML is probably the best you can do. It is probably better than listwise deletion. 


Thanks Linda. 


One more question  can we use FIML at a disorderlevel (i.e., total scores from separate scales), or does it have to be at an itemlevel? Thanks again, Miriam 


You can use FIML also for sum scores. 


Hello, Dr. Muthen, I tried to fit a multilevel regression model with missing data on Y variable. I want to explore whether having an outgroup friend (level1 predictor,dichotomous) influences attitudes toward the outgroup. The Y variable (attitude)is continous and is MCAR. Here is my syntax: Variable: Names are ID School Friend Attitude; USEV= School Friend Attitude; WITHIN = Friend; MISSING are Attitude (99); CLUSTER = School; Analysis: TYPE = TWOLEVEL RANDOM MISSING; MODEL: %WITHIN% sfriend  Attitude on Friend; %BETWEEN% Attitude sFriend; Attitude with sfriend; I got the waring " Data set contains cases with missing on all variables except xvariables. These cases were not included in the analysis." I don't know why Mplus used listwise deletion to deal with missingness on Y variable? How can I use FIML instead? Many thanks! 


The missing data theory of FIML does not apply to observed exogenous variables. The model is estimated conditioned on these variables. This is not listwise deletion. FIML is being used. 


You can use the PATTERNS option in the OUTPUT command to check your missing data patterns on your Y variables. 


I am preparing data for an MSEM path model that will be examined in Mplus for my dissertation. The extent of missing data less than 5%. I am considering two options for handling the missing data. 1) Impute missing data using EM before running the model. This would allow me to retain available data for summary scores. 2) Run the model in Mplus using FIML estimation of missing data. This is a more accurate estimation but would be based on less information, because the summary scores would be missing for any case missing data on any item in the measure. The extent of missing data would also be greater because of this handling of the missing data. What might be the advantages and disadvantages of using EM vs. FIML in this instance? 


Imputation and FIML should give quite similar results. Typically, if you can do FIML that gives you more options for various tests. Note that Mplus does imputation. I don't see how there would be a greater extent of missing data with FIML than imputation. Regarding the summary scores, why not use the average value of the items that are not missing. The imputed values don't carry new information anyway. 

Hicham Raïq posted on Monday, January 21, 2013  10:06 am



I m working on SEM with caegorical variables. In the output of my results, I have this warning Data set contains cases with missing on xvariables. These cases were not included in the analysis. Number of cases with missing on xvariables: 105 The number of my observation is 2388. What is the methode, can you suggest to me for dealing with missing: liswise deletion or pairwise deletion. Some authors propose the maximum liklihood estimation for incomplete data. But this option doesn't work in my case because my Analysis is type=complex. Tank you to give an advise about this situation 


Is your question about how to include the 105 cases? 

Hicham Raïq posted on Thursday, January 24, 2013  6:51 am



May be should I include those cases, but is it is the best method to deal with missing Thanks 


Missing data theory does not apply to observed exogenous covariates. If you want to include those cases in your analysis, you need to use multiple imputation. See DATA IMPUTATION in the user's guide. 


Dr. Muthen, Below is from a post of yours dated December 15, 2011  2:21 pm: "If a study has planned missingness, a missing value flag should be assigned to individuals who did not take the item. Nothing more needs to be done." I am working on data with skip patterns such that there are some variables that are responded by only a subset of participants. I have two questions: 1) When using multiple imputation, can I impute such that only those who should have responded but refused are imputed? I don't want values for those who are valid skips. If I set valid skips to missing, they will be imputed. If I do not set to missing but assign them values, the program will use those values as valid responses while imputing missing data. What should I do? 2) The model I am working on includes these variables that are responded by only a subsample. When modeling these variables, is there anything that needs to be done? Or do I just use them just like any other variable in the model? Thank you in advance! 

Jenny L. posted on Friday, May 31, 2013  10:56 pm



Dear Professors, If the missing data are not specified in the command, how would they be treated by Mplus? In my data set, my missing data were blank; I forgot to specify missing is blank in the first place, but still got an output with no error message. I'm curious how those missing data were treated. Thank you in advance for your help. 

Jenny L. posted on Friday, May 31, 2013  11:12 pm



I should also mention that when missing values were not specified, Mplus still seemed to use all data (i.e., the sample size was the number of all participants I had). 


If you do not declare a value as a missing value flag using the MISSING option, it is treated as a valid value. All cases will be used. If you have a blank in the data set with fixed format and do not declare it as missing, it is treated as a zero. Blanks are not allowed with free format data. They cause the data to be misread. 

Jenny L. posted on Saturday, June 01, 2013  9:04 am



Thank you for the clarification! 


Hi there  I am conducting a longitudinal CFA with one factor, four nonnormal continuous indicators and two time points, preintervention and postintervention. I have about 650 cases. I am conducting tests of measurement invariance (partial strict invariance is supported) and the aim of the study is to determine the effect of the intervention on the latent mean (so the reduction in the latent mean from pre to post, which is significant). I am able to run the analyses no problem, but the problem is that about 50% of the postintervention data is missing from drop out. I would rather use MLR and all cases rather than perform listwise deletion, but I'm not sure what the impact of such a large amount of missing data would have. Any help would be really appreciated. Thanks, Louise 


The Mplus default is MAR using all cases, which is obtained when requesting either ML or MLR. But you are right that 50% attrition is a lot and that means that the results depend to an uncomfortably large extent on the model assumptions, including normality. It can be particularly problematic if the missingness rate is different for the intervention groups. I assume it is not possible to try to find a random sample of those who were lost to followup. 


Wow! Thank you for your very prompt reply. The intervention is done online in an open access research setting, so we have no control group just the intervention and no contact with the patient once they drop out. The indicator variables are very much left censored. I'm gathering you'd suggest using completer only data? Thanks again, Louise 


No, using completer only data would probably be worse. The best you can do is probably using all data via MLR (assuming MAR). 


Thank you very much. 

SYoon posted on Friday, August 09, 2013  12:07 pm



Hi, I am using MPlus 7.1 I have missing values in independent and dependent variables. Outcome variables are continuous but mediator variables are categorial. I wanted to handle them as FILM but then I've got this error message. TYPE IS MISSING; ESTIMATOR IS ML; *** ERROR MODEL INDIRECT is not available for analysis with ALGORITHM=INTEGRATION. So I changed the syntax like this: TYPE IS MISSING; ESTIMATOR IS WLSM; *** WARNING Data set contains cases with missing on xvariables. These cases were not included in the analysis. Number of cases with missing on xvariables: 953 Do you have any suggestions? Thank you. 


Because your mediator is categorical you have to pay special attention to how to treat the mediator in the modeling. Call it u and let u* be an underlying continuous latent response variable for u. The key question is if u or u* is the predictor (IV) for the distal outcome y. ML uses u which complicates matters. WLSMV uses u*. Bayes can use either. More correct causal effects are obtained as in the paper on our website: Muthén, B. (2011). Applications of causally defined direct and indirect effects in mediation analysis using SEM in Mplus. A simple approach is to use WLSMV and include the x variables in the model by mentioning their means or variances. This also enables Model Indirect. You can also do Multiple Imputations as a first step to handle the missingness on the x's. 

SYoon posted on Friday, August 09, 2013  3:17 pm



Thank you so much for the specific direction. I've tried two cases according to your comment. 1) TYPE IS MISSING; ESTIMATOR IS WLSMV; And then I included variances of xvariables in my original model ses black hisp asian white male urbanicity public; but then, I've got this message. WARNING: VARIABLE URBANICI MAY BE DICHOTOMOUS BUT DECLARED AS CONTINUOUS. Does WLSMV hold the missing at random assumption? If I didn't include variances or means of xvariables, then it just deletes missing values in xvariables? just, TYPE IS MISSING; ESTIMATOR IS WLSMV; 2) I also tried to do multiple imputation of outcome variables. (but not mediators) TYPE IS imputation; *** ERROR in DATA command There are fewer NOBSERVATIONS entries than groups in the analysis. I would appreciate it if you have further suggestion. Thank you. 


WLSMV is not be the best estimator for handling missing data. ML and Bayes are better because they are fullinformation estimators. But with ML you can't have a u* mediator. So why not try Bayes? Regarding 2), you would have to send your files to Support to diagnose the problem. 

SYoon posted on Friday, August 09, 2013  3:56 pm



I tried Bayes as well but it doesn't seem to handle indirect effect model here. MODEL INDIRECT is not available for analysis with ESTIMATOR=BAYES. When I use ML then I get this message. MODEL INDIRECT is not available for analysis with ALGORITHM=INTEGRATION. Should I stay with WLSMV in this case? I greatly appreciate your help. 


The fact that Model Indirect is not available should not deter you. You can express the effect yourself using Model parameter labels that you use in Model Constraint. 

boydadavid posted on Monday, September 23, 2013  8:45 am



Hi, I have a question regarding the covariance coverage output. I see that one of my variables i have a number of ****** beside it. What does this mean? Covariances SEXF AGE EDUC INCOME SWD ________ ________ ________ ________ ________ SEXF 0.249 AGE 0.158 154.638 EDUC 0.059 5.428 2.733 INCOME 195.182 31807.575 1612.430 *********** SWD 0.015 0.474 0.029 71.745 0.068 NEVERMAR 0.007 2.898 0.106 1065.354 0.021 CHRONIC 0.010 0.691 0.025 148.704 0.005 


This means that income has a variance so large that it will not fit in the space allocated. We recommend keeping the variances of continuous variables between one and ten. You can rescale variables using the DEFINE command by dividing them by a constant so that their variances are between one and ten, for example, y = y/10; 

Eric Deemer posted on Thursday, September 26, 2013  6:57 pm



I'm trying to calculate the percentage of missing data in my data set. If I specify "missing = all(999)", for example, is there a command I can use to determine the frequency with which the value "999" is observed? Thanks. Eric 


The PATTERNS option of the OUTPUT command will show you the patterns of missing data. 

una posted on Thursday, November 07, 2013  6:18 am



Dear Prof. Muthen, I am running an analyses with MLR. If I understand correctly the default is that Mplus deletes cases with missing values on exogenous observed variables and uses full information for missing values on the endogenous variables? What are the advantages of using this approach? Do you have a reference to read more about this? Thank you very much in advance, 


The advantages are using all available information rather than using listwise deletion. See the Little and Rubin reference in the user's guide. 

milan lee posted on Friday, December 06, 2013  8:35 am



Hi, I wanted to test the missing pattern of my dataset based on Little’s MCAR test (Schlomer, Bauman, & Card, 2010) using Mplus. I checked out posts on the Mplus forum and it looks like we have to use "type=mixture" to obtain Little's MCAR test and the variables has to be categorical. However, I doubt I understand it well. There has to be a general command for testing MCAR and MNAR for imputation in a general regression model (not complex mixture model). May I have your advice on how to conduct this MCAR test with a chisquare value in Mplus? What is the command syntax for this test in continuous variables and simple regression models? Thank you very much! 


The MCAR test we give for categorical variables with Mixture is not Little's MCAR test. It comes from Fuchs (1982) Maximum Likelihood Estimation and Model Selection in Contingency Tables With Missing Data J of Amer Stat Assn. About NMAR I would recommend reading http://statmodel.com/download/Muthen%20et%20al%202011Psych%20MethGrowth1.pdf In particular MAR v.s NMAR testing is conditional on assumptions about the missing data mechanism. So I would say it is somewhat limited (that has nothing to do with which software you use  it comes from the fact that the MAR hypothesis is very very general  it is hard to test against any ignorable missing data mechanism). Little's test is not available in the current version of Mplus.


milan lee posted on Friday, December 06, 2013  7:03 pm



Thanks a lot for your explanation, Tihomir! Very very informative and helpful!!! 

Matteo posted on Wednesday, January 15, 2014  9:58 am



Dear MPlus developers, I'm trying to understand the exact algorithm that you use for dealing with missing data through maximum likelihood. Reading classical papers on the topic, I thought that there exists a closed form for the maximization of the full information maximum likelihood problem with missing data only when the outcomes can be considered multivariate normal, while in all the other cases, so for example with categorical outcomes, we need iterative methods like the EM algorithm. Is this what MPlus does? Or am I wrong? Sorry for bothering you, but I didn't find this information anywhere, Thanks in advance. 


Appendix 6 in http://statmodel.com/download/techappen.pdf gives some information on the missing data estimation. Closed for expression is available for all the models estimated with ML. Mplus does not the EM algorithm to deal with missing data. The general ML estimation is described in http://statmodel.com/download/ChapmanHall06V24.pdf 

Matteo posted on Thursday, January 16, 2014  3:51 am



Dear Tihomir, thank you very much for your answer! I had already seen Appendix 6, but I didn't find what I was looking for and also I thought it was a little bit out of date, since it starts by saying "Missing Data is allowed for in cases where all y variables are continuous and normally distributed", while I read in the general description of modelling missing data that "MPlus provides ML estimation under MCAR (missing completely at random), MAR (missing at random), and NMAR (not missing at random) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types (Little & Rubin, 2002)." That is also what I'm mostly interested in. I will give a look to the second reference you gave me, Thank you very much again for your answer! 

Anonymous posted on Monday, January 20, 2014  2:52 pm



In SEM with the criterion variable having categorical indicators with missing values (the predictor variables have continuous indicators with missing values), can I use FIML? Thank you very much in advance!! 


Yes. 

Alexis posted on Saturday, February 22, 2014  3:27 am



Hi, I’m running a model with ML estimation. By default Mplus deletes cases with missing values on exogenous variables and uses full information for missing values on the endogenous variables. The result is, however, that I’m still missing 33 cases due to missings on five exogenous variables. On this forum, I read that Mplus can handle missings on xvariables if they are brought into the model as yvariables, for instance, by mentioning the variance of the variable in the model command. But I’m not sure what the effect of this is and what I’m exactly doing. Isn’t it a bit artificial? So I’m wondering what the normal/standard procedure is for handle missings. Do you follow the default and accept that you have 33 missings due to missings on xvariables. Or do you do some tricks/adjustments to make your xvariables look like yvariables and consequently no cases are deleted? If the latter option is the standard/best approach, could you please tell me what I precisely should add to my model command given the fact I have five exogenous variables of which three are dichotomous and two are continuous. Is it just as simple as adding “X1 X2 X3 X4 X5;” into the syntax? Thank you very much in advance. 


(1) Usually in multiple imputation normality is assumed for the variables with missing. This approach is also taken if you bring the x variables into the model. (2) Or, you can use multiple imputation and specify that a variable is say categorical and then an underlying continuousnormal latent response variable is assumed. Both approach (1) and (2) therefore make assumptions. Treating your binary x's as continuousnormal in approach (1) is only an approximation and taking approach (2) may also be only an approximation. So in both cases you go beyond the assumption of the model you originally specified for the y's as a function of the x's. Approach (1) is probably very often taken. You mention 33 missings, but more relevant is probably the percentage that this corresponds to  if it is small the analysis doesn't rely on assumptions as much as if it is large. 


question about missing data. I am analyzing some longitudinal data in a crosslag model (N = 250)and have 30% of subjects missing data at T1, 26% at T2 and 38% at T3. Relatedly, 36% have data at all three time points, 34% at 2 of 3 and, 30% at only 1 of 3. Based on examining correlations within the sample, these appear to be MAR and we have included correlated variables in the analysis as auxiliary variables to help with estimation and reduce bias. We submitted the paper and both the handling editor and 1 of the reviewers expressed concern regarding the % missing. Do you know of any references that give guidelines on acceptable levels of missingness? Or do you have a personal rule of thumb? We have cited McArdle et al 2004 & Enders & Bandalos, 2001 on the use of FIML to address missingness. And Enders 2010 on the use of auxiliary variables. Any other suggestions? thank you very much for your ideas/suggestions. 


I would be most concerned about how many observations are present at two of the tree time points. I would also compare my results to those of listwise deletion. You can also create dummy variables for missing for times two and three and regress them on the time 1 outcome to see if y1 predicts missingness. I don't know of any discussion of how much missingness is too much. The Enders book is the most likely source. 


We are conducting a randomized control trial and we are doing multilevel modeling to determine program effect. According to the What Works Clearinghouse (WWC), when dealing with missing data in our analyses we need to do so separately for the treatment and control groups. Can we use FIML separately for the treatment and control groups? The other alternative that WWC accepts is multiple imputation but this also needs to be done separately for treatment and control groups. Is there a way to do this within Mplus? 


You can do this using multiple group analysis. 


Dr. Muthen, I am running a 3step GMM model with 5wave scale scores as my outcome variables and a few independent variables to predict the class membership (e.g., race and LGS scores). Three participants missed to indicate their race and one missed on LGS. To keep them in the step 3 analysis, here is the model syntax: Model: %OVERALL% i s  pl_1@0 pl_2@1 pl_3@2 pl_4@3 pl_5@4; i WITH s; c on AA sex LGS z_t1_c z_t1_n z_t1_e highschool somecoll college; [AA LGS]; ... The analysis did include all the 163 participants. However, the AIC, BIC, and ABIC are much larger than the model without estimating AA and LGS. I wonder what your advices would be given this situation. Thank you. 


When you bring the covariates into the model, the metic of the AIC etc. change. By the way, you must bring in all of the covariates not just a subset. 

HanJung Ko posted on Tuesday, April 29, 2014  11:11 pm



Thank you, Dr. Linda. When I bring in all the covariates, the model could not converge. I was wondering whether it was because of a small sample size, around 160? 


Please send the output and your license number to support@statmodel.com. 


Dr. Muthen, I'm running a latent growth model in which the latent intrinsic work rewards variable at each of the six waves was specified by four items, and then intercept and slope were estimated using the six latent variables. Finally, I used the intercept and slope to predict generativity at the final wave along with some control variables. My concern is that there are two types of missing data here: those who did not participate in a given wave (missing at random) and those who participated but did not answer intrinsic work rewards questions because they were unemployed at the time (missing not at random). I'm fine with having FIML estimate for those who missed the wave, but not comfortable estimating for those who participated the wave but were unemployed. Would it be possible for you to give me some suggestions on how to restructure my model to account for this missing not at random? Would it be appropriate to add six dummy variables, one for each wave's employment status, when predicting generativity to address the concern of unemployment? Or should I revise in some way the longitudinal CFA model in the earlier step? Thank you very much. 


It's a good research question that I don't think I know the answer to. I wonder what it would be like if you use a parallel process growth model where you have one binary part of employed/unemployed and one continuous part with an intrinsic work reward score. The latter is missing when the former is in the unemployed status. Which would mean missing as a function of an observed variable so could be MAR. At least the missing would not only be a function of other intrinsic work reward scores, but directly a function of employment. 


Hello, I am planning on running path analysis models involving examination of direct and indirect effects on a sample (N = 159) with data at 3 different time points. My variables of interest are scale scores (means of multiple items from questionnaires). The sample has both itemlevel missingness (1 or more items of a scale missing) and scalelevel missingness (entire scale missing) for both predictor and outcome variables and I am seeking advice on the best approach to deal with missingness. I had the following questions: 1) Should itemlevel missingness be dealt with using multiple imputation as a first step in software outside of Mplus? And should this then be followed by maximum likelihood estimation at the scalelevel in Mplus? Or: 2) Should itemlevel and scalelevel missingness be dealt with using multiple imputation outside of Mplus and the imputed "complete" dataset be used for subsequent analyses? Any clarification would be much appreciated! 


The optimal approach would seem to be to formulate a factor model for the item indicators for each factor and then simply use FIML (so assuming MAR). The practical problem arises from the onefactor models maybe not fitting the data well. Imputation could use a less restrictive model. But then again, the scales you mention probably are sums of items which in itself assumes a onefactor model. 


Hi Bengt, Thanks for your quick reply! The scales mentioned are actually subscales from questionnaires with more than one factor. Just to clarify, are you suggesting running a full SEM instead of a path analysis model to assist with itemlevel missingness using FIML? I am concerned doing so would be difficult given my sample size. Thanks! Tania 


Yes. It may be more difficult but it would yield a better analysis. 

RSrinivasan posted on Saturday, July 05, 2014  11:51 am



Hello Drs. Muthen, I am grad student working on a project with missing data in all most all variables. I have 5 latent variables, incl. 2 exogenous variable. I tried the syntax for missing data, but the error keeps asking me to add listwise=on and nochiquare in output. I did that as well.. but the error keeps coming back. Please help. I would not want the program to delete a complete data set for a few missing data points. Here is my syntax Data: FILE IS "I:\Data\XYZWV.csv" ; LISTWISE=ON; Variable: Names are x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4 w1 w2 w3 w4 v1 v2 v3 v4 gender age marital edu income city pur apppur amtpur; Usevariables are x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4 w1 w2 w3 w4 v1 v2 v3 v4; missing = all (999); analysis: type = missing; MODEL: x by x1x4; y by y1y4; z by z1z4; w by w1w4; v by v1v4; OUTPUT: MODINDICES standardized nochisquare; *** WARNING in ANALYSIS command Starting with Version 5, TYPE=MISSING is the default for all analyses. To obtain listwise deletion, use LISTWISE=ON in the DATA command. 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS 


If you want the model estimated using all available information, remove LISTWISE=ON; from the DATA command. The warning is just informing you that the default is using all available information. If you get an error when you remove LISTWISE=ON, send that output and your license number to support@statmodel.com. 


Hello, I would like to conduct a modelbased (H1) multiple imputation analysis but my model contains a latent variable interaction. I specified a Bayesian estimator but a message in the output file says that Bayesian estimation is not allowed with latent variable interaction and so the default estimator (ML) was used. However, I also see specifications for Bayesian estimation in the Summary of Analysis section. Will you please tell me whether the resulting imputed data were imputed using a Bayesian estimator or a ML estimator? Also, can I be sure that the imputation was done under my H1 model and not under H0? Thank you. 


Thanks Dr. Muthen. 


My apologies  I had the H0 and H1 imputation language reversed. I see now that an H1 model was used for imputation in the model that I described. Thank you. 


Hello, I am new to Mplus and am using version 6. I am trying to conduct multiple imputation for specified variables prior to analyses for my MA Thesis. The problem I have encountered is that missing data that appear as 999 in my SPSS data file and in my .dat data file, which Mplus is reading, appear as asterisks in all of the imputed files. I checked, and each 999 in my original data sets appears as an asterisk in the imputed files. I conducted a crosssectional version of the same multiple imputation and subsequent analyses last week, and did not encounter this problem. I copied and pasted the same input file for the current data sets. Could you help to guide me toward my error? My apologies for the basic question. 


Mplus uses the asterisk as the missing data symbol when data are saved. When you read the data, you must say MISSING = *; Be sure to note that the variables may not be in the original order. All information about the format and variable order are given at the end of the output where the imputed data sets are saved. 


Hello, Linda. Thank you for your reply. I don't understand what you mean by "when data are saved." Likewise, I don't understand what you mean by "when you read the data." Do you mean when I run the input file for the analysis, or when I run the multiple imputation input file? In the latter, I specified MISSING = ALL (999); under VARIABLES. I made sure that the order of variables in the dat file, VARIABLES list, and IMPUTE VARIABLES list are in the same order. I just redid the entire process, starting from my SPSS file. The same asterisks appear in my imputed files. I ran some simple multiple regression analyses to see what would happen. I specified IMPUTATION as TYPE and gave the correct list name to retrieve my imputed files. I got a full set of output, even though there are still asterisks in the imputed files and I did not specify MISSING in the regression input.I got a warning saying that the CHI2 test could not be conducted perhaps due to a large amount of missing data. This is the only warning I received despite having the asterisks in the files. Thank you for any further clarification you can provide! 


Please send the relevant files and your license number to support@statmodel.com. 

Eric Deemer posted on Tuesday, September 30, 2014  7:23 pm



I'm wondering why Mplus doesn't use cases with missing data on predictors with FIML? From what I've read, it's not possible to use cases with missing data on Y under FIML but still possible (albeit more difficult) to use cases with missing values on X but observed values on Y. Any light you can shed on this would be helpful. Thanks so much. Eric 


Missing data theory applies to dependent variables. Missing data theory does not apply to observed exogenous variables because the model is estimated conditioned on these variables. You can mention the variances of the observed exogenous variables in the MODEL command. This causes them to be treated as dependent variables and distributional assumptions are made about them but they will be used in the analysis. 

Eric Deemer posted on Wednesday, October 01, 2014  8:00 am



Thanks, Linda. That helps a lot. Best, Eric 

Briana Chang posted on Tuesday, November 11, 2014  10:39 am



Hello Linda and Bengt, I'm hoping this is a simple question and I'd like to be sure about it before I proceed. I'd like to bring covariates into the model so that FIML is used to handle missing on the covariates. Can you do this for binary covariates with missing values? Thank you. 


Yes. Note that if you want to bring one covariate into the model, you must bring all covariates into the model. You cannot bring in just a subset. 

Yoosoo posted on Saturday, January 03, 2015  10:55 am



Hello, I am running a SEM with 16 endogenous observed variables and 4 latent variables using WLSMV estimator with default missing data option. The data summary tells me that the number of missing data pattern is 29, and the reported # of observations is 48000, which is the total # of samples. Is there a way that I can find out the number of incomplete observations that were imputed and included in the analysis? Thank you for your helpful support as always. 


There is no imputation done with WLSMV. The pairwise present approach is used. 

Yoosoo posted on Saturday, January 03, 2015  6:16 pm



Thank you for your comment Bengt. Is there a way to simply report in the output the number of incomplete observations that were found in the input data? 


You can see this in the frequencies for the missing data patterns. The total sample minus those with no missing would be the number of observations with some missing values. For each variable or pairs of variables, see the coverage values. 

Djangou C posted on Sunday, January 11, 2015  11:34 pm



I'm doing multiple imputation with Mplus and would like to know how to compute the standard deviation of a point estimate (the mean)from the standard error provided by Mplus. Could please give a reference? Thank you 


The missing data book by Joe Schafer gives a good account of all the relevant imputation formulas used in Mplus. 


I’m currently trying to run two level multilevel models for several binary outcomes using FIML estimation procedures with longitudinal complex sample data. The models are complex: the level1 models typically having 1020 binary IVs and the level2 models for the intercepts having a maximum of 16 continuous IVs. Many of the IVs are completely observed. Do I need to bring all of the x variables into the model in order to have observations having missing data for the x variables included? In a 6/22/2006 posting you note that “If only two or three of your covariates have missing data, then FIML should be fine. You should study the missing data in your covariates. Perhaps there are some with very little missing data such that you could allow the listwise deletion on those and bring the others into the model.” However, on 11/12/2014 you say that “if you want to bring one covariate into the model, you must bring all covariates in to the model. You cannot bring in just a subset.” A small subset of the IVs account for most of my missing data. Is there a way to use the 6/2006 strategy and use listwise deletion for x variables missing small amounts of data – and not include x variables which don’t have any missing data in the model? Multiple imputation isn’t feasible for a variety of reasons. It looks as though your thinking on this may have changed – but figured it’s worth asking. Thank you in advance for your help! 


The issue with not bringing all the covariates into the model is that you want the covariates to correlate freely (as covariates should). This may not happen unless you model it. Say that you have e.g. two covariates and missing on X1 and not missing on X2 and you bring X1 into the model (essentially making it a Y). This model may leave X1 and X2 specified as uncorrelated. If you say X1 WITH X2 then you bring X2 into the model, so you have to say X1 ON X2 to correlate them and saying ON can have consequences for the rest of the model. So it is safest to bring all the Xs into the model. I assume you have considerable missingness on that small subset of IVs so that Listwise deletion is not an option. 


I am running a complex model with many xvariables. One of those xvariables has missings. If I bring this xvariable into the model by mentioning the variance, the model does not fit any more. The problem is that this variable is now assumed to be uncorrelated with the other xvariables. So I added WITHstatements which brought in all the other xvariables into the model. And I needed to add more WITHstatements. Now I get a warning that the number of observed variables is exceeding the number of clusters in my model. This puzzles me. If I look at de diagramview, the model seems the same (and when I do this with xvariables without missings the Chisquare and df are also the same) Why is there this large increase of observed variables and do you know a way to deal with this problem? Is there a way to let Mplus estimate the xvariables without increasing the number of observed variables or can I ignore this warning? Thanks a lot for your help Jaap 


You must bring all of the covariates into the model or none of them. You can do this by mentioning the variances. When you do this, they are treated as dependent variables in the model. The warning is to remind you that independence of observation with clustered data is at the cluster level. The impact of this on your results has not been well studied. 


Dear Linda, Thank you for your answer. I still have a question about bringing in all the covariates. Why does this increase the number of free parameters in the model and in the same time it doesn't affect the number of degrees of freedom. 


Because the increased number of free parameters is the same as the increased number of parameters in H1  namely the means, variances, and covariances among the covariates. 


Hi Drs. Muthen, I am unclear why Mplus is not deleting cases that are missing data for all dependent variables. Notes from my output are below. BWACHGAP is dependent; all other variables are independent. 140,274 is the N in my total sample, but why is this number not decreased, given that some cases do not have data for the sole dependent variable? SUMMARY OF ANALYSIS Number of groups 1 Number of observations 140274 Number of dependent variables 1 Number of independent variables 11 Number of continuous latent variables 0 Observed dependent variables Continuous BWACHGAP Observed independent variables CRITSKLL DIFFMETH SCH_CITY SCH_TOWN SCH_RURL SCH_MDLG SCH_LARG SCH_TITI SCH_ETH2 SCH_NSLP SCH_SMLL Cluster variable TEACHID4 Between variables BWACHGAP CRITSKLL DIFFMETH SCH_CITY SCH_TOWN SCH_RURL SCH_MDLG SCH_LARG SCH_TITI SCH_ETH2 SCH_NSLP SCH_SMLL Estimator MLR Information matrix OBSERVED . . . SUMMARY OF DATA Number of missing data patterns 2 Number of clusters 20127 Average cluster size 6.969 MISSING DATA PATTERNS (x = not missing) 1 2 BWACHGAP x CRITSKLL x x DIFFMETH x x SCH_CITY x x SCH_TOWN x x SCH_RURL x x SCH_MDLG x x SCH_LARG x x SCH_TITI x x SCH_ETH2 x x SCH_NSLP x x SCH_SMLL x x 


Please send the output to support so we can see the full issue. 


Further, Drs. Muthen, I saw the note above that by modeling the variances of exogenous variables, they are treated as dependent and distributional assumptions are made about them; it was implied that this one way to retain cases that would otherwise be dropped due to having missing data on all dependent variables. (Please correct me if that is wrong.) However, I have the following questions: 1) Why would an analyst want exogenous variables to be treated as dependent in the model; what consequences are there to this? 2) When I explored the result of modeling vs. not modeling the variances of my exogenous variables named above, I found that the fit indices changed drastically simply due to the explicit modeling of these variances. Please see below. Is this drop in the goodness in fit due to the improper assumptions that may be made about the distributions of these variables? Is the assumption multivariate normality? Thank you. *Without* exogenous variables' variances explicitly modeled: RMSEA 0.013 CFI 0.830 TLI 0.716 With exogenous variables' variances explicitly modeled: RMSEA 0.079 CFI 0.027 TLI 0.189 


Please send these 2 outputs to support so we can see the whole story. 


Thank you Bengt. What insight can you give here? I am now using a corporatelicensed install of Mplus currently, and do not have the license number on hand. 


OK , I will send the output, but cannot request the license number until Monday, and am not sure if all employees receive that number. 


Regarding your first question, perhaps you are bringing x's into the model by mentioning their variances. In this case you no longer have a univariate model for your BWACHGAP DV, but you have a multivariate model for BWACHGAP and all the x's. So even if you have missing on BWACHGAP, people who have nonmissing data on at least one x variable are (correctly) kept in the analysis sample. Regarding the change in fit, I cannot speculate except to say that you should make sue you let all the x's correlate freely. 


UG Ch11 states: "NMAR modeling is possible using ML estimation where categorical outcomes are indicators of missingness and where missingness can be predicted by continuous and categorical latent variables." Yes, I have predicted missingness as a dichotomous outcome in such models2 DVs are modeled: the outcome itself, and missingness. Both can be regressed on covariates, and this assumes MAR. 1) By correlating these two DVs, we can see if missingness is correlated with the predicted score in the whole samplewhether NMAR is a better assumptionright? Or is this not true, if ML estimation of missing scores (first DV) assumes MAR in the first place? 2) The above strategy (correlating these 2 DVs) works in a 1level model but not a 2level model. With the latter I get: "Covariances involving betweenonly categorical variables are not currently defined on the BETWEEN level." I can run regressions of both DVs on the between levelbut not correlate these outcomes. Does Mplus not allow for modeling covariances of dichotomous DVs on the between level? I know mixture modeling is another option for NMAR, but given the first statement above, it seems this strategy should work: "categorical outcomes are indicators of missingness and missingness can be predicted..." Perhaps simply not in a 2level model where missingness is between only? 


Drs. Muthen, I employed the strategy of creating a latent variance on level2 to define the residual variance for the indicator of missingness, and successfully correlated this residual variance with residual variance with the central variable. It seems that MAR is a plausible assumption given an estimated value of the correlation of missingness with the DV of about 0: F1 WITH BWACHGAP 0.003 0.424 0.006 0.995 Is this an OK way to assess the MAR assumption? Thank you sincerely. 


1) The missing data literature emphasizes that you cannot test whether NMAR is more suitable than MAR. I recommend the book by Craig Enders. This means that your 2 DV approach is not correct. Perhaps because the information on the residual correlation that you focus on comes only from those who don't have missing on Y (the rest is handled by the bivariate normal information). For NMAR modeling you need at least 2 DVs, not counting the binary missing data indicators. For more on NMAR modeling, see for instance: Muthén, B., Asparouhov, T., Hunter, A. & Leuchter, A. (2011). Growth modeling with nonignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological Methods, 16, 1733. Click here to view Mplus outputs used in this paper. download paper contact first author show abstract 2) No longer relevant. 


Hello, I have a dataset (n=3,000), with a large number of missing values. I was told to conduct multiple imputation using Bayesian analysis. When I try to run the imputation, the "DATA IMPUTATION" command does not turn blue. I was wondering if it is unavailable in the demo version, or if I have done something wrong in the input. Here is my input: TITLE: bayesian imputation test DATA: FILE IS "E:\Dissertation\2012LAPOPComplete.dat" VARIABLE: NAMES ARE gen cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 q10new ed etid q11 q2; USE VARIABLES ARE gen cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 q10new ed etid q11 q2; CATEGORICAL ARE gen cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 ed etid q11 gen q2; AUXILLARY ARE ed etid q11 q2 gen; MISSING ARE cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 q10new ed etid q11 q2 (888888 988888 999999); DATA IMPUTATION: IMPUTE = cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 q10new ed etid q11 q2; NDATASETS = 10; SAVE = missimp*.dat; ANALYSIS: TYPE = BASIC; OUTPUT: TECH8; 


Commands with more than one word do not turn blue. You can use only 6 variables in the Demo. 


Dear Linda/Bengt, I understand that Mplus removes cases with missing on all variables to run FIML estimation. I am working with political trust survey questions: out of about 37,000 respondents, 485 were removed for having missing values on all variables. This makes the summary statistics on the Mplus basic analysis slightly off from the summary stats I have in Stata. I'm concerned that this ends up removing reticent respondents from the data who are scared to answer the survey question. Hypothetically, just to see if I could retain the 485 cases, I added a different variable into the basic analysis in Mplus, which almost all respondents answered. This time, Mplus ran the basic analysis without a problem no cases were removed, yet the summary statistics in the basic output are still off by the same amount as when Mplus removed the 485 cases. I'm confused because Mplus should have been using the same exact data and missing values as Stata this time. Any idea why this happened? 


You have to make a distinction between DVs and IVs and multivariate vs univariate estimation in the following sense. Perhaps you are saying that you have a model with DVs and IVs and where all the DVs have missing data. They get removed because they don't add information to the estimation of relations between DVs and IVs. When you say you add a variable with no missing, perhaps that is used as a DV in which case Mplus keeps everyone because it can draw on missing data theory. Summary statistics can be computed using Type=Basic which uses all DV and IV variables and since no "DV ON IV" is part of this analysis, all variables will be used in the missing data analysis to computed the summary statistics. Statistically, there would be no difference between Stata and Mplus, only in how you use the programs. If this doesn't help, send relevant outputs from Mplus and Stata to support along with your license number. 


Hello, I am using Mplus 7 to perform IRT analyses on 104 dichotomous outcomes. Participants were only administered about a quarter of the items. Items were sampled in a way that creates MAR, as the missingness can be partially predicted by a categorical covariate, i.e., grade. Specifically, participants were administered a higher proportion of grade appropriate items and lower proportions of grade inappropriate items. I read in the user’s manual that a covariate can be used to model missingness of categorical DVs when one uses WLS. However, my attempts to use Auxiliary = grade(M) with WLS, WLSMV, ULSMV, and MLR have all generated the error message "Analysis with categorical variables is not available with the 'm' specifier in the AUXILIARY option." Is there another way to account for the MAR, in the context of dichotomous items and lots of (n=2317) missing data patterns? 


You can let grade predict the factor (s) and thereby draw on MAR. Missingness as a function of a covariate is what the manual refers to for WLSMV handling missing data. 


Interesting... So in Janelle's example above, if one were to regress the factor(s) on grade then you're saying that would adjust for the MAR design of the item sampling? Would the same approach work if one were to use MLR and ULSMV? 


In addition to my post above, is there a way to print the IRT parameterization when one uses a covariate in the model? I tried using the D(1.0) output option, but that didn't work. thanks for all your guidance! 


First post: Q1: Yes, if there is no other variable than grade not included in the model that predicts the missingness. Q2: Yes. 


Second post: IRT typically uses N(0,1) for the latent variable which you won't necessarily get with a covariate. So you would have to do the reparameterization yourself, e.g using Model constraint. 


Sorry, I'm still struggling with how to get the IRT parameterization. The (simplified) code below results in an error message that Model Constraint does not recognize the N function. I also tried constraining the factor mean to zero and factor residual variance to one, but that does not yield the IRT parameterization of thresholds/difficulties either, presumably becuase it is the factor variance that needs to be constrained. VARIABLE: NAMES are mplusid V1V104 GRADE; USEVARIABLES are V1V104 GRADE; CATEGORICAL = V1V104; MISSING are . ; ANALYSIS: TYPE = GENERAL; ! ESTIMATOR = MLR; PROCESSORS = 2; MODEL: F1 BY V1* V2V104; F1 on GRADE; [F1] (mean); MODEL CONSTRAINT: mean = N(0,1); 


You can try to do this using the tech report we have on our website: See the IRT page and the paper mentioned at: A brief technical description of the formulas used in the plots of item characteristics curves and information curves is available If you can't handle these formulas, you can try to make this simple by estimating the model in a first step without Model Constraint (your statement is not allowed). Fix the residual variance of f1 at 1 (f1@1). Asking for TECH4 you get the total variance of f1. Rerun by using a grade variable scaled so that the total variance of f1 is 1. 


Thank you, Bengt, for referring me to the tech report. Reparameterizing the loadings and thresholds to discriminations and difficulties was no trouble. Interestingly, whether or not one used a covariate to account for MAR had essentially no impact on estimates of a and b when using MLR, rs = .99 and 1.0 for as and bs, respectively. However, using a covariate to account for MAR had major impact on estimates of b when one used WLSMV, r among bs only .60. Estimates of b from MLR and WLSMV were more closely aligned when one accounted for the MAR using a covariate, r = .90, than when one did not account for MAR, r = .66. 


Interesting; makes sense. 


How do people assess the mechanism of missingness (I. e. MCAR, MAR, NMAR) when dealing with categorical variables? Little MCAR’s test does not work. Is there an alternative in MPLUS? 


Typically, it is not assessed. One cannot test MAR against NMAR for instance. ML under MAR is used as the standard  also in Mplus. See also examples in the missing data chapter 11 in the UG. 


Hi Linda & Bengt, I am following your recommendations for determining the best approach to dealing with NMAR data, and running MAR, patternmixture, and selection models. I cannot get the input for the RoyMuthen model to run, and was hoping you could explain what the "u" variable is? E.g. Variable: Names = y0y5 u1u5; Missing = *; usev = y0y5 d1 d2 d3 d4 d5; classes = cu(2) cy(4); Thank you 


See the Muthen et al (2011) Psych Methods article on our web site. Page 22 describes the Roy method. The u variables of the runs shown on the web site can be ignored. I think they reflect missing or not at the different time points. 


Hi Bengt, Thank you for your response. However, when I removed the "u" variables from the input, the model didn't run. Do I need to be coding missingness as present/absent manually? Thank you 


I'm currently trying to run some manual R3STEP models for a binary distal outcome and am trying to use FIML estimation procedures to account for missing data. Many of my IVs are binary. I have from 200 to 3,600 observations in each latent class. When I try to run models for my binary outcomes, I receive a message that the covariance matrix for one or more of my classes cannot be inverted. I’m wondering if this is happening because of (1) the distributional assumption of multivariate normality for all of the IVs and (2) very sparse/empty cell counts when I enter 23 binary indicators in the model. My emerging impression is that FIML estimation cannot handle any empty cells in the crosstabs/frequency table and covariance/design matrices (e.g. no variance in outcome within cell). FIML estimation often just won’t work when I have fewer < 5 observations in any cell. For example, if I try to look at differences in a binary outcome by latent class, gender, and race (4 x 2 x 3) and one subgroup’s cell for the presence of the binary outcome is empty. It looks as though I need to have some threshold number of observations in each possible cell in order to successfully use FIML estimation procedures – FIML can’t seem to handle sparse tables. Is there a better way to do this? I suspect that multiple imputation will also be problematic. Thank you in advance for your help! 


I assume you have missing data on your binary outcome(s) to give zero frequencies for certain combinations of binary covariate (IV) values. With binary covariates a singular covariance matrix problem may arise in those cases. I don't think there is a way around that problem; it is a common problem in logistic regression. If that doesn't answer your concerns, we need more information to comment. Please send input, output, data, and license number to Support along with a clarification of:  do you have missing on the binary outcomes or the binary IVs?  when you talk about entering 23 binary indicators is that the outcomes you are talking about or the IVs? 

WenHsu Lin posted on Wednesday, October 14, 2015  6:30 pm



Hi, I have a question regarding wave non response in my data (attrition). I have 8 waves and the attrition (wave 1 vs. wave 8) is almost 40%. I then ran my growth curve. My question is: Can FIML provide proper estimation? I checked the Covariance Coverage table and some numbers were at .6. Is this ok? Can I combined FIML and weight (IPTW) to adjust for attrition? Thank you 


Coverage of 0.6 is not good. FIML or any other missing data technique will have to rely too strongly on model assumptions, especially that the missing is MAR and that the variables are normally distributed. But things are better if earlier time points have higher coverage. Assumptions play a much smaller role if the coverage is at least say 0.8. 

WenHsu Lin posted on Thursday, October 15, 2015  4:36 pm



Yes. Earlier time points have converge over .9 then drop to .6. I am concerned because different methods give somewhat different results. Thank you. 


You can show this in a write up so readers can judge it. For instance, report the results when using only some of the early time points as compared to using all time points. 


Hi, I'm testing a path analysis model with two binary covariates which lead to the deletion of 5.4% of the sample. For continuous outcome, I know that there is a way of dealing with missing values on covariates by including them explicitly into the model. Is this possible with binary covariates too? If no, what would be the best way of dealing with these missing values? Thanks you very much. 


Sorry for my mistake. I forgot to specify that the two binary covariates have missing values (which lead to the deletion of 5.4% of the sample). Thanks 


Yes, you can do the same with binary covariates, although it is a bit better and also possible using Bayes to say that they are binary. 


Thanks. In other word, I should simply use Estimator=Bayes instead of Estimator=ML? Because I'm trying to test a mediation model, I should use the Model constraint command instead of the Model indirect? (If I understand well from a previous post, I would also have to divide by the SD of the DV and multiply by the SD of the IV). Is there a reference that I could read with a more concrete example of this? (or syntax) Thanks you for everything. 


Hi. I am running a latent profile analysis on 13 rating variables with two level nesting (15 scenarios within 300 people). I was able to successfully model a covariate (age group, dummy coded) with the use of algorithm=integration and integration=montecarlo statements, and including the covariate mean explicitly the MODEL line. However, I am trying now to test a different covariate (implicit emotion beliefs), and when I try running the same syntax, I get the following missing data warning message: Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 297 My syntax is: missing = .; usevar = subjid ems SSavoid SSleave SMmod ADneg ADpos REdet REdis REpos RErum REacc RMsup RMpos RMphys; classes = c(3); cluster = subjid; Analysis: Type = Mixture Complex ; algorithm = integration; integration = montecarlo; Starts = 100 10; stiterations = 10; k1starts = 100 10; processors =8(starts); Output: SAMPSTAT tech11; Thank you for your help! 


Add the Patterns option to your OUTPUT command so you can see the missing data structure. If that doesn't help, send output to Support along with license number. 


Answer to Coulombe: If you want to include covariates in the model and say that some are binary, Bayes is the way to go. Put the binary ones on the Categorical list and for covariates x1xp say: x1xp with x1xp; Yes on your second paragraph. No, not yet  but it is coming. 


Thank you Dr. Muthen. Just to make sure: adding x1xp with x1xp is the way for including the covariates in the model (by correlating them all with each others?). Finally, if I understand well, we don't use bootstrap with Bayesian analysis? Thank you very much. Simon 


Q1. Yes. Q2. Right, that's not needed with Bayes. Bayes still allows nonnormal parameter distributions and nonsymmetric confidence intervals. 

SABA posted on Thursday, November 26, 2015  8:25 am



Hi, I am running a multiple regression. 25% of my data is missing on a questionnaire because (respondents had refused to respond to that specific questionnaire) however, they have responded to another questionnaire which is also a part of model. The data is missing at random. The analysis type is complex and these respondents are cluster in my analysis. I run my model by trying both estimators ML/ MLR and in both cases I get the following message. *** WARNING Data set contains cases with missing on xvariables. These cases were not included in the analysis. Number of cases with missing on xvariables: 1621 My question is why these cases are excluded from the analysis? And ML is not estimating them. Thank you 


In regular regression you use only subjects with complete data on the x's. If you have subjects who have missing on x's but not missing on y, you can benefit from that using missing data theory by "including the x's in the model", which you can do in Mplus by adding x1x5; assuming you have 5 x's. 

SABA posted on Thursday, December 03, 2015  2:44 am



Hi, I am running a multiple regression. 25% of my data is missing on a questionnaire because (respondents had refused to respond to that specific questionnaire) however, they have responded to another questionnaire which is also a part of model. The data is missing at random. The analysis type is complex and these respondents are cluster in my analysis. I run my model by trying both estimators ML/ MLR and in both cases I get the following message. *** WARNING Data set contains cases with missing on xvariables. These cases were not included in the analysis. Number of cases with missing on xvariables: 1621 If I use estimator MAR, then I get the following message *** ERROR in ANALYSIS command Unrecognized setting for ESTIMATOR option: MAR Could you please tell me why these estimators are not estimating the missing values? And what could be the solution. Thank you 


MAR is not an estimator. Perhaps you are thinking about ML estimation using the MAR assumption. Saying ML or MLR will estimate under the MAR assumption. Subjects with missing data on x are typically excluded because x is not part of the model in terms of parameters estimated. But adding a normality assumption on x you can bring x into the model saying in the Model command: x; and this will bring up the sample size. Note, however, that regression slope estimates are only affected if subjects with missing on x have data on y. 

Phil Wood posted on Monday, December 14, 2015  3:24 pm



Is there any way to exclude observations from analysis based on the number of missing data points across variables? Something along the order of, say, in SAS, saying If nmiss(of item1item10)<5 then delete; Thanks! 


Not directly. But you can use DEFINE to do it by using IF(y EQ _MISSING)THEN ymiss=1 ELSE ymiss=0; and then use DEFINE to add up the ymiss. 


Dear Drs. Muthen and Asparouhov, I am conducting an MGA (11 groups) with one outcome. It is my understanding that to fill in missing data on the outcome (i.e., FIML), I need to call out the variances of all covariates. However, I am interested in nonlinear age trends in my outcome, therefore my model contains the following interactions: Age*Age and Age*Age*Age. My question is do I need to call out the variances of the interactions like so: Age AgeSq AgeCu Covar1 Covar2; OR can they be excluded like so: Age Covar1 Covar2; Thank you so much for you time. Best, Grace 


If a person has missing data on the outcome it won't help if you bring the covariates into the model as you suggest. The people you want to include are those with the outcome observed who have missing on some covariates. You need to mention all variables including the interactions (for deeper technical reasons, this won't be technically exactly correct but approximately so). 


Ah, understood. Thank you for your insight! 


Dear Mplus team, A followup from my last question: When I have only one DV, how can I 'get' Mplus to include cases with missing data on the DV (but no missing data on IVs)? Would it be appropriate to include other variables known to be correlated with the DV in the model, say, by calling out their variances and/or means? If so, is it advisable to call out the means, variances, or both? If not, is there another approach to retain cases with missing data on the DV? Thank you for taking the time to respondparticularly to someone who is early on in the learning process. Kindly, Grace 


If you have a regression of y ON x, you can certainly bring x into the model by mentioning its mean or variance. But the slope won't be affected by data on people who have missing on y and observed on x. The slope is affected only by data on people who have observed y and missing x. 

Sona Aoyagi posted on Friday, February 12, 2016  7:48 am



Hi, I'd like to know about TYPE=DDROPOUT option in DATA MISSING command, which is for pattern mixture model. 1) In user's guide, it says "For TYPE=SDROPOUT and TYPE=DDROPOUT, the number of binary indicators is one less than the number of variables in the NAMES statement because dropout cannot occur before the second time point an individual is observed." But I actually have cases which are dropped out at the first observed point (i.e. before the second time point). Is there any solution to build in these cases which occurred before the second time point in pattern mixture model? 2) When I ran the model (TYPE=DDROPOUT), I got the error as below. " One or more variables have a variance of zero. Check your data and format statement." The error occurred at d1. Could you tell me the meaning of this error message? DATA MISSING: NAMES = y0y5; BINARY = d1d5; TYPE = DDROPOUT; MODEL: i s  y0@0 y1@2 y2@6 y3@10 y4@14 y5@20 ; i ON d1d5; s ON d3d5; s ON d1 (1); s ON d2 (1); 3) Sorry for this beginner question: What is the difference between TYPE=MISSING and TYPE=DDROPOUT? TIA 


Type = Missing is for ML under the MAR assumption. Type=Ddropout is for patternmixture modeling of NMAR data. For a review see the paper on our website: Muthén, B., Asparouhov, T., Hunter, A. & Leuchter, A. (2011). Growth modeling with nonignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological Methods, 16, 1733. Click here to view Mplus outputs used in this paper. download paper contact first author show abstract If you are a beginner when it comes to missing data handling I would not use patternmixture modeling and therefore not Ddropout. 

Lin Jiang posted on Tuesday, February 16, 2016  11:03 am



Hi Dr. Muthen, My model has two latent variables. The IV is a latent variable with two observed variables (1 categorical, 1 continuous); the DV is a latent variable with 4 observed variables. The categorical observed variable for IV has 50 missing values out of 257. However, the continuous variables for IV has 145 missing values. I run the overall model first. The model fits. However, when I run the multigroup SEM with gender, it shows "no convergence number of iterations exceeded". I guess, the "no convergence" is caused by the large number of missing values. I want to impute the missing values first, and then run the multigroup SEM model again. However, one of my dissertation committee members insists that I should make the multigroup model with missing values fit, and then, make the model fit after imputation. I suspect whether the first step is necessary, since a lot of missing values exit, it is hard to make the multigroup SEM model fit. Do you have any suggestions? Also, my data were collected from both male and female. When I use the imputation, should I estimate the missing data separately based on different gender? How can I do that? ( I didn't attend any Mplus trainings/workshops because of the tight budget. I learned how to use it by myself.) Thank you for your help! 


Send output from the no convergence run (s) to Support along with your license number. But note that with 145 missing values out of 257 there is no missing data approach that is trustworthy because you rely too much on model assumptions and too little on data. 

Lin Jiang posted on Tuesday, February 16, 2016  7:28 pm



Dr. Muthen, Thank you for your reply! I am using the Mplus installed on the computers in statistic lab in my university. Do you know where can I find the license number? Thank you. 


You must be the registered user of an Mplus license with a current support contract to be eligible for support. You can ask the IT person in charge of the lab if this is possible. 


Dear Dr Muthen, We have a longitudinal study with 4 waves spread over 2 years. As we use routinely collected data from a treatment for criminal adolescents, we have a large amount of missing data: we have 50  70% missing data on each variable on each wave. As a consequence, our covariance coverage varies from just above .20 up to aproximately .60. We are conducting growth models, as well as latent class growth models, using FIML. I was wondering whether there is any way in which we could get an indiciation of the estimation bias introduced in our parameter estimates by FIML. For example, for multiple imputation Collins (2001) suggests to compute a standardized bias, which is 100*(average estimate  parameter)/se, where se is the standard deviation of the estimate. Are there any such possibilities to investigate whether the missing data is leading to biased parameter estimations when using FIML? Thank you so much for your advice! Aurelie 


You could do a simulation study to see the effect of so much missing data on your results. We recommend no more than 10 to 20 percent missing based on our experience. 


Dear Linda, I would like to perform a CFA analysis with a variable which has some missing values. In fact, in SPSS I used the imputation and it generated 5 data sets. how can I analyse in Mplus these 5 data sets generated from multiple imputation? Can I perform multiple imputation in Mplus? many thanks in advance for your help, 


See Example 13.13 for using imputed data sets. See Example 11.5 for imputing data in Mplus. 


Dear Linda, In fact, I was analysing my missing data and I found that I have less than 5% of missing data, therefore I will not use multiple imputation (as I don't have many missing data). Therefore, do you think that Mplus will handle automatically my missing data (using the method FIML). Or should I write something specifically in the Syntax? In fact, I have already read many information, but I am a little confused and I think I need your advice. Many thanks in advance for your help. 


With maximum likelihood estimation, FIML is the default. You don't need to specify anything. 


Thanks very much. However, I am conducting a CFA with categorical indicators. Also with categorical indicators, can I use FIML? Once again, many thanks for your help. 


You can use the CATEGORICAL options with maximum likelihood estimation. 


Dear Linda, Many thanks for your reply and for your help. However, I tried to use this in Mplus and it didn't worked (I received an error message). here I attach the syntax that I wrote for performing a CFA with unordered categorical observed variables with a maximum likelihood estimation for you see, in order to see if I did something wrong. Title: CFA DSMIVJ; Data: File is validacao only gamblingversao21904.dat; Variable: NAMES are DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; USEVARIABLES are DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; MISSING are all (999); NOMINAL are all; Analysis: TYPE IS MISSING H1 Model: F1 by DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; Output: STANDARDIZED; MODINDICES; That is the correct syntax for CFA with unordered categorical variables, and using a maximum likelihood estimation for missing values? Because the software is also using the WLMSV estimator to handle categorical variables. Once again, may thanks for your help. 


The default is WLSMV. Add ESTIMATOR = ML; to the ANALYSIS command if you want maximum likelihood. 


Linda, Thank you very much for you advice. I will change estimator and use the estimator ML instead of WLSMV. However, I am still a bit confused about one thing: can you please only tell me if I sill use the WLSMV estimator (the default for categorical data), how the software will handle missing data? Because from what I had read, I didn't understand if it's listwise ou pairwise. Once again, many thanks for your help. 


Pairwise. 


Dear Linda, Many thanks for all your help. In fact, I followed your suggestion and thus, I tried to perform the CFA of my instrument (an instrument with unordered categorical variables with 9 items) using both estimators, that is, I performed one analysis with the WLSMV estimator (the default for categorical variables) and I performed another analysis with ML estimator and I obtained different results. Taking into account that I only have 8 missing values for a sample of 750 respondents and analysing the pattern of the missing, it seemed that the pattern is MAR, what do you think it will be the best approach for my case? Do you think that will be to use the estimator WLSMV and let the software handle automatically through pairwise method? It's a CFA with 9 items, so I do not have covariates, right? Once again, thank you very much for all your help and insights. 


Are you using the CATEGORICAL option with WLSMV and ML. You should be doing this. You should be comparing the patterns of significance not the values of the coefficients. ML gives logisitic regression and WLSMV gives probit regression. They are not on the same scale. 


Dear Linda, Once again, thank you very much for your help and insights. Therefore, do you mean performing the same syntax for using the categorical option with WLSMV and ML? Or do you mean performing two syntaxes and then compare the patterns of significance (and not the values of the coefficients) I performed this syntax for the categorical option with WLSMV and ML. Can you please tell me if it is right? Title: CFA DSMIVJ; Data: File is validacao only gamblingversao21904.dat; Variable: NAMES are DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; USEVARIABLES are DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; MISSING are all (999); NOMINAL are all; Analysis: TYPE IS MISSING H1 ESTIMATOR IS WLSMV Model: F1 by DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; Output: STANDARDIZED; MODINDICES; Is this the correct manner to perform the syntax for using the categorical option with WLSMV and ML? In addition, question 2) If the data were completely at random, do you think that pairwise method should be fine? Once again, many thanks for all your help 


Please send the two outputs and your license number to support@statmodel.com. 


Thank you Linda. I will send 


Hello Dr. Muthen, I did a MGCFA for a 11 item scale (2 groups). One group has missing values on items; the valid sample size for items ranges 498505. The output showed the sample size for that group is 502. Why? Thanks! Jiebing 


Send to Support along with your license number. 


Hello, I am running a series of multiple regression analyses. The data set contains missing values so I am using maximum likelihood to retain the full sample in models. Some of the outcome variables are nonnormal, others are normally distributed. Based on my understanding, it is appropriate to use MLR (rather than ML) to estimate parameters for those models with outcome variables that are nonnormally distributed. I am wondering what the best practice is: is it appropriate to use MLR for all analyses reported within the same paper, even for those with outcome variables that are normally distributed? Or should I use ML when modeling outcomes that are normally distributed and MLR when modeling outcomes that are not normally distributed? Results do differ slightly when using ML versus MLR. Thank you! 


You should use MLR throughout. 

Back to top 