No, Mplus does not impute values for those that are missing. It uses all data that is available to estimate the model using full information maximum likelihood. Each parameter is estimated directly without first filling in missing data values for each individual.
I am having difficulty getting Mplus to converge on H1 (and thus to get a chi-sq test) for a missing-data-latent-growth-curve model, even when I fiddle with the starting values and convergence criteria. It runs fine when I do not ask for "type= missing h1;" but then I can't get the chi-sq. Am I missing some fundamental piece of the puzzle?
If you give the Mplus statement type=missing h1, the program first does H1 and then H0. You may want to first to a type=basic missing. The H1 estimation that this leads to can be difficult if there are large percentages of missing data - see the Covariance Coverage output. Starting values are not needed for H1. You can try to sharpen the convergence criterion as described in the User´s Guide.
Anonymous posted on Wednesday, February 14, 2001 - 9:07 am
I have data that do not fit the assumptions Mplus imposes for SEM with missing data so I am using a multivariate, multiple imputation approach such as that advocated by Little and Rubin.
My question is whether the coefficients and standard errors generated by the Mplus WLSMV estimator present particular problems for those planning on combining results from several separate imputed data sets.
As far as I understand, combining estimates and s.e.'s from analyses of multiple-imputed data could be done in the usual way also when using WLSMV.
Anonymous posted on Thursday, May 10, 2001 - 1:25 pm
In the manual, you point out that "Mplus has two special data handling features when data are missing because of the design of the study."
I understand that using the "not by design" missing data features, models assume that data are missing at random or missing completely at random. I have a data problem that doesn't technically seem to fit either scenario.
The study is looking at drug/alcohol treatment over time. It follows 2 cohorts of over 1,300 adults at baseline, 6 months, 18 months, 24 months, and 36 months. Because of funding constraints only 1 cohort (n about 700) was interviewed at 18 months. Both cohorts were interviewed at each of the other waves. I am wondering whether or not we should simply drop the entire 18-month wave data in a growth curve model or if we can somehow include the existing data from the 1 cohort who was interviewed. Technically, the missing cohort at 18 months was not missing at random and it did not seem to be similar to the missing patterns by design examples either. In addition, because this is a longitudinal study, there are data at waves other than 18 months that are missing, but these are more defensively considered “missing at random.”
I would not get rid of the data for 18 months. Measuring one cohort only at 18 months constitutes missing by design and is MCAR. You also have attrition which may be MAR but I couldn't comment on that. I would analyze all of the data using TYPE=MISSING. This is if your outcomes are continuous. TYPE=MISSING is not available for categorical outcomes.
Anonymous posted on Friday, May 11, 2001 - 1:55 pm
I guess it is MCAR (i.e., a random event caused it). I wasn't thinking of it as such since I was wondering if any existing differences between the 2 cohorts would pose a problem in the missing data estimation. But it was not cohort differences that "caused" the missing data, just the flip of the cohort coin. Your advice is very helpful. Thanks!
And yes, we do have attrition too that we would consider MAR.
Mike W posted on Monday, September 10, 2001 - 9:42 am
I'm interested in analyzing repeated measures data using a latent growth curve model. The data come from a complex sampling design (individual cases have sampling weights), are non-normal, AND have both planned/unplanned missing values (i.e., cohort sequential design/sample attrition).
I'm interested in using 3 of Mplus' features: 1. complex sampling (type=complex) 2. FIML missing data estimation 3. Robust estimators (MLM, MLMV, WLSM, WLSMV).
My question is whether these 3 features can be used in conjunction. If not, I'm wondering if it would make sense to do multiple imputation on the missing data, and then use the complex sampling & robust estimators in conjunction?
We are having an update in about two weeks that will include crossing TYPE=COMPLEX with MISSING AND MIXTURE, but weights will not be allowed at this time. You can do a one class mixture and thereby cross complex without weights and missing. Perhaps that might help. The estimator is MLR. MLR has maximum likelihood estimates and robust standard errors. This may help. Otherwise, multiple imputation would be the way to go.
Anonymous posted on Wednesday, February 20, 2002 - 4:20 am
Does Mplus have any methods to estimate a model when the missing data is non-ignorable?
Work is being done in this area but nothing definitive is available at this time. A possibility is to use the pattern mixture approach (see Little's 1995 JASA paper), using covariates, a multiple group approach, or a mixture approach.
I am trying to estimate the Diggle and Kenward (1994) model in order to account for non-ignorable missing data. This would, however, require to estimate a multiple-groups model with heterogeneous structures (as it is called in AMOS) using MPLUS. Thus, different groups should be allowed to have different variables included in the analyses (e.g., in a longitudinal setting with one outcome variable measured up to six times, for group1 outcome1 would be modeled, for group2 outcome1 and outcome2 would be modeled, etc.). I did not find such an option in MPLUS, so my question is: Is it possible to estimate a multiple groups model with heterogenous structures in MPLUS (which would also be helpful for "pattern mixture models")?
bmuthen posted on Tuesday, March 26, 2002 - 8:37 am
In the Diggle and Kenward approach, Mplus would need to model the growth part among the "y's" and the missingness as a function of previous observed y's among the "u's", where the quotation marks are used to refer to the general model parts of the Mplus framework. The D & G approach therefore needs to be able to allow missingness on y's as well as "u ON y" (logit) regressions. This combination cannot be done in the current Mplus, but is planned for version 3 due out early 2003.
The pattern mixture approach, however, would seem to be possible to carry out in the current Mplus. This would not use the regular multiple-group track because that requires non-zero variance for equal numbers of observed variables in the groups, which is not present here due to missingness. Instead the mixture track (type=mixture) would be used. In the mixture track, there is no problem due to missing data and zero variance for y's. The groups corresponding to the dropout patterns can be represented by latent classes ("c"), where the known class membership is handled using the "training data" feature. The growth model parameters for the y's can then be allowed to vary across classes (groups, patterns) to the extent desired. We can help with trying this approach.
Anonymous posted on Thursday, May 23, 2002 - 7:09 am
As I understand it, the new analyses in Mplus 2.1 assume MAR, but they use White corrected S.E.s, which I thought assume MCAR. Am I mistaken in my assumption?
Theory supports the fact that the corrected standard errors (sandwich or White) for missing data are correct under MCAR with non-normal data. For normal data, they are correct under MAR. We have found that these corrected standard errors also work better than regular standard errors under MAR and non-normality. However, there appears to be no theory to support this (see Yuan and Bentler in Soc Methods, 2000).
Thank you for sending the outputs. The correct model in Mplus is the one using the WITH statements. The reason the answers did not agree is that this model did not converge. I added two starting values for the variables GRADRAT and CSAT which have large variances and the model convergeds to the same solution as AMOS.
The following input does what I what to do: TITLE: Modified from http://www.statmodel.com/mplus/examples/categorical/cat4.html DATA: FILE IS wmimicd.dat; VARIABLE: NAMES ARE x1-x3 y1-y16; USEV = y6-y10; CATEGORICAL = y6-y10; GROUPING = x3 (1 = groupA 2 = groupB); ANALYSIS: TYPE = MGROUP MEANSTRUCTURE; MODEL: f1 BY y6-y10; OUTPUT: standardized;
What changes do I need to make in this input file when y10 is missing by design in groupB (e.g., Type = mixture missing)? Also, are there any fit indices unavailable after respecifying this missing data problem as a mixture analysis?
Anonymous posted on Tuesday, July 09, 2002 - 9:26 am
You can split groupB into two groups, one group for the observations with y10 present and the other with y10 missing. Add f1 by y10@0 for the last group and use type = mgroup meanstructure. Type = mixture missing is not going to give you what you want.
The idea of the solution proposed above is good - that y10 should not influence the fitting function in the last group where it is missing - but there seems to be two complications. Mplus will complain that y10 has zero variance in the last group. This can be circumvented by letting one person have a different value for y10 in the missing data group to give a quasi-nonzero y10 variance. Also, I think the weight matrix will be singular with zero variance and I don't know its quality if a quasi-nonzero variance is introduced for y10. I don't know if some other trick can be used. Categorical missing facilities are forthcoming in future Mplus versions.
Anonymous posted on Thursday, July 25, 2002 - 7:24 am
Hello. I am working on an LCA model with some missing data, and I would appreciate some advice on its behavior. The (binary) latent class indicators include 5 behaviors measured at each of 2 posttreatment follow-ups (for a total of 10 indicators). About 40% of the sample were interviewed at the short-term follow-up, but not the long-term assessment. Some noninterviews were by design, and some represent attrition. I also have several pretreatment covariates I am using to predict the classes. There are several points I am wrestling with:
1. The "Test for MCAR..." is clearly nonsignificant. What practical effect should this have on modeling strategy?
2. Group membership changes noticably when I add predictors. I suspect many "changers" are individuals with only 1 interview, because they simply have less "u" information available for classification, increasing the importance of their "x" information. If this is true, is it reasonable to believe that the LCA with covariates is more likely to be the "correct" model? Should decisions about the number of classes be made in the presence of the covariates?
3. To test for possible nonignorable missingness, is it appropriate in the context of LCA to try a "pattern mixture" approach (in the spirit of Little or Hedeker & Gibbons)? That is, adding a "missing interview" indicator and interaction terms to the set of covariates.
Thank you in advance for any suggestions.
bmuthen posted on Friday, July 26, 2002 - 10:05 am
Re 1, your MCAR test would seem to suggest that you can feel more comfortable using the ML approach that you are using. The ML approach is correct under the less strict MAR assumption. So, having support for MCAR is comforting, but I don't see that it changes your modeling strategy.
Re 2, changing group membership may point to a misspecification. If in the true model the predictors influence only class membership and not the latent class indicators directly, then you should get statistically the same membership with and without predictors in the model. But if the true model has some direct effects of predictors on latent class indicators, then class membership will change when including predictors but not allowing for the direct effects. The solution is to examine the need for direct effects by including one at a time and looking at chi-square differences (2*logL differences). It is also correct that predictors help to better determine class membership when the latent class indicator information is not strong, but with a correctly specified model this additional information should not cause essential changes in membership.Re 3, yes, a pattern mixture approach could be useful here.
Anonymous posted on Wednesday, July 31, 2002 - 6:54 am
If I am trying to run a Discrete-Time Survival Analysis, but I have missing data in my X values, is the only way for me to estimate a model with missing data is to use a program such as NORM and impute the missingness?
bmuthen posted on Wednesday, July 31, 2002 - 9:38 am
Yes, unless the x variables are such that they do not influence class membership, in which case they can be turned into "y variables" (for which missingness is handled) by referring to a parameter for x (e.g. its mean).
Anonymous posted on Saturday, November 02, 2002 - 2:54 pm
I need help running an EFA with missing data. Missingness is due to use of a 3-form design for 180 participants; data from 49 additional participants who completed either the first or second half of the 64-item set is also included. Covariance coverages range from .262 to.633. I used the following code--my first MPLUS experience.
Data: file is deedataII.txt; Variable: names are I1-I94; Usevariables are I1-I64; Missing = .; ANALYSIS: TYPE IS EFA 1 5 MISSING; ESTIMATOR = ML; H1ITERATIONS = 500; H1CONVERGENCE = 0.0001; COVERAGE = 0.10;
I have tried lowering the coverage crtiterion to .08, running the model with up to 16 of the 64 variables of interest deleted, eliminating the H1ITERATIONS & H1CONVERGENCE statements, and using analysis type missing basic. The messages I get go something like this...
THE MISSING DATA EM ALGORITHM FOR THE H1 MODEL HAS NOT CONVERGED WITH RESPECT TO THE LOGLIKELIHOOD FUNCTION. THIS COULD BE DUE TO LOW COVARIANCE COVERAGE OR A NOT SUFFICIENTLY STRICT EM PARAMETER CONVERGENCE CRITERION. CHECK THE COVARIANCE COVERAGE, OR SHARPEN THE EM PARAMETER CONVERGENCE CRITERION, OR RERUN WITHOUT H1 TO OBTAIN H0 PARAMETER ESTIMATES AND STANDARD ERRORS. NOTE THAT THE NUMBER OF H1 PARAMETERS (MEANS, VARIANCES, AND COVARIANCES) IS GREATER THAN THE NUMBER OF OBSERVATIONS. NUMBER OF H1 PARAMETERS : 2144 NUMBER OF OBSERVATIONS : 229
I think that the covariance coverage is adequate--how do I go about changing the convergence criterion or running the model without H1?
bmuthen posted on Sunday, November 03, 2002 - 8:38 am
You can try sharpening the H1convergence criterion to say 0.00001. One question is if your missingness is by design - you mention a 3-form design. If so, there may be alternative approaches.
Anonymous posted on Sunday, November 03, 2002 - 10:11 am
In response to your question, yes the missingness is by design; 180 participants completed 3-form design questionnaires containing 2 of the three subsets of items. I have additional data from 49 participants, each of whom completed half of the 64 items of interest. What options does this give me?
bmuthen posted on Sunday, November 03, 2002 - 5:38 pm
Here is an answer about what one can do in principle with missing by design - without claiming that this is how you should try to do your analysis. If I understand your design correctly, apart from the 49 subjects, there are 3 groups of subjects, each of which has missingness on parts of the variables. In a CFA, these 3 groups could be handled via multiple-group modeling where in each group only the reduced set of variables actually observed in the group would be considered, so that each group would only have missingness that is not by design. This would be an analysis with only about 2/3 of the variables and therefore perhaps less heavy. This approach has 2 complications for you. One, it is not clear how to handle the 49 participants since each group needs to have the same number of variables. Two, you want to do an EFA.
Regarding your analysis, what is the lowest coverage value that gets printed?
Anonymous posted on Sunday, November 03, 2002 - 6:11 pm
The lowest covariance coverage is .262.
bmuthen posted on Tuesday, November 05, 2002 - 8:42 am
Have you had any success using the sharpened H1convergence criterion? If not, perhaps you want to send your input and data to Mplus support so they can help.
Anonymous posted on Saturday, December 07, 2002 - 12:36 pm
I am thinking about using multiple imputation with data on which i am doing a structural equation model. the outcome variable in this model is dichotomous, which limits my options for handling missing data. i am considering using multiple imputation, and am wondering how to approach doing this in mplus. i can create my multiple data sets in other software packages. i have read on your website about the RUNALL facility. would it make sense to run the analyses with that? also, are there any other features in mplus that might be useful in this, including anything that would combine the results from the multiple runs to give final estimates? (or is that step something i need to do by hand?). thanks.
RUNALL would be the way to go. Results for the analysis of each data set are saved in an ASCII file which can subsequently be analyzed to obtain means of the parameter estimates etc.
anonymous posted on Monday, January 27, 2003 - 7:11 pm
I would like to use the MLR option across multiple imputations. Because I have no missing data I am not specifiying type=missing. When I try to run the model I get a message telling me that the MLR estimator is not available with type=general. Is MLR only available if you have missing data?
You can say TYPE=MISSING even if you don't have any missing data. Then you will be able to use MLR.
Anonymous posted on Tuesday, June 03, 2003 - 4:02 pm
My sense is that Mplus can only account for data missing on Y variables.
Is this because the computation is too intensive to include imputation on X's, or because its empirically incorrect to impute on Y's and X's at the same time ?
I ask because I've noticed that one of the HLM packages allows multiple imputations on X and Y in the same model run. This would appear to imply that such models borrow information from the X's to impute Y's (and vice-versa).
Will Mplus allow for imputation on X's and Y's in the near future (version 3.0) ?
bmuthen posted on Tuesday, June 03, 2003 - 5:33 pm
Modeling typically concerns a specification of the distribution of y | x (y conditional on x), whereas the marginal distribution of x is not involved in the model. When there is missing on x, a model for the marginal x part needs to be added. This is true for imputations as well as other modeling. That's why missingness on x's changes the picture and is not trivial - it calls for an extended model that may be hard to specify realistically.
I am not clear on what type of x modeling HLM does for imputations in the x part - I am not sure that this is stated; please let me know if I am wrong. Mplus does not do imputations, but handles missing data in a general way using ML under MAR. Mplus can handle missing on x's if they are brought into the model as "y's". This is done automatically in some tracks of the program (such as non-mixture, non-categorical). In other tracks, x's can be moved into the y set by mentioning parameters related to them in the model. Missing on x is then handled by a normality model for the x's. Normality may not be suitable if x's are say binary and skewed. In Schaefer's imputation programs, missingness categorical x's is handled by loglinear modeling. Mplus Version 3 will have more facilities related to missingness on categorical variables and missingness for variables that have random slopes.
Both 1 and 2 can be estimated in the current version of Mplus. These techniques have been available since Version 2.1 which came out in May 2002. The use of these techniques is described in the Addendum to the Mplus User's Guide which can be found at www.statmodel.com under Product Support. More features are coming in Version 3 the Fall.
The Mplus techniques for multilevel SEM with missing data are described in a paper that we will be happy to make available at the end of the Summer. We are not aware of any other references on this topic.
I am using Mplus to perform a stepwise regression analysis. I am using Mplus because some of the data is missing (N=363). For Step 1, the Mplus syntax is: Title: Injury analysis; Data: file is "filename"; Variable: Names are DV IV1 IV2 IV3 IV4 IV5; Usevariables are DV IV1; Missing are all (999); Analysis: Type = H1 Meanstructure missing; Model: DV on IV1; Output: Standardized; The output shows: Chi-sqare test of model fit for the baseline model Value 3.080 DF 1 P-Value 0.0000 (R sq = 0.011) For Step 2: Title: Injury analysis; Data: file is "filename"; Variable: Names are DV IV1 IV2 IV3 IV4 IV5; Usevariables are DV IV1 IV2 IV3 IV4 IV5; Missing are all (999); Analysis: Type = H1 Meanstructure missing; Model: DV on IV1 IV2 IV3 IV4 IV5; Output: Standardized; The output shows: Chi-sqare test of model fit for the baseline model Value 17.652 DF 5 P-Value 0.0000 (R sq = 0.062)
To calculate the significance of the R sq change (0.062 - 0.011 = 0.051) can I simply calculate the change in Chi sq (17.652 - 3.080 = 14.772), the change in DF (5 - 1 = 4) and conclude that the R sq change in significant @ p<.01? (The crititcal value of Chi sq for 4 df and p < .01 is 13.28). Or am I on the wrong track completely?? For your advice please,
A chi-square difference test can not be used to determine whether a r-square difference is significant. It can be used to see if parameters in nested models are significant. For example, you could compare
Anonymous posted on Monday, June 28, 2004 - 1:34 pm
I wish to run a multi-group analyis (women vs. men) using the missing option.
1. Would my analysis code be type=missing h1; or type = missing h1 mgroup?
2. Can I compare nested models (e.g., resticting covariances to be equal across groups)using a chi-square difference test when using the missing command?
3. When I run a mgroup analysis (not specifing missing) leaving the 'estimator=' blank, Mplus uses the ML estimator. Can I always trust what Mplus picks? For example, I have some categorical ivs and and some categorical indicators of a latent variable.
3. With categorical factor indicators Mplus defaults to WLSMV - and yes, the defaults have good reasons behind them.
Anonymous posted on Tuesday, June 29, 2004 - 7:20 am
I am trying to use missing data analysis to run a simple path model. However, Mplus only analyzes cases with no missing data (i.e., listwise). How can I use FIML in this situation? Thanks!
Mplus VERSION 3.01 MUTHEN & MUTHEN 06/29/2004 10:00 AM
TITLE: PATH ANALYSIS DATA: FILE IS C:\a1.DAT; VARIABLE: NAMES ARE id x1 x2 x3 y x4; CATEGORICAL ARE y; USEVARIABLES ARE x2 x3 y x4; missing are x2 x3 y x4 (-9); analysis: type = basic missing; MODEL: y ON x2 x3 x4; OUTPUT: stand tech1;
*** WARNING Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 115 *** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 318 *** WARNING Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis. Number of cases with missing on all variables except x-variables: 209
3 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
bmuthen posted on Tuesday, June 29, 2004 - 9:50 am
Regression on x's does not include a model for the x distribution, but the model concerns the y outcome conditional on the x's. To handle missing data on x's, you need to expand the model to include a model for the x's, e.g. assuming normality. This can be done in several ways. One way is to first do a multiple imputation step outside Mplus. Note that Mplus can take multiply imputed data as input. Another way is to include the x's in the model in Mplus - this is done by mentioning say their variances:
You can use ML estimation by the Analysis option
estimator = ml;
in which case the missing data on your 3 x's results in 3 dimensions of numerical integration.
I am running a parallel process latent growth curve model in version 3.11(3 equally spaced time points)involving two outcomes measured continuously (depression and smoking). The latent intercepts and slopes are regressed on two x's: gender and number of siblings. The data are nested (individuals nested within schools). There are missing values on both the Y's and on the sibling "X" variable.
The model appears to run fine (no warnings or error messages) and generates results that make sense. However, when I attempt to evaluate the plausibility of the model for girls and boys separately, I get a warning message that states "data set contains cases with missing on x-variables. These cases were not included in the analysis". Below is the sytnax used for latent growth model using the multiple group procedure:
Am I making an error somewhere in the syntax? Does M-plus offer FIML for LGC models where there is missing data on both the Y's and the X's in the context of running complex models (nested data) and multiple group comparisons?
I don't see how you would not have missing on x's when you read the full data set and have missing on x's when you look at part of the data set. If you send the two outputs, the one that worked and the one that didn't, and the data, to firstname.lastname@example.org, I can figure this out.
Regarding missing on x's, the following is from Chapter 1 in the Mplus User's Guide:
"In all models, missingness is not allowed for the observed covariates because they are not part of the model. The outcomes are modeled conditional on the covariates and the covariates have no distributional assumption. Covariate missingness can be modeled if the covariates are explicitly brought into the model and given a distributional assumption."
Anonymous posted on Tuesday, October 26, 2004 - 3:01 am
I have a question concerning missing data. I am constructing a twolevel model. Some of my within-level variables have missing values. Should I specify that type is twolevel missing or is it not necessary? I am using mlr as an estimator.
The unrestrictied model is the model of all means, variances, and covariances of the observed variables being free. There are no restrictions on any of the parameters. It is the H1 model. The reason that it is not automatically estimated with TYPE=MISSING in all cases is that it can be slow and is needed only to compute chi-square. So Mplus has it as an option. Without it, you will get parameter estimates and standard errors but not chi-square.
Anonymous posted on Thursday, January 27, 2005 - 8:33 am
I have a model where I am testing for invariance of structural paths across gender in the multiple-group context (all observed, continuous variables) but I am concerned that I have data that are NMAR. One of my endogenous variables is frequency/quantity of alcohol use and I have strong reason to believe that missingness on alcohol use is related to true levels of alcohol use. Consistent with suggestions made in earlier postings, I have constructed training data to represent 4 gender X missing data groups (i.e., males w/complete data, males w/incomplete data, etc. - missing data patterns are too sparse for additional patterns). In order to get weighted averages for structural coefficients (and intercepts) across missing data patterns (as you would via Hedeker/Gibbons 97), I have constrained all parameters to equality within gender for all models (i.e., male incompletes and male incompletes equated, female incompletes and female incompletes equated).
My base model would be the fully constrained model (all parameters equated across all 4 groups) - In order to test for invariance across gender, I allowed males to differ from females (but maintained equality constraints between the within-gender missing data groupings). I then used 2(deviance1-deviance2) for single df X^2 difference tests for invariance across gender. I wanted to know if a) this approach to pattern mixture modeling is generally defensible, b) could I compare the deviance from a fully saturated model against my base model so I can give an indication of "model fit" (and hand calculate RMSEA and the like) and c) if this is defensible, is there a citation to justify this approach specifically in SEM other than Hedeker/Gibbons 97 or Little 93 (i.e., Muthen/Brown 01 - is this manuscript available)?
bmuthen posted on Thursday, January 27, 2005 - 1:12 pm
A pattern-mixture approach of this kind seems generally ok, but I need some clarification. You seem to have a typo in the parenthesis at the end of the last paragraph and the second parenthesis of the second paragraph sounds strange to me. Seems like you want to test gender invariance while allowing for missing data differences within each gender. Also, does Hedeker's work give formulas for weighting regression coefficients across the missing data groups? I have not seen a reference to pattern-mixture for SEM. Muthen-Brown is still not available and is focused on actually letting latent variables predict missingness.
Anonymous posted on Thursday, January 27, 2005 - 2:09 pm
Yes, I am looking to test gender invariance and adjust for differences across the missing data groups.........Hedeker does give formulas for weighting regression coefficients across missing data groups. He does this by weighting the estimates by the observed proportions among the missing data groups in his 97 Psych Methods paper (an illustration of a conditional LGM with NMAR dropout in Proc Mixed). Here is the link to the .pdf http://tigger.uic.edu/%7Ehedeker/RRMPAT.pdf. Formulas 12 and 14 are the formulas for the weighted estimates and standard errors respectively. The corresponding dataset and SAS IML program that performs the matrix operation described therein is at http://tigger.uic.edu/~hedeker/ml.html. I simulated longitudinal data that were NMAR for 2 missing data patterns and analyzed the simulated data both in Proc Mixed/IML with his approach and in GGMM in Mplus with the two missing data groups identified w/training data and got nearly identical estimates. So the approach seemed viable but I did not want to move forward without consultation. I have had great difficulty finding a published analog to Hedeker's approach in SEM and had wondered whether Muthen-Brown was the SEM analog but also wanted your take on the approach before going forward........
BMuthen posted on Friday, January 28, 2005 - 2:42 pm
I will look at this when I am back in the country -- after February 2. It looks like you're on the right track and may be the first to do pattern-mixture in SEM.
Anonymous posted on Friday, January 28, 2005 - 4:58 pm
Thank you so much Bengt - look forward to hearing more from you when you are back. Bidding you safe travels on your return to LA......
Anonymous posted on Friday, February 04, 2005 - 3:03 am
Hello and thanks for a great program.
I'm running a simple linear regression analysis in Mplus3 where I want to correct the standard errors for the design effect (two-level structure) as well as estimate this model with missing data.
The "problem" is that I have got missing data on both the dependent variable and several of the independent variables. In another posting on this page you wrote
"Regression on x's does not include a model for the x distribution, but the model concerns the y outcome conditional on the x's. To handle missing data on x's, you need to expand the model to include a model for the x's, e.g. assuming normality. This can be done in several ways. One way is to first do a multiple imputation step outside Mplus. Note that Mplus can take multiply imputed data as input. Another way is to include the x's in the model in Mplus - this is done by mentioning say their variances:
This leaves me a bit confused on exactly how to write it in syntax. Could you please take a look at my syntax and see if this is the correct way to write a model where we have got missing data in a complex design (where there is also missing on the independent variables) as well as correcting the standard errors.
Names: The names of the variables in the dataset; Missing = All (-99); centering GRANDMEAN(age gender education); cluster is klnr1; Usevariables are antisos age gender education singlem1 singlem2 stepf JPC singlef; Analysis: type=complex; type=missing; Model: antisos on age gender education singlem1 singlem2 stepf JPC singlef; age gender education singlem1 singlem2 stepf JPC singlef;
****** In this model we wish to see how six different family structures expressed as five dummy variables (singlem1 singlem2 stepf JPC singlef) predicts antisos after we have controlled for age gender education. I have got no missing data on the dummy variables-only on the control variables and the dependent variable.
Is this the correct way to do it? Do you think it's best to impute the missing data with multiple imputation (NORM) before you use mplus or to to include the x's in the model in Mplus - by mentioning their variances (as I think have done here)? And related: Is it ok to impute missing data with Norm and use the imputed datasets in mplus3 even when you have got a nested data set?
I think this is the correct approach. I don't think NORM handles clustered data. You may have to correlate the observed variables using the WITH option. I'm not sure of the default. You could, however, remove the dummy variables from the variance list given that they have no missing data.
Anonymous posted on Wednesday, March 09, 2005 - 11:32 am
What approach does Mplus use to compute the S.E.'s for the H1 model with missing data? Is it using the observed information matrix evaluated at the final estimates?
The observed information matrix is used. With ML and MLR, there is an option to use the expected information matrix.
Anonymous posted on Friday, March 18, 2005 - 10:02 pm
I have longitudinal survey data at 5 time points. I am interested in using multiple imputation to handle missing data. I plan on using available data from time 1 for the imputation model used to impute values for the missing values at time 1. I would like to use the imputed (i.e., complete) data from time 1 to help impute the missing data in time 2, and so on.
I have two main questions about this:
Would you recommend doing a single imputation for each wave of data. Otherwise, I would have, say, m=5 imputed data sets for time 1, and then it is not clear how I would go about using time 1 to help impute the time 2 data.
Also, do you have recommendations about whether to use individual items vs. scale scores in the imputation model? I would like to have complete item-level data (for subsequent factor analyses), not just complete scale scores (for path analyses, for example). I have seen examples of multiple imputation and they all use scale scores in the model. Is it ever appropriate to use individual items in the imputation model? I can't seem to find anything about this in the literature.
bmuthen posted on Saturday, March 19, 2005 - 4:55 am
In principle, a good approach would be to use item-level data for all 5 time points jointly, perhaps adding covariates, analyzing these variables by ML under the usual MAR assumption. This approach is certainly doable on the scale score (or IRT theta) level. But perhaps the approaches you discuss are motivated by this ML-MAR approach involving too many variables when working on the item level. Perhaps that is why you suggest a time-by-time approach. However, the use of complete (partly imputed) data from time 1 for imputing values for time 2 does not seem like a good approach to me since it is acting as if the imputed time 1 data are real. And a single imputation would not give the desired result of multiple imputation - showing the true variability. Staying with the idea of imputing item-level data for each time point separately, it seems feasible to do this using observed data (not imputed) data on items (and covariates) at all other time points. I am not familiar with literature on these matters.
Anonymous posted on Saturday, March 19, 2005 - 10:49 am
Thank you for your quick reply. When you say "use the item-level data for all 5 time points jointly, perhaps adding covariates," isn't it the case that the covariates would already be factored in due to having all the survey items in the imputation model already? So I'm no sure what you mean here. Are you saying that if I want to use depression info as part of the model to impute values for missing anxiety scale items, I should use the depression scale score instead of the depression scale items?
Also, just to clarify, you think it is appropriate to use data collected at subsequent time points to impute values from previous time points? (I'm not arguing against the view, just wnat to clarify). Would this still be the case if there is a reason to expect measurements to change over time (e.g., some of the participants belong to an anxiety treatment group)?
Thank you again.
bmuthen posted on Saturday, March 19, 2005 - 4:19 pm
When I mentioned covariates I was making a distinction between background variables (e.g. demographics) and the (test?) items - it sounds like you are calling all of these variables "survey items" so we were probably just using different vocabulary. So my answer is no to your question at the end of your first paragraph. Regarding your second paragraph, yes my inclination would be to use any variable that might be correlated with the items with missing data.
Anonymous posted on Saturday, March 19, 2005 - 5:12 pm
Thanks again. This has been very helpful.
Anonymous posted on Tuesday, April 05, 2005 - 6:51 pm
I am using GGMM to analyse a longitudinal dataset with missing values. It seems that if "missing" is specified in the variable and analysis commands, FIML method will be utilized and the default algorithm is EM.i.e. the observed log likelihood will be maximized. am I right about this? In the output I got the warning says the fisher's information matrix and standard error matrix related to some parameters cannot be inverted. what does it imply generally?
one more question,is MCMC ever be considered in Mplus when dealing with latent variables and missing values?
BMuthen posted on Wednesday, April 06, 2005 - 3:16 am
Yes, you are correct about your first question.
Regarding the information matrix, this implies that your model is not identified. Ask for TECH11 to see which parameter causes this.
The current version of Mplus does not include MCMC analysis.
Anonymous posted on Thursday, April 07, 2005 - 10:13 am
many thanks to your prompt answer.
Anonymous posted on Wednesday, June 01, 2005 - 10:23 am
I am using type = missing h1 (with the ML estimator) for a structured equation model using all continuous variables (latent and manifest). I am trying to provide a brief description of what missing h1 does for a manuscript. I read the manual but was confused. Could you provide a brief description for inclusion in the manuscript?
Thanks in advance,
bmuthen posted on Wednesday, June 01, 2005 - 5:56 pm
"Missing H1" says that we want to do ML estimation of an unconstrained (saturated) covariance matrix for the observed variables taking missingness into account under the MAR assumption (see the Little & Rubin book). This ML-MAR estimation is carried out using the EM algorithm in line with the L-B book. The estimated covariance matrix is used to compute a chi-square test of model fit, comparing H0 to H1.
Anonymous posted on Saturday, June 04, 2005 - 2:29 pm
Is is appropriate to use H1 with outcomes that are from all categorical data? That is, using theta parameterization and the WSLMV estimator?
Yes, it is. There is a table in Chapter 15 of the Mplus User's Guide that shows which TYPE options are avaiable for variaous estimators and outcome scales. See ESTIMATOR in the index of the user's guide to find this table.
Dear Dr. Muthen, I am running a path analysis with 7 IVs at Time1 predicting 2 DVs at Time2. I have some missing data (not a huge amount, the coverage is around .9 for all variables). I have specified the Type = missing h1 under the analysis command. I have the following questions: 1. Does this missing data command take into account ALL variables that are listed under NAMES ARE, or does it use the variables that are listed in the USE VARIABLES ARE only? 2.If the latter is true, how do I go about letting other variables into the missing value analysis?(for instance relevant covariates listed in the NAMES ARE list)? 3. One of the Time one variables is gender. The way I have written the syntax now is just listed gender after the ON statement. Should I specify that gender is categorical? If so, how do I write that in the syntax? Thank you in advance and thanks for a wonderful help-page
2. The USEVARIABLES list should contain all variables in the MODEL command -- independent and dependent variables.
3. You should not place indepdendent variables on the CATEGORICAL list. This list is for dependent variables only.
Anonymous posted on Wednesday, July 06, 2005 - 9:47 am
Dear Dr. Muthen,
I have a question about missing value treatment. When I want to conduct FIML instead of EM algorithm, How can I do?
According to your previous response related to missing value treatment,
"Missing H1" says that we want to do ML estimation of an unconstrained (saturated) covariance matrix for the observed variables taking missingness into account under the MAR assumption (see the Little & Rubin book). This ML-MAR estimation is carried out using the EM algorithm in line with the L-B book.
Do you think that the MCMC option in LISREL does the same thing as multiple imputation under NORM or SAS proc MI?
Thank you very much!!!
bmuthen posted on Wednesday, July 06, 2005 - 4:44 pm
FIML is an estimator and EM is one algorithm for computing FIML estimates. Other algorithms include Quasi-Newton, Fisher Scoring, and Newton-Raphson. Mplus uses the EM algorithm for the unrestricted H1 models and the other algorithms for H0 models.
Saying Analysis type = missing implies using all available data. With FIML this is the standard "MAR" approach to missingness.
MCMC stands for Markov Chain Monte Carlo. I don't know how the LISREL approach relates to NORM.
Ad Vermulst posted on Friday, October 07, 2005 - 1:42 am
Dear Bengt/Linda, I am using type=missing H1 in combination with the WLSMV estimator for ordered categorical dependent variables. Can you tell me how MPLUS 3 deals with missing values in this situation? I have read appendix 6 of your technical appendices, but this appendix is restricted to normally distributed y-variables. Maybe you can give me a lit. reference? Thank you very much. Ad Vermulst
bmuthen posted on Saturday, October 08, 2005 - 2:07 pm
See the missing data section of Chapter 1 of the version 3 User's Guide - which has the same content as the intro paragraphs for the Missing data topic here on Mplus Discussion.
Essentially, pair-wise information is used with categorical outcomes using the WLSMV estimator.
Reetu posted on Wednesday, December 14, 2005 - 2:19 pm
I am trying to do an exploratory factor analysis with both categorical and continuous variables. I have missings in both and I'm getting an error that is telling me that i can only use the missing option if all my dependents are continuous. Is there a way of getting around this? How should I treat my categorical missings?
That does not have missing data estimation for categorical outcomes. This came out in Version 3.
Annonymous posted on Wednesday, January 11, 2006 - 10:50 am
Is the missing data estimation for categorical outcomes appropriate even if it does not appear that the data is MAR or MCAR? How can one test to know for certain if the data is not missing at random?
bmuthen posted on Wednesday, January 11, 2006 - 11:02 am
There is no test for MAR. If one suspects ways in which MAR is violated, non-ignorable missing data modeling can be attempted to see if results differ. Although it is not always easy, you can do non-ignorable modeling in Mplus - see for example the model diagrams posted at
Full information maximum likelihood and multiple imputation are clearly superior to other ad hoc approaches. I am debating which one to use for modeling my path analyses. Does anyone know if MI has clearly advantage over FIML?
bmuthen posted on Tuesday, February 07, 2006 - 6:24 pm
The approaches should give about the same results. I have heard Joe Schafer say that if you can do FIML, do it. - MI is mostly intended for when it is too hard to do FIML.
I wanted to get a sense for whether or not there is a mathematical and/or conceptual relationship between three approaches to the modeling of non-ignorable missingness - the first two are: a) MI where the missing data pattern indicators are included (along with the variables of interest) in the imputation model but only the variables of interest are included in the analysis model (Schafer, 2003, Stat. Neerlandica) and b) FIML with auxiliary variables where the missingness indicators are additional outcomes predicted by the IV(s) of interest (along with the DVs of substantive interest) with residual correlations between the missingness indicator(s) and the substantive DVs (Graham, 2003, SEM).
I came across Schafer's (2003) suggestion on a simple approach to pattern mixture modeling where he says in contrast to traditional PMMS "......this process of averaging the results across patterns may be carried out by MI. Suppose that we generate imputations Y1mis.....YMmis under a pattern mixture model. Once these imputations exist, we may forget about "R" (the missing data pattern indicators) and use the imputed datasets to estimate the parameters of P(Ycom) directly."(bottom of p.27) (link to the paper on Schafer's site is @
Using R in the imputation model and throwing it out in the analysis model sounded very much like using R as a special-case auxiliary variable a la Graham (2003). In Graham (2003), Collins et al., (2001, Psych. Methods) and elsewhere, the equivalence between MI with auxiliary variables and FIML with auxiliary variables is either discussed or illustrated. But one of the key models that is suggested by Graham (2003) (the correlated residuals model described above) looked very similar to a third model (Muthen/Jo/Brown 03 JASA - specifically the model on page 6 of your lecture17.pdf) except for two things: a) mixtures of longitudinal trajectories (which is not an important difference per se) and b) latent missing data classes (e.g., CU in the diagram) that are correlated with (or at least account for differences in conditional means on) the growth parameters. Now to my real question - assuming the same model structure of interest across the two approaches (e.g., single-population LGM), is it safe or reasonable to say that Graham's (2003) model is a special case of your "CU" JASA model where missing data pattern class is "known" (or at least captured with observed measure(s) of missing data class)?
I like the Schafer and Graham (2002) Psych Methods paper and their discussion of MI and FIML. Consider cases where you have variables (Z, say) that relate to missing data and that don't belong in your model of interest for x and y. With MI you would use z in the imputation model but not in the analysis model. With FIML you would use z as extra y variables that are freely related to y and x.
The modeling with the missing data indicators (u say as in Lecture 17) is different. If you have MAR, modeling the u's in an unrestricted way in addition to x and y gives the same ML results as analyzing x and y only (ignorability of missingness). Modeling the u's aims to handle non-ignorability. Lecture 17 suggests several possible alternatives for doing such u modeling. Page 6 that you point to tries to simplify the u structure. This relates to pattern-mixture modeling where you have to use all missingness patterns as covariates. The pattern-mixture model essentially corresponds to a latent class model (the model with cu) that has as many classes as there are missing data patterns. With a latent class model for u, you essentially reduce the number of patterns to the number of classes. You can combine the u modeling idea of Lecture 17 with the z modeling idea above.
Thanks so much for your response Bengt; it was very helpful. I had an additional question on u modeling in general and cu modeling in particular. Other approaches to NMAR have a mechanism for (weighted) averaging of parms and se's across the missingness patterns such as hand-calculation, equality constraints (e.g., Allison 87, MKH 87), combining via matrix manipulation (HG 97) or the multiple imputation approach to NMAR that Schafer discusses in the .pdf linked above. For CU modeling of NMAR, it seems like constraining the estimates to get a weighted average of the covariate>growth parameter effects (i.e., X>I, X>S) is no problem (of course, modeling X>I and X>S only in the %overall% part of the model is less code to do the same thing). But it also seems like if one is interested in getting a weighted averaged estimate of the growth parameter intercepts (GPIs) (across all the latent missing data groups), ( E[ I | X, CU] and E[ S | X, CU] ), you may not be able to estimate them directly in the analysis because if you constrain the GP intercepts to equality, the problem reduces back to an MAR solution - it seems like constraining the GPIs in each CU class to equality eliminates the relation between the growth parameters and CU which seems like the very part of the model that handles non-ignorability. But if you allow the GPIs to vary across CU, you do not get a single (weighted averaged) estimate. Is my understanding of this off-base? If so, any additional guidance you could provide would definitely be appreciated. If this is not off-base, then would you recommend hand-calculation of the weighted average if one was interested in inferences on the GPIs?
I think your understanding is correct. You don't want to hold these parameters equal across classes, and this does lead to the problem of how one presents the results mixing over classes. I don't think this is resolved, but needs research. On the other hand, with a cu approach you have fewer patterns (number of classes) and therefore perhaps you are interested in presenting the results for each class by itself without weighting (mixing) them together - the classes may be so fundamentally different that you rather treat them separately.
I was wondering what missing data strategy you would recommend for a small longitudinal SEM model? More specifically, I ran a SEM model in which there were 58 subjects at the first time point and only 50 subjects at the second time point (i.e. 8 subjects had missing data). I ran the model two ways (1) with listwise cases deleted and (2) with the means in the place of the missing data. Both models fit the data almost equally well and the same paths were significant in both models. Is the listwise strategy more rigorous than running the model with the means? Does this depend on the percentage of the sample that is missing data? Should I run the model another way?
Mplus uses the EM algorithm for ML estimation under the "MAR" assumption; see the Little & Rubin missing data book. In this approach, missing data are not imputed, but parameters of the model are estimated directly using all available data.
Hello, In Stata, I created a data set that has several multiply imputed data sets. When I try to read this data set into MPLUS, however, I get the same two error messages repeated until the program finally aborts:
-Errors for replication with data file [and then it lists a bunch of numbers].
-*** ERROR in Data command The file specified for the FILE option cannot be found. Check that this file exists: [and then again, a bunch of numbers].
As far as I can tell, the Stata file contains 5 multiply imputed data sets, but do can you tell from the above message if this is problem with the data in Stata or in MPlus?
The message means that the file you have named using the FILE option cannot be found. Perhaps you have misspelled it or it has an extra extension that you are not aware of. If the file contains 5 data sets, you need to separate them if you plan to use the IMPUTATION option of Mplus. If you have further questions on this topic, please send them along with your license number to email@example.com.
In the intro to the Missing Data Modeling Discussion board, there's a reference to a paper I can't find: "Non-ignorable missing data modeling is possible using maximum likelihood where categorical outcomes represent indicators of missingness and where missingness may be influenced by continuous and categorical latent variables (Muthén et al., 2003)." Can you provide a link or more information?
That is the JASA article which you find on our web site under References.
Andy Ross posted on Wednesday, June 21, 2006 - 7:09 am
Dear Prof. Muthen
I am attempting to run a MIMIC LCA model with missingness on the covariates.
A colleague of mine recommended: rather than including the x's in the model by mentioning their variances, which would require using integration to estimate the model. To instead create a new variable with mean zero and small variance and give random values to each case. Then regress all the covariates on this random variable. The covariates are then not independent variables in the model and can be missing.
The syntax for this model is as follows (rg is the new, random variable)
Data: file = c:\soton\ncdmis2.dat;
Variable: names = sx sc2 sc3 me ma ha br pv cd ep pa sm ex pt re kd em hq rg; classes = c (4); categorical = pt re kd hq; nominal = em; missing are all(99);
Analysis: type = mixture missing; starts (0);
Model: %overall% c#1-c#3 on sx-ex; sx sc2 sc3 me ma ha br pv cd ep pa sm ex on rg;
My initial reaction is that this is not a good idea. I would have to hear why your colleagues think it is a good idea to say more.
In your case, I would use multiple imputation. You can use the NORM program to generate imputed data sets and analyze them in Mplus using the IMPUTATION option.
Problems with the Mplus syntax should be sent to firstname.lastname@example.org. Please include the input, data, output, and license number.
Andy Ross posted on Wednesday, June 21, 2006 - 8:37 am
Dear Prof. Muthen
Many thanks for your speedy response - i will be certain to pass your thoughts onto my colleague.
Multiple imputation has been our method of choice so far, however the problem is we now want to save the probabilities in a data file, which is something you cannot do when working with multiple datasets.
Can i ask, are you suggesting that in our case, FIML wouldn't really be an option? i.e. use MI and accept that we will not be able to save the probabilities?
If you have more than two or three covariates with missing date, it is impractical to bring the covariates into the model because the computational burden of numerical integration would be heavy. If only two or three of your covariates have missing data, then FIML should be fine. You should study the missing data in your covariates. Perhaps there are some with very little missing data such that you could allow the listwise deletion on those and bring the others into the model.
Andy Ross posted on Thursday, June 22, 2006 - 7:32 am
Many thanks again.
I tried running the model again, mentioning the variance of the three variables which had the greatest missingness as suggested.
The model requested that i use ALGORITHM=INTEGRATION method of estimation. However when including this term under the analysis command I got the following error message:
*** FATAL ERROR THIS MODEL CAN BE DONE ONLY WITH MONTECARLO INTEGRATION.
Is this to be expected? Could you please give me some indication of how i should set up the estimation?
Missing data estimation using FIML is available for categorical outcomes by using the maximum likelihood estimator. Missing data estimation is also available using the weighted least squares estimator.
Julie Hall posted on Thursday, June 22, 2006 - 10:30 am
Thanks so much. How does Mplus deal with missing data when using WLSMV?
I am confused about how M-plus handles missing data. When I type the following into M-plus: /* DATA: FILE IS fulljoin.txt;
VARIABLE: NAMES ARE t1-t10 y1-y80 p1-p10 t11-t20 m1-m20; TSCORES = t1-t10; MISSING=ALL (999); USEVARIABLES ARE t1-t10 p1-p10; ANALYSIS: TYPE=MISSING;
MODEL: i s | p1-p10 AT t1-t10;
OUTPUT: SAMP; */
I receive the means for each of the variables p1 - p10. However, only the variables that do not have any missing data in them, match up with the means calculated in excel. I am certain that the missing values are set to 999 in MPlus file. Thank you for your help.
If you do not specify TYPE=MSSING; in the ANALYSIS command, Mplus uses listwise deletion of any observation with a missing value on one or more of the analsyis variables.
peter kane posted on Tuesday, July 18, 2006 - 1:12 pm
question about missing data.
i am analyzing some longitudinal data in a cross-lag model and have about 15% of subjects missing data on one variable at the first time point. these same subjects are missing subsequent time points for this specific variable. essentially, for this 15% of the sample, there is no data on this one variable. however, these missing subjects have observations on other variables.
my question is whether i should delete the subjects who are missing this variable, or conduct the analysis on the entire sample by employing a missing data estimation procedure such as FIML? i guess i am not sure if the data is "missing at random". thank you very much for your ideas/suggestions.
I would use missing data estimation even if the data are not missing at random if it meant losing 15 percent of the sample. You might want to do the analysis both ways and see if it affects the interpretation of the results.
I am conducting some analyses using data from NESARC. In a recent article (Grant et al., 2006) analyses were conducted using a sample of past year drinkers (n = 26946). I hope you could answer a query that I have.
I am aware that you have conducted anlayses on this dataset and am interested to know how you and your colleagues dealt with missing data among this sample. I have read in the literature that listwise deletion of missing data is quite popular. I am aware however that the NESARC dataset contains a weighting variable. I have read on the MPlus discussion board that deleting missing data can have an adverse affect on the weighting variable. I want to use the weighting variable in my analyses and I am therefore reluctant to delete cases with missing data.
In an attempt to overcome this problem, I have included the following commands in the input:
Variable: Missing are all (-9);
Type: Complex mixture missing;
However, I am aware that other people have used the algorithm command in their analyses. Is this an appropriate solution to the issue of missing data? If not, what command(s) would you suggest I use/change in my analyses?
I would like to extend on a query from my previous post (July 27 2006). I have missing values for approximately 4% of my data. I am considering recoding my dataset from values of ‘yes' to 'criteria present' and values of ‘no' or 'missing’ to ‘all other responses’.
Do you think that I could statistically defend this treatment of missing data? I am aware that treating unknowns or missing values as negative has a certain element of risk (as there may be false negatives), but given the low proportion of missing data, I am unsure as to whether this is a problem. I was wondering however if you could perhaps suggest any references or authors that may have utilised such a technique?
I would treat the missing as missing and use TYPE=MISSING; I think it is dangerous to start recoding. You may want to search the literature to see if you can find anyone who advocates the approach that you suggest.
Julie Hall posted on Wednesday, August 09, 2006 - 8:55 am
I am using Mplus for my dissertation analyses and I want to make sure that I understand how my missing data will be handled. I am using WLSMV (with covariates) and my understanding is that the data will be treated as missing as a function of the covariates. Could someone explain what this means? Thanks so much!
The ML-MAR approach to missing data allows missingness to be predicted by variables that are not missing for the individual. So both y and x variables. However, with WLSMV, if missing is predicted by y variables, the results are distorted, while they are not if missing is predicted by x variables.
Hello Bengt (I am having to send this in 2 parts...), I wanted to follow-up with you on our discussion from March 2006 on this thread about Latent Class Pattern Mixture Models (LCPMMs, i.e., "CU" models for NMAR dropout). With a very small sample (N = 128), I have looked at a series of K-class (i.e., single-class through 4-class) LCPMMs where CU (treatment attendance classes) jointly accounts for a) three-piece linear growth in alcohol use over time across 12 weekly alc. assessments, b) observed measures of treatment attendance (i.e., missingness) from weeks 2-12 of tx (i.e., everyone "shows up" for week 1) and c) the (calendar) week of the trial when the person started treatment. BIC and entropy suggest that a 3-class model fits best and, in fact, when you mix estimates (i.e., growth parameter intercepts, tx effects for each piece) across classes (weighted averaging outside the analysis), you make a different inference than you would have made if you took the results of the 1-class model (e.g., standard LGM under ignorability - but with the missingness indicators left in the model to compare BICs). (Part 1 ends here.....)
(Part 2 starts here....) My question is the 3-class model has 64 parameters - exactly half the number of people in the dataset, which is a dangerously low ratio of people-to-parameters (i.e., 2-to-1 - though the class-specific estimates do not look strange and I reproduce the log likelihood value multiple times with 500 starts). But 33 of those parameters (11 indicators x 3 classes) are the thresholds for the missingness (show/no-show) indicators. Lin et al (2004; Biometrics) say that for CU models for NMAR, data are MAR within each class, after conditioning on class membership - seems to me that once you condition on CU, you could ignore these missingness indicators (just as you would never need the missingness indicators in single-population models) and not be penalized for having such a low ratio because more than half the parameters in the model would not even be there if class membership were known. I wanted to get your thoughts on this and see if this was off-base........
Lin, H.Q., McCulloch, C.E., & Rosenheck, R.A. (2004) Latent pattern mixture model for informative intermittent missing data in longitudinal studies. Biometrics, 60, 295-305.
I see what you are saying, but it seems that you cannot get at CU status without estimating those thresholds, so I think they are necessary. It is an empirical question if you do better with such a 64-parameter model than not trying NMAR at all.
Thanks again Bengt. I agree that this would be a very different model w/o the thresholds. I just worried a little bit about the low ratio, especially given that this particular CU model looks better empirically than 1-class model under MAR (though I realize this is not necessarily a "test" for or against MAR). No one has brought up the ratio problem with these data and it seems like it doesn't worry you either.....
I conducted a small sim as part of this work (as part of a poster at Yih-ing Hser's CALDAR conference and a talk I gave Oct 2 at Bud MacCallum's brown bag @ UNC), which focused on confidence interval coverage for the mixing of the class-specific parameters in the meanstructure, using all the class-specific parameter estimates (e.g., growth parameter intercepts, treatment effects, show/no-show thresholds, variance components, etc.) from the 3-class model as population parameters with simulation N=128 (500 replications). I looked at coverage under 1-class through 4-class models, given there were 3 classes in the pop. There were two things that were encouraging for the three-class solution: 1) coverage was excellent for the 4 effects I looked at (weighted average treatment effects on the three linear pieces and the intercept at the last week of treatment), between 92-98% coverage across all replications and 2) no non-converged solutions/local maxima in any replication. Coverage was bad for 2-class and terrible for 1-class, with the majority of the confidence interval misses (relative to the pop. tx effect(s)) coming because the (class-mixed) tx effect was overestimated. 4-class is where the % of non-converged solutions was so high (even with 700 random starts in all conditions), I stopped studying anything beyond 3-class. Does this help?..
I am conducting analyses using data from NESARC, a complex survey design study. My analyses are concerned with a sub-sample of respondents, which I identify in my set-up using the subpopulation command. My query concerns the coding of respondents who are not members of the sub-sample. How should they be coded?
Just to clarify, those respondents who are not included in my subpopulation are coded as missing in the dataset (due to 2 screener questions). I have identified these people in the set-up (missing are all -9 and type = mixture complex missing). Is this correct?
Also, in my output, should the number of observations reflect the number of respondents in my subsample or rather the entire sample?
I wanted to follow-up on an Oct 2006 thread on Latent Class Pattern Mixtures (e.g., MJB, 2003) on an issue that probably comes up in any K=>2 GMM. In working with the covariance matrix of the estimates (covb) (and a Jacobian matrix of 1st-order derivatives) to generate delta method standard errors for weighted-averaged estimates from LCPMM, I noticed that there were non-zero covariances *across* classes. I initially thought that was strange, as I was expecting covb to be block-diagonal (0s for all parameter covariances across classes). But then I wondered if these non-zero covariances were one of the places in the model where the uncertainty in class membership was reflected; in fact the two class combination (in my K=3 model) that has the largest off-diagonal in the matrix of average latent class probabilities also has the largest cross-class covariances. The other sets of cross-class covariances are 0 (or so small as to be functionally 0). Are my suspicions on-base? If not, any explanation as to why covb isn't block diagonal would be very helpful......
Dear Linda and/or Bengt, I am analyzing data from a longitudinal study of risk for anxiety disorders and depression in 600+ high school juniors. At T1, we obtained self-reports on vulnerability measures for all subjects and tried to obtain peer-report versions of the same measures for all subjects. However, because some subjects refused to nominate peers and some peers refused to participate, we actually obtained peer-reports on roughly 50% of our subjects only. I was thinking about incorporating the missing data by using the multiple group approach to missing data. However, I have come across some references suggesting that the FIML approach to missing data is conceptually equivalent to the multiple group approach. If this is true, it certainly seems preferable to me to go the FIML route based on ease of model specification. Can you confirm that these approaches are equivalent? If not, when would you use the one and when would you use the other. Thanks for your time!
My N is 99 but when i run the following syntax, the number of observations in the output is only 70.
VARIABLE: NAMES ARE (I deleted this for brevity); MISSING = ALL (99); USEVAR = ad1 ad2 ad3 satbf1 percrit1 percrit2 percrit3; ANALYSIS: TYPE = MEANSTRUCTURE MISSING; MODEL: i s | ad1@0ad2@1ad3@2; i s ON satbf1; ad1 ON percrit1; ad2 ON percrit2; ad3 ON percrit3; OUTPUT: SAMPSTAT STANDARDIZED MODINDICES (3.84)
Look to see if the output says that individuals with missing on all variables or missing on x variables are deleted. If you don't see this, please send your input, output, data and license number to email@example.com.
I'd like to use estimated sigma within and between matrices for multilevel regression and path analysis. In part, these matrices would serve as input for multiple group analysis.
Many variables in my dataset are treated as covariates. As far as I know, covariate missingness leads to listwise deletion when using FIML.
When using the sigma matrices as input for analysis with covariate missingness, I wonder what would be the right N for the sample/the groups. Has missingness in covariates to be taken in account to determine N for covariance matrix input?
For the pooled-within matrix use the sample size shown in the output where you saved the pooled-within matrix minus the number of clusters. This takes into account any observations lost due to missing data. For the sigma between matrix the sample size is the number of clusters.
Lisa Liu posted on Friday, May 18, 2007 - 12:33 am
Hi, I am trying to run a two-level path analysis but am having trouble estimating the missing data. When I take out the level two data and just run it as a path analysis, the model successfully estimates the missing data. But when I add in the level two data, it stops working. This is strange because all of the missing data is level 1 data. Any suggestions? Thanks for your time!
1. While using "Type = imputation " option, how does Mplus generate S.E. of the estimates? Does it apply Rubin's rules? I compared the results of the same model with FIML and MI , the S.E. is quite different.
2. How can I request the output of relative efficiency, Relative Increasein Variance, Fraction Missing Information using MPLUS?
1. We estimate standard errors for multiple imputation according to the Schafer 1997 reference listed in the user's guide. FIML and MI are asymptotically equivalent. Differences can come about with small samples.
2. These items are not currently available in Mplus.
This is a follow-up question to my previous inquiry about analyzing data using multiple imputation.
I generated multiple imputed data sets (40 replications) using PROC MI in SAS. I then analyzed the data using both Mplus and PROC CALIS, taking into account that the data were generated by multiple imputation methods. Below are examples of the resulting parameter estimates with standard errors in parentheses. For comparison, I have also included estimates obtained from Mplus using full information maximum likelihood.
Analyzing data based on multiple imputation procedures (using the identical sets of data in both analyses) estimate from Mplus: 1.47 (se = .04) estimate from CALIS: 1.54 (se = .23)
Analyzing data without multiple imputation estimates from Mplus (FIML): 1.52 (.24)
It is interesting to note that the standard error obtained using PROC CALIS based on MI is quite comparable to that obtained using Mplus with FIML. The result from Mplus based on MI is strikingly different. How does one account for the large discrepancy?
I would need more information to comment on this. Given that the parameter estimates are simple averages over the replications, I wonder why they are different. Unless, they are the same, I wouldn't expect the standard errors to be the same. If you send the three outputs and your license number to firstname.lastname@example.org, I can take a look at it.
Jie Lu posted on Tuesday, September 18, 2007 - 11:17 pm
I am trying to fit a structural model with imputed data set generated through the procedure of ICE in STATA. One of the endogenous variable is a dummy. After I fit the model, I do get averaged CFI TLI RMSEA and their respective starndard deviation. However, I did not get those for the Chi-square? How can I get them?
My version of Mplus is 4.21. When I changed the type of all my endogenous vairalbes as continuous, I can get the averaged Chi-square and its standard deviation. But if some endogenous variables are categorical, I cannot get the averaged Chi-square and standard deviation. Is there some constraint for the WLSMV estimators?
With WLSMV, the chi-square test statistic and the degrees of freedom are adjusted to obtain a correct p-value. So the degrees of freedom varies across the replications and therefore we do not report its average.
V X posted on Thursday, November 22, 2007 - 11:59 am
I am not so understand with the "integration = montecarlo;" option in Mplus. WOuld you please provide some references so that I could have a good understand what is the mechanism and when to apply it?
Monte Carlo integration can be useful with many dimensions of integration and in other special cases described in the user's guide. You can search for this in the computational statistics literature. I don't know of a particular reference offhand.
Hi Bengt and Linda, I am interested in using Mplus for fitting a growth model to a data set with missing values on the outcome variable. I could use TYPE= RANDOM MISSING and the model would produce factor scores and other estimates under a MAR assumption (missing data mechanism depends on observed data) However, my question is the following. If I want to model the missing data mechanism as in Diggle and Kenward, I could use the "missing data indicator at time t ON outcome at time (t-1)" (u ON y) type of code but would still need the TYPE=MISSING bit of code to avoid listwise deletion. Am I not "overriding" the missing data code with the inclusion of TYPE= MISSING? Should not factor scores obtained under the two model specifications (with and without the explicit missing data model) be different due to the presence of the model for the missing data mechanism as I am only including "outcome at time (t-1)" in the missing data model? thanks,Graciela
The alternative of using missing data indicators (u_t on y*_t-1) in the modeling (so allowing for MNAR) takes the same approach of using all available data as MAR does, so Type = Missing does not override this. Factor scores should come out different in the two approaches given that the models are different.
In a conditional model, information on x does not contribute to the estimation of the regression coefficient in the regression of y on x, and the mean and variance of x are not estimated. So an observation with only information on x is not be used because it has no information to contribute.
In an unconditional model where the means, variances, and covariance between y and x are estimated, cases with information only o x are included in the analysis.
The only thing you can do to avoid this exclusion is to mention the variances of the x variables that have missing on x in the MODEL command. This will cause them to be treated as y variables. Their means, variances, and covariances will be estimated and distributional assumptions will be made about them.
Thank you for your quick reply. Would you suggest estimating the (co)variances for the x-side variables (and declaring nonnormal variables on the categorical line)? Could this cause other problems or violate assumptions?
I would not do this. If your x variables are continuous normal, it would probably be okay and in line with multiple imputation programs. If they are categorical, it would change the model. The bottom line is that if you are interested in regression coefficients, bringing the cases with only x's into the model will not change the results.
Have you published and/or are you aware of any articles that illustrate methods for modeling the conditional expectation of the likelihood given the data and current values of the parameter set (i.e., EM for model parameters) for conventional (single-class, single-level) SEMs? I am either coming across applications of EM a) for means and covariances (e.g., Schafer, 1997, p.163-181), b) for model parameters within the multilevel track (e.g., Lee & Poon, 1998, Stat. Sinica; Liang & Bentler, 2004, Psychometrika) or c) for model parameters in the mixture track (Muthén/Shedden, 1999). For b and c, some places are obvious where the model, within the E-step, would be modified/structured to fit a conventional model as a special case - and not-so-obvious in other places. And many texts that discuss casewise ML for MAR data for conventional SEM seem to say little about EM, N-R or any other optimization techniques (while there is plenty of talk on this for multilevel and/or mixture SEM). Any help either of you could provide on this would be greatly appreciated………
Looks like they have two papers that are relevant:
Rubin, D.R. & Thayer, D.T. (1982) EM Algorithms for Maximum Likelihood Factor Analysis. Psychometrika, 47, 69-76.
Rubin, D.R. & Thayer, D.T. (1983) More on EM for ML Factor Analysis. Psychometrika, 48, 253-257.
There was also this:
Bentler, P.M. & Tanaka. J.S. (1983) Problems with EM algorithms for ML factor analysis. Psychometrika, 48, 371-375.
Won't be able to get them until Mon (have hardcopy access but no electronic). The 2nd Rubin paper appears to be a response to the paper by Bentler & Jeff Tanaka where both groups traded concerns about susceptibility of another optimization method (N-R maybe? I'll see on Monday...) to local maxima. Thanks for pointing me in the right direction......
Hi Linda or Bengt, I know that a model with more parameters than subjects has identification problems (presumably related to the fact that the rank of the data matrix is limited by the number of subjects in this case) but I am not clear on how missing data impacts this. If I am fitting a model say with 200 parameters free to be estimated and 600 subjects total but have complete data from say 150 subjects (self-report is collected from all 600 but more expensive measures such as peer-report and diagnostic interview are collected from subsamples) would we have the same identification problems that we would have had if we didn't have the 450 subjects with partial data? or do the additional subjects with partial data help in this regard? Thanks for any insight you can provide!
thanks for the very quick reply Linda! I don't understand the last part about parameters for which we have little information to compute the H1 model. If you woulod be able to elaborate a bit I would appreciate it and am not sure which parameter those would be.
Dear Drs Muthen, this can be a very silly question but I am struggling to figure out how Mplus handles missing data (MD). Well, I hava a dataset of 8028 people, with 248 variables (between categorical and continuous) including missing data in all of them. I know that Mplus 5 takes into account MD by default, but I want to know why the number of analysed cases differ according to the number of variables in the dataset. See this example, when I regress age on sex (none have MD) the number of analysed cases is only 5189 instead of the 8028. On the other hand, when I create a new dataset including only age and sex as variables the number of analysed cases is 8028. Why are these differences? Is there any default mechanism that I am missing?
"Mplus provides maximum likelihood estimation under MCAR (missing completely at random) and MAR (missing at random; Little & Rubin, 2002) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types"
Does this mean Mplus used the method Rubin (2002) recommended?
This might be due to my lack of knowlege. Could you please specify the method you used? Can I just say Mplus used "special" ML to hand missing data?
Just quote the user's guide if you cannot paraphrase it.
Nikolai Eton posted on Wednesday, October 22, 2008 - 8:40 am
Dear Linda and Bengt,
I am interested in LGC with multiple indicators and multilevel growth mixture models. As my dataset is "different" in a way, I would like to hear your advice on how to handle missing data in it using Mplus.
The dataset consists of several variables (cont. & cat.) across 5 timepoints on the item level (every variable shows up 5 times) and is hierarchical with individuals nested in groups. The whole dataset is related to one country. The special thing about the dataset is that there is a high fluctuation across groups (that I can control), but also above countries (that I can't control). Thus, individuals sometimes change the group. In addition to this, sometimes they leave the country (and probably return later), what appears as missing value in the time of absence.
Overall, I have about 1000 individuals nested in about 25 groups with all 5 timepoints available for 106, 4 timepoints for 87, 3 timepoints for 165, 2 timepoints for 220 and 1 timepoint for 442 variables. I do not necessarily want to explain the absence.
What is your recommendation on taking care of missing values in here?
Thank you very much for your help.
Nikolai Eton posted on Wednesday, October 22, 2008 - 8:58 am
One add-on to my question: the minimum covariance coverage is .113 (although that var has about 70% missy data information). Thus, basic analysis results in this error message:
THE COVARIANCE COVERAGE FALLS BELOW THE SPECIFIED LIMIT.
If you use group as a level 2, you are treating group as a random mode of variation. In this case, changing group membership implies the need to use a "crossed random effects" approach (see the multilevel lit.), which Mplus currently cannot do.
The leaving the country at some time points is a missing data question which probably is handled fine by the standard ML MAR approach of Mplus. But you want to pay attention to coverage that is lower than the Mplus default limit of 0.10 (which can be altered) - that may already be too low (just put there to prevent convergence problems) for seriously relying on the results. It depends on where the low coverage occurs. If it is for a covariance between say the first and last time point, but coverage is otherwise high, then that is not so problematic because you typically don't have a growth model parameter corresponding to the covariance between first and last. A problem would be if low coverage happens for a variable (the diagonal of the coverage report) or for variables at close time points.
We have a data set of 300 adolescents who were sampled at three waves in a cohort sequential design. It's a typical longitudinal data set and we have a reasonable amount of missingness.
We did a series of unconditional and conditional LGM models to describe and predict constructs and concluded our paper with a conditional parallel process model between two constructs. To maximize our power and sample, we included cases that had data at two time points and used the missing estimator (our rationale being that two points provides a line if not a curve). When we looked at our data everything made sense and we wrote up and submitted our manuscript.
Upon receiving our reviews to this and two similar papers, we received consistent critiques that I'm hoping you can help me clarify for the reviewers. I'm writing today to ask if you can direct me to literature that address these issues.
1) Reviewers were concerned that latent growth curve models cannot be properly identified or stably fit with 3 (or less) time points. Is there evidence that the models are trust-worthy?
2) Is our rationale to keep cases with at least two time points and use the missing estimator something we can justify?
3) For some of our estimates, the size of the estimate is very small (e.g., slope = .008). Although significant, how are small estimates to be interpreted?
1) Typically, at least 4 time points is desirable for good growth modeling. With only 3 time points, there are several model mis specification risks that cannot be countered due to having too few time points to identify more flexible models. This is discussed in our Mplus Short Courses, Topic 3 (see videos and handouts on our web site). Still, many published studies have used only 3 time points. If all individuals have only 2 time points, only a very limited growth model is possible.
2) It sounds like you have 3 time points for a majority of individuals and 2 for some. The percentages of each should be given. And, in fact, you could have included individuals with only 1 time point. This is what ML estimation under the "MAR" assumption of missing data theory (see the Little & Rubin book) would do. If a majority has 3 time points, I don't see a serious problem with this approach.
3) I think you are talking about a slope mean. The size of this depends on the time scores. The real question is what the implied change in mean is for the outcome from one time point to the next. You find that in the Mplus output when requesting RESIDUAL.
Hemant Kher posted on Thursday, October 30, 2008 - 2:53 pm
Dr. Muthen -- Greetings,
I am working on fitting growth models to survey data and have a question about missing data.
There were 233 students in the sample. We collected data at 4 equally spaced time points. With regards to our key variables used to fit growth models, here is the breakdown of how often students provided data:
107 students provided data at all 4 time points (46%) 64 students provided data at 3 points (27%) 36 students provided data at 2 points (15%) 23 students provided data at 1 point (10%) 3 students did not provide any data (1%)
I have read somewhere that for growth models, we need at least 1 time point -- but I am not sure if having close to 10% people that provided only 1 observation will affect our growth models.
Assume that you have a linear growth model. The most important factor is how many individuals have at least 3 time points because that's how many you need to identify all the parameters. The individuals with fewer time points also contribute to the estimation of some parameters so they are helpful to include. Of those who have at least 3 time points you also want to know how representative they are of the whole group - a simple thing to check is if the mean of the outcome at the first time point is significantly different across the 4 missing data groups you list. If different, you may consider "pattern-mixture modeling".
In an earlier posting, it was mentioned that FIML was available for categorical outcomes. However, whenever I have tried this I get a warning stating, "Data set contains cases with missing on x-variables. These cases were not included in the analysis." This has been the case when I have run logistic regression analyses and when I have run SEM models with binary indicators of latent variables. Can you clear this up for me? Is there a way I can get Mplus to use FIML with such analyses?
A regression model is estimated conditioned on the observed exogenous variables. Means, variances, and covariances of these variables are not part of the regression model. Missing data theory applies to observed endogenous variables. You can include the observed exogenous variables in the model by mentioning their variances in the MODEL command. By doing this they are treated as endogenous variables and distributional assumptions are made about them.
I have missing data question that I was hoping someone could answer. We developed a ten factor measure of connectedness, with one factor measuring ting connectedness to sibling. As expected, some of our subjects do not have siblings and thus their data is missing appropriately. To avoid losing those subjects on the other factors, I estimated the missing data using a multiple imputation procedure and followed it with an invariance analysis that compared siblings and non-sibling samples across the factor loadings, intercepts, residuals, and covariance matrix.
My first question is do you find this analytic approach appropriate? As I expected, the results were nearly identical across the two samples, both when testing the ten factor model and single factor model (i.e., sibling connectedness scale). The only caveat is that the sibling connectedness results should only generalize to subjects with siblings.
My second question is whether mixture modeling with known classes is a better approach to answer this question. If my understanding of mixture modeling is correct, I would draw the same conclusion. Am I correct?
I would look at subjects with and without siblings separately. Then if you want to compare them, do so on the factors that are not about sibling connectedness. Imputing siblings for those without seems a little iffy.
Mixture modeling with only a known class variable is the same as multiple group analysis.
anonymous posted on Thursday, February 26, 2009 - 3:52 pm
Hello, I am trying to determine how to approach a missing data issue. I have ratings of depression severity across time for about N=400. The timing of observations varies across individuals, so I plan to nest time points within individuals. One problem, however, is that the number of data points also varies across individuals. For instance, the number of data points for the sample ranges from 1-16, with a mean of 8 data points, SD=2.8, and variance=7.7. I am not certain what would be the best way to approach this. For example, would it be best to include only the first 8 time-points for the analysis? Any thoughts would be very much appreciated.
I would include all the data. The varying timings is handled by the AT option of the growth language (using |) and the varying number of time points can be handled either by
(1) using a single-level, wide approach letting the observation vector be of length 16 and using a missing data symbol for time points not available
(2) using a two-level (time points within individuals), long approach with a univariate outcome, where the different number of time points per individual is merely resulting in different cluster sizes and is therefore inconsequential.
anonymous posted on Friday, February 27, 2009 - 10:14 am
Hi Dr. Muthen, I have attempted a LGMM using the first option, however the model does not converge. Do you think that convergence would be more feasible with the second option? Thanks very much for your help!
I am not sure. It would depend on the reason for non-convergence. You would have to contact email@example.com to have this diagnosed.
anonymous posted on Saturday, February 28, 2009 - 2:46 pm
The error I obtain is the following: *** ERROR One or more variables have a variance of zero. Check your data and format statement.
There is one variable with only 3 subjects and the variance is 0.027. However, Mplus indicates that the variance is 0.000 for this variable. Is this possible or is the data file incorrect? Many thanks!
This cannot be answered without seeing your input, data, output, and license number at firstname.lastname@example.org.
anonymous posted on Monday, March 02, 2009 - 7:15 am
Dr. Muthen, Many thanks. I will send you the input, data, and output.
I know that Mplus does not generate graphs when the type=random analysis is used to account for individually-varying times of observation. I'm guessing this is also true if type=random is used in the LGMM framework, correct?
I'm using the ECLS-B database from NCES. There are about 12 weights to be applied to various variable sets. I think I've identified the correct weight for my variables and should receive confirmation from NCES soon.
However, even though the output in SPSS shows I have a variable weighted, my Mplus output shows only something like 51 cases were analyzed. There are missing weight values for some cases, predictably. And yet, I'm told I cannot impute any value, neither a 1 or 0 for example, in Mplus.
Do you have any transformation or filtering suggestions, so that I can do a CFA with a larger sample?
In my understanding, missing is not allowed for a weight variable. You need to check with NCES to determine which weight variable you should use. It should not have missing.
anonymous posted on Wednesday, March 04, 2009 - 11:45 am
Hello Dr. Muthen, Regarding your response to my query concerning how to approach missing data (Bengt O. Muthen posted on Thursday, February 26, 2009 - 6:53 pm), what are the advantages and disadvantages to taking a long vs. wide approach? I have more familiarity with the wide approach and would prefer it, but will certainly consider the long approach if it has definite advantages over missing data. Also, a few more questions: 1) Can the long approach be used in conjunction with a LGMM? 2) Is it possible to graph the classes of trajectories with an HLM that uses the analysis=random option? Thanks very much!
I think the wide approach is generally preferable, but not always. For example, you can allow the residual variances for the outcomes to vary across time. But the wide approach has to use the max number of observations per subject which may lead to a long observation vector (very wide). And with individually-varying times of observation having a different residual variance for each time point makes for many parameters. Furthermore, some time points may not have variation in the outcome if the missingness is extreme.
1) Yes. In this case the latent class variable is a between-level variable (see UG for examples).
2) I think so; try it.
anonymous posted on Thursday, March 05, 2009 - 7:19 am
Hi Dr. Muthen, Yes, using the wide approach, I've found that some time points do not have variation in the outcome b/c the missingness is extreme. I was thinking of simply only including time points where the covariance coverage equals or exceeds .60 (although I have no reference to justify this approach). 1. Does this seem to be an acceptable solution? 2. Do you know of any references that recommend such a covariance coverage? thanks!
You can manipulate the data to fit better with the wide approach by deleting time points or combining them with adjacent timepoints, but such manipulation does not seem right. Given what you see, I would instead take the long approach. You may find the DATA WIDETOLONG option helpful.
Just wondering if it's possible in any way to run MLM estimation when having missing data. I have noticed that MLM requires the raw data (so it must be a FIML type estimation) so even if I feed the model with a covariance matrix it won't work.
anonymous posted on Thursday, March 12, 2009 - 12:46 pm
Hello Dr. Muthen, I've attempted to transform the data from wide to long per your suggestion (Bengt O. Muthen posted on Thursday, March 05, 2009 - 8:13 am). Thankfully, the model ran! However, I have a few questions regarding interpretation: 1) How might I obtain a graph of the LGM trajectory? When I attempt to view the individually-fitted curves, only two data points are plotted on the y-axis. 2) If I am using the long option, I no longer need TSCORES correct? 3) How might I compare LGM models? When using the wide approach, I've conducted chi-square diff tests for nested models (intercept vs. intercept + slope vs. Intercept + slope + quadratic slope), but I am not certain how to do this using the log likelihood test. Many thanks!
In the newer versions of Mplus, TYPE = MISSING is the default, where missing cases are handled under the Missing at Random (MAR) assumption using Full-Information Maximum Likelihood (FIML). You may also specify models with listwise deletion through LISTWISE=ON in the DATA-command. More information is provided in the User's Guide, pp. 7-8.
I've received this comment from a reviewer, regarding a confirmatory factor analytic study:
"Were missing data patterns missing at random (this can be done in Mplus by specifying a mixture analysis and using only a single class latent variable, using the %OVERALL% syntax at the beginning of the model statement and declaring the outcome variable as categorical variables)."
I don't understand what they are suggesting, and even if I did understand, I don't see how any test could tell if the data were MAR/MCAR vs MNAR.
I have mplus version 5, I am running a path analyis and I understand that the default is to estimate the model under missing data theory. How can I turn off this option? I just want to use complete case analysis in order to compare my results with another package. Thank you
This option came out with Version 5. Perhaps you are using an older version where listwise deletion is the default. If not you need to send your full output and license number to email@example.com.
A reviewer of one of my manuscripts requested that I report how Mplus handles missing data. I have a complex structural equation model (see below). I used the WLSMV estimator and MISSING = ALL (999). The outcome variable is categorical (1=relapse, 0=abstinent) and no subjects are missing on this variable. However, some subjects have missing data on some of the other observed variables. For instance, some subjects do not have data for c1-c4 (each of the observed variables that make up the crave latent variable). Is my description of what Mplus does in this situation correct? Syntax for the model is below.
“Intent to treat abstinence was the dependent variable in the current study. Thus, none of the participants were missing on the dependent variable (i.e., missing were counted as relapse). However, some participants did not complete all of the study measures. Mplus handles these missing values by estimating them using the other variables in the model.”
MODEL: SES by s1 s2 s3 s4; Neigh by h1-h4; Support by i1-i3; NA by n1-n4; agency by a1-a5; Crave by c1-c4; neigh on ses; support on ses neigh; NA on neigh support crave; agency on crave na; w4itt on agency ses;
Factor indicators are dependent variables. For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes. When there are no covariates in the model, this is analogous to pairwise present analysis.
I think that what you are saying is that all of your dependent variables are continuous except abstinence. If this is the case, I would use maximum likelihood estimation where maximum likelihood estimation under MCAR (missing completely at random) and MAR (missing at random; Little & Rubin, 2002) is available for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types. MAR means that missingness can be a function of observed covariates and observed outcomes.
I am interested in receiving your suggestions for analyzing my data using Mplus. The data comes from an intervention study for couples transitioning to marriage. 18 couples completed pre-test data and 12 couples completed post-test data. I am interested in assessing change in couple's attachment, affect regulation, empathy, and trust (continuous variables) following intervention. This was a very preliminary study which is why I want to use the FIML capabilities of Mplus to keep the sample size higher than 12.
The standard approach to analyze longitudinal data is to use FIML under the "MAR" assumption (see missing data lit.). This means that you use all available data - 18 couples at time 1 and 12 couples at time 2. I assume that the 12 couples is a subset of the 18. Couple, not individual, represents the mode of variation for which independent observations is assumed to hold, so the sample size is 18. Because of this, note that with only 2 timepoints the sample size of 18 is quite low and does not allow the estimation of a model with many variables and parameters.
Newcomer posted on Tuesday, July 28, 2009 - 8:44 am
Hi, I am using Mplus to run linear regression and just wonder if Mplus can save the adjusted means (predicted values from a multiple regression). If so, what is the command?
You cannot obtain these automatically. You would need to use the DEFINE command in a subsequent analysis to obtain them.
Newcomer posted on Tuesday, July 28, 2009 - 2:00 pm
Thanks so much Linda! Just to confirm--so, I will need to plug in the regression coefficients in the equation to calculate the predicted values using the DEFINE command, and then use the SAVEDATA command to save it, right?
In a previous post the following was stated by another user:
"In the newer versions of Mplus, TYPE = MISSING is the default, where missing cases are handled under the Missing at Random (MAR) assumption using Full-Information Maximum Likelihood (FIML)."
And then was followed up with a statement by Bengt:
"With ML estimators all available data are used, using "MAR"'
I have seen this sort of wording, "all available data are used", by Drs. Muthen in regard to missing data in several places but I have not seen either of them directly state that when using TYPE=MISSING FIML is being employed.
Is is fair to say that when you specify TYPE=MISSING (which is now the default) MPLUS is using FIML?
"FIML" is used in some literature to mean full-information maximum-likelihood estimation (most often with continuous outcomes, but that is not necessary) and with missing data the "MAR" assumption of missing data theory is utilized. (As an aside, I think the "full-information" part is superfluous because maximimu-likelihood estimation uses full information; to me it is not a good idea to add unnecessary acronyms beyond those in mainstream statistics.) Mplus uses ML to refer to maximum-likelihood estimation. ML under MAR is therefore the same as "FIML" and uses all available data.
So, TYPE=MISSING together with ESTIMATOR=ML gives "FIML". TYPE=MISSING together with ESTIMATOR=WLSMV, however, does not use MAR but a less flexible assumption detailed in the UG.
I am running growth models with a lot of missing data. I need to compare nested models. However, my understanding is that it is not appropriate to use traditional chi-square difference tests to compare the fit of the nested models when modeling missing data due to the approximated chi-square values. Further, the manual states that the DIFFTEST option can only be used with MLMV or WLSMV estimators, yet I am using the default (ML) estimator. What is the most appropriate way to compare the relative fit of the nested models in this case? Should I be changing my estimator, or using some other approach?
The presence of missing data should not be an issue with difference testing. It is only the estimator that dictates the type of difference testing. ML uses a simple difference in chi-square. MLR requires the use of a scaling correction factor. Estimators ending in MV can use the DIFFTEST option.
Thank you for your response. I realize that ML chi-square is typically the difference in the chi-square values (with the difference in the degrees of freedom as the df). However, when I estimate these models using regular ML estimation, the df change between samples. For example, if I run a model with sample A, and then run the exact same code with sample B, my chi-square df changes from 70 to 71, respectively. This implies to me that regular chi-square difference testing might not be ok. Am I totally off base?
If the model is the same, changing the sample does not change the degrees of freedom with ML. If you send the two outputs and your license number to firstname.lastname@example.org, I will find the explanation of this difference.
Both models estimate the same number of parameters. The difference in degrees of freedom is due to a different number of parameters in the unrestricted models due to different patterns of missing data in the two samples.
Do you deem it necessary to conduct an analysis of sample selectivity when FIML is used? I thought of comparing those with full data with those having at least one missing on the main study variables of interest. However, I'm not sure whether this analysis is theoretically needed because FIML uses all data available to estimate the model. In case there a differences between both groups...is MAR violated?
MAR is not necessarily violated - the missingness can still be predicted by the variables that are observed. You cannot test if MAR holds. Although of interest in itself, your comparison can only reject MCAR. So unless you try to get into NMAR (not missing at random) modeling, you might just as well go ahead with ML under MAR (i.e. what is often called FIML).
I'm trying to understand how missingness in x variables are handled in MPLUS. I have tried the simplest case with two continuous variables based on a sample size N=415 with X missing 22 cases while Y is missing only 1 case. If I regress y on x I get a message that 1 case is missing on all variables (N = 414). I had expected a message indicating that the analysis would be based on 393 cases (415 - 22). Is the analysis based on 414 cases or on 393 cases? (i.e., a listwise deletion or are the cases with missing Xs somehow adjusted for missingness rather than ommitted from the analysis). I tried to find information on this and don't understand one of your statements that "Covariate missingness can be modeled if the covariates are explicitly brought into the model and given a distributional assumption." Have I done this in my example? Thank you.
Your analysis uses TYPE=GENERAL with continuous outcomes. In this special case, there is no difference between estimating the model conditioned on the x variables or treating the x variables as y variables. This is why the 22 cases are not deleted from the analysis. In other cases, it does make a difference how the x variables are treated and cases with missing on x are deleted unless they are explicitly brought into the model by, for example, mentioning their variances in the MODEL command. In this case, they are treated as y variables and distributional assumptions are made about them.
I doing a longitudinal study of 1000 children followed at four time points to assess language and literacy growth. Since this study is still ongoing there are some children that not have been assessed at time 3 and time 4 yet. In one of my papers I'm focusing on time 2-time 4, doing SEM, to examine how variuos language skills are related to later literacy development. I'm not very familar with missing, but in my data I have some missing values due to the fact that some children have not been assessed yet. What type of missing is this, and how do I handle it?
I want to compare a measurement model obtained from a complete sample (N=1041) with the same measurement model obtained by multiple imputation using the same data with approximately 30% planned missingness MCAR. I want to see if the MI approach gets close to the original measurement model in a real data set. The items are are scaled on 7-point Likert
I have managed to run the measurement model using both methods and the models look similar but I wondered whether the data could be combined in one measurement invariance type analysis (multigroup?).
Is this possible?
For the MI analysis I used a .dat file with the names of the 30 imputed datafiles.
The complete-data sample and the MI samples are not independent so multigroup analysis would not be correct.
What you could do is to divide the sample into groups that have different planned missingness (variables for which everyone has data plus variables that some have data for) and then do a multigroup analysis where you can test invariance over model parts that are in common for the different groups. So this would not use MI.
Holly Burke posted on Thursday, January 14, 2010 - 2:03 pm
I was wondering how Mplus handles missing data with WLSM in categorical factor analysis?
I thought Mplus handled missing data using maximum likelihood, but when I run the following analysis code: TYPE = COMPLEX EFA 1 5 MISSING; the output says the program used the WLSM estimator so how could the program also be using the ML estimator?
We have collected student self-report data at seven time points and are interested in doing MM, which may lead into GMM or LGM, depending on the results of the MM. However, we have missing data (total n = 1434; listwise n = ~1271). We have determined that the missing data are not MCAR, and for now are treating them as MAR (will eventually do MNAR models, but are starting with MAR). We would like to do FIML.
My question is two-part:
1. Is the following syntax FIML?
TITLE: 2ClassA means free and var free but fix equal 0 covars DATA: File is 'EffortMM99.dat'; VARIABLE: names are id eff1 eff2 eff3 eff4 eff5 eff6 eff7; Usevariables are eff1 eff2 eff3 eff4 eff5 eff6 eff7; missing are all (99); classes=c(2); ANALYSIS: type=mixture missing; estimator = MLR; starts 500 500; MODEL: %overall% eff1 eff2 eff3 eff4 eff5 eff6 eff7; OUTPUT: TECH1;
2. We have quite a number of external covariates, which we are hoping to use with the auxiliary command. However, some of the external covariate data are missing, as well. Can we use these data with FIML and the auxiliary command? Or, what is your recommendation?
Thank you for any advice that you can offer! It is much appreciated.
By MI I assume you mean Multiple Imputation. I don't know about MI software with a mixture (I assume your MM notation means mixture modeling), but perhaps you mean doing MI for subjects grouped by most likely class, which might be an alright approximation. But perhaps you could simply do MI for the external covariates without involving mixtures.
If your substantive model can reasonably be extended to include those multiply imputed external covariates among your other covariates, that might be the most straightforward approach. Otherwise, you can include the externals as auxiliaries, either with them imputed or with their missingness.
I have come across some problems running my measurement model. I would like to run the Theory of Planned Behavior on a dataset containing 2,000 participants. My file consists of 30 observed variables who load on 5 factors: intention (3 obs. var); pros (9 obs. var); cons (7 obs. var); self-efficacy (9 obs. var); and social influence (2 obs. var).
Model: intenT0 by inten1t0-inten3t0; prost0 by pros1t0-pros9t0; cont0 by con1t0-con7t0; EEt0 by EE1t0- EE9t0; SIt0 by SSt0 SMt0;
No model results are shown (at least only the estimate is shown without s.e., p-values, MI etcetera) and I receive the following text: MAXIMUM LOG-LIKELIHOOD VALUE FOR THE UNRESTRICTED (H1) MODEL IS -63112.478 NO CONVERGENCE. NUMBER OF ITERATIONS EXCEEDED.
I have already tried to increase the number of iterations but this didn't help. Can the high number of missing values explain this error (number of missing data patterns is 133). If yes, how can I solve this? If not, do you have another suggestion that explains this error? Thank you very much for your help. Best wises!
I am wondering the best way to handle a standard CFA with dichotomous indicators where 1 indicator has missing data for all members of a dichotomous covariate? I get the following error when I run the model:
THE WEIGHT MATRIX PART OF VARIABLE AMEN IS NON-INVERTIBLE. THIS MAY BE DUE TO ONE OR MORE CATEGORIES HAVING TOO FEW OBSERVATIONS. CHECK YOUR DATA AND/OR COLLAPSE THE CATEGORIES FOR THIS VARIABLE. PROBLEM INVOLVING THE REGRESSION OF AMEN ON GENDER. THE PROBLEM MAY BE CAUSED BY AN EMPTY CELL IN THE BIVARIATE TABLE.
To be sure we understand the model, please send your full output and license number to email@example.com.
Brian Hall posted on Friday, March 26, 2010 - 1:27 pm
Dear Dr. Muthen, A quick question: I'm using the MLR estimator for CFA analyses. I have opted to use multiple imputation in order to test CFA models separately in multiple waves of data (sample size precludes normal temporal invariance investigation using FIML).
Given the robust estimation, I am concerned that MPLUS is not providing a scaled correction factor in the imputed results. Is this a valid concern? Do I need to compute the scaling factor? and if so, how? Thanks in advance, Brian
I'm not sure that using multiple imputation rather than FIML helps with a small sample size. You can test for invariance over time without looking at each time point separately. See the Topic 4 course handout starting with Slide 78 where multiple indicator growth is shown. The first steps test for measurement invariance.
If you are using TYPE=IMPUTATION and MLR, you will obtain an average MLR chi-square and standard deviation over imputations. These chi-square values have been corrected using the scaling correction factor. How to use a scaling correction factor with multiple imputation is a research question.
I am running a latent profile analysis with imputed data. I have generated 40 imputations with SAS proc mi, and created an ASCII file containing the names of the 40 data sets as described in the Mplus User’s guide. Although I am able to get the LPA models to converge I am concerned about the range of the indicator estimates across the classes - I have three continuous variables as indicators, all of which have been z-scored. Is it possible that the profiles may well change meaning from imputation to imputation in Mplus? In other words, across the 40 datasets is it necessary to verify that profile 1 always has the same meaning across imputations, as do profile 2, profile 3, etc. How does Mplus handle this?
I got a question over handling missing data in SEM analysis for panel studies. Marini, Olsen, & Rubin (1980) suggest this method should be used in nested pattern missing data, that is every subsequent wave time should be a sub sample of previous wave, like this:
t1 t2 t3 n 1 1 1 n 1 1 0 n 1 0 0 etc...
A few reviews I found, don't clearly make statements on this issue (Enders, 2001; Newman, 2003). For example, what if the pattern missing data is like this:
t1 t2 t3 n 1 1 1 = 3 complete wave times n 1 1 0 = t1 and t2; not t3 n 1 0 0 = just t1 n 1 0 1 = t1, not t2 and t3 n 0 1 1 = just t1 and t2 n 0 0 1 = just t3 n 0 1 0 = just t2
My concern is what it is most recommended to do with the not nested cases of the available data? to drop them out, or to hold them for the analysis of panel data when using ML?
Another concern, is the 'few cases' in the panel paths; is only a problem of sample amount (enough data to estimate the parameters), or there is a relation of between the N amount of the within covariances (panel cases) versus the between covariances (cross cases)?
I'll welcome any comments or directions on this issue, thanks in advance!
I think you make a distinction between dropout (monotone missingness) and intermittent missingness. I would think it is ok to make the standard MAR assumption for intermittent missing; perhaps it is even MCAR. You should certainly keep these cases in your analyses. The principle should be to use all available data. MAR for dropout may hold close enough, but for dropout one may want to also investigate other modeling (see for instance my paper under Missing data). But this is more advanced since it means that the missingness is part of what you model.
You then bring up coverage which Mplus prints for each outcome and pairs of outcomes. You want both types to be high in a longitudinal model.
Dr. Muthen, I'm running a simple model examining one indirect effect with one mediator. I have missing values on all variables (x, m, and y) and mentioned the x variable in the model command with the aim of FIML handling all missing data. However, Mplus is dropping cases that have missing values on all variables. Can FIML not address such cases? When I run the same model in Amos (which from what I understand also uses FIML) it appears to use the entire sample. Can you please explain what is happening here? Thank you.
Thank you. There are other variables not in the model that these cases have values on. Would Mplus stop dropping the cases if I brought these in as auxiliary variables? If so, do I only mention these variables in the auxiliary command or do they also need to be mentioned in the usevariables command?
I have a question about the individual LL values output under the SAVEDATA option. In trying to reproduce individual LogLikelihood values from a single-rep simulated dataset under MAR missingness, the values I calculated for a single case (in Proc IML) were slightly different than the value(s) produced in the SAVEDATA output. I was originally using the model-implied means and covariances under H0 to calculate the LogLike in Proc IML but then switched to the H1 means and covariances; the H1 sufficient stats seemed to reproduce the proper individual LL values in the output dataset. So a) am I right in my understanding that the LL values under the SAVEDATA command are the H1 LL values and, if so, b) is there anyway to also output the individual values under H0?
More for illustrative purposes for case-level discrepancies between H1 and H0 LLs - was thinking about them for a module on FIML for a seminar on missing. Your point is well taken though because is difficult to think of how H1 LLs would be useful in practice when your concern is H0 in real applications.
Just to be sure of myself: when type=missing is specified, what is the default method that Mplus uses to handle missing data? Or is this a function of the model specified. In my case, it is a simple path analysis; all variables are observed.
It is a function of the estimator and variable type not the model.
Mplus provides maximum likelihood estimation under MCAR (missing completely at random), MAR (missing at random), and NMAR (not missing at random) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types (Little & Rubin, 2002). MAR means that missingness can be a function of observed covariates and observed outcomes. For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes. When there are no covariates in the model, this is analogous to pairwise present analysis.
Sarah Ryan posted on Wednesday, October 06, 2010 - 10:53 am
I am working on the prospectus for my dissertation in which I will be using a national (NCES) longitudinal dataset. A number of exogenous variables are categorical, as is the mediating variable (5 levels) and the outcome (7 levels- so could be considered continuous). I have some missingness (data meet MAR)- seldom more than 10% and a good deal less in most cases. I have two questions, and I apologize if these seem horribly basic:
1) Is it best to use the WLSMV estimator?
and, if so,
2) Do I need to employ MI to deal with missingness as a first step? I have had some faculty at a training I attended suggest that it is always better to impute first, even in the SEM framework.
I'm still learning, so I'm having trouble understanding when and why I would or would not be wise to use MI.
With categorical outcomes, you can use either weighted least squares or maximum likelihood estimation. If your model has more than four factors, maximum likelihood would not be feasible because numerical integration is required. If you use maximum likelihood, you can use the default missing data estimation which is asymptotically equivalent to multiple imputation. If you use weighted least squares estimation, I would use multiple imputation because missing data estimation with weighted least squares estimation is not as good as with maximum likelihood.
Alice posted on Thursday, November 04, 2010 - 11:27 am
Question for Mplus discussion board:
I am a new user of Mplus. I am trying to run latent class analysis. The data files covers to waves of data. The data file has complex sample survey features. It has stratification, clustering, and weights.
And I need to use subpopulation option. The sample is wave4 sample.
My questions are:
(1) after I limit my sample to wave 4 sample, my data still has missing on all variables. Income is a continous variables and all others are categorical variables. Am I able to use Full Information Maximum Likelihood to deal with missingness. I googled somewhere and it says Mplus FIML is only for continuous variables. Is that true?
(2) FIML is able to deal with complex survey sample?
(3) Can I use multiple waves to run latent class analysis?
Please see my code in the next message for reference.
1. The default in Mplus is estimating the model using all information using maximum likelihood. A person who has missing values on all variables does not contribute anything so they are deleted.
3. This would be LCGA or GMM. See the Chapter 8 examples in the user's guide.
Alice posted on Friday, November 05, 2010 - 10:55 am
Thanks for the reply, Linda. Although my data are longitudinal (two waves), there are no repeated measures. Wave 1 data are respondents' reporting of their parents socioeconomic variables and wave 4 data are respondents' socioeconomic variables. I want to use latent class analysis to capture intergenerational mobility. And I want to identify individuals into different class membership. For example, I want to classify people into different groups, like moving up, staying the same as their parents, or moving down. For this kind of model, can I do simple latent class analysis (treating the longitudinal data as cross-sectional data) instead of LCGA or GMM?
BTW, using Full Information Maximum Likelihood to deal with missing data, do I need to specify it in the code?
I am trying to analyse clinical + genetic data from a patient cohort as part of my PhD. I have started using LCA (LatentGolD) to classify any underlying latent classes within my data, however after reading the manuals and a few tutorials, I am still confused as to how to determine the best cluster model. Some places I have noticed they just opt for the lowest BIC, however in other places they select the lowest L2 value. Is there any set criteria to select the best model?
It is very common to use BIC with mixtures - take a look at
Nylund, K.L., Asparouhov, T., & Muthén, B. (2007). Deciding on the number of classes in latent class analysis and growth mixture modeling. A Monte Carlo simulation study. Structural Equation Modeling, 14, 535-569.
FIML has come to mean ML under the MAR assumption of missing data. It is true that the term is typically used with continuous outcomes, but ML under MAR can be used with categorical outcomes as well, not just continuous ones. It is the default in the current Mplus version. You obtain ML under MAR simply by specifying what missing data symbol you have in the data (Missing = in the Variable command). By requesting Patterns in the Output command you will see what missingness there is in your data.
It sounds like you want to do Latent Transition Analysis. This is a Latent Class Analysis at several time points where you can study changes in class membership over time. The User's Guide has several such examples and there are several papers posted on our web site on this topic.
Xi Chen posted on Monday, November 15, 2010 - 2:14 pm
Hi Dr. Muthen, I ran a simple regression in Mplus and SPSS. The valid cases in SPSS with listwise deletion is 409, while the number of observation in Mplus is only 290, together with this: *** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 190
I checked the way I read data and I did not find any problem. Looking forward to your suggestions. Thanks!
That message suggests that you did not do listwise deletion in Mplus. That message is related to TYPE=MISSING where missing data theory does not apply to independent variables. To do listwise deletion in Mplus, specify LISTWISE=ON in the DATA command.
Xi Chen posted on Monday, November 15, 2010 - 2:40 pm
Xi Chen posted on Monday, November 15, 2010 - 5:15 pm
It turns out that the total N is 712 but Mplus only used 480 observations. the variables in the model do not have missing data and there are 712 rows in the datafile. Is there anyway to find out which part of data are used in mplus? Thanks!
Xi Chen posted on Monday, November 15, 2010 - 9:22 pm
Hi Dr. Muthen, I have checked the data used in the analysis and the original data. it looks like Mplus deleted some observations not for missing data problem (some observations without missing data were also deleted). Is there any reason why Mplus would delete observations from analysis? Thanks!
Hi, I just upgrade Mplus to 6.1 and i run an old program and the number of subjects is now lower.
I want to estimate a regression model using FIML. But now i revice the following message:
*** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 29 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
In Version 5, the default changed from listwise deletion to TYPE=MISSING. You can obtain listwise deletion by adding LISTWISE=ON; to the DATA command.
Missing data theory does not apply to observed exogenous covariates. That is why observations with missing on x are excluded. If you want them included, you must mention their variances in the MODEL command. When you do that they are treated as dependent variables and distributional assumptions are made about them.
You can't identify whether the data are missing as MAR or NMAR, the two key contenders. For how to approach this dilemma, see the 2 papers on missing data by Enders and Muthen et al. mentioned on our home page, which also show how to do NMAR modeling in Mplus.
Hello, I have a question regarding a discrepancy between the estimated sample statistics produced by the descriptives produced for example 11.2, and the estimated sample statistics provided for a LGM. For the outcome trajectory, the estimated means for the baseline observation are the same, but diverge for later observations. Specifically, the estimated means are substantially lower for the LGM produced SAMPLE STATISTICS: ESTIMATED SAMPLE STATISTICS at later observations than the RESULTS FOR BASIC ANALYSIS: ESTIMATED SAMPLE STATISTICS.
I am a doctoral student attempting to test second-order factors that I will include in a future SEM analysis. I am using complex survey data which require weights. My dataset does have two different versions of the weights – a base weight with a Taylor Series strata and PSU, or replicate weights. When I was proposing this project, I was advised by my mentor to use FIML in Mplus to address missing data. I have been using the replicate weights with bootstrap standard errors, but it appears that FIML is not available with this method of weighting.
Am I correct in my understanding that FIML is not available when using replicate weights with bootstrap standard errors?
In this case, what approach do you recommend? If at all possible, I would like to avoid listwise deletion.
can FIML be used when WLSMV is specified as estimator and is this accomplished by type = missing command? An earlier reply post by Linda to another user's question indicated that "if you use TYPE=MISSING with WLSMV, the missing data technique is pairwise present." Thanks!
Sarah Ryan posted on Wednesday, April 27, 2011 - 3:24 pm
I am running a mediation (mediator is latent continuous) model with four latent factors, and predominantly categorical indicators. The outcome is ordered categorical with six levels. Would it be advisable to treat the outcome as continuous in this case (rather than specifying it as CATEGORICAL) in order to reduce the computation burden of the numerical integration that will be required for this model?
It depends on whether the ordinal variable piles up at either end, that is, has a floor or ceiling effect. If it does, it should be treated as a categorical variable. If not, you are probably safe to treat it as a continuous variable. You can also consider using the WLSMV estimator. If you have categorical factor indicators, each factor is one dimension of integration with maximum likelihood estimation.
Sarah Ryan posted on Thursday, April 28, 2011 - 2:26 pm
Okay. If I use the WLSMV estimator with imputed data sets, however, is there a way to test multigroup invariance?
My reading has led me to think that difference testing with imputed data is relatively unexplored.
Also, is a corrected/scaled Wald statistic provided in the output when using Mplus to analyze imputed data sets using WLSMV?
I realize that beginning with v. 5 Mplus uses missing and Type=H1 as the default in model analyses. However, I was curious as to why the same exact model run (same command-line syntax) would indicate different missingness (on the same input data file) between versions 4.1 and 6.0. This was noted when an adviser ran the same analyses on a different version than I am using (the estimates are basically the same, but in the analyses using 4.1, all the data is indicated as being present whereas in V. 6 it indicates that there are 2 cases with data missing on x-variables and 150 cases where data is missing on all variables except x variables). I know from a previous analyses using FIML in v 4.1 that the warning for missing data is worded such that missing data are noted as 'number of cases with missing on all variables'. Is this difference b/c v 5 and higher looks at missingness as a function of x and y variables whereas v 4.1 looks at it with respect to all variables considered simultaneously?
Thanks Linda, read through this. To make sure I am understanding the difference fully would it be correct to say that pre v6.0 cases were only deleted if they were missing values for all variables (i.e., endogenous and exogenous) whereas currently cases are deleted if they are missing either all x vars, all y vars, and/or both considered in total?
Many thanks. One thing I am still having trouble wrapping my head around is that when I conduct parallel analyses in V6, my results are exactly the same as they were in v4. The only difference is that in v6 I get the warning that data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis.
With those cases not included in the analysis how is it that the model results can still be exactly the same? Does it have something to do with the fact that I complete data for the x-variable in the model? Bear with me if the answer is straightforward and I am just not seeing it.
It is the case that for maximum likelihood estimation for continuous outcomes with no missing data, the results will be the same if the model is estimated with y and x or with y conditioned on x. It is only in this case that the results will be the same. We changed to y conditioned on x to be in line with the rest of the program and regression in general.
Are there sample size parameters for conducting a pattern mixture growth model? I am writing a data analysis section for a grant in which missing data that is not ignorable is expected. It is a small clinical trial with a total of 60 subjects with equal distribution in two treatment groups (n=30 in each). I know this is very small, but was wondering if testing a model with two growth classes (one with missing one without) would be possible?
That depends on how many time points you have and what the growth shape is, plus parameter values. You have to do a Monte Carlo simulation study to learn about it. 60 may not be too low even for a 2-class model, but with mixtures the answer also depends on the degree of class separation in the growth factor means. See UG chapter 12.
Eric Teman posted on Tuesday, August 16, 2011 - 9:25 pm
When using FIML in Mplus, does Mplus delete cases with missing values on exogenous observed variables by default? And uses full information for missing values on the endogenous variables?
Version 6.1. See Version History on our web site to find more information about this.
You can easily revert to before v6.1 by mentioning the means or variances of the covariates. But this then makes additional model assumptions, not included in the original model. They are the same assumptions you make with multiple imputation. We made this change to be consistent throughout the program with categorical modeling, mixture modeling, and other cases.
Eric Teman posted on Thursday, August 18, 2011 - 2:38 pm
Hypothetically, let's say a dataset had missing values only on exogenous variables, i.e., the endogenous observed variables are complete. Would employing FIML be identical to using listwise deletion?
Yes, at least the way I define FIML. FIML is helpful when some endogenous variables have some missing values because then missingness is allowed to be a function of some of the other, not missing, variables for the individuals with missing. For instance, in a longitudinal study, the outcome at the first time point may be observed for many persons and this may predict later missingness.
Eric Teman posted on Saturday, August 20, 2011 - 3:03 pm
On p. 458 of the version 6 Mplus manual, it says, "The ASCII files...must be created by the user." I noticed yesterday, though, that I did not need to manually create these. When did Mplus start doing this automatically?
This is done automatically if you impute the data sets using DATA IMPUTATION but not if you impute the data sets outside of Mplus.
Eric Teman posted on Wednesday, August 24, 2011 - 8:05 pm
Hypothetically, in a Monte Carlo simulation study where a design cell contains 1,000 replications of a CFA model where multiple imputation was used (with 10 imputation data sets created per replication), would you simply average all of the parameter estimates and fit statistics (including chi-square) for that one cell across all imputations within replications?
You can do that for the parameter estimates but not for the fit statistics. How to accumulate fit statistic information is unstudied except for ML chi-square. See the most recent Topic 9 course handout for the formula.
Eric Teman posted on Thursday, August 25, 2011 - 2:39 pm
Is the ML chi-square over 5 imputations, for example, output anywhere for reading into an outside statistics software package? Or will I have to calculate the ML chi-square over the number of imputations in multiple imputation?
Note that the ML chi-square for each imputation - or the average of this over replications - is not a useful measure of fit but misestimates fit quite a bit. See the Topic 9 handout of 6/1/11, slides 212-216 for a study of this. The correct chi-square T_imp is printed.
Eric Teman posted on Thursday, August 25, 2011 - 3:02 pm
But is it printed in the ASCII results file. I can't find it there. I see it in the OUT files but not the ASCII files.
Eric Teman posted on Thursday, August 25, 2011 - 4:31 pm
I am using WLSMV as the estimator. Maybe that is why I'm not getting T_imp. In this case, how should I proceed with calculating the chi-square over imputations?
You are hitting the research frontier - that hasn't been invented yet.
Eric Teman posted on Thursday, August 25, 2011 - 5:46 pm
Would it be reasonable/appropriate to do a Monte Carlo simulation study to see how multiple imputation works with the WSLSMV estimator? Mplus is capable of this, right? We just don't know how it will act?
Eric Teman posted on Thursday, August 25, 2011 - 10:26 pm
To be more specific, I mean we don't know how the adjunct and chi-square fit statistics will behave when WLSMV is used, right? It might be beneficial for a Monte Carlo simulation to be done????
Am sorry if not posting in the right place, could not find an appropiate topic. I keep getting this error message when running my input file.
*** ERROR The length of the data field exceeds the 40-character limit for free-formatted data. Error at record #: 1, field #: 32 *** ERROR The number of observations is 0. Check your data and format statement. Data file: F:\MATCHEDBYCHNR W1234_mplus.csv
I have saved the spss file as .csv without variable names so everything should be alright.
Furthermore I was wondering, i am using type=complex and have missing data, is FIML used automatically? also i have indirect effects, bootstrap cannot be used i have found out is there any other option to know whether the indirect effects are significant? (i've heard about Prodscal but am not sure how that works)
I am interested in using MPlus with large data sets like TIMSS and NAPE, in particular TIMSS. To do so I have a few questions:
1) Does Mplus allow for sample weight to items if yes how?
2)How Mplus handles missing data in blocks to explain it in detail:
TIMSS is a collection of 12 booklets that is administered to several thousands students. Each student answers only to 2 booklets, as the result, if one wants to stduy the whole data set, one will have blocks of missing data. I was wondering if Mplus can handle such data sets. TIMSS 2003: 740 students are responding to items in block 1 and 2, another 740 students respond to block 2 and 3, another 740 respond to blocks 3 and 4 and so on. I was wondering if I stack all these items then there will be blocks of items which are missing. Can Mplus handle such a data?
3) I will be using it for Latent class analysis and was wondering if I could fix some of the parameters for the purpose of equating these blocks of itmes
1) Mplus allows for sampling weights - see the paper (which is also on our web site):
Asparouhov, T. (2005). Sampling weights in latent variable modeling. Structural Equation Modeling, 12, 411-434.
I don't know what you mean by "sample weight to items"
2 )You have missing by design which can be handled in Mplus in two ways. One is to have as variables (columns in the data) all the variables in all 12 booklets so that all students have missing on most variables. The other is to do a multiple-group analysis where each group of students has its own set of variables (but the same number of variables).
3) Yes, you can hold parameters equal for the purpose of equating.
thank you for your reply and sorry for the double posting. just to clarify, these variables are all categorical not continuous. Basically correct (1) or wrong(0) answers to mathematics questions. Can Mplus work with missing data, about 50% of the data is missing?
I have output for two mediation models now. However, the robust chi square values are not computed. (with MLR AND TYPE=COMPLEX) Also, no standard errors are displayed only the estimates. (i use the STANDARDIZED CINTERVAL command in OUTPUT)
I am receiving the error "THE MINIMUM COVARIANCE COVERAGE WAS NOT FULFILLED FOR ALL GROUPS" when running a structural equation model using complex survey data and WLSMV (n=1,100, with 46 observed variables). There are several latent variables and a number of observed dummy variables as covariates. Everything is being regressed on a dichotomous variable.
When I run the analysis variables through type=BASIC to investigate covariance coverage, it appears that all coverage values are well above 0.9. I wonder what I should be looking for. This is not a multiple group analysis so I wonder if there are other groups that the error message would be referring to? Perhaps it is referring to the clusters in the complex survey data?
Drs. Muthen: I have seen the use of latent coefficients (latent "placeholders," if you will) in longitudinal growth models where there is planned missing data.
Can latent coefficients (or latent "placeholders") also be used in the context of multi-group modeling when not all items from a standard scale were administered to one group?
What about if the item was not administered to either group? Would we be able to use latent placeholders and have Mplus estimate what the factor loadings for those items would have been had they been administered?
Or is this an inappropriate use of the latent coefficients / latent placeholders?
Do you have an example in the Mplus manual, or do you know of an article, that has used latent placeholders in the context of multigroup modeling with planned missing data in the past?
If a study has planned missingness, a missing value flag should be assigned to individuals who did not take the item. Nothing more needs to be done. If no one took an item, it will not be used in the analysis. An item with all missing contributes nothing to the model.
You cannot identify or estimate a loading for an item that was not administered to anyone in the sample because there is no sample information on this loading. You have to have some subjects who has responses on the item so that you know how this item correlates with the other items - and therefore can draw inference about how subjects who didn't take the item might have scored.
In a two-group model you can have an item that is only administered in one of the two groups, but you cannot estimate a group-specific loading for that item in the group that didn't take the item. This is for the same reason as above.
I have not heard of latent coefficients / latent placeholders, so I don't know what that is.
Bengt (or Linda): You mentioned that "in a two-group model you can have an item that is only administered in one of the two groups."
This is what I am doing, and the items are categorical, in a 2-group (gender) CFA model. One such item that was administered to only one group (girls only) is CBCEYEP (measuring eye problems as a somatic symptom).
When I run the analysis, I get the message: *** ERROR Categorical variable CBCEYEP contains less than 2 categories.
However, that is not true. Among girls, to whom the item was administered, all three possible categories were endorsed. I set loadings and thresholds to be equal in both groups, and freed the scaling factor in one group.
Can you tell me from this information what I am doing wrong?
We have extra variables in certain groups (and missing in the other group), and these variables are dependent. Fixing the residual variances to be equal to those in the other group implies Theta parameterization, does it not?
The trick in that FAQ is only for continuous outcomes, not categorical outcomes that you have. Here is another approach you can take.
Assume as an example that you have 10 items that both males and females respond to and assume that each gender also responds to 5 additional items, but they aren't the same for the two genders. So each gender responds to 15 items. Your input should then refer to 15 items in the USEV list and your model should have any equality constraints applied only to the same 10 items that both genders respond to.
You don't want to list 20 items because both groups would then have 5 items where nobody in the group has a responses to those 5 items.
The 15 items are different items for the two groups. For males, it is the set of 15 items that males responded to, for females it is the set of items that females responded to. So you have to arrange your data that way.
Bengt, I am trying to run this as a multi-group model, where parameters for males and estimates for females are estimated simultaneously.
Do I estimate this in three stages? For example, get the estimates for the model with items that females responded to, then get the estimates for the model with items that males responded to, then run the overall model with parameters set at the values derived from the first two models for the times that these are missing for a certain gender in the overall model?
I am sorry. I am so confused about this. I think the three-stage strategy may work because Linda mentioned that groups with no data for an item do not contribute to the estimates? However, if I have for example a factor with 8 indicators but two are missing for boys and two are missing for girls, and I try this three-stage stratgegy, can I assume that including the loadings for these items administered only to one of the two groups would have no effect on the other loadings when I add them in?
What I suggested is a single analysis, not a multi-stage analysis. I suggested a simultaneous, 2-group analysis of males and females. You arrange your data so that say the first 10 columns are the common items and the next 5 columns are the items specific to each gender (so those 5 are different items for the two genders). So for instance if you have one factor f, you say in the Overall part of the model:
f by y1-y15;
You can then apply measurement invariance across gender for the first 10 items. The next 5 items contribute to the measurement of f, although they are different items for the two genders.
This is a standard type of approach when different sets of subjects take different forms of an achievement test. A similar approach is also used with multiple-cohort data.
If you are still unsure of what I am suggesting, you may want to consult with an SEM person on your campus who can sit down with you and talk you through it.
Hello, what does the following message imply about my data, and how can I fix the problem so that the model will run? I don't think one entire group is missing data for these items, so I am not sure why I am getting these messages.
WARNING: THE BIVARIATE TABLE OF VANDA_D AND SKINP_D HAS AN EMPTY CELL.
COMPUTATIONAL PROBLEMS ESTIMATING THE CORRELATION FOR VANDA_D AND SKINP_D.
P.S. I do have several items with low endorsement, such as the items below, which were mentioned in the warning above. Can items with low endorsement lead to the generation of a warning like the messages above? The data are not really missing, so it's a confusing message to receive.
When a bivariate table has an empty cell, this implies a correlation of one which means that only one of the variables should be used in the analysis. Variables that correlate one are not statistically distinguishable. Empty cells can occur for extreme items when sample size is small.
Amanda Hare posted on Wednesday, January 11, 2012 - 8:28 am
I am trying to run what I thought was a very simple model using version 6.1, predicting wave 2 self esteem (continuous) from sex (categorical), wave 1 self esteem (continuous), and authoritative parenting (continuous). The problem is that I'm getting listwise deletion of all cases with missing on x-variables! Here are the highlights:
This is because missing data theory does not apply to observed exogenous variables. To avoid this, you would need to bring all of the covariates into the model by mentioning their variances in the MODEL command. When you do this, they are treated as dependent variables and distributional assumptions are made about them.
EFried posted on Saturday, January 14, 2012 - 8:51 pm
Dear Dr Muthén!
Data set of N=800, 5 measurement points (MP), first MP has 5%, last MP 40% missings on my one continuous outcome variable. Covariates (6 time invariant, 1 time varying) also have some missings. If I run the whole growth mixture model, MPLUS deletes about 50% of my subjects, which is an insane amount of information I do not want to lose:
"Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 396
What to do?
(1) Auxiliary isn't possible in type RANDOM or MISSING if I see that correctly.
(2) I watched your videos 3-6, but the parts about missing data confused me more than they helped me ;). I read up on "Diggle Kenward selection Modeling" and "Roy's Model (Pattern Mixture Modeling)" but I don't want to write a paper on missing data and imputation. Other people must struggle with this also. Are there any guidelines I can follow on this?
(3) Chapter 11 of your wonderful manual: "Covariate missingness can be modeled if the covariates are brought into the model and distributional assumptions such as normality are made about them." What does this mean - how do I model covariate missingness exactly?
I would choose number 3. What this means is that in regression the model is estimated conditioned on the covariates and no distributional assumptions are made about them. If you bring them into the model and treat them as dependent variables, distributional assumptions are made about them.
EFried posted on Wednesday, January 18, 2012 - 9:55 pm
Thank you! Is there an example in the v6 manual or in any of the videos for this? Or in one of the papers? I wouldn't quite now how to do this.
Also, I have 5 measurement points, and 8 time invariant and 2 time varying covariates (with 4 measurement points each).
So I probably would have to decide which ones to bring into the model as dependent variable, otherwise the model would become not identified anymore?
Nancy Lewis posted on Tuesday, January 31, 2012 - 10:42 am
I am trying to run a mixed-effects meta-analysis using SEM, with 4 dummy coded moderator variables as fixed effects and a random effect for the intercept. Several studies are missing data on one or more moderator variables.
When I run the model with TYPE=RANDOM, I get a warning indicating that listwise deletion was done and only 11 of my 22 cases were included in the model. However, when I run the model with both the intercept and moderators as fixed, I do not get this warning and all 22 cases are included.
Why is this and what do I need to do to use FIML for the mixed-effects analysis?
katie bee posted on Wednesday, February 08, 2012 - 10:14 am
I am using Mplus Version 6.1 and am using ML w/monte carlo integration, and have been unable to get fit statistics. I thought I read that beginning w/version 3, this would be possible? Do you have any suggestions?
Chi-square and related fit statistics are not available when means, variances, and covariances are not sufficient statistics for model estimation.
Anonymous posted on Friday, February 17, 2012 - 10:14 am
Hi. I’m trying to unpack the defaults in Mplus (5.21) Re: the way it “adjusts” for observed control covariates, across different estimators. I have a model: Latent Y regressed on latent X1 and a set of observed control covariates. I use the MLR estimator. It looks like the default is to give estimates of the covariances between the observed and latent covariates. My understanding is that one doesn’t need to “call in” the covariances amongst the observed covariates, in order to make sure that the fitted regression parameters control for the other variables in the model. If I, say, use numerical integration here instead, it looks like the default is to NOT estimate covariances between the observed and latent covariates. Is it still the case that the regression parameters are adjusted for the effects of the covariates in the model (observed or latent)? I ask b/c I fit the same model –w/ MLR and then w/numerical integration—I get notably different estimates for the effects of my observed covariates. W/ MLR, it looks like it might be adjusting for the other covariates, whereas w/ numerical integration it does not appear to be. If I “call in” the covariances between the latent and observed covariates w/ numerical integration, it looks like the MLR model w/out numerical integration. I suppose it could also be a difference in the way the two models handle missing data (?). Thanks for any thoughts.
gibbon lab posted on Monday, March 05, 2012 - 8:49 am
In one of your old posts (above), you mentioned "For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes." I was just wondering if you have a reference paper for this so that I can read more details. Thanks a lot.
I am new to MPLUS and have version 6.12. I am running a simple CFA with one factor and 21 ordinal outcome variables. But, I have missing data (assuming MAR). I am a little confused as I have read different things in terms of whether the program 'handles' the missing data when missing is indicated. I have gotten the program to run and get fit indices, but am worried that the estimator being used isn't appropriate. (WLSMV). Is this ok?
thanks so much!
TITLE: 1 FACTOR CFA OF COGNITIVE CAPACITY SAFETY TBI DATA: FILE IS "C:\Users\kathy\Desktop\shepherd_safety_project\ ControlFIle_cg_cc3_3_2012.dat"; VARIABLE: NAMES ARE cc1 cc2 cc3 cc4 cc5 cc6 cc7 cc8 cc9 cc10 cc11 cc12 cc13 cc14 cc15 cc16 cc17 cc18 cc19 cc20 cc21; CATEGORICAL ARE cc1 cc2 cc3 cc4 cc5 cc6 cc7 cc8 cc9 cc10 cc11 cc12 cc13 cc14 cc15 cc16 cc17 cc18 cc19 cc20 cc21; MISSING are all (9); MODEL: F1 BY cc1 cc2 cc3 cc4 cc5 cc6 cc7 cc8 cc9 cc10 cc11 cc12 cc13 cc14 cc15 cc16 cc17 cc18 cc19 cc20 cc21; OUTPUT: SAMPSTAT; STAND; RESIDUAL; PATTERNS; SAVEDATA: FILE IS COGCAP_02122012cfa.DAT; DIFFTEST IS DERIV.DAT; FORMAT IS F2.0; PLOT: TYPE IS PLOT2;
I am having trouble reproducing residual variances from the estimated parameters of path analysis model with missing data.
I ran path models with and without missing variables. When I manually recalculated the residual variance using estimated parameters, I could only reproduce mplus residual variances when I have full data. I took out the cases with missing variables and recalculated again but my calculation still did not match with the mplus estimate of residual variance.
Could you please help me understand what is the problem. I built the model as single indicator factor analysis model. Thank you.
I don't see how you can manually calculate the residual variance - it is estimated by ML I assume? You don't say if your variables are continuous or categorical, where in the latter case residual variances are not free parameters.
It is estimated with ML. All variables were continuous.
I just recalculated as the variance of difference between estimated dependent variable Y' and actual Y. Y' was calculated as Y'=intercept+coef*X. X and Y are observed, so I used the parameters (coef and intercept) produced by mplus to reproduce residual variance in a spreadsheet. Please let me know if I have misunderstood. I was able to reproduce the residual variance for full data but not for missing data.
You are assuming that estimated variance of Y equals the sample variance of Y. This is not true for all models. If you look at your missing data run, requesting Residual in the Output command, you will probably see that the difference between estimated and observed variance is not zero.
Mai Sherif posted on Tuesday, May 08, 2012 - 8:42 am
I am fitting a latent variable model that also includes random effects. Some of the items are binary while the rest are continuous. My data includes missing values as well. The output that I get says that the dimension of numerical integration is 1 and it is actually very fast. I was wondering how the fitting takes place with only one dimension of numerical integration although I have seven latent variables and random effects? Is there a dimension reduction technique used by MPlus?
Mai Sherif posted on Wednesday, May 16, 2012 - 10:20 am
I have 4 u's (random effects) and 3 z's (latent variables) and the output below specifies that the dimension of numerical integration is only 3. I am just wondering why the dimension is only 3 and how the other latent variables are integrated.
u1 with u2-u4 @0; u2 with u3-u4 @0; u3 with u4 @0; u1-u4 with z1-z3 @0; Number of dependent variables 11 Number of continuous latent variables 7 Integration Specifications Type STANDARD Number of integration points 15 Dimensions of numerical integration 3 Adaptive quadrature ON LinkLOGIT
Mai Sherif posted on Wednesday, May 16, 2012 - 10:25 am
Another question I have is about dealing with survival data. Do we have to specify that some indicators are survival indicators? Or is it sufficient to have the survival items set up as in (Muthen and Masyn, 2005)and then MPlus will automatically model the conditional probabilities of survival (hazard function)rather than just a logistic model?
For discrete-time survival, you do not need to specify that the indicators are survival indicators. You need to arrange the data and specify the model as shown in Example 6.19.
Please limit your posts to one window.
Eric Deemer posted on Wednesday, May 16, 2012 - 11:26 am
Hello, I would like to determine the proportion of missing data in my data set. Under the DATA MISSING command, would I just ask for DESCRIPTIVES for the variables that I named as missing? Are frequencies provided as part of the DESCRIPTIVES output?
I am conducting SEM analysis using data from cross-sectional study. Missing data analysis for some variables showed data are missing not at random (MNAR). I am wondering whether MLR will be able to handle data which are missing not at random? If not, will it help if I impute missing values using SPSS for these variables before I conduct the analysis using MPLUS?
You want to make a distinction between MCAR, MAR, and NMAR (=MNAR). See missing data books, or our Topic 4 teaching. Mplus does MAR by default (often called FIML) and can also do NMAR modeling. Mplus also does multiple imputation. Multiple imputation assumes MAR.
What you are probably seeing is that MCAR does not hold. MAR may still hold. There is no way of knowing if MAR or NMAR holds.
Hi there, i'm doing a latent class growth analysis across 5 time points (baseline, 3, 6, 9, and 12 months) with missing data. In reading varoius literature, I understand I should use the "auxiliary" function to ensure my data is MAR. So, I've pasted my syntax below to ensure it is correct because my output doesn't seem to change with or without the auxiliary function (note, the sample size is 269 patients).
In response to one of your previous comments "FIML is an estimator and EM is one algorithm for computing FIML estimates. Other algorithms include Quasi-Newton, Fisher Scoring, and Newton-Raphson. Mplus uses the EM algorithm for the unrestricted H1 models and the other algorithms for H0 models. "
Aren't Quasi-Newton, Fisher Scoring and Newton-Raphson mathematical methods for finding a solution of a equation? How are they related to missing data? If using these algorithms for H0 models, is it true that the missing data were not taken into account for H0 models? Thanks.
I think I was trying to make a distinction between estimators and algorithms because it seems like sometimes missing data handling is referred to as using the "EM approach", which mixes apples and oranges. So the answer is Yes to your first question. Any of the algorithms can be used to do ML under MAR, which is often called FIML. So the answer to your second question is No - the use of these algorithms is unrelated to whether missing data is handled or not. You have to look to the assumptions made in the estimation to know how missingness is handled.
gibbon lab posted on Monday, June 25, 2012 - 6:11 pm
In response to one of your old comments "There is no paper that describes this. With WLSMV the dependent variables are looked at in pairs so missing data information cannot be gathered from all variables like in maximum likelihood."
So pairwise likelihood uses information from each pair of the observed endogenous variables. But for those who have only one observed endogenous variable, there are no pairs available. Will those subjects be thrown away when using pairwise likelihood? Thanks.
Auxiliary variables are treated as continuous and should not be specified to be other than that. Using a nominal variable as an auxiliary variable would not work. You may want to create a set of dummy variables.
I am having some trouble with coding in upgraded Mplus. My old version of Mplus was Version 4. If I ran a continuous growth curve in 4 I had to specify the TYPE=Missing option but once I did that it would handle both missingness on my observed Y's (from attrition) and missingness on my X variables. Now in the new version (6) TYPE=Missing is the default but my model is "kicking out" anyone with missingness on any X variable. It only used to do that when I modeled a noncontunuous Y variable. Is this some problem in my code or did the default FIML change such that it automatically drops cases with missing on the X variables?
I am trying to run through the steps of factorial invariance and ultimately run from these latent constructs a growth curve over 4 time points (continuous data).
I have imputed my missing data (resulting in 20 data sets) using the latest version of Amelia and created the list.dat file (as is done in example 11.5 of the user guide). However, while I no longer have missing data, I do have some variables (mean scores) that will be ultimately included in my model that are non-normal (not seriously however).
I would like to use an appropriate estimator that will allow me to use the TYPE = imputation command to summarize results and provide me with the information I require to examine the CFI and RMSEA CI to judge my model as I go through the steps of factorial invariance.
From what I can tell with my first attempt, using the MLR estimator (not the MLM as it seems its not available with TYPE=imputation) I cannot access the CI for the RMSEA (as it is provided if using the ML estimator with TYPE = imputation).
My question is how robust the ML estimator is with nonnormal data and if it would be appropriate to use this (knowing I have some nonnormality) so that I can get the fit information I require using the TYPE= imputation command.
With multiple imputation, the only fit statistic that has been developed for multiple imputation is chi-square for ML. For the others, averages are given. I would run ML and MLR and see how different the standard errors are. If they are not that different, it would indicate that you variables all not that non-normal. I would then use ML.
I have 13 groups, and am testing a five item CFA (but 150 items in the data set). The sample sizes shown in the "Number of observations" section of the result, is 20 to 30% less than real sample sizes. The only explanation that I can think of is that listwise deletion has been applied to all items menioned in "names are" part. Can you think of any other explanation? If not, I will talk to my supervisor to send the files. Thanks.
Look at the warning messages that are printed for possible reasons. Check that you are reading the data correctly. You may have blanks in the data set that are not allowed with free format. Check that the number of variable names in the NAMES list is the same as the number of columns in the data set.
Hi Thanks. I did not know with free format blank is not allowed. It worked out.
Bogdan Voicu posted on Tuesday, September 18, 2012 - 8:09 am
I run a TWOLEVEL model. N is 72418. MPlus6.12 drops 42258 cases due to missingness on the x-variables. I have used SPSS to check for the total number of cases with at least a missing value and it is two times lower: 20765.
If I run a model with no predictor (just the dependent variable), there is no difference in the number of dropped cases reported by MPlus as compared to the one that I compute in SPSS. The more predictors I add, the higher the loss of cases when using MPlus (as compared to the value that I compute in SPSS).
Since I am not very experienced with MPlus, it is probably something that I miss, but I have no clue what this should be. Any suggestion would be more than welcome!
I have data that is neither MAR nor missing by design - we are measuring sexual dysfunctions as part of our analysis, and people that did not engage in sexual activity were unable to answer many of the questions, so have system missing values. We are using mixture models to analyse the relationships between sexual dysfunctions, depression and anxiety disorders.
I was wondering whether using FIML would be an acceptable way to deal with these cases, or if you have any other advice?
I tried to fit a multilevel regression model with missing data on Y variable. I want to explore whether having an outgroup friend (level1 predictor,dichotomous) influences attitudes toward the out-group. The Y variable (attitude)is continous and is MCAR. Here is my syntax: Variable: Names are ID School Friend Attitude; USEV= School Friend Attitude; WITHIN = Friend; MISSING are Attitude (99); CLUSTER = School; Analysis: TYPE = TWOLEVEL RANDOM MISSING;
MODEL: %WITHIN% sfriend | Attitude on Friend; %BETWEEN% Attitude sFriend; Attitude with sfriend;
I got the waring " Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis."
I don't know why Mplus used list-wise deletion to deal with missingness on Y variable? How can I use FIML instead?
I am preparing data for an MSEM path model that will be examined in Mplus for my dissertation. The extent of missing data less than 5%. I am considering two options for handling the missing data.
1) Impute missing data using EM before running the model. This would allow me to retain available data for summary scores.
2) Run the model in Mplus using FIML estimation of missing data. This is a more accurate estimation but would be based on less information, because the summary scores would be missing for any case missing data on any item in the measure. The extent of missing data would also be greater because of this handling of the missing data.
What might be the advantages and disadvantages of using EM vs. FIML in this instance?
Imputation and FIML should give quite similar results. Typically, if you can do FIML that gives you more options for various tests. Note that Mplus does imputation. I don't see how there would be a greater extent of missing data with FIML than imputation.
Regarding the summary scores, why not use the average value of the items that are not missing. The imputed values don't carry new information anyway.
Hicham Raïq posted on Monday, January 21, 2013 - 10:06 am
I m working on SEM with caegorical variables. In the output of my results, I have this warning Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 105
The number of my observation is 2388.
What is the methode, can you suggest to me for dealing with missing: liswise deletion or pairwise deletion. Some authors propose the maximum liklihood estimation for incomplete data. But this option doesn't work in my case because my Analysis is type=complex.
Below is from a post of yours dated December 15, 2011 - 2:21 pm:
"If a study has planned missingness, a missing value flag should be assigned to individuals who did not take the item. Nothing more needs to be done."
I am working on data with skip patterns such that there are some variables that are responded by only a subset of participants. I have two questions:
1) When using multiple imputation, can I impute such that only those who should have responded but refused are imputed? I don't want values for those who are valid skips. If I set valid skips to missing, they will be imputed. If I do not set to missing but assign them values, the program will use those values as valid responses while imputing missing data. What should I do?
2) The model I am working on includes these variables that are responded by only a sub-sample. When modeling these variables, is there anything that needs to be done? Or do I just use them just like any other variable in the model?
Thank you in advance!
Jenny L. posted on Friday, May 31, 2013 - 10:56 pm
If the missing data are not specified in the command, how would they be treated by Mplus? In my data set, my missing data were blank; I forgot to specify missing is blank in the first place, but still got an output with no error message. I'm curious how those missing data were treated.
Thank you in advance for your help.
Jenny L. posted on Friday, May 31, 2013 - 11:12 pm
I should also mention that when missing values were not specified, Mplus still seemed to use all data (i.e., the sample size was the number of all participants I had).
Hi there - I am conducting a longitudinal CFA with one factor, four non-normal continuous indicators and two time points, pre-intervention and post-intervention. I have about 650 cases. I am conducting tests of measurement invariance (partial strict invariance is supported) and the aim of the study is to determine the effect of the intervention on the latent mean (so the reduction in the latent mean from pre to post, which is significant). I am able to run the analyses no problem, but the problem is that about 50% of the post-intervention data is missing from drop out. I would rather use MLR and all cases rather than perform listwise deletion, but I'm not sure what the impact of such a large amount of missing data would have. Any help would be really appreciated.
The Mplus default is MAR using all cases, which is obtained when requesting either ML or MLR. But you are right that 50% attrition is a lot and that means that the results depend to an uncomfortably large extent on the model assumptions, including normality. It can be particularly problematic if the missingness rate is different for the intervention groups. I assume it is not possible to try to find a random sample of those who were lost to follow-up.
The intervention is done online in an open access research setting, so we have no control group just the intervention and no contact with the patient once they drop out. The indicator variables are very much left censored.
I'm gathering you'd suggest using completer only data?
Because your mediator is categorical you have to pay special attention to how to treat the mediator in the modeling. Call it u and let u* be an underlying continuous latent response variable for u. The key question is if u or u* is the predictor (IV) for the distal outcome y. ML uses u which complicates matters. WLSMV uses u*. Bayes can use either. More correct causal effects are obtained as in the paper on our website:
Muthén, B. (2011). Applications of causally defined direct and indirect effects in mediation analysis using SEM in Mplus.
A simple approach is to use WLSMV and include the x variables in the model by mentioning their means or variances. This also enables Model Indirect. You can also do Multiple Imputations as a first step to handle the missingness on the x's.
This means that income has a variance so large that it will not fit in the space allocated. We recommend keeping the variances of continuous variables between one and ten. You can rescale variables using the DEFINE command by dividing them by a constant so that their variances are between one and ten, for example,
y = y/10;
Eric Deemer posted on Thursday, September 26, 2013 - 6:57 pm
I'm trying to calculate the percentage of missing data in my data set. If I specify "missing = all(-999)", for example, is there a command I can use to determine the frequency with which the value "-999" is observed? Thanks.
The PATTERNS option of the OUTPUT command will show you the patterns of missing data.
una posted on Thursday, November 07, 2013 - 6:18 am
Dear Prof. Muthen, I am running an analyses with MLR. If I understand correctly the default is that Mplus deletes cases with missing values on exogenous observed variables and uses full information for missing values on the endogenous variables? What are the advantages of using this approach? Do you have a reference to read more about this? Thank you very much in advance,
The advantages are using all available information rather than using listwise deletion. See the Little and Rubin reference in the user's guide.
milan lee posted on Friday, December 06, 2013 - 8:35 am
Hi, I wanted to test the missing pattern of my dataset based on Little’s MCAR test (Schlomer, Bauman, & Card, 2010) using Mplus. I checked out posts on the Mplus forum and it looks like we have to use "type=mixture" to obtain Little's MCAR test and the variables has to be categorical. However, I doubt I understand it well. There has to be a general command for testing MCAR and MNAR for imputation in a general regression model (not complex mixture model). May I have your advice on how to conduct this MCAR test with a chi-square value in Mplus? What is the command syntax for this test in continuous variables and simple regression models? Thank you very much!
In particular MAR v.s NMAR testing is conditional on assumptions about the missing data mechanism. So I would say it is somewhat limited (that has nothing to do with which software you use - it comes from the fact that the MAR hypothesis is very very general - it is hard to test against any ignorable missing data mechanism).
Little's test is not available in the current version of Mplus.
milan lee posted on Friday, December 06, 2013 - 7:03 pm
Thanks a lot for your explanation, Tihomir! Very very informative and helpful!!!
Matteo posted on Wednesday, January 15, 2014 - 9:58 am
Dear MPlus developers, I'm trying to understand the exact algorithm that you use for dealing with missing data through maximum likelihood. Reading classical papers on the topic, I thought that there exists a closed form for the maximization of the full information maximum likelihood problem with missing data only when the outcomes can be considered multivariate normal, while in all the other cases, so for example with categorical outcomes, we need iterative methods like the EM algorithm. Is this what MPlus does? Or am I wrong? Sorry for bothering you, but I didn't find this information anywhere, Thanks in advance.
Matteo posted on Thursday, January 16, 2014 - 3:51 am
Dear Tihomir, thank you very much for your answer! I had already seen Appendix 6, but I didn't find what I was looking for and also I thought it was a little bit out of date, since it starts by saying "Missing Data is allowed for in cases where all y variables are continuous and normally distributed", while I read in the general description of modelling missing data that "MPlus provides ML estimation under MCAR (missing completely at random), MAR (missing at random), and NMAR (not missing at random) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types (Little & Rubin, 2002)." That is also what I'm mostly interested in. I will give a look to the second reference you gave me, Thank you very much again for your answer!
Anonymous posted on Monday, January 20, 2014 - 2:52 pm
In SEM with the criterion variable having categorical indicators with missing values (the predictor variables have continuous indicators with missing values), can I use FIML? Thank you very much in advance!!
Alexis posted on Saturday, February 22, 2014 - 3:27 am
Hi, I’m running a model with ML estimation. By default Mplus deletes cases with missing values on exogenous variables and uses full information for missing values on the endogenous variables. The result is, however, that I’m still missing 33 cases due to missings on five exogenous variables. On this forum, I read that Mplus can handle missings on x-variables if they are brought into the model as y-variables, for instance, by mentioning the variance of the variable in the model command. But I’m not sure what the effect of this is and what I’m exactly doing. Isn’t it a bit artificial? So I’m wondering what the normal/standard procedure is for handle missings. Do you follow the default and accept that you have 33 missings due to missings on x-variables. Or do you do some tricks/adjustments to make your x-variables look like y-variables and consequently no cases are deleted? If the latter option is the standard/best approach, could you please tell me what I precisely should add to my model command given the fact I have five exogenous variables of which three are dichotomous and two are continuous. Is it just as simple as adding “X1 X2 X3 X4 X5;” into the syntax? Thank you very much in advance.
(1) Usually in multiple imputation normality is assumed for the variables with missing. This approach is also taken if you bring the x variables into the model. (2) Or, you can use multiple imputation and specify that a variable is say categorical and then an underlying continuous-normal latent response variable is assumed. Both approach (1) and (2) therefore make assumptions. Treating your binary x's as continuous-normal in approach (1) is only an approximation and taking approach (2) may also be only an approximation. So in both cases you go beyond the assumption of the model you originally specified for the y's as a function of the x's. Approach (1) is probably very often taken. You mention 33 missings, but more relevant is probably the percentage that this corresponds to - if it is small the analysis doesn't rely on assumptions as much as if it is large.
I am analyzing some longitudinal data in a cross-lag model (N = 250)and have 30% of subjects missing data at T1, 26% at T2 and 38% at T3. Relatedly, 36% have data at all three time points, 34% at 2 of 3 and, 30% at only 1 of 3.
Based on examining correlations within the sample, these appear to be MAR and we have included correlated variables in the analysis as auxiliary variables to help with estimation and reduce bias.
We submitted the paper and both the handling editor and 1 of the reviewers expressed concern regarding the % missing. Do you know of any references that give guidelines on acceptable levels of missingness? Or do you have a personal rule of thumb? We have cited McArdle et al 2004 & Enders & Bandalos, 2001 on the use of FIML to address missingness. And Enders 2010 on the use of auxiliary variables. Any other suggestions? thank you very much for your ideas/suggestions.
I would be most concerned about how many observations are present at two of the tree time points. I would also compare my results to those of listwise deletion. You can also create dummy variables for missing for times two and three and regress them on the time 1 outcome to see if y1 predicts missingness.
I don't know of any discussion of how much missingness is too much. The Enders book is the most likely source.
We are conducting a randomized control trial and we are doing multilevel modeling to determine program effect. According to the What Works Clearinghouse (WWC), when dealing with missing data in our analyses we need to do so separately for the treatment and control groups. Can we use FIML separately for the treatment and control groups?
The other alternative that WWC accepts is multiple imputation but this also needs to be done separately for treatment and control groups. Is there a way to do this within Mplus?
Dr. Muthen, I am running a 3-step GMM model with 5-wave scale scores as my outcome variables and a few independent variables to predict the class membership (e.g., race and LGS scores). Three participants missed to indicate their race and one missed on LGS. To keep them in the step 3 analysis, here is the model syntax: Model: %OVERALL% i s | pl_1@0pl_2@1pl_3@2pl_4@3pl_5@4; i WITH s; c on AA sex LGS z_t1_c z_t1_n z_t1_e highschool somecoll college; [AA LGS]; ... The analysis did include all the 163 participants. However, the AIC, BIC, and ABIC are much larger than the model without estimating AA and LGS. I wonder what your advices would be given this situation. Thank you.
I'm running a latent growth model in which the latent intrinsic work rewards variable at each of the six waves was specified by four items, and then intercept and slope were estimated using the six latent variables. Finally, I used the intercept and slope to predict generativity at the final wave along with some control variables.
My concern is that there are two types of missing data here: those who did not participate in a given wave (missing at random) and those who participated but did not answer intrinsic work rewards questions because they were unemployed at the time (missing not at random). I'm fine with having FIML estimate for those who missed the wave, but not comfortable estimating for those who participated the wave but were unemployed. Would it be possible for you to give me some suggestions on how to restructure my model to account for this missing not at random? Would it be appropriate to add six dummy variables, one for each wave's employment status, when predicting generativity to address the concern of unemployment? Or should I revise in some way the longitudinal CFA model in the earlier step? Thank you very much.
It's a good research question that I don't think I know the answer to. I wonder what it would be like if you use a parallel process growth model where you have one binary part of employed/unemployed and one continuous part with an intrinsic work reward score. The latter is missing when the former is in the unemployed status. Which would mean missing as a function of an observed variable so could be MAR. At least the missing would not only be a function of other intrinsic work reward scores, but directly a function of employment.
I am planning on running path analysis models involving examination of direct and indirect effects on a sample (N = 159) with data at 3 different time points. My variables of interest are scale scores (means of multiple items from questionnaires). The sample has both item-level missingness (1 or more items of a scale missing) and scale-level missingness (entire scale missing) for both predictor and outcome variables and I am seeking advice on the best approach to deal with missingness. I had the following questions:
1) Should item-level missingness be dealt with using multiple imputation as a first step in software outside of Mplus? And should this then be followed by maximum likelihood estimation at the scale-level in Mplus? Or:
2) Should item-level and scale-level missingness be dealt with using multiple imputation outside of Mplus and the imputed "complete" dataset be used for subsequent analyses?
The optimal approach would seem to be to formulate a factor model for the item indicators for each factor and then simply use FIML (so assuming MAR). The practical problem arises from the one-factor models maybe not fitting the data well. Imputation could use a less restrictive model. But then again, the scales you mention probably are sums of items which in itself assumes a one-factor model.
Thanks for your quick reply! The scales mentioned are actually subscales from questionnaires with more than one factor.
Just to clarify, are you suggesting running a full SEM instead of a path analysis model to assist with item-level missingness using FIML? I am concerned doing so would be difficult given my sample size.
Yes. It may be more difficult but it would yield a better analysis.
RSrinivasan posted on Saturday, July 05, 2014 - 11:51 am
Hello Drs. Muthen,
I am grad student working on a project with missing data in all most all variables. I have 5 latent variables, incl. 2 exogenous variable. I tried the syntax for missing data, but the error keeps asking me to add listwise=on and nochiquare in output. I did that as well.. but the error keeps coming back. Please help. I would not want the program to delete a complete data set for a few missing data points. Here is my syntax-
FILE IS "I:\Data\XYZWV.csv" ; LISTWISE=ON;
x1 x2 x3 x4 y1 y2 y3 y4 z1 z2 z3 z4 w1 w2 w3 w4 v1 v2 v3 v4 gender age marital edu income city pur apppur amtpur;
x by x1-x4; y by y1-y4; z by z1-z4; w by w1-w4; v by v1-v4;
OUTPUT: MODINDICES standardized nochisquare;
*** WARNING in ANALYSIS command Starting with Version 5, TYPE=MISSING is the default for all analyses. To obtain listwise deletion, use LISTWISE=ON in the DATA command. 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
If you want the model estimated using all available information, remove LISTWISE=ON; from the DATA command. The warning is just informing you that the default is using all available information. If you get an error when you remove LISTWISE=ON, send that output and your license number to firstname.lastname@example.org.
Hello, I would like to conduct a model-based (H1) multiple imputation analysis but my model contains a latent variable interaction. I specified a Bayesian estimator but a message in the output file says that Bayesian estimation is not allowed with latent variable interaction and so the default estimator (ML) was used. However, I also see specifications for Bayesian estimation in the Summary of Analysis section. Will you please tell me whether the resulting imputed data were imputed using a Bayesian estimator or a ML estimator? Also, can I be sure that the imputation was done under my H1 model and not under H0? Thank you.
I am trying to conduct multiple imputation for specified variables prior to analyses for my MA Thesis. The problem I have encountered is that missing data that appear as 999 in my SPSS data file and in my .dat data file, which Mplus is reading, appear as asterisks in all of the imputed files. I checked, and each 999 in my original data sets appears as an asterisk in the imputed files. I conducted a cross-sectional version of the same multiple imputation and subsequent analyses last week, and did not encounter this problem. I copied and pasted the same input file for the current data sets. Could you help to guide me toward my error? My apologies for the basic question.
I don't understand what you mean by "when data are saved."
Likewise, I don't understand what you mean by "when you read the data." Do you mean when I run the input file for the analysis, or when I run the multiple imputation input file? In the latter, I specified MISSING = ALL (999); under VARIABLES.
I made sure that the order of variables in the dat file, VARIABLES list, and IMPUTE VARIABLES list are in the same order.
I just re-did the entire process, starting from my SPSS file. The same asterisks appear in my imputed files. I ran some simple multiple regression analyses to see what would happen. I specified IMPUTATION as TYPE and gave the correct list name to retrieve my imputed files. I got a full set of output, even though there are still asterisks in the imputed files and I did not specify MISSING in the regression input.I got a warning saying that the CHI-2 test could not be conducted perhaps due to a large amount of missing data. This is the only warning I received despite having the asterisks in the files.
Thank you for any further clarification you can provide!
Eric Deemer posted on Tuesday, September 30, 2014 - 7:23 pm
I'm wondering why Mplus doesn't use cases with missing data on predictors with FIML? From what I've read, it's not possible to use cases with missing data on Y under FIML but still possible (albeit more difficult) to use cases with missing values on X but observed values on Y. Any light you can shed on this would be helpful. Thanks so much.
Missing data theory applies to dependent variables. Missing data theory does not apply to observed exogenous variables because the model is estimated conditioned on these variables. You can mention the variances of the observed exogenous variables in the MODEL command. This causes them to be treated as dependent variables and distributional assumptions are made about them but they will be used in the analysis.
Eric Deemer posted on Wednesday, October 01, 2014 - 8:00 am
Thanks, Linda. That helps a lot.
Briana Chang posted on Tuesday, November 11, 2014 - 10:39 am
Hello Linda and Bengt,
I'm hoping this is a simple question and I'd like to be sure about it before I proceed. I'd like to bring covariates into the model so that FIML is used to handle missing on the covariates. Can you do this for binary covariates with missing values?
You can see this in the frequencies for the missing data patterns. The total sample minus those with no missing would be the number of observations with some missing values. For each variable or pairs of variables, see the coverage values.
Djangou C posted on Sunday, January 11, 2015 - 11:34 pm
I'm doing multiple imputation with Mplus and would like to know how to compute the standard deviation of a point estimate (the mean)from the standard error provided by Mplus. Could please give a reference? Thank you
I’m currently trying to run two level multilevel models for several binary outcomes using FIML estimation procedures with longitudinal complex sample data. The models are complex: the level-1 models typically having 10-20 binary IVs and the level-2 models for the intercepts having a maximum of 16 continuous IVs. Many of the IVs are completely observed.
Do I need to bring all of the x variables into the model in order to have observations having missing data for the x variables included? In a 6/22/2006 posting you note that “If only two or three of your covariates have missing data, then FIML should be fine. You should study the missing data in your covariates. Perhaps there are some with very little missing data such that you could allow the listwise deletion on those and bring the others into the model.” However, on 11/12/2014 you say that “if you want to bring one covariate into the model, you must bring all covariates in to the model. You cannot bring in just a subset.”
A small subset of the IVs account for most of my missing data. Is there a way to use the 6/2006 strategy and use listwise deletion for x variables missing small amounts of data – and not include x variables which don’t have any missing data in the model? Multiple imputation isn’t feasible for a variety of reasons. It looks as though your thinking on this may have changed – but figured it’s worth asking. Thank you in advance for your help!
The issue with not bringing all the covariates into the model is that you want the covariates to correlate freely (as covariates should). This may not happen unless you model it. Say that you have e.g. two covariates and missing on X1 and not missing on X2 and you bring X1 into the model (essentially making it a Y). This model may leave X1 and X2 specified as uncorrelated. If you say X1 WITH X2 then you bring X2 into the model, so you have to say X1 ON X2 to correlate them and saying ON can have consequences for the rest of the model. So it is safest to bring all the Xs into the model. I assume you have considerable missingness on that small subset of IVs so that Listwise deletion is not an option.
I am running a complex model with many x-variables. One of those x-variables has missings. If I bring this x-variable into the model by mentioning the variance, the model does not fit any more. The problem is that this variable is now assumed to be uncorrelated with the other x-variables. So I added WITH-statements which brought in all the other x-variables into the model. And I needed to add more WITH-statements. Now I get a warning that the number of observed variables is exceeding the number of clusters in my model.
This puzzles me. If I look at de diagramview, the model seems the same (and when I do this with x-variables without missings the Chi-square and df are also the same) Why is there this large increase of observed variables and do you know a way to deal with this problem? Is there a way to let Mplus estimate the x-variables without increasing the number of observed variables or can I ignore this warning?
You must bring all of the covariates into the model or none of them. You can do this by mentioning the variances. When you do this, they are treated as dependent variables in the model. The warning is to remind you that independence of observation with clustered data is at the cluster level. The impact of this on your results has not been well studied.
Thank you for your answer. I still have a question about bringing in all the covariates. Why does this increase the number of free parameters in the model and in the same time it doesn't affect the number of degrees of freedom.
Hi Drs. Muthen, I am unclear why Mplus is not deleting cases that are missing data for all dependent variables. Notes from my output are below. BWACHGAP is dependent; all other variables are independent. 140,274 is the N in my total sample, but why is this number not decreased, given that some cases do not have data for the sole dependent variable?
SUMMARY OF ANALYSIS Number of groups 1 Number of observations 140274 Number of dependent variables 1 Number of independent variables 11 Number of continuous latent variables 0 Observed dependent variables Continuous BWACHGAP Observed independent variables CRITSKLL DIFFMETH SCH_CITY SCH_TOWN SCH_RURL SCH_MDLG SCH_LARG SCH_TITI SCH_ETH2 SCH_NSLP SCH_SMLL
Further, Drs. Muthen, I saw the note above that by modeling the variances of exogenous variables, they are treated as dependent and distributional assumptions are made about them; it was implied that this one way to retain cases that would otherwise be dropped due to having missing data on all dependent variables. (Please correct me if that is wrong.) However, I have the following questions: 1) Why would an analyst want exogenous variables to be treated as dependent in the model; what consequences are there to this? 2) When I explored the result of modeling vs. not modeling the variances of my exogenous variables named above, I found that the fit indices changed drastically simply due to the explicit modeling of these variances.
Please see below. Is this drop in the goodness in fit due to the improper assumptions that may be made about the distributions of these variables? Is the assumption multivariate normality? Thank you.
Regarding your first question, perhaps you are bringing x's into the model by mentioning their variances. In this case you no longer have a univariate model for your BWACHGAP DV, but you have a multivariate model for BWACHGAP and all the x's. So even if you have missing on BWACHGAP, people who have non-missing data on at least one x variable are (correctly) kept in the analysis sample.
Regarding the change in fit, I cannot speculate except to say that you should make sue you let all the x's correlate freely.
UG Ch11 states: "NMAR modeling is possible using ML estimation where categorical outcomes are indicators of missingness and where missingness can be predicted by continuous and categorical latent variables."
Yes, I have predicted missingness as a dichotomous outcome in such models--2 DVs are modeled: the outcome itself, and missingness. Both can be regressed on covariates, and this assumes MAR.
1) By correlating these two DVs, we can see if missingness is correlated with the predicted score in the whole sample--whether NMAR is a better assumption--right? Or is this not true, if ML estimation of missing scores (first DV) assumes MAR in the first place?
2) The above strategy (correlating these 2 DVs) works in a 1-level model but not a 2-level model. With the latter I get: "Covariances involving between-only categorical variables are not currently defined on the BETWEEN level."
I can run regressions of both DVs on the between level--but not correlate these outcomes. Does Mplus not allow for modeling covariances of dichotomous DVs on the between level?
I know mixture modeling is another option for NMAR, but given the first statement above, it seems this strategy should work: "categorical outcomes are indicators of missingness and missingness can be predicted..."
Perhaps simply not in a 2-level model where missingness is between only?
Drs. Muthen, I employed the strategy of creating a latent variance on level-2 to define the residual variance for the indicator of missingness, and successfully correlated this residual variance with residual variance with the central variable.
It seems that MAR is a plausible assumption given an estimated value of the correlation of missingness with the DV of about 0: F1 WITH BWACHGAP -0.003 0.424 -0.006 0.995
Is this an OK way to assess the MAR assumption? Thank you sincerely.
1) The missing data literature emphasizes that you cannot test whether NMAR is more suitable than MAR. I recommend the book by Craig Enders. This means that your 2- DV approach is not correct. Perhaps because the information on the residual correlation that you focus on comes only from those who don't have missing on Y (the rest is handled by the bivariate normal information). For NMAR modeling you need at least 2 DVs, not counting the binary missing data indicators. For more on NMAR modeling, see for instance:
Muthén, B., Asparouhov, T., Hunter, A. & Leuchter, A. (2011). Growth modeling with non-ignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological Methods, 16, 17-33. Click here to view Mplus outputs used in this paper. download paper contact first author show abstract
I have a dataset (n=3,000), with a large number of missing values. I was told to conduct multiple imputation using Bayesian analysis. When I try to run the imputation, the "DATA IMPUTATION" command does not turn blue. I was wondering if it is unavailable in the demo version, or if I have done something wrong in the input.
Here is my input: TITLE: bayesian imputation test DATA: FILE IS "E:\Dissertation\2012LAPOPComplete.dat" VARIABLE: NAMES ARE gen cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 q10new ed etid q11 q2; USE VARIABLES ARE gen cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 q10new ed etid q11 q2; CATEGORICAL ARE gen cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 ed etid q11 gen q2; AUXILLARY ARE ed etid q11 q2 gen; MISSING ARE cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 q10new ed etid q11 q2 (888888 988888 999999); DATA IMPUTATION: IMPUTE = cp8 cp9 cp6 cp7 cp21 it1 b10a b11 b13 b18 b21 b21a b31 b32 prot3 prot6 prot8 vote cp4 cp4a cp2 np2 cp13 np1 cp5 q10new ed etid q11 q2; NDATASETS = 10; SAVE = missimp*.dat; ANALYSIS: TYPE = BASIC; OUTPUT: TECH8;
I understand that Mplus removes cases with missing on all variables to run FIML estimation. I am working with political trust survey questions: out of about 37,000 respondents, 485 were removed for having missing values on all variables. This makes the summary statistics on the Mplus basic analysis slightly off from the summary stats I have in Stata.
I'm concerned that this ends up removing reticent respondents from the data who are scared to answer the survey question. Hypothetically, just to see if I could retain the 485 cases, I added a different variable into the basic analysis in Mplus, which almost all respondents answered.
This time, Mplus ran the basic analysis without a problem-- no cases were removed, yet the summary statistics in the basic output are still off by the same amount as when Mplus removed the 485 cases. I'm confused because Mplus should have been using the same exact data and missing values as Stata this time. Any idea why this happened?
You have to make a distinction between DVs and IVs and multivariate vs univariate estimation in the following sense.
Perhaps you are saying that you have a model with DVs and IVs and where all the DVs have missing data. They get removed because they don't add information to the estimation of relations between DVs and IVs. When you say you add a variable with no missing, perhaps that is used as a DV in which case Mplus keeps everyone because it can draw on missing data theory.
Summary statistics can be computed using Type=Basic which uses all DV and IV variables and since no "DV ON IV" is part of this analysis, all variables will be used in the missing data analysis to computed the summary statistics.
Statistically, there would be no difference between Stata and Mplus, only in how you use the programs.
If this doesn't help, send relevant outputs from Mplus and Stata to support along with your license number.
Hello, I am using Mplus 7 to perform IRT analyses on 104 dichotomous outcomes. Participants were only administered about a quarter of the items. Items were sampled in a way that creates MAR, as the missingness can be partially predicted by a categorical covariate, i.e., grade. Specifically, participants were administered a higher proportion of grade appropriate items and lower proportions of grade inappropriate items. I read in the user’s manual that a covariate can be used to model missingness of categorical DVs when one uses WLS. However, my attempts to use Auxiliary = grade(M) with WLS, WLSMV, ULSMV, and MLR have all generated the error message "Analysis with categorical variables is not available with the 'm' specifier in the AUXILIARY option." Is there another way to account for the MAR, in the context of dichotomous items and lots of (n=2317) missing data patterns?
Interesting... So in Janelle's example above, if one were to regress the factor(s) on grade then you're saying that would adjust for the MAR design of the item sampling? Would the same approach work if one were to use MLR and ULSMV?
In addition to my post above, is there a way to print the IRT parameterization when one uses a covariate in the model? I tried using the D(1.0) output option, but that didn't work. thanks for all your guidance!
Sorry, I'm still struggling with how to get the IRT parameterization. The (simplified) code below results in an error message that Model Constraint does not recognize the N function. I also tried constraining the factor mean to zero and factor residual variance to one, but that does not yield the IRT parameterization of thresholds/difficulties either, presumably becuase it is the factor variance that needs to be constrained.
VARIABLE: NAMES are mplusid V1-V104 GRADE; USEVARIABLES are V1-V104 GRADE; CATEGORICAL = V1-V104; MISSING are . ;
You can try to do this using the tech report we have on our website:
See the IRT page and the paper mentioned at:
A brief technical description of the formulas used in the plots of item characteristics curves and information curves is available
If you can't handle these formulas, you can try to make this simple by estimating the model in a first step without Model Constraint (your statement is not allowed). Fix the residual variance of f1 at 1 (f1@1). Asking for TECH4 you get the total variance of f1. Rerun by using a grade variable scaled so that the total variance of f1 is 1.
Thank you, Bengt, for referring me to the tech report. Reparameterizing the loadings and thresholds to discriminations and difficulties was no trouble.
Interestingly, whether or not one used a covariate to account for MAR had essentially no impact on estimates of a and b when using MLR, rs = .99 and 1.0 for as and bs, respectively. However, using a covariate to account for MAR had major impact on estimates of b when one used WLSMV, r among bs only .60. Estimates of b from MLR and WLSMV were more closely aligned when one accounted for the MAR using a covariate, r = .90, than when one did not account for MAR, r = .66.
Hi Linda & Bengt, I am following your recommendations for determining the best approach to dealing with NMAR data, and running MAR, pattern-mixture, and selection models. I cannot get the input for the Roy-Muthen model to run, and was hoping you could explain what the "u" variable is?
See the Muthen et al (2011) Psych Methods article on our web site. Page 22 describes the Roy method. The u variables of the runs shown on the web site can be ignored. I think they reflect missing or not at the different time points.
I'm currently trying to run some manual R3STEP models for a binary distal outcome and am trying to use FIML estimation procedures to account for missing data. Many of my IVs are binary. I have from 200 to 3,600 observations in each latent class. When I try to run models for my binary outcomes, I receive a message that the covariance matrix for one or more of my classes cannot be inverted. I’m wondering if this is happening because of (1) the distributional assumption of multivariate normality for all of the IVs and (2) very sparse/empty cell counts when I enter 2-3 binary indicators in the model.
My emerging impression is that FIML estimation cannot handle any empty cells in the crosstabs/frequency table and covariance/design matrices (e.g. no variance in outcome within cell). FIML estimation often just won’t work when I have fewer < 5 observations in any cell. For example, if I try to look at differences in a binary outcome by latent class, gender, and race (4 x 2 x 3) and one subgroup’s cell for the presence of the binary outcome is empty. It looks as though I need to have some threshold number of observations in each possible cell in order to successfully use FIML estimation procedures – FIML can’t seem to handle sparse tables. Is there a better way to do this? I suspect that multiple imputation will also be problematic. Thank you in advance for your help!
I assume you have missing data on your binary outcome(s) to give zero frequencies for certain combinations of binary covariate (IV) values. With binary covariates a singular covariance matrix problem may arise in those cases. I don't think there is a way around that problem; it is a common problem in logistic regression.
If that doesn't answer your concerns, we need more information to comment. Please send input, output, data, and license number to Support along with a clarification of:
- do you have missing on the binary outcomes or the binary IVs?
- when you talk about entering 2-3 binary indicators is that the outcomes you are talking about or the IVs?
Wen-Hsu Lin posted on Wednesday, October 14, 2015 - 6:30 pm
Hi, I have a question regarding wave non response in my data (attrition). I have 8 waves and the attrition (wave 1 vs. wave 8) is almost 40%. I then ran my growth curve. My question is: Can FIML provide proper estimation? I checked the Covariance Coverage table and some numbers were at .6. Is this ok? Can I combined FIML and weight (IPTW) to adjust for attrition? Thank you
Coverage of 0.6 is not good. FIML or any other missing data technique will have to rely too strongly on model assumptions, especially that the missing is MAR and that the variables are normally distributed.
But things are better if earlier time points have higher coverage. Assumptions play a much smaller role if the coverage is at least say 0.8.
Wen-Hsu Lin posted on Thursday, October 15, 2015 - 4:36 pm
Yes. Earlier time points have converge over .9 then drop to .6. I am concerned because different methods give somewhat different results. Thank you.
Hi, I'm testing a path analysis model with two binary covariates which lead to the deletion of 5.4% of the sample.
For continuous outcome, I know that there is a way of dealing with missing values on covariates by including them explicitly into the model. Is this possible with binary covariates too? If no, what would be the best way of dealing with these missing values?
Thanks. In other word, I should simply use Estimator=Bayes instead of Estimator=ML?
Because I'm trying to test a mediation model, I should use the Model constraint command instead of the Model indirect? (If I understand well from a previous post, I would also have to divide by the SD of the DV and multiply by the SD of the IV).
Is there a reference that I could read with a more concrete example of this? (or syntax)
Hi. I am running a latent profile analysis on 13 rating variables with two level nesting (15 scenarios within 300 people). I was able to successfully model a covariate (age group, dummy coded) with the use of algorithm=integration and integration=montecarlo statements, and including the covariate mean explicitly the MODEL line.
However, I am trying now to test a different covariate (implicit emotion beliefs), and when I try running the same syntax, I get the following missing data warning message:
Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 297
Q2. Right, that's not needed with Bayes. Bayes still allows non-normal parameter distributions and non-symmetric confidence intervals.
SABA posted on Thursday, November 26, 2015 - 8:25 am
Hi, I am running a multiple regression. 25% of my data is missing on a questionnaire because (respondents had refused to respond to that specific questionnaire) however, they have responded to another questionnaire which is also a part of model. The data is missing at random. The analysis type is complex and these respondents are cluster in my analysis. I run my model by trying both estimators ML/ MLR and in both cases I get the following message. *** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 1621 My question is why these cases are excluded from the analysis? And ML is not estimating them. Thank you
In regular regression you use only subjects with complete data on the x's.
If you have subjects who have missing on x's but not missing on y, you can benefit from that using missing data theory by "including the x's in the model", which you can do in Mplus by adding
assuming you have 5 x's.
SABA posted on Thursday, December 03, 2015 - 2:44 am
Hi, I am running a multiple regression. 25% of my data is missing on a questionnaire because (respondents had refused to respond to that specific questionnaire) however, they have responded to another questionnaire which is also a part of model. The data is missing at random. The analysis type is complex and these respondents are cluster in my analysis. I run my model by trying both estimators ML/ MLR and in both cases I get the following message. *** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 1621 If I use estimator MAR, then I get the following message *** ERROR in ANALYSIS command Unrecognized setting for ESTIMATOR option: MAR Could you please tell me why these estimators are not estimating the missing values? And what could be the solution. Thank you
MAR is not an estimator. Perhaps you are thinking about ML estimation using the MAR assumption. Saying ML or MLR will estimate under the MAR assumption.
Subjects with missing data on x are typically excluded because x is not part of the model in terms of parameters estimated. But adding a normality assumption on x you can bring x into the model saying in the Model command:
and this will bring up the sample size. Note, however, that regression slope estimates are only affected if subjects with missing on x have data on y.
Phil Wood posted on Monday, December 14, 2015 - 3:24 pm
Is there any way to exclude observations from analysis based on the number of missing data points across variables? Something along the order of, say, in SAS, saying If nmiss(of item1-item10)<5 then delete; Thanks!
If a person has missing data on the outcome it won't help if you bring the covariates into the model as you suggest. The people you want to include are those with the outcome observed who have missing on some covariates.
You need to mention all variables including the interactions (for deeper technical reasons, this won't be technically exactly correct but approximately so).
When I have only one DV, how can I 'get' Mplus to include cases with missing data on the DV (but no missing data on IVs)?
Would it be appropriate to include other variables known to be correlated with the DV in the model, say, by calling out their variances and/or means? If so, is it advisable to call out the means, variances, or both?
If not, is there another approach to retain cases with missing data on the DV?
Thank you for taking the time to respond--particularly to someone who is early on in the learning process.
If you have a regression of y ON x, you can certainly bring x into the model by mentioning its mean or variance. But the slope won't be affected by data on people who have missing on y and observed on x. The slope is affected only by data on people who have observed y and missing x.
Sona Aoyagi posted on Friday, February 12, 2016 - 7:48 am
Hi, I'd like to know about TYPE=DDROPOUT option in DATA MISSING command, which is for pattern mixture model.
1) In user's guide, it says "For TYPE=SDROPOUT and TYPE=DDROPOUT, the number of binary indicators is one less than the number of variables in the NAMES statement because dropout cannot occur before the second time point an individual is observed." But I actually have cases which are dropped out at the first observed point (i.e. before the second time point). Is there any solution to build in these cases which occurred before the second time point in pattern mixture model?
2) When I ran the model (TYPE=DDROPOUT), I got the error as below. " One or more variables have a variance of zero. Check your data and format statement." The error occurred at d1. Could you tell me the meaning of this error message? DATA MISSING: NAMES = y0-y5; BINARY = d1-d5; TYPE = DDROPOUT; MODEL: i s | y0@0y1@2y2@6y3@10y4@14y5@20 ; i ON d1-d5; s ON d3-d5; s ON d1 (1); s ON d2 (1);
3) Sorry for this beginner question: What is the difference between TYPE=MISSING and TYPE=DDROPOUT?
Type = Missing is for ML under the MAR assumption. Type=Ddropout is for pattern-mixture modeling of NMAR data. For a review see the paper on our website:
Muthén, B., Asparouhov, T., Hunter, A. & Leuchter, A. (2011). Growth modeling with non-ignorable dropout: Alternative analyses of the STAR*D antidepressant trial. Psychological Methods, 16, 17-33. Click here to view Mplus outputs used in this paper. download paper contact first author show abstract
If you are a beginner when it comes to missing data handling I would not use pattern-mixture modeling and therefore not Ddropout.
Lin Jiang posted on Tuesday, February 16, 2016 - 11:03 am
Hi Dr. Muthen,
My model has two latent variables. The IV is a latent variable with two observed variables (1 categorical, 1 continuous); the DV is a latent variable with 4 observed variables. The categorical observed variable for IV has 50 missing values out of 257. However, the continuous variables for IV has 145 missing values.
I run the overall model first. The model fits. However, when I run the multi-group SEM with gender, it shows "no convergence number of iterations exceeded". I guess, the "no convergence" is caused by the large number of missing values.
I want to impute the missing values first, and then run the multi-group SEM model again. However, one of my dissertation committee members insists that I should make the multi-group model with missing values fit, and then, make the model fit after imputation. I suspect whether the first step is necessary, since a lot of missing values exit, it is hard to make the multi-group SEM model fit. Do you have any suggestions?
Also, my data were collected from both male and female. When I use the imputation, should I estimate the missing data separately based on different gender? How can I do that? ( I didn't attend any Mplus trainings/workshops because of the tight budget. I learned how to use it by myself.) Thank you for your help!
We have a longitudinal study with 4 waves spread over 2 years. As we use routinely collected data from a treatment for criminal adolescents, we have a large amount of missing data: we have 50 - 70% missing data on each variable on each wave. As a consequence, our covariance coverage varies from just above .20 up to aproximately .60. We are conducting growth models, as well as latent class growth models, using FIML.
I was wondering whether there is any way in which we could get an indiciation of the estimation bias introduced in our parameter estimates by FIML. For example, for multiple imputation Collins (2001) suggests to compute a standardized bias, which is 100*(average estimate - parameter)/se, where se is the standard deviation of the estimate. Are there any such possibilities to investigate whether the missing data is leading to biased parameter estimations when using FIML?
I would like to perform a CFA analysis with a variable which has some missing values. In fact, in SPSS I used the imputation and it generated 5 data sets. how can I analyse in Mplus these 5 data sets generated from multiple imputation? Can I perform multiple imputation in Mplus? many thanks in advance for your help,
In fact, I was analysing my missing data and I found that I have less than 5% of missing data, therefore I will not use multiple imputation (as I don't have many missing data). Therefore, do you think that Mplus will handle automatically my missing data (using the method FIML). Or should I write something specifically in the Syntax? In fact, I have already read many information, but I am a little confused and I think I need your advice. Many thanks in advance for your help.
Many thanks for your reply and for your help. However, I tried to use this in Mplus and it didn't worked (I received an error message). here I attach the syntax that I wrote for performing a CFA with unordered categorical observed variables with a maximum likelihood estimation for you see, in order to see if I did something wrong.
Title: CFA DSM-IV-J; Data: File is validacao only gamblingversao21904.dat; Variable: NAMES are DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; USEVARIABLES are DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; MISSING are all (-999); NOMINAL are all; Analysis: TYPE IS MISSING H1
Model: F1 by DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec;
Output: STANDARDIZED; MODINDICES;
That is the correct syntax for CFA with unordered categorical variables, and using a maximum likelihood estimation for missing values? Because the software is also using the WLMSV estimator to handle categorical variables. Once again, may thanks for your help.
Thank you very much for you advice. I will change estimator and use the estimator ML instead of WLSMV. However, I am still a bit confused about one thing: can you please only tell me if I sill use the WLSMV estimator (the default for categorical data), how the software will handle missing data? Because from what I had read, I didn't understand if it's listwise ou pairwise. Once again, many thanks for your help.
Many thanks for all your help. In fact, I followed your suggestion and thus, I tried to perform the CFA of my instrument (an instrument with unordered categorical variables with 9 items) using both estimators, that is, I performed one analysis with the WLSMV estimator (the default for categorical variables) and I performed another analysis with ML estimator and I obtained different results. Taking into account that I only have 8 missing values for a sample of 750 respondents and analysing the pattern of the missing, it seemed that the pattern is MAR, what do you think it will be the best approach for my case? Do you think that will be to use the estimator WLSMV and let the software handle automatically through pairwise method? It's a CFA with 9 items, so I do not have covariates, right? Once again, thank you very much for all your help and insights.
Are you using the CATEGORICAL option with WLSMV and ML. You should be doing this. You should be comparing the patterns of significance not the values of the coefficients. ML gives logisitic regression and WLSMV gives probit regression. They are not on the same scale.
Once again, thank you very much for your help and insights. Therefore, do you mean performing the same syntax for using the categorical option with WLSMV and ML? Or do you mean performing two syntaxes and then compare the patterns of significance (and not the values of the coefficients) I performed this syntax for the categorical option with WLSMV and ML. Can you please tell me if it is right?
Title: CFA DSM-IV-J; Data: File is validacao only gamblingversao21904.dat; Variable: NAMES are DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; USEVARIABLES are DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; MISSING are all (-999); NOMINAL are all; Analysis: TYPE IS MISSING H1 ESTIMATOR IS WLSMV Model: F1 by DSM1rec DSM2rec DSM3rec DSM4rec DSM5rec DSM6rec DSM7rec DSM8rec DSM9rec; Output: STANDARDIZED; MODINDICES;
Is this the correct manner to perform the syntax for using the categorical option with WLSMV and ML? In addition, question 2) If the data were completely at random, do you think that pairwise method should be fine? Once again, many thanks for all your help
Hello, I am running a series of multiple regression analyses. The data set contains missing values so I am using maximum likelihood to retain the full sample in models. Some of the outcome variables are non-normal, others are normally distributed. Based on my understanding, it is appropriate to use MLR (rather than ML) to estimate parameters for those models with outcome variables that are non-normally distributed. I am wondering what the best practice is: is it appropriate to use MLR for all analyses reported within the same paper, even for those with outcome variables that are normally distributed? Or should I use ML when modeling outcomes that are normally distributed and MLR when modeling outcomes that are not normally distributed? Results do differ slightly when using ML versus MLR. Thank you!
I ran a LCGA model with the LRTBOOTSTRAP option to test the difference in fit between a 3 class model and a 2 class model. The model took several days to run and when it finished the output did not contain any results but rather ended with a warning that the covariance coverage falls below the specified limit. It seems rather odd to me that it took several days to run and seemed to be fitting models with different random start values but then did not give me the results from any of the models. Does this seem odd to you too?
The program terminated before the computation was completed either by you (accidentally) or the Mplus program crashed during the computation. Run it again and if it crashes again send it to email@example.com
thanks - it ran for several days. I need to figure out how to submit it as a batch job to version of Mplus on my university's social science computing cluster so it doesn't tie up my computer for several days again. Once I figure that out, I will re-run it and get back to you.
A continuous variable in which a 7 point likert scale should be in the opposite direction of others. I’ve already did this syntax following your suggestion, but then, not requesting summary data option. e.g: Define: SBPN13r=7-SBPN13r; SPBN19r=7-SBPN19r;
I have imported data from SPSS (trialled in both CSV and tab delimited formats), n=590 and 48 variables. When I run any analysis in Mplus 7, it is indicated that there is 1 missing data pattern, while none are missing in SPSS, and none correspond to the value of missing set in Mplus (-99). Using type=basic, it is furthermore indicated that none of the variables in the dataset have missing data patterns, yet a data pattern is missing with a frequency of 590. No covariance coverage is found to be below 1.000.
I am currently working under the assumption that this will not affect my results, but I would like to know if there are specific types of errors in either the dataset or my Mplus input instructions that may have caused this?
I'm trying to use LISTWISE=ON in the DATA command as I would like to use an MLM estimation; however, I keep getting the error message "Estimator MLM is only available when LISTWISE=ON is specified in the DATA command. Default estimator will be used". I don't understand why it isn't recognising that I've specified listwise on. Any help would be much appreciated!
We have cross-sectional data for which Little's MCAR test divided by its df is somewhat larger than 2, suggesting that our data are NMAR. However, the extent of missingness is only small (between 1 and 2% in a large dataset). Would such a model be estimated in Mplus without being biased? Or would one need auxiliary variables? Can these data be MAR despite the Little test?
Taking a second look, we saw that including the mean of two variables as a third variable when calculating the Little test caused the test to be higher than 2 (removing that third variable resulted in a value lower than 2, suggesting MAR). Still, I'm interested in what the options were if the data had been MNAR :-)
Rejecting MCAR still makes it possible that MAR holds.
DavidBoyda posted on Monday, December 19, 2016 - 10:40 am
Apologies, probably not the correct area for this question, however. I am experiencing an odd anomaly in mplus (i think). I have a variable with 42 endorsed Yeses. however when use said variable in a model the univariate proportion and counts shows this:
I'm new to Mplus and I'm struggling to do a linear regression with missing data in the independent variables. Unfortunately Mplus performs listwise deletion even when "listwise=on" is not specified. Why is that? As I understand Mplus should handle missing data with FIML by default.
Update: I believed I solved the problem by adding the variables with missing values in brackets:
Syntax: title: Multiple regression with missing data data: file is hsbmis2.dat; variable: names are id female race ses hises prog academic read write math science socst hon; usevariables are write read female math; Missing are all (-9999); model: write on female read math; [female read math];
Regression analysis is done conditional on the covariates and therefore they cannot have missing data. FIML doesn't change this. One can however extend the model to include the covariates which you have done. All of this is explained in chapter 10 of our new book mentioned on our home page.
I am trying to impute data for a multi-group structural equation model but am new to imputation and am having difficulty assessing the most appropriate course of action.
My model uses categorical (dependent) and latent continuous variables (independent, mediator); data are non-normal; and, in addition to the grouping variable, data are clustered (multi-level). Should I use the TYPE = BASIC TWOLEVEL; command in this situation?
In addition, a colleague mentioned predictive mean matching may be appropriate - but was uncertain whether this technique is available in Mplus. Is it, or something similar, available?
Thank you for the quick response. I am missing data across the indicators of my main independent variable (a latent factor) and in several covariates. By using imputation I was hoping to avoid sample size reduction due to missing on x-variables.
You don't need imputation for missing on x variables. You can "bring the x's into the model" by mentioning their means, variances, or covariances. Missingness will then be accepted on the x's and handled via FIML (we write about this in our new book).
The error message I'm receiving says that there is a non-missing blank at record 1 field 15. I cannot, however, locate any problem with this case. What, specifically, should I be looking for with this error message?
I'm doing a multiple regression with data from a P&P survey containing missing data in both the dependent and manifest variables. Data weights have been added to correct for disproportional stratification. Data seems to be MAR.
The model is specified as follows:
Y on X1 X2 (etc.)
Estimator = MLR
Without accounting for sample weights the model and missing data handling works quite nice giving a much better model fit compared to pairwise or listwise deletion. However once sample weights are added to the model standard errors increase significantly resulting in an overall worse fit.
I have estimated the exact same model in SPSS using pairwise deletion with sample weights toggled on and off. The SPSS model with pairwise deletion and sample weights has slightly lower standard errors and a higher R Square (0,724 vs. 0,674) compared to the Mplus MLR model. I find this quite baffling since MLR should at least equal pairwise deletion unless the model and missing data handling have been misspecified.
What would you recommend I do next in order to improve model fit with sample weights turned on?
I assume that Y is a latent variable measured by several indicators so that model fit is an issue. Typically pairwise deletion is different and not as good as ML(R) under the MAR assumption. Judging by SEs and R-square is not necessarily the best approach. If you have missing on x's you may want to "bring them into the model" by mentioning their variances. We discuss these missing data matters in Chapter 10 of our new book.
i have two latent class variables and 10 items (5 items for each latent variable), all items are binary. how can i write the model command to assign the items for each related latent class variable. i tried to write: c1 by u1-u5 c2 by v1-v5 but i found warning message
Luke Rapa posted on Friday, February 17, 2017 - 6:49 am
I have a well-fitting measurement & structural model (RMSEA for both is .03 and CFI/TLI are .95 or higher). In both models, all individual indicators are loading significantly as expected and at the .50 level or higher. The structural model is longitudinal and includes multiple waves of data; there are some variables in the model with a high degree of missingness. I’m using MLR due to some non-normality.
When I add auxiliary variables to help address missingness, I find that the signs of certain factor loadings are reversing and paths that were previously significant no longer are so. I am using the following code to declare auxiliary variables:
AUXILIARY = (m) v1 v2 v3;
Is there a problem with this declaration of auxiliaries? If not, is there a reason that various factor loadings would change direction (reverse signs, from positive to negative loadings) and paths would become non-significant with the addition of these auxiliary variables?
The missing data handling is different as intended so it sounds like the missingness is strongly selective. See also our short course handout and video on this on our website under Topic 4.
Anne Black posted on Wednesday, March 15, 2017 - 6:42 am
Dear Dr. Muthen, I am estimating a SEM with a categorical outcome, and several categorical covariates with varying amounts of missing data. I am using type=complex and am specifying grouping, weight, stratification, and cluster variables. I was hoping to bring the incomplete covariates into the model and use estimator=Bayes to handle the missing data, but this isn't an option with multiple groups analysis. Can you suggest another way to handle the missingness? Thank you!
I was looking to use dummy coded variables as a part of the auxiliary (m) statement. Is this acceptable practice when all model DVs are continuous?
I would think this meets assumptions of saturated correlates, where all dummy auxiliary variables are correlated with each other as well as with IVs and with residuals of DVs. Correlations between continuous and dummies would be point-biserial, and the correlations among dummies themselves would be phi correlations.
I noticed one suggestion on this thread to use dummies on the auxiliary line, but another post on these forums (http://www.statmodel.com/discussion/messages/22/3457.html?1481589074) mentioned "continuous variables only" in the context of auxiliary variables. However, the original poster was talking about declaring the dummies as categorical, which I presume you wouldn't have to do if using MLR estimation and assuming the dummies function in terms of point-biserial and phi correlations?
Hi With a binary outcome with both WLSMV or MLR, I get these warnings:
*** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 206 *** WARNING Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis. Number of cases with missing on all variables except x-variables: 80
This is basically leaving out any cases with a single missing value (=listwise). When I use "listwise=on", I get the same final number of observations, without these warnings.
So, with categorical outcomes, there is no way to use FIML instead of listwise?
I assume you have a single DV in which case this happens - FIML doesn't kick in this univariate response case. You can "bring the x's into the model" by mentioning their variances and "FIML" would be activated because now you have a multivariate response case.
See also chapter 10 of our new book where these issues are described.
Good morning! I am running a growth mixture model adding some predictors of class membership. Some of these predictors are scales conformed by individual variables with many missing data. I would like to know: 1.- Is it better to do the imputation of the individual variables conforming the scales, or the scale values per se? 2.- Is it better that: (a) I do a multiple imputation of the predictor variables outside MPlus (i.e using STATA) and then use the imputed datasets in MPlus, or (b) would you rather do the imputation in Mplus. 3.- In scenario (a), how can I combine the different STATA datasets to be used in MPlus? Is there any guidance on this in the MPlus guide book? 4.- In scenario (b), where can I find some help to write the code to do the multiple imputation of the predictors in MPlus?
These Define rules are described in the V8 UG on pages 643 and 644. The mean uses those items that don't have missing. The sum deletes subject with missing on any item (I think I got that right). FIML is not applied because Define is not part of the Model command.
I just think it is a good default approach. We seldom have theories about zero x-variable correlations - most theories are about y's as a function of x's. If you mis-specify and don't include an x-covariance where there should be one, you get misfit and perhaps distorted results.