Version 2.02 allows missing data modeling when a latent mixture model is fit to data with a complex sampling design. Can missing data be handled with other models using complex data? I thought I read in the manual that it can't, but maybe the missing data feature has been added to other models in v. 2.02.
I am designing a two-level SEM and I have much much missing data of the independent variables at within level. But there is no missing data at between-level independent vairabels. Can this model still be handeld by Mplus version 3? or should I replace all of the missing data at with-in level before runing the program? Thanks for suggestion.
This can be handled by Mplus Version 3. You should treat the x variables as y variables. The normality assumption changes from normality given x to overall normality. You change them to y's by mentioning their variances in the MODEL command. Then use TYPE = TWOLEVEL MISSING;
We are running a structural equation model with clustered data (teenagers clustered within schools) using TYPE=COMPLEX modeling. We have missing data for some schools. Are there any special concerns we should consider when the missing data is at the second level, other than the usual things, like coverage and that missingness is at least MAR, when using TYPE=COMPLEX MISSING?
bmuthen posted on Saturday, April 24, 2004 - 9:03 am
Questions again from a new user of Mplus 3. Could I return to the previous question posted on April 05? As you suggested, I add the variances of Xs in the MODEL command, but the output suggests me to use ALGORITHM = INTEGRATION; INTEGRATION= MONTECARLO in the ANALYSIS command. I refer to the example of MPLUS short curse: multilevel regression model,page 52, in the input command, there is no specification of variance of missing data and also no ALGORITHM command although in the VARIABLE, missing data is mentioned. So in general, in which situation should I add vaiances of missing Xs? In fact, after I add ALGORITHM = INTEGRATION; INTEGRATION= MONTECARLO into the analysis, no any output comes out, only shows that "INPUT READING TERMINATED NORMALLY" (I put output option as SAMPSTAT TECH8).
In this case,is it necessary to run Monte Carlo simulation to generaing the missing data?
If possible, could you please suggest me one complete example of Two-level with Random and dealing with missing data? This perhaps can enable me ask you less questions concerning the similar issues.
We don't have examples that show MISSING. You just need to add it to the TYPE option of the ANALYSIS command. I suspect that your outcomes are not continuous and that is why numerical integration is required. Please send your output to email@example.com if you want me to look at it.
Mpduser1 posted on Wednesday, September 28, 2005 - 11:51 am
I'm building at multilevel SEM with two endogenous variables, Y1 and Y2, both of which are prone to missingness, and both of which have WITHIN and BETWEEN sources of variation. The missing data rate for Y2 is much higher than the missing data rate for Y1. Y2 is categorical, Y1 is ordinal.
My question is this: Does Mplus 3.13 use information from both the WITHIN and BETWEEN portions of the model when adjusting the maximum likelihood calculations to account for the missing data ?.
I ask because this could greatly influence my variable selection / modeling strategy.
bmuthen posted on Wednesday, September 28, 2005 - 9:06 pm
The answer is yes. That is how maximum-likelihood estimation under the standard "MAR" assumption works.
anonymous posted on Monday, January 16, 2006 - 10:13 pm
I am running a multinomial logistic regression analysis (nominal dv; using missing and complex estimation) and wish to compare if two of my three-way interaction betas are significantly different from one another. For example, I have a 3 level dv (one is the reference) and I have a 3-way interaction which is statistically signficant when comparing the first level to the reference group and not significant when comparing the second level to the reference. I wish to know if the 2 betas are significantly different from one another. Any ideas?
bmuthen posted on Tuesday, January 17, 2006 - 10:54 am
You compare the log likelihood (LL) of your model with a model where you constrain your betas to be equal (using the usual Mplus approach to equality constraints). Then use 2* LL as an LRT chi-square test of the equality with df = the difference in the number of parameters of the two models.
I have developed an SEM with TYPE COMPLEX (cluster data), and ESTIMATOR = MLR. Since I specified MISSING ARE ALL (-9), I assume that there has been a listwise deletion cases. The n varies nicely with the number of variables (with missing) that is used in the analyses.
Since I have missing, and would like to use a method equivalent to FIML, I have tried to specify TYPE = MISSING H1. Mplus gives no error message or warning, but simply responds with silence.
The relevant commands look like this:
ANALYSIS: TYPE = complex; TYPE = missing h1; ESTIMATOR = MLR;
I have tested out various ways, for instance this one:
ANALYSIS: TYPE = complex missing h1; ESTIMATOR = MLR;
There will always be listwise deletion of cases with missing on covariates because the model is estimated conditioned on the covariates. Means, variances, and covariances of the covariates are not estimated as part of the model. No missing date theory exists for covariates. If you don't want cases with missing of the covariates to be deleted, you need to bring the covariates into the model by mentioning their variances in the MODEL command. Means, variances, and covariances will then be estimated for them. In addition, distributional assumptions will be made about them as for any dependent variable.
student07 posted on Friday, July 27, 2007 - 8:33 am
I'd like to ask how Mplus deals with missing values for x-variables (covariates) which are measured only on the between-level when using TYPE= twolevel?
Any observation with a missing value on a covariate is eliminated from the analysis.
student07 posted on Monday, July 30, 2007 - 7:01 am
Thank you very much for your response to my earlier question - I now found that when using "type= twolevel missing", no chi-square statistics/ CFI or TLI are reported in the output. Am I doing something wrong here? Or Is there any possibility to request CFI TLI when using "type= twolevel missing"?
I'm trying to carry out a twolevel analysis with data of a pre-post-and-follow-up design in an intervention study. There are three groups (on control group and two treatment groups) on level 2 (operationalized as two dummy variables which predict the dependent variable on level 2). My question is: How can I do a twolevel analysis with taking missing data into account? Is there something like a syntax such as "TYPE=MISSING" for the twolevel approach?
The default since Version 5 is TYPE=MISSING for all analyses.
Kätlin Peets posted on Thursday, February 17, 2011 - 11:58 am
I have a question. My model looks like that
Laused2 on sugu ; Laused2 on Reading0 ; Laused2 ON Math0; Laused2 ON Avoid0;
%between% reading0 avoid0 math0 AAA;
Laused2 on Reading0 ; Laused2 on Math0; Laused2 on Avoid0; Laused2 ON AAA;! between-level predictor
Thus, I specify reading0, avoid0, math0, and AAA as part of the model in order not to lose cases with missing values on covariates. Model modif. indices suggest that I would specify correlations/covariances between avoid0, reading0, and math0. However, when I do so, my model parameters (especially between-level slopes) change. Why is it so?
Does the MISSING default in version 5 handle missing data differently for
TYPE = TWO-LEVEL RANDOM
than for a
TYPE = GENERAL analysis?
I've used Mplus for years, but always for SEM or LGM. I'm trying to analyze data for a school-level randomized control trial, in which students have a pre-test and a post-test. However, the output includes the following warnings:
*** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 327 *** WARNING Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis. Number of cases with missing on all variables except x-variables: 56
Why is it excluding these cases if I do not have LISTWISE = ON?
In GENERAL prior to Version 6, the model was not estimated conditioned on the observed exogenous variables as is done with TWOLEVEL RANDOM. Starting with Version 6, all models are estimated conditioned on the observed exogenous variables.
Missing data theory applies only to dependent variables. This is why observations with missing on observed exogenous variables are excluded. See the 6.1 Version History for further information.
I specify all the possible covariances between my covariates (at the within and between level) to be able to include all the cases in my analyses (when I mention only variances of x-s instead of covariances, the model fit is very bad). However, I get an error message:
MAXIMUM LOG-LIKELIHOOD VALUE FOR THE UNRESTRICTED (H1) MODEL IS -5322.918
THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS -0.177D-16. PROBLEM INVOLVING PARAMETER 37.
THE NONIDENTIFICATION IS MOST LIKELY DUE TO HAVING MORE PARAMETERS THAN THE NUMBER OF CLUSTERS. REDUCE THE NUMBER OF PARAMETERS.
We do not know the impact of having more parameters than clusters. This has not been studied. Certainly you don't want more between parameters than clusters because the number of clusters is the number of independent units.
If you include the covariates in the model, you must estimate the means, variances, and covariances of these variables. Perhaps you would be better off losing the observations that have missing data on the covariates.
I could, but my sample size decreases by 30%. I considered using MI. However, I need to know covariances for my parameter estimates (Tech 3 output gives a covariance matrix for each of my imputed data sets) to estimate simple slopes. And, I did not know how to get such an estimate.
Hi, I'm using the montecarlo feature of mplus to generate a 2 level model with 3 level 1 predictors (2 fixed and 1 random) and 1 level 2 predictor.
I'm interested in creating 10% and 30% missingness across either the level 1 predictors, the level 2 predictor, or across both.
When I use the PATMISS and PATPROBS commands, mplus informs me for analysis=twolevel random I must use montecarlo integration. However, when I use this integration I have several errors in the tech 9 output.
I've attempted using the missing= and MODEL MISSING: commands, but have not had much success.
What would be the best way to create 10% and 30% missingness on my multilevel data? Thank you for your time.
I have some variables measuring depression and acitivities of daily living, which I believe have some missing data. I will be creating percents based on total scores these scales because they are frequency scales (not truly continuous). The depression scale ranges from 0 to 3 for each of 9 items; the activities of daily living scale ranges from 0 to 2 for each of 5 items.
If I use the define statement at the beginning of my program, as below, will Mplus, by dafault, replace missing items with the maximum likelihood-estimated value for that item? OR should I handle missing data in SAS prior to exporting my data to Mplus for analysis? Thanks for your help!
Thanks, Dr. Muthen. I could be making a silly mistake, but when I use this define cOmmand, I get no variance on the resulting variable. I summed across the depression items, then divided by the total possible score of 3*9=27 to create a percent which we could then be divided into four categories for the resulting percent. (For this project, we wanted four categories for depression.) But I end up with DEPRESSC variable that has no variance, so Mplus won't run the model for depression.
There WAS variance on the original DEPRESS sum variable, and not all persons would fall into category 1. Is there some obvious mistake that I am making?
DEFINE: DEPRESS = SUM(H3SP5 H3SP6 H3SP7 H3SP8 H3SP9 H3SP10 H3SP11 H3SP12 H3SP13)/27; IF 0 <= DEPRESS < .25 THEN DEPRESSC = 1; IF .25 <= DEPRESS < .50 THEN DEPRESSC = 2; IF .50 <= DEPRESS < .75 THEN DEPRESSC = 3; IF .75 <= DEPRESS <= 1 THEN DEPRESSC = 4;
The error message: *** ERROR One or more variables have a variance of zero. Check your data and format statement.
Continuous Number of Variable Observations Variance
I am conducting multilevel modeling with random slopes. Let's say I regress y on x and z. And, y on x is treated as random (varies between classrooms). However, I have missing data on my y. I have heard that I could potentially regress z on x to include more cases in my analyses (using FIML). I tried it and it worked. Is this allowed?
I have an aggression variable at the within level and I want to create an average cluster aggression score to use at the between level. I understand that Mplus does this automatically (by not specifying this variable as within or between). My question is how does Mplus handle missing observations at the within level (e.g., level-1 aggression scores missing for a few individuals within each cluster). More specifically, is the average value based simply on the average of the non-missing observations or are the missing observations somehow estimated first using the standard ML missing procedure?
Related to the above, when would someone use Define Cluster_Mean instead of having Mplus calculate the between level values automatically?
When you don't put an individual-level variable on the WITHIN list, an average cluster score is not created, a latent variable decomposition is done. See Examples 9.1 and 9.2. To create an average cluster score, use the CLUSTER_MEAN option in DEFINE. For each cluster, the value is the average of the non-missing values in each cluster. If all values are missing in a cluster, the value is missing.
I don't think missing data handling is the deciding factor here. See the following paper which is available on the website:
Lüdtke, O., Marsh, H.W., Robitzsch, A., Trautwein, U., Asparouhov, T., & Muthén, B. (2008). The multilevel latent covariate model: A new, more reliable approach to group-level effects in contextual studies. Psychological Methods, 13, 203-229.
Katerina Gk posted on Wednesday, October 16, 2013 - 3:54 am
I got twolevel random type of mondel with missing data, Missing are all (999); ANALYSIS: TYPE IS TWOLEVEL RANDOM; ESTIMATOR IS ML; ALGORITHM=INTEGRATION; INTEGRATION=MONTECARLO; ......
When I dont have the interaction and so I get type is two level and estimator=WLSMV, in the beginning, the programme read quickly the model and then take some time to converge but it gives the output, BUT now adding type is random and the interaction and changing the estimator to ML, mplus read very slowly the models giving one by one the iterations so I was thinking that is something wrong because of the missing data and estimator=ML.
1)Am I right saying that the programme must read the model quicklier in the beginning?
2) If yes, could you please recommend me something to fix the error.
Dear Dr. Muthen, I running a twolevel random slope model with missing data on both levels. I want to use FIML. I have 8 level 1 variables, 3 level 2 variables and one cross-level interaction.
Is it correct to use beside the on-commands (within: Y ON x1 x2 x3..., between y ON z1 z2 z3; S on ...) the variance commands (within: x1 x2 x3...; between z1 z2 z3) or is it also necessary to calculate covariances (x1 with x2 x3...)? The two approaches yield somewhat different results, what would be correct, and what is the difference between the approaches? Thanks Chrisotph
Thanks, actually just mentioning the var for x-variables didn't covary them. One further question: To test cross-level interactions with missing data montecarlo integration is required. In the UG it is written, that the LL for models with montecarlo integration may be imprecise. So would it be better to use the z-value for determining the sign. of the random slope instead of a LR-Test?
The precision depends on the number of dimensions of integration - see TECH8 screen printing and also Summary output. With say up to 4 dimensions precision may be sufficient; while with 8 dimensions it may not be.
I don't think it is clear that an LR test is better; both are affected by precision issues.
Thanks, and one further question appeared. In some random slope models the estimation does not terminate normally DUE TO A NON-ZERO DERIVATIVE OF THE OBSERVED-DATA LOGLIKELIHOOD. Now I exluded clusters with a low covariance coverage (in some clusters the slope is based on 2 cases, cluster size = 10), and the estimation terminates normally.Is this reasonable? And are there guidelines regarding the number of cases (with non missing values) within clusters for the estimation of random slopes with missing data? Christoph
Anonymous posted on Sunday, December 14, 2014 - 11:12 am
Dear Drs. Muthén,
I have an unbalanced, longitudinal dataset (the German socio-economic panel) with several subsamples, all starting in consecutive waves. So I have missing data by design before the start of a single subsample. Moreover, there is item-nonresponse, wave-nonresponse and finally, drop-out.
I use a typical longitudinal multi-level model with observations clustered within participants.
Here are several questions concerning the resulting missing data:
(1) Is it appropriate to neglect sensitivity analyses for missing data and instead use type=twolevel and FIML only?
(2) Enders (2010) says the conventional multiple imputation procedure does not consider clustering, and you have to use special procedures. Is Mplus able to do that?
(3) Can I use the Diggle-Kenward selection model with type=twolevel in Mplus?
(4) I use weights. Participants who do not take part in the survey in a specific wave do NOT have a weight. I think it would not be adequate to use multiple imputation to impute for the missing waves (variables + weights). Can I use the selection model in this case or would it be better to use FIML and use only the observed information?
I would be very grateful for your support. Thank you very much!
I ran a two-level model with a continuous outcome (Y) and three predictors (cohort, treatment & prescore). After running the model, I got warning messages that 17 cases missing on Y and 2 cases missing on x-variables were not included in the analysis. --> Question1: Does this mean that those 19 cases were excluded from the actual analysis? (I'm asking this because my colleague said that Mplus has the capacity to do data replacement during the actual analyses, so actually those 19 cases were included in the actual analysis, which is different from my understanding)
And then, I ran the same model with the addition of the following command. Model: cohort treatment prescore; Then I found that the number of observations in the summary of analysis is a total number of cases, no warning messages on missing Y or missing x-variables. --> Question2: Does this mean that the cases with missing Y and those with missing x-variables are now all included in the analysis? --> Question3: If yes, how missing Y and missing x can be handled by the addition of the above command? I wonder what is being done behind the scene.
When we say 19 cases are excluded, they are not used in any way. The model is estimated conditioned on exogenous x variables. Missing data theory does not apply to them. When you bring them into the model, distributional assumptions are made about them and missing data theory is used. Observations with missing on all y variables have nothing to contribute to the analysis and are therefore exclused.
I'm sorry if I misunderstood what you said, but I am confused.
From your previous posting, cases with missing on Y (single observed variable) are excluded even when covariates are brought into the FIML model because they have nothing to contribute to the analysis. But now it sounds like that missing Y can be handled by bringing covariates into the FIML model. Am I missing something?
I believe that the following is true. If you have an output that shows otherwise, please send it.
We make a distinction between x and y even when x is brought into the model. So I think that a case with missing on all y's will still be deleted.
The above situation is different than having just one y. With just one y that does not have missing, y is not excluded so when the x's are brought in missing data theory can be used.
Yoon Oh posted on Thursday, October 22, 2015 - 6:59 pm
I'm trying to run a three-level model with random intercept and random slope. A problem is that there are missing data on a covariate with random slope. I wanted to include cases with missing covariates in the analysis by bring all Xs into the model. But I ended up with an error, saying "This model estimation is not available due to missing data in a covariate with random slope."
The following codes were used for the analysis. Would you please help me to figure out how to proceed?
ANALYSIS: TYPE = THREELEVEL RANDOM; ESTIMATOR = ML;
MODEL: %WITHIN% Y ON X1 X2 X3 ; B4 | Y ON X4 ; X1 X2 X3 X4 ;
The only way to do this is to conduct multiple imputation first. You can do 3-level imputations in Mplus using the H0 imputation track.
Patricia posted on Thursday, May 05, 2016 - 10:43 am
I am running a multi-group path analysis:
grouping = Insomnia (0 = NoInsomSx, 1 = InsomSx)
I receive the following error message:
*** WARNING Data set contains unknown or missing values for GROUPING, PATTERN, COHORT, CLUSTER and/or STRATIFICATION variables. Number of cases with unknown or missing values: 3 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
However, I checked my data file and all cases are coded appropriately for the grouping variable (no missing values). What is happening? Thank you!
I'm running a 2-level model (no random slopes), with missing data. I'm trying to use FIML (mentioning variances of predictors) to include all participants, but I run into issues with this due to cross-level interactions. I tried incorporating the montecarlo integration algorithm to deal with this issue, but I receive the following error: " THE ESTIMATED BETWEEN COVARIANCE MATRIX COULD NOT BE INVERTED. COMPUTATION COULD NOT BE COMPLETED IN ITERATION 1. CHANGE YOUR MODEL AND/OR STARTING VALUES. THE MODEL ESTIMATION DID NOT TERMINATE NORMALLY DUE TO AN ERROR IN THE COMPUTATION. CHANGE YOUR MODEL AND/OR STARTING VALUES."
Below is the relevant syntax: ANALYSIS: type=twolevel random; estimator=MLR; algorithm=integration; integration=montecarlo; MODEL: %within% lonew3 on sex_dw1 lonew2c AA asian white other; s_vic |lonew3 on zpickw2; s_recip|lonew3 on recip_dw2; s_vicrec|lonew3 on zvicXrec; zpickw2 recip_dw2 lonew3 sex_dw1 lonew2c aa asian white other zvicXrec; %between% lonew3 on sthw2_Mc; s_vic on sthw2_Mc; s_recip on sthw2_Mc; s_vicrec on sthw2_Mc; s_vic@0s_recip@0s_vicrec@0; sthw2_mc s_vic s_recip s_vicrec; Any help troubleshooting these error messages would be greatly appreciated.
1. Does this mean it is necessary to allow random slopes in order for the montecarlo integration to converge? Or is there an alternative way to specify that those slopes are fixed at zero that won't cause convergence problems?
2. I want to confirm that the following error message is ignorable, given that it is followed by "THE MODEL ESTIMATION TERMINATED NORMALLY"
WARNING: THE MODEL ESTIMATION HAS REACHED A SADDLE POINT OR A POINT WHERE THE OBSERVED AND THE EXPECTED INFORMATION MATRICES DO NOT MATCH. AN ADJUSTMENT TO THE ESTIMATION OF THE INFORMATION MATRIX HAS BEEN MADE. THE CONDITION NUMBER IS -0.943D-03. THE PROBLEM MAY ALSO BE RESOLVED BY DECREASING THE VALUE OF THE MCONVERGENCE OR LOGCRITERION OPTIONS OR BY CHANGING THE STARTING VALUES OR BY INCREASING THE NUMBER OF INTEGRATION POINTS OR BY USING THE MLF ESTIMATOR.
Dear MPlus team, I have been trying to estimate twolevel models with observations nested within persons (data from the German Socio-Economic Panel). Wherever I have categorical dependent variables, I run into problems (days and weeks of computing time) because montecarlo numerical integration is required if I bring predictors and a couple of other meaningful covariates into the model to estimate missing data (including attrition). I have tried to switch to Bayes because I read somewhere in the forum that it might work faster. However, I still get a fatal error message that this model can only be done with montecarlo integration. My questions are: 1) Is it really impossible to work around numerical integration with type = twolevel, a categorical dependent variable, and covariances between predictors in the model? 2) Is bringing all covariances between the predictors into the model really the best way of enabling FIML for predictors as well? I do get quite different sample sizes and regression estimates if I don't bring predictors into the model (or not all of them).
Is there an obvious reason I would receive considerably different result using SPSS's (V 24) MIXED procedure for a three-level model compared to Mplus's (V 7.4) Type is THREELEVEL? The coefficients are pretty close, but the Mplus standard errors are considerable smaller, resulting in considerably smaller p-values for covariates of interest in the Mplus results.
For SPSS I am using REML, and in Mplus using MLR. I expect this would produce some discrepancy, but the differences appear greater than expected.
I can paste syntax and/or output if needed, but thought I'd first inquire if there is a simple and obvious explanation that transcends the particulars of my model.
I will say, there is no missing data--so that can be ruled out as an explanation. Though, the reason I am using Mplus is because I intend to introduce covariates in subsequent models that do have missingness, and plan to mention the variances for those covariates in order to not drop cases; but first want to determine why my results are not replicating across statistical programs for a basic model without missingness.
Thank you for your prompt reply. I should have investigated that myself first. Indeed those results are similar--and the p-values fall roughly midway between the SPSS REML and Mplus MLR p-values.
Using a continuous outcome variable, the different estimates for our dichotomous Treatment variable are:
SPSS REML: 1.41 (1.75), p = .465 SPSS ML: 1.63 (1.46), p = .267 Mplus ML: 1.59 (1.47), p = .278 Mplus MLR: 1.59 (1.10), p = .147
I recognize that the statistical inference doesn't vary between these models; however, when I include other covariates, namely pretest, the p-value with Mplus MLR does become statistically significant.
Do you have any thoughts on which results we might consider most trustworthy? What factors should be considered in making this decision?
These data are drawn from a 22 site cluster randomized trial, with site-level assignment to one of two conditions. Sites varied in size, but the sample is fairly balanced between conditions.
Thank you so much for your recommendation and rationale.
More to the point of this thread, missing data in multilevel analysis:
I have missingness on a covariate (pretest) and plan to mention that covariate, so that means and variances are estimated, and I don't drop cases missing pretest. I also plan to use the method described in Mplus Web Notes No. 11 on constructing covariates in multilevel regression. My understanding is, when mentioning covariates to address missingness, that all covariates need to be mentioned (so as to not make the correlation between the variable in and out of the model zero). However, it is my hunch that this is a level-specific requirement.
Using a twolevel model as an example, where missingness only occurs for variable Pre:
Post ON Pre L1cov;
Post on Pre L2cov;
Am I correct that I need to mention Pre and L1cov on the WITHIN level (both measured at the within-level), but it is unnecessary to mention Pre and L2cov on the BETWEEN level because of orthogonality of variance between the within- and between-parts of the model?
I am running an LCA on complex survey data and I am having trouble with a persistent error.
*** WARNING Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 412 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
I've seen that this can occur when there is missingness in the covariates, but I dropped all observations with missing on the covariates before analyzing the data on mplus. I also recoded all missing in the predictors to -9999 in stata to avoid any issues with mplus reading blanks. Could this error be referring to missingness on the binary predictors...and if so how can I resolve that? I thought the default method for dealing with missing data was FIML, which did not drop observations. Any insight would be appreciated!
One last question: What is the proper term or phrasing I should use when describing this technique for handling missingness?
Is it that I declare covariates as endogenous so that FIML can be applied to them?
Is this a meanstructure approach?
Thank you for your guidance. If there is anything I can cite that describes what Mplus is doing so as to not drop cases, please let me know. I have looked at the statmodel Web Notes and Special Topics pages, and it was not apparent to me that any of those manuscripts addresses this issue in particular.
Thank you again.
Anne Black posted on Wednesday, March 15, 2017 - 2:16 pm
Dear Dr. Muthen, I'm using the data imputation option to handle incomplete covariates measured for individuals nested within clusters. If I include the cluster variable in the imputation model, will the hierarchical data structure be preserved , or do I need to specify that another way? Thank you for your advice.
A colleague indicated that Mplus could not handle performing FIML with an independent (exogenous) variable unless it is tricked into doing so by predicting the var with incomplete data using an auxiliary variable. The auxiliary variable could even be just a column of 1s. This makes Mplus treat the var as an endogenous variable.
Supplementary to this, in a SAS paper (312-2012), Paul Allison addressed several ways of using ML when data are missing on predictor variables. One is to use the EM algorithm to produce the means and covariance matrix for all the variables in the model (using PROC MI with NIMPUTE=0). The second is to use a SEM-based FIML approach (using PROC CALIS).
My question is, is the approach in Mplus of mentioning the mean or variance of an independent vars with incomplete data not using FIML to handle the missingness? In which case, this approach in Mplus might be more like what Allison is describing in the first approach. Or is it the case that it is a FIML approach, whereby this is another way (besides the use of auxiliary vars) to trick Mplus into treating it as endogenous?
Any guidance you can provide on terminology or phrasing for describing this approach will be greatly appreciated, including if this is what is referenced in the literature as a mean structure approach to handling missing data.
Anne Black posted on Thursday, March 16, 2017 - 8:07 am
Thank you, Dr. Muthen. I ultimately need to conduct a multiple groups analysis (for which estimator=Bayes is not available), but think I could use example 11.8 to impute values for each group separately, then combine the data sets. Does that seem reasonable? Or better to use the grouping variable (which is different from the cluster variable) in the imputation model?
With Mplus you can bring a covariate x into the model (making it endogenous) by mentioning either its mean or variance. This implies that the model s expanded to the joint distribution of y and x instead of the usual approach of y conditional on x (saying nothing about the x distribution). For this, Mplus used FIML assuming that x is normal. All of this is described in detail in Chapter 10 of our new book.
Thank you for the succinct explanation and the reference to your book. I see Ch.1.9.4 covers Bringing covariates into the model: Missing data on x. I'll be ordering it now. Tusen tack!
Melanie Wall posted on Wednesday, September 27, 2017 - 1:15 pm
In Mplus 7.11 we were able to run imputation with type = complex. But now in MPlus 8 (and in our older version 7.4) we cannot run it. We get the error about COMPLEX not being compatible with DATA IMPUTATION.
Coming online, we see many posts saying Complex and imputation are not compatible, with other fixes suggested. Should we trust the 7.11 output?
In Version 7.11 the data imputation is done correctly, however, none of the complex sampling features (weights, strata, cluster) are used during the imputation. The complex sampling features were used only during the model estimation that uses the imputed data.
We disallowed that now so that it is clearer what is being done. You can repeat the 7.11 process in two steps in version 8 - impute the data using type=basic without the complex features and then analyze the imputed data using the complex features in a second step - the results should be identical (two-step V8 v.s. V7.11).
Anyway, we don't really recommend that approach. What we recommend is the following (unfortunately that is not done automatically for you).
1. For cluster sampling we recommend two-level data imputation (using type=twolevel basic and the cluster variable = cluster sampling unit)
2. Strata - we recommend using multiple groups, i.e., impute each stratum separately.
3. Sampling weights - if you have sampling weights and missing data we don't really recommend MI at all. The best method is FIML. If you still need to do MI - you should use the sampling weight and log(sampling weight) and all other information related to the sampling weights in your imputation model. Any kind of proper MI method with sampling weights has to explicitly model the relationship between the weight and the variables, which is a huge drawback, because FIML works without assuming any relationship form between the weights and the variables, i.e., it works with any relationship and it doesn't have to be specified.
Melanie Wall posted on Thursday, September 28, 2017 - 10:51 am
Thank you Tihomir for solving the mystery about 7.11. I understand the issue/difficulty about imputing with the weights. We would be happy to just use FIML for our problem, but we have several missing covariates (X variables) which cannot be addressed other than listwise deletion by FIML. Do you have any suggested tricks for making X variables somehow into Y variables so those subjects with missing X values will not be included by the FIML.
You can still try MI via point 3 above, however, the easiest way might be to use FIML and add a model for the X variable that has missing values (making it a dependent variable). Something like that X1 on X2-X5; where X2 to X5 are the covariates that have no missing values and X1 is the covariate that has missing values. One thing to keep in mind is that the missing values are always based on some model assumptions, so whichever way you go make sure the model assumptions are reasonable.
Just to add that in certain situations MI using strategy like described above to deal with the weights might be the best solution in some cases. For example, if you have many binary covariates with many missing values you are better off imputing from a multivariate probit model, which is not available for ML/FIML.