I need an opinion on an x-variable missing problem.
In a model that I am running, an observed (x) variable greatly reduces the sample analyzed (combined with other x-variables, the sample goes down with about 50%). I am thinking that a solution would be to make it endogenous, by making it latent (maybe with an formative-indicator factor). The problem is like this: I have a variable, match, obtained by summing three binary variables (tmatch, scmatch, stmatch). Al three are important, but they have missing values, a great deal. I am thinking of a model like this:
matchl by; t by tmatch; sc by scmatch; st by st match; matchl on t sc st;
y1 on matchl; y2 on matchl; (these last two are are needed for identification, according to Jarvis, MacKenzie, & Podsakoff, 2003; they can be two extra indicators or, actually, parts of the larger model).
or i can use define: matchl=t+sc+st; and drop matchl by;
Does it sound like a reasonable solution? Is this what you mean by "Covariate missingness can be modeled if the covariates are brought into the model and distributional assumptions such as normality are made about them." in Mplus manual, Chapter 1?
Iím interested in modeling the over-time relationship between two binary variables in a variety of models: autoregressive cross-lagged model, parallel process growth model, and an autoregressive latent trajectory model. I will have missing data on each of the repeated measures as well as at least some of the covariates that will be used as control variables. Iím planning to use weighted least squares estimation for these analyses rather than ML. I have come across 3 possible solutions for dealing with missing data for the binary dependent variables and the covariates on the discussion board, and Iím hoping for some guidance on which would be the best given my situation (or an alternative suggestion if Iíve overlooked something): 1) Multiple imputation for both of the binary dependent variables as well as the covariates. 2) Allow for WLS to estimate missingness as a function of the covariates for the two binary dependent variables, and use multiple imputation for just the covariates. 3) Allow for WLS to estimate missingness as a function of the covariates for the two binary dependent variables, and mention the variances of the covariates in the MODEL command.
2) and 3) would not allow for missingness predicted by the binary outcomes before dropout. Alternative 1) seems most reasonable. You may also want to try a Bayesian approach which like ML gives a full-information analysis.
Thank you for your help! I'll also look into the Bayesian approach.
Rod Bond posted on Tuesday, April 10, 2012 - 5:58 am
I have missing data on covariates and am trying to deal with the problem by bringing the x variable(s) into the model. When I bring some variables into the model, however, I get the following warning "THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX." I get this warning even if there is no missing data for the covariate in question and even when I have only one covariate. For example, I run an analysis with Gender as a covariate on which there is no missing data. If I run it without bringing it into the model, it runs fine. If I bring it into the model by referring to its variance in the MODEL command, I get the warning but also identical results. I have tried rescaling Gender so that its variance is similar to that of other variables, but that makes no difference. Any ideas? Thanks
I have some missings on covariates and brought them into the model by adding them to the MODEL command as mentioned above. But, as there are some binary variables, how can I tell Mplus which of them don't have normal distribution?
For that you should use multiple imputation. But ignoring the binary aspect may not be a big sin unless you also have categorical DVs in your model.
Tracy Witte posted on Wednesday, April 03, 2013 - 10:40 am
I have a couple of questions about bringing predictor variables into the model so that missing data on predictor variables can be handled with FIML:
1) I realize that doing so means that the same assumptions for the rest of the model will now be applied to the predictor variables, which may not be tenable. However, if one is using MLR as the estimator, is it less problematic to include predictor variables in the model, even if they deviate somewhat from a normal distribution?
2) Does including predictor variables in the model change the substantive interpretation of the results? Or, will any differences in parameter estimates primarily be a function of the degree to which the model assumptions are untenable for the predictor variables?
3) This issue is addressed on the following website (http://www.ats.ucla.edu/stat/mplus/faq/fiml_counts.htm). I noticed that they included the predictors in the model by modeling the intercepts, rather than the variance of the predictor variables. Is this because this model included count variables, rather than continuous variables?
4) In general, is it considered better to use multiple imputation if you have missing data for predictor variables? That is, how "experimental" is it considered to be to use this approach with FIML?
1. It may make it less problematic. 2. No. 3. You can mention any parameter of the covariate. It does not matter which one. 4. The two methods are asymptotically equivalent. Imputation may be better for categorical variables. Imputation has fewer testing options.
Stine Hoj posted on Wednesday, January 22, 2014 - 5:04 pm
I am wondering if you could help me to understand why bringing a covariate into the model has a pronounced effect on model fit statistics.
I have a linear growth model with a continous covariate (Y1) measured at 3 time points. The model includes one time-varying covariate (X1) and several time-invariant covariates (Z1-Z7).
MODEL: i s | Y11@0Y12@1Y13@2; i s ON Z1-Z7; Y11 on X11; Y12 on X12; Y13 on X13;
(RMSEA = 0.05, CFI = 0.968, SRMR = 0.036)
Approximately 20% of the sample are missing values on X1 at some time point. However, if I bring X1 into the model to retain these observations, the fit statistics are markedly worse.
MODEL: i s | Y11@0Y12@1Y13@2; i s ON Z1-Z7; Y11 on X11; Y12 on X12; Y13 on X13; X11 X12 X13;
Please send the two outputs and your license number to firstname.lastname@example.org. Include TECH1 in both outputs.
Malte Jansen posted on Thursday, February 27, 2014 - 7:28 am
Dear Mplus Team,
it would be great if had could give me some advice with regard to the following situation:
I am trying to regress student achievement on a number of predictors and several interactions between continuous predictors. The predictors include binary variables (e.g. sex), continuous variables with one indicator and continuous variables with several indicators. The students are nested in classes, but all predictors are on the individual student level. The dataset includes some missing values on both the outcome and all three kinds of predictor variables.
I am unsure which modeling approach to use because:
1. It would like to use FIML to handle the missing data as itís more convenient than multiple imputation. When I use FIML (by estimating the variances of all predictors in the MODEL statement), all manifest predictors are treated as latent variables with one indicator, right? I guess this might be problematic for binary predictors such as sex. Would you recommend using FIML on the binary predictors?
2. If I use FIML only for continuous predictors and do not include their covariances with the binary predictors in the model, a bad model fit results. If I include the covariances, again FIML is used on the binaries. Do I necessarily have to include the covariances?
[part 2 coming up]
Malte Jansen posted on Thursday, February 27, 2014 - 7:28 am
3. When I include latent interaction, no standardized coefficients are reported. I tried to standardize all variables prior to the analysis as well as set the variance of all latent variables to 1. However, the results for the models without interactions are different from the STDYX output, especially the regression coefficients for the binary predictors. Is there any way to obtain ďSTDYĒ coefficients (or coefficients that could be interpreted in a similar way) for the binary predictor variables when interactions are included?
Which modeling approach do you think would be the most suitable? It would be great if you could help me.
In the future, please limit your post to one window.
Stine Hoj posted on Wednesday, March 26, 2014 - 7:52 pm
Can you point me to any resources that might help me understand how important the assumption of continuous normality is when bringing covariates into the model?
As in the last post, I would like to use FIML for reasons of convenience, but most of my covariates are binary. In actual fact, these binary covariates are missing almost no data; it is just the one continuous covariate that is missing ~20% of responses. I am aware that I need to bring all of the covariates into the model, not just a subset, but I am wondering how to assess what the implications of this might be. Thank you.
I am using Mplus to run a longitudinal cross-lagged model, with clustered data. I am using the MLR estimator and my syntax looks like this:
ANALYSIS: TYPE = COMPLEX; MODEL: CTRM_F2 TOTAL_F2 ON CTRM_F1 TOTAL_F1; CTRM_F1 TOTAL_F1 ON CTRM_IN TOTAL_IN; CTRM_F2 ON CTRM_IN; TOTAL_F2 ON TOTAL_IN; CTRM_F1 WITH TOTAL_F1; CTRM_F2 WITH TOTAL_F2;
When I run the analyses, I get warning messages indicating that some cases are excluded due to missing data on x-variables. As you know, I can resolve this issue and include all cases by adding: [CTRM_IN] [TOTAL_IN]
However, do you recommend doing so, so that as many cases as possible are included, or do you perceive it to be more appropriate to not include cases with missing data on x-variables?
Which estimator are you using. Why do you think Mplus uses listwise deletion. It does not.
Eric Deemer posted on Monday, September 29, 2014 - 12:45 pm
I am fitting contextual effect and multilevel mediation models to the same data set using the same variables. However, the number of observations used in the ML mediation analysis was 1722 but 1563 in the contextual effect analysis. I'm confused as to why this is the case. Could it be that FIML models the missing data better in the ML mediation model than in the contextual effect model? I'm using MLR as the estimator.
I'm running a mediation model with 3 x variables, 3 mediators, and 2 y variables. All are continuous with the exception of 2 x variables, which are dichotomous. I also include several (mostly dichotomous) control variables. I have virtually no missing data on the x, y, and mediators, but as much as 8% missing data in the control variables and would like to use FIML to use all cases in the sample (about 1650).
I have tried running 3 models: 1) using listwise deletion; 2) using FIML by naming all x variables (including controls) in the model statement; 3) using listwise deletion but also naming the x variables in the model statement.
For my 2 dichotomous x variables (key predictors), the estimates change substantially between models 1 and 2 but the estimates between models 2 and 3 are very similar. This suggests to me that naming the x variables in the model statement is changing the estimates.
Do you have any insight into why this might be (violation of the normality assumption?)? Is there anything I can try to figure out what is going on? I'd really like to use FIML if possible. Any help would be much appreciated.
Thank you for your response. It can't be sample size differences or missingness because the estimates change between model 1 (when I use listwise deletion and do not name the x variables in the model statement) and model 3 (when I use listwise deletion and name the x variables in the model statement). The samples are identical and the only difference is that in model 3 I name the x variables in the model statement.
Maybe a better question is: under what circumstances would naming the x variables in the model statement (treating them as endogenous variables) change the estimates in the model when there is no missing data? Thanks again.
I am estimating a multivariate negative binomial model with six dependent variables. Each accounts for the underlying population by constraining an exposure variable to @1, but each has a different underlying population (e.g., Black males, Black females, White males, etc.). For each regression model, there are cases where the exposure variable is equal to 0. I've set Missing values, but MPlus excludes all cases where at least one of these values = 0 for each model (so I lose 155 cases), but I want it to only exclude the cases with 0 for that specific model (approximately 35 0s per model). Is there a way that I can do this?
When you say "dependent variables", do you mean independent variables? That is, the same as the "exposure variables" and Black, Female etc? Why do you fix at 1? Not sure what you mean by Mplus deleting cases = 0, unless that is the missing values flag. Perhaps you need to send to support along with your license number.
Thanks for your response. No, that's not what I meant. I'm sorry if it wasn't clear. I figured out a work-around though. Thanks!
Zehua Cui posted on Wednesday, May 09, 2018 - 9:16 pm
Hi Dr. Muthen, I have a question about bringing predictor variables in the model command to handle missing data. Do I need to bring all the predictor variables into the model including the ones that do not have missing data?
Also, by predictor variables, do I need to bring just my control variable (covariates) or do I need to bring both my main predictor variable and control variables (covariates)?
Zehua Cui posted on Thursday, May 10, 2018 - 9:24 pm
Hi Dr. Muthen, Thank you so much for clarifying this! I am a graduate student and still learning SEM. I actually have another question about Multiple Imputation. My question is that when doing MI to handle missing data, should I just imputate independent variables with missing data or should I imputate both independent and dependent variables with missing data? My problem is that when I imputate both independent and dependent variables, my results are really different from when I use FIML. But when I just imputate independent variables, my results are close to what I got when using FIML. Also, when use MI in my data set, I still got about 200 cases (N=1354) that are excluded from the analysis because of missing on all variables except x variables. What should I do in this case? Would you recommend that I just stick to using FIML (bring all independent variables in Model command)?