The two methods are asymptotically equivalent. In both cases, with that much missing data, you will be relying too much on model assumption rather than the data. I don't one approach is better than the other in this situation.
IN CFA, when Estimator = WLSMV, Parameterization = THETA, and Listwise = OFF, what is the default method that Mplus (version 5) uses to deal with missing data, FIML, or MI?
Specifically, is it possible to have WLSMV and FIML simultaneously in a single model given that the former is a least-square procedure that requires pairwise deletion while the latter is a maximum likelihood procedure that uses all possible information? Please clarify, thanks.
With WLSMV, the method used when the model has no covariates is pairwise present. One can use ML with categorical outcomes and obtain FIML. The problem is each factor requires one dimension of integration when the factor indicators are categorical. Or one can impute data and use WLSMV with the imputed data sets.
Yes, actually I am running a MIMIC model, i.e., CFA with covariates. There are quite a few factors that have indicators (outcome variables) as a combination of continuous variables and categorical ones, and that's why I chose WLSMV estimator. Also, there are about 15% participants had missing values on at lease one of the indicators. The MIMIC model was run on both of the complete data only (about 85% participants) and the whole sample in order to assess if missing value is an issue. Then I am facing this question: under the situation as specified above and in my previous post, what method Mplus (version 5) is using to address missing values, Multiple Imputation, or, FIML? If I understand your response correctly, it should be MI, but I would rather ask for your kind confirmation. Thanks.
With WLSMV, Mplus uses neither multiple imputation or FIML. Without covariates, it uses pairwise present. With covariates, missingness is allowed to be a function of the observed covariates but not the observed outcomes.
What I suggest above is to use multiple imputation in Mplus to generate imputed data sets and then analyze these imputed data sets using WLSMV.
Sarah Ryan posted on Wednesday, March 09, 2011 - 7:02 pm
I am running a mediation model using a large federal data set- my N=8,000- using data from both parents and students. On student-level data, missingness is below 7% for all variables- but on parents, it ranges between 10% and 20%. In trying to decide between using MI and FIML, I've read Asparouhov & Muthen (2010) who write:
"When the data set contains a large number of variables but we are interested in modeling a small subsets of variables it is useful to impute the data from the entire set and then use the imputed data sets with the subset of variables. This way we don't have to worry about excluding an important missing data predictor" (p. 23).
If I understand correctly, this reason would apply in my research (there are several hundred student and parent variables in the data set).
Here's my dilemma, however. My understanding from Graham's work is that FIML and MI are only asymptotically equivalent when ALL of the same variables are used with both. Further, Graham recommends appx. 100 imputed data sets in order to achieve such equivalence.
MI with so many variables for this many data sets would be extremely intensive computationally- but to include only a subset of predictors seems to miss the point being made by A & M.
I would use the variables in your model plus any key variables beyond that which you think might be related to missingness. Most variables are not.
Sarah Ryan posted on Thursday, March 10, 2011 - 7:33 pm
Okay, thank you. Also, I should clarify that I will be using FIML in Mplus to estimate the full model- at this point, I'm just trying to figure out how best to approach missingness. I've also considered using auxilliary variables to assist in dealing with missingness using FIML. If the auxilliary variables were the same variables I would use as missing data predictors in MI, then the two approaches should be equivalent, is that correct?
This is for my dissertation work, so this is new territory for me. Thanks for your response.
You may want to take a look at Craig Enders 2010 missing data book (Guilford) if you haven't already.
Sarah Ryan posted on Thursday, April 21, 2011 - 8:23 pm
Following up here on the post above. I did go ahead and read Enders' book, which helped my understanding of MI immensely. I remain undecided. I have a rather complex model- several covariates (missingness on some, one reason making MI attractive), two latent exogenous constructs, several manifest exogenous measures, a latent mediator and continuous outcome. I will also use multiple group invariance testing. This is complex survey data and I'm using data from both parents and students. Student missingness ranges from 0% to 11%, but parent missingness ranges from 0% to 28% (16%-20% missingness on most parent-level varbs).
I know I would need to run MI for the full sample (includes all race/eth groups) as well as for each of the two subgroups of comparison (race-level comp). I also gather from reading on these boards that it would be best to fix parameter estimates for one of the MI sets (for full sample, and each subgroup) at the pooled imputation average AND THEN run the analysis model using FIML.
I'm wondering if, given the high levels of parent-missing and some missing on covariates, using MI would produce more accurate analysis model parameter estimates.
Do you have a stronger (convince me, please!) argument for why staying within the FIML framework (which would be simpler) would likely still produce just as accurate estimates even giving missingness issues?
In my mind, the answer to your question largely depends on the scaling of your variables, particularly the covariates. My post is a bit long, so I have to split it into separate chunks.
First, suppose that all of your variables are continuously scaled, or approximately so. In this case, I think that the choice of FIML vs. MI is personal preference. Theoretically, there is no reason to expect noticeable performance differences (FIML might have slightly smaller SEs because MI uses a more complex saturated model to deal with missing data, but this difference is usually negligible). There are a couple nuances here. With FIML, you will need to take care of the missing data on the covariates (you seem to be aware of this issue). Suppose you have two covariates, X1 and X2. You would do this as follows:
X1 X2; X1 with X2;
As I understand it, explicitly listing X1 and X2 effectively makes the covariates single indicators of a latent construct -- a programming trick that converts the Xs to Ys while still maintaining the exogenous status of the variables in the model. The missing data on your outcomes would be automatically handled by FIML.
Same situation -- continuous variables. Turning to MI, the imputation process is a bit simpler than you describe. You would impute the data separately for your ethnic groups -- there is NO need to then impute the data for the whole sample. Separate-group imputation automatically fills in the missing values in a way that preserves mean differences among the ethnic groups and all group by variable interactions in the data. Said differently, you would have filled in the data using the most general model possible, and the set of imputations that you get from this procedure would be appropriate for all your analyses (the imputation routine would need to include all covariates, outcomes, etc.). I'm not sure what you are referring to when you talk about fixing the estimates at the pooled average, then running the model using FIML. I *think* you might be describing the method for pooling likelihood ratio tests, but that would not be necessary -- Mplus reports the pooled chi-square. On my website (appliedmissingdata.com)I have an example of separate-group imputation in Mplus 6.
Next, suppose that one of your covariates is binary. I think that the situation becomes murkier here. Take that programming trick that lists the variances of the X variables and their covariances. Again, as I understand it, this would make the covariate a single indicator of a latent construct. However, it isn't so clear to me that this is appropriate with a binary covariate, because you would be assuming normality for the covariate, and employing a linear factor model to convert the X to a pseudo Y might cause problems. The first problem is more conceptual; the programming trick does not produce a model that is a one to one translation of the complete-data analysis. Whether this introduces bias, I don't know. Second, I could imagine that Mplus would issue a warning about the standard errors because the mean and the variance of the binary variable are linearly dependent when you use the linear factor model to handle the missing data. I'm not sure that either of the problems are substantial ...
MI would provide a useful alternative. Mplus allows you to specify variables as categorical or continuous in the imputation model. In the case of an incomplete categorical covariate, imputation would use a logistic regression to fill in plausible values. The MI procedure would be identical to what I describe above (impute separately for each ethnic group). You would simply use the (c) option to denote categorical variables.
Continuing on ... Finally, if your outcomes/indicators are not continuous -- say Likert items -- then the choice isn't all that clear to me. FIML would assume multivariate normality, as you know. The missing data handling probably won't behave any differently than complete-data ML. With MI, you have two options: (a) use linear regression to impute the incomplete variables, or (b) use logistic regression to impute. The former assumes normality and would produce fractional imputed values (not a problem beyond aesthetics, you would not want to round these). The latter does not assume normality and would produce discrete imputations. I know of no studies that have compared these two imputation approaches, but I suspect that the logistic imputation might lead to larger SEs because the multinomial model would have more parameters than a linear model. I would probably still go with FIML and MLR standard errors , but the choice isn't so clear cut.
The other thing to consider is your model testing procedures. It sounds like you plan to perform likelihood ratio tests. FIML is probably going to be easier to deal with here. Mplus computes pooled LR tests, but I'm not sure if there is a way to automate the computation of these tests when you are comparing fit between models from two separate analysis runs (e.g., your invariance tests). If you have to compute the LR tests by brute force, it would be time consuming.
Thanks Craig. I believe this post was interesting and useful for many of us.
Sarah Ryan posted on Friday, April 22, 2011 - 3:58 pm
I am more appreciative than you could know for your response above, and the time you took to offer some insight. Let me add some thoughts on a few of your points.
Sep. group imputation: The full sample includes four racia/ethnic groups, but I will test for model invariance between only two of those groups. Is it still the case, then, that I would not need to impute for the full sample? Fixing estimates at pooled average: In the discussion board on "MODEL INDIRECT and MISSING," Linda Muthen offers this advice if one is trying to examine indirect effects with imputed data (from what I understand, this is advised because otherwise Mplus will not provide indirect effects with imputed data): "With multiple imputation, you can fix all parameters at the average value given in the imputation run and them run the analysis with MODEL INDIRECT using one of the imputation data sets and no IMPUTATION. It does not matter which data set because nothing is estimated."
(See next post for the rest of my message)
Sarah Ryan posted on Friday, April 22, 2011 - 4:20 pm
Varb. scaling: None of the covariates with missingness are binary, but some are categorical (ordinal, with 3 to 7 levels). This is true of many of the model variables (ordinal, although those with 5+ levels could perhaps be treated as continuous). What I understand from a response I got from L. Muthen on choice of estimator is that I can use MLR with binary/categorical variables as long as there are four or fewer factors in the model (because numerical integration is required to estimate the logistic regressions, which becomes unwieldy with more than four factors). So, with any categorical covariates, ML would use logistic regression to estimate the single-indicator latent construct (the "programming trick").
Model testing: Yes, I do plan to use LR tests. Sounds like the model invariance testing would be pretty intensive if I went with MI.
Finally, from your response, I'm inferring that % missingness on any given variable is not as critical as the factors you discussed when deciding between FIML and MI as one's approach to missing data in an analysis model similar to mine. If that is the case, then, weighing all of this, FIML is looking like the way to go here.
EIher way, reading your 2010 book has left me feeling much more confident in my knowledge about MI and how I would go about it- if not in the dissertation, in the future.
I might have misunderstood your original description when you were referring to imputing the entire sample. I thought you were referring to a situation where you (a) impute without regard to ethnicity, then (b) re-impute separately by ethnicity. This is what you would want to avoid because you could not compare analysis models that differ with respect to the imputation model. In terms of separate-group imputation, I would still impute separately within each of your four groups, even though you only plan to compare two groups in your invariance analyses (assuming a sufficient N within each group). This would allow each group to have its own mean vector and covariance matrix. There is nothing lost by imputing the data using a richer set of variables/associations than what you have in the subsequent analyses. Doing it this way would give you imputations that could serve for all of your subsequent analyses.
Continued ... The mediation effect. To get around the lack of MODEL INDIRECT with imputed data, why don't you instead use the MODEL CONSTRAINT command, as follows:
m on x (a); y on m (b); y on x (cprime);
new(ab); ab = a*b;
Provided that your a and b paths are linear regressions, this would give you what you want. I'm not sure I completely understand the constraint part that you were describing, but you would want to estimate the mediated effect, then average the m estimates. The MODEL CONSTRAINT option would give you that.
Finally, the percentage of missing data would have no bearing on your decision. All things being equal, MI and ML are asymptotically equivalent. MI uses Monte Carlo simulation to average across a distribution of plausible replacement values for the missing data, whereas ML essentially uses calculus to do the same thing analytically. Although the procedures look very different, they are in fact doing the same thing. MI uses a saturated model (typically, although not necessarily) to handle the missing data, whereas ML uses a more parsimonious model that doesn't spend all the degrees of freedom (at least in your example). This might produce tiny differences in SEs, but there is no other reason to expect ML and MI to differ.
The model testing part certainly favors ML. The pooled LR test in MI is a bit of an unknown because simulation studies have not thoroughly assessed its performance. It's probably safe to say that most of what is known about the LR test in complete-data ML allows applies to missing data.
Sarah Ryan posted on Friday, April 22, 2011 - 8:52 pm
No, you did understand about group imputation...I didn't, at first! Now it makes sense. I also looked at the examples and code you give in the 2011 article in SEM journal on SGI, which made it even more clear.
Thanks also for the MODEL CONSTRAINT advice. That's quite helpful.
It seems like the reseach on how/when FIML and MI differ in approaching missing data is still emerging. I'm also learning that one has to weigh many different factors about the analysis model and data when deciding between the two. In trying to wrap my head around this, I've done a lot of reading. Some authors have strong convictions about always using one (or the other) to deal with missingness. My sense now is that it's not that simple, and that the researcher needs to make that decision based on the investigation at hand.
WLSMV uses pairwise present when the model does not have covariates.
Elina Dale posted on Thursday, October 10, 2013 - 2:21 pm
Sorry, but what does it mean in practice? If I have 4 factor indicators and one of them (y1) has missing values. Does pairrwise present look at missing values between y1 and y2, y1 and y3 etc? If y1 is then missing some obs, but y2 is not, it imputes y1? Thank you!
This has nothing to do with imputing anything. It means that the correlations in the matrix that is analyzed are computed using the number of people who do not have missing data on the variables involved.
Elina Dale posted on Thursday, October 10, 2013 - 10:53 pm
So, if they are not imputed, does it mean that if person A has a missing value for y1, which is one of the four observed factor indicators, his response to y2-y4 will also be deleted?
Thus, if my original sample had 200 people, and 10 of them had missing values on just 1 out of 4 items of the factor, my CFA model will have 190 observations only?
This sounds like listwise deletion and I though MPlus was better at handling missing values in item scales or factor indicators.
It is not listwise deletion. Missing data is looked at for pairs of variables -- pairwise present. So if 50 people have non-missing values for y1 and y2, 50 observations are used to compute the correlation between y1 and y2. If 70 people have non-missing values for y1 and y3, 70 observations will be used to compute the correlation between y1 and y3. Etc.
For FIML use maximum likelihood estimators.
Elina Dale posted on Friday, October 11, 2013 - 5:37 pm
Thank you so much!!! Now I understand it.
What about when we have covariates, but our items are still categorical variables? So, we still use WLSMV estimator.
In MPlus Guide, it says "For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes."
So, does it mean that when we have covariates, for example sex and age, MPlus imputes the missing values for scale items based on these two variables?
I am not sure I understand what does "is allowed to be a function" mean?
For ex, we have 100 observations. In a scale for anxiety with four items (y1-y4), y1 is missing 10 values but there are no missing values for sex and age which are used in the model to predict anxiety as measured by y1-y4. Does it mean the n for the model will be 100?
I apologize for these questions and I really appreciate your guidance!
I would like to know what approach MLR is using with missing data. In my output, all cases are being used since the number of observations is similar to the total number of cases, yet MLR is not the same as FIML, is it? How does the approach of MLR compare to FIML or multiple imputation?
All maximum likelihood estimators use what people refer to as FIML for missing data. FIML and multiple imputation are asymptotically similar techniques.
Nara Jang posted on Thursday, February 26, 2015 - 12:58 am
Dear Prof. Muthen,
I used FIML imputation(i.e., !listwise=on; & !missing = all(999);) for both CFA and SEM with interaction involving a latent variable (i.e., testing moderator) analyses and descriptive statistics.
Regarding the results of descriptive statistics, the estimated variance values of all continuous variable including missing were influenced of 999 marked as missing data. In addition, the maximum values were 999 for all continuous variables with missing data. Followings are my questions. Any advice will be greatly appreciated!
First, would you tell me if it is correct to use FIML imputation (i.e., !Listwise=on; & !missing=all(999);) for descriptive statistics. Second, if it is correct, would you tell me how to report the variance and maximum values of continuous variables with missing data. Third, would you tell me if there is a way to identify the replaced values instead of using missing data. Forth, would you tell me how to report the descriptive statistics when using FIML imputation in SEM analyses.
I deeply appreciate for your great help in advance!