I am fitting a CFA with 2 factors, 20 ordinal indicators, and missing data, with version 3. Though I don't fully understand the missing data methods in v.3 for categorical indicators, I think I prefer the EM ML variety over WLSMV with pairwise deletion.
I've read the descriptions of the ML, MLR and MLF estimators on p. 401 of the manual, but I don't know how to pick which one to use. Do you have any suggestions for how to choose?
In my RCT, our participants are assessed at waves 1, 3, and 4. Missingness was not a problem from w1 to w3. But now, at w4 or follow-up, all the "problematic" children in the control group have been lost, leaving us with a "super control group" to which our experimental group is being compared. Both Wave 1 and wave 3 problem-scales predict drop-out in the control group (the greater the problems the more likely they are to drop out), but are unrelated to drop-out in the exp group.
Please help me with the specific Mplus command that takes this "uneven missingness" into consideration. Do I have to apply weights to anything? If so, how do I calculate the weights?
Hello. I am analyzing decline in word recall scores in an elderly population over a decade with measurement taken every two years. I am attempting to use a Wu-Carroll selection model to control for the relationship between the outcome and dropout. All individuals were alive at baseline (t0) measurement. When I regress dropout (t1-t5) on intercept and slope, the model does not converge. When I regress dropout (t2-t5) on intercept and slope, the model converges. Why would the inclusion of dropout at t1 cause problems? Thank you. dd(t)= 0 - observed, 1 - dropout at time t, -99 - dropout at previous time
Hi Linda, For dd98, there were no missing cases. For dd00, 90% were observed (0's) and about 10% of cases were missing (1's). For dd02, 77% were observed (0's), 13% went missing (1's), and 10% were missing from previous wave (-99's).
Hi Linda, I am not using the dropout indicators for the first wave of measurement (taken in 1998). I am attempting to use the dropout indicators from 2000-2008, which I was successfully able to do using the Diggle-Kenward selection model. When attempting to use the Wu-Carroll selection model, I was not able to get the model to converge using dropout indicators from 2000-2008. However, when I did not include the dropout dummy for 2000 and only included dropout indicators from 2002-2008, the model successfully converged.
Hello, I am currently using a number of NMAR models (Diggle-Kenward, Wu-Carroll, and pattern mixture) to control for non-random missing data on the Y variable in my study. When implementing NMAR models, what options do I have for handling missing data on my X variables? I am losing a large number of observations because respondents have Y observations without X observations and the missing data on X seems to be handled through listwise deletion. Thank you.
You could use DATA IMPUTATION to impute values for the missing x's or you could include the variances of the x's in the MODEL command. Doing this means they are treated as dependent variables and distributional assumptions are made about them.
Thank you Linda. When I include the X's with missing data in the model command, I receive an error "THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY...". The warning points me to the PSI matrix diagonal for one of the dichotomous x's that I included in the model command (it was not defined as categorical). Do you think the problem is arising due to the use of the binary variable in maximum likelihood estimation? I can also send my information if that will help. Thanks.
The mean and variance of a binary variable are not orthogonal and can generate a message about non-identification when the variable is included in the model and is not identified as categorical. I can't say more without seeing the full output at email@example.com.
I have estimated a model with complete data on the predictor side (7 indicators, a nested-factor model, N = 1187) and a lot of missing data on the criterion side (11 indicators, a g-factor-model, N = 79).
The standardized factor loadings from my 11 indicators on their latent variable on the criterion side (where data is only available for N = 79) are really high in the SEM, mainly between .90 and .99. When I just estimate a g-factor-model CFA for those 79 people with those 11 indicators, the standardized factor loadings range between .56 and .83.
Why do the estimates change so much? What is the best way to deal with this problem?
The 79 people are a subsample of those 1187. The mean of those 79 on the seven indicators on the predictor side is much higher because a majority of them were selected on the basis of their scores on those indicators. Variances are similar.
Since I only have data on the criterion side for those 79 people I can't estimate sample statistics for the indicators on the criterion side for the other ones.
So the different means on the indicators on the predictor side are the reason why factor loadings on the criterion side change so much in the SEM?
Are my estimates reliable? What would be the best way to handle this situation?
The analysis of the criterion variables only (for n=79) is a submodel of the full model that includes the predictor variables. If your model estimates are very different for the criterion variables in the submodel and full model analyses, this probably means that the assumptions of the full model don't fully hold - in particular the covariances between the predictor variables and the criterion variables may not be captured fully by their sets of factors being related. Look for big modification indices in the full model analysis.
Because you select the n=79 based on the predictor variables, it would seem that MAR may hold so in principle the full model analysis should give you the right answer.
So far, I've used the option for auxiliary variables (m) where Mplus does not give Modification indices.
Hence, I decided to drop the auxiliary option in order to get Modification indices. Now, for my measurement model on the criterion side, this leads to "normal" factor loadings for my 11 indicators. However, one of my path coefficients from the two latent variables which are the predictors changes a lot - from .59 to .81 (the other one remains more or less the same) - this seems pretty unrealistic and I'm not sure what to do with this result. In this model, the highest ModIndices are for residual correlations on the predictor side (where there is no missing data) - not for covariances between predictor and criterion variables.
The factor loadings that changed where the standardized factor loadings already. So far, I've fixed my factor variances to one in all models in order to identify the model - would you suggest to estimate those variances freely and to fix the factor loading for the first indicator at 1? And then to check whether the variance of the factor on the criterion side changes from the measurement model (n = 79) to the structural model (N = 1187)?
I'm attempting to impute data from a cross sectional cohort sequential study.
I received the following message in the output.
PROBLEM OCCURRED DURING THE DATA IMPUTATION. YOU MAY BE ABLE TO RESOLVE THIS PROBLEM BY SPECIFYING THE USEVARIABLES OPTION TO REDUCE THE NUMBER OF VARIABLES USED IN THE IMPUTATION MODEL. YOU MAY ALSO BE ABLE TO RESOLVE THIS PROBLEM BY INCREASING THE NUMBER OF ITERATIONS USING THE THIN OR BITERATIONS OPTIONS OF THE ANALYSIS COMMAND. SPECIFYING A DIFFERENT IMPUTATION MODEL MAY ALSO RESOLVE THE PROBLEM.
I attempted to decrease the number of variables used to impute the data, and received the same message.
I would be grateful for some guidance on how to proceed.
As a first step, take a look at the Version 7 UG ex 11.5, page 397, and its use of USEVARIABLES, AUXILIARY, and IMPUTE. As a second step, have a look at the 14 practical tips in Section 4 of the paper on our website:
Asparouhov, T. & Muthén, B. (2010). Multiple imputation with Mplus. Technical Report. Version 2. Click here to view Mplus inputs, data, and outputs used in this paper.
I have an overall sample of 1600 which has missing data throughout. I am using the estimator MLR for growth modelling. I get these warning messages in the output: WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 772 *** WARNING Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis. Number of cases with missing on all variables except x-variables: 34
Would it be possible to get some clarification on what MLR does in relation to missing data and what these warning messages mean?
MLR handles missing data under the MAR assumption, that is, using what is commonly called "FIML". That applies to people with missing on some but not all DVs, in the subset of people who don't have any missing on IVs and don't have all DVs missing.
If you want to include those with missing on IVs, you can mention those variable names in the Model, thereby extending the model and making stronger, normality, assumptions for the IVs.
OK thank you very much. I have added this and the missing data are being estimated but now I get the following warning: THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.479D-11. PROBLEM INVOLVING THE FOLLOWING PARAMETER: Parameter 35, APFY9 It names a different variable in each model. Is it possible to find out what this could mean? Many thanks
Dear Dr Muthen, I am reading Muthen & Muthen 2002 paper and I have a basic question. I generated 200 complete data sets with Mplus simulation device, used an external program to create MAR on the same datasets and ran a SEM using Mplus on the data with and without missing. The following pattern was seen: 1) The same sample size is used in the two analyses. This is expected because an inspection of the data showed that no individual has missing in all variables. 2) The log likelihood is however different with lower mean and small SD for the complete data (e.g., M=-15476.595 SD=48.47 M=-14471.77 SD=71.57 and for complete and incomplete data). I was expecting the log likelihood to be different, for different information is used despite the equality of the sample size. However, I wasn’t expecting constantly higher log likelihood values for the incomplete data (I repeated this experiment with different models and the same pattern is always observed). How can we explain this result? May this be due to the fact that missing information introduces a level of uncertainty in the data which results in a variability in the log likelihood? Do you have any other explanation? Thank you for your help. Regards, Sam