I added control variables to my model. I got the following message after the addition. I wanted to use the full information. Therefore, I defined the all of the variables with missing data. It seems that the program deleted cases using listwise deletiion. With the following warning, I could not see the model estimation.
Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 94 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
What can I do to see the results of the model estimation?
The only way to avoid the listwise deletion of covariates is to bring them into the model as dependent variables. You can do this by mentioning their variances in the MODEL command. You then make distributional assumptions about them.
nina chien posted on Friday, September 04, 2009 - 1:58 pm
To avoid listwise deletion of covariates, I added the covariates into the model by mentioning their variances. This resulted in a series of WARNING statements that some of my covariates are categorical (which they are): “WARNING: VARIABLE GENDERR MAY BE DICHOTOMOUS BUT DECLARED AS CONTINUOUS.”
So, I tried stating covariates as categorical, but this resulted in error statements that the covariates are not DV’s: “ERROR in VARIABLE command. CATEGORICAL option is used for dependent variables only. GENDERR is not a dependent variable.”
I'm very confused because the WARNING and ERROR messages are contradictory. Can I ignore the warning statements?
Also, the model where covariates were declared as continuous did not converge. Is this related to the WARNING?
You should not put covariates on the CATEGORICAL list. The message comes about because the mean and variance of a dichotomous variable are not independent of each other. I would need to see the full output and your license number at firstname.lastname@example.org to say if this is related to convergence.
I have been asked by a reviewer to provide a reference for exactly what Mplus is doing when bringing covariates with missing data into the model as dependent variables by steps such as mentioning their variances in the MODEL command (and thus making distributional assumptions about them). If you have any recommendations for this, I would greatly appreciate it.
I don't know of a reference to this exactly. When you bring the x's into the model, multivariate normality is assumed for the x's and all continuous y's. This assumption is discussed in any structural equation modeling book.
Missing data theory does not apply to observed exogenous variables. The model is estimated conditioned on them. You can bring all of the covariates into the model by mentioning their variances in the MODEL command. Then they will be treated as dependent variables and distributional assumptions will be made about them.
Thanks Linda. I have run the analysis, and realise that if I bring the exogenous variable with missing data into the model, df becomes higher. I am wondering if this would pose any issue with explaining to readers how the model was specified. How do people usually justify the use of this method?
I am running a mediational model with two covariates (relationship status and greek membership). I have missing values on these. I specified type=missing with MLR. Then I entered the covariates in the model command and Mplus uses the whole sample. I understand that missing data theory does not apply to exogenous observed variables and when I do not include them in the model, there was no substantive changes except in one of my hypothesized effect. I want to keep all sample but I am wondering what happened here that brought the change? Can I trust the output with the observed variables included in the model
When you mention the means, variances, or covariances of the exogenous variables in the MODEL command, they are treated as dependent variables and distributional assumptions are made about them. This can change the results if not all variables are continuous. See the Version 6.1 version history on the website for a description.
When I am running a one-factor CFA model with continuous data, I get the following error. "Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 2"
Previously, I have run the same model without getting the error message.
I have checked whether the error is not due to misreading the input file, but still cannot find its source.
I have several questions regarding logistic regression with missing values. My data set contains 1000 cases, 300 of them with missing values. y and x1 are both binary and have missing values; x2-x4 are continuous and completely observed. I want Mplus to use all cases, including those with missing values. In order to do that I mentioned the variance of x1 and ran the following model:
ANALYSIS: estimator = ml; integration = montecarlo; MODEL: y on x1 x2 x3 x4; x1;
My questions are as follows:
1. When I mention the variance of x1 Mplus uses 950 of 1000 cases. If I specify one further independent variable (no matter which one), all cases are included. What I don’t understand is that the coefficients of the model vary depending on which variance I choose to mention. This observation confuses me and I am not sure which variance I should mention.
2. A more general question concerns predicting probabilites. Is it right that it is not possible to predict probabilities when there are missing values on an independent variable using ML estimation?
3. Is the given R square the McKelvey & Zavoina’s R square – and can I trust the value although I have missing values on the independent variable or does this bias the R square-computation (as I would guess)?
1. You should include the variances of all of the covariates or none of the covariates. When you include, for example, y1 and y2, they are correlated and y3 and y4 are correlated because the model is estimated conditional on them. But there are no correlations between y1, y2 and y3, y4. The zero correlations vary depending on which variances you inlcude in the model.
2. I don't think this can be done.
Hanna Esser posted on Wednesday, September 03, 2014 - 6:13 am
Dear Mplus team,
I have a question regarding missing values in my path model.
Variables x1-x6 are exogenous. Four of them are binary (gender and yes/no variables), two are continuous. Variables y1-y7 are endogenous. y1-y6 are ordinal, y7 is binary (yes/no). y1-y6 are also exogenous, because I compute regressions on y7.
My current sample size is 4009 and I would like to use the whole sample for my analysis. Unfortunately, there are missing data in variables x2-x6. I assume that they are MCAT. When I do the analysis without extra assumptions, there are 1031 cases with missing on x-variables which get excluded from my analysis. Most of the missing data are on the variable “income” because many refused to give this information and on the variable “Parents without qualification” (yes/no).
By reading old discussions I found out that I when I mention the variances no listwise deletion happens. This is the case in my analysis; it includes all 4009 cases when I do this, but I get the warning that my exogenous variables x1-x4 are dichotomous but declared as continuous (which is true).
What can I do to include all cases and avoid the listwise deletion of 25% of my cases?
Shiny7 posted on Monday, September 29, 2014 - 3:37 am
Dear Drs. Muthen,
I ran a multilevel model with 7 x-Variables and one continous outcome using MLR.
As known, Mplus does not include cases with missings on all x-Variables, which in my case are many, many cases.
Am I right that the only solution is to mention the variances of the x-Variables in the model command? Although the assumption that my data are multivariate normally distributed is in fact not given (!) (using MLR therefore).
The problem is furhter, that I have only 21 clusters, and if i take the variances of the x-variables into the model, the number of paramters exceeds the number of clusters which is also a known problem.
THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.206D-20. PROBLEM INVOLVING THE FOLLOWING PARAMETER: Parameter 65, GXNEIGH.
I thought maybe this was a degrees of freedom issue but the "number of free parameters" is 65. What is the issue here, and is there another way to get the program not to listwise delete on X?
This is likely due to one or more of your predictors being binary. The mean and variance of a binary variable are not orthogonal and this can trigger the message. Comment out the means. If the message disappears, you can put them back and ignore the message.
I am running a logistic regression with a binary dependent variable (MLR estimation). I have both continuous and binary independent variables. Both the dependent and independent variables have missing data.
1) Should I mention the variances only for independent variables with missing data, or should I mention the variances for all independent variables?
2) Is the procedure of mentioning variances also valid for categorical (binary) independent variables?
3) Is there anything further I should do to address the missing values of Y?
1. You must mention all of the variances of the covariates or none of them. If you mention only some, the model is estimated with zero covariances among the ones you mentioned and the ones you did not mention.
2. Yes, for maximum likelihood estimation but not for weighted least squares estimation. Note that each variance mentioned requires one dimension of integration.
3. Missing data theory does not apply to a single dependent variable. If you bring in the covariates, they are treated as dependent variables so you have more than one. Distributional assumptions are then made about the covariates.
Thank you, Dr. Muthen, for such a quick response! I was hoping you could clarify one more thing. Here is the model portion of my input:
MODEL: y1 on x1 x2 x3 x4;
x1; x2; x3; x4;
You mentioned in #3 above that when I bring the covariates into the model, they are treated as dependent variables; thus, I would have more than one dependent variable. Does this mean that I am currently estimating missing data for both the x and y variables based on my current input? I ask this because the number of observations being used in the analysis seems to reflect this. Thank you for your help.
I have received a warning on a model that is measuring the relationship of a predictive composite model on a latent model and a measured outcome:
*** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 21 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
You indicated in a previous post: "The only way to avoid the listwise deletion of covariates is to bring them into the model as dependent variables. You can do this by mentioning their variances in the MODEL command. You then make distributional assumptions about them."
Is the code to mention the variances in the MODEL command included in one of your workshop handouts or in the users guide? I cannot find any coding instructions.
You said in an earlier post that when you mention the variances of the covariates (in order to estimate x-side missing data), they are treated as dependent variables and distributional assumptions are made about them. Does that mean they must be normally distributed? If so, how robust would an analysis like this be to violations of normality? I am using multinomial logistic regression (y has 3 levels) and my x variables have a strong positive skew with a high frequency of zeroes, although they are not zero inflated. Is there anything else you can recommend that would be more appropriate? I am hesitant to delete cases with missing x variables as this seems to affect the means and standard deviations quite a bit. Thank you.
It is not clear in general how robust ML-MAR is to strong non-normality, but take a look at section 4.7 of the following paper on our website which was just published in the SEM journal:
Asparouhov, T. & Muthén B. (2014). Structural equation models and mixture models with continuous non-normal skewed distributions. Web note 19. Version 2. Download Mplus inputs and outputs used in this paper here.
So if your x variables are truly continuous variables, not discrete, and have no floor or ceiling effects, you could use the skew-t distribution for multiple imputation in a first step, followed by the logistic regression.