I added control variables to my model. I got the following message after the addition. I wanted to use the full information. Therefore, I defined the all of the variables with missing data. It seems that the program deleted cases using listwise deletiion. With the following warning, I could not see the model estimation.
Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 94 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
What can I do to see the results of the model estimation?
The only way to avoid the listwise deletion of covariates is to bring them into the model as dependent variables. You can do this by mentioning their variances in the MODEL command. You then make distributional assumptions about them.
nina chien posted on Friday, September 04, 2009 - 1:58 pm
To avoid listwise deletion of covariates, I added the covariates into the model by mentioning their variances. This resulted in a series of WARNING statements that some of my covariates are categorical (which they are): “WARNING: VARIABLE GENDERR MAY BE DICHOTOMOUS BUT DECLARED AS CONTINUOUS.”
So, I tried stating covariates as categorical, but this resulted in error statements that the covariates are not DV’s: “ERROR in VARIABLE command. CATEGORICAL option is used for dependent variables only. GENDERR is not a dependent variable.”
I'm very confused because the WARNING and ERROR messages are contradictory. Can I ignore the warning statements?
Also, the model where covariates were declared as continuous did not converge. Is this related to the WARNING?
You should not put covariates on the CATEGORICAL list. The message comes about because the mean and variance of a dichotomous variable are not independent of each other. I would need to see the full output and your license number at firstname.lastname@example.org to say if this is related to convergence.
I have been asked by a reviewer to provide a reference for exactly what Mplus is doing when bringing covariates with missing data into the model as dependent variables by steps such as mentioning their variances in the MODEL command (and thus making distributional assumptions about them). If you have any recommendations for this, I would greatly appreciate it.
I don't know of a reference to this exactly. When you bring the x's into the model, multivariate normality is assumed for the x's and all continuous y's. This assumption is discussed in any structural equation modeling book.
Missing data theory does not apply to observed exogenous variables. The model is estimated conditioned on them. You can bring all of the covariates into the model by mentioning their variances in the MODEL command. Then they will be treated as dependent variables and distributional assumptions will be made about them.
Thanks Linda. I have run the analysis, and realise that if I bring the exogenous variable with missing data into the model, df becomes higher. I am wondering if this would pose any issue with explaining to readers how the model was specified. How do people usually justify the use of this method?
I am running a mediational model with two covariates (relationship status and greek membership). I have missing values on these. I specified type=missing with MLR. Then I entered the covariates in the model command and Mplus uses the whole sample. I understand that missing data theory does not apply to exogenous observed variables and when I do not include them in the model, there was no substantive changes except in one of my hypothesized effect. I want to keep all sample but I am wondering what happened here that brought the change? Can I trust the output with the observed variables included in the model
When you mention the means, variances, or covariances of the exogenous variables in the MODEL command, they are treated as dependent variables and distributional assumptions are made about them. This can change the results if not all variables are continuous. See the Version 6.1 version history on the website for a description.
When I am running a one-factor CFA model with continuous data, I get the following error. "Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 2"
Previously, I have run the same model without getting the error message.
I have checked whether the error is not due to misreading the input file, but still cannot find its source.
I have several questions regarding logistic regression with missing values. My data set contains 1000 cases, 300 of them with missing values. y and x1 are both binary and have missing values; x2-x4 are continuous and completely observed. I want Mplus to use all cases, including those with missing values. In order to do that I mentioned the variance of x1 and ran the following model:
ANALYSIS: estimator = ml; integration = montecarlo; MODEL: y on x1 x2 x3 x4; x1;
My questions are as follows:
1. When I mention the variance of x1 Mplus uses 950 of 1000 cases. If I specify one further independent variable (no matter which one), all cases are included. What I don’t understand is that the coefficients of the model vary depending on which variance I choose to mention. This observation confuses me and I am not sure which variance I should mention.
2. A more general question concerns predicting probabilites. Is it right that it is not possible to predict probabilities when there are missing values on an independent variable using ML estimation?
3. Is the given R square the McKelvey & Zavoina’s R square – and can I trust the value although I have missing values on the independent variable or does this bias the R square-computation (as I would guess)?
1. You should include the variances of all of the covariates or none of the covariates. When you include, for example, y1 and y2, they are correlated and y3 and y4 are correlated because the model is estimated conditional on them. But there are no correlations between y1, y2 and y3, y4. The zero correlations vary depending on which variances you inlcude in the model.
2. I don't think this can be done.
Hanna Esser posted on Wednesday, September 03, 2014 - 6:13 am
Dear Mplus team,
I have a question regarding missing values in my path model.
Variables x1-x6 are exogenous. Four of them are binary (gender and yes/no variables), two are continuous. Variables y1-y7 are endogenous. y1-y6 are ordinal, y7 is binary (yes/no). y1-y6 are also exogenous, because I compute regressions on y7.
My current sample size is 4009 and I would like to use the whole sample for my analysis. Unfortunately, there are missing data in variables x2-x6. I assume that they are MCAT. When I do the analysis without extra assumptions, there are 1031 cases with missing on x-variables which get excluded from my analysis. Most of the missing data are on the variable “income” because many refused to give this information and on the variable “Parents without qualification” (yes/no).
By reading old discussions I found out that I when I mention the variances no listwise deletion happens. This is the case in my analysis; it includes all 4009 cases when I do this, but I get the warning that my exogenous variables x1-x4 are dichotomous but declared as continuous (which is true).
What can I do to include all cases and avoid the listwise deletion of 25% of my cases?
Shiny7 posted on Monday, September 29, 2014 - 3:37 am
Dear Drs. Muthen,
I ran a multilevel model with 7 x-Variables and one continous outcome using MLR.
As known, Mplus does not include cases with missings on all x-Variables, which in my case are many, many cases.
Am I right that the only solution is to mention the variances of the x-Variables in the model command? Although the assumption that my data are multivariate normally distributed is in fact not given (!) (using MLR therefore).
The problem is furhter, that I have only 21 clusters, and if i take the variances of the x-variables into the model, the number of paramters exceeds the number of clusters which is also a known problem.
THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.206D-20. PROBLEM INVOLVING THE FOLLOWING PARAMETER: Parameter 65, GXNEIGH.
I thought maybe this was a degrees of freedom issue but the "number of free parameters" is 65. What is the issue here, and is there another way to get the program not to listwise delete on X?
This is likely due to one or more of your predictors being binary. The mean and variance of a binary variable are not orthogonal and this can trigger the message. Comment out the means. If the message disappears, you can put them back and ignore the message.
I am running a logistic regression with a binary dependent variable (MLR estimation). I have both continuous and binary independent variables. Both the dependent and independent variables have missing data.
1) Should I mention the variances only for independent variables with missing data, or should I mention the variances for all independent variables?
2) Is the procedure of mentioning variances also valid for categorical (binary) independent variables?
3) Is there anything further I should do to address the missing values of Y?
1. You must mention all of the variances of the covariates or none of them. If you mention only some, the model is estimated with zero covariances among the ones you mentioned and the ones you did not mention.
2. Yes, for maximum likelihood estimation but not for weighted least squares estimation. Note that each variance mentioned requires one dimension of integration.
3. Missing data theory does not apply to a single dependent variable. If you bring in the covariates, they are treated as dependent variables so you have more than one. Distributional assumptions are then made about the covariates.
Thank you, Dr. Muthen, for such a quick response! I was hoping you could clarify one more thing. Here is the model portion of my input:
MODEL: y1 on x1 x2 x3 x4;
x1; x2; x3; x4;
You mentioned in #3 above that when I bring the covariates into the model, they are treated as dependent variables; thus, I would have more than one dependent variable. Does this mean that I am currently estimating missing data for both the x and y variables based on my current input? I ask this because the number of observations being used in the analysis seems to reflect this. Thank you for your help.
I have received a warning on a model that is measuring the relationship of a predictive composite model on a latent model and a measured outcome:
*** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 21 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
You indicated in a previous post: "The only way to avoid the listwise deletion of covariates is to bring them into the model as dependent variables. You can do this by mentioning their variances in the MODEL command. You then make distributional assumptions about them."
Is the code to mention the variances in the MODEL command included in one of your workshop handouts or in the users guide? I cannot find any coding instructions.
You said in an earlier post that when you mention the variances of the covariates (in order to estimate x-side missing data), they are treated as dependent variables and distributional assumptions are made about them. Does that mean they must be normally distributed? If so, how robust would an analysis like this be to violations of normality? I am using multinomial logistic regression (y has 3 levels) and my x variables have a strong positive skew with a high frequency of zeroes, although they are not zero inflated. Is there anything else you can recommend that would be more appropriate? I am hesitant to delete cases with missing x variables as this seems to affect the means and standard deviations quite a bit. Thank you.
It is not clear in general how robust ML-MAR is to strong non-normality, but take a look at section 4.7 of the following paper on our website which was just published in the SEM journal:
Asparouhov, T. & Muthén B. (2014). Structural equation models and mixture models with continuous non-normal skewed distributions. Web note 19. Version 2. Download Mplus inputs and outputs used in this paper here.
So if your x variables are truly continuous variables, not discrete, and have no floor or ceiling effects, you could use the skew-t distribution for multiple imputation in a first step, followed by the logistic regression.
I have run a latent class growth analysis model (3 classes) with one predictor x. This predictor had missing data points, so I estimated its variance in order to use the total sample in my analysis. However, after I added x variance in the model command, I got an error message and the c (the latent class variable) on x (predictor with missing data) command was ignored. Attached is my syntax for the model command. Could you please let me know what’s wrong in my syntax? Thank you.
*** ERROR The following MODEL statements are ignored: * Statements in the OVERALL class: C#1 ON X C#2 ON X *** ERROR One or more MODEL statements were ignored. These statements may be incorrect or are only supported by ALGORITHM=INTEGRATION.
I am rather new to MPlus and SEM and am a graduate student. So, I apologize if I am overlooking something obvious. I read through the forums and understand that if I include my X-variables in the model as dependent variables, then I can avoid listwise deletion of cases. My problem is that when I do this, my output produces "WITH" output that I did not request. I simply do not understand why I am getting "WITH" output that I did not request in my input. Also, I am clueless about how this impacts the model I am outlining in the input (which only has 4 WITH statements).
Additional info that might be needed: I am using WLSMV and have two categorical outcome measures in my model. Listwise=off.
Is it possible to avoid getting all of the "WITH" output that was not requested?
My mentors have been adamant that my analysis should use WLSMV because of the ordinal/categorical nature of my outcome measures. Thus, I guess my main followup question is how can I use WLSMV for my analysis and still avoid losing cases?
Again, I apologize for my lack of experience and thank you for your guidance.
My apologies. I should have mentioned I was using COMPLEX analysis option due to the inclusion of survey weights (weight, stratification, cluster). It is giving me an error stating that ML is not able to be used with this type of analysis.
any other ideas on how I can avoid dropping my missing cases, while still being able to use WLSMV?
More info= Full sample is nearly 11,000 cases. Most of my variables have between 0.5-3% missing (not many missing), but three specific variables are missing 10-13% (they come from a different questionnaire-none of which are outcome measures). I did mean/mode impute and run the analysis the same way as I do with the missings included, and the results are nearly identical to the pair wise deleted output (trivial differences in a couple coefficients, significance of variables did not change). That being said, I am guessing this method is frowned upon.
Again, thank you for all of the advice. i am learning a lot through your forums, publications, and user guide.
I should add that when I took the weights out and ran the model with ML (just to see how it would work), it still deleted the same amount of cases as was happening in WLSMV. Thus, I'm at a loss of how to proceed.
I am running a three-wave cross-lagged analysis with two categorical variables (yes/no). I do have various patterns of missing data. Sometimes it is missing on a first wave, sometimes on the second, first and second, second and third, etc.
1)To my understanding, by default m-plus uses FIML?
2) Should I use different method to deal with missing data instead? (multiple imputation or other?)
Hi. I am trying to run a multilevel regression model. And I have around 2000 kids missing (out of 7000), in order to get over this problem,I have included all my variables' variances. Below you can see part of the syntax: CLUSTER = laestab;
ANALYSIS: TYPE = TWOLEVEL random; MODEL: %WITHIN% emosumx ON genderC EthnicWR FamSumC CommSumC; genderC; EthnicWR; FamSumC; CommSumC; %BETWEEN% emosumx ON PhaseEdu1 PhaseEdu2 Urban; PhaseEdu1; PhaseEdu2; Urban;
When I run the model, I get the following error: THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.228D-16. PROBLEM INVOLVING PARAMETER 26.
THE MODEL ESTIMATION TERMINATED NORMALLY
Parameter 26 is PSI GenderC X GenderC.
So I am tiny bit confused as to what I should be doing. What is wrong?
I'm finding that when I use the "USEOBSERVATIONS" command, Mplus is dropping cases beyond the ones that are missing x-variables. For example, the Mplus model indicates 266 observations, and indicates that 6 cases dropped b/c they were missing an x-value, however in my SPSS dataset, I have 292 cases. I'm wondering why 20 additional cases appear to have been excluded?
Milan R posted on Thursday, August 10, 2017 - 11:55 am
Hi Dr. Muthen, I wanted to run a four-class GMM with binary predictors, which have missing data. What is the most appropriate way to handle the missingness on the X-side in my case? I learned from many posts on the forum that I could bring a predictor in the model by mentioning it out and making distributional assumptions. But since mine are categorical, I wonder if mentioning their variances in the MODEL command is the way. Or should I follow the 3-step approach using auxiliary variables? Thanks!
This is a tricky situation without a good solution. You can use WITH among the x's to bring them into the model. But, this leads to numerical integration with many dimensions so it is computationally demanding.
Milan R posted on Thursday, August 10, 2017 - 7:23 pm
Hi Dr. Muthen, Thank you for your prompt reply! For my situation, do you have a rough estimate of the computational time needed to run this GMM with a sample size of 7000? I would like to know a reasonable length of time before giving up or suspecting a problem with my model. Thanks again.
I couldn't say because it depends on so many factors including your specific model, the number of x variables with missing data, and your data. Ask for TECH8 and you will see the time each iteration takes.
Amanda Sim posted on Monday, November 06, 2017 - 5:42 pm
Dear Dr. Muthen,
My iv consists of a sum score of 17 yes/no items. There are 4 cases (out of 291) with missing data on one or more of the 17 items; hence, there are 4 cases with the iv missing.
In order to include these 4 cases in my analysis, I brought the iv into the model. However, when I do this, the regression coefficients get much larger (e.g. changes from 0.1 when the 4 cases are excluded to 0.4 when iv is brought into the model).
Do you think this is because multivariate normality is assumed when the iv is brought into the model (and the distribution of my iv is NOT actually normal)? I looked at the 4 cases to see if there was anything unusual about them but although they are all clustered around one end of the distribution, they do not appear to be massive outliers.
I am unsure how to interpret the different results when excluding the 4 cases vs. bringing them into the model and would appreciate your thoughts on how to proceed.
Please send these 2 outputs to Support along with your license number.
Zhi Li posted on Tuesday, December 19, 2017 - 7:09 pm
Dear Dr. Muthen, I am running a pathway analysis with 3 continuous endogenous variables regressed on 7 exogenous variables(5 continuous, 2 binary). Based on the messages above, I know that mentioning the variance/mean of the exogenous variables in the model will help avoid list-wise deletion for the X-variables. Yet, when I do include all variances of the exogenous variables in the model (because they have to be included all together), I came across the following error message. THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. My questions: (a) I am wondering if this has anything to do with the fact that three of my continuous exogenous variables did not have any missing values at all. I raised this because when I do not mention their variances, the error message disappeared. I don't think the error message is caused by mentioning the variance of the binary variables, because even when they are not included, the error message still pops out. (b) What is your suggestion to my current situation to avoid list-wise deletion for X? Thanks so much and I appreciate your help!
I seem to be having a similar problem with my binary covariates. I specified that the variance should be estimated, so missing observations are not removed from the sample, but left out the means:
age_* bmi_final*; !continuous [age_* bmi_final*];
male* hispanic* black*; !binary
I still get the following error:
THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS -0.185D-15. PROBLEM INVOLVING THE FOLLOWING PARAMETER: Parameter 62, %BETWEEN%: BLACK
This message can be ignored when binary covariates are brought into the model by mentioning their parameter(s). Note, however, that there are issues to consider - better ways to do this - when bringing binary covariates into the model as we explain in our Short Course Topic 11 video and handout on our website (YouTube or full video version), referring to Chapter 10 of our book Regression and Mediation Analysis using Mplus.
Thank you so much. Should I then be mentioning both the means and variances for the binary covariates? At one point I had tried to constrain the variances (dependent on the mean of the binary variable), but that made the model fail.
Thank you so much for your assistance. I have one final question. I recognize that the warning about variance is ignorable; when I explicitly state that I want covariances estimated between covariates I get a similar warning. Is this also ignorable?
THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.178D-17. PROBLEM INVOLVING THE FOLLOWING PARAMETER: Parameter 88, %BETWEEN%: BMI_FINAL WITH HISPANIC
That was indeed the case (HISPANIC is a binary variable). However, in the series of 6 models I am running (3 outcomes/2 sets of predictors) I get an error for a continuous variable:
THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.166D-18. PROBLEM INVOLVING THE FOLLOWING PARAMETER: Parameter 78, %BETWEEN%: BMI_FINAL
I am not sure why this would be happening, since nothing has changed in my code other than variable names.
I'm attempting to do SEM and consistently receive the same error warning. I have been through all my data and ensured that all missing data are coded as -99 and this is accounted for in my model.
I have read through a few of the solutions (e.g multiple imputation) suggested on this forum and on your FAQ but I am unsure if these are appropriate for my situation - as I know my colleagues have run similar models without having to do this. I was wondering if you might be able to help at all?
This is my input:
VARIABLE: Names are Age Gender Ethnic LEC_No Childhoo Anx_Atta Avo_Atta PTCI_Tot IIP_Tot PSS_Tot RE_DX AVO_DX TH_DX AD_DX NSC_DX DR_DX PHQ_Symp;
USEVAR are Age Gender Ethnic LEC_No RE_DX AVO_DX TH_DX AD_DX NSC_DX DR_DX PHQ_Symp; Missing are all (-99);
ANALYSIS: ESTIMATOR = MLR;
MODEL: PTSD by RE_DX AVO_DX TH_DX; DSO by AD_DX NSC_DX DR_DX;
PHQ_Symp on PTSD Gender Age Ethnic LEC_No; PHQ_Symp on DSO Gender Age Ethnic LEC_No;
And this is the error message I receive (apologies for the double post I reached the size limit for one):
*** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 6 *** WARNING Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis. Number of cases with missing on all variables except x-variables: 83 2 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS
We are running twolevel (logistic) regression analysis with missing data on x. As far as I've understood, there are several options to avoid listwise deletion: a. bring the variance of all x-variables into the model. But this shouldn't be done too lightly, especially with binary x's. Right? b. do the same thing, but then use Bayes estimator (as recommended in your RMA book). c. use multiple imputation and then run the regression analyses on the imputed data
Some questions: 1. Is there any preferred method? 2. Can I impute missing data as is done in example 11.6 of the user guide? Or should I specify a more advanced / different imputation model? 3. We want to run several regression models (using the same predictors each time, but different outcomes). Following example 11.6, should we only include the variables relevant for the specific regression in the IMPUTE = syntax? Or should we specify all variables relevant to our paper? Or is it better to do MI seperately and then run all analyses on these new imputed datasets (so that all analyses are based on exactly the same imputed data)?
Thanks for your advice!
Kind regards, Aurelie
P.s. I very much appreciate all the help I've been receiving through the past few years on this discussion board. It has really helped me to better understand Mplus and improve my skills.
1. I don't think there are general guidelines for what is best. If you have a lot of missing data, in your case, you might want to try one non-imputation approach and one imputation approach. I like the Bayes method you mention in b, but Bayes uses probit, not logit regression and therefore might not be attractive (next Mplus version will have Bayes logit).
2-3. If you impute, I would use UG ex 11.5 and include all the different DVs that you are interested in so that you use the same imputed data sets in all subsequent analyses.