Dont want deletion in logistic regre... PreviousNext
Mplus Discussion > Missing Data Modeling >
 Calvin Croy posted on Friday, January 06, 2006 - 1:42 pm
I tried running a logistic regression by specifying "Type is Missing H1" and "Extimation is ML" in hope that this would let me avoid listwise deletion. Note that I did NOT specify Type is Logistic. Nevertheless, Mplus implemented listwise deletion:13 observations with missing values for the dependent variable were omitted, as well as 37 that had missing values for one or more of my 3 predictors.

I would like to find out why the listwise deletion was still implemented. I think it has something to do with what you say in the User's Guide: "In all models, missingness is not allowed for the observed covariates because they are not part of the model. The outcomes are modeled conditional on the covariates and the covariates have no distributional asumption. Covariate missingness can be modeled if the covariates are explicitly brought into the model and given a distributional asumption".

My questions:

1. Am I right that the above quote explains why the listwise deletion was implemented? Or did I do something wrong in my Mplus code (everything converged fine and I got all the parameter estimates)?

2. Could you elaborate upon the above quote? I'm still not clear what is going on mathematically that would allow FIML to work around the missing data for a MIMIC analysis but not for a logistic regression.

3. If the above quote is applicable to the logistic regression, could you explain why the covariates (predictors) are considered outside the model?

4. If I were to bring the covariates into the model and give them distributional assumptions, could I avoid listwise deletion? If yes, I would appreciate your showing me how to do this. In the analysis I attempted, all the predictors were dichotomous and in future models all additional predictors would be normally distributed.

5.Is there a way to print out the comments on this discussion board without having the rightmost margin cut off? I'd like to be able to turn off displaying the menu that runs down the left side (topics, last day, last 3 days, etc). Or is there a special print button for the discussion comments that I haven't found?

Thanks so very much for your assistance. I really appreciate your help.
 Linda K. Muthen posted on Friday, January 06, 2006 - 2:54 pm
1. That sounds like the reason.
2. When you say MIMIC analysis, I think you mean TYPE=GENERAL. If TYPE=GENERAL, all variables are treated as y variables and distributional assumptions are made about them. This is not the case with maximum likelihood with categorical outcomes. You will have to bring the variables into the model and make distribtional assumptions about them if you do not want them to be deleted.
3. As in regular regression, the means, variances, and covariances of the covariates are estimated as part of the model. This is the same situation.
4. Yes. Mention their variances in the MODEL command.
5. I'm afraid I don't know the answer to this. I would copy what I want to print and paste in Notepad and then print it.
 Calvin Croy posted on Tuesday, January 17, 2006 - 8:05 am

1. You say in point #2 of your response that I need to make distributional assumptions about the variables if I don't want them deleted. Could you please explain what distributional assumptions you are referring to? Whether the variables are dichotomous or continuous? That the underlying populations have particular means, variances, and covariances that I specify? Do I need to specify particular values for means, variances, and covariances that I believe are the parametric values in the population my sample came from? Do I use these as starting values? Do I need to include these assumed parametric values when all my predictors are dichotomous?

2. If I had 1 dichotomous dependent variable, 2 categorical predictors, and 1 continuous predictor in the logistic regression, could you please spell out for me what you mean in your response point #4 when you say "Mention their variances in the model"? Am I supposed to calculate the variances of all 3 predictors using my sample data and specify those variances in the model somehow? What should be the specific wording of the lines in my Model command? Sorry, I need a lot of hand-holding!

In my first request I am trying to determine what assumptions I need to make in order to avoid the listwise deletion, and then I can see whether I think they're reasonable. In my second request, provided I am willing to make the assumptions, I am trying to figure out exactly how to implement what you've suggested.

Thank you again for all your assistance.
 bmuthen posted on Tuesday, January 17, 2006 - 11:01 am
Missingness in covariates can be handled by adding to the original logistic regression model an assumption of (for example) normality for the covariates. Imputation techniques often use this assumption as a proxy even when some covariates are dichotomous.

Say that you have the logisitic regression model u on x1-x3. To bring x1-x3 into the model, applying the normality assumption, you mentioning their variances by simply saying

 Calvin Croy posted on Wednesday, January 18, 2006 - 3:32 pm
Thanks for your response. I am familiar with how dichotomous data are often imputed using multiple imputation methods for continuous data.

Could you please elaborate about what your phrase "mentioning their variances" implies?

If I just say x1-x3 in the model command as you show above, you indicate that I will be making an assumption that the predictors are normally distributed, which is fine.

However, what assumptions about the variances of x1-x3 would I be making?

1. If in my sample data x1 has a variance of 2, x2 has a variance of 10, and x3 has a variance of .703, am I building a model that assumes the population has these exact variances?

2.Or would I be assuming that x1-x3 each has a variance of 1 (i.e. they have standard normal distributions)? Or that they have some other particular variance?

3. If the answer to #1 is Yes (i.e. the model rests on the assumption that the population variances are 2, 10, and .703), then how can I tell to what extent the odds ratios and standard errors are accurate if the true population variances are far from these assumed values?

Once again, I appreciate your help!
 bmuthen posted on Thursday, January 19, 2006 - 9:20 am
saying x1-x3; is harmless. The variances will be freely estimated without any restriction. It is simply an Mplus "trick" to bring the x's into the model in terms of giving them a distributional assumption.
 Calvin Croy posted on Friday, January 27, 2006 - 10:04 am
This is very good news. Assuming I don't want to impute (which I have done a lot of and have published on that subject), I'm now faced with deciding whether to run the logistic regression in Mplus or use SAS/SPSS/Stata. I'm inclined to use Mplus since it would let me avoid listwise deletion, but want to check a few more things with you first.

1. So if I understand all you've said, the only "drawback" for running this type of logistic regression where I've listed the predictors to have their variances freely estimated (for example by specifying x1 -x3 in the previous example) is that I must assume they are normally distributed? This sounds too good to be true as a way of avoiding listwise deletion.

2. If I'm willing to make this normality assumption for the predictors, would there be any drawbacks to running the logistic regression in Mplus to avoid the listwise deletion compared to doing the analysis in SPSS or SAS? I'm just referring to computational algorithms here, not the output content or ease of use. Though I really want odds ratios, I can calculate them from the Mplus output.

3. Being able to avoid listwise deletion in Mplus would seem to give it a real advantage over SAS, SPSS, and Stata (which I believe always invoke listwise deletion for logistic regressions, but maybe not in Proc Mixed, Proc Genmod, Proc Probit, or Catmod). Do you agree that this ability to avoid listwise deletion gives Mplus an edge?

4. If I had normally distributed predictors where the missing data were Missing at Random (but not necessarily Missing Completely at Random), wouldn't I get superior estimates of the regression coefficients out of Mplus rather than using SAS with listwise deletion?

5. I assume I could also run multiple regression in Mplus and avoid listwise deletion too if I'm willing to assume normality and list the predictors so they're brought into the model. Correct? (i.e. what works for logistic regression would also work for multiple regression)

Thank you again so very much for all your help.
 Linda K. Muthen posted on Saturday, January 28, 2006 - 9:26 am
Your statements sound correct. With logistic regression, bringing the covariates into the model estimation may lead to numerical integration which can be computationally heavy depending on the number of covariates, the size of the sample, etc.
 Calvin Croy posted on Tuesday, January 31, 2006 - 2:03 pm
Yes, I had to use the Monte Carlo integration to get my test program to run.

I have appreciated your help. Thank you!
 Dustin Pardini posted on Monday, May 22, 2006 - 6:46 am
I am running a poisson regression using an MLR estimator with missing data on the IVs and DV. I would like to use the "type = missing" command with numerical integration to handle the missingness. Some of the IVs are categorical (binary) and have minial missingness. However, when I bring all of the IVs into the model, this assumes they are all normally distributed.

Would it be more appropriate to use the categorical command to specificy the binary IVs as categorical? The model runs and provides sensible results when I do this, but the output describes the categorical IVs as dependent variables in the model.
 Linda K. Muthen posted on Monday, May 22, 2006 - 9:32 am
If you use the CATEGORICAL option, they are brought in just as in the first situation and treated as y's but in addition they are treated as categorical y's. In regression, covariates can be binary or continuous and in both cases, they are treated as continuous. So I don't think you should put them on the CATEGORICAL list.
 Dustin Pardini posted on Monday, May 22, 2006 - 10:22 am
Thanks for the advice Linda. However, when I bring the covariates into the model I get the following error message.


 Linda K. Muthen posted on Monday, May 22, 2006 - 11:04 am
You would need to send your input, data, output, and license number to for me to answer that.
 Amanda Barczyk posted on Thursday, February 03, 2011 - 6:07 am

I am doing CFA and SEM models for my dissertation. I was having a problem with the missing data (cases were being deleting instead of using all available data used). So, I used the technique you posted above "bring x1-x3 into the model, applying the normality assumption, you mentioning their variances by simply saying

I was hoping you could give me a little more information about what this actually does for my missing data. I am using weights and therefore am also using "ESTIMATOR=MLR;" and the model is not able to run without this syntax so I do not believe your above technique is a separate way of dealing with missing data other than maximum likelihood however I am not sure. Could you please provide a little bit more detail (or a citation) on what this does?

Thank you very much!
 Linda K. Muthen posted on Thursday, February 03, 2011 - 1:00 pm
Including the variances in the MODEL command causes the variables to be treated as dependent variables in the estimation of the model rather than estimating the model conditioned on these variables and making no distributional assumptions about them. This is also done in multiple imputation.

I don't know why the model would not run without this syntax. You should send the output and your license number to
 FN briere posted on Wednesday, February 22, 2012 - 9:22 am
Hi, when using FIML estimation to deal with missing data in logistic or linear regression by bringing covariates in the model, I typically get the following type of error message, which was already reported in this tread:
I am a bit surprised at how often I get this message, with different sets of covariates (which are mostly relatively normally distributed or binary) and not necessarily with a large number of covariates.
I don't know if this is a typical problem or not, but I'd be interested in knowing what tends to cause this and if there are general strategies to limit these types of problems (other than using MI instead to deal with missing data).

Thank you in advance,
 Linda K. Muthen posted on Wednesday, February 22, 2012 - 1:37 pm
This may be due to binary covariates. If so, it is because the mean and variance of binary variables are not orthogonal. This causes the identification message and can be ignored if this is the reason.
 FN briere posted on Wednesday, February 22, 2012 - 3:21 pm
yes, this seems to be the case. I do not get these messages when binary covariates are removed.

What's tricky is that when they are included, the error message is not necessarily related to these binary covariates. Does not make it easy to identify the issue.

Thank you for the quick reply,
 Linda K. Muthen posted on Wednesday, February 22, 2012 - 3:44 pm
In most cases that I have seen, it points to the variance of the binary variable in TECH1.
 FN briere posted on Tuesday, February 28, 2012 - 7:06 am
Hi again,
I've made several tests over the last week and results are constant. In my data:
1) Error messages occur automatically when one or more binary variable is included;
2) The error message does not necessarily relate to the binary variable. In fact, it tends to relate to the last parameter in the PSI matrix in TECH1, whatever the variable is as the exact parameter involved changes if the order of covariates is changed... This can be a non-binary covariate which causes no problem when no binary variable is included (whatever the combination of non-binary covariates is);
3) I do not get error messages when binary variables are not included, even if covariates are categorical variables with more than 2 categories;

Overall, this leads me to believe that issues of non identification are related to the binary covariates. Hope this helps the thinking of others who encounter similar issues
 Bengt O. Muthen posted on Wednesday, February 29, 2012 - 8:54 am
It is correct that the error message may not point to the binary variable that is the cause of the message. The issue is that for a binary variable the mean, say p, is mathematically related to the variance, p(1-p). That doesn't mean that the model is non-identified. It is a useful error message because with missing data on a binary variable, mentioning its variance the binary variable is treated as a continuous-normal variable and this can result in biased parameter estimates. Multiple imputation in Mplus may be better because you can specify that the variable is categorical.
 Marie Nancy Seraphin posted on Wednesday, May 27, 2015 - 9:13 am

I am new to Mplus. I have read the post here which apply to a problem I have running a logistic regression model with missing data on my covariates.

I would like to use the FIML to offset the effect of missingness. I would also like to specify the variance in the model statement as mentioned above to make distributional assumptions. However, I am not sure how to actually code these commands.

My model comment right now looks like this:
y on x1-x14;

Please Help!

Thank you
 Linda K. Muthen posted on Wednesday, May 27, 2015 - 10:11 am
y on x1-x14;
 Marie Nancy Seraphin posted on Wednesday, May 27, 2015 - 11:22 am
Thank you very much.
 Sara Geven posted on Sunday, November 29, 2015 - 3:48 am
Dear all,
I want to predict a model which contains two endogeneous variables with missing values: ability of students and their socio economic status. As these two variables are related to each other, I was thinking about including a covariance between these two variables (instead of estimating their variances). In this way these variables also become part of the model, respondents with missings are kept, and I take into account their relatedness. Is this approach okay?
Thank you in advance!

 Bengt O. Muthen posted on Sunday, November 29, 2015 - 11:05 am
Do you mean that the two variables are exogenous as opposed to endogenous?
 Sara Geven posted on Monday, November 30, 2015 - 12:33 am
Dear Prof. Muthen,
Yes, I am sorry. Stupid mistake. I meant exogenous.
I am also wondering whether it is advisable to predict other exogeneous variables on the exogeneous variables with missings (instead of just prediciting their variances/ covariances)?

 Bengt O. Muthen posted on Monday, November 30, 2015 - 5:39 pm
Yes, you can bring covariates into the model. But note that your model then makes normality assumptions about them. And it helps the slope estimates only if some of those with missing on the cov's are not missing on y.
 Bengt O. Muthen posted on Monday, November 30, 2015 - 5:40 pm
I don't understand your last question because it sounds like you want to regress some exogenous vbles on other exog vbles - but then they are not exogenous.
 Sara Geven posted on Tuesday, December 01, 2015 - 3:01 am
Dear Prof. Muthen,
Thank you for this reply. I indeed have data with missing on x's for which y is present.

When estimating the variance of exogeneous variables (to avoid listwise deletion) I noticed that the model fit sometimes becomes bad. This seems to be due to the fact that these variables are also related to other exogenous variables in the model. I understand you get a 'funny' model when predicting paths from other exogeneous variables on these variables. However, is there another way to do something with this issue?

Kind regards,
 Linda K. Muthen posted on Tuesday, December 01, 2015 - 6:36 am
You must bring all of the observed exogenous variables into the model. You cannot bring in only a subset. This can be done only for maximum likelihood estimation not for weighted least squares.
 Sara Geven posted on Tuesday, December 08, 2015 - 9:42 am
Dear Prof. Muthen,
Thank you for your answer. What should I do when some of the exogeneous variables are dummies. I do not want to make those part of the model, right? Only the continuous ones.

 Linda K. Muthen posted on Tuesday, December 08, 2015 - 9:47 am
You can't bring subset in. You must either bring them all in or none or them. When you bring in a subset, the correlations between the ones in and the ones out are zero which will cause the model not to fit well.
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message