Mplus Discussion >> Multiple imputation with complex survey data

Topics
Last Day
Last 3 Days
Last Week
Tree View

Edit Profile


Multiple imputation with complex surv...

Mplus Discussion > Missing Data Modeling >

Message/Author

Lily Wang posted on Sunday, February 05, 2012 - 12:23 pm

Hi, Drs Muthens,

I encounter a problem (accidental termination, to be specific) when trying to impute the data. The data is a national data (TYPE=COMPLEX). I wrote:

DATA IMPUTATION:
impute=V1 V2 V3;
save=CCimpute*.dat;

Mplus terminates when doing second imputation (as shown in the DOS window) without any of further notice or error message. The output window does not pop up as usual. If I open the output file manually, the file only contains the syntax I wrote.

Is there anyway to fix the problem?

Linda K. Muthen posted on Sunday, February 05, 2012 - 1:20 pm

If you are not using Version 6.12, you should download it. If you are, send the files and your license number to support@statmodel.com.

Stata posted on Thursday, February 09, 2012 - 11:34 pm

Dear Muthens,

I am trying to impute (MI) with a national dataset for multilevel latent class analysis:

Format is (F4.0, F3.0, 51F1.0);
VARIABLE:NAMES ARE SID STRAT BB1-BB51;
USEVARIABLES = BB1-BB51;
AUXILIARY = SID STRAT;
CATEGORICAL ARE BB1-BB51;
Missing = ALL (9);
Data Imputation:
IMPUTE = BB1-BB51(c);
NDATASETS = 40;
ANALYSIS: TYPE = BASIC;

The last 4 variables were dichotomous, the rest of them were in 4-point scale

I was not able to use the imputed data to ran multilevel. This is what I got:
*** ERROR(Err#: 64)
Invalid symbol at record #: 1
The record is shown below this message
"diss.imp1.dat"

Therefore, I ran two level MI adding the following syntax:

cluster = SID;
type=BASIC Twolevel;

*** FATAL ERROR

THE CONVERGENCE CRITERION IS NOT SATISFIED.
INCREASE THE MAXIMUM NUMBER OF ITERATIONS OR INCREASE THE CONVERGENCE CRITERIon

How can I fix these problems? Thank you.

Linda K. Muthen posted on Friday, February 10, 2012 - 6:48 am

Please send the relevant files and your license number to support@statmodel.com.

FN briere posted on Wednesday, March 28, 2012 - 12:56 pm

Hi, two questions regarding twolevel MI:
1) I would like to make sure that the following syntax is appropriate to generate a twolevel H1 (unrestricted) imputation.

Usevariables =
Alevel1 Blevel1 Clevel1 Alevel2 ;
cluster = SCHOOL;
within = Alevel1 Blevel1 Clevel1;
between = Alevel2 ;
MISSING are all (-100);

DATA IMPUTATION:
IMPUTE = Alevel1 Blevel1 Clevel1 Alevel2;
NDATASETS = 5;
SAVE = twolevel*.dat;

ANALYSIS: TYPE = twolevel;

2) Would it be preferable not to specify the outcome to be used in a twolevel regression model (second step after imputation) as either within or between?
Thank you in advance,
Fr�d�ric

Bengt O. Muthen posted on Thursday, March 29, 2012 - 3:39 pm

1) You want to say Type = Twolevel Basic;

And you want to remove the Within= line since that would mis-specify the variables as not having any between-level variance.

2) You should only use

Between = Alevel2;

during imputation and estimation phases.

FN briere posted on Friday, March 30, 2012 - 10:06 am

Thank you, this is very useful.

I gather from a different post that to for models with random slopes and cross-level interactions, it is necessary to switch to a H0 approach. I wish to do something as close as possible to H1, but including random slopes.

I would tend to do this by specifying a H0 imputation model with random slopes and all correlations between variables at the two levels.

e.g.
Usevariables =
x1 x2 y1 Alevel2 ;
cluster = SCHOOL;
between = Alevel2 ;
MISSING are all (-100);

ANALYSIS: TYPE = TWOLEVEL;
ESTIMATOR = BAYES;

MOdel:
%WITHIN%
s | y1 on x1;
y1 with x2;
x1 with x2;
%BETWEEN%
s y1 x2 x1 Alevel2 with s y1 x2 x1 Alevel2;

DATA IMPUTATION:
IMPUTE = x1 x2 y1 Alevel2;
NDATASETS = 5;
SAVE = twolevel*.dat;

Does that seem like a correct approach?
Thank you for your time again,

Bengt O. Muthen posted on Saturday, March 31, 2012 - 10:57 am

I think that's ok. It sounds like your primary interest is in getting imputed data for some later investigation, not estimating the model parmeters. You can estimate the model parameters without imputing.

Assuming imputing is the primary interest, in general it may be a good idea to impute from a model that is as close to the "true" model as possible. With twolevel settings, however, it can be difficult to get convergence with a very unrestricted model (a model close to H1), mainly due to having many between-level parameters. That's why our UG gives an example of imputing from a simpler model than the later analysis model. How far apart these two models can be is an interesting research question.

FN briere posted on Saturday, March 31, 2012 - 11:39 am

Thanks, always very useful.

One last question which I think may also benefit others.

Given that my main interest is more in specific cross-level interactions than in random effects, another option may be to run an H1 imputation including cross-level interaction terms. I tried that and results from models estimated on imputed files (second step) look fine. I am thinking to go this way, as I did get some convergence problems with H0 models with random effects.

Any thoughts on this strategy?

Bengt O. Muthen posted on Saturday, March 31, 2012 - 12:23 pm

I don't see the distinction between cross-level interactions and random effects. Your model above would have a cross-level interaction if you regress the random slope s on Alevel2.

FN briere posted on Saturday, March 31, 2012 - 12:59 pm

Yes, I wasn't clear.
I am wondering about different strategies to obtain something similar:
defining a cross-level interaction term as a variable to be included in an H1 imputation v. specifying a random slope s to be regressed on Alevel2 in a H0 model. I get no convergence problems with the first strategy, but I do get some with the second. This is why I ask.

Bengt O. Muthen posted on Sunday, April 01, 2012 - 11:12 am

I see, you just enter a variable that is the product of the two variables. That sounds reasonable.

It is an interesting general question of how close to the "true" model the imputation model has to be - how much difference it makes in the quality of the imputed data. A research topic.

Fernando H Andrade posted on Monday, April 09, 2012 - 8:33 am

Dear Dr. Muthen
i am doing multilevel SEM using multiple imputation and incorporating the complex survey design. The data comes from Add health and my sub-population for the complex survey analysis are students attending 9th to 11th grades in wave 1.

My outcome is a factor composed by GPA in math, reading, science and social studies. I performed multiple imputation to deal with missing data however, i still have some missing data because students did not take one or more of the four courses during the school year.

I am not sure how Mplus deals with this valid missing cases. I am considering out of my sub-population all cases that have missing in all four course, but i would like to keep cases that have taken at least one course. Would it be ok to have those cases just specifying the value for missing (for example, missing are -9999) or does mplus expect non-missing for all variables in each imputed data set? (in this case, how do i deal with valid missing cases)

thank you
Fernando

Linda K. Muthen posted on Tuesday, April 10, 2012 - 1:56 pm

You can have missing data in the imputed data sets. This missing data is treated in the regular way.

Fernando H Andrade posted on Friday, April 13, 2012 - 7:08 am

that is nice!, thank you
fernando

Jennifer Gibson posted on Friday, July 27, 2012 - 8:32 am

It sounds like the above case presented by Fernando involves variables that have missing values because no values were imputed for those variable. Could we instead have variables that have values imputed for some cases but left missing for other cases?

The example I'm thinking of is a survey with a skip pattern such that there are items that only a subset of respondents answer. Let's say 60 out of 100 of respondents are presented with an item but only 50 answer them. I would want to impute for those 10 who were presented the item but elected not to respond. Is this possible? Perhaps through some kind of "if" statement associated with DATA IMPUTATION?

Tihomir Asparouhov posted on Friday, July 27, 2012 - 4:38 pm

You can impute the data but not use the imputed values for the subgroup. Some kind of code in the analysis part of the imputed data should work (if 9999 is the missing value and you have both the imputed Yimp and the non-imputed Y in the same file):

define: if (group==1 .and. Y=9999) then Yimp=_missing;

This will restore the missing value where you need it.

Jennifer Gibson posted on Tuesday, July 31, 2012 - 6:29 am

Perfect. Thank you, Tihomir!

TaeKyoung Lee posted on Tuesday, August 28, 2012 - 6:01 am

Dear Dr. Muthen

Hope you are doing smoothly in fall semester.
Can I ask you for your advice on how to impute cluster variable using multiple imputation(TYPE=COMPLEX)?

Here is my syntax:

VARIABLE:NAMES = econpr2 invpar latediv tr addep3 parrej delq bmi3 ill3 biosex4 hos smoking alcohol eat ill4d psu never ed;
Missing are all(999);
DATA IMPUTATION:
IMPUTE = econpr2 invpar latediv (c) tr (c) addep3 parrej delq bmi3
ill3 biosex4 (c) hos smoking alcohol eat ill4d psu
never (c) ed (c);
NDATASETS = 10;
SAVE = illimp*.dat;

ANALYSIS: TYPE = BASIC;
OUTPUT: TECH8;

With this syntax, I was able to get perfect dataset without any missingness.
However, finally, I found a problem because psu is cluster variable(school ID). With above syntax, program treat that variableas continuous variable.
Although this variable has quite huge range(1-371), this is not a continuous variable.
So, I am wondering whether there's certain way that I can treat this variable as cluster variable within the context of Multiple imputation.

Linda K. Muthen posted on Tuesday, August 28, 2012 - 10:55 am

Are you trying to impute for missing values on school ID?

TaeKyoung Lee posted on Tuesday, August 28, 2012 - 4:26 pm

Yes. is it possible to impute for missing values on school ID?

Linda K. Muthen posted on Tuesday, August 28, 2012 - 5:18 pm

This does not make sense to me. It should be your analysis variables you impute for.

Christoph Weber posted on Tuesday, March 25, 2014 - 6:41 am

Dear Dr. Muth�n,
I have several questions regarding MI for multilevel models.

1.) I want to include cluster means as predictors in the analysis.
If I use a H1 imputation and do not specify (level 1) variables as within, then it is not necesarry to include cluster means (of the level 1 variables) on the between level? Is this correct?

2.) I have 3-level data (pupils, classes, schools). There is no missing data for class and school level variables.
I want to estimate the effect of a 0/1 coded treatment variable at class level on performance. Additionally I'm interested in cross level interactions (effect of school level variables on the effect of treatment on performance).

2a.) I tried type = threelevel and estimator = bayes, but the model does not converge. Would it be adequate to use type = basic twolevel (cluster = class) in order to achieve convergence? (there are no problems for type = basic and type = basic twolevel).

2b.) Due to convergence problems it seems also not possible to include the crosslevel interactions.
Would it be adequate for H1-imputation to include simple product terms (VariableLev3xVariableLev2, ...)?

Thanks a lot!
Christoph Weber

Linda K. Muthen posted on Tuesday, March 25, 2014 - 10:08 am

The WITHIN and BETWEEN options should be used if appropriate when the data are imputed.

Please send the three-level output with convergence problems and your license number to support@statmodel.com.

Aurelie Lange posted on Wednesday, December 17, 2014 - 11:50 am

Hi,

I have tried to run a complex twolevel model with a cross-level interaction based on 10 imputed datasets. The model without the cross-level interaction runs fine. However, when including the interaction, it seems as if the data is not used, and hence I don�t get any results (both the fit indices and the estimates are zero). The output also says:
Number of replications
Requested 10
Completed 0

My input file reads as follows:
DATA:
FILE IS TotalFile19t.4.bmsrclist.dat;
type = imputation;
CLUSTER = TherID cohort;
MISSING ARE ALL (999);
WITHIN = zgender Language therexp athome inschool noarrest;
BETWEEN = team1-team26 teamyexp;
CATEGORICAL = athome inschool noarrest (0-1);

ANALYSIS:
TYPE = COMPLEX TWOLEVEL RANDOM;
INTEGRATION = MONTECARLO;
PROCESSORS = 2;

MODEL:
%WITHIN%
s | famav19 on therexp;
famav19 on zgender language;
athome on famav19 therexp;
inschool on famav19 therexp;
noarrest on famav19 zgender therexp;

%BETWEEN%
famav19 s on teamyexp;
famav19 on team1-team26;

I would be very grateful for your advice!

Kind regards,
Aurelie

Linda K. Muthen posted on Wednesday, December 17, 2014 - 3:58 pm

None of the replications were completed. Try analyzing one of the data sets to see what the problem is.

Aurelie Lange posted on Thursday, December 18, 2014 - 7:03 am

Dear Dr Muthen,

Thank you for your reply. I have analysed a single data set and received the error message "THE ESTIMATED BETWEEN COVARIANCE MATRIX COULD NOT BE INVERTED." When looking into the manual I saw that centering was used in cross-level interactions. Therefore, I have centered therexp using grandmean.
1) The analyses on every single datafile now runned without errors. However, when using the 10 imputed datasets together, still only 4 of the replications were completed.

2) Also, in the output, I see the estimates of all paths (my model includes more paths than mentioned above), including the between-level path s on teamyexp. However I don�t see the estimates for the path famav19 on therexp. I was wondering whether this is correct or whether it is an error in my model. Is there any way I could get the estimates for this specific path?

3) In the output with the model estimates I also have a column �Rate of Missing�. What does this column mean? Since I have imputed all my data, I wouldn't expect any missings (I also get this column in the analysis without the cross-level interaction, which is running without problems).

I hope you could provide me with some suggestions to move on.

Kind regards,
Aurelie

Bengt O. Muthen posted on Thursday, December 18, 2014 - 6:43 pm

Please send the outputs and the data files to Support along with your license number.

Virginia Rangel posted on Monday, February 27, 2017 - 12:20 pm

I am trying to impute school level data using the following syntax:

VARIABLE:
NAMES=ID X1-X20 W1SCHOOL
W1S001-W1S200;
MISSING=ALL (-9);

DATA IMPUTATION:
IMPUTE=X1 X2 X3-X19 (c);
NDATASETS=5;
SAVE=PSF1MI*.dat;

ANALYSIS:
TYPE=BASIC;

OUTPUT:
TECH*;

When I ran this last week, it worked with no problem. Today, however, I realized I had to make a change to the underlying dataset and so I re-ran the imputation command and got the following message:

*** FATAL ERROR

THE CONVERGENCE CRITERION IS NOT SATISFIED.
INCREASE THE MAXIMUM NUMBER OF ITERATIONS OR INCREASE THE CONVERGENCE CRITERION

Unfortunately, I cannot send my data file as it is a secured data set.

Thank you in advance for any assistance!

Bengt O. Muthen posted on Monday, February 27, 2017 - 6:49 pm

Try one of the other 2 imputation methods.

Pia Kreijkes posted on Wednesday, May 23, 2018 - 5:02 am

Hi,

I want to use multiple imputation for categorical (ordinal) WITHIN variables in a data set that contains a cluster variable. Pupils reported on the teaching practices of their teacher (within level) and are clustered into classes.

I do not actually have a between level variable but I wonder if I need to take the clustering into account. If so, can I use TYPE = BASIC TWOLEVEL or is TYPE=BASIC COMPLEX available for MI?

If I can only use TYPE = BASIC TWOLEVEL, do I have to specify all variables as WITHIN as well as include the CLUSTER=Class or is it sufficient to only include the CLUSTER?

Thanks

Tihomir Asparouhov posted on Wednesday, May 23, 2018 - 1:09 pm

You should use type=basic twolevel; and you should not list the variables on the within command. Here is some sample code.

variable: names=y1-y10 c;
cluster=c;
missing=all(999);
categorical=y1-y10;
data: file=1.dat;
analysis: type = twolevel basic;
data imputation:
impute = y1-y10 (c);
ndatasets = 10;
save = missimp*.dat;

Pia Kreijkes posted on Wednesday, May 23, 2018 - 2:33 pm

Great, thanks for your help!

Pia Kreijkes posted on Thursday, May 24, 2018 - 8:59 am

Hello, I would like to ask some follow up questions.

1) All 30 demanded data sets were created, which took a couple of hours but then I received an error message stating that no output file could be created. What could the reason be and is it still 'safe' to use the imputed data sets?

2) I have more variables than clusters. Does this pose a problem even though they are all WITHIN variables? I did not understand this fully from your chapter on Multiple Imputation with Mplus (Asparouhov & Muth�n, 2010)

3) The imputed data sets contain variable values from 0 to 4 rather than the original 1 to 5. Similarly the gender variable now has scores 0 and 1 rather than 1 and 2. What could the reason be and is there a way I can prevent this re-coding in the future?

4) I have some AUXILIARY variables, which contain missing data that I do not wish to impute. In the imputed data sets, missing values are replaced by *, which can then not be read in later analyses. Is there a way that Mplus retains the original value?

Many thanks

Tihomir Asparouhov posted on Thursday, May 24, 2018 - 10:31 am

1) This is probably due to issues with computing multilevel polychoric correlations and you can ignore that but you shouldn't use these imputations because of point 2)

2) If you declare all the variables as within then you have essentially annulled the multilevel imputation and you imputed the data as if it is single level and ignoring the clustering. You should remove the within= command as in the sample code I gave above. Unfortunately also having more variables than clusters makes the imputation problem more difficult and you were able to avoid this difficulty by ignoring the clustering. If you remove the within= command you will probably get more problems. This situation can be resolved using H0 imputation, see the diagram on page 576 in the User's guide and Example 11.7 for the setup. I would recommend a model that looks like this
model:
%within%
y1-y10 with y1-y10
%between%
f by y1-y10;
Such a model will allow you to estimated the unrestricted model on the within level while having an identified model on the between level.

3) We do not have a command to do that. You have to add 1 to all the values. You can do that with a separate run in Mplus using the define command run for each imputed data set. This is generally just a convenience and has no impact on the meaning.

4) There are two ways to do this. You can use the command
SAVEDATA: MISSFLAG = 999;
However if you have multiple numbers indicating different types of missing data rather than just one 999 value this won't work for you. If the variables are truly auxiliary variables listed on the auxiliary command, that means they don't participate in the imputation at all (no information is used from those variables) you can simply not declare those as missing values. So instead of using
MISSING=all(999);
you can use
MISSING=y1-y5(999);
auxiliary=y6-y10;
and that will not change y6-y10. If the auxiliary variables are not truely auxiliary you will have to duplicate them with the define command and make the duplicate copy truly auxiliary.

Pia Kreijkes posted on Thursday, May 24, 2018 - 11:43 am

Thanks again for your helpful and prompt explanations.

Just to clarify, I did not actually define my variables as within in the MI, as this is what you have suggested in the last post. What I tried to say under point 2 is that all my variables are within and not between variables but I defined neither within nor between variables in the syntax. And the datasets were imputed anyways. However, I increased the H1 iterations to achieve this. If the data sets are created although I have more variables than clusters and although I do not define the variables as within, can I then use the imputations or will the data sets likely be faulty and I should use the H0 model you suggested?

Thanks

Tihomir Asparouhov posted on Thursday, May 24, 2018 - 3:08 pm

I see. We have done extensive simulations on the topic of large number of variables with small number of clusters. See Sections 3.2, 3.3, 3.4, 3.5 in
http://statmodel2.com/download/Imputations7.pdf

The H0 imputations can be a bit more precise but are susceptible to model assumptions. I think you can use the data sets that you have generated - they are not faulty.

Pia Kreijkes posted on Thursday, May 24, 2018 - 3:12 pm

Great news! Thanks for the advice.

Aurelie Lange posted on Saturday, April 18, 2020 - 2:48 pm

Dear Prof Muthen,

I am running a multiple imputation using the following input:

USEVARIABLES ARE
var1 var2 var3 clus;
between is var3;
AUXILIARY = clus;
cluster = clus;

DATA IMPUTATION:
IMPUTE = var1 var2 var3;
NDATASETS = 40;
SAVE = imp*.dat;

ANALYSIS: TYPE = twolevel basic;

However, I get the error message
"Auxiliary variable CLUS has multiple uses."
If I leave out the auxiliary command, clus is not imputed. However, if I put clus in the IMPUTE command (as well as in usevariables), I get the message "Unknown variable(s) in the IMPUTE option".

How can I specify the cluster variable as well as making sure it is kept in the imputed files?

Thank you!

Bengt O. Muthen posted on Saturday, April 18, 2020 - 3:17 pm

Please send the output to Support along with your license number.