Mplus Discussion >> Monte Carlo simulations with missing data

Topics
Last Day
Last 3 Days
Last Week
Tree View

Edit Profile


Monte Carlo simulations with missing ...

Mplus Discussion > Missing Data Modeling >

Message/Author

Jenny B posted on Friday, July 30, 2004 - 2:44 am

I am trying to use a Monte Carlo simulation to estimate the power, given the amount of missing data that I have, in a particular study.

So far I have used saved parameter estimates (from my sample) as true population values. Using MODEL MISSING, the output shows the missing data pattern from the first replication only. Is it possible to get the missing data patterns from every replication saved, so that I can check that the proportions of missing data generated are similar to those in the sample?

I have tried use PATMISS and PATPROBS to specify the missing data patterns/proportions but ran into difficulty specifying a pattern (and its relative proportion of the population) that is complete, i.e. has no missing data. Is this possible?

Thanks very much.

Linda K. Muthen posted on Friday, July 30, 2004 - 8:08 am

The Monte Carlo output prints only the missing data patterns from the first replication. If you want it for each replication, you would need to save the data sets and analyze each one separately. It is usually not necessary to do this as the first replication is usually enough to show that you have gotten the patterns correct. You can increase the sample size for one replication to 100,000 to see if the missing value patterns are what you expect.

Regarding PATMISS and PATPROBS, you need to send the output to support@statmodel.com along with a description of what you want the missing data to look like.

Anonymous posted on Sunday, January 09, 2005 - 10:24 am

With Mplus, model parameters estimated from a real data can be saved and used as the true population parameter values for Monte Carlo simulation. For Monte Carlo study on latent growth modeling with missing values and covariates, can Mplus save the missing patterns existing in the real data, then automatically use these missing patterns in data generation in Monte Carlo simulation?
If we have to specify the missing patterns in MODEL MISSING command, it would be very difficult to match the specified missing patterns with the missing patterns in the real data. Any comment or suggestion will be highly appreciated.

bmuthen posted on Sunday, January 09, 2005 - 12:55 pm

It is currently not possible to save the missing data patterns for use in MC. I would try to mimic the major (most frequent) patterns. Missing data is hard to simulate in a realistic way because the cause of missingness (predicted by x's, earlier y's, missing y's, latent continuous vbles, latent categorical vbles) is not known. So experimentation with likely scenarios is needed.

Mike Cheung posted on Sunday, May 15, 2005 - 8:11 pm

Dear Muthen,

I have two questions about Monte Carlo simulations with missing data:

1) I used "Montecarlo" to generate missing data and wanted to to analyze the data with "listwise deletion". Therefore, I used "ANALYSIS: TYPE=BASIC;" However, I got the following error message:

*** ERROR in Montecarlo command
TYPE = MISSING must be specified when using MONTECARLO to generate missing data.

How can I generate missing data without using the function of "missing" in the analysis?

2) In Monte Carlo simulations with TECH9 in OUTPUT, the program prints the error message for the replications. I would like to see if these problematic replications are included in calculating the summary statistics and parameter estimates or not.

If yes, how can I exclude these problematic replications in the simulations?

If no, the number of "convergent" and "good" replicatons is likely smaller than the number of replicatons stated in the design. Is it possible to have a fixed number, say 1000, of "convergent" and "good" replications?

Thanks in advance!

Linda K. Muthen posted on Monday, May 16, 2005 - 8:27 am

It sounds like you want to generate data sets that contain missing data but then analyze them using listwise deletion. You cannot do this in one step. You will have to generate the data in the first step and analyze them using external Monte Carlo. See Example 11.6 and Chapter 18 of the Mplus User's Guide. To generate data with missingness, you need to include MISSING in the TYPE option.

The results are for the number of replications that were completed. This number is printed in the output. It is not possible to specify the number of completed replications. You would have to keep increasing the number of requested replications until you get the number of completed replications that you want.

Anonymous posted on Sunday, August 21, 2005 - 7:37 am

Dear Muthen,

I would like to analyze many external "mean and covariance matrices" by MONTECARLO. I used something like below:

DATA: FILE IS covlist.dat;
TYPE = MONTECARLO;
TYPE IS MEANS COVARIANCE;
NOBSERVATIONS = 100;

I got an error message saying that the input covariance matrix could not be inverted. However, it worked perfectly okay if I analyzed the covariance matrices one by one.

It seems to me that it's not possible to use "TYPE =MONTECARLO" and "TYPE IS MEANS COVARIANCE" at the same time. Am I correct? Thanks!

Linda K. Muthen posted on Monday, August 22, 2005 - 8:27 am

The external Monte Carlo facility of Mplus requires raw data. We will add a better error message.

Tor Neilands posted on Thursday, March 23, 2006 - 11:22 am

Hi, Linda.

In your excellent and very helpful 2002 SEM Journal paper on Monte Carlo simulation of statistical power and effect size, on page 605 you provide an example of simulating a missing data process with missing at random data for a linear growth model with four indicators.

You note that when the dichotomous covariate has a value of zero, the first indicator has 12% missing data, the second indicator has 18% missing, the third has 27%, and the fourth indicator has 50% missing. When the covariate is 1.00 in value, the respective amounts of missing data are 12% again for the first indicator, 38%, 50%, and 78% for the remaining indicators.

The syntax for this model is shown on the top of page 616 of the appendix of the article. The missing data section is stipulated as follows:

MODEL MISSING:
%OVERALL%
[y1@-2 y2@-1.5 y3@-1 y4@0];
y2-y4 oN x@1 ;

I'm curious to learn how the values of -2, -1.5, -1, and zero map onto the proportions of missingness that you report on page 605. In other words, how does one work backwards from an expected amount of missing data per wave of measurement to obtain the proper regression weights to use in the MODEL MISSING section of the syntax?

Regards and thanks,

Tor

Linda K. Muthen posted on Thursday, March 23, 2006 - 4:06 pm

he values are logits. To turn them into probabilities, use the formula

p = 1 / (1 + exp (-logit))

for y1 the logit of -2 results in the probability of 0.12.

For y2-y4, the logit is based on the covariate also. The logit for y2 is

logit = -1.5 + bx;

For x=1,

logit = -1.5 + 1*1 = -.5

The probability for a logit of -.5 is .38.

Tor Neilands posted on Saturday, March 25, 2006 - 9:10 pm

Thank you, Linda.

I have a follow-up question. I am simulating power for a latent growth curve model with four observed indicators per time point and missing data at times 2-4. There are four time points, so there are 16 Y measures.

I want to create "wave missing" data that captures a typical attrition scenario where once a subject drops out of the study she no longer contributes data to that wave or subsequent waves. That is, I want Mplus to generate MAR missing data for 10% of the sample at time 2, 15% at time 3, and 20% at time4. The catch is that the missing data generation should yield four patterns:

- complete data for all 16 measures
- missing data for all time 2, 3, & 4 measures (10% of the sample)
- missing data for all time 3 & 4 measures (5% of the sample)
- missing data for all time 4 measures (5% of the sample)

So, 20% of the total sample will have some form of missing data.

This is my current MODEL MISSING syntax.
MODEL MISSING:

%OVERALL%
[y21@-2.198 y31@-1.735 y41@-1.387
y22@-2.198 y32@-1.735 y42@-1.387
y23@-2.198 y33@-1.735 y43@-1.387
y24@-2.198 y34@-1.735 y44@-1.387];

This yields measure-specific missingness that is correct within wave, but many missing data patterns are generated. Many of these are not realistic for my situation (e.g., one measure within a wave will be missing, but others will have data; our interviewer-based protocols will result in virtually complete data for any participant that is not lost to attrition).

Many thanks in advance for your suggestions,

Tor

Bengt O. Muthen posted on Sunday, March 26, 2006 - 3:46 pm

Would the Mplus options patmiss and patprob do what you want?

Tor Neilands posted on Sunday, March 26, 2006 - 5:40 pm

Thanks, Bengt.

Is it possible to use patmiss and patprob in conjunction with MODEL MISSING to generate MAR missing data patterns?

Thanks,

Tor

Linda K. Muthen posted on Sunday, March 26, 2006 - 5:54 pm

No, you would need to use one or the other.

Tor Neilands posted on Sunday, March 26, 2006 - 10:03 pm

Thank you, Linda.

You and Bengt are both amazing to respond in this forum over the weekend. Much appreciation to you both for your prompt and helpful replies.

I will continue to experiment with MODEL MISSING.

In your SEM Journal paper, you converted portions of the Mplus Monte Carlo output into Cohen-metric effect sizes. Could you describe what portions of the Mplus output you used to obtain the Cohen effect sizes you reported for the two growth model scenarios (slope regression weight = .2; slope regression weight = .1)? I'm unclear on how you were able to use the Mplus results to obtain the Cohen metric effect sizes.

Thank you,

Tor

Bengt O. Muthen posted on Monday, March 27, 2006 - 6:35 am

I don't know what you mean by converting portions of the Mplus output to Cohen effect sizes. Perhaps you refer to page 604 in the SEM article where we say that we get different effect sizes for different regression coefficients for s regressed on the covariate. The effect size computations are described on pp. 604-605. One can also compute it wrt observed outcomes.

Tor Neilands posted on Monday, March 27, 2006 - 11:49 am

Thank you, Bengt.

You're right. I was referring to the discussion on pp. 604-605.

Apologies for being dense, but I am still not seeing how one obtains the value of .63 as the estimate of the effect size from the input slope coefficient of .20.

Thanks,

Tor

Bengt O. Muthen posted on Monday, March 27, 2006 - 3:14 pm

Just divide the reg coeff of the growth slope factor regressed on the binary x by the SD (the sqrt of the variance) of the growth slope factor. This is a Cohen-like quantity in that the reg coeff is the mean difference between the 2 x groups in the slope growth factor.

Tor Neilands posted on Sunday, April 16, 2006 - 7:20 pm

Hi, Linda and Bengt.

I would like to simulate NMAR data using Mplus. Can you point me to any example MODEL MONTECARLO syntax that illustrates how to generate NMAR data using Mplus?

With best wishes and many thanks,

Tor

Bengt O. Muthen posted on Monday, April 17, 2006 - 8:30 am

We don't have a specific example available, but you can start from the User's Guide ex 11.2 and modify the Model Missing part. For example, you can have missing on y4 as a function of the value of y4 that would have been observed by saying in Model Missing:

y4 on y4*c;

where c is a value such as 0.5.

Shu Xu posted on Friday, January 26, 2007 - 12:16 am

Dear Linda and Bengt,

I am wondering what is the mechanism of generate the datasets for a two-part semicontinuous growth model for contimuous outcome (Ex 11.9). Is there a technique report on the data generating method for the two-part growth model?

Violxu

Linda K. Muthen posted on Friday, January 26, 2007 - 6:46 am

See the DATATWOPART command. The method is described and a reference is given.

Patricienn Kaponson Moreno posted on Thursday, June 28, 2007 - 8:22 am

Dr. Muthen,

How do I generate differing missing data patterns in Mplus. I want to create several patterns of missing data to see if it affects the robustness of the imputation method.

Linda K. Muthen posted on Thursday, June 28, 2007 - 8:37 am

The options of the MONTECARLO command are described in Chapter 18 of the Mplus User's Guide. Examples 11.1 and 11.2 show how to generate missing data.

Scott R. Colwell posted on Tuesday, October 27, 2009 - 2:32 pm

If you only want 1 variable say X1 of a set of variables (x1-x12) to have missing values and you only want 1 pattern where 10% of X1 are missing across 25% of the cases (the other 75% are not missing any), is this the correct way to specify this:

PATMISS = X1(.10) | X1(.00);
PATPROBS = .25 | .75;

Linda K. Muthen posted on Tuesday, October 27, 2009 - 4:04 pm

Try it. If it does not work, contact support.

Hyeyoung, Ahn posted on Friday, October 09, 2015 - 4:17 am

Dear, Muthen.

I have x1-x3, y1-y3, u1-u10 variables.
I would like to generate MAR missing pattern on y1-y3 and MCAR on u1-u10.

Can I use just "PATMISS" and "PATPROBS" only? Otherwise, do I have to use "MODEL MISSING" syntex?

It seems to me that it's not possible to use "MISSING" in MONTECARLO and "PATMISS" at the same time.

Bengt O. Muthen posted on Friday, October 09, 2015 - 8:52 am

I recommend using Model Missing.

Fred posted on Wednesday, August 02, 2017 - 2:19 am

Dr. Muthen,

I�ve tried to model missing data under a MAR mechanism with the model missing command. The model is a relative simple CFA with five indicators (5 categories scales each).

The missings should only occur on y3-y5 and depend on y1 (for y3 and y4) and on y2 (y5).

I tried to model my equations with the formulas provided by you, but i cannot figure out why it does not work.

The follwoing is my model missing command:
[y3@-3.94];
[y4@-3.94];
[y5@-3.94];
y3 ON y1@1;
y4 ON y1@1;
y5 ON y2@1;

With that I get missings for (more or less) 10% for each of my indicators.
Now is there a way to determine the correct equations so I get the desired missing quotes?

Thanks for the answer.

Fred posted on Thursday, August 03, 2017 - 12:25 am

Update to post above:

with the following model missing command I get the desired 5% missing for the three variables each:
[y3@-5.38];
[y4@-5.38];
[y5@-5.38];
y3 ON y1@1;
y4 ON y1@1;
y5 ON y2@1;

Now the question: Can I be sure, that this is a MAR mechanism?

Thank you for the answer
Fred

Bengt O. Muthen posted on Thursday, August 03, 2017 - 2:52 pm

Yes, that is MAR because the missing data probability depends on y1 which doesn't have missing data.

Joao Garcez posted on Wednesday, November 15, 2017 - 4:59 am

Hello Linda and Bengt Muthen,

I have the following question that, hopefully, you'll be able to help with:

I have a dataset with N = 160 and 60% missing data (MCAR) on some of the variables, and I wanted to do a Monte Carlo for power analysis, but I am a bit confused as to whether I should mention this pattern of missing data or not, because I intend to run my analyses using multiple imputation. So even if the full data (N = 160) is used in the analysis due to MI or FIML, do I still need to mention the PATMISS and PATPTOB when doing Monte Carlo for power analysis?

Thank you for whatever help you can provide,

Joao

Bengt O. Muthen posted on Wednesday, November 15, 2017 - 1:09 pm

Yes, you should specify the pattern of missingness in your simulation.

Joao Garcez posted on Sunday, November 19, 2017 - 4:30 am

Dear Dr. Muthen,

Thank you for you help and prompt answer.

Best,

Joao.

Christoph Herde posted on Sunday, July 01, 2018 - 12:00 pm

Dear Dr. Muth�n,
I am working on a complex data set in Mplus with 19 items in the item pool, but participants only responded to a random set of 8 items. However, I want to examine a model with one latent factor that explains responses to all 19 items.
I read in the data as they are and let FIML cope with the �missing by design�.
Whereas I get good results at the total sample level, results are bad if I use a limited number of observations.
I wanted to investigate whether �missing by design issue� may not allow the model fit indicators (e.g., CFA, RMSEA) work reliably if sample size would be too low.
I therefore prepared Monte Carlo simulations, in which I vary number of observations. I paste my input below. Is the specification of missing data appropriate for my kind of design?
is it then correct to investigate whether �expected� and �observed� percentiles for chi, RMSEA, SRMR are close to each other to answer my research question?

PATMISS =
ao1(.58)
ao2(.58)
ao3(.58)
ao4(.58)
ao5(.58)
ao6(.58)
ao7(.58)
ao8(.58)
ao9(.58)
ao10(.58)
ao11(.58)
ao12(.58)
ao13(.58)
ao14(.58)
ao15(.58)
ao16(.58)
ao17(.58)
ao18(.58)
ao19(.58);
PATPROBS = 1;

Bengt O. Muthen posted on Monday, July 02, 2018 - 10:08 am

If you have a large N to begin with you shouldn't have any problems. The real question is how many observations you have per variable and pairs of variables.

How many patterns of different sets of variables d you have? Note that UG ex 12.1 shows 2 patterns. I don't think you can approach this as only 1 pattern as your input suggests.

Christoph Herde posted on Tuesday, July 03, 2018 - 9:18 am

Thanks for your quick reply!

Because the items were presented randomly by a software, all possible combinations of missings per item are possible. If I read in the data for the total sample, I receive the report of 4029 missing data patterns.

My approach to specify missings for the monte carlo simulation was to specify the relative frequency that each item was displayed to the participant (equal for each participant because randomly chosen by a software). Then, the patprobs was set at 1.

If this may be a bad idea, do you have a good solution how to specify it more appropriately in Mplus?

Thanks a lot for your support in advance!

Bengt O. Muthen posted on Tuesday, July 03, 2018 - 3:07 pm

Perhaps there are fewer patterns in the real data - your N is only 160.

I get no quick idea for how to otherwise generate the data, but I am also not sure you need to do the Monte Carlo study at all.

Christoph Herde posted on Wednesday, July 04, 2018 - 12:28 pm

May I ask you what you mean with "N is only 160"?

If I read the data into the software, it tells me that there are around 4,100 observations with at least 1 "non-missing" value across the 19 items.

The background for the Monte Carlo study is that we want to examine the properties of a test at the total sample level, but also for each of several subgroups. We get very good results for subgroups with large sample sizes (e.g., 1000), but implausible bad results for subgroups with lower sample sizes (e.g., 400). That's why I wanted to examine whether the missing by design issue may cause problems for the model fit indicators if sample sizes are below a specific cut-off.

Thanks again for your valuable support!

Bengt O. Muthen posted on Wednesday, July 04, 2018 - 4:26 pm

I misread your N. I would think that subgroup sample size of 400 would be enough for good testing unless the items are not continuous or the model very complex. Could it be that those subgroups are not randomly obtained from the same population as those with N=1000?

Christoph Herde posted on Friday, July 06, 2018 - 12:00 pm

All the subgroups are not randomly sampled. The subgroups represent different regional groups (continents). For some regional groups, N is large (around 1000), for some lower (< 400).
The basic model is very simple (one-factor model), but all indicators are continuous scores between 2 and 10.

If there is no straightforward way in Mplus to specify the complex missing data pattern, do I assume that I would have to rely on rules of thumb for "cut-offs" at which number of observation the model fit indicators work reliably?

(quick explanation for the "strange" results: I do get fantastic model fit for the sample with all European people, for example, but horrible model fit for each sub-sample that describes people from one European country each).

Thanks again!

Bengt O. Muthen posted on Friday, July 06, 2018 - 5:39 pm

Model tests of fit for a 1-factor model for continuous outcomes should be fine for N around 400. Perhaps explore with Modindices why the fit is poor for each European country.

Christoph Herde posted on Tuesday, July 10, 2018 - 1:50 pm

Thanks a lot for your support!

Margarita posted on Thursday, August 02, 2018 - 6:59 am

Hi Dr. Muth�n,

Following up some of the previous discussions, I was wondering if it's possible to bypass the pattern option with categorical items issue within multilevel Monte Carlo? I am trying to do a multilevel multigroup Montecarlo as per webnote 16 and during Step1 the variable pattern is saved which unfortunately I cannot use during step2 because I have categorical indicators. Is there an alternative within Mplus 8.1?

Thank you

Tihomir Asparouhov posted on Thursday, August 02, 2018 - 2:28 pm

There are two ways to generate missing data in montecarlo. The first one is PATMISS (see User's Guide example 12.1). The second one is based on MODEL MISSING (see User's Guide example 12.2). Usually MODEL MISSING is more flexible and can be tailored to specific needs.

Daniel Lee posted on Monday, April 01, 2019 - 11:30 am

Hi, is it possible to indicate missing values for the independent variables when doing a monte carlo power estimation?

For example, while all our study participants will probably self-report their gender, some may report their relationship with mom and dad (if they are from a 2-parent led household), while others might just report their relationship with either mom or dad (if they are from a single parent family). If the analysis has mother and father relationship quality as independent variables (predicting an outcome for the child), how would we account for some of the missingness in the independent variable when doing a MC simulation power analysis?

Thank you!

Bengt O. Muthen posted on Monday, April 01, 2019 - 5:13 pm

Only by treating them as Y's (dependent variables; mentioning their variance).

Daniel Lee posted on Wednesday, April 03, 2019 - 6:58 am

Thank you!

Salina Whitaker posted on Monday, April 06, 2020 - 2:07 pm

Hi. I am running a Monte Carlo simulation using external data. I want to impart different missing patters on the data before the simulation. Specifically:

1. Probability of missingness on a variable increases linearly with the value of the covariate and
2. Probability of missingness on a variable changes in a convex manner--with higher probability at the high and low ends of the covariate.

I have not been able to figure out how to do this in Mplus. Any help/insight would be appreciated. Thank you

Bengt O. Muthen posted on Monday, April 06, 2020 - 5:56 pm

You can include a binary variable in your model where the 2 categories correspond to missing or not. And regress that variable on the variables you have in mind. Then use the category value for missing as a missing data flag.

Rolf Gjestad posted on Tuesday, August 04, 2020 - 12:38 am

Hello

I am somewhat confused here. I am going to simulate power in a growth curve model based on 8 points of time with increasing dropout over time. Should I use the patmiss and patprob functions or the missing model?

best,
Rolf Gjestad

Bengt O. Muthen posted on Tuesday, August 04, 2020 - 5:32 pm

I would instead go by UG ex12.2.

Rolf Gjestad posted on Tuesday, August 04, 2020 - 10:17 pm

Thank you very much !
Rolf