I am trying to use a Monte Carlo simulation to estimate the power, given the amount of missing data that I have, in a particular study.
So far I have used saved parameter estimates (from my sample) as true population values. Using MODEL MISSING, the output shows the missing data pattern from the first replication only. Is it possible to get the missing data patterns from every replication saved, so that I can check that the proportions of missing data generated are similar to those in the sample?
I have tried use PATMISS and PATPROBS to specify the missing data patterns/proportions but ran into difficulty specifying a pattern (and its relative proportion of the population) that is complete, i.e. has no missing data. Is this possible?
The Monte Carlo output prints only the missing data patterns from the first replication. If you want it for each replication, you would need to save the data sets and analyze each one separately. It is usually not necessary to do this as the first replication is usually enough to show that you have gotten the patterns correct. You can increase the sample size for one replication to 100,000 to see if the missing value patterns are what you expect.
Regarding PATMISS and PATPROBS, you need to send the output to email@example.com along with a description of what you want the missing data to look like.
Anonymous posted on Sunday, January 09, 2005 - 10:24 am
With Mplus, model parameters estimated from a real data can be saved and used as the true population parameter values for Monte Carlo simulation. For Monte Carlo study on latent growth modeling with missing values and covariates, can Mplus save the missing patterns existing in the real data, then automatically use these missing patterns in data generation in Monte Carlo simulation? If we have to specify the missing patterns in MODEL MISSING command, it would be very difficult to match the specified missing patterns with the missing patterns in the real data. Any comment or suggestion will be highly appreciated.
bmuthen posted on Sunday, January 09, 2005 - 12:55 pm
It is currently not possible to save the missing data patterns for use in MC. I would try to mimic the major (most frequent) patterns. Missing data is hard to simulate in a realistic way because the cause of missingness (predicted by x's, earlier y's, missing y's, latent continuous vbles, latent categorical vbles) is not known. So experimentation with likely scenarios is needed.
I have two questions about Monte Carlo simulations with missing data:
1) I used "Montecarlo" to generate missing data and wanted to to analyze the data with "listwise deletion". Therefore, I used "ANALYSIS: TYPE=BASIC;" However, I got the following error message:
*** ERROR in Montecarlo command TYPE = MISSING must be specified when using MONTECARLO to generate missing data.
How can I generate missing data without using the function of "missing" in the analysis?
2) In Monte Carlo simulations with TECH9 in OUTPUT, the program prints the error message for the replications. I would like to see if these problematic replications are included in calculating the summary statistics and parameter estimates or not.
If yes, how can I exclude these problematic replications in the simulations?
If no, the number of "convergent" and "good" replicatons is likely smaller than the number of replicatons stated in the design. Is it possible to have a fixed number, say 1000, of "convergent" and "good" replications?
It sounds like you want to generate data sets that contain missing data but then analyze them using listwise deletion. You cannot do this in one step. You will have to generate the data in the first step and analyze them using external Monte Carlo. See Example 11.6 and Chapter 18 of the Mplus User's Guide. To generate data with missingness, you need to include MISSING in the TYPE option.
The results are for the number of replications that were completed. This number is printed in the output. It is not possible to specify the number of completed replications. You would have to keep increasing the number of requested replications until you get the number of completed replications that you want.
Anonymous posted on Sunday, August 21, 2005 - 7:37 am
I would like to analyze many external "mean and covariance matrices" by MONTECARLO. I used something like below:
DATA: FILE IS covlist.dat; TYPE = MONTECARLO; TYPE IS MEANS COVARIANCE; NOBSERVATIONS = 100;
I got an error message saying that the input covariance matrix could not be inverted. However, it worked perfectly okay if I analyzed the covariance matrices one by one.
It seems to me that it's not possible to use "TYPE =MONTECARLO" and "TYPE IS MEANS COVARIANCE" at the same time. Am I correct? Thanks!
In your excellent and very helpful 2002 SEM Journal paper on Monte Carlo simulation of statistical power and effect size, on page 605 you provide an example of simulating a missing data process with missing at random data for a linear growth model with four indicators.
You note that when the dichotomous covariate has a value of zero, the first indicator has 12% missing data, the second indicator has 18% missing, the third has 27%, and the fourth indicator has 50% missing. When the covariate is 1.00 in value, the respective amounts of missing data are 12% again for the first indicator, 38%, 50%, and 78% for the remaining indicators.
The syntax for this model is shown on the top of page 616 of the appendix of the article. The missing data section is stipulated as follows:
I'm curious to learn how the values of -2, -1.5, -1, and zero map onto the proportions of missingness that you report on page 605. In other words, how does one work backwards from an expected amount of missing data per wave of measurement to obtain the proper regression weights to use in the MODEL MISSING section of the syntax?
I have a follow-up question. I am simulating power for a latent growth curve model with four observed indicators per time point and missing data at times 2-4. There are four time points, so there are 16 Y measures.
I want to create "wave missing" data that captures a typical attrition scenario where once a subject drops out of the study she no longer contributes data to that wave or subsequent waves. That is, I want Mplus to generate MAR missing data for 10% of the sample at time 2, 15% at time 3, and 20% at time4. The catch is that the missing data generation should yield four patterns:
- complete data for all 16 measures - missing data for all time 2, 3, & 4 measures (10% of the sample) - missing data for all time 3 & 4 measures (5% of the sample) - missing data for all time 4 measures (5% of the sample)
So, 20% of the total sample will have some form of missing data.
This is my current MODEL MISSING syntax. MODEL MISSING:
This yields measure-specific missingness that is correct within wave, but many missing data patterns are generated. Many of these are not realistic for my situation (e.g., one measure within a wave will be missing, but others will have data; our interviewer-based protocols will result in virtually complete data for any participant that is not lost to attrition).
You and Bengt are both amazing to respond in this forum over the weekend. Much appreciation to you both for your prompt and helpful replies.
I will continue to experiment with MODEL MISSING.
In your SEM Journal paper, you converted portions of the Mplus Monte Carlo output into Cohen-metric effect sizes. Could you describe what portions of the Mplus output you used to obtain the Cohen effect sizes you reported for the two growth model scenarios (slope regression weight = .2; slope regression weight = .1)? I'm unclear on how you were able to use the Mplus results to obtain the Cohen metric effect sizes.
I don't know what you mean by converting portions of the Mplus output to Cohen effect sizes. Perhaps you refer to page 604 in the SEM article where we say that we get different effect sizes for different regression coefficients for s regressed on the covariate. The effect size computations are described on pp. 604-605. One can also compute it wrt observed outcomes.
Just divide the reg coeff of the growth slope factor regressed on the binary x by the SD (the sqrt of the variance) of the growth slope factor. This is a Cohen-like quantity in that the reg coeff is the mean difference between the 2 x groups in the slope growth factor.
We don't have a specific example available, but you can start from the User's Guide ex 11.2 and modify the Model Missing part. For example, you can have missing on y4 as a function of the value of y4 that would have been observed by saying in Model Missing:
y4 on y4*c;
where c is a value such as 0.5.
Shu Xu posted on Friday, January 26, 2007 - 12:16 am
Dear Linda and Bengt,
I am wondering what is the mechanism of generate the datasets for a two-part semicontinuous growth model for contimuous outcome (Ex 11.9). Is there a technique report on the data generating method for the two-part growth model?
If you only want 1 variable say X1 of a set of variables (x1-x12) to have missing values and you only want 1 pattern where 10% of X1 are missing across 25% of the cases (the other 75% are not missing any), is this the correct way to specify this:
Yes, that is MAR because the missing data probability depends on y1 which doesn't have missing data.
Joao Garcez posted on Wednesday, November 15, 2017 - 4:59 am
Hello Linda and Bengt Muthen,
I have the following question that, hopefully, you'll be able to help with:
I have a dataset with N = 160 and 60% missing data (MCAR) on some of the variables, and I wanted to do a Monte Carlo for power analysis, but I am a bit confused as to whether I should mention this pattern of missing data or not, because I intend to run my analyses using multiple imputation. So even if the full data (N = 160) is used in the analysis due to MI or FIML, do I still need to mention the PATMISS and PATPTOB when doing Monte Carlo for power analysis?
Dear Dr. Muthén, I am working on a complex data set in Mplus with 19 items in the item pool, but participants only responded to a random set of 8 items. However, I want to examine a model with one latent factor that explains responses to all 19 items. I read in the data as they are and let FIML cope with the „missing by design“. Whereas I get good results at the total sample level, results are bad if I use a limited number of observations. I wanted to investigate whether „missing by design issue“ may not allow the model fit indicators (e.g., CFA, RMSEA) work reliably if sample size would be too low. I therefore prepared Monte Carlo simulations, in which I vary number of observations. I paste my input below. Is the specification of missing data appropriate for my kind of design? is it then correct to investigate whether „expected“ and „observed“ percentiles for chi, RMSEA, SRMR are close to each other to answer my research question?
Because the items were presented randomly by a software, all possible combinations of missings per item are possible. If I read in the data for the total sample, I receive the report of 4029 missing data patterns.
My approach to specify missings for the monte carlo simulation was to specify the relative frequency that each item was displayed to the participant (equal for each participant because randomly chosen by a software). Then, the patprobs was set at 1.
If this may be a bad idea, do you have a good solution how to specify it more appropriately in Mplus?
If I read the data into the software, it tells me that there are around 4,100 observations with at least 1 "non-missing" value across the 19 items.
The background for the Monte Carlo study is that we want to examine the properties of a test at the total sample level, but also for each of several subgroups. We get very good results for subgroups with large sample sizes (e.g., 1000), but implausible bad results for subgroups with lower sample sizes (e.g., 400). That's why I wanted to examine whether the missing by design issue may cause problems for the model fit indicators if sample sizes are below a specific cut-off.
I misread your N. I would think that subgroup sample size of 400 would be enough for good testing unless the items are not continuous or the model very complex. Could it be that those subgroups are not randomly obtained from the same population as those with N=1000?
All the subgroups are not randomly sampled. The subgroups represent different regional groups (continents). For some regional groups, N is large (around 1000), for some lower (< 400). The basic model is very simple (one-factor model), but all indicators are continuous scores between 2 and 10.
If there is no straightforward way in Mplus to specify the complex missing data pattern, do I assume that I would have to rely on rules of thumb for "cut-offs" at which number of observation the model fit indicators work reliably?
(quick explanation for the "strange" results: I do get fantastic model fit for the sample with all European people, for example, but horrible model fit for each sub-sample that describes people from one European country each).
Margarita posted on Thursday, August 02, 2018 - 6:59 am
Hi Dr. Muthén,
Following up some of the previous discussions, I was wondering if it's possible to bypass the pattern option with categorical items issue within multilevel Monte Carlo? I am trying to do a multilevel multigroup Montecarlo as per webnote 16 and during Step1 the variable pattern is saved which unfortunately I cannot use during step2 because I have categorical indicators. Is there an alternative within Mplus 8.1?
There are two ways to generate missing data in montecarlo. The first one is PATMISS (see User's Guide example 12.1). The second one is based on MODEL MISSING (see User's Guide example 12.2). Usually MODEL MISSING is more flexible and can be tailored to specific needs.
Daniel Lee posted on Monday, April 01, 2019 - 11:30 am
Hi, is it possible to indicate missing values for the independent variables when doing a monte carlo power estimation?
For example, while all our study participants will probably self-report their gender, some may report their relationship with mom and dad (if they are from a 2-parent led household), while others might just report their relationship with either mom or dad (if they are from a single parent family). If the analysis has mother and father relationship quality as independent variables (predicting an outcome for the child), how would we account for some of the missingness in the independent variable when doing a MC simulation power analysis?