Example 2 on page 24 of the Addendum to the Mplus User's Guide (at www.statmodel.com under Product Support) is a monte carlo for a two-group three-level growth model. You can use that as a starting point.
Daniel posted on Thursday, August 28, 2003 - 9:10 am
Thanks very much. You and your husband give the greatest support for any product of any type I have ever purchased.
Daniel posted on Friday, August 29, 2003 - 8:06 am
Hi, I'm having quite a bit of difficulty getting started. I have an LGM with two parallel processes. I want to see if I can find an interaction for high and low levels of depression. I want to cut my groups into high and low levels of depression. I am not using multi-level modeling. How should I get started? I'd appreciate any help possible. I looked at the example on page 24, like you suggested, but could not make heads or tails of it because I am not doing multi-level modeling. It keeps on asking me for data.
MODEL MONTECARLO-G2: [i*1 s*.5]; ANALYSIS: TYPE = MEANSTRUCTURE; MODEL: i BY y1-y4@1; s BY y1@0y2@1y3@2y4@3; y1-y4*.5 i*.7 s*.2; [y1-y4@0 i*0 s*.2 ]; MODEL G1: [i*0 s*.2]; MODEL G2: [i*1 s*.5]; OUTPUT: TECH9;
Actually, the input I posted will not run in Version 2 of Mplus. It will run in Version 3. In Version 2, you will need to use the FILE option to enter your population values instead of MODEL MONTECARLO. Or you will have to modify the input to be TYPE=MIXTURE, NCLASSES=1, GCLASSES=1; If I get time, I will modify this. But it won't be until next week.
Daniel posted on Friday, August 29, 2003 - 12:30 pm
Thanks again. I figured out a way to run the model. I look forward to version 3.
I would like to use the Monte Carlo facility (version 3 of Mplus) to generate 2 seperate populations, and then disproportionately sample a certain number from each population. The samples from these 2 populations need to be combined so that I can then analyze it using GMM. Can this be done all in Mplus or do I need to do some of it in SAS?
See Example 11.6 which explains how to generate and save the superpopulation files and analyze them with a separate input. Example mcex5.14 which comes with the Mplus CD and is put on the hard disk as part of the installation shows how to use Monte Carlo to generate multiple group data. In the generation of the population include your binary subpopulation inclusion indicator U. In the first group [U$1*0] and in the second group [U$1*1] would do disproportioned sampling. In the second input USEOBSERVATIONS = u eq 1 would make sure that you analyze the subpopulation.
I would like to do a simulation using a Growth Mixture Model (with latent continuous intercept and slope factors, and a categorical latent class factor). I would like to simulate a complex sample design, where individuals are sampled at different rates from 2 strata. Later, I would like to add clustering into the sample design. Thanks again for all of your help!
Anonymous posted on Monday, May 10, 2004 - 11:03 am
I modified example mcex8.1.inp to generate two strata. To add clustering into the design you have to modify mcex10.4.inp the same way. Tihomir.
title: this is an example of a GMM for a continuous outcome using automatic starting values and random starts
two strata inclusion variable for subsequent sampling U
montecarlo: names are y1-y4 x U; genclasses = c(4); classes = c(2); nobs = 500; seed = 3454367; nrep = 10; REPSAVE = ALL; SAVE = ex8.1rep*.dat; categorical=U; generate=U(1);
One more thing regarding previous question. Is there a way to specify the number of observation sampled from the population seperately in each strata. So if nobs=500, 300 would come from strata one and 200 whould come from strata 2?
bmuthen posted on Saturday, May 29, 2004 - 10:50 am
Seems like you can do that by specifying values of fixed probabilities for the classes corresponding to strata, so using the logit parameters of [c#]. I guess the strata could be handled by the V3 knownclass option.
Annica posted on Wednesday, November 03, 2004 - 1:05 am
What information can be saved when using TYPE=MONTECARLO in the DATA statement, i.e. when performing a Monte Carlo study and the data has been generated outside Mplus? Using the RESULTS option in the SAVEDATA statement I obtain the parameter estimates for each data set. However, some of the replications do not reach convergence and I would like to know which. I know this information is possible to save when performing an internal Monte Carlo study in Mplus.
Is it possible to specify an alternative significance level at which each replication in a Monte Carlo analysis is tested? I did not see anything in the documentation for Monte Carlo that allows significance level to be specified. I would like to determine the power for detecting differences in linear growth between two groups, but I want to test at alpha = .03125 as a way of approximating a multiplicity adjustment.
There is no way to specify an alternative significance level.
Emily Blood posted on Thursday, September 27, 2007 - 12:47 pm
If the data has been generated outside of Mplus, is it possible to enter the true values of the population parameters? In the manual I only see described how to enter true population values if the population values were generated in another Mplus run or if the data were generated within Mplus. I have generated the data in Splus and would like to specify the true population values used so I can use the 95% cover probabilities in the Mplus output from the MC run. Thanks for your help! Emily
You would do external Monte Carlo if the data are generated outside of Mplus. See Step 2 of Example 11.6. You would enter the population values in the MODEL command.
Emily Blood posted on Thursday, September 27, 2007 - 1:22 pm
Thank you. What is the syntax for entering a population value in the MODEL command? I know that, for example,i BY y@2 sets the value of the regression coefficient for y to 2 and [i@1] sets the intercept for the i factor to 1, and f ON x*1 will set the starting value for the regression parameter to be 1. It is not clear from the manual how to enter the true population values in the MODEL command. Can you please let me know. Thanks, Emily
In the MODEL command you should use and * followed by the true population value for free parameters and @ followed by the true population values of fixed parameters. Please see Step 2 of Example 11.6. The true population values are given in the MODEL command.
Emily Blood posted on Friday, September 28, 2007 - 6:44 am
Thank you. Since this is the same syntax for setting starting values for free parameters, does putting these true values in automatically make those the starting values in the estimation of the models? Emily
Emily Blood posted on Friday, September 28, 2007 - 7:34 am
Thank you for all of your help. The MC facility in Mplus is really great!!! It seems to do everything I need for my simulation study.
Emily Blood posted on Tuesday, October 09, 2007 - 4:57 pm
What is the method of computing the standard error of indirect effects? I have the following code in my model (it is a variation on a latent growth curve model where i is a latent intercept). MODEL INDIRECT: i IND tx; There is a mediating variable between tx and i. The output shows a standard error estimate, but I am not clear on what method is used to compute this since the indirect effect involves a product of two estimated coefficients (a+b*c is the form of the indirect effect, where a is the direct effect of tx on i and b is the effect of tx on a mediating variable and c is the effect of the mediating variable on i). If you could let me know that would be great! Thanks, Emily
The default is to compute standard errors using the Delta method. Bootstrap standard errors are also available.
Emily Blood posted on Wednesday, October 10, 2007 - 1:04 pm
I'm sorry to keep bothering you, but for the a+b*c total effect, the delta method std err would be: sqrt[var(a) + b^2*var(c) + c^2*var(b)], is this correct? If so, this doesn't seem to match the value from the output. I can match the value for the std err of the indirect part of the effect (se(b*c)), but for the total effect, I am getting something larger than what is in the output. I am trying to figure out what I am missing. If you could let me know I would really appreciate it. Thanks! Emily
I am trying to run a simulation in MPLUS for a growth model to estimate power. The proposed model has three covariates (one is binary, two continuous), and three time points. I am assuming complete data. How do you specify the parameter variances of the three intercepts and the three slopes. Is there an example you can direct me to?
The data for the user's guide examples are generated by Monte Carlo simulations. Find the Monte Carlo counterpart of the growth model example in Chapter 6 that is closest to what you are doing to see how this is done. These inputs come on the Mplus CD and are put on the hard disk as part of the default installation.
I am running a Montecarlo simulation for a 2 class nonlinear Growth model where a binary covariate has an effect on the slope in one class, but not the other. I am using the following syntaxt in the model population as well as in the model syntax block to create a binary variable with a 50/50 split, simulating a treatment variable: [x@0]; x@1; When I look at the estimated sample statistics for the first replication, the mean is not .5. Am I doing something wrong, or is due to the fact that the statistics are from the first replication are shown? Thank you for your advice.
model population: %overall% [c#1*-1]; [x@0]; x@1; y1-y5*.5; i s q| y1@-1y2@0y3@1y4@4y5@7; q@0; i*.5; s*.1; i on x@0; s on x*.3; q on x*.01; s with q*0; i with q*0; %c#1% [i*2.2 s*.5 q*-.1]; s on x*.3; q on x*.01; %c#2% [i*2.5 s*.5 q*-.02]; s on x*0; q on x*0;
You need CUTPOINTS (0). If you read the example, you will see that we did not do 50/50 split.
Erika Wolf posted on Wednesday, October 27, 2010 - 10:24 am
What do you make of monte carlo power analyses where power appears to be more than adequate yet some replications generated in the analysis do not converge and/or some replications include non-positive definite matrices? Are the power estimates accurate in this situation, given that the majority of replications are fine? I've noticed this tends to occur as I decrease sample size in an effort to determine the minimum sample size needed, yet power estimates in the output are still well above 80% and estimated parameter estimates do not differ meaningfully from the population.
That is a potential problem of Monte Carlo based power estimation - it relies on good parameter and SE estimation. An alternative is the Satorra-Saris pop cov matrix approach (see our Topic 3 handout), but that does not cover as many model situations.
A good question is how well the parameter estimates and SEs are recovered.
You can attempt to avoid non-pos def by various tricks (Cholesky decomp or inequality constraints, i.e. var >0).
Erika Wolf posted on Thursday, October 28, 2010 - 8:15 am
Thanks for your quick response. A follow-up question from the thread from yesterday: Can you please clarify how mplus calculates the mean parameter estimates, SD, and SE when some solutions in the montecarlo run fail to converge or produce improper solutions? Does it eliminate those instances and change the denominator for the mean values or does it factor that in some how? Also, are those non-replications reflected in any way in the calculation of the 95% cover or power estimate? Thanks again!
hi, i'm trying to estimate samples sizes for a two level SEM with the Monte Carlo simulation. I have a couple of questions whihc are not really clear to me from the User guide and Muthen & Muthen (2002). Given that I actually rely on real data I should run the Monte Carlo from the estimates of my model (obtained with "SAVEDATA: ESTIMATES") or on artificial data generated with Monte Carlo method? If so, here is the second question is: in Muthen & Muthen (2002) you fix the variance of factor to 1, saying that this make it more easy interpret results. Should I do the same in my case (simulation based on real data) thank you very much in advance
No, because the real data may have features that are different from the simulated data. The fact that the simulated data uses the model parameters estimated from the real data is not sufficient because the model may not capture all aspects of the real data - even when it fits well.
If you see a bias in SEs, you may instead try changing from ML to MLR.
Ok, thanks. In my ML-SEM model I already use MLR, but given that the second level clusters are countries, it ends up that the number of parameters is much higher than the number of clusters. Can I consider the SE estimates with MLR reliable also in this case? I think I remember that in one course you gave, you said not to consider the Mplus warning about "trustworthiness of SE" if you use MLR. Thank you very much your, help is really appreciated
If the number of between-level parameters is greater than the number of between-level cluster units, this is a problem. If it is the total number of parameters in the model that is greater than the number of clusters, this may also be a problem. This has not been thoroughly studied as far as I know. Note that a minimum of 30-50 clusters is recommended.
Hi Linda, I want to use the non-parametric GMM in a simulation but I don’t know how to specify the model. For example, how can I get a model with csub (3) cnp(2)? Please can you tell me what is wrong with my syntax below ? Thank you in advance.
I working through a modified version of example 12.4 for simulating a growth model. The default code runs as expected, but when I specify a correlation between iw/sw and ib/sb > .11 I receive a fatal error message (Population covariance matrix is NPD). Is this due to the starting values of the iw/sw and ib/sb? Any help is appreciated.
MONTECARLO: NAMES ARE y1-y3; NOBSERVATIONS = 244; NREPS = 500; NCSIZES = 1; CSIZES = 61 (4); MODEL POPULATION: %WITHIN% iw sw | y1@0y2@1y3@2; y1-y3*.2; iw*1; sw*.2; iw with sw*.21; %BETWEEN% ib sb | y1@0y2@1y3@2; y1-y3@0; [ib*1 sb*.5]; ib*.2; sb*.1; ib with sb*.21; ANALYSIS: TYPE IS TWOLEVEL; MODEL: %WITHIN% iw sw | y1@0y2@1y3@2; y1-y3*.2; iw*1; sw*.2; iw with sw*.21; %BETWEEN% ib sb | y1@0y2@1y3@2; y1-y3@0; [ib*1 sb*.5]; ib*.2; sb*.1; ib with sb*.21;
After getting a Monte Carlo simulation of a bivariate growth-curve model with multiple indicators to run, the program noted that many of the replications derived estimates from non-positive definite matrices. I'm wondering whether these are included in the final results presented, and if so, if I should just disregard the overall results from the power analysis then.
All converging replications are included. Non-pos-def matrices can occur with negative residual variances for the outcomes if population values are small or sample size is small, and also if growth factor correlations in the population are high or sample size small.
Hello! 1. I am trying to start with a Monte Carlo analysis to estimate power for a simple growth model with 6 assessments. I'd like to specify a constant corr across time, but this doesn't work: y1 WITH firstname.lastname@example.org ; y2 WITH email@example.com ; y3 WITH firstname.lastname@example.org ; y4 WITH email@example.com ; y5 WITH firstname.lastname@example.org ; What am I doing wrong? 2. Since the DV is created as standard normal, should the effect of a predictor at baseline be in SD units (Cohen's d). 3. And shouldn't the slope parameters be in standard units? Thanks! bac
Now I'm learning how to carry out a Monte Carlo analysis for a growth model using an ordinal outcome. Simple pre- posttest design, testing difference in change for two randomized groups (equal Ns), and controlling for two continuous covariates as nuisance variables. I'm modeling equal proportions of cases in each of the 4 ordinal categories. Do you see any problems with the specified threshholds? Or any other problems I should be asking about?
Thanks for any help you can provide!
i s | u1@0u2@1 ; [ i@0 s*.1 ] ; i*.3 ; s@0 ; i WITH s@0 ; [u1$1*-1.1 u1$2*0 u1$3*1.1] ; [u2$1*-1.1 u2$2*0 u2$3*1.1] ;
i ON x@0 z1*.18 z2*.1 ; s ON x*.824 z1*0 z2*0 ; ! OR for s ON x = 2.28, medium
First post: Thank you for your recommendation, Bengt! That brings back memories of articles long ago that debated the best way to analyze data from a pretest-posttest control group design, and if I recall, your suggested method was one of the top two winners! (The other being the ANCOVA of the change score on the pretest!). I have used the commands below to do the analysis the latent growth/multilevel way, and obtained results for both the LGM and a MC analysis for the model. This model tests the 3-way interaction of two fixed predictors on the (linear) change from pre to post. I think this is analogous to the classic 2-way ANOVA. There was no problem with identification. Am I missing something? -bac
Second post: Aside from what the best method is for the pre-post analysis, I'm not sure how to do the Monte Carlo analysis using your suggestion, since it treats the pretest as a predictor. The analysis I wrote about is for an ordinal outcome at pre- and posttest, with 4 points on the scale. If I use the ANCOVA with posttest on pretest, how can I specify that the pretest predictor is ordinal? I thought that Monte Carlo predictors could only be normal or binary.
Next, in this MC analysis, if I want (say) equal proportions in the four categories, with the threshold defined as Ln[p/(1-p)], does this give the correct thresholds? [u1$1*-1.1 u1$2*0 u1$3*1.1] ; ! Equal proportions of 25%
I am running a semi-continuous growth model for a continuous outcome as in example 12.9. You note that the exponential function must be applied to continuous variables saved for subsequent two-part analysis. For our data the means in the output for replication 1 for the continuous variable appear to correspond with the means from the generated data. Am I correct in assuming then that the output is on the base e log scale as well? Also, if I am inputting means for the continuous variable for the model population command, should these already be log transformed or in there raw form? Thanks so much for your help!
When you say data saved for the DATA TWOPART command, are you talking about if we had done the analysis on real data and then used that for generation and coverage? This is my fault for not expressing the issue clearly, but I think maybe I am still missing something. If in the Monte Carlo I use REPSAVE to save all the replications, the values for the continuous variable have been elog transformed according to example 12.9. What confuses me is if I input intercepts and mean values for the continuous variable in the Monte Carlo in their original form, they seem to be untransformed in the data files for the replications. For example, if I code [y1*20], then the mean of y1 in the replication .dat files, and in the output, is closer to 20, and not the elog of 20 (somewhere around 3) as I would expect. This made me think that I should be putting in elog transformed values in the model population command, but you seem to be saying otherwise, so I can't quite figure out what I'm missing. I've probably misunderstood the notes to example 12.9, but really appreciate any assistance in clearing up my confusion.
I agree that this can be presented more clearly. What the text on page 436 tries to convey is how a "Step 1" ("internal") Monte Carlo run relates to a "Step 2" ("external", DATA TYPE=MONTECARLO) Monte Carlo run, where I use the words Step 1 and Step 2 in line with the chapter 12 examples.
The Step 1 setup is shown in UG ex 12.9. The data generated are not obtained by a transformation (like log) but correspond to the parameter values given. But if you submit those data to a second run, say a Step 2 Monte Carlo run, the continuous variables will be by default log transformed when using DATA TWOPART. So to get the same parameter estimate results in Step 2 as in Step 1, you want to anti-log (exponentiate) the continuous variables in the data generated by Step 1 and that can be done using DEFINE in the Step 2 run. So the exponentiation is just to counteract the DATA TWOPART action in Step 2. Alternatively, the DATA TWOPART command in the Step 2 run can use the option TRANSFORM = NONE. I hope that makes it clearer.
Anna King posted on Monday, September 15, 2014 - 8:06 am
Sorry about my stupid question. When I'm doing a simulation study to compare the influence of different estimation methods (e.g, one-step and three-step methods) on covariate effects, should I generate one set of data for later analyses? Thanks!
Anne Black posted on Monday, August 08, 2016 - 10:01 am
Chunhua Cao posted on Wednesday, September 20, 2017 - 11:20 am
Dear Dr. Muthén,
My colleague and I are working on a simulation study to estimate treatment effects in interrupted time series designs using latent growth model and Bayesian estimation (default priors). We are interested in small sample sizes. We varied the total series length for baseline and treatment phases at 6,8,10 …,60 (same length across phases). For each series length, sample size was fixed to be one more than the total series length (i.e., 7 to 61), which we believe is the minimal sample size needed to ensure convergence. The treatment effect is estimated as the difference in the intercept factor mean between phases (assuming no slope) with “model constraint”. We ran into two problems:
a)Series length 10 had large negative absolute bias (-.129) in the treatment effect, but other lengths had negligible bias (ranging from -.022 to .034). b)When we examined the 95% CI coverage, some series lengths had much lower coverage (e.g., series lengths 52 and 30 had coverages .863 and .918, respectively) than others (ranging from .939 to 1.000).
We wonder why these particular series lengths behaved in this way. We tried increasing the number of replications and Mplus 8 but results remained close to what we originally got with 1000 replications and Mplus 7.3. It would be great if you can share your insights. Thank you for your time and help!
Why so few subjects - e.g. having only 11 subjects observed over 10 time points?
Chunhua Cao posted on Friday, September 22, 2017 - 9:06 am
Thank you for your question, Dr. Muthén. As far as we know, one more than the total series length is what we need to ensure a proper posterior distribution of factor variances and covariances, if we use the default Inverse Wishart prior. Earlier we tried even smaller sample sizes (e.g., 8 subjects with 10 time points), but got this error message, "THE SAMPLE SIZE PLUS THE PRIOR DEGREES OF FREEDOM OF PSI MUST BE GREATER THAN THE NUMBER OF LATENT VARIABLES".
We also have larger sample sizes (e.g., 80 subjects over 10 time points). We want to examine the bias and 95% CI coverage of the treatment effect estimate under extremely small and relatively large sample sizes.
We observed outliers (in terms of bias and 95% CI coverage) for some series lengths (e.g., 10, 30, 52 time points). But there is no clear pattern to explain why they occurred. Do you have any recommendations on this issue?
You can explore what's going on by saving the results and data from all reps with REPSAVE=ALL; and SAVEDATA. Look at histograms of the parameter of interest from all reps - look for extreme values in a few reps that could drive the average estimate.
Chunhua Cao posted on Wednesday, October 11, 2017 - 5:19 am
This is a follow-up post on the treatment effect estimates with small sample sizes. I asked why a particular series length had much larger bias than others.
Thank you very much for your suggestions, Dr. Muthén. Following your suggestion, my colleague and I checked the distribution of the treatment effect estimate for all reps for the problematic condition (series length 10, sample size 11). The distribution was normal with mean centered at -.129. Looking across several adjacent series lengths, bias for series lengths 6, 8, 10, 12, and 14 were .028, -.022, -.129, .003, and .002, respectively.
We tried a couple of things: 1) in addition to series length 8, 10, 12, we checked series length 9 and 11 to have a more closed examination of the bias; 2) for series length 10, we used informative priors for the variance-covariance matrix where the priors came from the FIML estimation of the same series length but more subjects; 3) in addition to 2), we added informative priors for the intercept factor mean of the baseline and treatment phases with priors coming from the same FIML estimation; 4) changed the population treatment effect to a different value (e.g., zero).
However, series length 10 was still the only condition that had much larger negative bias compared with other conditions. Could you please provide some insights and suggestions? Thank you very much!
I suggested checking if a particular replication obtained a much larger value than other replications. It sounds like that was not the case. I have no other suggestion or insight; it is very hard to understand sampling behavior for such a small number of subjects.
Chunhua Cao posted on Thursday, October 12, 2017 - 10:53 am
Thank you very much for your time, Dr. Muthén. I completely agree with you. It is hard to understand the behavior for this extremely small sample sizes. I appreciate your help.