I am simulating data for a mixture model using the montecarlo command with the syntax below. I am using Mplus, Version 3.0. In each of the four classes, probabilities for solving the seven items are given (‘model population’ part) according to assumptions concerning how these items relate to two attributes and error probabilities. In the ‘model’ and ‘model constraint’ part, the tresholds of the binary items are named and set equal according to the assumptions of the population model. As the result, mplus creates a data file. Mplus says, the 6000 cases in this file distribute on the four classes like this: N(C1) = 932, N(C2) = 201, N(C3) = 273 and N(C4) = 4592. But if I open the dat-file in SPSS and check the frequencies of the class variable, it goes N(C1) = 1515, N(C2) = 1463, N(C3) = 1513 and N(C4) = 1509.
Now, three questions: 1) How would you explain this? My feeling is that Mplus fixes the proportion ‘cases per class’ to be equal as the default. But the Mplus-Ns above contradict to this, don’t they? 2) Is there a way to change it, that means to have unequal distribution of cases on classes (maybe by adding something to the ‘model population’ command)? 3) The POPULATION VALUES FOR LATENT CLASS REGRESSION MODEL PART of tech1 are zero (‘0’) for all four classes. In my understanding, that is, no ‘population values’ are given for the estimation of the means of the latent class variable in the four groups. So they are estimated freely in groups one to three (fourth is set to zero). Now, if I would specify different means in the four groups (via ‘model population’), the proportions ‘cases in latent classes’ should change, right? But: How do I specify population values for these means.
I apologize, if these questions are adressed in your handbook and I am just not able to find the right pages...
Thanks in advance and greetings from Berlin, Germany Michael Leucht
NOW HERE IS MY SYNTAX:
TITLE: Monte-Carlo Simulation for DINA Model, 2 Attribute;
I assume the number of people in each class using SPSS is determined using most likely class membership. Mplus uses posterior probabilities. These can be very similar or very different depending on the entropy. Your results do look strange. There have been many changes since Version 3.0. I suggest downloading Version 4.0 and rerunning your analysis. If you continue to have questions about the results, please send the input, output, and your license number to email@example.com. When it is necessary to post an input, it is most likely an Mplus support question and should be sent to support.
You really need to send this to support (input, output, license number), otherwise we'd just be guessing what you are seeing.
Note that your Model statement does not give any true values for the parameters and since Monte Carlo by default does not use random starts, you will not get a good solution. And yes, Mplus defaults to equal class sizes. The way to specify the class sizes you want is shown in the User's Guide example 11.3.
Hi, In the Mplus guide there is an example (11.3) where you simulate two latent classes and recover a model with one class. I am trying to do something similar, but I have 5 latent categorical variables (A B C D AND E), 4 with 6 classes and one with 2 classes. However, Mplus gives me an error: Unknown class label in MODEL POPULATION: %A#1%
Is it actually possible in Mplus to simulate a mixture model with more than one latent categorical variable? Is there a command like MODEL POPULATION A:?
Sean F posted on Wednesday, October 07, 2009 - 12:26 pm
Dear Dr. Muthén,
In one of your posts above, you mentioned that example 11.3 shows how to run a Monte Carlo simulation where latent class sizes are unequal. Unfortunately, I cannot seem to figure out how to vary the class sizes from the example. I also tried running the syntax for 11.3, but the simulated data gives me equal class sizes. I am hoping to run a MC simulation where the population has three latent classes of unequal proportions. I appreciate your help.
Assuming no "c ON x", the class sizes are given by the values for the [c#.] logits. You translate from 3-class probabilities p1, p2, p3 to logits for c#1 and c#2 by using the UG chapter 13 formulas for intercepts in multinomial logistic regression as follows.
[c#1] = elog(p1/p3) [c#2] = elog(p2/p3)
So for example with probs = .5, .25, .25 you get
Ross Crosby posted on Tuesday, April 20, 2010 - 2:01 am
I am also interested in manipulating class sizes, but for a four-class solution. Can you please help me understand what to put in to get a 0.2, 0.2, 0.5, and 0.1 class probabilies? Following the suggestion above does not seem to produce the sizes predicably. Thanks.
I am having trouble setting up a monte carlo simulation study for a LVMM study. I want to determine my sample size prior to collecting data. I have 7 continuous variables and 2 dichotomous variables that will be used to determine classes (my latent variable will be categorical). This is exploratory, so I don't know how many classes will be generated.
I have read many discussion boards on this site, the Mplus user's guide (version 6), and the Muthen & Muthen, 2002 and Nylund et al., 2007 papers, but I am still confused. Many of the examples are more complicated than my model, and I am not sure how to deconstruct the syntax.
Do you have an example you could share for a monte carlo simulation of a LVMM with both continuous and categorical variables? Or perhaps there is a discussion with this information that I am overlooking?
Most examples in the user's guide come with a Monte Carlo counterpart that generated the data for the example. Find the example in Chapter 7 that is closest to what you are doing, and use the Monte Carlo counterpart as a starting point.
I simulated the performance of a 2-class-Poisson-LCGA where generated data are overdispersed. Data were generated using the NegBin model. What I find is, that with increasing degrees of overdispersion, the Poisson model reveals increasingly worse class counts based on estimated posterior probabilites, but remains very high entropy. Conversely, the NegBin model reveals increasingly worse entropy values, but conserves good results regarding class counts based on estimated posterior probabilites.
This could mean, that the Poisson model implicates a very distinct classification, even if the model requirements (equidispersion) are more and more violated and the classification itself (compared to the 'true model') is wrong.
The NegBin model on the other hand seems to implicate, that the uniqueness of the classification suffers with increasing degrees of overdispersion, although the classification based on posterior probs remains well.
So my question is: Shouldn't the NegBin model be able to replicate the data that were initially generated under NegBin conditions with both good results regarding class counts based on post.prob. AND high entropy?
I generated data by a (linear) 2-class-LCGA, with different conditions of overdisersion (low/medium/high) in the 3 repeated measures (y1 y2 y3). The remaining model parameters were held constant across dispersion conditions.
The different data/disperion conditions were reanalyzed with a Poisson LCGA (implying equidispersion) and a NegBin LCGA (allowing dispersion) as a control.
With an increase in the level of dispersion in the data (condidtions medium & high), the results of the Poisson based replications (1) show that the class counts/proportions based on estimated posterior probabilites do not match the 'true' model. Surprisingly Entropy very remains high throughout all dispersion conditions.
Reanalyzing with a NegBin specification (2) reveals good results regarding class counts/proportions based on estimated posterior probabilites compared to the 'true model'. Surprisingly Entropy gets worse with an increase in dispersion.
I see now what you are asking. My experience with say GMM is that entropy isn't necessarily higher for the model that is more appropriate for the data - so I am not surprised by your results. Entropy isn't a measure of goodness of fit - it is just a descriptive of how clear the classification is. A model that allows more dispersion flexibility may have a harder time clearly classifying certain responses - but that may be how reality is.
Helen Li posted on Tuesday, September 09, 2014 - 9:46 am
Dear Dr. Muthén,
I'm simulating two latent classes with two covariates included. When I manipulate the relation between latent class and the covariates, is it correct that I use the exp(regression coefficient) for data generation? For example, if the effect of one covariate on the log odds of membership in class 1 relative to class 2 is 1.5, should I tansform the value to exp(1.5) to be used in the data generation (i.e., c#1 on x1*0.405;)?
Also, should I add "ALGORITHM=INTEGRATION;" to the analysis part, which looks like: analysis: type=mixture; ALGORITHM=INTEGRATION;
Q1. The data generation in Mplus Monte Carlo uses the logit parameterization, not odds parameterization. So, no.
Q2. Algo = int; is only needed if you have a continuous latent variable.
Helen Li posted on Tuesday, September 09, 2014 - 4:46 pm
Thanks, Dr. Muthén! I'm sorry I'm still confused. So, as per Question 1, if I have a logistic model expressed in logit form as: logit(pie1)=log(pie1/pie2)=gamma01+gamma11x1+gamma21x2, can I use gamma11, for example, as the relation between latent class1 and x1 (i.e., c#1 on x1*gamma11;) for data generation?
As per Question 2, could you please see if my understanding is correct -- If my repeated measures are all continuous (for a growth mixture model), I may still use "Algo = int;". Or I can simply ignore this syntax and just include "Auxiliary = x1 x2;" (one continuous and one categorical covariate) in the Montecarlo part of the code?
Q2. I am not quite sure what your full model is. Perhaps it is a growth mixture model with covariates predicting the latent class variable. If so, you should not need algo=int. I don't know why you want to accomplish by putting the covariates on the auxiliary list.
Remember that every example in the UG has a Monte Carlo counterpart on our website. So if you find a model in the UG example set, you may want to start from that.
Helen Li posted on Tuesday, September 09, 2014 - 7:43 pm
Thank you so much for your quick response and patient explanation! This is really helpful! Thank you very much!
Helen Li posted on Friday, September 12, 2014 - 6:35 am
Dear Dr. Muthén,
I generated the data for my simulation study following your suggestion. My class proportion is 30:70 and I used logit value to manipulate it. The two covariates are correlated, so I include PWITH in the MODEL POPULATION. Below is the code:
Helen Li posted on Friday, September 12, 2014 - 7:18 pm
But if it is the correlation that I want to manipulate between x1 and x2, do I have to transform it to covariance? My understanding is that "x1 with x2*0.3" indicates the covariance but not the correlation. I just ran the program using these two different syntax and found that the data generated are exactly the same. May I know why "PWITH" is not recommended here? Thanks!
Both WITH and PWITH refer to covariances, not correlations (but if variances are 1, covariances are the same as correlations). PWITH is meant for more than 2 variables such as pairs of adjacent variables:
y1-y4 PWITH y2-y5;
y1 with y2; y2 with y3; y3 with y4; y4 with y5;
Helen Li posted on Monday, September 15, 2014 - 9:59 am
Thanks for your interpretation. I really appreciate it! May I also know if there will be any influence/change on the proportion of each latent class when two covariates are included in the GMM? Specifically, If I want to create 30:70 class proportion using the code "[c#1*-0.8472];" and then add two covariates (c#1 on x1*0.3; c#1 on x2*2; x1 with x2*0.3;), will the class proportion of the data generated be still the same (30:70) with the covariates included?
This is just like regression where the mean of Y changes. When you add covariates, the mean is a function of the intercept ( [c#1*...]) and the slope times the mean of the covariate. With c on x it's a little more complicated but same principle.
So you have to do it by trial and error using a very large sample to get the class proportions when there are covariates.
Helen Li posted on Monday, September 15, 2014 - 7:52 pm
Dear Dr. Muthén,
I really appreciate your time and interpretation of my questions. They are very helpful! As per your last response, I still have one question: Is there any calculation that I can count on to make sure the data generated satisfy the 30:70 class proportions when two covariates are included? The "[c#1*-0.8472];" was based solely on the log odds of 30 to 70 and I guess when two covariates are included in the model, the logit value should be changed. Is my understanding correct? If there is no available method to get that proportion, is it that I should try different logit values using large sample size to make sure it happens? If so, which part of the output I should look at? Many, many thanks!