Message/Author 


Dear Mr. and Mrs. Muthén, I am simulating data for a mixture model using the montecarlo command with the syntax below. I am using Mplus, Version 3.0. In each of the four classes, probabilities for solving the seven items are given (‘model population’ part) according to assumptions concerning how these items relate to two attributes and error probabilities. In the ‘model’ and ‘model constraint’ part, the tresholds of the binary items are named and set equal according to the assumptions of the population model. As the result, mplus creates a data file. Mplus says, the 6000 cases in this file distribute on the four classes like this: N(C1) = 932, N(C2) = 201, N(C3) = 273 and N(C4) = 4592. But if I open the datfile in SPSS and check the frequencies of the class variable, it goes N(C1) = 1515, N(C2) = 1463, N(C3) = 1513 and N(C4) = 1509. Now, three questions: 1) How would you explain this? My feeling is that Mplus fixes the proportion ‘cases per class’ to be equal as the default. But the MplusNs above contradict to this, don’t they? 2) Is there a way to change it, that means to have unequal distribution of cases on classes (maybe by adding something to the ‘model population’ command)? 3) The POPULATION VALUES FOR LATENT CLASS REGRESSION MODEL PART of tech1 are zero (‘0’) for all four classes. In my understanding, that is, no ‘population values’ are given for the estimation of the means of the latent class variable in the four groups. So they are estimated freely in groups one to three (fourth is set to zero). Now, if I would specify different means in the four groups (via ‘model population’), the proportions ‘cases in latent classes’ should change, right? But: How do I specify population values for these means. I apologize, if these questions are adressed in your handbook and I am just not able to find the right pages... Thanks in advance and greetings from Berlin, Germany Michael Leucht NOW HERE IS MY SYNTAX: TITLE: MonteCarlo Simulation for DINA Model, 2 Attribute; MONTECARLO: NAMES ARE x1x7; GENERATE = x1x7(1); CATEGORICAL = x1x7; GENCLASSES = c(4); CLASSES = c(4); NOBS = 6000; NREPS = 5; SEED = 1234567; SAVE = E:\iqb\kognitive modellierung\modellierung\dina_A2_Ziel_6000.dat; ANALYSIS: TYPE=Mixture; MCONVERGENCE=.001; MODEL POPULATION: %OVERALL% %c#1% [x1$1@0.405]; [x2$1@0.847]; [x3$1@0.847]; [x4$1@1.386]; [x5$1@1.386]; [x6$1@2.197]; [x7$1@2.197]; %c#2% [x1$1@2.197]; [x2$1@2.197]; [x3$1@0.847]; [x4$1@1.386]; [x5$1@1.386]; [x6$1@2.197]; [x7$1@2.197]; %c#3% [x1$1@0.405]; [x2$1@0.847]; [x3$1@1.386]; [x4$1@1.386]; [x5$1@1.386]; [x6$1@2.197]; [x7$1@2.197]; %c#4% [x1$1@2.197]; [x2$1@2.197]; [x3$1@1.386]; [x4$1@1.386]; [x5$1@0.847]; [x6$1@0.847]; [x7$1@0.405]; MODEL: %OVERALL% %c#1% [x1$1] (p1_1); [x2$1] (p2_1); [x3$1] (p3_1); [x4$1] (p4_1); [x5$1] (p5_1); [x6$1] (p6_1); [x7$1] (p7_1); %c#2% [x1$1] (p1_2); [x2$1] (p2_2); [x3$1] (p3_2); [x4$1] (p4_2); [x5$1] (p5_2); [x6$1] (p6_2); [x7$1] (p7_2); %c#3% [x1$1] (p1_3); [x2$1] (p2_3); [x3$1] (p3_3); [x4$1] (p4_3); [x5$1] (p5_3); [x6$1] (p6_3); [x7$1] (p7_3); %c#4% [x1$1] (p1_4); [x2$1] (p2_4); [x3$1] (p3_4); [x4$1] (p4_4); [x5$1] (p5_4); [x6$1] (p6_4); [x7$1] (p7_4); MODEL CONSTRAINT: p1_3=p1_1; p1_4=p1_2; p2_3=p2_1; p2_4=p2_2; p3_2=p3_1; p3_4=p3_3; p4_2=p4_1; p4_4=p4_3; p5_2=p5_1; p5_3=p5_1; p6_2=p6_1; p6_3=p6_1; p7_2=p7_1; p7_3=p7_1; OUTPUT: Tech8; Tech9; 


I assume the number of people in each class using SPSS is determined using most likely class membership. Mplus uses posterior probabilities. These can be very similar or very different depending on the entropy. Your results do look strange. There have been many changes since Version 3.0. I suggest downloading Version 4.0 and rerunning your analysis. If you continue to have questions about the results, please send the input, output, and your license number to support@statmodel.com. When it is necessary to post an input, it is most likely an Mplus support question and should be sent to support. 


Well, that does not really solve the case... What SPSS does, is to read out the class variable that was created in Mplus, so the results should not differ that much. Don't you think so? 


You really need to send this to support (input, output, license number), otherwise we'd just be guessing what you are seeing. Note that your Model statement does not give any true values for the parameters and since Monte Carlo by default does not use random starts, you will not get a good solution. And yes, Mplus defaults to equal class sizes. The way to specify the class sizes you want is shown in the User's Guide example 11.3. 


Hi, In the Mplus guide there is an example (11.3) where you simulate two latent classes and recover a model with one class. I am trying to do something similar, but I have 5 latent categorical variables (A B C D AND E), 4 with 6 classes and one with 2 classes. However, Mplus gives me an error: Unknown class label in MODEL POPULATION: %A#1% Is it actually possible in Mplus to simulate a mixture model with more than one latent categorical variable? Is there a command like MODEL POPULATION A:? Thanks in advance, Irene 


You use Model Populationa: So include a dash. 

Sean F posted on Wednesday, October 07, 2009  12:26 pm



Dear Dr. Muthén, In one of your posts above, you mentioned that example 11.3 shows how to run a Monte Carlo simulation where latent class sizes are unequal. Unfortunately, I cannot seem to figure out how to vary the class sizes from the example. I also tried running the syntax for 11.3, but the simulated data gives me equal class sizes. I am hoping to run a MC simulation where the population has three latent classes of unequal proportions. I appreciate your help. 


Assuming no "c ON x", the class sizes are given by the values for the [c#.] logits. You translate from 3class probabilities p1, p2, p3 to logits for c#1 and c#2 by using the UG chapter 13 formulas for intercepts in multinomial logistic regression as follows. [c#1] = elog(p1/p3) [c#2] = elog(p2/p3) So for example with probs = .5, .25, .25 you get [c#1*0.69]; [c#2*0]; 

Ross Crosby posted on Tuesday, April 20, 2010  2:01 am



I am also interested in manipulating class sizes, but for a fourclass solution. Can you please help me understand what to put in to get a 0.2, 0.2, 0.5, and 0.1 class probabilies? Following the suggestion above does not seem to produce the sizes predicably. Thanks. 


The draws are random so you can expect some variability. Try increasing the sample size. As sample size increases, the stability should also improve. 


I am having trouble setting up a monte carlo simulation study for a LVMM study. I want to determine my sample size prior to collecting data. I have 7 continuous variables and 2 dichotomous variables that will be used to determine classes (my latent variable will be categorical). This is exploratory, so I don't know how many classes will be generated. I have read many discussion boards on this site, the Mplus user's guide (version 6), and the Muthen & Muthen, 2002 and Nylund et al., 2007 papers, but I am still confused. Many of the examples are more complicated than my model, and I am not sure how to deconstruct the syntax. Do you have an example you could share for a monte carlo simulation of a LVMM with both continuous and categorical variables? Or perhaps there is a discussion with this information that I am overlooking? 


Most examples in the user's guide come with a Monte Carlo counterpart that generated the data for the example. Find the example in Chapter 7 that is closest to what you are doing, and use the Monte Carlo counterpart as a starting point. 


I simulated the performance of a 2classPoissonLCGA where generated data are overdispersed. Data were generated using the NegBin model. What I find is, that with increasing degrees of overdispersion, the Poisson model reveals increasingly worse class counts based on estimated posterior probabilites, but remains very high entropy. Conversely, the NegBin model reveals increasingly worse entropy values, but conserves good results regarding class counts based on estimated posterior probabilites. This could mean, that the Poisson model implicates a very distinct classification, even if the model requirements (equidispersion) are more and more violated and the classification itself (compared to the 'true model') is wrong. The NegBin model on the other hand seems to implicate, that the uniqueness of the classification suffers with increasing degrees of overdispersion, although the classification based on posterior probs remains well. So my question is: Shouldn't the NegBin model be able to replicate the data that were initially generated under NegBin conditions with both good results regarding class counts based on post.prob. AND high entropy? 


We are having difficulty understanding your question, for example, what does the following mean: the Poisson model reveals increasingly worse class counts based on estimated posterior probabilites 


Thank you. I'll try to clarify: I generated data by a (linear) 2classLCGA, with different conditions of overdisersion (low/medium/high) in the 3 repeated measures (y1 y2 y3). The remaining model parameters were held constant across dispersion conditions. The different data/disperion conditions were reanalyzed with a Poisson LCGA (implying equidispersion) and a NegBin LCGA (allowing dispersion) as a control. With an increase in the level of dispersion in the data (condidtions medium & high), the results of the Poisson based replications (1) show that the class counts/proportions based on estimated posterior probabilites do not match the 'true' model. Surprisingly Entropy very remains high throughout all dispersion conditions. Reanalyzing with a NegBin specification (2) reveals good results regarding class counts/proportions based on estimated posterior probabilites compared to the 'true model'. Surprisingly Entropy gets worse with an increase in dispersion. 


I see now what you are asking. My experience with say GMM is that entropy isn't necessarily higher for the model that is more appropriate for the data  so I am not surprised by your results. Entropy isn't a measure of goodness of fit  it is just a descriptive of how clear the classification is. A model that allows more dispersion flexibility may have a harder time clearly classifying certain responses  but that may be how reality is. 

Helen Li posted on Tuesday, September 09, 2014  9:46 am



Dear Dr. Muthén, I'm simulating two latent classes with two covariates included. When I manipulate the relation between latent class and the covariates, is it correct that I use the exp(regression coefficient) for data generation? For example, if the effect of one covariate on the log odds of membership in class 1 relative to class 2 is 1.5, should I tansform the value to exp(1.5) to be used in the data generation (i.e., c#1 on x1*0.405;)? Also, should I add "ALGORITHM=INTEGRATION;" to the analysis part, which looks like: analysis: type=mixture; ALGORITHM=INTEGRATION; Thank you for your time! Helen 


Q1. The data generation in Mplus Monte Carlo uses the logit parameterization, not odds parameterization. So, no. Q2. Algo = int; is only needed if you have a continuous latent variable. 

Helen Li posted on Tuesday, September 09, 2014  4:46 pm



Thanks, Dr. Muthén! I'm sorry I'm still confused. So, as per Question 1, if I have a logistic model expressed in logit form as: logit(pie1)=log(pie1/pie2)=gamma01+gamma11x1+gamma21x2, can I use gamma11, for example, as the relation between latent class1 and x1 (i.e., c#1 on x1*gamma11;) for data generation? As per Question 2, could you please see if my understanding is correct  If my repeated measures are all continuous (for a growth mixture model), I may still use "Algo = int;". Or I can simply ignore this syntax and just include "Auxiliary = x1 x2;" (one continuous and one categorical covariate) in the Montecarlo part of the code? Many thanks Helen 


Q1. Yes. Q2. I am not quite sure what your full model is. Perhaps it is a growth mixture model with covariates predicting the latent class variable. If so, you should not need algo=int. I don't know why you want to accomplish by putting the covariates on the auxiliary list. Remember that every example in the UG has a Monte Carlo counterpart on our website. So if you find a model in the UG example set, you may want to start from that. 

Helen Li posted on Tuesday, September 09, 2014  7:43 pm



Thank you so much for your quick response and patient explanation! This is really helpful! Thank you very much! Helen 

Helen Li posted on Friday, September 12, 2014  6:35 am



Dear Dr. Muthén, I generated the data for my simulation study following your suggestion. My class proportion is 30:70 and I used logit value to manipulate it. The two covariates are correlated, so I include PWITH in the MODEL POPULATION. Below is the code: Montecarlo: names are y1y6 x1 x2; nobservations = 1000; nreps = 200; seed = 26385; CUTPOINTS = x1(0.5244); Repsave = ALL; Save = sim*.dat; genclasses = c(2); classes=c(2); analysis: type=mixture; Model population: %overall% [x1x2@0]; x1x2@1; i s  y1@0 y2@1 y3@2 y4@3 y5@4 y6@5; i*1 s*.2; i with s*.2; y1y6*.75(ve); [c#1*0.8472]; !Class proportion c#1 on x1*0.3; !x1 & class1 relation c#1 on x2*2; !x2 & class2 relation x1 pwith x2*0.3; !correlated x1&x2 %c#1% [i*2.5]; [s*.05]; %c#2% [i*4.5]; [s*.85]; May I know if this is correct? Thanks! Helen 


Looks ok as far as you show. I would change x1 pwith x2*0.3; to x1 with x2*0.3; 

Helen Li posted on Friday, September 12, 2014  7:18 pm



But if it is the correlation that I want to manipulate between x1 and x2, do I have to transform it to covariance? My understanding is that "x1 with x2*0.3" indicates the covariance but not the correlation. I just ran the program using these two different syntax and found that the data generated are exactly the same. May I know why "PWITH" is not recommended here? Thanks! Helen 


Both WITH and PWITH refer to covariances, not correlations (but if variances are 1, covariances are the same as correlations). PWITH is meant for more than 2 variables such as pairs of adjacent variables: y1y4 PWITH y2y5; which means y1 with y2; y2 with y3; y3 with y4; y4 with y5; 

Helen Li posted on Monday, September 15, 2014  9:59 am



Thanks for your interpretation. I really appreciate it! May I also know if there will be any influence/change on the proportion of each latent class when two covariates are included in the GMM? Specifically, If I want to create 30:70 class proportion using the code "[c#1*0.8472];" and then add two covariates (c#1 on x1*0.3; c#1 on x2*2; x1 with x2*0.3;), will the class proportion of the data generated be still the same (30:70) with the covariates included? Helen 


No, the class proportions change. This is just like regression where the mean of Y changes. When you add covariates, the mean is a function of the intercept ( [c#1*...]) and the slope times the mean of the covariate. With c on x it's a little more complicated but same principle. So you have to do it by trial and error using a very large sample to get the class proportions when there are covariates. 

Helen Li posted on Monday, September 15, 2014  7:52 pm



Dear Dr. Muthén, I really appreciate your time and interpretation of my questions. They are very helpful! As per your last response, I still have one question: Is there any calculation that I can count on to make sure the data generated satisfy the 30:70 class proportions when two covariates are included? The "[c#1*0.8472];" was based solely on the log odds of 30 to 70 and I guess when two covariates are included in the model, the logit value should be changed. Is my understanding correct? If there is no available method to get that proportion, is it that I should try different logit values using large sample size to make sure it happens? If so, which part of the output I should look at? Many, many thanks! Helen 


Try different logit values using large sample size to make sure it happens. Look at the Modelbased final class proportions. 

Back to top 