

Montecarlo for Mixture Modeling 

Message/Author 


Dear Mr. and Mrs. Muthén, I am simulating data for a mixture model using the montecarlo command with the syntax below. I am using Mplus, Version 3.0. In each of the four classes, probabilities for solving the seven items are given (‘model population’ part) according to assumptions concerning how these items relate to two attributes and error probabilities. In the ‘model’ and ‘model constraint’ part, the tresholds of the binary items are named and set equal according to the assumptions of the population model. As the result, mplus creates a data file. Mplus says, the 6000 cases in this file distribute on the four classes like this: N(C1) = 932, N(C2) = 201, N(C3) = 273 and N(C4) = 4592. But if I open the datfile in SPSS and check the frequencies of the class variable, it goes N(C1) = 1515, N(C2) = 1463, N(C3) = 1513 and N(C4) = 1509. Now, three questions: 1) How would you explain this? My feeling is that Mplus fixes the proportion ‘cases per class’ to be equal as the default. But the MplusNs above contradict to this, don’t they? 2) Is there a way to change it, that means to have unequal distribution of cases on classes (maybe by adding something to the ‘model population’ command)? 3) The POPULATION VALUES FOR LATENT CLASS REGRESSION MODEL PART of tech1 are zero (‘0’) for all four classes. In my understanding, that is, no ‘population values’ are given for the estimation of the means of the latent class variable in the four groups. So they are estimated freely in groups one to three (fourth is set to zero). Now, if I would specify different means in the four groups (via ‘model population’), the proportions ‘cases in latent classes’ should change, right? But: How do I specify population values for these means. I apologize, if these questions are adressed in your handbook and I am just not able to find the right pages... Thanks in advance and greetings from Berlin, Germany Michael Leucht NOW HERE IS MY SYNTAX: TITLE: MonteCarlo Simulation for DINA Model, 2 Attribute; MONTECARLO: NAMES ARE x1x7; GENERATE = x1x7(1); CATEGORICAL = x1x7; GENCLASSES = c(4); CLASSES = c(4); NOBS = 6000; NREPS = 5; SEED = 1234567; SAVE = E:\iqb\kognitive modellierung\modellierung\dina_A2_Ziel_6000.dat; ANALYSIS: TYPE=Mixture; MCONVERGENCE=.001; MODEL POPULATION: %OVERALL% %c#1% [x1$1@0.405]; [x2$1@0.847]; [x3$1@0.847]; [x4$1@1.386]; [x5$1@1.386]; [x6$1@2.197]; [x7$1@2.197]; %c#2% [x1$1@2.197]; [x2$1@2.197]; [x3$1@0.847]; [x4$1@1.386]; [x5$1@1.386]; [x6$1@2.197]; [x7$1@2.197]; %c#3% [x1$1@0.405]; [x2$1@0.847]; [x3$1@1.386]; [x4$1@1.386]; [x5$1@1.386]; [x6$1@2.197]; [x7$1@2.197]; %c#4% [x1$1@2.197]; [x2$1@2.197]; [x3$1@1.386]; [x4$1@1.386]; [x5$1@0.847]; [x6$1@0.847]; [x7$1@0.405]; MODEL: %OVERALL% %c#1% [x1$1] (p1_1); [x2$1] (p2_1); [x3$1] (p3_1); [x4$1] (p4_1); [x5$1] (p5_1); [x6$1] (p6_1); [x7$1] (p7_1); %c#2% [x1$1] (p1_2); [x2$1] (p2_2); [x3$1] (p3_2); [x4$1] (p4_2); [x5$1] (p5_2); [x6$1] (p6_2); [x7$1] (p7_2); %c#3% [x1$1] (p1_3); [x2$1] (p2_3); [x3$1] (p3_3); [x4$1] (p4_3); [x5$1] (p5_3); [x6$1] (p6_3); [x7$1] (p7_3); %c#4% [x1$1] (p1_4); [x2$1] (p2_4); [x3$1] (p3_4); [x4$1] (p4_4); [x5$1] (p5_4); [x6$1] (p6_4); [x7$1] (p7_4); MODEL CONSTRAINT: p1_3=p1_1; p1_4=p1_2; p2_3=p2_1; p2_4=p2_2; p3_2=p3_1; p3_4=p3_3; p4_2=p4_1; p4_4=p4_3; p5_2=p5_1; p5_3=p5_1; p6_2=p6_1; p6_3=p6_1; p7_2=p7_1; p7_3=p7_1; OUTPUT: Tech8; Tech9; 


I assume the number of people in each class using SPSS is determined using most likely class membership. Mplus uses posterior probabilities. These can be very similar or very different depending on the entropy. Your results do look strange. There have been many changes since Version 3.0. I suggest downloading Version 4.0 and rerunning your analysis. If you continue to have questions about the results, please send the input, output, and your license number to support@statmodel.com. When it is necessary to post an input, it is most likely an Mplus support question and should be sent to support. 


Well, that does not really solve the case... What SPSS does, is to read out the class variable that was created in Mplus, so the results should not differ that much. Don't you think so? 


You really need to send this to support (input, output, license number), otherwise we'd just be guessing what you are seeing. Note that your Model statement does not give any true values for the parameters and since Monte Carlo by default does not use random starts, you will not get a good solution. And yes, Mplus defaults to equal class sizes. The way to specify the class sizes you want is shown in the User's Guide example 11.3. 


Hi, In the Mplus guide there is an example (11.3) where you simulate two latent classes and recover a model with one class. I am trying to do something similar, but I have 5 latent categorical variables (A B C D AND E), 4 with 6 classes and one with 2 classes. However, Mplus gives me an error: Unknown class label in MODEL POPULATION: %A#1% Is it actually possible in Mplus to simulate a mixture model with more than one latent categorical variable? Is there a command like MODEL POPULATION A:? Thanks in advance, Irene 


You use Model Populationa: So include a dash. 

Sean F posted on Wednesday, October 07, 2009  12:26 pm



Dear Dr. Muthén, In one of your posts above, you mentioned that example 11.3 shows how to run a Monte Carlo simulation where latent class sizes are unequal. Unfortunately, I cannot seem to figure out how to vary the class sizes from the example. I also tried running the syntax for 11.3, but the simulated data gives me equal class sizes. I am hoping to run a MC simulation where the population has three latent classes of unequal proportions. I appreciate your help. 


Assuming no "c ON x", the class sizes are given by the values for the [c#.] logits. You translate from 3class probabilities p1, p2, p3 to logits for c#1 and c#2 by using the UG chapter 13 formulas for intercepts in multinomial logistic regression as follows. [c#1] = elog(p1/p3) [c#2] = elog(p2/p3) So for example with probs = .5, .25, .25 you get [c#1*0.69]; [c#2*0]; 

Ross Crosby posted on Tuesday, April 20, 2010  2:01 am



I am also interested in manipulating class sizes, but for a fourclass solution. Can you please help me understand what to put in to get a 0.2, 0.2, 0.5, and 0.1 class probabilies? Following the suggestion above does not seem to produce the sizes predicably. Thanks. 


The draws are random so you can expect some variability. Try increasing the sample size. As sample size increases, the stability should also improve. 


I am having trouble setting up a monte carlo simulation study for a LVMM study. I want to determine my sample size prior to collecting data. I have 7 continuous variables and 2 dichotomous variables that will be used to determine classes (my latent variable will be categorical). This is exploratory, so I don't know how many classes will be generated. I have read many discussion boards on this site, the Mplus user's guide (version 6), and the Muthen & Muthen, 2002 and Nylund et al., 2007 papers, but I am still confused. Many of the examples are more complicated than my model, and I am not sure how to deconstruct the syntax. Do you have an example you could share for a monte carlo simulation of a LVMM with both continuous and categorical variables? Or perhaps there is a discussion with this information that I am overlooking? 


Most examples in the user's guide come with a Monte Carlo counterpart that generated the data for the example. Find the example in Chapter 7 that is closest to what you are doing, and use the Monte Carlo counterpart as a starting point. 


I simulated the performance of a 2classPoissonLCGA where generated data are overdispersed. Data were generated using the NegBin model. What I find is, that with increasing degrees of overdispersion, the Poisson model reveals increasingly worse class counts based on estimated posterior probabilites, but remains very high entropy. Conversely, the NegBin model reveals increasingly worse entropy values, but conserves good results regarding class counts based on estimated posterior probabilites. This could mean, that the Poisson model implicates a very distinct classification, even if the model requirements (equidispersion) are more and more violated and the classification itself (compared to the 'true model') is wrong. The NegBin model on the other hand seems to implicate, that the uniqueness of the classification suffers with increasing degrees of overdispersion, although the classification based on posterior probs remains well. So my question is: Shouldn't the NegBin model be able to replicate the data that were initially generated under NegBin conditions with both good results regarding class counts based on post.prob. AND high entropy? 


We are having difficulty understanding your question, for example, what does the following mean: the Poisson model reveals increasingly worse class counts based on estimated posterior probabilites 


Thank you. I'll try to clarify: I generated data by a (linear) 2classLCGA, with different conditions of overdisersion (low/medium/high) in the 3 repeated measures (y1 y2 y3). The remaining model parameters were held constant across dispersion conditions. The different data/disperion conditions were reanalyzed with a Poisson LCGA (implying equidispersion) and a NegBin LCGA (allowing dispersion) as a control. With an increase in the level of dispersion in the data (condidtions medium & high), the results of the Poisson based replications (1) show that the class counts/proportions based on estimated posterior probabilites do not match the 'true' model. Surprisingly Entropy very remains high throughout all dispersion conditions. Reanalyzing with a NegBin specification (2) reveals good results regarding class counts/proportions based on estimated posterior probabilites compared to the 'true model'. Surprisingly Entropy gets worse with an increase in dispersion. 


I see now what you are asking. My experience with say GMM is that entropy isn't necessarily higher for the model that is more appropriate for the data  so I am not surprised by your results. Entropy isn't a measure of goodness of fit  it is just a descriptive of how clear the classification is. A model that allows more dispersion flexibility may have a harder time clearly classifying certain responses  but that may be how reality is. 

Back to top 

