Clustering and stratification PreviousNext
Mplus Discussion > Multilevel Data/Complex Sample >
 ann feng posted on Thursday, November 11, 2010 - 1:23 pm
I am attempting a multilevel analysis on survey data collected annually over 3 years. Data were collected with geographic stratification (no cluster sampling) within each state and county of residence is my chosen independant clusters (data on a continental scale). My main interest is the relationship between outcome and the between-group predictor. Since the 3 year-combined sample is used, I could fit a hierarchical model using year of interviw as a within-group predictor and specify the stratification to reflect the sampling feature. But there are more than 4000 strata in the data so alternatively, I fitted the model ignoring the sampling strata,and instead specified stratification=year of interview to allow some time dependance. I am a little perplexed by the fact that I didn't account for the sampling strata and instead "post"-stratified the data by year. The between-group covariate is time-invariant by the way and although computations seem not be a problem for my model of choice, I wonder if I correctly adjusted for the year index here. I guess I could always allow year as a within-group predictor having a random slope across clusters (i.e. county) but when this specification was attemped computation failed (maybe my inexperience with syntax was to blame). So I am leaning towards the model using year as stratification but am very uneasy about the implications...Many thanks for any input from this resourceful community!
 Tihomir Asparouhov posted on Friday, November 12, 2010 - 10:00 am
I would suggest that you use year (two dummy predictors) as a within-group predictor with a fixed slope (instead of random slope).Keep in mind that the strata specification only affects the SE - not the model estimates. Strata specification that doesn't reflect the true sampling method would not be correct.
 ann feng posted on Friday, November 12, 2010 - 10:14 am
Thanks a lot Tihomir for your advice! I do have a follow-up question if you don't mind. So both type=complex and type=twolevel allows/requires the cluster option but do they have different meanings? From reading your Mplus notes 'Stratification in Multivariate Modeling' I surmise that units within a cluster (if cluster sampling used) should probabaly be as heterogeneous as possible in order to improve esimation precision but from a random effects modeling standpoint (allowing for intracluster correlation) the units are supposed to be similar wrt the variable of interest. So I am a bit confused as to if the clustering option is treated differently in the complex or twolevel methods? Again thanks for sharing your expertise!
 Tihomir Asparouhov posted on Saturday, November 13, 2010 - 1:32 pm
Take a look at
section "Factor Analysis with cluster Sampling".

The difference between complex and two-level is about what the model describes. Complex modeling gives a model for the entire population while taking into account the sampling scheme that causes non-independence between the observations. Twolevel modeling on the other hand describes the exact clustering effect and yields models that are specific to each cluster. In many cases both models can be used but the interpretation is different.
 ann feng posted on Sunday, November 14, 2010 - 2:04 pm
Thank you Tihomir for explaining the two approaches' distinction and the reference. Will take a thorough read through. One other concern I have is that the between-group predictor in my model (at the county level) might not be independent since spatial clustering wrt this variable might be present, yet the assumption is that the random effect (a random slope predicted by this 2nd-level covariate) is normal and independent. Should I formally test the spatial independance assuption on the random effects? But then adjusting for the spatial correlation may pose a problem since there are more than 2000 clusters or counties and using indicator variable to specify a cluster boundary seems unwieldy to say the least. So to pose a more general question how do I account for autocorrelation in random effects in Mplus? Also is it correct to say that autocorrelation in my outcome variable(a categorical latent class variable) if present at the 2nd or between-group level is already accounted for by the random effects? I hope I didn't sidetrack too much from multilevel modeling to discuss the minute details of my model. I am really grateful for any pointers and advice you may have to help me wrap my head around this.Thanks!
 Tihomir Asparouhov posted on Monday, November 15, 2010 - 9:05 am
You can use type=complex twolevel where you can introduce additional level of clustering (such as states or regions) and that can account for additional dependence between the random effects of neighboring counties. See
This would be a bit ad hoc though if the sampling of the counties is not based on such clustering.

Alternatively you can use the Bayes estimator in Mplus to estimate the two-level model, then generate plausible values, see
The plausible values can then be used in a spatial model to test spatial dependence. Mplus currently does not estimate spatial models however.
 ann feng posted on Monday, November 15, 2010 - 10:56 am
Thank you so much Tihomir for your guidance and references!! I believe R is capable of some spatial independence testing like computing the Moran's I so I will probably resort to that after I am able to obtain the Bayesian results. You have been tremendously helpful. Thanks again for your insight and help!
 Malte Jansen posted on Monday, January 16, 2012 - 8:17 am

i've also got a somewhat similar question: I'm trying to analyses data from students that are nested within classes that are nested within schools. My research questions don't really concern school- or class-level variables but i want to correct the standard-errors to account for clustering. Could i use stratification=school with cluster=class and type=complex to achieve this?

Many Thanks in advance!
 Bengt O. Muthen posted on Monday, January 16, 2012 - 11:11 am
You can use

Cluster = school class;


Type = Complex Twolevel;

where Twolevel refers to modeling that takes into account clustering within classes and Complex refers to correcting SE's/chi-2 for clustering within schools.
 Student 09 posted on Tuesday, January 17, 2012 - 12:25 am
If I understand this correctly, then model parameters such as random intercepts and slopes for a model using

Cluster = school class;


Type = Complex Twolevel;

refer to differences of students within and between classes, not to differences of students within and between schools.

But suppose a researchers is merely interested to control for clustering of students within classes, while her major interest focuses on the model parameters referring to differences of students within and between schools (and not to students within and between classes). Is there a syntax to adequately deal with such a research question?

Many thanks for your reply
 Anna-Maria Fall posted on Tuesday, January 17, 2012 - 6:38 pm
We fit a multi-group 2 PL logistic model (ex. 5.5) to estimate treatment effects (we are using a 46-item measure). The design is blocked on teachers and random assignment is classes within teachers with students nested in classes. We have modeled this as STRATIFICATION=TEACHER and CLUSTER=CLASSES. With variance fixed at 1 across the two conditions, and the C group latent mean fixed at 0, and the freely estimated T group mean was is .35. We interpret this as a 30% of a standard deviation difference or an effect of .35. A nested models comparison indicated no significant differences when the nestedness of the data was modeled . Without the nested structure, the parameter differed significantly and considerably from 0. The questions are: 1) does it make sense to adjust SE in the two group IRT model where treatment effects are of interest; 2) if so, and if the assumption about the conceptualizing the latent variable difference as an effect size, an effect of .35 in a properly structured multilevel sample of 450 should be statistically significant…thoughts on why it isn’t? Also, we get a lot of these: WARNING: THE BIVARIATE TABLE OF X AND Y HAS AN EMPTY CELL. I understand that this indicates a correlation of 1.0. However, the binomial correlations are not 1.0 or even close in most cases. Ideas about why we are getting these messages and what can be done (short of dropping the items)?
 Linda K. Muthen posted on Wednesday, January 18, 2012 - 9:49 am

I would aggregate the data to the classroom level and have classroom as my unit of analysis. Then your cluster variable in a multilevel analysis can be schools.
 Linda K. Muthen posted on Wednesday, January 18, 2012 - 4:01 pm

1. Yes. When clustering is ignored, standard errors are too small.

2. It may be too few classrooms.

A zero cell implies a correlation of one. The fact that the correlation is not estimated at one is the problem.
 Philip Parker posted on Thursday, November 15, 2012 - 7:48 pm
I have a model as follows:
Level 1 (students): Variables = educational aspirations (expect) and socioeconomic status (ses)
Level 2 (schools): not interested in this level just controlling for it
Level 3 (countries): Here I want the effect of ses on expect to be random at level three.

Essentially I want to model:
expect~B0+SES where the intercept is random at Level2 and the intercept and slope is random at Level3. In R I would do:

glmer(expect~ses+(1|SCHOOL)+(1+ses|COUNTRY), family=binomial (link=probit), data=AchData2, weights=W_FSTUWT)

I cannot quite see how to go about this in mplus.
 Philip Parker posted on Thursday, November 15, 2012 - 8:46 pm
So far I have tried


categorical = Expect;
within = Zses;
!weight = W_FSTUWT; !weights not allowed in bayes?


S1| Expect on ses;

Expect; !I still get slope variances with this
!s1@0;!maybe try this? seem dubious to me.
!with or without this constraint results are similar to glmer in R

[S1]; S1; Expect;
 Bengt O. Muthen posted on Thursday, November 15, 2012 - 8:52 pm
Your idea of s1@0 on the SCHOOLID level seems right to me.

And, you are right that weights have not been developed in the Bayesian literature as far as I know. Perhaps use weights-related variables as covariates.
 Philip Parker posted on Thursday, November 15, 2012 - 9:25 pm
Many Thanks Bengt. Two more question if that is ok. I plan on extending this model by adding a mediator. So:

1. My understanding is that the default link function in mplus is probit?

2. I have tried to label my parameters i.e. [S1] (p1); [s2](p2); ect. However, when I refer to these labels in model constraint no output is produced. Is there a reason for this?
 Bengt O. Muthen posted on Friday, November 16, 2012 - 8:01 am
1. For Bayes and WLSMV it is. With ML you have probit and logit.

2. Please send output to Support.
 Allison Tracy posted on Sunday, March 02, 2014 - 12:38 pm
I am working with a dataset with random sampling of counties and then stratified random sampling of households within these counties. Does the STRATIFICATION= command assume that the stratification is done at the PSU level?
 Tihomir Asparouhov posted on Monday, March 03, 2014 - 1:51 pm
 Tihomir Asparouhov posted on Monday, March 03, 2014 - 1:56 pm
So technically speaking the assumptions of the estimator are not the same as the actual sampling mechanism. In some cases it is ok to ignore this mismatch ... but the conservative thing to do it probably to ignore the stratification and accept the bigger SE without it. This is not specific to Mplus - most software use the same method.
 Allison Tracy posted on Tuesday, March 04, 2014 - 7:55 am
Thanks, Tihomir. Good suggestion. I'm glad it is not a fatal flaw in the sampling design. As such are the vicissitudes of secondary data analysis!
 Danica Cruz posted on Sunday, August 03, 2014 - 6:45 pm
I'm working with a nationally representative data set that instructs users as follows: "A 1-stage sampling plan should be set up using STATE and SAMPLE variables as strata, ID as the cluster and WGHT as the weight." However, if I enter both STATE and SAMPLE in the STRATIFICATION option, I get an error:

*** ERROR in VARIABLE command
Unrecognized variable in STRATIFICATION option: STATE SAMPLE

If I remove one variable from the STRATIFICATION option, the program runs with no error.

I'm new to using complex survey data in Mplus, so please tell me if there is there another way to use both strata variables in Mplus. Thank you.
 Linda K. Muthen posted on Monday, August 04, 2014 - 11:57 am
You can combine state and sample into one variable and use the new variable with the STRATIFICATION option.
 Danica Cruz posted on Tuesday, August 05, 2014 - 1:42 pm
Do you mean create a new variable to represent each possible combination of the two original stratification variables?

For example:
sample state new_variable
1 48 1
2 48 2
1 38 3
2 38 4
1 51 101
2 51 102 (50 states + DC) * 2

Thank you.
 Linda K. Muthen posted on Wednesday, August 06, 2014 - 9:35 am
No, create a new variable like


new = (10*sample) + state;
 Yoosoo posted on Tuesday, October 20, 2015 - 1:05 pm
Dear Drs. Muthen,

I have a question about analysing complex survey with a binary outcome, using a three-level model.

I am analysing surveys obtained from two stage sampling.
The sampling method involved:
1. stratifying population by districts
2. selecting PSUs from each district
3. selecting households from each PSU.

Three-level model is used to combine 30 national surveys in a single;the three levels are household, PSU, and nation. My main outcome is a binary variable at lv 1, and the main explanatory var is a continuous var at lv 2.

I read that the bayesian estimator used for three-level logistic analysis does not allow incorporating weights and stratification in MPlus. Would you suggest if there's any other alternatives I can use to incorporate stratification and weight in Mplus?
 Tihomir Asparouhov posted on Tuesday, October 20, 2015 - 2:55 pm
I would suggest the following approach. Convert the 3-level model to a 2-level model using the long to wide approach and transforming "household" to a multivariate observation. This will require entering missing data for the situation when household has varying size. Search the manual and the web site for "long to wide" and "wide to long" if you need help doing this. You can then use the ML estimator.

Other options could be using a 30 group two-level model or treating the weights and stratification as covariates. The weights and stratification variables are incompatible with the Bayes estimator on theoretical level.
 Ann-Renee Blais posted on Thursday, September 29, 2016 - 8:48 am
Good morning,

I'm working with a dataset originating from two populations (Regular and Reserve Forces Army members). Within each population, we used a stratified (4 strata) random sampling technique. Hence I have 2 populations, 4 strata within each population, and sampling weights. I'm hoping to run a latent profile analysis on the overall data set (with strat is strat, weight is weight, etc.) and use Reg/Res as a predictor of the latent profiles. How should I go about this in order for the weights to remain accurate?

Thank you!!
 Tihomir Asparouhov posted on Friday, September 30, 2016 - 10:10 am
The weights should be proportional to "1/probability of selection". Without having a complete description of the weights and the sampling process one can not verify that this is the case. If so I would recommend using the weights in the data set without modifying them.
 Ann-Renee Blais posted on Friday, September 30, 2016 - 3:17 pm
Thank you Tihomir!
 Stefan Kulakow posted on Friday, March 30, 2018 - 3:53 am
Good morning,

I have a question about clustering, too. In my study, I clustered classes in schools. Some of those schools have mixed age classrooms. Is it valid to cluster the single age-groups within one classroom addressing the developmental differences? Or should I cluster only the class itself without regard to the age-differences?

Thank you in advance.
 Bengt O. Muthen posted on Friday, March 30, 2018 - 1:41 pm
Good analysis strategy question for SEMNET.
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message