SEM and sampling weights PreviousNext
Mplus Discussion > Multilevel Data/Complex Sample >
 Anonymous posted on Wednesday, March 24, 2004 - 11:19 am

I am estimating a SEM using a 50% sample of the total population. The model fits and all path coefficients are statistically significant at p<=0.05. The model fits for both 50% subpopulation (the model fits for the full sample). When I introduce survey weights things get difficult, some path coefficients are insignificant in one of the 2 50% subpopulations while nothing changes in the other 50% subpopulation. The model still fits the full data. What does that mean for my model? Which results should pay attention to, the results for the full model or the results from the 2 50% subpopulations?

 Linda K. Muthen posted on Wednesday, March 24, 2004 - 3:20 pm
It sounds strange that the weighted samples behave so differently. Are these random samples? If the data were generated with unequal probabilities of selection, then weights should be used.
 Anonymous posted on Thursday, March 25, 2004 - 6:42 am
Thank you very much for your response. I should have paid more attention in my programming. I missed to include an error covariance in one of my groups. The inclusion of this error covariance makes a world of difference.

Here is another question. There appears to be an open discussion as to wheter or not to include any error covariances to increase model fit. I included error covariances for variables that were highly correlated withing each latent factor. The highly correlated variables had similar word stems or represented a series of questions on the same specific topic. I only included the error covariances after I explored alternative versions of creating the latent factor(s). In other words I expolored whether or not the latent factor(s) that contained the highly correlated varaibles was/were robust. Does this appear to be a valid approach or am I just fitting my model to the data?
 Anonymous posted on Tuesday, March 29, 2005 - 7:47 am
I'd like to weight both levels of analysis: individuals and neighbourhoods (racially diverse neighbourhoods have been oversampled). How do I specify that? It seems that WEIGHT option can be used only once.
 Tihomir Asparouhov posted on Monday, April 04, 2005 - 8:42 am
Enter the product of the individual and neighborhood weights in the WEIGHT option. Individual weights should be scaled to add up to the neighborhood sample size. See Mplus Web Note # 8
for more details.
 Natalia Letki posted on Tuesday, April 05, 2005 - 12:38 am
Tihomir, thanks a lot for this advice, I'll look into it. I have another question though: I'd like to construct a cross-level interaction term, as I'd like to test whether neighbourhood's low status amplifies the negative effect of individual-level deprivation on interpersonal trust. I'll be very grateful for your advice on how to do this (neighbourhood status is a latent level-2 variable).
 Tihomir Asparouhov posted on Tuesday, April 05, 2005 - 8:00 am
The random slope model does exactly that.

s | individual-level deprivation on interpersonal trust

s on neighbourhood's status
 Anonymous posted on Tuesday, April 26, 2005 - 9:13 am
I just read over the tech 9 report, and we're working with data that have sampling weights and stratification at 2 levels. There are roughly 15,000 primary stratum (PSU) and 2 replicates (rep)within each stratum.

I'm having problem fitting a CFA with ordinal measured variables.

I used the following setup to estimate a CFA with 6 factors. I also used the grouping function to identify the replicates and the cluster and type=complex options. The program works for continous observed variables, but does not produce estimates when I add the cagetorical option. Given that these data are likert type, perhaps the ordinal option is better.

Title: all groups 4/6/05;
File is c:\obs_short.dat;
variable: names are rep psu var1 var2 var3
var4 var5 var6 wt;
usevar= psu rep wt var1 var2 var3 var4 var5 var6;
missing= var1 var2 var3 var4 var5 var6 (-999);
categorical are var1 var2 var3 var4 var5 var6;
grouping is rep (1=repone,2=reptwo);
cluster is psu;
weight is wt;
analysis: type=complex;
model: latent by var1*1 var2 var3 var4 var5 var6;
output: sampstat tech1 standardized;

The error message is:

Cluster ID cannot appear in more than one group.
Problem with cluster ID: 10101

I looked at the data and here are the first 6 or so lines:

01 10101 5 4 2 3 3 4 1 1855.822633
01 10101 2 2 2 2 2 2 1 1655.3630801
01 10101 4 5 5 5 5 5 1 6519.8840498
01 10101 4 4 3 2 3 3 1 3407.0678367
02 10101 4 4 4 4 4 4 1 1136.8711335

Iím confused as to why the program wonít run. I removed the categorical option and the program produces results! I also added: estimator=MLR and with the categorical option and the error was that I needed to specify type=mixture.

 Linda K. Muthen posted on Tuesday, April 26, 2005 - 10:49 am
You would need to send your input/output, data, and license number to for us to see what is happening.
 Linda K. Muthen posted on Thursday, April 28, 2005 - 11:18 am
Following is an explanation of what you are experiencing. In multiple group analysis, the groups should be composed of independent observations. When observations from the same cluster are in more than one group, this is violated. For continuous outcomes using TYPE=COMPLEX;, Mplus takes this lack of independence into account. For categorical outcomes, it does not and therefore stops the analysis. You can trick the program so that it does not complain as described below. In our experience, this violation does not result in large differences in the results.

The trick is to define new cluster values that are unique for each group with the following:

psu = rep*100000 + psu;
! 100000 should be a number with more digits than !the cluster value
! in this case, 100000 is used because original !cluster ID has 5 digits
 Tor Neilands posted on Wednesday, March 01, 2006 - 2:19 pm

A colleague has asked me an interesting question about a situation that I suspect is rather common when one analyzes weighted data in SEM: If the weighting variable is computed for the whole population of respondents to a survey, yet the analysis model considers only a subset of those respondents (e.g.., sexually active female adolescents rather than sexually inactive and active males and females), what is the recommended course of action? Would one include the original weight variable for those cases in the Mplus analysis as the WEIGHT variable? Or must the weight variable be somehow recomputed?
 Linda K. Muthen posted on Wednesday, March 01, 2006 - 3:04 pm
I think this is dealt with by the new SUBPOPULATION option of the VARIALBE command. Following is an excerpt from the user's guide (see page 403):

The SUBPOPULATION option is used with TYPE=COMPLEX to select observations for an analysis when a subpopulation (domain) is analyzed. When the SUBPOPULATION option is used, all observations are included in the analysis although observations not in the subpopulation are assigned weights of zero (see Korn & Graubard, 1999, pp. 207-211).
 Matthew Diemer posted on Wednesday, March 15, 2006 - 12:46 pm
I had previously addressed this issue (examining subpopulations) with the Useobservations command and defined my subpopulation this way. I had also excluded participants with missing values for the weight variable. [NELS data]

With the new upgrade to version 4, I am interested in using the subpopulation command to examine the same subpopulation.

weight is f3f1pnwt;
cluster is sch_id;
stratification is sstratid;

subpopulation = f2race1 ne 4
and ses1band eq 1 and f2evdost eq 0;

However, when I do so, I receive the following error message:

Weight variable has missing value at observation 19.
Data set contains unknown or missing values for GROUPING,
PATTERN, COHORT and/or CLUSTER variables.
Number of cases with unknown or missing values: 759
Data set contains cases with missing on all variables.
These cases were not included in the analysis.
Number of cases with missing on all variables: 13510

Any guidance? I have tried defining my subpopulation to those cases without missing values for the weight variable as well, with similar results.
 Linda K. Muthen posted on Wednesday, March 15, 2006 - 12:55 pm
If you are using the same input and data as you used with USEOBSERVATIONS and are just substituting SUBPOPULATION for USEOBSERVATIONS, this should not be happening. If you are using a similar input and different data, you may be reading your data incorrectly. If this does not help, send your input, data, output, and license number to
 Scott R. Colwell posted on Friday, April 07, 2006 - 8:20 am
I am running a multigroup model with TYPE=COMPLEX. I am testing for measurement invariance across group a and group b.

When I run group a only I get 40 degrees of freedom. Logically I also get the same 40 degrees of freedom for group b only (using subpopulation command).

When I test for complete measurement non-invariance using for example:

Model: f1 by x1 x2 x3 x4;

Model b: f1 by x2 x3 x4;

I get 88 degrees of freedom. I was under the impression that the low group chi-square* and df plus the high group chi-square* and df should equal the chisquare* and df on the complete measurement non-invariance. The 88 degrees of freedom should be (from what I calculated) the numebr I get for just factor loading invariance. Is the Type=Complex changing the factor loading in the complete measurement invariance model?

*assuming converting the MLR chi-square to the ML chi-square via the scale correction factor.
 Linda K. Muthen posted on Friday, April 07, 2006 - 9:41 am
I would need to see the three outputs to comment. TYPE=COMPLEX; doesn't change anything. Please send the outputs, data if you don't have TECH1 in the outputs, and your license number to
 Matthew Diemer posted on Thursday, June 29, 2006 - 1:20 pm
Iím using the strata, psu and weight options, along with type=complex, to analyze a subpopulation with NELS data. I'm having a problem with preliminary MPlus analyses where clusters are nesting within more than one strata (I think this is what this error message means). I do have both continuous and categorical indicators/variables in the model.

This is the MPlus error message I am receiving:

Each stratum must contain unique cluster IDs.
Clusters are not nested within strata.

My understanding is that this may have happened because the PSUs (clusters) in current data file are not unique; i.e. the same identifier for a cluster (school), such as 29, could be nested into two different strata. I think that MPlus assumes that each cluster can only be contained within one strata, while NELS data might use the same cluster identifier across different strata.

Following the earlier post and helpful response about this issue earlier in this thread, Iíve tried to address this issue by creating a new unique id for cluster (school) as follows (following is SPSS code):


[Multiplied by 1000 because the largest cluster value is 999]

1. Is this the correct procedure to create unique cluster identifiers?
2. What unintended consequences might this procedure have in the estimation of standard errors?

Thank you for any suggestions.
 Bengt O. Muthen posted on Sunday, July 02, 2006 - 5:13 am
This is the correct procedure. As long as the clusters are not the same in any strata, the consequences are none.
 Orla Mc Bride posted on Monday, July 17, 2006 - 6:16 am

I have a similar problem to Matthew Diemer (March 15 2006). I am trying to compute a CFA model using the subpopulation command. I am using data from NESARC and I only want to use current or former drinkers (u4). When I use the following options:

missing are all (-9);
weight is weight;
stratification is stratum;
cluster is psu;
subpopulation is u4 == 1 or 2;
type = complex missing;

I get the following error message:

Weight variable has missing value at observation 259.
Data set contains unknown or missing values for GROUPING,PATTERN, COHORT and/or CLUSTER variables.
Number of cases with unknown or missing values: 8238

Could you please explain what this error message means and how to amend it.

Thanking you in advance.
 Linda K. Muthen posted on Monday, July 17, 2006 - 8:08 am
It sounds like you may be reading your data incorrectly or declaring your missing value flags incorrectly. Please send your input, data, output, and license number to
 David Bard posted on Friday, December 08, 2006 - 7:05 am
Regarding the April 26, 2005 Anonymous post & Linda's response on the 28th:
Has the procedure for multiple-group SEM with categorical data and complex sampling changed with version 4.0? I noticed that one of the new 4.0 features was improved MG complex sampling analysis for the WLS estimators. Does this mean that dependency across multiple groups with categorical DVs can now be handled using the cluster option, type-complex, & one of the WLS estimators?
 Linda K. Muthen posted on Saturday, December 09, 2006 - 9:45 am
This was implemented in Version 4.1 we think. It is certainly in the most recent version.
 AD posted on Thursday, January 04, 2007 - 1:17 pm
Can we use sampling weights in AMOS or only Mplus can do this?

Thank you very much!
 Linda K. Muthen posted on Sunday, January 07, 2007 - 8:53 am
You would have to check with AMOS support to ask this question.
 Chenshu Zhang posted on Monday, October 11, 2010 - 1:20 pm
Is their an option to test informativeness of sample weights using the MPTI test?
 Bengt O. Muthen posted on Monday, October 11, 2010 - 4:31 pm
The Modified Pfefferman Test of Informativeness is not yet available.
 Cecily Na posted on Tuesday, February 15, 2011 - 9:16 am
Dear professors,
Does Mplus incorporate sampling weight in SEM? If so, do I just include the weights within the data file (variable wt) and use the syntax weight is wt. And Mplus will automatically incorporate it in SEM?
Thanks a lot!
 Linda K. Muthen posted on Tuesday, February 15, 2011 - 9:21 am
 Bengt O. Muthen posted on Tuesday, February 15, 2011 - 9:48 am
Also see papers at

For example Asparouhov (2005).
 Agnes Stancel-Piatak posted on Monday, February 06, 2012 - 4:44 am
Hi, I'm using the PIRLS Data for an estimation of a two level SEM. L1=students, L2=classes. The data set has two population weights:
1. totwgt inflates the sample to the population and in result underestimates the standard error
2. houwgt is created for population analysis as well, but it doesnít underestimate the standard error
and a school weight, which controls for the stratification factors due to sample procedure.
My question is: How can I implement the weights using the ďcomplex twolevel randomĒ models?
1. Should I use two cluster variables: IDCLASS and IDSCHOOL (because the students are clustered in classes, that are clustered in schools)?
2. Should I then use weights on L1 and L2? (Iím modeling L2 effects on L1).

Thanks a lot in advance
 Bengt O. Muthen posted on Tuesday, February 07, 2012 - 12:28 pm
You say that students are clustered in classes that are clustered in schools, and it sounds like you want to do 2-level modeling of students within classes, while taking into account clustering in schools. If I am understanding that correctly, you then say

cluster = school class;

where school goes with the Complex part of Type=Complex Twolevel and class goes with the Twolevel part of Type= Complex Twolevel.

Mplus can use weigths for both level 1 (students) and level 2 (classes): Weight and Bweight. There is not a weigth option to go with Complex of Type=Complex Twolevel. You have to decide from the PIRLS data description which choice of weights is the most suitable.
 Jinni Su posted on Tuesday, August 21, 2012 - 11:48 am
Hi Dr.Muthen,

I am running a multilevel regression analysis with students clustering in 35 schools (all schools in a county). During data collection, census strategy was used in some schools whereas in other schools random sampling strategy was used. There are individual weights available in the data set but in some schools the individual weights are all 1 whereas in other schools the weights vary. In multilevel analysis, is it sufficient/correct to just use weight=weight to take into account the sample weights? or do I need to do something more such as WTSCALE?

Thank you very much!

 Tihomir Asparouhov posted on Wednesday, August 22, 2012 - 10:03 am

Weights 1 should not be a problem at all and you should not need to use WTSCALE. Combining random sampling and unequal probability sampling is not an issue you need to worry about.

 Tihomir Asparouhov posted on Wednesday, August 22, 2012 - 10:05 am
So yes .. just use weight=weight comand.
 Rachel V posted on Tuesday, April 09, 2013 - 2:57 pm

I am messaging because I am doing a path analysis over 5 time points. I have sampling weights (from a publicly available data set) to be used, however, these weights vary over time, and so when I reshaped my data to be wide (to model the same variable at 5 different time points within students), this weight also had to be reshaped. As such, I do not have a single weight to be applied, but a weight that corresponds to the specific time period in question (e.g. wgt1-wgt5). The WEIGHT= samplwgt command seems to only work for a stable weight. Is there an alternative I should be considering?

Thank you very much!

Best, Rachel
 Linda K. Muthen posted on Tuesday, April 09, 2013 - 5:14 pm
If you have multiple weights, you must use long format.
 Rachel V posted on Tuesday, April 09, 2013 - 6:09 pm
Thank you. This is helpful. I see the description of the long option on page 522. I originally had my data in long format and reshaped it wide in Stata before importing it into Mplus for the sake of path analysis over time. Is there a way to do the path analysis in non-wide format? Or is there a way to reshape the data wide again after specifying the weight? for example, my models currently have the following kinds of statements:

math4 ON skill3
skill4 ON math3

To represent cross-predictors (the numbers at the end of the variable specifying the time of measurement). I don't know how the analysis would be written in long format.
 Linda K. Muthen posted on Wednesday, April 10, 2013 - 1:27 pm
In the long format, you cannot do the model you describe if the 4 and 3 refer to repeated measures. In the long format you have access to only math and skill.
 Rachel V posted on Wednesday, April 10, 2013 - 4:30 pm
Thanks Linda. Yes, I understand long and wide. I guess what I am trying to figure out is how I can apply a time-varying sample weight to a path model. Is that possible? It seems like perhaps not given what you are telling me.
 Linda K. Muthen posted on Wednesday, April 10, 2013 - 4:40 pm
It seems impossible to me also.
 Dan Feaster posted on Wednesday, April 10, 2013 - 5:46 pm
If you only need one time period lags, there is no reason you could not use the long format approach. You would need to include lagged variables at each time point. You have 5 time points so you would have 4 long records. Times 5 & 4; times 4 & 3, Times 3 &2 and Times 2 &1; I would call this a stacked autoregressive/cross-lagged model. I would recommend that in addition to the math4 on skill3 that you also include autoregressive terms, i.e. Math 4 on Math 3; You would use the weighting variable associated with the dependent measure at each time ( i.e. 5, 4, 3, & 2). Time one weight should not be needed since you would only condition on time 1.
 Rachel V posted on Thursday, April 11, 2013 - 6:24 pm
Thank you. This is helpful.
 Liz C posted on Saturday, March 12, 2016 - 7:35 pm

I'm conducting a subpopulation analysis from a study with complex survey data. The dataset has 2 weights one for the Latino subsample and one for the Asian subsample. Since am only conducing analysis with the Latino sample, I only included the latino weight. But I get the following error message "Weight variable has missing value at observation 2." Is this happening because I need to include both weights? how do I fix this?

Thank you,

 Linda K. Muthen posted on Sunday, March 13, 2016 - 3:14 pm
I do not know how you would fix this. Weights are not allowed to have missing values.
 Ads posted on Monday, June 18, 2018 - 7:48 am
I am working with NESARC-III data. When using complex survey weighting for an LCA, I receive a warning when running the following code:


CLASSES = c (2);



STARTS 500 125;


*** WARNING in VARIABLE command
"Clusters with the same IDs have been found in different strata. These clusters are assumed to be different because clusters are not allowed to appear in more than one stratum."

I receive model results despite this warning, but was wondering about validity of the results given the warning. Isn't it common/expected to have multiple PSUs within each stratum? I applied the weight/strat/cluster variables as instructed in the NESARC-III "Data Notes" manual (p.7-8 of link below). Many thanks!
 Tihomir Asparouhov posted on Monday, June 18, 2018 - 11:21 am
The way clusters are coded in the data set is the reason for the message. Probably your data set looks like this

stata cluster
1 1
1 2
1 3
2 1
2 2
2 3

Mplus would prefer if you code the clusters as

1 1
1 2
1 3
2 11
2 12
2 13

so there is no confusion about what is meant by clusters. The assumption of this estimation is that each stratum consist of clusters/PSU and there is no overlap somehow where one PSU is in two or more stara. Each PSU should be entirely contained in exactly one stratum. You don't need to change anything however since Mplus is treating PSU with code name 1 as two separate PSU's one in stratum 1 and one in stratum 2.
 Ads posted on Monday, June 18, 2018 - 12:57 pm
Thank you Tihomir - my dataset does look like this. Much appreciated!
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message