Message/Author 


Hello Dr.s Muthen, I am attempting to conduct an EFA, subsequently estimate several first and second order CFA models, compare these models in terms of fit (with absolute fit statistics, nested and nonnested comparisons) across subsamples (using multiple group comparisons), and estimate factor scores. Most of these procedures I can already do in Mplus, but I am now also dealing with a complex sample with a sample weight, a strata variable and a cluster variable. The strata variable has 42 values, and there are 84 clusters. In the data set however, the cluster variable has only two values (i.e., 2 cluster values within 42 strata = 84 clusters). I am very new to the area of complex sample design. First, are all of the analyses I proposed above feasible with a complex sample using Mplus? Second, how do I adapt my EFA/CFA models to adjust for the sample design (note: I am only looking to adjust for the sample design, not obtain detailed information about the strata or clusters)? So first, for the EFA, do I simply apply the sample weight (and not use the cluster and strata information)? Are there any other issues I need to keep in mind for this analysis when using a complex sample? Second, how do I adjust my CFA syntax to properly model the weight, cluster, and strata variables (especially the latter two)? Following is an example of my syntax (excluding data and title): VARIABLE: NAMES ARE hypins fatigue retard agit conc indeci death watedec wateinc insom anhed guilt lworth suic; USEVARIABLES ARE hypins fatigue retard agit conc indeci death watedec wateinc insom anhed guilt lworth suic; CATEGORICAL ARE hypinssuic; ANALYSIS: TYPE IS GENERAL; ESTIMATOR IS WLSMV; ITERATIONS = 10000; CONVERGENCE = 0.00005; OUTPUT: SAMPSTAT RESIDUAL STANDARDIZED TECH3 TECH4 MODINDICES; MODEL: f1 BY fatigue retard agit conc indeci watedec wateinc insom anhed hypins; f2 BY death guilt lworth suic; I looked in the manual and was confused regarding how to properly model both the cluster and strata variables simultaneously with my sample. And I am assuming that with the sample weight, I can apply that to any analysis as long as I specify the weight command. Any help would be greatly appreciated. Thank you in advance, Jim 

bmuthen posted on Friday, June 03, 2005  6:19 pm



Your proposed analyses certainly seem feasible in Mplus. The complex survey data syntax simply amounts to adding the options weight = swght; strat = stratum; cluster= psu; in the Variable command and using Type = Complex in the Analysis command to get the correct SEs and chisquare. For EFA, only the weight option is available and Type = Complex is not available. But the cluster and strata information is less important in EFA since no SEs are given and you don't really need the chisquare test of fit. You can read more about the techniques in Mplus Web note #7, forthcoming in the SEM journal. Acknowledging stratification can give important reductions in the SEs. 


Thank you Bengt, As a followup: If I apply the sampling weight to the CFA models I spoke of, I assume that the model parameters will be adjusted according to these weights. When I create factor scores from these "weighted" models, is the sampling weight involved in any additional aspect of factor score estimation, or is it simply that the factor scores will be derived from a model that has been estimated from weighted data? Essentially; I am trying to subsequently use the factor scores created from the CFA models in subsequent linear and logistic regression models (in which the sample weight will also be applied) and want to be sure that I am properly weighting (i.e., not "overweighting," if such a thing is possible) the data. Also, are there any portions of the EFA output that WILL be affected by not including the cluster and strata information? Thanks, Jim 

bmuthen posted on Monday, June 06, 2005  4:19 pm



The factor scores are not affected by weighting beyond the fact that the parameter estimates are affected. 

bmuthen posted on Monday, June 06, 2005  4:34 pm



The only EFA output affected by not including cluster and strata info is the chisquare test of model fit and related fit indices. 


Thank you Bengt, I have tried estimating the CFA models I spoke of earlier and am getting the following error: *** ERROR Each stratum must contain unique cluster IDs. Clusters are not nested within strata. As I mentioned on 6/3, "The strata variable has 42 values, and there are 84 clusters. In the data set however, the cluster variable has only two values (i.e., 2 cluster values [1 & 2] within 42 strata = 84 clusters)." Is there any way I can work around the fact that my cluster variable is nested within my strata variable (aside from recoding the cluster variable)? For example, is there a way I can specify properly in my syntax that cluster is to be nested within strata? If there is no other way, would it make sense to code the cluster variable as follows Within strata #1: cluster; 1 = 1, 2 = 2; In strata #2, cluster; 1 = 3, 2 = 4; strata #3, cluster; 1 = 5, 2 = 6, etc...? Will this strategy handle the complex data appropriately (again, if such recoding is necessary)? Thanks, Jim 

bmuthen posted on Wednesday, June 15, 2005  7:46 am



The recoding you suggest is needed. 


Dear Dr. Muthen, I would like to explore the dimensionality of a new scale on collective efficacy with a set of hierarchical, nonindependent data. Therefore I have a question about the following answer you posted about the use of EFA on clustered data: "The only EFA output affected by not including cluster and strata info is the chisquare test of model fit and related fit indices." (June 06, 2005) I understand, that the chisquarestatistic is biased, but are RMSEA and RMSR also biased? If so, I don´t see how to use the remaining information (factor loadings) to explore the data, since there is no way to determine a reasonable number of factors. Are there any other options to conduct EFA on clustered data? Many thanks in advance! 


I would suggest saving the pooledwithin sample correlation matrix using the SAVEDATA command and using it as your data for the EFA. The sample size for this analysis would be n minus the number of clusters. 


Dear Linda, thanks a lot, this solution works very well. 

Marc Reis posted on Thursday, October 13, 2005  11:06 am



Hello Linda, I would like to do the same as Sandra (EFA on a pooledwithincorrelationmatrix with nonindependent data). Would you please give a recommendation for the best estimator: With summary data, the choice is between ML and ULS. ML provides the chisquarestatistic and RMSEA, so I would prefer it. Does the use of the pooledwithincorrelationmatrix leads to unbiased estimations of both indices? If not, is RMSR trustworthy (with ML and/or ULS)? Many thanks for your help. 


I would use ML. Note that the sample size is n minus the number of clusters. The fit indices should be reliable. 


Dear Linda, my question is related to your last posting. What's the correct N for the analysis on pooledwithin (1) and between (2) correlation matrix? I'd guess it's (1) NG (2) G with G being the number of clusters. Is that right so far? Thanks a lot for your help. 


Yes, this is correct. 


Dear Prof. Muthen, I am currently conducting EFA using NESARC. I have read that deleting missing data can adversely affect the weighting variable. I am aware of the subpopulation option in Mplus version 4, I know however that you cannot use it in EFA and that others have resolved this problem by using the USEROBS option. Is it possible to employ the USEROBS option in EFA? Thanking you in advance, 


The USEOBSERVATIONS option is available for all analyses. 


Dear Prof. Muthen, I am currently doing some 2level EFAs. I am unsure what the "unrestricted within/between covariance" specification means. Can you provide some reference on multilevel EFA? Thank you, Oliver 


The unrestricted model is the H1 model of unrestricted correlations. If you have that model on one level, tests of fit apply to the model on the other level. See the following technical appendix on the website: TwoLevel Weighted Least Squares Estimation. Proceedings of the Joint Statistical Meeting, August 2007, Biometrics Section 


Hi, I'm estimating CFA models on clustered data for which only the correlation matrix is available. How do I incorporate/account for the nonindependence of observations when I'm using DATA: TYPE IS CORRELATION MEANS STDEVIATIONS? Thank you, Jamie 


You need individuallevel data for this type of analysis. 


Thanks! 


Hi Dr. Muthen, I would also like to run an EFA (with categorical indicators) on complex data using the suggestion you posted previously (EFA on a pooledwithincorrelationmatrix with nonindependent data/October 11th, 2005) I read you recommended ML as estimator. However, my data has nonnormal distribution so I was considering using MLM or WLSMV. Any recommendations? Thanks in advance, itziar 


Dear Dr. Muthen, As a followup to my previous post, I tried running the EFA with the pooled within sample corr matrix as data (from the SAVEDATA command) but I get the following error: *** ERROR Unexpected end of file reached in data file. I think I'm not specifying correctly the source of the data. But I'm not sure how to incorporate in the syntax the use of the correlation matrix and my original data Thanks again, itziar 


Use TYPE=TWOLEVEL and weighted least squares. The SAMPLE option of the SAVEDATA command will give you the pooledwithin matrix. See Example 13.1 for the correct way to read a covariance matrix. The sample size for a pooledwithin matrix is the number of observations minus the number of clusters. 


Dear Dr. Muthen, Thanks for your response. I tried your suggestion and obtained a covariance matrix. However, I have categorical factor indicators. Is this correct? Shouldn't I obtain a correlation matrix instead? Thanks! itziar 


You can only obtain a correlation matrix with categorical outcomes. Please send your output and license number to support@statmodel.com. 

Ashleigh M posted on Tuesday, March 04, 2014  4:36 pm



Hello, I am doing a twolevel EFA with categorical mood symptom data for 67 people, with a range of 123 (mean of 5.239) timepoints per person. I am hoping this will produce factors for how symptoms covary (within level), as well as how symptoms covary across people (between level). In my twolevel EFA, I found nice fit for a few different models (RSMEA < .05). However, I get the following error messages for other models: NO CONVERGENCE. NUMBER OF ITERATIONS EXCEEDED. PROBLEM OCCURRED IN EXPLORATORY FACTOR ANALYSIS WITH 4 WITHIN FACTOR(S) AND 1 BETWEEN FACTOR(S). THE PROBLEM MAY BE RESOLVED BY USING THE STARTS OPTION TO GENERATE RANDOM SETS OF STARTING VALUES. Should I increase starts to attempt convergence for these models, or change the input, which was: TYPE = TWOLEVEL EFA 1 5 UW 1 5 UB; Finally, a hypothetical question: if eventually I learn that the best fit is unrestricted between and 2 factors within, does this signify that within has good fit, but there is no underlying factor between? I've been reading a lot but still do not seem to understand the significance of unrestricted covariance. Thank you very much 


I am not sure how your data structure is twolevel. You say that Within is symptoms and Between is people, but why not then do a regular (1level) factor analysis of the symptoms? 

Ashleigh M posted on Wednesday, March 05, 2014  10:22 am



Thank you for your prompt response! I have read that doing a 1level analysis on these data is problematic as it ignores dependencies within data (symptom sets within individuals across time are no doubt nonindependent), and the structure is contaminated by two sources of variance, both within and betweenindividuals. The following citation makes a case for using multilevel factor analysis with similar data. Reise, S.P., Ventura, J., Nuechterlein, K. H., & Kim, K. H. (2005). An illustration of multilevel factor analysis. Journal of Personality Assessment, 84(2), 126132. Especially since the range of symptom set is so large (123), I do not want to confound the factor structure by weighting some individuals more heavily than others and not accounted for the nested nature of the data. 


I see. So Within is not symptoms but time (with Between being person). So the number of variables that you analyze is the number of symptoms. Convergence problems for 4 within factors and 1 between factor may be due to negative residual variances with 4 within factors. Perhaps 3 factors is sufficient. 

Ashleigh M posted on Friday, March 07, 2014  2:22 pm



Thank you very much, and I'm sorry my wording was misleading. I still have a lot to learn! Now I need to decide between the models I have, but I'm struggling with understanding unrestricted covariance. One of my models with good fit has unrestricted within and 3 factors between; does this signify that between has good fit, but there is no underlying factor within? I've been reading a lot but still do not seem to understand the significance of unrestricted covariance. Thank you so much for all your help! 


I think you are more interested in your level2 factor structure since that concerns covariation among the variables across subjects. So an unrestricted Within could make sense in that it focuses the analysis on Between. Having fit this model, you can then see if you significantly worsen the fit when trying 1m factors on Within, keeping Between at 3 factors. 

Ashleigh M posted on Wednesday, March 19, 2014  2:21 pm



Thank you very much, that was helpful. My unrestricted within and 3 factors between works better given what you suggested. However, I would now like to choose between two models that have great fit indices and are supported by theory (unrestricted within and 2 factors between, as well as unrestricted within and 3 factors between). I learned I cannot use the DIFFTEST option with EFA, so is there anything else you would suggest? Thank you! 


You can use BIC if ML estimation is used. 

Ashleigh M posted on Wednesday, March 19, 2014  4:20 pm



My data are dichotomous, so the estimator WLSMV is the default. Thank you 


Then I don't know what to suggest, except present both solutions. 

Ashleigh M posted on Wednesday, March 19, 2014  4:28 pm



I will do that, thank you very much! 

JW posted on Tuesday, July 08, 2014  8:41 am



Hi, I have data collected across 27 teams and I would like to run a CFA on the 22 measures collected. I initially ran the CFA without clustering and then after seeing this thread decided to account for it by specifying cluster = team; and analysis: estimator = mlr; type = complex; However, I receive the warning message below but I must say that the estimates are the same as when I did not account for clustering and that overall the figures are sensible. Should I worry? and if so, could you suggest what to do pls? THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NONPOSITIVE DEFINITE FIRSTORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.299D16. PROBLEM INVOLVING THE FOLLOWING PARAMETER: Parameter 27, B1 BY BWEM_TOT THIS IS MOST LIKELY DUE TO HAVING MORE PARAMETERS THAN THE NUMBER OF CLUSTERS MINUS THE NUMBER OF STRATA WITH MORE THAN ONE CLUSTER. 


Independence of observations is at the cluster level. We are just reminding you of that. This is probably not a problem given what you describe. 

JW posted on Thursday, July 10, 2014  6:35 am



Thank you! 

Yuan Zhang posted on Monday, June 12, 2017  5:31 pm



Hi Linda and Bengt: Working with survey data and conducting 3factor CFA, I’m puzzled by the inconsistent RMSEA’s in the output for the same models with different survey specifications. Model A has sampling weight, cluster, and strata specified and Model B has sampling weight and replicate weights specified; all other specifications remain the same. The RMSEA’s are observed to increase in Model B (e.g., from 0.052 to 0.067). Could you advise if increasing RMSEA in Model B is something that I should be concerned about? Did I misspecify anything? Appreciatively, Yuan Partial (due to post size limit) codes follow below: title: Model_A; data: file=XXXX; variable: (omitted due to post size limit) Missing are all (999); weight = s_wgt; strat = strata; cluster=cluster; analysis: type=complex; model: int BY ...; eff BY ...; belief BY ...; ____________________________________ title: Model_B; data: file=XXXX; variable: (omitted due to post size limit) Missing are all (999); weight = s_wgt; repweight=jkn_1jkn_150; analysis: type=complex; repse=JACKKNIFE2; model: int BY ...; eff BY ...; belief BY ...; 


In Mplus version 8 we do not compute RMSEA with replicate weights. This should not have been printed with the earlier versions as well. Use the RMSEA version from the nonreplicate weights run. 

Yuan Zhang posted on Saturday, June 17, 2017  1:08 pm



Thanks, Tihomir! For your record, I used Mplus7 and did got RMSEA. 

Back to top 