I am attempting to conduct an EFA, subsequently estimate several first and second order CFA models, compare these models in terms of fit (with absolute fit statistics, nested and non-nested comparisons) across subsamples (using multiple group comparisons), and estimate factor scores. Most of these procedures I can already do in Mplus, but I am now also dealing with a complex sample with a sample weight, a strata variable and a cluster variable. The strata variable has 42 values, and there are 84 clusters. In the data set however, the cluster variable has only two values (i.e., 2 cluster values within 42 strata = 84 clusters).
I am very new to the area of complex sample design. First, are all of the analyses I proposed above feasible with a complex sample using Mplus? Second, how do I adapt my EFA/CFA models to adjust for the sample design (note: I am only looking to adjust for the sample design, not obtain detailed information about the strata or clusters)?
So first, for the EFA, do I simply apply the sample weight (and not use the cluster and strata information)? Are there any other issues I need to keep in mind for this analysis when using a complex sample?
Second, how do I adjust my CFA syntax to properly model the weight, cluster, and strata variables (especially the latter two)? Following is an example of my syntax (excluding data and title):
VARIABLE: NAMES ARE hypins fatigue retard agit conc indeci death watedec wateinc insom anhed guilt lworth suic; USEVARIABLES ARE hypins fatigue retard agit conc indeci death watedec wateinc insom anhed guilt lworth suic; CATEGORICAL ARE hypins-suic;
ANALYSIS: TYPE IS GENERAL; ESTIMATOR IS WLSMV; ITERATIONS = 10000; CONVERGENCE = 0.00005;
MODEL: f1 BY fatigue retard agit conc indeci watedec wateinc insom anhed hypins; f2 BY death guilt lworth suic;
I looked in the manual and was confused regarding how to properly model both the cluster and strata variables simultaneously with my sample. And I am assuming that with the sample weight, I can apply that to any analysis as long as I specify the weight command. Any help would be greatly appreciated.
Your proposed analyses certainly seem feasible in Mplus. The complex survey data syntax simply amounts to adding the options
weight = swght; strat = stratum; cluster= psu;
in the Variable command and using Type = Complex in the Analysis command to get the correct SEs and chi-square.
For EFA, only the weight option is available and Type = Complex is not available. But the cluster and strata information is less important in EFA since no SEs are given and you don't really need the chi-square test of fit.
You can read more about the techniques in Mplus Web note #7, forthcoming in the SEM journal. Acknowledging stratification can give important reductions in the SEs.
If I apply the sampling weight to the CFA models I spoke of, I assume that the model parameters will be adjusted according to these weights. When I create factor scores from these "weighted" models, is the sampling weight involved in any additional aspect of factor score estimation, or is it simply that the factor scores will be derived from a model that has been estimated from weighted data? Essentially; I am trying to subsequently use the factor scores created from the CFA models in subsequent linear and logistic regression models (in which the sample weight will also be applied) and want to be sure that I am properly weighting (i.e., not "overweighting," if such a thing is possible) the data.
Also, are there any portions of the EFA output that WILL be affected by not including the cluster and strata information?
I have tried estimating the CFA models I spoke of earlier and am getting the following error:
*** ERROR Each stratum must contain unique cluster IDs. Clusters are not nested within strata.
As I mentioned on 6/3, "The strata variable has 42 values, and there are 84 clusters. In the data set however, the cluster variable has only two values (i.e., 2 cluster values [1 & 2] within 42 strata = 84 clusters)."
Is there any way I can work around the fact that my cluster variable is nested within my strata variable (aside from recoding the cluster variable)? For example, is there a way I can specify properly in my syntax that cluster is to be nested within strata? If there is no other way, would it make sense to code the cluster variable as follows- Within strata #1: cluster; 1 = 1, 2 = 2; In strata #2, cluster; 1 = 3, 2 = 4; strata #3, cluster; 1 = 5, 2 = 6, etc...? Will this strategy handle the complex data appropriately (again, if such recoding is necessary)?
bmuthen posted on Wednesday, June 15, 2005 - 7:46 am
I would like to explore the dimensionality of a new scale on collective efficacy with a set of hierarchical, non-independent data. Therefore I have a question about the following answer you posted about the use of EFA on clustered data: "The only EFA output affected by not including cluster and strata info is the chi-square test of model fit and related fit indices." (June 06, 2005)
I understand, that the chi-square-statistic is biased, but are RMSEA and RMSR also biased? If so, I don´t see how to use the remaining information (factor loadings) to explore the data, since there is no way to determine a reasonable number of factors. Are there any other options to conduct EFA on clustered data?
I would suggest saving the pooled-within sample correlation matrix using the SAVEDATA command and using it as your data for the EFA. The sample size for this analysis would be n minus the number of clusters.
Dear Linda, thanks a lot, this solution works very well.
Marc Reis posted on Thursday, October 13, 2005 - 11:06 am
I would like to do the same as Sandra (EFA on a pooled-within-correlation-matrix with non-independent data). Would you please give a recommendation for the best estimator: With summary data, the choice is between ML and ULS. ML provides the chi-square-statistic and RMSEA, so I would prefer it. Does the use of the pooled-within-correlation-matrix leads to unbiased estimations of both indices? If not, is RMSR trustworthy (with ML and/or ULS)?
I am currently conducting EFA using NESARC. I have read that deleting missing data can adversely affect the weighting variable. I am aware of the subpopulation option in Mplus version 4, I know however that you cannot use it in EFA and that others have resolved this problem by using the USEROBS option. Is it possible to employ the USEROBS option in EFA?
The unrestricted model is the H1 model of unrestricted correlations. If you have that model on one level, tests of fit apply to the model on the other level. See the following technical appendix on the website:
Two-Level Weighted Least Squares Estimation. Proceedings of the Joint Statistical Meeting, August 2007, Biometrics Section
I'm estimating CFA models on clustered data for which only the correlation matrix is available. How do I incorporate/account for the non-independence of observations when I'm using DATA: TYPE IS CORRELATION MEANS STDEVIATIONS?
I would also like to run an EFA (with categorical indicators) on complex data using the suggestion you posted previously (EFA on a pooled-within-correlation-matrix with non-independent data/October 11th, 2005) I read you recommended ML as estimator. However, my data has non-normal distribution so I was considering using MLM or WLSMV. Any recommendations? Thanks in advance,
As a follow-up to my previous post, I tried running the EFA with the pooled within sample corr matrix as data (from the SAVEDATA command) but I get the following error: *** ERROR Unexpected end of file reached in data file. I think I'm not specifying correctly the source of the data. But I'm not sure how to incorporate in the syntax the use of the correlation matrix and my original data Thanks again, itziar
Dear Dr. Muthen, Thanks for your response. I tried your suggestion and obtained a covariance matrix. However, I have categorical factor indicators. Is this correct? Shouldn't I obtain a correlation matrix instead? Thanks! itziar
You can only obtain a correlation matrix with categorical outcomes. Please send your output and license number to email@example.com.
Ashleigh M posted on Tuesday, March 04, 2014 - 4:36 pm
Hello, I am doing a two-level EFA with categorical mood symptom data for 67 people, with a range of 1-23 (mean of 5.239) timepoints per person. I am hoping this will produce factors for how symptoms covary (within level), as well as how symptoms covary across people (between level).
In my two-level EFA, I found nice fit for a few different models (RSMEA < .05). However, I get the following error messages for other models:
NO CONVERGENCE. NUMBER OF ITERATIONS EXCEEDED. PROBLEM OCCURRED IN EXPLORATORY FACTOR ANALYSIS WITH 4 WITHIN FACTOR(S) AND 1 BETWEEN FACTOR(S). THE PROBLEM MAY BE RESOLVED BY USING THE STARTS OPTION TO GENERATE RANDOM SETS OF STARTING VALUES.
Should I increase starts to attempt convergence for these models, or change the input, which was: TYPE = TWOLEVEL EFA 1 5 UW 1 5 UB;
Finally, a hypothetical question: if eventually I learn that the best fit is unrestricted between and 2 factors within, does this signify that within has good fit, but there is no underlying factor between? I've been reading a lot but still do not seem to understand the significance of unrestricted covariance.
I am not sure how your data structure is two-level. You say that Within is symptoms and Between is people, but why not then do a regular (1-level) factor analysis of the symptoms?
Ashleigh M posted on Wednesday, March 05, 2014 - 10:22 am
Thank you for your prompt response! I have read that doing a 1-level analysis on these data is problematic as it ignores dependencies within data (symptom sets within individuals across time are no doubt nonindependent), and the structure is contaminated by two sources of variance, both within- and between-individuals. The following citation makes a case for using multilevel factor analysis with similar data. Reise, S.P., Ventura, J., Nuechterlein, K. H., & Kim, K. H. (2005). An illustration of multilevel factor analysis. Journal of Personality Assessment, 84(2), 126-132. Especially since the range of symptom set is so large (1-23), I do not want to confound the factor structure by weighting some individuals more heavily than others and not accounted for the nested nature of the data.
I see. So Within is not symptoms but time (with Between being person). So the number of variables that you analyze is the number of symptoms.
Convergence problems for 4 within factors and 1 between factor may be due to negative residual variances with 4 within factors. Perhaps 3 factors is sufficient.
Ashleigh M posted on Friday, March 07, 2014 - 2:22 pm
Thank you very much, and I'm sorry my wording was misleading. I still have a lot to learn! Now I need to decide between the models I have, but I'm struggling with understanding unrestricted covariance. One of my models with good fit has unrestricted within and 3 factors between; does this signify that between has good fit, but there is no underlying factor within? I've been reading a lot but still do not seem to understand the significance of unrestricted covariance.
I think you are more interested in your level-2 factor structure since that concerns co-variation among the variables across subjects. So an unrestricted Within could make sense in that it focuses the analysis on Between. Having fit this model, you can then see if you significantly worsen the fit when trying 1-m factors on Within, keeping Between at 3 factors.
Ashleigh M posted on Wednesday, March 19, 2014 - 2:21 pm
Thank you very much, that was helpful. My unrestricted within and 3 factors between works better given what you suggested. However, I would now like to choose between two models that have great fit indices and are supported by theory (unrestricted within and 2 factors between, as well as unrestricted within and 3 factors between). I learned I cannot use the DIFFTEST option with EFA, so is there anything else you would suggest?
I have data collected across 27 teams and I would like to run a CFA on the 22 measures collected.
I initially ran the CFA without clustering and then after seeing this thread decided to account for it by specifying
cluster = team;
estimator = mlr; type = complex;
However, I receive the warning message below but I must say that the estimates are the same as when I did not account for clustering and that overall the figures are sensible.
Should I worry? and if so, could you suggest what to do pls?
THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS -0.299D-16. PROBLEM INVOLVING THE FOLLOWING PARAMETER: Parameter 27, B1 BY BWEM_TOT
THIS IS MOST LIKELY DUE TO HAVING MORE PARAMETERS THAN THE NUMBER OF CLUSTERS MINUS THE NUMBER OF STRATA WITH MORE THAN ONE CLUSTER.
Yuan Zhang posted on Monday, June 12, 2017 - 5:31 pm
Hi Linda and Bengt:
Working with survey data and conducting 3-factor CFA, I’m puzzled by the inconsistent RMSEA’s in the output for the same models with different survey specifications.
Model A has sampling weight, cluster, and strata specified and Model B has sampling weight and replicate weights specified; all other specifications remain the same. The RMSEA’s are observed to increase in Model B (e.g., from 0.052 to 0.067). Could you advise if increasing RMSEA in Model B is something that I should be concerned about? Did I misspecify anything?
Partial (due to post size limit) codes follow below: title: Model_A; data: file=XXXX; variable: (omitted due to post size limit) Missing are all (999); weight = s_wgt; strat = strata; cluster=cluster; analysis: type=complex; model: int BY ...; eff BY ...; belief BY ...; ____________________________________
title: Model_B; data: file=XXXX; variable: (omitted due to post size limit) Missing are all (999); weight = s_wgt; repweight=jkn_1-jkn_150; analysis: type=complex; repse=JACKKNIFE2; model: int BY ...; eff BY ...; belief BY ...;