Mplus Discussion >> EFA/CFA with complex sample data

Topics
Last Day
Last 3 Days
Last Week
Tree View

Edit Profile


EFA/CFA with complex sample data

Mplus Discussion > Multilevel Data/Complex Sample >

Message/Author

Jim Prisciandaro posted on Friday, June 03, 2005 - 9:40 am

Hello Dr.s Muthen,

I am attempting to conduct an EFA, subsequently estimate several first and second order CFA models, compare these models in terms of fit (with absolute fit statistics, nested and non-nested comparisons) across subsamples (using multiple group comparisons), and estimate factor scores. Most of these procedures I can already do in Mplus, but I am now also dealing with a complex sample with a sample weight, a strata variable and a cluster variable. The strata variable has 42 values, and there are 84 clusters. In the data set however, the cluster variable has only two values (i.e., 2 cluster values within 42 strata = 84 clusters).

I am very new to the area of complex sample design. First, are all of the analyses I proposed above feasible with a complex sample using Mplus? Second, how do I adapt my EFA/CFA models to adjust for the sample design (note: I am only looking to adjust for the sample design, not obtain detailed information about the strata or clusters)?

So first, for the EFA, do I simply apply the sample weight (and not use the cluster and strata information)? Are there any other issues I need to keep in mind for this analysis when using a complex sample?

Second, how do I adjust my CFA syntax to properly model the weight, cluster, and strata variables (especially the latter two)? Following is an example of my syntax (excluding data and title):

VARIABLE:
NAMES ARE hypins fatigue retard agit conc indeci death watedec wateinc
insom anhed guilt lworth suic;
USEVARIABLES ARE hypins fatigue retard agit conc indeci death watedec wateinc
insom anhed guilt lworth suic;
CATEGORICAL ARE hypins-suic;

ANALYSIS:
TYPE IS GENERAL;
ESTIMATOR IS WLSMV;
ITERATIONS = 10000;
CONVERGENCE = 0.00005;

OUTPUT: SAMPSTAT RESIDUAL STANDARDIZED TECH3 TECH4 MODINDICES;

MODEL: f1 BY fatigue retard agit conc indeci watedec wateinc
insom anhed hypins;
f2 BY death guilt lworth suic;

I looked in the manual and was confused regarding how to properly model both the cluster and strata variables simultaneously with my sample. And I am assuming that with the sample weight, I can apply that to any analysis as long as I specify the weight command. Any help would be greatly appreciated.

Thank you in advance,

Jim

bmuthen posted on Friday, June 03, 2005 - 6:19 pm

Your proposed analyses certainly seem feasible in Mplus. The complex survey data syntax simply amounts to adding the options

weight = swght;
strat = stratum;
cluster= psu;

in the Variable command and using Type = Complex in the Analysis command to get the correct SEs and chi-square.

For EFA, only the weight option is available and Type = Complex is not available. But the cluster and strata information is less important in EFA since no SEs are given and you don't really need the chi-square test of fit.

You can read more about the techniques in Mplus Web note #7, forthcoming in the SEM journal. Acknowledging stratification can give important reductions in the SEs.

Jim Prisciandaro posted on Sunday, June 05, 2005 - 3:47 pm

Thank you Bengt,

As a followup:

If I apply the sampling weight to the CFA models I spoke of, I assume that the model parameters will be adjusted according to these weights. When I create factor scores from these "weighted" models, is the sampling weight involved in any additional aspect of factor score estimation, or is it simply that the factor scores will be derived from a model that has been estimated from weighted data? Essentially; I am trying to subsequently use the factor scores created from the CFA models in subsequent linear and logistic regression models (in which the sample weight will also be applied) and want to be sure that I am properly weighting (i.e., not "overweighting," if such a thing is possible) the data.

Also, are there any portions of the EFA output that WILL be affected by not including the cluster and strata information?

Thanks,
Jim

bmuthen posted on Monday, June 06, 2005 - 4:19 pm

The factor scores are not affected by weighting beyond the fact that the parameter estimates are affected.

bmuthen posted on Monday, June 06, 2005 - 4:34 pm

The only EFA output affected by not including cluster and strata info is the chi-square test of model fit and related fit indices.

Jim Prisciandaro posted on Tuesday, June 14, 2005 - 10:35 am

Thank you Bengt,

I have tried estimating the CFA models I spoke of earlier and am getting the following error:

*** ERROR
Each stratum must contain unique cluster IDs.
Clusters are not nested within strata.

As I mentioned on 6/3, "The strata variable has 42 values, and there are 84 clusters. In the data set however, the cluster variable has only two values (i.e., 2 cluster values [1 & 2] within 42 strata = 84 clusters)."

Is there any way I can work around the fact that my cluster variable is nested within my strata variable (aside from recoding the cluster variable)? For example, is there a way I can specify properly in my syntax that cluster is to be nested within strata? If there is no other way, would it make sense to code the cluster variable as follows- Within strata #1: cluster; 1 = 1, 2 = 2; In strata #2, cluster; 1 = 3, 2 = 4; strata #3, cluster; 1 = 5, 2 = 6, etc...? Will this strategy handle the complex data appropriately (again, if such recoding is necessary)?

Thanks,
Jim

bmuthen posted on Wednesday, June 15, 2005 - 7:46 am

The recoding you suggest is needed.

Sandra Mihailovic posted on Tuesday, October 11, 2005 - 1:25 pm

Dear Dr. Muthen,

I would like to explore the dimensionality of a new scale on collective efficacy with a set of hierarchical, non-independent data. Therefore I have a question about the following answer you posted about the use of EFA on clustered data: "The only EFA output affected by not including cluster and strata info is the chi-square test of model fit and related fit indices." (June 06, 2005)

I understand, that the chi-square-statistic is biased, but are RMSEA and RMSR also biased? If so, I don�t see how to use the remaining information (factor loadings) to explore the data, since there is no way to determine a reasonable number of factors. Are there any other options to conduct EFA on clustered data?

Many thanks in advance!

Linda K. Muthen posted on Tuesday, October 11, 2005 - 4:31 pm

I would suggest saving the pooled-within sample correlation matrix using the SAVEDATA command and using it as your data for the EFA. The sample size for this analysis would be n minus the number of clusters.

Sandra Mihailovic posted on Wednesday, October 12, 2005 - 2:16 am

Dear Linda, thanks a lot, this solution works very well.

Marc Reis posted on Thursday, October 13, 2005 - 11:06 am

Hello Linda,

I would like to do the same as Sandra (EFA on a pooled-within-correlation-matrix with non-independent data). Would you please give a recommendation for the best estimator: With summary data, the choice is between ML and ULS. ML provides the chi-square-statistic and RMSEA, so I would prefer it. Does the use of the pooled-within-correlation-matrix leads to unbiased estimations of both indices? If not, is RMSR trustworthy (with ML and/or ULS)?

Many thanks for your help.

Linda K. Muthen posted on Thursday, October 13, 2005 - 1:32 pm

I would use ML. Note that the sample size is n minus the number of clusters. The fit indices should be reliable.

Florian Fiedler posted on Monday, November 28, 2005 - 7:20 am

Dear Linda,

my question is related to your last posting. What's the correct N for the analysis on pooled-within (1) and between (2) correlation matrix?

I'd guess it's

(1) N-G
(2) G

with G being the number of clusters.

Is that right so far?

Thanks a lot for your help.

Linda K. Muthen posted on Monday, November 28, 2005 - 4:42 pm

Yes, this is correct.

Orla Mc Bride posted on Tuesday, July 18, 2006 - 2:53 am

Dear Prof. Muthen,

I am currently conducting EFA using NESARC. I have read that deleting missing data can adversely affect the weighting variable. I am aware of the subpopulation option in Mplus version 4, I know however that you cannot use it in EFA and that others have resolved this problem by using the USEROBS option. Is it possible to employ the USEROBS option in EFA?

Thanking you in advance,

Bengt O. Muthen posted on Tuesday, July 18, 2006 - 5:51 am

The USEOBSERVATIONS option is available for all analyses.

Oliver Arranz-Becker posted on Monday, April 26, 2010 - 6:38 am

Dear Prof. Muthen,

I am currently doing some 2-level EFAs. I am unsure what the "unrestricted within/between covariance" specification means. Can you provide some reference on multilevel EFA?

Thank you,
Oliver

Linda K. Muthen posted on Monday, April 26, 2010 - 8:55 am

The unrestricted model is the H1 model of unrestricted correlations. If you have that model on one level, tests of fit apply to the model on the other level. See the following technical appendix on the website:

Two-Level Weighted Least Squares Estimation. Proceedings of the Joint Statistical Meeting, August 2007, Biometrics Section

Jamie Marincic posted on Monday, May 09, 2011 - 8:52 pm

Hi,

I'm estimating CFA models on clustered data for which only the correlation matrix is available. How do I incorporate/account for the non-independence of observations when I'm using DATA: TYPE IS CORRELATION MEANS STDEVIATIONS?

Thank you,
Jamie

Linda K. Muthen posted on Tuesday, May 10, 2011 - 6:15 am

You need individual-level data for this type of analysis.

Jamie Marincic posted on Tuesday, May 10, 2011 - 10:36 am

Thanks!

Itziar Familiar posted on Tuesday, August 23, 2011 - 2:58 pm

Hi Dr. Muthen,

I would also like to run an EFA (with categorical indicators) on complex data using the suggestion you posted previously (EFA on a pooled-within-correlation-matrix with non-independent data/October 11th, 2005)
I read you recommended ML as estimator. However, my data has non-normal distribution so I was considering using MLM or WLSMV. Any recommendations?
Thanks in advance,

itziar

Itziar Familiar posted on Wednesday, August 24, 2011 - 7:22 am

Dear Dr. Muthen,

As a follow-up to my previous post, I tried running the EFA with the pooled within sample corr matrix as data (from the SAVEDATA command) but I get the following error:
*** ERROR
Unexpected end of file reached in data file.
I think I'm not specifying correctly the source of the data. But I'm not sure how to incorporate in the syntax the use of the correlation matrix and my original data
Thanks again,
itziar

Linda K. Muthen posted on Wednesday, August 24, 2011 - 1:04 pm

Use TYPE=TWOLEVEL and weighted least squares. The SAMPLE option of the SAVEDATA command will give you the pooled-within matrix.

See Example 13.1 for the correct way to read a covariance matrix. The sample size for a pooled-within matrix is the number of observations minus the number of clusters.

Itziar Familiar posted on Wednesday, August 24, 2011 - 2:37 pm

Dear Dr. Muthen,
Thanks for your response. I tried your suggestion and obtained a covariance matrix. However, I have categorical factor indicators. Is this correct? Shouldn't I obtain a correlation matrix instead?
Thanks!
itziar

Linda K. Muthen posted on Thursday, August 25, 2011 - 2:01 pm

You can only obtain a correlation matrix with categorical outcomes. Please send your output and license number to support@statmodel.com.

Ashleigh M posted on Tuesday, March 04, 2014 - 4:36 pm

Hello,
I am doing a two-level EFA with categorical mood symptom data for 67 people, with a range of 1-23 (mean of 5.239) timepoints per person. I am hoping this will produce factors for how symptoms covary (within level), as well as how symptoms covary across people (between level).

In my two-level EFA, I found nice fit for a few different models (RSMEA < .05). However, I get the following error messages for other models:

NO CONVERGENCE. NUMBER OF ITERATIONS EXCEEDED. PROBLEM OCCURRED IN EXPLORATORY FACTOR ANALYSIS WITH 4 WITHIN FACTOR(S) AND 1 BETWEEN FACTOR(S). THE PROBLEM MAY BE RESOLVED BY USING THE STARTS OPTION TO GENERATE RANDOM SETS OF STARTING VALUES.

Should I increase starts to attempt convergence for these models, or change the input, which was:
TYPE = TWOLEVEL EFA 1 5 UW 1 5 UB;

Finally, a hypothetical question: if eventually I learn that the best fit is unrestricted between and 2 factors within, does this signify that within has good fit, but there is no underlying factor between? I've been reading a lot but still do not seem to understand the significance of unrestricted covariance.

Thank you very much

Bengt O. Muthen posted on Wednesday, March 05, 2014 - 9:26 am

I am not sure how your data structure is two-level. You say that Within is symptoms and Between is people, but why not then do a regular (1-level) factor analysis of the symptoms?

Ashleigh M posted on Wednesday, March 05, 2014 - 10:22 am

Thank you for your prompt response!
I have read that doing a 1-level analysis on these data is problematic as it ignores dependencies within data (symptom sets within individuals across time are no doubt nonindependent), and the structure is contaminated by two sources of variance, both within- and between-individuals. The following citation makes a case for using multilevel factor analysis with similar data.
Reise, S.P., Ventura, J., Nuechterlein, K. H., & Kim, K. H. (2005). An illustration of multilevel factor analysis. Journal of Personality Assessment, 84(2), 126-132.
Especially since the range of symptom set is so large (1-23), I do not want to confound the factor structure by weighting some individuals more heavily than others and not accounted for the nested nature of the data.

Bengt O. Muthen posted on Friday, March 07, 2014 - 2:02 pm

I see. So Within is not symptoms but time (with Between being person). So the number of variables that you analyze is the number of symptoms.

Convergence problems for 4 within factors and 1 between factor may be due to negative residual variances with 4 within factors. Perhaps 3 factors is sufficient.

Ashleigh M posted on Friday, March 07, 2014 - 2:22 pm

Thank you very much, and I'm sorry my wording was misleading. I still have a lot to learn!
Now I need to decide between the models I have, but I'm struggling with understanding unrestricted covariance. One of my models with good fit has unrestricted within and 3 factors between; does this signify that between has good fit, but there is no underlying factor within? I've been reading a lot but still do not seem to understand the significance of unrestricted covariance.

Thank you so much for all your help!

Bengt O. Muthen posted on Friday, March 07, 2014 - 4:47 pm

I think you are more interested in your level-2 factor structure since that concerns co-variation among the variables across subjects. So an unrestricted Within could make sense in that it focuses the analysis on Between. Having fit this model, you can then see if you significantly worsen the fit when trying 1-m factors on Within, keeping Between at 3 factors.

Ashleigh M posted on Wednesday, March 19, 2014 - 2:21 pm

Thank you very much, that was helpful. My unrestricted within and 3 factors between works better given what you suggested. However, I would now like to choose between two models that have great fit indices and are supported by theory (unrestricted within and 2 factors between, as well as unrestricted within and 3 factors between). I learned I cannot use the DIFFTEST option with EFA, so is there anything else you would suggest?

Thank you!

Bengt O. Muthen posted on Wednesday, March 19, 2014 - 4:02 pm

You can use BIC if ML estimation is used.

Ashleigh M posted on Wednesday, March 19, 2014 - 4:20 pm

My data are dichotomous, so the estimator WLSMV is the default. Thank you

Bengt O. Muthen posted on Wednesday, March 19, 2014 - 4:23 pm

Then I don't know what to suggest, except present both solutions.

Ashleigh M posted on Wednesday, March 19, 2014 - 4:28 pm

I will do that, thank you very much!

JW posted on Tuesday, July 08, 2014 - 8:41 am

Hi,

I have data collected across 27 teams and I would like to run a CFA on the 22 measures collected.

I initially ran the CFA without clustering and then after seeing this thread decided to account for it by specifying

cluster = team;

and

analysis:

estimator = mlr;
type = complex;

However, I receive the warning message below but I must say that the estimates are the same as when I did not account for clustering and that overall the figures are sensible.

Should I worry? and if so, could you suggest what to do pls?

THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE
TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE
FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING
VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE
CONDITION NUMBER IS -0.299D-16. PROBLEM INVOLVING THE FOLLOWING PARAMETER:
Parameter 27, B1 BY BWEM_TOT

THIS IS MOST LIKELY DUE TO HAVING MORE PARAMETERS THAN THE NUMBER
OF CLUSTERS MINUS THE NUMBER OF STRATA WITH MORE THAN ONE CLUSTER.

Linda K. Muthen posted on Wednesday, July 09, 2014 - 5:01 pm

Independence of observations is at the cluster level. We are just reminding you of that. This is probably not a problem given what you describe.

JW posted on Thursday, July 10, 2014 - 6:35 am

Thank you!

Yuan Zhang posted on Monday, June 12, 2017 - 5:31 pm

Hi Linda and Bengt:

Working with survey data and conducting 3-factor CFA, I�m puzzled by the inconsistent RMSEA�s in the output for the same models with different survey specifications.

Model A has sampling weight, cluster, and strata specified and Model B has sampling weight and replicate weights specified; all other specifications remain the same. The RMSEA�s are observed to increase in Model B (e.g., from 0.052 to 0.067). Could you advise if increasing RMSEA in Model B is something that I should be concerned about? Did I misspecify anything?

Appreciatively,
Yuan

Partial (due to post size limit) codes follow below:
title: Model_A;
data: file=XXXX;
variable: (omitted due to post size limit)
Missing are all (999);
weight = s_wgt;
strat = strata;
cluster=cluster;
analysis: type=complex;
model: int BY ...;
eff BY ...;
belief BY ...;
____________________________________

title: Model_B;
data: file=XXXX;
variable: (omitted due to post size limit)
Missing are all (999);
weight = s_wgt;
repweight=jkn_1-jkn_150;
analysis: type=complex;
repse=JACKKNIFE2;
model: int BY ...;
eff BY ...;
belief BY ...;

Tihomir Asparouhov posted on Tuesday, June 13, 2017 - 2:36 pm

In Mplus version 8 we do not compute RMSEA with replicate weights. This should not have been printed with the earlier versions as well. Use the RMSEA version from the non-replicate weights run.

Yuan Zhang posted on Saturday, June 17, 2017 - 1:08 pm

Thanks, Tihomir! For your record, I used Mplus7 and did got RMSEA.

Yoosun Chu posted on Sunday, August 06, 2017 - 12:18 pm

Hello,
I am running two-level EFA with complex data and my indicators are ordinal variables.

I got a following message:
NO CONVERGENCE. NUMBER OF ITERATIONS EXCEEDED.PROBLEM OCCURRED IN EXPLORATORY FACTOR ANALYSIS WITH 4 WITHIN FACTOR(S) AND UNRESTRICTED BETWEEN COVARIANCE.THE PROBLEM MAY BE RESOLVED BY USING THE STARTS OPTION TO GENERATE RANDOM SETS OF STARTING VALUES.

I write the analysis part like this:
Analysis: TYPE = TWOLEVEL EFA 4 4 UW* UB*;

Also, although I use WLSMV estimation, the estimation takes quite long, more than one hour. My sample size is almost 10,000. Is there any other way to reduce the estimation time?

Would you give some advice?

Bengt O. Muthen posted on Sunday, August 06, 2017 - 5:19 pm

Try STARTS = 10;

The computing time depends for WLSMV depends more on the number of items than the sample size - you must have a lot of items.

Yoosun Chu posted on Sunday, August 06, 2017 - 7:03 pm

Thank you. I tried but had the same issue. I had 20 items. Would it be any other option for the convergence? Thanks.

Linda K. Muthen posted on Monday, August 07, 2017 - 5:45 am

Please send the output and your license number to support@statmodel.com.

Joseph McFall posted on Thursday, October 12, 2017 - 9:11 pm

Hello. I ran a categorical twolevel CFA for one factor and found that one of the three unstandardized estimates was nonsignificant; however, the standardized estimate for this item was statistically significant; this pattern held true for both the between and within levels. This item had a larger estimate and much larger SE than the other items, but very similar variance. I also performed the same CFA with a COMPLEX analysis - the item had a significant loading, p<.001 and had similar unstandardized estimates and SE as the other items. What might explain such discrepancy?

Linda K. Muthen posted on Friday, October 13, 2017 - 7:03 am

Standardized and unstandardized parameters have different sampling distributions so their significance values may differ.