A question came up about references discussing identification in SEM and I thought I'd post my answer here.
No really comprehensive and thorough references come to mind, perhaps because the topic is difficult to say something general about. In practice people are satisfied of empirical identification by the information matrix being non-singular. But the following classic references have bits and pieces of the puzzle:
- Wiley, D (1973). The identification problem for structural equation models... In Goldberger & Duncan (eds) Structural Equation Modeling in the Social Sciences (Seminar Press, NY) 69-73. ---This talks about the column rank of the derivative matrix for covariance elements by parameters.
- Joreskog, KG (1977). Structural equation models in the social sciences. In Krishnaiah (ed) Applications of Statistics. Amsterdam: Holland Publishing, 265-287. --Some general statements mostly
- Joreskog, KG (1979). Author's Addendum (to Confirmatory Factor Analysis, the 1969 Psychometrika article). In Magidson (ed), Advances in Factor Analysis and Structural Equation Models. Cambridge, Mass: Abt Books. ---This has a little more detail than is typical.
- Bartholomew's book Latent Variable Models and Factor Analysis (Griffin, 2nd ed.) has some more general identifiability sections.
I invite others to add more good references.
Anonymous posted on Tuesday, April 16, 2002 - 3:29 pm
What methods does Mplus use to solve overidentified models / SEMs ?
I've been running a rather large SEM with latent variables and have noticed that one or two portions of the model are overidentifed. Mplus doesn't return any error messages, the model converges without difficulty, and all the parameter estimates appear to be well-behaved. I'm wondering if the estimates for the coefficients are valid.
I assume you mean underidentified, that is, not identified. A part of the model can be not identified if analyzed alone but can become identified when it is part of a larger model. For example, a factor model with two indicators is not identified but a model with two factors each with two indicators is identified due borrowing information from each other. If this does not answer your question, let me know. If Mplus does not complain about the model not being identified, it is most likely identified.
Anonymous posted on Tuesday, April 16, 2002 - 8:04 pm
I am using Hanushek and Jackson's definition of "overidentified".
In a "just identified" model, you have an equal number of equations and unknowns, thus you can get a unique solution for each unknown for all the information provided in the model.
In an "overidentified" model, you have more equations than unkowns, which at first blush appears to be a good thing -- you have many different ways of obtaining estimates for your unknown quantities. A problem occurs with overidentified models because with sample data, the errors and variances are not the same across variables. Recommended ways of obtaining solutions include two-stage least squares (TSLS), indirect least squares (ILS), etc.
Thus, I'm wondering how Mplus obtains parameter estimates in the overidentified case.
A model with more unknowns (parameters) than equations (sample statistics) is not identified and will not be estimated by Mplus. A model with less unknowns than equations will be estimated if it is identified using maximum likelihood or weighted least squares.
Jason posted on Thursday, November 06, 2003 - 10:15 am
Very simple question from a new user. If I have a single indicator of a construct to be used in a larger SEM mediation model that has other latent constructs with multiple indicators, what is the best way to identify it. I tried fixing the factor loading of the single indicator to "1", but the model said it could not estimate the error variances. Suggestions?
You do not need to create a latent variable if you have a single indicator. Just use the observed variable in the analysis and Mplus will take care of it. The only reason you would want to create a latent variable behind a single indicator is if you wanted to fix the residual variance to a value that reflects the reliability of the measure.
following on from this last posting, I was wondering if you could tell me how one goes about fixing the residual variance to reflect a previous estimate of reliability. I have tried the following (fixing error variance at 0.3) and it does not seem to work (model does not converge):
where a is the error variance in y chosen as a = (1 – reliability) * sample variance
Anonymous posted on Thursday, March 04, 2004 - 8:57 pm
I’d like to pick up the discussion singular information matrices and identifiability again.
You said above that the model is probably identified if Mplus does not report any error messages, but I am wondering what types of error messages to look for/will come up when a model is not identified.
1. Is it correct to conclude that even if a model estimation terminates normally and yeilds fit statistics, class counts and memberships, model results (e.g., thresholds and/or proportions, the regression model part, etc.) … WHENEVER you get an error message in Tech11 that states that the information matrix with one less class is singular, this is always empirical evidence that the model is unidentified?
2. If this statement is true, when I encountered this error message I noted that Mplus still reported Estimates, S.E., and Est./S.E. when the model was not identified, but did not report Std or StdYX. Is the absence of values for Std and StdYX in the output another was to tell that a model is not identified?
3. Is Tech11 the only place in the output that indicates whether the information matrix with one less class is singular (or that indicates in some way that the model is not identified)? If so, is it true that the only was to determine that the information matrix is non-singular by making sure that there are no error messages in Tech11?
Thanks in advance. This discussion has been very helpful.
bmuthen posted on Friday, March 05, 2004 - 12:21 am
1. The usual non-identifiability message in the regular results section refers to the k-class model - use that. Tech11 refers to the k-1 class model.
2. My answer to 1. also answers 2 - SE's are not reported when the model (the k-class model) is not identified.
3. A better way to check if the k-1 class model is identified is to run it (without Tech11).
Anonymous posted on Thursday, September 30, 2004 - 5:49 pm
i have the following model and I am using m-plus version 3.
A ON C D; B ON C D; E ON A B; F ON A B; G ON A B; H ON A B E F G; MODIFICATION INDICES INDICATE THAT THE MODEL CAN BE IMPROVED BY ADDING A WITH B; E WITH F; F WITH G; NOW BY LOOKING AT THIS MODEL- IS THIS MODEL IDENTIFIED AFTER CORRELATING THESE ERRORS OF MEDIATING VARIABLES?
You should add one term and at time because the modification indices for the other terms may change when a term is added. Mplus will notify you if the model is not identified.
Anonymous posted on Tuesday, November 16, 2004 - 3:24 am
I have two questions-
First, I am interested in estimating a model where: y1* = w1-w5+u1 where y* is a categorical variable while also estimating: y2 = y1*+x1-x5+u2. There is some concern that the first equation may be over-specified. Is there a way to test this?
Second, assuming that y1* was continous, in which of the Tech files would I find the data to compute a Hausman test to see if I needed to instrument y1* at all?
bmuthen posted on Tuesday, November 16, 2004 - 6:19 pm
Let me ask some question so I understand this setup - are the w1 and w5 exogenous variables and if so what do you mean by over-specified? Remind me, the Hausman test concerns left-out exogeneous variables that might make residuals correlate with the included exogenous variables, right? I am not sure the Mplus output has the information needed for this test (but it is a test we are interested in adding).
Anonymous posted on Monday, April 18, 2005 - 11:48 pm
Can someone please help? How can the model be unidentified when I have 52 degrees of freedom? See below
Computation of degrees of freedom
Number of distinct sample moments = 91 Number of distinct parameters to be estimated = 39 Degrees of freedom = 91 - 39 = 52
The model is probably unidentified. In order to achieve identifiability, it will probably be necessary to impose 3 additional constraints.
The (probably) unidentified parameters are marked.
Can you send the full output to firstname.lastname@example.org along with your license number. This is not enough information to answer your question. Note also that a model with positive degrees of freedom may not be identified due to a part of the model not being identified.
Matt Moehr posted on Monday, December 04, 2006 - 1:42 pm
The data I have come from a study where babies were given 3 different toys for sixty seconds each. The babies' behavior was timed and coded as Focused Attention (f), Casual Attention (c), or Not Looking (n). A sub-sample of the babies was coded by two raters and a reliability score, given as percent agreement, was calculated for each of the three categories: Focused = 55.8% Casual = 79.6% Not Look = 95.8%
I found Linda's post from 1/23/2004, where she recommends using the formula (1-reliability)*sample variance to specify the error variance. My question is, can this same method be applied to latent variables with more than one measure? I tried the model below, but some of the results are confusing:
ANALYSIS: type = general missing h1; estimator = ml;
MODEL: mood1 BY f1@1 c1 n1*-1; mood2 BY f2@1 c2 n2*-1; mood3 BY f3@1 c3 n3*-1;
You would only want to fixed the residual variance to correct for reliability when you have a single indicator. When you have a factor with several indicators, this captures the unreliability. You should remove the following statements:
Matt Moehr posted on Monday, December 04, 2006 - 8:50 pm
My strategy on this analysis was to run three separate models with just one category of attention as the indicators This made 3 simple hidden Markov models, but I had to build in some type of assumption because they were all underidentified. I made the choice to bring in the inter-rater reliability because it seemed like the least restrictive assumption I could make. Then I began to wonder if I could make a model that used all three indicators for each trial, but I could never get this model to converge. When I went back and fixed the error variances, the model estimated just fine and most of the structural and measurement paths looked great. However, the standardized error variances were no longer equal to the inter-rater reliability, so I agree that this model is probably misspecified. Do you have any suggestions to improve model convergence in hidden Markov models? Is there an inherent problem in using multiple indicators, which are almost perfectly (negatively) correlated due to the mutually exclusive coding groups?
I am a little confused by your mentioning of both factor analysis and hidden markov, but let me focus on the latter. With a single observed binary indicator at each of 3 time points a hidden markov is identifiable if you estimate measurement error that is constrained to be invariant across time. If you have problems with this, send input, output, data and license number to email@example.com.
Yes, you don't want to use multiple indicators created from mutually exclusive coding groups. Such variables could be treated as nominal and then LTA (hidden markov) is possible.
Matt Moehr posted on Friday, December 08, 2006 - 4:36 pm
I'm also a little confused by the use of factor analysis and hidden Markov , but I was handed this data long after the study was designed and executed. The theory we're working with says that 6-month old babies will show an habituation to novel stimulus, in this case new toys. However, interwoven with the habituation each baby should show early signs of innate temperament, or "personality" if you like. The main goal of the project is to measure temperament at 6-months and relate it to follow-up measures of temperament and psychosocial adjustment at age 24 months. Seems like a good idea, but my models don't like the variables for time-spans of attention. I think this is because they are mutually exclusive categories. For instance, if a baby spends a lot of time "Not Looking" during the trial, that baby is going to spend less time in "Focused Attention". There are only sixty seconds in each trial, so more time in one category means less time in another. Simple correlations among the residuals could account for this, but then the habituation that occurs between trials would be lost. I think what I'm trying to do is a basic LTA, but with a panel analysis (or hidden Markov?) stuck on the other side of the variables.
I'll try to clean up my syntax and send you all of the files.
You can see formulas 167, 168, and 169 in Technical Appendix 8 which is on the website. Basically, an information matrix that is used in the computation of the standard errors is singular. It does not need to be inverted for the MLR estimator so we provide standard errors. This is a warning that we are not sure if they are accurate when there as less clusters than parameters. The affect of this matrix being singular has not been studied.
Hi, I am using the following model in a paper: … CATEGORICAL IS Y1 Y2; ANALYSIS: PARAMETERIZATION=THETA; TYPE=MEANSTRUCTURE; MODEL: Y1 ON Y2 Y3 X1 X2 X3; Y2 ON Y3 X1 X4 X5 X6 X7; Y3 ON Y1 X1 X4 X5 X2 X3 X8 X9;
Y1, Y2 and Y3 are endogenous variables and X1 … X9 are exogenous variables.
The paper received a R&R from a major journal and one of the reviewers questioned whether MPLUS can solve the causation (feedback effect) between Y1 and Y3.
(1) The reviewer said, “I do not see how your model can be identified without some instruments for the variables “Y3” and “Y1.”
Do I really need instrumental variables? I think if the model cannot be identified, MPLUS will generate an error message (it didn’t in my case). If I don’t need instrumental variables, how should I respond to the reviewer?
(2) The reviewer said, “Your very schematic description of the model (you need to provide its graphic representation) does not give me any idea how your model is identified.”
For “graphical representation,” do you think the reviewer just wants a graph indicating possible causal paths between all the variables?
(3) It seems that the reviewer wants more technical discussion, so could you please give me suggestions on methodological details of model identification in MPLUS (for the procedure I used)?
(1)The reciprocal interaction between y1 and y3 is identified because your model fulfills the rule of having at least one x variable that influences y1 and not y3 and at least one x variable that influences y3 and not y1. I think Bollen's SEM book covers this. Empirically, you are right that Mplus would complain if the model was not identified.
(2) Draw the model and also refer to a section in some SEM book for identification.
(3) Identification matters in Mplus is the same as in other SEM, so Bollen's book is a good resource.
Dear Linda and/or Bengt, I know that if a portion of a model is under-identified then the entire model is under-identified (e.g., a second-order CFA with two first-order factors and one second-order factor is always under-identified in the absence of a constraint on the higher-order loadings because the second-order portion of the model is underidentified regardless of whether the first-order portion is identified or over-identified). What I am wondering is whether an overall model is still overidentified if a portion of it is just identified? As one example, consider a second-order CFA with three first-order factors and one second-order factor and four observed indicators of each first-order factor. The first-order portion of the model is over-identified but the second-order portion is just identified. Is the overall model over-identified? Thanks! Rick Zinbarg
I've adjusted 2 different models but I think I've obtained the same results żare they equivalent? - because I've expected to found some difference. (All observed variables are continuous and normally distributed).
model_1 -> f1 BY y1 y2 y3 x;
model_2 -> f1 BY y1 y2 y3; x ON f1;
I thought that if the error distribution of the "structural part" (x ON f1) is different to the error-distribution of the "measurement part" then both models could be different, but if the error distribution is the same then it makes no difference, and both models are equivalent. Is it correct?
f1 BY x is the same as x ON f1 so these models are the same.
Maša Vidmar posted on Tuesday, June 02, 2009 - 12:32 am
I am running a CFA with several constructs, one of which has only 1 indicator. I fixed error variance to (1 – reliability) * sample variance. Would you happen to know the reference for that formula? I tried looking in some multivariate and SEM textbooks, but I could not find it. Thx!
I don't now offhand. I would think the Bollen book might discuss this.
Maša Vidmar posted on Tuesday, June 02, 2009 - 11:31 pm
I cannot get the Bollen book (Structural equations with latent variables)...it is not on Psyinfo nor in any library in our country. I also looked at his paper ''Latent variables in psychology and the social sciences'', no success. Any other suggestions? Thx.
Maša, as Linda says, Bollen's book, chapter 5 ("The consequences of measurement error") is the standard reference. Other, easy-to-find ebook, references are: Schumaker&Lomax: A Beginner's Guide to Structural Equation Modeling (2nd. ed), pags. 198-199 Kline: Principles and Practice of Structural Equation Modeling (2nd. ed), pags. 229-231 (Or try Google with: latent variable with single indicator).
Vlad posted on Wednesday, October 21, 2009 - 2:48 pm
Hello, I am estimating the model with 3-classes and always get this messege from mplus: WARNING: WHEN ESTIMATING A MODEL WITH MORE THAN TWO CLASSES, IT MAY BE NECESSARY TO INCREASE THE NUMBER OR RANDOM STARTS USING THE STARTS OPTION TO AVOID LOCAL MAXIMA. I've tried to use starts but nothing has changed. Any suggestion?
Hi, I've question for underidentified model with 2 indicators. I did equate both loadings for this model. However the results give me one loading=1.00, se=0.000 so that est/se=999.00 and p=999. The other loading is 1.42 with se=.056, est/se=25.35. My question, what is that means with est/se=999 and p=999?
After reading through some of the earlier postings, it looks as though a globally identified model will estimate even if there are some locally non-identified parameters. Is that correct, or is there a warning if local under-identification is present?
If there is no warning, does this mean that it is safe to interpret all model parameters or the locally identified parameters?
Thank you for your response. Yes, I was asking about how Mplus would handle a situation in which some parameters are not identified and others are. From your response, it sounds like Mplus will notice this and give a warning message in almost all circumstances. Is that correct?
Kip Sorgen posted on Tuesday, February 08, 2011 - 4:37 pm
I have a latent construct with three indicators that is just identified in CFA and explains 92% of the variance. What are some considerations of having a factor with zero degrees of freedom when including it in the full SEM model?
Hi, I am a very new user to Mplus. And need a question answered about identification. I have a latent factor with 3 indicators that will be part of larger SEM. I wanted to check my measurement models before I added them to the larger model. This is my code:
My model has 0 degrees of freedom for the Chi-Square Test of Model Fit and therefore don't get estimates for the Test of fit or RMSEA and my TLI/CFI=1, does this mean I need to fix my factor variance? And if so, how do I do that? Thanks!
A factor with three indicators is just-identified. Model fit cannot be assessed for a factor with less than four indicators.
Cecily Na posted on Wednesday, October 31, 2012 - 5:26 pm
Hello, I have a model with two latent factors and other observed covariates. Each latent factor has two indicators. One latent factor together with other covariates cause the other latent factor. No errors are correlated. Is this model identified? How do these two factors borrow information from each other? Thanks a lot!
Sounds like an identified model given that a model with 2 factors that are correlated and each has 2 indicators is identified. The covariances between the 2 sets of factor indicators make it identified. And your model also has covariates.
Paula Vagos posted on Friday, December 14, 2012 - 10:12 am
Hello, I am trying to test a second-order model, with two first-order factors, each with 8 indicators. I fexed the paths between the two first order factors and the higher-order factor. Would this model be possible/ identified?
If both factor loadings are fixed, then the model would be identified. But I wonder what the point of this is. If they are fixed at one, the one parameter that is estimated is the factor variance which is the covariance between the two factor indicators.