IRT Models in Mplus PreviousNext
Mplus Discussion > Categorical Data Modeling >
Thread Last Post Last Poster Posts
 gianbattista FLEBUS posted on Monday, November 22, 1999 - 3:37 pm
I would like to understand what the difference is
between the Mplus estimation method for
categorical items and the Rasch model for scales
(or test) construction. In
other words, are the several models (rating scale,
partial credit and equal dispersion model)
extraneous to Mplus estimation method or not?
Thanks for the answ
 Bengt O. Muthen posted on Wednesday, November 24, 1999 - 3:36 pm
With a single factor, Mplus estimates a 2-parameter normal ogive to use IRT terms. A transformation to IRT parameters a and b is given in Muthen, Kao, Burstein (1991) - see IRT reference list under the Mplus Discussion topic References. Guessing is not taken into account. This means that results are close to those of 2-parameter logistic, using the usual conversion factor to make probit results close to logit results. Differences also arise because Mplus uses weighted least squares (WLS) and not maximum likelihood (ML). Mislevy had an article in Journal of Educational Statistics in 1986 showing that WLS and ML gave close results in multi-factor situations. A Rasch model can be estimated when holding loadings equal across items. Differences with Rasch programs would again be due to WLS versus ML; note also the custom of average difficulty = 0 in Rasch. I don't think Partial Credit or the Rating Scale model in line with Master's and Andrich can be handled in Mplus, but maybe other users would know. Mplus handles ordered polytomous outcomes, which is in line with Samejima's graded response model.
 John Fleishman posted on Thursday, December 02, 1999 - 2:35 pm
I'd like to continue the discussion of IRT-like analyses in MPLUS. Specifically, I have analyzed the same data set using Mplus, R. MacDonald's NOHARM program, and the BILOG-MG IRT program. The results differ across programs. My question is whether I should worry about such differences, or if they're within the range of what one would expect, given differences in estimation procedures (e.g., ML vs. WLS).

The details are as follows: I'm analyzing 11 dichotomous items, with a sample size of 3091. In Mplus, I specified Model: F1 by I1* I2-I11;
F1 @ 1; (thereby constraining the factor variance to be 1.0).

I won't overwhelm everyone with all the results, but here's the estimates for two items:

Item #1 Mplus WLSMV: loading .839 threshold .289
Mplus WLS: loading .950 threshold .229
NOHARM: loading .814 threshold -.289
BILOG: slope 1.674 threshold .295

Item #2 Mplus WLSMV: loading .515 threshold -.029
Mplus WLS: loading .655 threshold -.126
NOHARM: loading .513 threshold .029
BILOG: slope .450 threshold -.087

My understanding is that estimates from Mplus can be transformed into the parameterization used in BILOG by the following:
BILOG slope = loading/(sqrt(1-loading**2))
BILOG Threshold = threshold/loading

For item #1, this calculation yields 1.542 for the slope/loading, while BILOG reports 1.674. For the threshold, the formula yields .344 while BILOG says .295.

Again, the issue for me is whether such differences are cause for concern, or if that's just the way it is. Thanks in adnvance to anyone who can provide illumination!
 Bengt O. Muthen posted on Monday, December 06, 1999 - 2:53 pm
I think the magnitude of the differences between BILOG and Mplus are not a cause for concern. They are due to using ML instead of WLS and using logit instead of probit (the constant 1.7 in the logit gives only an approximate closeness to the normal). Note also that the item response curves may be close even when the thresholds and the slopes are a bit different (these are correlated quantities). As for thresholds, Mplus WLS and Mplus WLSMV differ a bit due to the difference in the weight matrix. With WLS the thresholds are not simply inverse normal transforms of the sample proportions, but are a function of the weight matrix covariance between thresholds and correlations and the fit of the correlations, see Muthen (1978) Psychometrika equation (20).
 Rich Jones posted on Friday, April 06, 2001 - 11:11 am
Is there a standard method for quantifying the 'bias' in the estimate of mean differences in latent ability according to group membership when measurement non-invariance (DIF) is not taken into account?

Here's what I tried. I built a single-group MIMIC model and detected significant direct effects yada-yada-yada. Then I estimated a mis-specified model with previously detected direct effects fixed to 0.

Now comparing overall model fit isn't really interesting because I already know those direct effects significantly improve model fit. What I'm really interested in is how wrong would I be about inferences made on subject ability if I ignored the direct effects?

I chose to compare the Std parameter estimates for the indirect effects for each model, because the residual varinaces of ability are different across models and the background variable is binary.

The full model returns a Std indirect effect of 0.490, the mis-specified model 0.498. Thus I conclude that ignoring bias results in over-estimating group differences by a whopping 1.6%.

So does that seem to make sense? Anyone with other ideas?
 Bengt O. Muthen posted on Friday, April 06, 2001 - 6:53 pm
I think your approach makes sense. In addition, with several x's, you could look at the standardized mean differences for different groups, e.g. consider how differently, in a standardized metric, black females compare to white females with and without a direct effect. As a complement, I guess one could also ask what happens to the factor score for a given individual with and without a direct effect.
 Anonymous posted on Thursday, April 12, 2001 - 11:05 am
Based on the above discussion and Muthen and Christoffersson (1981), I was wondering why it might still make sense to perform an Mplus multigroup CFA for categorical variables if one doesn't render it as a 2P IRT model.

Unless the scale factors are constrained across groups, it doesn't seem that that differences in group means and variances are necessarily meaningful since the definition of the measures could still be quite different for each group in spite of having equal thresholds and loadings.
 Bengt O. Muthen posted on Saturday, April 14, 2001 - 5:59 pm
Differences across groups in factor means and variances are meaningful if thresholds and loadings have (partial) invariance across groups even if scale factors are different across groups. A useful analogy is with continuous indicators where in order to compare factor means and variances across groups you do not need invariance across groups of error variances in addition to invariant intercepts and loadings. With categorical outcomes, allowing scale factors to differ across groups in the presence of invariant thresholds and loadings can be thought of as allowing noninvariant error variances.
 Anonymous posted on Monday, April 16, 2001 - 10:46 am
Within the Mplus CFA framework then is there no meaningful way of "correcting" for DIF by imposing a direct effect between a covariate and an indicator (as there would be if the CFA was rendered as an IRT) ?
 bmuthen posted on Monday, April 16, 2001 - 12:25 pm
I did not realize that you were asking about direct effect DIF modeling in our last message, but that does not change my answer. I conclude the opposite from our discussion: including a direct effect is a meaningful way to correct for DIF, just like it is with regular CFA with covariates. Please let me know how I can help get us to get agreement on this issue.
 Anonymous posted on Monday, April 16, 2001 - 1:54 pm
My second question was a follow-up rather than a restatement of the first.

I am unclear as to how a direct effect between a covariate and an indicator adjusts for DIF for an Mplus CFA with categorical indicators. This may be due to confusion regarding the nature of scale factors in the Mplus CFA and the correspondence between the Mplus CFA and a 2P IRT.

The interpretation of adjusting for DIF in the case of a 2P IRT is straight forward: the DE adjusts for differences in item difficulties _within_ and _across_ (via the imposition of equal thresholds, loadings, and scale factors) groups of a multigroup model.

The extension to the Mplus CFA with categorical indicators approach is less clear: would DE's only adjust for differences in error variances _within_ groups of a multigroup model (and in so doing, adjust the loadings and thresholds as well -- which are _still_ constrained to be equal across groups even though the scale factors are allowed to vary) ?
 bmuthen posted on Tuesday, April 17, 2001 - 4:15 pm
First of all, there is no difference between the two-parameter normal ogive IRT model and the Mplus model.

To summarize my view, there are two ways to capture DIF in Mplus modeling: (1) CFA with covariates and (2) multi-group analysis. To me, DIF means that for a given item you have different item characteristics curves for different subject groupings and both approaches capture this.

In approach (1), DIF for a certain item is handled a binary (say) x variable is allowed to have direct influence on the item in question, thereby making its threshold (difficulty) parameter be different for the two groups. Direct effects are only needed for items with DIF. Scale factors are not involved here.

In approach (2), considering the same two groups, DIF is handled by a two-group analysis where the DIF item is allowed to have different threshold and loading for the two groups. So this is a more general DIF form. Scale factors are fixed at 1 for this item (free scale factors are used for items that are assumed to not have DIF).

For further details, please see the Muthen articles given on this web site under References-Categorical Outcomes-IRT, especially papers 35, 18, and 15. These references also show the relationships between Mplus and IRT parameterization; see also the IRT paper by McIntosh listed on the home page.
 Bill Farmer posted on Sunday, May 20, 2001 - 5:19 pm
I am trying to do a cross-validation of the model I've settled on for a particular set of all categorical data and was wondering how I would apply the same thresholds/loadings to the cross-validation sample to derive factor scores.
 Linda K. Muthen posted on Monday, May 21, 2001 - 4:26 pm
To get factor scores from the same model but on a different sample, fix all parameters to the values estimated in the first sample. These can be fixed using the @ sign. For example,


f1 BY y1@1 y2@.4 y3@.4;
y1@.2 y2@.3 y3@.12;
[y1$1@0 y2$1@.2 y3$1@.12];
 Bill Farmer posted on Friday, July 13, 2001 - 12:30 pm
If a one factor model with categorical indicators can be transformed quite easily to yield a unidimensional logistic IRT model, what does this mean in the multidimensional case? Specifically, the majority of multidimenional IRT models that I have seen are compensatory in nature with each dimension providing its own discriminating parameter to the item characteristic function (yielding an "a" vector) and a scalar defined level of difficulty/threshold (see Ackerman, 1994). If we extend the same procedures that are used to transform the loadings/discrimination and threshold parameters for the unidimensional case in Mplus to what we might expect to get if there was a BILOG/PARSCALE analog for multidimensional IRT, we end up with not only multiple "a" parameters as in the compensatory model, but also multiple thresholds. To me this suggests that the multidimensional model derived in Mplus would in effect be noncompensatory in nature. Is this in fact the case are am I way off?
 bmuthen posted on Friday, July 13, 2001 - 5:57 pm
In Mplus, single-factor and multi-factor models have a single threshold for each binary item and this translates into a single "b" value in IRT. Please let me know if this doesn't answer the question.
 Bill Farmer posted on Monday, July 16, 2001 - 9:07 am
I was a bit unclear. In order to transform the Mplus derived parameter estimates to what would be expected with a logistic model we:

divide the Mplus loading by the square root of the quantity 1 - (Mplus loading squared) to get the analagous "a" parameter from BILOG/PARSCALE; and divide the Mplus derived threshold by the Mplus derived loading to get the analagous difficulty/threshold parameter.

For a one dimensional model this is fairly staightforward - yielding only one discrimination parameter and one set of thresholds per item.

In a multidimensional item; however, if you do the transformation for the thresholds for each dimension, you end with as many sets of transformed thresholds as you have dimensions.

Is there another way to approach the transformation issue for the multidimensional case?
 bmuthen posted on Monday, July 16, 2001 - 10:35 am
Good question. You are right about the division by the square root of the quantity 1 - (loading squared). Explicating what this quantity is, makes the generalization to multiple factors clear. The quantity is the residual variance of y*, the continuous latent variable underlying the binary observed y. In the Mplus factor analysis with categorical y's, the y* variances are standardized to one. The loading squared inside of the parenthesis is the variance in y* explained by the factor when we have a single factor with variance one. In the more general case, the variance explained can be calculated in the usual way for continuous y's to include cases with several factors that may be correlated and/or have variances different from one. So, for instance, with two correlated factors, you have variance explained in item j equal to V(f_1)* lambda_j1**2 + V(f_2)*lambda_j2**2 + 2*lambda_j1*lambda_j2*cov(f1,f2).
 John Fleishman posted on Monday, July 16, 2001 - 12:35 pm
I'd like to return to the discussion of quantifying DIF for a scale. Several points were raised in the exchange between Rich Jones and Bengt back on April 6. I'd like to ask for a little more explication, if possible. The basic situation is one in which one runs a MIMIC model with DIF effects and a MIMIC model without such effects, to assess the overall magnitude of DIF.

1. Rich Jones proposed comparing standardized parameter estimates of the indirect effects for a MIMIC model with no DIF and a model that contained DIF effects (i.e., direct effects from the background variable to the item(s) in question). I assume that the standardized indirect effect would be the product of the (STD) effect of the background (x) variable on the factor times the (STD) loading of an indicator on the factor. My question: If one has several indicators, does it make sense to assess *overall* DIF for the set of indicators by looking at the *sum* of the loadings of the indicators times the effect of the x variable on the factor, and comparing these quantities in the DIF and no-DIF models?

2. Bengt also suggested that one could look at "standardized mean differences" for different groups, defined by the background factors (x variables). Perhaps it's the summer heat, but I'm not sure how one would calculate standardized mean differences from the Mplus output. Suppose one had dummy variables for each combination of background variables. Would one compare the magnitude of the (STD) effects of each dummy variable on the factor in the DIF and no-DIF models? Or does one calculate standardized mean differences in some other manner?

3. Bengt also suggested comparing factor scores in the DIF and no-DIF models. When comparing the mean factor scores for two groups differing in their background characteristics, would it make sense to incorporate some information on the within-group standard deviation of the estimated factor scores, to provide a scale for the factor score comparison (e.g., something analogous to a measure of effect size)?

4. Finally, suppose one had two background variables in the MIMIC model (e.g., age and gender). One could form dummy x variables for each age/gender combination. If one had 4 groups, one would use 3 dummy variables in the model. The question here is: How can one recover the predicted mean for the excluded group? In a standard regression, one would use the intercept, but there's no intercept in the Mplus formulation. Is there a way to use the overall predicted factor mean from the TECH 4 output for the purpose of recovering the mean of the excluded group? (And is there somewhere in the Appendices that provides the formula used to obtain the estimated factor mean in the TECH 4 output?)

These questions may have obvious answers; I'm just not seeing them right now. Thanks in advance for clarifying these points.
 Bill Farmer posted on Tuesday, July 17, 2001 - 2:45 pm
Thank you for the explanation of establishing the variance in an item explained by two correlated factors. In terms of the item thresholds; however, would we end up with multiple sets of thresholds per item (one set for each dimension) or is there a way to transform to one set?
 bmuthen posted on Tuesday, July 17, 2001 - 4:19 pm
Mplus has just one threshold for each item, even when there are several factors. This is because the threshold is defined on the scale of the latent response variable y*, and there is only one y*. In achievement testing, y* is the specific ability needed to solve a certain item correctly. This ability may consist of several factors.

The relationship between IRT parameterizations and Mplus is most clearly seen in P(y=1|f), the probablity of y=1 given the factor (or factors) f, expressing this as a normal distribution function with argument arg, say. For a single theta in IRT,

arg = a(theta - b)

whereas in Mplus with a single factor f,

arg = (-tau + lambda*f)*c,

where the inverted value of c is the square root of the residual variance that we discussed earlier. This gives the relationship

a = lambda*c,
b = tau/lambda.

With several factors in Mplus,

arg = (-tau + lambda_1*f_1 + lambda_2*f_2 + ...)*c.

This formula can then be used to derive the relationship to the various IRT formulations. Hope this helps.
 bmuthen posted on Tuesday, July 17, 2001 - 6:57 pm
Here are some thoughts on John Fleishman's questions. Rich Jones may have further ideas.

1. It would be nice to have some overall indicator of the DIF, for all items involved. Perhaps the sum of the squared differences between the standardized indirect effects with and without direct effects. Perhaps root mean square.

2. One can get the factor means and factor standard deviations from TECH4, then compute mean differences divided by sd.

3. Sounds alright.

4. The estimated mean for the excluded group would be zero since a single-group analysis fixes the intercept for the factor at zero (in line with mean fixed at zero). TECH4 would only give the marginal factor mean (so averaged over all background variable values). In line with equation (38) in version 2's appendix, the factor mean conditional on x covariates is

B_0 * alpha + B_0 * Gamma * x,

where B_0 is the inverse of (I - B).
 Anonymous posted on Monday, July 23, 2001 - 9:35 pm
Is there a convenient way to generate something like reliabilities for multifactor CFA models for categorical indicators in Mplus ?
 bmuthen posted on Tuesday, July 24, 2001 - 9:32 am
The R-square values that are printed if you request a standardized solution would serve this purpose. The R-square describes the proportion of variation in the latent response variable y* accounted for by the multiple factors influencing this y*. See Appendix 1 of the User's Guide for more details about y*.
 Rich Jones posted on Friday, August 03, 2001 - 1:20 pm
In response to John Fleishman's request for an explication of my Single-Number Summary of DIF...

Direct and Indirect Effects in Mplus MIMIC models

In the case of a single-factor MIMIC model (single factor CFA with covariates), what I refer to as the indirect effect can also be conceptualized as a regression of the latent factor (eta) on a given x. When the x is dichotomous, these regressions are analogous to dummy variable regressions in ordinary linear regression or mean differences as expressed in ANCOVA models. So while the parameter describes the indirect relationship between the x (e.g., group membership) and the item(s), it also captures group mean differences in the underlying construct (eta) (this is a powerful feature of DIF detection with a MIMIC model that is difficult to get at with more usual DIF detection approaches).

Mplus will produce regression parameters standaredized with respect to the variance of eta (STD), and also standardized to the variances of eta, the x's, and the y's (STDYX). The indirect effect, when standardized with respect to the variance of eta (STD; STDYX is harder to interpret in the case of dichotomous x's), describes a kind of effect size difference for group membership: that is, the standard deviation increase in eta associated with a unit increase in the covariate (i.e. group membership).

I am very interested in the study of DIF, but I also believe that DIF is often at best a nuisance. What I am really interested in is how cognition, depression, functioning, whatever, is distributed within a sample, across groups, or how it relates to some other characteristic of subjects. The presence of DIF might lead to spurious inferences of group differences or exaggerated relation to other correlates if the DIF is of significant magnitude. Most studies of DIF that I have seen conduct an analysis of DIF, interpret (some) of the findings, and stop there without going on to explore how the DIF might impact findings or relationships with other variables. Often, DIF studies only interpret evidence of bias in the direction favored/suspected a priori by the investigator. I was just trying to go one step further and try to conceptualize/express the overall magnitude of DIF and the importance of modeling DIF or, conversely, the cost of ignoring DIF.

Overall Summary of DIF

In my posted example using the single-group one factor MIMIC approach, I built a MIMIC model in a forwards stepwise function, examining model fit derivatives for evidence of DIF and sequentially free-ing up direct effects etc. as described in Muthén, B., in Test Validity, H. Wainer & H. Braun, eds. (1988). The initial model (without any DIF /direct effects) suggested significant and large group differences in the underlying construct (eta) - in other words, the regression of eta on x was large and significant: a significant indirect effect. The final model, one that included DIF according to group membership suggested many items with DIF (a.k.a. direct effects), and group differences in the underlying construct remained (indirect effect).

I was interested in trying to describe how much of the observed group difference in eta was due to bias (DIF) by group membership. As in other areas of statistical inquiry, large samples may lead to an analytic finding of statistically significant DIF, but the magnitude of the effect is of little practical importance. Also, sometimes you'll find DIF favoring one group on some items, and suggesting a disadvantage on other items, and given different difficulty and perhaps even discrimination across indicators, it's hard to gauge the overall importance of DIF.

The approach I took was to compare from the final model (with DIF) the STD indirect effect (regressions of eta on group membership) for the group membership dummy covariate (x) to a model otherwise equivalent with the exception of the direct effects (DIF) - a purposefully mis-specified model. You can get an omnibus chi-square difference test and p-value for all DIF parameters this way, but you already know this will be significant given the way the MIMIC model was built. What I was trying to do was get a handle on how large the group differences would be if the DIF was ignored.

In my posted example, the final model returned a STD indirect effect of 0.490. That is, the standardized (with respect to the variance of eta) difference in eta was 0.490. This value was slightly lower than the standardized group difference in eta found for the purposefully mis-specified model: 0.498. I further expressed the discrepancy in group difference estimates as a fraction of the mis-specified group difference ((0.498-0.490)/0.498)=1.6%. This result is very interesting because although my analysis demonstrated significant DIF, the practical importance of this DIF in terms of obtaining an un-biased estimate of eta seems to be relatively small. Therefore, I concluded that most of the group differences in eta were not due to possible item bias (but may be due to constant bias - but that's another topic, a matter of substantive interpretation of the indirect effect).

So I use the fugure 1.6% as a single number to quantify the discrepancy between DIF and no-DIF models in terms of the underlying construct. Bengt suggested other ways, for example estimating factor scores for the final and mis-specified models and plot them, compute their correlation, or estimate mean of differences, etc.

Other DIF Summaries

I have considered other summaries, for example something analogous to a sum of the area between the item characteristic curves (ICCs) for focal and referent groups. Other postings on the Mplus discussion list describe how MIMIC model parameters can be used to obtain IRT parameters, and Raju (1988; Psychometrika 53:495-502) demonstrated that the area between two group's ICCs is equal to their difference in IRT difficulty parameters for 1-P and 2-P models. So you could convert direct effects, thresholds and loadings to difficulty parameters, and compute the sum of group differences in item difficulty across groups as another single-number expression of the total effect of DIF (but notice however that direct effects are they only things that vary across group in a single group MIMIC model).

Limitations of this area approach are that it does not take into consideration the distribution of ability in the sample of interest. If the items are highly skewed -- all very difficult or very easy (as they often are in fields outside of educational testing such as psychology, medicine, epidemiology) -- this sum of area's may mis-represent group differences. It's possible that the sum of the area's between the curves is very large but if weighted to the distribution of ability in the population and with all of the items most discriminating at the tails of the ability distribution, there will be very small differences in estimated group differences due to DIF. I believe this is what is happening in my example, where large and significant DIF explains little of the overall group difference in estimated ability because very few respondents have a level of ability that matches the difficulty level of the test items.
 Denny Borsboom posted on Friday, September 14, 2001 - 7:20 am
When estimating the correlation between two continuous latent variables with binary observed variables, does Mplus incorporate a correction for attenuation due to unreliability (like a factor analysis model does for continuous observed variables)? Or does it compute correlations between latent variable estimates (as I think is the case for IRT programs like BILOG)?
 bmuthen posted on Friday, September 14, 2001 - 11:23 am
The correction for attenuation is built into the modeling. Mplus does not compute the correlation from estimates of each individual's latent variable value.
 Patrick Malone posted on Friday, December 28, 2001 - 10:35 am

I'm trying to develop an IRT-style scale for several polytomous items that are measured longitudinally. I'm somewhat surprised by some of my results, so I'm hoping you can check my logic. I have too many items/response categories to do a full longitudinal CFA (MPlus insists there's not enough memory in the workspace, no matter how much I try to give it). So I did a one-factor CFA with one year's data, using WLSMV, and was satisfied with the outcome. I'm willing to make the assumption that the items have the same measurement relations to the factor across years.

I then ran the model for different years, fixing all of the loading and threshold values to be equal to those from the above CFA, and saved factor scores. My thinking was that this would give me a score for each year, all on the same scale. However, the factor means I'm getting out of these runs are surprising to me. It may be that they're right and my expectations were simply wrong.

However, I noticed a pattern. Some of the years don't have all of the items, so I omitted those items from the scoring runs for those years. With all of the thresholds and loadings fixed, I'm working from the idea that that's as if those variables are missing (completely at random, in fact), and shouldn't affect the scaling. But the years with missing items all have the highest factor scores.

Is this a reasonable approach? Other suggestions are welcome.

 bmuthen posted on Tuesday, January 01, 2002 - 5:56 pm
Off-hand, I don't see that this approach has any gross error, although I may be overlooking some scaling issue. As a check of reasonableness of the results I would compare this to treating the items as continuous and study the mean development for the average at each time point (average to take into account missing items).
 Patrick Malone posted on Thursday, January 03, 2002 - 4:19 am
Thanks. I discovered the key problem was in my assumption that the scaling was constant across years -- there were certain items that were phrased differently in different years, making them much "easier items" in some years than others. I appreciate the check on the logic.

 Michael Conley posted on Friday, February 07, 2003 - 5:11 pm
Just starting to get into measurement equivalence assessment using two group CFA. Based on reading so far, I think that when using Mplus on binary math item indicators if I have partial measurement invariance between M and F on Math it is reasonable to assume that the factor means and factor scores can be estimated and that I am measuring the same factor in both groups. True so far? But I am interested in these factor scores compared to 2P IRT derived scores. I have seen on this board the formulas for converting Mplus parameters to BILOG parameters. I am wondering if the Mplus factor scores will correlate with other variables in the same way as BILOG parameter based scores. Maybe an obvious answer, but a stretch for me. Likewise would differences in group means relative to variances be the same for Mplus factor scores and BILOG scores?
 bmuthen posted on Saturday, February 08, 2003 - 6:41 am
Mplus Web Note #4 goes through different parameterizations including IRT and shows that the relationship between the item and the factor is the same. This means that the factor scores are the same. The only difference between BILOG and current Mplus is the difference due to BILOG using a logistic function and Mplus using a probit function; this should produce only minor differences wrt factor scores.
 Michael Conley posted on Saturday, February 08, 2003 - 7:51 am
Thanks for your very helpful response. Is it correct that partial measurement invariance is somewhat subjective as to whether you have it? If you have it, or enough of it, am I correctly interpreting your comments on the board that you can estimate factor scores for the two groups and DIF is controlled? This is extremely helpful. Thanks.
 Linda K. Muthen posted on Saturday, February 08, 2003 - 8:33 am
Yes and yes.
 Svend Kreiner posted on Tuesday, February 11, 2003 - 7:47 am
I have been unable to find the Muthen, Kao, Burstein (1991) reference mentioned above. Could you provide some information on where I can find it.
 Linda K. Muthen posted on Tuesday, February 11, 2003 - 8:30 am
Following is the complete reference:

Muthén, B., Kao, Chih-Fen, & Burstein, L. (1991). Instructional sensitivity in mathematics achievement test items: Applications of a new IRT-based detection technique. Journal of Educational Measurement, 28, 1-22. (#35)

If you don't have access to the journal, you can request the paper from Reference the paper by #35.
 Michael Conley posted on Tuesday, February 11, 2003 - 10:49 am
RE: Rich Jones on Friday, April 06, 2001 - 11:11 am: suggestion on an effect size for DIF. Would I be looking at the more general DIF by using a two group analysis? Would I would then compare the factor mean computed for the withDIF group computed from a model recognizing some noninvariance (but still retaining partial measurement invariance)and the withDIF group factor mean computed from a model assuming invariance. Correct? I guess there is no scale problem as long as I use the same item as reference indicator throughout, even though I free us some parameters? I wonder if there is any way to get a standard error on that difference or an upper bound on the standard error?
 michael conley posted on Tuesday, February 11, 2003 - 1:53 pm
Clarification of my Q. My term withDIF group makes sense in my context. That is a test is administered and scaled under standard conditions and then administered to others under nonstandard conditions. So this last group is the withDIF group I was refering to. Upon reflection my issue may be yet more complex. Getting the scores of the withDIF group controlling for DIF is no problem, I think. However the comparison would be with their scores if they were scored using the item parameters derived from the standard group only. Nothing in my described analysis gives me that?
 bmuthen posted on Wednesday, February 12, 2003 - 4:40 pm
Yes, when moving from a mimic-type analysis to a 2-group analysis I think comparing the factor mean from the correctly specified model (with the non-invariance in question included in the model) with the factor mean from the incorrectly specified model makes sense. I don't know about the s.e. of the difference in means. In a sense the s.e. for the mean for the correctly specified model would be somewhat useful - for example, it is informative if the incorrect mean is more than 2 s.e.'s away from the correct mean.
 Rich Jones posted on Friday, February 14, 2003 - 9:20 am
In Re Magnitude of DIF: Comment, a shameless plug, and a proposed rule of thumb for interpreting magnitude of difference

I like to think of the parameter estimate for the indirect effect (or mean difference in multiple-group case) and associated standard error as test of significance, and the comparisson of parameters from fitted and mis-specified models as kind of an effect size measure.

I discuss the mis-specified model comparisson approach in a little greater detail in Jones & Gallo (2002, J Gerontol B Psychol Sci Soc Sci 57B:P548-558).

I've recently learned that Maldonado and Greeland (1993, Am J Epidemiol 138:923-936) describe a simulation study used to evaluate different strategies for identifying important confounders in observational studies. Their strategy might be adapted for evaluating the model mis-specifcation approach to detecting "confounding" in the ability estimate due to DIF. These authors ultimately reccomend a "change in estimate approach" similar to what I propose in Jones & Gallo (2002), but using a pre-determined threshold (e.g., |b-b'|/b' > 0.10, where b and b' are parameter estimates from mis-specified and fitted models, respectively) as a criterion for marking 'important' confounding.

This 10% difference rule of thumb might be as good as good rule of thumb as any, short of simulation studies or other indications that the detected DIF makes an important or substantial impact. BTW: I should mention that I came to the Maldolondo and Greenland work by way of Crane, Jolley and Van Belle, who use it as a criterion in assessing the presence of uniform DIF in their DIFdetect procedure (see

However, I can see that it would be nice to have some indication as to how confident we can be that the difference between fitted and mis-specified parameter estimates for mean difference in underlying ability is less than 10%.
 bmuthen posted on Saturday, February 15, 2003 - 10:44 am
Thanks, Rich. Could you send me a copy of your article? I have a feeling you did already, but I just reorganized my office and can't seem to find it.
 Rich Jones posted on Wednesday, February 26, 2003 - 9:14 am
Re: Scale Equating

Setup: I am trying to use the Mplus factor score estimator to produce equivalent latent trait estimates for two sequential administrations of the same symptom inventory. I need to generate equated, or linked, scores because the wording of response options (but not symptom stems) changed between administrations. All items are treated as dichotomous (symptom present/absent).

Approach: I am linking the two models by (1) estimating factor loadings and thresholds in the first administration, with the variance fixed at 1 and mean 0 for the single latent factor, and saving factor scores; and (2) estimating latent trait estimates (factor scores) at the second administration, constraining the loadings for all items and the thresholds for the items that are (assumed to be) equal across administration to be equal to those estimated at the first administration. I've estimated two seperate models, so that by default Delta=I in both administrations. Further, in the second model, the only free parameters are the mean of the latent trait, the thresholds for the items for which the wording changed, and ...

(1) Should I hold the variances of the latent factor to be equal (i.e., 1) at the second administration?

(2) Do you think this would be more appropriately parameterized as a multiple-group model and bring Delta into the picture? (i.e., fix Psi to 1 and freely Delta for group 2 where group 2 is really administration 2?)

I realize that if all items were exactly the same, if I did not constrain the variances to be equal (along with all loadings, thresholds), the metric of the latent trait would change, and I would get a different latent trait estimate for identical response patterns. I'm just not sure if I can expect (assume) the latent trait variance would/should be equal when the thresholds for the items are very different.

- Rich
 bmuthen posted on Wednesday, February 26, 2003 - 9:31 am
Just a clarification - are the two administrations given to the same group of people or different people? I assume different, but I want to make sure.
 Rich Jones posted on Wednesday, February 26, 2003 - 11:15 am
They are the same people at the two administrations.
 Rich Jones posted on Wednesday, February 26, 2003 - 2:29 pm
...but my idea was to treat them as seperate groups in a scale equating phase of the analysis, and then look at longitudinal changes in a seperate set of analyses. I realize this is not neccesary with Mplus, but a secondary goal is to provide a set of equated scores for other investigators to use (who might not use Mplus).
 bmuthen posted on Thursday, February 27, 2003 - 8:54 pm
You can do this by a "longitudinal factor analysis", that is using a single-group analysis with one factor per time point (I would not use a multiple-group approach since you have the same people at the two admin's, so not independent samples). The standard setup is to hold thresholds and loadings invariant across time to the extent that is realistic, and let the factor variances be different across time (with one loading =1 to set the metric), having a zero factor mean at time 1 and free at time 2, and letting Delta =1 for time and free for time 2 (see Mplus Web Note #4). Then estimate factor scores from this model.

Or, do the above to get thresholds and loadings for each admin and then run each admin separately with parameters held fixed at the solution from the joint analysis, and estimate factor scores for that admin. This approach perhaps is somewhat less prone to misspefication of across-time relationships.
 Rich Jones posted on Monday, March 03, 2003 - 6:07 am
Thanks for the suggestion. I ran these models and I am sure there is something I still do not 'get' about the use of scale factors. I will re-read Web Note #4 more carefully.

I actually have more than two administrations of this questionnaire: only one of the administrations differs from the others and needs to be linked. Running each repeated adminstration seperately, I find that if I do not constrain both the psi and delta matrices, the estimated factor scores for identical response patterns are not equal across time (when the items are the same, and lambda and tau are also constrained to be equal). I find this sample-invariant scoring intuitively pleasing and seems to be consistent with the IRT model.

 bmuthen posted on Monday, March 03, 2003 - 6:36 am
Having different Psi matrices (and therefore different Delta) influences the factor score estimation even when tau and lambda are the same. Psi is the "prior" factor cov matrix and therefore should have an influence. Substantively it seems like Psi can change over time and this should be allowed even when tau and lambda remain invariant.
 Michael Conley posted on Saturday, March 29, 2003 - 12:50 pm
I am doing two group CFA with one factor. The indicators consist of some binary items and some ordered polytomous items. If I wanted to refer to the Mplus factor scores in IRT terms, would I say they are similar to or the same as IRT graded response scores? Thanks.
 Bengt O. Muthen posted on Saturday, March 29, 2003 - 1:19 pm
They are the same as IRT theta scores when using a two parameter normal ogive model and estimating the scores using a Bayes modal, that is, maximum a posteriori, estimator.
 Michael Conley posted on Saturday, March 29, 2003 - 4:05 pm
I'm slow absorbing sometimes. That's so even if some of the items are ordered polytomous and not binary?
 bmuthen posted on Saturday, March 29, 2003 - 5:14 pm
That's right.
 Michael Conley posted on Sunday, March 30, 2003 - 6:00 am
So if items are all ordered polytomous then its like graded response? All or part binary items then its 2P normal ogive MAP? In that case item scores other than 0, eg 1 2 3, are treated as 1 to compute probit coefficients?
 bmuthen posted on Saturday, April 05, 2003 - 10:35 am
As I understand it, graded response (when using a normal ogive) is the same as ordered polytomous in Mplus. Binary is 2P normal ogive. In both cases factor scores are MAP in Mplus.
 Anonymous posted on Tuesday, December 16, 2003 - 10:07 am
I am estimating a Graded response model(Samijima, 1979) in Mplus. The strucutre of the scale turns out to be multidimensional, (4 factors). I wonder in this case, can I still using the same tranformation, i.e, a =loading /(sqrt(1-loading**2)); b =threshold/loading, to convert the Mplus estimates of thresholds for an item into IRT model's b? and the same way for a?
 Linda K. Muthen posted on Wednesday, December 17, 2003 - 9:00 am
Yes, as long as all factor indicators load on only one factor, that is, there are no cross-loadings.
 Anonymous posted on Wednesday, December 17, 2003 - 11:16 am
unfotunately I do have three items load on two factors. therefore, shall I use: b1 =threshold1/(sqrt(1-(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2))) as the transformation for these three items?
 Anonymous posted on Thursday, December 18, 2003 - 11:55 am
A correction for my last submitted inquiry: for the three items that load on two factors, the a_f1 = lamda_f1 /(sqrt(1-(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2))) and a_f2 = lamda_f2 /(sqrt(1-(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2)));
as such, for these three items, each will have two sets of transformed thresholds, one set for factor 1, b1_f1 =threshold1_f1/lamda_f1,... and one set for factor 2, b1_f2 =threshold1_f2/lamda_f2, is it correct?
 bmuthen posted on Thursday, December 18, 2003 - 1:42 pm
This article may give you the answers:

MacIntosh, R. & Hashim, S. (2003). Variance estimation for converting MIMIC model parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 27, 372-379.

Note that you don't get more thresholds because you have more than one factor - the threshold relate to the item, not the factor.
 Anonymous posted on Monday, December 22, 2003 - 1:05 pm
I obtained the paper of MacIntosh and Hashim (2003), but they didn't mention the case when items load on multiple factors.
 BMuthen posted on Monday, December 22, 2003 - 5:03 pm
Too bad. Perhaps the Takane and de Leeuw paper in Psychometrika. I don't know offhand. I would have to work it out and don't have time right now.
 Anonymous posted on Tuesday, January 27, 2004 - 8:36 am
When estimating IRT models via Mplus, two things are unclear to me:

First, am I correct by saying that Mplus only models the means, variances and covariances, and not the higher-order moments? In this case, the estimation of IRT-models via Mplus is not full-information, but a good approximation instead.

Second, is it still not possible to estimate the ratingscale model or the partial credit model (with a probit link instead of a logit link) with Mplus? I ask this because the possibilities of Mplus grow rapidly, and the previous question about this topic is dated in 1999.

Thanks for the answ
 Linda K. Muthen posted on Tuesday, January 27, 2004 - 10:40 am
The current version of Mplus uses weighted least squares with a probit link. This is not a full-information estimator. Version 3 will include a full-information maximum likelihood estimator with a logit link for categorical outcomes.

I am not familiar with what the rating scale model or the partial credit model is. If you can explain it in a simple way, I can try to answer this.
 Anonymous posted on Tuesday, January 27, 2004 - 1:30 pm
The ratingscale models is a model for ordered polytomous data. It states that:
ln [ P_vij / P_vi(j-1) ] = Theta_v - Beta_i + Tau_j
With v=person, i=item, j=category, Theta is a person parameter, Beta an item parameter, and Tau a category parameter. As such, this model assumes equal distance between two categories for all items.

The rating scale model is an extension of the ratingscale model in that it relaxes the assumption of equal distance between categorries:
ln [ P_vij / P_vi(j-1) ] = Theta_v - Delta_ij
with Delta an item and category specific parameter.

Hope this is clear
 Linda K. Muthen posted on Tuesday, January 27, 2004 - 5:18 pm
No, we don't do these models as far as I know. There may be some way to specify it but I am not aware of it.
 Anonymous posted on Thursday, September 02, 2004 - 12:26 am
There have been a couple of mentions of the relationship between Mplus modelling of ordered polytomous data and Samejima's graded response model: is this spelt out in detail anywhere?
 bmuthen posted on Thursday, September 02, 2004 - 8:24 am
With ML estimation, Mplus uses a logistic regression of the ordered polytomous item on the factor, where the logistic regression model is a proportional-odds model in the language of Agresti's categorical data book. I am told by IRT experts that this is Samejima's model. I haven't seen this spelled out in writing, but you can check Agresti and compare to Samejima.
 Anonymous posted on Sunday, September 05, 2004 - 11:20 pm
Just in follow-up to the graded response question. I've compared estimates from MULTILOG and Mplus in three different data sets and they are essentially in agreement as follows: MULTILOG's A = Mplus' standardized loading; MULTILOG's B(k) = Mplus' k'th threshold/standardized loading. The relationship for A looks different to what has been stated in some earlier questions/answers: is it different?
 Bengt O. Muthen posted on Monday, September 06, 2004 - 7:43 am
The relationship you express is with logit. Another relationship holds for probit. Is this what you mean?
 Anonymous posted on Monday, September 06, 2004 - 5:01 pm
Re: Mplus/MULTILOG. I think I haven't understood the answer to "Anonymous Tuesday, December 16, 2003 - 10:07 am" which describes a different transformation from Mplus to graded response. But to be clear: a 1-factor model for polytomous variables in Mplus fits the equivalent of a logit form proportional-odds? And since MULTILOG does a logit version of graded response the above simple relationship holds? (And Mplus can/can't do normal ogive versions?)
 bmuthen posted on Monday, September 06, 2004 - 5:09 pm
Yes and yes. The Dec 16 statement is true for probit (i.e. normal ogive), which Mplus can also do.
 Anonymous posted on Tuesday, September 07, 2004 - 3:31 pm
Thank you for your replies regarding graded response. Finally for now (I hope): how is a probit invoked? (I couldn't find anything in the manual for 3.0).
 Linda K. Muthen posted on Wednesday, September 29, 2004 - 4:19 pm
Probit is obtained with the WLS, WLSM, and WLSMV estimators and categorical outcomes.
 Judy posted on Thursday, December 16, 2004 - 8:14 am
I have a question about generating the scaling factor for an IRT analysis. My items are categorical - with 4 response options and I have 15 total items. How do I generate a scaling factor for each of the 15 items?

I have read through this discussion and I think I've figured out how to generate the a and b parameters - but does the newest version of Mplus allow for graphing of the IRT curves? If so, how do I do that?

 Linda K. Muthen posted on Thursday, December 16, 2004 - 10:16 am
Mplus does not use scale factors to generate categorical data. See mcex5.2.inp which comes as part of the Mplus installation for an example of how to generate categorical data.
 Anonymous posted on Tuesday, January 18, 2005 - 3:46 pm
A quick question. As I am reading all the conversations for IRT, I am left needing some clarification. Mplus will perform Samejima's model similar to Multilog? However, Mplus version 3 does not perform IRT for rating scale data? I am I correct with these?
 Linda K. Muthen posted on Tuesday, January 18, 2005 - 3:50 pm
I don't know what you mean by rating scale. If it is ordered categorical (polytomous), then Mplus can handle it.
 Anonymous posted on Tuesday, January 18, 2005 - 3:56 pm
By rating scale, I mean a scale that is strongly agree, agree, disagree, and strongly disagree. I know that Mplus can handle this type of data in other formats (i.e., SEM), but will it work for an IRT model using the Likert-Type format?
 Linda K. Muthen posted on Thursday, January 20, 2005 - 8:16 pm
It will work for all models in Mplus. IRT is just a special case of an SEM model. Use the CATEGORICAL option to specify which dependent variables are rating scales.
 Eveline Gebhardt posted on Saturday, January 29, 2005 - 12:56 am
I have noticed that person abilities estimated by the MLR method in Mplus are continuous while I expected discrete values like MLE or WLE ability estimates. What causes this difference in the MLR method? Do you have a reference where I can find information about this? Many thanks in advance.
 Linda K. Muthen posted on Sunday, January 30, 2005 - 3:46 pm
I think by ability measures you are referring to factor scores. They are continuous for all maximum likelihood estimators and weighted least squares estimators. Factors are continuous.
 Eveline Gebhardt posted on Sunday, January 30, 2005 - 4:45 pm
My understanding is that if the model is a logit model and with constrained variances a Rasch model, then it is in the exponential family and the student raw scores are sufficient statistics. Therefore there should be a one-to-one match between abilities (factor scores) and raw scores, but that is not happening in Mplus.
 Linda K. Muthen posted on Monday, January 31, 2005 - 8:50 pm
Have you considered that in the Mplus modeling that the prior for the ability distribution is normal. The scale of the ability estimates and the raw scores are therefore different.
 Eveline Gebhardt posted on Monday, January 31, 2005 - 11:15 pm
I am using MLR with a dichotomous items and I am constraining the item loadings to 1. My understanding is that this will result in a Rasch model with a normal prior. My understanding is also that in this case the raw scores and the ability estimates will have a one-to-one match, the metrics will be different and the transformation from one to the other will be non-linear, but nevertheless there should be just one estimate for each possible raw score. This does not appear to be happening and I am not sure why. Do you have an explanation? What do you recommend as the best reference for understanding the MLR estimation algorithm in this context?
 Linda K. Muthen posted on Tuesday, February 01, 2005 - 7:03 pm
To answer this I would need to see your Mplus output and your data. Please send them to

Just to make sure, you should allow the loadings to be equal when fixing the factor variance to one, not fixing the factor loadings to one. If you fix the factor loadings to one, you should allow the factor variance to be free.
 Anonymous posted on Saturday, February 05, 2005 - 3:47 am
I'm wondering about how to compute the estimates needed for a test information function in MPLUS. I think this was done in

Muthén, B.O. (1996). Psychometric evaluation of diagnostic criteria: application to a two-dimensional model of alcohol abuse and dependence. Drug and Alcohol Dependence,41(2), 101-112.

Any pointers would help


Andrew Baillie
andrew.baillie at
 bmuthen posted on Saturday, February 05, 2005 - 1:46 pm
I found it helpful to go by the Hambleton-Swaminathan IRT book which I think was referred to in that article.
 Laura Piersol posted on Monday, March 07, 2005 - 12:44 pm
I am wondering how to incorporate estimates from one item-response model into a second item-response model and still get good standard errors. More specifically:

“Stage 1” is a multilevel item-response model for individual’s ordinal responses to questions at time 1. Level 1 = person and level 2 = items within person. This will give coefficient estimates and cut-points from which we can obtain probabilities of an individual scoring between any two cut-points.

“Stage 2” would be a multilevel item-response model for individual’s responses to questions at time 2, but this time we would like to incorporate estimates from “stage 1” as predictors.

Any suggestions on how to model this in Mplus?
Thank you so much for your help.
Laura Piersol
 Linda K. Muthen posted on Tuesday, March 08, 2005 - 7:35 am
You can have a model that contains both of your item-response models, a two factor model.
 bmuthen posted on Tuesday, March 08, 2005 - 2:56 pm
To add to this discussion, it sounds like your "cutpoints" are thresholds for ordinal outcomes and perhaps your "coefficient estimates" are the loadings (discriminations). If so, Linda's 2-factor suggestion refers to a longitudinal factor analysis with 1 factor at each time point. Instead of having stage 1 estimates as "predictor", the idea is then to hold the thresholds and loadings equal across time. Perhaps this is something you want to do - assuming we have understood you correctly.
 Anonymous posted on Wednesday, March 09, 2005 - 11:48 am
I'm a new user, so I'm still learning the program. I'm trying to generate an IRT analysis of my data (1 latent trait explaining 10 categorical variables) modeling my program on example 5.5 from the manual.

Can Mplus generate Item Characteristic Curves using the Plot command? (The program gives me the options only of Histograms, Scatterplots, Sample Proportions, and Estimated Probabilities, none of which produce ICCs.)

Thanks for any suggestions or guidance.
 Linda K. Muthen posted on Wednesday, March 09, 2005 - 12:26 pm
No, ICC's are not yet available. They are on our to do list.
 Laura Piersol posted on Tuesday, March 15, 2005 - 2:26 pm
Thank you for your 3/8/05 responses to my question. I have been trying to get a better handle on these types of models, including factor analysis and IRT as I am relatively new to these topics.

In regard to holding the thresholds and loadings equal across time- We are following a group over two time points. We are happy to assume that the thresholds are equal across items for each time point. However, the set of items at the two time points are not identical so equating thresholds doesn’t seem correct. In this case, would you think of the first factor as a “predictor” of the second factor? This is the model I have in mind:
MODEL: f1 BY u1-u7
f2 BY u8-u14
f1 ON x1-x10
f2 ON f1 x11-x20
Am I missing something?

I also would like to confirm that we don’t need to run a multilevel model. In terms of variables, we have individuals’ responses to the items and individual-level covariates. From reading the Mplus documentation, it sounds that this can be handled as a single level model.

Do you have a literature suggestion for better understanding the intricacies of these models and interpreting the output?

Many thanks,
 Linda K. Muthen posted on Tuesday, March 15, 2005 - 3:00 pm
If I understand you correctly now, you do not have the same items at two timepoints but different items at two time points representing two different dimensions. Then there is no need to hold thresholds and factor loadings equal. You would want to hold them equal if you have the same items at two time points representing the same dimension. The equalities specify measurement invariance. I think your MODEL command looks good given what you want to do.

If you have no clustering in your data, that is, children were not sampled from classrooms, for example, then a single-level analysis is appropriate.

I don't know of any one piece of literature but a good SEM book would probably help. See our website where there are a plethora of refereces. I think many like the Bolen book. Maybe someone else can make a suggestion. There are also some papers that compare IRT and SEM.
 Anonymous posted on Thursday, March 17, 2005 - 4:11 am
on Wednesday, March 09, 2005 - 11:48 am Anonymous asked about plotting ICCs in MPlus. While I haven't found a way of doing it within MPLUS It is relatively easy to plot ICCs in your favourite graphical program (I like gnuplot but I've done it in excel as well). In gnuplot you can use the norm() function with the reparameterised estimates (see above on this page for how to reparameterise) and plot y = norm(a*(x-b))

Andrew Baillie
andrew.baillie at
 Andrew Baillie posted on Thursday, March 17, 2005 - 4:38 am
Some time back I asked about plotting test information functions. For the benefit of others here is what I've found. Hambleton & Swaminathan (1985) got me off to a good start but Frank Baker's online book on IRT had all the answers I was looking for.

The essential points are

1. The test information function is the sum of the item information functions

2. The item information function for the 2 parameter logistic model is

I(theta) = a^2 P(theta) Q(theta)

where P(theta) = 1/(1+EXP(-a(theta-b)) and

Q(theta)= 1 - P(theta)

a and b being the discrimination and difficulty parameters, and theta the latent "ability"

(see Baker, 2001 eqn 6.3 on p 106)

Note that for simplicity I've left the i subscript off these formulae.

I'm yet to find the item information function for the two parameter probit model.

Thanks again for the excellent software.

Andrew Baillie
andrew.baillie at


Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD.

Hambleton, R.K. & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Kluwer.
 Linda K. Muthen posted on Thursday, March 17, 2005 - 5:42 am
Thank you for the information. ICC's cannot be plotted in Mplus at the present time but we will be adding this in the future.
 CMW posted on Tuesday, April 12, 2005 - 7:32 am

I fitted a 2PL IRT model with Multilog. Then I used Mplus to fit a one-factor model, fixing the variance of the factor to 1, specifying that the variables are categorical, and using the MLR estimator.

The Mplus loadings are nearly identical to the Multilog discrimination parameters, but the Mplus thresholds are not close to the Multilog thresholds. Can you help with understanding this discrepancy?

 BMuthen posted on Thursday, April 14, 2005 - 12:00 am
Mplus uses the logit parameterization:

- threshold + loading*factor

whereas Multilog uses:

a (factor - b)

From this you can deduce the relationship.
 Anonymous posted on Monday, April 18, 2005 - 5:50 am
I used Mplus to fit a one-factor model; the variance was fixed at 1.0, and the variables were categorical. I used WLSVM estimator.

I like to transform the estimates from Mplus into the parameterization used in BILOG and Multilog.

I know that the BILOG slope = loading/(sqrt(1-loading**2), and BILOG threshold= threshold/loading.

But, my question is: are the abovementioned loadings unstandardized or standardized?

I highly appreciate your answer

 bmuthen posted on Tuesday, April 19, 2005 - 12:18 pm
You should work with the unstandardized values. But note that with WLSMV you have probit, not logit results. This means that you have to involve the factor 1.7 in your slope/loading comparison with BILOG.
 Allison Tracy posted on Monday, May 08, 2006 - 2:37 pm
Can the Mplus framework incorporate multidimensional scaling analysis and/or multidimensional unfolding analysis? I understand that there are formulations of these models that are based in IRT.

 Bengt O. Muthen posted on Monday, May 08, 2006 - 3:16 pm
Don't know - can you suggest some pertinent references?
 Allison Tracy posted on Tuesday, May 09, 2006 - 8:08 am
Here is a webpage with a number of references - I'm not sure if these references all apply to unidimensional unfolding or not but I believe it has been applied multidimensionally. I am just starting out in this literature, so I may run across more recent articles. If so, I will pass them along.

If the Mplus framework can incorporate these types of models, it would represent a significant advance in the field of MDS, allowing MDS and unfolding solutions to be more rigorously tested and incorporated into more general latent variable models. I believe that the possibility of state-of-the-art missing data handling and the use of complex sampling designs would be a major advance, as well.

 Tan Teck Kiang posted on Monday, May 29, 2006 - 2:24 am
Shall we look at the coefficient of the standardized or the unstandardized for IRT difficulty parameter?

The following model convergences in usual way

Model f by Q1-Q40;
f on Male SES;

But for the following, despiste increase the MIteration and MConverge, does not terminate normally,

Model f by Q1-Q40
HumanCap by FEdu MEdu;
EconCap by Maid Car ResType ClubM;
f on Male HumanCap EconCap;

Is there any text or reference on building IRT model using MPlus?

 Bengt O. Muthen posted on Monday, May 29, 2006 - 10:39 am
For ML estimation of single-factor models for binary indicators, Mplus Version 4 gives results not only in the regular factor analysis metric but also in the metric of the classic 2PL model with difficulty and discrimination estimates. The classic IRT estimates are given in the usual (0,1) metric for the factor.

The relationship between the factor model parameterization is given in Day 3 of our course handouts. This will also be posted this week as part of the new Web Note #10. The factor model estimates are reported both as raw estimates and standardized and the choice - or reporting both - is up the the user.

For the problem you are having with your 3-factor model, to make a diagnosis we need you to send input, output, data and license number to

The only relevant IRT references I know are listed on our web site under References, Categorical outcomes, IRT - see also the forthcoming Web Note #10.
 yang posted on Friday, June 02, 2006 - 7:01 am
Is ICC available in Mplus now? Thanks.
 Linda K. Muthen posted on Friday, June 02, 2006 - 8:49 am
Item Characteristic Curves and Information curves are now available as part of the PLOT2 option of the PLOT command.
 Tor Neilands posted on Saturday, June 10, 2006 - 11:15 am
Dear Bengt,

Last year I submitted a scale validation manuscript to a journal. The centerpiece of this manuscript was a confirmatory factor analysis conducted in Mplus 3 using WLMSV estimation. The input items were 136 binary personality inventory items. Previous research using PCA and EFA methods suggested three second-order factors and 17 first-order factors, so we fit that hypothesized factor structure to our data. We have some 6,000 research participants, which we randomly split into two samples, an initial model validation sample (which we used to obtain a "brief" 48 item pared down version of the original instrument) and a cross-validation sample on which we successfully refit the factor structure from the first sample.

A reviewer of the manuscript has called into question our use of factor analytic methods, arguing that we should instead use IRT methodology. The reviewer states, "To reduce the number of items measuring the three clinical dimensions of the 136-item inventory should be performed according to modern psychometrics outside the frame of factor analysis, namely with item response theory models. In this respect, the authors should consult, e.g., Borsboom, D: Measuring the Mind. Conceptual issues in contemporary psychometrics. Cambridge University Press 2005".

I have read the Borsboom book as well as an earlier paper he published in 2003 that delineates some of the philosophical conundra involved in using latent variable models to infer the presence of latent factors from correlation matrices. Given how enthusiastically the reviewer endorsed IRT over CFA, I was initially surprised to see that Borsboom's criticisms seem to apply with equal force to IRT and CFA/SEM models. Paul Barrett made mention of this on SEMNET in March of 2005, citing the following paper:

Michell, J. (2004) Item Response Models, pathological science, and the
shape of error. Theory and Psychology, 14, 1, 121-129.

When I read through the SEMNET archives and this discussion board, as well the helpful posts in the new IRT sections of the Mplus Web site, I found myself less surprised given how closely related the two methods appear to be, with indentical results possible under some conditions (e.g., ML estimation of models cotaining a single latent factor).

In crafting my response to this reviewer's comments, it would be helpful for me to know the scope of available IRT models in general and what is available in Mplus. First, to your knowledge, is it even possible to fit higher-order factor models within the IRT framework? If it isn't, then clearly IRT would not be a suitable tool for our purposes given that our theory clearly stipulates a higher-order factor structure a priori. On the other hand, if it is in fact possible to fit higher-order latent variable models under the IRT umbrella, is it feasible to do it using Mplus? I'd guess that if one of the requirements is ML estimation, then the answer is probably "No" because of the computational burden involved with this many variables and subjects.

Finally, given how closely related the factor analytic and IRT approaches are, even if it is conceptually possible to fit a higher-order IRT model and it's computationally feasible, is it even worth bothering to recast the analyses in this manner given how similar the IRT and CFA results are likely to be? My intuition tells me that at such large samples the WLSMV estimates originating from Mplus would be unlikely to differ markedly from those produced by an IRT model. What do you think?

As always, references and any additional comments (including any thoughts you have on the overall utility of what can be learned from fitting CFA models to tetrachoric correlation matrices in scale validition studies) are most welcome.

Gratefully and with best wishes,

 Bengt O. Muthen posted on Monday, June 12, 2006 - 9:06 am
I am disappointed that there are still some journal reviewers who do not understand the relationship between factor analysis of categorical outcomes and IRT - that it's all the same. It's been a long time now since articles like

Takane, Y. & DeLeeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408.

were published and long before then it was clear that this is all the same.

Perhaps the early focus on correlation matrices in factor analysis throws people off. But that should be seen merely as a matter of estimator choice, not model choice. Tetrachoric correlations belong with (weighted) least-squares estimation of limited information from first- and second-order moments, whereas with ML you work with the raw data (full information from all moments). The model is the same, however - if you assume normal factors and probit regressions, you fulfill the assumptions of underlying normality for continuous latent response variables that tetrachorics rely on. This is IRT's 2-parameter normal ogive model. Going to 2-parameter logistic is a trivial model variation. Both the IRT and the factor analysis traditions now work with multiple factors, although I haven't seen explicit use of second-order factors in IRT - mostly because IRT uses ML almost exclusively and the necessary numerical integration is heavy in situations where second-order factors are used - namely with many first-order factors. Programs like Bock-Muraki's TESTFACT is limited, I think, to 5 dimensions. For the same model, least-squares and ML estimation typically give very similar results as already the 1986 JEBS article by Mislevy showed.

The Mplus facilities for IRT are outlined at

showing that both least-squares and ML techniques are available for both probit and logit. ML can in principle be used for the same complex models as least-squares, but is again limited by computational demands with many first-order factors.

Note also that Mplus can do IRT modeling in mixture (latent class), multilevel, and multilevel mixture situations.
 Thomas Olino posted on Thursday, October 19, 2006 - 5:02 pm
I was interested in looking at comorbidity between two disorders using an IRT framework. However, I am running into a problem in examining the model. I want the variable to be coded as 0 = no diagnosis; 1 = diagnosis a; 2 = diagnosis b; and 3 = diagnosis a AND b. If I treat the variable as an ordered categorical variable, the model works by treating the 4 level variable as a graded response. However, there is no reason to think that diagnosis b is more severe than diagnosis a, which is implied in the graded response model.

Is there a way to handle the data in a nominal analytic approach?
 Linda K. Muthen posted on Friday, October 20, 2006 - 7:09 am
I am a little confused because when you say IRT, I think factors but then you talk about a nominal variable. What is this nominal variable used for in the analysis? Give me your MODEL command.
 Thomas Olino posted on Friday, October 20, 2006 - 10:55 am
I have repeated diagnostic assessments so the 4-level nominal variables would be the indicators for the factor. At the initial stages, the model looks like:

F BY dxt1* dxt2 dxt3 dxt4 (1);

Although it may be more appropriate to constrain the thresholds to be equal across the levels of the nominal variable.
 Linda K. Muthen posted on Friday, October 20, 2006 - 11:32 am
The NOMINAL option is not available for factor indicators. Perhaps you should use your original items for the factor analysis.
 Thomas Olino posted on Friday, October 20, 2006 - 11:41 am
By 'original items' - do you mean disorder a - present/absent and disorder b - present/absent? Or do you mean to treat the variable as ordered categorical?
 Linda K. Muthen posted on Friday, October 20, 2006 - 1:24 pm
I mean the symptom/disorder items. Treating a nominal variable as ordered doesn't make sense.
 Thomas Olino posted on Monday, November 06, 2006 - 1:52 pm
I am fitting a model of the following form using WLSMV:

f1 BY u1*-u8;
f2 BY u1*-u4;
f3 BY u5*-u8;
f1 WITH f2@0;
f1 WITH f3@0;
f2 WITH f3@0;

u1-u4 reflects diagnostic status for disorder A and u5-u8 reflects diagnostic status for disorder B. I am not interested in gorwth per se, but I am interested in examining some aspects of measurement invariance. Is it appropriate to iterpret the threshold parameters in the same way that you would if there was only one factor?

 Linda K. Muthen posted on Tuesday, November 07, 2006 - 8:39 am
 Gareth posted on Monday, November 27, 2006 - 2:56 pm

I would like to find out how 10 questionnaire items differ between two groups of people.

Could you offer any guidance on sample size? How many cases are required?

Many thanks
 Linda K. Muthen posted on Monday, November 27, 2006 - 3:17 pm
Sample size depends on many things including the model, reliability of the data, scale of the dependent variables, etc. To know for sure how many observations you would need, you could do a Monte Carlo simulation study.
 yang posted on Thursday, November 30, 2006 - 7:39 pm
I am fitting an MIMIC model with binary indicators loaded on a single factor, and several DIF effects are detected. I know that the parameters in MIMIC model can be converted into the parameters for a 2-PL IRT model (refer to Dr. Bengt Muthen's response on May 29, 2006), but I do not know how to get them in Mplus.

I have 3 questions:
1. What are the corresponding commands in Mplus in order to get these converted parameters and their standard errors under this situation (single factor, binary indicators, ML estimation)?

2. Does it make any difference if the estimator is not ML, e.g., WLSMV?

3. How about an MIMIC model with multiple factors, binary indicators and several DIF effects?

Thank you very much.
 Linda K. Muthen posted on Friday, December 01, 2006 - 9:34 am
The IRT conversions are given only for models without covariates.
 Michelle Williams posted on Monday, December 04, 2006 - 7:08 am
I apologize if this is a redundant question, given previous posts, but I am still relatively new to both the IRT literature and MPlus. I am fitting IRT models to binary data. I have no problem fitting the Rasch and 2PL models, unidimensionally. However, I believe that I need to fit a multidimensional IRT because I think that there are two factors that would represent my dataset. Am I correct in thinking that I can use the same basic code for a multidimensional IRT as a unidimensional IRT? In doing so, I have merely added the second factor into the syntax. The model runs, however I only get thresholds, rather than difficulties and discriminations.

Am I fitting the model correctly?
If so, is there a way to get discriminations and difficulties out of my output?

 Linda K. Muthen posted on Monday, December 04, 2006 - 8:32 am
We make the IRT translation only for tradition one-factor models. You need to do this yourself for multi-factors models.
 Ilona posted on Friday, January 26, 2007 - 6:24 pm
I apologize for this Stat 101 question -- I have looked at many references referred to in this (and other related) discussions, but I have not found the explicit answer:

How exactly do you convert the probit lambda to a logit lambda, and a probit threshold to a logit threshold? I assume it's not a simple multiplication by the 1.7 conversion factor?

I had attempted to figure this out on my own from running the same data two ways: with the link=probit for one run and link=logit for another run, both using MLR. I was hoping to see if simple multiplication seemed to work (I had set the threshold@.71 in the link=probit run, and set the threshold to 1.2 in the link=logit run). But I was further confused when in both runs, the first threshold (item difficulty) in the IRT parameterization sections was shown to be set to -1 in the output for both. I expected it would have been the same as the link=logit value of 1.2.

 Ilona posted on Saturday, January 27, 2007 - 7:24 am
Sorry, please answer the first question, but let me fix my second question:

My second question should have said:

I attempted to figure some of this out on my own from running the same data and model (single factor model with 60 binary items) two ways: with the link=Probit for one run and link=Logit for another run, both using MLR.

What I assumed from Bengt's response Dec 6, 1999, was that the conversions (mentioned Dec 2,1999) from FA to IRT parameters are for converting from the link=Probit FA parameters to the Probit IRT parameters. These were:
i) IRT discrim/slope: (a)=lamda/(sqrt(1-lamda**2))
ii) IRT difficulty: (b)=(threshold)/(lambda)

But, I can't seem to convert the Mplus FA output to the Mplus IRT output regardless of the link I use.

So, Question 2:
How can I convert from my Mplus link=Probit FA thresholds and lambdas to the IRT discrimination and difficulty estimates Mplus outputs?
I have:
lambda1 set to .71
threshold 1 set to -.71
Mplus IRT discrim (a1) output=.71
Mplus IRT difficulty (b1) output=-1.0.
(the IRT output says the parameterization is Probit)
 Ilona posted on Saturday, January 27, 2007 - 7:24 am
And Question 3:
How can I convert from my Mplus link=Logit FA thresholds and lambdas to the IRT discrim & difficulty estimates that Mplus outputs?

I have:
lambda1 set to 1.208
threshold1 set to -1.208
Mplus IRT discrim (a1) is output as .71
Mplus IRT difficulty (b1) is output as -1.0.
(the IRT output says the parameterization is Logistic)

 Bengt O. Muthen posted on Sunday, January 28, 2007 - 12:28 pm
Answers to your questions are in the IRT documentation on our web site:
 Thomas Olino posted on Wednesday, February 28, 2007 - 5:52 am
In deciding if observed item scores are continous or ordered categorical, would it be appropriate to run CFA models (one specifying continuous indicators and the other specifying categorical (4 response levels)) and then compare the BIC values?

 Linda K. Muthen posted on Wednesday, February 28, 2007 - 7:48 am
Statistics cannot determine the nature of the measurement scale.
 Scott Weaver posted on Saturday, April 21, 2007 - 11:35 am
I am conducting an ordinal CFA with the wlsmv estimator. How do I obtain output of parameter estimates in IRT metric that is now available with v. 4.2 of Mplus? I cannot find IRT metric estimates in my output nor can I find the command to request them in the Mplus documentation.


 Linda K. Muthen posted on Saturday, April 21, 2007 - 1:44 pm
They are available for binary dependent variables only. They are printed automatically when available.
 Yaacov Petscher posted on Thursday, June 28, 2007 - 7:42 am

I am currently running a Graded Response Model with 836 participants over 80 items. There are 5 dimensions and is measured on a 5-point Likert scale. The model is running on a server with substantial memory (16GB) and disk space (60GB). I'm using 10 integration points in code, and once the model began I received the message that, "THIS MODEL REQUIRES A LARGE AMOUNT OF MEMORY AND DISK SPACE. IT MANY NEED A SUBSTANTIAL AMOUNT OF TIME TO COMPLETE..." The MS-DOS window indicates that the total number of integration points is 100,000. Currently, it has been running for close to 30 hours, and not having run these multiple times, I wasn't sure if the model is indeed running, or if it's 'stuck' (as can happen in LISREL and SPSS). Any suggestions would be appreciated! Thanks!
 Linda K. Muthen posted on Thursday, June 28, 2007 - 9:50 am
It's probably still running but this is not a realistic analysis with so many integration points. You can change the integration to Monte Carlo integration (INTEGRATION=MONTECARLO;). Alternatively, you can use the weighted least squares estimator, WLSMV.
 Laura Stapleton posted on Thursday, August 02, 2007 - 6:44 am
I'm comparing Parscale and Mplus (v3.1) 1PL and 2PL models with ML estimation. I obtain the same difficulty and (for the 2PL) discrimination parameter estimates across the two programs (after using the conversion equations in webnote4).

However, I find that the distribution of the theta (factor) scores differ across the two programs. Parscale provides approximately normal theta values, while the factor scores in Mplus have a std deviation of .91 in the 1PL and .93 in the 2PL. The latent scores themselves are correlated .99 across the two programs and are centered at 0, but are just distributed differently. Is there any reason that you can think of why Mplus provides scores that do not have a st dev of 1? Or am I missing something?

Thank you!
 Linda K. Muthen posted on Thursday, August 02, 2007 - 8:22 am
One reason might be that there are two ways of estimating these factors scores _ EAP and MAP. Mplus uses EAP. Perhaps Parscale uses MAP. Also, the variances of estimated factor scores do not in general agree with the maximum likelihood estimated factor variances due to shrinkage. A third reason could be that there was a problem in Version 3.1. ML for categorical outcomes was introduced in Version 3 and there have been many changes since then. I can't think of any problem offhand but there may have been one.
 Ilona posted on Friday, August 24, 2007 - 10:14 am
In using SEM to parallel IRT, I believe:
getting Factor Scores (SAVE = FSCORES;) is parallel to IRT Scale Scores (on a Z-score scale).

My question is, what is the parallel to getting the IRT standard error of measurement for the scale scores? Can this be output per each subject's estimated Factor Score, or per Factor Score value (if using ML/MLR or WLSMV)?

Thank you!
 Bengt O. Muthen posted on Saturday, August 25, 2007 - 10:26 am
The factor scores you get with categorical items and continuous factors when using ML estimation in Mplus are the "theta-hat" scores you obtain in IRT (using the Bayesian Expected A Posteriori approach). They will be on a z-score scale if you have the factor variance fixed at 1, freeing all the loadings (the first is otherwise fixed by default and the variance free).

IRT standard errors of measurement are typically expressed as the inverse, namely the information curves for items and sums of items. You can request information curves in the Mplus plot command. See also the web site description of IRT in Mplus.
 T. Cheong posted on Tuesday, October 16, 2007 - 1:42 pm
This is a follow-up question on two queries posted to this list in 1998 and 2004:

Could Partial Credit or the Rating Scale model in line with Master's and Andrich be handled in Mplus now? Thank you.
 Linda K. Muthen posted on Wednesday, October 17, 2007 - 10:07 am
I am not sure. This may be possible using MODEL CONSTRAINT.
 Thomas F. Northrup posted on Wednesday, December 19, 2007 - 7:02 am

After a full day of reading MPlus discussion boards, web notes, and a few other articles about how IRT operates in MPlus, I wanted to make sure that my understanding was clear on a few points:

1. If I am interested in running an IRT (2PL) model, on a 16-item unidimensional measure with 5 response categories for each item (i.e., items are ordered categorical), while testing for DIF across gender, do I start by running a CFA using the 2-step procedure outlined on p. 399 of the MPlus User's Guide for Multi-group invariance testing with categorical outcomes (using WLSMV and delta parameterization). Correct so far?
2. Am I correct in my understanding that single-df chi-squared difference testing in WLSMV (for individual parameters such as item 1's factor loading or first threshold) will help me determine statistical DIF based on gender (assuming sufficient power)?
3. Does MPlus have other statistical tests for DIF? I have read about CFI change tests in the invariance testing literature, but I was not sure if these had made it into MPlus.

More in next posting...
 Thomas F. Northrup posted on Wednesday, December 19, 2007 - 7:03 am
4. To discuss the difficulty in IRT terms, the thresholds can be converted to traditional IRT difficulties (i.e., b) using the simple formula b = threshold/factor loading. Then if I wish to convert it from a probit scale to a logit scale (commonly used in PARSCALE and other IRT programs) I need to multiply the number by 1.7 or 1.76. Correct?
5. Can I calculate difficulties for each threshold using this formula? Since each item has 4 thresholds, I plan to calculate a difficulty for each threshold (as number of responses per category allows). In other words, does the basic formula (b = threshold/factor loading), which seemed to have only been talked about for dichotomous items in everything I could find, generalize to polytomous items?
6. To convert the loadings to IRT slopes (i.e., a; item discriminations), do I use the basic formula a = factor loading/(sqrt(1-factor loading^2)), and follow the same multiplication procedure (by 1.7) if I need to convert the probit to a logit. Correct?

Happy holidays and thanks!
 Bengt O. Muthen posted on Wednesday, December 19, 2007 - 6:09 pm
1. Yes, or to test only item difficulty DIF, you could use gender as a covariate in a single-group analysis, looking for direct effects (we teach that at Hopkins in our March course). See, e.g., the Muthen 1989 Psychometrika article (on my UCLA web site).

2. Yes.

3. Just multiple-group tests or test of direct covariate effects.

4. Right. Or use ML with logit link right away.

5. Yes, I think so.

6. Yes, or let Mplus do the conversion (given in the output for single-factor models)

These matters will be discussed during day 2 of our upcoming Hopkins course in March.
 Richard E. Zinbarg posted on Friday, February 15, 2008 - 10:40 pm
With polytomous items it seems somewhat arbritrary to me as to how many possible values each item can take on before we consider the assumption that the items are continuous to be reasonable. That is, if we let subjects use a 100 point response scale for each item, probably few if any would object to analyzing the items assuming they are continuous but there seems to be little principled reason for a priori considering a 100 point response scale to be continuous and a 5 point response scale to be categorical. Thus, when I have 5 point response scales I usually analyze the data both ways and often find that indices of fit such as CFI look a lot better in the IRT approach. My question is whether there are any valid tests of whether the apparent increment in model fit resulting from treating the items cateorically is statistically significant?
 Linda K. Muthen posted on Saturday, February 16, 2008 - 6:14 am
I think there have been some studies that suggest with five or more categories and no floor or ceiling effects, treating a categorical variable as continuous does not make much of a difference. If the categorical variable has floor or ceiling effects, the categorical methodology can handle that better. You could do a Monte Carlo simulation where you generate categorical data that look like your data and then analyze them as continuous variables in one analysis and categorical variables in another analysis and see if one way is superior in reproducing the population values.
 Richard E. Zinbarg posted on Monday, February 18, 2008 - 11:29 am
thanks Linda, we might just do a simulation study of this (we already have one planned in which we test whether treating a categorical variable as continuous makes much difference when estimating omega_hierarchical) and I am sure that would be helpful. Apart from a simulation study though I am wondering if it would be legitimate to conduct a test of the difference in fit for an individual data set. I don't know enough about IRT and categorical data analysis to know the answer to this question. My guess is that there are no legitimate tests for this purpose. When I run the same model with the items treating as being either categorical or continuous, the chi-square values and dfs seem to be on different orders of magnitude (e.g., in one model in which the items are treated as being categorical the model df = 191 whereas the exact same model with the items being treated as continuous the model df = 2137.
 Linda K. Muthen posted on Monday, February 18, 2008 - 11:45 am
I would need to see the two outputs you are comparing to comment but you are comparing different estimators and models with a different number of parameters. In addition, if you are using WLSMV, the degrees of freedom do not have the same meaning as with ML for example. The continuous model will contain linear regression coefficients while WLSMV will provide probit regressions. With five-category items that have no floor or ceiling effects, I would expect similar p-values for the chi-square test of model fit and also similar ratios of parameter estimates to standard errors (column 3 of the output). If these are not similar, then I would use the categorical methodology.
 Richard E. Zinbarg posted on Monday, February 18, 2008 - 12:07 pm
right and I am assuming given the different estimators that and the different meanings of the dfs that there is not any test that could be validly used to compare the difference in fit (in the particular example I am referring to, the p values are indeed the same and the ratios of parameter estimates are similar for the most part though in some cases the values are as different as say 6.9 vs 9.8 and the RMSEA estimates are highly similar - .053 vs .042 - but the CFIs seem meaningfully different to me - .937 for the categorical model vs .851 for the continuous model).
P.S. thanks for the very speedy replies!
 Lily posted on Sunday, April 06, 2008 - 7:31 pm
Dear Dr. Muthen,

Does Mplus provide callable built-in functions for cumulative distribution function of standard normal, erf and integration where we can input the arguments?
I am doing my masters project and my supervisor recommended me to use Mplus. However, both of us are extremely new to Mplus and we are still learning from scratch.

Greatly appreciated,
 Linda K. Muthen posted on Monday, April 07, 2008 - 9:05 am
No, Mplus has no callable functions.
 Lily posted on Tuesday, May 13, 2008 - 5:13 pm
Hi Dr. Muthen,
I am fitting a model of the following form using WLSMV:
F BY Y1* (P1)
Y2-Y5 (P2-P5);

[Y1$1] (P6);
[Y2$1] (P7);
[Y3$1] (P8);
[Y4$1] (P9);
[Y5$1] (P10);
....some constraints..
And Mplus the output given:

Estimate S.E. Est./S.E.....

Y1 0.745 0.040 18.511
Y2 0.334 0.040 8.350

And also

Item Discriminations...

Y1 1.099 0.133 8.280 ..
Y2 0.355 0.045 7.904 ..

I am not sure how to interpret the results as to which parameterization the first block of the output uses (the one directly under MODEL RESULTS) i.e. what is the relationship between the estimates in Item Discriminations and those immediately under MODEL RESULTS.

Your help is greatly appreciated.
 Linda K. Muthen posted on Wednesday, May 14, 2008 - 10:07 am
The information under Model Results shows the results from the estimation of the model. The information under IRT PARAMETERIZATION is a translation of the results into the IRT parameters of discrimination and difficulty. See IRT under Special Mplus Topics on the website for details about the translation.
 Daniel E Bontempo posted on Tuesday, September 09, 2008 - 1:18 pm
For polytomous item models, where IRT parameter conversions are not provided, is there some difficulty or issue to be aware of? The discussion above refers to manual computation, but I also assume if there were no issues MPlus would just do it.

Do I convert the multiple thresholds to multiple difficulty parameters using the same conversion as in the dichotomous case?
 Bengt O. Muthen posted on Tuesday, September 09, 2008 - 6:51 pm
I don't think there is a particular difficulty involved. We will think about it, write it out, and include it in the tech doc.
 Anna Brown posted on Friday, September 12, 2008 - 8:16 am
Dear Bengt
One thing is puzzling me. When obtaining test information curves for a simple 1 factor model, results depend on the link used with ML estimation. With logit link, values are just over 3 times larger than for probit link. The shapes of TIC are pretty much the same.
Is this to do with the scaling constant 1.7? But when squared it gives 2.89, not 3?
Which is the "correct" information? It is important for estimating standard errors.
In my model factor variance is fixed to 1 and factor loadings are free.

Thank you
 Linda K. Muthen posted on Saturday, September 13, 2008 - 10:57 am
The scaling factor is approximately 1.7 to 1.8. It is not exact.
 Yaacov Petscher posted on Sunday, September 14, 2008 - 6:26 am

I have been doing some IRT modeling for the purpose of a large scale assessment development, and have found some interesting things I was hoping to get clarification on. I generated a 2PL model in MPlus using maximum likelihood using the logit link, and then a 2PL model in BILOG, and 2P normal ogive model in BILOG. When comparing the three methods, the MPlus results are nearly identical to the normal ogive results in BILOG, but are nowhere near the logistic results in BILOG. Though the correlations are of an expected magnitude (1.0), the difference between them vary greatly. Further, the discriminations are quite different (MPlus = 1.24, BILOG-ogive = 1.27, BILOG-log = 2.02). BILOG uses the marginal maximum likelihood, but why do 2PL results from MPlus match BILOG normal ogive and not the BILOG 2PL model? Thanks!
 Linda K. Muthen posted on Sunday, September 14, 2008 - 10:54 am
If you are using maximum likelihood and the logit link, you should get the same results. This is not the default so you would have to specify it in the ANALYSIS command. The difference may be due to them using or not using the constant of 1.7 in their computations and Mplus not doing this. Mplus gives the results in IRT metric as well, using the 1.7 constant. If none of this helps, send the files and your license number to
 Anna Brown posted on Wednesday, September 17, 2008 - 5:22 am
Thanks Linda
You did not answer which information is the "correct" one for computation of standard errors.
 Linda K. Muthen posted on Wednesday, September 17, 2008 - 10:48 am
The results differ by rescaling using a constant so there should be no difference except a difference in scale.
 Diana Clarke posted on Monday, October 06, 2008 - 2:52 pm
I am using MPLUS 5 to run a 2 parameter IRT. I used the graph feature in the progam to plot the overall ICC curves as well as the group (gender) differences in the curves. Can the size of the plot lines and symbols be modified? If not, is this something that is being worked on? It would certainly improve the look of the plots that are generated.

Also, is there a way to import the labels and titles save in a previous plot to a new graph? These are some thing that would improve the user friendliness of the software.
 Linda K. Muthen posted on Monday, October 06, 2008 - 4:20 pm
The size of line cannot be changed. Symbols can be changed by using the Line Series option under Graph menu. At the present time, labels and titles cannot be saved. This is on our list of things to add.
 Mell Mckitty posted on Tuesday, October 07, 2008 - 5:10 am
I am a novice to IRT but trying to examine item endorsement invariance across gender on a test with 10 dichotomously scored item (y vs. n). I am currently using mplus 5. First I ran a cfa to confirm the unidimensionality of my scale and then I ran a second model that included gender as a covariate (see codes used below). Is significant DIF indicated simply by the significance of my covariate on the item as my indicator?

codes used:
for testing of the unidimensionality:


MODEL: angst BY a-j*;

for item invariance by gender:

MODEL: angst BY a-j*;
angst on gender;
 Linda K. Muthen posted on Tuesday, October 07, 2008 - 6:50 am
A significant direct effect of a covariate to an item represent DIF. In your example it would be, for example.

j ON gender;

You may find the slides that discuss measurement invariance and population heterogeneity from our Topic 1 course handout helpful. Also, the Topic 2 course handout contains information specific to categorical outcomes. The video for these topics is also available on the website.
 Mell Mckitty posted on Monday, October 13, 2008 - 7:09 pm
Hi Linda,
Thanks for your help above.

My follow-up question relates to the issue of having a covariate that is a 5-level nominal variable (e.g., 5-level age group: 1=18 - 24, 2=25 - 34, 3=35 - 44, 4=45 - 64, 5=65+ with group 2 as my referent category): How would I plot the the curves to look at DIF for across age groups in comparison tot he referent category. I know I would create 4 dummy variables with the referent category left out. However, when I try to plot the relationship I cannot figure out how to get plot for the referent category. Can you make any suggestions?

Here is what I have done so far:

MODEL: angst BY a-j*;
angst on age1 age3 age4 age5;
j ON age1 age3 age4 age5;
 Linda K. Muthen posted on Tuesday, October 14, 2008 - 9:56 am
The referent model would be the model with all covariates equal to zero.
 Mell Mckitty posted on Tuesday, October 14, 2008 - 1:17 pm
Thank you.
 Mell Mckitty posted on Thursday, October 16, 2008 - 1:32 pm
Hi Linda,
Regarding my model above:

when I ran the model without the covariates included:

MODEL: angst BY a-j*;

I obtain the Item Discrimination and Item Difficulty information in the MPLUS output. However, once I include the covariates:

MODEL: angst BY a-j*;
angst on age1 age3 age4 age5;
j ON age1 age3 age4 age5;

The Item Discrimination and Item Difficulty information is no longer included in the MPLUS output. MPLUS gives me the OR indicating significant or non-significant DIF for the age groups relative to the referent category (age2, which is left out). Is there a way for me to obtain the Item Discrimination and Item Difficulty information for each level of the covariate?
 Linda K. Muthen posted on Friday, October 17, 2008 - 9:11 am
Mplus does not provide these. You would have to compute them yourself using information from the IRT Technical Note that is on the website.
 Mell Mckitty posted on Monday, October 20, 2008 - 2:43 pm
Hi Linda/Bengt;
As Linda suggested, I downloaded the slides that discuss measurement invariance and population heterogeneity from your Topic 1 and 2 course handout. On slide 161, in discussing the interpretation of the effects it was concluded that shoplift was not invariant. Would this also be true if the direct effect of gender on shoplift was statistically significant and positive (instead of negative) but all other effects remained the same as in the slide? That is, as expected, for a given factor value, males had a higher probability of shoplifting than females. I am assuming that it would be but this is not clear from what is written in the slides.

Another question related to the calculation of the item discrimination and item difficulty for different levels of my covariate (Mell Mckitty posted on Thursday, October 16, 2008 - 1:32 pm ), would I use the model estimates or the standardized estimates? Which is the alpha and PSI value in the MPLUS output?

 Mell Mckitty posted on Monday, October 20, 2008 - 2:45 pm
Another related question:
How do you determine if DIF is uniform versus non-uniform? I am assuming that the inclusion of an interaction term would work. That is, a significant interaction term would indicate non-uniform DIF. Is this correct?

 Linda K. Muthen posted on Tuesday, October 21, 2008 - 10:34 am
A statistically significant direct effect implies DIF whether the coefficient is positive or negative. Only the interpretation would change.

Raw coefficients should be used. Alpha refers to factor means/intercepts. Psi refers to factor variances/residual variances and covariances/residual covariances.

Yes, adding an interaction term could do this.
 Thomas Scotto posted on Tuesday, October 21, 2008 - 11:34 am

A colleague and I are attempting to construct and interpret a polytomous item response model.

I want to make sure I am obtaining the slope and category thresholds that are commonly reported and are consistent with what you would obtain through running a graded response model in Multilog.

Following example 5.5 in the manual, I begin by designating my estimator as Robust Maximum Likelihood. I continue by specifying the variance of the latent construct to 1, and make sure I am using the logit link.

To obtain the thresholds, I take the logit thresholds reported in MPLUS and divide by the standardized factor loadings?

To obtain the slopes, I take the Standardized Factor Loadings and divide by the square root of (1-factor loading^2).

Is this correct?

Many thanks,

 Mell Mckitty posted on Tuesday, October 21, 2008 - 12:08 pm
Thanks, Linda.

And I would obtain the factor means/intercepts and factor variance/residual variances from TECH4, correct?
 Linda K. Muthen posted on Tuesday, October 21, 2008 - 12:29 pm
TECH4 contains factor means, variances, and covariances.
 Mell Mckitty posted on Wednesday, October 22, 2008 - 5:58 am
Once an interaction term is in the model, the plots of the ICC are no longer available in MPLUS. If a statistically significant interaction term is observed, thus indicating non-uniform DIF, how would one go about obtaining this plot in MPLUS.
 Bengt O. Muthen posted on Wednesday, October 22, 2008 - 7:34 am
Is your interaction between 2 covariates or between a covariate and the latent variable?
 Mell Mckitty posted on Wednesday, October 22, 2008 - 8:16 am
My model was as follows:


MODEL: angst BY a-j*;
int | gender xwith angst;
j on angst gender int;

the results showed that:
estimate se 2-tail p
j on
int -0.783 0.312 0.012

j on
gender -0.619 0.311 0.047

so the interaction term was with the latent variable.
 Bengt O. Muthen posted on Wednesday, October 22, 2008 - 8:41 am
I don't see a way to plot this in Mplus currently.
 Thomas Scotto posted on Wednesday, October 22, 2008 - 8:56 am
Before it gets lost in the shuffle, is my logic for conversion that I lay out above correct?

Anyone have any thoughts?


 Mell Mckitty posted on Wednesday, October 22, 2008 - 10:54 am
Thanks for your quick reply. Is the use of the interaction term as indicated in my model (Mell Mckitty posted on Wednesday, October 22, 2008 - 8:16 am ) a sufficient way of assessing for non-uniform DIF?
 Bengt O. Muthen posted on Wednesday, October 22, 2008 - 11:12 am
Yes. As far as I can see, it is the same as doing a multiple-group analysis (corresponding to the dummy covariate) where the loading (the discrimination) also varies over the groups.
 Bengt O. Muthen posted on Wednesday, October 22, 2008 - 11:23 am

Comparing to equations (17), (18), and (19) in our IRT tech appendix document

your setup gives alpha=0 and psi=1, so it looks like your the IRT "a" is the loading and the IRT "b" is the threshold/loading.

I don't believe Multilog uses the D=1.7 constant unless your request the "L2" option.

Check it out to see if you get agreement that way. The parameter estimates should be identical.
 Thomas Scotto posted on Thursday, October 23, 2008 - 8:44 am
To report--I got it!

The MPLUS loadings under MLR were the same as those estimated by multi-log. To get the thresholds, the transformation is MPLUS Threshold/factor loading=Multi-log threshold.
 Mell Mckitty posted on Thursday, October 23, 2008 - 2:40 pm
I ran a series of mimic models
1. without covariate
2. with binary covariate + interaction term to rule out non-uniform DIF,
if the interaction term was not sig
3. a model with the binary covariate
I tested the formulas from the technical notes by replicating the item discrimination (a) and item difficulties (b) printed in the MPLUS output for model 1 - minor differences between the calculated b's and those from MPLUS. I am now trying to use the formulas to calculate a and b for the different levels of my covariate (Model 3) for the item with DIF: How would the formulas be modified to do this?
I think that for the item of interest I would take the following information from the output of Model 3:
lamda = unstandardized estimate;
tau = unstandardized threshold;
alpha = estimated means for the latent variable from TECH4;

I am not clear on the value of psi:
1. is psi = the residual variance for the latent variable from the model output or the covariance estimate for the latent variable in the TECH4 output?

2. how does the estimate of the effect of the covariate fit into this formula?

3. If uniform DIF is present I believe that a would be constant across different levels of the covariate but that b would vary so, the estimate of the effect of the covariate should affect tau. Is this correct? and how?
 Bengt O. Muthen posted on Thursday, October 23, 2008 - 6:28 pm
This subject is discussed under Topic 2 in the Mplus course sequence. See slides 162-164 of the Topic 2 handout under

The DIF is expressed in terms of probit regressions (normal ogive in IRT language) using the WLSMV estimator. That can be used to translate into IRT metric using our IRT tech doc at

where the relationship between WLSMV probit and IRT is given. The corresponding translation from ML logit to IRT should follow from this.
 Mell Mckitty posted on Friday, October 24, 2008 - 9:47 am
Is there anyway to specify an interaction term in a CFA with covariates when the WLSMV estimator is used? I tried but keep getting an error message. If not, is it sufficient to use the multigroup method (i.e., grouping is (1=male, 2=female))?
 Bengt O. Muthen posted on Friday, October 24, 2008 - 10:09 am
No, WLSMV does not allow such interactions. But multiple-group analysis accomplishes this - letting the loadings vary over groups in addition to the thresholds.
 Shayne Piasta posted on Monday, October 27, 2008 - 2:54 pm
I have 2 questions re an IRT model w/ item predictors similar to Muthen, Kao, & Burstein (1991). Items a-j are binary with dummy coded item predictors specific to each item (flfn a-j, values of 1, 0), very similar to the OTL item predictors in Muthen et al. Truncated input follows:

F BY item_a* item_b-item_j;
F ON flfn_A-flfn_J;
item_a ON flfn_a;
item_j ON flfn_j;

I want to compare the 2 sets of difficulty parameters per item (e.g., the item_A parameters when flfn_A=1 and flfn_A=0.) I was somewhat unclear as to how to apply the Muthen et al. (1991, p. 10) formula to compute these, or whether computation would differ when using ML estimation. Would I simply (a) use (item threshold - regression weight)/item loading for predictor=1 and (b) use item threshold/item loading for predictor=0?

Secondly, I wonder whether effects of the item predictors must be modeled on both F as well as the items themselves. None of the predictor-factor loadings were significant, nor would I hypothesize that they would be. Rather, I would expect that item predictors only affect ability levels indirectly, through adjusting the likelihoods of correct responses. Could I constrain the loadings for F ON flfn_A-flfn_J to be zero or leave them out of the model completely, or must these paths be included for the model to be estimated/interpreted correctly?
 Bengt O. Muthen posted on Monday, October 27, 2008 - 6:39 pm
The slope for the item regressed on the binary covariate simply shifts the threshold for that item (the slope is of opposite sign of the threshold), so all formulas we have in our IRT tech doc follow once you have computed the 2 threshold alternatives.

I would let the covariates predict f as well. It would seem that considering them all togher, there might be an effect on f, although each covariate has a small effect.
 Mell Mckitty posted on Saturday, November 01, 2008 - 2:08 pm
From the model:


MODEL: angst BY a-j*;
int | gender xwith angst;
j on angst gender int;

would you agree with the following calculations:

a = loading + interaction*x

Since the interaction term allows the loading (the discrimination) to also varies over the groups,


b = (threshold + direct effects*x)/D
 Bengt O. Muthen posted on Saturday, November 01, 2008 - 3:20 pm
I agree that the loading in Mplus metric is modified as


and that the Mplus threshold is modified as

threshold + direct effect* x.

But the translation from Mplus parameters to ML IRT parameters a and b are given in

as (18) and (19). Here alpha=0, but psi and D play into it.
 Mell Mckitty posted on Saturday, November 01, 2008 - 5:35 pm
Sorry, I made a mistake in typing my equations above. The equations I used are:

1. a= (loading + interaction*x)/D; and
2. b= (threshold + direct effects*x)/loading

So D is taken into account in the calculation of a. Also, psi refers to the factor variance and angst@1 indicate that psi is set at 1. Therefore, equation 18 for the IRT techincal notes, taking into account the interaction term becomes,

a=((lambda +interaction*x)*sqrt(psi))/D, which is the same as 1 above.

Equation 19 with alpha=0 and psi = 1 and taking into account the direct effects of the covariate becomes:

b=((tau +direct effect*x) - lambda*alhpa)/(lambda*sqrt(psi))) = (tau + direct effect*x)/lambda, which is the same as equation 2 above.

 Bengt O. Muthen posted on Sunday, November 02, 2008 - 7:34 am
That looks correct.
 Mell Mckitty posted on Sunday, November 02, 2008 - 1:47 pm

My next queation relates to the calculation of the probability, which needs to take into account the indirect effects as well. So, how would the indirect effect of my model be incorporated into equation 17 (from the IRT Technical Notes) to calculate the probability P(Ui = 1|f)?

I know that the indirect effect (i.e., angst<-x) affects the mean of the latent trait (i.e., theta). As such, I am assuming that the indirect effect would be added to theta in equation 17. Is this reasoning correct? I have done this and the probability and related plot of the ICC curves for my indicated item for the different levels of my covariate seems correct but I would like some confirmation.

 Bengt O. Muthen posted on Monday, November 03, 2008 - 7:53 am
I think you are referring to the linear regression equation

f on x;

where in your case f is angst. This equation is estimated as

f-predicted = beta0 + beta1*x,

where beta0 is the factor intercept fixed at zero and beta1 is the estimated slope. So, this equation is only used to compute relevant f values for the ICC plot, as a function of the x values.
 Mell Mckitty posted on Monday, November 03, 2008 - 9:39 am
Sorry, I think I am getting a number of different questions all mixed up:

1. Correct, I wanted to know the equation to calculate angst on x and to plot the ICC curves.

2. I wanted to figure out how to identify theta from my output with mlr estimation with interaction included. I just went over your notes again and realize that theta is the residual variance, which gets printed out when standardized is indicated in the output. However, STANDARDIZED (STD, STDY, STDYX) options are not available for TYPE=RANDOM, and TYPE=RANDOM is necessary if an interaction term is specified in the model. Is there any other way to get the residual variance for my indicated item once an interaction term is specified in the model?

3. I am assuming that a combination of the direct, indirect and the interaction effects must be taken into account in the calculation of P(Ui = 1|f). So, How does the indirect effect fits into equation 17?
 Bengt O. Muthen posted on Tuesday, November 04, 2008 - 2:11 pm
Regarding point 2., the residual variance parameter theta is not present in your maximum-likelihood estimation using the logistic form. That's why you don't see it in (1) and (2) of our IRT document that we are discussing, nor in (18) and (19).

I'm sorry, I can't go further than I already have on points 1 and 3 because it turns into statistical consulting which we don't have time for. Perhaps you want to discuss with your local statistical consultation center.
 Rick Sawatzky posted on Thursday, November 06, 2008 - 1:52 pm
I need to calculate the first and second derivatives of the log-likelihood function with respect to the factor scores for an IRT model similar to ex5.5 so as to be able to calculate the maximum likelihood IRT scores and the information function. In conventional IRT, the first and second derivatives can be obtained by adding equations 1 and 2 respectively across all the items:

1) D1(theta) = a*(u-p(theta)
2) D2(theta) = D1^2*Q*-P
where P(theta) = 1/(1+EXP(-a(theta-b)), Q(theta)= 1 - P(theta), a is the discrimination parameter, and u is the item response (1 or 0)

These calculations, however, do not yield the anticipated results when using the IRT parameters estimated in MPlus (version 5.1). Could you please inform me how to obtain these derivatives based on the parameters estimated in MPlus (e.g., as in ex5.5). Thank you.
 Bengt O. Muthen posted on Friday, November 07, 2008 - 9:14 am
Are you considering the "maximum-likelihood" estimator of the latent factor score "theta", or are you considering the "Expected a posteriori" estimator? With the ML estimator for the parameter estimates Mplus uses the latter estimator, which implies that a normal "prior" is used in the calculations. See IRT books.
 dkim posted on Friday, November 07, 2008 - 3:42 pm
I have a question about the definition of “linear” CFA vs “nonlinear” CFA.

According to MPLUS manual (Example 5.7 non-linear CFA), it seems that the term “nonlinearity” is defined in terms of how factors are specified (e.g., interaction, quadratic).

I think 2PL(Mplus example 5.5) IRT model belongs to nonlinear one-factor CFA. Am I correct? I tried to run 2PL IRT model with 60 binary items but it took really long time to get results. I changed estimator from MLR to WLSMV, which gave results relatively quickly. If I use WLSMV instead of MLR, can I still say that I am estimating IRT model? I heard that if I am using estimators other than MLR I am estimating 2 parameter normal ogive model (not logistic model).

MODEL: f BY u1-u60;

Can this model be considered nonlinear CFA model?
 Bengt O. Muthen posted on Friday, November 07, 2008 - 5:34 pm
With continuous outcomes a model is non-linear if it has non-linear functions of factors. With categorical outcomes, the model is always non-linear because the conditional expectation function (the item characteristic curve) is non linear.

2PL IRT with 60 binary items should go very quickly using ML because you use only unidimensional integration over the single factor.

WLSMV uses probit which in IRT language is the "normal ogive". This is still an IRT model.
 Jason Bond posted on Monday, November 10, 2008 - 3:14 pm
I was wondering about Differential Criterion Functioning (DCF), ala:

Saha T.D., Stinson F.S., Frant B.F. (2007). The Role of Alcohol Consumption in Figure Classifications of Alcohol Use Disorders. Drug and Alcohol Dependence, 89, 82-92.

which seems to be an analog of using IRT and multiple group analysis in order to test for differences between the a and b parameters (in a 2-parameter logistic version of the model) across groups. Could the DCF be computed in this way? Similarly, when they plot Test Response Curves (TRC), which are supposed to indicate DCF differences, they plotted Expected Raw Scores by the severity factor. Is it clear what these Expected Raw Scores should be? Thanks much in advance,

 Linda K. Muthen posted on Wednesday, November 12, 2008 - 8:15 am
Neither of us are familiar with this paper so we cannot comment.
 Thomas Scotto posted on Monday, November 17, 2008 - 2:45 am
Bengt and Linda,

To follow-up on a post above , I sucessfully transformed the thresholds on my POLYTOMOUS IRT MODEL reported by MPLUS after employing the MLR estimator into those reported by MULTILOG using the (MPLUS Threshold/Factor Loading)=Multi-log threshold.

However, I'm off when I try to transform the standard errors. Is there something I'm missing? It looks like it should be a simple transformation. The Z-scores reported are close, but not spot on?


 Bengt O. Muthen posted on Monday, November 17, 2008 - 6:18 am
If you use Model constraint to descibe the transformation, you get the right Delta method SEs. Note that such SEs involve not only the SEs for the threshold and factor loading, but also their covariance.
 Tammy Tolar posted on Thursday, December 04, 2008 - 9:06 am
Are you working on capability for estimating 3 parameter IRT models? If so, can we get a beta version?
 Bengt O. Muthen posted on Thursday, December 04, 2008 - 9:27 am
Not right now, although it is on our list of things to add.
 Jennifer Rose posted on Tuesday, March 10, 2009 - 8:16 am

I've read through the postings on conversion of factor analytic parameters to IRT parameters. In my case, I've run a multiple group CFA model with categorical observed variables and regressing covariates on the single latent factor using WLSMV with delta parameterization. The model uses the default constraint alpha=0, and loadings and thresholds for variables showing evidence of DIF in previous analyses are free to vary across groups.
What I need clarification on is what values of alpha and psi I should use to convert the loadings and thresholds to IRT discrimination and difficulty parameters using equations 19 and 22 in the IRT technical appendix. Should I use the Tech 4 estimates or should I use alpha=0 and the residual variance estimates in the output?

Either way, I would get different IRT parameters for variables that were constrained to be equal across groups (i.e., noninvariant). Is this because I have regressed covariates on the latent factor, and would I just explain this when I present the results?

Thanks so much for your help,
 Bengt O. Muthen posted on Tuesday, March 10, 2009 - 4:08 pm
A couple of points here.

With multiple-group CFA (or IRT), the default is alpha=0 in the first group and free in the other groups (so not fixed at zero in all groups).

To get the standard IRT metric you would use the TECH4 means and variances for the factor.

But if you do that then your IRT curves will be different even thought the thresholds and loadings are equal - that's a function of the standard IRT metric using different standardization (different alpha and psi) in the different groups. So I would just use the alpha, psi standardization in say the first group and not the other groups - you can then see invariance in the item curves.

Note also that WLSMV uses probit, not logit.
 Jennifer Rose posted on Tuesday, March 10, 2009 - 6:06 pm
Hi Bengt,

Thanks for the clarifications. When I ask for IRT curves using the Mplus graph option after running the model, are those curves calculated based on the approach you suggested where just the alpha, psi standardization from the first group is used to calculate the IRT parameters? When I use the graph option, I do get IRT curves that are the same across groups for the invariant items, but differ for the noninvariant items.

 Darya posted on Wednesday, March 11, 2009 - 11:11 am

I am fitting a 3-factor CFA model with ordered categorical items using the WLSMV estimator. I let all factor loadings and item thresholds be free in the model and fixed factor variances and means to 1 and 0, respectively, for identification purposes. I want to fit information curves (IIC) to these items given the 3-factor structure.

1. Is that OK to do (given IRT analyses typically fit IICs for unidimensional models)?

2. I know it is feasible to do in Mplus, but I was wondering if these information curves are correct. That is, do they have the same interpretation as the IICs fitted in one-factor IRT graded response model.

3. And, would you tell me if there is documentation regarding this application of IICs (e.g. Mplus technical notes and citations/references)?

Thank you very much for your help!
 Darya posted on Wednesday, March 11, 2009 - 11:59 am
And, just one more question:

4. Are these IICs plotted on a probit scale?

 Bengt O. Muthen posted on Thursday, March 12, 2009 - 12:30 pm
Mplus computes information curves also in the multifactorial case. The curve for items loading on a factor draws on the full multivariate information using the second-order derivative with respect to the factor in question. The remaining factors are substituted by their means. See also our IRT tech note:
 Bengt O. Muthen posted on Thursday, March 12, 2009 - 5:32 pm
Answer to Jen of March 10. I was being confusing - the translation to IRT parameter values uses alpha and psi to bring them to the N(0,1) metric used in IRT. The IRT curves that Mplus plots, however, use the Mplus factor parameterization and because what is drawn is the probability given the factor, the factor mean (alpha) and factor variance (psi) does not enter into the curve (only in terms of the location and range of the "x axis"). So, yes, invariant items will show up as invariant even across groups with different alpha and psi.
 Cameron McIntosh posted on Tuesday, March 31, 2009 - 7:52 pm

I am working on a latent growth curve model where the items for assessing the construct (social support) changed after the second wave of data. In particular, new items were added, and binary yes/no response scales were changed to 4-point Likert scales.

Thus the assumption of measurement invariance over time is surely violated. The only glimmer of hope I see is some form of IRT score equating across the different versions of the social support instrument. Probably this could not be done in Mplus, but in another program followed by importing the IRT scores for analysis in the LGC model. Any comments you might have on the reasonableness/feasibility of such a procedure would be greatly appreciated.

 Bengt O. Muthen posted on Wednesday, April 01, 2009 - 10:33 am
If there are at least some items that are repeated in the same format, there would be hope for equating - which could be done in Mplus in a single modeling step (unless data called for a 3PL). Otherwise not, I don't think. Changing from binary to 4-point scales can make a big difference I would imagine.
 Michelle Little posted on Friday, May 22, 2009 - 8:16 am

I have two questions about using a bifactor model to obtain IRT estimates and factor scores (i.e theta scores) based in the graded response model (using 3-cat ordinal scale observed variables).

Question 1
I've seen that folks report using MPLUS and other specialized full information programs to derive IRT estimates from bifactor graded response models(Gibbons, Rush et al., 2009; Reise et al 2007).

I am finding that a bifactor model fits my data best in many cases (child externalizing dimensions). But, I'm not clear, after reading several postings whether bifactor loadings and thresholds may be used to derive IRT disc and difficulty paramters in the same way they are used in the one-dimensional case- because items are loaded on multiple factors.

Question 2
If I derive factor scores in MPLUS from a bifactor model, what are the resulting factor scores analagous to, in terms of the info provided?

Would the factor scores provide theta estimated based on all factors (averaged out)?
Or, can I derive a factor score that provides info on the general factor with specific dimensionality factored out?
Is it possible to pull multiple factor scores when using multiple factors?

Any help would be appreciated.
Thank you!
 Linda K. Muthen posted on Friday, May 22, 2009 - 2:27 pm
1. If you send us the paper at, we will see what they do.

2. Each person gets a factor score for each factor.
 Tony LI posted on Friday, June 12, 2009 - 9:34 am
Dear Linda,

Just wondering do you have any insights RE: Michelle Little May 22, 2009 question?

Thanks !
 Linda K. Muthen posted on Friday, June 12, 2009 - 9:42 am
We never received the paper.
 Ying Li posted on Thursday, June 25, 2009 - 1:09 pm
Hi Linda,

I am using M plus to do a 2-factor CFA, or 2-dimensional IRT. I am wondering if I can get factor scores for the 2-factors respectively.

I tried SAVEDATA: file is...;

But no factor scores were computed.

Thanks a lot for your time.

 Linda K. Muthen posted on Thursday, June 25, 2009 - 4:08 pm
If you are using Version 5.21, you should get these. If you are using Version 5.21, please send your files and license number to
 Ying Li posted on Friday, June 26, 2009 - 1:49 pm

Thanks. I am wondering if M plus can do Marginal Maximum Likelihood Estimation. Is there an example for it?

Thanks a lot for your time.

 Bengt O. Muthen posted on Friday, June 26, 2009 - 2:35 pm
Yes - use estimator = ML. See UG ex 5.5.

Mplus can also do weighted least squares estimation - see UG ex 5.2.
 Scott R. Colwell posted on Friday, October 02, 2009 - 6:27 am
Does this make sense for correcting correlations for attenuation?

I am trying to correct a 4x4 correlation matrix of observed variables for attenuation and I think that I can do this quite easily with Mplus.

If I model each of the four observed variables as single reflective indicators of four latent variables (1 indicator per LV), and set the loading to 1 for each, wouldn't the correlation of the latent variables be the corrected correlation of the observed variables?
 Linda K. Muthen posted on Friday, October 02, 2009 - 6:39 am
When you create a latent variable that is identical to an observed variable, the correlations among the latent variables will be the same as the correlations among the observed variables.

For continuous outcomes you do this as follows:

f BY y@1;

For categorical outcomes you do this as follows:

f BY u@1;

For continuous outcomes if you do not fix the residual variance of y at zero, it is not identified.
 Scott R. Colwell posted on Friday, October 02, 2009 - 7:09 am
Of course....I didn't think that through very well. Too early in the morning I guess.

Are there any ways of correcting a correlation matrix for attenuation in Mplus?

Thank you,

 Linda K. Muthen posted on Friday, October 02, 2009 - 3:35 pm
See the Topic 1 course handout under the topic Measurement Errors And Multiple Indicators Of Latent Variables.
 George Bohrnstedt posted on Saturday, October 24, 2009 - 1:39 pm
Linda and Bengt: Hello from an "old" friend! A colleague and I are using MPLUS to do a graded IRT model. We have four items each with four response categories. The most direct question is whether the results for the discrimination and threshold parameter estimates need to be rescaled, and if so, how? I ask because you do rescale estimates for 1 and 2PL models taking into account what I assume is the probit logit disctinction in estimation. But I see no such rescaling option for the graded response model. Without rescaling the results seem literally to be "off the map."

Thanks for your help.
Very best,

Geo B.
 Bengt O. Muthen posted on Saturday, October 24, 2009 - 3:50 pm
Good to hear from you, George.

These matters are covered in our Mplus Short Course on "Topic 2" - see the Topic 2 handout at

on slides 93-94. In short, Mplus operates in a factor analysis metric considering the probit/logit argument

(1) - tau_jk + lambda_j*eta

for item j, item category k, threshold tau and factor eta. In contrast, IRT considers

(2) D*a_j (theta - b_jk)

using the Samejima graded response model where D is chosen to make logit and probit close (1.7), a is the discrimination, theta is the "factor", and b are the difficulties.

You go from the Mplus results in factor metric to the IRT metric as follows. The Mplus IRT tech doc on our web site implies that when you run Mplus with the factor standardized to zero mean and unit variance (as is typical in IRT), a comparison of (1) and (2) gives

(3) a_j = lambda_j/D,
(4) b_jk = tau_jk/lambda_j

Check if that doesn't get you results in a metric seen in IRT. You can do the translation (3) and (4) in Model Constraint using parameter labeling so a_j and b_jk get estimates and SEs.
 Michael Thomas posted on Thursday, December 03, 2009 - 6:52 am

I would like to ask a follow-up question regarding IRT parameter estimates for bifactor models. I am wondering if the item parameter estimates provided by Mplus are appropriate for multidimensional/bifactor models? That is, can I apply the usual IRT transformations of factor loadings and thresholds estimated with WLSMV, and does MLR produce the correct IRT parameterization on its own?

Also, do the plots of information functions have the same meanings as they do for unidimensional models [SE = 1 / sqrt (info)]?

Thank you for any advice,

 Bengt O. Muthen posted on Thursday, December 03, 2009 - 6:05 pm
You can use Mplus for bifactor models with categorical outcomes. Because you have more than one factor, you don't get the IRT translation but you can do it yourself by hand.

The answer is yes to your information function question, although the actual details are more complex. Mplus provides information functions also for multiple factors but the information function for a given factor depends on which factor value for the other factor that you consider. Because of this, Mplus lets you plot the information function for one factor at a value of the other factor that you choose (such as the mean). You can also condition on covariates, so this plotting is quite general.
 Anna-Maria Fall posted on Monday, January 11, 2010 - 10:43 am
Dear Drs. Muthen,

I would like to ask a question related to the fit indices in IRT model. A colleague and I we are using Mplus to do a graded response IRT model.Our items have four response categories (Likert-scale). We are interested in the absolute fit of the model. Since there are problems with using the ML chi-square values to assess differences in fit between models, we decided to use WLSMV estimator to assess the model fit.

The output we got was quite a surprise (see below). We don't know why is our CFI value lower than TLI. Should we be concerned with this outcome? Do you have an explanation for it?
Thank you for your time,


Chi-Square Test of Model Fit

Value 1835.004*
Degrees of Freedom 213**
P-Value 0.0000

Chi-Square Test of Model Fit for the Baseline Model

Value 16181.845
Degrees of Freedom 38
P-Value 0.0000

CFI 0.900
TLI 0.982

Number of Free Parameters 143

RMSEA (Root Mean Square Error Of Approximation) Estimate 0.059

WRMR (Weighted Root Mean Square Residual)
Value 1.921
 Linda K. Muthen posted on Tuesday, January 12, 2010 - 9:30 am
These discrepancies can occur. See the Yu dissertation on the website for information about fit statistics and their behavior. This would make me suspicious about my model. Try alternative specifications.
 Jeff Williams posted on Wednesday, March 10, 2010 - 11:31 am
On Wednesday, December 17, 2003 - 9:00 am, Linda stated that the formula


was only valid if there were no cross-loadings. Is that because it is scaling the loading by the variance not explained by the target factor instead of the variance not explained by any factor? If so, could you include that influence with this adjustment:


to scale the parameter by the overall residual variance? If not, then what is the correct formula when there are cross-loadings? Finally, could you point me an article that discusses this issue?

 Linda K. Muthen posted on Wednesday, March 10, 2010 - 2:08 pm
You would need to include also the covariance between the two factors.
 Jeff Williams posted on Wednesday, March 10, 2010 - 3:05 pm
Yes, of course. I forgot to mention that one of mine is a method factor. In the correlated case, it would be


Does anyone know of a paper that addresses the issue of cross-loadings specifically in terms of IRT and/or item/test information functions?
 Linda K. Muthen posted on Thursday, March 11, 2010 - 6:32 am
That looks correct.

I am not aware of any such paper.
 liutiechuan posted on Monday, April 19, 2010 - 9:45 pm
Dear Drs. Muthen

I've found something confusing me.

I do a categorical CFA (five binary item one factor) and get a desirable solution. For the second item, all parameter is positive.

then I attempted to reverse the scoring of second item and analyse it again use the same model setting.

I got the result confusing me. The absolute value of all parameters is the same as before.

But sign for loading and threshold become negative. And from the output, the discrimination parameter becomes negative
, but the difficulty parameter is still positive as before.

I accept the IRT training and have some knowledge about the conversion formula between FA and IRT. BUT I'm still confused by these output,especially for the negative discrimination parameter.

Can you give me some suggestions or reference to explain those output? thank you very much!
 Linda K. Muthen posted on Tuesday, April 20, 2010 - 8:59 am
Please send the two outputs and your license number to
 Vivia McCutcheon posted on Thursday, May 20, 2010 - 10:42 am
Dear Drs. Muthen,

I have estimated a multi-group IRT (4-groups) using the MLR estimator. The means are set to zero in one group and allowed to vary in the others. Variances are constrained to 1. I first tested differences in thresholds for each item (there are 9) individually by comparing model fit. After accounting for all differences in thresholds I next tested differences in item difficulty.

My question regards significance of thresholds. 2 items have non-significant thresholds in one or more groups according to p-value and confidence intervals, although the discrimination parameters are significant. Is something wrong here? If so, how do I troubleshoot? If not, how does one interpret, if at all, a non-significant threshold?

Thank you.
 Jon Heron posted on Friday, May 21, 2010 - 12:18 am
Hi Vivia,

I've just been doing this very same thing so have looked back at my output.

The p-values for the thresholds indicate whether they are signficantly different from zero. Assuming that you have binary data and are not modelling a guessing parameter, this just tells you that for this item and within that group, a positive/negative endorsement to the item is equally likely at the centre of your trait - you could verify that by plotting the ICC. i.e. certainly not something to worry about.

In your model I don't think the tests for thresholds are particularly informative, although the parameter SE's might be useful to get a better handle on how these parameters different across groups following your omnibus test.

bw, Jon
 Bengt O. Muthen posted on Friday, May 21, 2010 - 6:28 pm

I am a bit unclear on what you are doing here. When you talk about means and variance I assume this is for the factor. If you have multiple groups you want to test for measurement invariance. So assuming this is what you do, the factor variance should not be fixed at one in all groups.

And then you say

"After accounting for all differences in thresholds I next tested differences in item difficulty."

which confuses me because the item difficulties are functions of the thresholds.
 Melanie Wall posted on Wednesday, July 21, 2010 - 4:38 am
Is it possible in Mplus to obtain a summary measure of the area under the total information curve from an IRT model.
 Linda K. Muthen posted on Wednesday, July 21, 2010 - 10:11 am
We give only a plot. We don't give a summary measure.
 Lesa Hoffman posted on Friday, September 03, 2010 - 3:58 pm
What is the difference between ML and MLR for IRT models (i.e., assuming categorical data and full information estimation)? For continuous data, I understand that MLR is supposed to help adjust SEs and test statistics for non-normality, but if normality is not assumed for categorical data to begin with, then what does MLR have to offer over ML? I apologize if this topic is already addressed elsewhere, but I could not find it. Thanks in advance for any direction you can provide!
 nanda mooij posted on Monday, September 27, 2010 - 2:38 am
I have conducted a IRT analysis with Mplus, and saved the factor scores. But these factor scores don't correlate with the observed scores, how is this possible? They are supposed to correlate right? I used a WLSMV estimator.

Thanks a lot,
 Linda K. Muthen posted on Monday, September 27, 2010 - 10:23 am
It sounds like you may be using an old version of Mplus where there were problems with the factor scores. If not, please send your input, data, output, and license number to
 nanda mooij posted on Tuesday, October 05, 2010 - 12:07 pm
Hi, I have another question. I am trying to fit a very large model, but when I do this I get the following warnings:

I get this warning for many more variables, but not all.

I checked the correlations, but these are not correlated this high, and not one correlation is below zero. What does this mean? What is meant by zero cells in the bivariate table?

 Linda K. Muthen posted on Wednesday, October 06, 2010 - 10:00 am
Zero cells in the bivariate table of two dichtomous variables imply a correlation of plus or minus one. Both variables should not be used as they do not contribute any additional information. This can happen with small samples and skewed variables.
 Sigrid Bloemeke posted on Thursday, October 21, 2010 - 10:12 am
Dear Drs. Muthen,
We analyzed the same data set using MPlus and CONQUEST. We applied a single factor model with 72 categorical items (68 dichotomous, 4 partial credit items with 3 categories; sample size 13,004).

In MPlus we specified the model f1 by I1-I72@1 with:
We expected the same item difficulty parameters compared to a CONQUEST analysis with the standard model specification item+item*step (constraints=cases, score (0 1 2) (0 0.5 1).

For all dichotomously scored items this was in fact true. But for all 4 partial credit items the threshold parameter estimates differed remarkably. Using MPlus, we got e.g. the threshold parameter estimates I69$1=0.697 and I69$2=2.276. Using CONQUEST we got the item difficulty 1.84734 and 0.11344 for the step parameter. This corresponds to the threshold parameter I69$1=1.96078 and I69$2=1.7339; that is the threshold parameters are unordered! The results for the other partial credit items were similar.

Both programs yield the same response category proportions. Increasing the number of nods doesn’t change the results. Inspecting the residual statistics reveals that the differences between the observed and the model implied response category proportions seem to be small. The item fit statistics do not indicate some kind of misfit, either.

We’d appreciate any help, Sigrid
 Linda K. Muthen posted on Friday, October 22, 2010 - 9:53 am
Mplus estimates Samejima's graded response model not a partial credit model.
 Melanie Wall posted on Wednesday, November 10, 2010 - 12:18 pm
I am fitting a single factor model with both continuous and dichotomous indicators. If I have all dichotomous outcomes, Mplus will output "IRT parameters" estimates and standard errors(the estimates are obtained through the conversion formula and I assume the standard errors are obtained by the delta method as in MacIntosh and Hashim). My question is: Is there a way to get Mplus to output these "IRT parameters" for the dichotomous indicators when I have a mix of dichotomous and continuous indicators?

I know I could use ML and get the IRT parameters directly but I am particularly interested in using WLS.
 Bengt O. Muthen posted on Wednesday, November 10, 2010 - 4:16 pm
I don't think so currently.
 James L. Lewis posted on Friday, December 03, 2010 - 10:49 am

Example 5.5 in the manual gives the program for a 2PL Graded Response Model. I am specifying CATEGORICAL (ordered polytomous) indicators. If I am understanding correctly, using the WLSMV (instead of MLR) estimator gives me the same model but it is 2 parameter normal ogive and not 2PL.

1. Is this correct?

I have read elsewhere on this board that the loadings that are output for the WLSMV estimator (normal ogive(?) above) (A) do not correspond directly to IRT "a" parameters but (B) require a transformation to correspond to the "a" parameters, and that (C) the transformation is only possible if items only load onto a single factor.

2. Are (A), (B), and (C) correct? If so and especially if (C) is also correct, has anything been worked out yet anywhere to your knowledge that allows for the transformation if items that load onto >1 factor (I am doing a bi-factor model with WLSMV and want factor scores and "a" parameters for the general factor).

3. Can I get the standard errors for the factor scores when using the WLSMV estimator? They are automically given in the FSCORES file with the (more computationally intensive) 2PL model, but do not appear when using WLSMV.Is there a way to get them?

Thanks so much.

 Linda K. Muthen posted on Sunday, December 05, 2010 - 11:31 am
1. Yes.
2a. Yes.
2b. Yes.
2c. Mplus gives these only when the model has one factor.

3. No.
 Rick Sawatzky posted on Wednesday, December 29, 2010 - 10:42 am
For a TWO-PARAMETER LOGISTIC ITEM RESPONSE THEORY (IRT) MODEL (as in example 5.5), could you please confirm whether the standard error of the factor scores (in the SAVEDATA file) is the inverse of the square root of the information function as defined in formula 14 of the MplusIRT2 document( Thank you very much.
 Bengt O. Muthen posted on Wednesday, December 29, 2010 - 5:05 pm
Yes, it is.
 Stefan Schneider posted on Saturday, May 21, 2011 - 2:51 pm
Dear Drs. Muthen,
We would like to examine temporal measurement (non-)invariance in an IRT model with “dense” repeated measurement: participants (n = 100) completed the same 6 items (5 ordered categorical response options per item) every day over 28 days. Instead of estimating 28 factors simultaneously, we think about estimating a single factor for all 2800 days, using the TYPE=COMPLEX option to adjust for the non-independence of observations. In this case, would it make sense to introduce “temporal” covariates (e.g., day of assessment as continuous covariate, week 1 versus weeks 2-4, weekend-days versus weekdays) in MIMIC models to examine DIF based on time, or is there something about this model that would not be accounted for by the COMPLEX option? Thanks very much for your support.
 Bengt O. Muthen posted on Sunday, May 22, 2011 - 1:16 pm
It sounds like you can view this as 2-level data, where level 1 is time (28) and level 2 is subject (100), and where you have 6 outcomes. So you could model it via Type=Twolevel or Complex. The Twolevel structure is similar in structure to doing growth modeling as 2-level, where time-varying covariates can be handled as in UG ex9.16.
 D C posted on Sunday, May 22, 2011 - 11:00 pm

Can Mplus produce item discrimination and difficulty parameters for a 2-PL IRT model with 6-category items? It seems Mplus produces these parameters only when the items are dichotomous.

Thank you,
 Linda K. Muthen posted on Monday, May 23, 2011 - 6:04 am
No, we only provide these when the items are binary. We will be preparing a FAQ about this in a couple of weeks.
 D C posted on Tuesday, May 24, 2011 - 5:37 pm
Hello Linda,

Thank you for your response. Will this FAQ include information on how I could compute the discrimination and difficulty parameters for such a model myself using the given loadings and thresholds? Or, if possible, would you tell me how that could be done in your next response in this thread?

Thank you,
 Linda K. Muthen posted on Tuesday, May 24, 2011 - 6:02 pm
The FAQ will cover this.
 Holmes Finch posted on Thursday, June 02, 2011 - 12:08 pm

I am interested in generating 2PL IRT model for dichotomous indicators where the latent trait is not normally distributed. I see how to generate non-normal indicators, but am not sure about the generation of factors with a particular skewness. I wonder if you could point me in the right direction. Thanks very much.

Holmes Finch
 Bengt O. Muthen posted on Thursday, June 02, 2011 - 4:34 pm
I would try to do that via mixtures. For instance, a 2-class mixture of two normal factors can represent a log-normal factor distribution - see pages 14-16 of the McLachlan-Peel (2000) book on mixtures.

There are probably exact results to be found, but otherwise you can take a trial and error approach to choosing means, variances, and class sizes to get the non-normality that you want.
 D C posted on Thursday, June 02, 2011 - 8:51 pm
Hello Professors,

If I was interested in looking at group DIF or group-differences in item endorsement using a 2-PL IRT. What is the difference in Mplus if I fit these two scenarios:
(1)a stratified IRT in each group and then plotting ICCs from common items by group in the same figure VS.
(2) fitting IRT with a covariate (see *example below) and then plotting the ICCs by group (using "name a set of values" command in plots)?

I saw a much more substantial difference in the stratified analysis while nearly identical curves in (2).

perception BY item1* item2 item3;
perception ON sex;
item1-item3 ON sex;

Isn't scenario (1) more accurately showing how differently the items function in each group since all model parameters are estimated separately (loadings, thresholds, etc.)?

Thank you!
 Holmes Finch posted on Friday, June 03, 2011 - 4:17 am
Thanks very much. I will give that a try.

 Bengt O. Muthen posted on Friday, June 03, 2011 - 9:06 am
Answer to D C:

In your *Example you show a model that is not identified because you can't have all direct effects of a covariate on the items and also a covariate effect on the factor. Perhaps you mean that this is just a part of the model where there are other items that aren't directly affected by the covariate.
 Emily Lai posted on Friday, September 23, 2011 - 12:43 pm
My colleague and I are trying to compare results from a traditional CFA to a multidimensional IRT analysis using the same data. We have polytomously-scored items and we are using MLR, specifying that our data are categorical.

We are also running the multidimensional IRT analysis using Conquest software in order to confirm our IRT results. When we do so, we get item fit indices in the output—the squared standardized residuals and a variance-weighted version for each item. That is, Conquest computes the average of the squared standardized model-based residuals for each item. We are wondering if MPlus would give us something like an item fit index that is comparable to this. We first thought of modification indices, but it is my understanding that you cannot get modification indices when you have categorical indicators. Is this true? Can you think of another index that would constitute a comparable indicator of individual item fit? Could we use standardized residuals from MPlus output in a similar way?

 Linda K. Muthen posted on Friday, September 23, 2011 - 4:58 pm
The CFA model for categorical indicators without the guessing parameter is the same model as IRT without the guessing parameter. You should not find differences when both are estimated using the same estimator.

We do not provide the fit index you mention. I don't think you will be helped by modification indices. Try TECH10 where univariate and bivariate fit is shown.
 sailor cai posted on Sunday, September 25, 2011 - 5:36 pm
Dear Dr Muthens,I would like to ask two questions:

Background information: I am looking at the predicting effects of three IVs (2 dichotomous +1 polytoumous) on one DV (dichotomous). My planned analytical steps: 1) using IRT models to transform values in the whole response matrix(3 IVs + 1DV) into probability values; 2) conducting EFA,CFA,and SEM based on product-moment approach. My questions are:

1. Does Mplus directly provide transformed values for my Step 1 analysis? How to obtain them? Or any other recommendations?

2. If I am to use plausible values before running EFA&CFA&SEM,does this mean I should obtain plausible values before IRT calibrating or do it after my Step 1 analysis? Further, if plausible values are to be used, at least how many sets of data will be acceptable for Mplus practioners? For 5 seems to be too many for my case.

It would be also much appreciated if you can recommend works using IRT and SEM in combination.

Thanks in advance!
 Linda K. Muthen posted on Monday, September 26, 2011 - 8:47 am
I'm not clear on how IRT is involved here. It sounds like all of your variables are observed.
 sailor cai posted on Tuesday, September 27, 2011 - 6:18 am
Yes, I have 117 observed indicators in my whole dataset. But to directly put them all together into a full model seems to make things too complicated.

It seems using composite variables would make things easier. I am wondering whether it is appropriate to weigh each observed variable with the IRT discriminality parameters before grouping them.

So my question is reduced to one: Does Mplus provide transformed response matrix based on IRT discriminality parameters? Or, the matrix with each item response vector multipled by its corresponding IRT discriminality value?
 Linda K. Muthen posted on Tuesday, September 27, 2011 - 9:30 am
It sounds like you have about 40 indicators for each of three factors. I would use IRT scores in this case. See the FSCORES option of the SAVEDATA command.
 sailor cai posted on Monday, October 03, 2011 - 1:48 am
Yes, I have about from 20 to 35 indicators for each of the four latent variables(second order). I would try the FSCORES.

Many thanks!
 Leslie Rutkowski posted on Wednesday, October 12, 2011 - 2:02 pm
Apologies if this is an overly simple question. I am interested in fitting a 2-level IRT model where item responses (L1) are nested within individuals (L2). I’m stuck on setting up the data in a way that will be correctly analyzed. So, my current thinking is that the data should be assembled such that item responses are in 1 column vector stacked for all examinees. So, for a 3 item test taken by 10 examinees, my data would be 30 rows long. In de Boeck & Wilson (2007), the authors augment the data matrix with an identity matrix to facilitate model specification. Accordingly, the three item (i=1,2,3) 10 person (j = 1,..10) case would be specified something like

F1 by ITEM*i1 ITEM*i2 ITEM*i3

Where ITEM corresponds to the column vector of item responses and i(i) =1 when row_ji = i and 0 otherwise.

Is this also necessary in Mplus? Or is there another way to organize the data and specify the model? Specifically, I'm wondering if specifying the items as 'within' and using examinee id as the cluster variable in the usual 'wide' format is sufficient?

Thanks in advance.
 Linda K. Muthen posted on Friday, October 14, 2011 - 10:39 am
You should use wide format data where each of your three variables is one column of the data set. This becomes a single-level analysis because clustering is taken care of by multivariate analysis. See Example 5.5.
 Yan Zhou posted on Tuesday, November 01, 2011 - 1:22 pm
I want to test the difference between constrained Rasch model and free Rasch model, both with multiple groups examinees, so I am wondering which estimator I should use, WLSM, WLSMV, or some other estimators? By the way, I tried to use MLR estimator to estimate IRT parameters in multiple groups examinees, it didn't work. So could you please tell me the reason?
 Linda K. Muthen posted on Tuesday, November 01, 2011 - 1:56 pm
What do you mean by free and constrained Rasch models?
 Yan Zhou posted on Tuesday, November 01, 2011 - 5:59 pm
I am sorry to confuse you. Now, I need to adopt Rasch model to estimate items' difficulties and examinees' abilities. During the difficulties estimation, I want to separate my examinees into four groups, and then estimate items' difficulties by each group. I want to use two methods to estimate each group's items' difficulties. One is by constraining all groups' items' difficulties to be equal,such as b11=b21=b31=b41,b12=b22=b32=b42 (the first footnote of b stands for group, the second footnote of b stands for item); the other method is by freeing one of the four groups' item difficulties estimation while constraining the other three groups' items' difficulties to be equal.

Now, what I want to know is which estimator I should use when I am estimating items' difficulties using the two methods above, MLR, WLSM, WLSMV, or some other estimators?

My another question is whether we should use the same estimator when we are using Mplus to run Rasch model many times from different perspectives while all these operations are for the same paper?

Thank you so much for your help!
 Linda K. Muthen posted on Wednesday, November 02, 2011 - 1:27 pm
The Rasch model usually is estimated using maximum likelihood estimation. That would be MLR. You could estimate the Rasch model using WLSV or WLSMV in which case you would use the DIFFTEST option to compare models. Unless you are writing a paper that compares different estimators, I would stick with one estimator.
 Mark LaVenia posted on Thursday, November 03, 2011 - 2:03 pm
I am using Mplus to generate thetas for subjects who took a test with known item parameters (generated by BILOG-MG and published by the test developers). By running a Two-Parameter Logistic IRT Model and fixing the item loadings (from BILOG output) and item thresholds (computed as BILOG_Threshold*Loading), the model runs great and generates thetas that seem plausible. As long as the variance is fixed @1, the reported Item Difficulty parameters in the Mplus output are exactly the same as the BILOG_Threshold, but the Item Discrimination parameters in the Mplus output are about half the size of the BILOG_Slope parameters. I was hoping to use a match between the parameters (Mplus_Discrimination & BILOG_Slope; Mplus_Difficulty & BILOG_Threshold) as a check that I did it right. I'm thinking maybe I didn't. Any input would be greatly appreciated. Thank you.

I am pasting below my truncated syntax:

VARIABLE: NAMES ARE id i1_Crt-i17d_Crt;

CATEGORICAL ARE i1_Crt-i17d_Crt;





 Bengt O. Muthen posted on Thursday, November 03, 2011 - 6:32 pm
Slide 94 of the Topic 2 handout may be helpful. For the discrimination there is also the D factor which some of the IRT programs set at 1.7. So if BILOG doesn't use D in its 2PL, you probably need to multiply the Mplus "a" by 1.7.

See also the IRT technical appendix at
 Mark LaVenia posted on Friday, November 04, 2011 - 7:29 am
Bengt – Thank you for the prompt reply, your thoughts, and the reference to further resources. Discrimination*1.7 brings me closer to the BILOG Slope, but still not on the money. To complicate things a bit more, when I allow the variance to be estimated freely the “a”*1.7 is closer to the BILOG slope than when the variance is fixed @1, which is a little frustrating because when I fix the variance @1, the Difficulty is spot on, but no so when free (Note: Fixing or freely estimating the factor mean has no effect either way). For example, the parameters for the first three items under each condition are as follows:

BILOG Slope: 0.732, 1.105, 0.786
BILOG Threshold: -1.414, 0.015, -0.322

!Factor variance = free
Mplus Discrimination*1.7: 0.716, 0.899, 0.750
Mplus Difficulty: -1.166, 0.012, -0.266

!Factor variance = 1
Mplus Discrimination*1.7: 0.590, 0.741, 0.619
Mplus Difficulty: -1.414, 0.015, -0.322

It seems to me that the spot on “b” parameters with the fixed variance is better than the appreciably better match on the “a” and worsened match on the “b” found with the free variance. I would appreciate your thoughts. Maybe this is just an inappropriate use of Mplus (i.e., for scoring tests/theta generation rather than model fitting & parameter estimation); do you recommend a different approach? Thank you again for your time and insight.
 Bengt O. Muthen posted on Friday, November 04, 2011 - 8:48 pm
Are you using BILOG 2PL or normal ogive?

I am not sure I follow your different steps of analysis, where in the run you show you have fixed item parameters.

I think you'd better send the relevant BILOG and the Mplus outputs to Support, explaining your steps.
 Mark LaVenia posted on Saturday, November 05, 2011 - 2:02 pm
Will do. Thank you so much for being willing to look at it.
 Salma Ayis posted on Thursday, December 22, 2011 - 5:59 pm
I have used MPlus version 3, in 2009.
The program below was used to produce Item Characteristics Curves.
I have run this program today, but when I tried to view the graph, it wasn't possible.
I used graph, view graph, which as far as I remember allow the viewing but didn't seem to do so.
Advice is much appreciated!
Here is the program code:

TITLE: this is an example of a two-parameter
logistic item response theory (IRT) model
DATA: FILE IS ADL21_May24_05f.txt;
USEV ARE u1-u21;
AUXILIARY= u1-u21 g;
MODEL: f BY u1-u21;
FORMAT IS 21F5.0, F15;
 Linda K. Muthen posted on Thursday, December 22, 2011 - 8:21 pm
Mplus Version 3 is not supported. If you have a current upgrade and support contract you should download Version 6.12 and send problems to
 nanda mooij posted on Saturday, March 17, 2012 - 7:35 am
Dear dr. Muthen,

I have a few questions about my model; I have fitted a IRT model by fixing the factor variances to 1, and freeing the factor loadings like this:
f1 BY u1* u2-u16;
f2 BY u17* u18-u32;
f3 BY u33* u34-u48;
f4 BY u49* u50-u64;
f5 BY u65* u66-u80;
f6 BY u81* u82-u96;
f7 BY u97* u98-u112;
f8 BY u113* u114-u128;
f9 BY u129* u130-u144;
h1 BY f1-f3;
h2 BY f4-f6;
h3 BY f7-f9;
p BY h1-h3;
My question is, is this right or must I also free the loadings of f1, f4, f7 and h1?
Besides this, I want to look at the fit of this model. In order to do this, I thought that it wouldn't be necessary to fix the variances and free the loadings, because I am only interested in the fit indices. Is this right? Because the fit indices of the IRT model and the normal CFA model are very different, especially the CFI, TLI and WRMR. So which model should I choose to determine the fit of the model?

Thanks in advance
 Linda K. Muthen posted on Saturday, March 17, 2012 - 7:41 am
You need to free the first factor loading of h1, h2, h3, and p if fix the factor variances to one. Model fit will be the same if you fix the factor loadings to one versus freeing all factor loadings and fix the factor variances to one. If you don't get the same fit with these two parameterizations, you are doing something wrong.
 Yen posted on Tuesday, March 20, 2012 - 3:15 pm
I apologize that I don't have much knowledge about the irt topic...
Would it be possible to run one irt analysis with one ordinal response item and all binary items?

Thank you.
 Bengt O. Muthen posted on Tuesday, March 20, 2012 - 6:31 pm
Yes, no problem. Just put them all on the Categorical= list. Mplus automatically figures out which are binary and which are ordered polytomous (ordinal).
 Hamdollah Ravand posted on Monday, April 23, 2012 - 8:28 am
Does Mplus do IRT simulation. In the simulation chapter of the user guide I didn't see anything on this subject. could you please refer me to an article which teaches IRT simulation step by step with Mplus the same way as your 2002 article taught SEM simulatin in Structural Equation Journal
 Linda K. Muthen posted on Monday, April 23, 2012 - 1:53 pm
There is no article that shows this. Example 5.5 in the user's guide is an IRT model. See the Monte Carlo counterpart of this. It is mcex5.5.inp.
 Simon Denny posted on Sunday, June 10, 2012 - 4:20 pm
Hi - I am trying convert an analysis with several latent factors to an IRT framework and I am trying to understand the conversion to IRT parameters (difficulty and discrimination). I have been reading and re-reading the web notes and discussion posts. You mention in Web note 4 that an increased residual variance (theta) gives rise to a flatter conditional probability curve. But to standardise y* under the LRV formulation with theta = 1 - factor loading^2*psi doesn't seem to take into account the residual variance at all? There is a footnote that suggest the R^2 output can be used to estimate theta in the Delta parameterization. Does this only apply in multiple group comparisons?
 Bengt O. Muthen posted on Sunday, June 10, 2012 - 4:48 pm
The extra residual variance parameters, which go beyond what is available in conventional IRT, are only relevant in multiple-group or multiple-timepoint settings. You need to fix them in a referent group, whereas they are free in the other groups.
 Miaoyun Li posted on Thursday, September 06, 2012 - 6:43 am

I'm a new user for Mplus. And now I have two questions about the use of IRT.
The first one is that I was confused about the transformation of the a, and b estimator parameters.
As I know, there are three kinds of formula:
(1) a =loading /(sqrt(1-loading**2)); b =threshold/loading.
(2) a =loading; b= threshold/loading.
(3) a =loading/1.7; b=threshold/loading.
When I used the ML estimate method, which one formula is right ? When the WLSMV method was used, which one is appropriate ? Besides, which one is the 'loading' and 'threshold' in the formula referred as "standardized", or "unstandardized" ?
The second question is about the explanation of the result of TECH10, especial for the " Univariate Pearson Chi-Square" and "Univariate Log-Likelihood Chi-Square". How can I explain that the item is misfit ?
I will be very appreciated for some related references suggestion.
 Bengt O. Muthen posted on Thursday, September 06, 2012 - 3:51 pm
See our technical appendix for IRT at
 Dexin Shi posted on Tuesday, March 26, 2013 - 8:46 pm

I have a question about the transformation between IRT and CCFA. based on the technical appendix for IRT, using CCFA with logit link, If we fix the factor variance to be 1 and factor mean to be 0, we have Discrimination a= loading lambda/1.7; and Difficulty bi= threshold tau/ loading lambda. However,from other references (e.g. Wirth & Edwards, 2007; Kim & Yoon 2011), the Discrimination a= 1.7 (lambda/sqrt(1-lambda^2); Difficulty bi= threshold tau/ loading lambda;Can you please explain the source of disagreement for the parameter a? Thank you very much for your help.
 Bengt O. Muthen posted on Wednesday, March 27, 2013 - 3:33 pm
The difference stems from using logit or probit. With WLSMV probit you use the 1-lambda^2 version because that is the theta (residual) variance. See our Topic 2 handout on our web site, slides 93-94.
 Dexin Shi posted on Wednesday, March 27, 2013 - 8:59 pm
Thanks, Bengt; that makes a lot of sense. But based on slide 94, if we fix the factor variance and means, using logit link a=lambda/1.7; using probit link, a=(lambda/sqrt(1-lambda^2). It seems if using the probit link, the scaling constant D should not be included in the equation, however, in other's notation a=1.7 (lambda/sqrt(1-lambda^2); is this an error, or am I missing something. Thanks again for your help
 Bengt O. Muthen posted on Thursday, March 28, 2013 - 9:23 am
I agree that 1.7 does not belong with the expression using /sqrt(1-lambda^2) since that is probit (WLSMV with Delta parameterization). The constant 1.7 was introduced to make logit close to probit IRT estimates. Furthermore, these days it seems that 1.7 has been dropped from the logit expression.
 Carlos Fernando Collares posted on Tuesday, April 09, 2013 - 12:50 am
I tried to find this information on the forum but could not find it.

Instead of leaving the "a" parameters float freely or fix them to 1 and thus obtain a Rasch model, I would like to set specific values for the mean and standard deviations for "a" (e.g. mean 0.8 and std. dev. 0.3).

Does anyone have an idea of script to obtain this? My attempts have all failed. I think I am dumb. >-;

Another question is if it will be possible to choose how to center the scale (either on b or on theta scores to have mean = 0 and std. dev. = 1).


 Linda K. Muthen posted on Tuesday, April 09, 2013 - 1:50 pm
You can use Bayesian analysis and use the specific values as priors.

You can fix the mean to one and the variance to one as follows. The standard deviation is not a model parameter.

 Carlos Fernando Collares posted on Tuesday, April 09, 2013 - 2:05 pm
Good idea!

Just a minor comment: for the IRT bunch the a parameters are not as important in the diagram as the b parameters. There should be an option to show b instead of a in the diagram.


 Yaacov Petscher posted on Thursday, April 25, 2013 - 1:05 pm

I was wondering if you all had any recommendations about the best way to evaluate Rasch and 2PL IRT models when using the Bayes estimator. I see that the DIC and pD are not given. Thanks for any input!

 Tihomir Asparouhov posted on Friday, April 26, 2013 - 8:58 am
The overall posterior predictive p-value (PPP) is a good way to evaluate these models. Several more detailed tests are reported in the tech10 output option.

 Carlos Fernando Collares posted on Wednesday, May 01, 2013 - 6:11 am
Any idea on if and/or when mPlus will have other logistic models beyond 2PLM (e.g. 3, 4 and 5-PLM)?

Best regards,

 Linda K. Muthen posted on Wednesday, May 01, 2013 - 8:39 am
We plan to have the 3PL model perhaps in a year.
 Carlos Fernando Collares posted on Wednesday, May 01, 2013 - 8:44 am
Ok, many thanks, Linda!

 Julien Morizot posted on Wednesday, May 01, 2013 - 11:38 am
Linda and Bengt, I jump in with a suggestion: if you're thinking to have a 3PL, you should explore the possibility to have a 4PL at the same time, as some studies suggested the higher asymptote may also be useful for modeling some kinds of items. Apart from some R scripts, almost all common IRT software packages do not have the 4PL.

Reise, S.P., & Waller, N.G. (2003). How many IRT parameters does it take to model psychopathology items? Psychological Methods, 8, 164–184.
 Linda K. Muthen posted on Wednesday, May 01, 2013 - 1:29 pm
Thank you for the suggestion. We will take this into account in our development.
 Carlos Fernando Collares posted on Monday, May 06, 2013 - 5:24 am
Julien, that's a great idea! Apparently the modeling of the fourth parameter would not require too much additional engineering effort if compared to 3PL and would indeed help.

Here I have students from all academic years making a test with end-of-course level (destined to new graduates), so that the upper asymptote is indeed very crucial to obtain better fit.

I tried to use R packages catR and irtProb, and they can be useful for several tasks, but not for item parameter calibration. After you got the parameters, then you can estimate standard errors, test information curves, etc.

What I have tried to explore for item calibrating so far, with limited success, was a script on WinBUGS (Loken and Rulison, 2010, see below).

If you guys have any alternative idea whatsoever to estimate the 4th parameter with less hassle, I would be much grateful.

Best regards,



for (i in 1: nstud) {
for (j in 1: nqs) {
p[i,j] ,- c[j] þ (d[j] 2 c[j])*
(exp(1.7*a[j]*(theta[i] 2 b[j]))/
(1 þ exp(1.7*a[j]*(theta[i] 2 b[j]))))
r[i, j] , dbern(p[i,j])
theta[i] , dnorm(0,1)
for (k in 1:nqs) {
a[j] , dlnorm(0,8);
b[j] , dnorm(0,.25);
c[j] , dbeta(5,17);
d[j] , dbeta(17,5);
 Chen, Lillian posted on Friday, May 31, 2013 - 12:45 am
I am conducting differential item functioning using Rasch model in Mplus. I simulated 40 items and 1000 samples for this data analysis.

My syntax is as follows:
VARIABLE: NAME ARE group y1-y40;
categorical are y1-y40;
grouping is group (0 = male 1=female);
f1 by y1-y20@1;
f2 by y21-y40@1;
f1 with f2;
Model male:
[y39$1] (d1);
[y40$1] (d2);
Model female:
[y39$1] (d3);
[y40$1] (d4);
Model constraint:
NEW (a);
NEW (b);

My questions are:
(1) I was trying to use MLR, but an error message showed. Did I miss anything?
(2) Although the overall model already stated "f1 with f2", the covariances between f1 and f2 are different in the male and the female groups. Why?
(3) Item difficulty estimates are biased more in items of f2, and very different from true values. Did I do anything wrong in my syntax?

Thanks for your response.
 Bengt O. Muthen posted on Friday, May 31, 2013 - 9:26 am
(1) We have to see your output.

(2) You have to say

f1 with f2 (1);

to hold that equal across groups.

(3) We would have to see how you obtained your "true values" and see the output from the run where you see this bias.
 Michael Lorber posted on Thursday, November 07, 2013 - 12:41 pm
Dear Linda and Bengt,

I've conducted parallel unifactorial graded response models in Multilog/IRTPRO and ordinal CFAs in Mplus. The estimates are nearly identical, but the item information curves (IIC) and test information curves (TIC) look quite different.

What's the difference in how the IICs and TICs are calculated in Mplus vs. Multilog/IRTPRO.

 Bengt O. Muthen posted on Thursday, November 07, 2013 - 1:33 pm
Are you using Mplus with the ML estimator and logit link with the latent variable mean fixed at zero and variance at 1? If so, the curves should be the same.

Which Mplus version are you using?
 Michael Lorber posted on Thursday, November 07, 2013 - 2:54 pm
Thanks, Bengt.

MPlus 7.11

The height of the curves is lower in Mplus. The shapes are a somewhat different too.

I 2x checked and the slopes/discrimination parameters are really close. The thresholds are of course different, as you explained in an earlier post. The numbers of free parameters, AIC, and BIC are also the same. I'm pretty sure the same things are being estimated.
 Linda K. Muthen posted on Friday, November 08, 2013 - 9:01 am
Please send the Mplus and IRTPRO outputs and the data to
 Gabriella Melis posted on Thursday, November 28, 2013 - 4:12 am
I was wondering whether Mplus handles the 1-parameter rating scale model (RSM) (Andrich, 1978)?

In this model, the slope parameter (discrimination) Ajx equals 1 for all items (same as in the 1-parameter logistic), however the location (difficulty) parameter Bjx is divided into an item component (DELTAj) and a category, or step component (TAUx), so that Bjx = DELTAj + TAUx. It is used for polytomous items, and is a special case of the partial credit model (PCM): in the RSM the logits equal (THETA - DELTAj - TAUx), whilst in the PCM the logits equal (THETA - Bjx).

Please, could you suggest how to translate this model into an Mplus syntax, if possible?

Many thanks.
 Bengt O. Muthen posted on Thursday, November 28, 2013 - 8:13 am
I haven't looked into this, but I wonder if one could use the Mplus "nu" parameter to capture DELTAj. I assume TAUx implies that x varies over the different categories of the item, so they are the Mplus threshold parameters.

With categorical items, the "nu" parameters are not activated (since they wouldn't all be identified together with the thresholds), but you essentially get them by placing a factor behind each item and letting that factor's mean (alpha) pick up the nu.
 Gabriella Melis posted on Friday, November 29, 2013 - 4:31 am
Thank you, I will work on it!
 Gavin T L Brown posted on Monday, December 16, 2013 - 9:48 pm
we analysed a set of dichotomously scored MCQ items using
MODEL: f BY M1-M50*;

One item struck as extremely odd in the resulting 2PL analysis

item M9 was answered by 0.8% (i.e., n=3) of the 375 students.
Category 1 0.992 372.000
Category 2 0.008 3.000
the Item Discrimination was
M9 -0.324 0.457 -0.708 0.479
and Item Difficulty was
M9$1 -15.062 20.868 -0.722 0.470
This means the item is very easy. However, the item was only answered by less than 1% of the students and so having such a low difficulty estimate makes no sense.
I checked this in both version 6 and 7. We have run the same data in RUMM2010 (1PL Rasch), ICL (3PL) and get more plausible location values for this item.
My concern is whether there is an error in the code or the set up or whether it is possible to have such a low logit despite such high difficulty. Advice appreciated.
 Bengt O. Muthen posted on Tuesday, December 17, 2013 - 3:56 pm
Because the loading is negative, a Y=1 answer implies a poor answer, not a good one. Perhaps you need to reverse score the item. Or, given that the loading is insignificant, delete the item.
 William Hula posted on Thursday, January 30, 2014 - 10:10 am
I've estimated a bifactor mimic model and want to convert the parameter estimates into IRT terms. I'd like to confirm that I have the generalization of equations (4) and (5) given by MacIntosh & Hashim (2003) to the bifactor case correct. I'm using the WLSMV estimator with delta parameterization, and I followed the steps described by MacIntosh and Hashim to set the means and variances of the latent variables to 0 and 1. Specifically, I centered the values of the dichotomous covariates about their means and set the residual variances of the latents to values estimated from an initial run.

(1) Would the correct equation for the discrimination of item j on the general factor be:
a_j = lambda_jGen / sqrt(1 - lambda_jGen**2*psi_Gen - lambda_jLoc**2*psi_Loc),
where psi_Gen and psi_Loc refer to the residual variances for the general and a local factor, respectively?

(2) Would the b-value for item j in group k on the general factor be:
b_jk = (tau_j - beta_j*z_k) / lambda_jGen,
where beta_j is the direct effect of a covariate on item j and z is the group indicator dummy variable?

and (3) is the dummy variable z appropriately coded 0 and 1 for the reference and focal groups, respectively, or should it be coded with the deviation scores from the mean, given that the covariate was centered for model estimation?

Thank you
 Bengt O. Muthen posted on Thursday, January 30, 2014 - 3:14 pm
(1) and (2) look correct. Re (3), I wouldn't center z for model estimation. The factor mean shift due to changing from z=0 to z=1 is taken care of in

b_jk = (tau_j - beta_j*z_k) / lambda_jGen,
 William Hula posted on Friday, January 31, 2014 - 2:08 am
Thank you for your help. I have a follow-up question. If z is coded 0/1 for estimation, then the estimated means for the latents may be shifted away from zero. In that case, would b_jk be

b_jk = (tau_j - beta_j*z_k) / lambda_jGen - muGen,

where mu is the latent mean estimate from the tech4 output?
 Bengt O. Muthen posted on Friday, January 31, 2014 - 2:38 pm
No, you get the mean of the latent variable from the beta_j*z_k term, with z being either 0 or 1 (assuming all your other covariates are centered).
 Emily posted on Friday, February 28, 2014 - 8:41 am
I have run a CFA with binary indicator variables on a single factor using MLR and the default delta parameterization. I have fixed my factor with a mean of zero and a variance of 1.

I am unsure of what conversion equation to utilize to transform the Mplus estimates into IRT parameters. I found this article

Equation 21 in this article gives the formula for the discrimination parameter as follows: a = factor loading*sqrt(factor variance).

I have also seen this equation mentioned previously: a= factor loading/sqrt(1-factor loading^2).

I am wondering which conversion equation is appropriate.
 Bengt O. Muthen posted on Friday, February 28, 2014 - 3:55 pm
You should use that equation 21 formula for MLR.

The other formula you mention is geared towards WLSMV.
 Yoonjeong Kang posted on Sunday, March 02, 2014 - 7:28 pm
Dear Drs. Muthen,

I want to run 2-group 1PL IRT model using Mplus. That is, I want to constrain the discrimination parameters within and across groups but do not want to constrain the difficulty parameters across groups. When I ran the code below, I got the error message (see below). Could you tell me what are wrong in my code? Without [u1-u25$1], I got the equal difficulty parameters across groups.

DATA: FILE IS Test_1PL.csv;
GROUPING IS g (1 = male 2 = female);
f BY u1-u25* (1);
Model male:
Model female:
*** ERROR in MODEL command
Unknown variables: U1-U25$1
in line: U1-U25$1
 Linda K. Muthen posted on Sunday, March 02, 2014 - 10:44 pm

 Yoonjeong Kang posted on Monday, March 03, 2014 - 11:03 pm
It works!!

 J.D. Haltigan posted on Tuesday, April 01, 2014 - 3:31 pm

I have a quick question re: discrimination values in a 2PLM IRT model. I have read that these values can theoretically range anywhere yet in practice typically they fall between .50 and 2.50. I have conducted some models in which I get alpha parameters such as 36.26 and other large values. However, these models converge and there are no warnings. Is such a value simply impossible or, if the item infrequently occurs (binary) is such a high discriminatory value possible?
 Linda K. Muthen posted on Wednesday, April 02, 2014 - 11:18 am
It is a sign that the item response is rare.
 J.D. Haltigan posted on Wednesday, April 02, 2014 - 11:26 am
Thank you. That being said, it still speaks to the items discriminatory power, correct? In other words, an event could be rarely occurring but not discriminate?
 Bengt O. Muthen posted on Wednesday, April 02, 2014 - 5:12 pm
I guess that is possible but a flat curve (low discrimination) tends to give a non-ignorable probability. Large thresholds often go together with large loadings.
 Tim Stump posted on Thursday, August 28, 2014 - 2:18 pm
I apologize if the answer to my question is in this discussion board somewhere, but it seemed most appropriate to ask here.

I have a bi-factor model with ordered categorical items--1 general factor and 2 subfactors I'm trying to estimate. Can someone tell me the difference in the unstandardized and standardized coefficients when requesting MLR and WLSMV. With WLSMV, unstandardized and standardized are the same and look like factor loadings to me. If I use MLR, it looks like I'm getting discrimination and threshold parameters from an IRT model. Would this be a correct assessment?

 Bengt O. Muthen posted on Thursday, August 28, 2014 - 3:50 pm
MLR with link=probit should give similar results to WLSMV when you compare to standardized MLR.
 Hillary Bayer posted on Thursday, December 11, 2014 - 3:46 pm

I have two groups and for now I am using the theta parameterization and have fixed the residual variances at one in both groups. I have constrained the thresholds and the factor loadings to be invariant over groups and the factor means and variances are free.

I notice that in the IRT PARAMETERIZATION part of the output, that the item discriminations and item difficulties are not the same across groups. This confuses me. Above, Dr. Muthen mentions: "To summarize my view, there are two ways to capture DIF in Mplus modeling: (1) CFA with covariates and (2) multi-group analysis. To me, DIF means that for a given item you have different item characteristics curves for different subject groupings and both approaches capture this. "

Don't the differences between groups in item discriminations and difficulties given in the IRT parameterization portion of the output suggest that the ICCs would be different? But the factor loadings and thresholds are fixed to be invariant? Am I testing the invariance of the wrong parameters? Should I be testing the invariance of the parameters in the IRT parameterization if I am interested in dif?
 Bengt O. Muthen posted on Friday, December 12, 2014 - 6:40 pm
Our Topic 2 handout, slide 94 shows how the default Mplus factor model parameterization using thresholds and loadings relate to the IRT parameterization with a and b. This shows that even with invariant thresholds and loadings, a and b will vary across groups when the factor mean and/or factor variance vary across groups.

Note that the IRT parameterization output refers to a factor with mean zero and variance 1.

My writing has been about DIF using the factor model parameterization.
 Hillary Bayer posted on Tuesday, January 06, 2015 - 2:17 pm
Thank you for your response and I would like to ask a follow-up question if it is not too much trouble. In the IRT literature it is not uncommon to hear that an item is unbiased between groups A and B if and only if the two item characteristic curves are identical. I think Mellenbergh more generally states this as: An item is unbiased with respect to the variable G and given the variable Z if and only if
f(X|g,z) = f(X|z)
for all values g and z of the variables G and Z, where f(X|g,z) is the distribution of the item response given g and z and f(X|z) the distribution of the item responses given z; otherwise the item is biased.

I am wondering if you are thinking about dif differently, or if invariant factor loadings and thresholds in CFA with dichotomous items, with factor means and variances free, is consistent with the above definition.
 Bengt O. Muthen posted on Tuesday, January 06, 2015 - 6:08 pm
I think it is consistent; I am not thinking about dif differently.
 Cheng-Hsien Li posted on Thursday, February 12, 2015 - 2:36 pm
Hi Dr. Muthen,

One quick question related to your message on Dec 12, 2014:

Even with invariant loadings and factor variances, a will still vary across groups when "scaling factors" vary across groups, is it correct?

That is, a can be equal across groups only when loadings, factor variances, and "scaling factors" are invariant (uisng WLSMV with probit link and delta parameterization), correct?
 Bengt O. Muthen posted on Thursday, February 12, 2015 - 4:51 pm
Right. The IRT parameterization with a and b ignores the possibility of varying scale factors due to varying residual variances.
 Darrin Grelle posted on Wednesday, March 11, 2015 - 6:59 am
I am trying to estimate IRT parameters for a large bank of cognitive ability items. I have about 600 items total with about 200 items loading on one of three correlated factors. My data is rather sparse because I had my trial sample take 36 items randomly drawn from the full item bank. My sample had over 12,000 people, so each item was seen by at least 300 people.

The data runs just fine as a unidimensional model, but as a multidimensional model, it has been running for 7 days.

Do you have any idea as to why the estimation is taking so long or suggestions as to what I can do to reach a valid solution?

Thank you!
 Bengt O. Muthen posted on Wednesday, March 11, 2015 - 9:48 am
Which estimator are you using?
 Darrin Grelle posted on Wednesday, March 11, 2015 - 10:49 am
I am using MLR.
 Bengt O. Muthen posted on Wednesday, March 11, 2015 - 6:45 pm
If you ask for TECH8 you will see screen printing of the iterations. The screen printing will tell you how long each iteration takes and how fast or slow the convergence is so you can get an idea of total time required. I assume your items are declared as categorical so that numerical integration is called for. The screen should show 3 dimensions of integration which with the default 15 points per dimension gives over 3000 points and can take a while for n=12,000 (time is linear in n).

But it shouldn't be bad for a fast computer with say processors = 8 and i7 CPU.

To speed it up you can use integration=10. Or, you can say NOSERR in the output to not compute SEs (they take a while with so many parameters). Or, you can use Estimator=Bayes. Or, you can take a smaller random sample of your data and use those estimates as starting values for the full-sample run.
 J.D. Haltigan posted on Monday, March 23, 2015 - 2:05 pm
After having read through this thread in depth again a question came to mind...apologies if the answer is either obvious or already somewhere here or on SEMNET which I also searched.

Why is it the case that for binary indicators in CFA it is generally recommended to avoid ML or MLR (because of the non-normality of discrete/binary indicators) whereas in the IRT setup, ML(R) estimation is (based on my read) a somewhat better choice than limited information estimators...I understand the advantage of full information estimation but am missing something as to why then full-information methods would not just be favored regardless of the metric of your indicator
 Bengt O. Muthen posted on Monday, March 23, 2015 - 2:19 pm
I think your question reflects a common misunderstanding that I have seen several times also in the literature, so I am glad you ask it to help clear it up.

When it is said that ML should be avoided with binary indicators in CFA, the speaker/writer is thinking of ML as synonymous with analysis treating the indicators as continuous-normal outcomes. So the mistake is to think that ML means continuous variables. Treating binary variables as continuous is of course not optimal. But ML does not imply analyzing your variables as if they were continuous. The fact that ML (or MLR) can be used for variables other than continuous is clear from a study of logistic regression and it is clear from IRT (which is logistic regression with factors as IVs), and it is clear from regression with a count DV.
 Mengyao Hu posted on Monday, March 30, 2015 - 11:12 am
Apologize if the answers to this questions is too obvious or somewhere in the forum.

I am trying to run a multidimensional IRT with nominal indicators (two latent continuous factors with nominal indicators, and several indicators have crossing "loadings"). I am just wondering if MPlus can handle it? I know MPlus has NOMINAL option, but I am not sure if I can define the observed indicators as NOMINAL in a measurement model. Thank you.
 Bengt O. Muthen posted on Monday, March 30, 2015 - 12:56 pm
This works fine.
 Billy Brooks posted on Thursday, April 09, 2015 - 9:20 am
I am trying to generate the ICC plots for my data and it appears to be plotting the curves for the wrong category. I indicate the non-referent category (cat 1) and it plots what is clearly the referent category (cat 2). Is there something I am doing wrong or misunderstanding?

 Bengt O. Muthen posted on Thursday, April 09, 2015 - 9:49 am
Please send output to support and explain the steps you take in the plot menu.
 Mi-young Webb posted on Tuesday, April 21, 2015 - 12:39 pm
I am using binary data for second-order item factor analysis (four first-order factors and one second-order factor). I would like to present the results in IRT framework. Is there a way to convert parameters from second-order factor model to IRT parameters?

Thank you.
 Bengt O. Muthen posted on Tuesday, April 21, 2015 - 6:43 pm
If you express how each item relates to the second-order factor (that is, as a product of loadings on the two levels) you are in the framework of the standard single-factor case that IRT typically considers and can therefore use the regular translations between our factor model parameterization and the IRT parameterization that we describe in our Topic 2 handout as well as in our IRT document at
 Mi-young Webb posted on Wednesday, April 22, 2015 - 6:37 am
Thank you much, Dr. Muthen.
 shaun goh posted on Friday, August 14, 2015 - 1:21 am

I was wondering if MPLUS can plot the results of a 2PL IRT, of latent ability by probability of correct response on an item.

I have specified a non-uniform DIF MIMIC model, where latent ability, a continuous covariate, and the interaction of latent ability X continuous covariate are regressed onto a binary item. In other words,

variable :
categorical are bexp20-bexp37;
L by bexp20-bexp37;
interact | L xwith bilingual ;
bexp35 on bilingual (a1);
bexp35 on interact (a2);

Thank you for your assistance,
 Bengt O. Muthen posted on Friday, August 14, 2015 - 2:03 pm
That's the item characteristic curve option in the plot menu.
 shaun goh posted on Friday, August 14, 2015 - 4:22 pm
Thanks for the reply Bengt.

Somehow I do not get the option for the icc curve when i use the syntax in the previous post. I do get the iccs when I do not specify covariates, or when my covariates are observed. However, the option for iccs disappears once i specify a LMS/xwith. Is this not supposed to happen?

 Bengt O. Muthen posted on Sunday, August 16, 2015 - 11:16 am
Ah, yes, the plot is not available with XWITH.
 shaun goh posted on Sunday, August 16, 2015 - 7:06 pm
Thanks Bengt.

I read you would be improving IRT and extending XWITH commands in v7.4.Any chance that these will allow for plotting of ICCs with XWITH?

I was wondering if you thought it would be worth manually calculating the probabilties of correct responses from the logits, as well as derive standardised theta from saved factor scores? (I understand if you choose not to comment as this is not strictly a Mplus question)

Thanks again
 Bengt O. Muthen posted on Tuesday, August 18, 2015 - 10:57 am
These icc's shouldn't be hard to manually calculate and plot by R. You don't need to get into factor score matters.

We will put your request on our to-do list, but the to-do list for 7.4 is already full.
 Michael Lorber posted on Wednesday, October 07, 2015 - 6:19 pm
I'm estimating a saturated covariance model among dichotmous variables using the WLSMV estimator. I understand from your prior posts and handouts that the covariance estimates are probit coefficients. I would like to compute odds ratios from these probit coefficients. Would it be fair to first translate the probit coeffients into log odds (log odds = probit coefficient*1.86), and then exponentiate the newly computed log odds to get the odds ratios?

For example 0.427 (probit) = 2.17 (odds ratio).

Thanks as usual!
 Linda K. Muthen posted on Thursday, October 08, 2015 - 11:51 am
This cannot be done with probit coefficients. Odds ratios are for logistic regression coefficients.
 Michael Lorber posted on Thursday, October 08, 2015 - 2:41 pm
Thank you! Is there any common effect size metric that probit coefficients from Mplus output can be converted to? I'm aware of a probit-to-d transformation (Sánchez-Mecaet al., 2003, Psychological Methods, 8, 448-467), but it is based on observed data and my model is latent, so it's unclear how to apply.
 Bengt O. Muthen posted on Thursday, October 08, 2015 - 6:41 pm
Not sure which context you have here given that you are looking at a correlation matrix among binary variables - effect size is usually a standardized mean difference between 2 groups.
 KwH Kim posted on Sunday, April 17, 2016 - 9:37 am
I tried modeling Rating Scaling Model.
Source is the below.



MODEL : f BY u1-u3* (a1-a3);

[u1$1] (p11);
[u1$2] (p12);
[u1$3] (p13);
[u1$4] (p14);

[u2$1] (p21);
[u2$2] (p22);
[u2$3] (p23);
[u2$4] (p24);

[u3$1] (p31);
[u3$2] (p32);
[u3$3] (p33);
[u3$4] (p34);

NEW(c1 c2 c3);
c1 = (p11 + p12 + p13 + p14)/4;
c2 = (p21 + p22 + p23 + p24)/4;
c3 = (p31 + p32 + p33 + p34)/4;

0= (p21-c2)/a2 - (p11-c1)/a1 ;
0= (p22-c2)/a2 - (p12-c1)/a1;
0= (p23-c2)/a2-(p13-c1)/a1;
0= (p24-c2)/a2-(p14-c1)/a1;

0= (p31-c3)/a3-(p11-c1)/a1;
0= (p32-c3)/a3-(p12-c1)/a1;
0= (p33-c3)/a3-(p13-c1)/a1;
0= (p34-c3)/a3-(p14-c1)/a1;


And I got the result,


Do you have any idea what's wrong and how to fix it?
 Linda K. Muthen posted on Sunday, April 17, 2016 - 10:10 am
Please send the output and your license number to
 J.D. Haltigan posted on Friday, June 24, 2016 - 11:12 pm
Quick question when using the Bayes estimator to estimate a 2PLM IRT model.

I see that the output provides standard factor loadings and thresholds. Is the converstion to the IRT parameterization in this case done in the standard way?

Thank you!
 Bengt O. Muthen posted on Saturday, June 25, 2016 - 6:08 am
Bayes uses probit link. See also
 Katy Roche posted on Thursday, July 21, 2016 - 2:09 pm
Is it possible to conduct IRT with the TYPE= IMPUTATION command?
 Linda K. Muthen posted on Thursday, July 21, 2016 - 3:10 pm
 Katy Roche posted on Wednesday, August 17, 2016 - 10:48 am
I am running into a problem where my univariate statistics in MPlus for the 0/1 items do not align with those in SPSS. In fact, in one case, MPlus lists the variable as having 4 categories when it is clearly 0/1 coded in SPSS. Any idea what could be happening that the results are so different?
 Linda K. Muthen posted on Wednesday, August 17, 2016 - 2:29 pm
This is most likely due to you reading the data incorrectly or having blanks in the data set. If you can't see the problem, send the output, data set, and your license number to
 Katy Roche posted on Wednesday, August 31, 2016 - 1:28 pm
For multiple group IRT, I am struggling with what to modify in syntax below in order to free the Difficulty parameter (as I've done for the Discrimination parameter). Would you mind indicating what syntax change needs to be made?

GROUPING IS atRisk (0 = notpoor 1 = poor);

TYPE = mgroup ;

f@1; [F@0];

model poor: f BY ATODES1_c* ATODES2_c ATODES4_c ATODES5_c;
f@1; [F@0];
 Linda K. Muthen posted on Wednesday, August 31, 2016 - 2:20 pm
See the Topic 2 course handout on the website under Multiple Group Analysis. You will find inputs that show what you want. You also need to take factor means and scale factors into account.
 Katy Roche posted on Tuesday, September 06, 2016 - 10:20 am
A final (let's hope) question on multiple group IRT with the IMPUTATION command.

In this case, what is an appropriate estimator and fit statistic for comparing nested models? Thank you for the support!
 Bengt O. Muthen posted on Tuesday, September 06, 2016 - 3:36 pm
Fit statistics for comparing nested models aren't developed for imputation.

See also our handout for the Topic 9 short course of 06/09/2011, slide 212 and onward.
 Elizabeth Peterson posted on Sunday, September 18, 2016 - 4:24 pm
We are ultimately trying to get a DIF function for a four group IRT model
using a large N= 5000

We have constrained model X1-X8 as follows

THETA BY *x1-x18;

We also have a free parameter model to look across multiple groups


THETA BY * x1-x18;
{ x1-x18@1};


THETA BY * x1-x18;
{ x1-x18@1};


THETA BY *x1-x18;
{ x1-x18@1};


THETA BY * x1-x18;
{x1-x18 @1};

The model runs, but we end up with the degrees of freedom being the same in both models, but the model fit improves in the multigroup model.
We don’t understand why the DF are the same and what is going on here.
We also would like to run a DIF test on the IRT scores and are not sure what the appropriate command is.
Thanks in advance for your help
 Bengt O. Muthen posted on Monday, September 19, 2016 - 11:56 am
Your statement

THETA BY *x1-x18;

is incorrect. You should say

THETA BY x1-x18*;
 Mike Stoolmiller posted on Sunday, October 09, 2016 - 9:30 am
I have fit an Olsen and Schaffer (2001) style 2-part, 2-level model using ML estimation that uses a logit link for the binary indicators of the latent variable in the first part of the model and in the second conditional part of the model, I use a log link for the count indicators that define the 2nd latent factor. Now that I have the model parameter estimates, my question in general is whether or not Mplus can estimate factor scores on the 2 latent factors for data from a new sample that wasn't used to fit the model, without having to re-fit the model on the new sample? In particular, can Mplus compute estimated factor scores for a single individual from a new sample?
 Bengt O. Muthen posted on Sunday, October 09, 2016 - 5:12 pm
That should work fine by fixing all parameters at the estimated values. Let us know if you have any problems.
 Mike Stoolmiller posted on Wednesday, October 12, 2016 - 9:57 am
I followed the procedure in, "Computing the Strictly Positive Satorra-Bentler Chi-Square Test in Mplus", Tihomir Asparouhov and Bengt Muthen, Mplus Web Notes: No. 12, April 17, 2013 to get Mplus to not do any optimization and instead just compute factor scores. In the same data, I had Mplus compute factor scores following optimization and then in the second run, I had it compute factor scores when all parameters were fixed at their optimized values for the previous run. I verified using Tech5 and Tech8 that no optimization happened in the second run. I noticed, however, that the loglikelihoods were substantially different, -134457.160 for the first run and -134495.608 for the second run, despite no optimization iterations. In addition, although the two sets of factor scores were very close, good enough for what I want to do, the standard errors for the factor scores were far enough off to make me worry. Any ideas? It wasn't my goal to compute the strictly positive SB chi-square test but it would appear that this might not work either given the discrepancy in the loglikelihoods. Any ideas? Be glad to send output and data if you want to take a closer look.
 Bengt O. Muthen posted on Wednesday, October 12, 2016 - 11:28 am
Yes, please send the output and data so we can have a look.
 Owis Eilayyan posted on Monday, November 14, 2016 - 9:34 am
Hello Dr. Muthen,
I am running few IRT models and I have 2 concerns. My first concern is that I have too large value of the degree of freedom. For example,

Chi-Square Test of Model Fit for the Binary and Ordered Categorical (Ordinal) Outcomes**
Pearson Chi-Square Value 2462.635
Degrees of Freedom 19601
P-Value 1.0000
Likelihood Ratio Chi-Square Value 938.836
Degrees of Freedom 19601
P-Value 1.0000
** Of the 88641 cells in the frequency table, 54 were deleted in the calculation of chi-square due to extreme values.

My second concern is one of the model’s output says “THE CHI-SQUARE TEST CANNOT BE COMPUTED BECAUSE THE FREQUENCY TABLE FOR THE LATENT CLASS INDICATOR MODEL PART IS TOO LARGE”. The model includes 14 items rated on different scales range from 2 to 4 categories. So I am not sure if I can trust the results or not! If not, do you have any suggestions I can consider?

Thank you,
 Bengt O. Muthen posted on Monday, November 14, 2016 - 5:55 pm
In our teachings we point out that these frequency table tests should be ignored with many cells where the two chi-squares disagree. Use Tech10 instead.
 Owis Eilayyan posted on Monday, November 14, 2016 - 6:46 pm
Thank you Dr. Muthen for your response.
I used TECH10 too, but when I want to present the results I need to report the model fit instead of univariate item fit.
So there is no way to decrease the number of degree of freedom!

Thank you,
 Bengt O. Muthen posted on Tuesday, November 15, 2016 - 5:30 pm
I would not try to use a frequency table chi-2 test when you have that many variables.

I would report the bivariate TECH10 results.

Or, use WLSMV with its model test.
 Owis Eilayyan posted on Tuesday, November 22, 2016 - 12:57 pm
Thank you Dr. Muthen

I would prefer to use ML estimator and report the univariate item fit

Best regards,
 Corina Owens posted on Wednesday, February 08, 2017 - 8:06 am
Hi Dr. Muthen,

I am wanting to calculate IRT scale scores from summed scores using EAP. I know that I can request and obtain factor scores from the GRM in MPlus using an MLR estimator but are my obtained factor scores (thetas) based on response patterns or summed scores. If based on response patterns is there a way to obtain thetas(factor scores) based on summed scores in MPlus?

Thank you,
 Bengt O. Muthen posted on Wednesday, February 08, 2017 - 4:11 pm
Mplus uses patterns. Mplus cannot give thetas based on summed scores.
 Paul A.TIffin posted on Monday, April 03, 2017 - 4:30 am
Dear Mplus team,

I have used Bayesian estimation to estimated factor models with informative priors specified for specific item factor loadings previously. I wondered if it is possible to specify specific informative priors for item difficulties and person traits/abilities (i.e. theta) in Mplus (i.e. a reparameterised factor model as an IRT model). If so how would the syntax work for this (for the specific priors)? Your advice is much appreciated.

 Bengt O. Muthen posted on Wednesday, April 05, 2017 - 4:03 pm
You can put priors on parameters such as difficulties but not on random variables such as factors.

See Model Priors in the UG.
 Paul A.TIffin posted on Wednesday, April 05, 2017 - 5:04 pm
Thanks for clarifying
 sung eun posted on Monday, April 10, 2017 - 4:29 pm

Could we estimate Graded Response Model parameters using summary data? Suppose we have 6 likert type items and each item has four scale points. We tried to feed polychoric correlation matrix to Mplus and fit a one-factor model, so we could transform the parameters back to GRM. However, it only reports factor loadings and residual variances. In order to estimate step difficulty parameters, we also need to estimate category thresholds. If it is possible, what information and in what format should we feed Mplus, so we can estimate factor loadings, residual variances, and category thresholds for categorical variables based on summary data (we don't have access to individual raw data)?

Thank you!
 Bengt O. Muthen posted on Monday, April 10, 2017 - 7:11 pm
You need threshold information. The polychorics are not enough. Mplus needs raw data.
 Weizi Wang posted on Monday, June 12, 2017 - 2:11 am
I am using Mplus for 3PL model and successfully find that the item parameter estimates in the output file, but I am wondering is there any possibility for me to generate an output file with the ability estimates for each items response record in the data (which represents each person's response to the item), too?

Thank you!
 Bengt O. Muthen posted on Monday, June 12, 2017 - 6:13 pm
You get that if you request Save = Fscores in the Savedata command.
 Jason P posted on Monday, December 04, 2017 - 2:00 am
Hi Bengt,

My question is about the measurement of differential item functioning at level-2 in a multilevel IRT model.

My data is relatively simple. Six dichotomous indicators of a latent trait for 15,000 individual respondents (level-1) clustered within 150 interviewers (level-2). I am trying to estimate interviewer effects, i.e. whether there is any residual variance at level-2 in any of the six items, conditioning on latent ability. I am not interested (at least not yet) in the level-2 variance of latent ability itself, which is my understanding of ex9.7's use of:

fb by x1-x4

I'm not sure whether it is possible to simultaneously estimate the level-2 variance on each item of an IRT, but my immediate assumption is that a separate level-2 latent variable is needed. Something like:

fw by x1-x6

fb1 by x1
fb2 by x2
fb3 by x3
fb4 by x4
fb5 by x5
fb6 by x6

Apoligies if this has been answered elsewhere. I just haven't yet found a clear example which mirrors my own interest in level-2 variance of each individual item in an IRT as a measure of DIF.

Thanking you in advance.
 Bengt O. Muthen posted on Monday, December 04, 2017 - 2:55 pm
Just add


on Between.
 Jason P posted on Tuesday, December 19, 2017 - 12:48 pm
Perfect! Thanks Bengt!

If I wanted to add a level-2 predictor of the level-2 item variances, it would look like this (?):

theta by d1-d6

d1 on x1
d2 on x1
d3 on x1
d4 on x1
d5 on x1
d6 on x1

This seemed to run fine, but I wasn't sure that this was the correct specification. FYI, x1 in this case is the interviewer's sex. The question is whether the interviewer's sex explains any of the level-2 variance in IRT response probabilities.

Thanks again!

 Bengt O. Muthen posted on Tuesday, December 19, 2017 - 3:33 pm
Yes, d1-d6 on x1.
 Jason P posted on Saturday, December 23, 2017 - 1:06 am
Thanks Bengt. Your commitment to continue assisting Mplus users is extra-ordinary!

My final question is about interpretation - and may be one that is beyond the scope of this forum, so my apologies in advance if that's the case.

I am trying to interpret the %Between% level parameters. Focusing on just D1 (one of six items 0/1 in the IRT) I have:

Between Level

D1 ON INTSEX 0.185 0.086 2.146 0.032

D1$1 0.083 0.050 1.659 0.097

Residual Variances
D1 0.062 0.017 3.635 0.000

My interpretation of this is that the Threshold D1$1 represents the difficulty when INTSEX==0. (I would convert these to difficulty parameters since the two-level command doesn't produce an IRT version of parameters). D1 ON INTSEX thus represents the threshold for when INTSEX==1. Technically this means that D1 if more "difficult" to endorse when INTSEX==1. The residual variance is an estimate of the remaining %Between% level variance after INTSEX has been accounted for.

I saved the fscores and examined them by INTSEX, and this seems a reasonable interpretation, but I just wasn't sure in the case of an IRT.

As ever, grateful for your advice/assistance.

 Bengt O. Muthen posted on Saturday, December 23, 2017 - 2:46 pm
Note that on Between, the name of the variable refers to the random intercept of that variable. So not the random threshold (the threshold is the negative of the intercept). The random intercept is the intercept of the regression for the y*, the continuous latent response variable underlying the categorical outcome. You can think of it as the specific ability needed to solve this specific item. So a positive D1 ON Intsex coefficient means that the specific ability is higher for this sex category, that is, the threshold is lower - the item is less difficult.
 Olev Must posted on Tuesday, March 13, 2018 - 9:09 pm
I have several ability subtests. Binary items (15-30 items in each subtest). My aim is to obtain the fit indices of the models and a and b values. I used the WLSMV method.
Why for some subtests this parametrization occurs?
How to compare a and b values of different subtests?
Thank you for advice.
 Bengt O. Muthen posted on Wednesday, March 14, 2018 - 3:23 pm
Send your output to Support along with your license number so we can see why you don't get the IRT parameterization for all subtests.
 peter birch posted on Monday, April 02, 2018 - 10:31 am
Dear Prof. Muthen,

I have some experience in CFA and am new to IRT methodology. I learned from literature and the Mplus handout topic 2 that CFA loadings (lambda) can be converted item discrimination parameters (a) using the equation:

a= lambda/sqrt(1-lambda^2)

Sometimes also a scaling factor of 1.7 is used.

However, when I try to convert loadings from publications that provide information on both loadings and discrimination parameters, I get inconsistent results.

I also came across your paper “Psychometric evaluation of diagnostic criteria: application to a two-dimensional model of alcohol abuse and dependence” (Drug and Alcohol Dependenee 41, 1996, 101-112;

In Tab. 2 you provide information on loadings and optimal weights, the latter of which can be interpreted as “item discrimination” as far as I understand. At least the formula of the optimal weights described in the last paragraph of Section 2.1 corresponds to “a=lambda/sqrt(1-lambda^2)” since communality should be lambda^2, as far as I know.

However, using that formula I cannot calculate the optimal weights from the table.
Do I make something wrong or do other information need to be considered as well to calculate “a” based on “lambda”?

 Bengt O. Muthen posted on Monday, April 02, 2018 - 12:26 pm
Note that the model in Table 2 uses 2 factors so the communality is the sum of the square loading for the 2 factors.
 peter birch posted on Saturday, April 07, 2018 - 8:31 am
Dear Prof. Muthen,

Thank you very much. Thanks to your reply I could figure it out.


 Michael Sciffer posted on Tuesday, January 29, 2019 - 2:04 am
Is it possible to combine 2PL and GPCM in a single model?

I am trying to fit a set of binary and ordinal indicators onto the same single factor but get an error about my first binary indicator saying "variables specified as GPCM must be ordinal."

I have tried separating the binary from the ordinal variables under the VARIABLE command by using the CATEGORICAL option twice e.g.
CATEGORICAL ARE x4 x5 x6 (gpcm);

But this does not address the error.

thanks in advance.
 Michael Sciffer posted on Tuesday, January 29, 2019 - 11:59 am
Just to add to the previous post. I have also noticed that you get the same error if you try to run a 2PL fit on one factor and a GPCM fit on another in the same analysis.
 Tihomir Asparouhov posted on Tuesday, January 29, 2019 - 1:06 pm
CATEGORICAL ARE x4 x5 x6 (gpcm) x1 x2 c3;

The (gpcm) label applies to all variable in front of it even if when they are on a different line.

This should also work
CATEGORICAL ARE x4 x5 x6 (gpcm);
 Melissa Reynolds Weber posted on Friday, February 15, 2019 - 9:04 am
Hello Drs. Muthen,

We are attempting to conduct a multilevel threshold analysis to assess the threshold ordering of a set of ordinal items with 7 response categories. We attempted to estimate a GPCM to handle ordered polytomous response options through a constrained nominal response model. However, the model is failing to converge despite a very high n. Is it possible that very high ICCs could be the cause?

Thank so much
 Bengt O. Muthen posted on Friday, February 15, 2019 - 5:31 pm
Note that you can do GPCM directly in Mplus, not having to constrain multinomials. See UG ex 5.5.
 Jeongwook CHOI posted on Monday, March 25, 2019 - 6:01 am
Dear Dr. Muthen,

I have some question about Generalized Partial Credit Model.

I want to analyse GPCM step parameter in Mplus8. I'll use a 'BY' statement if my dataset is unidimension?

And does Mplus8 use Muraki's formulation(1997) on GPCM analysis?

And Graded Response Model is analysed in Mplus?

Thank you in advance for any feedback and advice!
 Friedrich Platz posted on Tuesday, March 26, 2019 - 4:49 am
Dear Dr. Muthen,

I have a question concerning the output of my results obtained from correlating two latent variables estimated on subjects' response behavior on 5 dichotomous items for each latent variable.
My model input was:
au BY item1-item5@1;
au@1; !Rasch-Model

tm BY item6-item10@1;

au WITH tm;

AU WITH TM 0.565

Later in the output the correlation between au and tm is quantified as: 0.815

My question is: Which information is the correlation between both latent varibles that should be reported?
 Friedrich Platz posted on Tuesday, March 26, 2019 - 4:53 am
Dear Dr. Muthen,

my second question is, how I can set the difficulty parameters for my items obtained in an earlier study. However, I obtained the estimates not using MPLUS, thus do I have to convert them – being parametrized in the "traditional" IRT language – for MPLUS?

Thank you in advance!
 Tihomir Asparouhov posted on Tuesday, March 26, 2019 - 9:37 am
Here is the answer to Jeongwook CHOI:

The following document describes the IRT models and parameterizations Mplus uses
 Bengt O. Muthen posted on Tuesday, March 26, 2019 - 5:27 pm
Answers for Platz:

I think that Rasch modeling holds the loadings equal, not fixed at 1.

Those conflicting estimates would only be seen if there are other variables in the model so that one estimate is the residual covariance and the other the total covariance.

If this doesn't help, send your output to Support along with your license number.
 Bengt O. Muthen posted on Tuesday, March 26, 2019 - 5:28 pm
Answer for Platz' second question:

See the translations shown in
 Jeongwook CHOI posted on Wednesday, March 27, 2019 - 7:54 pm
Thank you for your prompt reply.

I know that it is possible to analyse Graded Response Theory in Mplus.

there is not Scaling factor D in document you posted.

Mplus use the scaling factor?

if Mplus use the factor, is there way that except the factor in computing IRT parameter?

Thank you in advance!
 Tihomir Asparouhov posted on Friday, March 29, 2019 - 10:37 am
Take a look at user's guide example 5.5. We don't use D. the IRT parameterization is given in formulas (21) and (22) in
 Jeongwook CHOI posted on Friday, April 12, 2019 - 6:25 am

I have a question about item estimation.

Mplus offer MLR as default estimator.

MLR is Marginal Maximum likelihood estimation in Mplus?

Mplus doesn't offer the MML estimation?

And Mplus use EM algorithm in analyzing IRT item parameter as default?

Thank you in advance
 Bengt O. Muthen posted on Friday, April 12, 2019 - 2:02 pm
ML and MLR give MML estimates. MLR gives robust SEs for those estimates.

Mplus uses a combination of algorithms. Check the screen printing to see which algo is used when.
 Jeongwook CHOI posted on Thursday, April 18, 2019 - 5:51 am

I wonder whether EM algorithm in Mplus is BAEM(the Bock–Aitkin approach with the expectation–maximization algorithm).

The EM algorithm is not BAEM.

Do you inform diffrence between them or article regarding that?

Thanks in advance.
 Bengt O. Muthen posted on Thursday, April 18, 2019 - 4:33 pm
Here is a useful reference clarifying this:
 Leslie Rutkowski posted on Tuesday, December 10, 2019 - 11:39 am
I want to confirm that Mplus *does not* have functionality for concurrent calibration with non-equivalent groups and an anchor test. This gives the situation there where are 3 sets of variables: unique to time 1 (x), unique to time 2 (y), and common to both (z). In "long form" (one row per subject), cov(x,z) and cov(y,z) exist, but cov(x,y) does not because these are unique to each time point.

Then fit a model like:
f1 by x1 x2 x3 ... xn !unique t1
z11 z21 z31 ... zp1 (a1-ap) !common to both
f2 by y1 y2 y3 ... yj !unique to t2
z12 z22 z32 ... zp2 (a1-ap) !common to both

Naturally, the fact that the covariance coverage between all x and y is zero causes a problem in finding a solution.

Am I correct that this problem is not suited to Mplus?

 Bengt O. Muthen posted on Tuesday, December 10, 2019 - 5:31 pm
Sounds like you can do this in Mplus using a 2-group approach where one group has the variables x and z and the other group has y and z. Then zero coverage issues don't arrive.
 Leslie Rutkowski posted on Wednesday, December 11, 2019 - 6:12 am
Thanks, Bengt. That was my thinking; however, when I set it up like:

f1 BY x1 x2 x3 x4;
x1@1 x2@1 x3@1 x4@1;

MODEL foc:
f1 BY x1 x2 x3 x4
y1 y2 y3;
x1* x2* x3* x4*;

I get the error:

Categorical variable y1 contains less than 2 categories in Group 1.

Notice that unique variables are only in the "foc" group (Group 2).
 Bengt O. Muthen posted on Wednesday, December 11, 2019 - 3:41 pm
I thought you have 1 group of people observed on x and z and another group of people observed on y and z. If so, you should organize the columns of your data so that (e.g.) the z variables come first followed by the x or y variables depending on group. Those last columns can be called say v variables and can be referred to as such in the Names and Usev lists. But they are different variables. This means you don't have variables not showing variance in either group. - But I may be misunderstanding your setup.
 Leslie Rutkowski posted on Thursday, December 12, 2019 - 7:55 am
Thanks, Bengt. This is helpful. One last point for clarification: if the number of variables in x and y are different this solution will only work to the point where column dimension of x and y are equal? That is, e.g., if col(x) > col(y), then some x variables must be left out.
 Bengt O. Muthen posted on Thursday, December 12, 2019 - 3:02 pm
We have a FAQ on that called:

Different number of variables in different groups
 Cesar Daniel Costa Ball posted on Wednesday, February 19, 2020 - 9:30 am
Dear professor,

I need to get the syntax to do an analysis with a model of response theory to the item. The questionnaire has 94 items, binary (success - error).

I want to evaluate the fit to a one-dimensional model and obtain;
indices of difficulty and indices of discrimination
Factor loads
Information function
Characteristic curves of the items

Do I want to know if the syntax obtained is adequate, or if I have to modify it?

TITLE: this is an example IRT


f1 BY y1-y94*;


FILE IS IADL_2PLThetas.dat
 Bengt O. Muthen posted on Thursday, February 20, 2020 - 4:32 pm
This is correct and you get all the information you mention, either in the output or in plots.
 Daniel Lee posted on Sunday, May 17, 2020 - 8:30 pm

I am trying to assess item fit statistics for a uni-dimensional graded response model (for a 9 item scale). For example, in the R package MIRT, they offer the following to assess item-level fit: Orlando and Thissen (2000, 2003) and Kang and Chen's (2007) signed chi-squared test.

Can a test of item fit be implemented in Mplus?

Moreover, I am using MLR and I am trying to assess local dependence. Does Mplus generate a local dependence matrix? I realized that WLSMV gets you a matrix of the residual correlations, but I'm wondering if this is possible when using MLR.

Thank you for your help.
 Bengt O. Muthen posted on Monday, May 18, 2020 - 5:26 pm
Use TECH10 and examine bivariate information.
 Daniel Lee posted on Tuesday, May 19, 2020 - 7:39 am
Thank you. In the last question, I also asked about item-level fit. Is there a way to implement this in Mplus, or were you saying that item-level fit & local dependence can all be evaluated using TECH10? I appreciate your help.
 Bengt O. Muthen posted on Wednesday, May 20, 2020 - 5:00 pm
Item-level fit is captured by TECH10.
 Jeongwook CHOI posted on Monday, May 25, 2020 - 6:26 pm

I have some question about item parameter.

I knew that Mplus doesn't directly show item parameter: discrimination, difference or boundary parameter.

I wanna obtain boundary parameters under polytomous model. So I calculated boundary parameter using below ways for each polytomous model.


boundary = thresold/lamda(factor loading)


boundary = step/lamda(factor loading)


boundary = location - (step/lamda)

are the above ways correct to obtain boundary parameter under the three model?

thank you for your help.
 Bengt O. Muthen posted on Tuesday, May 26, 2020 - 10:37 am
 Helen Norman posted on Monday, September 07, 2020 - 1:04 am

I have a very simple question about terminology. I have run a CFA with a set of ordinal (observed) indicators (as per example 5.2 in chapter 5 of the user guide). The user guide states that when the observed variables are categorical, CFA is also referred to as item response theory (IRT) analysis.

So does this mean I have run (1) a CFA with categorical indicators or (2) an IRT model? Or do they both essentially mean the same thing so either term can be used?

 Bengt O. Muthen posted on Monday, September 07, 2020 - 3:44 pm
They are the same so either term can be used. In IRT terms, this is a graded response model.
 Rumana Aktar posted on Sunday, October 25, 2020 - 8:47 am
Dear Dr. Muthen,

I am a beginner to use Mplus for IRT analysis. I am using version 8.4.

I ran a multidimensional IRT model using the WLSMV-Theta analysis option (since the model has 4 dimensions with 24 indicators, and MLR estimation was taking a too long time for completion).

The output did not contain item discrimination and difficulty values.
I really need your guidance on how can I get these values. Whether my syntax is correct?

Here is the syntax:

DATA: Mother_MPLUS.dat;

MODEL: Cold by I1_M* I3_M I9_M I12_M I13_M I17_M I19_M I22_M I24_M;
Hos by I4_M* I6_M I10_M I14_M I18_M I20_M;
Ind by I2_M* I7_M I11_M I15_M I23_M;
Und by I5_M* I8_M I16_M I21_M;



Should I report?
a = discrimination = loading
b = difficulty = threshold

Thank you!
 Bengt O. Muthen posted on Sunday, October 25, 2020 - 5:23 pm
When you have more than one factor, the conventional IRT parameterization is not used but instead the Mplus parameterization. See the FAQ on our website:

IRT parameterization using Mplus thresholds

Note the statements there:

Unfortunately, the slope-threshold form
does not generalize well to truly multidimensional models, so we
adopt the slope–intercept parameterization for this model and all
remaining IRT models.
The slope-intercept parameterization is also used in the Reckase (2009) book “Multidimensional IRT”;
see section
 Rumana Aktar posted on Tuesday, October 27, 2020 - 12:16 am
Thank you Dr. Muthen for your guidance.

Just to clarify, for the slope-intercept parameterization should I run the model with MLR estimation? then, for estimating the item difficulty, should I divide the intercept of a certain category of an item with the factor loading of the item?

Thank you once again!
 Bengt O. Muthen posted on Tuesday, October 27, 2020 - 10:50 am
The slope-intercept parameterization does not deal with item difficulty but instead an intercept (which is the negative of our threshold). Read about it in the references I gave.
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message