Message/Author 


I would like to understand what the difference is between the Mplus estimation method for categorical items and the Rasch model for scales (or test) construction. In other words, are the several models (rating scale, partial credit and equal dispersion model) extraneous to Mplus estimation method or not? Thanks for the answ 


With a single factor, Mplus estimates a 2parameter normal ogive to use IRT terms. A transformation to IRT parameters a and b is given in Muthen, Kao, Burstein (1991)  see IRT reference list under the Mplus Discussion topic References. Guessing is not taken into account. This means that results are close to those of 2parameter logistic, using the usual conversion factor to make probit results close to logit results. Differences also arise because Mplus uses weighted least squares (WLS) and not maximum likelihood (ML). Mislevy had an article in Journal of Educational Statistics in 1986 showing that WLS and ML gave close results in multifactor situations. A Rasch model can be estimated when holding loadings equal across items. Differences with Rasch programs would again be due to WLS versus ML; note also the custom of average difficulty = 0 in Rasch. I don't think Partial Credit or the Rating Scale model in line with Master's and Andrich can be handled in Mplus, but maybe other users would know. Mplus handles ordered polytomous outcomes, which is in line with Samejima's graded response model. 


I'd like to continue the discussion of IRTlike analyses in MPLUS. Specifically, I have analyzed the same data set using Mplus, R. MacDonald's NOHARM program, and the BILOGMG IRT program. The results differ across programs. My question is whether I should worry about such differences, or if they're within the range of what one would expect, given differences in estimation procedures (e.g., ML vs. WLS). The details are as follows: I'm analyzing 11 dichotomous items, with a sample size of 3091. In Mplus, I specified Model: F1 by I1* I2I11; F1 @ 1; (thereby constraining the factor variance to be 1.0). I won't overwhelm everyone with all the results, but here's the estimates for two items: Item #1 Mplus WLSMV: loading .839 threshold .289 Mplus WLS: loading .950 threshold .229 NOHARM: loading .814 threshold .289 BILOG: slope 1.674 threshold .295 Item #2 Mplus WLSMV: loading .515 threshold .029 Mplus WLS: loading .655 threshold .126 NOHARM: loading .513 threshold .029 BILOG: slope .450 threshold .087 My understanding is that estimates from Mplus can be transformed into the parameterization used in BILOG by the following: BILOG slope = loading/(sqrt(1loading**2)) BILOG Threshold = threshold/loading For item #1, this calculation yields 1.542 for the slope/loading, while BILOG reports 1.674. For the threshold, the formula yields .344 while BILOG says .295. Again, the issue for me is whether such differences are cause for concern, or if that's just the way it is. Thanks in adnvance to anyone who can provide illumination! 


I think the magnitude of the differences between BILOG and Mplus are not a cause for concern. They are due to using ML instead of WLS and using logit instead of probit (the constant 1.7 in the logit gives only an approximate closeness to the normal). Note also that the item response curves may be close even when the thresholds and the slopes are a bit different (these are correlated quantities). As for thresholds, Mplus WLS and Mplus WLSMV differ a bit due to the difference in the weight matrix. With WLS the thresholds are not simply inverse normal transforms of the sample proportions, but are a function of the weight matrix covariance between thresholds and correlations and the fit of the correlations, see Muthen (1978) Psychometrika equation (20). 

Rich Jones posted on Friday, April 06, 2001  11:11 am



Is there a standard method for quantifying the 'bias' in the estimate of mean differences in latent ability according to group membership when measurement noninvariance (DIF) is not taken into account? Here's what I tried. I built a singlegroup MIMIC model and detected significant direct effects yadayadayada. Then I estimated a misspecified model with previously detected direct effects fixed to 0. Now comparing overall model fit isn't really interesting because I already know those direct effects significantly improve model fit. What I'm really interested in is how wrong would I be about inferences made on subject ability if I ignored the direct effects? I chose to compare the Std parameter estimates for the indirect effects for each model, because the residual varinaces of ability are different across models and the background variable is binary. The full model returns a Std indirect effect of 0.490, the misspecified model 0.498. Thus I conclude that ignoring bias results in overestimating group differences by a whopping 1.6%. So does that seem to make sense? Anyone with other ideas? 


I think your approach makes sense. In addition, with several x's, you could look at the standardized mean differences for different groups, e.g. consider how differently, in a standardized metric, black females compare to white females with and without a direct effect. As a complement, I guess one could also ask what happens to the factor score for a given individual with and without a direct effect. 

Anonymous posted on Thursday, April 12, 2001  11:05 am



Based on the above discussion and Muthen and Christoffersson (1981), I was wondering why it might still make sense to perform an Mplus multigroup CFA for categorical variables if one doesn't render it as a 2P IRT model. Unless the scale factors are constrained across groups, it doesn't seem that that differences in group means and variances are necessarily meaningful since the definition of the measures could still be quite different for each group in spite of having equal thresholds and loadings. 


Differences across groups in factor means and variances are meaningful if thresholds and loadings have (partial) invariance across groups even if scale factors are different across groups. A useful analogy is with continuous indicators where in order to compare factor means and variances across groups you do not need invariance across groups of error variances in addition to invariant intercepts and loadings. With categorical outcomes, allowing scale factors to differ across groups in the presence of invariant thresholds and loadings can be thought of as allowing noninvariant error variances. 

Anonymous posted on Monday, April 16, 2001  10:46 am



Within the Mplus CFA framework then is there no meaningful way of "correcting" for DIF by imposing a direct effect between a covariate and an indicator (as there would be if the CFA was rendered as an IRT) ? 

bmuthen posted on Monday, April 16, 2001  12:25 pm



I did not realize that you were asking about direct effect DIF modeling in our last message, but that does not change my answer. I conclude the opposite from our discussion: including a direct effect is a meaningful way to correct for DIF, just like it is with regular CFA with covariates. Please let me know how I can help get us to get agreement on this issue. 

Anonymous posted on Monday, April 16, 2001  1:54 pm



My second question was a followup rather than a restatement of the first. I am unclear as to how a direct effect between a covariate and an indicator adjusts for DIF for an Mplus CFA with categorical indicators. This may be due to confusion regarding the nature of scale factors in the Mplus CFA and the correspondence between the Mplus CFA and a 2P IRT. The interpretation of adjusting for DIF in the case of a 2P IRT is straight forward: the DE adjusts for differences in item difficulties _within_ and _across_ (via the imposition of equal thresholds, loadings, and scale factors) groups of a multigroup model. The extension to the Mplus CFA with categorical indicators approach is less clear: would DE's only adjust for differences in error variances _within_ groups of a multigroup model (and in so doing, adjust the loadings and thresholds as well  which are _still_ constrained to be equal across groups even though the scale factors are allowed to vary) ? 

bmuthen posted on Tuesday, April 17, 2001  4:15 pm



First of all, there is no difference between the twoparameter normal ogive IRT model and the Mplus model. To summarize my view, there are two ways to capture DIF in Mplus modeling: (1) CFA with covariates and (2) multigroup analysis. To me, DIF means that for a given item you have different item characteristics curves for different subject groupings and both approaches capture this. In approach (1), DIF for a certain item is handled a binary (say) x variable is allowed to have direct influence on the item in question, thereby making its threshold (difficulty) parameter be different for the two groups. Direct effects are only needed for items with DIF. Scale factors are not involved here. In approach (2), considering the same two groups, DIF is handled by a twogroup analysis where the DIF item is allowed to have different threshold and loading for the two groups. So this is a more general DIF form. Scale factors are fixed at 1 for this item (free scale factors are used for items that are assumed to not have DIF). For further details, please see the Muthen articles given on this web site under ReferencesCategorical OutcomesIRT, especially papers 35, 18, and 15. These references also show the relationships between Mplus and IRT parameterization; see also the IRT paper by McIntosh listed on the home page. 


I am trying to do a crossvalidation of the model I've settled on for a particular set of all categorical data and was wondering how I would apply the same thresholds/loadings to the crossvalidation sample to derive factor scores. 


To get factor scores from the same model but on a different sample, fix all parameters to the values estimated in the first sample. These can be fixed using the @ sign. For example, MODEL: f1 BY y1@1 y2@.4 y3@.4; f1@.5; y1@.2 y2@.3 y3@.12; [y1$1@0 y2$1@.2 y3$1@.12]; 


If a one factor model with categorical indicators can be transformed quite easily to yield a unidimensional logistic IRT model, what does this mean in the multidimensional case? Specifically, the majority of multidimenional IRT models that I have seen are compensatory in nature with each dimension providing its own discriminating parameter to the item characteristic function (yielding an "a" vector) and a scalar defined level of difficulty/threshold (see Ackerman, 1994). If we extend the same procedures that are used to transform the loadings/discrimination and threshold parameters for the unidimensional case in Mplus to what we might expect to get if there was a BILOG/PARSCALE analog for multidimensional IRT, we end up with not only multiple "a" parameters as in the compensatory model, but also multiple thresholds. To me this suggests that the multidimensional model derived in Mplus would in effect be noncompensatory in nature. Is this in fact the case are am I way off? 

bmuthen posted on Friday, July 13, 2001  5:57 pm



In Mplus, singlefactor and multifactor models have a single threshold for each binary item and this translates into a single "b" value in IRT. Please let me know if this doesn't answer the question. 


I was a bit unclear. In order to transform the Mplus derived parameter estimates to what would be expected with a logistic model we: divide the Mplus loading by the square root of the quantity 1  (Mplus loading squared) to get the analagous "a" parameter from BILOG/PARSCALE; and divide the Mplus derived threshold by the Mplus derived loading to get the analagous difficulty/threshold parameter. For a one dimensional model this is fairly staightforward  yielding only one discrimination parameter and one set of thresholds per item. In a multidimensional item; however, if you do the transformation for the thresholds for each dimension, you end with as many sets of transformed thresholds as you have dimensions. Is there another way to approach the transformation issue for the multidimensional case? 

bmuthen posted on Monday, July 16, 2001  10:35 am



Good question. You are right about the division by the square root of the quantity 1  (loading squared). Explicating what this quantity is, makes the generalization to multiple factors clear. The quantity is the residual variance of y*, the continuous latent variable underlying the binary observed y. In the Mplus factor analysis with categorical y's, the y* variances are standardized to one. The loading squared inside of the parenthesis is the variance in y* explained by the factor when we have a single factor with variance one. In the more general case, the variance explained can be calculated in the usual way for continuous y's to include cases with several factors that may be correlated and/or have variances different from one. So, for instance, with two correlated factors, you have variance explained in item j equal to V(f_1)* lambda_j1**2 + V(f_2)*lambda_j2**2 + 2*lambda_j1*lambda_j2*cov(f1,f2). 


I'd like to return to the discussion of quantifying DIF for a scale. Several points were raised in the exchange between Rich Jones and Bengt back on April 6. I'd like to ask for a little more explication, if possible. The basic situation is one in which one runs a MIMIC model with DIF effects and a MIMIC model without such effects, to assess the overall magnitude of DIF. 1. Rich Jones proposed comparing standardized parameter estimates of the indirect effects for a MIMIC model with no DIF and a model that contained DIF effects (i.e., direct effects from the background variable to the item(s) in question). I assume that the standardized indirect effect would be the product of the (STD) effect of the background (x) variable on the factor times the (STD) loading of an indicator on the factor. My question: If one has several indicators, does it make sense to assess *overall* DIF for the set of indicators by looking at the *sum* of the loadings of the indicators times the effect of the x variable on the factor, and comparing these quantities in the DIF and noDIF models? 2. Bengt also suggested that one could look at "standardized mean differences" for different groups, defined by the background factors (x variables). Perhaps it's the summer heat, but I'm not sure how one would calculate standardized mean differences from the Mplus output. Suppose one had dummy variables for each combination of background variables. Would one compare the magnitude of the (STD) effects of each dummy variable on the factor in the DIF and noDIF models? Or does one calculate standardized mean differences in some other manner? 3. Bengt also suggested comparing factor scores in the DIF and noDIF models. When comparing the mean factor scores for two groups differing in their background characteristics, would it make sense to incorporate some information on the withingroup standard deviation of the estimated factor scores, to provide a scale for the factor score comparison (e.g., something analogous to a measure of effect size)? 4. Finally, suppose one had two background variables in the MIMIC model (e.g., age and gender). One could form dummy x variables for each age/gender combination. If one had 4 groups, one would use 3 dummy variables in the model. The question here is: How can one recover the predicted mean for the excluded group? In a standard regression, one would use the intercept, but there's no intercept in the Mplus formulation. Is there a way to use the overall predicted factor mean from the TECH 4 output for the purpose of recovering the mean of the excluded group? (And is there somewhere in the Appendices that provides the formula used to obtain the estimated factor mean in the TECH 4 output?) These questions may have obvious answers; I'm just not seeing them right now. Thanks in advance for clarifying these points. 


Thank you for the explanation of establishing the variance in an item explained by two correlated factors. In terms of the item thresholds; however, would we end up with multiple sets of thresholds per item (one set for each dimension) or is there a way to transform to one set? 

bmuthen posted on Tuesday, July 17, 2001  4:19 pm



Mplus has just one threshold for each item, even when there are several factors. This is because the threshold is defined on the scale of the latent response variable y*, and there is only one y*. In achievement testing, y* is the specific ability needed to solve a certain item correctly. This ability may consist of several factors. The relationship between IRT parameterizations and Mplus is most clearly seen in P(y=1f), the probablity of y=1 given the factor (or factors) f, expressing this as a normal distribution function with argument arg, say. For a single theta in IRT, arg = a(theta  b) whereas in Mplus with a single factor f, arg = (tau + lambda*f)*c, where the inverted value of c is the square root of the residual variance that we discussed earlier. This gives the relationship a = lambda*c, b = tau/lambda. With several factors in Mplus, arg = (tau + lambda_1*f_1 + lambda_2*f_2 + ...)*c. This formula can then be used to derive the relationship to the various IRT formulations. Hope this helps. 

bmuthen posted on Tuesday, July 17, 2001  6:57 pm



Here are some thoughts on John Fleishman's questions. Rich Jones may have further ideas. 1. It would be nice to have some overall indicator of the DIF, for all items involved. Perhaps the sum of the squared differences between the standardized indirect effects with and without direct effects. Perhaps root mean square. 2. One can get the factor means and factor standard deviations from TECH4, then compute mean differences divided by sd. 3. Sounds alright. 4. The estimated mean for the excluded group would be zero since a singlegroup analysis fixes the intercept for the factor at zero (in line with mean fixed at zero). TECH4 would only give the marginal factor mean (so averaged over all background variable values). In line with equation (38) in version 2's appendix, the factor mean conditional on x covariates is B_0 * alpha + B_0 * Gamma * x, where B_0 is the inverse of (I  B). 

Anonymous posted on Monday, July 23, 2001  9:35 pm



Is there a convenient way to generate something like reliabilities for multifactor CFA models for categorical indicators in Mplus ? 

bmuthen posted on Tuesday, July 24, 2001  9:32 am



The Rsquare values that are printed if you request a standardized solution would serve this purpose. The Rsquare describes the proportion of variation in the latent response variable y* accounted for by the multiple factors influencing this y*. See Appendix 1 of the User's Guide for more details about y*. 

Rich Jones posted on Friday, August 03, 2001  1:20 pm



In response to John Fleishman's request for an explication of my SingleNumber Summary of DIF... Direct and Indirect Effects in Mplus MIMIC models In the case of a singlefactor MIMIC model (single factor CFA with covariates), what I refer to as the indirect effect can also be conceptualized as a regression of the latent factor (eta) on a given x. When the x is dichotomous, these regressions are analogous to dummy variable regressions in ordinary linear regression or mean differences as expressed in ANCOVA models. So while the parameter describes the indirect relationship between the x (e.g., group membership) and the item(s), it also captures group mean differences in the underlying construct (eta) (this is a powerful feature of DIF detection with a MIMIC model that is difficult to get at with more usual DIF detection approaches). Mplus will produce regression parameters standaredized with respect to the variance of eta (STD), and also standardized to the variances of eta, the x's, and the y's (STDYX). The indirect effect, when standardized with respect to the variance of eta (STD; STDYX is harder to interpret in the case of dichotomous x's), describes a kind of effect size difference for group membership: that is, the standard deviation increase in eta associated with a unit increase in the covariate (i.e. group membership). I am very interested in the study of DIF, but I also believe that DIF is often at best a nuisance. What I am really interested in is how cognition, depression, functioning, whatever, is distributed within a sample, across groups, or how it relates to some other characteristic of subjects. The presence of DIF might lead to spurious inferences of group differences or exaggerated relation to other correlates if the DIF is of significant magnitude. Most studies of DIF that I have seen conduct an analysis of DIF, interpret (some) of the findings, and stop there without going on to explore how the DIF might impact findings or relationships with other variables. Often, DIF studies only interpret evidence of bias in the direction favored/suspected a priori by the investigator. I was just trying to go one step further and try to conceptualize/express the overall magnitude of DIF and the importance of modeling DIF or, conversely, the cost of ignoring DIF. Overall Summary of DIF In my posted example using the singlegroup one factor MIMIC approach, I built a MIMIC model in a forwards stepwise function, examining model fit derivatives for evidence of DIF and sequentially freeing up direct effects etc. as described in Muthén, B., in Test Validity, H. Wainer & H. Braun, eds. (1988). The initial model (without any DIF /direct effects) suggested significant and large group differences in the underlying construct (eta)  in other words, the regression of eta on x was large and significant: a significant indirect effect. The final model, one that included DIF according to group membership suggested many items with DIF (a.k.a. direct effects), and group differences in the underlying construct remained (indirect effect). I was interested in trying to describe how much of the observed group difference in eta was due to bias (DIF) by group membership. As in other areas of statistical inquiry, large samples may lead to an analytic finding of statistically significant DIF, but the magnitude of the effect is of little practical importance. Also, sometimes you'll find DIF favoring one group on some items, and suggesting a disadvantage on other items, and given different difficulty and perhaps even discrimination across indicators, it's hard to gauge the overall importance of DIF. The approach I took was to compare from the final model (with DIF) the STD indirect effect (regressions of eta on group membership) for the group membership dummy covariate (x) to a model otherwise equivalent with the exception of the direct effects (DIF)  a purposefully misspecified model. You can get an omnibus chisquare difference test and pvalue for all DIF parameters this way, but you already know this will be significant given the way the MIMIC model was built. What I was trying to do was get a handle on how large the group differences would be if the DIF was ignored. In my posted example, the final model returned a STD indirect effect of 0.490. That is, the standardized (with respect to the variance of eta) difference in eta was 0.490. This value was slightly lower than the standardized group difference in eta found for the purposefully misspecified model: 0.498. I further expressed the discrepancy in group difference estimates as a fraction of the misspecified group difference ((0.4980.490)/0.498)=1.6%. This result is very interesting because although my analysis demonstrated significant DIF, the practical importance of this DIF in terms of obtaining an unbiased estimate of eta seems to be relatively small. Therefore, I concluded that most of the group differences in eta were not due to possible item bias (but may be due to constant bias  but that's another topic, a matter of substantive interpretation of the indirect effect). So I use the fugure 1.6% as a single number to quantify the discrepancy between DIF and noDIF models in terms of the underlying construct. Bengt suggested other ways, for example estimating factor scores for the final and misspecified models and plot them, compute their correlation, or estimate mean of differences, etc. Other DIF Summaries I have considered other summaries, for example something analogous to a sum of the area between the item characteristic curves (ICCs) for focal and referent groups. Other postings on the Mplus discussion list describe how MIMIC model parameters can be used to obtain IRT parameters, and Raju (1988; Psychometrika 53:495502) demonstrated that the area between two group's ICCs is equal to their difference in IRT difficulty parameters for 1P and 2P models. So you could convert direct effects, thresholds and loadings to difficulty parameters, and compute the sum of group differences in item difficulty across groups as another singlenumber expression of the total effect of DIF (but notice however that direct effects are they only things that vary across group in a single group MIMIC model). Limitations of this area approach are that it does not take into consideration the distribution of ability in the sample of interest. If the items are highly skewed  all very difficult or very easy (as they often are in fields outside of educational testing such as psychology, medicine, epidemiology)  this sum of area's may misrepresent group differences. It's possible that the sum of the area's between the curves is very large but if weighted to the distribution of ability in the population and with all of the items most discriminating at the tails of the ability distribution, there will be very small differences in estimated group differences due to DIF. I believe this is what is happening in my example, where large and significant DIF explains little of the overall group difference in estimated ability because very few respondents have a level of ability that matches the difficulty level of the test items. 


When estimating the correlation between two continuous latent variables with binary observed variables, does Mplus incorporate a correction for attenuation due to unreliability (like a factor analysis model does for continuous observed variables)? Or does it compute correlations between latent variable estimates (as I think is the case for IRT programs like BILOG)? 

bmuthen posted on Friday, September 14, 2001  11:23 am



The correction for attenuation is built into the modeling. Mplus does not compute the correlation from estimates of each individual's latent variable value. 


Greetings. I'm trying to develop an IRTstyle scale for several polytomous items that are measured longitudinally. I'm somewhat surprised by some of my results, so I'm hoping you can check my logic. I have too many items/response categories to do a full longitudinal CFA (MPlus insists there's not enough memory in the workspace, no matter how much I try to give it). So I did a onefactor CFA with one year's data, using WLSMV, and was satisfied with the outcome. I'm willing to make the assumption that the items have the same measurement relations to the factor across years. I then ran the model for different years, fixing all of the loading and threshold values to be equal to those from the above CFA, and saved factor scores. My thinking was that this would give me a score for each year, all on the same scale. However, the factor means I'm getting out of these runs are surprising to me. It may be that they're right and my expectations were simply wrong. However, I noticed a pattern. Some of the years don't have all of the items, so I omitted those items from the scoring runs for those years. With all of the thresholds and loadings fixed, I'm working from the idea that that's as if those variables are missing (completely at random, in fact), and shouldn't affect the scaling. But the years with missing items all have the highest factor scores. Is this a reasonable approach? Other suggestions are welcome. Thanks, Pat 

bmuthen posted on Tuesday, January 01, 2002  5:56 pm



Offhand, I don't see that this approach has any gross error, although I may be overlooking some scaling issue. As a check of reasonableness of the results I would compare this to treating the items as continuous and study the mean development for the average at each time point (average to take into account missing items). 


Thanks. I discovered the key problem was in my assumption that the scaling was constant across years  there were certain items that were phrased differently in different years, making them much "easier items" in some years than others. I appreciate the check on the logic. Pat 


Just starting to get into measurement equivalence assessment using two group CFA. Based on reading so far, I think that when using Mplus on binary math item indicators if I have partial measurement invariance between M and F on Math it is reasonable to assume that the factor means and factor scores can be estimated and that I am measuring the same factor in both groups. True so far? But I am interested in these factor scores compared to 2P IRT derived scores. I have seen on this board the formulas for converting Mplus parameters to BILOG parameters. I am wondering if the Mplus factor scores will correlate with other variables in the same way as BILOG parameter based scores. Maybe an obvious answer, but a stretch for me. Likewise would differences in group means relative to variances be the same for Mplus factor scores and BILOG scores? 

bmuthen posted on Saturday, February 08, 2003  6:41 am



Mplus Web Note #4 goes through different parameterizations including IRT and shows that the relationship between the item and the factor is the same. This means that the factor scores are the same. The only difference between BILOG and current Mplus is the difference due to BILOG using a logistic function and Mplus using a probit function; this should produce only minor differences wrt factor scores. 


Thanks for your very helpful response. Is it correct that partial measurement invariance is somewhat subjective as to whether you have it? If you have it, or enough of it, am I correctly interpreting your comments on the board that you can estimate factor scores for the two groups and DIF is controlled? This is extremely helpful. Thanks. 


Yes and yes. 


I have been unable to find the Muthen, Kao, Burstein (1991) reference mentioned above. Could you provide some information on where I can find it. 


Following is the complete reference: Muthén, B., Kao, ChihFen, & Burstein, L. (1991). Instructional sensitivity in mathematics achievement test items: Applications of a new IRTbased detection technique. Journal of Educational Measurement, 28, 122. (#35) If you don't have access to the journal, you can request the paper from bmuthen@ucla.edu. Reference the paper by #35. 


RE: Rich Jones on Friday, April 06, 2001  11:11 am: suggestion on an effect size for DIF. Would I be looking at the more general DIF by using a two group analysis? Would I would then compare the factor mean computed for the withDIF group computed from a model recognizing some noninvariance (but still retaining partial measurement invariance)and the withDIF group factor mean computed from a model assuming invariance. Correct? I guess there is no scale problem as long as I use the same item as reference indicator throughout, even though I free us some parameters? I wonder if there is any way to get a standard error on that difference or an upper bound on the standard error? 


Clarification of my Q. My term withDIF group makes sense in my context. That is a test is administered and scaled under standard conditions and then administered to others under nonstandard conditions. So this last group is the withDIF group I was refering to. Upon reflection my issue may be yet more complex. Getting the scores of the withDIF group controlling for DIF is no problem, I think. However the comparison would be with their scores if they were scored using the item parameters derived from the standard group only. Nothing in my described analysis gives me that? 

bmuthen posted on Wednesday, February 12, 2003  4:40 pm



Yes, when moving from a mimictype analysis to a 2group analysis I think comparing the factor mean from the correctly specified model (with the noninvariance in question included in the model) with the factor mean from the incorrectly specified model makes sense. I don't know about the s.e. of the difference in means. In a sense the s.e. for the mean for the correctly specified model would be somewhat useful  for example, it is informative if the incorrect mean is more than 2 s.e.'s away from the correct mean. 

Rich Jones posted on Friday, February 14, 2003  9:20 am



In Re Magnitude of DIF: Comment, a shameless plug, and a proposed rule of thumb for interpreting magnitude of difference I like to think of the parameter estimate for the indirect effect (or mean difference in multiplegroup case) and associated standard error as test of significance, and the comparisson of parameters from fitted and misspecified models as kind of an effect size measure. I discuss the misspecified model comparisson approach in a little greater detail in Jones & Gallo (2002, J Gerontol B Psychol Sci Soc Sci 57B:P548558). I've recently learned that Maldonado and Greeland (1993, Am J Epidemiol 138:923936) describe a simulation study used to evaluate different strategies for identifying important confounders in observational studies. Their strategy might be adapted for evaluating the model misspecifcation approach to detecting "confounding" in the ability estimate due to DIF. These authors ultimately reccomend a "change in estimate approach" similar to what I propose in Jones & Gallo (2002), but using a predetermined threshold (e.g., bb'/b' > 0.10, where b and b' are parameter estimates from misspecified and fitted models, respectively) as a criterion for marking 'important' confounding. This 10% difference rule of thumb might be as good as good rule of thumb as any, short of simulation studies or other indications that the detected DIF makes an important or substantial impact. BTW: I should mention that I came to the Maldolondo and Greenland work by way of Crane, Jolley and Van Belle, who use it as a criterion in assessing the presence of uniform DIF in their DIFdetect procedure (see http://www.alz.washington.edu/DIFDETECT/welcome.html). However, I can see that it would be nice to have some indication as to how confident we can be that the difference between fitted and misspecified parameter estimates for mean difference in underlying ability is less than 10%. 

bmuthen posted on Saturday, February 15, 2003  10:44 am



Thanks, Rich. Could you send me a copy of your article? I have a feeling you did already, but I just reorganized my office and can't seem to find it. 

Rich Jones posted on Wednesday, February 26, 2003  9:14 am



Re: Scale Equating Setup: I am trying to use the Mplus factor score estimator to produce equivalent latent trait estimates for two sequential administrations of the same symptom inventory. I need to generate equated, or linked, scores because the wording of response options (but not symptom stems) changed between administrations. All items are treated as dichotomous (symptom present/absent). Approach: I am linking the two models by (1) estimating factor loadings and thresholds in the first administration, with the variance fixed at 1 and mean 0 for the single latent factor, and saving factor scores; and (2) estimating latent trait estimates (factor scores) at the second administration, constraining the loadings for all items and the thresholds for the items that are (assumed to be) equal across administration to be equal to those estimated at the first administration. I've estimated two seperate models, so that by default Delta=I in both administrations. Further, in the second model, the only free parameters are the mean of the latent trait, the thresholds for the items for which the wording changed, and ... Questions: (1) Should I hold the variances of the latent factor to be equal (i.e., 1) at the second administration? (2) Do you think this would be more appropriately parameterized as a multiplegroup model and bring Delta into the picture? (i.e., fix Psi to 1 and freely Delta for group 2 where group 2 is really administration 2?) I realize that if all items were exactly the same, if I did not constrain the variances to be equal (along with all loadings, thresholds), the metric of the latent trait would change, and I would get a different latent trait estimate for identical response patterns. I'm just not sure if I can expect (assume) the latent trait variance would/should be equal when the thresholds for the items are very different.  Rich 

bmuthen posted on Wednesday, February 26, 2003  9:31 am



Just a clarification  are the two administrations given to the same group of people or different people? I assume different, but I want to make sure. 

Rich Jones posted on Wednesday, February 26, 2003  11:15 am



They are the same people at the two administrations. 

Rich Jones posted on Wednesday, February 26, 2003  2:29 pm



...but my idea was to treat them as seperate groups in a scale equating phase of the analysis, and then look at longitudinal changes in a seperate set of analyses. I realize this is not neccesary with Mplus, but a secondary goal is to provide a set of equated scores for other investigators to use (who might not use Mplus). 

bmuthen posted on Thursday, February 27, 2003  8:54 pm



You can do this by a "longitudinal factor analysis", that is using a singlegroup analysis with one factor per time point (I would not use a multiplegroup approach since you have the same people at the two admin's, so not independent samples). The standard setup is to hold thresholds and loadings invariant across time to the extent that is realistic, and let the factor variances be different across time (with one loading =1 to set the metric), having a zero factor mean at time 1 and free at time 2, and letting Delta =1 for time and free for time 2 (see Mplus Web Note #4). Then estimate factor scores from this model. Or, do the above to get thresholds and loadings for each admin and then run each admin separately with parameters held fixed at the solution from the joint analysis, and estimate factor scores for that admin. This approach perhaps is somewhat less prone to misspefication of acrosstime relationships. 

Rich Jones posted on Monday, March 03, 2003  6:07 am



Thanks for the suggestion. I ran these models and I am sure there is something I still do not 'get' about the use of scale factors. I will reread Web Note #4 more carefully. I actually have more than two administrations of this questionnaire: only one of the administrations differs from the others and needs to be linked. Running each repeated adminstration seperately, I find that if I do not constrain both the psi and delta matrices, the estimated factor scores for identical response patterns are not equal across time (when the items are the same, and lambda and tau are also constrained to be equal). I find this sampleinvariant scoring intuitively pleasing and seems to be consistent with the IRT model. Rich 

bmuthen posted on Monday, March 03, 2003  6:36 am



Having different Psi matrices (and therefore different Delta) influences the factor score estimation even when tau and lambda are the same. Psi is the "prior" factor cov matrix and therefore should have an influence. Substantively it seems like Psi can change over time and this should be allowed even when tau and lambda remain invariant. 


I am doing two group CFA with one factor. The indicators consist of some binary items and some ordered polytomous items. If I wanted to refer to the Mplus factor scores in IRT terms, would I say they are similar to or the same as IRT graded response scores? Thanks. 


They are the same as IRT theta scores when using a two parameter normal ogive model and estimating the scores using a Bayes modal, that is, maximum a posteriori, estimator. 


I'm slow absorbing sometimes. That's so even if some of the items are ordered polytomous and not binary? 

bmuthen posted on Saturday, March 29, 2003  5:14 pm



That's right. 


So if items are all ordered polytomous then its like graded response? All or part binary items then its 2P normal ogive MAP? In that case item scores other than 0, eg 1 2 3, are treated as 1 to compute probit coefficients? 

bmuthen posted on Saturday, April 05, 2003  10:35 am



As I understand it, graded response (when using a normal ogive) is the same as ordered polytomous in Mplus. Binary is 2P normal ogive. In both cases factor scores are MAP in Mplus. 

Anonymous posted on Tuesday, December 16, 2003  10:07 am



I am estimating a Graded response model(Samijima, 1979) in Mplus. The strucutre of the scale turns out to be multidimensional, (4 factors). I wonder in this case, can I still using the same tranformation, i.e, a =loading /(sqrt(1loading**2)); b =threshold/loading, to convert the Mplus estimates of thresholds for an item into IRT model's b? and the same way for a? 


Yes, as long as all factor indicators load on only one factor, that is, there are no crossloadings. 

Anonymous posted on Wednesday, December 17, 2003  11:16 am



unfotunately I do have three items load on two factors. therefore, shall I use: b1 =threshold1/(sqrt(1(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2))) as the transformation for these three items? 

Anonymous posted on Thursday, December 18, 2003  11:55 am



A correction for my last submitted inquiry: for the three items that load on two factors, the a_f1 = lamda_f1 /(sqrt(1(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2))) and a_f2 = lamda_f2 /(sqrt(1(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2))); as such, for these three items, each will have two sets of transformed thresholds, one set for factor 1, b1_f1 =threshold1_f1/lamda_f1,... and one set for factor 2, b1_f2 =threshold1_f2/lamda_f2, is it correct? 

bmuthen posted on Thursday, December 18, 2003  1:42 pm



This article may give you the answers: MacIntosh, R. & Hashim, S. (2003). Variance estimation for converting MIMIC model parameters to IRT parameters in DIF analysis. Applied Psychological Measurement, 27, 372379. Note that you don't get more thresholds because you have more than one factor  the threshold relate to the item, not the factor. 

Anonymous posted on Monday, December 22, 2003  1:05 pm



I obtained the paper of MacIntosh and Hashim (2003), but they didn't mention the case when items load on multiple factors. 

BMuthen posted on Monday, December 22, 2003  5:03 pm



Too bad. Perhaps the Takane and de Leeuw paper in Psychometrika. I don't know offhand. I would have to work it out and don't have time right now. 

Anonymous posted on Tuesday, January 27, 2004  8:36 am



When estimating IRT models via Mplus, two things are unclear to me: First, am I correct by saying that Mplus only models the means, variances and covariances, and not the higherorder moments? In this case, the estimation of IRTmodels via Mplus is not fullinformation, but a good approximation instead. Second, is it still not possible to estimate the ratingscale model or the partial credit model (with a probit link instead of a logit link) with Mplus? I ask this because the possibilities of Mplus grow rapidly, and the previous question about this topic is dated in 1999. Thanks for the answ 


The current version of Mplus uses weighted least squares with a probit link. This is not a fullinformation estimator. Version 3 will include a fullinformation maximum likelihood estimator with a logit link for categorical outcomes. I am not familiar with what the rating scale model or the partial credit model is. If you can explain it in a simple way, I can try to answer this. 

Anonymous posted on Tuesday, January 27, 2004  1:30 pm



The ratingscale models is a model for ordered polytomous data. It states that: ln [ P_vij / P_vi(j1) ] = Theta_v  Beta_i + Tau_j With v=person, i=item, j=category, Theta is a person parameter, Beta an item parameter, and Tau a category parameter. As such, this model assumes equal distance between two categories for all items. The rating scale model is an extension of the ratingscale model in that it relaxes the assumption of equal distance between categorries: ln [ P_vij / P_vi(j1) ] = Theta_v  Delta_ij with Delta an item and category specific parameter. Hope this is clear 


No, we don't do these models as far as I know. There may be some way to specify it but I am not aware of it. 

Anonymous posted on Thursday, September 02, 2004  12:26 am



There have been a couple of mentions of the relationship between Mplus modelling of ordered polytomous data and Samejima's graded response model: is this spelt out in detail anywhere? 

bmuthen posted on Thursday, September 02, 2004  8:24 am



With ML estimation, Mplus uses a logistic regression of the ordered polytomous item on the factor, where the logistic regression model is a proportionalodds model in the language of Agresti's categorical data book. I am told by IRT experts that this is Samejima's model. I haven't seen this spelled out in writing, but you can check Agresti and compare to Samejima. 

Anonymous posted on Sunday, September 05, 2004  11:20 pm



Just in followup to the graded response question. I've compared estimates from MULTILOG and Mplus in three different data sets and they are essentially in agreement as follows: MULTILOG's A = Mplus' standardized loading; MULTILOG's B(k) = Mplus' k'th threshold/standardized loading. The relationship for A looks different to what has been stated in some earlier questions/answers: is it different? 


The relationship you express is with logit. Another relationship holds for probit. Is this what you mean? 

Anonymous posted on Monday, September 06, 2004  5:01 pm



Re: Mplus/MULTILOG. I think I haven't understood the answer to "Anonymous Tuesday, December 16, 2003  10:07 am" which describes a different transformation from Mplus to graded response. But to be clear: a 1factor model for polytomous variables in Mplus fits the equivalent of a logit form proportionalodds? And since MULTILOG does a logit version of graded response the above simple relationship holds? (And Mplus can/can't do normal ogive versions?) 

bmuthen posted on Monday, September 06, 2004  5:09 pm



Yes and yes. The Dec 16 statement is true for probit (i.e. normal ogive), which Mplus can also do. 

Anonymous posted on Tuesday, September 07, 2004  3:31 pm



Thank you for your replies regarding graded response. Finally for now (I hope): how is a probit invoked? (I couldn't find anything in the manual for 3.0). 


Probit is obtained with the WLS, WLSM, and WLSMV estimators and categorical outcomes. 

Judy posted on Thursday, December 16, 2004  8:14 am



I have a question about generating the scaling factor for an IRT analysis. My items are categorical  with 4 response options and I have 15 total items. How do I generate a scaling factor for each of the 15 items? I have read through this discussion and I think I've figured out how to generate the a and b parameters  but does the newest version of Mplus allow for graphing of the IRT curves? If so, how do I do that? Thanks 


Mplus does not use scale factors to generate categorical data. See mcex5.2.inp which comes as part of the Mplus installation for an example of how to generate categorical data. 

Anonymous posted on Tuesday, January 18, 2005  3:46 pm



A quick question. As I am reading all the conversations for IRT, I am left needing some clarification. Mplus will perform Samejima's model similar to Multilog? However, Mplus version 3 does not perform IRT for rating scale data? I am I correct with these? 


I don't know what you mean by rating scale. If it is ordered categorical (polytomous), then Mplus can handle it. 

Anonymous posted on Tuesday, January 18, 2005  3:56 pm



By rating scale, I mean a scale that is strongly agree, agree, disagree, and strongly disagree. I know that Mplus can handle this type of data in other formats (i.e., SEM), but will it work for an IRT model using the LikertType format? 


It will work for all models in Mplus. IRT is just a special case of an SEM model. Use the CATEGORICAL option to specify which dependent variables are rating scales. 


I have noticed that person abilities estimated by the MLR method in Mplus are continuous while I expected discrete values like MLE or WLE ability estimates. What causes this difference in the MLR method? Do you have a reference where I can find information about this? Many thanks in advance. 


I think by ability measures you are referring to factor scores. They are continuous for all maximum likelihood estimators and weighted least squares estimators. Factors are continuous. 


My understanding is that if the model is a logit model and with constrained variances a Rasch model, then it is in the exponential family and the student raw scores are sufficient statistics. Therefore there should be a onetoone match between abilities (factor scores) and raw scores, but that is not happening in Mplus. 


Have you considered that in the Mplus modeling that the prior for the ability distribution is normal. The scale of the ability estimates and the raw scores are therefore different. 


I am using MLR with a dichotomous items and I am constraining the item loadings to 1. My understanding is that this will result in a Rasch model with a normal prior. My understanding is also that in this case the raw scores and the ability estimates will have a onetoone match, the metrics will be different and the transformation from one to the other will be nonlinear, but nevertheless there should be just one estimate for each possible raw score. This does not appear to be happening and I am not sure why. Do you have an explanation? What do you recommend as the best reference for understanding the MLR estimation algorithm in this context? 


To answer this I would need to see your Mplus output and your data. Please send them to support@statmodel.com. Just to make sure, you should allow the loadings to be equal when fixing the factor variance to one, not fixing the factor loadings to one. If you fix the factor loadings to one, you should allow the factor variance to be free. 

Anonymous posted on Saturday, February 05, 2005  3:47 am



I'm wondering about how to compute the estimates needed for a test information function in MPLUS. I think this was done in MuthÃ©n, B.O. (1996). Psychometric evaluation of diagnostic criteria: application to a twodimensional model of alcohol abuse and dependence. Drug and Alcohol Dependence,41(2), 101112. ? Any pointers would help thanks Andrew Baillie andrew.baillie at mq.edu.au 

bmuthen posted on Saturday, February 05, 2005  1:46 pm



I found it helpful to go by the HambletonSwaminathan IRT book which I think was referred to in that article. 


I am wondering how to incorporate estimates from one itemresponse model into a second itemresponse model and still get good standard errors. More specifically: “Stage 1” is a multilevel itemresponse model for individual’s ordinal responses to questions at time 1. Level 1 = person and level 2 = items within person. This will give coefficient estimates and cutpoints from which we can obtain probabilities of an individual scoring between any two cutpoints. “Stage 2” would be a multilevel itemresponse model for individual’s responses to questions at time 2, but this time we would like to incorporate estimates from “stage 1” as predictors. Any suggestions on how to model this in Mplus? Thank you so much for your help. Laura Piersol 


You can have a model that contains both of your itemresponse models, a two factor model. 

bmuthen posted on Tuesday, March 08, 2005  2:56 pm



To add to this discussion, it sounds like your "cutpoints" are thresholds for ordinal outcomes and perhaps your "coefficient estimates" are the loadings (discriminations). If so, Linda's 2factor suggestion refers to a longitudinal factor analysis with 1 factor at each time point. Instead of having stage 1 estimates as "predictor", the idea is then to hold the thresholds and loadings equal across time. Perhaps this is something you want to do  assuming we have understood you correctly. 

Anonymous posted on Wednesday, March 09, 2005  11:48 am



I'm a new user, so I'm still learning the program. I'm trying to generate an IRT analysis of my data (1 latent trait explaining 10 categorical variables) modeling my program on example 5.5 from the manual. Can Mplus generate Item Characteristic Curves using the Plot command? (The program gives me the options only of Histograms, Scatterplots, Sample Proportions, and Estimated Probabilities, none of which produce ICCs.) Thanks for any suggestions or guidance. 


No, ICC's are not yet available. They are on our to do list. 


Thank you for your 3/8/05 responses to my question. I have been trying to get a better handle on these types of models, including factor analysis and IRT as I am relatively new to these topics. In regard to holding the thresholds and loadings equal across time We are following a group over two time points. We are happy to assume that the thresholds are equal across items for each time point. However, the set of items at the two time points are not identical so equating thresholds doesn’t seem correct. In this case, would you think of the first factor as a “predictor” of the second factor? This is the model I have in mind: MODEL: f1 BY u1u7 f2 BY u8u14 f1 ON x1x10 f2 ON f1 x11x20 Am I missing something? I also would like to confirm that we don’t need to run a multilevel model. In terms of variables, we have individuals’ responses to the items and individuallevel covariates. From reading the Mplus documentation, it sounds that this can be handled as a single level model. Do you have a literature suggestion for better understanding the intricacies of these models and interpreting the output? Many thanks, Laura 


If I understand you correctly now, you do not have the same items at two timepoints but different items at two time points representing two different dimensions. Then there is no need to hold thresholds and factor loadings equal. You would want to hold them equal if you have the same items at two time points representing the same dimension. The equalities specify measurement invariance. I think your MODEL command looks good given what you want to do. If you have no clustering in your data, that is, children were not sampled from classrooms, for example, then a singlelevel analysis is appropriate. I don't know of any one piece of literature but a good SEM book would probably help. See our website where there are a plethora of refereces. I think many like the Bolen book. Maybe someone else can make a suggestion. There are also some papers that compare IRT and SEM. 

Anonymous posted on Thursday, March 17, 2005  4:11 am



on Wednesday, March 09, 2005  11:48 am Anonymous asked about plotting ICCs in MPlus. While I haven't found a way of doing it within MPLUS It is relatively easy to plot ICCs in your favourite graphical program (I like gnuplot but I've done it in excel as well). In gnuplot you can use the norm() function with the reparameterised estimates (see above on this page for how to reparameterise) and plot y = norm(a*(xb)) Andrew Baillie andrew.baillie at mq.edu.au 


Some time back I asked about plotting test information functions. For the benefit of others here is what I've found. Hambleton & Swaminathan (1985) got me off to a good start but Frank Baker's online book on IRT had all the answers I was looking for. The essential points are 1. The test information function is the sum of the item information functions 2. The item information function for the 2 parameter logistic model is I(theta) = a^2 P(theta) Q(theta) where P(theta) = 1/(1+EXP(a(thetab)) and Q(theta)= 1  P(theta) a and b being the discrimination and difficulty parameters, and theta the latent "ability" (see Baker, 2001 eqn 6.3 on p 106) Note that for simplicity I've left the i subscript off these formulae. I'm yet to find the item information function for the two parameter probit model. Thanks again for the excellent software. Andrew Baillie andrew.baillie at mq.edu.au References Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD. http://edres.org/irt/baker/ Hambleton, R.K. & Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Kluwer. 


Thank you for the information. ICC's cannot be plotted in Mplus at the present time but we will be adding this in the future. 

CMW posted on Tuesday, April 12, 2005  7:32 am



Greetings, I fitted a 2PL IRT model with Multilog. Then I used Mplus to fit a onefactor model, fixing the variance of the factor to 1, specifying that the variables are categorical, and using the MLR estimator. The Mplus loadings are nearly identical to the Multilog discrimination parameters, but the Mplus thresholds are not close to the Multilog thresholds. Can you help with understanding this discrepancy? CMW 

BMuthen posted on Thursday, April 14, 2005  12:00 am



Mplus uses the logit parameterization:  threshold + loading*factor whereas Multilog uses: a (factor  b) From this you can deduce the relationship. 

Anonymous posted on Monday, April 18, 2005  5:50 am



I used Mplus to fit a onefactor model; the variance was fixed at 1.0, and the variables were categorical. I used WLSVM estimator. I like to transform the estimates from Mplus into the parameterization used in BILOG and Multilog. I know that the BILOG slope = loading/(sqrt(1loading**2), and BILOG threshold= threshold/loading. But, my question is: are the abovementioned loadings unstandardized or standardized? I highly appreciate your answer HJD 

bmuthen posted on Tuesday, April 19, 2005  12:18 pm



You should work with the unstandardized values. But note that with WLSMV you have probit, not logit results. This means that you have to involve the factor 1.7 in your slope/loading comparison with BILOG. 


Can the Mplus framework incorporate multidimensional scaling analysis and/or multidimensional unfolding analysis? I understand that there are formulations of these models that are based in IRT. AJT 


Don't know  can you suggest some pertinent references? 


Here is a webpage with a number of references http://www.psychology.gatech.edu/unfolding/Publications.html. I'm not sure if these references all apply to unidimensional unfolding or not but I believe it has been applied multidimensionally. I am just starting out in this literature, so I may run across more recent articles. If so, I will pass them along. If the Mplus framework can incorporate these types of models, it would represent a significant advance in the field of MDS, allowing MDS and unfolding solutions to be more rigorously tested and incorporated into more general latent variable models. I believe that the possibility of stateoftheart missing data handling and the use of complex sampling designs would be a major advance, as well. Allison 


Shall we look at the coefficient of the standardized or the unstandardized for IRT difficulty parameter? The following model convergences in usual way Model f by Q1Q40; f on Male SES; But for the following, despiste increase the MIteration and MConverge, does not terminate normally, Model f by Q1Q40 HumanCap by FEdu MEdu; EconCap by Maid Car ResType ClubM; f on Male HumanCap EconCap; Is there any text or reference on building IRT model using MPlus? Thanks. 


For ML estimation of singlefactor models for binary indicators, Mplus Version 4 gives results not only in the regular factor analysis metric but also in the metric of the classic 2PL model with difficulty and discrimination estimates. The classic IRT estimates are given in the usual (0,1) metric for the factor. The relationship between the factor model parameterization is given in Day 3 of our course handouts. This will also be posted this week as part of the new Web Note #10. The factor model estimates are reported both as raw estimates and standardized and the choice  or reporting both  is up the the user. For the problem you are having with your 3factor model, to make a diagnosis we need you to send input, output, data and license number to support@statmodel.com. The only relevant IRT references I know are listed on our web site under References, Categorical outcomes, IRT  see also the forthcoming Web Note #10. 

yang posted on Friday, June 02, 2006  7:01 am



Is ICC available in Mplus now? Thanks. 


Item Characteristic Curves and Information curves are now available as part of the PLOT2 option of the PLOT command. 


Dear Bengt, Last year I submitted a scale validation manuscript to a journal. The centerpiece of this manuscript was a confirmatory factor analysis conducted in Mplus 3 using WLMSV estimation. The input items were 136 binary personality inventory items. Previous research using PCA and EFA methods suggested three secondorder factors and 17 firstorder factors, so we fit that hypothesized factor structure to our data. We have some 6,000 research participants, which we randomly split into two samples, an initial model validation sample (which we used to obtain a "brief" 48 item pared down version of the original instrument) and a crossvalidation sample on which we successfully refit the factor structure from the first sample. A reviewer of the manuscript has called into question our use of factor analytic methods, arguing that we should instead use IRT methodology. The reviewer states, "To reduce the number of items measuring the three clinical dimensions of the 136item inventory should be performed according to modern psychometrics outside the frame of factor analysis, namely with item response theory models. In this respect, the authors should consult, e.g., Borsboom, D: Measuring the Mind. Conceptual issues in contemporary psychometrics. Cambridge University Press 2005". I have read the Borsboom book as well as an earlier paper he published in 2003 that delineates some of the philosophical conundra involved in using latent variable models to infer the presence of latent factors from correlation matrices. Given how enthusiastically the reviewer endorsed IRT over CFA, I was initially surprised to see that Borsboom's criticisms seem to apply with equal force to IRT and CFA/SEM models. Paul Barrett made mention of this on SEMNET in March of 2005, citing the following paper: Michell, J. (2004) Item Response Models, pathological science, and the shape of error. Theory and Psychology, 14, 1, 121129. When I read through the SEMNET archives and this discussion board, as well the helpful posts in the new IRT sections of the Mplus Web site, I found myself less surprised given how closely related the two methods appear to be, with indentical results possible under some conditions (e.g., ML estimation of models cotaining a single latent factor). In crafting my response to this reviewer's comments, it would be helpful for me to know the scope of available IRT models in general and what is available in Mplus. First, to your knowledge, is it even possible to fit higherorder factor models within the IRT framework? If it isn't, then clearly IRT would not be a suitable tool for our purposes given that our theory clearly stipulates a higherorder factor structure a priori. On the other hand, if it is in fact possible to fit higherorder latent variable models under the IRT umbrella, is it feasible to do it using Mplus? I'd guess that if one of the requirements is ML estimation, then the answer is probably "No" because of the computational burden involved with this many variables and subjects. Finally, given how closely related the factor analytic and IRT approaches are, even if it is conceptually possible to fit a higherorder IRT model and it's computationally feasible, is it even worth bothering to recast the analyses in this manner given how similar the IRT and CFA results are likely to be? My intuition tells me that at such large samples the WLSMV estimates originating from Mplus would be unlikely to differ markedly from those produced by an IRT model. What do you think? As always, references and any additional comments (including any thoughts you have on the overall utility of what can be learned from fitting CFA models to tetrachoric correlation matrices in scale validition studies) are most welcome. Gratefully and with best wishes, Tor 


I am disappointed that there are still some journal reviewers who do not understand the relationship between factor analysis of categorical outcomes and IRT  that it's all the same. It's been a long time now since articles like Takane, Y. & DeLeeuw, J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393408. were published and long before then it was clear that this is all the same. Perhaps the early focus on correlation matrices in factor analysis throws people off. But that should be seen merely as a matter of estimator choice, not model choice. Tetrachoric correlations belong with (weighted) leastsquares estimation of limited information from first and secondorder moments, whereas with ML you work with the raw data (full information from all moments). The model is the same, however  if you assume normal factors and probit regressions, you fulfill the assumptions of underlying normality for continuous latent response variables that tetrachorics rely on. This is IRT's 2parameter normal ogive model. Going to 2parameter logistic is a trivial model variation. Both the IRT and the factor analysis traditions now work with multiple factors, although I haven't seen explicit use of secondorder factors in IRT  mostly because IRT uses ML almost exclusively and the necessary numerical integration is heavy in situations where secondorder factors are used  namely with many firstorder factors. Programs like BockMuraki's TESTFACT is limited, I think, to 5 dimensions. For the same model, leastsquares and ML estimation typically give very similar results as already the 1986 JEBS article by Mislevy showed. The Mplus facilities for IRT are outlined at http://www.statmodel.com/irtanalysis.shtml showing that both leastsquares and ML techniques are available for both probit and logit. ML can in principle be used for the same complex models as leastsquares, but is again limited by computational demands with many firstorder factors. Note also that Mplus can do IRT modeling in mixture (latent class), multilevel, and multilevel mixture situations. 


I was interested in looking at comorbidity between two disorders using an IRT framework. However, I am running into a problem in examining the model. I want the variable to be coded as 0 = no diagnosis; 1 = diagnosis a; 2 = diagnosis b; and 3 = diagnosis a AND b. If I treat the variable as an ordered categorical variable, the model works by treating the 4 level variable as a graded response. However, there is no reason to think that diagnosis b is more severe than diagnosis a, which is implied in the graded response model. Is there a way to handle the data in a nominal analytic approach? 


I am a little confused because when you say IRT, I think factors but then you talk about a nominal variable. What is this nominal variable used for in the analysis? Give me your MODEL command. 


I have repeated diagnostic assessments so the 4level nominal variables would be the indicators for the factor. At the initial stages, the model looks like: F BY dxt1* dxt2 dxt3 dxt4 (1); F@1; Although it may be more appropriate to constrain the thresholds to be equal across the levels of the nominal variable. 


The NOMINAL option is not available for factor indicators. Perhaps you should use your original items for the factor analysis. 


By 'original items'  do you mean disorder a  present/absent and disorder b  present/absent? Or do you mean to treat the variable as ordered categorical? 


I mean the symptom/disorder items. Treating a nominal variable as ordered doesn't make sense. 


I am fitting a model of the following form using WLSMV: f1 BY u1*u8; f2 BY u1*u4; f3 BY u5*u8; f1@1; f2@1; f3@1; f1 WITH f2@0; f1 WITH f3@0; f2 WITH f3@0; u1u4 reflects diagnostic status for disorder A and u5u8 reflects diagnostic status for disorder B. I am not interested in gorwth per se, but I am interested in examining some aspects of measurement invariance. Is it appropriate to iterpret the threshold parameters in the same way that you would if there was only one factor? Thanks! 


Yes. 

Gareth posted on Monday, November 27, 2006  2:56 pm



Hello I would like to find out how 10 questionnaire items differ between two groups of people. Could you offer any guidance on sample size? How many cases are required? Many thanks 


Sample size depends on many things including the model, reliability of the data, scale of the dependent variables, etc. To know for sure how many observations you would need, you could do a Monte Carlo simulation study. 

yang posted on Thursday, November 30, 2006  7:39 pm



I am fitting an MIMIC model with binary indicators loaded on a single factor, and several DIF effects are detected. I know that the parameters in MIMIC model can be converted into the parameters for a 2PL IRT model (refer to Dr. Bengt Muthen's response on May 29, 2006), but I do not know how to get them in Mplus. I have 3 questions: 1. What are the corresponding commands in Mplus in order to get these converted parameters and their standard errors under this situation (single factor, binary indicators, ML estimation)? 2. Does it make any difference if the estimator is not ML, e.g., WLSMV? 3. How about an MIMIC model with multiple factors, binary indicators and several DIF effects? Thank you very much. 


The IRT conversions are given only for models without covariates. 


I apologize if this is a redundant question, given previous posts, but I am still relatively new to both the IRT literature and MPlus. I am fitting IRT models to binary data. I have no problem fitting the Rasch and 2PL models, unidimensionally. However, I believe that I need to fit a multidimensional IRT because I think that there are two factors that would represent my dataset. Am I correct in thinking that I can use the same basic code for a multidimensional IRT as a unidimensional IRT? In doing so, I have merely added the second factor into the syntax. The model runs, however I only get thresholds, rather than difficulties and discriminations. Am I fitting the model correctly? If so, is there a way to get discriminations and difficulties out of my output? Thanks! 


We make the IRT translation only for tradition onefactor models. You need to do this yourself for multifactors models. 

Ilona posted on Friday, January 26, 2007  6:24 pm



I apologize for this Stat 101 question  I have looked at many references referred to in this (and other related) discussions, but I have not found the explicit answer: How exactly do you convert the probit lambda to a logit lambda, and a probit threshold to a logit threshold? I assume it's not a simple multiplication by the 1.7 conversion factor? I had attempted to figure this out on my own from running the same data two ways: with the link=probit for one run and link=logit for another run, both using MLR. I was hoping to see if simple multiplication seemed to work (I had set the threshold@.71 in the link=probit run, and set the threshold to 1.2 in the link=logit run). But I was further confused when in both runs, the first threshold (item difficulty) in the IRT parameterization sections was shown to be set to 1 in the output for both. I expected it would have been the same as the link=logit value of 1.2. Thanks! 

Ilona posted on Saturday, January 27, 2007  7:24 am



Sorry, please answer the first question, but let me fix my second question: My second question should have said: I attempted to figure some of this out on my own from running the same data and model (single factor model with 60 binary items) two ways: with the link=Probit for one run and link=Logit for another run, both using MLR. What I assumed from Bengt's response Dec 6, 1999, was that the conversions (mentioned Dec 2,1999) from FA to IRT parameters are for converting from the link=Probit FA parameters to the Probit IRT parameters. These were: i) IRT discrim/slope: (a)=lamda/(sqrt(1lamda**2)) and ii) IRT difficulty: (b)=(threshold)/(lambda) But, I can't seem to convert the Mplus FA output to the Mplus IRT output regardless of the link I use. So, Question 2: How can I convert from my Mplus link=Probit FA thresholds and lambdas to the IRT discrimination and difficulty estimates Mplus outputs? I have: lambda1 set to .71 threshold 1 set to .71 Mplus IRT discrim (a1) output=.71 Mplus IRT difficulty (b1) output=1.0. (the IRT output says the parameterization is Probit) 

Ilona posted on Saturday, January 27, 2007  7:24 am



And Question 3: How can I convert from my Mplus link=Logit FA thresholds and lambdas to the IRT discrim & difficulty estimates that Mplus outputs? I have: lambda1 set to 1.208 threshold1 set to 1.208 Mplus IRT discrim (a1) is output as .71 Mplus IRT difficulty (b1) is output as 1.0. (the IRT output says the parameterization is Logistic) Thanks! 


Answers to your questions are in the IRT documentation on our web site: http://www.statmodel.com/irtanalysis.shtml 

Thomas Olino posted on Wednesday, February 28, 2007  5:52 am



In deciding if observed item scores are continous or ordered categorical, would it be appropriate to run CFA models (one specifying continuous indicators and the other specifying categorical (4 response levels)) and then compare the BIC values? Thanks! 


Statistics cannot determine the nature of the measurement scale. 


I am conducting an ordinal CFA with the wlsmv estimator. How do I obtain output of parameter estimates in IRT metric that is now available with v. 4.2 of Mplus? I cannot find IRT metric estimates in my output nor can I find the command to request them in the Mplus documentation. Thanks! Scott 


They are available for binary dependent variables only. They are printed automatically when available. 


Greetings! I am currently running a Graded Response Model with 836 participants over 80 items. There are 5 dimensions and is measured on a 5point Likert scale. The model is running on a server with substantial memory (16GB) and disk space (60GB). I'm using 10 integration points in code, and once the model began I received the message that, "THIS MODEL REQUIRES A LARGE AMOUNT OF MEMORY AND DISK SPACE. IT MANY NEED A SUBSTANTIAL AMOUNT OF TIME TO COMPLETE..." The MSDOS window indicates that the total number of integration points is 100,000. Currently, it has been running for close to 30 hours, and not having run these multiple times, I wasn't sure if the model is indeed running, or if it's 'stuck' (as can happen in LISREL and SPSS). Any suggestions would be appreciated! Thanks! 


It's probably still running but this is not a realistic analysis with so many integration points. You can change the integration to Monte Carlo integration (INTEGRATION=MONTECARLO;). Alternatively, you can use the weighted least squares estimator, WLSMV. 


Folks I'm comparing Parscale and Mplus (v3.1) 1PL and 2PL models with ML estimation. I obtain the same difficulty and (for the 2PL) discrimination parameter estimates across the two programs (after using the conversion equations in webnote4). However, I find that the distribution of the theta (factor) scores differ across the two programs. Parscale provides approximately normal theta values, while the factor scores in Mplus have a std deviation of .91 in the 1PL and .93 in the 2PL. The latent scores themselves are correlated .99 across the two programs and are centered at 0, but are just distributed differently. Is there any reason that you can think of why Mplus provides scores that do not have a st dev of 1? Or am I missing something? Thank you! 


One reason might be that there are two ways of estimating these factors scores _ EAP and MAP. Mplus uses EAP. Perhaps Parscale uses MAP. Also, the variances of estimated factor scores do not in general agree with the maximum likelihood estimated factor variances due to shrinkage. A third reason could be that there was a problem in Version 3.1. ML for categorical outcomes was introduced in Version 3 and there have been many changes since then. I can't think of any problem offhand but there may have been one. 

Ilona posted on Friday, August 24, 2007  10:14 am



In using SEM to parallel IRT, I believe: getting Factor Scores (SAVE = FSCORES;) is parallel to IRT Scale Scores (on a Zscore scale). My question is, what is the parallel to getting the IRT standard error of measurement for the scale scores? Can this be output per each subject's estimated Factor Score, or per Factor Score value (if using ML/MLR or WLSMV)? Thank you! 


The factor scores you get with categorical items and continuous factors when using ML estimation in Mplus are the "thetahat" scores you obtain in IRT (using the Bayesian Expected A Posteriori approach). They will be on a zscore scale if you have the factor variance fixed at 1, freeing all the loadings (the first is otherwise fixed by default and the variance free). IRT standard errors of measurement are typically expressed as the inverse, namely the information curves for items and sums of items. You can request information curves in the Mplus plot command. See also the web site description of IRT in Mplus. 

T. Cheong posted on Tuesday, October 16, 2007  1:42 pm



This is a followup question on two queries posted to this list in 1998 and 2004: Could Partial Credit or the Rating Scale model in line with Master's and Andrich be handled in Mplus now? Thank you. 


I am not sure. This may be possible using MODEL CONSTRAINT. 


Greetings, After a full day of reading MPlus discussion boards, web notes, and a few other articles about how IRT operates in MPlus, I wanted to make sure that my understanding was clear on a few points: 1. If I am interested in running an IRT (2PL) model, on a 16item unidimensional measure with 5 response categories for each item (i.e., items are ordered categorical), while testing for DIF across gender, do I start by running a CFA using the 2step procedure outlined on p. 399 of the MPlus User's Guide for Multigroup invariance testing with categorical outcomes (using WLSMV and delta parameterization). Correct so far? 2. Am I correct in my understanding that singledf chisquared difference testing in WLSMV (for individual parameters such as item 1's factor loading or first threshold) will help me determine statistical DIF based on gender (assuming sufficient power)? 3. Does MPlus have other statistical tests for DIF? I have read about CFI change tests in the invariance testing literature, but I was not sure if these had made it into MPlus. More in next posting... 


4. To discuss the difficulty in IRT terms, the thresholds can be converted to traditional IRT difficulties (i.e., b) using the simple formula b = threshold/factor loading. Then if I wish to convert it from a probit scale to a logit scale (commonly used in PARSCALE and other IRT programs) I need to multiply the number by 1.7 or 1.76. Correct? 5. Can I calculate difficulties for each threshold using this formula? Since each item has 4 thresholds, I plan to calculate a difficulty for each threshold (as number of responses per category allows). In other words, does the basic formula (b = threshold/factor loading), which seemed to have only been talked about for dichotomous items in everything I could find, generalize to polytomous items? 6. To convert the loadings to IRT slopes (i.e., a; item discriminations), do I use the basic formula a = factor loading/(sqrt(1factor loading^2)), and follow the same multiplication procedure (by 1.7) if I need to convert the probit to a logit. Correct? Happy holidays and thanks! 


1. Yes, or to test only item difficulty DIF, you could use gender as a covariate in a singlegroup analysis, looking for direct effects (we teach that at Hopkins in our March course). See, e.g., the Muthen 1989 Psychometrika article (on my UCLA web site). 2. Yes. 3. Just multiplegroup tests or test of direct covariate effects. 4. Right. Or use ML with logit link right away. 5. Yes, I think so. 6. Yes, or let Mplus do the conversion (given in the output for singlefactor models) These matters will be discussed during day 2 of our upcoming Hopkins course in March. 


With polytomous items it seems somewhat arbritrary to me as to how many possible values each item can take on before we consider the assumption that the items are continuous to be reasonable. That is, if we let subjects use a 100 point response scale for each item, probably few if any would object to analyzing the items assuming they are continuous but there seems to be little principled reason for a priori considering a 100 point response scale to be continuous and a 5 point response scale to be categorical. Thus, when I have 5 point response scales I usually analyze the data both ways and often find that indices of fit such as CFI look a lot better in the IRT approach. My question is whether there are any valid tests of whether the apparent increment in model fit resulting from treating the items cateorically is statistically significant? 


I think there have been some studies that suggest with five or more categories and no floor or ceiling effects, treating a categorical variable as continuous does not make much of a difference. If the categorical variable has floor or ceiling effects, the categorical methodology can handle that better. You could do a Monte Carlo simulation where you generate categorical data that look like your data and then analyze them as continuous variables in one analysis and categorical variables in another analysis and see if one way is superior in reproducing the population values. 


thanks Linda, we might just do a simulation study of this (we already have one planned in which we test whether treating a categorical variable as continuous makes much difference when estimating omega_hierarchical) and I am sure that would be helpful. Apart from a simulation study though I am wondering if it would be legitimate to conduct a test of the difference in fit for an individual data set. I don't know enough about IRT and categorical data analysis to know the answer to this question. My guess is that there are no legitimate tests for this purpose. When I run the same model with the items treating as being either categorical or continuous, the chisquare values and dfs seem to be on different orders of magnitude (e.g., in one model in which the items are treated as being categorical the model df = 191 whereas the exact same model with the items being treated as continuous the model df = 2137. 


I would need to see the two outputs you are comparing to comment but you are comparing different estimators and models with a different number of parameters. In addition, if you are using WLSMV, the degrees of freedom do not have the same meaning as with ML for example. The continuous model will contain linear regression coefficients while WLSMV will provide probit regressions. With fivecategory items that have no floor or ceiling effects, I would expect similar pvalues for the chisquare test of model fit and also similar ratios of parameter estimates to standard errors (column 3 of the output). If these are not similar, then I would use the categorical methodology. 


right and I am assuming given the different estimators that and the different meanings of the dfs that there is not any test that could be validly used to compare the difference in fit (in the particular example I am referring to, the p values are indeed the same and the ratios of parameter estimates are similar for the most part though in some cases the values are as different as say 6.9 vs 9.8 and the RMSEA estimates are highly similar  .053 vs .042  but the CFIs seem meaningfully different to me  .937 for the categorical model vs .851 for the continuous model). P.S. thanks for the very speedy replies! 

Lily posted on Sunday, April 06, 2008  7:31 pm



Dear Dr. Muthen, Does Mplus provide callable builtin functions for cumulative distribution function of standard normal, erf and integration where we can input the arguments? I am doing my masters project and my supervisor recommended me to use Mplus. However, both of us are extremely new to Mplus and we are still learning from scratch. Greatly appreciated, Lily 


No, Mplus has no callable functions. 

Lily posted on Tuesday, May 13, 2008  5:13 pm



Hi Dr. Muthen, I am fitting a model of the following form using WLSMV: F BY Y1* (P1) Y2Y5 (P2P5); F@1; [Y1$1] (P6); [Y2$1] (P7); [Y3$1] (P8); [Y4$1] (P9); [Y5$1] (P10); ....some constraints.. And Mplus the output given: MODEL RESULTS Estimate S.E. Est./S.E..... F BY Y1 0.745 0.040 18.511 Y2 0.334 0.040 8.350 And also ........ Item Discriminations... F BY Y1 1.099 0.133 8.280 .. Y2 0.355 0.045 7.904 .. I am not sure how to interpret the results as to which parameterization the first block of the output uses (the one directly under MODEL RESULTS) i.e. what is the relationship between the estimates in Item Discriminations and those immediately under MODEL RESULTS. Your help is greatly appreciated. 


The information under Model Results shows the results from the estimation of the model. The information under IRT PARAMETERIZATION is a translation of the results into the IRT parameters of discrimination and difficulty. See IRT under Special Mplus Topics on the website for details about the translation. 


For polytomous item models, where IRT parameter conversions are not provided, is there some difficulty or issue to be aware of? The discussion above refers to manual computation, but I also assume if there were no issues MPlus would just do it. Do I convert the multiple thresholds to multiple difficulty parameters using the same conversion as in the dichotomous case? 


I don't think there is a particular difficulty involved. We will think about it, write it out, and include it in the tech doc. 

Anna Brown posted on Friday, September 12, 2008  8:16 am



Dear Bengt One thing is puzzling me. When obtaining test information curves for a simple 1 factor model, results depend on the link used with ML estimation. With logit link, values are just over 3 times larger than for probit link. The shapes of TIC are pretty much the same. Is this to do with the scaling constant 1.7? But when squared it gives 2.89, not 3? Which is the "correct" information? It is important for estimating standard errors. In my model factor variance is fixed to 1 and factor loadings are free. Thank you 


The scaling factor is approximately 1.7 to 1.8. It is not exact. 


Greetings, I have been doing some IRT modeling for the purpose of a large scale assessment development, and have found some interesting things I was hoping to get clarification on. I generated a 2PL model in MPlus using maximum likelihood using the logit link, and then a 2PL model in BILOG, and 2P normal ogive model in BILOG. When comparing the three methods, the MPlus results are nearly identical to the normal ogive results in BILOG, but are nowhere near the logistic results in BILOG. Though the correlations are of an expected magnitude (1.0), the difference between them vary greatly. Further, the discriminations are quite different (MPlus = 1.24, BILOGogive = 1.27, BILOGlog = 2.02). BILOG uses the marginal maximum likelihood, but why do 2PL results from MPlus match BILOG normal ogive and not the BILOG 2PL model? Thanks! 


If you are using maximum likelihood and the logit link, you should get the same results. This is not the default so you would have to specify it in the ANALYSIS command. The difference may be due to them using or not using the constant of 1.7 in their computations and Mplus not doing this. Mplus gives the results in IRT metric as well, using the 1.7 constant. If none of this helps, send the files and your license number to support@statmodel.com. 

Anna Brown posted on Wednesday, September 17, 2008  5:22 am



Thanks Linda You did not answer which information is the "correct" one for computation of standard errors. 


The results differ by rescaling using a constant so there should be no difference except a difference in scale. 


I am using MPLUS 5 to run a 2 parameter IRT. I used the graph feature in the progam to plot the overall ICC curves as well as the group (gender) differences in the curves. Can the size of the plot lines and symbols be modified? If not, is this something that is being worked on? It would certainly improve the look of the plots that are generated. Also, is there a way to import the labels and titles save in a previous plot to a new graph? These are some thing that would improve the user friendliness of the software. 


The size of line cannot be changed. Symbols can be changed by using the Line Series option under Graph menu. At the present time, labels and titles cannot be saved. This is on our list of things to add. 


I am a novice to IRT but trying to examine item endorsement invariance across gender on a test with 10 dichotomously scored item (y vs. n). I am currently using mplus 5. First I ran a cfa to confirm the unidimensionality of my scale and then I ran a second model that included gender as a covariate (see codes used below). Is significant DIF indicated simply by the significance of my covariate on the item as my indicator?  codes used: for testing of the unidimensionality: ANALYSIS: ESTIMATOR = mlr; MODEL: angst BY aj*; angst@1; for item invariance by gender: ANALYSIS: ESTIMATOR = mlr; MODEL: angst BY aj*; angst on gender; angst@1; 


A significant direct effect of a covariate to an item represent DIF. In your example it would be, for example. j ON gender; You may find the slides that discuss measurement invariance and population heterogeneity from our Topic 1 course handout helpful. Also, the Topic 2 course handout contains information specific to categorical outcomes. The video for these topics is also available on the website. 


Hi Linda, Thanks for your help above. My followup question relates to the issue of having a covariate that is a 5level nominal variable (e.g., 5level age group: 1=18  24, 2=25  34, 3=35  44, 4=45  64, 5=65+ with group 2 as my referent category): How would I plot the the curves to look at DIF for across age groups in comparison tot he referent category. I know I would create 4 dummy variables with the referent category left out. However, when I try to plot the relationship I cannot figure out how to get plot for the referent category. Can you make any suggestions? Here is what I have done so far: MODEL: angst BY aj*; angst on age1 age3 age4 age5; angst@1; j ON age1 age3 age4 age5; 


The referent model would be the model with all covariates equal to zero. 


Thank you. 


Hi Linda, Regarding my model above: when I ran the model without the covariates included: MODEL: angst BY aj*; angst@1; I obtain the Item Discrimination and Item Difficulty information in the MPLUS output. However, once I include the covariates: MODEL: angst BY aj*; angst on age1 age3 age4 age5; angst@1; j ON age1 age3 age4 age5; The Item Discrimination and Item Difficulty information is no longer included in the MPLUS output. MPLUS gives me the OR indicating significant or nonsignificant DIF for the age groups relative to the referent category (age2, which is left out). Is there a way for me to obtain the Item Discrimination and Item Difficulty information for each level of the covariate? 


Mplus does not provide these. You would have to compute them yourself using information from the IRT Technical Note that is on the website. 


Hi Linda/Bengt; As Linda suggested, I downloaded the slides that discuss measurement invariance and population heterogeneity from your Topic 1 and 2 course handout. On slide 161, in discussing the interpretation of the effects it was concluded that shoplift was not invariant. Would this also be true if the direct effect of gender on shoplift was statistically significant and positive (instead of negative) but all other effects remained the same as in the slide? That is, as expected, for a given factor value, males had a higher probability of shoplifting than females. I am assuming that it would be but this is not clear from what is written in the slides. Another question related to the calculation of the item discrimination and item difficulty for different levels of my covariate (Mell Mckitty posted on Thursday, October 16, 2008  1:32 pm ), would I use the model estimates or the standardized estimates? Which is the alpha and PSI value in the MPLUS output? thanks. 


Another related question: How do you determine if DIF is uniform versus nonuniform? I am assuming that the inclusion of an interaction term would work. That is, a significant interaction term would indicate nonuniform DIF. Is this correct? thanks. 


A statistically significant direct effect implies DIF whether the coefficient is positive or negative. Only the interpretation would change. Raw coefficients should be used. Alpha refers to factor means/intercepts. Psi refers to factor variances/residual variances and covariances/residual covariances. Yes, adding an interaction term could do this. 


Hello, A colleague and I are attempting to construct and interpret a polytomous item response model. I want to make sure I am obtaining the slope and category thresholds that are commonly reported and are consistent with what you would obtain through running a graded response model in Multilog. Following example 5.5 in the manual, I begin by designating my estimator as Robust Maximum Likelihood. I continue by specifying the variance of the latent construct to 1, and make sure I am using the logit link. To obtain the thresholds, I take the logit thresholds reported in MPLUS and divide by the standardized factor loadings? To obtain the slopes, I take the Standardized Factor Loadings and divide by the square root of (1factor loading^2). Is this correct? Many thanks, Tom 


Thanks, Linda. And I would obtain the factor means/intercepts and factor variance/residual variances from TECH4, correct? 


TECH4 contains factor means, variances, and covariances. 

Mell Mckitty posted on Wednesday, October 22, 2008  5:58 am



Once an interaction term is in the model, the plots of the ICC are no longer available in MPLUS. If a statistically significant interaction term is observed, thus indicating nonuniform DIF, how would one go about obtaining this plot in MPLUS. 


Is your interaction between 2 covariates or between a covariate and the latent variable? 

Mell Mckitty posted on Wednesday, October 22, 2008  8:16 am



My model was as follows: ANALYSIS: ESTIMATOR = mlr; MODEL: angst BY aj*; angst@1; int  gender xwith angst; j on angst gender int; the results showed that: estimate se 2tail p j on int 0.783 0.312 0.012 j on gender 0.619 0.311 0.047 so the interaction term was with the latent variable. 


I don't see a way to plot this in Mplus currently. 


Before it gets lost in the shuffle, is my logic for conversion that I lay out above correct? Anyone have any thoughts? Thanks, Tom 

Mell Mckitty posted on Wednesday, October 22, 2008  10:54 am



Thanks for your quick reply. Is the use of the interaction term as indicated in my model (Mell Mckitty posted on Wednesday, October 22, 2008  8:16 am ) a sufficient way of assessing for nonuniform DIF? 


Yes. As far as I can see, it is the same as doing a multiplegroup analysis (corresponding to the dummy covariate) where the loading (the discrimination) also varies over the groups. 


Tom, Comparing to equations (17), (18), and (19) in our IRT tech appendix document http://www.statmodel.com/download/MplusIRT1.pdf your setup gives alpha=0 and psi=1, so it looks like your the IRT "a" is the loading and the IRT "b" is the threshold/loading. I don't believe Multilog uses the D=1.7 constant unless your request the "L2" option. Check it out to see if you get agreement that way. The parameter estimates should be identical. 


To reportI got it! The MPLUS loadings under MLR were the same as those estimated by multilog. To get the thresholds, the transformation is MPLUS Threshold/factor loading=Multilog threshold. 


Bengt/Linda, I ran a series of mimic models 1. without covariate 2. with binary covariate + interaction term to rule out nonuniform DIF, if the interaction term was not sig 3. a model with the binary covariate I tested the formulas from the technical notes by replicating the item discrimination (a) and item difficulties (b) printed in the MPLUS output for model 1  minor differences between the calculated b's and those from MPLUS. I am now trying to use the formulas to calculate a and b for the different levels of my covariate (Model 3) for the item with DIF: How would the formulas be modified to do this? I think that for the item of interest I would take the following information from the output of Model 3: lamda = unstandardized estimate; tau = unstandardized threshold; alpha = estimated means for the latent variable from TECH4; I am not clear on the value of psi: 1. is psi = the residual variance for the latent variable from the model output or the covariance estimate for the latent variable in the TECH4 output? 2. how does the estimate of the effect of the covariate fit into this formula? 3. If uniform DIF is present I believe that a would be constant across different levels of the covariate but that b would vary so, the estimate of the effect of the covariate should affect tau. Is this correct? and how? 


This subject is discussed under Topic 2 in the Mplus course sequence. See slides 162164 of the Topic 2 handout under http://www.statmodel.com/newhandouts.shtml The DIF is expressed in terms of probit regressions (normal ogive in IRT language) using the WLSMV estimator. That can be used to translate into IRT metric using our IRT tech doc at http://www.statmodel.com/download/MplusIRT1.pdf where the relationship between WLSMV probit and IRT is given. The corresponding translation from ML logit to IRT should follow from this. 


Is there anyway to specify an interaction term in a CFA with covariates when the WLSMV estimator is used? I tried but keep getting an error message. If not, is it sufficient to use the multigroup method (i.e., grouping is (1=male, 2=female))? 


No, WLSMV does not allow such interactions. But multiplegroup analysis accomplishes this  letting the loadings vary over groups in addition to the thresholds. 


I have 2 questions re an IRT model w/ item predictors similar to Muthen, Kao, & Burstein (1991). Items aj are binary with dummy coded item predictors specific to each item (flfn aj, values of 1, 0), very similar to the OTL item predictors in Muthen et al. Truncated input follows: F BY item_a* item_bitem_j; F@1; F ON flfn_Aflfn_J; item_a ON flfn_a; ... item_j ON flfn_j; I want to compare the 2 sets of difficulty parameters per item (e.g., the item_A parameters when flfn_A=1 and flfn_A=0.) I was somewhat unclear as to how to apply the Muthen et al. (1991, p. 10) formula to compute these, or whether computation would differ when using ML estimation. Would I simply (a) use (item threshold  regression weight)/item loading for predictor=1 and (b) use item threshold/item loading for predictor=0? Secondly, I wonder whether effects of the item predictors must be modeled on both F as well as the items themselves. None of the predictorfactor loadings were significant, nor would I hypothesize that they would be. Rather, I would expect that item predictors only affect ability levels indirectly, through adjusting the likelihoods of correct responses. Could I constrain the loadings for F ON flfn_Aflfn_J to be zero or leave them out of the model completely, or must these paths be included for the model to be estimated/interpreted correctly? 


The slope for the item regressed on the binary covariate simply shifts the threshold for that item (the slope is of opposite sign of the threshold), so all formulas we have in our IRT tech doc follow once you have computed the 2 threshold alternatives. I would let the covariates predict f as well. It would seem that considering them all togher, there might be an effect on f, although each covariate has a small effect. 

Mell Mckitty posted on Saturday, November 01, 2008  2:08 pm



From the model: ANALYSIS: ESTIMATOR = mlr; MODEL: angst BY aj*; angst@1; int  gender xwith angst; j on angst gender int; would you agree with the following calculations: a = loading + interaction*x Since the interaction term allows the loading (the discrimination) to also varies over the groups, AND: b = (threshold + direct effects*x)/D 


I agree that the loading in Mplus metric is modified as loading+interaction*x and that the Mplus threshold is modified as threshold + direct effect* x. But the translation from Mplus parameters to ML IRT parameters a and b are given in http://www.statmodel.com/download/MplusIRT1.pdf as (18) and (19). Here alpha=0, but psi and D play into it. 

Mell Mckitty posted on Saturday, November 01, 2008  5:35 pm



Sorry, I made a mistake in typing my equations above. The equations I used are: 1. a= (loading + interaction*x)/D; and 2. b= (threshold + direct effects*x)/loading So D is taken into account in the calculation of a. Also, psi refers to the factor variance and angst@1 indicate that psi is set at 1. Therefore, equation 18 for the IRT techincal notes, taking into account the interaction term becomes, a=((lambda +interaction*x)*sqrt(psi))/D, which is the same as 1 above. Equation 19 with alpha=0 and psi = 1 and taking into account the direct effects of the covariate becomes: b=((tau +direct effect*x)  lambda*alhpa)/(lambda*sqrt(psi))) = (tau + direct effect*x)/lambda, which is the same as equation 2 above. correct? 


That looks correct. 


Thanks. My next queation relates to the calculation of the probability, which needs to take into account the indirect effects as well. So, how would the indirect effect of my model be incorporated into equation 17 (from the IRT Technical Notes) to calculate the probability P(Ui = 1f)? I know that the indirect effect (i.e., angst<x) affects the mean of the latent trait (i.e., theta). As such, I am assuming that the indirect effect would be added to theta in equation 17. Is this reasoning correct? I have done this and the probability and related plot of the ICC curves for my indicated item for the different levels of my covariate seems correct but I would like some confirmation. Thanks. 


I think you are referring to the linear regression equation f on x; where in your case f is angst. This equation is estimated as fpredicted = beta0 + beta1*x, where beta0 is the factor intercept fixed at zero and beta1 is the estimated slope. So, this equation is only used to compute relevant f values for the ICC plot, as a function of the x values. 


Sorry, I think I am getting a number of different questions all mixed up: 1. Correct, I wanted to know the equation to calculate angst on x and to plot the ICC curves. 2. I wanted to figure out how to identify theta from my output with mlr estimation with interaction included. I just went over your notes again and realize that theta is the residual variance, which gets printed out when standardized is indicated in the output. However, STANDARDIZED (STD, STDY, STDYX) options are not available for TYPE=RANDOM, and TYPE=RANDOM is necessary if an interaction term is specified in the model. Is there any other way to get the residual variance for my indicated item once an interaction term is specified in the model? 3. I am assuming that a combination of the direct, indirect and the interaction effects must be taken into account in the calculation of P(Ui = 1f). So, How does the indirect effect fits into equation 17? 


Regarding point 2., the residual variance parameter theta is not present in your maximumlikelihood estimation using the logistic form. That's why you don't see it in (1) and (2) of our IRT document that we are discussing, nor in (18) and (19). I'm sorry, I can't go further than I already have on points 1 and 3 because it turns into statistical consulting which we don't have time for. Perhaps you want to discuss with your local statistical consultation center. 


I need to calculate the first and second derivatives of the loglikelihood function with respect to the factor scores for an IRT model similar to ex5.5 so as to be able to calculate the maximum likelihood IRT scores and the information function. In conventional IRT, the first and second derivatives can be obtained by adding equations 1 and 2 respectively across all the items: 1) D1(theta) = a*(up(theta) 2) D2(theta) = D1^2*Q*P where P(theta) = 1/(1+EXP(a(thetab)), Q(theta)= 1  P(theta), a is the discrimination parameter, and u is the item response (1 or 0) These calculations, however, do not yield the anticipated results when using the IRT parameters estimated in MPlus (version 5.1). Could you please inform me how to obtain these derivatives based on the parameters estimated in MPlus (e.g., as in ex5.5). Thank you. 


Are you considering the "maximumlikelihood" estimator of the latent factor score "theta", or are you considering the "Expected a posteriori" estimator? With the ML estimator for the parameter estimates Mplus uses the latter estimator, which implies that a normal "prior" is used in the calculations. See IRT books. 

dkim posted on Friday, November 07, 2008  3:42 pm



I have a question about the definition of “linear” CFA vs “nonlinear” CFA. According to MPLUS manual (Example 5.7 nonlinear CFA), it seems that the term “nonlinearity” is defined in terms of how factors are specified (e.g., interaction, quadratic). I think 2PL(Mplus example 5.5) IRT model belongs to nonlinear onefactor CFA. Am I correct? I tried to run 2PL IRT model with 60 binary items but it took really long time to get results. I changed estimator from MLR to WLSMV, which gave results relatively quickly. If I use WLSMV instead of MLR, can I still say that I am estimating IRT model? I heard that if I am using estimators other than MLR I am estimating 2 parameter normal ogive model (not logistic model). VARIABLE: NAMES ARE u1u60; CATEGORICAL ARE u1u60; ANALYSIS: ESTIMATOR = WLSMV; TYPE = MEANSTRUCTURE; MODEL: f BY u1u60; Can this model be considered nonlinear CFA model? 


With continuous outcomes a model is nonlinear if it has nonlinear functions of factors. With categorical outcomes, the model is always nonlinear because the conditional expectation function (the item characteristic curve) is non linear. 2PL IRT with 60 binary items should go very quickly using ML because you use only unidimensional integration over the single factor. WLSMV uses probit which in IRT language is the "normal ogive". This is still an IRT model. 

Jason Bond posted on Monday, November 10, 2008  3:14 pm



I was wondering about Differential Criterion Functioning (DCF), ala: Saha T.D., Stinson F.S., Frant B.F. (2007). The Role of Alcohol Consumption in Figure Classifications of Alcohol Use Disorders. Drug and Alcohol Dependence, 89, 8292. which seems to be an analog of using IRT and multiple group analysis in order to test for differences between the a and b parameters (in a 2parameter logistic version of the model) across groups. Could the DCF be computed in this way? Similarly, when they plot Test Response Curves (TRC), which are supposed to indicate DCF differences, they plotted Expected Raw Scores by the severity factor. Is it clear what these Expected Raw Scores should be? Thanks much in advance, Jason 


Neither of us are familiar with this paper so we cannot comment. 


Bengt and Linda, To followup on a post above , I sucessfully transformed the thresholds on my POLYTOMOUS IRT MODEL reported by MPLUS after employing the MLR estimator into those reported by MULTILOG using the (MPLUS Threshold/Factor Loading)=Multilog threshold. However, I'm off when I try to transform the standard errors. Is there something I'm missing? It looks like it should be a simple transformation. The Zscores reported are close, but not spot on? Thanks, Tom 


If you use Model constraint to descibe the transformation, you get the right Delta method SEs. Note that such SEs involve not only the SEs for the threshold and factor loading, but also their covariance. 

Tammy Tolar posted on Thursday, December 04, 2008  9:06 am



Are you working on capability for estimating 3 parameter IRT models? If so, can we get a beta version? 


Not right now, although it is on our list of things to add. 


Hi, I've read through the postings on conversion of factor analytic parameters to IRT parameters. In my case, I've run a multiple group CFA model with categorical observed variables and regressing covariates on the single latent factor using WLSMV with delta parameterization. The model uses the default constraint alpha=0, and loadings and thresholds for variables showing evidence of DIF in previous analyses are free to vary across groups. What I need clarification on is what values of alpha and psi I should use to convert the loadings and thresholds to IRT discrimination and difficulty parameters using equations 19 and 22 in the IRT technical appendix. Should I use the Tech 4 estimates or should I use alpha=0 and the residual variance estimates in the output? Either way, I would get different IRT parameters for variables that were constrained to be equal across groups (i.e., noninvariant). Is this because I have regressed covariates on the latent factor, and would I just explain this when I present the results? Thanks so much for your help, Jen 


A couple of points here. With multiplegroup CFA (or IRT), the default is alpha=0 in the first group and free in the other groups (so not fixed at zero in all groups). To get the standard IRT metric you would use the TECH4 means and variances for the factor. But if you do that then your IRT curves will be different even thought the thresholds and loadings are equal  that's a function of the standard IRT metric using different standardization (different alpha and psi) in the different groups. So I would just use the alpha, psi standardization in say the first group and not the other groups  you can then see invariance in the item curves. Note also that WLSMV uses probit, not logit. 


Hi Bengt, Thanks for the clarifications. When I ask for IRT curves using the Mplus graph option after running the model, are those curves calculated based on the approach you suggested where just the alpha, psi standardization from the first group is used to calculate the IRT parameters? When I use the graph option, I do get IRT curves that are the same across groups for the invariant items, but differ for the noninvariant items. Thanks, Jen 

Darya posted on Wednesday, March 11, 2009  11:11 am



Hello, I am fitting a 3factor CFA model with ordered categorical items using the WLSMV estimator. I let all factor loadings and item thresholds be free in the model and fixed factor variances and means to 1 and 0, respectively, for identification purposes. I want to fit information curves (IIC) to these items given the 3factor structure. 1. Is that OK to do (given IRT analyses typically fit IICs for unidimensional models)? 2. I know it is feasible to do in Mplus, but I was wondering if these information curves are correct. That is, do they have the same interpretation as the IICs fitted in onefactor IRT graded response model. 3. And, would you tell me if there is documentation regarding this application of IICs (e.g. Mplus technical notes and citations/references)? Thank you very much for your help! Darya 

Darya posted on Wednesday, March 11, 2009  11:59 am



And, just one more question: 4. Are these IICs plotted on a probit scale? Thanks! 


Mplus computes information curves also in the multifactorial case. The curve for items loading on a factor draws on the full multivariate information using the secondorder derivative with respect to the factor in question. The remaining factors are substituted by their means. See also our IRT tech note: http://www.statmodel.com/download/MplusIRT1.pdf 


Answer to Jen of March 10. I was being confusing  the translation to IRT parameter values uses alpha and psi to bring them to the N(0,1) metric used in IRT. The IRT curves that Mplus plots, however, use the Mplus factor parameterization and because what is drawn is the probability given the factor, the factor mean (alpha) and factor variance (psi) does not enter into the curve (only in terms of the location and range of the "x axis"). So, yes, invariant items will show up as invariant even across groups with different alpha and psi. 


Hello, I am working on a latent growth curve model where the items for assessing the construct (social support) changed after the second wave of data. In particular, new items were added, and binary yes/no response scales were changed to 4point Likert scales. Thus the assumption of measurement invariance over time is surely violated. The only glimmer of hope I see is some form of IRT score equating across the different versions of the social support instrument. Probably this could not be done in Mplus, but in another program followed by importing the IRT scores for analysis in the LGC model. Any comments you might have on the reasonableness/feasibility of such a procedure would be greatly appreciated. Cam 


If there are at least some items that are repeated in the same format, there would be hope for equating  which could be done in Mplus in a single modeling step (unless data called for a 3PL). Otherwise not, I don't think. Changing from binary to 4point scales can make a big difference I would imagine. 


Hello, I have two questions about using a bifactor model to obtain IRT estimates and factor scores (i.e theta scores) based in the graded response model (using 3cat ordinal scale observed variables). Question 1 I've seen that folks report using MPLUS and other specialized full information programs to derive IRT estimates from bifactor graded response models(Gibbons, Rush et al., 2009; Reise et al 2007). I am finding that a bifactor model fits my data best in many cases (child externalizing dimensions). But, I'm not clear, after reading several postings whether bifactor loadings and thresholds may be used to derive IRT disc and difficulty paramters in the same way they are used in the onedimensional case because items are loaded on multiple factors. Question 2 If I derive factor scores in MPLUS from a bifactor model, what are the resulting factor scores analagous to, in terms of the info provided? Would the factor scores provide theta estimated based on all factors (averaged out)? Or, can I derive a factor score that provides info on the general factor with specific dimensionality factored out? Is it possible to pull multiple factor scores when using multiple factors? Any help would be appreciated. Thank you! 


1. If you send us the paper at support@statmodel.com, we will see what they do. 2. Each person gets a factor score for each factor. 

Tony LI posted on Friday, June 12, 2009  9:34 am



Dear Linda, Just wondering do you have any insights RE: Michelle Little May 22, 2009 question? Thanks ! 


We never received the paper. 

Ying Li posted on Thursday, June 25, 2009  1:09 pm



Hi Linda, I am using M plus to do a 2factor CFA, or 2dimensional IRT. I am wondering if I can get factor scores for the 2factors respectively. I tried SAVEDATA: file is...; save=fscores; But no factor scores were computed. Thanks a lot for your time. Ying 


If you are using Version 5.21, you should get these. If you are using Version 5.21, please send your files and license number to support@statmodel.com. 

Ying Li posted on Friday, June 26, 2009  1:49 pm



Linda, Thanks. I am wondering if M plus can do Marginal Maximum Likelihood Estimation. Is there an example for it? Thanks a lot for your time. Ying 


Yes  use estimator = ML. See UG ex 5.5. Mplus can also do weighted least squares estimation  see UG ex 5.2. 


Does this make sense for correcting correlations for attenuation? I am trying to correct a 4x4 correlation matrix of observed variables for attenuation and I think that I can do this quite easily with Mplus. If I model each of the four observed variables as single reflective indicators of four latent variables (1 indicator per LV), and set the loading to 1 for each, wouldn't the correlation of the latent variables be the corrected correlation of the observed variables? 


When you create a latent variable that is identical to an observed variable, the correlations among the latent variables will be the same as the correlations among the observed variables. For continuous outcomes you do this as follows: f BY y@1; y@0; For categorical outcomes you do this as follows: f BY u@1; For continuous outcomes if you do not fix the residual variance of y at zero, it is not identified. 


Of course....I didn't think that through very well. Too early in the morning I guess. Are there any ways of correcting a correlation matrix for attenuation in Mplus? Thank you, SC 


See the Topic 1 course handout under the topic Measurement Errors And Multiple Indicators Of Latent Variables. 


Linda and Bengt: Hello from an "old" friend! A colleague and I are using MPLUS to do a graded IRT model. We have four items each with four response categories. The most direct question is whether the results for the discrimination and threshold parameter estimates need to be rescaled, and if so, how? I ask because you do rescale estimates for 1 and 2PL models taking into account what I assume is the probit logit disctinction in estimation. But I see no such rescaling option for the graded response model. Without rescaling the results seem literally to be "off the map." Thanks for your help. Very best, Geo B. 


Good to hear from you, George. These matters are covered in our Mplus Short Course on "Topic 2"  see the Topic 2 handout at http://www.statmodel.com/newhandouts.shtml on slides 9394. In short, Mplus operates in a factor analysis metric considering the probit/logit argument (1)  tau_jk + lambda_j*eta for item j, item category k, threshold tau and factor eta. In contrast, IRT considers (2) D*a_j (theta  b_jk) using the Samejima graded response model where D is chosen to make logit and probit close (1.7), a is the discrimination, theta is the "factor", and b are the difficulties. You go from the Mplus results in factor metric to the IRT metric as follows. The Mplus IRT tech doc on our web site implies that when you run Mplus with the factor standardized to zero mean and unit variance (as is typical in IRT), a comparison of (1) and (2) gives (3) a_j = lambda_j/D, (4) b_jk = tau_jk/lambda_j Check if that doesn't get you results in a metric seen in IRT. You can do the translation (3) and (4) in Model Constraint using parameter labeling so a_j and b_jk get estimates and SEs. 


Hello, I would like to ask a followup question regarding IRT parameter estimates for bifactor models. I am wondering if the item parameter estimates provided by Mplus are appropriate for multidimensional/bifactor models? That is, can I apply the usual IRT transformations of factor loadings and thresholds estimated with WLSMV, and does MLR produce the correct IRT parameterization on its own? Also, do the plots of information functions have the same meanings as they do for unidimensional models [SE = 1 / sqrt (info)]? Thank you for any advice, Michael 


You can use Mplus for bifactor models with categorical outcomes. Because you have more than one factor, you don't get the IRT translation but you can do it yourself by hand. The answer is yes to your information function question, although the actual details are more complex. Mplus provides information functions also for multiple factors but the information function for a given factor depends on which factor value for the other factor that you consider. Because of this, Mplus lets you plot the information function for one factor at a value of the other factor that you choose (such as the mean). You can also condition on covariates, so this plotting is quite general. 


Dear Drs. Muthen, I would like to ask a question related to the fit indices in IRT model. A colleague and I we are using Mplus to do a graded response IRT model.Our items have four response categories (Likertscale). We are interested in the absolute fit of the model. Since there are problems with using the ML chisquare values to assess differences in fit between models, we decided to use WLSMV estimator to assess the model fit. The output we got was quite a surprise (see below). We don't know why is our CFI value lower than TLI. Should we be concerned with this outcome? Do you have an explanation for it? Thank you for your time, AnnaMari TESTS OF MODEL FIT ChiSquare Test of Model Fit Value 1835.004* Degrees of Freedom 213** PValue 0.0000 ChiSquare Test of Model Fit for the Baseline Model Value 16181.845 Degrees of Freedom 38 PValue 0.0000 CFI 0.900 TLI 0.982 Number of Free Parameters 143 RMSEA (Root Mean Square Error Of Approximation) Estimate 0.059 WRMR (Weighted Root Mean Square Residual) Value 1.921 


These discrepancies can occur. See the Yu dissertation on the website for information about fit statistics and their behavior. This would make me suspicious about my model. Try alternative specifications. 


On Wednesday, December 17, 2003  9:00 am, Linda stated that the formula a=loading/sqrt(1loading**2) was only valid if there were no crossloadings. Is that because it is scaling the loading by the variance not explained by the target factor instead of the variance not explained by any factor? If so, could you include that influence with this adjustment: a=loading/sqrt(1loading**2loading2**2) to scale the parameter by the overall residual variance? If not, then what is the correct formula when there are crossloadings? Finally, could you point me an article that discusses this issue? Thanks. 


You would need to include also the covariance between the two factors. 


Yes, of course. I forgot to mention that one of mine is a method factor. In the correlated case, it would be a=loading/sqrt(1loading**2loading2**2loading*factor_corr*loading2) Does anyone know of a paper that addresses the issue of crossloadings specifically in terms of IRT and/or item/test information functions? 


That looks correct. I am not aware of any such paper. 


Dear Drs. Muthen I've found something confusing me. I do a categorical CFA (five binary item one factor) and get a desirable solution. For the second item, all parameter is positive. then I attempted to reverse the scoring of second item and analyse it again use the same model setting. I got the result confusing me. The absolute value of all parameters is the same as before. But sign for loading and threshold become negative. And from the output, the discrimination parameter becomes negative , but the difficulty parameter is still positive as before. I accept the IRT training and have some knowledge about the conversion formula between FA and IRT. BUT I'm still confused by these output,especially for the negative discrimination parameter. Can you give me some suggestions or reference to explain those output? thank you very much! 


Please send the two outputs and your license number to support@statmodel.com. 


Dear Drs. Muthen, I have estimated a multigroup IRT (4groups) using the MLR estimator. The means are set to zero in one group and allowed to vary in the others. Variances are constrained to 1. I first tested differences in thresholds for each item (there are 9) individually by comparing model fit. After accounting for all differences in thresholds I next tested differences in item difficulty. My question regards significance of thresholds. 2 items have nonsignificant thresholds in one or more groups according to pvalue and confidence intervals, although the discrimination parameters are significant. Is something wrong here? If so, how do I troubleshoot? If not, how does one interpret, if at all, a nonsignificant threshold? Thank you. 

Jon Heron posted on Friday, May 21, 2010  12:18 am



Hi Vivia, I've just been doing this very same thing so have looked back at my output. The pvalues for the thresholds indicate whether they are signficantly different from zero. Assuming that you have binary data and are not modelling a guessing parameter, this just tells you that for this item and within that group, a positive/negative endorsement to the item is equally likely at the centre of your trait  you could verify that by plotting the ICC. i.e. certainly not something to worry about. In your model I don't think the tests for thresholds are particularly informative, although the parameter SE's might be useful to get a better handle on how these parameters different across groups following your omnibus test. bw, Jon 


Vivia, I am a bit unclear on what you are doing here. When you talk about means and variance I assume this is for the factor. If you have multiple groups you want to test for measurement invariance. So assuming this is what you do, the factor variance should not be fixed at one in all groups. And then you say "After accounting for all differences in thresholds I next tested differences in item difficulty." which confuses me because the item difficulties are functions of the thresholds. 


Is it possible in Mplus to obtain a summary measure of the area under the total information curve from an IRT model. 


We give only a plot. We don't give a summary measure. 


What is the difference between ML and MLR for IRT models (i.e., assuming categorical data and full information estimation)? For continuous data, I understand that MLR is supposed to help adjust SEs and test statistics for nonnormality, but if normality is not assumed for categorical data to begin with, then what does MLR have to offer over ML? I apologize if this topic is already addressed elsewhere, but I could not find it. Thanks in advance for any direction you can provide! 

nanda mooij posted on Monday, September 27, 2010  2:38 am



Hi, I have conducted a IRT analysis with Mplus, and saved the factor scores. But these factor scores don't correlate with the observed scores, how is this possible? They are supposed to correlate right? I used a WLSMV estimator. Thanks a lot, Nanda 


It sounds like you may be using an old version of Mplus where there were problems with the factor scores. If not, please send your input, data, output, and license number to support@statmodel.com. 

nanda mooij posted on Tuesday, October 05, 2010  12:07 pm



Hi, I have another question. I am trying to fit a very large model, but when I do this I get the following warnings: WARNING: THE SAMPLE CORRELATION OF V97 AND V9 IS 0.999 DUE TO ZERO CELLS IN THE BIVARIATE TABLE I get this warning for many more variables, but not all. I checked the correlations, but these are not correlated this high, and not one correlation is below zero. What does this mean? What is meant by zero cells in the bivariate table? Thanks, Nanda 


Zero cells in the bivariate table of two dichtomous variables imply a correlation of plus or minus one. Both variables should not be used as they do not contribute any additional information. This can happen with small samples and skewed variables. 


Dear Drs. Muthen, We analyzed the same data set using MPlus and CONQUEST. We applied a single factor model with 72 categorical items (68 dichotomous, 4 partial credit items with 3 categories; sample size 13,004). In MPlus we specified the model f1 by I1I72@1 with: ALGORITHM IS EM; ESTIMATOR IS MLR; INTEGRATION = 10; We expected the same item difficulty parameters compared to a CONQUEST analysis with the standard model specification item+item*step (constraints=cases, score (0 1 2) (0 0.5 1). For all dichotomously scored items this was in fact true. But for all 4 partial credit items the threshold parameter estimates differed remarkably. Using MPlus, we got e.g. the threshold parameter estimates I69$1=0.697 and I69$2=2.276. Using CONQUEST we got the item difficulty 1.84734 and 0.11344 for the step parameter. This corresponds to the threshold parameter I69$1=1.96078 and I69$2=1.7339; that is the threshold parameters are unordered! The results for the other partial credit items were similar. Both programs yield the same response category proportions. Increasing the number of nods doesn’t change the results. Inspecting the residual statistics reveals that the differences between the observed and the model implied response category proportions seem to be small. The item fit statistics do not indicate some kind of misfit, either. We’d appreciate any help, Sigrid 


Mplus estimates Samejima's graded response model not a partial credit model. 

Melanie Wall posted on Wednesday, November 10, 2010  12:18 pm



I am fitting a single factor model with both continuous and dichotomous indicators. If I have all dichotomous outcomes, Mplus will output "IRT parameters" estimates and standard errors(the estimates are obtained through the conversion formula and I assume the standard errors are obtained by the delta method as in MacIntosh and Hashim). My question is: Is there a way to get Mplus to output these "IRT parameters" for the dichotomous indicators when I have a mix of dichotomous and continuous indicators? I know I could use ML and get the IRT parameters directly but I am particularly interested in using WLS. 


I don't think so currently. 


Hello, Example 5.5 in the manual gives the program for a 2PL Graded Response Model. I am specifying CATEGORICAL (ordered polytomous) indicators. If I am understanding correctly, using the WLSMV (instead of MLR) estimator gives me the same model but it is 2 parameter normal ogive and not 2PL. 1. Is this correct? I have read elsewhere on this board that the loadings that are output for the WLSMV estimator (normal ogive(?) above) (A) do not correspond directly to IRT "a" parameters but (B) require a transformation to correspond to the "a" parameters, and that (C) the transformation is only possible if items only load onto a single factor. 2. Are (A), (B), and (C) correct? If so and especially if (C) is also correct, has anything been worked out yet anywhere to your knowledge that allows for the transformation if items that load onto >1 factor (I am doing a bifactor model with WLSMV and want factor scores and "a" parameters for the general factor). 3. Can I get the standard errors for the factor scores when using the WLSMV estimator? They are automically given in the FSCORES file with the (more computationally intensive) 2PL model, but do not appear when using WLSMV.Is there a way to get them? Thanks so much. James 


1. Yes. 2a. Yes. 2b. Yes. 2c. Mplus gives these only when the model has one factor. 3. No. 


For a TWOPARAMETER LOGISTIC ITEM RESPONSE THEORY (IRT) MODEL (as in example 5.5), could you please confirm whether the standard error of the factor scores (in the SAVEDATA file) is the inverse of the square root of the information function as defined in formula 14 of the MplusIRT2 document(http://www.statmodel.com/download/MplusIRT2.pdf). Thank you very much. 


Yes, it is. 


Dear Drs. Muthen, We would like to examine temporal measurement (non)invariance in an IRT model with “dense” repeated measurement: participants (n = 100) completed the same 6 items (5 ordered categorical response options per item) every day over 28 days. Instead of estimating 28 factors simultaneously, we think about estimating a single factor for all 2800 days, using the TYPE=COMPLEX option to adjust for the nonindependence of observations. In this case, would it make sense to introduce “temporal” covariates (e.g., day of assessment as continuous covariate, week 1 versus weeks 24, weekenddays versus weekdays) in MIMIC models to examine DIF based on time, or is there something about this model that would not be accounted for by the COMPLEX option? Thanks very much for your support. 


It sounds like you can view this as 2level data, where level 1 is time (28) and level 2 is subject (100), and where you have 6 outcomes. So you could model it via Type=Twolevel or Complex. The Twolevel structure is similar in structure to doing growth modeling as 2level, where timevarying covariates can be handled as in UG ex9.16. 

D C posted on Sunday, May 22, 2011  11:00 pm



Hello, Can Mplus produce item discrimination and difficulty parameters for a 2PL IRT model with 6category items? It seems Mplus produces these parameters only when the items are dichotomous. Thank you, D 


No, we only provide these when the items are binary. We will be preparing a FAQ about this in a couple of weeks. 

D C posted on Tuesday, May 24, 2011  5:37 pm



Hello Linda, Thank you for your response. Will this FAQ include information on how I could compute the discrimination and difficulty parameters for such a model myself using the given loadings and thresholds? Or, if possible, would you tell me how that could be done in your next response in this thread? Thank you, D 


The FAQ will cover this. 


Hi, I am interested in generating 2PL IRT model for dichotomous indicators where the latent trait is not normally distributed. I see how to generate nonnormal indicators, but am not sure about the generation of factors with a particular skewness. I wonder if you could point me in the right direction. Thanks very much. Holmes Finch 


I would try to do that via mixtures. For instance, a 2class mixture of two normal factors can represent a lognormal factor distribution  see pages 1416 of the McLachlanPeel (2000) book on mixtures. There are probably exact results to be found, but otherwise you can take a trial and error approach to choosing means, variances, and class sizes to get the nonnormality that you want. 

D C posted on Thursday, June 02, 2011  8:51 pm



Hello Professors, If I was interested in looking at group DIF or groupdifferences in item endorsement using a 2PL IRT. What is the difference in Mplus if I fit these two scenarios: (1)a stratified IRT in each group and then plotting ICCs from common items by group in the same figure VS. (2) fitting IRT with a covariate (see *example below) and then plotting the ICCs by group (using "name a set of values" command in plots)? I saw a much more substantial difference in the stratified analysis while nearly identical curves in (2). *Example: perception BY item1* item2 item3; perception ON sex; item1item3 ON sex; perception@1; [perception@0]; Isn't scenario (1) more accurately showing how differently the items function in each group since all model parameters are estimated separately (loadings, thresholds, etc.)? Thank you! 


Thanks very much. I will give that a try. Holmes 


Answer to D C: In your *Example you show a model that is not identified because you can't have all direct effects of a covariate on the items and also a covariate effect on the factor. Perhaps you mean that this is just a part of the model where there are other items that aren't directly affected by the covariate. 

Emily Lai posted on Friday, September 23, 2011  12:43 pm



My colleague and I are trying to compare results from a traditional CFA to a multidimensional IRT analysis using the same data. We have polytomouslyscored items and we are using MLR, specifying that our data are categorical. We are also running the multidimensional IRT analysis using Conquest software in order to confirm our IRT results. When we do so, we get item fit indices in the output—the squared standardized residuals and a varianceweighted version for each item. That is, Conquest computes the average of the squared standardized modelbased residuals for each item. We are wondering if MPlus would give us something like an item fit index that is comparable to this. We first thought of modification indices, but it is my understanding that you cannot get modification indices when you have categorical indicators. Is this true? Can you think of another index that would constitute a comparable indicator of individual item fit? Could we use standardized residuals from MPlus output in a similar way? Thanks. 


The CFA model for categorical indicators without the guessing parameter is the same model as IRT without the guessing parameter. You should not find differences when both are estimated using the same estimator. We do not provide the fit index you mention. I don't think you will be helped by modification indices. Try TECH10 where univariate and bivariate fit is shown. 

sailor cai posted on Sunday, September 25, 2011  5:36 pm



Dear Dr Muthens,I would like to ask two questions: Background information: I am looking at the predicting effects of three IVs (2 dichotomous +1 polytoumous) on one DV (dichotomous). My planned analytical steps: 1) using IRT models to transform values in the whole response matrix(3 IVs + 1DV) into probability values; 2) conducting EFA,CFA,and SEM based on productmoment approach. My questions are: 1. Does Mplus directly provide transformed values for my Step 1 analysis? How to obtain them? Or any other recommendations? 2. If I am to use plausible values before running EFA&CFA&SEM,does this mean I should obtain plausible values before IRT calibrating or do it after my Step 1 analysis? Further, if plausible values are to be used, at least how many sets of data will be acceptable for Mplus practioners? For 5 seems to be too many for my case. It would be also much appreciated if you can recommend works using IRT and SEM in combination. Thanks in advance! 


I'm not clear on how IRT is involved here. It sounds like all of your variables are observed. 

sailor cai posted on Tuesday, September 27, 2011  6:18 am



Yes, I have 117 observed indicators in my whole dataset. But to directly put them all together into a full model seems to make things too complicated. It seems using composite variables would make things easier. I am wondering whether it is appropriate to weigh each observed variable with the IRT discriminality parameters before grouping them. So my question is reduced to one: Does Mplus provide transformed response matrix based on IRT discriminality parameters? Or, the matrix with each item response vector multipled by its corresponding IRT discriminality value? 


It sounds like you have about 40 indicators for each of three factors. I would use IRT scores in this case. See the FSCORES option of the SAVEDATA command. 

sailor cai posted on Monday, October 03, 2011  1:48 am



Yes, I have about from 20 to 35 indicators for each of the four latent variables(second order). I would try the FSCORES. Many thanks! 


Apologies if this is an overly simple question. I am interested in fitting a 2level IRT model where item responses (L1) are nested within individuals (L2). I’m stuck on setting up the data in a way that will be correctly analyzed. So, my current thinking is that the data should be assembled such that item responses are in 1 column vector stacked for all examinees. So, for a 3 item test taken by 10 examinees, my data would be 30 rows long. In de Boeck & Wilson (2007), the authors augment the data matrix with an identity matrix to facilitate model specification. Accordingly, the three item (i=1,2,3) 10 person (j = 1,..10) case would be specified something like F1 by ITEM*i1 ITEM*i2 ITEM*i3 Where ITEM corresponds to the column vector of item responses and i(i) =1 when row_ji = i and 0 otherwise. Is this also necessary in Mplus? Or is there another way to organize the data and specify the model? Specifically, I'm wondering if specifying the items as 'within' and using examinee id as the cluster variable in the usual 'wide' format is sufficient? Thanks in advance. 


You should use wide format data where each of your three variables is one column of the data set. This becomes a singlelevel analysis because clustering is taken care of by multivariate analysis. See Example 5.5. 

Yan Zhou posted on Tuesday, November 01, 2011  1:22 pm



I want to test the difference between constrained Rasch model and free Rasch model, both with multiple groups examinees, so I am wondering which estimator I should use, WLSM, WLSMV, or some other estimators? By the way, I tried to use MLR estimator to estimate IRT parameters in multiple groups examinees, it didn't work. So could you please tell me the reason? 


What do you mean by free and constrained Rasch models? 

Yan Zhou posted on Tuesday, November 01, 2011  5:59 pm



I am sorry to confuse you. Now, I need to adopt Rasch model to estimate items' difficulties and examinees' abilities. During the difficulties estimation, I want to separate my examinees into four groups, and then estimate items' difficulties by each group. I want to use two methods to estimate each group's items' difficulties. One is by constraining all groups' items' difficulties to be equal,such as b11=b21=b31=b41,b12=b22=b32=b42 (the first footnote of b stands for group, the second footnote of b stands for item); the other method is by freeing one of the four groups' item difficulties estimation while constraining the other three groups' items' difficulties to be equal. Now, what I want to know is which estimator I should use when I am estimating items' difficulties using the two methods above, MLR, WLSM, WLSMV, or some other estimators? My another question is whether we should use the same estimator when we are using Mplus to run Rasch model many times from different perspectives while all these operations are for the same paper? Thank you so much for your help! 


The Rasch model usually is estimated using maximum likelihood estimation. That would be MLR. You could estimate the Rasch model using WLSV or WLSMV in which case you would use the DIFFTEST option to compare models. Unless you are writing a paper that compares different estimators, I would stick with one estimator. 

Mark LaVenia posted on Thursday, November 03, 2011  2:03 pm



I am using Mplus to generate thetas for subjects who took a test with known item parameters (generated by BILOGMG and published by the test developers). By running a TwoParameter Logistic IRT Model and fixing the item loadings (from BILOG output) and item thresholds (computed as BILOG_Threshold*Loading), the model runs great and generates thetas that seem plausible. As long as the variance is fixed @1, the reported Item Difficulty parameters in the Mplus output are exactly the same as the BILOG_Threshold, but the Item Discrimination parameters in the Mplus output are about half the size of the BILOG_Slope parameters. I was hoping to use a match between the parameters (Mplus_Discrimination & BILOG_Slope; Mplus_Difficulty & BILOG_Threshold) as a check that I did it right. I'm thinking maybe I didn't. Any input would be greatly appreciated. Thank you. I am pasting below my truncated syntax: VARIABLE: NAMES ARE id i1_Crti17d_Crt; CATEGORICAL ARE i1_Crti17d_Crt; ANALYSIS: ESTIMATOR = MLR; MODEL: MKT_Pre BY i1_Crt@0.590 . . . i17d_Crt@0.412; [i1_Crt$1@0.835]; . . . [i17d_Crt$1@0.255]; [MKT_Pre@0]; MKT_Pre@1; 


Slide 94 of the Topic 2 handout may be helpful. For the discrimination there is also the D factor which some of the IRT programs set at 1.7. So if BILOG doesn't use D in its 2PL, you probably need to multiply the Mplus "a" by 1.7. See also the IRT technical appendix at http://www.statmodel.com/download/MplusIRT2.pdf 


Bengt – Thank you for the prompt reply, your thoughts, and the reference to further resources. Discrimination*1.7 brings me closer to the BILOG Slope, but still not on the money. To complicate things a bit more, when I allow the variance to be estimated freely the “a”*1.7 is closer to the BILOG slope than when the variance is fixed @1, which is a little frustrating because when I fix the variance @1, the Difficulty is spot on, but no so when free (Note: Fixing or freely estimating the factor mean has no effect either way). For example, the parameters for the first three items under each condition are as follows: BILOG Slope: 0.732, 1.105, 0.786 BILOG Threshold: 1.414, 0.015, 0.322 !Factor variance = free Mplus Discrimination*1.7: 0.716, 0.899, 0.750 Mplus Difficulty: 1.166, 0.012, 0.266 !Factor variance = 1 Mplus Discrimination*1.7: 0.590, 0.741, 0.619 Mplus Difficulty: 1.414, 0.015, 0.322 It seems to me that the spot on “b” parameters with the fixed variance is better than the appreciably better match on the “a” and worsened match on the “b” found with the free variance. I would appreciate your thoughts. Maybe this is just an inappropriate use of Mplus (i.e., for scoring tests/theta generation rather than model fitting & parameter estimation); do you recommend a different approach? Thank you again for your time and insight. 


Are you using BILOG 2PL or normal ogive? I am not sure I follow your different steps of analysis, where in the run you show you have fixed item parameters. I think you'd better send the relevant BILOG and the Mplus outputs to Support, explaining your steps. 

Mark LaVenia posted on Saturday, November 05, 2011  2:02 pm



Will do. Thank you so much for being willing to look at it. 

Salma Ayis posted on Thursday, December 22, 2011  5:59 pm



I have used MPlus version 3, in 2009. The program below was used to produce Item Characteristics Curves. I have run this program today, but when I tried to view the graph, it wasn't possible. I used graph, view graph, which as far as I remember allow the viewing but didn't seem to do so. Advice is much appreciated! Here is the program code: TITLE: this is an example of a twoparameter logistic item response theory (IRT) model DATA: FILE IS ADL21_May24_05f.txt; VARIABLE: NAMES ARE u1u21 g; USEV ARE u1u21; AUXILIARY= u1u21 g; MISSING ARE ALL(99); CATEGORICAL ARE u1u21; ANALYSIS: ESTIMATOR = MLR; MODEL: f BY u1u21; OUTPUT: TECH1 TECH8; PLOT: TYPE IS PLOT2; SAVEDATA: FILE IS ADL_21_Feb07_IRT.SAV; FORMAT IS 21F5.0, F15; SAVE=FSCORES; 


Mplus Version 3 is not supported. If you have a current upgrade and support contract you should download Version 6.12 and send problems to support@statmodel.com. 

nanda mooij posted on Saturday, March 17, 2012  7:35 am



Dear dr. Muthen, I have a few questions about my model; I have fitted a IRT model by fixing the factor variances to 1, and freeing the factor loadings like this: f1 BY u1* u2u16; f2 BY u17* u18u32; f3 BY u33* u34u48; f4 BY u49* u50u64; f5 BY u65* u66u80; f6 BY u81* u82u96; f7 BY u97* u98u112; f8 BY u113* u114u128; f9 BY u129* u130u144; h1 BY f1f3; h2 BY f4f6; h3 BY f7f9; p BY h1h3; f1f9@1; h1h3@1; p@1; My question is, is this right or must I also free the loadings of f1, f4, f7 and h1? Besides this, I want to look at the fit of this model. In order to do this, I thought that it wouldn't be necessary to fix the variances and free the loadings, because I am only interested in the fit indices. Is this right? Because the fit indices of the IRT model and the normal CFA model are very different, especially the CFI, TLI and WRMR. So which model should I choose to determine the fit of the model? Thanks in advance 


You need to free the first factor loading of h1, h2, h3, and p if fix the factor variances to one. Model fit will be the same if you fix the factor loadings to one versus freeing all factor loadings and fix the factor variances to one. If you don't get the same fit with these two parameterizations, you are doing something wrong. 

Yen posted on Tuesday, March 20, 2012  3:15 pm



I apologize that I don't have much knowledge about the irt topic... Would it be possible to run one irt analysis with one ordinal response item and all binary items? Thank you. 


Yes, no problem. Just put them all on the Categorical= list. Mplus automatically figures out which are binary and which are ordered polytomous (ordinal). 


Hi Does Mplus do IRT simulation. In the simulation chapter of the user guide I didn't see anything on this subject. could you please refer me to an article which teaches IRT simulation step by step with Mplus the same way as your 2002 article taught SEM simulatin in Structural Equation Journal 


There is no article that shows this. Example 5.5 in the user's guide is an IRT model. See the Monte Carlo counterpart of this. It is mcex5.5.inp. 


Hi  I am trying convert an analysis with several latent factors to an IRT framework and I am trying to understand the conversion to IRT parameters (difficulty and discrimination). I have been reading and rereading the web notes and discussion posts. You mention in Web note 4 that an increased residual variance (theta) gives rise to a flatter conditional probability curve. But to standardise y* under the LRV formulation with theta = 1  factor loading^2*psi doesn't seem to take into account the residual variance at all? There is a footnote that suggest the R^2 output can be used to estimate theta in the Delta parameterization. Does this only apply in multiple group comparisons? 


The extra residual variance parameters, which go beyond what is available in conventional IRT, are only relevant in multiplegroup or multipletimepoint settings. You need to fix them in a referent group, whereas they are free in the other groups. 

Miaoyun Li posted on Thursday, September 06, 2012  6:43 am



Hello, I'm a new user for Mplus. And now I have two questions about the use of IRT. The first one is that I was confused about the transformation of the a, and b estimator parameters. As I know, there are three kinds of formula: (1) a =loading /(sqrt(1loading**2)); b =threshold/loading. (2) a =loading; b= threshold/loading. (3) a =loading/1.7; b=threshold/loading. When I used the ML estimate method, which one formula is right ? When the WLSMV method was used, which one is appropriate ? Besides, which one is the 'loading' and 'threshold' in the formula referred as "standardized", or "unstandardized" ? The second question is about the explanation of the result of TECH10, especial for the " Univariate Pearson ChiSquare" and "Univariate LogLikelihood ChiSquare". How can I explain that the item is misfit ? I will be very appreciated for some related references suggestion. 


See our technical appendix for IRT at http://www.statmodel.com/download/MplusIRT2.pdf 

Dexin Shi posted on Tuesday, March 26, 2013  8:46 pm



Hello, I have a question about the transformation between IRT and CCFA. based on the technical appendix for IRT, using CCFA with logit link, If we fix the factor variance to be 1 and factor mean to be 0, we have Discrimination a= loading lambda/1.7; and Difficulty bi= threshold tau/ loading lambda. However,from other references (e.g. Wirth & Edwards, 2007; Kim & Yoon 2011), the Discrimination a= 1.7 (lambda/sqrt(1lambda^2); Difficulty bi= threshold tau/ loading lambda;Can you please explain the source of disagreement for the parameter a? Thank you very much for your help. 


The difference stems from using logit or probit. With WLSMV probit you use the 1lambda^2 version because that is the theta (residual) variance. See our Topic 2 handout on our web site, slides 9394. 

Dexin Shi posted on Wednesday, March 27, 2013  8:59 pm



Thanks, Bengt; that makes a lot of sense. But based on slide 94, if we fix the factor variance and means, using logit link a=lambda/1.7; using probit link, a=(lambda/sqrt(1lambda^2). It seems if using the probit link, the scaling constant D should not be included in the equation, however, in other's notation a=1.7 (lambda/sqrt(1lambda^2); is this an error, or am I missing something. Thanks again for your help 


I agree that 1.7 does not belong with the expression using /sqrt(1lambda^2) since that is probit (WLSMV with Delta parameterization). The constant 1.7 was introduced to make logit close to probit IRT estimates. Furthermore, these days it seems that 1.7 has been dropped from the logit expression. 


I tried to find this information on the forum but could not find it. Instead of leaving the "a" parameters float freely or fix them to 1 and thus obtain a Rasch model, I would like to set specific values for the mean and standard deviations for "a" (e.g. mean 0.8 and std. dev. 0.3). Does anyone have an idea of script to obtain this? My attempts have all failed. I think I am dumb. >; Another question is if it will be possible to choose how to center the scale (either on b or on theta scores to have mean = 0 and std. dev. = 1). Thanks, Carlos 


You can use Bayesian analysis and use the specific values as priors. You can fix the mean to one and the variance to one as follows. The standard deviation is not a model parameter. [f@0]; f@1; 


Good idea! Just a minor comment: for the IRT bunch the a parameters are not as important in the diagram as the b parameters. There should be an option to show b instead of a in the diagram. Thanks, Carlos 


Hello, I was wondering if you all had any recommendations about the best way to evaluate Rasch and 2PL IRT models when using the Bayes estimator. I see that the DIC and pD are not given. Thanks for any input! Yaacov 


The overall posterior predictive pvalue (PPP) is a good way to evaluate these models. Several more detailed tests are reported in the tech10 output option. output:tech10; 


Any idea on if and/or when mPlus will have other logistic models beyond 2PLM (e.g. 3, 4 and 5PLM)? Best regards, Carlos 


We plan to have the 3PL model perhaps in a year. 


Ok, many thanks, Linda! Carlos 


Linda and Bengt, I jump in with a suggestion: if you're thinking to have a 3PL, you should explore the possibility to have a 4PL at the same time, as some studies suggested the higher asymptote may also be useful for modeling some kinds of items. Apart from some R scripts, almost all common IRT software packages do not have the 4PL. Reise, S.P., & Waller, N.G. (2003). How many IRT parameters does it take to model psychopathology items? Psychological Methods, 8, 164–184. 


Thank you for the suggestion. We will take this into account in our development. 


Julien, that's a great idea! Apparently the modeling of the fourth parameter would not require too much additional engineering effort if compared to 3PL and would indeed help. Here I have students from all academic years making a test with endofcourse level (destined to new graduates), so that the upper asymptote is indeed very crucial to obtain better fit. I tried to use R packages catR and irtProb, and they can be useful for several tasks, but not for item parameter calibration. After you got the parameters, then you can estimate standard errors, test information curves, etc. What I have tried to explore for item calibrating so far, with limited success, was a script on WinBUGS (Loken and Rulison, 2010, see below). If you guys have any alternative idea whatsoever to estimate the 4th parameter with less hassle, I would be much grateful. Best regards, Carlos  model { for (i in 1: nstud) { for (j in 1: nqs) { p[i,j] , c[j] þ (d[j] 2 c[j])* (exp(1.7*a[j]*(theta[i] 2 b[j]))/ (1 þ exp(1.7*a[j]*(theta[i] 2 b[j])))) r[i, j] , dbern(p[i,j]) } theta[i] , dnorm(0,1) } for (k in 1:nqs) { a[j] , dlnorm(0,8); b[j] , dnorm(0,.25); c[j] , dbeta(5,17); d[j] , dbeta(17,5); } } 


I am conducting differential item functioning using Rasch model in Mplus. I simulated 40 items and 1000 samples for this data analysis. My syntax is as follows: DATA: FILE=DIF99.txt; VARIABLE: NAME ARE group y1y40; categorical are y1y40; grouping is group (0 = male 1=female); MODEL: f1 by y1y20@1; f2 by y21y40@1; f1@1; f2@1; f1 with f2; Model male: [y39$1] (d1); [y40$1] (d2); Model female: [y39$1] (d3); [y40$1] (d4); Model constraint: NEW (a); NEW (b); d3=d1+a; d4=d2+b; My questions are: (1) I was trying to use MLR, but an error message showed. Did I miss anything? (2) Although the overall model already stated "f1 with f2", the covariances between f1 and f2 are different in the male and the female groups. Why? (3) Item difficulty estimates are biased more in items of f2, and very different from true values. Did I do anything wrong in my syntax? Thanks for your response. 


(1) We have to see your output. (2) You have to say f1 with f2 (1); to hold that equal across groups. (3) We would have to see how you obtained your "true values" and see the output from the run where you see this bias. 


Dear Linda and Bengt, I've conducted parallel unifactorial graded response models in Multilog/IRTPRO and ordinal CFAs in Mplus. The estimates are nearly identical, but the item information curves (IIC) and test information curves (TIC) look quite different. What's the difference in how the IICs and TICs are calculated in Mplus vs. Multilog/IRTPRO. Thanks! 


Are you using Mplus with the ML estimator and logit link with the latent variable mean fixed at zero and variance at 1? If so, the curves should be the same. Which Mplus version are you using? 


Thanks, Bengt. MPlus 7.11 estimator=ml; LINK = LOGIT; f@1; [f@0]; The height of the curves is lower in Mplus. The shapes are a somewhat different too. I 2x checked and the slopes/discrimination parameters are really close. The thresholds are of course different, as you explained in an earlier post. The numbers of free parameters, AIC, and BIC are also the same. I'm pretty sure the same things are being estimated. 


Please send the Mplus and IRTPRO outputs and the data to support@statmodel.com. 


Hi, I was wondering whether Mplus handles the 1parameter rating scale model (RSM) (Andrich, 1978)? In this model, the slope parameter (discrimination) Ajx equals 1 for all items (same as in the 1parameter logistic), however the location (difficulty) parameter Bjx is divided into an item component (DELTAj) and a category, or step component (TAUx), so that Bjx = DELTAj + TAUx. It is used for polytomous items, and is a special case of the partial credit model (PCM): in the RSM the logits equal (THETA  DELTAj  TAUx), whilst in the PCM the logits equal (THETA  Bjx). Please, could you suggest how to translate this model into an Mplus syntax, if possible? Many thanks. 


I haven't looked into this, but I wonder if one could use the Mplus "nu" parameter to capture DELTAj. I assume TAUx implies that x varies over the different categories of the item, so they are the Mplus threshold parameters. With categorical items, the "nu" parameters are not activated (since they wouldn't all be identified together with the thresholds), but you essentially get them by placing a factor behind each item and letting that factor's mean (alpha) pick up the nu. 


Thank you, I will work on it! 


we analysed a set of dichotomously scored MCQ items using VARIABLE: NAMES ARE M1M50; CATEGORICAL ARE M1M50; ANALYSIS: ESTIMATOR = MLR; MODEL: f BY M1M50*; F@1; OUTPUT: TECH1 TECH8; PLOT: TYPE = PLOT3 One item struck as extremely odd in the resulting 2PL analysis item M9 was answered by 0.8% (i.e., n=3) of the 375 students. M9 Category 1 0.992 372.000 Category 2 0.008 3.000 the Item Discrimination was F BY M9 0.324 0.457 0.708 0.479 and Item Difficulty was M9$1 15.062 20.868 0.722 0.470 This means the item is very easy. However, the item was only answered by less than 1% of the students and so having such a low difficulty estimate makes no sense. I checked this in both version 6 and 7. We have run the same data in RUMM2010 (1PL Rasch), ICL (3PL) and get more plausible location values for this item. My concern is whether there is an error in the code or the set up or whether it is possible to have such a low logit despite such high difficulty. Advice appreciated. 


Because the loading is negative, a Y=1 answer implies a poor answer, not a good one. Perhaps you need to reverse score the item. Or, given that the loading is insignificant, delete the item. 

William Hula posted on Thursday, January 30, 2014  10:10 am



I've estimated a bifactor mimic model and want to convert the parameter estimates into IRT terms. I'd like to confirm that I have the generalization of equations (4) and (5) given by MacIntosh & Hashim (2003) to the bifactor case correct. I'm using the WLSMV estimator with delta parameterization, and I followed the steps described by MacIntosh and Hashim to set the means and variances of the latent variables to 0 and 1. Specifically, I centered the values of the dichotomous covariates about their means and set the residual variances of the latents to values estimated from an initial run. (1) Would the correct equation for the discrimination of item j on the general factor be: a_j = lambda_jGen / sqrt(1  lambda_jGen**2*psi_Gen  lambda_jLoc**2*psi_Loc), where psi_Gen and psi_Loc refer to the residual variances for the general and a local factor, respectively? (2) Would the bvalue for item j in group k on the general factor be: b_jk = (tau_j  beta_j*z_k) / lambda_jGen, where beta_j is the direct effect of a covariate on item j and z is the group indicator dummy variable? and (3) is the dummy variable z appropriately coded 0 and 1 for the reference and focal groups, respectively, or should it be coded with the deviation scores from the mean, given that the covariate was centered for model estimation? Thank you 


(1) and (2) look correct. Re (3), I wouldn't center z for model estimation. The factor mean shift due to changing from z=0 to z=1 is taken care of in b_jk = (tau_j  beta_j*z_k) / lambda_jGen, 


Thank you for your help. I have a followup question. If z is coded 0/1 for estimation, then the estimated means for the latents may be shifted away from zero. In that case, would b_jk be b_jk = (tau_j  beta_j*z_k) / lambda_jGen  muGen, where mu is the latent mean estimate from the tech4 output? 


No, you get the mean of the latent variable from the beta_j*z_k term, with z being either 0 or 1 (assuming all your other covariates are centered). 

Emily posted on Friday, February 28, 2014  8:41 am



I have run a CFA with binary indicator variables on a single factor using MLR and the default delta parameterization. I have fixed my factor with a mean of zero and a variance of 1. I am unsure of what conversion equation to utilize to transform the Mplus estimates into IRT parameters. I found this article http://www.statmodel.com/download/MplusIRT2.pdf Equation 21 in this article gives the formula for the discrimination parameter as follows: a = factor loading*sqrt(factor variance). I have also seen this equation mentioned previously: a= factor loading/sqrt(1factor loading^2). I am wondering which conversion equation is appropriate. 


You should use that equation 21 formula for MLR. The other formula you mention is geared towards WLSMV. 


Dear Drs. Muthen, I want to run 2group 1PL IRT model using Mplus. That is, I want to constrain the discrimination parameters within and across groups but do not want to constrain the difficulty parameters across groups. When I ran the code below, I got the error message (see below). Could you tell me what are wrong in my code? Without [u1u25$1], I got the equal difficulty parameters across groups. DATA: FILE IS Test_1PL.csv; VARIABLE: NAMES ARE g u1u25; CATEGORICAL ARE u1u25; GROUPING IS g (1 = male 2 = female); Analysis:parameterization=theta; MODEL: f BY u1u25* (1); f@1; [f@0]; Model male: u1u25@1; [u1u25$1]; Model female: u1u25@1; [u1u25$1]; *** ERROR in MODEL command Unknown variables: U1U25$1 in line: U1U25$1 


Try U1$1U25$1 


It works!! Thanks!! Yoonjeong 


Hello: I have a quick question re: discrimination values in a 2PLM IRT model. I have read that these values can theoretically range anywhere yet in practice typically they fall between .50 and 2.50. I have conducted some models in which I get alpha parameters such as 36.26 and other large values. However, these models converge and there are no warnings. Is such a value simply impossible or, if the item infrequently occurs (binary) is such a high discriminatory value possible? 


It is a sign that the item response is rare. 


Thank you. That being said, it still speaks to the items discriminatory power, correct? In other words, an event could be rarely occurring but not discriminate? 


I guess that is possible but a flat curve (low discrimination) tends to give a nonignorable probability. Large thresholds often go together with large loadings. 

Tim Stump posted on Thursday, August 28, 2014  2:18 pm



I apologize if the answer to my question is in this discussion board somewhere, but it seemed most appropriate to ask here. I have a bifactor model with ordered categorical items1 general factor and 2 subfactors I'm trying to estimate. Can someone tell me the difference in the unstandardized and standardized coefficients when requesting MLR and WLSMV. With WLSMV, unstandardized and standardized are the same and look like factor loadings to me. If I use MLR, it looks like I'm getting discrimination and threshold parameters from an IRT model. Would this be a correct assessment? Thanks. 


MLR with link=probit should give similar results to WLSMV when you compare to standardized MLR. 


I have a question about the IRT PARAMETERIZATION IN TWOPARAMETER PROBIT METRIC WHERE THE PROBIT IS DISCRIMINATION*(THETA  DIFFICULTY) portion of the output. I have two groups and for now I am using the theta parameterization and have fixed the residual variances at one in both groups. I have constrained the thresholds and the factor loadings to be invariant over groups and the factor means and variances are free. I notice that in the IRT PARAMETERIZATION part of the output, that the item discriminations and item difficulties are not the same across groups. This confuses me. Above, Dr. Muthen mentions: "To summarize my view, there are two ways to capture DIF in Mplus modeling: (1) CFA with covariates and (2) multigroup analysis. To me, DIF means that for a given item you have different item characteristics curves for different subject groupings and both approaches capture this. " Don't the differences between groups in item discriminations and difficulties given in the IRT parameterization portion of the output suggest that the ICCs would be different? But the factor loadings and thresholds are fixed to be invariant? Am I testing the invariance of the wrong parameters? Should I be testing the invariance of the parameters in the IRT parameterization if I am interested in dif? 


Our Topic 2 handout, slide 94 shows how the default Mplus factor model parameterization using thresholds and loadings relate to the IRT parameterization with a and b. This shows that even with invariant thresholds and loadings, a and b will vary across groups when the factor mean and/or factor variance vary across groups. Note that the IRT parameterization output refers to a factor with mean zero and variance 1. My writing has been about DIF using the factor model parameterization. 


Thank you for your response and I would like to ask a followup question if it is not too much trouble. In the IRT literature it is not uncommon to hear that an item is unbiased between groups A and B if and only if the two item characteristic curves are identical. I think Mellenbergh more generally states this as: An item is unbiased with respect to the variable G and given the variable Z if and only if f(Xg,z) = f(Xz) for all values g and z of the variables G and Z, where f(Xg,z) is the distribution of the item response given g and z and f(Xz) the distribution of the item responses given z; otherwise the item is biased. I am wondering if you are thinking about dif differently, or if invariant factor loadings and thresholds in CFA with dichotomous items, with factor means and variances free, is consistent with the above definition. 


I think it is consistent; I am not thinking about dif differently. 


Hi Dr. Muthen, One quick question related to your message on Dec 12, 2014: Even with invariant loadings and factor variances, a will still vary across groups when "scaling factors" vary across groups, is it correct? That is, a can be equal across groups only when loadings, factor variances, and "scaling factors" are invariant (uisng WLSMV with probit link and delta parameterization), correct? 


Right. The IRT parameterization with a and b ignores the possibility of varying scale factors due to varying residual variances. 


I am trying to estimate IRT parameters for a large bank of cognitive ability items. I have about 600 items total with about 200 items loading on one of three correlated factors. My data is rather sparse because I had my trial sample take 36 items randomly drawn from the full item bank. My sample had over 12,000 people, so each item was seen by at least 300 people. The data runs just fine as a unidimensional model, but as a multidimensional model, it has been running for 7 days. Do you have any idea as to why the estimation is taking so long or suggestions as to what I can do to reach a valid solution? Thank you! 


Which estimator are you using? 


I am using MLR. 


If you ask for TECH8 you will see screen printing of the iterations. The screen printing will tell you how long each iteration takes and how fast or slow the convergence is so you can get an idea of total time required. I assume your items are declared as categorical so that numerical integration is called for. The screen should show 3 dimensions of integration which with the default 15 points per dimension gives over 3000 points and can take a while for n=12,000 (time is linear in n). But it shouldn't be bad for a fast computer with say processors = 8 and i7 CPU. To speed it up you can use integration=10. Or, you can say NOSERR in the output to not compute SEs (they take a while with so many parameters). Or, you can use Estimator=Bayes. Or, you can take a smaller random sample of your data and use those estimates as starting values for the fullsample run. 


After having read through this thread in depth again a question came to mind...apologies if the answer is either obvious or already somewhere here or on SEMNET which I also searched. Why is it the case that for binary indicators in CFA it is generally recommended to avoid ML or MLR (because of the nonnormality of discrete/binary indicators) whereas in the IRT setup, ML(R) estimation is (based on my read) a somewhat better choice than limited information estimators...I understand the advantage of full information estimation but am missing something as to why then fullinformation methods would not just be favored regardless of the metric of your indicator 


I think your question reflects a common misunderstanding that I have seen several times also in the literature, so I am glad you ask it to help clear it up. When it is said that ML should be avoided with binary indicators in CFA, the speaker/writer is thinking of ML as synonymous with analysis treating the indicators as continuousnormal outcomes. So the mistake is to think that ML means continuous variables. Treating binary variables as continuous is of course not optimal. But ML does not imply analyzing your variables as if they were continuous. The fact that ML (or MLR) can be used for variables other than continuous is clear from a study of logistic regression and it is clear from IRT (which is logistic regression with factors as IVs), and it is clear from regression with a count DV. 

Mengyao Hu posted on Monday, March 30, 2015  11:12 am



Apologize if the answers to this questions is too obvious or somewhere in the forum. I am trying to run a multidimensional IRT with nominal indicators (two latent continuous factors with nominal indicators, and several indicators have crossing "loadings"). I am just wondering if MPlus can handle it? I know MPlus has NOMINAL option, but I am not sure if I can define the observed indicators as NOMINAL in a measurement model. Thank you. 


This works fine. 


I am trying to generate the ICC plots for my data and it appears to be plotting the curves for the wrong category. I indicate the nonreferent category (cat 1) and it plots what is clearly the referent category (cat 2). Is there something I am doing wrong or misunderstanding? Thanks 


Please send output to support and explain the steps you take in the plot menu. 


I am using binary data for secondorder item factor analysis (four firstorder factors and one secondorder factor). I would like to present the results in IRT framework. Is there a way to convert parameters from secondorder factor model to IRT parameters? Thank you. 


If you express how each item relates to the secondorder factor (that is, as a product of loadings on the two levels) you are in the framework of the standard singlefactor case that IRT typically considers and can therefore use the regular translations between our factor model parameterization and the IRT parameterization that we describe in our Topic 2 handout as well as in our IRT document at http://www.statmodel.com/irtanalysis.shtml 


Thank you much, Dr. Muthen. 

Back to top 