jan mod posted on Thursday, June 18, 2015 - 3:57 pm
Dear Muthen & Muthen,
I'm doing an IRT model with two waves (mathematics, dichotomous variables) (single group design). I want to do a concurrent calibration with 10 common items and unique items for each wave. I want to put both waves on a common scale. 1) is this syntax correct? 2) How can I make an IRT calibrated scale for each wave math (two scores on common scale)?
I would think you use 2 theta variables, one for each time point. And with its own indicators. So something like
theta1 by math1-math10 math111-math120* (a1-a20); theta2 by math211-math220* (a1-a20); math21-math45; theta1-theta2@1;
where you have equalities a1-a20 for those items that are in common and administered at both time points (hence the prefix 1 and 2). The 2 theta variables will be correlated as the default. You ask for factor scores to get the individual theta values.
I’m also working on a longitudinal IRT equating/calibration project, but it has a single group design. Forms A and B have the exact same items but A is self-report and B is from a clinical interview. Estimating item parameters for scoring a combined measure of A+B (my “gold standard”), calibrated A only (CA) and calibrated B only (CB). Looking at 1) testing for first- and second-order equity between CB and CB and 2) whether the treatment effect size for A+B is in contained the sample ES CI for CA and CB (coverage if A+B was the true value). The only way I could meet FOE/SOE and achieve d CI “coverage” was to impose N(0,1) combined across *all* timepoints – which is a very subtle-but-critical difference in parameterization from standard longitudinal IRT that it appears you’ve suggested here; if you have a reference for this, I’d love for you to share it. There isn’t anything formal in the equating literature on this (all the lit is cross-sectional) and in other literatures where there is an interest in longitudinal calibration (e.g., IDA), the standard M&V structure is always used; no one seems interested in equity in IDA because the interest is only in the combined measure.
Hopefully I can clarify what I’ve done: I have u1-u34 measured at Time 1 and Time 2. u1-u17 is from Form A and u18-u34 are from Form B. I examine item-specific loading and threshold DIF over time and across reporters for each item (e.g., u1 = u18 but for different reporters…). I fit a final calibration model (with DIF where necessary) but also score across u1-u34 using standard structure (N(0,1) at T1, free M&V at T2). These are my “gold standard” (GS) scores. I then have two other scoring models: a) one where I used the item parameters estimates from GS as fixed parameters to score based only on u1-u17 and b) another scoring model where I score based only on u18-u34. The IRT linking literature suggests that the score distributions for the non-GS scores should be equivalent within sampling error under tests for first- and second-order equity (e.g., Hanson et al., 2001; Kim & DeCarlo, 2016). With a free M&V at T2, my scores fail these tests of equity. I then came across this thread above from 2015 and saw you had a constraint of N(0,1) across all timepoints and that actually worked for me.