Mplus Discussion >> Count and categorical indicators

Topics
Last Day
Last 3 Days
Last Week
Tree View

Edit Profile


Count and categorical indicators

Mplus Discussion > Confirmatory Factor Analysis >

Message/Author

Jen Bailey posted on Monday, April 19, 2004 - 4:39 pm

Dear Dr.s Muthen:

I am using Version 3, and trying to run a CFA in preparation for looking at the stability of general, latent substance use within individuals across 3 time points (high school, early adulthood, and age 27). My indicators are either zero-inflated Poisson (e.g., frequency bing drank in past month) or ordered categorical. At each time point, I'm trying to create a latent factor from cigarette use (categ), binge drinking frequency (z-ip), and pot use frequency (z-ip). I have 2 questions. First, I hypothesize autocorrelation in substance specific residuals across time (i.e., high school marijuana use residual will be correlated with age 27 marijuana use residual). I read in the manual, however, that residuals are not calculated for count variables. How can I specify the hypothesized autocorrelations? In my syntax below, you can see some of the ways I've tried to do this.

My second question concerns large fully standardized estimates for the means of my inflation variables. Several of the estimates are 999.00. I am wondering if this could be due to the fact that my count variables have a much larger variance than my other variables (e.g., 32 vs 1). Any other ideas as to why these values might be showing up? I've checked the data file for out of range values, unspecified "missing" values, etc., and found nothing out of order.

I would greatly appreciate any thoughts on either question. Thank you!

DATA:
FILE IS C:\JENNIFER\PSUBEH3.DAT;
TYPE IS INDIVIDUAL;
FORMAT IS 40F8.2;

VARIABLE:
NAMES ARE STUID SEX G1AFRQ7 G1AFRQ8 G1BINGE7
G1BINGE8 G1ALAMT8 G1CIG7 G1CIG8
G1POT7 G1POT8 G1POTFR7 G1POTFR8 HSALQT
HSBIN HSALFR HSCIG HSTOB HSPTQTY
HSPTFR EAALQTY EABIN EAALFR EACIG EATOB
EAPTQTY EAPTFR SSAL27 SSBN27
SSALQTY SSCG27 SSTOB27 Q364 SSPT27 G2CDt
G2ODDt G2ADHDt CDT ODDT ADHDT ;
USEVARIABLES G2CDt G2ODDT G2ADHDT
SSBN27 SSPT27 HSCIG HSBIN HSPTFR
SSCG27 ;
MISSING =BLANK;
CATEGORICAL ARE SSCG27 HSCIG;
COUNT ARE SSBN27 (I) SSPT27 (I)
HSBIN (I) HSPTFR (I);

MODEL:
G2PROB BY G2CDt* G2ODDT* G2ADHDT* ;
G2ASU BY HSBIN* HSCIG* HSPTFR*;
!EASU BY EABIN* EACIG* EAPTFR*;
G2SU27 BY SSBN27* SSPT27* SSCG27*;

G2PROB@1;
G2ASU@1;
!EASU@1;
G2SU27@1;

!HSPTFR#1 WITH SSPT27#1; - using this line gives the error message "hsptfr#1 is not observed."
![HSPTFR#1 WITH SSPT27#1]; - using this line gives the error message "unknown variable [ ."
!HSPTFR WITH SSPT27; - using this line gives the error message "interaction problem."
!SSPT27#1 ON HSPTFR#1; - using this line gives the error message "sspt27#1 is not observed."
!SSPT27 ON HSPTFR; - using this line causes a fully standardized loading of sspt27 on its factor that is greater than 1.

ANALYSIS:
ESTIMATOR= ML;
TYPE = MEANSTRUCTURE;
!ADAPTIVE = OFF;
!MITERATIONS = 1000;
!MCITERATIONS = 5;
!MUITERATIONS = 5;
ITERATIONS=30000;
!MATRIX= COVA;

OUTPUT:
STAND sampstat TECH1 TECH8;

Linda K. Muthen posted on Tuesday, April 20, 2004 - 7:49 am

The estimates of 999 are most likely caused by negative residual variances of the categorical outcomes or factor correlations greater than one.

See Example 7.16 to see how to estimate a residual covariance for a categorical outcome using maximum likelihood. You define a factor that influences the two outcomes for which you want a residual covariance. You can use the same approach for a count outcome. Note that you are using numerical integration which can be computationally heavy when there are many factors.

Jen Bailey posted on Tuesday, April 20, 2004 - 10:41 am

Thanks very much. The example you pointed me to was quite helpful!

sara perry posted on Saturday, May 19, 2007 - 10:58 am

I have a couple of related questions to this one.

We want to run a measurement model in MPLUS of a latent variable with factor indicators that are all count data, in a zero-inflated negative binomial distribution. We have a few questions about this:

1) Should we consider the latent variable as a count variable as well, since all the factors are counts? If so, is this considered mixture modeling, and is there a section in the manual for the code to do that?

2) Regardless, we know we need to tell MPLUS that the factor indicators are count data. We tried to do this using the code in the manual and came across an error regarding residual variance. What is the limitation of estimating residual variance in a measurement model w/ count factor indicators? Is there something else we should be considering regarding residual variance/covariance that is different from a normal measurement model?

Thanks in advance!

Linda K. Muthen posted on Saturday, May 19, 2007 - 5:39 pm

Mplus does not handle zero-inflated negtive binomial modeling. Instead it uses zero-inflated Poisson modeling with or without mixtures.

If your factor indicators are count variables, the factor itself is continuous.

Possion regression does not include residual variance parameters.

Alexandre Morin posted on Tuesday, March 04, 2008 - 7:35 am

Greetings,

When testing measurment invariance on a CFA with categorical indicators, the delat parametrisation allow us to work with scale factors and the theta parametrisation allow us to work with residuals (scales and residuals beeing a function of one another, both cannot be estimated simultaneously). Is that it?

Then, my question is how in Mplus can we estimate intercepts in CFA with categorical indicators ? I read somewhere that it is possible but did not found how ? I might be wrong but are not intercepts directly related to scale factors and residuals?

Thank you in advance

Linda K. Muthen posted on Tuesday, March 04, 2008 - 8:29 am

The answer to your first paragraph is yes.

If you want to estimate intercepts, you need to put a factor behind each observed variable such that the factor is equivalent to the observed variable. There is no relationship between intercepts and scale factors and residuals.

Alexandre Morin posted on Thursday, March 06, 2008 - 2:39 pm

Thank you!

I was not too far.

If I understand correctly, that would give a higher order CFA model in which each item is loaded on a factor (1 item per factor) and then all factors are themselves loaded on the "higher order" factors of interest ? Or would that give a kind of bifactor model in which each item is simultaneously loaded on two factors ? I believe the first possibility is the right one but the "behind each observed variables) got me confused.

Then the invariance would be at the level of the relationship between the lower and higher order factors (treated as continuous) : baseline, loadings, intercepts, residuals, etc. Without the possibility of testing the invariance of scale factors ?

Linda K. Muthen posted on Thursday, March 06, 2008 - 5:40 pm

When you put a factor behind each observed variable as follows, you are simply making the factor identical to the observed variable:

f1 BY y1@1; y1@0;

This simply opens up the alpha matrix so that intercepts can be estimated because the nu matrix is not opened with categorical outcomes. It is a trick.

Alexandre Morin posted on Friday, March 07, 2008 - 7:52 am

Then, if my original model is:
f1 BY wp1 wp2 wp7 ;
f2 BY wp3 wp4 wp9 ;
I will have to redefine it as:
f10 BY wp1@1; wp1@0;
f11 BY wp2@1; wp2@0;
f12 BY wp7@1; wp7@0;
f13 BY wp3@1; wp3@0;
f14 BY wp4@1; wp4@0;
f15 BY wp9@1; wp9@0;
f1 BY f10 f11 f12;
f2 BY f13 f14 f15;
And then for the tests of invariance, I only work with the F10-f15 "factors" for which I can constrain or relax, as with the wp1-wp9 in the preceeding model: thresholds, loadings, residuals/scale (theta/delta), intercepts, etc. In other words, I then conduct my tests of invariance as I did (lets suppose I did it right before) but replacing wp1-wp9 by f10-f15? Or do I, for instance, constrain thresholds on wp1-wp9 and loadings/residuals/intercepts on f10-f15 ?
And, I you have time, a follow up question: why is the nu matrix not opened by default in Mplus ?
Thank you very much for taking time to answer our questions!

Linda K. Muthen posted on Sunday, March 09, 2008 - 10:11 am

You are correct.

We don't open the nu matrix as the default because both tau and nu cannot be identified at the same time. We don't see a strong need to work with nu parameters.

Alexandre Morin posted on Monday, March 10, 2008 - 7:32 am

Thank you again,
However this means I cant be right completely... If the nu and tau matrix cannot be identified at the same time, it means that opening the nu matrix closes the tau matrix and that the invariance of thresholds cannot be estimated in such a model? Or is it still possible to evaluate thresholds invariance at the level of the "pseudo factors" ?

Bengt O. Muthen posted on Monday, March 10, 2008 - 8:15 am

Using pseudo factors your nu's will be put into alpha (the factor means) and you will in this way have access to both nu and tau (the thresholds). The question is which restrictions do you have to place on the model to identify nu and tau parameters. Roger Millsap has written (in MBR?) about parameterizations different from the Mplus defaults where nu and tau are used, but with certain restrictions on other parameters.

Alexandre Morin posted on Monday, March 10, 2008 - 8:26 am

Thank you very much,
Yes, I was planning to work from the Millsap paper (2004, MBR). In this case, if I'm right, I will have to work with the thresholds at the items level (wp1-wp9) and on the other parameters at the pseudo factors level (f10-f15: residuals, intercepts, loadings) ?

Alexandre Morin posted on Monday, March 10, 2008 - 8:41 am

Yes, it works this way. No need to answer me, I tried it. Thanks for your time.

Jennifer M. Jester posted on Wednesday, May 13, 2009 - 10:33 am

I am using a zero-inflated Poisson model for onset of cigarette smoking, using measures of inattention to predict whether or not an adolescent has started smoking and the level of smoking. My confusion is that when I look at the 3 different standardized models, I see different results and p-values for the relationship of the predictor to smoking. Can you tell me which of the standardizations I should be using?
Here is some of the output:
MODEL RESULTS

STDYX Standardization
CIGDAYT6 ON
INATTEN 1.000 0.000 ********* 0.000

CIGDAYT6#1 ON
INATTEN 0.076 0.366 0.207 0.836

YSRATTNT WITH
SRINAT45 0.478 0.045 10.687 0.000

STDY Standardization

CIGDAYT6 ON
INATTEN 1.000 0.000 ********* 0.000

CIGDAYT6#1 ON
INATTEN 0.076 0.366 0.207 0.836

YSRATTNT WITH
SRINAT45 0.478 0.045 10.687 0.000

STD Standardization
CIGDAYT6 ON
INATTEN 1.864 0.532 3.502 0.000

CIGDAYT6#1 ON
INATTEN 0.138 0.670 0.205 0.837

YSRATTNT WITH
SRINAT45 0.243 0.037 6.471 0.000

Linda K. Muthen posted on Thursday, May 14, 2009 - 9:39 am

STDYX is used for continuous covariates. STDY is used for binary covariates. See pages 577-579 of the user's guide for more information.

QianLi Xue posted on Thursday, September 10, 2009 - 8:07 pm

I run Example 5.2 in User's guide. Here are the fit statistics:
CFI 1.000
TLI 1.001
RMSEA 0.000

WRMR 0.342

Why do CFI, TLI, and RMSEA all give indication of a good fit, while WRMR suggests the opposite?

Bengt O. Muthen posted on Friday, September 11, 2009 - 9:44 am

We have observed several instances where WRMR does not seem to work well. It did, however, work well in most of the simulations of the Yu dissertation - see our web site. If most of the other fit indices are good, I think you should ignore WRMR.

Chris Stride posted on Tuesday, November 10, 2009 - 4:04 am

Hi
I am testing a proposed 5-factor measurement model using CFA. The observed variables are questionnaire items of an ordinal categorical form, 5 cats, coded 0-4, that would often be treated as continuous. However...

Whilst half of them (which ask about likelihood of behaving well in different aspects of one's job) have a roughly uniform distn across the 5 categories, the other half, (which ask about likelihood of behaving badly) have a very skewed distribution; with 70-80% of cases selecting category 0.

As such, the most honest way of defining these vars seemed to be to treat the positive items as continuous, and the negative as count data, since whilst they don't actually represent a count, they are a measure of the occurence of rare events.

I therefore ran the model in Mplus 5 -
wWhilst the model runs OK, there are very limited measures of model fit given in the output; just the chi-sq (1910.744 on 49947df!), AIC and BIC.

So...

What are your thoughts re: the allocation of variables type; would it be better to treat all items as continuous or all as categorical?

And my choices are correct, how can i get an indication of model fit; the chi-sq statistic above looks very strange with a huge df with respect to the actual chi-sq figure.

Linda K. Muthen posted on Tuesday, November 10, 2009 - 5:52 am

I would not treat ordered categorical variables as count variables. I would use the CATEGORICAL option for all of the ordered categorical variables both those with and without floor effects. The default estimator in this situation is weighted least squares which gives you chi-square and the related fit measures that you are used to.

The Pearson and Likelihood Chi-squares statistics that you obtain with count data are for the frequency table.

K�tlin Peets posted on Thursday, April 08, 2010 - 7:19 am

I am wondering whether to use EFA before conducting CFA (e.g., to see cross-loadings) or to start with CFA right away. The problem is that when I conduct CFA, the model (4-factor) has a good fit (and loadings are high). However, when I conduct EFA, several items seem to be actually loading on several factors (and none of the loadings are particularly high). And, when I start taking items out, the model seems to be unstable (one of the residual variances becomes very high and negative. so...a 3-factor solution might be actually enough). So, I am a bit puzzled. Should my choice be based on whether we have a good theory vs. whether we want to explore the number of factors underlying the variables?

Linda K. Muthen posted on Thursday, April 08, 2010 - 8:00 am

If the variables are not behaving as you expect, this points to them not being valid measures. You should think about why this might be the case. Ultimately, the meaning of the factors based on theory needs to be considered.

K�tlin Peets posted on Tuesday, April 13, 2010 - 7:27 am

Hi,

I an running a CFA with categorical indicators. I get a message that some of the bivariate tables have an empty cell. When I remove these problematic items, the model fit is pretty much the same. Is it absolutely necessary to remove these items?

Linda K. Muthen posted on Tuesday, April 13, 2010 - 10:57 am

I would not use them.

Alexander Kapeller posted on Monday, April 18, 2011 - 1:11 pm

Hello,

I have count indicators for a latent factor with missing.
1) Is FIML applicable to this Problem?
2) For correction of non-normality i want to use MLR, is this correct?
3) can you offer a reference dealing with the missing problem for count data?

thanks a lot.

alex

Linda K. Muthen posted on Tuesday, April 19, 2011 - 9:28 am

1. Yes.
2. MLR is robust against non-normality of continuous outcomes. For count outcomes, the statistical model takes into account the nature of the data. In this case, MLR can help with model misspecification, for example, using a Poisson model when a negative binomial model is needed.
3. It is the same for all variable types. See the Little and Rubin reference in the user's guide.

Alexander Kapeller posted on Monday, April 25, 2011 - 10:36 am

hello.

I have three follow up question.
1) can I censor a zip distributed indicator variable in Mplus?

2) I have a continouos variable, which is similar distributed like the one in 1) which distributional model would you suggest to use with mplus? I want to form a factor out of these two variables.

2)is there a source in which is described, which empirical distribution function I can use for observed variables in Mplus.

Thanks.

Bengt O. Muthen posted on Monday, April 25, 2011 - 6:01 pm

1) If you mean a count variable where a set of high counts have been combined, I would not use ZIP, but perhaps instead an ordinal categorical approach.

2) If you mean a continuous variable which is censored, you can use a censored-normal approach in Mplus.

The variable types that Mplus handles are shown in the UG.

Alexander Kapeller posted on Tuesday, April 26, 2011 - 1:05 am

Hello Bengt,

I would like to clarify ad2)

my continouos variable is a percentage measure and has a lot of zeros. Can I just inflate normal theory based ML on zero with indicating (i) or something like this behind the variable?

Best
Alexander

Bengt O. Muthen posted on Tuesday, April 26, 2011 - 6:06 pm

Mplus provides both censored-normal and censored-normal inflated modeling for such variables.

Cecily Na posted on Thursday, June 14, 2012 - 12:33 pm

Hello Professors,
I have a latent variable with several indicators which are either count or categorical. Should I convert the count indicator into a categorical indicator to perform CFA? Does it matter? Note that the count indicator has a range of 0-200, but the categorical indicator has a range of 0-6.

Greatly appreciate your advice!
Cecily

Linda K. Muthen posted on Thursday, June 14, 2012 - 1:14 pm

You should treat the count variable as a count variable by putting it on the COUNT list and the categorical variable should be put on the CATEGORICAL list.

Cecily Na posted on Friday, June 15, 2012 - 11:58 am

Hello, Thank you. I want to follow up with my previous post. When outcomes are categorical, the slopes are probit. When outcomes are counts, does Mplus use poisson regression? Thanks.

Linda K. Muthen posted on Friday, June 15, 2012 - 12:54 pm

For variables on the CATEGORICAL list, weighted least squares gives probit regression. Maximum likelihood gives logistic regression as the default. Probit regression can be requested.

Poisson models are used for variables on the COUNT list.

laura skriner posted on Sunday, July 15, 2012 - 10:11 am

Hello Professors,

I'm testing for measurement invariance on a depression measure between 4 ethnic groups. The measure has 20 items scored on a 4-point scale. I have two questions regarding fitting the unconstrained factor loading and unconstrained thesholds models.

1. I first ran separate CFA's for each of the 4 ethnic groups and all of the models fit well. I next ran the fully constrained multi-group model, which ran fine and also fit the data well. When I tried to run the model with unconstrained thresholds, I do not get fit indices because the model is underidentified. I am not sure if there is something wrong in my syntax or if having 3 thresholds for 20 items is too many? I am wondering, 1. would one solution be to fix or constrain some of the model parameters? and 2. are there certain parameters that I should try and fix/constrain?

2. My second issue occurred when trying to run the unconstrained factor loadings model with MLR estimation (I did this using a LCA framework). I was able to get fit indices when I first ran the WLSMV model first. However, when I tried to run the MLR model in order to get AIC and BIC, my model is not converging. I am confused as to why it would converge using one estimator versus another - and am wondering if you have any suggestions for how to get it to converge.

Would greatly appreciate any advice.

best,
Laura

Bengt O. Muthen posted on Sunday, July 15, 2012 - 7:05 pm

1. I assume you treat the items as polytomous, categorical in which case you need to constrain some parameters - see Millsap-Tein (2004) in Multiv Behav Res.

2. The output for this needs to be looked at by Support.

laura skriner posted on Monday, July 16, 2012 - 12:26 pm

thanks so much for the advice.

I will look at Millsap-Tein (2004) and try constraining some parameters

I will send the output to Support.

Laura

Kathy Xiao posted on Sunday, April 09, 2017 - 5:22 am

I am doing a LCA where one of the variables is the number of disorder the participant has before, which has values as 0, 1, 2, 3. And 3 is a combined list of those having 3 and more disorders.

Shall I treat it as a categorical variable instead of a continuous one?

Thanks!

Bengt O. Muthen posted on Monday, April 10, 2017 - 6:12 pm

Yes, if you have a strong floor effect.

Daniel Rodriguez posted on Sunday, October 29, 2017 - 4:37 am

Good morning!
I have a question about the scale of a latent variable with categorical indicator variables. I am conducting an ESEM analysis and using categorical indicator variables. I understand that when I have predictor variables with direct paths to my observed variables, I can interpret their effects as the log of odds. However, I am not clear on what the scale is for my latent variable underlying the observed categorical variables. My model fits best with a single factor, and I have significant effects from my predictor variables to the single factor. I understand in CFA with continuous variables, the scale is determined by the indicators if I set one path to equal 1. However, I am unclear on the scale for my latent variable with categorical indicators. Should I attempt an interpretation at all, or is it better to just report if the effects are significant? I greatly appreciate your guidance on this matter.

Bengt O. Muthen posted on Sunday, October 29, 2017 - 6:00 pm

The regression of the factor on its predictors is a linear regression because the factor is continuous even when the factor indicators are categorical.

Because the indicators are categorical DVs, you have logistic regression if the Link=logit, otherwise probit regression.

Daniel Rodriguez posted on Monday, October 30, 2017 - 3:06 am

Thank you very much for your response. I appreciate it very much. In terms of reporting results for the effect of predictors on the latent variable, is standardization a wise option? Or is it better to just report the effects as significant or not significant only? Again, I greatly appreciate your time.

Bengt O. Muthen posted on Monday, October 30, 2017 - 4:25 pm

I would report both unstandardized and standardize with their significance.

You may want to ask these general analysis questions on SEMNET to get broader input.