eric baumer posted on Wednesday, January 02, 2002 - 10:05 am
I have a situation in which I really want to estimate a fairly complex structural equation model, yet the key endogenous variable is probably best represented as a count variable with a poisson distribution (i.e., a count model or negative binomial). Does Mplus handle this type of estimation in an SEM context? Thanks, Eric Baumer
I'm also working with count data, where I have several count items (number of occasions on which the respondent has performed certain behaviors) which I would like to use in a factor model. Unfortunately, I have incomplete data.
Previously, I imputed the count data (using NORM -- yes, I know, it's already a distortion), categorized the counts into ordered polytomous variables, and ran a categorical data model in MPlus, combining across imputations.
I'd like to avoid the multiple imputation step, if possible, not least because I have a lot of individual items and the imputation takes quite a while. Would using the MLR estimator in 2.1 seem like a reasonable approach for these data?
Separately, following up Eric's question, is work being done on Poisson links for this kind of model? I have no idea what the complexities would be, but I'm used to y'all pulling rabbits out of hats :-).
bmuthen posted on Tuesday, June 25, 2002 - 2:33 pm
Yes, I think MLR or MLM can be seen as a rough approximation. Mplus is developing missing data facilities for more types of outcomes. And also planning for Poisson outcomes.
I am also working on a relatively complex growth curve model that is based on count data. Do you know of any resources that might save me some time writing simulations by discussing the biases that might result from treating data distributed as a poisson with M-Plus which would consider them polytomous in nature. I assume that this would entail the violation of the assumption that there is a continuous latent dimension underlying the categorical polytomous data.
bmuthen posted on Sunday, September 22, 2002 - 9:41 am
I am not aware of any writings on the analysis of Poisson count outcomes using methods for ordered polytomous outcomes. Personally, I would not worry too much because for the simple purpose of regression I don't think the ordered model makes important violations of the nature of count data, although I may be wrong. That is, the ordered polytomous model would certainly not estimate the same parameters as Poisson, but probably fit the data well and point to the same important predictors. Perhaps other Mplus Discussion readers have an opinion here. I don't think of the ordered polytomous model as requiring an assumption of a continuous underlying dimension (a "y*" latent response variables), but merely a model based on proportional odds (see Agresti's book), so that would not be an important consideration to me in this choice.
I am preparing to fit a path analysis model in which there are three sets of observed variables: (a) exogenous variables (either dummy-coded or continuous, approximately normally distributed); (b) mediators that are continuous and approximately normally distributed; and (c) a single outcome that is a count and which appears Poisson distributed (perhaps with zero-inflation).
We will be using Mplus 3.11 to analyze these data. Of interest are the indirect effects of the (a) variable set on (c). I notice that Mplus 3.11 does not support computation of indirect and total effects models involving count outcomes. Can you tell me how to compute these by hand?
More generally, I'm also curious how much the calculations would change under different, but similar scenarios. For instance, how would the computations change if one of the mediating variables was binary or ordered categorical? What about if continuous latent variables are involved at the exogenous, mediating, or outcome stage of the model?
Thank you for any tips you can offer on this topic.
bmuthen posted on Friday, October 01, 2004 - 1:12 pm
With an ultimate count outcome, the indirect effect could pertain to the log rate that is modeled - so that this is the "y*" variable (in my terms) that we are used to when modeling categorical outcomes. The indirect effect would then simply be the usual product of regression coefficients and SEs computed using the Delta method (see Bollen's book). For count outcomes, Mplus uses ML and in the ML context a mediating variable that is categorical (binary or ordinal) is entering the prediction of the count outcome as an observed variable (not y*), i.e. a score that is treated as continuous. The same approach as above applies. And also for continuous latent variables.
Anonymous posted on Thursday, October 14, 2004 - 1:04 pm
Does mplus version 3 handle mediating count variables? thanks a lot.
bmuthen posted on Thursday, October 14, 2004 - 1:18 pm
Yes. This is done in the new ML estimation framework. Note that when used in the equation where it is an exogeneous variable, as opposed to in the equation where it is endogenous, the count variable is treated as a continuous variable (an observed variable rather than the underlying y*-type log rate).
Anonymous posted on Friday, July 15, 2005 - 2:16 pm
I am glad to see that Mplus 3 can run zero-inflated Poisson model, using cross section or longitudinal data. I have a couple of questions on this issue:
1) It looks like that Mplus only provides Loglikelihood values and Information Criteria for the zero-inflated Poisson model. Is there any way to tell whether a model fits the data? 1) Can I use the Loglikelihood values Mplus provides to conduct LR test for comparing the standard Poisson model with zero-inflated Poisson model? 2) Can I run zero-inflated negative binomial model in Mplus? 3) Can I run other zero-modified (e.g., zero-deflated and zero-truncated) Poisson models in the current version of Mplus?
I am reading Bengt's posting above on September 22, 2002 where he wrote "I am not aware of any writings on the analysis of Poisson count outcomes using methods for ordered polytomous outcomes."
My question is about the other way around: the analysis of discrete (i.e., 6 response options) data representing unequal count intervals (e.g., 0, 1-2, 3-5, 6-9, 10-19, 20-39, 40+) using continuous Poisson models.
The choice would appear to be between an ordered polytomous outcome (i.e, 0,1,2,3,4,5,6) or a recoded somewhat discrete count-approximation using interval midpoints (or endpoints).
A colleague told me Bengt had written a paper on adolescent alcohol use where a Poisson model was used, although there was some discussion as to whether response options or interval mid-points should have been used. I have been unable to clearly identify this paper, or the subsequent critique. Does this ring any bell?
Also, what about the more general question as the best approach for modeling data obtained from a survey item about number of drinks in the past month ... etc?
The zero-inflated Poisson is attractive because many of these high-school students did not drink and reported zero drinks in the past month. But I am worried that the data is not continuous.
Our respondents are nested within school, so with either the discrete or the continuous model, I would want to appropriately handle the nesting.
Any advice, references to your own work, pointers to MPLUS examples, or other references would be greatly appreciated.
My 1999 Biometrics paper with Shedden worked with frequency of heavy drinking as the outcome. The outcome was a categorized count outcome such as 0, 1-2 times, 3-5 times etc. One could argue that this should be handled via some generalized Poisson model - earlier in this thread a "grouped zero-inflated Poisson" model was mentioned, but Mplus does not support that yet. I tend to want to treat such data as ordered categorical. This then takes care of the strong floor effect. Another approach is 2-part modeling which Mplus also supports and that modeling has the advantage of letting covariates have different impact on the probability of engaging in an activity at all vs how much.
When you refer to two-part modeling, do you refer to Example 7.25, which uses two classes? or do you mean a manual split of the data into drinkers and non-drinkers and I predict frequency of drinking using only the drinkers?
Or, perhaps I misunderstand any you mean something else?
I have a mediating and outcome variable that are both count variables. The outcome variable is a categorical count variable. The mediating variable is continuous. Below is the syntax that I used to try to run my model but for some reason the operation can not be performed. Can you let me know what part of the syntax needs to be adjusted? This is a straight forward regression model with only one LV.
TITLE:1-2-07 Instr mediation model2 DATA: FILE IS "C:\1-2-07.dat"; VARIANCES=CHECK; VARIABLE: NAMES ARE id a5 b2 d1 h5 h9 h17 needdepr recgende forsrv lngsocco lngmedia contlang socethn relginf2; MISSING are all (999); USEVARIABLES ARE a5 b2 d1 h5 h9 h17 needdepr recgende forsrv lngsocco lngmedia contlang socethn relginf2; COUNT ARE h5 h9 h17 forsrv; CATEGORICAL ARE forsrv;
MODEL: cultural_integration by lngsocco lngmedia contlang socethn b2; h9 on a5 cultural_integration recgende relginf2; forsrv on a5 cultural_integration recgende relginf2 h9; d1 with a5 cultural_integration recgende relginf2 forsrv needdepr; needdepr with a5 cultural_integration recgende relginf2 forsrv; OUTPUT: SAMPSTAT RESIDUAL STANDARDIZED;
On an earlier post, you mentioned that "Mplus is developing missing data facilities for more types of outcomes". (posted on Tuesday, June 25, 2002 - 2:33 pm). I am developing some models with a count outcome. These models will be analyzed with longitudinal data, so missing data is an issue. Are there capabilities in the latest version of Mplus for handling missing data on count variables?
Yes. Mplus provides maximum likelihood estimation under MCAR (missing completely at random) and MAR (missing at random; Little & Rubin, 2002) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types.
I have a zero inflated count outcome variable, 2 IVs and 3 mediators. My dataset has missing data. I am trying to compare the strength of different parameters using bootstrapping to derive confidence intervals. None of the estimators that are available for bootsrapping, however, work with count data or missing data. Can you please advise me of my options.
I had a brief question. I created a measurement model using both categorical and continuous variables. My final model has correlated errors between the indicators. I would like to create a full model using a count outcome and my measurement model as the exposure. However, I cannot seem to do this in MPlus. Is there any way in the current version of MPlus to include WITH statements with a count outcome?
Not enough information to say - please send your input, output, data, and license number to email@example.com.
Mary Campa posted on Tuesday, April 21, 2009 - 9:21 am
Hello. I am trying to estimate a multiple mediator path model with a negative binomial distributed Y, one continuous M, one Poisson distributed M, and a three-level X.
The convention in my discipline is to present standardized coefficients; however, M-plus will not provide these with the count mediating variables. Is there any way to get these estimates or a reference I can provide as to why these estimates are not valid?
Regression with a count dependent variable does not involve a residual variance parameter. There is also not an underlying continuous DV for which a residual variance can be conceptualized like with logit or probit regression. That's why you don't see standardized count regression coefficients in the literature.
You can standardize with respect to the other variables.
Note that mediational models are perfectly valid in their raw, unstandardized form.
Mary Campa posted on Thursday, April 23, 2009 - 6:03 am
Thank you for your prompt response. Could you tell me what situation would create a significant raw effect and a very non-significant standardized effect (in both STDYX and STDY)? How should I interpret this?
Such big differences are rare, so there is probably something peculiar about this example. Please send the output and license number to firstname.lastname@example.org.
Rob Dvorak posted on Thursday, October 22, 2009 - 11:56 am
Hi Drs. Muthen,
I am running a model in which I have two latent variables and their interaction predicting two zero inflated negative binomial distribution outcomes, with one of the zinb outcomes predicting the final zinb outcome as well. Mplus ran the model and terminated normally, however, I just want to be sure that what I am getting is valid. I am wondering, because I am only getting a dispersion test for one of the zinbs (the one that serves as a mediator).
I have a SEM with a latent variable mediating the effect of an intervention on 3 outcome variables (2 count outcomes and 1 binary outcome) and want to be sure I’m calculating and interpreting the effects correctly. For the count outcomes, I used an approach recommended in a previous post and calculated the indirect effects by simply calculating the product of the unstandardized regression coefficient of the intervention on the latent mediator times the unstandardized regression coefficient of the latent mediator on the count outcome variable. Then, I calculated the SE using the delta method; this was also done for the binary outcome. Here are my questions: 1) Did I follow your recommended approach? 2) For the binary outcome, is it appropriate to calculate the % mediation by dividing the indirect effect B by the direct effect (without the mediator) B? For example, x-> m-> y B=-1.969 and x->y B=-0.697; -0.697/-1.969= .354, or 35.4% of the effect of the intervention was mediated by the mediator. 3) Could the approach used in question 2 be used to calculate the % mediation for the count outcomes? 4) Your previous post states that WITH statements can’t be included with count variables, and I’m unclear why? Is there a reference available for not including the correlations? 5) Would it be useful to exponentiate the product terms (i.e., indirect effects) to get a fractional interpretation? Thank you
1. It sounds like you did. Remember that the final dependent variable is the log rate. 2-3. This sounds questionable to me. 4. There is no model estimated variance for a count variable. See one of the Agresti books on categorical data analysis. 5. I think this would work. When you exponentiae a log rate, it becomes a rate.
Hi - I'm new to path modeling and Mplus(so forgive me if this question is simple!) but I was wondering if there is a way to get fit statistics for path models that have an outcome variable with a negative binomial distribution?
Chi-square and related fit statistics are not available for these models because means, variances, and covariances are not sufficient statistics for model estimation. In these cases, people compare nested models using loglikelihood difference testing and also look at BIC.
I have a multi-wave sample of 5000 children involved with the child welfare system. I have measurements of behavior problems (approximately normally distributed) at three times, T2, T3, and T4. I have counts of the number of out-of-home placements experienced by the children at three time intervals: from T1 to T2, from T2 to T3, and from T3 to T4. The counts are highly skewed. For any given interval, about 85% of children experience zero placements, about 7-8% experience one placement only, and about 7-8% experience multiple placements. I am modeling the effects of: behavior on placement, placement on behavior, behavior on behavior, and placement on placement.
For the regressions on placement counts, I used a zero-inflated negative binomial model. This model worked reasonably well (after appropriate starting values were supplied) and provided sensible/interpretable results. Due to the skewness issue, a journal reviewer has recommended that the placement counts be modeled either as binary or ordered categorical variables. My question is: Is the placement count data so skewed that the zero inflated negative binomial is not be recommended and, thus, a binary or ordered categorical model would be preferred.
Just hoping to get an opinion on this. I realize there may not be an “answer.”
Thanks in advance.
Rob Dvorak posted on Monday, May 03, 2010 - 4:39 pm
Hi Drs. Muthen, I have a latent variable interaction predicting a zinb outcome. There is a significant path from the latent variable interaction to the count portion of the model. I computed the simple slopes ala Aiken & West (i.e., +/- 1 SD on one of the latent variables), and computed the SEs of these slopes using asymptotic covariances. I then exponentiated the simple slope coefficients, making them into incident rate ratios. Is this an appropriate way to compute simple slopes for non-linear models? If not, do you know of a reference for computing these? Thanks in advance.
Yes. In the y on m regression it is treated as a continuous variable. For issues related to the computation of indirect effects, see the following paper which is available on the website:
Muthén, B. (2011). Applications of causally defined direct and indirect effects in mediation analysis using SEM in Mplus.
Chris posted on Sunday, September 02, 2012 - 5:51 pm
Thank you Linda. Much appreciated.
Tom Booth posted on Thursday, August 01, 2013 - 12:27 am
I have a CFA model which forms part of a larger SEM which uses both count and categorical variables.
I have been asked by a reviewer to provide some information on how well the model fits. As we only have one model, AIC and BIC are not useful. Also, our sample size is moderate to large (700+) so I am not sure how useful the chi-square for the categorical and count portions of the model will be.
I was considering the following as options:
1 - fix all loadings in the CFA to nominally small values (.01), and compare AIC and BIC from a psuedo-model of no association to the CFA with free loadings.
I have a path model with three outcomes measured over three time points. Two of the outcomes are continuous and one is a count, and I am using numerical integration. I need to test whether some parameters are significantly different (e.g. stronger) than others. I know that with count outcomes that standardized estimates are not available, and I also read above that bootstrapping is not available when numerical integration is required. Would it be feasible to answer this question by comparing the fit of two models, one where the parameters to be compared are constrained to be equal, and another where they are free?
Yvonne LEE posted on Tuesday, April 22, 2014 - 8:54 am
As a novice to Mplus, I am conducting modeling on 'rape tendency' on a group of offenders and have made enquiries earlier.
I have 2 indicators for my rape tendency DV i.e. self-report number of rape and official record of rape count. The former indicator scored 0 on 64% of the cases and the latter scored 0 on 79%. Because of the very low frequency on higher count, I collapse these 2 indicators into 4 levels. For the self-report rape, I collapse into 0, 1, 2, >=3 and the same for the official rape count.
Question 1: Should I treat the 2 indicators as categorical or count data? The estimation model terminated normally when treating as count data but error message 'THE RESIDUAL COVARIANCE MATRIX (THETA) IS NOT POSITIVE DEFINITE.' appears if treating as categorical data. Pls advise.
Question 2: Should I use two-part modeling ? I have tried and the estimation model terminated normally.
You should treat this as a categorical variable not a count variable.
I would not recommend two-part modeling.
Yvonne LEE posted on Wednesday, April 23, 2014 - 5:16 pm
Thanks for your advice. I have run analysis treating them as categorical variable but error message of 'non-positive definite covariance matrices' was shown, involving variable PAbuse. Pls advise how to go about.
MODEL: f1 by EAbuse PAbuse SAbuse ENeglect CTS_PV; f2 by Rape0123 SES0123; f2 on f1;
One thing is that both categorical indicators have a U-shaped distribution. Does it matter? Major problem observed in TECH4 is correlation between F2 and SES0123 is 1.14. Also, Rsquare for SES0123 is undefined.
I would like to seek your advice on convergence problem. I have a count outcome, with three latent mediators, with survey data. I am unable to get the model to converge when I treat the mediators as latent variable. However, the model converges with summary measured variables(treating them as observed).
Could you take a look at my code? Please let me know anything that I could try.
Thank you very much.
count is v201; Categorical are hlt purc visit gooutr neglr arguer refsr burnfr negsex negcon; Cluster = v021; Weight = weigh; Stratification = v022; Subpopulation = subpop_fa eq 1; Analysis: Type = complex ; integration = montecarlo (75); Parameterization=theta; Model: v511 ON v133 v012 work2 head urban house2 house3 house4 house5 polyfir polysec reage reeduc2; f1 by hlt purc visit; f1 ON v133 v511 v012 work2 head urban house2 house3 house4 house5 polyfir polysec reage reeduc2; f2 by negsex negcon; f2 ON v133 v511 v012 work2 head urban house2 house3 house4 house5 polyfir polysec reage reeduc2; f3 by gooutr neglr arguer refsr burnfr; f3 ON v133 v511 v012 work2 head urban house2 house3 house4 house5 polyfir polysec reage reeduc2; v201 ON v133 v511 f1 f2 f3 v012 work2 head urban house2 house3 house4 house5 polyfir polysec access reage reeduc2; f1 with f2; f2 with f3; f1 with f3;
Kim Kiely posted on Tuesday, March 17, 2015 - 8:52 pm
Hi Linda and Bengt,
Many years ago (Sept 22 2002), Bengt posted in this thread:
"I am not aware of any writings on the analysis of Poisson count outcomes using methods for ordered polytomous outcomes. Personally, I would not worry too much because for the simple purpose of regression I don't think the ordered model makes important violations of the nature of count data, although I may be wrong. That is, the ordered polytomous model would certainly not estimate the same parameters as Poisson, but probably fit the data well and point to the same important predictors. "
I was wondering if your thougts on this remained the same?
I have the following outcome data, which strictly speaking are counts (I also expect are underdispersed with a mean=1.17 and variance=0.78):
I have analyzed as count (with NegBin and Poisson) and as ordinal in Stata, and am about to start replicating these models in Mplus v7. Generally my ordinal models have a relatvely better fit - and given the frequencies it seems reasonable to treat as such. But I would be interested get a second opinion.
I am building a cross-lagged panel model with one count variable and one continuous variable across four longitudinal waves. how to specify the within-wave correlations (T1, T234) between these two variables? Are there any Mplus examples?
follow up my previous question, I will need to use ZIP model for my data. The relationship between the continuous variable and the zero-inflated part has to be estimated as well (after creating a latent variable for the count variable, the correlation between part of the count variable and continuous variable can be estimated but the model does not recognize the zero-inflated part of the count variable). any suggestions?
I assume you specified a factor as measured by the continuous outcome and the count outcome where the factor variance is fixed at 1 and one loading is free to capture the residual covariance. Seems like you can do the same using the inflation part as an indicator of a second factor.
Given that my outcome count data are highly zero-inflated, I am interested in comparing different count based distribution models of my data to determine the best fitting model so that I may specify my data accordingly for other analyses. Specifically, I would like to compare the Olsen and Schafer's Two-Part model to poisson, NB, ZIP, ZINB, and NBH.
I was wondering if you could please tell me what information criterion in mplus should I use to compare the two-part model to these other distributional models. BIC? Vuong test?
Thanks for your suggestion. I have since purchased your new book and read the section in your book. It was helpful in answering most of my questions.. According to the section of the chapter on model comparison, it says a Vuong test should be used to compare non nested models, which I need to do. I was wondering if you could please tell me whether mplus can compute the Vuong test and if so, how?
If mplus cannot compute it, could you please let me know what information from the mplus outputs are needed to put in the formula to do the Vuong test by hand?
Hello, I read in the Mplus User's Guide that residual variances are not estimated for count variables: "The inflated part of censored outcomes, binary outcomes, ordered categorical (ordinal) outcomes, count outcomes, and the inflated part of count outcomes have no variance parameters" (p. 639), and indeed I have seen this in my own Mplus output before--that no R-square is given, for this reason.
However, I am regressing count variables in a latent growth model on time-varying covariates and it is working.
How can I regress a variable that does not have residual variances defined on a TVC, if it is already loading onto the I, S, and Q LGM factors? (I am not using Theta parameterization.) If its residual variance is undefined, what does the regression parameter mean? I am unsure how to interpret this.
Can you explain whether it is possible to regress a count indicator in a factor model on another variable? How can this be possible? Thank you.
Regression analysis with a count DV indeed has no residual variance parameter in the regular Poisson version of count modeling. You can still estimate an intercept and slopes. See for example Chapter 6 in our new book which also discussed more elaborate models.
Bengt, the quote from the User's Guide above suggests that the (zero) inflated portion of a count outcome does not have a defined residual variance; but that the COUNT portion (if it has both inflated and count portions) does. Is that true?
This seems intuitive because there is a dispersion parameter for a zero-inflated negative binomial variable.