I was wondering if a hurdle model with a poisson distribution can be estimated in Mplus. Secondly, I am comparing models (poisson, nb, zip, zinb, hurdle nb) within a SEM-context. I find that zip is the most appropriate method in a model with a correlation between the dependent count variable and another dependent (continuous) variable. But if I fit the same SEM-model without that correlation, I find that zinb is more appropriate. Is it ok to assume that the full model is the one I should base my choice on? Or is there something I miss? the correlation is fitted like this:
A hurdle model with a normal distribution can be done in Mplus. This splits the model into a binary part (event or not) and a continuous part (how much of the event), followed by parallel process growth modeling. This is automated in Mplus via DATA TWOPART (see UG and also the handouts for Topic 3 or 4; and see the Olsen-Schafer JASA article referred to).
The analogous approach can be used for counts with Poisson (but I'm not sure DATA TWOPART can be used). The "continuous" part is then a truncated Poisson, where the zero value is not included. Mplus handles the truncated Poisson.
I would use the full model where the 2 DVs are correlated as you have done. It is likely that both equations leave out covariates that are correlated.
Rob Dvorak posted on Saturday, July 23, 2011 - 10:24 am
Is it possible to do a multilevel hurdle model in Mplus?
The term hurdle model seems to come from regression with a count DV. This is possible in Mplus. Here, a binary variable describes if the event occured or not and a zero-truncated Poisson variable describes how many times the event occured. Another term is two-part modeling, which seems to be used with continuous DVs. Mplus has special functions to set this model up easily (DATA TWOPART). In both cases can one add a 2-level structure with random intercepts and slopes.
Rob Dvorak posted on Saturday, July 23, 2011 - 11:15 am
Excellent! Thanks for your help.
Rob Dvorak posted on Monday, November 19, 2012 - 10:45 am
I have a question about the centering option, which may or may not be relevant. I am running a multilevel negative binomial hurdle model with centering at the group mean. Iím predicting the probability of alcohol use on any given day (the logistic portion) followed by number of drinks consumed on drinking days (the count portion following the hurdle). Iím predicting these outcomes based on mean daily mood. In the logistic model, mood is centered at the group (i.e., person) level using all available days for that person. My issue is with the hurdle portion. Is centering reconfigured for this portion of the model based on the number of drinking days for each person, or is it using the centering from the logistic model (i.e., total days). It seems, methodologically, that I should re-center the data based only on drinking days for the count portion if I want to look at cross-level interactions. Is that correct?
I gather from the above that in a hurdle model, those who cross the hurdle take on a value higher than zero. The model takes into account who the second-stage DV pertains to, and in the second-stage, zero is not a valid response option.
In our data, we have whether a person drank alcohol (0/1), and then in the second stage, a count of problems experienced due to drinking. But zero is a response option for the number of problem indicators (i.e., in the second stage).
This seems problematic to me because it deviates from the assumptions of the hurdle model, right?
I was considering other modeling options such as a knownclass model within the mixture modeling framework, but that doesn't seem appropriate either, because there are no responses for nondrinkers.
If we can do these models in Mplus, which model would be appropriate?
Or rather, Bengt and Linda, does it sound that our question would be best addressed with a zero-inflated negative binomial (or Poisson), given this explanation from the manual:
ďWith a zero-inflated Poisson model, two regressions are estimated. The first ON statement describes the Poisson regression of the count part of u1 on the covariates x1 and x3. This regression predicts the value of the count dependent variable for individuals who are able to assume values of zero and above. The second ON statement describes the logistic regression of the binary latent inflation variable u1#1 on the covariates x1 and x3. This regression predicts the probability of being unable to assume any value except zeroĒ (Mplus Manual, p. 28).
So the first stage would distinguish zeros due to being a nondrinker (structural zeros) from zeros due to being a drinker but having no problems (true zeros on the final DV)?
Thanks, Linda. Does Mplus offer a statistic in ZINB output that reveals the quality of distinction between structural and true zeros?
That is, if both nondrinkers and drinkers can take on zeros for having alcohol-related problems (and the drinkers can also take on higher scores), can Mplus tell us the quality of the distinction between the two types of zeros?
This might be similar to the entropy statistic in latent class analysis, which gives information about the quality of distinction between classes.
Hi Linda, I do like the idea of getting the classification table when modeling our data according to Example 7.25.
However, might we also do this as a KNOWNCLASS model where the nondrinker class has the probability of taking on a value above zero for drinking problems of essentially zero, in the nbi portion of the model? This would be done using the -15 constraint for the intercept of the DV, as shown in the example.
Basically, since we know our classes, should we do something just like Example 7.25, but as a KNOWNCLASS?
Thanks, Bengt. We do know who the members of the two classes are (drinkers and nondrinkers). The Example 7.25 seems very appropriate--we really like it. But it occurred to us that in Example 7.25, class membership is unknown, and a classification table is produced.
Would it be best to alter Example 7.25's model and treat it as KNOWNCLASS?
In ZINB and ZIP you consider class membership as unknown - zero reported drinking in a certain period can come from 2 different classes of people: non-drinkers and drinkers who happened to not drink at that point in time. In this case the classification table can be of interest to study.
If you consider the class membership to be known there is no misclassification - the classification table will have zero off-diagonal elements. Where does "the quality of the distinction between the two types of zeros? " that you mention come into play in this case?
You know their drinking status at one point in time. They report not drinking at that time. That does not mean they are non-drinkers. This is why you want to do the analysis with latent classes. Then one class is the true non-drinkers.
Linda, can you tell me why the model would not run as ZINB, but would run as a ZIP? Mplus seemed to not even give a full output file when I tried to run it as ZINB--as if it stopped the analysis prematurely, and created just he beginning of an output file.
But when I changed (nbi) to just (i), the model ran fine and produced a full output file. Why would that occur?
We see the Loglikelihood values and scaling parameters for comparing 2 nested models (with the H0 label).
How do we also compare against the null model H1? I don't see the fit of the null model (H1) expressed as a Loglikelihood in the output, against which we can compare. Should we think of this H1 model as the "unrestricted model"--would that be the proper term?
We are using mixture model 7.25 for our ZINB analyses, with class predicting who can and cannot take on a positive score for the count regression. I am surprised to see that some cases are being dropped due to missing data on the dependent variable. This is surprising because the Mplus manual states that even with MLR estimation and count DVs, Mplus handles missing data on the DVs with ML methods.
The message we are receiving is: Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis. Number of cases with missing on all variables except x-variables: 52
Why is Mplus excluding these cases, who according to the message are missing their score on the DV, rather than employing ML treatment of this missing data?
ML handling of missing data does not use a person who has missing on all of his DVs. Each person needs at least one DV that he is not missing on. Then you can borrow information from that DV when considering the DV with missingness.
Take the case of regression of y on x with some people having missing data on y. They are not used in ML estimation of the slope and intercept. They contribute to the estimation of the marginal distribution of x, but those parameters are not part of the regression model.
In Example 7.25, when there are multiple predictors of class and score on the DV (as with variables x1 and x3), I have the following questions:
1. x1 and x3 are correlated by default, without this needing to be written explicitly into the code, right?
2. These correlations are set to be equal across classes by default, then, right? So, to allow them to differ between class 1 and class 2, we would need to code explicitly for these differing correlations in the two classes, if that is what we hypothesize. Is that correct?
1. Yes. The model is estimated conditioned on these variables. 2. No, they are not equal across classes. The correlations are not model parameters.
Moin Syed posted on Thursday, February 13, 2014 - 12:02 pm
Is it possible to use TWOPART (or some other procedure) with a predictor rather than a DV? I want to separate out the binary vs. continuous aspects of a predictor, but I get an error about lack of variance in the binary component. Any help would be appreciated.
If you have a growth model that is two-part I can see predicting from the growth factors, but with a single variable I think you might just as well use a binary dummy predictor (even observed or not) - assuming that the DV has variation for each of those categories.
Thank you Prof. Muthen. Yes I did think about using growth curves for the multiple time points. However given than I have many variables that vary over time I am not sure how to go about it, as having too many parallel growth curves in the model may become complex.
Thank you Prof. Muthen. Let me try this out. Many thanks.
Lois Downey posted on Wednesday, February 15, 2017 - 9:28 am
I have a very elementary question related to hurdle and zero-inflated models.
I naively expected the coefficient for the binary part of these models to match the coefficient that would have been produced with a logistic regression model of an outcome in which the original values of 1 and above were recoded to 0, and the original value of 0 was recoded to 1. But that doesn't seem to be the case.
The discrepancy you see if because a zero-inflated model uses a mixture at y=0, that is, some people with y=0 are in the binary part of the model but some others are in the regular count part of the model including 0. See chapter 6 of our new book where different count models are compared.
Lois Downey posted on Wednesday, February 15, 2017 - 1:28 pm
And how about the hurdle model, where cases with y=0 can't be in the regular count part?
Is the discrepancy there caused by the fact that some people with y>0 are in the binary part, and others with y>0 are in the regular count part?
The hurdle model is a two-part model so I think there should not be a difference. If you like, you can send the two outputs to Support along with your license number.
anonymous Z posted on Tuesday, November 07, 2017 - 1:53 pm
Dear Drs. Muthen,
I am working on the syntax for a hurdle model. The User Guide gives the example of two-part model for continuous variable. I wondered how the syntax should be if the outcome variable is a count variable.
Below is my analysis with H_RISK being a count variable. How should I make changes to the syntax to reflect H_RISK being a count variable?
Thanks so much!
Title: HIV two part Data: FILE = OP_HIVriskbeh_LONG.dat; DATA TWOPART: NAMES = H_RISK; BINARY = binRisk; CONTINUOUS = contRisk;
VARIABLE: NAMES = ID time H_RISK H_BEH H_KNOW;
USEVARIABLES ARE ID time binRisk contRisk;
CATEGORICAL ARE binRisk;
MISSING = ALL (-999999);
cluster = ID; within = time;
Analysis: type = twolevel random;
MODEL: %within% S1| binRisk on time;
S2| contRisk on time;
binRisk with S1;
contRisk WITH S2;
binRisk with contRisk; binRisk with s2; contRisk with s1;
It says no dispersion but you may still need the hurdle part.
anonymous Z posted on Friday, November 10, 2017 - 5:48 pm
Dear Dr. Muthen,
Could you explain further? I assume that the reason to use the hurdle model is because the data are dispersed due to excess of zeros. If the data are not dispersed, is the hurdle model still necessary?
The negative binomial dispersion parameter is one more parameter than what the Poisson has. This handles a certain degree of overdispersion. But this is not the same as having a special model part for zeros like zero-inflated Poisson, zero-inflated negbin, or hurdle negbin. Zero-inflated adds a class of subjects at zero and hurdle also allows a special model part for zeros. Also, the zero class and the hurdle part can be regressed on its own covariates. In sum, negbin's dispersion parameter is different from hurdle modeling.