Mplus Discussion >> Hurdle models

Topics
Last Day
Last 3 Days
Last Week
Tree View

Edit Profile


Hurdle models

Mplus Discussion > Categorical Data Modeling >

Message/Author

Brondeel Ruben posted on Thursday, February 17, 2011 - 8:37 am

Dear,

I was wondering if a hurdle model with a poisson distribution can be estimated in Mplus.
Secondly, I am comparing models (poisson, nb, zip, zinb, hurdle nb) within a SEM-context. I find that zip is the most appropriate method in a model with a correlation between the dependent count variable and another dependent (continuous) variable. But if I fit the same SEM-model without that correlation, I find that zinb is more appropriate. Is it ok to assume that the full model is the one I should base my choice on? Or is there something I miss?
the correlation is fitted like this:

f BY y1 y2;
f@1;

Finally, I am also (for different data) interested if a hurdle model with a normal distribution exists. And if so, can it be done in Mplus?

Thanks in advance.

Bengt O. Muthen posted on Thursday, February 17, 2011 - 4:14 pm

A hurdle model with a normal distribution can be done in Mplus. This splits the model into a binary part (event or not) and a continuous part (how much of the event), followed by parallel process growth modeling. This is automated in Mplus via DATA TWOPART (see UG and also the handouts for Topic 3 or 4; and see the Olsen-Schafer JASA article referred to).

The analogous approach can be used for counts with Poisson (but I'm not sure DATA TWOPART can be used). The "continuous" part is then a truncated Poisson, where the zero value is not included. Mplus handles the truncated Poisson.

I would use the full model where the 2 DVs are correlated as you have done. It is likely that both equations leave out covariates that are correlated.

Rob Dvorak posted on Saturday, July 23, 2011 - 10:24 am

Is it possible to do a multilevel hurdle model in Mplus?

Bengt O. Muthen posted on Saturday, July 23, 2011 - 10:52 am

Short answer, yes.

Long answer:

The term hurdle model seems to come from regression with a count DV. This is possible in Mplus. Here, a binary variable describes if the event occured or not and a zero-truncated Poisson variable describes how many times the event occured. Another term is two-part modeling, which seems to be used with continuous DVs. Mplus has special functions to set this model up easily (DATA TWOPART). In both cases can one add a 2-level structure with random intercepts and slopes.

Rob Dvorak posted on Saturday, July 23, 2011 - 11:15 am

Excellent! Thanks for your help.

Rob Dvorak posted on Monday, November 19, 2012 - 10:45 am

I have a question about the centering option, which may or may not be relevant. I am running a multilevel negative binomial hurdle model with centering at the group mean. I�m predicting the probability of alcohol use on any given day (the logistic portion) followed by number of drinks consumed on drinking days (the count portion following the hurdle). I�m predicting these outcomes based on mean daily mood. In the logistic model, mood is centered at the group (i.e., person) level using all available days for that person. My issue is with the hurdle portion. Is centering reconfigured for this portion of the model based on the number of drinking days for each person, or is it using the centering from the logistic model (i.e., total days). It seems, methodologically, that I should re-center the data based only on drinking days for the count portion if I want to look at cross-level interactions. Is that correct?

Bengt O. Muthen posted on Monday, November 19, 2012 - 5:10 pm

The Mplus centering applies to both parts. I think you are right that you may want to do separate centering for the two parts.

Lisa M. Yarnell posted on Wednesday, May 01, 2013 - 8:40 am

Hi Bengt and Linda,

I gather from the above that in a hurdle model, those who cross the hurdle take on a value higher than zero. The model takes into account who the second-stage DV pertains to, and in the second-stage, zero is not a valid response option.

In our data, we have whether a person drank alcohol (0/1), and then in the second stage, a count of problems experienced due to drinking. But zero is a response option for the number of problem indicators (i.e., in the second stage).

This seems problematic to me because it deviates from the assumptions of the hurdle model, right?

I was considering other modeling options such as a knownclass model within the mixture modeling framework, but that doesn't seem appropriate either, because there are no responses for nondrinkers.

If we can do these models in Mplus, which model would be appropriate?

Thank you.

Lisa M. Yarnell posted on Wednesday, May 01, 2013 - 9:48 am

Bengt and Linda, could the above-mentioned DATA TWOPART be appropriate for our research question? Thank you.

Lisa M. Yarnell posted on Wednesday, May 01, 2013 - 10:22 am

Or rather, Bengt and Linda, does it sound that our question would be best addressed with a zero-inflated negative binomial (or Poisson), given this explanation from the manual:

�With a zero-inflated Poisson model, two regressions are estimated. The first ON statement describes the Poisson regression of the count part of u1 on the covariates x1 and x3. This regression predicts the value of the count dependent variable for individuals who are able to assume values of zero and above. The second ON statement describes the logistic regression of the binary latent inflation variable u1#1 on the covariates x1 and x3. This regression predicts the probability of being unable to assume any value except zero� (Mplus Manual, p. 28).

So the first stage would distinguish zeros due to being a nondrinker (structural zeros) from zeros due to being a drinker but having no problems (true zeros on the final DV)?

Linda K. Muthen posted on Wednesday, May 01, 2013 - 1:31 pm

I would try the zero-inflated Poisson or the negative binomial model for the count variable.

Please limit posts to one window.

Lisa M. Yarnell posted on Wednesday, May 01, 2013 - 3:57 pm

Thanks, Linda. Does Mplus offer a statistic in ZINB output that reveals the quality of distinction between structural and true zeros?

That is, if both nondrinkers and drinkers can take on zeros for having alcohol-related problems (and the drinkers can also take on higher scores), can Mplus tell us the quality of the distinction between the two types of zeros?

This might be similar to the entropy statistic in latent class analysis, which gives information about the quality of distinction between classes.

Linda K. Muthen posted on Wednesday, May 01, 2013 - 4:17 pm

You could do Example 7.25 using nb which would estimate nbi. Then you get the classification table.

Lisa M. Yarnell posted on Wednesday, May 01, 2013 - 4:49 pm

This looks like what we are searching for. Is there an nbi option?

Linda K. Muthen posted on Wednesday, May 01, 2013 - 4:55 pm

Yes. See the COUNT option in the user's guide.

Lisa M. Yarnell posted on Thursday, May 02, 2013 - 5:01 pm

Hi Linda, I do like the idea of getting the classification table when modeling our data according to Example 7.25.

However, might we also do this as a KNOWNCLASS model where the nondrinker class has the probability of taking on a value above zero for drinking problems of essentially zero, in the nbi portion of the model? This would be done using the -15 constraint for the intercept of the DV, as shown in the example.

Basically, since we know our classes, should we do something just like Example 7.25, but as a KNOWNCLASS?

Bengt O. Muthen posted on Thursday, May 02, 2013 - 9:04 pm

Typically you want to find out who is in the two classes.

Lisa M. Yarnell posted on Thursday, May 02, 2013 - 9:40 pm

Thanks, Bengt. We do know who the members of the two classes are (drinkers and nondrinkers). The Example 7.25 seems very appropriate--we really like it. But it occurred to us that in Example 7.25, class membership is unknown, and a classification table is produced.

Would it be best to alter Example 7.25's model and treat it as KNOWNCLASS?

Bengt O. Muthen posted on Friday, May 03, 2013 - 8:51 am

I don't see what you are after.

In ZINB and ZIP you consider class membership as unknown - zero reported drinking in a certain period can come from 2 different classes of people: non-drinkers and drinkers who happened to not drink at that point in time. In this case the classification table can be of interest to study.

If you consider the class membership to be known there is no misclassification - the classification table will have zero off-diagonal elements. Where does "the quality of the distinction between the two types of zeros? " that you mention come into play in this case?

Lisa M. Yarnell posted on Friday, May 03, 2013 - 9:25 am

If we do know their drinker status, should we run this as a latent class or knownclass model?

Linda K. Muthen posted on Friday, May 03, 2013 - 9:39 am

You know their drinking status at one point in time. They report not drinking at that time. That does not mean they are non-drinkers. This is why you want to do the analysis with latent classes. Then one class is the true non-drinkers.

Lisa M. Yarnell posted on Friday, May 03, 2013 - 9:45 am

Great--yes, I agree, Linda. I think that makes perfect sense! Thank you.

Lisa M. Yarnell posted on Monday, May 06, 2013 - 6:01 pm

Linda and Bengt, thanks again for your good advice to base our modeling on Example 7.25. The models run, with 2 nuances.

They run only as ZIP (not as ZINB, where the output seems stuck and doesn't come out fully). I think this could OK, however; we can perhaps use ZIP.

They also have low entropy (0.589), with an estimated 14.5% of class 2 persons being misclassified as class 1; and an an estimated 11.5% of class 1 persons being misclassified as class 2.

Is there a way to improve our entropy statistic and get lower rates of estimated misclassification?

Thank you,
Lisa

Linda K. Muthen posted on Tuesday, May 07, 2013 - 9:44 am

Adding covariates to the model can sometimes improve entropy.

Lisa M. Yarnell posted on Tuesday, May 07, 2013 - 12:40 pm

Linda, can you tell me why the model would not run as ZINB, but would run as a ZIP? Mplus seemed to not even give a full output file when I tried to run it as ZINB--as if it stopped the analysis prematurely, and created just he beginning of an output file.

But when I changed (nbi) to just (i), the model ran fine and produced a full output file. Why would that occur?

Linda K. Muthen posted on Tuesday, May 07, 2013 - 1:46 pm

Please send the input, data, and your license number to support@statmodel.com.

Lisa M. Yarnell posted on Thursday, May 09, 2013 - 2:23 pm

Hi Linda,

Example 7.25 is working great for us.

We see the Loglikelihood values and scaling parameters for comparing 2 nested models (with the H0 label).

How do we also compare against the null model H1? I don't see the fit of the null model (H1) expressed as a Loglikelihood in the output, against which we can compare. Should we think of this H1 model as the "unrestricted model"--would that be the proper term?

Thank you.

Linda K. Muthen posted on Thursday, May 09, 2013 - 2:25 pm

There is no H1 model when means, variances, and covariances are not sufficient statistics for model estimation.

Lisa M. Yarnell posted on Monday, June 03, 2013 - 5:52 pm

Hi Linda,

We are using mixture model 7.25 for our ZINB analyses, with class predicting who can and cannot take on a positive score for the count regression. I am surprised to see that some cases are being dropped due to missing data on the dependent variable. This is surprising because the Mplus manual states that even with MLR estimation and count DVs, Mplus handles missing data on the DVs with ML methods.

The message we are receiving is:
Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis.
Number of cases with missing on all variables except x-variables: 52

Why is Mplus excluding these cases, who according to the message are missing their score on the DV, rather than employing ML treatment of this missing data?

Thank you.

Linda K. Muthen posted on Monday, June 03, 2013 - 5:56 pm

Yes, cases are excluded if they have missing values on all dependent variables. They have nothing to contribute.

Bengt O. Muthen posted on Monday, June 03, 2013 - 6:00 pm

ML handling of missing data does not use a person who has missing on all of his DVs. Each person needs at least one DV that he is not missing on. Then you can borrow information from that DV when considering the DV with missingness.

Take the case of regression of y on x with some people having missing data on y. They are not used in ML estimation of the slope and intercept. They contribute to the estimation of the marginal distribution of x, but those parameters are not part of the regression model.

Lisa M. Yarnell posted on Monday, June 03, 2013 - 6:13 pm

OK--thank you Linda and Bengt for clarifying this!

Lisa M. Yarnell posted on Tuesday, June 04, 2013 - 11:38 pm

Hi again Linda and Bengt,

In Example 7.25, when there are multiple predictors of class and score on the DV (as with variables x1 and x3), I have the following questions:

1. x1 and x3 are correlated by default, without this needing to be written explicitly into the code, right?

2. These correlations are set to be equal across classes by default, then, right? So, to allow them to differ between class 1 and class 2, we would need to code explicitly for these differing correlations in the two classes, if that is what we hypothesize. Is that correct?

Thank you for all of your help.

Linda K. Muthen posted on Wednesday, June 05, 2013 - 6:17 am

1. Yes. The model is estimated conditioned on these variables.
2. No, they are not equal across classes. The correlations are not model parameters.

Moin Syed posted on Thursday, February 13, 2014 - 12:02 pm

Hello,

Is it possible to use TWOPART (or some other procedure) with a predictor rather than a DV? I want to separate out the binary vs. continuous aspects of a predictor, but I get an error about lack of variance in the binary component. Any help would be appreciated.

Thanks!

moin

Bengt O. Muthen posted on Friday, February 14, 2014 - 11:22 am

If you have a growth model that is two-part I can see predicting from the growth factors, but with a single variable I think you might just as well use a binary dummy predictor (even observed or not) - assuming that the DV has variation for each of those categories.

S.Arunachalam posted on Tuesday, June 07, 2016 - 9:57 am

Respected Prof. Muthen

May I kindly request your advice on how to setup a 3-level negative binomial hurdle model in Mplus please.

The three level structure is country (level 3) ; category (level 2) ; and time (level 1). That is multiple time points of the category variable(s).

Level 1 variable is a 'count' variable.

A basic outlay of the Mplus code will be great help. Many many thanks in advance.

Bengt O. Muthen posted on Wednesday, June 08, 2016 - 10:47 am

Mplus does not yet allow 3-level count modeling. {Perhaps you can express the multiple timepoints in wide format and thereby reduce the analysis to 2-level.

S.Arunachalam posted on Thursday, June 09, 2016 - 8:48 am

Thank you Prof. Muthen. Yes I did think about using growth curves for the multiple time points. However given than I have many variables that vary over time I am not sure how to go about it, as having too many parallel growth curves in the model may become complex.

Bengt O. Muthen posted on Thursday, June 09, 2016 - 6:31 pm

I see what you are saying. If it makes it any easier, note that you don't have to apply a growth model when data are in wide format - any model including a saturated model is possible.

S.Arunachalam posted on Friday, June 10, 2016 - 5:19 am

Thank you Prof. Muthen. Let me try this out. Many thanks.

Lois Downey posted on Wednesday, February 15, 2017 - 9:28 am

I have a very elementary question related to hurdle and zero-inflated models.

I naively expected the coefficient for the binary part of these models to match the coefficient that would have been produced with a logistic regression model of an outcome in which the original values of 1 and above were recoded to 0, and the original value of 0 was recoded to 1. But that doesn't seem to be the case.

Would you please explain why not.

THANKS

Bengt O. Muthen posted on Wednesday, February 15, 2017 - 11:19 am

The discrepancy you see if because a zero-inflated model uses a mixture at y=0, that is, some people with y=0 are in the binary part of the model but some others are in the regular count part of the model including 0. See chapter 6 of our new book where different count models are compared.

Lois Downey posted on Wednesday, February 15, 2017 - 1:28 pm

Thank you.

And how about the hurdle model, where cases with y=0 can't be in the regular count part?

Is the discrepancy there caused by the fact that some people with y>0 are in the binary part, and others with y>0 are in the regular count part?

Bengt O. Muthen posted on Wednesday, February 15, 2017 - 3:09 pm

The hurdle model is a two-part model so I think there should not be a difference. If you like, you can send the two outputs to Support along with your license number.

anonymous Z posted on Tuesday, November 07, 2017 - 1:53 pm

Dear Drs. Muthen,

I am working on the syntax for a hurdle model. The User Guide gives the example of two-part model for continuous variable. I wondered how the syntax should be if the outcome variable is a count variable.

Below is my analysis with H_RISK being a count variable. How should I make changes to the syntax to reflect H_RISK being a count variable?

Thanks so much!

Title: HIV two part
Data: FILE = OP_HIVriskbeh_LONG.dat;
DATA TWOPART:
NAMES = H_RISK;
BINARY = binRisk;
CONTINUOUS = contRisk;

VARIABLE: NAMES = ID time H_RISK H_BEH H_KNOW;

USEVARIABLES ARE ID time binRisk contRisk;

CATEGORICAL ARE binRisk;

MISSING = ALL (-999999);

cluster = ID;
within = time;

Analysis:
type = twolevel random;

MODEL:
%within%
S1| binRisk on time;

S2| contRisk on time;

%between%

binRisk with S1;

contRisk WITH S2;

binRisk with contRisk;
binRisk with s2;
contRisk with s1;

Bengt O. Muthen posted on Tuesday, November 07, 2017 - 3:14 pm

You can do it two ways as explained in our Regression and Mediation book:

1) don't use Data Twopart but say (using negative binomial as an example):

Count = y(NBH);

2) use Data Twopart and zero-truncated count modeling. See book Table 6.8 script using the y(NBT) setting.

anonymous Z posted on Thursday, November 09, 2017 - 9:13 am

Dear Dr. Muthen,

Based on your advice, I did a hurdle model (Below is the syntax. What does p=0.977 for "dispersion H_BEH" mean? Does it mean that the data are not dispersed, and a hurdle is not necessary?

USEVARIABLES ARE ID time H_BEH;

COUNT = H_BEH (NBH);

MISSING = ALL (-999999);

cluster = ID;
within = time;

Analysis:
type = twolevel random;

MODEL:
%within%
s| H_BEH ON time;

MODEL RESULTS
Two-Tailed
Estimate S.E. Est./S.E. P-Value

Within Level

Dispersion
H_BEH 0.000 0.000 0.029 0.977

Thanks so much!

Bengt O. Muthen posted on Friday, November 10, 2017 - 4:36 pm

It says no dispersion but you may still need the hurdle part.

anonymous Z posted on Friday, November 10, 2017 - 5:48 pm

Dear Dr. Muthen,

Could you explain further? I assume that the reason to use the hurdle model is because the data are dispersed due to excess of zeros. If the data are not dispersed, is the hurdle model still necessary?

Thanks so much!

Bengt O. Muthen posted on Saturday, November 11, 2017 - 8:56 am

The negative binomial dispersion parameter is one more parameter than what the Poisson has. This handles a certain degree of overdispersion. But this is not the same as having a special model part for zeros like zero-inflated Poisson, zero-inflated negbin, or hurdle negbin. Zero-inflated adds a class of subjects at zero and hurdle also allows a special model part for zeros. Also, the zero class and the hurdle part can be regressed on its own covariates. In sum, negbin's dispersion parameter is different from hurdle modeling.

Amanda Lemmon posted on Thursday, September 17, 2020 - 12:40 pm

Hello -

I wanted to ask about the difference between the hurdle data = twopart approach and the censored-inflated approach.

I am running a regression where the binary part is essentially the N/A option on the survey question (y) and the continuous part is actual responses on the survey question (y).

From the description in the Mplus guide, I thought that both approaches should work, but I am getting different results. To be more concrete, here are the codes I tried:

Censored-Inflated:

USEVARIABLES ARE y x1 x2;

CENSORED ARE y (bi);

MODEL:

y ON x1 x2;
y#1 ON x1 x2;

Hurdle model:

USEVARIABLES ARE y_cont y_binary x1 x2;

CATEGORICAL ARE y_binary;

MODEL:

y_cont ON x1 x2;
y_binary ON x1 x2;

Bengt O. Muthen posted on Thursday, September 17, 2020 - 6:04 pm

The automatic two-part model in Mplus uses a logY transformation so results will be on a different scale than censored(inflated) which does not transform. You can also do Heckman modeling. Two-part may be more conceptually suitable given that the floor isn't really a true zero. We discuss all this and compare many approaches in Chapter 7 of our RMA book at

http://www.statmodel.com/Mplus_Book.shtml