Message/Author 


Dear, I was wondering if a hurdle model with a poisson distribution can be estimated in Mplus. Secondly, I am comparing models (poisson, nb, zip, zinb, hurdle nb) within a SEMcontext. I find that zip is the most appropriate method in a model with a correlation between the dependent count variable and another dependent (continuous) variable. But if I fit the same SEMmodel without that correlation, I find that zinb is more appropriate. Is it ok to assume that the full model is the one I should base my choice on? Or is there something I miss? the correlation is fitted like this: f BY y1 y2; f@1; Finally, I am also (for different data) interested if a hurdle model with a normal distribution exists. And if so, can it be done in Mplus? Thanks in advance. 


A hurdle model with a normal distribution can be done in Mplus. This splits the model into a binary part (event or not) and a continuous part (how much of the event), followed by parallel process growth modeling. This is automated in Mplus via DATA TWOPART (see UG and also the handouts for Topic 3 or 4; and see the OlsenSchafer JASA article referred to). The analogous approach can be used for counts with Poisson (but I'm not sure DATA TWOPART can be used). The "continuous" part is then a truncated Poisson, where the zero value is not included. Mplus handles the truncated Poisson. I would use the full model where the 2 DVs are correlated as you have done. It is likely that both equations leave out covariates that are correlated. 

Rob Dvorak posted on Saturday, July 23, 2011  10:24 am



Is it possible to do a multilevel hurdle model in Mplus? 


Short answer, yes. Long answer: The term hurdle model seems to come from regression with a count DV. This is possible in Mplus. Here, a binary variable describes if the event occured or not and a zerotruncated Poisson variable describes how many times the event occured. Another term is twopart modeling, which seems to be used with continuous DVs. Mplus has special functions to set this model up easily (DATA TWOPART). In both cases can one add a 2level structure with random intercepts and slopes. 

Rob Dvorak posted on Saturday, July 23, 2011  11:15 am



Excellent! Thanks for your help. 

Rob Dvorak posted on Monday, November 19, 2012  10:45 am



I have a question about the centering option, which may or may not be relevant. I am running a multilevel negative binomial hurdle model with centering at the group mean. I’m predicting the probability of alcohol use on any given day (the logistic portion) followed by number of drinks consumed on drinking days (the count portion following the hurdle). I’m predicting these outcomes based on mean daily mood. In the logistic model, mood is centered at the group (i.e., person) level using all available days for that person. My issue is with the hurdle portion. Is centering reconfigured for this portion of the model based on the number of drinking days for each person, or is it using the centering from the logistic model (i.e., total days). It seems, methodologically, that I should recenter the data based only on drinking days for the count portion if I want to look at crosslevel interactions. Is that correct? 


The Mplus centering applies to both parts. I think you are right that you may want to do separate centering for the two parts. 


Hi Bengt and Linda, I gather from the above that in a hurdle model, those who cross the hurdle take on a value higher than zero. The model takes into account who the secondstage DV pertains to, and in the secondstage, zero is not a valid response option. In our data, we have whether a person drank alcohol (0/1), and then in the second stage, a count of problems experienced due to drinking. But zero is a response option for the number of problem indicators (i.e., in the second stage). This seems problematic to me because it deviates from the assumptions of the hurdle model, right? I was considering other modeling options such as a knownclass model within the mixture modeling framework, but that doesn't seem appropriate either, because there are no responses for nondrinkers. If we can do these models in Mplus, which model would be appropriate? Thank you. 


Bengt and Linda, could the abovementioned DATA TWOPART be appropriate for our research question? Thank you. 


Or rather, Bengt and Linda, does it sound that our question would be best addressed with a zeroinflated negative binomial (or Poisson), given this explanation from the manual: “With a zeroinflated Poisson model, two regressions are estimated. The first ON statement describes the Poisson regression of the count part of u1 on the covariates x1 and x3. This regression predicts the value of the count dependent variable for individuals who are able to assume values of zero and above. The second ON statement describes the logistic regression of the binary latent inflation variable u1#1 on the covariates x1 and x3. This regression predicts the probability of being unable to assume any value except zero” (Mplus Manual, p. 28). So the first stage would distinguish zeros due to being a nondrinker (structural zeros) from zeros due to being a drinker but having no problems (true zeros on the final DV)? 


I would try the zeroinflated Poisson or the negative binomial model for the count variable. Please limit posts to one window. 


Thanks, Linda. Does Mplus offer a statistic in ZINB output that reveals the quality of distinction between structural and true zeros? That is, if both nondrinkers and drinkers can take on zeros for having alcoholrelated problems (and the drinkers can also take on higher scores), can Mplus tell us the quality of the distinction between the two types of zeros? This might be similar to the entropy statistic in latent class analysis, which gives information about the quality of distinction between classes. 


You could do Example 7.25 using nb which would estimate nbi. Then you get the classification table. 


This looks like what we are searching for. Is there an nbi option? 


Yes. See the COUNT option in the user's guide. 


Hi Linda, I do like the idea of getting the classification table when modeling our data according to Example 7.25. However, might we also do this as a KNOWNCLASS model where the nondrinker class has the probability of taking on a value above zero for drinking problems of essentially zero, in the nbi portion of the model? This would be done using the 15 constraint for the intercept of the DV, as shown in the example. Basically, since we know our classes, should we do something just like Example 7.25, but as a KNOWNCLASS? 


Typically you want to find out who is in the two classes. 


Thanks, Bengt. We do know who the members of the two classes are (drinkers and nondrinkers). The Example 7.25 seems very appropriatewe really like it. But it occurred to us that in Example 7.25, class membership is unknown, and a classification table is produced. Would it be best to alter Example 7.25's model and treat it as KNOWNCLASS? 


I don't see what you are after. In ZINB and ZIP you consider class membership as unknown  zero reported drinking in a certain period can come from 2 different classes of people: nondrinkers and drinkers who happened to not drink at that point in time. In this case the classification table can be of interest to study. If you consider the class membership to be known there is no misclassification  the classification table will have zero offdiagonal elements. Where does "the quality of the distinction between the two types of zeros? " that you mention come into play in this case? 


If we do know their drinker status, should we run this as a latent class or knownclass model? 


You know their drinking status at one point in time. They report not drinking at that time. That does not mean they are nondrinkers. This is why you want to do the analysis with latent classes. Then one class is the true nondrinkers. 


Greatyes, I agree, Linda. I think that makes perfect sense! Thank you. 


Linda and Bengt, thanks again for your good advice to base our modeling on Example 7.25. The models run, with 2 nuances. They run only as ZIP (not as ZINB, where the output seems stuck and doesn't come out fully). I think this could OK, however; we can perhaps use ZIP. They also have low entropy (0.589), with an estimated 14.5% of class 2 persons being misclassified as class 1; and an an estimated 11.5% of class 1 persons being misclassified as class 2. Is there a way to improve our entropy statistic and get lower rates of estimated misclassification? Thank you, Lisa 


Adding covariates to the model can sometimes improve entropy. 


Linda, can you tell me why the model would not run as ZINB, but would run as a ZIP? Mplus seemed to not even give a full output file when I tried to run it as ZINBas if it stopped the analysis prematurely, and created just he beginning of an output file. But when I changed (nbi) to just (i), the model ran fine and produced a full output file. Why would that occur? 


Please send the input, data, and your license number to support@statmodel.com. 


Hi Linda, Example 7.25 is working great for us. We see the Loglikelihood values and scaling parameters for comparing 2 nested models (with the H0 label). How do we also compare against the null model H1? I don't see the fit of the null model (H1) expressed as a Loglikelihood in the output, against which we can compare. Should we think of this H1 model as the "unrestricted model"would that be the proper term? Thank you. 


There is no H1 model when means, variances, and covariances are not sufficient statistics for model estimation. 


Hi Linda, We are using mixture model 7.25 for our ZINB analyses, with class predicting who can and cannot take on a positive score for the count regression. I am surprised to see that some cases are being dropped due to missing data on the dependent variable. This is surprising because the Mplus manual states that even with MLR estimation and count DVs, Mplus handles missing data on the DVs with ML methods. The message we are receiving is: Data set contains cases with missing on all variables except xvariables. These cases were not included in the analysis. Number of cases with missing on all variables except xvariables: 52 Why is Mplus excluding these cases, who according to the message are missing their score on the DV, rather than employing ML treatment of this missing data? Thank you. 


Yes, cases are excluded if they have missing values on all dependent variables. They have nothing to contribute. 


ML handling of missing data does not use a person who has missing on all of his DVs. Each person needs at least one DV that he is not missing on. Then you can borrow information from that DV when considering the DV with missingness. Take the case of regression of y on x with some people having missing data on y. They are not used in ML estimation of the slope and intercept. They contribute to the estimation of the marginal distribution of x, but those parameters are not part of the regression model. 


OKthank you Linda and Bengt for clarifying this! 


Hi again Linda and Bengt, In Example 7.25, when there are multiple predictors of class and score on the DV (as with variables x1 and x3), I have the following questions: 1. x1 and x3 are correlated by default, without this needing to be written explicitly into the code, right? 2. These correlations are set to be equal across classes by default, then, right? So, to allow them to differ between class 1 and class 2, we would need to code explicitly for these differing correlations in the two classes, if that is what we hypothesize. Is that correct? Thank you for all of your help. 


1. Yes. The model is estimated conditioned on these variables. 2. No, they are not equal across classes. The correlations are not model parameters. 

Moin Syed posted on Thursday, February 13, 2014  12:02 pm



Hello, Is it possible to use TWOPART (or some other procedure) with a predictor rather than a DV? I want to separate out the binary vs. continuous aspects of a predictor, but I get an error about lack of variance in the binary component. Any help would be appreciated. Thanks! moin 


If you have a growth model that is twopart I can see predicting from the growth factors, but with a single variable I think you might just as well use a binary dummy predictor (even observed or not)  assuming that the DV has variation for each of those categories. 


Respected Prof. Muthen May I kindly request your advice on how to setup a 3level negative binomial hurdle model in Mplus please. The three level structure is country (level 3) ; category (level 2) ; and time (level 1). That is multiple time points of the category variable(s). Level 1 variable is a 'count' variable. A basic outlay of the Mplus code will be great help. Many many thanks in advance. 


Mplus does not yet allow 3level count modeling. {Perhaps you can express the multiple timepoints in wide format and thereby reduce the analysis to 2level. 


Thank you Prof. Muthen. Yes I did think about using growth curves for the multiple time points. However given than I have many variables that vary over time I am not sure how to go about it, as having too many parallel growth curves in the model may become complex. 


I see what you are saying. If it makes it any easier, note that you don't have to apply a growth model when data are in wide format  any model including a saturated model is possible. 


Thank you Prof. Muthen. Let me try this out. Many thanks. 

Lois Downey posted on Wednesday, February 15, 2017  9:28 am



I have a very elementary question related to hurdle and zeroinflated models. I naively expected the coefficient for the binary part of these models to match the coefficient that would have been produced with a logistic regression model of an outcome in which the original values of 1 and above were recoded to 0, and the original value of 0 was recoded to 1. But that doesn't seem to be the case. Would you please explain why not. THANKS 


The discrepancy you see if because a zeroinflated model uses a mixture at y=0, that is, some people with y=0 are in the binary part of the model but some others are in the regular count part of the model including 0. See chapter 6 of our new book where different count models are compared. 

Lois Downey posted on Wednesday, February 15, 2017  1:28 pm



Thank you. And how about the hurdle model, where cases with y=0 can't be in the regular count part? Is the discrepancy there caused by the fact that some people with y>0 are in the binary part, and others with y>0 are in the regular count part? 


The hurdle model is a twopart model so I think there should not be a difference. If you like, you can send the two outputs to Support along with your license number. 

anonymous Z posted on Tuesday, November 07, 2017  1:53 pm



Dear Drs. Muthen, I am working on the syntax for a hurdle model. The User Guide gives the example of twopart model for continuous variable. I wondered how the syntax should be if the outcome variable is a count variable. Below is my analysis with H_RISK being a count variable. How should I make changes to the syntax to reflect H_RISK being a count variable? Thanks so much! Title: HIV two part Data: FILE = OP_HIVriskbeh_LONG.dat; DATA TWOPART: NAMES = H_RISK; BINARY = binRisk; CONTINUOUS = contRisk; VARIABLE: NAMES = ID time H_RISK H_BEH H_KNOW; USEVARIABLES ARE ID time binRisk contRisk; CATEGORICAL ARE binRisk; MISSING = ALL (999999); cluster = ID; within = time; Analysis: type = twolevel random; MODEL: %within% S1 binRisk on time; S2 contRisk on time; %between% binRisk with S1; contRisk WITH S2; binRisk with contRisk; binRisk with s2; contRisk with s1; 


You can do it two ways as explained in our Regression and Mediation book: 1) don't use Data Twopart but say (using negative binomial as an example): Count = y(NBH); 2) use Data Twopart and zerotruncated count modeling. See book Table 6.8 script using the y(NBT) setting. 

anonymous Z posted on Thursday, November 09, 2017  9:13 am



Dear Dr. Muthen, Based on your advice, I did a hurdle model (Below is the syntax. What does p=0.977 for "dispersion H_BEH" mean? Does it mean that the data are not dispersed, and a hurdle is not necessary? USEVARIABLES ARE ID time H_BEH; COUNT = H_BEH (NBH); MISSING = ALL (999999); cluster = ID; within = time; Analysis: type = twolevel random; MODEL: %within% s H_BEH ON time; MODEL RESULTS TwoTailed Estimate S.E. Est./S.E. PValue Within Level Dispersion H_BEH 0.000 0.000 0.029 0.977 Thanks so much! 


It says no dispersion but you may still need the hurdle part. 

anonymous Z posted on Friday, November 10, 2017  5:48 pm



Dear Dr. Muthen, Could you explain further? I assume that the reason to use the hurdle model is because the data are dispersed due to excess of zeros. If the data are not dispersed, is the hurdle model still necessary? Thanks so much! 


The negative binomial dispersion parameter is one more parameter than what the Poisson has. This handles a certain degree of overdispersion. But this is not the same as having a special model part for zeros like zeroinflated Poisson, zeroinflated negbin, or hurdle negbin. Zeroinflated adds a class of subjects at zero and hurdle also allows a special model part for zeros. Also, the zero class and the hurdle part can be regressed on its own covariates. In sum, negbin's dispersion parameter is different from hurdle modeling. 

Back to top 