

Zero inflated continuous data, three ... 

Message/Author 


I have a data set in which occasions of measurement are nested within days and days are nested within persons. Some of the level 1 measures are zero inflated. For example, treating the occasion level data as a flat file, 95% of the observations have a value of 0. The other 5% are distributed reasonably normally. My present interest is the relationship between level 3 measures and measures collected at level 1. It seems that the appropriate model is a hurdle model of some kind in which one parameter would distinguish level 3 units (persons) who have all zeroes from those who do not, and another parameter would model the variability in the level 1 measure among those who do not have all zeroes. In browsing the forum, I have seen references to the TWODATA command. This has usually been mentioned in terms of single level or two level models. I have not seen a discussion of this in terms of three level models. Is it possible to use the TWODATA command for three level model? If so, how? If it is not possible, I can always drop the middle level of nesting and run a two level model. Or, I can reframe the research question and analyze the level 1 outcome as a binomial. Would the results of such an analysis be dramatically different than the zero/nonzero parameter estimated in a formal hurdle model? Any hints, advice, comments, suggestions (and leads to examples) would be greatly appreciated. Best regards, John 


I would first ask: Are your outcomes counts or continuous scores? How many time points are there? Why distinguish between withinday and betweenday variation? Why not singlelevel wide growth modeling? 


There are 150 level 3 units (persons), who provided an average of 13 days of data (level 2). At level 1, there are 17000 observations, an average of just under 9 observations per person/day. The level 1 units of analysis are descriptions of events, and the level 1 observations (data) are the results of the coding of these descriptions, represented by the relative frequency (percent) of different linguistic categories (e.g., first person pronouns). So, although these are technically count data, because they are expressed as percents, they are distributed relatively normally (aside from the zero inflation). There is no rationale for examining changes across time, hence no interest in growth curve modeling. Days (level 2) are included as a level to account for day level variability, and eventually, there may be predictors at this level. For the present analyses, days do not need to be included, but I would like to set up the models to allow for this possibility in the future. 


Does level 1 have a univariate DV or is the response multivariate? I am asking because TWOPART modeling with a single DV boils down to nothing more than regression for each part (binary and continuous) separately. In principle, I think Mplus can do 3level 2part modeling. 


If we think of multivariate vs. univariate in an ANOVA vs MANOVA sense (i.e., analyzing multiple DVs simultaneously), the level 1 DV could be treated as univariate or multivariate. Given that this is the first analysis of its kind and the selection of the individual measures that could be joined to do a multivariate analysis would be somewhat arbitrary, I would prefer to start with univariate analyses and move to multivariate analyses once I understand the phenomenon better and have a better sense of the modeling. In this data structure, multivariate would not refer to nesting DVs within observations. 


You could then do a separate 3level analysis of the binary outcome of engaging in the activity or not and a 3level analysis of the continuous outcome of the frequency for those who engage in the activity. This kind of modeling (singlelevel) was discussed in Duan. N .. Manning. W. G .. Morris, C. N .. and Newhouse. J. P. (1983). "A Comparison of Alternative Models for the Demand for Medical Care." Journal of Business and Economic Statistics, 1. ll5126. This work is referred to in A twopart randomeffects model for semicontinuous longitudinal data Maren K Olsen; Joseph L Schafer Journal of the American Statistical Association; Jun 2001; 96, 454; ABI/INFORM Global pg. 730 


Dear Bengt, Thank you for your prompt and very helpful set of replies to my questions. At the risk of "taxing the carrying capacity of the host", I have one more question. I read both the Duan and Olsen articles. What was not entirely clear to me was if the two models (engaging in or not and frequency) were estimated separately. For example, all observations could be included in a binomial analysis because the zeroes define the baseline. In contrast, for the analyses of frequency, the large number of zeroes would seem to create the problems associated with "zero inflation". Assuming this, could I eliminate the zeroes in the second analysis? Olsen mentioned something along these lines in the discussion. Thanks, John 


The zeros in the second analysis of the continuous part are rescored to missing when using data twopart and that is also the approach used in OlsenSchafer. We talk about this in our short course handout and video, Topic 4, slides 2233 (see slide 24 for the recode). In twopart modeling the 2 processes (binary and cont's) are correlated and they typically correlate pretty highly. In the singleDV case the correlation cannot be identified so set to zero so the analysis is the same as doing each process separately. 


Bengt, Thanks again. This resolves the issue for me. John 

Back to top 

