Zero inflated continuous data, three ...
Message/Author
 John B. Nezlek posted on Sunday, May 31, 2015 - 9:33 am
I have a data set in which occasions of measurement are nested within days and days are nested within persons. Some of the level 1 measures are zero inflated. For example, treating the occasion level data as a flat file, 95% of the observations have a value of 0. The other 5% are distributed reasonably normally. My present interest is the relationship between level 3 measures and measures collected at level 1.
It seems that the appropriate model is a hurdle model of some kind in which one parameter would distinguish level 3 units (persons) who have all zeroes from those who do not, and another parameter would model the variability in the level 1 measure among those who do not have all zeroes.
In browsing the forum, I have seen references to the TWODATA command. This has usually been mentioned in terms of single level or two level models. I have not seen a discussion of this in terms of three level models. Is it possible to use the TWODATA command for three level model? If so, how?
If it is not possible, I can always drop the middle level of nesting and run a two level model. Or, I can reframe the research question and analyze the level 1 outcome as a binomial. Would the results of such an analysis be dramatically different than the zero/nonzero parameter estimated in a formal hurdle model?

Best regards,
John
 Bengt O. Muthen posted on Sunday, May 31, 2015 - 12:04 pm
I would first ask: Are your outcomes counts or continuous scores? How many time points are there? Why distinguish between within-day and between-day variation? Why not single-level wide growth modeling?
 John B. Nezlek posted on Sunday, May 31, 2015 - 1:29 pm
There are 150 level 3 units (persons), who provided an average of 13 days of data (level 2). At level 1, there are 17000 observations, an average of just under 9 observations per person/day.
The level 1 units of analysis are descriptions of events, and the level 1 observations (data) are the results of the coding of these descriptions, represented by the relative frequency (percent) of different linguistic categories (e.g., first person pronouns). So, although these are technically count data, because they are expressed as percents, they are distributed relatively normally (aside from the zero inflation).
There is no rationale for examining changes across time, hence no interest in growth curve modeling. Days (level 2) are included as a level to account for day level variability, and eventually, there may be predictors at this level.
For the present analyses, days do not need to be included, but I would like to set up the models to allow for this possibility in the future.
 Bengt O. Muthen posted on Sunday, May 31, 2015 - 1:52 pm
Does level 1 have a univariate DV or is the response multivariate? I am asking because TWOPART modeling with a single DV boils down to nothing more than regression for each part (binary and continuous) separately. In principle, I think Mplus can do 3-level 2-part modeling.
 John B. Nezlek posted on Sunday, May 31, 2015 - 2:06 pm
If we think of multivariate vs. univariate in an ANOVA vs MANOVA sense (i.e., analyzing multiple DVs simultaneously), the level 1 DV could be treated as univariate or multivariate.
Given that this is the first analysis of its kind and the selection of the individual measures that could be joined to do a multivariate analysis would be somewhat arbitrary, I would prefer to start with univariate analyses and move to multivariate analyses once I understand the phenomenon better and have a better sense of the modeling.
In this data structure, multivariate would not refer to nesting DVs within observations.
 Bengt O. Muthen posted on Sunday, May 31, 2015 - 5:00 pm
You could then do a separate 3-level analysis of the binary outcome of engaging in the activity or not and a 3-level analysis of the continuous outcome of the frequency for those who engage in the activity. This kind of modeling (single-level) was discussed in

Duan. N .. Manning. W. G .. Morris, C. N .. and Newhouse. J. P. (1983). "A Comparison of Alternative Models for the Demand for Medical Care." Jour­nal of Business and Economic Statistics, 1. ll5-126.

This work is referred to in

A two-part random-effects model for semicontinuous longitudinal data
Maren K Olsen; Joseph L Schafer
Journal of the American Statistical Association; Jun 2001; 96, 454; ABI/INFORM Global
pg. 730
 John B. Nezlek posted on Sunday, May 31, 2015 - 11:39 pm
Dear Bengt,
Thank you for your prompt and very helpful set of replies to my questions. At the risk of "taxing the carrying capacity of the host", I have one more question.
I read both the Duan and Olsen articles. What was not entirely clear to me was if the two models (engaging in or not and frequency) were estimated separately.
For example, all observations could be included in a binomial analysis because the zeroes define the baseline. In contrast, for the analyses of frequency, the large number of zeroes would seem to create the problems associated with "zero inflation". Assuming this, could I eliminate the zeroes in the second analysis? Olsen mentioned something along these lines in the discussion.

Thanks,
John
 Bengt O. Muthen posted on Monday, June 01, 2015 - 2:32 pm
The zeros in the second analysis of the continuous part are re-scored to missing when using data twopart and that is also the approach used in Olsen-Schafer. We talk about this in our short course handout and video, Topic 4, slides 22-33 (see slide 24 for the re-code). In two-part modeling the 2 processes (binary and cont's) are correlated and they typically correlate pretty highly. In the single-DV case the correlation cannot be identified so set to zero so the analysis is the same as doing each process separately.
 John B. Nezlek posted on Tuesday, June 02, 2015 - 2:17 am
Bengt,

Thanks again. This resolves the issue for me.

John