I have a couple of questions related to use of count variables for LCA. I am using data on 30-day frequency of use for 3 substances among adolescents. The data was collected on an 8-point scale from no use to daily (i.e., some ranges are grouped, with broader groupings at the higher end of the scale).
My first question -- is it appropriate to transform this type of scale into a count (e.g., take the mean for each range), given that the lower end of the scale is more fine-grained and the scale is zero-inflated, modeling this as a count from 0-30 for days? The resulting models appear to make more sense than modeling from 1-7 (i.e., you don't have to work to convert the means for the domains after the fact). If not, what is a better alternative?
My second question -- the examples for modeling zero-inflated counts suggest that both the dichotomous and frequency portions of the variable be modeled as outcomes. However, one of the identified classes is a non-use class. Is it appropriate to simply model class as the DV when including covariates, or must I also include the binary and frequency portions as DVs? Does the fact that their are 3 separate count variables make a difference in this regard?
Running as an ordered categorical variable altered results fairly dramatically -- going from 4 patterns of use that are consistent with the literature to only 2 that do not seem to capture the range of variation in adolescent use very completely.
Four follow-up questions:
1) is there a way to statistically compare the count vs. categorical approach to see which more accurately depicts the observed classes?
2) if I want to force a zero-use class (i.e., restrict one of the classes to represent youth with no use, do I need to specify start values for each threshhold on the 8-point scale, or can I just fix the lowest threshhold?
3) Any value in transforming the skewed frequency scales and running as a continuous variable rather than having an ordered categorical variable with so many thressholds?
4) For future reference -- what are your thoughts on the second part of my question re: modeling class as the DV vs. class and zero-inflation components (i.e., yes/no and frequency for yes)?
I increased the number of starts to 500, and get the same results. There is no indication of problems replicating the best loglikelihood value, though I do have some warnings re: one or more thresholds needing to be set at extreme values.
When running the data as counts I get 4 classes -- a non-use class, a low frequency class, a regular smoker class, and a moderate poly use class. When running as an 8-level categorical I get 2 classes that appear to be a low-use class and a moderate use class. I tried fixing threshold values to include a 3rd no-use class, but Tech 11 results do not appear to support 3 classes even with this restriction.
Ok, looking at your outputs I get a fuller picture. I see that not only do you have 3 outcomes, but you also have 14 covariates. The comparison is between a count-inflated model and an ordered polytomous model.
One reason for different results from the two approaches may be that the covariates may have direct influence on some of the 3 indicators (measurement non-invariance) and this is picked up differently in the two approaches. If possible, I would start with an analysis without covariates. Having only 3 indicators, however, is limiting here.
Now to your questions.
1. This is not straightforward to do. The loglikelihoods are on different scales and not comparable. The fit to data with the categorical approach can be studied using TECH10 and looking especially at the number of significant bivariate residuals. But with counts, this is not produced - you can however compute the probabilities estimated by the count model for each (or particularly interesting) patterns of counts across the 3 outcomes yourself from the model estimates and compare to the observed counts.
The Poisson-inflated approach may be ok as an approximation, but is not well justified when counts are categorized as in your case. Other modeling alternatives are "two-part" modeling and censored-normal modeling (see UG).
2. All thresholds.
3. No, I don't think so, but you could consider the two-part or censored approaches.
4. I would think both class and frequency would be influenced by covariates - this is the approach taken with two-part modeling.
Actually, I think I must correct one statement I made - the loglikelihoods should be in the same metric because the outcome values are the same. Which means that the categorical LL = -2533 with 72 parameters and BIC = 5550 clearly beats the count-inflated LL = -3830 with 60 parameters and BIC = 8063, even when taking into account that there are 12 more parameters in the categorical model.
I appreciate all of your thoughts -- I'm going to take a look at your last option. I ran without covariates, but the results with the 8-level categorical approach were not good (i.e., did not converge).
Using your suggestion of collapsing some of the upper-level categories and forcing a non-use class, I get a strong 3-class model that makes conceptual sense. I appreciate your assistance on this over the past couple of days!
In our study we want to investigate the existence of distinct classes of health care utilization in 6741 Diabetes patients using LCA. We have several variables measuring utilization (5 count variables, 9 binary variables, 2 categorical variables and a continuous variable). We therefore have many variables with different distributions (count, binary, categorical and continuous). When we specify the count variables as count in the input file, the estimates in the output file do not relate back to the health care utilization of the total patient population (characteristics). We calculated the threshold of the different classes back to the total population). When we specify the count variables as continuous in the input file, we do not see these problems. The "best" model (based on e.g. BIC, BLRT), a 5-class model actually shows interesting, clinically interpretable classes, posterior probs are all >0.9 and no visible problems with the estimation.
I know the estimation of the LCAs is complex because of the mix of variables, has anyone experienced similar problems and is specifying the count variables as count a "correct" way of handling such complex analyses, or do you recommend a different method? I am trying to figure out what is happening
We can send the output files if needed. Thanks for your help!