LCA with Count Variables PreviousNext
Mplus Discussion > Latent Variable Mixture Modeling >
Message/Author
 Christian M. Connell posted on Tuesday, November 28, 2006 - 9:56 am
I have a couple of questions related to use of count variables for LCA. I am using data on 30-day frequency of use for 3 substances among adolescents. The data was collected on an 8-point scale from no use to daily (i.e., some ranges are grouped, with broader groupings at the higher end of the scale).

My first question -- is it appropriate to transform this type of scale into a count (e.g., take the mean for each range), given that the lower end of the scale is more fine-grained and the scale is zero-inflated, modeling this as a count from 0-30 for days? The resulting models appear to make more sense than modeling from 1-7 (i.e., you don't have to work to convert the means for the domains after the fact). If not, what is a better alternative?

My second question -- the examples for modeling zero-inflated counts suggest that both the dichotomous and frequency portions of the variable be modeled as outcomes. However, one of the identified classes is a non-use class. Is it appropriate to simply model class as the DV when including covariates, or must I also include the binary and frequency portions as DVs? Does the fact that their are 3 separate count variables make a difference in this regard?

Thanks for your thoughts.
Christian
 Linda K. Muthen posted on Tuesday, November 28, 2006 - 1:33 pm
I don't think you should treat the variable you describe as a count variable using the approach you describe. I think you should treat it as an ordered categorical variable.
 Christian M. Connell posted on Tuesday, November 28, 2006 - 3:59 pm
Running as an ordered categorical variable altered results fairly dramatically -- going from 4 patterns of use that are consistent with the literature to only 2 that do not seem to capture the range of variation in adolescent use very completely.

Four follow-up questions:

1) is there a way to statistically compare the count vs. categorical approach to see which more accurately depicts the observed classes?

2) if I want to force a zero-use class (i.e., restrict one of the classes to represent youth with no use, do I need to specify start values for each threshhold on the 8-point scale, or can I just fix the lowest threshhold?

3) Any value in transforming the skewed frequency scales and running as a continuous variable rather than having an ordered categorical variable with so many thressholds?

4) For future reference -- what are your thoughts on the second part of my question re: modeling class as the DV vs. class and zero-inflation components (i.e., yes/no and frequency for yes)?

Thanks, again.
Christian
 Bengt O. Muthen posted on Tuesday, November 28, 2006 - 4:06 pm
How many random starts are you using?
 Christian M. Connell posted on Tuesday, November 28, 2006 - 4:24 pm
I was using the following:

ANALYSIS:
type = mixture missing;
starts = 200 20;
iterations = 20;

I increased the number of starts to 500, and get the same results. There is no indication of problems replicating the best loglikelihood value, though I do have some warnings re: one or more thresholds needing to be set at extreme values.

Thank you,
Christian
 Bengt O. Muthen posted on Tuesday, November 28, 2006 - 5:13 pm
Are you considering 3 observed outcomes, or do you also have repeated measures over time?

I'll get to your 4 questions after hearing that.
 Christian M. Connell posted on Tuesday, November 28, 2006 - 5:37 pm
The study is cross-sectional. The three measures represent # of days using alcohol, tobacco, and marijuana during the past month.

Also, I think I misspoke for question #3 -- I actually meant to fix the values for the thresholds rather than specify start values.

Christian
 Bengt O. Muthen posted on Tuesday, November 28, 2006 - 6:00 pm
So you get 4 classes with only 3 outcomes when treating the outcomes as counts, or when treating them as continuous?
 Christian M. Connell posted on Tuesday, November 28, 2006 - 6:12 pm
When running the data as counts I get 4 classes -- a non-use class, a low frequency class, a regular smoker class, and a moderate poly use class. When running as an 8-level categorical I get 2 classes that appear to be a low-use class and a moderate use class. I tried fixing threshold values to include a 3rd no-use class, but Tech 11 results do not appear to support 3 classes even with this restriction.

Christian
 Bengt O. Muthen posted on Tuesday, November 28, 2006 - 6:22 pm
Instead of me asking so many questions, why don't you send your counts 4-class analysis input, output, data (if permitted) and license number to support@statmodel.com.
 Bengt O. Muthen posted on Wednesday, November 29, 2006 - 9:03 am
Ok, looking at your outputs I get a fuller picture. I see that not only do you have 3 outcomes, but you also have 14 covariates. The comparison is between a count-inflated model and an ordered polytomous model.

One reason for different results from the two approaches may be that the covariates may have direct influence on some of the 3 indicators (measurement non-invariance) and this is picked up differently in the two approaches. If possible, I would start with an analysis without covariates. Having only 3 indicators, however, is limiting here.

Now to your questions.

1. This is not straightforward to do. The loglikelihoods are on different scales and not comparable. The fit to data with the categorical approach can be studied using TECH10 and looking especially at the number of significant bivariate residuals. But with counts, this is not produced - you can however compute the probabilities estimated by the count model for each (or particularly interesting) patterns of counts across the 3 outcomes yourself from the model estimates and compare to the observed counts.

The Poisson-inflated approach may be ok as an approximation, but is not well justified when counts are categorized as in your case. Other modeling alternatives are "two-part" modeling and censored-normal modeling (see UG).

2. All thresholds.

3. No, I don't think so, but you could consider the two-part or censored approaches.

4. I would think both class and frequency would be influenced by covariates - this is the approach taken with two-part modeling.
 Bengt O. Muthen posted on Wednesday, November 29, 2006 - 10:58 am
Actually, I think I must correct one statement I made - the loglikelihoods should be in the same metric because the outcome values are the same. Which means that the categorical LL = -2533 with 72 parameters and BIC = 5550 clearly beats the count-inflated LL = -3830 with 60 parameters and BIC = 8063, even when taking into account that there are 12 more parameters in the categorical model.
 Bengt O. Muthen posted on Wednesday, November 29, 2006 - 11:31 am
In the categorical approach, you can also try to trichotomize (or 4-chotomize) the variable since your higher categories are rather infrequent. This can have an impact on class formation.
 Christian M. Connell posted on Wednesday, November 29, 2006 - 11:40 am
I appreciate all of your thoughts -- I'm going to take a look at your last option. I ran without covariates, but the results with the 8-level categorical approach were not good (i.e., did not converge).
 Christian M. Connell posted on Wednesday, November 29, 2006 - 1:31 pm
Using your suggestion of collapsing some of the upper-level categories and forcing a non-use class, I get a strong 3-class model that makes conceptual sense. I appreciate your assistance on this over the past couple of days!

Christian
 Trynke Hoekstra posted on Wednesday, August 25, 2010 - 8:05 am
In our study we want to investigate the existence of distinct classes of health care utilization in 6741 Diabetes patients using LCA. We have several variables measuring utilization (5 count variables, 9 binary variables, 2 categorical variables and a continuous variable). We therefore have many variables with different distributions (count, binary, categorical and continuous). When we specify the count variables as count in the input file, the estimates in the output file do not relate back to the health care utilization of the total patient population (characteristics). We calculated the threshold of the different classes back to the total population). When we specify the count variables as continuous in the input file, we do not see these problems. The "best" model (based on e.g. BIC, BLRT), a 5-class model actually shows interesting, clinically interpretable classes, posterior probs are all >0.9 and no visible problems with the estimation.

I know the estimation of the LCAs is complex because of the mix of variables, has anyone experienced similar problems and is specifying the count variables as count a "correct" way of handling such complex analyses, or do you recommend a different method? I am trying to figure out what is happening

We can send the output files if needed. Thanks for your help!

Christel van Dijk and Trynke Hoekstra
 Linda K. Muthen posted on Wednesday, August 25, 2010 - 3:51 pm
The count parameter estimates are in the log rate metric. You need to exponentiate them to bring them back to the metric of the data.
Back to top
Add Your Message Here
Post:
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Password:
Options: Enable HTML code in message
Automatically activate URLs in message
Action: