LCA with Count Variables PreviousNext
Mplus Discussion > Latent Variable Mixture Modeling >
 Christian M. Connell posted on Tuesday, November 28, 2006 - 9:56 am
I have a couple of questions related to use of count variables for LCA. I am using data on 30-day frequency of use for 3 substances among adolescents. The data was collected on an 8-point scale from no use to daily (i.e., some ranges are grouped, with broader groupings at the higher end of the scale).

My first question -- is it appropriate to transform this type of scale into a count (e.g., take the mean for each range), given that the lower end of the scale is more fine-grained and the scale is zero-inflated, modeling this as a count from 0-30 for days? The resulting models appear to make more sense than modeling from 1-7 (i.e., you don't have to work to convert the means for the domains after the fact). If not, what is a better alternative?

My second question -- the examples for modeling zero-inflated counts suggest that both the dichotomous and frequency portions of the variable be modeled as outcomes. However, one of the identified classes is a non-use class. Is it appropriate to simply model class as the DV when including covariates, or must I also include the binary and frequency portions as DVs? Does the fact that their are 3 separate count variables make a difference in this regard?

Thanks for your thoughts.
 Linda K. Muthen posted on Tuesday, November 28, 2006 - 1:33 pm
I don't think you should treat the variable you describe as a count variable using the approach you describe. I think you should treat it as an ordered categorical variable.
 Christian M. Connell posted on Tuesday, November 28, 2006 - 3:59 pm
Running as an ordered categorical variable altered results fairly dramatically -- going from 4 patterns of use that are consistent with the literature to only 2 that do not seem to capture the range of variation in adolescent use very completely.

Four follow-up questions:

1) is there a way to statistically compare the count vs. categorical approach to see which more accurately depicts the observed classes?

2) if I want to force a zero-use class (i.e., restrict one of the classes to represent youth with no use, do I need to specify start values for each threshhold on the 8-point scale, or can I just fix the lowest threshhold?

3) Any value in transforming the skewed frequency scales and running as a continuous variable rather than having an ordered categorical variable with so many thressholds?

4) For future reference -- what are your thoughts on the second part of my question re: modeling class as the DV vs. class and zero-inflation components (i.e., yes/no and frequency for yes)?

Thanks, again.
 Bengt O. Muthen posted on Tuesday, November 28, 2006 - 4:06 pm
How many random starts are you using?
 Christian M. Connell posted on Tuesday, November 28, 2006 - 4:24 pm
I was using the following:

type = mixture missing;
starts = 200 20;
iterations = 20;

I increased the number of starts to 500, and get the same results. There is no indication of problems replicating the best loglikelihood value, though I do have some warnings re: one or more thresholds needing to be set at extreme values.

Thank you,
 Bengt O. Muthen posted on Tuesday, November 28, 2006 - 5:13 pm
Are you considering 3 observed outcomes, or do you also have repeated measures over time?

I'll get to your 4 questions after hearing that.
 Christian M. Connell posted on Tuesday, November 28, 2006 - 5:37 pm
The study is cross-sectional. The three measures represent # of days using alcohol, tobacco, and marijuana during the past month.

Also, I think I misspoke for question #3 -- I actually meant to fix the values for the thresholds rather than specify start values.

 Bengt O. Muthen posted on Tuesday, November 28, 2006 - 6:00 pm
So you get 4 classes with only 3 outcomes when treating the outcomes as counts, or when treating them as continuous?
 Christian M. Connell posted on Tuesday, November 28, 2006 - 6:12 pm
When running the data as counts I get 4 classes -- a non-use class, a low frequency class, a regular smoker class, and a moderate poly use class. When running as an 8-level categorical I get 2 classes that appear to be a low-use class and a moderate use class. I tried fixing threshold values to include a 3rd no-use class, but Tech 11 results do not appear to support 3 classes even with this restriction.

 Bengt O. Muthen posted on Tuesday, November 28, 2006 - 6:22 pm
Instead of me asking so many questions, why don't you send your counts 4-class analysis input, output, data (if permitted) and license number to
 Bengt O. Muthen posted on Wednesday, November 29, 2006 - 9:03 am
Ok, looking at your outputs I get a fuller picture. I see that not only do you have 3 outcomes, but you also have 14 covariates. The comparison is between a count-inflated model and an ordered polytomous model.

One reason for different results from the two approaches may be that the covariates may have direct influence on some of the 3 indicators (measurement non-invariance) and this is picked up differently in the two approaches. If possible, I would start with an analysis without covariates. Having only 3 indicators, however, is limiting here.

Now to your questions.

1. This is not straightforward to do. The loglikelihoods are on different scales and not comparable. The fit to data with the categorical approach can be studied using TECH10 and looking especially at the number of significant bivariate residuals. But with counts, this is not produced - you can however compute the probabilities estimated by the count model for each (or particularly interesting) patterns of counts across the 3 outcomes yourself from the model estimates and compare to the observed counts.

The Poisson-inflated approach may be ok as an approximation, but is not well justified when counts are categorized as in your case. Other modeling alternatives are "two-part" modeling and censored-normal modeling (see UG).

2. All thresholds.

3. No, I don't think so, but you could consider the two-part or censored approaches.

4. I would think both class and frequency would be influenced by covariates - this is the approach taken with two-part modeling.
 Bengt O. Muthen posted on Wednesday, November 29, 2006 - 10:58 am
Actually, I think I must correct one statement I made - the loglikelihoods should be in the same metric because the outcome values are the same. Which means that the categorical LL = -2533 with 72 parameters and BIC = 5550 clearly beats the count-inflated LL = -3830 with 60 parameters and BIC = 8063, even when taking into account that there are 12 more parameters in the categorical model.
 Bengt O. Muthen posted on Wednesday, November 29, 2006 - 11:31 am
In the categorical approach, you can also try to trichotomize (or 4-chotomize) the variable since your higher categories are rather infrequent. This can have an impact on class formation.
 Christian M. Connell posted on Wednesday, November 29, 2006 - 11:40 am
I appreciate all of your thoughts -- I'm going to take a look at your last option. I ran without covariates, but the results with the 8-level categorical approach were not good (i.e., did not converge).
 Christian M. Connell posted on Wednesday, November 29, 2006 - 1:31 pm
Using your suggestion of collapsing some of the upper-level categories and forcing a non-use class, I get a strong 3-class model that makes conceptual sense. I appreciate your assistance on this over the past couple of days!

 Trynke Hoekstra posted on Wednesday, August 25, 2010 - 8:05 am
In our study we want to investigate the existence of distinct classes of health care utilization in 6741 Diabetes patients using LCA. We have several variables measuring utilization (5 count variables, 9 binary variables, 2 categorical variables and a continuous variable). We therefore have many variables with different distributions (count, binary, categorical and continuous). When we specify the count variables as count in the input file, the estimates in the output file do not relate back to the health care utilization of the total patient population (characteristics). We calculated the threshold of the different classes back to the total population). When we specify the count variables as continuous in the input file, we do not see these problems. The "best" model (based on e.g. BIC, BLRT), a 5-class model actually shows interesting, clinically interpretable classes, posterior probs are all >0.9 and no visible problems with the estimation.

I know the estimation of the LCAs is complex because of the mix of variables, has anyone experienced similar problems and is specifying the count variables as count a "correct" way of handling such complex analyses, or do you recommend a different method? I am trying to figure out what is happening

We can send the output files if needed. Thanks for your help!

Christel van Dijk and Trynke Hoekstra
 Linda K. Muthen posted on Wednesday, August 25, 2010 - 3:51 pm
The count parameter estimates are in the log rate metric. You need to exponentiate them to bring them back to the metric of the data.
 Lorna Roe posted on Thursday, March 23, 2017 - 10:19 am
Dear Drs. Muthen,

In our study we want to (a) investigate the existence of distinct classes of health care utilization in the year preceding survey among 783 frail older people; (b) identify transitions in service use profiles over time and (c) examine if profile membership is associated with differences in outcomes at follow-up. We have two variables measuring utilization (6 count variables, 17 binary variables) and three time-points.

I wanted to investigate a LCA model with count and binary data and I get the following error code “WARNING: COUNT VARIABLE HAS LARGE VALUES. IT MAY BE MORE APPROPRIATE TO TREAT SUCH VARIABLES AS CONTINUOUS.” Some advice on this would be gratefully received.

With regard to transitions over time, I am intending to use latent transition analysis (LTA) with distal outcomes. However, can I use LTA with binary and count variables? And, is it appropriate to run LTA when the time-points are not consecutive? I.e. T0 is 2010/11, T1 is 2013/14 and T2 is 2015/16.

Finally, how is missing data managed? Particularly:
(a) missing responses to particular questions at any time-point
(b) participants present in T0, not present in T1 and present in T2,
(c) participants present in T0 and T1 but died by T2 (or present in T0 and died by T1 and not present in T2).

With thanks
 Bengt O. Muthen posted on Thursday, March 30, 2017 - 9:32 am
Regarding the Warning, you should check your data and see how big and why the count variable has such large values. Are they outliers, mis-coding, or a separate sub-pop?

LTA can be done using count and binary outcomes.

Time points don't need to be equally spaced.

Missing data is handled by ML under MAR, sometimes also referred as FIML in all 3 of your cases - that is, using all available data.
 J.D. Haltigan posted on Tuesday, April 10, 2018 - 3:12 am
Hoping this question/observation has not been broached before and I missed it:

In short, I am specifying a latent class model using 5 count outcomes. If I do not specify algorithm=integration, the model is not trustworthy and I get the warning to increase MITERATIONS; however, specifying algorithm=integration the model converges, best LL replicated etc.

I guess my question is: isn't algorithm=integration the default for count outcomes?
 Bengt O. Muthen posted on Tuesday, April 10, 2018 - 3:43 pm
If you are sure that the 2 runs have the same model, e.g. by checking that the number of parameters is the same, send your 2 outputs to Support along with your license number.
 HwaYoung Lee posted on Thursday, November 29, 2018 - 12:52 pm
I ran a couple of latent class analyses with count variables, such as # of drinks, # of restaurant use, and # of smoking and so on.
When I used the option "count=drinks (pi) # of restaurant use, # of smoking (pi)", the entropy value was so poor (.4). In addition, there was an warning message: "WARNING: COUNT VARIABLE HAS LARGE VALUES.IT MAY BE MORE APPROPRIATE TO TREAT SUCH VARIABLES AS CONTINUOUS."
When using count=drinks... without (pi), the entropy value was okay, but I also got the same message as I mentioned earlier.
Some variables ranged from 0 to 80.

So, I've treated them as continuous variables. it worked well.

I am wondering if it is okay to treat these variables as continuous.

Any suggestions would be greatly appreciated!!!
 Bengt O. Muthen posted on Thursday, November 29, 2018 - 2:38 pm
If you don't have a large percentage at zero count and these high counts, I would treat the variable as continuous.
 Geilson Lima Santana Junior posted on Saturday, February 09, 2019 - 5:57 pm
Dear Drs. Muthén,

I want to run a latent lass class model including type and number of lifetime adverse events, and afterwards explore its associations to mental disorders.

I have a “basic” question: Should I consider the number of events subjects were exposed to as a count variable (and, therefore, use Poison) or can I take it as an ordinal variable? It ranges from 0 to 8.

Than you very much!
 Bengt O. Muthen posted on Sunday, February 10, 2019 - 5:22 pm
A Poisson type model (there are several) is suitable if you have the exact count (not categorized) with a large percentage at zero, that is, it reflects a rare event.
Back to top
Add Your Message Here
Username: Posting Information:
This is a private posting area. Only registered users and moderators may post messages here.
Options: Enable HTML code in message
Automatically activate URLs in message