Message/Author 


I have a couple of questions related to use of count variables for LCA. I am using data on 30day frequency of use for 3 substances among adolescents. The data was collected on an 8point scale from no use to daily (i.e., some ranges are grouped, with broader groupings at the higher end of the scale). My first question  is it appropriate to transform this type of scale into a count (e.g., take the mean for each range), given that the lower end of the scale is more finegrained and the scale is zeroinflated, modeling this as a count from 030 for days? The resulting models appear to make more sense than modeling from 17 (i.e., you don't have to work to convert the means for the domains after the fact). If not, what is a better alternative? My second question  the examples for modeling zeroinflated counts suggest that both the dichotomous and frequency portions of the variable be modeled as outcomes. However, one of the identified classes is a nonuse class. Is it appropriate to simply model class as the DV when including covariates, or must I also include the binary and frequency portions as DVs? Does the fact that their are 3 separate count variables make a difference in this regard? Thanks for your thoughts. Christian 


I don't think you should treat the variable you describe as a count variable using the approach you describe. I think you should treat it as an ordered categorical variable. 


Running as an ordered categorical variable altered results fairly dramatically  going from 4 patterns of use that are consistent with the literature to only 2 that do not seem to capture the range of variation in adolescent use very completely. Four followup questions: 1) is there a way to statistically compare the count vs. categorical approach to see which more accurately depicts the observed classes? 2) if I want to force a zerouse class (i.e., restrict one of the classes to represent youth with no use, do I need to specify start values for each threshhold on the 8point scale, or can I just fix the lowest threshhold? 3) Any value in transforming the skewed frequency scales and running as a continuous variable rather than having an ordered categorical variable with so many thressholds? 4) For future reference  what are your thoughts on the second part of my question re: modeling class as the DV vs. class and zeroinflation components (i.e., yes/no and frequency for yes)? Thanks, again. Christian 


How many random starts are you using? 


I was using the following: ANALYSIS: type = mixture missing; starts = 200 20; iterations = 20; I increased the number of starts to 500, and get the same results. There is no indication of problems replicating the best loglikelihood value, though I do have some warnings re: one or more thresholds needing to be set at extreme values. Thank you, Christian 


Are you considering 3 observed outcomes, or do you also have repeated measures over time? I'll get to your 4 questions after hearing that. 


The study is crosssectional. The three measures represent # of days using alcohol, tobacco, and marijuana during the past month. Also, I think I misspoke for question #3  I actually meant to fix the values for the thresholds rather than specify start values. Christian 


So you get 4 classes with only 3 outcomes when treating the outcomes as counts, or when treating them as continuous? 


When running the data as counts I get 4 classes  a nonuse class, a low frequency class, a regular smoker class, and a moderate poly use class. When running as an 8level categorical I get 2 classes that appear to be a lowuse class and a moderate use class. I tried fixing threshold values to include a 3rd nouse class, but Tech 11 results do not appear to support 3 classes even with this restriction. Christian 


Instead of me asking so many questions, why don't you send your counts 4class analysis input, output, data (if permitted) and license number to support@statmodel.com. 


Ok, looking at your outputs I get a fuller picture. I see that not only do you have 3 outcomes, but you also have 14 covariates. The comparison is between a countinflated model and an ordered polytomous model. One reason for different results from the two approaches may be that the covariates may have direct influence on some of the 3 indicators (measurement noninvariance) and this is picked up differently in the two approaches. If possible, I would start with an analysis without covariates. Having only 3 indicators, however, is limiting here. Now to your questions. 1. This is not straightforward to do. The loglikelihoods are on different scales and not comparable. The fit to data with the categorical approach can be studied using TECH10 and looking especially at the number of significant bivariate residuals. But with counts, this is not produced  you can however compute the probabilities estimated by the count model for each (or particularly interesting) patterns of counts across the 3 outcomes yourself from the model estimates and compare to the observed counts. The Poissoninflated approach may be ok as an approximation, but is not well justified when counts are categorized as in your case. Other modeling alternatives are "twopart" modeling and censorednormal modeling (see UG). 2. All thresholds. 3. No, I don't think so, but you could consider the twopart or censored approaches. 4. I would think both class and frequency would be influenced by covariates  this is the approach taken with twopart modeling. 


Actually, I think I must correct one statement I made  the loglikelihoods should be in the same metric because the outcome values are the same. Which means that the categorical LL = 2533 with 72 parameters and BIC = 5550 clearly beats the countinflated LL = 3830 with 60 parameters and BIC = 8063, even when taking into account that there are 12 more parameters in the categorical model. 


In the categorical approach, you can also try to trichotomize (or 4chotomize) the variable since your higher categories are rather infrequent. This can have an impact on class formation. 


I appreciate all of your thoughts  I'm going to take a look at your last option. I ran without covariates, but the results with the 8level categorical approach were not good (i.e., did not converge). 


Using your suggestion of collapsing some of the upperlevel categories and forcing a nonuse class, I get a strong 3class model that makes conceptual sense. I appreciate your assistance on this over the past couple of days! Christian 


In our study we want to investigate the existence of distinct classes of health care utilization in 6741 Diabetes patients using LCA. We have several variables measuring utilization (5 count variables, 9 binary variables, 2 categorical variables and a continuous variable). We therefore have many variables with different distributions (count, binary, categorical and continuous). When we specify the count variables as count in the input file, the estimates in the output file do not relate back to the health care utilization of the total patient population (characteristics). We calculated the threshold of the different classes back to the total population). When we specify the count variables as continuous in the input file, we do not see these problems. The "best" model (based on e.g. BIC, BLRT), a 5class model actually shows interesting, clinically interpretable classes, posterior probs are all >0.9 and no visible problems with the estimation. I know the estimation of the LCAs is complex because of the mix of variables, has anyone experienced similar problems and is specifying the count variables as count a "correct" way of handling such complex analyses, or do you recommend a different method? I am trying to figure out what is happening We can send the output files if needed. Thanks for your help! Christel van Dijk and Trynke Hoekstra 


The count parameter estimates are in the log rate metric. You need to exponentiate them to bring them back to the metric of the data. 

Lorna Roe posted on Thursday, March 23, 2017  10:19 am



Dear Drs. Muthen, In our study we want to (a) investigate the existence of distinct classes of health care utilization in the year preceding survey among 783 frail older people; (b) identify transitions in service use profiles over time and (c) examine if profile membership is associated with differences in outcomes at followup. We have two variables measuring utilization (6 count variables, 17 binary variables) and three timepoints. I wanted to investigate a LCA model with count and binary data and I get the following error code “WARNING: COUNT VARIABLE HAS LARGE VALUES. IT MAY BE MORE APPROPRIATE TO TREAT SUCH VARIABLES AS CONTINUOUS.” Some advice on this would be gratefully received. With regard to transitions over time, I am intending to use latent transition analysis (LTA) with distal outcomes. However, can I use LTA with binary and count variables? And, is it appropriate to run LTA when the timepoints are not consecutive? I.e. T0 is 2010/11, T1 is 2013/14 and T2 is 2015/16. Finally, how is missing data managed? Particularly: (a) missing responses to particular questions at any timepoint (b) participants present in T0, not present in T1 and present in T2, (c) participants present in T0 and T1 but died by T2 (or present in T0 and died by T1 and not present in T2). With thanks Lorna 


Regarding the Warning, you should check your data and see how big and why the count variable has such large values. Are they outliers, miscoding, or a separate subpop? LTA can be done using count and binary outcomes. Time points don't need to be equally spaced. Missing data is handled by ML under MAR, sometimes also referred as FIML in all 3 of your cases  that is, using all available data. 


Hoping this question/observation has not been broached before and I missed it: In short, I am specifying a latent class model using 5 count outcomes. If I do not specify algorithm=integration, the model is not trustworthy and I get the warning to increase MITERATIONS; however, specifying algorithm=integration the model converges, best LL replicated etc. I guess my question is: isn't algorithm=integration the default for count outcomes? 


If you are sure that the 2 runs have the same model, e.g. by checking that the number of parameters is the same, send your 2 outputs to Support along with your license number. 

HwaYoung Lee posted on Thursday, November 29, 2018  12:52 pm



I ran a couple of latent class analyses with count variables, such as # of drinks, # of restaurant use, and # of smoking and so on. When I used the option "count=drinks (pi) # of restaurant use, # of smoking (pi)", the entropy value was so poor (.4). In addition, there was an warning message: "WARNING: COUNT VARIABLE HAS LARGE VALUES.IT MAY BE MORE APPROPRIATE TO TREAT SUCH VARIABLES AS CONTINUOUS." When using count=drinks... without (pi), the entropy value was okay, but I also got the same message as I mentioned earlier. Some variables ranged from 0 to 80. So, I've treated them as continuous variables. it worked well. I am wondering if it is okay to treat these variables as continuous. Any suggestions would be greatly appreciated!!! 


If you don't have a large percentage at zero count and these high counts, I would treat the variable as continuous. 


Dear Drs. Muthén, I want to run a latent lass class model including type and number of lifetime adverse events, and afterwards explore its associations to mental disorders. I have a “basic” question: Should I consider the number of events subjects were exposed to as a count variable (and, therefore, use Poison) or can I take it as an ordinal variable? It ranges from 0 to 8. Than you very much! 


A Poisson type model (there are several) is suitable if you have the exact count (not categorized) with a large percentage at zero, that is, it reflects a rare event. 

Back to top 