Chapter 6 of the Mplus User's Guide contains many examples of growth models for data in the wide format where data collected for each individual at different time points is represented by a column in the data set. All of these examples and data come on the Mplus CD and are also available on our website under excerpts from the user's guide.
It looks like you have set your data up in the long format where data collected for each individual is represented by a diferrent record. Growth models for data in the long format are estimated in Mplus using TYPE=TWOLEVEL: with CLUSTER=ID;.
In Chapter 13, there is a description of how to handle multiple cohort data under the section called Missing. I'm not sure if this is what you are interested in but you can read that also.
Does Mplus allow for Growth Mixture Modelling with longitudinal data collected from multiple cohorts?
I have annual data over five years collected from 8,10,12,14,and 16 year-olds at time 1. I was hoping to look at trajectories of count outcomes using the full dataset. Any suggestions would be greatly appreciated.
bmuthen posted on Monday, January 02, 2006 - 2:36 pm
There are 2 ways to do this. One is to do a multiple-group analysis - which in the case of mixtures uses Knownclass - where cohort is group. The other is to string out the data across all ages represented so that data from each cohort has missing at some ages.
Sarah Dauber posted on Wednesday, November 08, 2006 - 12:40 pm
Hello, I am interested in conducting growth curve analysis with cohort data. Each person was measured at 3 timepoints, and there are 10 different cohorts. So, altogether I have data from age 12 to age 28. I would like to use the approach of stringing out the data so all ages are represented, rather than using the multiple group method, but there is a lot of missing data this way and I am getting very low covariance coverage rates. Is there a coverage rate that is considered optimal? Also, can you point me to some readings that would provide more info on using MPlus with cohort data?
Hi. I'm wanting to do growth modeling working with data from Add Health, in which youth were recruited in an agespan from 11-20 (or thereabouts), and interviewed at years 0, 1, and 7.
I know this is a situation for the cohort analysis, but there is substantial missingness that is not by design, which, if I understand it, is treated by listwise deletion using the cohort commands in Mplus.
I'm trying the knownclass = age approach, but the 6 year gap seems to be causing problems. So there is a lot of coverage of zero, as no respondent was measured at age 11 along with any of age 13, 14, 15, 16, or 17.
Didn't think of that, thanks. I'm really hoping for a solution that takes advantage of the accelerated longitudinal design, though, so I can get beyond linear growth. I think I'd still be limited, wouldn't I?
Matt Moehr posted on Monday, March 19, 2007 - 10:25 am
I used the following code in an accelerated design. X1-X3 are the same variables measured at successive time points. Cohort tells the program when the person started, and then I keep it as type = missing. I think you would need to change the lines where I specified the cohort models, e.g.:
Model Cohort13: i s | x1@0 x2@1 x3@7;
**** Begin Code **** TITLE: GROWTH CURVE MODEL;
DATA: file is ...;
VARIABLE: names are id cohort x1-x3 ; usevariables are x1-x3 cohort; grouping is cohort (1=3-0 2=3-9 3=4-6); missing are all(-99);
I think you said you had 3 occasions but with a lot of missing data. Growth modeling with random intercept and slope should still be supported if enough people have observations on all 3 occasions. I assume you are handling individually-varying ages of observation using the AT option per our earlier communication. And holding residual variances equal across time (since different ages for different people at a given occasion). I don't see off hand how adding covariates would make it more difficult unless covariates have a lot of missingness too.
Hello, I am trying to model growth in depression scores over time using the Add Health data. People were assessed 3 times, baseline, one year later, and five years later. However, ages ranged from 12-21 at baseline, so that stringing the data out using age as the time variable would theoretically allow you to look at growth from age 12 to 28. When I try to run this (with age as the time variable, data strung out over time), the model doesn't converge b/c there is so much missing data. I have also tried modeling it with just the 3 variables for each person (t1 t2 and t3) and using AT to indicate individually varying times of observation. The model converges this way, but I can't get a plot of the curve across all ages. Is this the correct way to model change in growth across all ages? And if so, how would I get a plot of the curve across all ages?
The two approaches you are using are the same except with the wide format the residual variances are free across time. With AT they are held equal across time. I would try holding them equal across time in the wide analysis and see if that helps.
I am using the DATA COHORT to rearrange my longitudinal data so that I can investigate development over age. My data were collected at 6-month intervals (baseline, 6m, 12m, 18m, 24m). Mplus requires birth year and measurement year, and the actual ages that I'd like represented are 13-18. However, since the waves are in 6 months intervals, I'm not sure how to do that. Therefore, I have set up some arbitrary integers that actually present the data in the age range of 13-20. Therefore, I'll just redo the graph so that the X axis shows the correct age. I'm wondering if there's another way of correctly capturing the 6-month intervals or if I'm doing the correct way? Here's the current code I'm using:
Instead of working in years (see the code above), I worked in months by multiplying all the year values by 12. This way I captured all the 6 month intervals in age growth from 13 to 18 years old (cope13 cope13_5 cope14 cope14_5 cope15 ...).
Matthew Cole posted on Saturday, September 01, 2007 - 7:28 pm
I've been using the DATA COHORT to rearrange my longitudinal data, and I received the nonconvergence message below. I am curious if there is a way to get my data to converge so that I can plot the means. If not, at least Mplus is providing the savedata file so I'll be able to put a figure together using another program.
THE MISSING DATA EM ALGORITHM FOR THE H1 MODEL HAS NOT CONVERGED WITH RESPECT TO THE PARAMETER ESTIMATES. THIS MAY BE DUE TO SPARSE DATA LEADING TO A SINGULAR COVARIANCE MATRIX ESTIMATE.
INCREASE THE NUMBER OF H1 ITERATIONS.
NOTE THAT THE NUMBER OF H1 PARAMETERS(MEANS, VARIANCES, AND COVARIANCES) IS GREATER THAN THE NUMBER OF OBSERVATIONS. NUMBER OF H1 PARAMETERS : 209 NUMBER OF OBSERVATIONS : 166
I’ve done a multiple cohorts growth curve. In particular I have two cohorts:
Older Cohort: 15 years old 1998/ 17 yo 2000/ 19 yo 2002/ 21 yo 2004 Younger Cohort: 16 years old 1998/ 18 yo 2000/ 20 yo 2002/ 22 yo 2004
I want now add predictors and outcomes Can I add them even if each variable is not measured at the same age? For example I want to add as predictor “qda” measured in ’98 (15 year old for the younger cohort and 16 for the older one). Is this procedure correct?
This is part of the input
USEV ARE sex md98 md20 md02 md04 qdr98 qdd98;
Im Sm | md98@0md20@2md02@4md04@6; im with sm (21); Sm (22); im (23); [sm] (24); [im] (25); sm on sex (26); im on sex (27); qda by qdd98 qdr98;
I think you have a good start here. Note that the introduction to your message has older and younger reversed (those who are 16 in 1998 are older than those who are 15 in 1998). You have it right in the model. Your equality restrictions look right.
Since you don't measure your qda indicators at the same age for the 2 cohorts, you may want to test that the equalities across cohorts related to this factor actually fit well by also runnning the model with them unequal. Even if they are unequal, the parameters related to the growth factors may be equal. It is of interest to test if parameters related to the growth factors are invariant across cohorts.
I'VE DONE AGAIN THIS MODEL WITH THE EQUALITIES AND ADDING THE OUTCOME. ALL THE PARAMETERS ARE INVARIANT ACROSS COHORTS. IN THIS WAY I CAN DISCUSS OLSO THE IMPACT ON THE OUTCOME CONSIDERING THE PREDICTOR.
USEV ARE sex md98 md20 md02 md04 qdd98 qdr98 VAS04R; missing are all (99.00); grouping is coorte (1=younger 2=older); Analysis: Type = MEANSTRUCTURE MISSING ; ESTIMATOR = mlR; model:
Im Sm | md98@0md20@2md02@4md04@6; im with sm (21); Sm (22); im (23); [sm] (24); [im] (25); sm on sex (26); im on sex (27); qda by qdd98 ; qda BY qdr98 (31) ; VAS04R ON IM (32); VAS04R ON SM (33); VAS04R ON SEX (34); VAS04R ON QDA (35); im on qda (28); sm on qda (29); qda on sex (30); model older: Im Sm | md98@1md20@3md02@5md04@7;
I am working on a multiple cohorts growth curve, but there is substantial missingness that is not by design. I know your recommendation is to do listwise deletion first, but when I do that I will only keep about 30% of my cases, which is not really an option.
My questions are: 1. is it really necessary to do listwise deletion first? 2. with both missingness by design and missingness at random, will Mplus estimate the model incorrect? (though it runs properly). In what way then? 3. if listwise deletion is the only solution, does it matter if I perform listwise myself in SPSS first, or is it better (for model estimation) to let Mplus do that using the DATA COHORT command?
I know TSCORES is good way to overcome this problem (that also runs properly), but then I am not able make a plot.
I'm not sure why you think we recommend listwise deletion. We don't. You can use the multiple group multiple cohort approach shown in the new Example 6.18 or you can string the data out by age and not use TSCORES.
I mean that when I string the data out by age, I will get missingness by design next to missingness at random. Is this allowed? (or is listwise deletion necessary?)
Using the DATA COHORT command, each observation that does not have complete data is deleted from the data set (page 350 userguide version 4.1), that was the reason why I thought you recommend listwise deletion.
I hope you can clear this up for me. Thanks in advance.
You can string the data out with or without doing listwise deletion within each pattern of variables. This is your choice. If you don't want listwise deletion within each pattern, you would have to do the analysis in two steps. In the first step, save the data without using the MISSING option. In the second step, use the MISSING option and do the analysis.
Example 6.18 is quite helpful, however then I get separate growth curves for all cohorts instead of 1 overall growth curve. The two-step analysis you mention is maybe a better solution. But what do you mean by saving the data without using the missing option? What is the difference then between that file and the raw data file? Sorry if these are stupid questions, but I just don't get what you're saying.
You do not get separate growth curves for each cohort. If you look at the example, you will see that each cohort contributes to part of the growth curve for which you obtain one intercept and one slope growth factor mean, variance, and covariance due to the equalities that are imposed on these parameters. Take a moment to thoroughly go through the input and also look at the output to see which parameters are estimated.
If you do not use the MISSING option in the first step where you use MODEL COHORT to string out the data, then there will be no listwise deletion because no value will be considered missing.
hi, I’ve done growth model for two parallel processes for continous outcomes with regression among the random effects and predictors using a multiple cohorts growth curve approach. In particular I have two cohorts:
Younger Cohort: 15 years old 1998/ 17 yo 2000/ 19 yo 2002/ Older Cohort: 16 years old 1998/ 18 yo 2000/ 20 yo 2002/
thus the two growth curves are from age 15 to 20 years.
Could you suggest me some references in which this approach was used or references that can help me to describe the results?
Dear Bengt and Linda, I would like to do a multiple group multiple cohort growth model for two parallel processes.
The reason that I want to use parallel processes is that · I want to estimate a model for two different instruments (child and adolescent).
I want to use multiple cohort because · I have an accelerated design.
This means that for each cohort, the measurement points for the parallel processes will differ. In some cohorts, not all variables will have observations because the cohort is ‘too old’ for the instrument.
You should be able to do this. You would need to have the same set of variables in each group which may result in a problem with zero variances. I think this can be overcome by including VARIANCES=NOCHECK in the DATA command.
This looks correct with the exception that in the last two groups you have the intercept growth factor loadings fixed to zero instead of one. I would also use the special | growth language instead of BY because the defaults are more appropriate to a growth model.
Thanks for taking a look, but I guess I'm still a bit unclear on whether changing those final two models to have intercept factor loadings of 1 allows us to maximize the strength of our cohort-sequential design. Even though we only sample each participant three times, we obtain 8 data points across the four cohorts (0, 1, 2, 3, 4, 5, 6, 7, 8). Is it possible to model growth AS IF there were 8 points across all the participants?
In a growth model, the loadings for the intercept growth factor are one. That is part of the model parameterization. The way you have the model set up is as if the data are across 8 timepoints. See Example 6.18 for a full description of the multiple group multiple cohort model.
C. Sullivan posted on Friday, December 12, 2008 - 12:59 pm
I have three cohorts measured at three waves each and would like to be able to also assess neighborhood variance (and potential effects) on growth factors. Is it possible to estimate a multiple cohort growth model within a multilevel framework? Specifically, could a model like that shown in example 6.18 be run in the multilevel framework (like ex. 9.12)?
The GROUPING option is available with TYPE=TWOLEVEL when outcomes are continuous.
C. Sullivan posted on Tuesday, December 23, 2008 - 12:13 pm
Two other quick questions on the multiple cohort, multilevel growth model
In the time structuring...if I have two ages with no coverage, would I just set the rest of the scores as usual (i.e., y1@0, y2@2, y3@3 if there was no coverage at the second interval)?
I'm trying to run a MC model for the Twolevel, grouped growth model, but I'm not getting any estimates and the output is telling me that none of the repetitions that I requested were completed. Would that more likely be the result of a setting being incorrect or model misspecification?
Yes but you would need to use TYPE=MIXTURE and the KNOWNCLASS option to do this. The GROUPING option is not available with TYPE=RANDOM;
csulliva posted on Thursday, June 17, 2010 - 3:22 pm
1. Is the known class multiple cohort approach to growth modeling equivalent to the cohort group based approach (ex. 6.18)? I received some warnings on the model’s identification with the latter-- but not the former--and was a bit unclear on the potential source for that discrepancy. 2. Also, in response to a question above it is mentioned that “each cohort contributes to part of the growth curve for which you obtain one intercept and one slope growth factor…” Does this mean that it is appropriate to plot the outcome across age for the full sample—as opposed to a series of separate plotted lines for each known class (cohort)?
Nicolas M posted on Saturday, July 17, 2010 - 4:47 pm
I'm doing a growth curve analysis on a 9-waves panel data. Individuals in it have very different ages (going from 16 to 80). I think I'm using what you call the "wide" format, where for each individual I have age1 outcome1 age2 outcome2 age3 outcome3 ... 20 1 21 5 22 6
I defined the ages as TSCORES using "TSCORES are age1-age9;". However, I have convergence problems. I managed to solve them by standardizing all the age variables using the following operation: age1standard. = (age1 - mean(all_ages))/sd(all_ages) age2standard. = (age2 - mean(all_ages))/sd(all_ages) etc.
Now, the model converges. Do you think this is a proper way to solve this problem? Can you see any reason for not doing that?
I would not standardize. I would divide age by a constant such as ten.
Nicolas M posted on Tuesday, July 20, 2010 - 2:13 pm
Thank you for your answer. I did try to divide the age by a constant, but I still have major convergence problems. I was thinking, as every observations are equally spaced, is it reasonable to use simply :
i s | f1@0f2@1f3@2 ... and then controlling for the starting age : i s ON age0
instead of using the TSCORES command? Or is it not a good idea? Actually what I like with this method is that mplus doesn't use the EM algorithm for numerical integration, so it is much faster. The model has a good fit. But I need to be sure it is statistically correct...
I think using TSCORES is preferred. You could also consider multiple group multiple cohort as shown in Example 6.18. If you send the output where you failed with TSCORES and your license number to email@example.com, we can see if we can help.
Melvin C Y posted on Monday, September 06, 2010 - 8:39 pm
Dear Dr Muthen,
I have similar measures obtained from two cohorts (group1=10-12 years; group2=13-16 years). As there is no common age or linking data between cohorts, would I still be able to use the multiple cohort LGM (i.e., 10-17 years)? Would you suggest piecewise model instead?
You can use piece-wise and see if the two pieces align.
Andy Ross posted on Thursday, February 24, 2011 - 2:13 am
Dear Prof Muthen
I’d like to estimate a linear growth model for a categorical outcome and wanted to use the multiple cohort option, however i'm under the impression that this option is not available with categorical outcomes is that correct?
I'm modelling gang membership over three time points using data that contains young people aged 11, 12, 13, 14, 15, 16, 17 at time point one. I wanted to use this approach as the alternatives appear severely limited by the number of time points - i.e. the standard LGM would only allow a linear model which is not supported by the data - not to mention the fact that I would like to capture the age crime curve.
It would also be useful to estimate different trajectories, i.e. adolescent limited and persistent offenders, as far as they may exist for gang membership - does the multiple cohort option allow this? I did also consider using LCA as an alternative but am limited by the number of classes I can estimate before the model is not identified.
Can you offer any suggestions or am I simply asking too much of the data?
Jing Zhang posted on Tuesday, May 03, 2011 - 12:11 pm
Dear Professor Muthen,
You mentioned that there are two ways to handle multiple cohort data: 1) a multiple group approach to cohort analysis; and 2) make the program to string the data out over time.
I have several questions: 1) If the data are missing by design, e.g. for some cohorts, the data were not collected at certain time points of the survey, can I still use a multiple group approach to cohort analysis as indicated in example 6.18? 2) I am doing a three-level multiple cohort growth curve model for my research. The data are missing by design, and followings are the example. Can I still follow the example of 6.18, or I should string the data out over time? Do you have an example of the syntax for multilevel multiple cohort growth curve modeling with stringing the data out over the time?
Note: cohort 1 does not have data on y1 and y2, and cohort 2 doesn’t have data on y2.
y1 y2 y3 x1 x2 x3 cohort x x 7 2 5 3 1 x x 6 1 3 4 1 x x 5 2 5 6 1 x 3 5 1 6 7 2 x 2 3 4 3 2 2 x 1 3 6 7 3 2 5 3 2 1 8 9 3 3 4 7 5 1 8 3 3 5 8 2 4 6 3
The multiple-cohort, multiple-group approach is not straightforward in Mplus when the cohorts have different number of observed time points. So I would string out the data. Multilevel does not cause any extra difficulty as far as I can see; I don't have an example.
I conducted a multiple cohort growth model using the known class option (three cohort groups) and found that the model with equality constraints was of poorer fit than the unconstrained model.
1. This would suggest that I would need to account for those groups (cohort) effects throughout my analysis. Is this correct?
2. Does that necessitate freeing the estimates for the growth factors, residual variances/covariances, and any covariate effects across groups?
3. If so, are there any tractability/estimation issues in particular that need attention in this process? I have run a test model freeing those parameters and a covariate effect and have had difficulty with convergence. Would this just be a matter of increasing the number of random starts or MIterations?
Drs. Muthen, I have a longitudinal dataset with seven waves (base, six months, 12 months, 18 months, 4.5 years, 5.5 years, 6.5 years), with youth who were aged 13-17 at baseline. I need to use TSCORES to deal with individual variability around each wave, and zero-inflated Poisson distribution to deal with high number of zero’s in the outcome. I would like to use age, rather than time since baseline as the time variable, so created TSCORES variables that are age0, age1, age2, etc… representing their age at each wave of data collection. In addition I am trying to test for cohort effects given the 13-17 range at baseline. Does the syntax below make sense? In particular, I’m concerned about whether I am structuring the TSCORES correctly. Thanks in advance, Carolyn
I am trying to develop a multivariate MLGM (using Bayesian estimation). The challenge is that data consist of different cohorts.
- I have two cohorts with data at three time points each (school grades 8, 9, and 10). - I have two cohorts with data at two time points (school grades 8 and 9 OR school grades 9 and 10, so these cohorts have missingness by design).
My idea is that some of the analyses should combine all four cohorts into one analysis of school grades 8 to 10 to increase sample size.
I see there are a few options for cohort analyses in Mplus. But I should try to do this wisely. I would want to test for time effects (e.g. simulating the possibility of unknown historical events). This means that a measurement in any of the three specific years where measurements were conducted can affect scores, for one cohort this effect will be in grade 8, for another cohort the effect will be in grade 9 while for yet another cohort the effect will be in grade 10.
Anyone of the wonderful Mplus team -- or any other in the Mplus community -- do you have suggestions on how best to develop this model?
I would do a multiple-group analysis, with cohort as group and grade as time axis for the growth model. With Bayes, you do multiple-group via Knownclass in Type=Mixture. Testing for time effects may be more tricky. Although an event in a certain year influences subjects in different grades for different cohorts, we don't know if that same event has a different influence for students in different grades. There is a large literature on age-period-cohort analyses. But in principle you can let the event effect be restricted to have the same magnitude in the different cohorts (for the different grades), for instance by letting an intercept of the outcome at that point jump out of line of a linear growth model.
Thanks, Bengt. So you would not (also) do a model with all cohorts and try to account for cohorts effects within that model. A multiple-group analysis seems to give me only data for the two cohorts with measurements at all three time points, I wanted to add an analysis with all my data (four cohorts) to check whether increasing the number of cohorts and the sample size changed anything (but this gives missingness by design).
You can do a single-group run of all cohorts as well, although then investigation of cohort differences is not as flexible. The multi-group approach can handle different number of observed variables per our FAQ, but I haven't tried something like that with Bayes.
I have data from an accelerated cohort design with 4 cohorts, measured at three time points. I want to look at temporal antecedents for my outcome. Is the only way to do this to use X1 and X2 to predict Y2 and Y3, or is there a way I can use the accelerated cohort design too?
Hello, I am interested in running a multiple-group multiple-cohort model similar to example 6.18, although I would like to use time scores rather than measurement occasion for my time points. In a multiple-group multiple-cohort model, do I need to center the individually-varying time-scores (age) for each of the groups (cohorts) at initial measurement, or would I grand-mean center the time scores for all groups? Thanks for your assistance.
In response to a recent post in this thread, Linda suggested that the user should center time scores representing age on the mean age taken across all observation points. I have been under the assumption that time scores representing age should be centered on the mean age at the first observation. Is one of these methods "correct" or are there certain situations where one method is preferable to the other? As always, thank you for your guidance.
I think centering choice is largely determined substantively. That is, which age do you want the intercept factor to represent? But in some cases the correlations between the growth factors can get uncomfortably close to 1 in which case average age centering can help to make them less correlated.
I've ran a multiple group multiple cohort analysis, using KNOWNCLASS due to the use of time scores. I am not sure if there are problems with my model or if I need to rethink my interpretation of group specific intercepts.
My main question is regarding the interpretation of the intercept for each cohort. Currently the time scores are age at each observation centered on the grand mean for age at time 0, with this value divided by 10 to reduce the range of time scores.
The problem is that I am receiving estimated intercepts that are outside the range of possible values for the outcomes. For example, the estimated intercept for my outcome for the oldest cohort (cohort 1) was 24, which is far off the possible range of values for the outcome (mean = 3, sd = 3, range = 0-11). The mean value of the time score for the oldest cohort was 1.4. My understanding is that the intercept should be interpreted as the mean value of my outcome at the mean age at time 0. Should the group-specific intercepts be interpreted in this manner? If so, would these intercepts suggest non-linearity that may require quadratic terms?
FYI, here is the mean age of each cohort at time 0:
Thank you again for the guidance. I now have a question regarding the class specific output. When using the KNOWNCLASS option in mixture modeling, my assumption was that the "CLASSIFICATION OF INDIVIDUALS BASED ON THEIR MOST LIKELY LATENT CLASS MEMBERSHIP" output should provide the same group n-counts as the frequency of my cohort grouping variable, but it does not. Is there a simple explanation for this difference?
Observed class counts: a = 4753 b = 2096 c = 9396 d = 2277
Estimated class counts and proportions n-counts: a = 4480, b = 2239, c = 7596, d = 4206
You need to follow Example 6.18. This is also described in either the Topic 3 or Topic 4 course handouts. The key is arranging your data by age not cohort. Age is the time variable. The time scores come from this. This is described in Example 6.18. Follow these steps for your example.