Anonymous posted on Monday, October 22, 2001 - 12:19 pm
May I ask you one question? I am trying to use the method of missing by design following example 24.2 in Mplus manual. My data have two time-point and my subjects' age ranges from 4-9. I am planning to develop latent growth modeling using age not using the data collection times. My question is whether it is OK to conduct latent growth modeling in this situation. That is, I have 6 time points (age) in my LGM but each subject has only two observations and four missing points. Does my plan sound OK? Do I have too many missing points relative to the number of observations? Thanks so much.
bmuthen posted on Tuesday, October 23, 2001 - 9:41 am
It is possible to do growth modeling in your case. The growth model that you can fit is essentially determined by the number of time points a given individual has - that is, 2 in your case. With 2 time points you can only fit a restricted model, such as random intercept but fixed (not random) slope. This can be specified as zero slope variance and zero intercept-slope covariance.
Anonymous posted on Tuesday, March 15, 2005 - 8:15 am
We are fitting a growth curve model to cohort-sequential data covering ages 3-22, which we are handling with FIML estimation using type = missing (there are many different missing data patterns). We have time-varying covariates that follow the same patterns of missingness. But because these are exogenous observed variables, they are not allowed to have missing values. Is it possible with Mplus to estimate the effects of time-varying covariates that themselves have multiple patterns of missingness (without doing imputation)?
You would have to bring them into the model to do this. This would mean that they would take on distributional assumptions as if they were endogenous. To do this, you can mention their variances in the model command.
Anonymous posted on Monday, May 30, 2005 - 11:47 pm
Is it correct that it is not possible to use the pattern statement for defining data missing by design when using categorical variables in an EFA or CFA? If so, is there a way to get around this other than treating the variables as interval variables?
I am assuming that you must have tried this and gotten an error message to that effect. This seems possible given that missing was originally developed for continuous outcomes. You can get around this by doing your analysis in two steps. In the first step, do not treat the variables as categorical. Use the PATTERN option and do a TYPE = MISSING BASIC; in conjunction with the SAVEDATA command. In the second step, analyze the saved data while declaring the outcomes as categorical.
Anonymous posted on Thursday, June 09, 2005 - 4:55 pm
That's right. Thank you very much for your response. I will try this.
I would like to estimate a Multiple Cohort Growth Model similar to the example you have on your Special Analyses page. However, my dataset is already arranged by age (not wave). That is, I don't need to rearrange my data. How do I set up the model in this case? I tried using the PATTERN option in combination with my cohort variable to define the missing by design pattern, but I had to set the covariance coverage value to zero in order for Mplus to estimate my model. How would you set up the model in this case?
In that case, you don't need to do anything special. The special fuctions are just for rearranging the data. Use TYPE=MISSING:
Eveline posted on Tuesday, June 14, 2005 - 8:15 pm
I think I have followed your advise on using the PATTERN statement for CATEGORICAL variables in two steps. I don't think it worked, but I might have misunderstood you. This is what I did.
(1) TYPE=MISSING BASIC in analysis statement, using the PATTERN statement, treating variables as continuous, 1 iteration and saving the response data.
(2) I've tried two versions of the second step. I defined the indicators as categorical and tried to run it both with and without PATTERN statement:
- When I did not use the PATTERN statement the following statement appeared in the output "COMPUTATIONAL PROBLEMS ESTIMATING THE CORRELATION FOR S404QRA AND S269QRA. THE MISSING DATA COVARIANCE COVERAGE FOR THIS PAIR IS ZERO." Same for many other pairs of variables that do not appear in the same booklet.
- When I did use the PATTERN statement the output gave the error "*** ERROR in Variable command. Analysis with categorical variables is not available with PATTERN, COHORT, COPATTERN features."
Is this what you meant I could try? Thanks!
bmuthen posted on Wednesday, June 15, 2005 - 7:49 am
fati posted on Friday, October 07, 2005 - 11:26 am
I have use the PATTERN option with TYPE = MISSING BASIC; in conjunction with the SAVEDATA command. in order to analyse the saved file in mixture model in the second step, but i have a problem with the number of observations, the number of observations is counted incorrectly, i have in first 978 observations, i have use a variable with no missing value as pattern, which have 3 category, but in each category of pattern variable, i have also the missing so the number of observations is not correct (just 783), how can i use pattern variable and counts this observations with other missing
You need to send your input, data, output, and license number to firstname.lastname@example.org. It is not clear from your description exactly what is happening. There will be cases deleted if they do not have the variables listed using the PATTERN option for their pattern.
fati posted on Thursday, October 20, 2005 - 9:29 am
I would like doing a LCA analysis with data misssing by design, then i have doind analysis in two steps. In the first step, I use the PATTERN option and do a TYPE = MISSING BASIC; in conjunction with the SAVEDATA command. In the second step, analyze the saved data while declaring the outcomes as categorical. but in my first outout, I have the following message : THE MISSING DATA EM ALGORITHM FOR THE H1 MODEL HAS NOT CONVERGED WITH RESPECT TO THE PARAMETER ESTIMATES. THIS MAY BE DUE TO SPARSE DATA LEADING TO A SINGULAR COVARIANCE MATRIX ESTIMATE. INCREASE THE NUMBER OF H1 ITERATIONS.
RESULTS FOR BASIC ANALYSIS
ESTIMATED SAMPLE STATISTICS
NO CONVERGENCE IN THE MISSING DATA ESTIMATION OF THE SAMPLE STATISTICS.
I have trying to increase tne number of h1iterations in 2000, but I have always the same message, what can I do for resolve this? thank tou for your help
bmuthen posted on Thursday, October 20, 2005 - 6:52 pm
If you have missing by design, MCAR (missing completely at random) holds for this part of the missingness. I would not do the analysis in two steps, but in one step using Type = Missing. This takes care of both missing by design and other missingness giving ML estimation under MAR. This may solve your H1 model problem.
If your H1 problem persists, you can either drop H1 from the estimation or try a multiple-group analysis where each missing-by-design pattern is a group. Here you should also use Type = Missing.
fati posted on Thursday, October 27, 2005 - 5:48 am
I think it is not possible to do pattern option with categorical variables, this is the reason why I have used two steps in my analysis, that's correct?
The pattern command refers to one design variable. If you have more than one design variable, you could create a variable that reflects this and then specify which variables individuals with each conditon should have.
I am trying to do a multigroup CFA model, but my grouping variable is defined by missingness. I have used the PATTERN command in order to allow for the missingness, but the only way that I can get the model to run is if I leave out the grouping variable, thus having only one model. If I leave the grouping command in the model the program gives me an error saying that the grouping/pattern variable has multiple uses. I have included a sample of my syntax to clarify what I am trying:
GROUPING IS patgrp(1=1 2=2 3=3); PATTERN IS patgrp(1=a b c d e f 2=a b c d e 3=a b c d);
ANALYSIS: TYPE=General MISSING h1; ESTIMATOR=ML;
MODEL: f1 by a b; f2 BY c d; MODEL 1: f2 by e f; f1 WITH f2 (cv1); MODEL 2: f2 by e; f1 WITH f2 (cv2); MODEL 3: f1 WITH f2 (cv3);
If I remove the MODEL 1 - MODEL 3 statements and the GROUPING statement, and include the variables with missingness in the by part of the MODEL statement it runs, but it is only one model, which is not what I want.
I would greatly appreciate any suggestions you might have for rectifying this problem, or if you can point me in the direction of some literature that I might read.
Thank you for your response. I initially tried running the model without the pattern statement but the missing data gave me problems. Specifically, I kept getting errors that said a variable had no non-missing values, (meaning the variable that is missing for a group), even if I did not invoke that variable in that groups model statement.
For example, the following code gives me problems because for group 2 there are no non-missing values for variable f and for group 3 there are no non-missing values for variables e or f.
GROUPING IS patgrp(1=1 2=2 3=3);
ANALYSIS: TYPE=General MISSING h1; ESTIMATOR=ML;
MODEL: f1 by a b; f2 BY c d; MODEL 1: f2 by e f; f1 WITH f2 (cv1); MODEL 2: f2 by e; f1 WITH f2 (cv2); MODEL 3: f1 WITH f2 (cv3);
I guess the question is then, how do I get Mplus to ignore the missing values for group 2 and group 3?
Thank you for your response, again. I am sorry to be a pain about this, but I think that I may not be making my problem clear. I need to do a multigroup CFA. I have three groups and they are defined by missingness. For example, group 1 has data for all variables, group 2 has data for all but variable A, and group 3 has data for all but variables A and B. I am constructing my CFA so that I have three models: MODEL 1 has all variables in it, MODEL 2 has all but variable A in it, and MODEL 3 has all but variables A and B in it. When I use the grouping option, MPlus gives me errors saying that (even though I do not invoke the missing variables in their corresponding model statement) there are no non-missing values. I tried using the PATTERN option to fix this, and then I could get a single, overall model to run, however, as mentioned in my first post, I need seperate models for each group. As far as I can figure, I cannot see how to do a multigroup CFA with the groups defined by missingness since I cannot use the GROUPING and PATTERN options at the same time.
To boil it down, is there a way that I can use the PATTERN option and run multiple group analyses, or alternatively, can I use the GROUPING option and tell MPlus to ignore the missingness by design so that I don't get errors when I define the groups by missing variables?
I assume you want to do multiple group analysis to test for the invariance of certain parameters across group. You can do the same thing in a single group analysis using a set of dummy variables to represent the three groups. Usually when groups represent patterns of missing data, equalities are placed such that the results represent a single-group analysis.
In your post from May, 31st, 2005 you propose a two-step-procedure for using the PATTERN option in combination with CATEGORICAL data, using the SAVEDATA command in the first step. It is not clear to me which data I should save in the first step. I want to establish a SEM for one test. Subjects worked on one of three different testbooks, leading to missing by design in my data. Thank you very much for your help.
You would run the analysis with your full set of data. The PATTERN option would specify for each value of the pattern variable, the variable for which listwise deletion would be carried out. You save that data and use it in your analysis.
thanks for your last advice - it helped a lot with one dataset. Now I've encountered a new problem with two other datasets. In these two studies, we had three groups. All students worked on the same six items. Additionally, each group of students worked on 8 to 9 unique items. Therefore, we cannot estimate correlations between the unique items, e.g. from version A and version B. I tried the two-step-procedure for this data. In step 2, I don't get any results but the following information: "THE MISSING DATA COVARIANCE COVERAGE FOR THIS PAIR IS ZERO" for all pairs of variables from two different test versions. Is there a possibility to circumvent this problem or does the data simply not allow estimation of a SEM? Thank you very much for your help.
i have an r&r on a paper where i have reorganized 3 waves of data into 14 ages and fit a quadratic LGM trajectory. Everything looks good--no flags, good fit indices and nice parameter estimates. unfortunately, i have a reviewer who doesn't comprehend using FIML to address data which is missing by design due to the wave->age rearrangement. the reviewer claims that i have used 3 waves of data to impute 8 additional repeated measures and accordingly, argues that this is an invalid method. do you know of a good citation supporting the use of FIML to address data which is missing by design in such an "accelerated" longitudinal design? i know the relevant passages in the Mplus users manual and also the Bollen and Curran (2006) book. any other citations you'd recommend?
If you are interested in growth over age and you have collected data on several occasions, you need to arrange the data by age to study this development or use age as an individually-varying time of observation. See the following paper where this type of analysis was done:
Muthén, B. & Muthén, L. (2000). The development of heavy drinking and alcohol-related problems from ages 18 to 37 in a U.S. national sample. Journal of Studies on Alcohol, 61, 290-300.
Missing by design is MCAR. See the Little and Rubin book for background.
could you suggest a reference that would allow me to substantiate the claim that missing by design is MCAR? i'm familiar with rubin's work, but do not recall him specifically addressing missing by design. i could be wrong though.
I have only categorical (binary) variables and it's one of those cases where the data is missing by design. I have demographic data for all examinees, a set of common items for all examinees, some examinees taking two test forms A and B (but not C and D), and the rest taking two separate test forms C and D(but not A and B). I have tried the PATTERN option but I know that I am not doing it right. For now, I am just computing the tetrachoric correlations and any help as to how I can specify the missing by design data will be appreciated.
You should set your data up with four columns representing A, B, C, and D. Individuals who have not taken C and D will have missing data on C and D. Individuals who have not taken A and B will have missing data on A and B. Then use the default of TYPE=MISSING;
But Mplus 3 does not run that for me, and it doesn't give me a reason why.. It just acts like it is running and returns an outputfile with only my script in it. It does run well with a good output without the "PATTERN IS" line however. Do you have any idea why he is not running with the PATTERN IS line?
No, I can't say. You should first update to version 5.1 and if the problem persists then send it to email@example.com with your license number.
Abdel posted on Saturday, August 09, 2008 - 11:04 am
Thanks for the quick reply! It will be a little difficult for me to get that update on time, since I got sort of a deadline.. Are there any readings you know about that can tell me something about how large the bias is without the PATTERN IS option? Or some paper(s) about how Mplus handles missings by design in a case like this?
I have used the two-step procedure you've suggested on May 31, 2005 to analyse my categorical data with missing by design. In step 1, I have the following warning: "THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES MAY NOT BE TRUSTWORTHY FOR SOME PARAMETERS DUE TO A NON-POSITIVE DEFINITE FIRST-ORDER DERIVATIVE PRODUCT MATRIX. THIS MAY BE DUE TO THE STARTING VALUES BUT MAY ALSO BE AN INDICATION OF MODEL NONIDENTIFICATION. THE CONDITION NUMBER IS 0.154D-13. PROBLEM INVOLVING PARAMETER 36."
Nevertheless, I proceeded with step 2 where I did not get this message. However, I'm not sure whether I can trust my results now.
A collaborator has collected data for a joint project via a website with large numbers of subjects but with a lot of missing data by design (he administers a relatively small subset of the item pool to each subject to reduce subject burden). In fact, the covariance coverage is below the 10% Mplus default cutoff for every pair of items. I would very much appreciate being pointed in the direction of information re: how the 10% cutoff was derived. Are there simulations, for example, that suggested this cutoff? And might a different value be acceptable with a very large overall sample? (That is, is it truly the % that is important or is it the number of cases each covariance is based on that is important?). Many thanks in advance!
Missing by design usually involves zero coverage which when by design is no problem. Low coverage is a problem. The 10% cutoff is not suggested as an okay amount. Much higher coverage is recommended. It is just where we draw the line. Lower coverage typically gives problems with the unrestricted H1 model. See the Schafer book in the user's guide reference list for further information about coverage.
Would you please direct me to syntax examples of MISSING BY DESIGN. I have 14 booklets,subjects worked on 2 of 14 different testbooks, leading to missing data. These are categorical in Latent class analysis. I am not able to find syntax examples with explanations of the codes.
I also have another question. I tried the two-step procedure and I want to assess DIF. When I use the MLR estimator it works, but when I use WLSMV I get this error (for the item that are missing by design):
THE WEIGHT MATRIX PART OF VARIABLE I9T1 IS NON-INVERTIBLE. THIS MAY BE DUE TO ONE OR MORE CATEGORIES HAVING TOO FEW OBSERVATIONS. CHECK YOUR DATA AND/OR COLLAPSE THE CATEGORIES FOR THIS VARIABLE. PROBLEM INVOLVING THE REGRESSION OF I9T1 ON AGEGROUP. THE PROBLEM MAY BE CAUSED BY AN EMPTY CELL IN THE BIVARIATE TABLE.
I should have said that this error only appears for the variables which are missing by design. I'm trying to regress the indicators on the grouping option (the grouping variable used in the pattern option in the first step) to see if there's DIF for the item that appear in both groups. I get the error even though I don't regress the variables missing by design on the grouping option but this doesn't happen when using MLR.
This makes sense. Variables missing by design may have sparse cells in the bivariate tables. MLR does not use univariate and bivariate frequencies for model estimation so this message would not come up although the issue of sparse cells is still there. I would look at the CROSSTABS.
I am trying to analyse data that both has a nested structure (students in classes in schools) and an incomplete design.
I tried using PATTERN to specify which variables are missing by design (there were 3 different designs; thus each student was administered 2/3 of the total items) and using STRATIFICATION IS school; CLUSTER IS class; to adjust standard errors to the nested structure.
However, PATTERN doesn't seem to work with TYPE=Complex and STRATIFICATION only works with TYPE=Complex. I would be very grateful of you could help me address this problem.
You can do the analysis in two steps. In the first step, do not use COMPLEX and STRATIFICATION. Simply use the PATTERN option and TYPE=BASIC putting the CLUSTER and STRATIFICATION variables on the AUXLITIARY list. Save the data in this step. In the second step, use the saved data and the COMPLEX and STRATIFICATION options.
I have data with missing by design, like x1-x8 with x6-7 being missing in group 1 and x5-6 being missing in group 2. So the data have coverage of zero covariance.
I tried "pattern" option (with Mplus 6) in VARIABLE command with ML, and got a result. Then I omit this option and run the model again. The results were identical. What does "pattern" option specifically do?
The PATTERN option is for use with listwise deletion not the TYPE=MISSING default in Mplus.
Sandra N. posted on Thursday, March 08, 2012 - 1:44 am
Hi Linda, we collected data by using a multi-matrix-design (35 booklets, balanced incomplete block design, N=1200). We tried running a CFA based on these data and had to lower the covariance coverage limit as it is around .03 for the variables we are interested in. The fit indices were implausibly high, some latent correlations were above 1. We tried if the pattern command would help, but it did not. Is there any possibility how we can deal with these problems? Is there a minimum for the covariance coverage for running the analyses? Many thanks in advance!
Dear Dr. Muthen, I'm running an IRT-model with planned missing data. I have 4 booklets, 8 items occur in each booklet, 16 items vary from booklet to booklet. I tried to use the PATTERN command. But an error in the output indicates, that the PATTERN command can't be used in conjunction with categorical data.
Is it correct to use FIML-estimation and not to specify the missing by design pattern?
Dear Dr. Muthen, another question: I want to compare a 1p and a 2p IRT Model (45 Items). Mplus can't compute Chi² because "THE CHI-SQUARE TEST CANNOT BE COMPUTED BECAUSE THE FREQUENCY TABLE FOR THE LATENT CLASS INDICATOR MODEL PART IS TOO LARGE." Should I use AIC and BIC to compare the models, or is there another possibility?
In this case you can use the likelihood ratio chi-square difference test. Two times the difference in loglikelihoods for the two models is chi-square with df equal to the difference in number of parameters.
Dear Dr. Muthen, just another question. What would be the best way for item selection using Mplus. I have 45 items and want to select those, which conform to the 2p or 1p model. Sorry for the these questions, but its my first time doing irt-models.
Should I start deleting items based on factor loadings and then using ICC and Information Curves? Further inspect output of tech10. Starting with univariate fit information and deleting those items with z > 1.96 and so on??
Is there a possibility to evaluate person fit? Can I use the standardised residuals for the response pattern?
Hi, I want to run a cohort-sequential LGM. I have 20 cohorts each with 4 measurement occasions, across 20 time-structured timepoints, set-up in a wide data format. I have a fairly small sample (N= 152), and each of my cohorts are small. I also have missing data due to dropout, and have 71 missing data. I set the residuals of all timepoints equal. Even after setting COVERAGE = 0 and increasing my H1Iterations, I receive errors in convergence. THE STANDARD ERRORS OF THE MODEL PARAMETER ESTIMATES COULD NOT BE COMPUTED. THE MODEL MAY NOT BE IDENTIFIED. CHECK YOUR MODEL. PROBLEM INVOLVING PARAMETER 5. Parameter 5 is the Psi between intercept and slope, and I do not notice any blatant problems with that value. How do you suggest that I proceed? I have also considered running this model in a multilevel framework, or using TScores. Do you suggest either of these approaches instead of the LGM described above, keeping in mind that I intend to eventually conduct a parallel process model using this LGM? Thanks very much for your input.
Samuli Helle posted on Wednesday, February 06, 2013 - 6:34 am
I'm rather new to Mplus and to missing data procedures. I think I have a case where part of my data is missing by design: I have measured three DVs from three cohorts of women and for one DV, there are missing values (by design since it was not recorded?) for all women belonging to a specific cohort. My impression is that as I'm using Mplus 7.0 I actually shouldn't do anything "special" because FIML can handle this sort of missingness (and I have used e.g. MISSING ARE ALL(-99))? Is this correct?
You don't need to do anything special. FIML is the default in Mplus.
Samuli Helle posted on Wednesday, February 06, 2013 - 1:06 pm
Thank you. So I can ignore the following note:
ONE OR MORE PARAMETERS WERE FIXED TO AVOID SINGULARITY OF THE INFORMATION MATRIX. THE SINGULARITY IS MOST LIKELY DUE TO THE MODEL IS NOT IDENTIFIED, OR DUE TO A LARGE OR A SMALL PARAMETER ON THE LOGIT SCALE. THE FOLLOWING PARAMETERS WERE FIXED: 8
...and that this parameter has estimate and SE of 0.0000.
Is the approach implemented via the PATTERN option of the VARIABLE command the same as the multiple-groups SEM methodology described in Muthen, Kaplan, and Hollis (1987)? I am interested in replicating their approach using Mplus with a toy example to demonstrate how FIML works "under the hood" for an upcoming presentation I'm delivering.
I am trying to run CFA with categorical indicators. The data have both designed missingness (approximately 60%) and truly omitted responses (approximately 3.4% of valid data). As many people have asked previously, the “pattern” statement does not work with categorical variables, so I tried the 2-step approach you suggested in May 31st, 2005. In this approach, can you elaborate how missing data (missing by design) is treated? I am using the default estimator (WLSMV). Also, is there any way I can model both designed missingness and truly omitted responses when running CFA with categorical indicators?
Hello, I have a question about how factor scores are computed when data are missing by design.
I ran a CFA (3-factor structure) with 14 variables (ordinal variables) using the WLSMV estimation, and data are missing by design. For group 1, all 14 variables were administered. For group 2, 10 out of 14 variables were administered. For group 3, 4 out of 14 variables were administered. Group 2 and 3 do not share any common items.
When I looked at the correlation among the factor scores by each group, the correlations were 1.0 for group 3 (those who got 4 out of 14 items).
So, I am wondering how factor scores were computed for group 2 (missing 4 variables) and 3 (missing 10 variables) for factors representing the variables that were not administered.
Is there a formula for computing factor scores that I can read?
I assume you use the Mplus default of measurement parameter invariance across groups. The factor scores for a person are estimated using the estimated model parameter values and the person's observed scores. Perhaps group 3 didn't have items measuring all three factors, or in any case the factors must be poorly measured by only the 4 variables in group 3.
The Technical Appendix for Version 2 is on our website and appendix 11 describes WLSMV factor score estimation.