I am estimating a model with several observed dichotomous outcomes. To simplify, lets say that I have one exogenous variable (X), and two endogenous variables (Y1 and Y2), all of which are dichotomous. X affects both Y1 and Y2, and Y1 affects Y2. I would like to be able to say something about the extent to which X affects the probability of Y2 (both directly and indirectly through Y1), rather than limiting my discussion to how the underlying latent variables are related. Reviewers have requested that I go beyond the sign and direction of coefficients in my interpretation--- a reasonable request--- I am just not sure how to do it.
Any help on calculated/interpretting predicted probabilities or using some other approach to interpretation would be greatly appreciated.
bmuthen posted on Sunday, July 08, 2001 - 10:53 am
You can study how x affects y2 probabilities directly and indirectly as follows. Assume
y*_1 = g_1*x + e_1, y*_2 = b*y*_1 + g_2*x + e_2,
where y* denotes the underlying latent response variable and b and g are regression coefficients. It follows that
y*_2 = b*g_1*x + g_2*x + b*e_1 + e_2.
Mplus assumes that V(y* | x) = 1, so that V(b*e_1 + e_2) = 1. Then
where tau_2 is the threshold for y_2 and F is the standard normal distribution function found in tables. The second term in the argument of F is the indirect effect and the third term is the direct effect. Using different values of x, the effects of x on y_2 = 1 probabilities via these two terms can be computed.
Anonymous posted on Monday, September 06, 2004 - 5:11 pm
I have two (unrelated) models from which I am trying to calculate the probability of a binary outcome (labeled u3 in Model 1 and u2 in Model 2) for given values of the other variables. I have the Day 3 MPlus handouts, which are proving quite helpful for this, but I still have a few questions.
Model 1 MODEL: f1 BY u1 u2; u3 ON f1 x1 x2; ANALYSIS: TYPE = general MISSING h1;
Model 2 MODEL: y1 ON x1; u1 ON y1; u2 ON u1; ANALYSIS: TYPE = MEANSTRUCTURE;
For Model 1, MPlus outputs a residual variance for the outcome of interest (as in page 19 of the handout). I was planning to plug this into the probability equation as shown on pages 21-22. However, the Model 2 output does not contain a residual variance. Is this 1, as you imply in your response above, or do I need to calculate it using other items from the output (and, if so, how?)
Finally, for Model 1, to compare the effects of f1, x1, and x2 on u3, are the Std or StdYX estimates most appropriate?
I would need to see your two outputs to understand why you are getting a residual variance in one and not the other. With categorical outcomes, residual variances are printed at the end of the results with r-square when you request a standardized solution.
I think you would look at stdyx.
Anonymous posted on Tuesday, January 11, 2005 - 9:55 am
I have two unrelated questions. First, is there anywhere in the output that specifies whether a model was estimated using logit or probit. I thought I read in the user manual (or in the discussion) that random models using MLR are estimated as probit models. But in the output, I noticed that default logit thresholds were mentioned.
Second, I estimated a logit model using Mplus and STATA. While the coefficients on the independent variables were all virtually identical, the intercepts were quite different. Does Mplus calculate intercepts differently?
I am modeling continuous mediators with a categorical outcome. I asked for the IND effects for each mediator on the outcome and get the specific indirect and sum of indirects. Can I add these to the direct effect to get the total effect of each mediator on the outcome? Peggy Tonkin
See MODEL INDIRECT in the Mplus User's Guide for a full description of IND. Say y IND x1 not y IND x2 x1 to get all possible indirect effects and a total effect.
peggy tonkin posted on Wednesday, February 02, 2005 - 8:00 am
Thank You. Peggy Tonkin
Anonymous posted on Wednesday, June 01, 2005 - 1:11 pm
I am working on a path analysis with categorical dependent variables using MPLUS (including indirect effect), but I don't know how to interpret the coefficients of the direct and indirect effects. I have seen your answer regarding this, but still feel a bit confused by the formula you gave. Could you please give a concrete example? For instance, how to interpret the following probit regression (direct and indirect effects):
... CATEGORICAL IS y x1 MODEL: y ON x1 x2 x3 x1 ON x2 x4 MODEL INDIRECT: y x1 x2
Estimates S.E. Est./S.E.
y ON x1 -0.463 0.107 -4.328 x2 -0.295 0.161 -1.838 x3 0.063 0.383 0.165 x1 ON x2 0.309 0.161 1.913 x4 0.004 0.067 0.062
Effects from x2 to y
Sum of indirect -0.143 0.084 -1.705
y x1 x2 -0.143 0.084 -1.705
bmuthen posted on Wednesday, June 01, 2005 - 6:09 pm
I think you are in the WLSMV - probit framework where you can think in terms of continuous latent response variables underlying the categorical outcomes. So for
y on x1 x2 x3; x1 on x2 x4;
where y and x1 are categorical, the continuous latent response variables can be called x1* and y*. The indirect effect of say x2 on y via x1 is therefore viewed as an indirect effect of x2 on y* via x1* and is obtained as the product of the coefficients for x1* regressed on x2 and y* on x1* (which are the coefficients printed in the regular output), and this product is interpreted exactly the way you would interpret it had x1* and y* been observed (continuous) variables. And as you say, more has been said in earlier posts.
Anonymous posted on Wednesday, June 01, 2005 - 7:20 pm
Thanks for answering my question above. Three quick questions:
1. In MPLUS's probit regression, is threshold the constant term in STATA's probit regression (the sign of MPLUS's threshold is opposite to the sign of STATA's constant term)?
2. To get the threshold value, I add "TYPE=MEANSTRUCTURE" in ANALYSIS. In the results:
Does -2.524 refer to the threshold in the equation where y1 is the dependent variable?
3. BTW, how does MPLUS obtain standard errors for indirect effect in path analysis, when dependent variables are categorical?
Anonymous posted on Wednesday, June 01, 2005 - 7:59 pm
Sorry. one more question:
In the answer regarding interpreting coefficients you gave in 2001 (first message in this section), it seems you were addressing the case when there are two endogenous variales (y*_1 and y*_2) and one exogenous variable (x). If I have more exogeneous variables, when I interpret the coefficient of one particular variable, do I need to take the mean value of other exogeneous variables? Or, can I disregard the value of other exogeneous variables and only use the formula you gave, which is P(y_2 = 1 | x) = P(y*_2 > t_2 | x) = F(- t_2 + b*g_1*x + g_2*x)?
I guess I need to control the value of other variables, but I want to make sure. Thanks!
bmuthen posted on Thursday, June 02, 2005 - 9:12 am
Answers to your questions:
3. Two ways: Delta method and bootstrap (see User's Guide). The Delta method considers the product of slope estimates; the principle is the same as with continuous outcomes.
4. You need to use values for all your exogenous variables, since each slope refers to a partial effect just like in standard regression analysis.
Anonymous posted on Friday, July 22, 2005 - 12:06 pm
Hi, I have two questions:
1. What's the major difference between a probit model estimated by WLSMV in MPLUS and a probit model estimated by ML?
2. In SEM, when the dependent variables are ordered categorical, you said MPLUS takes them as continuous latent variables by using WLSMV estimation. Is the "latent variable" here the same as "latent variable" in factor analysis? I ask this because my impression is that a latent variable is often based on several observated variables. But when you take ONE categorical variable as a latent variable, there is only one observed variable -- are you saying that in this case, a latent variable is actually based on observed categories in the observed categorical variable?
bmuthen posted on Friday, July 22, 2005 - 12:22 pm
1. The results of those two estimators would probably be very similar (we plan to have probit ML in Mplus in the future).
2. Yes, the latent variable here is a continuous latent response variable underlying a single observed categorical variable. It is not a factor with multiple indicators but is specific to a certain observed variable. It can be thought of as what you really want to measure, whereas your measurement is a crude reflection of the response variable - the observed categories inform about which range (between neighboring thresholds) the response variable is in, but not its specific value. It is sometimes called a response propensity.
Anonymous posted on Saturday, August 20, 2005 - 8:17 am
Hi, I am estimating a SEM model by using probit estimation (WLSMV). In one of the equations, the dependent variable (Y1) is binary, and in this equation, the coefficient of a continuous independent variable (X1) is .23 (p<.001). I have two questions:
1. Can I interpret the coefficient as: for one unit increase of X1, the latent continuous variable underlying Y1 increases by .23?
2. How can I interpret the coefficient in terms of probability? Readers are used to seeing that the effects of independent variables on a dichotomous dependent variable are interpreted in a way of probability change. Since this is a probit model by WLSMV, not a logit one by ML, I am not sure how to get this alternative interpretation.
1. Yes. 2. The probability you ask for is computed as P,
P = 1 - probability ((threshold - z)/sqrt(theta)),
threshold = the threshold of the dichotomous event,
theta = the residual variance for y* of the dichotomous event obtained from the standardized solution,
and, for example
z = a*eta1 + b*eta2 + c*x,
where a, b, and c are the estimated regression coefficients of y* for the dichotomous event, regressed on two factors and one x. P is the conditional probability of the event given those factor values and x value.
To compute P you choose values of eta1, eta2, and x that you are interested in and evaluate z for those values. You then use a normal probability table to obtain
probability ((threshold - z)/sqrt(theta)),
from which you obtain the desired P.
Anonymous posted on Saturday, August 20, 2005 - 2:44 pm
Thanks a lot for your response. I can only find the threshold of the dichotomous event, but don't know how to find theta and eta1/eta2.
1. How to obtain "the residual variance for y* of the dichotomous event from the standardized solution?
2. What do you mean by standardized solution above?
3. What are eta1 and eta2?
Or, Could you please give me a real example?
bmuthen posted on Sunday, August 21, 2005 - 1:41 pm
1. Theta is the residual variance which is found in the output next to the R-square values (at least if you request a standardized solution).
2. If you type "Standardized" in the OUTPUT command you get slopes standardized to unit variance.
3. In this example, eta1 and eta2 are factors used to illustrate the case where you have not only x's but also factors that influence the categorical outcome. If you don't have factors, then you drop that part.
Referring to the discussion of the last few postings (Aug 2005): How can I calculate the predicted probabilities of a probit when I am doing a path model using multiple imputation? When type=imputation, standardized output is not available, so it seems I don't get the residual variance of y*.
I notice that I do get a matrix of thetas in the TECH1 output; will that contain the resid var of y*, though (i.e. are these the same thetas, or is there a homonym here)? Anyway, in my output the theta matrix contains only zeros (so they are unusable for the probability calculation).
If I may make a suggestion: It would be quite nice to have estimated probabilities as an output option in MPlus - similar to the postestimation programs people like Gary King or Scott Long have written for STATA. But maybe that's asking too much? Mplus is a brilliant programme as it is, of course.
bmuthen posted on Friday, October 14, 2005 - 10:12 am
To get the y* residual variance you have to use the parameter estimates printed (which have been averaged over the imputed data sets) and the formulas of Appendix 2 of the Version 2 Tech Appendix on the web site - see especially formula (43). We have it on our list to add more output features for imputed runs and also the estimated probabilities for individuals.
Hi Bengt, 1. In a SEM model with categorical main independent variable (5-level nominal so 4 dummy variable created), categorical/ordinal endogenous variables (mediators) and a binary outcome, would one report the standardized or the unstandardized coefficients? Can you clearly explain the pros and cons of each?
2. Whan are the benefits of calculating and reporting the probabilities?
1. I would not standardize wrt to x here, only wrt y. Standardization wrt to x is only suitable when x is continuous - it does not make to talk about a standard deviation (SD) change for a categorical variable (you want to consider changing categories). Standardization wrt to categorical mediators or ultimate outcomes may or may not be done. I personally feel that the rush to standardization is often not necessary - I like raw coefficients. Certainly, in logistic regression one typically does not standardize. But is is possible to do so considering as the variance the variance of the latent response variable underlying the categorical variable.
2. I think reporting key estimated probabilities for categorical dependent variables is much better than standardizations. This clearly shows what the model implies.
Hi Bengt, Thanks for the response above (March 29, 2006 - 9:06 am). I have a few follow-up questions related to the calculation of the probabilities using the scenario below:
in a model: y1 on d2 d3 d4 x1 x2 x3; y2 on y1 d2 d3 d4 x1 x3; y3 on y1 y2 d2 d3 d4 x1 x2; y4 on y1 y3 d2 d3 d4 x1 x2 x3;
where y1-y3 are 4-level ordinal variable, y4 is binary x1 and x2 are dichotomous variables and d2-d4 are dummy variables that represents my main independent variable with d1 the referent category left out. 1. How would I calculate the probability of y4=1 for different categories of my main independent variables (i.e. the probability of an event (y4=1) for d2=1 compared to d4=1)? 2. Would I have to do this at each threshold value for each endogenous variable in the model (i.e. 3 threshold values for each)? 3. With respect to the continuous exogenous variable, is it sufficient to just include the group mean value for the variable?
I assume you use the WLSMV estimator (so probit and u* variables used for mediation), and not ML (logit and u variables used for mediation). Then it is simple:
1.You would express y4* in terms of the "reduced-form", that is in terms of the x variables d2, d3, d4, x1, x2, x3 (just like you would in a regular mediational path model for continuous outcomes). Then you are looking at a regular probit regression for which our V4 UG chapter 13 gives prob formulas.
2. No, because your y1-y3 are ordinal variables which have each only 1 slope and therefore the category does not have an influence; this makes is simple.
Hi Bengt, Thanks for your response above. However, if my model contains a correlation between two of my mediating variables, how is this taken into account in calculating the probability. That is, the complete model is:
y1 on d2 d3 d4 x1 x2 x3; y2 on y1 d2 d3 d4 x1 x3; y3 on y1 d2 d3 d4 x1 x2; y4 on y1 y3 d2 d3 d4 x1 x2 x3; y2 with y3;
and I am trying to calculate the effect of d2 on y4.
Having that correlation in the model changes the parameter estimates and therefore the indirect effects, but not the procedure for calculating the probabilities (as a function of indirect and direct effects).
For instance, if you have
x--> y1 -->z with slopes a1 and b1 x--> y2 -->z with slopes a2 and b2
then z expressed as a function of x is
E(z | x) = b1*a1 + b2*a2
irrespective of y1 and y2 having correlated residuals.
Hi Bengt, Since one can obtain beta coefficients for the specific indirect paths with MPLUS once your model is recursive, I am assuming that one could calculate the probability of X on Y through a specific indirect path (as opposed to the overall indirect path). Is this correct?
Kris Preacher (@ U. Kansas) has some nice SPSS macros (and corresponding papers) to calculate indirect effects in single mediator models, multiple mediator models and med mod/mod med models @ http://www.psych.ku.edu/preacher/
Yes, we have a free demo which is exactly the same as the regular version except for a limitation on the number of variables that can be analyzed. The full user's guide is also on the website. You can access both via www.statmodel.com.
Sorry for the silly question, but doing a logistic regression on Mplus (using on) i get an estimate of 4.6, how should this be interpreted (i.e. what is it)? you also get an odds ratio, so it si not an oods ratio. how do you use it tracing a path?
It is a logit, that is, a log odds. If you are asking about an indirect effect, you can use the probit link and then the indirect effect is the product of the two regression coefficients in the indirect effect.
Assuming you are using WLSMV, the threshold is a z-score indicating a probability of greater than .5.
Hossein posted on Thursday, March 12, 2009 - 5:22 am
I have a dependent variable (Y) which is nominal with three levels (degrdation, constant, improvement), and several independent variables (Xs) which are interval. Can I estimate any regression here? If so, what kind? and is it possible with SPSS?
Six mediating nominal variables M1-M6, each of them with different number of groups (each mediating variable describes a membership to different clusters in one of six measures)
One binary dependent variable Y (a decision of the subjects)
The idea is to show the changes in behavior Y (decision) effected by a different treatment (X: condition) is mediated by some of the changes (belonging to one cluster and not another) within the six mediators.
I also used Hayes' (beta) indirect script for binary outcomes but then tried to do this with MPlus. The problem is obviously that in both approaches the mediators (are or) have to be defined as categoricals. But doing this, mediation depends on ordering of the nominal variables, of course. Is it possible to force Mplus to do a multinominal regression M on X and Y on M? Doing this manually with SPSS it shows that multinominal regression for some of the mediators brings significant dependencies in both directions.
With the bootstrapping procedure I am forced to set variables to categorical. Using MonteCarlo I cannot build IND or VIA effects. Grouping of course only works with one mediating variable, which hinders to show specific indirect effects.
Do you have an answer for me? Would be so great, it's my disseration.
So it sounds like you want a multinomial logistic regression of M on X and a binary logistic regression of the Y on M. The latter of course needs to be interpreted as Y probabilities shifting as a function of the nominal M categories. The way I can see this done (using ML) is to represent M by a latent class variable c, making M and c the same by using the M intercepts to connect its categories with those of c. Y prob's (thresholds) would then shift as a function of the c classes. I don't know how one would think of indirect effects in this context, however.
Thank you for your answer. I will try it this way. I got two further questions: 1. Is it a problem that I have six latent variables each with two, three or four nominal categories at the end? And 2. (to your last sentence) Do you mean there is no way to think about indirect effects or is it a technical problem? There is a strong direct effect of changes in X changing probabilities to be in one or the other state of Y. But the mediators may carry some of the changes meaning a specific pattern within the six mediators may sig. change the probability of changing state in Y. Some of the mediators can be interpreted as categorical and it can be shown that such significant specific indirect effects exist for them. So I wonder if the sig. nominal connection of M and X and the sig nominal connection of Y on M carry such effect. Thank you again, Thomas
2. I mean that an indirect effect is not simply the product of 2 slopes in this case. As you say, the mediator class probability is influenced by x and the mediator class influences the mean of the y so there is certainly an indirect effect of x, but a more complex one.
Quick question re how to interpret coefficients from a regression of a continuous latent on an observed binary (and an observed ordinal).
In the case of a binary outcome, what is the reference category 0 or 1 (eg. if my equation is y=.345eta1 then do I interpret this as for every one unit increase in eta1 my log odds of y being 0 increase by .345 or do I interpret it as my log odds of y being 1 increases by .345).
Similarly, for an ordinal outcome, do I interpret y as the likelihood of being in the next lower category or the likelihood of being in the next higher category.
Could someone clarify the interpretation of indirect effects using predicted probabilities in a simple mediation model with a binary dependent variable? (I've carefully read through multiple threads, the user guides, and scoured the Internet--a concrete example would help me fix ideas.)
The path model is X --> M --> Y, where X is a binary treatment variable, M is a continuous mediating variable, and Y is a binary outcome. Below are the unstandardized coeff. (using WLSMV):
M ON X: a = .089 (.035) Y ON M: b = 1.99 (.17) ON X: c = .054 (.142) Intercepts (for model with M): .673 (.021) Thresholds (Y$1): 1.355 (.139) Indirect Effect (X to Y): .176 (.071)
Using the formula provided in the user guides and this thread to calculate predicted probabilities:
P(Y=1|X) = F(-t + a*b*X + c*X) So, when X = 0: P(Y=1|X) = F(-1.355 + .177(0) + .054(0)) = .087 And, when X = 1: P(Y=1|X) = F(-1.355 + .177(1) + .054(1)) = .130
1) Do these calculations look correct? My concern is that these predicted probabilities seem pretty low when I look at the raw data. For instance, the mean value of Y when X is 0 is .49, and it is .59 when X equals 1. Am I missing an intercept or something?
Even when you condition on X there is some variation left in M, namely its residual. This means that to get the probability of Y you have to integrate out this residual and the formula is more complex than what you have.
You can see how this is done in the paper
Muthén, B. (2011). Applications of causally defined direct and indirect effects in mediation analysis using SEM in Mplus. Submitted for publication.
which is on our web site under Papers, Mediational Modeling. The Tech Appendix goes through the formulas in Section 13.2. Note that you may be better off presenting causal effects as discussed in the paper - you will find Mplus scripts there.
Ah, that makes sense. Thanks so much for your help--everything works beautifully now for a path model with a binary outcome.
What about a path model with an ordinal outcome? Seems to be a common situation and my hope is to use Mplus exclusively rather than having to switch back and forth between different software to do all of these types of analyses (like Imai et al.'s 'mediation' package in R). Is there a straightforward way of modifying the formulas/scripts to calculate causal indirect effects for a particular category of an ordinal outcome?
For instance, for a 4-category outcome variable (1,2,3,4), would it be possible to substitute 'mbeta2' for the ordinal outcome (from 'mbeta0' for the binary threshold) to get the indirect effect of moving from a value of 3 to 4?
I am glad you are moving ahead on it. Please send any paper you write on it - these techniques need to be more widely used.
The effect formulas generalize directly to an ordered categorical (ordinal) and an unordered categorical (nominal) outcome. For a 0/1 binary outcome, the expected value for the outcome is the same as the probability of category 1. With an ordered categorical outcome, the expected value for the outcome is a sum over the non-zero categories, weighted by their probabilities. This, however, assumes a certain scoring for the categories. For example, an equidistant scoring such as 0, 1, 2,... may not be substantively motivated due to the difference between two adjacent categories representing a substantively larger difference than two other adjacent categories. As an alternative, the probability for each category can be considered, an approach that is also suitable for a nominal outcome.
I did CFA for binary indicators using WLSMV estimator. I have questions on the interpretation of the factor loading and threshold. 1. Interpretation of the factor loading, is it correct? ¡°probit (y = 1) = -¦Ó+¦Ë¦Ç Factor loading ¦Ë can be interpreted in the linear form as 1 unit increase in ¦Ç results in ¦Ë units increase in the probit of getting the observed indicator as 1.¡±
2. Can factor loading also be interpreted as probability? I found in a book that ¡°the change in the probability is difficult to interpret when in the nonlinear form of the normal cumulative distribution function as it varies depend on the value of predictor ¦Ç¡±? Is it correct?
3. Interpretation of the threshold? How to interpret the threshold for binary data? I found an interpretation ¡°The thresholds, or cut points, reflects the predicted cumulative probabilities at covariate values of zero.¡± Is it correct?
(1) For binary dependent variables, is this the default equation that Mplus uses for regression? LN(P/(1-P)) =Intercept+beta1*X1+beta2*X2+...+error-term Where LN=natural logarithm, P=probability of the incident, X1 and X2 are independent variables, and beta1 and beta2 are coefficients.
(2) What is the difference between threshold reported by Mplus and intercept?
I am using MPlus for cross-lagged panel analyses using SEM, I have several baseline covariates in Year 1 and then two variables (A and B, where A is a latent factor and B is an observed dichotomous variable) each of which is measured in Years 1, 2, and 3. A and B are endogenous dependent variables. So, A in Year 2 is regressed on B in Year 1 and B in Year 2 regressed on A in Year 1 (and the same two relationships are modeled between Years 2 and 3).
To compute predicted probabilities for the risk of B in Year 2 at the mean of A in Year 1 (or, given a 1 SD change in A), can I use the same formula indicated earlier in this thread (and as illustrated on slide #163 of Topic 2)?
If I were not using Year 1 baseline covariates, then I believe the computation of the risk of B in Year 2 at the mean of A in Year 1 would involve: tau, the threshold for B in Year 2, lambda, the coefficient of B in Year 2 regressed on A in Year 1, and k, the coefficient of A in Year 2 on A in Year 1. Is that correct? And would that also work when including the baseline Year 1 covariates?
Also, is theta the SE of the Estimate under R-square? (In the example in Topic 2, slide #160, the output looks different than the MPlus output I get.)
And, finally, should one use the STDYX Standardized results for tau, lambda and k?
1. It sounds as if this is not quite right for my model--so, I should not include lambda_j. Is that correct? (Yet, slide #163 seemed to use an example with a dichotomous outcome and used lambda.) Other than removing lambda, does the formula remain the same?
(So that would leave threshold for B in Year 2 - mean of A in Year 1 (i.e., 0) - coefficient of B in Year 2 on A in Year 1 times mean of B in Year 1 times 1/sqrt of theta.)
2. Does the formula account for covariates included at baseline in Year 1? To obtain a predicted probability in Stata with a regression model, I can set other covariates to their means--can that be done here?
3. What is the residual variance from the standardized section? As illustrated on slides 160 and 163 in Topic 2, theta is the residual variance under R-square. In my output under STD standardization, I have a value of .535 for the estimate and 0.018 for the SE of the latent variable A in Year 1.
4. When displaying standardized model output, should I use STDYX Standardization?
Lambda refers to a factor influencing its indicator which I don't think was your situation so you would find the value perhaps in Beta (you can tell what the names are by looking at the output and TECH1).
I worry that me giving you piece-wise advice on the formulas via quick posts on Mplus Discussion will not get things exactly right since I won't be digging into your model. Instead, I suggest that you take this to a statistical consultant who can look at it carefully. But you have to say which estimate goes with which relationship.