library(marginaleffects)
library(modelsummary)
library(ggplot2)
library(patchwork)
dat <- read.csv("data/interaction_01.csv")
head(dat)
Y X M
1 1 0 a
2 1 1 b
3 1 0 a
4 0 0 a
5 1 0 c
6 1 1 a
So far, the models that we have considered were relatively simple. In this chapter, we apply the same workflow, framework, and software introduced in Parts I and II of the book to interpret estimates from slightly more complex specifications. Our goal is to address two related issues: heterogeneity and flexibility.
We say that there is “heterogeneity” when the association between an independent variable and a dependent variable varies based on the context, or based on other characteristics of the objects of study. For instance, a new medical treatment might significantly reduce blood pressure in younger adults, but have a weaker effect on older ones. Or a marketing campaign may increase sales in rural areas but not urban ones. Applied statisticians use different expressions to describe such situations: heterogeneity, moderation, interaction, effect modification, context-conditionality, or strata-specific effects. In this chapter, we use these expressions interchangeably to describe situations where the strength of association between two variables depends on a third: the moderator.
The second issue that this chapter tackles is flexibility. Many of the statistical models that data analysts fit are relatively simple and rigid; they often rely on linear approximations of the data generating process. Such approximations may not be appropriate when the relationships of interest are complex and (potentially) non-linear.1 For example, in environmental science, the relationship between temperature and crop yield is often non-linear: as temperatures rise, crop yields may initially increase due to optimal growing conditions, but eventually decline as heat stress becomes detrimental.
Complex relationships and heterogeneous effects are present in virtually all empirical domains. This chapter shows that the conceptual framework and software tools presented in earlier parts of this book can be applied in straightforward fashion to gain insights into complex phenomena. We will focus on three modeling strategies to account for heterogeneity and increase the flexibility of our models: multiplicative interactions, polynomials (a special case of interaction), and splines.
We say that there is heterogeneity when the strength of the association between an explanator \(X\) and an outcome \(Y\) varies based on the value of a moderator \(M\). Typically, \(M\) is a variable which measures contextual elements, or characteristics of the individuals, groups, or units under observation. The key characteristic of a moderator is that it modifies the nature of the relationship between two other variables. \(M\) can strengthen, weaken, or even reverse the association between an independent variable \(X\) and the dependent variable \(Y\).
One common strategy to study moderation (aka effect modification) is to fit a statistical model with multiplicative interactions (Brambor, Clark, and Golder 2006; Kam and Jr. 2009; Clark and Golder 2023). This involves creating a new composite variable by multiplying the explanator (\(X\)) to the moderator (\(M\)). Then, we insert this new interacted variable as a predictor in the model, alongside its individual components.2 One popular specification for this type of moderation analysis is a simple linear model with one interaction.
\[ Y = \beta_1 + \beta_2 \cdot X + \beta_3 \cdot M + \beta_4 \cdot X \cdot M + \varepsilon, \tag{8.1}\]
where \(Y\) is the outcome, \(X\) is the cause of interest, and \(M\) is a contextual variable which moderates the relationship between \(X\) and \(Y\).
To characterize the strength of association between \(X\) and \(Y\), depending on the value of \(M\), we can check how the predicted value of \(Y\) changes in response to a change in \(X\). Imagine that \(M\) is equal to 0. In that context, a change from \(X=0\) to \(X=1\) results in a change of \(\beta_2\) in the predicted value of \(Y\). We see this by plugging the values of \(X\) and \(M\) into Equation 8.1.
\[\begin{align*} Y_{X=1,M=0} &= \beta_1 + \beta_2 \cdot 1 + \beta_3 \cdot 0 + \beta_4 \cdot 1 \cdot 0\\ Y_{X=0,M=0} &= \beta_1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 + \beta_4 \cdot 0 \cdot 0\\ Y_{X=1,M=0} - Y_{X=0,M=0} &= \beta_2 \end{align*}\]
Now, imagine that \(M=1\). In that context, a change from \(X=0\) to \(X=1\) results in a different change in the predicted value of \(Y\): \(\beta_2 + \beta_4\).
\[\begin{align*} Y_{X=1,M=1} &= \beta_1 + \beta_2 \cdot 1 + \beta_3 \cdot 1 + \beta_4 \cdot 1 \cdot 1\\ Y_{X=0,M=1} &= \beta_1 + \beta_2 \cdot 0 + \beta_3 \cdot 1 + \beta_4 \cdot 0 \cdot 1\\ Y_{X=1,M=1} - Y_{X=0,M=1} &= \beta_2 + \beta_4 \end{align*}\]
The difference between these two results shows exactly what we mean by heterogeneity, moderation, or effect modification. The association between \(X\) and \(Y\) is different when \(M=0\) than when \(M=1\). In Equation 8.1, \(\beta_2\) characterizes the association between \(X\) and \(Y\) when \(M=0\). If the moderator \(M\) is non-zero, the strength of association between \(X\) and \(Y\) is more complex. Generically, we see can see this by inspecting the partial derivative of Equation 8.1 with respect to \(X\) (see Chapter 7 on slopes).
\[ \frac{\partial Y}{\partial X} = \beta_2 + \beta_4 \cdot M \tag{8.2}\]
Equation 8.2 represents the slope of \(Y\) with respect to \(X\), that is, the extent to which \(Y\) is expected to change in response to a small change in \(X\). Clearly, this relationship now depends on more than just \(X\) itself: the strength of association between \(X\) on \(Y\) is driven in part by the value of \(M\). Thusly, this model specification allows the effect of an explanator to vary based on the value of a mediator.
The parameters in Equation 8.1 are fairly straightforward to grasp, and we can interpret them using a little algebra. However, as soon as a model includes non-linear components or more than one interaction, it quickly becomes very difficult to interpret raw coefficient estimates directly. This book has argued that, in most cases, it is best to ignore raw parameter values to focus on more interpretable quantities, like predictions, counterfactual comparisons, and slopes. This advice is particulary important here.
In the next few sections, we will illustrate how to model heterogeneity using multiplicative interactions, and how to interpret our parameter estimates using marginaleffects
and the software tools and conceptual framework in Parts I and II.
The first case to consider is when the association between a categorical explanator X
and an outcome Y
is moderated by a categorical variable M
. This situation could occur when the effect of a binary treatment (treatment vs. control) on patient recovery (outcome) varies across different age groups (young, middle-aged, elderly).
To illustrate, we consider simulated data with three variables: the outcome Y
is a binary variable; the treatment X
is a binary variable; and the moderator M
is a categorical variable with 3 levels (a, b, and c).
library(marginaleffects)
library(modelsummary)
library(ggplot2)
library(patchwork)
dat <- read.csv("data/interaction_01.csv")
head(dat)
Y X M
1 1 0 a
2 1 1 b
3 1 0 a
4 0 0 a
5 1 0 c
6 1 1 a
To fit a regression model using multiplicative interactions, we use a colon (:
) character in the model formula.
mod <- glm(Y ~ X + M + X : M, data = dat, family = binomial)
Alternatively, we can use the asterisk (*
) shortcut, which will automatically include the multiplicative interaction, as well as each of the individual variables. The next command will thus estimate the same model as above.
mod <- glm(Y ~ X * M, data = dat, family = binomial)
As is the case for many of the more complex models that we will consider, the coefficient estimates for this logit model with interactions are difficult to interpret on their own.
modelsummary(mod, coef_rename = TRUE, gof_map = "nobs")
(1) | |
---|---|
(Intercept) | 0.436 |
(0.072) | |
X | 0.260 |
(0.103) | |
M [b] | 0.566 |
(0.107) | |
M [c] | 0.970 |
(0.111) | |
X × M [b] | 0.895 |
(0.173) | |
X × M [c] | 1.412 |
(0.215) | |
Num.Obs. | 5000 |
Thankfully, we can rely on the framework and tools introduced in Parts I and II of this book to make these results intelligible.
Our first cut is to compute the average predicted outcome for each combination of X
and M
. As explained in Chapter 5, this is equivalent to computing fitted values for every row in the original data, and then aggregating those fitted values by subgroups.3
avg_predictions(mod, by = c("X", "M"))
X | M | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
0 | a | 0.607 | 0.01716 | 35.4 | <0.001 | 0.574 | 0.641 |
0 | b | 0.732 | 0.01555 | 47.0 | <0.001 | 0.701 | 0.762 |
0 | c | 0.803 | 0.01334 | 60.2 | <0.001 | 0.777 | 0.829 |
1 | a | 0.667 | 0.01647 | 40.5 | <0.001 | 0.635 | 0.700 |
1 | b | 0.896 | 0.01058 | 84.7 | <0.001 | 0.876 | 0.917 |
1 | c | 0.956 | 0.00707 | 135.2 | <0.001 | 0.942 | 0.970 |
There is considerable variation in the predicted value of Y
across subgroups, going from 61% to 96%. This variation can be made starker by plotting our results with the plot_predictions()
function:
plot_predictions(mod, by = c("M", "X"))
Figure 8.1 shows that, on average, the predicted probability that Y=1
is considerably higher when X=1
. Moreover, the difference in predicted outcomes \(P(Y=1|X=1,M=a)-P(Y=1|X=0,M=a)\) seems smaller than \(P(Y=1|X=1,M=c)-P(Y=1|X=0,M=c)\). This is a hint that we may want to formally check for effect moderation.
X
affect Y
?We can build on these preliminary findings by adopting a more explicitly counterfactual approach, using the comparisons()
family of function. Recall, from Chapter 6, that we can compute an average counterfactual comparison by following these steps:
X
to 0 for all observations, and compute predictions for every row.X
to 1 for all observations, and compute predictions for every row.These three steps can be taken with a single line of code:
avg_comparisons(mod, variables = "X")
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.127 | 0.0112 | 11.3 | <0.001 | 0.105 | 0.149 |
On average, moving from 0 to 1 on the X
variable is associated with an increase of 0.127 on the outcome scale. Since we fit a logistic regression model, predictions are expressed on the probability scale. Thus, the estimate printed above suggests that the average treatment effect of X
on Y
is about 12.7 percentage points. This estimate is statistically distinguishable from zero, as the small \(p\) value attests.
X
on Y
moderated by M
?Now we can dive deeper, exploiting the multiplicative interaction in our model to interrogate heterogeneity. To see if the effect of X
on Y
depends on M
, we make the same function call as above, but add the by
argument:
avg_comparisons(mod, variables = "X", by = "M")
Term | M | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
X | a | 0.0601 | 0.0238 | 2.53 | 0.0115 | 0.0135 | 0.107 |
X | b | 0.1649 | 0.0188 | 8.77 | <0.001 | 0.1280 | 0.202 |
X | c | 0.1529 | 0.0151 | 10.13 | <0.001 | 0.1233 | 0.182 |
On average, moving from the control (X=0
) to the treatment group (X=1
) is associated with an increase of 6 percentage points for individuals in category A. The average estimated effect of X
for individuals in category C is 15.
At first glance, these two estimated effects look different. But is the difference between 6 and 15 percentage points statistically significant? To answer this question, we can use the hypothesis
argument and conduct a test of equality between the 1st and the 3rd estimate:
avg_comparisons(mod, variables = "X", by = "M", hypothesis = "b3 - b1 = 0")
Warning:
It is essential to check the order of estimates when specifying hypothesis tests using positional indices like b1, b2, etc. The indices of estimates can change depending on the order of rows in the original dataset, user-supplied arguments, model-fitting package, and version of `marginaleffects`.
It is also good practice to use assertions that ensure the order of estimates is consistent across different runs of the same code. Example:
```r
mod <- lm(mpg ~ am * carb, data = mtcars)
# assertion for safety
p <- avg_predictions(mod, by = 'carb')
stopifnot(p$carb[1] != 1 || p$carb[2] != 2)
# hypothesis test
avg_predictions(mod, by = 'carb', hypothesis = 'b1 - b2 = 0')
```
Disable this warning with: `options(marginaleffects_safe = FALSE)`
This warning appears once per session.
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
b3-b1=0 | 0.0928 | 0.0282 | 3.29 | <0.001 | 0.0376 | 0.148 |
The difference between the average estimated effect of X
in categories C and A is: \(0.1529 - 0.0601 = 0.0928\). This difference is associated to a large \(z\) statistic and a small \(p\) value. Therefore, we can conclude that the difference is statistically significant; we can reject the null hypothesis that the effect of X
is the same in sub-populations A and C.
The second case to consider is an interaction between a categorical explanator (X
) and a continuous mediator (M
). To illustrate, we fit a new model to simulated data.
dat <- read.csv("data/interaction_02.csv")
mod <- glm(Y ~ X * M, data = dat, family = binomial)
modelsummary(mod, coef_rename = TRUE, gof_map = "nobs")
(1) | |
---|---|
(Intercept) | -0.840 |
(0.046) | |
X | 0.192 |
(0.063) | |
M | -0.759 |
(0.050) | |
X × M | 0.352 |
(0.068) | |
Num.Obs. | 5000 |
In the previous section, we started by computing average predictions for each combination of the interacted variable. When one of the variables is continuous and takes on many values (like M
), it is not practical to report averages for every combination of X
and M
. Therefore, we focus on “conditional” estimates, obtained by calling the predictions()
function. We use the datagrid()
and fivenum()
functions to create a grid of predictors based on Tukey’s five number summary of M
:4
predictions(mod, newdata = datagrid(X = c(0, 1), M = fivenum))
X | M | Estimate | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0 | -3.14813 | 0.8248 | <0.001 | 0.7766 | 0.8645 |
0 | -0.65902 | 0.4159 | <0.001 | 0.3921 | 0.4401 |
0 | 0.00407 | 0.3009 | <0.001 | 0.2821 | 0.3204 |
0 | 0.68359 | 0.2045 | <0.001 | 0.1848 | 0.2256 |
0 | 3.53494 | 0.0287 | <0.001 | 0.0198 | 0.0414 |
1 | -3.14813 | 0.6532 | <0.001 | 0.5871 | 0.7139 |
1 | -0.65902 | 0.4063 | <0.001 | 0.3830 | 0.4300 |
1 | 0.00407 | 0.3432 | <0.001 | 0.3244 | 0.3624 |
1 | 0.68359 | 0.2838 | <0.001 | 0.2623 | 0.3063 |
1 | 3.53494 | 0.1105 | <0.001 | 0.0820 | 0.1473 |
The results show considerable variation in the predicted \(Pr(Y=1)\), ranging from 0.0 to 0.8.
Instead of making predictions for discrete values of the continuous moderator M
, we can also draw a plot with that variable on the x-axis:
plot_predictions(mod, condition = c("M", "X"))
Figure 8.2 shows that predicted values of Y
tend to be lower when M
is large. That figure also suggests that the relationship between X
and Y
has a different character for different values of \(M\). When \(M\) is small, we see \(P(Y=1|X=1,M)<P(Y=1|X=0,M)\). When \(M\) is large, the converse seems true.
X
affect Y
?Moving to the counterfactual analysis, we call avg_comparisons()
to get an overall estimate of the effect of X
on the predicted \(Pr(Y=1)\):
avg_comparisons(mod, variables = "X")
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.0284 | 0.0129 | 2.21 | 0.0273 | 0.00318 | 0.0536 |
On average, moving from 0 to 1 on X
increases the predicted probability that Y=1
by 2.8 percentage points.
X
on Y
moderated by M
?As explained in Chapter 6, we can estimate the effect of X
for different values of M
by using the newdata
argument and datagrid()
function. Here, we measure the strength of association between X
and Y
for two different values of M
: its minimum and maximum.
comparisons(mod, variables = "X", newdata = datagrid(M = range))
M | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
-3.15 | -0.1716 | 0.0395 | -4.35 | <0.001 | -0.2489 | -0.0943 |
3.53 | 0.0818 | 0.0174 | 4.70 | <0.001 | 0.0477 | 0.1159 |
Moving from 0 to 1 on the X
variable is associated with a change of -0.172 in the predicted Y
when the moderator M
is at its minimum. Moving from 0 to 1 on the X
variable is associated with a change of 0.082 in the predicted Y
when the moderator M
is at its maximum. Both of these estimates are associated with small \(p\) values, so we can reject the null hypotheses that they are equal to zero.
Both estimates are different from zero, but are they different from one another? Is the effect of X
on Y
different when M
takes on different values? To check this, we can add the hypothesis
argument to the previous call:
comparisons(mod,
hypothesis = "b2 - b1 = 0",
variables = "X",
newdata = datagrid(M = range))
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.253 | 0.0546 | 4.64 | <0.001 | 0.146 | 0.36 |
This confirms that the estimates are statistically distinguishable. We can reject the null hypothesis that M
has no moderating effect the relationship between X
and Y
.
The third case to consider is an interaction between two continuous numeric variables: X
and M
. To illustrate, we fit a new model to simulated data.
As in the previous cases, we begin by computing the predicted outcomes for different values of the predictors. In practice, the analyst should report predictions for predictor values that are meaningful to the domain of application. Here, we hold X
and M
to fixed arbitrary values:
predictions(mod, newdata = datagrid(X = c(-2, 2), M = c(-1, 0, 1)))
X | M | Estimate | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
-2 | -1 | 0.2586 | <0.001 | 0.2215 | 0.2995 |
-2 | 0 | 0.1365 | <0.001 | 0.1189 | 0.1563 |
-2 | 1 | 0.0669 | <0.001 | 0.0531 | 0.0839 |
2 | -1 | 0.2177 | <0.001 | 0.1837 | 0.2561 |
2 | 0 | 0.4855 | 0.425 | 0.4500 | 0.5212 |
2 | 1 | 0.7619 | <0.001 | 0.7213 | 0.7982 |
Rather than focus on arbitrary point estimates, we can plot predicted values to communicate a richer set of estimates. When calling plot_predictions()
with these data, we obtain a plot of predicted outcomes with the primary variable of interest (X
) on the x-axis, and different lines representing different values of the moderator (M
).
plot_predictions(mod, condition = c("X", "M"))
We can draw two preliminary conclusions from Figure 8.3. First, the predicted values of Y
depend strongly on the value of X
. Moving from left to right in the plot often has a strong effect on the heights of predicted probability curves. Second, M
strongly moderates the relationship between X
and Y
. Indeed, for some values of M
the relationship of interest completely flips. For example, when M
is around -3, the relationship between X
and Y
is negative: an increase in X
is associated with a decrease in Pr(Y=1)
. However, for all the other values of M
that we considered, the relationship between X
and Y
is positive: an increase in X
is associated withn an increase in Pr(Y=1)
.
X
affect Y
?To measure the “effect” of X
on the predicted outcome, we can compute the average slope with respect to our predictor of interest:
avg_slopes(mod, variables = "X")
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.0857 | 0.00571 | 15 | <0.001 | 0.0745 | 0.0969 |
On average, across all observed values of the moderator M
, increasing X
by one unit increases the predicted outcome by 0.086.5 This is interesting, but as suggested by Figure 8.3, there is strong heterogeneity in the relationship of interest, as a function of moderator M
. Indeed, this is what we observe by computing slopes for different values of M
:
X
on Y
moderated by M
?To answer this question, we estimate slopes of Y
with respect to X
, for different values of the moderator M
:
M | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
-3.0788 | -0.1525 | 0.01899 | -8.03 | < 0.001 | -0.18969 | -0.1153 |
-0.6759 | 0.0200 | 0.00756 | 2.65 | 0.00815 | 0.00519 | 0.0348 |
0.0137 | 0.0913 | 0.00695 | 13.14 | < 0.001 | 0.07768 | 0.1049 |
0.6906 | 0.1698 | 0.00950 | 17.87 | < 0.001 | 0.15113 | 0.1884 |
3.4711 | 0.5426 | 0.03362 | 16.14 | < 0.001 | 0.47673 | 0.6085 |
The results from this command confirm the intuition we developed based on Figure 8.3. When M
is strongly negative (-3), the slope is negative: increasing X
results in a reduction of Y
. However, for the other 4 values of M
we consider, the slope is positive. This is consistent with Figure 8.3, which shows one line with downward slope, and four lines with upward slopes.
For a more fine grained analysis, we can plot the slope of Y
with respect to X
for all observed values of the moderator M
:
plot_slopes(mod, variables = "X", condition = "M") +
geom_hline(yintercept = 0, linetype = "dotted")
Figure 8.4 plot shows that when the moderator M
is below -1, the relationship between X
and Y
is negative: increasing X
decreases Y
. However, when M
rises to about -1, the relationship between X
and Y
becomes positive: increasing X
increases Y
.
We can confirm that this moderation effect is statistically significant using the hypothesis
argument:
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.695 | 0.0475 | 14.6 | <0.001 | 0.602 | 0.788 |
The \(\frac{\partial Y}{\partial X}\) slope is larger when evaluated at maximum M
, than at minimum M
. Therefore, we can reject the null hypothesis that M
has no moderating effect on the relationship between X
and Y
.
The fourth case to consider is when more than two variables are included in multiplicative interactions. Such models have serious downsides: they can overfit the data, and they impose major costs in terms of statistical power, typically requiring considerably larger sample sizes than models without interaction. On the upside, models with multiple interactions allow more flexibility in modelling, and they can capture complex patterns of moderation between regressors.
Models with several multiplicative interactions do not pose any particular interpretation challenge, since the tools and workflows introduced in this book can be applied to these models in straightforward fashion. Consider this model, fit to simulated data, with three binary variables multipled to each other:
dat <- read.csv("data/interaction_04.csv")
mod <- glm(Y ~ X * M1 * M2, data = dat, family = binomial)
modelsummary(mod, coef_rename = TRUE, gof_map = "nobs")
(1) | |
---|---|
(Intercept) | -1.021 |
(0.091) | |
X | 0.463 |
(0.123) | |
M1 | -0.795 |
(0.147) | |
M2 | 0.579 |
(0.122) | |
X × M1 | 0.675 |
(0.189) | |
X × M2 | -0.965 |
(0.172) | |
M1 × M2 | 0.405 |
(0.189) | |
X × M1 × M2 | 0.198 |
(0.254) | |
Num.Obs. | 5000 |
Once again, the coefficient estimates of this logistic regression are difficult to interpret on their own, so we use functions from the marginaleffects
package.
As before, we can compute and display marginal predicted outcomes in any subgroup of interest, using the avg_predictions()
or plot_predictions()
:
plot_predictions(mod, by = c("X", "M1", "M2"))
X
affect Y
?As before, we can estimate the average change in Y
associated with a change from 0 to 1 in the X
variable using the avg_comparisons()
function:
avg_comparisons(mod, variables = "X")
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.0657 | 0.0129 | 5.1 | <0.001 | 0.0405 | 0.091 |
This suggests that, on average, moving from the control (0) to the treatment (1) group is associated with an increase of 6.6 percentage points in the probability that Y
equals 1. The \(p\) value is small, which implies that we can reject the null hypothesis that X
has no effect on the predicted outcome.
X
on Y
moderated by M1
?We can also estimate how the effect of X
varies based on different values of moderator M1
:
avg_comparisons(mod, variables = "X", by = "M1")
Term | M1 | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
X | 0 | -0.00703 | 0.0185 | -0.38 | 0.704 | -0.0433 | 0.0292 |
X | 1 | 0.14026 | 0.0179 | 7.83 | <0.001 | 0.1052 | 0.1754 |
The results suggest that the change in Y
associated with a change in X
differs based on the value of M1
: -0.0070 vs. 0.1403. By using the hypothesis
argument, we can confirm that the difference between these two estimated effect sizes is statistically significant:
avg_comparisons(mod,
variables = "X",
by = "M1",
hypothesis = "b2 - b1 = 0")
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
b2-b1=0 | 0.147 | 0.0258 | 5.72 | <0.001 | 0.0968 | 0.198 |
M1
depend on M2
?The last question that we pose is more complex. Above, we established that:
X
affects the predicted value of Y
.M1
modifies the strength of association between X
and Y
.Now we ask if M2
changes the way in which M1
moderates the effect of X
on Y
. The difference is subtle but important: we are asking if the moderation effect of M1
is itself moderated by M2
.
The following code computes the average difference in predicted Y
associated with a change in X
, for every combination of moderators M1
and M2
. Each row represents the average effect of X
at different points in the sample space:
avg_comparisons(mod,
variables = "X",
by = c("M2", "M1"))
Term | M2 | M1 | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|---|
X | 0 | 0 | 0.0992 | 0.0261 | 3.80 | < 0.001 | 0.0481 | 0.1504 |
X | 0 | 1 | 0.1967 | 0.0236 | 8.35 | < 0.001 | 0.1505 | 0.2429 |
X | 1 | 0 | -0.1111 | 0.0262 | -4.24 | < 0.001 | -0.1625 | -0.0597 |
X | 1 | 1 | 0.0834 | 0.0270 | 3.09 | 0.00203 | 0.0304 | 0.1363 |
When we hold the moderators fixed at M1=0
and M2=0
, changing the value of X
from 0 to 1 changes the predicted value of Y
by 9.9 percentage points. When we hold the moderators fixed at M1=0
and M2=1
, changing the value of X
from 0 to 1 changes the predicted value of Y
by -11.1 percentage points.
Now, imagine that we hold M2
constant at 0. We can determine if the effect of X
is moderated by M1
by using the hypothesis
argument to compare estimates in rows 1 and 2. This shows that the estimated effect size of X
is larger when M1=1
than when M1=0
, holding M2
at 0.
avg_comparisons(mod,
hypothesis = "b2 - b1 = 0",
variables = "X",
by = c("M2", "M1"))
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
b2-b1=0 | 0.0975 | 0.0351 | 2.77 | 0.00555 | 0.0286 | 0.166 |
Similarly, imagine that we hold M2
constant at 1. We can determine if the effect of X
is moderated by M1
by comparing estimates in rows 3 and 4. We use the syntax introduced in Chapter 4, where b3
represents the 3rd estimate and b4
the fourth.
avg_comparisons(mod,
hypothesis = "b4 - b3 = 0",
variables = "X",
by = c("M2", "M1"))
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
b4-b3=0 | 0.194 | 0.0377 | 5.16 | <0.001 | 0.121 | 0.268 |
The hypothesis test above shows that we can reject the null hypothesis that the 4th estimate is equal to the 3rd (i.e., that the difference between them is zero). Therefore, we can conclude that the effect size of X
is larger when M1=1
than when M1=0
, holding M2
at 1.
So far, we have assessed the extent to which M1
acts as a moderator, holding M2
at different values. To answer the question of whether M2
moderates the moderation effect of M1
, we can specify the hypothesis
as a difference in differences:
avg_comparisons(mod,
hypothesis = "(b2 - b1) - (b4 - b3) = 0",
variables = "X",
by = c("M2", "M1"))
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
(b2-b1)-(b4-b3)=0 | -0.097 | 0.0515 | -1.88 | 0.0597 | -0.198 | 0.00396 |
This suggests that M2
may have a second order moderation effect, but we cannot completely completely rule out the null hypothesis because the \(p\) value does not cross conventional thresholds of statistical significance (\(p\)=0.0597).
TODO: Two research questions are often conflated:
Polynomial regression is an extension of linear regression that allows for modeling the relationship between a dependent variable \(Y\) and an independent variable \(X\) as an nth-degree polynomial. While the model specification remains linear in the coefficients, it is polynomial in the value of \(X\). This type of regression is useful when the data shows a non-linear relationship that a straight line cannot adequately capture.
The general form of a polynomial regression model is:
\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \cdots + \beta_n X^n + \varepsilon,\]
where \(Y\) is the dependent variable, \(X\) is the independent variable, \(\beta_0, \beta_1, \beta_2, \ldots, \beta_n\) are the coefficients to be estimated, \(n\) is the degree of the polynomial, and \(\varepsilon\) represents the error term. For instance, a second-degree (quadratic) polynomial regression equation can be written as:
\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon\]
A polynomial regression can be seen as a special case of the multiplicative interactions discussed in ?sec-interationals_multiplicative, where predictors are simply interacted with themselves. As before, they can be treated as a simple linear regression problem by constructing new variables \(Z_1=X\), \(Z_2=X^2\), etc. The model then becomes \(Y=\beta_0+\beta_1\cdot Z_1+\beta_2\cdot Z_2+\varepsilon\), which can be estimated using standard methods like ordinary least squares.
Polynomial regression offers several key advantages, particularly in its flexibility and ability to fit a wide range of curves, simply by adjusting the degree of the polynomial. As a result, polynomial regression can reveal underlying patterns in the data that are not immediately apparent with simpler models.
This approach also has notable disadvantages. One significant issue is its potential for overfitting, especially when the \(n\) order is high. Moreover, polynomial regression can suffer from unreliable extrapolation, where predictions made outside the range of the training data can become erratic and unrealistic. Consequently, while polynomial regression can be powerful, careful consideration must be given to the degree of the polynomial to balance fit and generalization effectively.
Polynomial regression can be viewed simply as a model specification with several variables interacted with themselves. As such, it can be interpreted using exactly the same tools discussed in the earlier part of this chapter. To illustrate, we consider two simple data generating processes adapted from Hainmueller, Mummolo, and Xu (2019). The first is:
\[\begin{align*} Y = 2.5 - X^2 + \nu, && \text{where }\nu\sim N(0,1),X\sim U(-3,3) \end{align*}\]
If we fit a linear model with only \(X\) as predictor, the line of best fit will not be a good representation of the data. However, a cubic polynomial regression can easily detect the curvilinear relationship between \(X\) and \(Y\). In R
and Python
, we can use similar syntax to specify polynomials directly in the model formula:6
library(patchwork)
dat <- read.csv("data/polynomial_01.csv")
mod_linear <- lm(Y ~ X, data = dat)
mod_cubic <- lm(Y ~ X + I(X^2) + I(X^3), data = dat)
p1 <- plot_predictions(mod_linear, condition = "X", points = .05) + ggtitle("Linear")
p2 <- plot_predictions(mod_cubic, condition = "X", points = .05) + ggtitle("Cubic")
p1 + p2
Clearly, the model with polynomials makes much better predictions, ones that capture the curvilinear relationship between \(X\) and \(Y\). We can now evaluate the strength of association between \(X\) and \(Y\) by computing the slope of the outcome equation with respect to \(X\), for different values of \(X\):
Term | X | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
X | -2 | 3.98616 | 0.0352 | 113.237 | <0.001 | 3.9172 | 4.0552 |
X | 0 | 0.00478 | 0.0226 | 0.211 | 0.833 | -0.0396 | 0.0491 |
X | 2 | -3.97639 | 0.0360 | -110.367 | <0.001 | -4.0470 | -3.9058 |
When \(X\) is negative (-2), the slope is positive which indicates that an increase of \(X\) is associated with an increase in \(Y\). When \(X\) is around 0, the slope is null, which indicates that the strength of association between \(X\) and \(Y\) is null (or very weak). When \(X\) is around 0, changing \(X\) by a small amount will have almost no effect on \(Y\). When \(X\) is positive (-2), the slope is negative. This indicates that increasing \(X\) will result in a decrease in \(Y\).
Now, consider a sligthly different data generating process, where a binary moderator \(M\) changes the nature of the relationship between \(X\) and \(Y\):7
\[\begin{align*} Y = 2.5 - X^2 - 5 \cdot M + 2 \cdot M \cdot X^2 + \nu && \text{where }\nu\sim N(0,1),X\sim U(-3,3) \end{align*}\]
If we simply fit a cubic regression, without accounting for \(M\), our predictions will be inaccurate. However, if we interact the moderator \(M\) with all polynomial terms (using parentheses as a shortcut for the distributive property), we can get an excellent fit for the curvilinear and differentiated relationship between \(X\) and \(Y\):
dat <- read.csv("data/polynomial_02.csv")
# cubic
mod_cubic <- lm(Y ~ X + I(X^2) + I(X^3), data = dat)
# cubic + interaction
mod_cubic_int <- lm(Y ~ M * (X + I(X^2) + I(X^3)), data = dat)
p1 <- plot_predictions(mod_cubic, condition = "X", points = .05)
p2 <- plot_predictions(mod_cubic_int, condition = c("X", "M"), points = .1)
p1 + p2
Of course, we can also estimate the slope of the outcome equation for different values of M
and X
:
Term | M | X | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|---|
X | 0 | -2.9974 | 5.7529 | 0.2587 | 22.240 | <0.001 | 5.24595 | 6.2599 |
X | 0 | -1.4845 | 2.9009 | 0.0582 | 49.874 | <0.001 | 2.78693 | 3.0149 |
X | 0 | -0.0586 | 0.1324 | 0.0667 | 1.986 | 0.047 | 0.00174 | 0.2631 |
X | 0 | 1.4788 | -2.9405 | 0.0575 | -51.098 | <0.001 | -3.05324 | -2.8277 |
X | 0 | 2.9843 | -6.0373 | 0.2540 | -23.769 | <0.001 | -6.53517 | -5.5395 |
X | 1 | -2.9974 | -6.1017 | 0.2528 | -24.137 | <0.001 | -6.59716 | -5.6062 |
X | 1 | -1.4845 | -2.9061 | 0.0563 | -51.628 | <0.001 | -3.01646 | -2.7958 |
X | 1 | -0.0586 | -0.0461 | 0.0634 | -0.727 | 0.467 | -0.17026 | 0.0781 |
X | 1 | 1.4788 | 2.8730 | 0.0595 | 48.285 | <0.001 | 2.75638 | 2.9896 |
X | 1 | 2.9843 | 5.5654 | 0.2594 | 21.454 | <0.001 | 5.05695 | 6.0738 |
TODO:
avg_comparisons()
: https://stats.stackexchange.com/questions/654784/how-to-decide-which-spline-to-use-when-conducting-g-computation-after-weighting/654791#654791library(mgcv)
# mgcv::gam should treat M as categorical (factor)
mod <- gam(Y ~ s(X, by = factor(M)) + M, data = dat)
plot_predictions(mod, condition = c("X", "M"), points = .1)
Term | M | X | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|---|
X | 0 | -1 | 2.134 | 0.212 | 10.087 | <0.001 | 1.719 | 2.548 |
X | 0 | 0 | -0.141 | 0.212 | -0.662 | 0.508 | -0.557 | 0.276 |
X | 0 | 1 | -1.883 | 0.201 | -9.386 | <0.001 | -2.276 | -1.489 |
X | 1 | -1 | -1.879 | 0.201 | -9.340 | <0.001 | -2.274 | -1.485 |
X | 1 | 0 | 0.113 | 0.215 | 0.526 | 0.599 | -0.308 | 0.535 |
X | 1 | 1 | 2.025 | 0.211 | 9.594 | <0.001 | 1.612 | 2.439 |
Two downsides of more flexible models are that they can sometimes reduce statistical power or overfit the data.↩︎
In most cases, it is important to include all constitutive terms in addition to interactions. For example, if a model includes a multiplication between three variables \(X\cdot W \cdot Z\), one would typically want to also include \(X\cdot W, X\cdot Z, W\cdot Z, X, W,\) and \(Z\). See Clark and Golder (2023) for details.↩︎
An alternative would be to compute predictions on a different grid (ex: “balanced”) before aggregating. This could be achieved by calling: avg_predictions(mod, newdata="balanced", by=c("X","M"))
↩︎
These five numbers correspond to elements of a standard boxplot: minimum, lower-hinge, median, upper-hinge, and maximum.↩︎
Recall, from Chapter 7, that this interpretation is valid for small changes in the neighborhoods where slopes are evaluated.↩︎
For marginaleffects
to work properly in this context, it is important to specify the polynomials in the model-fitting formula. Users should not hard-code the values by creating new variables in the dataset before fitting the model.↩︎
This data generating process is adapted from Hainmueller, Mummolo, and Xu (2019).↩︎