10  Interactions and polynomials

So far, the models that we have considered were relatively simple. In this chapter, we apply the same workflow, framework, and software introduced in Parts I and II of the book to interpret estimates from slightly more complex specifications. Our goal is to address two related issues: heterogeneity and flexibility.

We say that there is “heterogeneity” when the association between an independent variable and a dependent variable varies across contexts, or based on the characteristics of our objects of study. For instance, a new medical treatment might significantly reduce blood pressure in younger adults, but have a weaker effect on older ones. Or a marketing campaign may increase sales in rural areas but not urban ones. Applied statisticians use different expressions to describe such situations: heterogeneity, moderation, interaction, effect modification, context-conditionality, or strata-specific effects. In this chapter, we use these expressions interchangeably to describe situations where the strength of association between two variables depends on a third: the moderator.

The second issue that this chapter tackles is flexibility. Many of the statistical models that data analysts fit are relatively simple and rigid; they often rely on linear approximations of the data generating process. Such approximations may not be appropriate when the relationships of interest are complex or non-linear.1 In environmental science, for example, the relationship between temperature and crop yield is often non-linear: as temperatures rise, crop yields may initially increase due to optimal growing conditions, but eventually decline as heat stress becomes detrimental.

Complex relationships and heterogeneous effects are present in virtually all empirical domains. This chapter shows that the conceptual framework and software tools presented in earlier parts of this book can be applied in straightforward fashion to gain insights into complex phenomena. We will focus on two modeling strategies to account for heterogeneity and increase the flexibility of our models: multiplicative interactions and polynomials.

Of course, these two approaches do not exhaust the set of techniques that researchers may want to deploy to study complex phenomena. But the same ideas and workflows illustrated here can be applied to interpret even more flexible models. Readers who wish to go further in that direction can turn to Chapter 13 on Machine Learning, or visit the marginaleffects.com website, which hosts tutorials on Generalized Additive Models, splines, and other strategies.

10.1 Multiplicative interactions

We say that there is heterogeneity when the strength of the association between an explanator \(X\) and an outcome \(Y\) varies based on the value of a moderator \(M\). Typically, \(M\) is a variable which measures contextual elements, or features of the individuals, groups, or units under observation. The key characteristic of a moderator is that it modifies the nature of the relationship between two other variables. \(M\) can strengthen, weaken, or even reverse the association between an independent variable \(X\) and the dependent variable \(Y\).

One common strategy to study moderation—or effect modification—is to fit a statistical model with multiplicative interactions (Brambor, Clark, and Golder 2006; Kam and Jr. 2009; Clark and Golder 2023). This involves creating a new composite variable by multiplying the explanator (\(X\)) to the moderator (\(M\)). Then, we insert this new interacted variable as a predictor in the model, alongside its individual components.2 One popular specification for this type of moderation analysis is a simple linear model with one interaction.

\[ Y = \beta_1 + \beta_2 \cdot X + \beta_3 \cdot M + \beta_4 \cdot X \cdot M + \varepsilon, \tag{10.1}\]

where \(Y\) is the outcome, \(X\) is the predictor of interest, and \(M\) is a contextual variable which moderates the relationship between \(X\) and \(Y\).

To characterize the strength of association between \(X\) and \(Y\), we can check how predicted \(Y\) changes in response to a change in \(X\). When \(M=0\), a change from \(X=0\) to \(X=1\) results in a change of \(\beta_2\) in the predicted value of \(Y\). We can see this by plugging the values of \(X\) and \(M\) into Equation 10.1.

\[\begin{align*} Y_{X=1,M=0} &= \beta_1 + \beta_2 \cdot 1 + \beta_3 \cdot 0 + \beta_4 \cdot 1 \cdot 0\\ Y_{X=0,M=0} &= \beta_1 + \beta_2 \cdot 0 + \beta_3 \cdot 0 + \beta_4 \cdot 0 \cdot 0\\ Y_{X=1,M=0} - Y_{X=0,M=0} &= \beta_2 \end{align*}\]

Now, imagine that \(M=1\). In that context, a change from \(X=0\) to \(X=1\) has a different effect on the predicted value of \(Y\).

\[\begin{align*} Y_{X=1,M=1} &= \beta_1 + \beta_2 \cdot 1 + \beta_3 \cdot 1 + \beta_4 \cdot 1 \cdot 1\\ Y_{X=0,M=1} &= \beta_1 + \beta_2 \cdot 0 + \beta_3 \cdot 1 + \beta_4 \cdot 0 \cdot 1\\ Y_{X=1,M=1} - Y_{X=0,M=1} &= \beta_2 + \beta_4 \end{align*}\]

When \(M=0\), an increase from 0 to 1 on the \(X\) variable is associated to an increase of \(\beta_2\) in predicted \(Y\). When \(M=1\), an increase from 0 to 1 on the \(X\) variable is associated to an increase of \(\beta_2 + \beta_4\) in predicted \(Y\). The difference between these two results shows what we mean by heterogeneity, moderation, or effect modification. The strength of association between a dependent and an independent variable varies based on the value of a moderator.

The parameters in Equation 10.1 are fairly straightforward to grasp, and we can interpret them using a little algebra. However, as soon as a model includes non-linear components or more than one interaction, it quickly becomes very difficult to interpret coefficient estimates directly. This book has argued that it is usually best to ignore raw coefficients, and to focus on more interpretable quantities, like predictions, counterfactual comparisons, and slopes.

When we fit models which include interactions, how we present the findings from those models depends on the nature of X and M. In what follows, we will consider in turns different scenarios (e.g., \(X\) is categorical and \(M\) is categorical; \(X\) is continuous and \(M\) is categorical) and illustrate the corresponding presentation.

10.1.1 Categorical-by-categorical

The first case to consider is when the association between a categorical explanator \(X\) and an outcome \(Y\) is moderated by a categorical variable \(M\). For example, this could occur if the effect of a drug (\(X\)) on patient recovery (\(Y\)) varies across different age groups (\(M\)).

Let’s consider a simulated dataset with three variables: the outcome \(Y\) and the treatment \(X\) are binary variables, and the moderator \(M\) is a categorical variable with 3 levels (a, b, and c).

# A data frame: 6 × 3
      Y     X M    
* <int> <int> <chr>
1     1     0 a    
2     1     1 b    
3     1     0 a    
4     0     0 a    
5     1     0 c    
6     1     1 a    

We fit a logistic regression model using the glm() function. This model includes \(X\) and \(M\) as standalone predictors, and also includes a multiplicative interaction between those two variables. To create this interaction, we insert a colon (:) character in the model formula.

mod = glm(Y ~ X + M + X:M, data = dat, family = binomial)

Alternatively, one could use the asterisk (*) shortcut, which automatically includes the multiplicative interaction, as well as each of the individual variables. The next command is thus equivalent to the previous one.

mod = glm(Y ~ X * M, data = dat, family = binomial)

As is the case for many of the more complex models that we will consider, the coefficient estimates for this logit model with interactions are difficult to interpret on their own.

summary(mod)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.43643    0.07195   6.065   <0.001
X            0.26039    0.10337   2.519   0.0118
Mb           0.56596    0.10699   5.290   <0.001
Mc           0.96967    0.11087   8.746   <0.001
X:Mb         0.89492    0.17300   5.173   <0.001
X:Mc         1.41219    0.21462   6.580   <0.001

Thankfully, we can rely on the framework and tools introduced in Parts I and II of this book to make these results intelligible.

Marginal predictions

A good first step toward interpretation is to compute the average predicted outcome for each combination of predictor \(X\) and moderator \(M\). We can do this by calling the avg_predictions() function with its by argument. As explained in Section 5.3, this is equivalent to computing fitted values for every row in the original data, and then taking the average of fitted values by subgroup.3

avg_predictions(mod, by = c("X", "M"))
X M Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0 a 0.607 0.01716 35.4 <0.001 0.574 0.641
0 b 0.732 0.01555 47.0 <0.001 0.701 0.762
0 c 0.803 0.01334 60.2 <0.001 0.777 0.829
1 a 0.667 0.01647 40.5 <0.001 0.635 0.700
1 b 0.896 0.01058 84.7 <0.001 0.876 0.917
1 c 0.956 0.00707 135.2 <0.001 0.942 0.970

When \(X=0\) and \(M=a\), the average prediction is 0.607. Since the outcome is a binary variable, predicted values for this specific model are expressed on the \([0,1]\) interval, and they can be interpreted as probabilities.

There appears to be considerable variation in the average predicted probability that \(Y=1\) across subgroups, going from 61% to 96%. This stark variation can be underscored visually using the plot_predictions() function.

plot_predictions(mod, by = c("M", "X"))
Figure 10.1: Average predicted probability that Y=1 for different values of M and X.

The triangles in this figure are systematically above the corresponding squares. On average, the predicted probability that \(Y=1\) is considerably higher when \(X=1\) than when \(X=0\). Figure 10.1 also gives us a first hint that the relationship between \(X\) and \(Y\) may differ across values of \(M\). For some values of \(M\), the triangles are farther apart from the squares, suggesting that the effect of \(X\) on \(Y\) may be stronger when \(M\) takes on those values.

<script>

  function styleCell_5ovf7syl52tm62iwuwuq(i, j, css_id) {
      var table = document.getElementById("tinytable_5ovf7syl52tm62iwuwuq");
      var cell = table.rows[i]?.cells[j];  // Safe navigation to avoid errors
      if (cell) {
          console.log(`Styling cell at (${i}, ${j}) with class ${css_id}`);
          cell.classList.add(css_id);
      } else {
          console.warn(`Cell at (${i}, ${j}) not found.`);
      }
  }
  function insertSpanRow(i, colspan, content) {
    var table = document.getElementById('tinytable_5ovf7syl52tm62iwuwuq');
    var newRow = table.insertRow(i);
    var newCell = newRow.insertCell(0);
    newCell.setAttribute("colspan", colspan);
    // newCell.innerText = content;
    // this may be unsafe, but innerText does not interpret <br>
    newCell.innerHTML = content;
  }
  function spanCell_5ovf7syl52tm62iwuwuq(i, j, rowspan, colspan) {
    var table = document.getElementById("tinytable_5ovf7syl52tm62iwuwuq");
    const targetRow = table.rows[i];
    const targetCell = targetRow.cells[j];
    for (let r = 0; r < rowspan; r++) {
      // Only start deleting cells to the right for the first row (r == 0)
      if (r === 0) {
        // Delete cells to the right of the target cell in the first row
        for (let c = colspan - 1; c > 0; c--) {
          if (table.rows[i + r].cells[j + c]) {
            table.rows[i + r].deleteCell(j + c);
          }
        }
      }
      // For rows below the first, delete starting from the target column
      if (r > 0) {
        for (let c = colspan - 1; c >= 0; c--) {
          if (table.rows[i + r] && table.rows[i + r].cells[j]) {
            table.rows[i + r].deleteCell(j);
          }
        }
      }
    }
    // Set rowspan and colspan of the target cell
    targetCell.rowSpan = rowspan;
    targetCell.colSpan = colspan;
  }
  // tinytable span after
  window.addEventListener('load', function () {
      var cellsToStyle = [
        // tinytable style arrays after
      ];

      // Loop over the arrays to style the cells
      cellsToStyle.forEach(function (group) {
          group.positions.forEach(function (cell) {
              styleCell_5ovf7syl52tm62iwuwuq(cell.i, cell.j, group.css_id);
          });
      });
  });
</script>

<style>
  /* tinytable css entries after */
</style>
<div class="container">
  <table class="table table-borderless" id="tinytable_5ovf7syl52tm62iwuwuq" style="width: auto; margin-left: auto; margin-right: auto;" data-quarto-disable-processing='true'>
    <thead>
    
          <tr>
            <th scope="col">Hypothesis</th>
            <th scope="col">Estimate</th>
            <th scope="col">Std. Error</th>
            <th scope="col">z</th>
            <th scope="col">Pr(&gt;|z|)</th>
            <th scope="col">2.5 %</th>
            <th scope="col">97.5 %</th>
          </tr>
    </thead>
    
    <tbody>
            <tr>
              <td>M=a</td>
              <td>0.0601</td>
              <td>0.0238</td>
              <td>2.53</td>
              <td>0.0115</td>
              <td>0.0135</td>
              <td>0.107</td>
            </tr>
            <tr>
              <td>M=b</td>
              <td>0.0932</td>
              <td>0.0170</td>
              <td>5.48</td>
              <td>&lt;0.001</td>
              <td>0.0599</td>
              <td>0.127</td>
            </tr>
            <tr>
              <td>M=c</td>
              <td>0.1529</td>
              <td>0.0151</td>
              <td>10.13</td>
              <td>&lt;0.001</td>
              <td>0.1233</td>
              <td>0.182</td>
            </tr>
    </tbody>
  </table>
</div>


:::
:::

 \normalsize




On average, when $M=a$, moving from $X=0$ to $X=1$ is associated with an increase of 0.06 in predicted $Y$. On average, when $M=c$, moving from $X=0$ to $X=1$ is associated with an increase of 0.153 in predicted $Y$.

Is the difference between those two differences statistically significant? To check this, we can use the `hypothesis` argument again.





 \small

::: {.cell layout-align="center"}

```{.r .cell-code}
avg_predictions(mod, 
  by = c("X", "M"),
  hypothesis = "(b4 - b1) - (b6 - b4) = 0")
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
(b4-b1)-(b6-b4)=0 -0.228 0.0378 -6.04 <0.001 -0.303 -0.154

:::

The \(p\) value is small. This means we can reject the null hypothesis that the gap between triangles and squares in Figure 10.1 is the same where \(M=a\) and where \(M=c\). –>

Does \(X\) affect \(Y\)?

We can build on these preliminary findings by adopting a more explicitly counterfactual approach, using the comparisons() family of function.4 Recall, from Chapter 6, that we can compute an average counterfactual comparison by taking three steps.

  1. Modify the original dataset to fix \(X=0\) for all observations, and compute predictions for every row.
  2. Modify the original dataset to fix \(X=1\) for all observations, and compute predictions for every row.
  3. Calculate the average difference between the counterfactual predictions computed in steps 1 and 2.

This can be done with a single line of code.

avg_comparisons(mod, variables = "X")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.127 0.0112 11.3 <0.001 0.105 0.149

On average, moving from 0 to 1 on the \(X\) variable is associated with an increase of 0.127 on the outcome scale. Since we fit a logistic regression model, predictions are expressed on the probability scale. Thus, the estimate printed above suggests that the average treatment effect of \(X\) on \(Y\) is about 12.7 percentage points. This estimate is statistically distinguishable from zero, as the small \(p\) value and confidence interval attest.

Is the effect of \(X\) on \(Y\) moderated by \(M\)?

Now we can exploit the multiplicative interaction in our model to interrogate heterogeneity. To see if the effect of \(X\) on \(Y\) depends on \(M\), we make the same function call as above, but add the by argument.

avg_comparisons(mod, variables = "X", by = "M")
M Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
a 0.0601 0.0238 2.53 0.0115 0.0135 0.107
b 0.1649 0.0188 8.77 <0.001 0.1280 0.202
c 0.1529 0.0151 10.13 <0.001 0.1233 0.182

On average, moving from the control (\(X=0\)) to the treatment group (\(X=1\)) is associated with an increase of 6 percentage points for individuals in category \(a\). The average estimated effect of \(X\) for individuals in category \(c\) is 15 percentage points.

At first glance, these two estimates look different. But is the difference between 6 and 15 percentage points statistically significant? Can we reject the null hypothesis that these two estimates are equal? To answer these questions, we can use the hypothesis argument.

avg_comparisons(mod,
  variables = "X",
  by = "M",
  hypothesis = "b3 - b1 = 0")
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b3-b1=0 0.0928 0.0282 3.29 <0.001 0.0376 0.148

The difference between the average estimated effect of \(X\) in categories \(c\) and \(a\) is: \(0.1529 - 0.0601 = 0.0928\). This difference is associated to a large \(z\) statistic and a small \(p\) value. Therefore, we can conclude that the difference is statistically significant; we can reject the null hypothesis that the effect of \(X\) on \(Y\) is the same in sub-populations \(a\) and \(c\).

10.1.2 Categorical-by-continuous

The second case to consider is an interaction between a categorical explanator (\(X\)) and a continuous mediator (\(M\)). To illustrate, we consider a new simulated dataset.

dat = get_dataset("interaction_02")
head(dat)
# A data frame: 6 × 3
      Y     X      M
* <int> <int>  <dbl>
1     1     0 -0.521
2     0     1  0.136
3     1     0 -0.412
4     0     0 -0.567
5     0     0 -0.979
6     1     1 -1.13 

As before, we fit a logistic regression model with a multiplicative interaction, using the asterisk operator.

mod = glm(Y ~ X * M, data = dat, family = binomial)
summary(mod)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.83981    0.04644 -18.083  < 0.001
X            0.19222    0.06334   3.035  0.00241
M           -0.75891    0.05049 -15.030  < 0.001
X:M          0.35209    0.06753   5.213  < 0.001

Conditional predictions

In the previous section, we started by computing average predictions for each combination of the interacted variable. When one of the variables is continuous and takes on many values (like \(M\)), it is not practical to report averages for every combination of \(X\) and \(M\). Therefore, we focus on “conditional” estimates, obtained by calling the predictions() function. We use the datagrid() and fivenum() functions to create a grid of predictors based on Tukey’s five number summary of \(M\).5

predictions(mod, newdata = datagrid(X = c(0, 1), M = fivenum))
X M Estimate Pr(>|z|) 2.5 % 97.5 %
0 -3.14813 0.8248 <0.001 0.7766 0.8645
0 -0.65902 0.4159 <0.001 0.3921 0.4401
0 0.00407 0.3009 <0.001 0.2821 0.3204
0 0.68359 0.2045 <0.001 0.1848 0.2256
0 3.53494 0.0287 <0.001 0.0198 0.0414
1 -3.14813 0.6532 <0.001 0.5871 0.7139
1 -0.65902 0.4063 <0.001 0.3830 0.4300
1 0.00407 0.3432 <0.001 0.3244 0.3624
1 0.68359 0.2838 <0.001 0.2623 0.3063
1 3.53494 0.1105 <0.001 0.0820 0.1473

The results show considerable variation in the predicted probability that \(Y\) equals 1, ranging from 0.0 to 0.8.

Instead of making predictions for discrete values of the continuous moderator \(M\), we can also draw a plot with that variable on the x-axis.

plot_predictions(mod, condition = c("M", "X"))
Figure 10.2: Predicted probability that Y=1, for different values of X and M

Figure 10.2 shows that predicted values of \(Y\) tend to be lower when \(M\) is large. That figure also suggests that the relationship between \(X\) and \(Y\) has a different character for different values of \(M\). When \(M\) is small (left side of the plot), we see

\[\begin{align*} P(Y=1|X=1,M) < P(Y=1|X=0,M) && \mbox{For small values of } M \end{align*}\]

When \(M\) is large, the converse seems true.

Does \(X\) affect \(Y\)?

Moving to the counterfactual analysis, we call avg_comparisons() to get an overall estimate of the effect of \(X\) on the predicted \(Pr(Y=1)\).

avg_comparisons(mod, variables = "X")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.0284 0.0129 2.21 0.0273 0.00318 0.0536

On average, moving from 0 to 1 on \(X\) increases the predicted probability that \(Y=1\) by 2.8 percentage points.

Is the effect of \(X\) on \(Y\) moderated by \(M\)?

As explained in Chapter 6, we can estimate the effect of \(X\) for different values of \(M\) by using the newdata argument and datagrid() function. Here, we measure the strength of association between \(X\) and \(Y\) for two different values of \(M\): its minimum and maximum.

comparisons(mod, variables = "X", newdata = datagrid(M = range))
M Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
-3.15 -0.1716 0.0395 -4.35 <0.001 -0.2489 -0.0943
3.53 0.0818 0.0174 4.70 <0.001 0.0477 0.1159

Moving from 0 to 1 on the \(X\) variable is associated with a change of -0.172 in the predicted \(Y\) when the moderator \(M\) is at its minimum. Moving from 0 to 1 on the \(X\) variable is associated with a change of 0.082 in the predicted \(Y\) when the moderator \(M\) is at its maximum. Both of these estimates are associated with small \(p\) values, so we can reject the null hypotheses that they are equal to zero.

Both estimates are different from zero, but are they different from one another? Is the effect of \(X\) on \(Y\) different when \(M\) takes on different values? To check this, we can add the hypothesis argument to the previous call:

comparisons(mod, 
  hypothesis = "b2 - b1 = 0",
  variables = "X",
  newdata = datagrid(M = range))
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b2-b1=0 0.253 0.0546 4.64 <0.001 0.146 0.36

This confirms that the estimates are statistically distinguishable. We can reject the null hypothesis that \(M\) has no moderating effect the relationship between \(X\) and \(Y\).

10.1.3 Continuous-by-continuous

The third case to consider is an interaction between two continuous numeric variables: \(X\) and \(M\). To illustrate, we fit a new model to simulated data.

dat = get_dataset("interaction_03")
head(dat)
# A data frame: 6 × 3
      Y      X       M
* <int>  <dbl>   <dbl>
1     0 -0.779 -0.761 
2     0 -0.389 -0.199 
3     0 -2.03   1.46  
4     1 -0.982 -0.0254
5     1  0.248 -0.664 
6     0 -2.10  -0.314 
mod = glm(Y ~ X * M, data = dat, family = binomial)
summary(mod)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.95119    0.03320 -28.650   <0.001
X            0.44657    0.03470  12.868   <0.001
M            0.21494    0.03435   6.257   <0.001
X:M          0.50298    0.03660  13.744   <0.001

Conditional predictions

As in the previous cases, we begin by computing the predicted outcomes for different values of the predictors. In practice, the analyst should report predictions for predictor values that are meaningful to the domain of application. Here, we hold \(X\) and \(M\) to fixed arbitrary values:

p <- predictions(mod, newdata = datagrid(
  X = c(-2, 2),
  M = c(-1, 0, 1)))
p
X M Estimate Pr(>|z|) 2.5 % 97.5 %
-2 -1 0.2586 <0.001 0.2215 0.2995
-2 0 0.1365 <0.001 0.1189 0.1563
-2 1 0.0669 <0.001 0.0531 0.0839
2 -1 0.2177 <0.001 0.1837 0.2561
2 0 0.4855 0.425 0.4500 0.5212
2 1 0.7619 <0.001 0.7213 0.7982

When \(X=-2\) and \(M=-1\), the predicted probability that \(Y\) equals 1 is 25.86%.

Rather than focus on arbitrary values of \(X\) and \(M\), we can plot predicted values to communicate a richer set of estimates. When calling plot_predictions() with these data, we obtain a plot of predicted outcomes with the focal variable on the x-axis, and lines representing different values of the moderator.

plot_predictions(mod, condition = c("X", "M"))
Figure 10.3: Predicted value of Y for different values of X and M

We can draw two preliminary conclusions from Figure 10.3. First, the predicted values of \(Y\) depend strongly on the value of \(X\). Moving from left to right in the plot often has a strong effect on the heights of predicted probability curves. Second, \(M\) strongly moderates the relationship between \(X\) and \(Y\). Indeed, for some values of \(M\) the relationship of interest completely flips. For example, when \(M\) is around -3, the relationship between \(X\) and \(Y\) is negative: an increase in \(X\) is associated with a decrease in \(Pr(Y=1)\). However, for all other displayed values of \(M\), the relationship between \(X\) and \(Y\) is positive: an increase in \(X\) is associated with an increase in \(Pr(Y=1)\).

Does \(X\) affect \(Y\)?

To measure the association between \(X\) on \(Y\), we have two basic options. First, we can use the comparisons() function to compute the estimated effect of a discrete change in \(X\) on the predicted value of \(Y\). Second, we can use the slopes() function to compute the estimated slope of \(Y\) with respect to \(X\). Since the focal variable and mediator are both continuous, the rest of this section focuses on slopes. However, it is useful to note that slopes and counterfactual comparisons both offer valid answers to the research question, both families of functions operate in nearly identical ways, and both sets of results would lead us to the same substantive conclusions.

avg_slopes(mod, variables = "X")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.0857 0.00571 15 <0.001 0.0745 0.0969

On average, across all observed values of the moderator \(M\), increasing \(X\) by one unit increases the predicted outcome by 0.086.6 This is interesting, but as Figure 10.3 suggests, there is strong heterogeneity in the relationship of interest. The association between \(X\) and \(Y\) may be very different for different values of \(M\).

Is the effect of \(X\) on \(Y\) moderated by \(M\)?

To answer this question, we estimate slopes of \(Y\) with respect to \(X\), for five values of the moderator \(M\).

slopes(mod, variables = "X", newdata = datagrid(M = fivenum))
Table 10.1
M Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
-3.0788 -0.1525 0.01899 -8.03 < 0.001 -0.18969 -0.1153
-0.6759 0.0200 0.00756 2.65 0.00815 0.00519 0.0348
0.0137 0.0913 0.00695 13.14 < 0.001 0.07768 0.1049
0.6906 0.1698 0.00950 17.87 < 0.001 0.15113 0.1884
3.4711 0.5426 0.03363 16.14 < 0.001 0.47672 0.6085

The results in this table confirm our intuition. When \(M\approx -3\), the slope is negative: increasing \(X\) results in a reduction of \(Y\). However, for the other values of \(M\) in Table 10.1, the average slope is positive. This is consistent with Figure 10.3, which shows one line with downward slope, and four lines with upward slopes.

For a more fine grained analysis, we can plot the slope of \(Y\) with respect to \(X\) for all observed values of the moderator \(M\).

plot_slopes(mod, variables = "X", condition = "M") +
    geom_hline(yintercept = 0, linetype = "dotted")
Figure 10.4: Slope of \(Y\) with respect to \(X\), for different values of the moderator \(M\).

Figure 10.4 plot shows that when the moderator \(M\) is below -1, the relationship between \(X\) and \(Y\) is negative: increasing \(X\) decreases \(Y\). However, when \(M\) rises above -1, the relationship between \(X\) and \(Y\) becomes positive: increasing \(X\) increases \(Y\).

We can confirm that this moderation effect is statistically significant using the hypothesis argument. First, we estimate the slope of \(Y\) with respect to \(X\) for two different values of the moderator \(M\): its minimum and maximum.

slopes(mod,
  variables = "X",
  newdata = datagrid(M = range))
M Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
-3.08 -0.152 0.0190 -8.03 <0.001 -0.190 -0.115
3.47 0.543 0.0336 16.14 <0.001 0.477 0.609

Then, we use the hypothesis argument to compare the two estimates.

slopes(mod,
  variables = "X",
  newdata = datagrid(M = range),
  hypothesis = "b2 - b1 = 0")
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b2-b1=0 0.695 0.0475 14.6 <0.001 0.602 0.788

The \(\frac{\partial Y}{\partial X}\) slope is larger when evaluated at maximum \(M\), than at minimum \(M\). Therefore, we can reject the null hypothesis that \(M\) has no moderating effect on the relationship between \(X\) and \(Y\).

10.1.4 Multiple interactions

The fourth case to consider is when more than two variables are included in multiplicative interactions. Such models have serious downsides: they can overfit the data, and they impose major costs in terms of statistical power, typically requiring considerably larger sample sizes than models without interaction. On the upside, models with multiple interactions allow more flexibility in modelling, and they can capture complex patterns of moderation between regressors.

Models with several multiplicative interactions do not pose any particular interpretation challenge, since the tools and workflows introduced in this book can be applied to these models in straightforward fashion. Consider this simulated dataset with four binary variables.

dat = get_dataset("interaction_04")
head(dat)
# A data frame: 6 × 4
      Y     X    M1    M2
* <int> <int> <int> <int>
1     0     0     0     0
2     1     1     0     1
3     0     0     1     0
4     1     0     1     0
5     0     0     0     1
6     0     1     0     1

We fit a logistic regression model with \(Y\) as the outcome, and multiplicative interactions between all three predictors.

mod = glm(Y ~ X * M1 * M2, data = dat, family = binomial)
summary(mod)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.0209     0.0908 -11.244   <0.001
X             0.4632     0.1229   3.768   <0.001
M1           -0.7954     0.1470  -5.411   <0.001
M2            0.5788     0.1217   4.755   <0.001
X:M1          0.6746     0.1890   3.569   <0.001
X:M2         -0.9649     0.1716  -5.623   <0.001
M1:M2         0.4046     0.1890   2.141   0.0323
X:M1:M2       0.1976     0.2542   0.777   0.4369

Once again, the coefficient estimates of this logistic regression are difficult to interpret on their own, so we use functions from the marginaleffects package.

Marginal predictions

As before, we can compute and display marginal predicted outcomes in any subgroup of interest, using the avg_predictions() or plot_predictions() functions.

plot_predictions(mod, by = c("X", "M1", "M2"))
Figure 10.5: Average predicted outcomes for different combinations of \(X\), \(M_1\), and \(M_2\).

Figure 10.5 shows the predicted probability that \(Y\) equals 1, for different combinations of \(X\), \(M_1\), and \(M_2\). The x-axis represents the value of \(X\). The shape of points represents the value of \(M_1\), with triangles indicating a value of \(M_1=1\). The left facet shows average predictions where \(M_2=0\), and the right facet shows estimates for the subset of data where \(M_2=1\).

In the left facet, we see that the average predicted probability that \(Y\) equals 1 is equal to about 26.5% when \(X=0\), \(M_1=0\), and \(M_2=0\). We can extract this numeric result by calling avg_predictions() on the relevant subset of data.

avg_predictions(mod, newdata = datagrid(
  X = 0, M1 = 0, M2 = 0
))
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.265 0.0177 15 <0.001 0.23 0.299

Equivalently, we can use the by argument to compute average predicted probabilities for every combination of the predictors (results omitted for brevity).

avg_predictions(mod, by = c("X", "M1", "M2"))

Does \(X\) affect \(Y\)?

As before, we can estimate the average change in \(Y\) associated with a change from 0 to 1 in the \(X\) variable using the avg_comparisons() function.

avg_comparisons(mod, variables = "X")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.0657 0.0129 5.1 <0.001 0.0405 0.091

This suggests that, on average, moving from the control (0) to the treatment (1) group is associated with an increase of 6.6 percentage points in the probability that \(Y\) equals 1. The \(p\) value is small, which implies that we can reject the null hypothesis that \(X\) has no effect on the predicted outcome.

Is the effect of \(X\) on \(Y\) moderated by \(M_1\)?

To check estimate if the effect of \(X\) varies depending on the value of moderator \(M_1\), we call avg_comparisons() with the by argument.

avg_comparisons(mod, variables = "X", by = "M1")
M1 Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0 -0.00703 0.0185 -0.38 0.704 -0.0433 0.0292
1 0.14026 0.0179 7.83 <0.001 0.1052 0.1754

This shows that the expected change in \(Y\) associated with a change in \(X\) differs based on the value of \(M_1\): -0.0070 vs. 0.1403. Using the hypothesis argument, we can confirm that the difference between these two estimated effect sizes is statistically significant. In other words, we can reject the null hypothesis that the effect of \(X\) is the same for all values of \(M_1\).

avg_comparisons(mod, 
  variables = "X",
  by = "M1",
  hypothesis = "b2 - b1 = 0")
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b2-b1=0 0.147 0.0258 5.72 <0.001 0.0968 0.198

Does the moderating effect of \(M_1\) depend on \(M_2\)?

The last question that we pose is more complex. Above, we established two patterns.

  1. On average, the value of \(X\) affects the predicted value of \(Y\).
  2. On average, the value of \(M_1\) modifies the strength of association between \(X\) and \(Y\).

Now, we ask if \(M_2\) changes the way in which \(M_1\) moderates the effect of \(X\) on \(Y\). The difference is subtle but important: we are asking if the moderation effect of \(M_1\) is itself moderated by \(M_2\).

The following code computes the average difference in predicted \(Y\) associated with a change in \(X\), for every combination of moderators \(M_1\) and \(M_2\). Each row represents the average effect of \(X\) at different points in the sample space.

avg_comparisons(mod, 
    variables = "X", 
    by = c("M2", "M1"))
M2 M1 Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0 0 0.0992 0.0261 3.80 < 0.001 0.0481 0.1504
0 1 0.1967 0.0236 8.35 < 0.001 0.1505 0.2429
1 0 -0.1111 0.0262 -4.24 < 0.001 -0.1625 -0.0597
1 1 0.0834 0.0270 3.09 0.00203 0.0304 0.1363

When we hold the moderators fixed at \(M_1=0\) and \(M_2=0\), changing the value of \(X\) from 0 to 1 changes the average predicted probability of \(Y=1\) by 9.9 percentage points. When we hold the moderators fixed at \(M_1=0\) and \(M_2=1\), changing the value of \(X\) from 0 to 1 changes the average predicted probability of \(Y=1\) by -11.1 percentage points.

Now, imagine we hold \(M_2\) constant at 0. We can determine if the effect of \(X\) is moderated by \(M_1\) by using the hypothesis argument to compare estimates in rows 1 and 2. This shows that the estimated effect size of \(X\) is larger when \(M_1=1\) than when \(M_1=0\), holding \(M_2=0\).

avg_comparisons(mod, 
    hypothesis = "b2 - b1 = 0",
    variables = "X", 
    by = c("M2", "M1"))
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b2-b1=0 0.0975 0.0351 2.77 0.00555 0.0286 0.166

Similarly, imagine that we hold \(M_2\) constant at 1. We can determine if the effect of \(X\) is moderated by \(M_1\) by comparing estimates in rows 3 and 4. We use the syntax introduced in Chapter 4, where b3 represents the 3rd estimate and b4 the fourth.

avg_comparisons(mod, 
    hypothesis = "b4 - b3 = 0",
    variables = "X", 
    by = c("M2", "M1"))
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b4-b3=0 0.194 0.0377 5.16 <0.001 0.121 0.268

The hypothesis test above shows that we can reject the null hypothesis that the 4th estimate is equal to the 3rd (i.e., that the difference between them is zero). Therefore, we can conclude that the effect size of \(X\) is larger when \(M_1=1\) than when \(M_1=0\), holding \(M_2\) at 1.

So far, we have assessed the extent to which \(M_1\) acts as a moderator, holding \(M_2\) at different values. To answer the question of whether \(M_2\) moderates the moderation effect of \(M_1\), we can specify the hypothesis as a difference in differences:

avg_comparisons(mod, 
    hypothesis = "(b2 - b1) - (b4 - b3) = 0",
    variables = "X", 
    by = c("M2", "M1"))
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
(b2-b1)-(b4-b3)=0 -0.097 0.0515 -1.88 0.0597 -0.198 0.00396

This suggests that \(M_2\) may have a second order moderation effect, but we cannot completely completely rule out the null hypothesis because the \(p\) value does not cross conventional thresholds of statistical significance (\(p\)=0.0597). In other words, we cannot quite reject the null hypothesis that \(M_2\) has no moderation effect on the moderation effect of \(M_1\).

10.2 Polynomial regression

Polynomial regression is an extension of the standard additive regression framework, that us to model the relationship between a dependent variable \(Y\) and an independent variable \(X\) as an nth-degree polynomial. While the model specification remains linear in the coefficients, it is polynomial in the value of \(X\). This type of regression is useful when the data shows a non-linear relationship that a straight line cannot adequately capture.

The general form of a polynomial regression model is

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \cdots + \beta_n X^n + \varepsilon,\]

where \(Y\) is the dependent variable, \(X\) is the independent variable, \(\beta_0, \beta_1, \beta_2, \ldots, \beta_n\) are the coefficients to be estimated, \(n\) is the degree of the polynomial, and \(\varepsilon\) represents the error term. For instance, a second-degree (quadratic) polynomial regression equation can be written as:

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \varepsilon\]

A polynomial regression can be seen as a special case of the multiplicative interactions discussed in Section 10.1.4, where predictors are simply interacted with themselves. As before, they can be treated as a simple linear regression problem by constructing new variables \(Z_1=X\), \(Z_2=X^2\), etc. The model then becomes \(Y=\beta_0+\beta_1\cdot Z_1+\beta_2\cdot Z_2+\varepsilon\), which can be estimated using standard methods like ordinary least squares.

Polynomial regression offers several key advantages. It is flexible and can fit a wide range of curves, simply by adjusting the degree of the polynomial. As a result, polynomial regression can reveal underlying patterns in the data that are not immediately apparent with simpler models.

This approach also has notable disadvantages. One significant issue is its potential for overfitting, especially when the \(n\) order is high. Moreover, polynomial regression can suffer from unreliable extrapolation, where predictions made outside the range of the observed sample can become erratic and unrealistic. Consequently, while polynomial regression can be powerful, careful consideration must be given to the degree of the polynomial to balance fit and generalization effectively.

Polynomial regression can be viewed simply as a model specification with several variables interacted with themselves. As such, it can be interpreted using exactly the same tools discussed in the earlier part of this chapter. To illustrate, we consider two simple data generating processes adapted from Hainmueller, Mummolo, and Xu (2019). The first is

\[\begin{align*} Y = 2.5 - X^2 + \varepsilon, && \text{where }\varepsilon\sim N(0,1) \text{ and } X\sim U(-3,3) \end{align*}\]

If we fit a linear model with only \(X\) as predictor, the line of best fit will not be a good representation of these data. However, a cubic polynomial regression can easily detect the curvilinear relationship between \(X\) and \(Y\). In R and Python, we can use similar syntax to specify polynomials directly in the model formula.7 Then, we call plot_predictions() to visualize the predictions of the model.

library(patchwork)
dat = get_dataset("polynomial_01")

mod_linear = lm(Y ~ X, data = dat)
p1 = plot_predictions(mod_linear, condition = "X", points = .05) + 
  ggtitle("Linear")

mod_cubic = lm(Y ~ X + I(X^2) + I(X^3), data = dat)
p2 = plot_predictions(mod_cubic, condition = "X", points = .05) + 
  ggtitle("Cubic")

p1 + p2
Figure 10.6: Modelling a curvilinear relationship with linear or polynomial regression.

On the left-hand side, we see the predictions of a linear model. The prediction line runs straight through the cloud of observed data points, and fails to capture the curvilinear relationship between \(X\) and \(Y\). On the right-hand side, the prediction line drawn based on cubic terms does a very good job of capturing the curve in the data generating process.

To evaluate the strength of association between \(X\) and \(Y\), we can compute the slope of the outcome equation with respect to \(X\), at different values of \(X\).

slopes(mod_cubic, variables = "X", newdata = datagrid(X = c(-2, 0, 2)))
X Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
-2 3.98616 0.0351 113.532 <0.001 3.9173 4.0550
0 0.00478 0.0226 0.211 0.833 -0.0396 0.0491
2 -3.97639 0.0359 -110.644 <0.001 -4.0468 -3.9059

When \(X\) is negative, the slope is positive which indicates that an increase of \(X\) is associated with an increase in \(Y\). When \(X\) is around 0, the slope is null, which indicates that the strength of association between \(X\) and \(Y\) is null (or very weak). When \(X\) is around 0, changing \(X\) by a small amount will have almost no effect on \(Y\). When \(X\) is positive, the slope is negative. This indicates that increasing \(X\) should result in a decrease in \(Y\).

Now, consider a slightly different data generating process, where a binary moderator \(M\) changes the nature of the relationship between \(X\) and \(Y\).8

\[\begin{align*} Y = 2.5 - X^2 - 5 \cdot M + 2 \cdot M \cdot X^2 + \varepsilon && \text{where }\varepsilon\sim N(0,1) \text{ and } X\sim U(-3,3) \end{align*}\]

If we simply fit a cubic regression, without accounting for \(M\), our predictions will be inaccurate. However, if we interact the moderator \(M\) with all polynomial terms (using parentheses as a shortcut for the distributive property), we can get an excellent fit for the curvilinear and differentiated relationship between \(X\) and \(Y\).

dat = get_dataset("polynomial_02")

mod_cubic = lm(Y ~ X + I(X^2) + I(X^3), data = dat)
p1 = plot_predictions(mod_cubic, condition = "X",
  points = .05)

mod_cubic_interaction = lm(Y ~ M * (X + I(X^2) + I(X^3)), data = dat)
p2 = plot_predictions(mod_cubic_interaction, condition = c("X", "M"),
  points = .1)

p1 + p2
Figure 10.7

As expected, the fit lines draws by the model with cubic polynomials and an interaction do well in capturing the distinct patterns, in the two \(M\) subsets. And as before, we can estimate the slope of the outcome equation for different values of \(M\) and \(X\):

s = slopes(mod_cubic_interaction,
  variables = "X",
  newdata = datagrid(M = c(0, 1), X = fivenum))
s
M X Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0 -2.9974 5.7529 0.2582 22.281 <0.001 5.24686 6.2590
0 -1.4845 2.9009 0.0582 49.874 <0.001 2.78693 3.0149
0 -0.0586 0.1324 0.0667 1.986 0.047 0.00174 0.2631
0 1.4788 -2.9405 0.0575 -51.106 <0.001 -3.05322 -2.8277
0 2.9843 -6.0373 0.2545 -23.724 <0.001 -6.53611 -5.5386
1 -2.9974 -6.1017 0.2531 -24.109 <0.001 -6.59774 -5.6057
1 -1.4845 -2.9061 0.0563 -51.628 <0.001 -3.01646 -2.7958
1 -0.0586 -0.0461 0.0634 -0.727 0.467 -0.17026 0.0781
1 1.4788 2.8730 0.0595 48.283 <0.001 2.75638 2.9896
1 2.9843 5.5654 0.2594 21.454 <0.001 5.05695 6.0738

When \(M=0\) and \(X\approx -3\), the estimated slope is 5.753, which means that the prediction curve is rising steeply. Indeed, this is what we see in the right-hand side panel of Figure 10.7, where the solid line rises quickly for low values of \(X\). When \(M=0\) and \(X\approx 3\), increasing \(X\) by a small amount will result in a substantial (and statistically significant) increase in \(Y\).


  1. Two downsides of more flexible models are that they can sometimes reduce statistical power or overfit the data.↩︎

  2. In most cases, it is important to include all constitutive terms in addition to interactions. For example, if a model includes a multiplication between three variables \(X\cdot W \cdot Z\), one would typically want to also include \(X\cdot W, X\cdot Z, W\cdot Z, X, W,\) and \(Z\). See Clark and Golder (2023) for details.↩︎

  3. An alternative would be to report “marginal means” by computing on a balanced grid before aggregating. This can be achieved by calling: avg_predictions(mod, newdata="balanced", by=c("X","M"))↩︎

  4. For a discussion of the difference between hypothesis tests conducted on average predictions and average comparisons, see Section 6.3.1.↩︎

  5. These five numbers correspond to elements of a standard boxplot: minimum, lower-hinge, median, upper-hinge, and maximum.↩︎

  6. Recall from Chapter 7 that this interpretation is valid for small changes in the neighborhoods where slopes are evaluated.↩︎

  7. For marginaleffects to work properly in this context, it is important to specify the polynomials in the model-fitting formula. Users should not hard-code the values by creating new variables in the dataset before fitting the model.↩︎

  8. This data generating process is adapted from Hainmueller, Mummolo, and Xu (2019).↩︎