6  Predictions

The parameter estimates obtained by fitting a statistical model are rarely the main object of interest in a data analysis. Instead of focusing on those raw estimates, a good starting point is often to compute model-based predictions for different combinations of predictor values. This allows an analyst to report results on a scale that makes intuitive sense to their readers, colleagues, and stakeholders.

What is a model-based prediction? In this book, we consider that

A prediction is the outcome expected by a fitted model for a given combination of predictor values.

This definition is in line with the familiar concept of “fitted value,”1 but it differs from a “forecast” or “out-of-sample prediction” (Hyndman and Athanasopoulos 2018). For our purposes, the word “prediction” need not imply that we hope to forecast the future, or that we are trying to extrapolate to unseen data.

Model-based predictions are often the main quantity of interest in a data analysis. They allow us to answer a wide variety of questions, such as:

All of these descriptive questions can be answered using model-based predictions. This highlights the fact that predictions are an intrinsically interesting quantity. In chapters 7 and 8, we will see that they are also a fundamental building block to analyze the effects of interventions.

The current chapter illustrates how to compute and report predictions for models estimated with the Thornton (2008) data. We proceed in order, through each component of the conceptual framework laid out in Chapter 4: (1) quantity, (2) predictors, (3) aggregation, (4) uncertainty, and (5) tests. Then, we conclude by showing different ways to visualize predictions.

6.1 Quantity

To begin, it is useful to consider how predictions are built in one particular case. Let’s consider a logistic regression model estimated using the Thornton (2008) HIV dataset:

\[ Pr(\text{Outcome}=1) = \Phi \left (\beta_1 + \beta_2 \cdot \text{Incentive} + \beta_3 \cdot \text{Age}_{18-35} + \beta_4 \cdot \text{Age}_{>35} \right ), \tag{6.1}\]

where Outcome is a binary variable equal to 1 if the study participant travelled to the test center to learn their HIV status; Incentive is a binary variable equal to 1 if the participant received a monetary incentive; and the other two predictors are indicators for the age category to which a participant belongs, with omitted category Age\(_{<18}\). The letter \(\Phi\) represents the standard logisitic function \(\Phi(x) = \frac{1}{1 + e^{-x}}\), which ensures that the linear component inside the parentheses of Equation 6.1 gets scaled to the \([0,1]\) interval. This allows the model to respect the natural scale of the binary outcome variable.

We load the marginaleffects package, read the data, and estimate a logistic regression model using the glm() function:

library(marginaleffects)
dat <- readRDS("data/hiv.rds")
mod <- glm(outcome ~ incentive + agecat, data = dat, family = binomial)

The estimated coefficients are:

b <- coef(mod)
b
(Intercept)   incentive agecat18-35   agecat>35 
-0.78232923  1.99229719  0.04368393  0.24780479 

For clarity of presentation, we substitute these estimates into the model equation:

\[ Pr(\text{Outcome}=1) = \Phi \left (-0.782 + 1.992 \cdot \text{Incentive} + 0.044 \cdot \text{Age}_{18-35} + 0.248 \cdot \text{Age}_{>35} \right ) \]

To make a prediction for a particular individual, we simply plug-in the characteristics of a person into this equation. For example, the predicted probability that Outcome equals 1 for an 18 to 35 year-old in the treatment group is:

\[\begin{align*} Pr(\text{Outcome}=1) = \Phi \left (-0.782 + 1.992 \cdot 1 + 0.044 \cdot 1 + 0.248 \cdot 0 \right )\\ \end{align*}\]

The predicted probability that Outcome equals 1 for someone above 35 years-old in the control group is:

\[\begin{align*} Pr(\text{Outcome}=1) = \Phi \left (-0.782 + 1.992 \cdot 0 + 0.044 \cdot 0 + 0.248 \cdot 1 \right ) \end{align*}\]

These expressions can be evaluated using any calculator. First, we compute the part in parentheses. This is the “linear” or “link scale” prediction:

linpred_treatment_younger <- b[1] + b[2] * 1 + b[3] * 1 + b[4] * 0
linpred_treatment_younger

[1] 1.253652

linpred_control_older <- b[1] + b[2] * 0 + b[3] * 0 + b[4] * 1
linpred_control_older

[1] -0.5345244

Link scale predictions from a logit model are expressed on the log odds scale. In this example, they include a negative value and a value greater than one. To many, this will feel incongruous, because the outcome variable is a binary variable, with a probability bounded by 0 and 1. To ensure that our predictions respect the natural scale of the data, we transform the linear component of Equation 6.1 using the logistic function:

logistic <- \(x) 1 / (1 + exp(-x))
logistic(linpred_treatment_younger)

[1] 0.7779314

logistic(linpred_control_older)

[1] 0.3694623

Our model expects that the probability of seeking information about one’s HIV status is 78% for a young adult who receives a monetary incentive, and 37% for an older participant who does not receive an incentive.

Computing predictions manually is useful for pedagogical purposes, but it is a labour-intensive and error-prone process. The commands above are also limiting, because they only apply to one very specific model.

Instead of manual computation, we can use the predictions() function from the marginaleffects package. This function can be applied in consistent fashion across more than 100 different classes of statistical models.

First, we build a data frame of predictor values—a grid—where each row represents a different individual:

grid <- data.frame(agecat = c("18-35", ">35"), incentive = c(1, 0))
grid
  agecat incentive
1  18-35         1
2    >35         0

Then, we call the predictions() function, using the newdata argument to specify the predictor values, and the type argument to set the scale (link or response):

predictions(mod, newdata = grid, type = "link")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % agecat incentive
1.254 0.0691 18.15 1.118 1.389 {18-35} 1
-0.535 0.1013 -5.28 -0.733 -0.336 {>35 } 0

These results are exactly identical to the link scale predictions that we computed manually above. This is reassuring. However, in a logit model, link scale predictions are hard to interpret.

To communicate our results clearly, it is usually best to make predictions on the same scale as the outcome variable. The resulting estimates are easier to interpret, since they can be compared to observed values of the outcome variable in our dataset. For this reason, the default behavior of predictions() is to return predictions on the response scale:

predictions(mod, newdata = grid)
Estimate Pr(>|z|) 2.5 % 97.5 % agecat incentive
0.778 0.754 0.800 {18-35} 1
0.369 0.325 0.417 {>35 } 0

In the rest of this chapter, we show that the marginaleffects package makes it easy to compute various types of predictions, aggregate, and conduct statistical tests on them.

6.2 Predictors

Predictions are “conditional” quantities, in the sense that they depend on the values of all the predictor variables in a model. To compute a prediction, the analyst must fix all the variables on the right-hand side of the model equation; they must choose a grid (see Section 4.2).

The choice of grid depends on the researcher’s goals. The profiles it holds could correspond to actual observations in the original data, or they could represent unseen, hypothetical, or representative units. To illustrate, let’s consider a slight modification of the model estimated in Section 6.1. In addition to the incentive and agecat predictors, we now include a numeric predictor to account for the distance between a study participant’s home and the test center where they can learn their HIV status:

mod <- glm(outcome ~ incentive + agecat + distance, data = dat, family = binomial)

With this model, we can make predictions on various grids: empirical, interesting, representative, balanced, or counterfactual.

6.2.1 Empirical grid

By default, the predictions() function uses the full original dataset as a grid, that is, it uses the empirical distribution of predictors (Section 4.2.1). This means that predictions() will compute fitted values for each of the individuals observed in the dataset used to fit the model:

p <- predictions(mod)
p
Estimate Pr(>|z|) 2.5 % 97.5 %
0.365 0.297 0.439
0.354 0.288 0.426
0.265 0.209 0.330
0.300 0.241 0.365
0.334 0.271 0.402
0.833 0.809 0.855
0.840 0.815 0.862
0.833 0.809 0.855
0.827 0.803 0.849
0.789 0.761 0.814

The p object created by predictions() includes the fitted values for each observation in the dataset, along with test statistics like p values and confidence intervals. p is a standard data frame, which means that we can manipulate it using standard R functions.

For example, we can check that the data frame includes 2825 and 10 columns:

dim(p)
[1] 2825   10

We can list the available column names:

 [1] "rowid"     "estimate"  "p.value"   "s.value"   "conf.low"  "conf.high"
 [7] "outcome"   "incentive" "agecat"    "distance" 

And we can extract individual columns and cells using the standard $ or [] syntaxes, or using data manipulation packages like dplyr or data.table:

p[1:4, "estimate"]
[1] 0.3652679 0.3540617 0.2653133 0.2997423

Users should be mindful of the fact that, by default, the p values held in this data frame correspond to a hypothesis test against a null of zero. In Section 6.5, we will see that it is easy to change this default null using the hypothesis argument.

6.2.2 Interesting grid

In many cases, analysts are not interested in model-based predictions for each observation in their sample. Instead, they may prefer to construct a customized grid of predictors which includes specific values of scientific or commercial interest (Section 4.2.2).

In marginaleffects, the main strategy to define custom grids is to use the newdata argument and the datagrid() function. This function creates a “typical” dataset with all variables at their means or modes, except those we explicitly define:

datagrid(agecat = "18-35", incentive = c(0, 1), model = mod)
  distance agecat incentive rowid
1 2.014541  18-35         0     1
2 2.014541  18-35         1     2

We can feed this datagrid() function to the newdata argument of predictions():2

predictions(mod,
  newdata = datagrid(agecat = "18-35", incentive = c(0, 1))
)
agecat incentive Estimate Pr(>|z|) 2.5 % 97.5 % distance
{18-35} 0 0.318 0.279 0.361 2.01
{18-35} 1 0.780 0.755 0.802 2.01

This shows that the estimated probability of seeking one’s HIV status is about 32% for a participant who is between 18 and 35 years old, did not receive a monetary incentive, and lives the average distance from the center.

We can also make predictions on a custom grid by supplying functions to datagrid(). These functions will be applied to the named variables, and the output used to construct the grid.

predictions(mod, 
  newdata = datagrid(distance = 2, agecat = unique, incentive = max)
)
distance agecat incentive Estimate Pr(>|z|) 2.5 % 97.5 %
2 { 1 0.774 0.725 0.816
2 {18-35} 1 0.780 0.756 0.802
2 {>35 } 1 0.815 0.791 0.837

6.2.3 Representative grid

Sometimes, analysts do not want fine-grained control over the values of each predictor, but would rather compute predictions for some “representative” individual (Section 4.2.3). For example, we can compute a “Prediction at the Mean,” that is, a prediction for a hypothetical representative individual whose personal caracteristics are exactly average (numeric) or modal (categorical).

To achieve this, we can either set the values of the grid manually in datagrid(), or we can use the "mean" shortcut:

predictions(mod, newdata = "mean")
Estimate Pr(>|z|) 2.5 % 97.5 % incentive agecat distance
0.78 0.755 0.802 1 {18-35} 2.01

Representative grids can be useful in some contexts, but they are not always the best choice. Sometimes there is simply noone in our sample who is exactly average on all relevant dimensions. When this “average individual” is fictional, predictions made for this profile may not be scientifically interesting or practically relevant.

6.2.4 Balanced grid

A common strategy in the analysis of experiments is to compute estimates on a “balanced grid” (Section 4.2.4). This type of grid includes one row for each combination of unique values for the categorical (or binary) predictors, holding numeric variables at their means. To achieve this, we can either call datagrid() or use the "balanced" shortcut. These two calls are equivalent:

predictions(mod,
  newdata = datagrid(agecat = unique, incentive = unique, distance = mean)
)
predictions(mod, newdata = "balanced")

Estimate Pr(>|z|) 2.5 % 97.5 % incentive agecat distance
0.311 0.251 0.377 0 { 2.01
0.318 0.279 0.361 0 {18-35} 2.01
0.367 0.322 0.415 0 {>35 } 2.01
0.773 0.724 0.816 1 { 2.01
0.780 0.755 0.802 1 {18-35} 2.01
0.815 0.791 0.837 1 {>35 } 2.01

A balanced grid is often used with randomized experiments, when the analyst wishes to give equal weights to each combination of treatment conditions in the calculation of marginal means (Section 6.3).

6.2.5 Counterfactual grid

Yet another set of predictor profiles to consider is the “counterfactual grid.” The predictions made on such a grid allow us to answer questions such as:

What would the predicted outcomes be in our observed sample if everyone had received the treatment, or if everyone had received the control?

To create a counterfactual grid, we duplicate the full dataset, once for every value of the focal variable. For instance, if our dataset includes 2884 rows, and we want to compute predictions for each combination of the incentive variable (0 and 1), the counterfactual grid will include 5768.

To make predictions on a counterfactual grid, we can call the datagrid() function, or we can use the variables argument:

p <- predictions(mod, variables = list(incentive = 0:1))
dim(p)
[1] 5650   11

These predictions are interesting, because they give us a first look at the kinds of counterfactual (potentially causal) queries that we will explore in Chapter 7. We can ask:

For each individual in the Thornton (2008) sample, what is the predicted probability of seeking information about HIV status in the counterfactual worlds where they receive a monetary incentive, and where they do not?

To answer this question, we rearrange the data and plot it:

library(ggplot2)

p <- data.frame(
  Control = p[p$incentive == 0, "estimate"],
  Treatment = p[p$incentive == 1, "estimate"])

ggplot(p, aes(Control, Treatment)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  geom_point() +
  labs(x = "Pr(outcome=1) when incentive = 0", 
       y = "Pr(outcome=1) when incentive = 1") +
  xlim(0, 1) + ylim(0, 1) + coord_equal()
Figure 6.1: Predicted probabilities for counterfactual values of incentive.

On this graph, each point represents a single study participant.3 The x-axis shows the predicted probability that Outcome equals 1 for an individual with the same socio-demographic characteristics, in the control group. The y-axis shows the predicted probability that Outcome equals 1 for an individual with the same socio-demographic characteristics, in the treatment group.

Every point is well above the 45 degree line. This means that, for every observed combination of predictor values, for every participant in the study, our model says that changing the incentive variable from 0 to 1 increases the predicted probability that the person will seek to learn their HIV status.

6.3 Aggregation

Computing predictions for a large grid or for every observation in a dataset is useful, but the results can feel unwieldy. This section makes two principal points. First, it often makes sense to compute aggregated statistics, such as the average predicted outcome across the whole dataset, or by subgroups of the data. Second, the grid across which we aggregate can make a big difference to the results.

An “average prediction” is the outcome of a simple two step process. First, we compute predictions (fitted values) for each row in the original dataset. Then, we take the average of those predictions. This can be done manually by calling the predictions() function and taking the mean of estimates:

p <- predictions(mod)
mean(p$estimate)
[1] 0.6916814

Alternatively, we can use the avg_predictions() function, which is a wrapper around predictions() that computes the average prediction directly:

Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.692 0.00791 87.5 0.676 0.707

This shows that the average predicted probability of seeking information about HIV status, across all the study participants in the Thornton (2008) sample, is about 69%.

Now, imagine we want to check if the predicted probability of the Outcome variable differs across age categories. To see this, we can make the same function call, but add the by argument:

avg_predictions(mod, by = "agecat")
agecat Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
{ 0.670 0.0240 27.9 0.623 0.717
{18-35} 0.673 0.0116 58.1 0.650 0.695
{>35 } 0.720 0.0121 59.5 0.697 0.744

The average predicted probability of seeking one’s test result is about 67% for minors, and 72% for those above 35 years old. In Section 6.5 we will formally test if the difference between those two average predictions is statistically significant.

So far, we have taken averages over the empirical distribution of covariates, but analysts are not limited to that grid. One common alternative is to compute “marginal means” by averaging predictions across a balanced grid of predictors.4 This is useful in experimental settings, when the observed sample is not representative of the population, and when we want to marginalize while giving equal weight to each treatment conditions.

To compute marginal means, we call the same function, using the newdata and by arguments:

avg_predictions(mod, newdata = "balanced", by = "agecat")
agecat Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
{ 0.542 0.0261 20.7 0.491 0.593
{18-35} 0.549 0.0135 40.6 0.522 0.575
{>35 } 0.591 0.0151 39.0 0.561 0.621

Notice that the results are considerably different from the average predictions computed on the empirical grid. Now, the average predicted probability of seeking one’s test result is estimated at 54% for minors, and 59% for those above 35 years old.

What explains this difference is that the balanced grid gives equal weight to each combination of categorical variables, while the empirical grid gives more weight to the combinations that are more frequent in the data. In the Thornton (2008) dataset, more participants belonged to the treatment than to the control group:

table(dat$incentive)

   0    1 
 621 2204 

Therefore, when we compute an average prediction on the empirical distribution, predicted outcomes in the incentive=1 group are given more weight. This matters, because the average predicted probability that Outcome equals 1 is much higher in the treatment group than in the control group:

avg_predictions(mod, by = "incentive")
incentive Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0 0.340 0.01890 18.0 0.303 0.377
1 0.791 0.00862 91.7 0.774 0.808

Thus, the group-wise averages for each age categories are smaller when computed over a balanced grid than when they are computed over the empirical distribution.

We have already shown that the probably of seekingIn the Thornton (2008) dataset, the number of participants in each age category is not equal, and the marginal means computed on the empirical grid are biased towards the more frequent categories.

table(dat$incentive)

   0    1 
 621 2204 

In the next example, we create a “counterfactual” data grid where each observation of the dataset is repeated twice, with different values of the incentive variable, and all other variables held at the observed values. We also show the equivalent results using standard R commands:

avg_predictions(
    mod,
    by = "incentive",
    newdata = datagrid(incentive = c(0, 1), grid_type = "counterfactual"))
incentive Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0 0.339 0.01888 18.0 0.302 0.376
1 0.791 0.00862 91.8 0.774 0.808
p <- predictions(
    mod,
    type = "response",
    newdata = datagrid(incentive = 0:1, grid_type = "counterfactual"))
aggregate(estimate ~ incentive, FUN = mean, data = p)
  incentive  estimate
1         0 0.3390492
2         1 0.7910508

6.4 Uncertainty

As in the rest of the marginaleffects package, the predictions() family of functions accept a vcov argument which can be used to specify the type of standard errors to compute and report. We can also control the size of confidence intervals with conf_level. For instance, to compute heteroskedasticity-consistent standard errors (Type 3) with 90% confidence intervals, we simply call:

avg_predictions(mod,
  by = "incentive",
  vcov = "HC3",
  conf_level = .9)
incentive Estimate Std. Error z Pr(>|z|) 5.0 % 95.0 %
0 0.340 0.01892 18.0 0.309 0.371
1 0.791 0.00864 91.5 0.777 0.805

We can also report clustered standard errors by village, or use the inferences() function to compute bootstrap intervals:

avg_predictions(mod,
  by = "incentive",
  vcov = ~ village,
  conf_level = .9)
incentive Estimate Std. Error z Pr(>|z|) 5.0 % 95.0 %
0 0.340 0.0235 14.5 0.301 0.378
1 0.791 0.0102 77.6 0.774 0.808
avg_predictions(mod,
  by = "incentive",
  conf_level = .9) |>
  inferences(method = "boot", R = 1000)
incentive Estimate Std. Error 5.0 % 95.0 %
0 0.340 0.01832 0.309 0.370
1 0.791 0.00829 0.777 0.804

Notice that the intervals are all slightly different, but still remain in the same ballpark.

6.5 Test

Above, we computed average predictions by age subgroups, and noted that there appeared to be differences in the likelihood that younger and older people would seek their HIV test results. That observation was only based on the point estimates of the average predictions, and did not rely on a statistical test. Now, let’s consider how analysts can compare predictions more formally.

6.5.1 Null hypothesis tests

To begin, we compute the average predicted outcome for each age subgroup:

p <- avg_predictions(mod, by = "agecat")
p
agecat Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
{ 0.670 0.0240 27.9 0.623 0.717
{18-35} 0.673 0.0116 58.1 0.650 0.695
{>35 } 0.720 0.0121 59.5 0.697 0.744

The average predicted outcome is 67% for young adults and 72% for participants above 35 years old. The difference between these two averages is:

p$estimate[3] - p$estimate[2]
[1] 0.04782993

To see if this risk difference is statistically significant, we can use the hypothesis argument, as we did in Chapter 5.

p <- avg_predictions(mod, by = "agecat", hypothesis = "b3 - b2 = 0")
p
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.0478 0.0167 2.86 0.00428 0.015 0.0806

The estimated difference between the 3rd and 2nd groups is about 5 percentage points, and the p value associated with this estimate is 0.004. This crosses the conventional (but arbitrary) statistical significance threshold of \(\alpha=0.05\). Accordingly, many analysts would reject the null hypothesis that the average predicted probability of seeking one’s HIV test results is the same in the 18-35 and >35 groups.

A more convenient way to conduct the same test is to use the formula interface. On the left-hand side, we set the comparison function (difference, ratio, etc.). On the right hand side, we specify which estimates to compare to one another (sequential, reference, etc.). Here, we choose sequential comparisons: 2nd level vs. 1st level, 3rd level vs. 2nd level, and so on.

p <- avg_predictions(mod, by = "agecat", hypothesis = difference ~ sequential)
p
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
(18-35) - ( 0.00274 0.0267 0.103 0.91810 -0.0495 0.0550
(>35) - (18-35) 0.04783 0.0167 2.856 0.00428 0.0150 0.0806

p <- avg_predictions(mod, by = "agecat", hypothesis = difference ~ sequential)
p
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
(18-35) - ( 0.00274 0.0267 0.103 0.91810 -0.0495 0.0550
(>35) - (18-35) 0.04783 0.0167 2.856 0.00428 0.0150 0.0806

The hypothesis argument also allows us to specify hypothesis groups by subgroups. For example, consider this command, which computes average predicted outcomes for each observed combination of incentive and agecat:

avg_predictions(mod,
    by = c("incentive", "agecat"))
incentive agecat Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0 { 0.312 0.0321 9.72 0.249 0.375
0 {18-35} 0.324 0.0209 15.46 0.283 0.365
0 {>35 } 0.370 0.0235 15.74 0.324 0.416
1 { 0.771 0.0234 32.99 0.725 0.816
1 {18-35} 0.778 0.0119 65.43 0.755 0.801
1 {>35 } 0.811 0.0117 69.04 0.788 0.834

We can use the hypothesis argument in similar fashion as before, but add a vertical bar to specify that we want to compute sequential risk differences within subgroups:

p <- avg_predictions(mod,
    by = c("incentive", "agecat"),
    hypothesis = difference ~ sequential | incentive)
p
Hypothesis incentive Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
(18-35) - ( 0 0.01111 0.0311 0.357 0.7213 -0.04994 0.0722
(>35) - (18-35) 0 0.04605 0.0216 2.131 0.0331 0.00369 0.0884
(18-35) - ( 1 0.00717 0.0254 0.283 0.7775 -0.04258 0.0569
(>35) - (18-35) 1 0.03330 0.0154 2.156 0.0311 0.00302 0.0636

This shows that, in the control group (incentive=0), the difference between the average predicted outcome for participants over 35 and for those between 18 and 35 is about 5 percentage points. However, in the treatment group (incentive=1), this difference is about 3 percentage points. Both of these differences are associated with relatively large \(z\) statistics, and are thus statistically distinguishable from zero.

6.5.2 Equivalence tests

Flipping the logic around, the analyst could run an equivalence test to determine if the difference between average predicted outcomes in the two subgroups is small enough to be considered negligible (Section 5.2). Imagine that, for domain-specific reasons, a risk difference smaller than 10 percentage points is considered “uninteresting,” “negligible,” or “equivalent to zero”. All we need to do is call the same function with the equivalence argument and the \([-0.1,0.1]\) interval of practical equivalence:

avg_predictions(mod,
    by = "agecat",
    hypothesis = "b3 - b1 = 0",
    equivalence = c(-0.1, 0.1))
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)
0.0506 0.0269 1.88 0.0601 -0.00215 0.103 0.0331 0.0331

The p value associated with our test of equivalence is small. This suggests that we can reject the null hypothesis that the difference lies outside the interval of practical equivalence. The difference is thus likely to be small enough to be ignored.

6.6 Visualization

In many cases, data analysts will want to visualize (potentially aggregated) predictions rather than report raw numeric values. This is easy to do with the plot_predictions() function, which a similar syntax that closely parallels that of the other functions in the marginaleffects package.

6.6.1 Unit predictions

As discussed in Chapter 4, the quantities derived from statistical models—predictions, counterfactual comparisons, and slopes—are typically conditional, in the sense that they depend on the values of all covariates in the model. This implies that each unit in our sample will be associated with its own prediction (fitted value) or effect estimate. In Avoiding One-Number Summaries, Harrell (2021) argues that data analysts should avoid the temptation to summarize these individual-level estimates. Rather, Harrell argues, they should display the full distribution of estimates to convey a sense of the heterogeneity of our quantity of interest across different combinations of predictor values.

Histograms and Empirical Cumulative Distribution Function (ECDF) plots are two common ways to visualize such a distribution. Since the output generated by the predictions() function is a standard data frame, it is easy to feed that object to any plotting function in R or Python, in order to craft good-looking visualizations.

library(patchwork)
p <- predictions(mod)

# Histogram
p1 <- ggplot(p) +
  geom_histogram(aes(estimate, fill = factor(incentive))) +
  labs(x = "Pr(outcome = 1)", y = "Count", fill = "Incentive")

# Empirical Cumulative Distribution Function
p2 <- ggplot(p) +
  stat_ecdf(aes(estimate, colour = factor(incentive))) +
  labs(x = "Pr(outcome = 1)", y = "Cumulative Probability", colour = "Incentive")

p1 + p2
Figure 6.2: Distribution of unit-level predictions (fitted values), by treatment group.

The left side of Figure 6.2 presents a histogram showing the distribution of predicted probabilities that individual study participants choose to travel to the test center in order to learn their HIV status. As usual, the x-axis represents the range of predicted outcomes, while the y-axis shows the number of study participants each bin. By assigning different colors to the bins based on the treatment arm (incentive equal 0 or 1), we highlight one of the key features of the distribution: predicted outcomes for people in the treatment group tend to be much higher than predicted outcomes for people in the control group. Indeed, the distribution of outcome probabilities without an incentive (orange) is concentrated between 0.2 and 0.4, indicating a low probability of travelling to the test center. In contrast, the distribution of predicted outcome for participants who received a monetary incentive (blue) is concentrated between 0.6 and 0.8. This suggests that those who received an incentive are considerably more likely to seek their test results.

The right side of Figure 6.2 presents an ECDF plot. Again, the x-axis represents the range of predicted outcomes. This time, however, the y-axis indicates the cumulative probability, which is the proportion of data points that are less than or equal to a specific value. For any given value on the x-axis, the height of the curve indicates the proportion of data points that are less than or equal to that value. For example, at 0.3 on the x-axis, we see that the incentive=0 line is close to 0.25. This suggests that about 25% of the participants in our sample have predicted outcome smaller than 30%. When the ECDF curve is steep, we know that a lot of the our data is concentrated in that part of the distribution. With this in mind, we see clearly that many of our predicted outcome values are clustered near 0.3 in the control group, and near 0.8 in the treatment group.

6.6.2 Marginal predictions

The first approach to display “marginal” predictions using the by argument. The underlying process is to (1) compute predictions for each observation in the actually observed dataset, and then (2) average these predictions across some variable(s). This is equivalent to plotting the results of calling avg-predictions() using the by argument.

For example, if we want to compute the average predicted probability that outcome equals 1, by subgroup, we call:

avg_predictions(mod, by = "incentive")
incentive Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0 0.340 0.01890 18.0 0.303 0.377
1 0.791 0.00862 91.7 0.774 0.808

We plot the same results using the plot_predictions() function:

p1 <- plot_predictions(mod, by = "incentive")
p2 <- plot_predictions(mod, by = c("incentive", "agecat"))
p1 + p2
Figure 6.3: Marginal predicted probabilities that outcome equals 1.

Note that that the plot_predictions() function also accepts a newdata argument. This means that we can, for example, plot marginal means constructed by marginalizing across a balanced grid of predictors:5

plot_predictions(mod, by = "incentive", newdata = "balanced", draw = FALSE)

6.6.3 Conditional predictions

In some contexts, plotting marginal predictions may not be appropriate. For instance, when one of the predictors of interest is continuous, there are many predictors, or much heterogeneity, the commands presented in the previous section may generate jagged plots which are difficult to read. In such cases, it can be useful to plot “conditional” predictions instead. In this context, the word “conditional” means that we are computing predictions, conditional on the values of the predictors in a constructed grid of “representative” values. However, unlike in the previous section, we do not average over several predictions before displaying the estimates. We fix the grid and display the predictions made for that grid immediately.

The condition argument of the plot_predictions() function does just that: Build a grid of representative predictor values, compute predictions for each combination of predictor values, and plot the results. In the following examples, we fix one or more predictor to its unique values (categorical) or to an equally spaced grid from minimum to maximum (continuous). The other predictors in the model are held to their mean or model.

p1 <- plot_predictions(mod, condition = "distance")
p2 <- plot_predictions(mod, condition = c("distance", "incentive"))
p3 <- plot_predictions(mod, condition = c("distance", "incentive", "agecat"))
(p1 + p2) / p3

Figure 6.4: Predicted probability that outcome equals 1, conditional on incentive, age categories, and distance. Other variables are held at their means or modes.

We can also set the value of some variables explicitly by setting condition to a named list. For example, to plot the predicted outcome for an individual above 35 years old, who did not receive a monetary incentive, for different values of distance:

plot_predictions(mod, condition = list(
    "distance", "agecat" = ">35", "incentive" = 0
))

6.6.4 Customization

Since the output of plot_predictions() is a ggplot2 object, it is very easy to customize. For example, we can add points for the actual observations of our dataset like so:

plot_predictions(mod, condition = "distance", rug = TRUE) +
    theme_grey() +
    ylim(c(.65, .81)) + xlim(c(2, 4.5))

A more powerful but less convenient way to customize plots is to call the draw=FALSE argument. This will return a data frame with the raw values used to create plots. You can then use these data to create your own plots with base R graphics, ggplot2, or any other plotting functions you like:

plot_predictions(mod, by = "incentive", draw = FALSE)
  incentive  estimate   std.error statistic      p.value  s.value  conf.low
1         0 0.3397746 0.018899194  17.97826 2.884182e-72 237.6507 0.3027328
2         1 0.7908348 0.008620357  91.74038 0.000000e+00      Inf 0.7739393
  conf.high
1 0.3768163
2 0.8077304


  1. A “fitted value” typically refers to the prediction made for one of the rows in the dataset used to fit the model.↩︎

  2. When datagrid() is called as an argument to a marginaleffects function, we can omit the model argument.↩︎

  3. Since the points are tightly clustered, they look like a curve at low resolution.↩︎

  4. This is the default approach taken by software packages like emmeans (Lenth 2024).↩︎

  5. The plot is omitted because, in this particular case, it looks very similar to the one in Figure 6.3↩︎