library(marginaleffects)
dat <- get_dataset("thornton")
mod <- glm(outcome ~ incentive + agecat, data = dat, family = binomial)
5 Predictions
The parameter estimates obtained by fitting a statistical model are rarely the main object of interest in a data analysis. Instead of focusing on those raw estimates, a good starting point is often to compute model-based predictions for different combinations of predictor values. This allows an analyst to report results on a scale that makes intuitive sense to their readers, colleagues, and stakeholders.
What is a model-based prediction? In this book, we consider that
A prediction is the outcome expected by a fitted model for a given combination of predictor values.
This definition is in line with the familiar concept of “fitted value,”1 but it differs from a “forecast” or “out-of-sample prediction” (Hyndman and Athanasopoulos 2018). For our purposes, the word “prediction” need not imply that we hope to forecast the future, or that we are trying to extrapolate to unseen data.
Model-based predictions are often the main quantity of interest in a data analysis. They allow us to answer a wide variety of questions, such as:
- What is the expected probability that a 50-year-old smoker develops heart disease, adjusting for diet, exercise, and family history?
- What is the expected probability that a football team wins a game, considering the team’s recent performance, injuries, and opponent strength?
- What is the expected turnout in municipal elections, accounting for national trends and local demographic characteristics?
- What is the expected price of a three-bedroom house in a suburban area, controlling for floor area and market conditions?
All of these descriptive questions can be answered using model-based predictions. This highlights the fact that predictions are an intrinsically interesting quantity. In chapters 6 and 7, we will see that they are also a fundamental building block to analyze the effects of interventions.
The current chapter illustrates how to compute and report predictions for models estimated with the Thornton (2008) data. We proceed in order, through each component of the conceptual framework laid out in Chapter 3: (1) quantity, (2) predictors, (3) aggregation, (4) uncertainty, and (5) tests. Then, we conclude by showing different ways to visualize predictions.
5.1 Quantity
To begin, it is useful to see how predictions are built in one particular case. Consider a logistic regression model estimated using the Thornton (2008) HIV dataset:
\[ Pr(\text{Outcome}=1) = \Phi \left (\beta_1 + \beta_2 \cdot \text{Incentive} + \beta_3 \cdot \text{Age}_{18-35} + \beta_4 \cdot \text{Age}_{>35} \right ), \tag{5.1}\]
where Outcome is a binary variable equal to 1 if the study participant travelled to the test center to learn their HIV status; Incentive is a binary variable equal to 1 if the participant received a monetary incentive; and the other two predictors are indicators for the age category to which a participant belongs, with omitted category Age\(_{<18}\). The letter \(\Phi\) represents the standard logisitic function \(\Phi(x) = \frac{1}{1 + e^{-x}}\), which ensures that the linear component inside the parentheses of Equation 5.1 gets scaled to the \([0,1]\) interval. This allows the model to respect the natural scale of the binary outcome variable.
We load the marginaleffects
package, read the data, and estimate a logistic regression model using the glm()
function:
The estimated coefficients are:
b <- coef(mod)
b
(Intercept) incentive agecat18 to 35 agecat>35
-0.78232923 1.99229719 0.04368393 0.24780479
For clarity of presentation, we substitute these estimates into the model equation:
\[ Pr(\text{Outcome}=1) = \Phi \left (-0.782 + 1.992 \cdot \text{Incentive} + 0.044 \cdot \text{Age}_{18-35} + 0.248 \cdot \text{Age}_{>35} \right ) \tag{5.2}\]
To make a prediction for a particular individual, we simply plug-in the characteristics of that person into Equation 5.2. For example, the predicted probability that Outcome equals 1 for an 18 to 35 year-old in the treatment group is
\[\begin{align*} Pr(\text{Outcome}=1) = \Phi \left (-0.782 + 1.992 \cdot 1 + 0.044 \cdot 1 + 0.248 \cdot 0 \right )\\ \end{align*}\]
The predicted probability that Outcome equals 1 for someone above 35 years-old in the control group is
\[\begin{align*} Pr(\text{Outcome}=1) = \Phi \left (-0.782 + 1.992 \cdot 0 + 0.044 \cdot 0 + 0.248 \cdot 1 \right ) \end{align*}\]
These expressions can be evaluated using any calculator. First, we compute the part in parentheses. This is the “linear” or “link scale” prediction.
linpred_treatment_younger <- b[1] + b[2] * 1 + b[3] * 1 + b[4] * 0
linpred_treatment_younger
[1] 1.253652
linpred_control_older <- b[1] + b[2] * 0 + b[3] * 0 + b[4] * 1
linpred_control_older
[1] -0.5345244
Link scale predictions from a logit model are expressed on the log odds scale. In this example, they include a negative value and a value greater than one. To many, this will feel incongruous, because the outcome variable is a binary variable, with a probability bounded by 0 and 1. To ensure that our predictions respect the natural scale of the data, we transform the linear component of Equation 5.1 using the logistic function:2
logistic <- \(x) 1 / (1 + exp(-x))
logistic(linpred_treatment_younger)
[1] 0.7779314
logistic(linpred_control_older)
[1] 0.3694623
Our model expects that the probability of seeking information about one’s HIV status is 78% for a young adult who receives a monetary incentive, and 37% for an older participant who does not receive an incentive.
Computing predictions manually is useful for pedagogical purposes, but it is a labour-intensive and error-prone process. The commands above are also limiting, because they only apply to one very specific model.
Instead of manual computation, we can use the predictions()
function from the marginaleffects
package. This function can be applied in consistent fashion across more than 100 different classes of statistical models.
First, we build a data frame of predictor values—a grid—where each row represents a different individual:
grid <- data.frame(agecat = c("18 to 35", ">35"), incentive = c(1, 0))
grid
agecat incentive
1 18 to 35 1
2 >35 0
Then, we call the predictions()
function, using the newdata
argument to specify the predictor values, and the type
argument to set the scale (link or response):
predictions(mod, newdata = grid, type = "link")
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % | agecat | incentive |
---|---|---|---|---|---|---|---|
1.254 | 0.0691 | 18.15 | <0.001 | 1.118 | 1.389 | 18 to 35 | 1 |
-0.535 | 0.1013 | -5.28 | <0.001 | -0.733 | -0.336 | >35 | 0 |
These results are exactly identical to the link scale predictions that we computed manually above. This is reassuring. However, in a logit model, link scale predictions are hard to interpret.
To communicate our results clearly, it is usually best to make predictions on the same scale as the outcome variable. The resulting estimates are easier to interpret, since they can be compared to observed values of the outcome variable in our dataset. For this reason, the default behavior of predictions()
is to return predictions on the response scale:
predictions(mod, newdata = grid)
Estimate | Pr(>|z|) | 2.5 % | 97.5 % | agecat | incentive |
---|---|---|---|---|---|
0.778 | <0.001 | 0.754 | 0.800 | 18 to 35 | 1 |
0.369 | <0.001 | 0.325 | 0.417 | >35 | 0 |
In the rest of this chapter, we show that the marginaleffects
package makes it easy to compute various types of predictions, aggregate, and conduct statistical tests on them.
5.2 Predictors
Predictions are “conditional” quantities, in the sense that they depend on the values of all the predictor variables in a model. To compute a prediction, the analyst must fix all the variables on the right-hand side of the model equation; they must choose a grid (see Section 3.2).
The choice of grid depends on the researcher’s goals. The profiles it holds could correspond to actual observations in the original data, or they could represent unseen, hypothetical, or representative units. To illustrate, let’s consider a slight modification of the model estimated in Section 5.1. In addition to the incentive
and agecat
predictors, we now include a numeric predictor to account for the distance
between a study participant’s home and the test center where they can learn their HIV status:
mod <- glm(outcome ~ incentive + agecat + distance,
data = dat, family = binomial)
With this model, we can make predictions on various grids: empirical, interesting, representative, balanced, or counterfactual.
5.2.1 Empirical grid
By default, the predictions()
function uses the full original dataset as a grid, that is, it uses the empirical distribution of predictors (Section 3.2.1). This means that predictions()
will compute fitted values for each of the rows in the dataset used to fit the model.
p <- predictions(mod)
p
Estimate | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|
0.365 | <0.001 | 0.297 | 0.439 |
0.354 | <0.001 | 0.288 | 0.426 |
0.265 | <0.001 | 0.209 | 0.330 |
0.300 | <0.001 | 0.241 | 0.365 |
0.334 | <0.001 | 0.271 | 0.402 |
0.833 | <0.001 | 0.809 | 0.855 |
0.840 | <0.001 | 0.815 | 0.862 |
0.833 | <0.001 | 0.809 | 0.855 |
0.827 | <0.001 | 0.803 | 0.849 |
0.789 | <0.001 | 0.761 | 0.814 |
The p
object created by predictions()
includes the fitted values for each observation in the dataset, along with test statistics like p values and confidence intervals. p
is a standard data frame, which means that we can manipulate it using the usual R
or Python
commands.
For example, we can check that the data frame includes 2825 and 10 columns:
dim(p)
[1] 2825 10
We can list the available column names:
colnames(p)
[1] "rowid" "estimate" "p.value" "s.value" "conf.low" "conf.high"
[7] "outcome" "incentive" "agecat" "distance"
And we can extract individual columns and cells using the standard $
or []
syntaxes, or using data manipulation packages like dplyr
or data.table
:
p[1:4, "estimate"]
[1] 0.3652679 0.3540617 0.2653133 0.2997423
Users should be mindful of the fact that, by default, the p values held in this data frame correspond to a hypothesis test against a null of zero. In Section 5.5, we will see that it is easy to change this default null using the hypothesis
argument.
5.2.2 Interesting grid
In many cases, analysts are not interested in model-based predictions for each observation in their sample. Instead, they may prefer to construct a customized grid of predictors which includes specific values of scientific or commercial interest (Section 3.2.2).
In marginaleffects
, the main strategy to define custom grids is to use the newdata
argument and the datagrid()
function. This function creates a “typical” dataset with all variables at their means or modes, except for those we explicitly define.
distance agecat incentive rowid
1 2.014541 18 to 35 0 1
2 2.014541 18 to 35 1 2
We can feed this datagrid()
function to the newdata
argument of predictions()
.3
predictions(mod,
newdata = datagrid(agecat = "18 to 35", incentive = c(0, 1))
)
agecat | incentive | Estimate | Pr(>|z|) | 2.5 % | 97.5 % | distance |
---|---|---|---|---|---|---|
18 to 35 | 0 | 0.318 | <0.001 | 0.279 | 0.361 | 2.01 |
18 to 35 | 1 | 0.780 | <0.001 | 0.755 | 0.802 | 2.01 |
This shows that the estimated probability of seeking one’s HIV status is about 32% for a participant who is between 18 and 35 years old, did not receive a monetary incentive, and lives the average distance from the center.
We can also make predictions on a custom grid by supplying functions to datagrid()
. These functions will be applied to the named variables, and the output used to construct the grid.
predictions(mod,
newdata = datagrid(distance = 2, agecat = unique, incentive = max)
)
distance | agecat | incentive | Estimate | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
2 | <18 | 1 | 0.774 | <0.001 | 0.725 | 0.816 |
2 | 18 to 35 | 1 | 0.780 | <0.001 | 0.756 | 0.802 |
2 | >35 | 1 | 0.815 | <0.001 | 0.791 | 0.837 |
5.2.3 Representative grid
Sometimes, analysts do not want fine-grained control over the values of each predictor, but would rather compute predictions for some “representative” individual (Section 3.2.3). For example, we can compute a “Prediction at the Mean,” that is, a prediction for a hypothetical representative individual whose personal caracteristics are exactly average (numeric) or modal (categorical).
To achieve this, we can either set the values of the grid manually in datagrid()
, or we can use the "mean"
shortcut.
predictions(mod, newdata = "mean")
Estimate | Pr(>|z|) | 2.5 % | 97.5 % | incentive | agecat | distance |
---|---|---|---|---|---|---|
0.78 | <0.001 | 0.755 | 0.802 | 1 | 18 to 35 | 2.01 |
Representative grids can be useful in some contexts, but they are not always the best choice. Sometimes there is simply noone in our sample who is exactly average on all relevant dimensions. When this average individual is fictional, the associated prediction may not be scientifically interesting or practically relevant.
5.2.4 Balanced grid
A common strategy in the analysis of experiments is to compute estimates on a “balanced grid” (Section 3.2.4). This type of grid includes one row for each combination of unique values for the categorical (or binary) predictors, holding numeric variables at their means. To achieve this, we can either call datagrid()
or use the "balanced"
shortcut. These two calls are equivalent:
predictions(mod, newdata = datagrid(
agecat = unique, incentive = unique, distance = mean
))
predictions(mod, newdata = "balanced")
Estimate | Pr(>|z|) | 2.5 % | 97.5 % | incentive | agecat | distance |
---|---|---|---|---|---|---|
0.311 | <0.001 | 0.251 | 0.377 | 0 | <18 | 2.01 |
0.318 | <0.001 | 0.279 | 0.361 | 0 | 18 to 35 | 2.01 |
0.367 | <0.001 | 0.322 | 0.415 | 0 | >35 | 2.01 |
0.773 | <0.001 | 0.724 | 0.816 | 1 | <18 | 2.01 |
0.780 | <0.001 | 0.755 | 0.802 | 1 | 18 to 35 | 2.01 |
0.815 | <0.001 | 0.791 | 0.837 | 1 | >35 | 2.01 |
A balanced grid is often used with randomized experiments, when the analyst wishes to give equal weight to each combination of treatment conditions in the calculation of marginal means (Section 5.3).
5.2.5 Counterfactual grid
Yet another set of predictor profiles to consider is the “counterfactual grid.” The predictions made on such a grid allow us to answer questions such as:
What would the predicted outcomes be in our observed sample if everyone had received the treatment, or if everyone had received the control?
To create a counterfactual grid, we duplicate the full dataset, once for every value of the focal variable. For instance, if our dataset includes 2884 rows, and we want to compute predictions for each combination of the incentive
variable (0 and 1), the counterfactual grid will include 5768.
To make predictions on a counterfactual grid, we can call the datagrid()
function, or we can use the variables
argument:
p <- predictions(mod, variables = list(incentive = 0:1))
dim(p)
[1] 5650 11
These predictions are interesting, because they give us a first look at the kinds of counterfactual (potentially causal) queries that we will explore in Chapter 6. We can ask:
For each individual in the Thornton (2008) sample, what is the predicted probability of seeking information about HIV status in the counterfactual worlds where they receive a monetary incentive, and where they do not?
To answer this question, we rearrange the data and plot it:
library(ggplot2)
p <- data.frame(
Control = p[p$incentive == 0, "estimate"],
Treatment = p[p$incentive == 1, "estimate"])
ggplot(p, aes(Control, Treatment)) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
geom_point() +
labs(x = "Pr(outcome=1) when incentive = 0",
y = "Pr(outcome=1) when incentive = 1") +
xlim(0, 1) + ylim(0, 1) + coord_equal()
On this graph, each point represents a single study participant.4 The x-axis shows the predicted probability that Outcome equals 1 for an individual with the same socio-demographic characteristics, in the control group. The y-axis shows the predicted probability that Outcome equals 1 for an individual with the same socio-demographic characteristics, in the treatment group.
Every point is well above the 45 degree line. This means that, for every observed combination of predictor values, for every participant in the study, our model says that changing the incentive
variable from 0 to 1 increases the predicted probability that the person will seek to learn their HIV status.
5.3 Aggregation
Computing predictions for a large grid or for every observation in a dataset is useful, but the results can feel unwieldy. This section makes two principal arguments. First, it often makes sense to compute aggregated statistics, such as the average predicted outcome across the whole dataset, or by subgroups of the data. Second, the grid across which we aggregate can make a big difference to the results.
An “average prediction” is the outcome of a two step process. First, we compute predictions (fitted values) for each row in the original dataset. Then, we take the average of those predictions. This can be done manually by calling the predictions()
function and taking the mean of estimates.
p <- predictions(mod)
mean(p$estimate)
[1] 0.6916814
Alternatively, we can use the avg_predictions()
function, which is a wrapper around predictions()
that computes the average prediction directly.
avg_predictions(mod)
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.692 | 0.00791 | 87.5 | <0.001 | 0.676 | 0.707 |
This shows that the average predicted probability of seeking information about HIV status, across all the study participants in the Thornton (2008) sample, is about 69%.
Now, imagine we want to check if the average predicted probability of the Outcome variable differs across age categories. To see this, we can make the same function call, but add the by
argument.
avg_predictions(mod, by = "agecat")
agecat | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
<18 | 0.670 | 0.0240 | 27.9 | <0.001 | 0.623 | 0.717 |
18 to 35 | 0.673 | 0.0116 | 58.1 | <0.001 | 0.650 | 0.695 |
>35 | 0.720 | 0.0121 | 59.5 | <0.001 | 0.697 | 0.744 |
The average predicted probability of seeking one’s test result is about 67% for minors, and 72% for those above 35 years old. In Section 5.5 we will formally test if the difference between those two average predictions is statistically significant.
So far, we have taken averages over the empirical distribution of covariates, but analysts are not limited to that grid. One common alternative is to compute “marginal means” by averaging predictions across a balanced grid of predictors.5 This is useful in experimental settings, when the observed sample is not representative of the population, and when we want to marginalize while giving equal weight to each treatment condition.
To compute marginal means, we call the same function, using the newdata
and by
arguments.
avg_predictions(mod, newdata = "balanced", by = "agecat")
agecat | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
<18 | 0.542 | 0.0261 | 20.7 | <0.001 | 0.491 | 0.593 |
18 to 35 | 0.549 | 0.0135 | 40.6 | <0.001 | 0.522 | 0.575 |
>35 | 0.591 | 0.0151 | 39.0 | <0.001 | 0.561 | 0.621 |
Notice that the results are considerably different from the average predictions computed on the empirical grid. Now, the average predicted probability of seeking one’s test result is estimated at 54% for minors, and 59% for those above 35 years old.
What explains this difference is that the balanced grid gives equal weight to each combination of categorical variables, while the empirical grid gives more weight to predictor values that are more frequent in the observed data. In the Thornton (2008) dataset, more participants belonged to the treatment than to the control group:
table(dat$incentive)
0 1
621 2204
Therefore, when we compute an average prediction on the empirical distribution, predicted outcomes in the incentive=1
group are given more weight. This matters, because the average predicted probability that Outcome equals 1 is much higher in the treatment group than in the control group:
avg_predictions(mod, by = "incentive")
incentive | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
0 | 0.340 | 0.01890 | 18.0 | <0.001 | 0.303 | 0.377 |
1 | 0.791 | 0.00862 | 91.7 | <0.001 | 0.774 | 0.808 |
Thus, the group-wise averages for each age categories are smaller when computed over a balanced grid than when they are computed over the empirical distribution.
The last example of aggregation that we consider is done across a “counterfactual grid” (Section 3.2.5). To build such a grid, every observation of the dataset is duplicated, and the incentive
variable is fixed to different values in the different counterfactual datasets. We can achieve this with the datagrid()
function or with the variables
argument.
The following commands yield equivalent results.
avg_predictions(
mod,
by = "incentive",
newdata = datagrid(incentive = c(0, 1), grid_type = "counterfactual"))
avg_predictions(
mod,
variables = list(incentive = c(0, 1)),
by = "incentive")
incentive | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
0 | 0.339 | 0.01888 | 18.0 | <0.001 | 0.302 | 0.376 |
1 | 0.791 | 0.00862 | 91.8 | <0.001 | 0.774 | 0.808 |
This is equivalent to making unit-level predictions on a counterfactual grid, and then averaging those predictions manually.
p <- predictions(
mod,
type = "response",
newdata = datagrid(incentive = 0:1, grid_type = "counterfactual"))
aggregate(estimate ~ incentive, FUN = mean, data = p)
incentive estimate
1 0 0.3390492
2 1 0.7910508
5.4 Uncertainty
As in the rest of the marginaleffects
package, the predictions()
family of functions accepts two arguments to control how to express the uncertainty around our estimates. First, the conf_level
argument controls the size the confidence interval (default: 95%). Second, the vcov
argument allows us to specify the type of standard errors to compute and report.
By default, marginaleffects
functions return “classical” standard errors. The procedure used to justify these estimates of uncertainty typically assumes that errors are independently and identically distributed, and it rules out common empirical patterns like autocorrelation, clustering, or heteroskedasticity.6 Statisticians have developed a variety of techniques to accomodate these situations, to report standard errors that are “robust” or “clustered.”
If we think there are reasons to believe that there is heterogeneity in the variance of our prediction errors—which is almost always the case—users can compute and report heteroskedasticity-consistent (HC) standard errors. There are several version of HC standard errors available, and the different options are documented in the package manual.7 Here, we use the vcov
argument to specify the type of standard errors, and the conf_level
to define the width of the confidence interval.
avg_predictions(mod,
by = "incentive",
vcov = "HC3",
conf_level = .9)
incentive | Estimate | Std. Error | z | Pr(>|z|) | 5.0 % | 95.0 % |
---|---|---|---|---|---|---|
0 | 0.340 | 0.01892 | 18.0 | <0.001 | 0.309 | 0.371 |
1 | 0.791 | 0.00864 | 91.5 | <0.001 | 0.777 | 0.805 |
In the Thornton dataset, we know that participants were recruited from different villages, and our dataset includes a village
variable that records the place of origin of each participant. If the sampling or treatment assignment mechanisms is related to these groups, it may be appropriate to report clustered standard errors (AbaAthImbWoo2022?). To do this, we use the vcov
argument and specify the clustering unit on the right-hand side of the tilde (~
) sign in a formula.
avg_predictions(mod,
by = "incentive",
vcov = ~ village,
conf_level = .9)
incentive | Estimate | Std. Error | z | Pr(>|z|) | 5.0 % | 95.0 % |
---|---|---|---|---|---|---|
0 | 0.340 | 0.0235 | 14.5 | <0.001 | 0.301 | 0.378 |
1 | 0.791 | 0.0102 | 77.6 | <0.001 | 0.774 | 0.808 |
The last class of uncertainty estimates that we consider relies on resampling or simulation: bootstrap, simulation-based inference, or conformal inference. Chapter Chapter 13 devotes substantial space to explain each of these approaches. For now, it suffices to point out that these methods are implemented in the inferences()
function of the marginaleffects
package.
avg_predictions(mod,
by = "incentive",
conf_level = .9) |>
inferences(method = "boot", R = 1000)
incentive | Estimate | Std. Error | 5.0 % | 95.0 % |
---|---|---|---|---|
0 | 0.340 | 0.01891 | 0.310 | 0.371 |
1 | 0.791 | 0.00851 | 0.777 | 0.805 |
Notice that the intervals are all slightly different, but still remain in the same ballpark.
5.5 Test
Above, we computed average predictions by age subgroups, and noted that there appeared to be differences in the likelihood that younger and older people would seek their HIV test results. That observation was only based on the point estimates of the average predictions, and did not rely on a statistical test. Now, let’s consider how analysts can compare predictions more formally.
5.5.1 Null hypothesis tests
To begin, we compute the average predicted outcome for each age subgroup:
p <- avg_predictions(mod, by = "agecat")
p
agecat | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
<18 | 0.670 | 0.0240 | 27.9 | <0.001 | 0.623 | 0.717 |
18 to 35 | 0.673 | 0.0116 | 58.1 | <0.001 | 0.650 | 0.695 |
>35 | 0.720 | 0.0121 | 59.5 | <0.001 | 0.697 | 0.744 |
The average predicted outcome is 67% for young adults and 72% for participants above 35 years old. The difference between these two averages is:
p$estimate[3] - p$estimate[2]
[1] 0.04782993
To see if this risk difference is statistically significant, we can use the hypothesis
argument, as we did in Chapter 4.
p <- avg_predictions(mod, by = "agecat",
hypothesis = "b3 - b2 = 0")
p
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
b3-b2=0 | 0.0478 | 0.0167 | 2.86 | 0.00428 | 0.015 | 0.0806 |
The estimated difference between the 3rd and 2nd groups is about 5 percentage points, and the p value associated with this estimate is 0.004. This crosses the conventional (but arbitrary) statistical significance threshold of \(p=0.05\). Accordingly, many analysts would reject the null hypothesis that the average predicted probability of seeking one’s HIV test results is the same in the 18 to 35 and >35 groups.
In many cases, we would like to make more than one comparison at a time. For instance, imagine an experiment with three groups: placebo, low dose, high dose. In that context, we may want to compare each treatment group to the baseline or “reference” placebo group. Alternatively, we may want to do “sequential” comparisons, comparing one group to the one that directly precedes it. This can be helpful when there is a clear logical order between categories, and when we aree interested in the progression across groups.
In marginaleffects
, this can be done easily by supplying a formula to the hypothesis
argument. On the left-hand side, we set the comparison function (difference, ratio, etc.). On the right hand side, we specify which estimates to compare to one another (sequential, reference, etc.). Here, we choose sequential comparisons: 2nd level vs. 1st level, 3rd level vs. 2nd level, and so on.
p <- avg_predictions(mod, by = "agecat",
hypothesis = difference ~ sequential)
p
Hypothesis | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
(18 to 35) - (<18) | 0.00274 | 0.0267 | 0.103 | 0.91810 | -0.0495 | 0.0550 |
(>35) - (18 to 35) | 0.04783 | 0.0167 | 2.856 | 0.00428 | 0.0150 | 0.0806 |
Hypothesis | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
(18 to 35) - (<18) | 0.00274 | 0.0267 | 0.103 | 0.91810 | -0.0495 | 0.0550 |
(>35) - (18 to 35) | 0.04783 | 0.0167 | 2.856 | 0.00428 | 0.0150 | 0.0806 |
By modifying the formula, we can compare each group to the “reference” or baseline category, that is, participants under 18 years old.
p <- avg_predictions(mod, by = "agecat",
hypothesis = difference ~ reference)
p
Hypothesis | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
(18 to 35) - (<18) | 0.00274 | 0.0267 | 0.103 | 0.9181 | -0.04952 | 0.055 |
(>35) - (<18) | 0.05057 | 0.0269 | 1.880 | 0.0601 | -0.00215 | 0.103 |
The hypothesis
argument also allows us to specify hypothesis groups by subgroups. For example, consider this command, which computes average predicted outcomes for each observed combination of incentive
and agecat
.
avg_predictions(mod,
by = c("incentive", "agecat"))
incentive | agecat | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
0 | <18 | 0.312 | 0.0321 | 9.72 | <0.001 | 0.249 | 0.375 |
0 | 18 to 35 | 0.324 | 0.0209 | 15.46 | <0.001 | 0.283 | 0.365 |
0 | >35 | 0.370 | 0.0235 | 15.74 | <0.001 | 0.324 | 0.416 |
1 | <18 | 0.771 | 0.0234 | 32.99 | <0.001 | 0.725 | 0.816 |
1 | 18 to 35 | 0.778 | 0.0119 | 65.43 | <0.001 | 0.755 | 0.801 |
1 | >35 | 0.811 | 0.0117 | 69.04 | <0.001 | 0.788 | 0.834 |
We can use the hypothesis
argument in similar fashion as before, but add a vertical bar to specify that we want to compute sequential risk differences within subgroups.
p <- avg_predictions(mod,
by = c("incentive", "agecat"),
hypothesis = difference ~ sequential | incentive)
p
Hypothesis | incentive | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
(18 to 35) - (<18) | 0 | 0.01111 | 0.0311 | 0.357 | 0.7213 | -0.04994 | 0.0722 |
(>35) - (18 to 35) | 0 | 0.04605 | 0.0216 | 2.131 | 0.0331 | 0.00369 | 0.0884 |
(18 to 35) - (<18) | 1 | 0.00717 | 0.0254 | 0.283 | 0.7775 | -0.04258 | 0.0569 |
(>35) - (18 to 35) | 1 | 0.03330 | 0.0154 | 2.156 | 0.0311 | 0.00302 | 0.0636 |
This shows that, in the control group (incentive=0
), the difference between the average predicted outcome for participants over 35 and for those between 18 and 35 is about 5 percentage points. However, in the treatment group (incentive=1
), this difference is about 3 percentage points. Both of these differences are associated with relatively large \(z\) statistics, and are thus statistically distinguishable from zero.
5.5.2 Equivalence tests
Flipping the logic around, the analyst could run an equivalence test to determine if the difference between average predicted outcomes in the two subgroups is small enough to be considered negligible (Section 4.2). Imagine that, for domain-specific reasons, a risk difference smaller than 10 percentage points is considered “uninteresting,” “negligible,” or “equivalent to zero”. All we need to do is call the same function with the equivalence
argument and the \([-0.1,0.1]\) interval of practical equivalence:
avg_predictions(mod,
by = "agecat",
hypothesis = "b3 - b1 = 0",
equivalence = c(-0.1, 0.1))
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % | p (NonSup) | p (NonInf) | p (Equiv) |
---|---|---|---|---|---|---|---|---|---|
b3-b1=0 | 0.0506 | 0.0269 | 1.88 | 0.0601 | -0.00215 | 0.103 | 0.0331 | <0.001 | 0.0331 |
The p value associated with our test of equivalence is small. This suggests that we can reject the null hypothesis that the difference lies outside the interval of practical equivalence. The difference is thus likely to be small enough to be ignored.
5.6 Visualization
In many cases, data analysts will want to visualize (potentially aggregated) predictions rather than report raw numeric values. This is easy to do with the plot_predictions()
function, which a similar syntax that closely parallels that of the other functions in the marginaleffects
package.
5.6.1 Unit predictions
As discussed in Chapter 3, the quantities derived from statistical models—predictions, counterfactual comparisons, and slopes—are typically conditional, in the sense that they depend on the values of all covariates in the model. This implies that each unit in our sample will be associated with its own prediction (fitted value) or effect estimate. In Avoiding One-Number Summaries, Harrell (2021) argues that data analysts should avoid the temptation to summarize these individual-level estimates. Rather, Harrell argues, they should display the full distribution of estimates to convey a sense of the heterogeneity of our quantity of interest across different combinations of predictor values.
Histograms and Empirical Cumulative Distribution Function (ECDF) plots are two common ways to visualize such a distribution. Since the output generated by the predictions()
function is a standard data frame, it is easy to feed that object to any plotting function in R
or Python
, in order to craft good-looking visualizations.
library(patchwork)
p <- predictions(mod)
# Histogram
p1 <- ggplot(p) +
geom_histogram(aes(estimate, fill = factor(incentive))) +
labs(x = "Pr(outcome = 1)", y = "Count", fill = "Incentive") +
scale_fill_grey()
# Empirical Cumulative Distribution Function
p2 <- ggplot(p) +
stat_ecdf(aes(estimate, colour = factor(incentive))) +
labs(x = "Pr(outcome = 1)",
y = "Cumulative Probability",
colour = "Incentive") +
scale_colour_grey()
p1 + p2
The left side of Figure 5.2 presents a histogram showing the distribution of predicted probabilities that individual study participants choose to travel to the test center in order to learn their HIV status. As usual, the x-axis represents the range of predicted outcomes, while the y-axis shows the number of study participants each bin. By assigning different colors to the bins based on the treatment arm (incentive
equal 0 or 1), we highlight one of the key features of the distribution: predicted outcomes for people in the treatment group tend to be much higher than predicted outcomes for people in the control group. Indeed, the distribution of outcome
probabilities without an incentive (orange) is concentrated between 0.2 and 0.4, indicating a low probability of travelling to the test center. In contrast, the distribution of predicted outcome
for participants who received a monetary incentive (blue) is concentrated between 0.6 and 0.8. This suggests that those who received an incentive are considerably more likely to seek their test results.
The right side of Figure 5.2 presents an ECDF plot. Again, the x-axis represents the range of predicted outcomes. This time, however, the y-axis indicates the cumulative probability, which is the proportion of data points that are less than or equal to a specific value. For any given value on the x-axis, the height of the curve indicates the proportion of data points that are less than or equal to that value. For example, at 0.3 on the x-axis, we see that the incentive=0
line is close to 0.25. This suggests that about 25% of the participants in our sample have predicted outcome
smaller than 30%. When the ECDF curve is steep, we know that a lot of the our data is concentrated in that part of the distribution. With this in mind, we see clearly that many of our predicted outcome
values are clustered near 0.3 in the control group, and near 0.8 in the treatment group.
5.6.2 Marginal predictions
The first approach to display “marginal” predictions using the by
argument. The underlying process is to (1) compute predictions for each observation in the actually observed dataset, and then (2) average these predictions across some variable(s). This is equivalent to plotting the results of calling avg-predictions()
using the by
argument.
For example, if we want to compute the average predicted probability that outcome
equals 1, by subgroup, we call:
avg_predictions(mod, by = "incentive")
incentive | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
0 | 0.340 | 0.01890 | 18.0 | <0.001 | 0.303 | 0.377 |
1 | 0.791 | 0.00862 | 91.7 | <0.001 | 0.774 | 0.808 |
We plot the same results using the plot_predictions()
function:
p1 <- plot_predictions(mod, by = "incentive")
p2 <- plot_predictions(mod, by = c("incentive", "agecat"))
p1 + p2
Note that that the plot_predictions()
function also accepts a newdata
argument. This means that we can, for example, plot marginal means constructed by marginalizing across a balanced grid of predictors:8
plot_predictions(mod, by = "incentive", newdata = "balanced", draw = FALSE)
5.6.3 Conditional predictions
In some contexts, plotting marginal predictions may not be appropriate. For instance, when one of the predictors of interest is continuous, there are many predictors, or much heterogeneity, the commands presented in the previous section may generate jagged plots which are difficult to read. In such cases, it can be useful to plot “conditional” predictions instead. In this context, the word “conditional” means that we are computing predictions, conditional on the values of the predictors in a constructed grid of “representative” values. However, unlike in the previous section, we do not average over several predictions before displaying the estimates. We fix the grid and display the predictions made for that grid immediately.
The condition
argument of the plot_predictions()
function does just that: Build a grid of representative predictor values, compute predictions for each combination of predictor values, and plot the results. In the following examples, we fix one or more predictor to its unique values (categorical) or to an equally spaced grid from minimum to maximum (continuous). The other predictors in the model are held to their mean or model.
p1 <- plot_predictions(mod, condition = "distance")
p2 <- plot_predictions(mod, condition = c("distance", "incentive"))
p3 <- plot_predictions(mod, condition = c("distance", "incentive", "agecat"))
(p1 + p2) / p3
We can also set the value of some variables explicitly by setting condition
to a named list. For example, to plot the predicted outcome
for an individual above 35 years old, who did not receive a monetary incentive, for different values of distance:
plot_predictions(mod, condition = list(
"distance", "agecat" = ">35", "incentive" = 0
))
5.6.4 Customization
Since the output of plot_predictions()
is a ggplot2
object, it is very easy to customize. For example, we can add points for the actual observations of our dataset like so:
plot_predictions(mod, condition = "distance", rug = TRUE) +
theme_grey() +
ylim(c(.65, .81)) + xlim(c(2, 4.5))
A more powerful but less convenient way to customize plots is to call the draw=FALSE
argument. This will return a data frame with the raw values used to create plots. You can then use these data to create your own plots with base
R
graphics, ggplot2
, or any other plotting functions you like:
plot_predictions(mod, by = "incentive", draw = FALSE)
incentive estimate std.error statistic p.value s.value conf.low
1 0 0.3397746 0.018899194 17.97826 2.884182e-72 237.6507 0.3027328
2 1 0.7908348 0.008620357 91.74038 0.000000e+00 Inf 0.7739393
conf.high
1 0.3768163
2 0.8077304
5.7 Summary
This chapter defined a “prediction” as the outcome expected by a fitted model for a given combination of predictor values. A prediction is a useful descriptive quantity: an “expectation” or “best guess” for different individuals, units, or subgroups of the population of interest.
The predictions()
function from the marginaleffects
package computes predictions for a wide range of models. avg_predictions()
aggregates unit-level predictions. plot_predictions()
displays predictions visually. The main arguments of these functions are summarized in Table 5.1.
predictions()
, avg_predictions()
, and plot_predictions()
functions.
Argument | |
---|---|
model |
Fitted model used to make predictions. |
newdata |
Grid of predictor values. |
variables |
Make predictions on a counterfactual grid. |
vcov |
Standard errors: Classical, robust, clustered, etc. |
conf_level |
Size of the confidence intervals. |
type |
Type of predictions: response, link, etc. |
by |
Grouping variable for average predictons. |
byfun |
Custom aggregation functions. |
wts |
Weights used to compute average predictions. |
transform |
Post-hoc transformations. |
hypothesis |
Compare predictions to one another, conduct linear or non-linear hypothesis tests, or specify null hypotheses. |
equivalence |
Equivalence tests |
df |
Degrees of freedom for hypothesis tests. |
numderiv |
Algorithm used to compute numerical derivatives. |
... |
Additional arguments are passed to the predict() method supplied by the modelling package. |
To clearly define predictions and attendant tests, analysts must make five decisions.
First, the Quantity.
- Predictions can be computed on different scales, depending on the type of statistical model. In generalized linear models (GLM), for example, one can make predictions on the “link” or “response” scales. In most cases, analysts should report predictions on the same scale as the outcome variable, since this is most natural for readers.
- The scale of predictions is controlled by the
type
argument.
Second, the Predictors.
- Predictions are conditional quantities, that is, they depend on the values of all the predictors in a model.
- Analysts can make predictions for different combinations of predictor values—or grids: empirical, interesting, representative, balanced, or counterfactual.
- The predictor grid is defined by the
newdata
argument and thedatagrid()
function.
Third, the Aggregation.
- To simplify the presentation of results, it often makes sense to report average predictions. Analysts can choose between different aggregation schemes:
- Unit-level predictions (no aggregation)
- Average predictions
- Average predictions by subgroup
- Weighted average of predictions
- Predictions can be aggregated using the
avg_predictions()
function and theby
argument.
Fourth, the Uncertainty.
- In
marginaleffects
, thevcov
argument allows analysts to report classical, robust, or clustered standard errors around predictions. - The
inferences()
function can compute uncertainty intervals via bootstrapping, simulation-based inference, or conformal prediction.
Fifth, the Test.
- A null hypothesis test aims to determine if a prediction (or a function of predictions) is different from a null hypothesis value. For example, an analyst may wish to check if two predictions are different from one another. Null hypothesis tests can be conducted using the
hypothesis
argument. - An equivalence test aims to determine if a prediction (or a function of predictions) is similar to a reference value. Equivalence tests can be conducted using the
equivalence
argument.
A “fitted value” typically refers to the prediction made for one of the rows in the dataset used to fit the model.↩︎
Instead of defining our own
logistic()
function, we could use built-in theplogis()
function inR
, or thescipy.stats.logistic.cdf()
inPython
.↩︎When
datagrid()
is called as an argument to amarginaleffects
function, we can omit themodel
argument.↩︎Since the points are tightly clustered, they look like a curve at low resolution.↩︎
This is the default approach taken by software packages like
emmeans
(Lenth 2024).↩︎Autocorrelation often arises in time series estimation when the prediction errors we make for a unit of observation at time \(t\) are correlated to the prediction errors we make for the same unit at time \(t+1\). Clustering sometimes occurs when errors are similar for study participants drawn from well-defined groups (ex: classrooms, neighborhoods, or provinces). Heteroskedasticity arises when the variance of the errors is inconsistent across observations, such as when our model makes bigger prediction errors (in either direction) for some kinds of observations.↩︎
Under the hood,
marginaleffects
uses the excellentsandwich
package to compute robust standard errors.↩︎The plot is omitted because, in this particular case, it looks very similar to the one in Figure 5.3↩︎