6 Counterfactual comparisons

The main claim of this book is that the parameters of a statistical model are complex quantities that do not always have a straightforward meaning. Instead of trying to interpret them directly, we should treat parameters as a “resting stone on the way to prediction.”

In this chapter, we go further and show that predictions can themselves act as a springboard for further analysis. We will see how to combine predictions into counterfactual comparisons that quantify the strength of association between two variables, or the effect of a cause.

Counterfactual thinking is fundamental to scientific inquiry and data analysis. Indeed, many of our most important research questions can be expressed as comparisons between hypothetical worlds.

Would survival be more likely if patients received a new medication rather than a placebo?
Would standardized test scores be higher if class sizes were smaller?
Does participating in a micro-finance program increase household income?
Do conservation policies improve forest coverage?

To answer questions like these, researchers must conduct thought experiments. They must make counterfactual predictions, and then compare those predictions.

Say we want to estimate the effect of conservation policies on forest coverage. To start, we ask: what is the forest coverage in a region without conservation policies? Then, we ask: what would be the forest coverage in the same region, if conservation policies were implemented? Finally, we compare the two quantities: what is the difference between the expected forest coverage in counterfactual worlds with and without conservation policies?

In that spirit, we can define a broad class of quantities of interest:

A counterfactual comparison is a function of two or more model-based predictions, made with different predictor values.

This definition is intimately linked to the theories of causal inference surveyed in Section 2.1.3. As we will see, a natural way to estimate the effect of an intervention is to use a statistical model to make predictions in two counterfactual worlds and to compare those predictions. When the conditions for causal identification are satisfied, this counterfactual comparison can be interpreted as a measure of the effect of $X$ on $Y$.

But even where the conditions for causal identification are not satisfied, counterfactual comparisons remain very interesting statistical quantities. In that case, they can be treated as descriptive measures of the strength of association between two variables, holding other variables constant.

Sections 6.1 and 6.2 show that a vast array of estimands can be expressed as functions of two (or more) predictions: contrasts, risk differences, ratios, odds, lift, etc. Section 6.3 explains that we can aggregate counterfactual comparisons across different grids to compute quantities of interest like the average treatment effect (ATE) or the conditional average treatment effect (CATE). Section 6.4 discusses standard errors and confidence intervals, and Section 6.5 shows how we can contrast comparisons to one another, in view of exploring treatment effect heterogeneity. Section 6.6 concludes with some tips on data visualization.

6.1 Quantity

A counterfactual comparison is a function of two or more model-based predictions, made with different predictor values. To operationalize this quantity, we must make three decisions. First, what is the focal predictor whose effect on (or association with) the outcome we wish to estimate? Second, how does the focal predictor differ between counterfactual worlds? Third, what function do we use to compare predicted outcomes obtained for different values of the focal predictor?

To fix notation, consider a simple case where we fit a statistical model with outcome $Y$ and focal predictor $X$, and use the parameter estimates to compute predictions. When the variable $X$ is set to a specific value $x$, the model-based prediction is written $\hat{Y}_{X=x}$.

An analyst who is interested in model description, data description, or causal inference may want to estimate how the predicted outcome $\hat{Y}_{X}$ changes when we manipulate the predictor $X$. For example, how does the predicted outcome change when $X$ increases by 1 unit, by one standard deviation, or when $X$ changes from one specific value to another?

\[\begin{align*} \hat{Y}_{X=x+1} - \hat{Y}_{X=x} && \text{Increase of one unit}\\ \hat{Y}_{X=x+\sigma_X} - \hat{Y}_{X=x} && \text{Increase of one standard deviation}\\ \hat{Y}_{X=max(X)} - \hat{Y}_{X=min(X)} && \text{Increase from minimum to maximum}\\ \hat{Y}_{X=b} - \hat{Y}_{X=a} && \text{Change between specific values $a$ and $b$} \end{align*}\]

In each of the examples above, we calculated the difference between two predicted outcomes, evaluated for different values of the focal predictor $X$. A simple difference is often the best starting point for interpretation, because it is simple and easy to grasp intuitively. But we are not restricted to this function.

\[\begin{align*} \hat{Y}_{X=b} - \hat{Y}_{X=a} && \text{Difference}\\ \frac{\hat{Y}_{X=b}}{\hat{Y}_{X=a}} && \text{Ratio}\\ \frac{\hat{Y}_{X=b} - \hat{Y}_{X=a}}{\hat{Y}_{X=a}} && \text{Lift}\\ \end{align*}\]

In the special case where the predicted outcome $\hat{Y}_X$ is a probability, $\hat{Y}_{X=b}-\hat{Y}_{X=a}$ is called a risk difference, $\hat{Y}_{X=b}/\hat{Y}_{X=a}$ a risk ratio, and $\frac{\hat{Y}_{X=b}}{1 - \hat{Y}_{X=b}} \bigg/ \frac{\hat{Y}_{X=a}}{1 - \hat{Y}_{X=a}}$ an odds ratio.

The rest of this chapter shows how to compute and interpret all of these quantities using the marginaleffects package.

For illustration, we fit a logistic regression model to data from Thornton (2008). The outcome is a binary variable which indicates if a study participant sought to learn their HIV status. The randomized treatment is a binary variable indicating whether the participant received a monetary incentive. To make the specification more flexible and improve precision, we interact the incentive indicator with two other predictors: a participant’s age category and their distance from the test center.

R
Python

library(marginaleffects)
dat = get_dataset("thornton")
mod = glm(outcome ~ incentive * (agecat + distance), 
    data = dat, family = binomial)
summary(mod)


Coefficients:
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)              -0.48617    0.29621  -1.641   0.1007
incentive                 2.05760    0.34421   5.978   <0.001
agecat18 to 35            0.07937    0.28872   0.275   0.7834
agecat>35                 0.34522    0.29467   1.172   0.2414
distance                 -0.18440    0.07236  -2.548   0.0108
incentive:agecat18 to 35 -0.05850    0.33500  -0.175   0.8614
incentive:agecat>35      -0.12468    0.34242  -0.364   0.7158
incentive:distance        0.02304    0.08256   0.279   0.7802

import polars as pl
import numpy as np
from marginaleffects import *
from plotnine import *
from scipy.stats import norm
from statsmodels.formula.api import logit, ols

dat = get_dataset("thornton")
mod = logit("outcome ~ incentive * (agecat + distance)",
  data=dat.to_pandas()).fit()

Optimization terminated successfully.
         Current function value: 0.535929
         Iterations 5

mod.summary()

Logit Regression Results
Dep. Variable:	outcome	No. Observations:	2825
Model:	Logit	Df Residuals:	2817
Method:	MLE	Df Model:	7
Date:	Thu, 23 Oct 2025	Pseudo R-squ.:	0.1324
Time:	07:58:33	Log-Likelihood:	-1514.0
converged:	True	LL-Null:	-1745.1
Covariance Type:	nonrobust	LLR p-value:	1.023e-95

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-0.4862	0.296	-1.641	0.101	-1.067	0.094
agecat[T.18 to 35]	0.0794	0.289	0.275	0.783	-0.487	0.645
agecat[T.>35]	0.3452	0.295	1.172	0.241	-0.232	0.923
incentive	2.0576	0.344	5.978	0.000	1.383	2.732
incentive:agecat[T.18 to 35]	-0.0585	0.335	-0.175	0.861	-0.715	0.598
incentive:agecat[T.>35]	-0.1247	0.342	-0.364	0.716	-0.796	0.546
distance	-0.1844	0.072	-2.548	0.011	-0.326	-0.043
incentive:distance	0.0230	0.083	0.279	0.780	-0.139	0.185

This is a simple logistic regression model. Yet, because the model includes interactions and a non-linear link function, interpreting raw model coefficients is difficult for most analysts, and essentially impossible for lay people.

Counterfactual comparisons are a compelling alternative to coefficient estimates, with many advantages for interpretation. First, counterfactual comparisons can be expressed directly on the scale of the outcome variable, rather than as complex functions like log-odds ratios. Second, counterfactual comparisons map directly onto what many people have in mind when they think of the effect of a treatment: what change can we expect in the outcome when a predictor changes? Finally, as the next sections show, the marginaleffects package makes it trivial to compute counterfactual comparisons. Data analysts can embrace the same workflow in model-agnostic fashion, applying similar post-estimation steps regardless of the kind of model they chose to estimate.

6.1.1 First steps: risk difference with a binary treatment

To begin, let us consider a simple estimand: the risk difference associated with a change in binary treatment. Specifically, we will estimate the expected change in outcome when the incentive variable is manipulated to equal 1 instead of 0.

An important factor to consider, when estimating such a quantity, is that counterfactual comparisons are conditional quantities. Except in the simplest cases, comparisons will depend on the values of all the predictors in a model. Each individual in a dataset may be associated with a different counterfactual comparison. Therefore, whenever the analyst computes a counterfactual comparison, they must explicitly define the values of the focal predictor, but also the values of all other covariates in the model.

Section 6.2 explores different ways to define a grid of predictor profiles. For now, we shall focus on a single individual with arbitrary characteristics.

R
Python

grid = data.frame(distance = 2, agecat = "18 to 35", incentive = 1)
grid

  distance   agecat incentive
1        2 18 to 35         1

grid = pl.DataFrame({
    "distance": 2,
    "agecat": ["18 to 35"],
    "incentive": 1
})
print(grid)

shape: (1, 3)
┌──────────┬──────────┬───────────┐
│ distance ┆ agecat   ┆ incentive │
│ ---      ┆ ---      ┆ ---       │
│ i32      ┆ str      ┆ i32       │
╞══════════╪══════════╪═══════════╡
│ 2        ┆ 18 to 35 ┆ 1         │
└──────────┴──────────┴───────────┘

Our goal is to estimate the risk difference for someone between the ages of 18 to 35, who lives a distance of 2 from the test center.

\[\hat{Y}_{i=1,d=2,a=\text{18 to 35}} - \hat{Y}_{i=0,d=2,a=\text{18 to 35}}\]

To compute this quantity, we must compare model-based predictions with and without the incentive, holding all other unit characteristics constant. Using simple base R commands, we manipulate the grid, make predictions, and compare those predictions.

R
Python

# Counterfactual grids of predictor values
g_treatment = transform(grid, incentive = 1)
g_control = transform(grid, incentive = 0)

# Counterfactual predictions
p_treatment = predictions(mod, newdata = g_treatment)$estimate
p_control = predictions(mod, newdata = g_control)$estimate

# Counterfactual comparison
p_treatment - p_control

g_treatment = grid.with_columns(pl.lit(1).alias("incentive"))
g_control = grid.with_columns(pl.lit(0).alias("incentive"))
p_treatment = mod.predict(g_treatment.to_pandas())
p_control = mod.predict(g_control.to_pandas())
print(p_treatment - p_control)

0    0.465402
dtype: float64

R
Python

[1] 0.465402

comparisons(mod, variables = "incentive", newdata = grid)

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"incentive"	"1 - 0"	0.465402	0.029326	15.86979	0.0	inf	0.407924	0.522881

The same estimate can be obtained more easily, along with standard errors and test statistics, using the comparisons() function from the marginaleffects package.

R
Python

comparisons(mod, variables = "incentive", newdata = grid)

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
0.465	0.0293	15.9	<0.001	0.408	0.523

comparisons(mod,
  variables="incentive",
  newdata=grid)

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"incentive"	"1 - 0"	0.465402	0.029326	15.86979	0.0	inf	0.407924	0.522881

Our model suggests that, for a participant who is between 18 and 35 years old and lives a distance of 2 from the test center, moving from the control group to the treatment group increases the predicted probability that outcome equals one by $0.465\times 100=46.5$ percentage points.

This result is interesting, but a note of caution is warranted. In this chapter, we will interpret the counterfactual comparison as a measure of the effect of a change in a focal predictor on the outcome predicted by a model. This is both a claim about the fitted model’s behavior,¹ and a descriptive claim about the estimated association between a focal predictor and the outcome.² To go further and give counterfactual comparisons a causal interpretation would require us to make strong assumptions. Chapter 8 discusses these assumptions in some detail.

6.1.2 Comparison functions

So far, we have measured the effect of a change in predictor solely by looking at differences in predicted outcomes. Differences are typically the best starting point for interpretation, because they are simple and easy to grasp intuitively. Nevertheless, in some contexts it can make sense to use different functions to compare counterfactual predictions, such as ratios, lift, or odds ratios.

To compute the ratio of predicted outcomes associated to a change in incentive, we use the comparison="ratio" argument.

\[\frac{\hat{Y}_{i=1,d=2,a=\text{18-35}}}{\hat{Y}_{i=0,d=2,a=\text{18-35}}}\]

R
Python

comparisons(mod,
  variables = "incentive",
  comparison = "ratio",
  hypothesis = 1,
  newdata = grid)

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
2.48	0.211	7	<0.001	2.06	2.89

comparisons(mod,
  variables = "incentive",
  comparison = "ratio",
  hypothesis = 1,
  newdata = grid)

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"incentive"	"1 / 0"	2.476193	0.211029	6.995229	2.6483e-12	38.458056	2.062585	2.889801

The predicted outcome is nearly 2.5 times as large for a participant in the treatment group, who is between 18 and 35 years old and lives at a distance of 2 from the test center, than for a participant in the control group with the same socio-demographic characteristics. Note that, in the code above, we set hypothesis=1 to test against the null hypothesis that these two predicted probabilities are identical (ratio of 1). The standard error is small and the $z$ statistic large.

Therefore, we can reject the null hypothesis that the predicted outcomes are the same in the treatment and control arms of this trial. We can reject the hypothesis that the ratio between the predicted probabilities that someone in the treatment group (incentive=1) and someone in the control group (incentive=0) will get their test results is 1.

To compute the lift, we would proceed in the same way, by setting comparison="lift".

\[\frac{\hat{Y}_{i=1,d=2,a=\text{18-35}}-\hat{Y}_{i=0,d=2,a=\text{18-35}}}{\hat{Y}_{i=0,d=2,a=\text{18-35}}}\]

R
Python

comparisons(mod,
  variables = "incentive",
  comparison = "lift",
  newdata = grid)

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
1.48	0.211	7	<0.001	1.06	1.89

comparisons(mod,
  variables = "incentive",
  comparison = "lift",
  hypothesis = 1,
  newdata = grid)

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"incentive"	"lift"	1.476193	0.211029	2.256533	0.024037	5.378584	1.062585	1.889801

Finally, it is useful to note that the comparison argument accepts arbitrary functions. This is an extremely powerful feature, as it allows analysts to specify fully customized comparisons between a hi prediction (e.g., treatment) and a lo prediction (e.g., control). To illustrate, we compute a log odds ratio based on average predictions.³

R
Python

lnor = function(hi, lo) {
  log((mean(hi) / (1 - mean(hi))) / (mean(lo) / (1 - mean(lo))))
}
comparisons(mod,
  variables = "incentive",
  comparison = lnor)

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
2	0.0991	20.2	<0.001	1.8	2.19

def lnor(hi, lo):
    hi = np.asarray(hi)
    lo = np.asarray(lo)
    return np.log((hi.mean() / (1 - hi.mean())) / (lo.mean() / (1 - lo.mean())))

comparisons(mod,
  variables="incentive",
  comparison=lnor)

6.2 Predictors

The predictors in a model can be divided in two categories: focal and adjustment variables. Focal variables are the key predictors in a counterfactual analysis. They are the variables whose effect on (or association with) the outcome we wish to quantify. In contrast, adjustment (or control) variables are incidental to the principal analysis. They can be included in a model to increase its flexibility, improve fit, control for confounders, or to check if treatment effects vary across subgroups of the population. However, the effect of an adjustment variable is not of inherent interest in a counterfactual analysis.⁴

As noted above, counterfactual comparisons are conditional quantities, which means that they typically depend on the values of all the predictors in a model. Therefore, when computing a comparison, we must decide where to evaluate it in the predictor space. We must decide what values to assign to both the focal and adjustment variables.

6.2.1 Focal variables

When estimating counterfactual comparisons, our goal is to determine what happens to the predicted outcome when one or more focal predictors change. Obviously, the kind of change we are interested in depends on the nature of the focal predictors. Let us consider four common cases: binary, categorical, numeric, and cross-comparisons.

For pedagogical purposes, we will treat each of the predictors in our model as a focal variable in turn: incentive, agecat, and distance. Note, however, that only the incentive variable was randomized in the Thornton (2008) study. In most real-world applications, there will only be one or two focal predictors per statistical model.⁵

6.2.1.1 Change in binary predictors

By default, when the focal predictor is binary, marginaleffects returns the value of the difference in predicted outcome associated to a change from the control (0) to the treatment (1) group.

R
Python

comparisons(mod, variables = "incentive", newdata = grid)

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
0.465	0.0293	15.9	<0.001	0.408	0.523

comparisons(mod,
  variables="incentive",
  newdata=grid)

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"incentive"	"1 - 0"	0.465402	0.029326	15.86979	0.0	inf	0.407924	0.522881

If an analyst wants to compute the effect of a change in the opposite direction, they can specify that change explicitly using the list syntax.

R
Python

comparisons(mod, variables = list("incentive" = c(1, 0)), newdata = grid)

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
-0.465	0.0293	-15.9	<0.001	-0.523	-0.408

comparisons(mod,
  variables={"incentive" : [1, 0]},
  newdata=grid)

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"incentive"	"0 - 1"	-0.465402	0.029326	-15.86979	0.0	inf	-0.522881	-0.407924

Moving from the treatment to the control group on the incentive variable is a associated with a change of -47 percentage points in the predicted probability that outcome equals 1.

6.2.1.2 Change in categorical predictors

The same approach can be used when we are interested in changes in a categorical variable with multiple levels. For example, if we want to know how changes in the agecat variable affect the predicted probability that outcome equals 1, we use the variables argument.

R
Python

comparisons(mod, variables = "agecat", newdata = grid)

Contrast	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
18 to 35 - <18	0.00359	0.0294	0.122	0.903	-0.0540	0.0611
>35 - <18	0.03587	0.0294	1.221	0.222	-0.0217	0.0935

comparisons(mod, variables="agecat", newdata=grid)

shape: (2, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"agecat"	"18 to 35 - <18"	0.003594	0.029363	0.122398	0.902584	0.147867	-0.053957	0.061145
"agecat"	">35 - <18"	0.035866	0.029385	1.220558	0.222254	2.169722	-0.021728	0.09346

Moving from the <18 category to the 18 to 35 category increases the predicted probability that outcome equals 1 by 0.4 percentage points. Moving from the <18 to the >35 age bracket increases the predicted probability by 3.6 percentage points.

By default, the comparisons() function returns comparisons between every level of the categorical predictor and its reference level, or first category. We can modify the variables argument to compare specific categories, or to compare all categories to its preceding level, sequentially: <18 to 18 to 35, and 18 to 35 to >35.

R
Python

# Specific comparison
comparisons(mod,
  variables = list("agecat" = c("18 to 35", ">35")),
  newdata = grid)

# Sequential comparisons
comparisons(mod,
  variables = list("agecat" = "sequential"),
  newdata = grid)

comparisons(mod,
  variables={"agecat" : ["18 to 35", ">35"]},
  newdata=grid)

comparisons(mod,
  variables={"agecat" : "sequential"},
  newdata=grid)

6.2.1.3 Change in numeric predictors

When the focal predictor is numeric, we can once again use the variables argument.

R
Python

comparisons(mod, variables = "distance", newdata = grid)

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
-0.0289	0.00742	-3.89	<0.001	-0.0434	-0.0143

comparisons(mod, variables="distance", newdata=grid)

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"distance"	"+1"	-0.027626	0.006822	-4.049729	0.000051	14.251327	-0.040997	-0.014256

This table shows the effect of increasing distance by 1 unit on the predicted value of outcome. Importantly, this corresponds to an increase of 1 unit from the value of distance encoded in the predictor grid.

R
Python

grid

  distance   agecat incentive
1        2 18 to 35         1

grid

shape: (1, 3)

distance	agecat	incentive
i32	str	i32
2	"18 to 35"	1

For a person between the ages of 18 and 35, in the treatment group, moving from 2 to 3 units on the distance variable is associated with a change of -2.9 percentage points on the predicted probability of the outcome.

One may be interested in different magnitudes of change in the focal predictor distance. For example, the effect of a 5 unit (or 1 standard deviation) increase in distance on the predicted value of the outcome. Alternatively, the analyst may want to assess the effect of a change between two specific values of distance, or across the interquartile (or full) range of the data. All of these options are easy to implement using the variables argument.

R
Python

# Increase of 5 units
comparisons(mod, variables = list("distance" = 5), newdata = grid)

# Increase of 1 standard deviation
comparisons(mod, variables = list("distance" = "sd"), newdata = grid)

# Change between specific values
comparisons(mod, variables = list("distance" = c(0, 3)), newdata = grid)

# Change across the interquartile range
comparisons(mod, variables = list("distance" = "iqr"), newdata = grid)

# Change across the full range
comparisons(mod, variables = list("distance" = "minmax"), newdata = grid)

# Increase of 5 units
comparisons(mod, variables={"distance": 5}, newdata=grid)
# Increase of 1 standard deviation
comparisons(mod, variables={"distance": "sd"}, newdata=grid)
# Change between specific values
comparisons(mod, variables={"distance" : [0, 3]}, newdata=grid)
# Change across the interquartile range
comparisons(mod, variables={"distance" : "iqr"}, newdata=grid)
# Change across the full range
comparisons(mod, variables={"distance": "minmax"}, newdata=grid)

6.2.1.4 Cross-comparisons

Sometimes, an analyst wants to assess the joint or combined effect of manipulating two predictors. In a medical study, for example, we may be interested in the change in survival rates for people who both receive a new treatment and make a dietary change. In our running example, it may be interesting to know how much the predicted probability of outcome would change if we modified both the distance and incentive variables simultaneously. To check this, we use the cross argument.

R
Python

cmp = comparisons(mod,
  variables = c("incentive", "distance"), 
  cross = TRUE,
  newdata = grid)
cmp

C: distance	C: incentive	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
+1	1 - 0	0.437	0.0305	14.3	<0.001	0.377	0.496

comparisons(mod,
  variables = ["incentive", "distance"],
  cross = True,
  newdata = grid)

shape: (1, 10)

term	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high	contrast_incentive	contrast_distance
str	f64	f64	f64	f64	f64	f64	f64	str	str
"cross"	0.431043	0.030713	14.034636	0.0	inf	0.370847	0.491239	"1 - 0"	"+1"

These results show that a simultaneous increase of 1 unit in the distance variable and between 0 and 1 on the incentive variable is associated with a change of 0.437 in the predicted outcome.

6.2.2 Adjustment variables

In a typical counterfactual analysis, the researcher is not interested in a change in the adjustment variables themselves. Nevertheless, since the value of a counterfactual comparison depends on where it is evaluated in the predictor space, we must imperatively define the full grid of focal and adjustment variables.

Much like in Chapter 5, where we computed predictions for different profiles, we now estimate counterfactual comparisons on empirical, interesting, representative, and balanced grids.

6.2.2.1 Empirical distribution

By default, the comparisons() function returns estimates for every row of the original data frame that was used to fit the model. The Thornton (2008) dataset includes 2825 complete observations (after dropping missing data), so the next command will yield 2825 estimates.

R
Python

comparisons(mod, variables = "incentive")

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
0.458	0.0689	6.64	<0.001	0.323	0.593
0.463	0.0666	6.95	<0.001	0.332	0.593
0.488	0.0611	7.99	<0.001	0.368	0.608
0.482	0.0603	7.99	<0.001	0.364	0.600
0.471	0.0632	7.45	<0.001	0.347	0.595
2815 rows omitted	2815 rows omitted	2815 rows omitted	2815 rows omitted	2815 rows omitted	2815 rows omitted
0.423	0.0370	11.43	<0.001	0.350	0.495
0.416	0.0393	10.59	<0.001	0.339	0.493
0.423	0.0370	11.43	<0.001	0.350	0.495
0.428	0.0355	12.05	<0.001	0.358	0.498
0.454	0.0377	12.02	<0.001	0.380	0.528

comparisons(mod, variables = "incentive")

shape: (2_825, 22)

rowid	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high	predicted	predicted_lo	predicted_hi	index	village	outcome	distance	amount	incentive	age	hiv2004	agecat
i32	str	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	i64	i32	i32	f64	f64	f64	i32	i32	enum
0	"incentive"	"1 - 0"	0.457763	0.068948	6.639216	3.1536e-11	34.884225	0.322627	0.592899	0.357252	0.357252	0.815015	0	43	1	0.548523	0.0	0.0	14	0	"<18"
1	"incentive"	"1 - 0"	0.462816	0.06662	6.947062	3.7297e-12	37.964084	0.332242	0.59339	0.344996	0.344996	0.807812	1	117	0	0.840264	0.0	0.0	14	0	"<18"
2	"incentive"	"1 - 0"	0.488055	0.061068	7.992008	1.3323e-15	49.415037	0.368364	0.607746	0.249281	0.249281	0.737336	2	2	0	3.342164	0.0	0.0	15	0	"<18"
3	"incentive"	"1 - 0"	0.481845	0.060309	7.989579	1.3323e-15	49.415037	0.363641	0.600049	0.28608	0.28608	0.767925	3	6	0	2.322895	0.0	0.0	15	0	"<18"
4	"incentive"	"1 - 0"	0.47115	0.063201	7.454828	8.9928e-14	43.338222	0.347279	0.59502	0.322613	0.322613	0.793763	4	11	0	1.386263	0.0	0.0	15	0	"<18"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2820	"incentive"	"1 - 0"	0.42269	0.036977	11.43123	0.0	inf	0.350217	0.495163	0.83042	0.40773	0.83042	2820	11	1	1.260402	0.18912	1.0	75	0	">35"
2821	"incentive"	"1 - 0"	0.416111	0.039311	10.585024	0.0	inf	0.339062	0.493159	0.836997	0.420886	0.836997	2821	8	1	0.966362	0.4728	1.0	80	0	">35"
2822	"incentive"	"1 - 0"	0.422654	0.036988	11.426753	0.0	inf	0.350158	0.495149	0.830457	0.407804	0.830457	2822	11	1	1.258747	0.9456	1.0	80	0	">35"
2823	"incentive"	"1 - 0"	0.428007	0.035522	12.04911	0.0	inf	0.358385	0.497628	0.824674	0.396667	0.824674	2823	11	1	1.509935	0.9456	1.0	80	0	">35"
2824	"incentive"	"1 - 0"	0.453887	0.037749	12.023763	0.0	inf	0.3799	0.527874	0.787479	0.333592	0.787479	2824	128	0	2.988336	0.9456	1.0	80	0	">35"

If we do not specify the variables argument, comparisons() computes distinct differences for all the variables. Here, there are 4 possible differences, so we get $4 \times 2825=11300$ rows.

R
Python

cmp = comparisons(mod)
nrow(cmp)

[1] 11300

cmp

Term	Contrast	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
agecat	18 to 35 - <18	0.0184	0.0665	0.277	0.782	-0.1120	0.149
agecat	18 to 35 - <18	0.0181	0.0655	0.277	0.782	-0.1102	0.147
agecat	18 to 35 - <18	0.0151	0.0544	0.278	0.781	-0.0916	0.122
agecat	18 to 35 - <18	0.0165	0.0593	0.278	0.781	-0.0998	0.133
agecat	18 to 35 - <18	0.0176	0.0634	0.277	0.782	-0.1067	0.142
11290 rows omitted	11290 rows omitted	11290 rows omitted	11290 rows omitted	11290 rows omitted	11290 rows omitted	11290 rows omitted	11290 rows omitted
incentive	1 - 0	0.4227	0.0370	11.431	<0.001	0.3502	0.495
incentive	1 - 0	0.4161	0.0393	10.585	<0.001	0.3391	0.493
incentive	1 - 0	0.4227	0.0370	11.427	<0.001	0.3502	0.495
incentive	1 - 0	0.4280	0.0355	12.049	<0.001	0.3584	0.498
incentive	1 - 0	0.4539	0.0377	12.024	<0.001	0.3799	0.528

cmp = comparisons(mod)
cmp.shape

(11300, 22)

cmp

shape: (11_300, 22)

rowid	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high	predicted	predicted_lo	predicted_hi	index	village	outcome	distance	amount	incentive	age	hiv2004	agecat
i32	str	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	i64	i32	i32	f64	f64	f64	i32	i32	enum
0	"agecat"	"18 to 35 - <18"	0.018425	0.066541	0.276894	0.781861	0.355015	-0.111993	0.148843	0.357252	0.357252	0.375677	0	43	1	0.548523	0.0	0.0	14	0	"<18"
1	"agecat"	"18 to 35 - <18"	0.01815	0.065509	0.277061	0.781733	0.355252	-0.110244	0.146544	0.344996	0.344996	0.363146	1	117	0	0.840264	0.0	0.0	14	0	"<18"
2	"agecat"	"18 to 35 - <18"	0.015147	0.05444	0.278236	0.780832	0.356917	-0.091554	0.121848	0.249281	0.249281	0.264428	2	2	0	3.342164	0.0	0.0	15	0	"<18"
3	"agecat"	"18 to 35 - <18"	0.016482	0.059326	0.277823	0.781148	0.356331	-0.099795	0.132759	0.28608	0.28608	0.302562	3	6	0	2.322895	0.0	0.0	15	0	"<18"
4	"agecat"	"18 to 35 - <18"	0.017584	0.063397	0.27736	0.781504	0.355675	-0.106673	0.141841	0.322613	0.322613	0.340197	4	11	0	1.386263	0.0	0.0	15	0	"<18"
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2820	"incentive"	"1 - 0"	0.42269	0.036977	11.43123	0.0	inf	0.350217	0.495163	0.83042	0.40773	0.83042	2820	11	1	1.260402	0.18912	1.0	75	0	">35"
2821	"incentive"	"1 - 0"	0.416111	0.039311	10.585024	0.0	inf	0.339062	0.493159	0.836997	0.420886	0.836997	2821	8	1	0.966362	0.4728	1.0	80	0	">35"
2822	"incentive"	"1 - 0"	0.422654	0.036988	11.426753	0.0	inf	0.350158	0.495149	0.830457	0.407804	0.830457	2822	11	1	1.258747	0.9456	1.0	80	0	">35"
2823	"incentive"	"1 - 0"	0.428007	0.035522	12.04911	0.0	inf	0.358385	0.497628	0.824674	0.396667	0.824674	2823	11	1	1.509935	0.9456	1.0	80	0	">35"
2824	"incentive"	"1 - 0"	0.453887	0.037749	12.023763	0.0	inf	0.3799	0.527874	0.787479	0.333592	0.787479	2824	128	0	2.988336	0.9456	1.0	80	0	">35"

Since the output of comparisons() is a simple data frame, we can easily plot the full distribution of unit-specific risk differences.

library(ggplot2)

ggplot(cmp, aes(x = estimate)) +
  geom_histogram(bins = 30) +
  facet_grid(. ~ term + contrast, scales = "free") +
  labs(x = "Estimated change in predicted outcome", y = "Count")

Figure 6.1: Distribution of unit-level risk differences associated with changes in each of the predictors.

The x-axes in Figure 6.1 show the estimated effects of changes in the predictors on the predicted value of the outcome. The y-axes show the prevalence of each estimate across the full sample.

There appears to be considerable heterogeneity. For example, consider the right-most panel, which plots the distribution of unit-level contrasts for the incentive variable. This panel shows that for some participants, the model predicts that moving from the control to the treatment condition would increase the predicted probability that outcome equals one by about 0.5 points. For others, the estimated effect of this change can be as low as 0.4.

6.2.2.2 Interesting grid

In some contexts, we want to estimate a contrast for a specific individual with characteristics of interest. To achieve this, we can supply a data frame to the newdata argument.

The code below shows the expected change in the predicted probability of the outcome associated with a change in incentive, for a few individuals with interesting characteristics. As in Chapter 5, we use the datagrid() function as a convenient mechanism to create a grid of profiles of interest.

R
Python

comparisons(mod,
  variables = "incentive",
  newdata = datagrid(agecat = unique, distance = mean))

comparisons(mod,
  variables = "incentive", newdata = datagrid(
    agecat = dat["agecat"].unique(),
    distance = dat["distance"].mean()))

agecat	distance	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
<18	2.01	0.479	0.0608	7.87	<0.001	0.360	0.598
18 to 35	2.01	0.466	0.0293	15.87	<0.001	0.408	0.523
>35	2.01	0.438	0.0342	12.79	<0.001	0.371	0.505

Notice that the estimated effects of incentive on the predicted probability that outcome=1 differ depending on participants’ age. Indeed, our model estimates that the difference between treatment and control would be 46.6 percentage points in the central age category, but 43.8 percentage point in the older age category.

6.2.2.3 Representative grids

A common alternative is to compute a comparison or risk difference “at the mean.” The idea is to create a “representative” or “synthetic” profile for an individual whose characteristics are completely average or modal. Then, we report the comparison for this specific hypothetical individual. To do this, we use the “mean” shortcut in the newdata argument.

R
Python

comparisons(mod, variables = "incentive", newdata = "mean")

comparisons(mod, variables = "incentive", newdata = "mean")

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
0.466	0.0293	15.9	<0.001	0.408	0.523

The main advantage of this approach is that it is fast and cheap from a computational standpoint. The disadvantage is that the interpretation is somewhat ambiguous. Often, the population includes no individual at all who is perfectly average across all dimensions and it is not always clear why we should be interested in such an individual. This matters because, in some cases, a “comparison at the mean” can differ significantly from an “average comparison.”⁶

6.2.2.4 Balanced grids

Balanced grids, introduced in Section 3.2, include all unique combinations of categorical variables, while holding numeric variables at their means. This is particularly useful in experimental contexts, where the sample is not representative of the target population, and where we want to treat each combination of treatment conditions similarly.

R
Python

cmp = comparisons(mod, variables = "incentive", newdata = "balanced")
as.data.frame(cmp)

  rowid      term contrast  estimate  std.error statistic      p.value
1     1 incentive    1 - 0 0.4788406 0.06084968  7.869238 3.568081e-15
2     2 incentive    1 - 0 0.4788406 0.06084968  7.869238 3.568081e-15
3     3 incentive    1 - 0 0.4655786 0.02933084 15.873347 9.693111e-57
4     4 incentive    1 - 0 0.4655786 0.02933084 15.873347 9.693111e-57
5     5 incentive    1 - 0 0.4379624 0.03423944 12.791166 1.836976e-37
6     6 incentive    1 - 0 0.4379624 0.03423944 12.791166 1.836976e-37
    s.value  conf.low conf.high   agecat distance incentive predicted_lo
1  47.99377 0.3595774 0.5981038      <18 2.014541         0    0.2978325
2  47.99377 0.3595774 0.5981038      <18 2.014541         1    0.2978325
3 186.07294 0.4080912 0.5230660 18 to 35 2.014541         0    0.3146933
4 186.07294 0.4080912 0.5230660 18 to 35 2.014541         1    0.3146933
5 122.03401 0.3708543 0.5050704      >35 2.014541         0    0.3746268
6 122.03401 0.3708543 0.5050704      >35 2.014541         1    0.3746268
  predicted_hi predicted
1    0.7766732        NA
2    0.7766732        NA
3    0.7802718        NA
4    0.7802718        NA
5    0.8125892        NA
6    0.8125892        NA

comparisons(mod, variables = "incentive", newdata = "balanced")

shape: (6, 17)

rowid	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high	predicted	predicted_lo	predicted_hi	outcome	agecat	distance	incentive
i32	str	str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	i64	enum	f64	f64
0	"incentive"	"1 - 0"	0.478841	0.06085	7.869234	3.5527e-15	48.0	0.359577	0.598104	0.297833	0.297833	0.776673	1	"<18"	2.014541	0.0
1	"incentive"	"1 - 0"	0.478841	0.06085	7.869234	3.5527e-15	48.0	0.359577	0.598104	0.776673	0.297833	0.776673	1	"<18"	2.014541	1.0
2	"incentive"	"1 - 0"	0.465579	0.029331	15.873339	0.0	inf	0.408091	0.523066	0.314693	0.314693	0.780272	1	"18 to 35"	2.014541	0.0
3	"incentive"	"1 - 0"	0.465579	0.029331	15.873339	0.0	inf	0.408091	0.523066	0.780272	0.314693	0.780272	1	"18 to 35"	2.014541	1.0
4	"incentive"	"1 - 0"	0.437962	0.034239	12.791168	0.0	inf	0.370854	0.50507	0.374627	0.374627	0.812589	1	">35"	2.014541	0.0
5	"incentive"	"1 - 0"	0.437962	0.034239	12.791168	0.0	inf	0.370854	0.50507	0.812589	0.374627	0.812589	1	">35"	2.014541	1.0

The code above converted the results to a data.frame. This is because marginaleffects automatically hides some columns when printing output. Calling as.data.frame() ensures that all columns are printed, which makes it easier to see the structure of the balanced grid.

6.3 Aggregation

As discussed above, the default behavior of comparisons() is to estimate quantities of interest for all the actually observed units in our dataset, or for each row of the dataset supplied to the newdata argument. Sometimes, it is convenient to marginalize those conditional estimates to obtain an average (or marginal) estimates.

Several key quantities of interest can be expressed as average counterfactual comparisons. For example, when certain assumptions are satisfied, the average treatment effect (ATE) of $X$ on $Y$ can be defined as the expected difference between outcomes under treatment or control:

\[ E[Y_{X=1} - Y_{X=0}], \]

where the expectation is taken over the distribution of adjustment variables. In Chapter 8, we will see that estimands like the Average Treatment Effect on the Treated (ATT) or Average Treatment Effect on the Untreated (ATU) can be defined analogously.

To compute an average counterfactual comparison, we proceed in four steps:

Compute predictions for every row of the dataset in the counterfactual world where all observations belong to the treatment condition.
Compute predictions for every row of the dataset in the counterfactual world where all observations belong to the control condition.
Take the differences between the two vectors of predictions.
Average the unit-level estimates across the whole dataset, or within subgroups.

Previously, we called the comparisons() function to compute unit-level counterfactual comparisons. To return average estimates, we call the same function, with the same arguments, but add the avg_ prefix.

R
Python

cmp = avg_comparisons(mod, variables = "incentive")
cmp

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
0.452	0.0208	21.8	<0.001	0.411	0.493

avg_comparisons(mod, variables = "incentive")

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"incentive"	"1 - 0"	0.45192	0.020756	21.773247	0.0	inf	0.411239	0.492601

On average, across all participants in the study, moving from the control to the treatment group is associated with a change of 45.2 percentage points in the predicted probability that outcome equals one. This result is equivalent to computing unit-level estimates and taking their mean.

R
Python

cmp = comparisons(mod, variables = "incentive")
mean(cmp$estimate)

[1] 0.45192

cmp = comparisons(mod, variables = "incentive")
cmp["estimate"].mean()

0.45192001665920534

Using the by argument, we can compute the average risk difference with respect to incentive, for each age subcategory.

R
Python

cmp = avg_comparisons(mod, variables = "incentive", by = "agecat")
cmp

agecat	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
<18	0.475	0.0605	7.85	<0.001	0.356	0.593
18 to 35	0.461	0.0290	15.89	<0.001	0.404	0.518
>35	0.435	0.0338	12.84	<0.001	0.368	0.501

avg_comparisons(mod, variables = "incentive", by = "agecat")

shape: (3, 10)

agecat	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
enum	str	str	f64	f64	f64	f64	f64	f64	f64
"<18"	"incentive"	"1 - 0"	0.474643	0.060485	7.847277	4.2188e-15	47.752072	0.356094	0.593191
"18 to 35"	"incentive"	"1 - 0"	0.461205	0.029033	15.885307	0.0	inf	0.4043	0.518109
">35"	"incentive"	"1 - 0"	0.434634	0.033839	12.844052	0.0	inf	0.36831	0.500958

On average, for participants in the <18 age bracket, moving from the control to the treatment group is associated with a change of 47.5 percentage points in the predicted probability that outcome equals one. This average risk difference is estimated at 43.5 percentage points for participants above 35 years old.

We can also compute an average risk difference for individuals with specific profiles, by specifying the grid of predictors using the newdata argument. For example, if we are interested in an average treatment effect that only applies to study participants who actually belonged to the treatment group, we can call:⁷

R
Python

avg_comparisons(mod,
  variables = "incentive",
  newdata = subset(incentive == 1)
)

Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
0.452	0.0208	21.8	<0.001	0.411	0.493

avg_comparisons(mod,
  variables = "incentive",
  newdata = dat.filter(pl.col("incentive") == 1))

shape: (1, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"incentive"	"1 - 0"	0.451878	0.020776	21.750074	0.0	inf	0.411158	0.492599

6.3.1 Average predictions vs. average comparisons

Before moving on, it is useful to take a slight detour to highlight the relationships between three of the most important quantities of interest we have considered so far: average predictions, average counterfactual predictions, and average counterfactual comparisons.

To illustrate these concepts, we momentarily shift our focus to the Palmer Penguins dataset (Horst, Hill, and Gorman 2020), which contains measurements of body mass and flipper lengths for three species of penguins: Adelie, Chinstrap, and Gentoo. On average, the dataset shows that Gentoo penguins are heaviest and have the longest flippers.

R
Python

penguins = get_dataset("penguins", "palmerpenguins") 

aggregate(
  cbind(body_mass_g, flipper_length_mm) ~ species, 
  FUN = mean,
  data = penguins)

    species body_mass_g flipper_length_mm
1    Adelie    3700.662          189.9536
2 Chinstrap    3733.088          195.8235
3    Gentoo    5076.016          217.1870

penguins = get_dataset("penguins", "palmerpenguins")
penguins.select(["body_mass_g", "flipper_length_mm", "species"]) \
  .group_by("species").mean()

shape: (3, 3)

species	body_mass_g	flipper_length_mm
str	f64	f64
"Adelie"	3700.662252	189.953642
"Gentoo"	5076.01626	217.186992
"Chinstrap"	3733.088235	195.823529

We fit a linear regression model to predict flipper length based on body mass and species. Average predictions are calculated over the observed distribution of covariates within each subset of interest. If Gentoo penguins are heavier than Adelie, that difference in predictors will be reflected in the average predictions.

R
Python

fit = lm(flipper_length_mm ~ body_mass_g * species, data = penguins)

avg_predictions(fit, by = "species")

species	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
Adelie	190	0.435	436	<0.001	189	191
Chinstrap	196	0.649	302	<0.001	195	197
Gentoo	217	0.482	450	<0.001	216	218

fit = ols(
  "flipper_length_mm ~ body_mass_g * species",
  data = (penguins_pd := penguins.to_pandas())).fit()

avg_predictions(fit, by = "species")

shape: (3, 8)

species	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
enum	f64	f64	f64	f64	f64	f64	f64
"Adelie"	189.953642	0.435241	436.433119	0.0	inf	189.100585	190.806699
"Chinstrap"	195.823529	0.648581	301.926201	0.0	inf	194.552334	197.094724
"Gentoo"	217.186992	0.482244	450.36776	0.0	inf	216.241812	218.132172

Now, consider what happens when we compute average counterfactual predictions instead. As explained in Section 5.2.5, this involves duplicating the entire dataset three times—once for each penguin species—and fixing the focal variable to be constant in the three dataset. Then, we call the predict() and mean() functions from base R, in order to return the average expected flipper length.

# Adelie
penguins |>
  transform(species = "Adelie") |>
  predict(fit, newdata = _) |>
  mean(na.rm = TRUE)

[1] 193.2994

# Chinstrap
penguins |>
  transform(species = "Chinstrap") |>
  predict(fit, newdata = _) |>
  mean(na.rm = TRUE)

[1] 201.403

# Gentoo
penguins |>
  transform(species = "Gentoo") |>
  predict(fit, newdata = _) |>
  mean(na.rm = TRUE)

[1] 209.2844

Instead of this manual calculation, we can compute the average counterfactual predictions with the avg_predictions() function and its variables and by arguments.

R
Python

p = avg_predictions(fit, variables = "species", by = "species")
p

species	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
Adelie	193	0.646	299	<0.001	192	195
Chinstrap	201	1.027	196	<0.001	199	203
Gentoo	209	0.968	216	<0.001	207	211

p = avg_predictions(fit, variables = "species", by = "species")
p

shape: (3, 8)

species	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
enum	f64	f64	f64	f64	f64	f64	f64
"Adelie"	193.299368	0.645866	299.287325	0.0	inf	192.033495	194.565241
"Chinstrap"	201.40303	1.027376	196.036246	0.0	inf	199.389409	203.416651
"Gentoo"	209.28442	0.968367	216.12089	0.0	inf	207.386454	211.182385

Note that since we have replicated the full datasets, the distribution of body mass is now identical in all three datasets. By construction, the average counterfactual predictions are ceteris paribus quantities, that allow us to compare species while holding covariates constant. As a result of “ignoring” body mass, the gaps between average counterfactual predictions are smaller than the gaps between average predictions.

Finally, average counterfactual comparisons measure the differences between these counterfactual predictions, giving us the estimated effect of species on flipper length, while controlling for body mass. Average counterfactual comparisons can be computed directly by taking the difference between average counterfactual predictions, or by calling avg_comparisons().

R
Python

diff(p$estimate)

[1] 8.103662 7.881389

avg_comparisons(fit, variables = list(species = "sequential"))

Contrast	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
Chinstrap - Adelie	8.10	1.21	6.68	<0.001	5.73	10.5
Gentoo - Chinstrap	7.88	1.41	5.58	<0.001	5.11	10.6

p['estimate'].diff()

shape: (3,)

estimate
f64
null
8.103662
7.881389

avg_comparisons(fit, variables = {"species": "sequential"})

shape: (2, 9)

term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	str	f64	f64	f64	f64	f64	f64	f64
"species"	"Chinstrap - Adelie"	8.103662	1.213525	6.677786	2.4258e-11	35.262753	5.725196	10.482128
"species"	"Gentoo - Chinstrap"	7.881389	1.411823	5.58242	2.3719e-8	25.329356	5.114267	10.648512

This example illustrates how counterfactual predictions and comparisons allow us to conduct “all else equal” analyses—examining the relationship between variables while holding other factors constant. The differences between species are smaller in the counterfactual analysis than in the raw data because we have “controlled for” the fact that some species tend to be heavier than others.

In the next section, we go back to our running example.

6.4 Uncertainty

By default, the standard errors around contrasts are estimated using the delta method and the classical variance-covariance matrix supplied by the modeling software.⁸ For many of the more common statistical models, we can rely on the sandwich or clubSandwich packages to report “robust” standard errors, confidence intervals, and $p$ values (Zeileis, Köll, and Graham 2020; Pustejovsky 2023). We can also use the inferences() function to compute bootstrap or simulation-based estimates of uncertainty.

Using the modelsummary package, we can report estimates with different uncertainty estimates in a single table, with models displayed side-by-side (Table 6.1).⁹

library(modelsummary)

models = list(
  "Heteroskedasticity" = avg_comparisons(mod, vcov = "HC3"),
  "Clustered" = avg_comparisons(mod, vcov = ~village),
  "Bootstrap" = avg_comparisons(mod) |> inferences("boot")
)

modelsummary(models,
  statistic = "conf.int", fmt = 4, gof_map = "nobs",
  shape = term + contrast ~ model
)

Table 6.1: Alternative ways to compute uncertainty about counterfactual comparisons.

		Heteroskedasticity	Clustered	Bootstrap
agecat	18 to 35 -	0.0065	0.0065	0.0065
		[-0.0457, 0.0587]	[-0.0468, 0.0599]	[-0.0463, 0.0602]
	>35 -	0.0449	0.0449	0.0449
		[-0.0079, 0.0977]	[-0.0033, 0.0931]	[-0.0061, 0.0983]
distance	+1	-0.0303	-0.0303	-0.0303
		[-0.0427, -0.0179]	[-0.0449, -0.0157]	[-0.0431, -0.0180]
incentive	1 - 0	0.4519	0.4519	0.4519
		[0.4110, 0.4928]	[0.4094, 0.4945]	[0.4075, 0.4908]
Num.Obs.		2825	2825	2825

6.5 Test

Earlier in this chapter, we estimated average counterfactual comparisons for different age subgroups. Now, we will see how to conduct hypothesis tests on these quantities, in order to compare them to one another.

Imagine one is interested in the following question:

Does moving from incentive=0 to incentive=1 have a bigger effect on the predicted probability that outcome=1 for older or younger participants?

To answer this, we can start by using the avg_comparisons() and its by argument to estimate sub-group specific average risk differences.

R
Python

cmp = avg_comparisons(mod,
  variables = "incentive",
  by = "agecat")
cmp

agecat	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
<18	0.475	0.0605	7.85	<0.001	0.356	0.593
18 to 35	0.461	0.0290	15.89	<0.001	0.404	0.518
>35	0.435	0.0338	12.84	<0.001	0.368	0.501

cmp = avg_comparisons(mod,
  variables="incentive",
  by="agecat")
cmp

shape: (3, 10)

agecat	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
enum	str	str	f64	f64	f64	f64	f64	f64	f64
"<18"	"incentive"	"1 - 0"	0.474643	0.060485	7.847277	4.2188e-15	47.752072	0.356094	0.593191
"18 to 35"	"incentive"	"1 - 0"	0.461205	0.029033	15.885307	0.0	inf	0.4043	0.518109
">35"	"incentive"	"1 - 0"	0.434634	0.033839	12.844052	0.0	inf	0.36831	0.500958

At first glance, it looks like the average counterfactual comparison is larger in the first age bin than in the last (0.475 vs. 0.435). It looks like the effect of incentive on the probability of seeking to learn one’s HIV status is stronger for younger participants than for older ones.

The difference between estimated treatment effects in those two groups is

\[ 0.475 - 0.435 = 0.04, \]

but is this difference statistically significant? Does our data allow us to conclude that incentive has different effects in age subgroups?

To answer this question, we follow the process laid out in Chapter 4, and express our test as a string formula, where b1 identifies the estimate in the first row and b3 in the third row.

R
Python

avg_comparisons(mod,
  hypothesis = "b1 - b3 = 0",
  variables = "incentive",
  by = "agecat")

Hypothesis	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
b1-b3=0	0.04	0.0693	0.577	0.564	-0.0958	0.176

avg_comparisons(mod, 
                hypothesis = "b0 - b2 = 0",
                variables = "incentive", 
                by = "agecat")

shape: (1, 8)

term	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
str	f64	f64	f64	f64	f64	f64	f64
"b0-b2=0"	0.040009	0.069309	0.577248	0.563772	0.826816	-0.095835	0.175852

There is a numerical difference between the two estimates, but the $p$ value is large. This means that we cannot reject the null hypothesis that the effect of incentive on predicted outcome is the same for participants in the two age brackets. The two estimated risk differences are not statistically distinguishable from one another.

6.6 Visualization

In many cases, data analysts will want to visualize counterfactual comparisons rather than solely report numerical summaries. Figure 6.1 already showed how one could visualize unit-level comparisons. In this section, we will see how the plot_comparisons() can be used to visualize marginal and conditional comparisons.

6.6.1 Marginal comparisons

We already know that the avg_comparisons() function can be used to compute average counterfactual comparisons, or conditional average treatment effects by subgroups of the data. For example, it allows us to calculate the average change in predicted outcome associated with a change in incentive, for different age subgroups.

R
Python

avg_comparisons(mod, variables = "incentive", by = "agecat")

agecat	Estimate	Std. Error	z	Pr(>\|z\|)	2.5 %	97.5 %
<18	0.475	0.0605	7.85	<0.001	0.356	0.593
18 to 35	0.461	0.0290	15.89	<0.001	0.404	0.518
>35	0.435	0.0338	12.84	<0.001	0.368	0.501

avg_comparisons(mod, variables = "incentive", by = "agecat")

shape: (3, 10)

agecat	term	contrast	estimate	std_error	statistic	p_value	s_value	conf_low	conf_high
enum	str	str	f64	f64	f64	f64	f64	f64	f64
"<18"	"incentive"	"1 - 0"	0.474643	0.060485	7.847277	4.2188e-15	47.752072	0.356094	0.593191
"18 to 35"	"incentive"	"1 - 0"	0.461205	0.029033	15.885307	0.0	inf	0.4043	0.518109
">35"	"incentive"	"1 - 0"	0.434634	0.033839	12.844052	0.0	inf	0.36831	0.500958

Instead of displaying these results in a table, we can call the plot_comparisons() function to compute the same estimates, and present them graphically. The syntax is nearly identical.

R
Python

library(ggplot2) 

plot_comparisons(mod, variables = "incentive", by = "agecat") +
  labs(x = "Age", y = "Average risk difference")

Figure 6.2: Marginal counterfactual comparisons for a change in incentive, by age category.

plot_comparisons(mod,
  variables = "incentive",
  by = "agecat").show()

This plot shows that, on average, our model expects that receiving an incentive will have a stronger effect on the probability of seeking to learn one’s HIV status for younger individuals. Indeed, the point estimate is higher on the left side of the plot than on the right side. However, the confidence intervals are wide and overlapping, which suggests that the treatment effects are not significantly different from one another.

6.6.2 Conditional comparisons

When one of the predictors of interest is continuous or when we are dealing with multiple predictors, plotting conditional comparisons rather than marginal ones can improve interpretability. The condition argument of the plot_comparisons() function helps us construct plots where we hold some predictors at representative values, while displaying the variation in comparisons for others.

R
Python

library(patchwork)

p1 = plot_comparisons(mod,
  variables = "incentive",
  condition = "distance") +
  labs(y = "Conditional risk difference")

p2 = plot_comparisons(mod, 
  variables = "incentive",
  condition = c("distance", "agecat")) +
  labs(y = "Conditional risk difference")

p1 + p2

plot_comparisons(mod,
  variables = "incentive",
  condition = "distance").show()

  
plot_comparisons(mod,
  variables = "incentive",
  condition = ["distance", "agecat"]).show()

On these plots, high values on the y-axis indicate that the incentive has a strong effect on the predicted probability of outcome. The x-axis represents the distance between a participant’s home and the test center, with people living closest to the test center at the left of the plots.

The first plot shows that the estimated effect of incentive on the predicted probability of outcome is smaller for individuals living closest to the test center. The second plot further indicates slight differences across age groups, with older participants reacting less strongly to the treatment. Chapter 10 shows how to explore these kinds of contrasts more formally; it shows how to design and interpret tests of treatment effect heterogeneity, using interactions and polynomials.

6.7 Summary

This chapter defined a “counterfactual comparison” as a function of two or more model-based predictions made with different predictor values. These comparisons measure the strength of association between variables or, if appropriate identification assumptions are met, they quantify causal effects.

The comparisons() function from the marginaleffects package computes counterfactual comparisons for a wide range of models, and avg_comparisons() aggregates them across units or groups. Analysts can visualize these comparisons with the plot_comparisons() function or standard visualization tools like ggplot2.

Table 6.2: Main arguments of the comparisons(), avg_comparisons(), and plot_comparisons() functions.

Argument
`model`	Fitted model used to make counterfactual comparisons.
`variables`	Focal predictor whose association with or effect on the outcome we are interested in.
`newdata`	Grid of predictor values.
`comparison`	How the counterfactual predictions are compared: difference, ratio, lift, etc.
`vcov`	Standard errors: classical, robust, clustered, etc.
`conf_level`	Size of the confidence intervals.
`type`	Type of predictions to compare: response, link, etc.
`cross`	Estimate the effect of changing multiple variables at once (cross-contrast).
`by`	Grouping variable for average predictions.
`wts`	Weights used to compute average comparisons.
`transform`	Post-hoc transformations.
`hypothesis`	Compare different comparisons to one another, conduct linear or non-linear hypothesis tests, or specify null hypotheses.
`equivalence`	Equivalence tests
`df`	Degrees of freedom for hypothesis tests.
`numderiv`	Algorithm used to compute numerical derivatives.
`...`	Additional arguments are passed to the `predict()` method supplied by the modeling package.

To define, compute, and interpret counterfactual comparisons, analysts must make five decisions.

First, the Quantity:

Counterfactual comparisons are defined along two dimensions.
- What change in focal predictor are we interested in? For example, do we want to estimate the effect of changing $X$ from 0 to 1, of an increase of 1 unit or one standard deviation, or a change between two specific values. The change of focal predictor is specified by the variables argument.
- What function do we use to compare counterfactual predictions? Two counterfactual predictions can be compared by a difference, ratio, lift, and many other functions. These comparisons are specified using the comparison argument in the comparisons() function.

Second, the Predictors:

Counterfactual comparisons are conditional quantities. They usually depend on the values of all predictors in a model.
Analysts can evaluate comparisons for different predictor grids, including empirical, balanced, interesting, or counterfactual grids, defined by the newdata argument and the datagrid() function.

Third, the Aggregation:

Analysts can report unit-level or aggregated counterfactual comparisons:
- Unit-level comparisons: Specific to each observation.
- Aggregated comparisons: Average effects or associations across the population or subgroups, computed with avg_comparisons() and the by argument.

Fourth, the Uncertainty:

The uncertainty around counterfactual comparisons can be estimated using classical or robust standard errors, bootstrap, or simulation-based inference.
Standard errors and confidence intervals are handled by the vcov and conf_level arguments, or via the inferences() function.

Fifth, the Test:

Hypothesis tests can check if counterfactual comparisons differ significantly from a null value or from one another. This is implemented with the hypothesis argument.
Equivalence tests can establish whether comparisons are practically equivalent to a reference value using the equivalence argument.

Section 2.1.1 ↩︎
Section 2.1.2 ↩︎
We take the log odds ratios of the averages, rather than the average log odds ratio, because odds ratios are non-collapsible.↩︎
Interpreting the parameters associated with adjustment variables as measures of association or effect is generally not recommended. See the discussion of Table 2 fallacy in Section 2.2 and Westreich and Greenland (2013).↩︎
See Section 2.1.3 for a discussion of the problems that can arise when multiple focal variables are included in a single model. Also see Goldsmith-Pinkham, Hull, and Kolesár (Forthcoming).↩︎
Section 6.3 ↩︎
In the current example, computing the average across the full empirical distribution or in the subset of actually treated units does not make much difference, which is often the case for randomized controlled experiments. As Chapter 8 shows, this is not the case in general.↩︎
Section 14.1 ↩︎
The statistic argument allows us to display confidence intervals instead of the default standard errors. The fmt determines the number of digits to display. gof_omit omits all goodness-of-fit statistics from the bottom of the table. shape indicates that models should be displayed as columns and terms and contrasts as structured rows.↩︎