The conceptual framework introduced in this book builds on the idea that “a parameter is just a resting stone on the way to prediction.” This chapter shows that predictions can themselves act as a springboard for further analysis. In particular, we will see how model-based predictions can be transformed into “counterfactual comparisons” that allow us to measure the strength of association between two variables, or to quantify the effects of a cause.
Counterfactual thinking is fundamental to scientific inquiry and data analysis. Indeed, many of our most important research questions can be expressed as comparisons between hypothetical worlds. For example,
Would survival be more likely if patients received a new medication rather than a placebo?
Would standardized test scores be higher if class sizes were smaller in high schools?
Do conservation policies improve forest coverage?
Does participating in a microfinance program increase household income?
To answer each of those questions, the researcher must conduct a thought experiment. They need to compare two different “worlds”: one where the intervention occurred, and one where it did not. In that spirit, we can define a broad class of quantities of interest as follows:
A counterfactual comparison is a function of two or more model-based predictions, made for with different predictor values.
This definition is intimately linked to the theories of causal inference surveyed in Section 1.1.3. Indeed, a natural way to estimate the effect of an intervention is to use a statistical model to make predictions in two counterfactual worlds—for different predictor values—and to compare those predictions. When the conditions for causal identification are satisfied (see Section 1.1.3), our counterfactual comparison can be interpreted as a measure of the “effect” of \(X\) on \(Y\). When the conditions for causal identification are not satisfied by the model, counterfactual comparisons can still be useful as a descriptive measure of the strength of association between two variables.
In this chapter, we will see that a vast array of statistical quantities of interest can be expressed as a functions of two (or more) predictions: contrasts, risk differences, ratios, odds, lift, etc. Moreover, by averaging counterfactual comparisons across different grid, we can compute standard estimands such as an average treatment effect, or/and average treatment effects on the treated.
Using the marginaleffects package, we will be able to compute all of those quantities easily, and we will also see how to do comparisons between comparisons, in order to answer more complex questions about interactions or treatment effect heterogeneity.
6.1 Quantity
The concept of “counterfactual comparison” was defined as a function of two or more model-based predictions, made with different predictor values. To operationalize this quantity, the analyst must decide (1) what change in predictors they are interested in, and (2) what function to use to compare counterfactual predicted outcomes.
Consider a simple case where the analysts fits a statistical model with outcome \(Y\) and predictor \(X\). They use the parameter estimates to compute predictions. When the variable \(X\) is set to a specific value \(x\), we denote the prediction of the model \(\hat{Y}_{X=x}\).
An analyst who is interested in model description, data description, or causal inference (Section 1.1) may want to estimate how the predicted outcome \(\hat{Y}_{X}\) changes when we manipulate the predictor \(X\). For example, one could ask: what is the effect of an increase of 1 unit, of an increase of one standard deviation, or of a change between two specific values \(a\) and \(b\), on the predicted outcome?
\[\begin{align*}
\hat{Y}_{X=x+1} - \hat{Y}_{X=x} && \text{Increase of one unit}\\
\hat{Y}_{X=x} - \hat{Y}_{X=x+\sigma_X} && \text{Increase of one standard deviation}\\
\hat{Y}_{X=max(x)} - \hat{Y}_{X=min(X)} && \text{Increase from minimum to maximum}\\
\hat{Y}_{X=b} - \hat{Y}_{X=a} && \text{Change between specific values $a$ and $b$}
\end{align*}\]
In each of those examples, we calculate the difference between two predicted outcomes, evaluated for different values of the focal predictor \(X\). These counterfactual comparisons were all taken to be simple differences, but we can compare predictions using other functions. Common ones include:
If the predicted outcome \(\hat{Y}_X\) is a probability, the difference defined above becomes a “risk difference,” the ratio a “risk ratio,” etc. We can define more complex comparisons like the odds ratio, as \(\frac{\hat{Y}_{X=b}}{1 - \hat{Y}_{X=b}} \bigg/ \frac{\hat{Y}_{X=a}}{1 - \hat{Y}_{X=a}}\).
If there is more than one predictor in the statistical model, we can also define “cross-comparisons,” where two variables change simultaneously: \(X\) changes from \(a\) to \(b\), and \(Z\) simultaneously changes from \(c\) to \(d\):
This kind of cross-comparison can be useful when trying to assess the effect of simultaneous changes in multiple predictors. For example, what happens to the predicted outcome if a patient starts taking a medication and doing exercise at the same time?
In the rest of this section, we will illustrate how to compute and interpret all of these quantities using the marginaleffects package, in the context of the Thornton (2008) study introduced in Chapter 4. We fit a logistic regression model in which the outcome is the probability that a participant will seek to learn their HIV status, and the randomized treatment is whether the participant received a monetary incentive. To make the specification more flexible and improve precision, we interact the treatment indicator (incentive) with two predictors: the age categories to which participants belongs (agecat) and their distance from the test center.
This model is relatively simple. Still, because it includes interactions and a non-linear link function, interpreting raw model coefficients is difficult for most analysts, and essentially impossible for lay people.
Counterfactual comparisons are a compelling alternative to coefficient estimates, with many advantages for interpretation. First, counterfactual comparisons can be expressed directly on the scale of the outcome variable, rather than as complex functions like log-odds ratios. Second, counterfactual comparisons map directly onto what many people have in mind when they think of the effect of a treatment: what change can we expect in the outcome when the treatment changes? Finally, as the following sections show, the marginaleffects package makes it trivial to compute counterfactual comparisons. Data analysts can embrace the same workflow in a model-agnostic fashion, applying similar post-estimation steps to interpret their results, (almost) regardless of the statistical model they choose to estimate.
6.1.1 First steps: Risk difference with a binary treatment
To begin, let’s consider a simple estimand: the risk difference associated to a change in binary treatment. Specifically, we will use the Thornton (2008) data to estimate the expected change in outcome when a participant moves from the control group to the treatment group, that is, when they move from 0 to 1 on the incentive variable.
An important factor to consider, when estimating such a quantity, is that counterfactual comparisons are conditional quantities. Except in the simplest cases, comparisons will depend on the values of all the predictors in a model. Each individual in a dataset (i.e., each predictor profile) may be associated to a different counterfactual comparison. Therefore, whenever the analyst computes a counterfactual comparison, they must explicitly define the predictor values of interest. Section 6.2 explores different ways to define the grid of predictor profiles to use for this purpose. For now, it suffices to focus on a single individual with typical (average or modal) characteristics.
grid<-data.frame(distance =2, agecat ="18 to 35", incentive =1)grid
distance agecat incentive
1 2 18 to 35 1
Our goal is to estimate the effect of a binary predictor—whether participants received a monetary incentive—on the predicted probability that they will seek to learn their HIV status.
\[\hat{Y}_{i=1,d=2,a=\text{18 to 35}} - \hat{Y}_{i=0,d=2,a=\text{18 to 35}}\]
To do this, we can compare model-based predicted outcomes with and without the incentive, holding all other unit characteristics constant. Using simple base R commands, we manipulate the grid, make predictions, and compare those predictions.
# Grids of predictor valuesg_treatment<-transform(grid, incentive =1)g_control<-transform(grid, incentive =0)# Counterfactual predictionsp_treatment<-predict(mod, newdata =g_treatment, type ="response")p_control<-predict(mod, newdata =g_control, type ="response")# Counterfactual comparisonp_treatment-p_control
[1] 0.465402
The same estimate can be obtained more easily, along with standard errors and test statistics, using the comparisons() function from the marginaleffects package:
Our model suggests that for a participant with typical characteristics, moving from the control group to the treatment group increases the predicted probability that outcome equals one by \(0.465\times 100=46.5\) percentage points.
6.1.2 Comparison functions
So far, we have measured the effect of a change in predictor solely by looking at differences in predicted outcomes. Differences are typically the best starting point for interpretation, because they are simple and easy to grasp intuitively. Nevertheless, in some contexts it can make sense to use different functions to compare counterfactual predictions, such as ratios, lift, or odds ratios.
To compute the ratio of predicted outcomes associated to a change in incentive, we use the comparison="ratio" argument:
The predicted outcome is nearly 4.5 times as large for a participant in the treatment group, who is between 18 and 35 years old and lives at a distance of 2 from the test center, than for a participant in the control group with the same socio-demographic characteristics. Note that, in the code above, we set hypothesis=1 to test against the null hypothesis that these two predicted probabilities are identical (i.e., a ratio of 1). The standard error is small and the \(z\) statistic large.
Therefore, we can reject the null hypothesis that the predicted outcomes are the same in the treatment and control arms of this trial. We can reject the hypothesis that the ratio between the predicted probabilities that someone in the treatment group (incentive=1) and someone in the control group (incentive=0) will get their test results is 1.
To compute the lift, we would proceed in same way, setting comparison="lift":
Finally, it is useful to note that the comparison argument accepts arbitrary R or Python functions. This is an extremely powerful feature, as it allows analysts to specify fully customized comparisons between a hi prediction (ex: treatment) and a lo prediction (ex: control). To illustrate, we compute a log odds ratio based on average predictions:1
The predictors in a model can be divided in two categories: focal and adjustment variables. Focal variables are the key predictors in a counterfactual analysis. They are the variables whose effect on (or association with) the outcome we wish to quantify. In contrast, adjustment—or control—variables are incidental to the principal analysis. They can be included in a model to increase its flexibility, improve fit, meet a conditional independence assumption, or to check if treatment effects vary across subgroups of the population. However, the effect of an adjustment variable is not, in and of itself, of interest in a counterfactual analysis.2
Counterfactual comparisons are conditional quantities, in the sense that they typically depend on the values of all the predictors in a model. Therefore, when we compute a comparison, we need to decide where to evaluate it in the predictor space. We need to decide what values to assign to the focal and adjustment variables.
6.2.1 Focal variables
When estimating counterfactual comparisons, our goal is to determine what happens to the predicted outcome when one or more focal predictors change. Obviously, the kind of change we are interested depends on the nature of the focal predictors. Let’s consider four common cases in turn: binary, categorical, numeric, and cross-comparisons.
To illustrate each of these cases, we will treat each of the predictors in our model as a focal variable in turn (incentive, agecat, and distance). Note, however, that in typical real-world applications, there is usually one focal variable per statistical model.3
6.2.1.1 Change in binary predictors
By default, when the focal predictor is binary, marginaleffects returns the value of the difference in predicted outcome associated to a change from the control (0) to the treatment (1) group:
Moving from the treatment to the control group on the incentive variable is a associated with a change of -47 percentage points in the predicted probability that outcome equals 1.
6.2.1.2 Change in categorical predictors
The same approach can be used when we are interested in changes in a categorical variable with multiple levels. For example, if we want to know how changes in the agecat variable would affected the predicted probability that outcome equals 1, we simple type:
We find that moving from the <18 to the 18 to 35 age bracket increases the predicted probability that outcome equals 1 by 0.4 percentage points. Moving from the <18 to the >35 age bracket increases the predicted probability by 0.4 percentage points.
By default, the comparisons() function returns comparisons between every level of the categorical predictor and its “reference” or first level, here: <18. We can modify the variables argument to compare specific categories, or to compare all categories to its preceding level sequentially (<18 to 18 to 35, and 18 to 35 to >35):
# Specific comparisoncomparisons(mod, variables =list("agecat"=c("18 to 35", ">35")))# Sequential comparisonscomparisons(mod, variables =list("agecat"="sequential"))
6.2.1.3 Change in numeric predictors
When the focal predictor is numeric, we can once again follow the same steps and call:
This result says that increasing the distance variable by 1 unit changes the predicted value of outcome by -2.9 percentage points. Importantly, this represents a 1 unit increase in distance from the value of distance in the predictor grid: 2.00.
One may be interested in different magnitudes of change in the focal predictor distance. For example, the effect of a 5 unit (or 1 standard deviation) increase in distance on the predicted value of the outcome. Alternatively, the analyst may want to assess the effect of a change between two specific values of distance, or across the interquartile (or full) range of the data. All of these options are easy to implement using the variables argument:
# Increase of 5 unitscomparisons(mod, variables =list("distance"=5))# Increase of 1 standard deviationcomparisons(mod, variables =list("distance"="sd"))# Change between specific valuescomparisons(mod, variables =list("distance"=c(0, 3)))# Change across the interquartile rangecomparisons(mod, variables =list("distance"="iqr"))# Change across the full rangecomparisons(mod, variables =list("distance"="minmax"))
6.2.1.4 Cross-comparisons
Sometimes, an analyst wants to assess the joint or combined effect of manipulating two predictors. In a medical study, for example, we may be interested in the change in survival rates for people who both receive a new treatment and make a dietary change. In our running example, it may be interesting to know how much the probability of outcome would change if we modified both the distance and incentive variables simulataneously. To check this, we use the cross argument:
These results show that a simultaneous increase of 1 unit in the distance variable and between 0 and 1 on the incentive variable from 0 to 1, is associated with a change of 0.437 in the predicted outcome.
6.2.2 Adjustment variables
In a typical counterfactual analysis, the researcher is not interested in a change in the adjustement variables themselves. Nevertheless, since the value of a counterfactual comparison depends on where it is evaluated in the predictor, it is important to specify the values of all the predictors in a grid.
Much like in Chapter 5, where we computed predictions for different individual profiles, we now estimate counterfactual comparisons based on empirical, interesting, representative, and balanced grids.
6.2.2.1 Empirical distribution
By default, the comparisons() function returns estimates for every single row of the original data frame which was used to fit the model. The Thornton (2008) dataset includes 2825 complete observations (after dropping missing data), so this command will yield 2825 estimates:
If we do not specify the variables argument, comparisons() computes distinct differences for all the variables. Here, there are 4 possible differences, so we get \(4 \times 2825=11300\) rows:
The x-axis in Figure 6.1 show the estimated effects of changes in the predictors on the predicted value of the outcome. The y-axes show the prevalence of each estimate across the full sample. There appears to be considerable heterogeneity.
For example, consider the right-most panel, which shows the distribution of unit-level contrasts for the incentive variable. This panel shows that for some participants, the model predicts that moving from the control to the treatment condition would increase the predicted probability that outcome equals one by about 0.5 points. For others, the estimated effect of this change can be as low as 0.4.
6.2.2.2 Interesting grid
In some contexts, we want to estimate a contrast for a specific individual with characteristics of interest. To achieve this, we can supply a data frame to the newdata argument.
The code below shows the expected change in the predicted probability of the outcome associated with a change in incentive, for a few individuals with interesting characteristics. As in Chapter 5, we use the datagrid() function as a convenient mechanism to create a grid of profiles of interest:
Notice that the estimated effects of incentive on the predicted probability that outcome=1 differ depending on the age bin to which a study participant belongs. Indeed, our model estimates that the difference between treatment and control would be percentage points in the middle age bin, but 43.8 percentage point in the older age bin.
6.2.2.3 Representative grids
Yet another common alternative is to compute a comparison or risk difference “at the mean.” The idea is to create a “representative” or “synthetic” profile for an individual whose characteristics are completely average or modal. Then, we report the comparison for this specific hypothetical individual. For example:
The main advantage of this approach is that it is fast and cheap from a computational standpoint. The disadvantage is that the interpretation is somewhat ambiguous. Often, the population includes no individual at all who is perfectly average across all dimensions, and it is not always clear what the analyst should be especially interested in such an individua. This matters because, in some cases, a “comparison at the mean” can differ significantly from an “average comparison” (Section 6.3).
6.2.2.4 Balanced grids
Balanced grids, introduced in Section 3.2, include all unique combinations of categorical variables, while holding numeric variables a their means. This is particularly useful in experimental contexts, where the sample is not representative of the target population, and where we want to treat each combination of treatment conditions similarly.
As discussed above, the default behavior of comparisons() is to estimate quantities of interest for all the actually observed units in our dataset, or for each row of the dataset supplied to the newdata argument. Sometimes, it is convenient to marginalize those conditional estimates to obtain an average (or marginal) estimates.
Several key quantities of interest can be expressed as average counterfactual comparisons. For example, when certain assumptions are satisfied, the average treatment effect of \(X\) on \(Y\) can be defined as the expected difference between outcomes under treatment or control:
\[
E[Y_{X=1} - Y_{X=0}],
\]
where the expectation is taken over the distribution of adjustment variables. In ?sec-gcomputation, we will see that estimands like the Average Treatment Effect on the Treated (ATT) or Average Treatment Effect on the Untreated (ATU) can be defined analogously.
To compute an average counterfactual comparison, we proceed in four steps:
Compute predictions for every row of the dataset in the counterfactual world where all observations belonged to the treatment condition.
Compute predictions for every row of the dataset in the counterfactual world where all observations belonged to the control condition.
Take the differences between the two vectors of predictions.
Average the unit-level estimates across the whole dataset, or within subgroups.
Previously, we called the comparisons() function to compute unit-level counterfactual comparisons. To return average estimates, we call the same function, with the same arguments, but add the avg_ prefix:
On average, across all participants in the study, moving from the control to the treatment group is associated with a change of 45.2 percentage points in the predicted probability that outcome equals one. This result is equivalent to computing unit-level estimates and taking their mean:
On average, for participants in the <18 age bracket, moving from the control to the treatment group is associated with a change of 47.5 percentage points in the predicted probability that outcome equals one. This average risk difference is estimated at 43.5 percentage points for participants above 35 years old.
We can also compute an average risk difference for individuals with specific profiles, by specifying the grid of predictors using the newdata argument. For example, if we are interested in an average treatment effect that only applies to study participants who actually belonged to the treatment group, we can call (see ?sec-gcomputation):4
Average predictions are taken over the observed distribution of covariates in the subsets of interest. If there the subsets have different characteristics, the difference may be due to those characteristics, rather than the by.
Average comparison is taken by counterfactual analysis. It is a ceteris paribus comparison.
6.4 Uncertainty
By default, the standard errors around contrasts are estimated using the delta method (Section 3.4.1) and the classical variance-covariance matrix supplied by the modelling software. For many of the more common statistical models, we can rely on the sandwich or clubSandwich packages to report “robust” standard errors, confidence interval, and \(p\) values (Zeileis, Köll, and Graham 2020; Pustejovsky 2023). We can also use the inferences() function to compute bootstrap or simulation-based estimates of uncertainty.
Using the modelsummary package, we can report estimates with different uncertainty estimates in a single table, with models displayed side-by-side (Table 6.1).5
Table 6.1: Alternative ways to compute uncertainty about counterfactual comparisons.
Heteroskedasticity
Clustered
Bootstrap
agecat
mean(18-35) - mean(
0.0065
0.0065
0.0065
[-0.0457, 0.0587]
[-0.0468, 0.0599]
[-0.0421, 0.0608]
mean(>35) - mean(
0.0449
0.0449
0.0449
[-0.0079, 0.0977]
[-0.0033, 0.0931]
[-0.0038, 0.0996]
distance
mean(+1)
-0.0303
-0.0303
-0.0303
[-0.0427, -0.0179]
[-0.0449, -0.0157]
[-0.0426, -0.0177]
incentive
mean(1) - mean(0)
0.4519
0.4519
0.4519
[0.4110, 0.4928]
[0.4094, 0.4945]
[0.4100, 0.4932]
Num.Obs.
2825
2825
2825
6.5 Test
In Chapter 6, we estimated average counterfactual for different subgroups of the data. Now, we will see how to conduct hypothesis tests on these estimates, in order to compare them to one another.
Imagine one is interested in the following question:
Does moving from incentive=0 to incentive=1 have a bigger effect on the predicted probability that outcome=1 for older or middle aged participants?
To answer this, we can start by using the avg_comparisons() and its by argument to estimate sub-group specific average risk differences:
cmp<-avg_comparisons(mod, variables ="incentive", by ="agecat")cmp
agecat
Estimate
Std. Error
z
Pr(>|z|)
2.5 %
97.5 %
{
0.475
0.0605
7.85
0.356
0.593
{18-35}
0.461
0.0290
15.89
0.404
0.518
{>35 }
0.435
0.0338
12.84
0.368
0.501
At first glance, it looks like the estimated difference is larger for participants in the first age bin (0.475) than for participants in the last age bin (0.435). In the words, it looks like the effect of incentive on the probability of seeking to learn one’s HIV status is stronger for younger participants than for older ones.
The difference between estimated treatment effects in those two groups is:
\[
43\.46 - 47\.46 = 42\.99
\]
But is this difference statistically significant? Does our data allow us to conclude that incentive has different effects in age subgroups?
To answer this question, we follow the process laid out in Chapter 4, and express a test of equality as a string formula, where b1 identifies the estimate in the first row and b3 in the third row:
The \(p\) value is large, which means that we cannot reject the null hypothesis that the effect of incentive on predicted outcome is the same for people in both age brackets. Our estimated risk differences are not statistically distinguishable from one another.
6.6 Visualization
The plot_comparisons() function can plot comparisons, differences, risk ratios, etc.
Goldsmith-Pinkham, Paul, Peter Hull, and Michal Kolesár. Forthcoming. “Contamination Bias in Linear Regressions.”American Economic Review, Forthcoming.
Thornton, Rebecca L. 2008. “The Demand for, and Impact of, Learning HIV Status.”American Economic Review 98 (5): 1829–63.
Westreich, Daniel, and Sander Greenland. 2013. “The Table 2 Fallacy: Presenting and Interpreting Confounder and Modifier Coefficients.”American Journal of Epidemiology 177 (4): 292–98. https://doi.org/10.1093/aje/kws412.
Zeileis, Achim, Susanne Köll, and Nathaniel Graham. 2020. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.”Journal of Statistical Software 95 (1): 1–36. https://doi.org/10.18637/jss.v095.i01.
We take the log odds ratios of the averages, rather than the average log odds ratio, because odds ratios are non-collapsible. TODO.↩︎
Interpreting the parameters associated with adjustement variables as measures of association or effect is generally not recommended. See the discussion of Table 2 fallacy in Section 1.2 and Westreich and Greenland (2013).↩︎
See section Section 1.1.3 for a discussion of the problems that can arise when multiple focal variables are included in a single model. Also see Goldsmith-Pinkham, Hull, and Kolesár (Forthcoming).↩︎
In this example, computing the average across the full empirical distribution or in the subset of actually treated units does not make much difference. This will not always be the case.↩︎
The statistic argument allows us to display confidence intervals instead of the default standard errors. The fmt determines the number of digits to display. gof_omit omits all goodness-of-fit statistics from the bottom of the table. shape indicates that models should be displayed as columns and terms and contrasts as structured rows.↩︎