7  Counterfactual comparisons

The conceptual framework introduced in this book builds on the idea that “a parameter is just a resting stone on the way to prediction.” This chapter shows that predictions can themselves act as a springboard for further analysis. In particular, we will see how model-based predictions can be transformed into “counterfactual comparisons” that allow us to measure the strength of association between two variables, or to quantify the effects of a cause.

Counterfactual thinking is fundamental to scientific inquiry and data analysis. Indeed, many of our most important research questions can be expressed as comparisons between hypothetical worlds. For example,

To answer each of those questions, the researcher must conduct a thought experiment. They need to compare two different “worlds”: one where the intervention occurred, and one where it did not. In that spirit, we can define a broad class of quantities of interest as follows:

A counterfactual comparison is a function of two or more model-based predictions, made for with different predictor values.

This definition is intimately linked to the theories of causal inference surveyed in Section 1.1.3. Indeed, a natural way to estimate the effect of an intervention is to use a statistical model to make predictions in two counterfactual worlds—for different predictor values—and to compare those predictions. When the conditions for causal identification are satisfied (see Section 1.1.3), our counterfactual comparison can be interpreted as a measure of the “effect” of \(X\) on \(Y\). When the conditions for causal identification are not satisfied by the model, counterfactual comparisons can still be useful as a descriptive measure of the strength of association between two variables.

In this chapter, we will see that a vast array of statistical quantities of interest can be expressed as a functions of two (or more) predictions: contrasts, risk differences, ratios, odds, lift, etc. Moreover, by average counterfactual comparisons across different grid, we can compute standard estimands such as an average treatment effect, or and average treatment effects on the treatment.

Using the marginaleffects package, we will be able to compute all of those quantities easily, and we will also see how to do comparisons between comparisons, in order to answer more complex questions about interactions or treatment effect heterogeneity.

7.1 Quantity

The concept of “counterfactual comparison” was defined as a function of two or more model-based predictions, made with different predictor values. To operationalize this quantity, the analyst must decide (1) what change in predictors they are interested in, and (2) what function to use to compare counterfactual predicted outcomes.

Consider a simple case where the analysts fits a statistical model with outcome \(Y\) and predictor \(X\). They use the parameter estimates to compute predictions. When the variable \(X\) is set to a specific value \(x\), we denote the prediction of the model \(\hat{Y}_{X=x}\).

An analyst who is interested in model description, data description, or causal inference (Section 1.1) may want to estimate how the predicted outcome \(\hat{Y}_{X}\) changes when we manipulate the predictor \(X\). For example, one could ask: what is the effect of an increase of 1 unit, of an increase of one standard derviation, or of a change between two specific values \(a\) and \(b\), on the predicted outcome?

\[\begin{align*} \hat{Y}_{X=x+1} - \hat{Y}_{X=x} && \text{Increase of one unit}\\ \hat{Y}_{X=x} - \hat{Y}_{X=x+\sigma_X} && \text{Increase of one standard deviation}\\ \hat{Y}_{X=max(x)} - \hat{Y}_{X=min(X)} && \text{Increase from minimum to maximum}\\ \hat{Y}_{X=b} - \hat{Y}_{X=a} && \text{Change between specific values $a$ and $b$} \end{align*}\]

In each of those examples, we calculate the difference between two predicted outcomes, evaluated for different values of the focal predictor \(X\). These counterfactual comparisons were all taken to be simple differences, but we can compare predictions using other functions. Common ones include:

\[\begin{align*} \hat{Y}_{X=b} - \hat{Y}_{X=a} && \text{Difference}\\ \frac{\hat{Y}_{X=b}}{\hat{Y}_{X=a}} && \text{Ratio}\\ \frac{\hat{Y}_{X=b} - \hat{Y}_{X=a}}{\hat{Y}_{X=a}} && \text{Lift}\\ \end{align*}\]

If the predicted outcome \(\hat{Y}_X\) is a probability, the difference defined above becomes a “risk difference,” the ratio a “risk ratio,” etc. We can define more complex comparisons like the odds ratio, as \(\frac{\hat{Y}_{X=b}}{1 - \hat{Y}_{X=b}} \bigg/ \frac{\hat{Y}_{X=a}}{1 - \hat{Y}_{X=a}}\).

If there is more than one predictor in the statistical model, we can also define “cross-comparisons,” where two variables change simultaneously: \(X\) changes from \(a\) to \(b\), and \(Z\) simultaneously changes from \(c\) to \(d\):

\[\begin{align*} \hat{Y}_{X=b,Z=d} - \hat{Y}_{X=a,Z=c} && \text{Cross-comparison} \end{align*}\]

This kind of cross-comparison can be useful when trying to assess the effect of simultaneous changes in multiple predictors. For example, what happens to the predicted outcome if a patient starts taking a medication and doing exercise at the same time?

In the rest of this section, we will illustrate how to compute and interpret all of these quantities using the marginaleffects package, in the context of the Thornton (2008) study introduced in Chapter 5. We fit a logistic regression model in which the outcome is the probability that a participant will seek to learn their HIV status, and the randomized treatment is whether the participant received a monetary incentive. To make the specification more flexible and improve precision, we interact the treatment indicator with two control variables: the age categories to which participants belongs (agecat) and their distance from the test center.

library(marginaleffects)
dat <- readRDS(here::here("chapters/data/hiv.rds"))
mod <- glm(outcome ~ incentive * (agecat + distance), 
    data = dat, family = binomial)
summary(mod)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)
(Intercept)           -0.48617    0.29621  -1.641   0.1007
incentive              2.05760    0.34421   5.978   <0.001
agecat18-35            0.07937    0.28872   0.275   0.7834
agecat>35              0.34522    0.29467   1.172   0.2414
distance              -0.18440    0.07236  -2.548   0.0108
incentive:agecat18-35 -0.05850    0.33500  -0.175   0.8614
incentive:agecat>35   -0.12468    0.34242  -0.364   0.7158
incentive:distance     0.02304    0.08256   0.279   0.7802

This model is relatively simple. Still, because it includes interactions and a non-linear link function, interpreting raw model coefficients is difficult for most analysts, and essentially impossible for lay people.

Counterfactual comparisons are a compelling alternative to coefficient estimates, with many advantages for interpretation. First, counterfactual comparisons can be expressed directly on the scale of the outcome variable, rather than as complex functions like log-odds ratios. Second, counterfactual comparisons map directly onto what many people have in mind when they think of the “effect” of a treatment: what change can we expect in the response when the treatment changes? Finally, as the following sections show, the marginaleffects package makes it trivial to compute counterfactual comparisons. Data analysts can embrace the same workflow in a model-agnostic fashion, applying similar post-estimation steps to interpret their results, (almost) regardless of the statistical model they choose to estimate.

7.1.1 First steps: Risk difference with a binary treatment

To begin, let’s consider a simple estimand: the risk difference associated to a change in binary treatment. An important factor to consider, when estimating that quantity, is that counterfactual comparisons are conditional quantities. Except in the simplest cases, comparisons will depend on the values of all the predictors in a model. Therefore, whenever the analyst computes a counterfactual comparison, they must explicitly define the predictor values of interest. Section 7.2 explores different ways to define the grid of predictor profiles to use for this purpose. For now, it suffices to focus on a single individual with typical (average or modal) characteristics:

grid <- data.frame(distance = 2, agecat = "18-35", incentive = 1)
grid
  distance agecat incentive
1        2  18-35         1

Our goal is to estimate the effect of a binary predictor—whether participants received a monetary incentive—on the predicted probability that they will seek to learn their HIV status. To do this, we can compare model-based predicted outcomes with and without the incentive, holding all other unit characteristics constant.

We can do this using simple base R commands, by manipulating the grid, making predictions, and comparing those predictions:

# Grids of predictor values
g_treatment <- transform(grid, incentive = 1)
g_control <- transform(grid, incentive = 0)

# Counterfactual predictions
p_treatment <- predict(mod, newdata = g_treatment, type = "response")
p_control <- predict(mod, newdata = g_control, type = "response")

# Counterfactual comparison
p_treatment - p_control

[1] 0.465402

The same estimate can be obtained more easily, along with standard errors and test statistics, using the comparisons() function from the marginaleffects package:

comparisons(mod, variables = "incentive", newdata = grid)
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % distance agecat
0.465 0.0293 15.9 0.408 0.523 2 {18-35}

Our model suggests that for a participant with typical characteristics, moving from the control group to the treatment group increases the predicted probability that outcome equals one by 47 percentage points.

7.1.2 Comparison functions

So far, we have measured the effect of a change in predictor solely by looking at differences in predicted outcomes. Differences are typically the best starting point for interpretation, because they are simple and easy to grasp intuitively. Nevertheless, in some contexts it can make sense to use different functions to compare counterfactual predictions, such as ratios, lift, or odds ratios.

To compute the ratio of predicted outcomes associated to a change in incentive, we use the comparison="ratio" argument:

\[\frac{\hat{Y}_{i=1,d=2.01,a=\text{<18}}}{\hat{Y}_{i=0,d=2.01,a=\text{<18}}}\]

comparisons(mod,
  variables = "incentive",
  comparison = "ratio",
  hypothesis = 1,
  newdata = grid)
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % distance agecat
2.48 0.211 7 2.06 2.89 2 {18-35}

The predicted outcome is nearly 2.5 times as large for a participant in the treatment group, who is between 18 and 35 years old and lives at a distance of 2 from the test center, than for a participant in the control group with the same socio-demographic characteristics. Note that, in the code above, we set hypothesis=1 to test against the null hypothesis that these two predicted probabilities are identical (i.e., a ratio of 1). The standard error is small and the \(z\) statistic large. Therefore, we can reject the null hypothesis that the predicted outcomes are the same in the treatment and control arms of this trial.

To compute the lift, we would proceed in same way, setting comparison="lift":

\[\frac{\hat{Y}_{i=1,d=2.01,a=\text{<18}}-\hat{Y}_{i=0,d=2.01,a=\text{<18}}}{\hat{Y}_{i=0,d=2.01,a=\text{<18}}}\]

comparisons(mod,
  variables = "incentive",
  comparison = "lift",
  hypothesis = 1,
  newdata = grid)
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % distance agecat
1.48 0.211 2.26 0.024 1.06 1.89 2 {18-35}

Finally, it is useful to note that the comparison argument accepts arbitrary R or Python functions. This is an extremely powerful feature, as it allows analysts to specify fully customized comparisons between a hi prediction (ex: treatment) and a lo prediction (ex: control). To illustrate, we compute a log odds ratio based on average predictions:1

lnor <- function(hi, lo) {
  log((mean(hi) / (1 - mean(hi))) / (mean(lo) / (1 - mean(lo))))
}
comparisons(mod,
  variables = "incentive",
  comparison = lnor,
  hypothesis = 1)
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
2 0.0991 10.1 1.8 2.19

7.2 Predictors

The predictors in a model can be divided in two categories: focal and adjustment variables. Focal variables are the key predictors in a counterfactual analysis. They are the variables whose effect on (or association with) the outcome we wish to quantify. In contrast, adjustment—or control—variables are incidental to the principal analysis. They can be included in a model to increase its flexibility, improve fit, meet a conditional independence assumption, or to check if treatment effects vary across subgroups of the population. However, the effect of an adjustment variable is not, in and of itself, of interest in a counterfactual analysis.2

Counterfactual comparisons are conditional quantities, in the sense that they typically depend on the values of all the predictors in a model. Therefore, when we compute a comparison, we need to decide where to evaluate it in the predictor space. We need to decide what values to assign to the focal and adjustment variables.

7.2.1 Focal variables

When estimating counterfactual comparisons, our goal is to determine what happens to the predicted outcome when one or more focal predictors change. Obviously, the kind of change we are interested depends on the nature of the focal predictors. Let’s consider four common cases in turn: binary, categorical, numeric, and cross-comparisons.

To illustrate each of these cases, we will treat each of the predictors in our model as a focal variable in turn (incentive, agecat, and distance). Note, however, that in typical real-world applications, there is usually one focal variable per statistical model.3

7.2.1.1 Change in binary predictors

By default, when the focal predictor is binary, marginaleffects returns the value of the difference in predicted outcome associated to a change from the control (0) to the treatment (1) group:

comparisons(mod, variables = "incentive", newdata = grid)
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % distance agecat
0.465 0.0293 15.9 0.408 0.523 2 {18-35}

If an analyst wants to compute the effect of a change in the opposite direction, they can specify that change explicitly using the list syntax:

comparisons(mod, variables = list("incentive" = c(1, 0)), newdata = grid)
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % distance agecat
-0.465 0.0293 -15.9 -0.523 -0.408 2 {18-35}

Moving from the treatment to the control group on the incentive variable is a associated with a change of -47 percentage points in the predicted probability that outcome equals 1.

7.2.1.2 Change in categorical predictors

The same approach can be used when we are interested in changes in a categorical variable with multiple levels. For example, if we want to know how changes in the agecat variable would affected the predicted probability that outcome equals 1, we simple type:

comparisons(mod, variables = "agecat", newdata = grid)
Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % distance incentive
18-35 - 0.00359 0.0294 0.122 0.903 -0.0540 0.0611 2 1
>35 - 0.03587 0.0294 1.221 0.222 -0.0217 0.0935 2 1

We find that moving from the <18 to the 18-35 age bracket increases the predicted probability that outcome equals 1 by 0.4 percentage points. Moving from the <18 to the >35 age bracket increases the predicted probability by 0.4 percentage points.

By default, the comparisons() function returns comparisons between every level of the categorical predictor and its “reference” or first level, here: <18. We can modify the variables argument to compare specific categories, or to compare all categories to its preceding level sequentially (<18 to 18-35, and 18-35 to >35):

# Specific comparison
comparisons(mod, variables = list("agecat" = c("18-35", ">35")))

# Sequential comparisons
comparisons(mod, variables = list("agecat" = "sequential"))

7.2.1.3 Change in numeric predictors

When the focal predictor is numeric, we can once again follow the same steps and call:

comparisons(mod, variables = "distance", newdata = grid)
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % agecat incentive
-0.0289 0.00742 -3.89 -0.0434 -0.0143 {18-35} 1

This result says that increasing the distance variable by 1 unit changes the predicted value of outcome by -2.9 percentage points. Importantly, this represents a 1 unit increase in distance from the value of distance in the predictor grid: 2.00.

One may be interested in different magnitudes of change in the focal predictor distance. For example, the effect of a 5 unit (or 1 standard deviation) increase in distance on the predicted value of the outcome. Alternatively, the analyst may want to assess the effect of a change between two specific values of distance, or across the interquartile (or full) range of the data. All of these options are easy to implement using the variables argument:

# Increase of 5 units
comparisons(mod, variables = list("distance" = 5))

# Increase of 1 standard deviation
comparisons(mod, variables = list("distance" = "sd"))

# Change between specific values
comparisons(mod, variables = list("distance" = c(0, 3)))

# Change across the interquartile range
comparisons(mod, variables = list("distance" = "iqr"))

# Change across the full range
comparisons(mod, variables = list("distance" = "minmax"))

7.2.1.4 Cross-comparisons

Sometimes, an analyst wants to assess the joint or combined effect of manipulating two predictors. In a medical study, for example, we may be interested in the change in survival rates for people who both receive a new treatment and make a dietary change. In our running example, it may be interesting to know how much the probability of outcome would change if we modified both the distance and incentive variables simulataneously. To check this, we use the cross argument:

cmp <- comparisons(mod,
  variables = c("incentive", "distance"), 
  cross = TRUE,
  newdata = grid)
cmp
C: distance C: incentive Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % distance agecat incentive
+1 1 - 0 0.437 0.0305 14.3 0.377 0.496 2 {18-35} 1

These results show that a simultaneous increase of 1 unit in the distance variable and between 0 and 1 on the incentive variable from 0 to 1, is associated with a change of 0.437 in the predicted outcome.

7.2.2 Adjustment variables

In a typical counterfactual analysis, the researcher is not interested in a change in the adjustement variables themselves. Nevertheless, since the value of a counterfactual comparison depends on where it is evaluated in the predictor, it is important to specify the values of all the predictors in a grid.

Much like in Chapter 6, where we computed predictions for different individual profiles, we now estimate counterfactual comparisons based on empirical, interesting, representative, and balanced grids.

7.2.2.1 Empirical distribution

By default, the comparisons() function returns estimates for every single row of the original data frame which was used to fit the model. The Thornton (2008) dataset includes 2825 complete observations (after dropping missing data), so this command will yield 2825 estimates:

comparisons(mod, variables = "incentive")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.458 0.0689 6.64 0.323 0.593
0.463 0.0666 6.95 0.332 0.593
0.488 0.0611 7.99 0.368 0.608
0.482 0.0603 7.99 0.364 0.600
0.471 0.0632 7.45 0.347 0.595
0.423 0.0370 11.43 0.350 0.495
0.416 0.0393 10.59 0.339 0.493
0.423 0.0370 11.43 0.350 0.495
0.428 0.0355 12.05 0.358 0.498
0.454 0.0377 12.02 0.380 0.528

If we do not specify the variables argument, comparisons() computes distinct differences for all the variables. Here, there are 4 possible differences, so we get \(4 \times 2825=11300\) rows:

cmp <- comparisons(mod)
nrow(cmp)
[1] 11300

Since the output of comparisons() is a simple data frame, we can easily plot the full distribution of unit-specific risk differences:

library(ggplot2)

cmp <- comparisons(mod)
ggplot(cmp, aes(x = estimate)) +
  geom_histogram(bins = 30) +
  facet_grid(. ~ term + contrast, scales = "free") +
  labs(x = "Estimated change in predicted outcome", y = "Count")
Figure 7.1: Distribution of unit-level risk differences associated with changes in each of the predictors.

The x-axis in Figure 7.1 show the estimated effects of changes in the predictors on the predicted value of the outcome. The y-axes show the prevalence of each estimate across the full sample. There appears to be considerable heterogeneity.

For example, consider the right-most panel, which shows the distribution of unit-level contrasts for the incentive variable. This panel shows that for some participants, the model predicts that moving from the control to the treatment condition would increase the predicted probability that outcome equals one by about 0.5 points. For others, the estimated effect of this change can be as low as 0.4.

7.2.2.2 Interesting grid

In some contexts, we want to estimate a contrast for a specific individual with characteristics of interest. To achieve this, we can supply a data frame to the newdata argument.

The code below shows the expected change in the predicted probability of the outcome associated with a change in incentive, for a few individuals with interesting characteristics. As in Chapter 6, we use the datagrid() function as a convenient mechanism to create a grid of profiles of interest:

comparisons(mod,
  variables = "incentive",
  newdata = datagrid(agecat = unique, distance = mean))

agecat distance Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
{ 2.01 0.479 0.0608 7.87 0.360 0.598
{18-35} 2.01 0.466 0.0293 15.87 0.408 0.523
{>35 } 2.01 0.438 0.0342 12.79 0.371 0.505

Notice that the estimated effects of incentive on the predicted probability that outcome=1 differ depending on the age bin to which a study participant belongs. Indeed, our model estimates that the difference between treatment and control would be 46.6 percentage points in the middle age bin, but 43.8 percentage point in the older age bin.

7.2.2.3 Representative grid

Yet another common alternative is to compute a comparison or risk difference “at the mean.” The idea is to create a “representative” or “synthetic” profile for an individual whose characteristics are completely average or modal. Then, we report the comparison for this specific hypothetical individual. For example:

comparisons(mod, newdata = "mean")

Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % agecat distance
0.466 0.0293 15.9 0.408 0.523 {18-35} 2.01

The main advantage of this approach is that it is fast and cheap from a computational standpoint. The disadvantage is that the interpretation is somewhat ambiguous. Often, the population includes no individual at all who is perfectly average across all dimensions, and it is not always clear what the analyst should be especially interested in such an individua. This matters because, in some cases, a “comparison at the mean” can differ significantly from an “average comaprison” (Section 7.3).

7.2.2.4 Balanced grids

Balanced grids, introduced in Section 4.2, include all unique combinations of categorical variables, while holding numeric variables a their means. This is particularly useful in experimental contexts, where the sample is not representative of the target population, and where we want to treat each combination of treatment conditions similarly.

comparisons(mod, variables = "incentive", newdata = "balanced")

Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 % agecat distance
0.479 0.0608 7.87 0.360 0.598 { 2.01
0.466 0.0293 15.87 0.408 0.523 {18-35} 2.01
0.438 0.0342 12.79 0.371 0.505 {>35 } 2.01
0.479 0.0608 7.87 0.360 0.598 { 2.01
0.466 0.0293 15.87 0.408 0.523 {18-35} 2.01
0.438 0.0342 12.79 0.371 0.505 {>35 } 2.01

7.3 Aggregation

As discussed above, the default behavior of comparisons() is to estimate quantities of interest for all the actually observed units in our dataset, or for each row of the dataset supplied to the newdata argument. Sometimes, it is convenient to marginalize those conditional estimates to obtain an average (or marginal) estimates.

Several key quantities of interest can be expressed as average counterfactual comparisons. For example, when certain assumptions are satisfied, the average treatment effect of \(X\) on \(Y\) can be defined as the expected difference between outcomes under treatment or control:

\[ E[Y_{X=1} - Y_{X=0}], \]

where the expectation is taken over the distribution of adjustment variables. In ?sec-gcomputation, we will see that estimands like the Average Treatment Effect on the Treated (ATT) or Average Treatment Effect on the Untreated (ATU) can be defined analogously.

To compute an average counterfactual comparison, we proceed in four steps:

  1. Compute predictions for every row of the dataset in the counterfactual world where all observations belonged to the treatment condition.
  2. Compute predictions for every row of the dataset in the counterfactual world where all observations belonged to the control condition.
  3. Take the differences between the two vectors of predictions.
  4. Average the unit-level estimates across the whole dataset, or within subgroups.

Previously, we called the comparisons() function to compute unit-level counterfactual comparisons. To return average estimates, we call the same function, with the same arguments, but add the avg_ prefix:

avg_comparisons(mod, variables = "incentive")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.452 0.0208 21.8 0.411 0.493

On average, across all participants in the study, moving from the control to the treatment group is associated with a change of 45.2 percentage points in the predicted probability that outcome equals one. This result is equivalent to computing unit-level estimates and taking their mean:

cmp <- comparisons(mod, variables = "incentive")
mean(cmp$estimate)
[1] 0.45192

Using the by argument, we can compute the average risk difference with respect to incentive, for each age subcategory:

avg_comparisons(mod, variables = "incentive", by = "agecat")
agecat Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
{ 0.475 0.0605 7.85 0.356 0.593
{18-35} 0.461 0.0290 15.89 0.404 0.518
{>35 } 0.435 0.0338 12.84 0.368 0.501

On average, for participants in the <18 age bracket, moving from the control to the treatment group is associated with a change of 47.5 percentage points in the predicted probability that outcome equals one. This average risk difference is estimated at 43.5 percentage points for participants above 35 years old.

We can also compute an average risk difference for individuals with specific profiles, by specifying the grid of predictors using the newdata argument. For example, if we are interested in an average treatment effect that only applies to study participants who actually belonged to the treatment group, we can call (see ?sec-gcomputation):4

avg_comparisons(mod, variables = "incentive", newdata = subset(dat, incentive == 1))
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.452 0.0208 21.8 0.411 0.493

7.3.1 Average predictions vs. average comparison

TODO:

  • Average predictions are taken over the observed distribution of covariates in the subsets of interest. If there the subsets have different characteristics, the difference may be due to those characteristics, rather than the by.
  • Average comparison is taken by counterfactual analysis. It is a ceteris paribus comparison.

7.4 Uncertainty

By default, the standard errors around contrasts are estimated using the delta method (Section 4.4.1) and the classical variance-covariance matrix supplied by the modelling software. For many of the more common statistical models, we can rely on the sandwich or clubSandwich packages to report “robust” standard errors, confidence interval, and \(p\) values (Zeileis, Köll, and Graham 2020; Pustejovsky 2023). We can also use the inferences() function to compute bootstrap or simulation-based estimates of uncertainty.

Using the modelsummary package, we can report estimates with different uncertainty estimates in a single table, with models displayed side-by-side (Table 7.1).5

library(modelsummary)

models <- list(
  "Heteroskedasticity" = avg_comparisons(mod, vcov = "HC3"),
  "Clustered" = avg_comparisons(mod, vcov = ~village),
  "Bootstrap" = avg_comparisons(mod) |> inferences("boot")
)

modelsummary(models,
  statistic = "conf.int", fmt = 4, gof_map = "nobs",
  shape = term + contrast ~ model
)
Table 7.1: Alternative ways to compute uncertainty about counterfactual comparisons.
Heteroskedasticity Clustered Bootstrap
agecat mean(18-35) - mean( 0.0065 0.0065 0.0065
[-0.0457, 0.0587] [-0.0468, 0.0599] [-0.0421, 0.0608]
mean(>35) - mean( 0.0449 0.0449 0.0449
[-0.0079, 0.0977] [-0.0033, 0.0931] [-0.0038, 0.0996]
distance mean(+1) -0.0303 -0.0303 -0.0303
[-0.0427, -0.0179] [-0.0449, -0.0157] [-0.0426, -0.0177]
incentive mean(1) - mean(0) 0.4519 0.4519 0.4519
[0.4110, 0.4928] [0.4094, 0.4945] [0.4100, 0.4932]
Num.Obs. 2825 2825 2825

7.5 Test

In Chapter 7, we estimated average counterfactual for different subgroups of the data. Now, we will see how to conduct hypothesis tests on these estimates, in order to compare them to one another.

Imagine one is interested in the following question:

Does moving from incentive=0 to incentive=1 have a bigger effect on the predicted probability that outcome=1 for older or middle aged participants?

To answer this, we can start by using the avg_comparisons() and its by argument to estimate sub-group specific average risk differences:

cmp <- avg_comparisons(mod,
  variables = "incentive",
  by = "agecat")
cmp
agecat Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
{ 0.475 0.0605 7.85 0.356 0.593
{18-35} 0.461 0.0290 15.89 0.404 0.518
{>35 } 0.435 0.0338 12.84 0.368 0.501

cmpstr <- sprintf("%.2f", 100 * cmp$estimate)
equal <- sprintf("%.2f", 100 * cmp$estimate[3] - cmp$estimate[1])

At first glance, it looks like the estimated difference is larger for participants in the first age bin (0.475) than for participants in the last age bin (0.435). In the words, it looks like the effect of incentive on the probability of seeking to learn one’s HIV status is stronger for younger participants than for older ones.

The difference between estimated treatment effects in those two groups is:

\[ 43\.46 - 47\.46 = 42\.99 \]

But is this difference statistically significant? Does our data allow us to conclude that incentive has different effects in age subgroups?

To answer this question, we follow the process laid out in Chapter 5, and express a test of equality as a string formula, where b1 identifies the estimate in the first row and b3 in the third row:

avg_comparisons(mod,
  hypothesis = "b1 - b3 = 0",
  variables = "incentive",
  by = "agecat")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.04 0.0693 0.577 0.564 -0.0958 0.176

The \(p\) value is large, which means that we cannot reject the null hypothesis that the effect of incentive on predicted outcome is the same for people in both age brackets. Our estimated risk differences are not statistically distinguishable from one another.

7.6 Visualization

The plot_comparisons() function can plot comparisons, differences, risk ratios, etc.

plot_comparisons(mod, variables = "incentive", by = "agecat")

plot_comparisons(mod, variables = "incentive", condition = "distance")


  1. We take the log odds ratios of the averages, rather than the average log odds ratio, because odds ratios are non-collapsible. TODO.↩︎

  2. Interpreting the parameters associated with adjustement variables as measures of association or effect is generally not recommended. See the discussion of Table 2 fallacy in Section 1.2 and Westreich and Greenland (2013).↩︎

  3. See section Section 1.1.3 for a discussion of the problems that can arise when multiple focal variables are included in a single model. Also see Goldsmith-Pinkham, Hull, and Kolesár (Forthcoming).↩︎

  4. In this example, computing the average across the full empirical distribution or in the subset of actually treated units does not make much difference. This will not always be the case.↩︎

  5. The statistic argument allows us to display confidence intervals instead of the default standard errors. The fmt determines the number of digits to display. gof_omit omits all goodness-of-fit statistics from the bottom of the table. shape indicates that models should be displayed as columns and terms and contrasts as structured rows.↩︎