library(marginaleffects)
library(tinytable)
set.seed(1024)
N <- 10
dat <- data.frame(
Num = rnorm(N),
Bin = rbinom(N, size = 1, prob = 0.5),
Cat = sample(c("A", "B", "C"), size = N, replace = TRUE)
)
3 Conceptual framework
This chapter introduces a conceptual framework to interpret results from a wide variety of statistical and machine learning models. The framework is model-agnostic and based on a simple idea: analysts should transform parameter estimates or predictions from fitted models into quantities that are readily interpretable and make intuitive sense to stakeholders. By applying a simple and consistent post-estimation workflow, researchers can easily go from model to meaning.
To define our estimands we will ask three critical questions:
- Quantity: What is the quantity of interest?
- Predictors: What predictor values are we interested in?
- Aggregation: Do we report unit-level or aggregated estimates?
With a clear description of the empirical estimand in hand, we will then ask two further questions:
- Uncertainty: How do we quantify uncertainty about our estimates?
- Test: Which hypothesis or equivalence tests are relevant?
These five questions give us a structured way to think about the quantities and tests that matter. They lead us to clear definitions of those quantities, dispell terminological ambiguity, and indicate which software is needed to run appropriate calculations. The rest of this chapter surveys each of the five questions.
3.1 Quantity
The parameter estimates obtained by fitting a statistical model can often be difficult to interpret. In many contexts, it helps to transform them into quantities with a more natural and domain-relevant meaning.
For example, an analyst who fits a logistic regression model obtains coefficient estimates expressed as log odds ratios. For most people, this scale is extremely difficult to reason about (see Section 1.3). Instead of working with logit coefficients directly, we can convert them into more intuitive quantities, like predicted probabilities or risk differences.
This section discusses the statistical justification for this kind of post-estimation transformation, and introduces three classes of transformed quantities of interest: predictions, counterfactual comparisons, and slopes. These quantities are model-agnostic; they can help make sense of the estimates produced by a wide variety of statistical and machine learning models.
3.1.1 Post-estimation transformations are justified
Statistical theory offers two major justifications for post-estimation transformations: the plug-in principle and the invariance property of maximum likelihood estimators (MLE).
The intuition behind the plug-in principle is simple: to infer some feature of a population, we can study the same feature in a sample, and “plug-in” our sample estimate in lieu of the population value (Aronow and Miller 2019).
To formalize this idea a bit, consider a sample \(Y_1, \ldots, Y_n\) draw from a distribution \(F\). For example, \(F\) could represent the distribution of ages in a population, and \(Y_1\) could be the observed age of a single individual.
Imagine that we care about a statistical functional1 \(\psi\) of this distribution function: \(\theta=\psi(F)\). In this context, \(\theta\) could represent the average age of the population, the distribution variance, or the coefficient of a regression model designed to capture the data generating process for \(F\). We cannot measure every individual in the population, so we cannot observe the true \(F\), nor can we compute \(\psi(F)\) directly. Thankfully, we can observe the empirical distribution function \(\hat{F}_n\) in a sample,2 that is, we can observe the actual distribution of ages in a randomly selected group of \(n\) individuals.
The plug-in principle states that if \(\theta = \psi(F)\) is a statistical functional of the probability distribution \(F\), and if some mild regularity conditions are satisfied,3 we can estimate \(\theta\) using the sample analogue \(\hat{\theta} = \psi(\hat{F}_n)\). As Aronow and Miller (2019) note, the implications for statistical practice are profound. When the number of observations increases, the empirical distribution function will tend to approximate the population distribution function. When the number of observations increases, our estimate \(\hat{\theta}\) will tend to approach \(\theta\).
These ideas empower us to target a broad array of quantities of interest: \(\theta\) could be as simple as the distribution mean; \(\theta\) could be a regression coefficient; or \(\theta\) could be a more interesting quantity, defined as a function of regression coefficients. The plug-in principle thus justifies the workflow proposed in this book. First, we fit a regression model. Then, we transform coefficient estimates into more meaningful quantities. Finally, we interpret those quantities as sample analogues to population characteristics.
A second way to motivate the transformation of parameter estimates is to draw on the invariance property of MLE. This is a fundamental concept in statistical theory, which highlights the efficiency and flexibility of MLEs. The invariance property can be stated as follows: if \(\hat{\beta}\) is the MLE estimate of \(\beta\), then for any function \(\lambda(\beta)\), the MLE of \(\lambda(\beta)\) is \(\lambda(\hat{\beta})\) (Berger and Casella 2024, 259). In other words, the desirable properties of MLEs—consistency, efficiency, and asymptotic normality—are preserved under transformation.
This property is useful for practice because it simplifies the estimation of functions of parameters. For example, let’s say that we specify a regression model to capture the effect of nutrition on children’s heights. We estimate the coefficient \(\beta\) of this model via maximum likelihood, and obtain an estimate \(\hat{\beta}\). Chapter 5 shows how one could apply a function \(\lambda\) to the coefficient \(\beta\), in order to compute the predicted (or expected) value of the outcome variable (height) for a given value of the explanatory variable (nutrition). The invariance property says that if the coefficient estimate \(\hat{\beta}\) has the desirable properties of a maximum likelihood estimate, then the prediction \(\lambda{\hat{\beta}}\) inherits those qualities.
In sum, it is often useful to apply post-estimation transformations to the parameter estimates obtained by fitting a statistical model, in order to obtain more meaningful and interpretable quantities. This approach is well-grounded in statistical theory, via the plug-in principle and the invariance property of MLE.
3.1.2 Predictions, counterfactual comparisons, and slopes
To interpret the results of statistical and machine learning models, we will compute and report three broad classes of quantities of interest. This section offers a brief survey of these quantities. A full chapter is dedicated to each of them in Part II of this book (chapters 5, 6, and 7).
Predictions
The first class of transformations to consider is the prediction:
A prediction is the expected outcome of a fitted model for a given combination of predictor values.
Imagine that we fit a statistical model by minimizing some loss function \(\mathcal{L}\). For example, we may choose an estimator that minimizes the sum of squared errors or the negative log-likelihood. The estimated coefficients that allow us to find this minimum can be written as follows.
\[ \hat{\beta} = \argmin_{\beta} \; \mathcal{L}(Y, X, \mathbf{Z}; \beta), \tag{3.1}\]
where \(\beta\) is a vector of coefficients; \(Y\) is the outcome variable; \(X\) is a focal predictor which holds particular scientific interest; and \(\mathbf{Z}\) is a vector of control variables. In Equation 3.1, the \(\argmin\) operator simply means that our estimation procedure is designed to find the values of \(\beta\) that minimize the errors or loss.
Once we obtain estimates of the \(\hat{\beta}\) parameters, we can use them to compute predictions for particular values of the focal predictor \(X\) and control variables \(\mathbf{Z}\):
\[ \hat{Y} = \Phi(X=x, \mathbf{Z}=\mathbf{z}; \hat{\beta}), \tag{3.2}\]
where \(\Phi\) is a function that transforms parameters and predictors into expected outcomes (i.e., predictions).
Imagine that we fit a model to describe children’s heights (\(Y\)) as a function of age (\(X\)), calorie intake (\(Z\)), and three regression coefficients: \(\beta_0\), \(\beta_1\), and \(\beta_2\). Using the estimated coefficients and an appropriate function \(\Phi\), we could compute the predicted (or expected) height of an 11 year-old child who eats 1800 calories per day, by plugging-in that child’s characteristics into Equation 3.2.
\[ \Phi(\beta_0 + \beta_1 \cdot 11 + \beta_2 \cdot 1800) \tag{3.3}\]
In Chapter 5, we will work through an example like this in the context of logistic regression, using real data and analysis code.
Predictions are a fundamental building block of the framework and workflow developed in this book. Indeed, the epigraph of this chapter could be extended to say: “parameters are a resting stone on the road to prediction, and predictions are a resting stone on the way to counterfactual comparisons.”
Counterfactual comparisons
The second class of transformations to consider is the counterfactual comparison:
A counterfactual comparison is a function of two predictions made with different predictor values.
Imagine that we are interested in the effect of a change from 0 (control group) to 1 (treatment group) in the focal predictor \(X\) on the predicted outcome \(\hat{Y}\), holding control variables \(\mathbf{Z}\) to some fixed values \(\mathbf{z}\). First, we compute two counterfactual predictions:
\[\begin{align*} \hat{Y}_{X=1,\mathbf{Z}=\mathbf{z}} = \Phi(X=1, \mathbf{Z}=\mathbf{z}; \hat{\beta}) && \mbox{Treatment}\\ \hat{Y}_{X=0,\mathbf{Z}=\mathbf{z}} = \Phi(X=0, \mathbf{Z}=\mathbf{z}; \hat{\beta}) && \mbox{Control} \end{align*}\]
Then, we choose a function to compare the relative size of these two predictions. In the simplest case, the comparison function could be a simple difference:
\[\begin{align*} \hat{Y}_{X=1,\mathbf{Z}=\mathbf{z}} - \hat{Y}_{X=0,\mathbf{Z}=\mathbf{z}} \end{align*}\]
Chapter 6 shows that this is a versatile strategy, which can be extended in at least three directions. First, depending on the model type, the predicted outcome \(\hat{Y}\) can be expressed on a variety of scales. In linear regression, \(\hat{Y}\) is a real number. In logistic or Poisson models, the outcome would typically be a predicted probability or expected count.4
Second, the analyst can select different counterfactual values of the focal predictor \(X\). In the example above, \(X\) was fixed to 0 or 1. In Chapter 6, we will use different counterfactual values to estimate how the predicted value \(\hat{Y}\) reacts when \(X\) changes by 1 or more units (e.g. one standard deviation), or from one category to another.
Third, a counterfactual comparison can be expressed as any function of two predictions. It can be a difference \(Y_{X=1,\mathbf{Z}=\mathbf{z}}-Y_{X=0,\mathbf{Z}=\mathbf{z}}\), a ratio \(\frac{Y_{X=1,\mathbf{Z}=\mathbf{z}}}{Y_{X=0,\mathbf{Z}=\mathbf{z}}}\), the lift \(\frac{Y_{X=1,\mathbf{Z}=\mathbf{z}}-Y_{X=0,\mathbf{Z}=\mathbf{z}}}{Y_{X=0,\mathbf{Z}=\mathbf{z}}}\), or even more complex (and unintelligible) functions like odds ratios.
Importantly, a counterfactual comparison need not be causal. The analyst can give it a purely descriptive or associational interpretation if they wish. Inded, a counterfactual comparison can be seen purely as a measure of the association between changes in \(X\) and \(\hat{Y}\), that is, as the relationship between a model’s inputs and outputs. However, when the conditions for causal identification are satisfied, a counterfactual comparison can be interpreted as a measure of the effect of \(X\) on \(Y\) (see Section 1.1.3).
To sum up, the idea of a counterfactual comparison covers a broad class of quantities of interest, which include many of the most common ways to quantify the association between or the effect of \(X\) on \(Y\): contrasts, differences, risk ratios, lift, etc.
Slopes
The third class of transformations to consider is the slope:
A slope is the partial derivative of the regression equation with respect to a focal predictor.
The slope is the rate at which a model’s predictions change when we modify a focal predictor by a small amount, while keeping other predictors constant. It measures the strength of assocation between the outcome variable and a predictor, ceteris paribus. When causal identification conditions are satisfied, the slope will often be interpreted as the effect of a focal predictor \(X\) on outcome \(Y\) (Section 1.1.3).
In some disciplines like economics and political science, slopes are known as “marginal effects,” where the word “marginal” indicates that we are looking at the effect of a very small change in one of the predictors (see Section 1.2 for an attempt to disambiguate). Chapter 7 offers a lot more intuition about the nature of slopes or partial derivatives, and it works through several concrete examples.
3.2 Predictors
Predictions, counterfactual comparisons, and slopes are conditional quantities, which means that their values typically depend on all the predictors in a model.5 Whenever an analyst reports one of these statistics, they must imperatively disclose where it was evaluated in the predictor space. They must answer questions like:
Is the prediction, comparison or slope computed for a 20 year old student in the treatment group, or for a 50 year literary critic in the control group?
Answers to questions like this one can be expressed in terms of “profiles” and “grids.” A profile is a combination of predictor values: \(\{X,\mathbf{Z}\}\).6 A grid is a collection of one or more profiles.
Profiles can be observed, synthetic, or partially synthetic. An observed profile is a combination of predictor values that actually appears in the sample. It represents the characteristics of an observed individual or unit in our dataset. A synthetic profile represents a purely hypothetical individual with representative or interesting characteristics, like sample means or modes. A partially synthetic profile combines the two approaches. Here, we take an actual observation from our dataset, and modify a focal predictor take on some meaningful value. For example, imagine that we draw one observation with these characteristics from a dataset: {Age=25, City=Montreal, Group=Control}. We can use this information and a fitted model to compute a prediction for this observed individual. Alternatively, we could modify the Group value to “Treatment” rather than “Control”, and compute a prediction for the modified, partially synthetic, profile. Doing so would allow us to answer a counterfactual “what if” question: What would have been the expected outcome for the 25 year old Montrealer, if they had been part of the treatment group rather than then the control group.
A grid is a collection of one or more profiles. It can hold the empirical distribution of actually observed profiles, or a set of synthetic profiles that hold particular scientific interest. Defining the grid of predictors is a crucial step in model interpretation. Unless we specify a grid, we cannot clearly define the quantity of interest—or estimand—that sheds light on a research question (see Section 1.2). By defining a grid, the analyst gives an explicit answer to this question: what kinds of people or units do estimates apply to?
In the rest of this section, we use the datagrid()
function from the marginaleffects
package to construct and discuss a variety of grids. We will build these grids by reference to a simulated dataset with 10 observations on three variables: numeric, binary and categorical.
The code examples below rely on the “pipe” operator |>
to send the output of datagrid()
to the tt()
function from the tinytable
package. This allows us to display nicely formatted table for each grid.
3.2.1 Empirical grid
The first type of grid to consider is most obvious: the empirical distribution is simply the observed dataset. It is a grid composed of all the actually observed profiles in our sample.
dat |> tt()
Num | Bin | Cat |
---|---|---|
-0.7787 | 0 | C |
-0.3895 | 0 | C |
-2.0338 | 0 | C |
-0.9824 | 0 | A |
0.2479 | 1 | A |
-2.1039 | 1 | C |
-0.3814 | 1 | B |
2.0749 | 0 | C |
1.0271 | 0 | A |
0.473 | 1 | B |
Computing predictions on the empirical distribution is common practice in data analysis, as it yields one “fitted value” for each observation in the sample. Estimating counterfactual comparisons or slopes on the empirical distribution allows us to study how manipulating a focal predictor affects the predicted outcome, for observed units with different characteristics.
Using the empirical distribution as a grid makes most sense when the observed sample is representative of the population that the analyst targets for inference. When working with convenience samples with very different characteristics from the population, it may make sense to use one of the grids described below or apply weights as described in Chapter 11.
3.2.2 Interesting grid
If the analyst is interested in units with specific profiles, they can use the datagrid()
function to create fully customized grids of “interesting” predictor values. This is useful when one wants to compute a prediction or slope for an individual with given characteristics, such as a 50 year-old engineer from Belgium.
By default, datagrid()
fixes all variables to their means or modes, unless the user explicitly specifies a value for a variable.
Note that the continuous Num
variable was fixed to its mean value, and the categorical Cat
variable to its modal value, because neither variable was specified explicitly by the user.
datagrid()
also accepts functions to be applied to the variables in the original dataset. In R
, the range
function returns a vector of two values: the minimum and the maximum of a variable; the mean
function returns a single value; and unique
returns a vector of all unique values from a variable. With the following function call, we get a grid with \(2\times 1 \times 3=6\) rows.
3.2.3 Representative grid
Analysts often want to use representative grids, where predictors are fixed at representative values such as their mean or mode. This is useful when the analyst wants to compute predictions, comparisons, or slopes for a “typical” or “average” unit in the sample.
In Part II of the book, we will see how representative grids allow us to compute useful quantities such as “fitted value at the mean” or “marginal effect at the mean.” In both cases, the quantities of interest are evaluated at a “central” or “representative” point in the predictor space.
3.2.4 Balanced grid
A “balanced grid” is a grid built from all unique combinations of categorical variables, and where all numeric variables are held at their means.
To create a balanced grid, we fix numeric variables at their means, and create rows for each combination of values for the categorical variables (i.e., the cartesian product). In this next table, Num
is held at its mean, and the grid includes all combinations of unique values for Bin
and Cat
:
Num | Bin | Cat | rowid |
---|---|---|---|
-0.2847 | 0 | C | 1 |
-0.2847 | 0 | A | 2 |
-0.2847 | 0 | B | 3 |
-0.2847 | 1 | C | 4 |
-0.2847 | 1 | A | 5 |
-0.2847 | 1 | B | 6 |
Balanced grids are often used to analyze the results of factorial experiments in convenience samples, where the empirical distribution of predictor values is not representative of the target population. In that context, the analyst may wish to report estimates for each combination of treatments, that is, for each cell in a balanced grid. We will see balanced grids in action in Section 5.3, when computing marginal means.7
3.2.5 Counterfactual grid
The last type of grid to consider is the counterfactual. Here, we duplicate the entire dataset, and create one copy for every combination of values that the analyst supplies.
In this example, we specify that Bin
must take on values of 0 or 1. Therefore, create two new copies of the full dataset. In the first copy, Bin
is fixed to 0; in the second copy, Bin
is fixed to 1. All other variables are held at their observed values. Since the original data had 10 rows, the counterfactual datasets will have 20:
We can inspect the first three row of each counterfactual version of the data by filtering on the rowidcf
column, which holds row indices:
rowidcf | Num | Cat | Bin |
---|---|---|---|
1 | -0.7787 | C | 0 |
2 | -0.3895 | C | 0 |
3 | -2.0338 | C | 0 |
1 | -0.7787 | C | 1 |
2 | -0.3895 | C | 1 |
3 | -2.0338 | C | 1 |
Notice that each row has an exact duplicate which differs only in terms of the counterfactual Bin
variable. This kind of duplication is useful for the counterfactual analyses we will conduct in the Counterfactual comparisons and Causality and G-computation sections (Chapters 6 and ?sec-gcomputation).
3.3 Aggregation
As noted above, predictions, counterfactual comparisons, and slopes are conditional quantities, in the sense that they typically depend on the values of all predictors in a model. When we compute these quantities over a grid of predictor values, we obtain a collection of point estimates: one for each row of the grid.
If a grid has many rows, the many estimates that we generate can become unwieldy, making it more difficult to extract clear and meaningful insights from our models. To simplify the presentation of results, analysts might want to aggregate unit-level estimates to report more macro level summaries. We can think about several aggregation strategies:
- No aggregation: Report all unit-level estimates.
- Overall average: Report the average of all unit-level estimates.
- Subgroup averages: Report the average of unit-level estimates within subgroups.
- Weighted averages: Report the average of unit-level estimates, weighted by sampling probability or inverse probability weights.
Aggregated estimates are useful and common in most scientific fields. In Chapter 5, we will compute average predictions, or “marginal means.” In Chapter 6, we will compute average counterfactual comparisons which, under the right conditions, can be interpreted as average treatment effects. In ?sec-gcomputation, we will see that aggregating comparisons over different grids allows us to compute quantities such as the Average Treatment effect on the Treated (ATT) or the Average Treatment effect on the Untreated (ATU). In Chapter 7, we will compute average slopes, or “average marginal effects.”
Such aggregated estimates tend to be easier to interpret, and they can often be estimated with greater precision than unit-level quantities. However, they also come with important downsides. Indeed, aggregated estimates can mask interesting variation across the sample. For example, imagine that the effect of a treatment is strongly negative for some individuals, but strongly positive for others. In that case, an estimate of the average treatment effect could be close to 0 because positive and negative estimates cancel out. This aggregated quantity would not adequately capture the nature of the data generating process, and it could lead to misleading conclusions.
3.4 Uncertainty
Whenever we report quantities of interest derived from a statistical model, it is essential to provide estimates of our uncertainty. Without standard errors or confidence intervals, readers cannot confidently assess if the reported values or relationships are genuine, or if they could merely be the product of chance.
The marginaleffects
package offers four primary methods for quantifying uncertainty: the delta method (default), bootstrap, simulation-based inference, and conformal prediction.
The delta method is a statistical technique which can be used to approximate the variance of a function of random variables. Since all the quantities of interest described in this book can be seen as transformations model parameters, the delta method can be used to compute standard errors for predictions, counterfactual comparisons, and slopes. The delta method has many advantages: it is fast, flexible, and it can be paired with “robust” variance estimates to account for heteroskedasticity, autocorrelation, or clustering. The main disadvantages of the delta method are that it relies on a coarse linear approximation; that it assumes that a model’s parameters are asymptotically normal; and that it requires functions of parameters to be continuously differentiable.
The second strategy for uncertainty quantification is the bootstrap, a resampling-based technique that generates empirical distributions for estimated quantities by repeatedly sampling from the observed data. The bootstrap is an extremely flexible approach, which can be very useful when the delta method’s assumptions are not met.
The third method, simulation-based inference, involves drawing simulating values from the estimated model to compute quantities of interest and their uncertainty. This strategy is particularly effective when working with complex models or when estimating functions of multiple parameters. It also provides an intuitive way to visualize uncertainty through predictive distributions.
The fourth method, conformal prediction, is a flexible framework for uncertainty quantification that provides valid prediction—rather than confidence—intervals under minimal assumptions. Unlike traditional approaches that rely heavily on model-specific assumptions or asymptotic approximations, conformal prediction uses split sample strategies to construct intervals that are guaranteed to cover a given share of out-of-sample data points. This method is particularly appealing for its simplicity, adaptability to complex models, and compatibility with modern machine learning algorithms.
Chapters 5, 6, and 7 show how we can apply these strategies to construct standard errors and confidence intervals around predictions, comparisons, and slopes. Readers who seek more intuition, technical detail, and hands-on demonstration can refer to Chapter 13, which is entirely dedicated to uncertainty quantification.
3.5 Test
Chapter 4 introduces two essential statistical testing procedures: null hypothesis and equivalence tests. These procedures are widely used in data analysis, providing powerful tools for making informed decisions based on model outputs.
Null hypothesis tests allow us to determine if there is sufficient evidence to reject a presumed statement about a quantity of interest. An analyst could check if there is enough evidence to reject statements such as:
- Parameter estimates: The first regression coefficient in a model equals 1.
- Predictions: The predicted number of goals in a game is two.
- Counterfactual comparison: The effect of a new medication on blood pressure is null.
- Slope: The association between education and income is zero.
Equivalence tests, on the other hand, are used to provide evidence that the difference between an estimate and some reference set is “negligible” or “unimportant.” For instance, we could use an equivalence test to support statements such as:
- Parameter estimates: The first regression coefficient in a model is practically equivalent to 1.
- Predictions: The predicted probability of scoring a goal is not meaningfully different from 2.
- Counterfactual comparisons: The effect of a new medication on blood pressure is essentially the same as the effect of an existing treatment.
- Slope: The association between education and income is close to zero.
In sum, null hypothesis tests allow us to establish a difference, whereas equivalence tests allow us to establish a similarity or practical equivalence.
The marginaleffects
package allows us to compute null hypothesis and equivalence tests on raw parameter estimates, as well as for all the quantities estimated by the package: predictions, counterfactual predictions, and slopes. Crucially, marginaleffects
can conduct both types of tests on arbitrary (non-)linear functions of multiple estimates.
In Part II of this book, we dive deeper into each of the components of the framework, and show how to use the marginaleffects
package to operationalize them.
3.6 Summary
This chapter was written to address two problems:
- The parameters of a statistical model are often difficult to interpret.
- Analysts often fail to rigorously define the statistical quantities (estimands) and tests that can shed light on their research questions.
We can solve these problems by transforming parameter estimates into quantities with a straightforward interpretation and a direct link to our research goals. In practice, this requires us to answer five questions:
- Quantity: What is the quantity of interest?
- Predictors: What predictor values are we interested in?
- Aggregation: Do we report unit-level estimates or aggregate them?
- Uncertainty: How do we quantify uncertainty about our estimates?
- Test: What hypothesis test do we wish to conduct?
The analysis workflow implied by these five questions is extremely flexible; it allows us to define and compute a wide variety of quantities and tests. To see this, we can refer to visual aids adapted from Heiss (2022).
The Data in Figure 3.1 represents a dataset with five rows and two variables: a numeric and a categorical one. Each of the light gray cells in the first column corresponds to a value of the numeric variable. The second column of the Data represents a categorical variable, with gray and light gray cells showing distinct categories.
Using these two variables, we fit a statistical model. Then, we transform the parameters of that statistical model to compute quantities of interest. These quantities could be predictions, or fitted values (Chapter 5). They could be counterfactual comparisons, such as risk differences (Chapter 6). Or they could be slopes (Chapter 7).
All of these quantities are conditional: their values typically depend on all the predictors in the model. Thus, each row of the original dataset is associated to a different estimate of the quantity of interest. We represent unit-level estimates of the quantity of interest as black cells, on the right-hand side of Figure 3.1.
Often, the analyst does not wish to compute or report one estimate for every row of the original dataset. Instead, they prefer aggregating—or averaging—the estimates across the full dataset. For example, instead of reporting one prediction (or fitted value) for each row of the dataset, we could first compute the individual fitted values, and then aggregate these estimates to obtain an “average predicted outcome.”
Figure 3.2 illustrates this extended workflow. We follow the same process as above, but add an extra step at the end, to aggregate the unit-level estimates into a one-number summary.
Recall that our Data includes a categorical variable, with distinct values illustrated by gray and light gray cells. Instead of aggregating the unit-level estimates across the full dataset, to obtain a single number, we could instead compute the average estimate in subgroups defined by that categorical variable. In Figure 3.3, the first aggregated estimate (black cell) is the average estimate in the gray subgroup. The second aggregated estimate is the average estimate in the light gray subgroup.
Instead of aggregating unit-level estimates made on the empirical distribution (i.e., the observed data), we can build a smaller grid of predictors, and compute estimates directly on that smaller grid. The values in that grid could represent units with particularly interesting features. For example, a medical researcher could be interested in computing estimates for a specific patient profile, between the ages of 18 and 35 with a particular biomarker (see Section 3.2.2 on “interesting grids”). Alternatively, the analyst could want to compute an estimate for a typical individual, whose characteristics are exactly average or modal on all dimensions (see Section 3.2.3 on “representative grids”).
Figure 3.4 illustrates this approach. Based on the original data, we create a one-row grid for individual with interesting or representative values of the numeric (light grey) and categorical (grey) variables. Then, we use the model to compute an estimate (prediction, comparison, or slope) for that hypothetical, representative, or typical individual.
The final approach, illustrated by Figure 3.5, combines both the grid creation and the aggregation steps. First, we duplicate the entire dataset, and we fix one of the variables at different values in each of the datasets. In Section 3.2.5, we described this process as creating a “counterfactual grid.” This is illustrated by the duplicated grids in Figure 3.5.
In the top half of the duplicated grid, the categorical variable is set to a single value: light gray. In the bottom half of the duplicated grid, the categorical variable is set to a single value: gray. Using this grid and the fitted model, we compute estimates (predictions, comparisons, or slopes) for every row. Finally, we compute average estimates for subgroups defined by the categorical (manipulated) variable. In ?sec-gcomputation, we will see that Figure 3.5 is a visual representation of the G-computation algorithm.
In sum, estimands and tests can be defined by focusing on their constituent components: the quantity of interest, predictor grid, aggregation level, uncertainty quantification, and hypothesis testing strategy. Taken together, these components form a flexible framework for data analysis that can be adapted to many scenarios.
This conceptual framework is also easy to operationalize via the consistent user interface of the marginaleffects
package for R
and Python
.
A statistical functional is a map from distribution \(F\) to a real number or vector.↩︎
\(\hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n I(X_i \leq x), \quad \forall x \in \mathbb{R}.\)↩︎
See Aronow and Miller (2019, sec. 3.3.1) for an informal discussion of those conditions and Wasserman (2006) for more details.↩︎
In generalized linear models, we can also make predictions on the link or response scale. The
marginaleffects
package allows both.↩︎In this book, the words “predictor”, “treatment,” “regressor,” and “explanator” are used as synonyms to refer generically to variables on the right-hand side of a regression equation.↩︎
As in the previous section, \(X\) represents a “focal” predictor of interest, and \(\mathbf{Z}\) a vector of control variables.↩︎
Balanced grids are used by default in some post-estimation software, such as
emmeans
(Lenth 2024).↩︎