3 Conceptual framework

This chapter introduces a conceptual framework to aid the interpretation of a wide variety of statistical and machine learning models. This framework is motivated by a simple argument: instead of focusing on the parameters of a fitted model, analysts should convert those parameters into quantities that make more intuitive sense to readers and stakeholders. By applying post-hoc transformations, researchers can easily go from model to meaning.

The proposed workflow is both model-agnostic and consistent. Every single analysis starts from the same place: defining the quantity of interest. To do this, we ask three critical questions:

Quantity: Do we wish to estimate the level of a variable, the association between two (or more) variables, or the effect of a cause?
Predictors: What predictor values are we interested in?
Aggregation: Do we care about unit-level or aggregated estimates?

With a clear description of the estimand in hand, we ask two further questions:

Uncertainty: How do we quantify uncertainty about our estimates?
Test: Which hypothesis or equivalence tests are relevant?

These five questions give us a structured way to think about the quantities and tests that matter. They lead us to clear definitions of those quantities, and dispel terminological ambiguity. Importantly, the answers to those questions also point to the specific software commands that one needs to execute in order to run appropriate calculations. The rest of the chapter surveys each of the five questions.

3.1 Quantity

The parameters of a statistical model are often difficult to interpret, and they do not always shed direct light onto the research questions that interest us. In many contexts, it helps to transform parameter estimates into quantities with a more natural and domain-relevant meaning.

For example, the analyst who fits a logistic regression model obtains coefficient estimates expressed as log odds ratios. For most people, this scale is very difficult to reason about. Instead of struggling with complex amalgams of probabilities, analysts should transform estimates into more intuitive quantities, like predicted probabilities or risk differences.

Section 3.1.1 exposes the theoretical underpinnings of post hoc transformations: the plug-in principle and the invariance property maximum likelihood. Empirically-minded readers who are less interested in statistical theory may skip that part of the text.

Section 3.1.2 surveys the three classes of quantities of interest at the heart of this book: predictions, counterfactual comparisons, and slopes. These quantities are introduced briefly here, but given chapter-length treatments in Part II.

3.1.1 Theoretical background

There are two primary theoretical justifications for post-estimation transformations: the plug-in principle and the invariance property of maximum likelihood estimators (MLE).

The intuition behind the plug-in principle is simple. To infer some feature of a population, we can study the same feature in a sample, and plug-in our sample estimate in lieu of the population value (Aronow and Miller 2019).

To formalize this idea a bit, consider a sample \(Y_1, \ldots, Y_n\) drawn from a distribution \(F\). For example, \(F\) could represent the distribution of ages in a population, and \(Y_1\) could be the observed age of a single individual. Imagine that we care about a statistical functional¹ \(\psi\) of this distribution, \(\theta=\psi(F)\). In this context, \(\theta\) could represent the average age of the population, the distribution variance, or the coefficient of a regression model designed to capture the data generating process for \(F\). Since we do not know the age of every individual in the population, we cannot trace the full distribution \(F\) and we cannot compute the true \(\psi(F)\) directly.

The plug-in principle says that we do not need to know the exact distribution \(F\) to learn something about \(\psi(F)\). Instead, we can estimate the quantity of interest by studying the empirical distribution² of a random sample of \(n\) individuals, \(\hat{F}_n\). More specifically, the plug-in principle states that if \(\theta = \psi(F)\) is a statistical functional of the probability distribution \(F\), and if some mild regularity conditions are satisfied,³ we can estimate \(\theta\) using the sample analogue \(\hat{\theta} = \psi(\hat{F}_n)\).

As Aronow and Miller (2019) note, the implications for statistical practice are profound. When the number of observations increases, the empirical distribution function will tend to approximate the population distribution function. When the number of observations increases, our estimate \(\hat{\theta}\) will tend to approach \(\theta\).

These ideas empower us to target a broad array of quantities of interest. \(\theta\) could be as simple as the distribution mean; \(\theta\) could be a regression coefficient; or \(\theta\) could be a more interesting quantity, defined as a function of regression coefficients. More concretely, we can estimate the average age of a population by computing the average age of the individuals in our sample, or we can guess the value of a regression parameter in the population by computing the same parameter in-sample.

The plug-in principle thus justifies the workflow proposed in this book. First, we fit a regression model and obtain coefficient estimates that characterize the empirical distribution of the data. Then, we apply a function to coefficient estimates and transform them into more interpretable quantities. Finally, we interpret those quantities as sample analogues to population characteristics.

A different way to motivate post hoc transformations is to draw on the invariance property of MLE. This is a fundamental concept in statistical theory, which highlights the efficiency and flexibility of MLE. The invariance property can be stated as follows: if \(\hat{\theta}\) is the MLE estimate of \(\theta\), then for any function \(\psi(\theta)\), the MLE of \(\psi(\theta)\) is \(\psi(\hat{\theta})\) (Berger and Casella 2024, 259). In other words, the desirable properties of MLEs—consistency, efficiency, and asymptotic normality—are preserved under transformation.

This property is useful for practice, because it simplifies the estimation of functions of parameters. For example, let’s say that we specify a regression model to capture the effect of nutrition on children’s heights. We estimate the coefficient \(\theta\) of this model via maximum likelihood, and obtain an estimate \(\hat{\theta}\). Chapter 5 shows how one could apply a function \(\psi\) to the coefficient \(\hat{\theta}\), in order to compute the predicted (or expected) value of the outcome variable (height) for a given value of the explanatory variable (nutrition). The invariance property says that if the coefficient estimate \(\hat{\theta}\) has the desirable properties of a maximum likelihood estimate, then the prediction \(\psi(\hat{\theta})\) inherits those qualities.

In sum, it is often useful to apply post-estimation transformations to the parameter estimates obtained by fitting a statistical model, in order to obtain more meaningful and interpretable quantities. This kind of post hoc transformation is well-grounded in statistical theory, via the plug-in principle and the invariance property of MLE.

3.1.2 Predictions, counterfactual comparisons, and slopes

In this book, we will target three broad classes of estimands: predictions, counterfactual comparisons, and slopes. Part II dedicates a full chapter to each of them, with many concrete examples drawing on real-world datasets. The present section foreshadows that in-depth analysis by briefly introducing each quantity.

Predictions

The first class of transformations to consider is the prediction.

A prediction is the expected outcome of a fitted model for a given combination of predictor values.

Consider the general setting where we fit a statistical model by minimizing some loss function \(\mathcal{L}\). For a standard linear regression model, \(\mathcal{L}\) could represent the sum of squared errors; for a model estimated by maximum likelihood, \(\mathcal{L}\) could be the negative log-likelihood. The coefficient estimates that allow us to minimize loss can be written

\[ \hat{\beta} = argmin_{\beta} \; \mathcal{L}(Y, X, \mathbf{Z}; \beta), \tag{3.1}\]

where \(\beta\) is a vector of parameters; \(Y\) the outcome variable; \(X\) a focal predictor which holds particular scientific interest; and \(\mathbf{Z}\) a vector of control variables. In Equation 3.1, the \(argmin\) operator simply means that the estimator is designed to find the value of \(\beta\) that minimize \(\mathcal{L}\).

Once we obtain estimates of the \(\beta\) parameters, we can use them to compute predictions for particular values of the predictors \(X\) and \(\mathbf{Z}\). To do this, we plug-in the predictors and estimates into an appropriate function \(\Phi\).

\[ \hat{Y} = \Phi(X=x, \mathbf{Z}=\mathbf{z}; \hat{\beta}), \tag{3.2}\]

The choice of \(\Phi\), of course, depends on the model specification. When fitting a model using ordinary least squares, \(\Phi\) can be a simple linear combination of coefficients and predictors. When \(Y\) is a probability, bounded by 0 and 1, \(\Phi\) can be the logistic or normal distribution functions.⁴

To make things more concrete, imagine that we wish to describe children’s heights (\(Y\)) as a function of age (\(X\)) and caloric intake (\(Z\)). The dependent variable is numeric and continuous, so we specify a linear regression model with an intercept and two coefficients. After fitting the model, we obtain the following estimates: the intercept is estimated to be \(\hat{\beta}_0=60\), the coefficient for age is \(\hat{\beta}_1=6.5\), and the coefficient for caloric intake is \(\hat{\beta}_2=0.01\).

Now, let’s transform these estimates into a prediction. More specifically, let’s use our fitted model to compute the expected height of an 11-year-old child who eats 1800 calories per day. To achieve this, we define \(\Phi\) as a linear combination of coefficients and predictors, and we plug-in the child’s personal characteristics and our parameter estimates into Equation 3.2.

\[\begin{align*} \hat{Y} &= \Phi(\text{Age}=11, \text{Calories}=1800; \hat{\beta}_0=60, \hat{\beta}_1=6.5, \hat{\beta}_2=0.01) \\ &= \hat{\beta}_0 + \hat{\beta}_1 \cdot \text{Age} + \hat{\beta}_2 \cdot \text{Calories} \\ &= 60 + 6.5 \cdot 11 + 0.01 \cdot 1800 \\ &= 149.5\text{ cm} \end{align*}\]

For this specific combination of predictor values, and given the estimated values of the coefficients, our model predicts a height of 149.5cm. This example shows how easy it is to transform abstract-seeming parameters into a concrete, meaningful, and interpretable quantity: the expected height of an 11-year-old who consumes 1800 calories a day.

Throughout the book, we will perform these operations repeatedly, applying them to more complex models and various \(\Phi\) functions. At times, we will generate multiple predictions and aggregate them to obtain an average estimate. In other instances, we will compare counterfactual predictions to estimate treatment effects.

Counterfactual comparisons

If parameters are a resting stone on the way to prediction, predictions are a springboard to comparison and causal inference. The second class of transformations to consider is thus the counterfactual comparison.

A counterfactual comparison is a function of two predictions made with different predictor values.

Suppose we want to evaluate the effect of changing the focal predictor \(X\) from 0 (control group) to 1 (treatment group) on the predicted outcome \(\hat{Y}\), all else equal. To do this, we compute two predictions: one with \(X=0\) and another with \(X=1\), while keeping all control variables at fixed values \(\mathbf{Z}=\mathbf{z}\).

\[\begin{align*} \hat{Y}_{X=1,\mathbf{Z}=\mathbf{z}} = \Phi(X=1, \mathbf{Z}=\mathbf{z}; \hat{\beta}) && \mbox{Treatment}\\ \hat{Y}_{X=0,\mathbf{Z}=\mathbf{z}} = \Phi(X=0, \mathbf{Z}=\mathbf{z}; \hat{\beta}) && \mbox{Control} \end{align*}\]

These two \(\hat{Y}\) predictions are called “counterfactual,” because they represent the expected outcomes for hypothetical scenarios where the focal predictor \(X\) is different from what we factually observe in the data. Here, the key idea is to use a statistical model to extrapolate, to predict what would happen to \(Y\) if \(X\) took on different values.

The next step in the analysis is to choose a function to compare the relative size of the two counterfactual predictions. One natural choice is to simply subtract one from the other. This gives us the difference in predicted outcome between the treatment and control groups, for individuals with identical background characteristics \(\mathbf{z}\).

\[\begin{align*} \hat{Y}_{X=1,\mathbf{Z}=\mathbf{z}} - \hat{Y}_{X=0,\mathbf{Z}=\mathbf{z}} \end{align*}\]

If \(\hat{Y}_{X=1,\mathbf{Z}=\mathbf{z}} > \hat{Y}_{X=0,\mathbf{Z}=\mathbf{z}}\), our model tells us that being part of the treatment group is associated with higher values of \(Y\), for individuals with background characteristics \(\mathbf{Z}=\mathbf{z}\).

Chapter 6 shows that this is but one example of a versatile strategy, which can be extended in at least three directions. First, depending on the model type, the predicted outcome \(\hat{Y}\) can be expressed on a variety of scales. In linear regression, \(\hat{Y}\) is a real number. Using a logistic regression, one could predict and compare predicted probabilities. With Poisson models, we could compare expected outcomes expressed as counts.⁵

Second, the analyst can select different counterfactual values of the focal predictor \(X\). In the example above, \(X\) was fixed to 0 or 1. In Chapter 6, we will use different counterfactual values to estimate how the predicted value \(\hat{Y}\) reacts when \(X\) changes by 1 or more units (e.g., one standard deviation), or from one category to another.

Third, a counterfactual comparison can be expressed as any function of two predictions. It can be a difference \(Y_{X=1,\mathbf{Z}=\mathbf{z}}-Y_{X=0,\mathbf{Z}=\mathbf{z}}\), a ratio \(\frac{Y_{X=1,\mathbf{Z}=\mathbf{z}}}{Y_{X=0,\mathbf{Z}=\mathbf{z}}}\), the lift \(\frac{Y_{X=1,\mathbf{Z}=\mathbf{z}}-Y_{X=0,\mathbf{Z}=\mathbf{z}}}{Y_{X=0,\mathbf{Z}=\mathbf{z}}}\), or even more complex (and unintelligible) functions like odds ratios.

Counterfactual comparisons are a fundamental building block for causal inference. Indeed, when conditions for causal identification are satisfied, they will often be interpreted as measures of the effect of \(X\) on \(Y\).⁶

That said, it is important to emphasize that counterfactual comparisons need not have a causal interpretation. In some contexts, it is perfectly reasonable to interpret them purely in associational terms, as measures of the statistical relationship between two variables \(X\) and \(Y\), holding other variables constant.

To sum up, counterfactual comparisons are a broad class of estimands designed to measure the association between two variables, or the effect of one variable on another: contrasts, differences, risk ratios, lift, etc.

Slopes

The third class of transformations to consider is the slope.

A slope is the partial derivative of the regression equation with respect to a focal predictor.

The slope is the rate at which a model’s predictions change when we modify a focal predictor by a small amount, while keeping other predictors constant. Slopes fall in the same toolbox as counterfactual comparisons; they help us measure the strength of association between two variables, or the effect of one variable on another, ceteris paribus.

In some disciplines like economics and political science, slopes are known as “marginal effects,” where the word marginal indicates that we are looking at the effect of a very small change in one of the predictors.⁷ Chapter 7 offers a lot more intuition about the nature of slopes or partial derivatives, and it works through several concrete examples.

3.2 Predictors

Predictions, counterfactual comparisons, and slopes are conditional quantities, which means that their values typically depend on all the predictors in a model. Whenever an analyst reports one of these statistics, they must imperatively disclose where it was evaluated in the predictor space. They must answer questions like:

Is the prediction, comparison, or slope computed for a 20-year-old student in the treatment group or for a 50-year-old literary critic in the control group?

Answers to questions like this one can be expressed in terms of profiles and grids.

A profile is a specific combination of values for a focal predictor \(X\) and a vector of control variables \(\mathbf{Z}\). It is the set of predictor values for one observed or hypothetical individual. Profiles can be observed, synthetic, or partially synthetic. An observed profile is a combination of predictor values that actually appears in the sample. It represents the characteristics of an observed individual or unit in our dataset. A synthetic profile represents a purely hypothetical individual with representative or interesting characteristics, like sample means or modes. A partially synthetic profile combines the two approaches. Here, we take an actual observation from our dataset, and modify a focal predictor to take on some meaningful value.

For example, imagine that we draw one observation with these characteristics from a dataset: {Age=25, City=Montreal, Group=Control}. We can use this information and a fitted model to compute a prediction for this observed individual. Alternatively, we could modify the Group value to Treatment rather than Control, and compute a prediction for the modified, partially synthetic, profile. Doing so would allow us to answer a counterfactual what if question: What would have happened to a 25 year-old Montrealer, if they had been part of the treatment group instead of the control group.

A grid is a collection of one or more profiles. For example, it could hold the empirical distribution of predictors, a grid with one row for each of the individuals that we have actually observed. Using this grid, we could compute one prediction (fitted value) for each row of our dataset. Alternatively, we could construct a grid of synthetic profiles, that is, a grid with combinations of predictors that we may not have actually observed, but that hold scientific interest. With such a grid, we could make predictions for a specific type of individual (ex, 13 year-old boy), under different treatment regimes (ex, placebo or treatment).

Defining the grid of predictors is a crucial step in model interpretation. Unless we specify a grid, we cannot clearly define the quantity of interest—or estimand—that sheds light on a research question Section 2.2. By defining a grid, the analyst gives an explicit answer to this question: what kinds of people or units do estimates apply to?

In the rest of this section, we use the datagrid() function from the marginaleffects package to construct and discuss a variety of grids. We will build these grids by reference to a simulated dataset with 10 observations on three variables: numeric, binary and categorical.

R
Python

library(marginaleffects)
library(tinytable)

set.seed(48103)
N = 10
dat = data.frame(
  Num = rnorm(N),
  Bin = rbinom(N, size = 1, prob = 0.5),
  Cat = sample(c("A", "B", "C"), size = N, replace = TRUE)
)

import numpy as np
import polars as pl
from marginaleffects import *

rng = np.random.default_rng(seed=48103)
N = 10
dat = pl.DataFrame({
  "Num": rng.normal(loc = 0, scale = 1, size = N),
  "Bin": rng.binomial(n = 1, p = 0.5, size = N),
  "Cat": rng.choice(["A","B","C"], size = N)
})

The code examples below rely on the pipe operator |> to send the output of datagrid() to the tt() function from the tinytable package. This allows us to display nicely formatted tables for each grid.

3.2.1 Empirical grid

The first type of grid to consider is most obvious. The empirical distribution is simply the observed dataset. It is a grid composed of all the actually observed profiles in our sample.

R
Python

dat |> tt()

Num	Bin	Cat
-0.65471	0	A
-0.49825	1	A
0.33231	0	B
1.12147	0	C
-0.06395	1	C
1.5376	1	B
-0.11144	1	B
0.24231	1	C
-0.16921	1	B
0.57632	0	C

dat

shape: (10, 3)

Num	Bin	Cat
f64	i64	str
0.689273	0	"C"
0.542308	0	"A"
0.176669	1	"B"
-0.260732	0	"B"
-2.0586	0	"C"
-1.220433	0	"A"
-0.483258	1	"A"
-0.756905	1	"A"
0.298359	0	"B"
0.579213	1	"C"

Computing predictions on the empirical distribution is common practice in data analysis, as it yields one fitted value for each observation in the sample. When estimating counterfactual comparisons or slopes, analysts will often start with the empirical grid, manipulate one focal predictor, and see how predicted outcomes are affected.

Studying an empirical grid makes most sense when the observed sample is representative of the population that the analyst targets for inference. When working with convenience samples with very different characteristics from the population, it may make sense to use one of the grids described below, or to apply weights as described in Chapter 12.

3.2.2 Interesting grid

If the analyst cares about units with specific profiles, they can use the datagrid() function to create customized grids of “interesting” predictor values. This is useful when one wants to compute a prediction or slope for an individual with given characteristics, such as a 50 year-old engineer from Belgium.

By default, datagrid() fixes all variables to their means or modes, except for those variables that the analyst has explicitly defined.

R
Python

datagrid(Bin = c(0, 1), newdata = dat) |> tt()

rowid	Num	Cat	Bin
1	0.2312	B	0
2	0.2312	B	1

datagrid(Bin=[0,1], newdata = dat)

shape: (2, 3)

Num	Cat	Bin
f64	str	i64
-0.249411	"A"	0
-0.249411	"A"	1

Note that the Num variable was fixed to its mean value, and the Cat variable to its mode, because neither variable was specified explicitly by the user.

datagrid() also accepts functions to be applied to the variables in the original dataset. In R, the range function returns a vector of two values: the minimum and the maximum of a variable; the mean function returns a single value; and unique returns a vector of all unique values in a variable. With the following function call, we get a grid with \(2\times 1 \times 3=6\) rows.

R
Python

datagrid(Num = range, Bin = mean, Cat = unique, newdata = dat) |> tt()

rowid	Num	Bin	Cat
1	-0.6547	0.6	A
2	-0.6547	0.6	B
3	-0.6547	0.6	C
4	1.5376	0.6	A
5	1.5376	0.6	B
6	1.5376	0.6	C

datagrid(
  Num=[dat["Num"].max(), dat["Num"].min()], 
  Bin=dat["Bin"].mean(),
  Cat=dat["Cat"].unique(),
  newdata = dat)

shape: (6, 3)

Num	Bin	Cat
f64	f64	str
0.689273	0.4	"A"
0.689273	0.4	"C"
0.689273	0.4	"B"
-2.0586	0.4	"A"
-2.0586	0.4	"C"
-2.0586	0.4	"B"

3.2.3 Representative grid

A representative grid is one where predictors are fixed to representative values, such as means, medians, or modes. This kind of grid is useful when the analyst wants to compute predictions, comparisons, or slopes for a typical or average individual.

R
Python

datagrid(grid_type = "mean_or_mode", newdata = dat) |> tt()

rowid	Num	Bin	Cat
1	0.2312	1	B

datagrid(grid_type = "mean_or_mode", newdata = dat)

shape: (1, 3)

Num	Bin	Cat
f64	i64	str
-0.249411	0	"A"

In Part II of the book, we will see how representative grids allow us to compute useful quantities such as “fitted value at the median” or “marginal effect at the mean.” In both cases, the quantities of interest are evaluated at a central point in the predictor space.

Studying representative grids may be useful when looking for a measure of central tendency. It can also be beneficial for computational reasons, because calculating a single statistic for a single profile is cheaper than computing one statistic per row for a large dataset. On the downside, it is important to keep in mind that, in some cases, nobody in the population is exactly average on all dimensions. In that context, the interpretation of quantities computed on representative grids is somewhat ambiguous.

3.2.4 Balanced grid

A balanced grid is a grid built from all unique combinations of categorical variables, and where all numeric variables are held at their means.

To create a balanced grid, we fix numeric variables at their means and create rows for each combination of values for the categorical variables (i.e., the cartesian product). In the next table, Num is held at its mean, and rows show all combinations of unique Bin and Cat.

R
Python

datagrid(grid_type = "balanced", newdata = dat) |> tt()

rowid	Num	Bin	Cat
1	0.2312	0	A
2	0.2312	0	B
3	0.2312	0	C
4	0.2312	1	A
5	0.2312	1	B
6	0.2312	1	C

datagrid(grid_type = "balanced", newdata = dat)

shape: (6, 3)

Num	Bin	Cat
f64	i64	str
-0.249411	0	"A"
-0.249411	0	"B"
-0.249411	0	"C"
-0.249411	1	"A"
-0.249411	1	"B"
-0.249411	1	"C"

Balanced grids are often used to analyze the results of factorial experiments in convenience samples, where the empirical distribution of predictor values is not representative of the target population. In that context, the analyst may wish to report estimates for each combination of treatments, that is, for each cell in a balanced grid. We will see balanced grids in action in Section 5.3, when computing marginal means.⁸

3.2.5 Counterfactual grid

The last type of grid to consider is counterfactual. Here, we duplicate the entire dataset, creating one copy for every combination of values that the analyst supplies.

In the example that follows, we specify that Bin must take on values of 0 or 1. Then, we create two copies of the full dataset. In the first copy, Bin is fixed to 0; in the second copy, it is set to 1. All other variables are held at their observed values. Since the original data had 10 rows, the counterfactual datasets has 20 rows.

R
Python

g = datagrid(
  Bin = c(0, 1),
  grid_type = "counterfactual",
  newdata = dat
)
nrow(g)

[1] 20

g = datagrid(
  Bin=[0,1],
  grid_type = "counterfactual",
  newdata = dat
)
g.shape

(20, 4)

We can inspect the first three rows of each counterfactual version of the data by filtering on the rowidcf column, which holds row indices. Notice how each of the original rows has an exact duplicate, which differs only in terms of the counterfactual Bin variable.

R
Python

subset(g, rowidcf %in% 1:3) |> tt()

rowid	rowidcf	Num	Cat	Bin
1	1	-0.6547	A	0
2	2	-0.4982	A	0
3	3	0.3323	B	0
11	1	-0.6547	A	1
12	2	-0.4982	A	1
13	3	0.3323	B	1

g.filter(pl.col("rowidcf")<3)

shape: (6, 4)

Num	Cat	rowidcf	Bin
f64	str	i64	i64
0.689273	"C"	0	0
0.689273	"C"	0	1
0.542308	"A"	1	0
0.542308	"A"	1	1
0.176669	"B"	2	0
0.176669	"B"	2	1

This kind of duplication will be essential in chapters 6 and 8, where we explore counterfactual analysis and causal inference.

3.3 Aggregation

As noted above, predictions, counterfactual comparisons, and slopes are conditional quantities, in the sense that they typically depend on the values of all predictors in a model. When we compute these quantities over a grid of predictor values, we obtain a collection of different point estimates: one for each row of the grid.

If a grid has many rows, the large number of estimates that we generate can be unwieldy, making it more difficult to extract clear and meaningful insights from our models. To simplify the presentation of results, analysts might want to aggregate unit-level estimates into more macro-level summaries. We can think about several aggregation strategies.

No aggregation: Unit-level estimates.
Overall average: Average of unit-level estimates.
Subgroup averages: Average of unit-level estimates within subgroups of the data.
Weighted averages: Weighted average of unit-level estimates, where weights could be sampling weights or the inverse probability of treatment assignment.

Aggregated estimates are very common in all fields of research. In Chapter 5, we will compute average predictions or marginal means. In Chapter 6, we will compute average counterfactual comparisons. In Chapter 8, we will see that aggregating comparisons over different grids allows us to compute quantities such as the Average Treatment Effect (ATE), the Average Treatment effect on the Treated (ATT), or the Average Treatment effect on the Untreated (ATU). In Chapter 7, we will compute average slopes (a.k.a. average marginal effects).

Aggregated estimates tend to be easier to interpret, and they can often be estimated with greater precision than unit-level quantities. However, they also come with an important downside, since they can mask interesting variation across the sample. For example, imagine that the effect of a treatment is strongly negative for some individuals, but strongly positive for others. In that case, an estimate of the average treatment effect could be close to 0 because positive and negative estimates cancel out. This aggregated quantity would not adequately capture the nature of the data generating process, and it could lead to misleading conclusions.

3.4 Uncertainty

Whenever we report quantities of interest derived from a statistical model, it is essential to provide estimates of our uncertainty. Without standard errors or confidence intervals, readers cannot confidently assess if the reported values or relationships are genuine, or if they could merely be the product of chance.

The marginaleffects package offers four primary methods for quantifying uncertainty: the delta method (default), bootstrap, simulation-based inference, and conformal prediction. Chapter 14 is entirely dedicated to uncertainty quantification. It offers intuition, technical details, and hands-on demonstration for all four of the approaches listed above. Here, we survey them briefly.

The delta method is the default and most common way to compute standard errors for the kinds of quantities considered in this book. It is a statistical technique to approximate the variance of a function of random variables. Since all the quantities of interest described in this book can be seen as transformations of model parameters, the delta method can be used to compute standard errors for predictions, counterfactual comparisons, and slopes. The delta method has many advantages: it is fast, flexible, and it can be paired with robust variance estimates to account for heteroskedasticity, autocorrelation, or clustering. The main disadvantages of the delta method are that it relies on a coarse linear approximation, that it assumes a model’s parameters are asymptotically normal, and that it requires functions of parameters to be continuously differentiable.

The second strategy for uncertainty quantification is the bootstrap, a resampling-based technique that generates empirical distributions for estimated quantities by repeatedly sampling from the observed data. The bootstrap is an extremely flexible approach, which can be useful when the delta method’s assumptions are not met.

The third method, simulation-based inference, involves drawing simulated coefficients from an assumed distribution, and then using those simulated values repeatedly to compute quantities of interest. This strategy is particularly effective when working with complex models or when estimating functions of multiple parameters. It also provides an intuitive way to visualize uncertainty through predictive distributions.

The fourth method, conformal prediction, is a flexible framework for uncertainty quantification that provides valid prediction—rather than confidence—intervals under minimal assumptions. Unlike traditional approaches that rely heavily on model-specific assumptions or asymptotic approximations, conformal prediction uses split sample strategies to construct intervals that are guaranteed to cover a given share of out-of-sample data points. This method is particularly appealing for its simplicity and adaptability to complex models.

Chapters 5, 6, and 7 show how we can apply these strategies to construct standard errors and confidence intervals around predictions, comparisons, and slopes.

3.5 Test

Once we have computed a quantity of interest and its standard error, we can finally conduct a test to see if our hypothesis or conjecture is correct. This book covers two important testing procedures: null hypothesis and equivalence tests. These procedures are widely used in data analysis, providing powerful tools for making informed decisions based on model outputs.

Null hypothesis tests allow us to determine if there is sufficient evidence to reject a presumed statement about a quantity of interest. For example, an analyst could check if there is enough evidence to reject statements such as:

Parameter estimates: The difference between the first and second regression coefficients is equal to one.
Predictions: The predicted number of goals in a game is two.
Counterfactual comparison: The effect of a new medication on blood pressure is null.
Slope: The association between education and income is zero.

Equivalence tests, on the other hand, are used to provide evidence that the difference between an estimate and some reference set is “negligible” or “unimportant.” For instance, we could use an equivalence test to support statements such as:

Parameter estimates: The first regression coefficient is practically equivalent to the second regression coefficient.
Predictions: The predicted probability of scoring a goal is not meaningfully different from 2.
Counterfactual comparisons: The effect of a new medication on blood pressure is essentially the same as the effect of an existing treatment.
Slope: The association between education and income is close to zero.

In sum, null hypothesis tests allow us to establish a difference, whereas equivalence tests allow us to establish a similarity or practical equivalence.

The marginaleffects package allows us to compute null hypothesis and equivalence tests on raw parameter estimates, (non-)linear combinations of those parameters, as well as on all the quantities estimated by the package: predictions, counterfactual comparisons, and slopes. Chapter 4 is entirely dedicated to hypothesis and equivalence testing, and the rest of Part II shows how these tests can be applied to other quantities of interest.

3.6 Summary

This chapter was written to address two problems:

The parameters of a statistical model are often difficult to interpret.
Analysts often fail to rigorously define the statistical quantities (estimands) and tests that can shed light on their research questions.

We can solve these problems by transforming parameter estimates into quantities with a straightforward interpretation and a direct link to our research goals. In practice, this requires us to answer five questions:

Quantity: Do we wish to estimate the level of a variable, the association between two (or more) variables, or the effect of a cause?
Predictors: What predictor values are we interested in?
Aggregation: Do we care about unit-level or aggregated estimates?
Uncertainty: How do we quantify uncertainty about our estimates?
Test: Which hypothesis or equivalence tests are relevant?

The analysis workflow implied by these five questions is extremely flexible; it allows us to define and compute a wide variety of quantities and tests. To see this, we can refer to visual aids adapted from Heiss (2022).

The Data in Figure 3.1 represents a dataset with five rows and two variables: a numeric and a categorical one. Each of the light gray cells in the first column corresponds to a value of the numeric variable. The second column of the Data represents a categorical variable, with gray and light gray cells showing distinct categories.

Using these two variables, we fit a statistical model. Then, we transform the parameters of that statistical model to compute quantities of interest. These quantities could be predictions or fitted values.⁹ They could be counterfactual comparisons, such as risk differences.¹⁰ Or they could be slopes.¹¹

All of these quantities are conditional: their values typically depend on all the predictors in the model. Thus, each row of the original dataset is associated to a different estimate of the quantity of interest. We represent unit-level estimates of the quantity of interest as black cells, on the right-hand side of Figure 3.1.

Often, the analyst does not wish to compute or report one estimate for every row of the original dataset. Instead, they prefer aggregating estimates across the full dataset. For example, instead of reporting one prediction for each row of the dataset, they could first compute individual fitted values, and then marginalize them to obtain an average predicted outcome.

Figure 3.2 illustrates this extended workflow. We follow the same process as above, but add an extra aggregation step at the end to convert unit-level estimates into a one-number summary.

Recall that our Data includes a categorical variable, with distinct values illustrated by gray and light gray cells. Instead of aggregating the unit-level estimates across the full dataset, as in Figure 3.2, we could compute average estimates in subgroups defined by the categorical variable. On the right side of Figure 3.3, the first black cell represents an average estimate in the gray subgroup. The second black cell represents the average estimate in the light gray subgroup.

So far, we have aggregated unit-level estimates computed on the empirical grid, that is, we have taken averages across the set of 10 units of observation. One alternative approach would be to build a smaller grid of predictors, and to compute estimates directly on that smaller grid. The values in that grid could represent units with particularly interesting features. For example, a medical researcher could be interested in computing estimates for a specific patient profile, between the ages of 18 and 35 with a particular biomarker.¹² Alternatively, the analyst could want to compute an estimate for a typical individual, whose characteristics are exactly average or modal on all dimensions.¹³

Figure 3.4 illustrates this approach. Based on the original data, we create a one-row grid which holds the characteristics of a “typical” or “representative” individual. Then, we use the model to compute an estimate—prediction, comparison, or slope—for that synthetic individual.

Figure 3.4: Estimates at representative values.

The last approach we consider, illustrated by Figure 3.5, combines both the grid creation and the aggregation steps. First, we duplicate the entire dataset, and we fix one of the variables at different values in each of the datasets. In Section 3.2.5, we described this process as creating a counterfactual grid.

Figure 3.5: Counterfactual average estimates (G-computation).

In the top half of the duplicated grid, the categorical variable is set to a single counterfactual value: light gray. In the bottom half of the duplicated grid, the categorical variable is set to a different counterfactual value: gray. Using this grid and the fitted model, we compute estimates for every row. Finally, we marginalize estimates across subgroups defined by the categorical (counterfactual) variable. In Chapter 8, we will see that Figure 3.5 is a visual representation of the G-computation algorithm.

To sum up, estimands and tests can be defined by focusing on their constituent components: the quantity of interest, predictor grid, aggregation level, uncertainty quantification, and hypothesis testing strategy. Taken together, these components form a flexible framework for data analysis that can be adapted to many scenarios. This conceptual framework is also easy to operationalize via the consistent user interface of the marginaleffects package for R and Python.

A statistical functional is a map from distribution \(F\) to a real number or vector.↩︎
\(\hat{F}_n(x) = \frac{1}{n} \sum_{i=1}^n I(X_i \leq x), \quad \forall x \in \mathbb{R}.\)↩︎
See Aronow and Miller (2019, sec. 3.3.1) for an informal discussion of those conditions and Wasserman (2006) for more details.↩︎
The former would match a logit regression and the latter a probit regression.↩︎
In generalized linear models, we could also make predictions on the link scale, rather than the response scale. The marginaleffects package allows both, but it is usually better to use the response scale, because its interpretation is more intuitive.↩︎
Section 2.1.3 and Chapter 8 ↩︎
Section 2.2 ↩︎
Balanced grids are used by default in the emmeans post-estimation software (Lenth 2024).↩︎
Chapter 5 ↩︎
Chapter 6 ↩︎
Chapter 7 ↩︎
Section 3.2.2 ↩︎
Section 3.2.3 ↩︎