14  Uncertainty

This chapter introduces four approaches to quantify uncertainty: the delta method, bootstrap, simulation-based inference, and conformal prediction. We discuss the intuition behind each of these strategies, and reinforce that intuition by implementing simple versions of each strategy “by hand.” The code examples are presented for pedagogical purposes; more robust and convenient implementations are supplied by the marginaleffects package. Readers who are not familiar with the basics of multivariable calculus may want to skip this chapter.

14.1 Delta method

The delta method is a statistical technique used to approximate the variance of a function of random variables. For example, data analysts often want to compare regression coefficients or predicted probabilities by subtracting them from one another. Such comparisons can be expressed as functions of a model’s parameters, but the variances (and standard errors) of such transformed quantities can be difficult to derive analytically. The delta method is a convenient way to approximate those variances. It allows us to quantity uncertainty around most of the quantities of interest discussed in this book.

The delta method relies on the observation that if the distribution of a parameter is asymptotically normal, then the limiting distribution of a smooth function of this parameter is also normal (Wasserman 2004). This is a powerful result, which is especially useful in the context of maximum likelihood estimation, where parameters are expected to be asymptotically normal.

The delta method is a useful way to quantify uncertainty. It is versatile and fast. Yet, the technique also has important limitations: it relies on a coarse linear approximation, its desirable properties depend on the asymptotic normality of the estimator, and it is only valid when a function is continuously differentiable in a neighborhood of its parameters. When analysts do not believe that these conditions hold, they can turn to the bootstrap or toward simulation-based inference.1

The theoretical properties of the delta method are proved in many statistical textbooks (Wasserman 2004; Hansen 2022b; Berger and Casella 2024). Instead of demonstrating them anew, the rest of this section provides a hands-on tutorial, with a focus on computation. The goal is to reinforce intuitions by illustrating how to calculate standard errors manually in two cases: univariate and multivariate.

14.1.1 Univariate delta method

To start, we use the delta method to compute the standard error of a function with one parameter: the natural logarithm of a single regression coefficient. Consider a setting with three measured variables: \(Y\), \(X\), and \(Z\). With these variables in hand, the analyst estimates a linear regression model.

\[ Y = \beta_1 + \beta_2 X + \beta_3 Z + \varepsilon \tag{14.1}\]

Let \(\hat{\beta}_2\) be an estimator of the true parameter \(\beta_2\). We know that the variance of \(\hat{\beta}_2\) is \(\text{Var}(\hat{\beta}_2)\), but we are not interested in this particular statistic. Rather, our goal is to compute the variance of a function of the estimator: \(\text{Var}\left [\log(\hat{\beta}_2) \right ]\).

To get there, we apply a Taylor series expansion. This mathematical tool allows us to rewrite smooth functions like \(\log(\hat{\beta}_2)\) as infinite sums of derivatives and polynomials. This is a useful way to simplify and approximate complex functions.

The Taylor series of a function \(g(\alpha)\) is always expressed by reference to a particular point \(a\), where we wish to study the behavior of \(g\). Generically, the approximation can be written as

\[ g(\alpha) = g(a) + g'(a)(\alpha - a) + \frac{g''(a)}{2!}(\alpha - a)^2 + \cdots + \frac{g^{(n)}(a)}{n!}(\alpha - a)^n + \cdots, \tag{14.2}\]

where \(g^{(n)}(a)\) denotes the \(n\)th derivative of \(g(\alpha)\), evaluated at \(\alpha=a\).

Equation 14.2 is an infinite series, so it is not practical (or possible) to compute all its terms. In real-world applications, we thus truncate the series by dropping higher order terms. For instance, we can construct a first-order Taylor series by keeping only the first two terms on the right-hand side of Equation 14.2.

We can now approximate the function that interests us, \(\log(\hat{\beta}_2)\), as a first-order Taylor series, evaluated at the true value of the parameter \(\beta_2\). Recall that the derivative of \(\log(\beta_2)\) is \(1/\beta_2\), and substitute appropriate values into the first two terms of Equation 14.2.

\[\begin{align*} g(\alpha) \approx g(a) + g'(a)(\alpha - a) \\ \log(\hat{\beta}_2) \approx \log(\beta_2) + \frac{1}{\beta_2}(\hat{\beta}_2 - \beta_2) \end{align*}\]

Now, insert the approximation in the variance operator.

\[ \text{Var}\left [\log(\hat{\beta}_2)\right ] \approx \text{Var} \left [\log(\beta_2) + \frac{1}{\beta_2}(\hat{\beta}_2 - \beta_2)\right ] \tag{14.3}\]

This linearized approximation of the variance is useful, because it allows further simplification. First, we know that the variance of a constant \(\beta_2\) is zero, which implies that some of the terms in Equation 14.3 can drop out. Second, we can apply the scaling property of variances2 to rewrite the expression as

\[ \text{Var}\left [\log(\hat{\beta}_2)\right ] \approx \frac{1}{\beta_2^2}\text{Var} \left [\hat{\beta}_2\right ] \tag{14.4}\]

Equation 14.4 gives us a formula to approximate the variance of \(\log(\hat{\beta}_2)\). This formula is expressed in terms of \(\text{Var}(\hat{\beta}_2)\), that is, the square of the estimated standard error. Equation 14.4 also includes an unknown term: \(1/\beta_2\). Since we do not know the true value of that term, we fall back to using \(1/\hat{\beta}_2\) as a plug-in estimate.3

To show how this formula can be applied in practice, let’s simulate data that conform to Equation 14.1, and use that data to fit a linear regression model.

X = rnorm(100)
Z = rnorm(100)
Y = 1 * X + 0.5 * Z + rnorm(100)
dat = data.frame(Y, X, Z)
mod = lm(Y ~ X + Z, data = dat)
summary(mod)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.08587    0.11478  -0.748    0.456
X            1.10159    0.12434   8.859   <0.001
Z            0.48986    0.10550   4.643   <0.001

The true \(\beta_2\) is 1, but the estimated \(\hat{\beta}_2\) is

b_2 = coef(mod)[2]
b_2

[1] 1.101589

The estimated variance of \(\hat{\beta}_2\) is

v = vcov(mod)[2, 2]
v
[1] 0.0154611

The natural logarithm of our estimate, \(\log(\hat{\beta}_2)\), is

log(b_2)
[1] 0.09675333

Using Equation 14.4, we compute the delta method estimate of \(\text{Var}\left [\log(\hat{\beta}_2)\right ]\). Then, we take the square root to obtain the standard error of \(\log(\hat{\beta}_2)\).

sqrt(v / b_2^2)

[1] 0.1128758

To verify that this result is correct, we can use the hypotheses() function from the marginaleffects package. As noted in Chapter 4, this function can compute arbitrary functions of model parameters, along with associated standard errors.

library(marginaleffects)
hypotheses(mod, "log(b2) = 0")
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
log(b2)=0 0.0968 0.113 0.857 0.391 -0.124 0.318

Reassuringly, the estimate and standard error that we computed “manually” for \(\log(\hat{\beta}_2)\) are the same as the ones obtained “automatically” by calling hypotheses().

14.1.2 Multivariate delta method

The univariate version of the delta method can be generalized to the multivariate case. Let \(\mathcal{B}=\{\beta_1,\beta_2,\ldots,\beta_k\}\) be a vector of input parameters, and \(\Theta=\{\theta_1,\theta_2,\ldots,\theta_n\}\) represent the vector-valued output of an \(h\) transformation. For example, \(\mathcal{B}\) could a vector of regression coefficients. \(\Theta\) could be a single summary statistic, or it could be a vector of length \(n\) with one fitted value per observation in the dataset.4

Our goal is to compute standard errors for each element of \(\Theta\). To do this, we express the multivariate delta method in matrix notation

\[ \operatorname{Var}\left(h(\mathcal{B})\right) = J^T \cdot \operatorname{Var}\left (\mathcal{B}\right ) \cdot J, \tag{14.5}\]

where \(h(\mathcal{B})\) is a function of the parameter vector \(\mathcal{B}\) and \(\text{Var}(h(\mathcal{B}))\) is the variance of that function.

\(J\) is the Jacobian of \(h\). The number of rows in that matrix is equal to the number of parameters in the input vector \(\mathcal{B}\). The number of columns in \(J\) is equal to the number of elements in the output vector \(\Theta\), that is, equal to the number of transformed quantities of interest that \(h\) produces. The element in the \(i\)th row and \(j\)th column of \(J\) is a derivative which characterizes the effect of a small change in \(\beta_i\) on the value of \(\theta_j\).

\[ J = \begin{bmatrix} \frac{\partial \theta_1}{\partial \beta_1} & \frac{\partial \theta_2}{\partial \beta_1} & \cdots & \frac{\partial \theta_n}{\partial \beta_1} \\ \frac{\partial \theta_1}{\partial \beta_2} & \frac{\partial \theta_2}{\partial \beta_2} & \cdots & \frac{\partial \theta_n}{\partial \beta_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial \theta_1}{\partial \beta_k} & \frac{\partial \theta_2}{\partial \beta_k} & \cdots & \frac{\partial \theta_n}{\partial \beta_k} \end{bmatrix} \]

Now, let’s go back to the linear model defined by Equation 14.1, with coefficients \(\beta_1\), \(\beta_2\), and \(\beta_3\). Imagine that we want to estimate the difference between the 2nd and 3rd coefficients of this model. This difference can be expressed as the single output \(\theta\) from a function \(h\) of three coefficients:

\[ \theta = h(\beta_1,\beta_2,\beta_3) = \beta_2 - \beta_3, \tag{14.6}\]

or

b = coef(mod)
b[2] - b[3]

[1] 0.6117329

To quantify the uncertainty around \(\theta\), we use the multivariate delta method formula. In Equation 14.5, \(\text{Var}(\mathcal{B})\) refers to the variance-covariance matrix of the regression coefficients, which we can extract using the vcov() function.

V = vcov(mod)
V
              (Intercept)             X             Z
(Intercept)  0.0131734661 -0.0005546706 -0.0017475913
X           -0.0005546706  0.0154611041 -0.0002355777
Z           -0.0017475913 -0.0002355777  0.0111296508

The \(J\) matrix is defined by reference to the \(h\) function. In Equation 14.6, \(h\) accepts three parameters as inputs and returns a single value. Therefore, the corresponding \(J\) matrix must have 3 rows and 1 column. We populate this matrix by taking partial derivatives of Equation 14.6 with respect to each of the input parameters.

\[ J = \left [ \begin{matrix} \frac{\partial \theta}{\beta_1} \\ \frac{\partial \theta}{\beta_2} \\ \frac{\partial \theta}{\beta_3} \end{matrix} \right ] = \left [ \begin{matrix} 0 \\ 1 \\ -1 \end{matrix} \right ] \]

Having extracted \(\text{Var}(\mathcal{B})\) and defined an appropriate \(J\) for the quantity of interest \(\theta\), we now use Equation 14.5 to compute the standard error.

J = matrix(c(0, 1, -1), ncol = 1)
variance = t(J) %*% V %*% J
sqrt(variance)

[1] 0.164505

We have thus estimated that the difference between \(\beta_2\) and \(\beta_3\) is 0.612, and that the standard error associated with this difference is 0.165. Again, we can check this “manual” result against the output of the hypotheses() function.

hypotheses(mod, hypothesis = "b2 - b3 = 0")
Hypothesis Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b2-b3=0 0.612 0.165 3.72 <0.001 0.289 0.934

The results obtained manually and automatically are identical.

14.1.3 Robust or clustered standard errors

The calculations in the previous sections relied on “classical” estimates of the variance-covariance matrix. These estimates are useful, but they rely on strong assumptions that rule out ubiquitous phenomena like autocorrelation, heteroskedasticity, and clustering.5 If there is heterogeneity in the variance of our prediction errors—which is almost always the case—classical standard errors may inaccurately characterize the uncertainy in our estimates.

To see how the assumptions that underpin classical standard errors can be relaxed, let’s consider a linear regression with only one predictor.

\[ Y = X \beta + \varepsilon, \]

where \(Y\) is an \(n \times 1\) vector of outcome; \(X\) is an \(n \times k\) matrix of covariates, including an intercept; \(\beta\) is a \(k \times 1\) vector of coefficients; \(\varepsilon\) is an \(n \times 1\) vector of errors. We can estimate regression coefficients by ordinary least squares, a method which can be expressed in a few matrix operations.

\[ \hat{\beta}=(X^TX)^-1(X^TY) \tag{14.7}\]

The standard way to compute the variance of \(\hat{\beta}\) is to use the “sandwich” formula, so-called because it includes a “meat” expression, squeezed between identical “bread” components. The classical variance estimate for ordinary least squares can thus be written as

\[ Var(\hat{\beta}) = (X^TX)^{-1}(\sigma^2_\varepsilon X^TX)(X^TX)^{-1}, \tag{14.8}\]

where \(\sigma^2_\varepsilon\) can be estimated by taking the variance of estimated residuals.

To illustrate the use of these formulas, let’s load the Thornton (2008) data and fit a linear regression model with one predictor and an intercept.

library(sandwich)
library(marginaleffects)
dat = get_dataset("thornton")
dat = na.omit(dat[, c("incentive", "outcome")])
mod = lm(outcome ~ incentive, data = dat)
summary(mod)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.33977    0.01695   20.04   <0.001
incentive    0.45106    0.01919   23.50   <0.001

The same estimates can be obtained manually by constructing a design matrix \(X\), a response matrix \(Y\), and by applying Equation 14.7.

X = cbind(1, dat$incentive)
Y = dat$outcome
solve(crossprod(X, X)) %*% crossprod(X, Y)
          [,1]
[1,] 0.3397746
[2,] 0.4510603

The variance-covariance matrix for these coefficients is obtained via Equation 14.8.

e = Y - predict(mod)
var_e = sum((e - mean(e))^2) / df.residual(mod)
bread = solve(crossprod(X))
meat = var_e * crossprod(X)
V = bread %*% meat %*% bread
V
              [,1]          [,2]
[1,]  0.0002874265 -0.0002874265
[2,] -0.0002874265  0.0003684119

This is the same as the matrix returned by the vcov() function from base R.

vcov(mod)
              (Intercept)     incentive
(Intercept)  0.0002874265 -0.0002874265
incentive   -0.0002874265  0.0003684119

And the square root of the diagonal elements of this matrix are the standard errors printed in the model summary at the top of this section.

[1] 0.01695366 0.01919406

One key assumption about classical standard errors is encoded explicitly and clearly in Equation 14.8. In that formula, and in the code that implements it, \(\sigma^2_\varepsilon\) is a single number, treated as a constant that reflects the uncertainty in each observation uniformly. But in many applied contexts, the variance is not uniform, that is, there is heteroskedasticity.6

One popular strategy to deal with this challenge is to compute “robust” standard errors, also known as “heteroskedasticity-constent” or “Huber-White” standard errors. There are many types of robust standard errors available, designed to address different data issues (Zeileis, Köll, and Graham 2020). The code below shows one of the simplest ways to account for heteroskedasticity. It implements the same sandwich formula as in the classical case, but the “meat” component includes a matrix with squared residuals on the diagonal. This gives us more flexibility to account for changes in the variance of prediction errors across the sample.

meat = t(X) %*% diag(e^2) %*% X
bread %*% meat %*% bread
              [,1]          [,2]
[1,]  0.0003612364 -0.0003612364
[2,] -0.0003612364  0.0004362886

A more convenient way to compute the same variance-covariance matrix is to use the vcovHC() function from the powerful sandwich package, which supports many models and estimators beyond linear regression.

vcovHC(mod, type = "HC0")
              (Intercept)     incentive
(Intercept)  0.0003612364 -0.0003612364
incentive   -0.0003612364  0.0004362886

In all marginaleffects functions, there is a vcov argument that allows us to report various types of robust standard errors. When, that argument is omitted, this function returns counterfactual comparisons with classical standard errors.

Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.451 0.0192 23.5 <0.001 0.413 0.489

To report heteroskedasticity consistent standard errors instead, we simply execute the code below.

avg_comparisons(mod, vcov = "HC0")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.451 0.0209 21.6 <0.001 0.41 0.492

The vcov argument accepts several different string shortcuts like the one above, but it can also take a variance-covariance matrix, a function that returns such a matrix, or a formula to specify clusters.

V = sandwich::vcovHC(mod)
avg_comparisons(mod, vcov = V)         # heteroskedasticity-consistent
avg_comparisons(mod, vcov = ~ village) # clustered by village

In sum, the delta method is a simple, fast, and flexible approach to obtain standard errors for a vast array of quantities of interest. It is the default approach implemented by all the functions in the marginaleffects package. By combining the delta method with robust estimates of the variance-covariance matrix, analysts can report uncertainty estimates that account for common data features like heteroskedasticity or clustering.

14.2 Bootstrap

X = rnorm(100)
Z = rnorm(100)
Y = 1 * X + 0.5 * Z + rnorm(100)
dat = data.frame(Y, X, Z)
mod = lm(Y ~ X + Z, data = dat)
summary(mod)
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06695    0.09975  -0.671    0.504
X            0.99036    0.10437   9.489   <0.001
Z            0.42283    0.09294   4.549   <0.001

The Bootstrap is a method pioneered by Bradley Efron and colleagues in the 1970s and 1980s (Efron and Tibshirani 1994). It is a powerful strategy to quantify uncertainty about statistics of interest.7

Consider \(T_n\), a statistic we wish to compute. \(T_n\) is a function of an outcome \(y_n\), some regressors \(\mathbf{x}\), and the unknown population distribution function \(F\):

\[T_n = T_n ((y_1,\mathbf{x}_1),\dots,(y_n,\mathbf{x}_n),F)\]

In our running example, \(y_n\) is bill_length_mm, \(\mathbf{x}_n\) is a vector of binary variables representing different values of island, and \(T_n\) is the statistic obtained by regressing \(y_n\) on \(\mathbf{x}_n\), and by taking a linear combination of the resulting coefficient estimates.

The statistic of interest \(T_n\) is characterized by its own distribution function \(G\), itself a function of \(F\). If we knew \(F\), then \(G\) would immediately follow. The key idea of the bootstrap is to use simulations to approximate \(F\), and then to use that approximation when quantifying uncertainty around the statistic of interest.

How do we approximate \(F\) in practice? Since the 1980s, many different variations on bootstrap simulations have been proposed. A survey of this rich literature lies outside the scope of this book, but it is instructive to illustrate one of the simplest approaches: the nonparametric bootstrap algorithm. The idea is to build several simulated datasets by drawing observations with replacement from the observed sample.8 Then, with those datasets in hand, we can estimate the statistic of interest in all of those samples.

To illustrate, here is a minimalist implementation of the bootstrap algorithm in R. First, we define a statistic function which extracts coefficient estimates from a model object and takes a linear combination of them: \(\beta_2-\beta_3\).9

# statistic of interest: linear combination of model coefficients
statistic = \(model) drop(c(0, 1, -1) %*% coef(model))

Then, we implement the bootstrap algorithm with 10,000 replicates:

# 10000 bootstrap replicates
theta = vector(length = 10000)
for (i in seq_along(theta)) {
  # draw bootstrap sample
  idx = sample(1:nrow(dat), nrow(dat), replace = TRUE)
  boot_dat = dat[idx,]
  # estimate statistic
  boot_mod = lm(Y ~ X + Z, data = boot_dat)
  theta[i] = statistic(boot_mod)
}

Finally, we take quantiles of the bootstrap distribution of the statistic to create confidence interval:

sprintf("%.3f [%.3f, %.3f]", 
    statistic(mod),
    quantile(theta, .025),
    quantile(theta, .975)
)
[1] "0.568 [0.338, 0.807]"

TODO: Example with marginaleffects

Bootstrapping, particularly in its nonparametric form, is a versatile technique that estimates the sampling distribution of a statistic by resampling the data. It is widely applicable, especially when the underlying distribution is unknown or complex, making it a powerful tool for deriving standard errors and confidence intervals. However, bootstrapping can be computationally intensive, particularly for large datasets or complex models. In addition, selecting an adequate resampling scheme can be challenging when the data-generating process is complex, as improper choices can lead to biased or invalid results.

14.3 Simulation

Yet another strategy to compute uncertainty estimates for quantities of inference is simulation-based inference. This method is an alternative to traditional analytical approaches, leveraging power of computational simulations to estimate the variability of a given parameter. One of the earliest applications of this approach can be found in Krinsky and Robb (1986), who use simulations to quantify the sampling uncertainty around estimates of elasiticities in a regression modelling context. King, Tomz, and Wittenberg (2000) subsequentily popularized the strategy and its implementation in the CLARIFY software.

The foundation of this approach assumes that the regression-based estimator follows a normal distribution. This assumption allows us to use the properties of the multivariate normal distribution to simulate numerous potential outcomes. The steps involved in this process are as follows:

  1. Simulate parameters: Begin by drawing many sets of parameters from a multivariate normal distribution. The mean of this distribution is the estimated vector of coefficients obtained from the regression analysis, and the variance corresponds to the estimates covariance matrix of these coefficients.
  2. Transform parameters: Next, each simulated set of coefficients is transformed to generate multiple versions of the quantity of interest.
  3. Summarize draws: Finally, construct a confidence interval for the quantity of interest by taking quantiles of the simulated quantities.

Intuitively, this process of drawing simulated quantities can be thought of as an “informal” Bayesian posterior simulation, as suggested by Rainey (2024). While it does not strictly follow Bayesian methods, it shares a similar philosophy of using the distribution of simulated parameters to infer the uncertainty around the estimated quantity. By repeatedly sampling from the parameter space, we can gain a clearer picture of the possible variability and thereby construct more reliable confidence intervals.

Implementing this form of simulation-based inference is relatively easy:

# Step 1: Simulate alternative parameters
library(mvtnorm)
draws = rmvnorm(10000, coef(mod), vcov(mod))

# Step 2: Transform the parameters
theta = draws %*% c(0, 1, -1)

# Step 3: Summarize draws
sprintf("%.3f [%.3f, %.3f]", 
    statistic(mod),
    quantile(theta, .025),
    quantile(theta, .975)
)
[1] "0.568 [0.299, 0.841]"

Note that the simulation-based confidence interval is slightly different than those obtained via bootstrap or delta method, but generally of the same magnitude.

Simulation-based inference is easy to implement, convenient, and it applies to a broad range of models. It trades the analytic approximation of the delta method, which relies on Taylor-series expansions, for a numerical approximation through simulation. Although simulation-based inference tends to be more computationally expensive than the delta method, it is usually faster than bootstrapping, which requires refitting the model multiple times.

TODO: Example with marginaleffects

14.4 Conformal prediction

In Chapter 5, we used the predictions() function to compute confidence intervals around fitted values. These intervals are designed to characterize the uncertainty about the model-based expected value of the response. One common misunderstanding is that such confidence intervals should be calibrated to cover a certain percentage of unseen data points. This is not the case. In fact, the confidence interval reported by default by the predictions() function will typically cover a much smaller share of out-of-sample outcomes than its nominal coverage.

To illustrate, let’s conduct a simulation. We draw two samples of 25 observations randomly from a normal distribution with mean \(\pi\): \(Y_{train}\) and \(Y_{test}\). Then, we use the lm() function to fit a linear model on the training set \(Y_{train}\), and we use the predictions() function to compute a 90% confidence interval for the expected value of the response. Finally, we check how many observations in the test set \(Y_{test}\) are covered by this interval. We repeat the experiment 1000 times:

library(marginaleffects)
set.seed(48103)

simulation = function(...) {
  Y_train = rnorm(25, mean = pi)
  Y_test = rnorm(25, mean = pi)
  m = lm(Y_train ~ 1)
  p = predictions(m, conf_level = .90)
  out = c(
    "3.1416" = mean(p$conf.low <= pi & p$conf.high >= pi),
    "Test" = mean(p$conf.low <= Y_test & p$conf.high >= Y_test)
  )
  return(out)
}

coverage = apply(replicate(1000, simulation()), 1, mean)
coverage
 3.1416    Test 
0.88100 0.24668 

We see that the confidence intervals cover the true mean of the outcome (\(\pi\) or 3.1416) nearly 90% of the time, which matches the significance level that we selected with conf_level. However, the confidence intervals only cover out-of-sample observations (\(Y_{test}\)) about 25% of the time. Clearly, if we care about out of sample predictions, that is, if we want our interval to cover a specific share of out-of-sample data points, we must compute “prediction intervals” instead of “confidence intervals.”

How can we compute prediction intervals with adequate coverage? Conformal prediction is one—very flexible and powerful—approach. In their excellent tutorial, Angelopoulos and Bates (2022) write that conformal prediction is

“a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions.”

These are extraordinary claims which deserve to be underlined: In principle, conformal prediction can offer well-calibrated intervals, regardless of the prediction model we use, and even if that model is misspecified.

There are four main caveats to this. First, the conformal prediction algorithms implemented in marginaleffects are designed for exchangeable data.10 They do not offer coverage guarantees in contexts where exchangeability is violated, such as in time series data, when there is spatial dependence between observations, or when there is distribution drift between the training and test data. Second, these algorithms offer marginal coverage guarantees, that is, they guarantee that a random test point will fall within the interval with a given probability. Below, we show an example where the prediction interval covers the right number of test points overall, but is not well calibrated locally, in different strata of the predictors. Different algorithms have recently been proposed to offer class-conditional coverage guarantees (see Ding et al. (2023) for an example). Third, the width of the conformal prediction interval will typically depend on the quality of the prediction model and of the score function. Finally, the score functions implemented in marginaleffects simply take the residual—or difference between the observed outcome and predicted value. This means that the type argument must ensure that observations and predictions are on commensurable scales (usually type="response" or type="prob").

A detailed technical explanation of conformal prediction lies outside the scope of this book, but outlining a few key elements can provide some intuition. Roughly speaking, conformal prediction starts with a trained model that produces some notion of confidence or uncertainty, like softmax probabilities for classification or residuals for regression. Then, it evaluates how well this model performs on a separate calibration dataset, which includes data points that were not used to train the model. The key step is to calculate a “conformal score” for each calibration example. This score measures how unusual or poorly predicted a data point is. These scores are ranked, and the algorithm sets a threshold based on a quantile of the scores (adjusted for sample size). At prediction time, the model computes a new score for the test point, and the prediction set includes all possible answers whose score is within the threshold. This works because even if the model is misspecified or biased, the quantile-based thresholding ensures that the prediction set will still cover the true value with the desired probability, making the method “distribution-free.” If the model is confident, the set is small; if the model is uncertain, the set is larger—directly reflecting the model’s uncertainty in a calibrated way.

There are several alternative conformal prediction algorithms. For the purposes of this tutorial, we will only use one called “CV+.”

TODO: more details on the CV+ algorithm

14.4.1 Numeric outcome

To illustrate the use of conformal prediction, let’s consider a dataset which includes demographic information on over 1 million members of the US military (TODO: CITE):

dat = get_dataset("military")
head(dat)
# A data frame: 6 × 8
  rownames grade   branch gender race    hisp   rank officer
*    <int> <chr>   <chr>  <chr>  <chr>   <lgl> <int>   <dbl>
1        1 officer army   male   ami/aln TRUE      2       1
2        2 officer army   male   ami/aln TRUE      2       1
3        3 officer army   male   ami/aln TRUE      5       1
4        4 officer army   male   ami/aln TRUE      5       1
5        5 officer army   male   ami/aln TRUE      5       1
6        6 officer army   male   ami/aln TRUE      5       1

We split the data in two sets with 10000 randomly selected observations each. The training set is used to fit the model, and the test set is used to check the coverage of our interval:

idx = sample(seq_len(nrow(dat)), 20000)
test = dat[idx[1:10000],]
train = dat[idx[10001:20000],]

Now, we fit a logistic regression model with gender and race as predictor. Our goal is to predict the rank of each individual in the test set.

mod = lm(rank ~ gender * race, data = train)

Finally, we use the predictions() and inferences() to compute unit-level predictions and conformal prediction intervals, using 5 cross-validation folds:

p = predictions(mod, conf_level = .8) |> 
  inferences(
    R = 5,
    method = "conformal_cv+",
    conformal_test = test)

covered = with(p, rank > pred.low & rank <= pred.high)
mean(covered)
[1] 0.8346

We find that 83% of observations in the test set are covered by the conformal prediction intervals. This is quite close to our target coverage rate of 80%.

14.4.2 Categorical outcome

Now, let’s illustrate an analogous strategy in the context of a classification task. We use a multinomial logit model to predict what brank of the military each observation belongs to: air force, army, marine corps, or navy.

library(nnet)
mod = multinom(branch ~ gender * race, data = train, trace = FALSE)

When the outcome is categorical, we use conformal_score="softmax". With this argument, inferences() generates “conformal prediction sets,” that is, sets of possible outcome classes with coverage guarantees:

p = predictions(mod, conf_level = .7) |>
    inferences(
        R = 5,
        method = "conformal_cv+",
        conformal_score = "softmax",
        conformal_test = test)
head(p)
Pred Set
army, navy
air force, army
air force, army
air force, army , navy
air force, army
army, navy

As shown above, inferences() returns a list column of sets for each observation. On average, those sets should cover the true value about 70% of the time. For example, for the first observation in the dataset, the conformal prediction is {army, navy} and the true value is army. The conformal prediction set thus covers the true value.

We can compute the overal coverage rate in the test set as:

mean(sapply(seq_len(nrow(p)), function(i) p$branch[i] %in% p$pred.set[[i]]))
[1] 0.7056

As expected, conformal sets include the true value of about 70% of the time.

14.4.3 Misspecification

As noted above, a conformal prediction interval should be valid even if the model is misspecified. We now illustrate this property with two examples.

14.4.3.1 Polynomials

To illustrate this, we generate data from a linear model with polynomials, but estimate a linear model without polynomials. Then, we plot the results and compute the coverage of the prediction interval:

library("ggplot2")

# simulate data
N = 1000
X = rnorm(N * 2)
dat = data.frame(
    X = X,
    Y = X + X^2 + X^3 + rnorm(N * 2))
train = dat[1:N,]
test = dat[(N + 1):nrow(dat),]

# fit misspecified model
m = lm(Y ~ X, data = train)

# conformal predictions
p = predictions(m) |>
  inferences(
    R = 5,
    method = "conformal_cv+",
    conformal_test = test)

# plot both the confidence and prediction intervals
ggplot(p, aes(X, Y)) +
  geom_point(alpha = .05) +
  geom_ribbon(aes(X, ymin = pred.low, ymax = pred.high), alpha = .2) +
  geom_ribbon(aes(X, ymin = conf.low, ymax = conf.high), alpha = .4)

Clearly, the prediction intervals are much wider than the confidence intervals. This is normal, because the confidence intervals are only designed to cover the expected mean of the outcome:

# prediction interval coverage
with(p, mean(Y <= pred.high & Y >= pred.low))
[1] 0.944
# confidence interval coverage
with(p, mean(Y <= conf.high & Y >= conf.low))
[1] 0.094

This example is interesting, because it shows that the prediction interval has adquate marginal coverage. However, the intervals are not necessarily well calibrated “locally”, in different strata of \(X\). In the figure above, our model is misspecified, so we make more mistakes in the tails, where predictions are bad. In contrast, the interval catches more observations in the middle of the distribution, which ensures that the overall error rate is adequate.

14.4.3.2 Poisson vs. negative binomial

Here is a second example of model misspecification. We generate data from a negative binomial model, but estimate a Poisson model. Nevertheless, the conformal prediction interval has good coverage:

library("MASS")
n = 10000
X = rnorm(n)
eta = -1 + 2*X
mu = exp(eta)
Y = rnegbin(n, mu = mu, theta = 1)
dat = data.frame(X = X, Y = Y)
train = dat[1:5000,]
test = dat[5001:nrow(dat),]

mod = glm(Y ~ X, data = train, family = poisson)

p = predictions(mod, conf_level = .9) |>
    inferences(
        method = "conformal_cv+",
        R = 10,
        conformal_test = test)

with(p, mean(Y >= pred.low & Y <= pred.high))
[1] 0.9084


  1. Sections 14.2 and 14.3↩︎

  2. Scaling property of variance: Let \(a\) be a constant and \(\lambda\) be a random variable, then \(\text{Var}[a\lambda]=a^2\text{Var}[\lambda]\).↩︎

  3. Section 3.1.1↩︎

  4. Chapter 5↩︎

  5. Autocorrelation often arises in time series estimation when the prediction errors we make for a unit of observation at time \(t\) are correlated to the prediction errors we make for the same unit at time \(t+1\). Clustering sometimes occurs when errors are similar for study participants drawn from well-defined groups (ex: classrooms, neighborhoods, or provinces). Heteroskedasticity arises when the variance of the errors is inconsistent across observations, such as when our model makes bigger prediction errors (in either direction) for units with certain characteristics. ↩︎

  6. \(\ref{ft-heteroskedasticity}\)↩︎

  7. The notation and ideas in this section follow Hansen (2022a).↩︎

  8. We draw with replacement, otherwise each bootstrap sample would exactly replicate the original data.↩︎

  9. We use the drop() function to convert the default output—a one-cell matrix—to a scalar.↩︎

  10. The usual “independent and identically distributed” assumption is a special case of exchangeability.↩︎