This chapter introduces four approaches to quantify uncertainty: the delta method, the bootstrap, simulation-based inference, and conformal prediction. We discuss the intuition behind each of these strategies, and reinforce that intuition by implementing simple versions of each strategy “by hand.” The code examples presented for pedagogical purposes; more robust and convenient implementations are supplied by the marginaleffects package. Readers who are not familiar with the basics of multivariable calculus may want to skip this chapter.
13.1 Delta method
The delta method is a statistical technique used to approximate the variance of a function of random variables. For example, data analysts often want to compare regression coefficients or predicted probabilities by subtracting them from one another. Such comparisons can be expressed as functions of a model’s parameters, but the exact asymptotic distributions of those functions can be difficult to derive analytically. The delta method offers a convenient way to approximate those distributions. It allows us to compute standard errors for many of the quantities of interest that we will discuss in this book.
Roughly speaking, the delta method relies on the observation that if the distribution of a parameter is asymptotically normal, then the limiting distribution of a smooth function of this parameter is also normal (Wasserman 2004). This is a powerful result, which is especially useful in the context of maximum likelihood estimation, where parameters are expected to be asymptotically normal.
The delta method is a useful way to quantify uncertainty. It is versatile and fast. Yet, the technique also has important limitations: it relies on a coarse linear approximation, its desirable properties depend on the asymptotic normality of the estimator, and it is only valid when a function is continuously differentiable in a neighborhood of its parameters. When analysts do not believe that these conditions hold, they can turn to the bootstrap or toward simulation-based inference instead (Sections 13.2 and 13.3).
The theoretical properties of the delta method are proved in many statistical textbooks (Wasserman 2004; Hansen 2022b; Berger and Casella 2024). Instead of demonstrating them anew, the rest of this section provides a hands-on tutorial, with a focus on computation. The goal is to reinforce intuitions by illustrating how to compute standard errors in two cases: univariate and multivariate. Readers who are not interested in the manual computation of standard errors, and those who have forgotten too much calculus, may want to skip ahead to Section 3.5.
13.1.1 Univariate delta method
To start, we will use the delta method to compute the standard error of a function with one parameter: the natural logarithm of a single regression coefficient. Consider a setting with three measured variables: \(Y\), \(X\), and \(Z\). With these variables in hand, the analyst estimates a linear regression model of this form:
\[
Y = \beta_1 + \beta_2 X + \beta_3 Z + \varepsilon
\tag{13.1}\]
Let \(\hat{\beta}_2\) be an estimator of the true parameter \(\beta_2\). Assume \(\hat{\beta}_2\) is unbiased, such that \(E[\hat{\beta}_2] = \beta_2\). We know that the variance of \(\hat{\beta}_2\) is \(\text{Var}(\hat{\beta}_2)\), but our ultimate goal is to compute the variance of a specific function of the estimator: \(\text{Var}\left [\log(\hat{\beta}_2) \right ]\).
We start by applying a Taylor series expansion. This mathematical tool allows us to rewrite smooth functions as infinite sums of derivatives and polynomials. This is a useful way to simplify and approximate complex functions. Generically, the Taylor series of a function \(g(\alpha)\) about a point \(a\) is given by:
where \(g^{(n)}(a)\) denotes the \(n\)th derivative of \(g(\alpha)\) evaluated at \(\alpha=a\).
Equation 13.2 is an infinite series, so it is not practical (or possible) to compute all its terms. In real-world applications, we thus truncate the series by dropping higher order terms. For instance, we can construct a first-order Taylor series by keeping only the first two terms on the right-hand side of Equation 13.2.
We leverage this idea by approximating \(\log(\hat{\beta}_2)\) using a first-order Taylor series about the mean (\(\beta_2\)), and by inserting this approximation in the variance operator:1
This linearized approximation is useful because it allows further simplification. First, we know that constants like \(\beta_2\) contribute nothing to a variance, which implies that some of the terms in Equation 13.3 drop out. Second, we can apply the scaling property of variances2 to rewrite the expression as:
Equation 13.4 gives us a formula to approximate the variance of \(\log(\hat{\beta}_2)\). This formula is expressed in terms of two quantities that are easy to compute: the regression coefficient estimate and the square of its standard error.
To show how this formula can be applied, let’s simulate data that conform to Equation 13.1, and use it to fit a linear regression model:
Using Equation 13.4, we compute the delta method estimate of \(\text{Var}\left [\log(\hat{\beta}_2)\right ]\). Then, we take the square root to obtain the standard error of \(\log(\hat{\beta}_2)\):
To verify that this result is correct, we can use the hypotheses() function from the marginaleffects package. Chapter 4 will offer a detailed tutorial on the use of that function. For now, it suffices to note that hypotheses() allows us to compute arbitrary functions of model parameters, along with associated standard errors.
Reassuringly, the estimate and standard error that we computed “manually” for \(\log(\hat{\beta}_2)\) are the same as the ones obtained “automatically” by calling hypotheses().
13.1.2 Multivariate delta method
The univariate version of the delta method can be generalized to the multivariate case. Let \(\mathcal{B}=\{\beta_1,\beta_2,\ldots,\beta_k\}\) be a vector of input parameters, and \(\Theta=\{\theta_1,\theta_2,\ldots,\theta_n\}\) represent the vector-valued output of an \(h\) transformation. For example, \(\mathcal{B}\) could a vector of regression coefficients. \(\Theta\) could be a single summary statistic, or it could be a vector of length \(n\) with one fitted value per observation in the dataset (see Chapter 5).
Our goal is to compute standard errors for each element of \(\Theta\). To do this, we express the multivariate delta method in matrix notation:
where \(h(\mathcal{B})\) is a function of the parameter vector \(\mathcal{B}\) and \(\text{Var}(h(\mathcal{B}))\) is the variance of that function.
\(J\) is the Jacobian of \(h\). The number of rows in that matrix is equal to the number of parameters in the input vector \(\mathcal{B}\). The number of columns in \(J\) is equal to the number of elements in the output vector \(\Theta\), that is, equal to the number of transformed quantities of interest that \(h\) produces. The element in the \(i\)th row and \(j\)th column of \(J\) is a derivative which characterizes the effect of a small change in \(\beta_i\) on the value of \(\theta_j\).
Now, let’s go back to the linear model defined by Equation 13.1, with coefficients \(\beta_1\), \(\beta_2\), and \(\beta_3\). Imagine that we want to estimate the difference between the 2nd and 3rd coefficients of this model. This difference can be expressed as the single output \(\theta\) from a function \(h\) of three coefficients:
To quantify the uncertainty around \(\theta\), we use the multivariate delta method formula. In Equation 13.5, \(\text{Var}(\mathcal{B})\) refers to the variance-covariance matrix of the regression coefficients, which we can extract using the vcov() function:
(Intercept) X Z
(Intercept) 0.0113412244 0.0002144091 -0.0009195640
X 0.0002144091 0.0111630283 0.0002884508
Z -0.0009195640 0.0002884508 0.0089812565
The \(J\) matrix is defined by reference to the \(h\) function. In Equation 13.6, \(h\) accepts three parameters as inputs and returns a single value. Therefore, the corresponding \(J\) matrix must have 3 rows and 1 column. We populate this matrix by taking partial derivatives of Equation 13.6 with respect to each of the input parameters:
Having extracted \(\text{Var}(\mathcal{B})\) and defined an appropriate \(J\) for the quantity of interest \(\theta\), we now use Equation 13.5 to compute the standard error:
We have thus estimated that the difference between \(\beta_2\) and \(\beta_3\) is 0.551, and that the standard error associated with this difference is 0.14. Again, we can check this “manual” result against the output of the hypotheses() function:
The calculations above rely on a “classical” estimate of the variance-covariance matrix, which assumes that the variance of the errors have constant variance (“homoskedastic”). In some cases, users may want alternative variance estimators to report standard errors that are “robust” to heteroskedasticity and autocorrelation, or that take into account the “clustered” nature of a dataset.
To achieve this, all we have to do is substitute the \(\text{Var}(\mathcal{B})\) matrix by a suitable alternative. For example, using the heteroskedasticity-robust variance-covariance matrix produced by the sandwich package for R(Zeileis, Köll, and Graham 2020), we get a slightly different standard error:
All estimation functions in the marginaleffects package have a vcov argument which allows us to report robust or clustered standard errors automatically:
In sum, the delta method is a simple, fast, and flexible approach to obtain standard errors for a vast array of quantities of interest. It is the default approach implemented by all the functions in the marginaleffects package. In the next sections, we consider two alternatives: the bootstrap and simulation-based inference.
13.2 Bootstrap
The Bootstrap is a method pioneered by Bradley Efron and colleagues in the 1970s and 1980s (Efron and Tibshirani 1994). It is a powerful strategy to quantify uncertainty about statistics of interest.3
Consider \(T_n\), a statistic we wish to compute. \(T_n\) is a function of an outcome \(y_n\), some regressors \(\mathbf{x}\), and the unknown population distribution function \(F\):
In our running example, \(y_n\) is bill_length_mm, \(\mathbf{x}_n\) is a vector of binary variables representing different values of island, and \(T_n\) is the statistic obtained by regressing \(y_n\) on \(\mathbf{x}_n\), and by taking a linear combination of the resulting coefficient estimates.
The statistic of interest \(T_n\) is characterized by its own distribution function \(G\), itself a function of \(F\). If we knew \(F\), then \(G\) would immediately follow. The key idea of the bootstrap is to use simulations to approximate \(F\), and then to use that approximation when quantifying uncertainty around the statistic of interest.
How do we approximate \(F\) in practice? Since the 1980s, many different variations on bootstrap simulations have been proposed. A survey of this rich literature lies outside the scope of this book, but it is instructive to illustrate one of the simplest approaches: the nonparametric bootstrap algorithm. The idea is to build several simulated datasets by drawing observations with replacement from the observed sample.4 Then, with those datasets in hand, we can estimate the statistic of interest in all of those samples.
To illustrate, here is a minimalist implementation of the bootstrap algorithm in R. First, we define a statistic function which extracts coefficient estimates from a model object and takes a linear combination of them: \(\beta_2-\beta_3\).5
# statistic of interest: linear combination of model coefficientsstatistic<- \(model)drop(c(0, 1, -1)%*%coef(model))
Then, we implement the bootstrap algorithm with 10,000 replicates:
Bootstrapping, particularly in its nonparametric form, is a versatile technique that estimates the sampling distribution of a statistic by resampling the data. It is widely applicable, especially when the underlying distribution is unknown or complex, making it a powerful tool for deriving standard errors and confidence intervals. However, bootstrapping can be computationally intensive, particularly for large datasets or complex models. In addition, selecting an adequate resampling scheme can be challenging when the data-generating process is complex, as improper choices can lead to biased or invalid results.
13.3 Simulation
Yet another strategy to compute uncertainty estimates for quantities of inference is simulation-based inference. This method is an alternative to traditional analytical approaches, leveraging power of computational simulations to estimate the variability of a given parameter. One of the earliest applications of this approach can be found in Krinsky and Robb (1986), who use simulations to quantify the sampling uncertainty around estimates of elasiticities in a regression modelling context. King, Tomz, and Wittenberg (2000) subsequentily popularized the strategy and its implementation in the CLARIFY software.
The foundation of this approach assumes that the regression-based estimator follows a normal distribution. This assumption allows us to use the properties of the multivariate normal distribution to simulate numerous potential outcomes. The steps involved in this process are as follows:
Simulate parameters: Begin by drawing many sets of parameters from a multivariate normal distribution. The mean of this distribution is the estimated vector of coefficients obtained from the regression analysis, and the variance corresponds to the estimates covariance matrix of these coefficients.
Transform parameters: Next, each simulated set of coefficients is transformed to generate multiple versions of the quantity of interest.
Summarize draws: Finally, construct a confidence interval for the quantity of interest by taking quantiles of the simulated quantities.
Intuitively, this process of drawing simulated quantities can be thought of as an “informal” Bayesian posterior simulation, as suggested by Rainey (2024). While it does not strictly follow Bayesian methods, it shares a similar philosophy of using the distribution of simulated parameters to infer the uncertainty around the estimated quantity. By repeatedly sampling from the parameter space, we can gain a clearer picture of the possible variability and thereby construct more reliable confidence intervals.
Implementing this form of simulation-based inference is relatively easy:
Note that the simulation-based confidence interval is slightly different than those obtained via bootstrap or delta method, but generally of the same magnitude.
Simulation-based inference is easy to implement, convenient, and it applies to a broad range of models. It trades the analytic approximation of the delta method, which relies on Taylor-series expansions, for a numerical approximation through simulation. Although simulation-based inference tends to be more computationally expensive than the delta method, it is usually faster than bootstrapping, which requires refitting the model multiple times.
13.4 Conformal prediction
In Chapter 5, we used the predictions() function compute confidence intervals around fitted values. These intervals are designed to characterize the uncertainty about the model-based expected value of the response. One common misunderstanding is that such confidence intervals should be calibrated to cover a certain percentage of unseen data points. This is not the case. In fact, the confidence interval reported by default by the predictions() function will typically cover a much smaller share of out-of-sample outcomes than its nominal coverage.
To illustrate, let’s conduct a simulation. We draw two samples of 25 observations randomly from a normal distribution with mean \(\pi\): \(Y_{train}\) and \(Y_{test}\). Then, we use the lm() function to fit a linear model on the training set \(Y_{train}\), and we use the predictions() function to compute a 90% confidence interval for the expected value of the response. Finally, we check how many observations in the test set \(Y_{test}\) are covered by this interval. We repeat the experiment 1000 times:
library(marginaleffects)set.seed(48103)simulation<-function(...){Y_train<-rnorm(25, mean =pi)Y_test<-rnorm(25, mean =pi)m<-lm(Y_train~1)p<-predictions(m, conf_level =.90)out<-c("3.1416"=mean(p$conf.low<=pi&p$conf.high>=pi),"Test"=mean(p$conf.low<=Y_test&p$conf.high>=Y_test))return(out)}coverage<-apply(replicate(1000, simulation()), 1, mean)coverage
3.1416 Test
0.88100 0.24668
We see that the confidence intervals cover the true mean of the outcome (\(\pi\) or 3.1416) nearly 90% of the time, which matches the significance level that we selected with conf_level. However, the confidence intervals only cover out-of-sample observations (\(Y_{test}\)) about 25% of the time. Clearly, if we care about out of sample predictions, that is, if we want our interval to cover a specific share of out-of-sample data points, we must compute “prediction intervals” instead of “confidence intervals.”
How can we compute prediction intervals with adequate coverage? Conformal prediction is one—very flexible and powerful—approach. In their excellent tutorial, Angelopoulos and Bates (2022) write that conformal prediction is
“a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions.”
These are extraordinary claims which deserve to be underlined: In principle, conformal prediction can offer well-calibrated intervals, regardless of the prediction model we use, and even if that model is misspecified.
There are four main caveats to this. First, the conformal prediction algorithms implemented in marginaleffects are designed for exchangeable data.6 They do not offer coverage guarantees in contexts where exchangeability is violated, such as in time series data, when there is spatial dependence between observations, or when there is distribution drift between the training and test data. Second, these algorithms offer marginal coverage guarantees, that is, they guarantee that a random test point will fall within the interval with a given probability. Below, we show an example where the prediction interval covers the right number of test points overall, but is not well calibrated locally, in different strata of the predictors. Different algorithms have recently been proposed to offer class-conditional coverage guarantees (see (Din2023?) for an example). Third, the width of the conformal prediction interval will typically depend on the quality of the prediction model and of the score function. Finally, the score functions implemented in marginaleffects simply take the residual—or difference between the observed outcome and predicted value. This means that the type argument must ensure that observations and predictions are on commensurable scales (usually type="response" or type="prob").
There are several alternative conformal prediction algorithms. For the purposes of this tutorial, we will only use one: CV+.
TODO: more details on the CV+ algorithm
13.4.1 Numeric outcome
To illustrate the use of conformal prediction, let’s consider a dataset which includes demographic information on over 1 million members of the US military (TODO: CITE):
# A tibble: 6 × 8
rownames grade branch gender race hisp rank officer
<int> <chr> <chr> <chr> <chr> <lgl> <int> <dbl>
1 1 officer army male ami/aln TRUE 2 1
2 2 officer army male ami/aln TRUE 2 1
3 3 officer army male ami/aln TRUE 5 1
4 4 officer army male ami/aln TRUE 5 1
5 5 officer army male ami/aln TRUE 5 1
6 6 officer army male ami/aln TRUE 5 1
We split the data in two sets with 10000 randomly selected observations each. The training set is used to fit the model, and the test set is used to check the coverage of our interval:
Finally, we use the predictions() and inferences() to compute unit-level predictions and conformal prediction intervals, using 5 cross-validation folds:
p<-predictions(mod, conf_level =.8)|>inferences( R =5, method ="conformal_cv+", conformal_test =test)covered<-with(p, rank>pred.low&rank<=pred.high)mean(covered)
[1] 0.8346
We find that 83% of observations in the test set are covered by the conformal prediction intervals. This is quite close to our target coverage rate of 80%.
13.4.2 Categorical outcome
Now, let’s illustrate an analogous strategy in the context of a classification task. We use a multinomial logit model to predict what brank of the military each observation belongs to: air force, army, marine corps, or navy.
When the outcome is categorical, we use conformal_score="softmax". With this argument, inferences() generates “conformal prediction sets,” that is, sets of possible outcome classes with coverage guarantees:
As shown above, inferences() returns a list column of sets for each observation. On average, those sets should cover the true value about 70% of the time. For example, for the first observation in the dataset, the conformal prediction is {army, navy} and the true value is army. The conformal prediction set thus covers the true value.
We can compute the overal coverage rate in the test set as:
As expected, conformal sets include the true value of about 70% of the time.
13.4.3 Misspecification
As noted above, a conformal prediction interval should be valid even if the model is misspecified. We now illustrate this property with two examples.
13.4.3.1 Polynomials
To illustrate this, we generate data from a linear model with polynomials, but estimate a linear model without polynomials. Then, we plot the results and compute the coverage of the prediction interval:
library("ggplot2")# simulate dataN<-1000X<-rnorm(N*2)dat<-data.frame( X =X, Y =X+X^2+X^3+rnorm(N*2))train<-dat[1:N,]test<-dat[(N+1):nrow(dat),]# fit misspecified modelm<-lm(Y~X, data =train)# conformal predictionsp<-predictions(m)|>inferences( R =5, method ="conformal_cv+", conformal_test =test)# plot both the confidence and prediction intervalsggplot(p, aes(X, Y))+geom_point(alpha =.05)+geom_ribbon(aes(X, ymin =pred.low, ymax =pred.high), alpha =.2)+geom_ribbon(aes(X, ymin =conf.low, ymax =conf.high), alpha =.4)
Clearly, the prediction intervals are much wider than the confidence intervals. This is normal, because the confidence intervals are only to cover the expected mean of the outcome:
# prediction intervals have good converage in test setwith(p, mean(Y<=pred.high&Y>=pred.low))
[1] 0.944
# prediction intervals have bad converage in test setwith(p, mean(Y<=conf.high&Y>=conf.low))
[1] 0.094
This example is interesting, because it shows that the prediction interval has adquate marginal coverage. However, the intervals are not necessarily well calibrated “locally”, in different strata of \(X\). In the figure above, our model is misspecified, so we make more mistakes in the tails, where predictions are bad. In contrast, the interval catches more observations in the middle of the distribution, which ensures that the overall error rate is adequate.
13.4.3.2 Poisson vs. negative binomial
Here is a second example of model misspecification. We generate data from a negative binomial model, but estimate a Poisson model. Nevertheless, the conformal prediction interval has good coverage:
library("MASS")n<-10000X<-rnorm(n)eta<--1+2*Xmu<-exp(eta)Y<-rnegbin(n, mu =mu, theta =1)dat<-data.frame(X =X, Y =Y)train<-dat[1:5000,]test<-dat[5001:nrow(dat),]mod<-glm(Y~X, data =train, family =poisson)p<-predictions(mod, conf_level =.9)|>inferences( method ="conformal_cv+", R =10, conformal_test =test)with(p, mean(Y>=pred.low&Y<=pred.high))
[1] 0.9084
Angelopoulos, Anastasios N., and Stephen Bates. 2022. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification,” no. arXiv:2107.07511 (September). https://doi.org/10.48550/arXiv.2107.07511.
Berger, Roger, and George Casella. 2024. Statistical Inference. 2nd ed. CRC Press.
King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.”American Journal of Political Science, 347–61.
Krinsky, I., and A. L. Robb. 1986. “On Approximating the Statistical Properties of Elasticities.”Review of Economics and Statistics 68 (4): 715–19.
Rainey, Carlisle. 2024. “A Careful Consideration of CLARIFY: Simulation-Induced Bias in Point Estimates of Quantities of Interest.”Political Science Research and Methods 12 (3): 614–23. https://doi.org/10.1017/psrm.2023.8.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. New York, NY: Springer. https://doi.org/10.1007/978-0-387-21736-9.
Zeileis, Achim, Susanne Köll, and Nathaniel Graham. 2020. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.”Journal of Statistical Software 95 (1): 1–36. https://doi.org/10.18637/jss.v095.i01.
Recall that the derivative of \(\log(\beta_2)\) is \(1/\beta_2\).↩︎
Scaling property of variance: Let \(a\) be a constant and \(\lambda\) be a random variable, then \(\text{Var}[a\lambda]=a^2\text{Var}[\lambda]\).↩︎
The notation and ideas in this section follow Hansen (2022a).↩︎
We draw with replacement, otherwise each bootstrap sample would exactly replicate the original data.↩︎
We use the drop() function to convert the default output—a one-cell matrix—to a scalar.↩︎
The usual “independent and identically distributed” assumption is a special case of exchangeability.↩︎