This chapter introduces four approaches to quantify uncertainty: the delta method, bootstrap, simulation-based inference, and conformal prediction. We discuss the intuition behind each of these strategies, and reinforce that intuition by implementing simple versions of each strategy “by hand.” The code examples are presented for pedagogical purposes; more robust and convenient implementations are supplied by the marginaleffects package. Readers who are not familiar with the basics of multivariable calculus may want to skip this chapter.
14.1 Delta method
The delta method is a statistical technique used to approximate the variance of a function of random variables. For example, data analysts often want to compare regression coefficients or predicted probabilities by subtracting them from one another. Such comparisons can be expressed as functions of a model’s parameters, but the variances (and standard errors) of such transformed quantities can be difficult to derive analytically. The delta method is a convenient way to approximate those variances. It allows us to quantity uncertainty around most of the quantities of interest discussed in this book.
The delta method relies on the observation that if the distribution of a parameter is asymptotically normal, then the limiting distribution of a smooth function of this parameter is also normal (Wasserman 2004). This is a powerful result, which is especially useful in the context of maximum likelihood estimation, where parameters are expected to be asymptotically normal.
The delta method is a useful way to quantify uncertainty. It is versatile and fast. Yet, the technique also has important limitations: it relies on a coarse linear approximation, its desirable properties depend on the asymptotic normality of the estimator, and it is only valid when a function is continuously differentiable in a neighborhood of its parameters. When analysts do not believe that these conditions hold, they can turn to the bootstrap or toward simulation-based inference.1
The theoretical properties of the delta method are proved in many statistical textbooks (Wasserman 2004; Hansen 2022b; Berger and Casella 2024). Instead of demonstrating them anew, the rest of this section provides a hands-on tutorial, with a focus on computation. The goal is to reinforce intuitions by illustrating how to calculate standard errors manually in two cases: univariate and multivariate.
14.1.1 Univariate delta method
To start, we use the delta method to compute the standard error of a function with one parameter: the natural logarithm of a single regression coefficient. Consider a setting with three measured variables: \(Y\), \(X\), and \(Z\). With these variables in hand, the analyst estimates a linear regression model.
\[
Y = \beta_1 + \beta_2 X + \beta_3 Z + \varepsilon
\tag{14.1}\]
Let \(\hat{\beta}_2\) be an estimator of the true parameter \(\beta_2\). We know that the variance of \(\hat{\beta}_2\) is \(\text{Var}(\hat{\beta}_2)\), but we are not interested in this particular statistic. Rather, our goal is to compute the variance of a function of the estimator: \(\text{Var}\left [\log(\hat{\beta}_2) \right ]\).
To get there, we apply a Taylor series expansion. This mathematical tool allows us to rewrite smooth functions like \(\log(\hat{\beta}_2)\) as infinite sums of derivatives and polynomials. This is a useful way to simplify and approximate complex functions.
The Taylor series of a function \(g(\alpha)\) is always expressed by reference to a particular point \(a\), where we wish to study the behavior of \(g\). Generically, the approximation can be written as
where \(g^{(n)}(a)\) denotes the \(n\)th derivative of \(g(\alpha)\), evaluated at \(\alpha=a\).
Equation 14.2 is an infinite series, so it is not practical (or possible) to compute all its terms. In real-world applications, we thus truncate the series by dropping higher order terms. For instance, we can construct a first-order Taylor series by keeping only the first two terms on the right-hand side of Equation 14.2.
We can now approximate the function that interests us, \(\log(\hat{\beta}_2)\), as a first-order Taylor series, evaluated at the true value of the parameter \(\beta_2\). Recall that the derivative of \(\log(\beta_2)\) is \(1/\beta_2\), and substitute appropriate values into the first two terms of Equation 14.2.
This linearized approximation of the variance is useful, because it allows further simplification. First, we know that the variance of a constant \(\beta_2\) is zero, which implies that some of the terms in Equation 14.3 can drop out. Second, we can apply the scaling property of variances2 to rewrite the expression as
Equation 14.4 gives us a formula to approximate the variance of \(\log(\hat{\beta}_2)\). This formula is expressed in terms of \(\text{Var}(\hat{\beta}_2)\), that is, the square of the estimated standard error. Equation 14.4 also includes an unknown term: \(1/\beta_2\). Since we do not know the true value of that term, we fall back to using \(1/\hat{\beta}_2\) as a plug-in estimate.3
To show how this formula can be applied in practice, let’s simulate data that conform to Equation 14.1, and use that data to fit a linear regression model.
Using Equation 14.4, we compute the delta method estimate of \(\text{Var}\left [\log(\hat{\beta}_2)\right ]\). Then, we take the square root to obtain the standard error of \(\log(\hat{\beta}_2)\).
To verify that this result is correct, we can use the hypotheses() function from the marginaleffects package. As noted in Chapter 4, this function can compute arbitrary functions of model parameters, along with associated standard errors.
Reassuringly, the estimate and standard error that we computed “manually” for \(\log(\hat{\beta}_2)\) are the same as the ones obtained “automatically” by calling hypotheses().
14.1.2 Multivariate delta method
The univariate version of the delta method can be generalized to the multivariate case. Let \(\mathcal{B}=\{\beta_1,\beta_2,\ldots,\beta_k\}\) be a vector of input parameters, and \(\Theta=\{\theta_1,\theta_2,\ldots,\theta_n\}\) represent the vector-valued output of an \(h\) transformation. For example, \(\mathcal{B}\) could a vector of regression coefficients. \(\Theta\) could be a single summary statistic, or it could be a vector of length \(n\) with one fitted value per observation in the dataset.4
Our goal is to compute standard errors for each element of \(\Theta\). To do this, we express the multivariate delta method in matrix notation
where \(h(\mathcal{B})\) is a function of the parameter vector \(\mathcal{B}\) and \(\text{Var}(h(\mathcal{B}))\) is the variance of that function.
\(J\) is the Jacobian of \(h\). The number of rows in that matrix is equal to the number of parameters in the input vector \(\mathcal{B}\). The number of columns in \(J\) is equal to the number of elements in the output vector \(\Theta\), that is, equal to the number of transformed quantities of interest that \(h\) produces. The element in the \(i\)th row and \(j\)th column of \(J\) is a derivative which characterizes the effect of a small change in \(\beta_i\) on the value of \(\theta_j\).
Now, let’s go back to the linear model defined by Equation 14.1, with coefficients \(\beta_1\), \(\beta_2\), and \(\beta_3\). Imagine that we want to estimate the difference between the 2nd and 3rd coefficients of this model. This difference can be expressed as the single output \(\theta\) from a function \(h\) of three coefficients:
To quantify the uncertainty around \(\theta\), we use the multivariate delta method formula. In Equation 14.5, \(\text{Var}(\mathcal{B})\) refers to the variance-covariance matrix of the regression coefficients, which we can extract using the vcov() function.
(Intercept) X Z
(Intercept) 0.0131734661 -0.0005546706 -0.0017475913
X -0.0005546706 0.0154611041 -0.0002355777
Z -0.0017475913 -0.0002355777 0.0111296508
The \(J\) matrix is defined by reference to the \(h\) function. In Equation 14.6, \(h\) accepts three parameters as inputs and returns a single value. Therefore, the corresponding \(J\) matrix must have 3 rows and 1 column. We populate this matrix by taking partial derivatives of Equation 14.6 with respect to each of the input parameters.
Having extracted \(\text{Var}(\mathcal{B})\) and defined an appropriate \(J\) for the quantity of interest \(\theta\), we now use Equation 14.5 to compute the standard error.
We have thus estimated that the difference between \(\beta_2\) and \(\beta_3\) is 0.612, and that the standard error associated with this difference is 0.165. Again, we can check this “manual” result against the output of the hypotheses() function.
The results obtained manually and automatically are identical.
14.1.3 Robust or clustered standard errors
The calculations in the previous sections relied on “classical” estimates of the variance-covariance matrix. These estimates are useful, but they rely on strong assumptions that rule out ubiquitous phenomena like autocorrelation, heteroskedasticity, and clustering.5 If there is heterogeneity in the variance of our prediction errors—which is almost always the case—classical standard errors may inaccurately characterize the uncertainy in our estimates.
To see how the assumptions that underpin classical standard errors can be relaxed, let’s consider a linear regression with only one predictor.
\[
Y = X \beta + \varepsilon,
\]
where \(Y\) is an \(n \times 1\) vector of outcome; \(X\) is an \(n \times k\) matrix of covariates, including an intercept; \(\beta\) is a \(k \times 1\) vector of coefficients; \(\varepsilon\) is an \(n \times 1\) vector of errors. We can estimate regression coefficients by ordinary least squares, a method which can be expressed in a few matrix operations.
\[
\hat{\beta}=(X^TX)^-1(X^TY)
\tag{14.7}\]
The standard way to compute the variance of \(\hat{\beta}\) is to use the “sandwich” formula, so-called because it includes a “meat” expression, squeezed between identical “bread” components. The classical variance estimate for ordinary least squares can thus be written as
One key assumption about classical standard errors is encoded explicitly and clearly in Equation 14.8. In that formula, and in the code that implements it, \(\sigma^2_\varepsilon\) is a single number, treated as a constant that reflects the uncertainty in each observation uniformly. But in many applied contexts, the variance is not uniform, that is, there is heteroskedasticity.6
One popular strategy to deal with this challenge is to compute “robust” standard errors, also known as “heteroskedasticity-constent” or “Huber-White” standard errors. There are many types of robust standard errors available, designed to address different data issues (Zeileis, Köll, and Graham 2020). The code below shows one of the simplest ways to account for heteroskedasticity. It implements the same sandwich formula as in the classical case, but the “meat” component includes a matrix with squared residuals on the diagonal. This gives us more flexibility to account for changes in the variance of prediction errors across the sample.
A more convenient way to compute the same variance-covariance matrix is to use the vcovHC() function from the powerful sandwich package, which supports many models and estimators beyond linear regression.
In all marginaleffects functions, there is a vcov argument that allows us to report various types of robust standard errors. When, that argument is omitted, this function returns counterfactual comparisons with classical standard errors.
The vcov argument accepts several different string shortcuts like the one above, but it can also take a variance-covariance matrix, a function that returns such a matrix, or a formula to specify clusters.
In sum, the delta method is a simple, fast, and flexible approach to obtain standard errors for a vast array of quantities of interest. It is the default approach implemented by all the functions in the marginaleffects package. By combining the delta method with robust estimates of the variance-covariance matrix, analysts can report uncertainty estimates that account for common data features like heteroskedasticity or clustering.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06695 0.09975 -0.671 0.504
X 0.99036 0.10437 9.489 <0.001
Z 0.42283 0.09294 4.549 <0.001
The Bootstrap is a method pioneered by Bradley Efron and colleagues in the 1970s and 1980s (Efron and Tibshirani 1994). It is a powerful strategy to quantify uncertainty about statistics of interest.7
Consider \(T_n\), a statistic we wish to compute. \(T_n\) is a function of an outcome \(y_n\), some regressors \(\mathbf{x}\), and the unknown population distribution function \(F\):
In our running example, \(y_n\) is bill_length_mm, \(\mathbf{x}_n\) is a vector of binary variables representing different values of island, and \(T_n\) is the statistic obtained by regressing \(y_n\) on \(\mathbf{x}_n\), and by taking a linear combination of the resulting coefficient estimates.
The statistic of interest \(T_n\) is characterized by its own distribution function \(G\), itself a function of \(F\). If we knew \(F\), then \(G\) would immediately follow. The key idea of the bootstrap is to use simulations to approximate \(F\), and then to use that approximation when quantifying uncertainty around the statistic of interest.
How do we approximate \(F\) in practice? Since the 1980s, many different variations on bootstrap simulations have been proposed. A survey of this rich literature lies outside the scope of this book, but it is instructive to illustrate one of the simplest approaches: the nonparametric bootstrap algorithm. The idea is to build several simulated datasets by drawing observations with replacement from the observed sample.8 Then, with those datasets in hand, we can estimate the statistic of interest in all of those samples.
To illustrate, here is a minimalist implementation of the bootstrap algorithm in R. First, we define a statistic function which extracts coefficient estimates from a model object and takes a linear combination of them: \(\beta_2-\beta_3\).9
# statistic of interest: linear combination of model coefficientsstatistic= \(model)drop(c(0, 1, -1)%*%coef(model))
Then, we implement the bootstrap algorithm with 10,000 replicates:
Bootstrapping, particularly in its nonparametric form, is a versatile technique that estimates the sampling distribution of a statistic by resampling the data. It is widely applicable, especially when the underlying distribution is unknown or complex, making it a powerful tool for deriving standard errors and confidence intervals. However, bootstrapping can be computationally intensive, particularly for large datasets or complex models. In addition, selecting an adequate resampling scheme can be challenging when the data-generating process is complex, as improper choices can lead to biased or invalid results.
14.3 Simulation
Yet another strategy to compute uncertainty estimates for quantities of inference is simulation-based inference. This method is an alternative to traditional analytical approaches, leveraging power of computational simulations to estimate the variability of a given parameter. One of the earliest applications of this approach can be found in Krinsky and Robb (1986), who use simulations to quantify the sampling uncertainty around estimates of elasiticities in a regression modelling context. King, Tomz, and Wittenberg (2000) subsequentily popularized the strategy and its implementation in the CLARIFY software.
The foundation of this approach assumes that the regression-based estimator follows a normal distribution. This assumption allows us to use the properties of the multivariate normal distribution to simulate numerous potential outcomes. The steps involved in this process are as follows:
Simulate parameters: Begin by drawing many sets of parameters from a multivariate normal distribution. The mean of this distribution is the estimated vector of coefficients obtained from the regression analysis, and the variance corresponds to the estimates covariance matrix of these coefficients.
Transform parameters: Next, each simulated set of coefficients is transformed to generate multiple versions of the quantity of interest.
Summarize draws: Finally, construct a confidence interval for the quantity of interest by taking quantiles of the simulated quantities.
Intuitively, this process of drawing simulated quantities can be thought of as an “informal” Bayesian posterior simulation, as suggested by Rainey (2024). While it does not strictly follow Bayesian methods, it shares a similar philosophy of using the distribution of simulated parameters to infer the uncertainty around the estimated quantity. By repeatedly sampling from the parameter space, we can gain a clearer picture of the possible variability and thereby construct more reliable confidence intervals.
Implementing this form of simulation-based inference is relatively easy:
Note that the simulation-based confidence interval is slightly different than those obtained via bootstrap or delta method, but generally of the same magnitude.
Simulation-based inference is easy to implement, convenient, and it applies to a broad range of models. It trades the analytic approximation of the delta method, which relies on Taylor-series expansions, for a numerical approximation through simulation. Although simulation-based inference tends to be more computationally expensive than the delta method, it is usually faster than bootstrapping, which requires refitting the model multiple times.
TODO: Example with marginaleffects
14.4 Conformal prediction
In Chapter 5, we used the predictions() function to compute confidence intervals around fitted values. These intervals are designed to characterize the uncertainty about the model-based expected value of the response. One common misunderstanding is that such confidence intervals should be calibrated to cover a certain percentage of unseen data points. This is not the case. In fact, the confidence interval reported by default by the predictions() function will typically cover a much smaller share of out-of-sample outcomes than its nominal coverage.
To illustrate, let’s conduct a simulation. We draw two samples of 25 observations randomly from a normal distribution with mean \(\pi\): \(Y_{train}\) and \(Y_{test}\). Then, we use the lm() function to fit a linear model on the training set \(Y_{train}\), and we use the predictions() function to compute a 90% confidence interval for the expected value of the response. Finally, we check how many observations in the test set \(Y_{test}\) are covered by this interval. We repeat the experiment 1000 times:
library(marginaleffects)set.seed(48103)simulation=function(...){Y_train=rnorm(25, mean =pi)Y_test=rnorm(25, mean =pi)m=lm(Y_train~1)p=predictions(m, conf_level =.90)out=c("3.1416"=mean(p$conf.low<=pi&p$conf.high>=pi),"Test"=mean(p$conf.low<=Y_test&p$conf.high>=Y_test))return(out)}coverage=apply(replicate(1000, simulation()), 1, mean)coverage
3.1416 Test
0.88100 0.24668
We see that the confidence intervals cover the true mean of the outcome (\(\pi\) or 3.1416) nearly 90% of the time, which matches the significance level that we selected with conf_level. However, the confidence intervals only cover out-of-sample observations (\(Y_{test}\)) about 25% of the time. Clearly, if we care about out of sample predictions, that is, if we want our interval to cover a specific share of out-of-sample data points, we must compute “prediction intervals” instead of “confidence intervals.”
How can we compute prediction intervals with adequate coverage? Conformal prediction is one—very flexible and powerful—approach. In their excellent tutorial, Angelopoulos and Bates (2022) write that conformal prediction is
“a user-friendly paradigm for creating statistically rigorous uncertainty sets/intervals for the predictions of such models. Critically, the sets are valid in a distribution-free sense: they possess explicit, non-asymptotic guarantees even without distributional assumptions or model assumptions.”
These are extraordinary claims which deserve to be underlined: In principle, conformal prediction can offer well-calibrated intervals, regardless of the prediction model we use, and even if that model is misspecified.
There are four main caveats to this. First, the conformal prediction algorithms implemented in marginaleffects are designed for exchangeable data.10 They do not offer coverage guarantees in contexts where exchangeability is violated, such as in time series data, when there is spatial dependence between observations, or when there is distribution drift between the training and test data. Second, these algorithms offer marginal coverage guarantees, that is, they guarantee that a random test point will fall within the interval with a given probability. Below, we show an example where the prediction interval covers the right number of test points overall, but is not well calibrated locally, in different strata of the predictors. Different algorithms have recently been proposed to offer class-conditional coverage guarantees (see Ding et al. (2023) for an example). Third, the width of the conformal prediction interval will typically depend on the quality of the prediction model and of the score function. Finally, the score functions implemented in marginaleffects simply take the residual—or difference between the observed outcome and predicted value. This means that the type argument must ensure that observations and predictions are on commensurable scales (usually type="response" or type="prob").
A detailed technical explanation of conformal prediction lies outside the scope of this book, but outlining a few key elements can provide some intuition. Roughly speaking, conformal prediction starts with a trained model that produces some notion of confidence or uncertainty, like softmax probabilities for classification or residuals for regression. Then, it evaluates how well this model performs on a separate calibration dataset, which includes data points that were not used to train the model. The key step is to calculate a “conformal score” for each calibration example. This score measures how unusual or poorly predicted a data point is. These scores are ranked, and the algorithm sets a threshold based on a quantile of the scores (adjusted for sample size). At prediction time, the model computes a new score for the test point, and the prediction set includes all possible answers whose score is within the threshold. This works because even if the model is misspecified or biased, the quantile-based thresholding ensures that the prediction set will still cover the true value with the desired probability, making the method “distribution-free.” If the model is confident, the set is small; if the model is uncertain, the set is larger—directly reflecting the model’s uncertainty in a calibrated way.
There are several alternative conformal prediction algorithms. For the purposes of this tutorial, we will only use one called “CV+.”
TODO: more details on the CV+ algorithm
14.4.1 Numeric outcome
To illustrate the use of conformal prediction, let’s consider a dataset which includes demographic information on over 1 million members of the US military (TODO: CITE):
# A data frame: 6 × 8
rownames grade branch gender race hisp rank officer
* <int> <chr> <chr> <chr> <chr> <lgl> <int> <dbl>
1 1 officer army male ami/aln TRUE 2 1
2 2 officer army male ami/aln TRUE 2 1
3 3 officer army male ami/aln TRUE 5 1
4 4 officer army male ami/aln TRUE 5 1
5 5 officer army male ami/aln TRUE 5 1
6 6 officer army male ami/aln TRUE 5 1
We split the data in two sets with 10000 randomly selected observations each. The training set is used to fit the model, and the test set is used to check the coverage of our interval:
Finally, we use the predictions() and inferences() to compute unit-level predictions and conformal prediction intervals, using 5 cross-validation folds:
p=predictions(mod, conf_level =.8)|>inferences( R =5, method ="conformal_cv+", conformal_test =test)covered=with(p, rank>pred.low&rank<=pred.high)mean(covered)
[1] 0.8346
We find that 83% of observations in the test set are covered by the conformal prediction intervals. This is quite close to our target coverage rate of 80%.
14.4.2 Categorical outcome
Now, let’s illustrate an analogous strategy in the context of a classification task. We use a multinomial logit model to predict what brank of the military each observation belongs to: air force, army, marine corps, or navy.
When the outcome is categorical, we use conformal_score="softmax". With this argument, inferences() generates “conformal prediction sets,” that is, sets of possible outcome classes with coverage guarantees:
As shown above, inferences() returns a list column of sets for each observation. On average, those sets should cover the true value about 70% of the time. For example, for the first observation in the dataset, the conformal prediction is {army, navy} and the true value is army. The conformal prediction set thus covers the true value.
We can compute the overal coverage rate in the test set as:
As expected, conformal sets include the true value of about 70% of the time.
14.4.3 Misspecification
As noted above, a conformal prediction interval should be valid even if the model is misspecified. We now illustrate this property with two examples.
14.4.3.1 Polynomials
To illustrate this, we generate data from a linear model with polynomials, but estimate a linear model without polynomials. Then, we plot the results and compute the coverage of the prediction interval:
library("ggplot2")# simulate dataN=1000X=rnorm(N*2)dat=data.frame( X =X, Y =X+X^2+X^3+rnorm(N*2))train=dat[1:N,]test=dat[(N+1):nrow(dat),]# fit misspecified modelm=lm(Y~X, data =train)# conformal predictionsp=predictions(m)|>inferences( R =5, method ="conformal_cv+", conformal_test =test)# plot both the confidence and prediction intervalsggplot(p, aes(X, Y))+geom_point(alpha =.05)+geom_ribbon(aes(X, ymin =pred.low, ymax =pred.high), alpha =.2)+geom_ribbon(aes(X, ymin =conf.low, ymax =conf.high), alpha =.4)
Clearly, the prediction intervals are much wider than the confidence intervals. This is normal, because the confidence intervals are only designed to cover the expected mean of the outcome:
This example is interesting, because it shows that the prediction interval has adquate marginal coverage. However, the intervals are not necessarily well calibrated “locally”, in different strata of \(X\). In the figure above, our model is misspecified, so we make more mistakes in the tails, where predictions are bad. In contrast, the interval catches more observations in the middle of the distribution, which ensures that the overall error rate is adequate.
14.4.3.2 Poisson vs. negative binomial
Here is a second example of model misspecification. We generate data from a negative binomial model, but estimate a Poisson model. Nevertheless, the conformal prediction interval has good coverage:
library("MASS")n=10000X=rnorm(n)eta=-1+2*Xmu=exp(eta)Y=rnegbin(n, mu =mu, theta =1)dat=data.frame(X =X, Y =Y)train=dat[1:5000,]test=dat[5001:nrow(dat),]mod=glm(Y~X, data =train, family =poisson)p=predictions(mod, conf_level =.9)|>inferences( method ="conformal_cv+", R =10, conformal_test =test)with(p, mean(Y>=pred.low&Y<=pred.high))
[1] 0.9084
Angelopoulos, Anastasios N., and Stephen Bates. 2022. “A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” arXiv. https://doi.org/10.48550/arXiv.2107.07511.
Berger, Roger, and George Casella. 2024. Statistical Inference. 2nd ed. CRC Press.
Ding, Tiffany, Anastasios N. Angelopoulos, Stephen Bates, Michael I. Jordan, and Ryan J. Tibshirani. 2023. “Class-Conditional Conformal Prediction with Many Classes.” arXiv. https://doi.org/10.48550/arXiv.2306.09335.
King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.”American Journal of Political Science, 347–61.
Krinsky, I., and A. L. Robb. 1986. “On Approximating the Statistical Properties of Elasticities.”Review of Economics and Statistics 68 (4): 715–19.
Rainey, Carlisle. 2024. “A Careful Consideration of CLARIFY: Simulation-Induced Bias in Point Estimates of Quantities of Interest.”Political Science Research and Methods 12 (3): 614–23. https://doi.org/10.1017/psrm.2023.8.
Thornton, Rebecca L. 2008. “The Demand for, and Impact of, Learning HIV Status.”American Economic Review 98 (5): 1829–63.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. New York, NY: Springer. https://doi.org/10.1007/978-0-387-21736-9.
Zeileis, Achim, Susanne Köll, and Nathaniel Graham. 2020. “Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R.”Journal of Statistical Software 95 (1): 1–36. https://doi.org/10.18637/jss.v095.i01.
Autocorrelation often arises in time series estimation when the prediction errors we make for a unit of observation at time \(t\) are correlated to the prediction errors we make for the same unit at time \(t+1\). Clustering sometimes occurs when errors are similar for study participants drawn from well-defined groups (ex: classrooms, neighborhoods, or provinces). Heteroskedasticity arises when the variance of the errors is inconsistent across observations, such as when our model makes bigger prediction errors (in either direction) for units with certain characteristics. ↩︎