31 Power Analysis

Power analysis is fundamental to good research design. It helps us determine whether our study is likely to detect an effect if one truly exists, and guides decisions about sample size and study design.

This post demonstrates power analysis in the context of estimating quantities of interest with marginaleffects. We show that focusing on meaningful estimands—like average treatment effects computed using avg_comparisons(model, variables = "treatment")—does not unduly complicate the conduct of power analyses. We can use standard simulation-based approaches to execute them effectively.

DeclareDesign is an extremely powerful package for conducting simulations and assessing research design properties like power, bias, and RMSE. While the package offers impressive flexibility, this tutorial uses only its most basic functionality to provide a clear introduction to power analysis via simulation.

Before jumping in, we load some R packages.

library(MASS)
library(fabricatr)
library(DeclareDesign)
library(marginaleffects)

31.1 What is Power Analysis?

Statistical power is the probability that a statistical test will correctly reject a false null hypothesis. In other words, it’s the probability of detecting an effect when it truly exists. Power depends on four key factors:

Effect size: How large the true effect is
Sample size: How many observations we collect
Significance level (α): Our threshold for rejecting the null (typically 0.05)
Measurement error/noise: How much variability exists in our data

The minimum effect of interest (MEI) is the smallest effect size that we consider practically meaningful. This is a substantive decision that should be made before data collection based on theory, prior research, or policy relevance.

31.2 Components of Power Analysis

A simulation-based power analysis has four core components:

Data Generating Process (DGP): How we simulate realistic data under different assumptions
Estimand: The true quantity we want to estimate (e.g., average treatment effect)
Estimator: The statistical procedure we use to estimate the estimand (e.g., regression, difference-in-means)
Simulation: Repeatedly applying the estimator to data generated randomly from the DGP to assess statistical properties like power, bias, and coverage

31.3 Example: Two-Arm Randomized Controlled Trial

We now illustrate the four components of power analysis using a simple randomized controlled trial.

31.3.0.0.1 1. Data Generating Process (DGP)

The first data generating process that we consider is an ultra-simple randomized experiment with binary treatment \(D\), outcome \(Y\), and noise \(\varepsilon\).

\[ Y = \beta_0 + \beta_1 D + \varepsilon \tag{31.1}\]

To simulate data that conform to Equation 31.1, we call the declare_model() function. declare_model() is a convenience wrapper around the fabricate() function from the fabricatr package. It allows researchers to simulate very complex datasets, with hierarchical structures, weird sampling strategies, and complex treatment assignment schemes (e.g., blocks and/or clusters). In our case, the DGP is so simple that the code below is self-explanatory.

sample_size = 100
treatment_effect = 0.2
noise_sd = 1

dgp = declare_model(
  N = sample_size,

  # standard normal noise
  e = rnorm(N, mean = 0, sd = noise_sd),

  # random dummy treatment with equal probability of treatment and control
  D = rbinom(N, size = 1, prob = 0.5),

  # intercept
  b0 = 0,

  # outcome equation
  Y = b0 + treatment_effect * D + e
)

Tip

Our ultimate goal is to study how the statistical power of the research design varies when the sample size and treatment effect change. In DeclareDesign, the parameters that we intend to vary—like the sample_size and treatment_effect variables—must be defined before and outside the declare_model() call. See the code block above for an example.

declare_model() is what we call a “function factory”: it is a function that returns another function. Indeed, the dgp object that we created above can be called repeatedly to simulate new datasets from the same DGP.

Three rows from one simulation:

d = dgp()
head(d, 3)

   ID           e D b0           Y
1 001 -0.15639455 0  0 -0.15639455
2 002 -1.35104318 0  0 -1.35104318
3 003  0.08012451 0  0  0.08012451

Three different rows from another simulation:

d = dgp()
head(d, 3)

   ID          e D b0          Y
1 001 0.42202782 0  0 0.42202782
2 002 1.23659539 1  0 1.43659539
3 003 0.09362766 0  0 0.09362766

31.3.0.0.2 2. Estimand

Our estimand is the average treatment effect \(\beta_1\). In the code above, we set treatment_effect = 0.2, so the true value of our estimand is 0.2. This is the true quantity we want to estimate and the parameter we will vary to understand how statistical power changes with effect size.

31.3.0.0.3 3. Estimator

Now, we define a fit() function to serve as our estimator:

Input:
- Data frame called data
Output:
- 1-row data frame with mandatory and optional columns.
  - Mandatory: term, estimate
  - Optional: std.error, conf.low, conf.high, p.value, cost, etc.

The fit() function defined below accepts a data frame, fits a linear regression model, and return a 1-row data frame as results.

fit = function(data) {
  model = lm(Y ~ D, data = data)
  out = data.frame(
    "term" = "OLS",
    "estimate" = coef(model)["D"]
  )
  return(out)
}

# simulate
d = dgp()

# estimate
fit(d)

  term  estimate
D  OLS 0.4829167

Instead of extracting the estimates from the lm() model manually, we can use the avg_comparisons() function from the marginaleffects package. In this simple model, the estimate produced by avg_comparisons() will identical to the coefficient estimated by OLS. In more complex models, with interactions or non-linear components, avg_comparisons() will be a convenient way to estimate the average treatment effect (ATE), and to return results in the “tidy” format required by DeclareDesign.

fit = function(data) {
  model = lm(Y ~ D, data = data)
  out = avg_comparisons(model, variables = "D")
  return(out)
}

fit(d)


 Estimate Std. Error    z Pr(>|z|)   S  2.5 % 97.5 %
    0.483      0.197 2.45   0.0141 6.1 0.0971  0.869

Term: D
Type: response
Comparison: 1 - 0

Now, we use DeclareDesign to create an R object that represents a complete “research design”. This object is created by “adding” or “chaining” together calls to different DeclareDesign functions, using the + operator. In particular, we will call three functions:

declare_model()
- Data generating process
declare_inquiry()
- True value of the quantity of interest
declare_estimator()
- Estimation procedure

31.3.0.0.4 4. Simulation

Finally, once we have defined the research design, we can call the diagnose_design() function to assess its properties. This function will run simulations, estimate the treatment effect, and compute power, bias, etc. The simulation repeatedly generates random data from our DGP, applies our estimator, and summarizes the results across many iterations.

Putting it all together:

library(DeclareDesign)
library(marginaleffects)

sample_size = 100
treatment_effect = 0.2
noise_sd = 1

dgp = declare_model(
  N = sample_size,
  e = rnorm(N, mean = 0, sd = noise_sd),
  D = rbinom(N, size = 1, prob = 0.5),
  Y = treatment_effect * D + e
)

estimator = declare_estimator(handler = 
  function(data) {
    model = lm(Y ~ D, data = data)
    out = avg_comparisons(model, variables = "D")
    return(out)
  }
)

truth = declare_inquiry(ATE = treatment_effect)

design = dgp + estimator + truth

d = diagnose_design(design)
d


Research design diagnosis based on 500 simulations. Diagnosis completed in 3 secs.

 Design Inquiry Term Mean Estimand Mean Estimate  Bias SD Estimate RMSE Power Coverage N Sims
 design     ATE    D          0.20          0.19 -0.01        0.20 0.20  0.16     0.94    500

These simulation results show that our estimator yields unbiased estimates of the treatment effect. However, the statistical power is very low, around 16), meaning we would only detect the effect about that share of the time it truly existed.

31.4 Power vs. Sample Size and Effect Size

To improve statistical power, we can increase the sample size, increase the treatment effect size, or reduce noise (i.e., improve measurement, etc.). The redesign() function lets us systematically vary these parameters and compare the resulting power across different scenarios.

design_list = redesign(design,
  sample_size = c(100, 500),
  treatment_effect = c(0.2, 0.5)
)

diagnose_design(design_list)


Research design diagnosis based on 500 simulations. Diagnosis completed in 7 secs.

   Design sample_size treatment_effect Inquiry Term Mean Estimand Mean Estimate  Bias SD Estimate RMSE Power Coverage N Sims
 design_1         100              0.2     ATE    D          0.20          0.21  0.01        0.21 0.21  0.21     0.94    500
 design_2         500              0.2     ATE    D          0.20          0.20  0.00        0.09 0.09  0.63     0.94    500
 design_3         100              0.5     ATE    D          0.50          0.50 -0.00        0.20 0.20  0.70     0.95    500
 design_4         500              0.5     ATE    D          0.50          0.49 -0.01        0.09 0.09  1.00     0.96    500

As expected, larger effect sizes and sample sizes both increase statistical power. With a sample of 500 and an effect size of 0.5, power reaches nearly 100%.

We can also vary the noise level to see how measurement precision affects power:

noise_design_list = redesign(design,
  noise_sd = c(0.5, 1, 2)
)

diagnose_design(noise_design_list)


Research design diagnosis based on 500 simulations. Diagnosis completed in 5 secs.

   Design noise_sd Inquiry Term Mean Estimand Mean Estimate  Bias SD Estimate RMSE Power Coverage N Sims
 design_1      0.5     ATE    D          0.20          0.21  0.01        0.10 0.10  0.57     0.95    500
 design_2      1.0     ATE    D          0.20          0.21  0.01        0.21 0.21  0.20     0.93    500
 design_3      2.0     ATE    D          0.20          0.18 -0.02        0.40 0.40  0.08     0.95    500

Lower noise (better measurement precision) substantially increases power, while higher noise reduces power. This demonstrates why investing in better measurement instruments can be as valuable as collecting more data.

31.5 Comparing Estimators: Adjusted vs. Unadjusted

Controlling for predictors can increase the precision of our estimates and produce smaller standard errors. Even in randomized experiments, it may thus be valuable to control for covariates/predictors. We follow Lin (2013) and estimate a model where the treatment is interacted with all covariates. The power analysis will compare the statistical properties of regression-adjusted and unadjusted estimators.

The data generating process that we consider follows this equation:

\[ Y = \beta_0 + \beta_1 D + \beta_2 X_1 + \beta_3 X_3 + \varepsilon, \]

where \(Y\) is the outcome, \(D\) a randomized binary treatment, \(X_1\) and \(X_2\) are covariates drawn from a multivariate normal distribution, and \(\varepsilon\) is a standard normal noise term.

sample_size = 100
treatment_effect = 0.2
noise_sd = 1

dgp = declare_model(
  N = sample_size,
  e = rnorm(N, mean = 0, sd = noise_sd),
  D = rbinom(N, size = 1, prob = 0.5),

  # random normal covariates
  fabricatr::draw_multivariate(c(X1, X2) ~ MASS::mvrnorm(
    n = N,
    mu = c(0, 0),
    Sigma = matrix(c(1, 0.5, 0.5, 1), 2, 2)
  )),

  Y = 1 + treatment_effect * D + 1 * X1 + 1 * X2 + e
)

fit = function(data) {
  # Linear model with regression adjustement for covariates
  adjusted =  lm(Y ~ D * (X1 + X2), data = data)
  adjusted = avg_comparisons(adjusted, variables = "D", vcov = "HC3")

  # Linear model with a single binary predictor
  unadjusted =  lm(Y ~ D, data = data)
  unadjusted = avg_comparisons(unadjusted, variables = "D", vcov = "HC3")

  # Combine the results
  out = rbind(adjusted, unadjusted)

  # Label the results
  out$term = c("Adjusted", "Unadjusted")
  return(out)
}

We can fit this model directly to the output of dgp():

dgp() |> fit()


       Term Estimate Std. Error     z Pr(>|z|)   S  2.5 % 97.5 %
 Adjusted     0.0636      0.208 0.306     0.76 0.4 -0.344  0.471
 Unadjusted   0.4091      0.420 0.974     0.33 1.6 -0.414  1.233

Type: response
Comparison: 1 - 0

And we can diagnose the research design as before:

truth = declare_inquiry(ATE = treatment_effect)
estimators = declare_estimator(handler = fit)
design = dgp + truth + estimators
diagnose_design(design)


Research design diagnosis based on 500 simulations. Diagnosis completed in 4 secs.

 Design Inquiry       Term Mean Estimand Mean Estimate  Bias SD Estimate RMSE Power Coverage N Sims
 design     ATE   Adjusted          0.20          0.19 -0.01        0.22 0.22  0.16     0.95    500
 design     ATE Unadjusted          0.20          0.18 -0.02        0.40 0.40  0.09     0.96    500

Both estimators are unbiased, but the regression-adjusted estimator achieves higher power by reducing the standard error of the treatment effect estimate. This demonstrates how incorporating relevant covariates can improve statistical power even with the same sample size.