Equivalence Tests

In many contexts, analysts are less interested in rejecting a null hypothesis, and more interested in testing whether an estimate is “inferior”, “superior”, or “equivalent” to a given threshold or interval. For example, medical researchers may wish to determine if the estimated effect of a new treatment is larger than the effect of prior treatments, or larger than some threshold of “clinical significance.” Alternatively, researchers may wish to support a claim that an estimated parameter is “equivalent to” or “not meaningfully different from” a null hypothesis.

To answer these questions, we can use non-inferiority, non-superiority, or equivalence tests like the two-one-sided test (TOST). This article gives a primer and tutorial on TOST:

Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science. 2018;1(2):259-269. doi:10.1177/2515245918770963

The hypotheses() function of the marginaleffects package includes an equivalence argument which allows users to apply these tests to any of the quantities generated by the package, as well as to arbitrary functions of a model’s parameters. To illustrate, we begin by estimating a simple linear regression model:

R
Python

library(marginaleffects)
mod <- lm(mpg ~ hp + factor(gear), data = mtcars)

import polars as pl
import statsmodels.formula.api as smf
from marginaleffects import *

mtcars = pl.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv")
mtcars = mtcars.cast({"gear" : pl.Utf8, "hp" : pl.Float64})

mod = smf.ols("mpg ~ hp + gear", data = mtcars.to_pandas()).fit()

The rest of this section considers several quantities estimated by marginaleffects.

Predictions

Consider a single prediction, where all predictors are held at their median or mode:

R
Python

p <- predictions(mod, newdata = "median")
p
#> 
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %  hp gear
#>      19.7          1 19.6   <0.001 281.3  17.7   21.6 123    3
#> 
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, hp, gear, mpg 
#> Type:  response

p = predictions(mod, newdata = "median")
print(p)
#> shape: (1, 7)
#> ┌──────────┬───────────┬──────┬─────────┬─────┬──────┬───────┐
#> │ Estimate ┆ Std.Error ┆ z    ┆ P(>|z|) ┆ S   ┆ 2.5% ┆ 97.5% │
#> │ ---      ┆ ---       ┆ ---  ┆ ---     ┆ --- ┆ ---  ┆ ---   │
#> │ str      ┆ str       ┆ str  ┆ str     ┆ str ┆ str  ┆ str   │
#> ╞══════════╪═══════════╪══════╪═════════╪═════╪══════╪═══════╡
#> │ 19.7     ┆ 1         ┆ 19.6 ┆ 0       ┆ inf ┆ 17.7 ┆ 21.6  │
#> └──────────┴───────────┴──────┴─────────┴─────┴──────┴───────┘
#> 
#> Columns: rowid, estimate, std_error, statistic, p_value, s_value, conf_low, conf_high, rownames, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

Now we specify an equivalence interval (or “region”) for predictions between 17 and 18:

R
Python

hypotheses(p, equivalence = c(17, 18))
#> 
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)  hp gear
#>      19.7          1 19.6   <0.001 281.3  17.7   21.6      0.951    0.00404     0.951 123    3
#> 
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, hp, gear, mpg, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv 
#> Type:  response

# Not implemented yet

The results allow us to draw three conclusions:

The p value for the non-inferiority test is 0.0040. This suggests that we can reject the null hypothesis that the parameter is below 17.
The p value for the non-superiority test is 0.9508. This suggests that we cannot reject the null hypothesis that the parameter (19.6589) is above 18.
The p value for the equivalence test is 0.9508. This suggests that we cannot reject the hypothesis that the parameter falls outside the equivalence interval.

Model coefficients

The hypotheses function also allows users to conduct equivalence, non-inferiority, and non-superiority tests for model coefficients, and for arbitrary functions of model coefficients.

Our estimate of the coefficient affecting factor(gear)5 in the model is:

R
Python

coef(mod)[4]
#> factor(gear)5 
#>      6.574763

mod.params['gear[T.5]']
#> 6.574762842180765

We can test if this parameter is likely to fall in the [5,7] interval by:

R
Python

hypotheses(mod, equivalence = c(5, 7))[4, ]
#> 
#>           Term Estimate Std. Error z Pr(>|z|)    S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  factor(gear)5     6.57       1.64 4   <0.001 14.0  3.36   9.79      0.398      0.169     0.398
#> 
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

h = pl.DataFrame(hypotheses(mod, equivalence = [5., 7.]))[2,:]
print(h)
#> shape: (1, 13)
#> ┌───────────┬──────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
#> │ term      ┆ estimate ┆ std_error ┆ statistic ┆ … ┆ statistic ┆ p_value_n ┆ p_value_n ┆ p_value_e │
#> │ ---       ┆ ---      ┆ ---       ┆ ---       ┆   ┆ _nonsup   ┆ oninf     ┆ onsup     ┆ quiv      │
#> │ str       ┆ f64      ┆ f64       ┆ f64       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
#> │           ┆          ┆           ┆           ┆   ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
#> ╞═══════════╪══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
#> │ gear[T.5] ┆ 6.574763 ┆ 1.642684  ┆ 4.002452  ┆ … ┆ -0.258867 ┆ 0.168867  ┆ 0.397869  ┆ 0.397869  │
#> └───────────┴──────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘

The p value is 0.3979, so we cannot reject the hypothesis that the factor(gear)5 parameter falls outside the [5,7] interval.

Slopes

The same syntax can be used to conduct tests for all the quantities produced by the marginaleffects package. For example, imagine that, for substantive or theoretical reasons, an average slope between -0.1 and 0.1 is uninteresting. We can conduct an equivalence test to check if this is the case:

R
Python

avg_slopes(mod, variables = "hp", equivalence = c(-.1, .1))
#> 
#>  Term    Contrast Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>    hp mean(dY/dX)  -0.0669      0.011 -6.05   <0.001 29.4 -0.0885 -0.0452     <0.001    0.00135   0.00135
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv 
#> Type:  response

s = pl.DataFrame(avg_slopes(mod, variables = "hp", equivalence = [-.1, .1]))
print(s)
#> shape: (1, 14)
#> ┌──────┬────────────┬───────────┬───────────┬───┬────────────┬────────────┬────────────┬───────────┐
#> │ term ┆ contrast   ┆ estimate  ┆ std_error ┆ … ┆ statistic_ ┆ p_value_no ┆ p_value_no ┆ p_value_e │
#> │ ---  ┆ ---        ┆ ---       ┆ ---       ┆   ┆ nonsup     ┆ ninf       ┆ nsup       ┆ quiv      │
#> │ str  ┆ str        ┆ f64       ┆ f64       ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---       │
#> │      ┆            ┆           ┆           ┆   ┆ f64        ┆ f64        ┆ f64        ┆ f64       │
#> ╞══════╪════════════╪═══════════╪═══════════╪═══╪════════════╪════════════╪════════════╪═══════════╡
#> │ hp   ┆ mean(dY/dX ┆ -0.066854 ┆ 0.011258  ┆ … ┆ -14.820613 ┆ 0.001619   ┆ 5.3898e-50 ┆ 0.001619  │
#> │      ┆ )          ┆           ┆           ┆   ┆            ┆            ┆            ┆           │
#> └──────┴────────────┴───────────┴───────────┴───┴────────────┴────────────┴────────────┴───────────┘

The p value is 0.0013, which suggests that we can reject the hypothesis that the parameter falls outside the region of “substantive equivalence” that we have defined by the interval.

Difference between comparisons (contrasts)

Consider a model with a multiplicative interaction:

R
Python

int <- lm(mpg ~ hp * factor(gear), data = mtcars)

inter = smf.ols("mpg ~ hp * gear", data = mtcars.to_pandas()).fit()

The average contrast for a change of 1 unit in hp differs based on the value of gear:

R
Python

avg_comparisons(int, variables = "hp", by = "gear")
#> 
#>  Term Contrast gear Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %
#>    hp mean(+1)    3  -0.0522     0.0146 -3.59   <0.001 11.6 -0.0808 -0.0237
#>    hp mean(+1)    4  -0.1792     0.0303 -5.92   <0.001 28.2 -0.2385 -0.1199
#>    hp mean(+1)    5  -0.0583     0.0126 -4.61   <0.001 17.9 -0.0830 -0.0335
#> 
#> Columns: term, contrast, gear, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
#> Type:  response

print(avg_comparisons(inter, variables = "hp", by = "gear"))
#> shape: (3, 10)
#> ┌──────┬──────┬──────────┬──────────┬───┬──────────┬──────┬─────────┬─────────┐
#> │ gear ┆ Term ┆ Contrast ┆ Estimate ┆ … ┆ P(>|z|)  ┆ S    ┆ 2.5%    ┆ 97.5%   │
#> │ ---  ┆ ---  ┆ ---      ┆ ---      ┆   ┆ ---      ┆ ---  ┆ ---     ┆ ---     │
#> │ str  ┆ str  ┆ str      ┆ str      ┆   ┆ str      ┆ str  ┆ str     ┆ str     │
#> ╞══════╪══════╪══════════╪══════════╪═══╪══════════╪══════╪═════════╪═════════╡
#> │ 3    ┆ hp   ┆ +1       ┆ -0.0522  ┆ … ┆ 0.000333 ┆ 11.6 ┆ -0.0808 ┆ -0.0237 │
#> │ 4    ┆ hp   ┆ +1       ┆ -0.179   ┆ … ┆ 3.16e-09 ┆ 28.2 ┆ -0.238  ┆ -0.12   │
#> │ 5    ┆ hp   ┆ +1       ┆ -0.0583  ┆ … ┆ 3.97e-06 ┆ 17.9 ┆ -0.083  ┆ -0.0335 │
#> └──────┴──────┴──────────┴──────────┴───┴──────────┴──────┴─────────┴─────────┘
#> 
#> Columns: gear, term, contrast, estimate, std_error, statistic, p_value, s_value, conf_low, conf_high

Are these contrasts different from one another? Let’s look at the pairwise differences between them:

R
Python

avg_comparisons(int, variables = "hp", by = "gear",
    hypothesis = "pairwise")
#> 
#>   Term Estimate Std. Error      z Pr(>|z|)    S   2.5 %  97.5 %
#>  3 - 4  0.12695     0.0336  3.781   <0.001 12.6  0.0611  0.1928
#>  3 - 5  0.00603     0.0193  0.313    0.754  0.4 -0.0318  0.0438
#>  4 - 5 -0.12092     0.0328 -3.688   <0.001 12.1 -0.1852 -0.0567
#> 
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

print(avg_comparisons(inter, variables = "hp", by = "gear",
    hypothesis = "pairwise"))
#> shape: (3, 8)
#> ┌───────────────┬──────────┬───────────┬───────┬──────────┬───────┬─────────┬─────────┐
#> │ Term          ┆ Estimate ┆ Std.Error ┆ z     ┆ P(>|z|)  ┆ S     ┆ 2.5%    ┆ 97.5%   │
#> │ ---           ┆ ---      ┆ ---       ┆ ---   ┆ ---      ┆ ---   ┆ ---     ┆ ---     │
#> │ str           ┆ str      ┆ str       ┆ str   ┆ str      ┆ str   ┆ str     ┆ str     │
#> ╞═══════════════╪══════════╪═══════════╪═══════╪══════════╪═══════╪═════════╪═════════╡
#> │ Row 1 - Row 2 ┆ 0.127    ┆ 0.0336    ┆ 3.78  ┆ 0.000156 ┆ 12.6  ┆ 0.0611  ┆ 0.193   │
#> │ Row 1 - Row 3 ┆ 0.00603  ┆ 0.0193    ┆ 0.313 ┆ 0.754    ┆ 0.407 ┆ -0.0317 ┆ 0.0438  │
#> │ Row 2 - Row 3 ┆ -0.121   ┆ 0.0328    ┆ -3.69 ┆ 0.000226 ┆ 12.1  ┆ -0.185  ┆ -0.0567 │
#> └───────────────┴──────────┴───────────┴───────┴──────────┴───────┴─────────┴─────────┘
#> 
#> Columns: term, estimate, std_error, statistic, p_value, s_value, conf_low, conf_high

We consider that these pairwise comparisons are “equivalent to zero” when they fall in the [-.1, .1] interval:

R
Python

avg_comparisons(int, variables = "hp", by = "gear",
    hypothesis = "pairwise",
    equivalence = c(-.1, .1))
#> 
#>   Term Estimate Std. Error      z Pr(>|z|)    S   2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  3 - 4  0.12695     0.0336  3.781   <0.001 12.6  0.0611  0.1928      0.789     <0.001     0.789
#>  3 - 5  0.00603     0.0193  0.313    0.754  0.4 -0.0318  0.0438     <0.001     <0.001    <0.001
#>  4 - 5 -0.12092     0.0328 -3.688   <0.001 12.1 -0.1852 -0.0567     <0.001      0.738     0.738
#> 
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv 
#> Type:  response

c = pl.DataFrame(
    avg_comparisons(inter, variables = "hp", by = "gear",
        hypothesis = "pairwise",
        equivalence = [-.1, .1])
    )
print(c)
#> shape: (3, 13)
#> ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
#> │ term      ┆ estimate  ┆ std_error ┆ statistic ┆ … ┆ statistic ┆ p_value_n ┆ p_value_n ┆ p_value_ │
#> │ ---       ┆ ---       ┆ ---       ┆ ---       ┆   ┆ _nonsup   ┆ oninf     ┆ onsup     ┆ equiv    │
#> │ str       ┆ f64       ┆ f64       ┆ f64       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
#> │           ┆           ┆           ┆           ┆   ┆ f64       ┆ f64       ┆ f64       ┆ f64      │
#> ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
#> │ Row 1 -   ┆ 0.126946  ┆ 0.033575  ┆ 3.781029  ┆ … ┆ 0.80258   ┆ 6.9246e-1 ┆ 0.788891  ┆ 0.788891 │
#> │ Row 2     ┆           ┆           ┆           ┆   ┆           ┆ 2         ┆           ┆          │
#> │ Row 1 -   ┆ 0.006029  ┆ 0.019275  ┆ 0.312784  ┆ … ┆ -4.875192 ┆ 1.8908e-8 ┆ 5.4351e-7 ┆ 5.4351e- │
#> │ Row 3     ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆ 7        │
#> │ Row 2 -   ┆ -0.120917 ┆ 0.032784  ┆ -3.688273 ┆ … ┆ -6.738519 ┆ 0.738272  ┆ 8.0004e-1 ┆ 0.738272 │
#> │ Row 3     ┆           ┆           ┆           ┆   ┆           ┆           ┆ 2         ┆          │
#> └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘

The p (Equiv) column shows that the difference between the average contrasts when gear is 3 and gear is 5 can be said to be equivalent to the specified interval. However, there are good reasons to think that the other two pairwise comparisons may fall outside the interval.

Marginal means and `emmeans`

This example shows the equivalence between results produced by the emmeans package and the predictions() function:

library(emmeans)

mod <- lm(log(conc) ~ source + factor(percent), data = pigs)

## {emmeans}
emmeans(mod, specs = "source") |>
    pairs() |>
    test(df = Inf,
         null = 0,
         delta = log(1.25),
         side = "equivalence",
         adjust = "none")
#>  contrast    estimate     SE  df z.ratio p.value
#>  fish - soy    -0.273 0.0529 Inf   0.937  0.8257
#>  fish - skim   -0.402 0.0542 Inf   3.308  0.9995
#>  soy - skim    -0.130 0.0530 Inf  -1.765  0.0388
#> 
#> Results are averaged over the levels of: percent 
#> Degrees-of-freedom method: user-specified 
#> Results are given on the log (not the response) scale. 
#> Statistics are tests of equivalence with a threshold of 0.22314 
#> P values are left-tailed

## {marginaleffects}
predictions(
    mod,
    newdata = "balanced",
    by = "source",
    hypothesis = "pairwise",
    equivalence = c(-log(1.25), log(1.25))
)
#> 
#>         Term Estimate Std. Error     z Pr(>|z|)    S  2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  fish - soy    -0.273     0.0529 -5.15   <0.001 21.9 -0.377 -0.1690     <0.001     0.8257    0.8257
#>  fish - skim   -0.402     0.0542 -7.43   <0.001 43.0 -0.508 -0.2961     <0.001     0.9995    0.9995
#>  soy - skim    -0.130     0.0530 -2.44   0.0146  6.1 -0.233 -0.0255     <0.001     0.0388    0.0388
#> 
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv 
#> Type:  response

t-test

Now we show that the results produced by hypotheses() are identical to the results produced by the equivalence package in the case of a simple t-test:

library(equivalence)

set.seed(1024)

## simulate data data
N <- 20
dat <- data.frame(
    y = rnorm(N),
    x = sample(c(rep(0, N / 2), rep(1, N / 2)), N))

## fit model
mod <- lm(y ~ x, data = dat)

## test with the {equivalence} package
e <- tost(
    x = dat$y[dat$x == 0],
    y = dat$y[dat$x == 1],
    epsilon = 10)
e
#> 
#>  Welch Two Sample TOST
#> 
#> data:  dat$y[dat$x == 0] and dat$y[dat$x == 1]
#> df = 17.607
#> sample estimates:
#>  mean of x  mean of y 
#> -0.3788551 -0.2724594 
#> 
#> Epsilon: 10 
#> 95 percent two one-sided confidence interval (TOST interval):
#>  -1.058539  0.845747
#> Null hypothesis of statistical difference is: rejected 
#> TOST p-value: 4.248528e-13

## test with {marginaleffects} package
h <- hypotheses(mod, equivalence = c(-10, 10), df = e$parameter)[2, ]
h
#> 
#>  Term Estimate Std. Error     t Pr(>|t|)   S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)   Df
#>     x    0.106      0.548 0.194    0.848 0.2 -1.05   1.26     <0.001     <0.001    <0.001 17.6
#> 
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, df, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

# identical p values
h$p.value.equiv |> as.vector()
#> [1] 4.248528e-13

e$tost.p.value |> as.vector()
#> [1] 4.248528e-13