# Equivalence Tests

In many contexts, analysts are less interested in rejecting a null hypothesis, and more interested in testing whether an estimate is “inferior”, “superior”, or “equivalent” to a given threshold or interval. For example, medical researchers may wish to determine if the estimated effect of a new treatment is larger than the effect of prior treatments, or larger than some threshold of “clinical significance.” Alternatively, researchers may wish to support a claim that an estimated parameter is “equivalent to” or “not meaningfully different from” a null hypothesis.

To answer these questions, we can use non-inferiority, non-superiority, or equivalence tests like the two-one-sided test (TOST). This article gives a primer and tutorial on TOST:

Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science. 2018;1(2):259-269. doi:10.1177/2515245918770963

The `hypotheses()` function of the `marginaleffects` package includes an `equivalence` argument which allows users to apply these tests to any of the quantities generated by the package, as well as to arbitrary functions of a model’s parameters. To illustrate, we begin by estimating a simple linear regression model:

``````library(marginaleffects)
mod <- lm(mpg ~ hp + factor(gear), data = mtcars)``````
``````import polars as pl
import statsmodels.formula.api as smf
from marginaleffects import *

mtcars = mtcars.cast({"gear" : pl.Utf8, "hp" : pl.Float64})

mod = smf.ols("mpg ~ hp + gear", data = mtcars.to_pandas()).fit()``````

The rest of this section considers several quantities estimated by `marginaleffects`.

## Predictions

Consider a single prediction, where all predictors are held at their median or mode:

``````p <- predictions(mod, newdata = "median")
p
#>
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %  hp gear
#>      19.7          1 19.6   <0.001 281.3  17.7   21.6 123    3
#>
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, mpg, hp, gear
#> Type:  response``````
``````p = predictions(mod, newdata = "median")
print(p)
#> shape: (1, 7)
#> ┌──────────┬───────────┬──────┬─────────┬─────┬──────┬───────┐
#> │ Estimate ┆ Std.Error ┆ z    ┆ P(>|z|) ┆ S   ┆ 2.5% ┆ 97.5% │
#> │ ---      ┆ ---       ┆ ---  ┆ ---     ┆ --- ┆ ---  ┆ ---   │
#> │ str      ┆ str       ┆ str  ┆ str     ┆ str ┆ str  ┆ str   │
#> ╞══════════╪═══════════╪══════╪═════════╪═════╪══════╪═══════╡
#> │ 19.7     ┆ 1         ┆ 19.6 ┆ 0       ┆ inf ┆ 17.7 ┆ 21.6  │
#> └──────────┴───────────┴──────┴─────────┴─────┴──────┴───────┘
#>
#> Columns: rowid, estimate, std_error, statistic, p_value, s_value, conf_low, conf_high, rownames, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb``````

Now we specify an equivalence interval (or “region”) for predictions between 17 and 18:

``````hypotheses(p, equivalence = c(17, 18))
#>
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)  hp gear
#>      19.7          1 19.6   <0.001 281.3  17.7   21.6      0.951    0.00404     0.951 123    3
#>
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, mpg, hp, gear, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
#> Type:  response``````
``# Not implemented yet``

The results allow us to draw three conclusions:

1. The p value for the non-inferiority test is 0.0040. This suggests that we can reject the null hypothesis that the parameter is below 17.
2. The p value for the non-superiority test is 0.9508. This suggests that we cannot reject the null hypothesis that the parameter (19.6589) is above 18.
3. The p value for the equivalence test is 0.9508. This suggests that we cannot reject the hypothesis that the parameter falls outside the equivalence interval.

## Model coefficients

The `hypotheses` function also allows users to conduct equivalence, non-inferiority, and non-superiority tests for model coefficients, and for arbitrary functions of model coefficients.

Our estimate of the coefficient affecting factor(gear)5 in the model is:

``````coef(mod)[4]
#> factor(gear)5
#>      6.574763``````
``````mod.params['gear[T.5]']
#> 6.574762842180765``````

We can test if this parameter is likely to fall in the [5,7] interval by:

``````hypotheses(mod, equivalence = c(5, 7))[4, ]
#>
#>           Term Estimate Std. Error z Pr(>|z|)    S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  factor(gear)5     6.57       1.64 4   <0.001 14.0  3.36   9.79      0.398      0.169     0.398
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv``````
``````h = pl.DataFrame(hypotheses(mod, equivalence = [5., 7.]))[2,:]
print(h)
#> shape: (1, 13)
#> ┌───────────┬──────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
#> │ term      ┆ estimate ┆ std_error ┆ statistic ┆ … ┆ statistic ┆ p_value_n ┆ p_value_n ┆ p_value_e │
#> │ ---       ┆ ---      ┆ ---       ┆ ---       ┆   ┆ _nonsup   ┆ oninf     ┆ onsup     ┆ quiv      │
#> │ str       ┆ f64      ┆ f64       ┆ f64       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
#> │           ┆          ┆           ┆           ┆   ┆ f64       ┆ f64       ┆ f64       ┆ f64       │
#> ╞═══════════╪══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
#> │ gear[T.5] ┆ 6.574763 ┆ 1.642684  ┆ 4.002452  ┆ … ┆ -0.258867 ┆ 0.168867  ┆ 0.397869  ┆ 0.397869  │
#> └───────────┴──────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴───────────┘``````

The p value is 0.3979, so we cannot reject the hypothesis that the factor(gear)5 parameter falls outside the [5,7] interval.

## Slopes

The same syntax can be used to conduct tests for all the quantities produced by the `marginaleffects` package. For example, imagine that, for substantive or theoretical reasons, an average slope between -0.1 and 0.1 is uninteresting. We can conduct an equivalence test to check if this is the case:

``````avg_slopes(mod, variables = "hp", equivalence = c(-.1, .1))
#>
#>  Term Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>    hp  -0.0669      0.011 -6.05   <0.001 29.4 -0.0885 -0.0452     <0.001    0.00135   0.00135
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
#> Type:  response``````
``````s = pl.DataFrame(avg_slopes(mod, variables = "hp", equivalence = [-.1, .1]))
print(s)
#> shape: (1, 14)
#> ┌──────┬────────────┬───────────┬───────────┬───┬────────────┬────────────┬────────────┬───────────┐
#> │ term ┆ contrast   ┆ estimate  ┆ std_error ┆ … ┆ statistic_ ┆ p_value_no ┆ p_value_no ┆ p_value_e │
#> │ ---  ┆ ---        ┆ ---       ┆ ---       ┆   ┆ nonsup     ┆ ninf       ┆ nsup       ┆ quiv      │
#> │ str  ┆ str        ┆ f64       ┆ f64       ┆   ┆ ---        ┆ ---        ┆ ---        ┆ ---       │
#> │      ┆            ┆           ┆           ┆   ┆ f64        ┆ f64        ┆ f64        ┆ f64       │
#> ╞══════╪════════════╪═══════════╪═══════════╪═══╪════════════╪════════════╪════════════╪═══════════╡
#> │ hp   ┆ mean(dY/dX ┆ -0.066854 ┆ 0.011258  ┆ … ┆ -14.820613 ┆ 0.001619   ┆ 5.3898e-50 ┆ 0.001619  │
#> │      ┆ )          ┆           ┆           ┆   ┆            ┆            ┆            ┆           │
#> └──────┴────────────┴───────────┴───────────┴───┴────────────┴────────────┴────────────┴───────────┘``````

The p value is 0.0013, which suggests that we can reject the hypothesis that the parameter falls outside the region of “substantive equivalence” that we have defined by the interval.

## Difference between comparisons (contrasts)

Consider a model with a multiplicative interaction:

``int <- lm(mpg ~ hp * factor(gear), data = mtcars)``
``inter = smf.ols("mpg ~ hp * gear", data = mtcars.to_pandas()).fit()``

The average contrast for a change of 1 unit in `hp` differs based on the value of `gear`:

``````avg_comparisons(int, variables = "hp", by = "gear")
#>
#>  Term Contrast gear Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %
#>    hp mean(+1)    3  -0.0522     0.0146 -3.59   <0.001 11.6 -0.0808 -0.0237
#>    hp mean(+1)    4  -0.1792     0.0303 -5.92   <0.001 28.2 -0.2385 -0.1199
#>    hp mean(+1)    5  -0.0583     0.0126 -4.61   <0.001 17.9 -0.0830 -0.0335
#>
#> Columns: term, contrast, gear, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
#> Type:  response``````
``````print(avg_comparisons(inter, variables = "hp", by = "gear"))
#> shape: (3, 10)
#> ┌──────┬──────┬──────────┬──────────┬───┬──────────┬──────┬─────────┬─────────┐
#> │ gear ┆ Term ┆ Contrast ┆ Estimate ┆ … ┆ P(>|z|)  ┆ S    ┆ 2.5%    ┆ 97.5%   │
#> │ ---  ┆ ---  ┆ ---      ┆ ---      ┆   ┆ ---      ┆ ---  ┆ ---     ┆ ---     │
#> │ str  ┆ str  ┆ str      ┆ str      ┆   ┆ str      ┆ str  ┆ str     ┆ str     │
#> ╞══════╪══════╪══════════╪══════════╪═══╪══════════╪══════╪═════════╪═════════╡
#> │ 3    ┆ hp   ┆ +1       ┆ -0.0522  ┆ … ┆ 0.000333 ┆ 11.6 ┆ -0.0808 ┆ -0.0237 │
#> │ 4    ┆ hp   ┆ +1       ┆ -0.179   ┆ … ┆ 3.16e-09 ┆ 28.2 ┆ -0.238  ┆ -0.12   │
#> │ 5    ┆ hp   ┆ +1       ┆ -0.0583  ┆ … ┆ 3.97e-06 ┆ 17.9 ┆ -0.083  ┆ -0.0335 │
#> └──────┴──────┴──────────┴──────────┴───┴──────────┴──────┴─────────┴─────────┘
#>
#> Columns: gear, term, contrast, estimate, std_error, statistic, p_value, s_value, conf_low, conf_high``````

Are these contrasts different from one another? Let’s look at the pairwise differences between them:

``````avg_comparisons(int, variables = "hp", by = "gear",
hypothesis = "pairwise")
#>
#>   Term Estimate Std. Error      z Pr(>|z|)    S   2.5 %  97.5 %
#>  3 - 4  0.12695     0.0336  3.781   <0.001 12.6  0.0611  0.1928
#>  3 - 5  0.00603     0.0193  0.313    0.754  0.4 -0.0318  0.0438
#>  4 - 5 -0.12092     0.0328 -3.688   <0.001 12.1 -0.1852 -0.0567
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
#> Type:  response``````
``````print(avg_comparisons(inter, variables = "hp", by = "gear",
hypothesis = "pairwise"))
#> shape: (3, 8)
#> ┌───────────────┬──────────┬───────────┬───────┬──────────┬───────┬─────────┬─────────┐
#> │ Term          ┆ Estimate ┆ Std.Error ┆ z     ┆ P(>|z|)  ┆ S     ┆ 2.5%    ┆ 97.5%   │
#> │ ---           ┆ ---      ┆ ---       ┆ ---   ┆ ---      ┆ ---   ┆ ---     ┆ ---     │
#> │ str           ┆ str      ┆ str       ┆ str   ┆ str      ┆ str   ┆ str     ┆ str     │
#> ╞═══════════════╪══════════╪═══════════╪═══════╪══════════╪═══════╪═════════╪═════════╡
#> │ Row 1 - Row 2 ┆ 0.127    ┆ 0.0336    ┆ 3.78  ┆ 0.000156 ┆ 12.6  ┆ 0.0611  ┆ 0.193   │
#> │ Row 1 - Row 3 ┆ 0.00603  ┆ 0.0193    ┆ 0.313 ┆ 0.754    ┆ 0.407 ┆ -0.0317 ┆ 0.0438  │
#> │ Row 2 - Row 3 ┆ -0.121   ┆ 0.0328    ┆ -3.69 ┆ 0.000226 ┆ 12.1  ┆ -0.185  ┆ -0.0567 │
#> └───────────────┴──────────┴───────────┴───────┴──────────┴───────┴─────────┴─────────┘
#>
#> Columns: term, estimate, std_error, statistic, p_value, s_value, conf_low, conf_high``````

We consider that these pairwise comparisons are “equivalent to zero” when they fall in the [-.1, .1] interval:

``````avg_comparisons(int, variables = "hp", by = "gear",
hypothesis = "pairwise",
equivalence = c(-.1, .1))
#>
#>   Term Estimate Std. Error      z Pr(>|z|)    S   2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  3 - 4  0.12695     0.0336  3.781   <0.001 12.6  0.0611  0.1928      0.789     <0.001     0.789
#>  3 - 5  0.00603     0.0193  0.313    0.754  0.4 -0.0318  0.0438     <0.001     <0.001    <0.001
#>  4 - 5 -0.12092     0.0328 -3.688   <0.001 12.1 -0.1852 -0.0567     <0.001      0.738     0.738
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
#> Type:  response``````
``````c = pl.DataFrame(
avg_comparisons(inter, variables = "hp", by = "gear",
hypothesis = "pairwise",
equivalence = [-.1, .1])
)
print(c)
#> shape: (3, 13)
#> ┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
#> │ term      ┆ estimate  ┆ std_error ┆ statistic ┆ … ┆ statistic ┆ p_value_n ┆ p_value_n ┆ p_value_ │
#> │ ---       ┆ ---       ┆ ---       ┆ ---       ┆   ┆ _nonsup   ┆ oninf     ┆ onsup     ┆ equiv    │
#> │ str       ┆ f64       ┆ f64       ┆ f64       ┆   ┆ ---       ┆ ---       ┆ ---       ┆ ---      │
#> │           ┆           ┆           ┆           ┆   ┆ f64       ┆ f64       ┆ f64       ┆ f64      │
#> ╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
#> │ Row 1 -   ┆ 0.126946  ┆ 0.033575  ┆ 3.781029  ┆ … ┆ 0.80258   ┆ 6.9245e-1 ┆ 0.788891  ┆ 0.788891 │
#> │ Row 2     ┆           ┆           ┆           ┆   ┆           ┆ 2         ┆           ┆          │
#> │ Row 1 -   ┆ 0.006029  ┆ 0.019275  ┆ 0.312784  ┆ … ┆ -4.875192 ┆ 1.8908e-8 ┆ 5.4351e-7 ┆ 5.4351e- │
#> │ Row 3     ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆ 7        │
#> │ Row 2 -   ┆ -0.120917 ┆ 0.032784  ┆ -3.688273 ┆ … ┆ -6.738519 ┆ 0.738272  ┆ 8.0004e-1 ┆ 0.738272 │
#> │ Row 3     ┆           ┆           ┆           ┆   ┆           ┆           ┆ 2         ┆          │
#> └───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘``````

The `p (Equiv)` column shows that the difference between the average contrasts when `gear` is 3 and `gear` is 5 can be said to be equivalent to the specified interval. However, there are good reasons to think that the other two pairwise comparisons may fall outside the interval.

## Marginal means and `emmeans`

This example shows the equivalence between results produced by the `emmeans` package and the `predictions()` function:

``````library(emmeans)

mod <- lm(log(conc) ~ source + factor(percent), data = pigs)

## {emmeans}
emmeans(mod, specs = "source") |>
pairs() |>
test(df = Inf,
null = 0,
delta = log(1.25),
side = "equivalence",
#>  contrast    estimate     SE  df z.ratio p.value
#>  fish - soy    -0.273 0.0529 Inf   0.937  0.8257
#>  fish - skim   -0.402 0.0542 Inf   3.308  0.9995
#>  soy - skim    -0.130 0.0530 Inf  -1.765  0.0388
#>
#> Results are averaged over the levels of: percent
#> Degrees-of-freedom method: user-specified
#> Results are given on the log (not the response) scale.
#> Statistics are tests of equivalence with a threshold of 0.22314
#> P values are left-tailed

## {marginaleffects}
predictions(
mod,
newdata = datagrid(grid_type = "balanced"),
by = "source",
hypothesis = "pairwise",
equivalence = c(-log(1.25), log(1.25))
)
#>
#>         Term Estimate Std. Error     z Pr(>|z|)    S  2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  fish - soy    -0.273     0.0529 -5.15   <0.001 21.9 -0.377 -0.1690     <0.001     0.8257    0.8257
#>  fish - skim   -0.402     0.0542 -7.43   <0.001 43.0 -0.508 -0.2961     <0.001     0.9995    0.9995
#>  soy - skim    -0.130     0.0530 -2.44   0.0146  6.1 -0.233 -0.0255     <0.001     0.0388    0.0388
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
#> Type:  response``````

## t-test

Now we show that the results produced by `hypotheses()` are identical to the results produced by the `equivalence` package in the case of a simple t-test:

``````library(equivalence)

set.seed(1024)

## simulate data data
N <- 20
dat <- data.frame(
y = rnorm(N),
x = sample(c(rep(0, N / 2), rep(1, N / 2)), N))

## fit model
mod <- lm(y ~ x, data = dat)

## test with the {equivalence} package
e <- tost(
x = dat\$y[dat\$x == 0],
y = dat\$y[dat\$x == 1],
epsilon = 10)
e
#>
#>  Welch Two Sample TOST
#>
#> data:  dat\$y[dat\$x == 0] and dat\$y[dat\$x == 1]
#> df = 17.607
#> sample estimates:
#>  mean of x  mean of y
#> -0.3788551 -0.2724594
#>
#> Epsilon: 10
#> 95 percent two one-sided confidence interval (TOST interval):
#>  -1.058539  0.845747
#> Null hypothesis of statistical difference is: rejected
#> TOST p-value: 4.248528e-13

## test with {marginaleffects} package
h <- hypotheses(mod, equivalence = c(-10, 10), df = e\$parameter)[2, ]
h
#>
#>  Term Estimate Std. Error     t Pr(>|t|)   S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)   Df
#>     x    0.106      0.548 0.194    0.848 0.2 -1.05   1.26     <0.001     <0.001    <0.001 17.6
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, df, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

# identical p values
h\$p.value.equiv |> as.vector()
#> [1] 4.248528e-13

e\$tost.p.value |> as.vector()
#> [1] 4.248528e-13``````