# 14Equivalence Tests

In many contexts, analysts are less interested in rejecting a null hypothesis, and more interested in testing whether an estimate is “inferior”, “superior”, or “equivalent” to a given threshold or interval. For example, medical researchers may wish to determine if the estimated effect of a new treatment is larger than the effect of prior treatments, or larger than some threshold of “clinical significance.” Alternatively, researchers may wish to support a claim that an estimated parameter is “equivalent to” or “not meaningfully different from” a null hypothesis.

To answer these questions, we can use non-inferiority, non-superiority, or equivalence tests like the two-one-sided test (TOST). This article gives a primer and tutorial on TOST:

Lakens D, Scheel AM, Isager PM. Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science. 2018;1(2):259-269. doi:10.1177/2515245918770963

The `hypotheses()` function of the `marginaleffects` package includes an `equivalence` argument which allows users to apply these tests to any of the quantities generated by the package, as well as to arbitrary functions of a model’s parameters. To illustrate, we begin by estimating a simple linear regression model:

``````library(marginaleffects)
mod <- lm(mpg ~ hp + factor(gear), data = mtcars)``````

The rest of this section considers several quantities estimated by `marginaleffects`.

## 14.1 Predictions

Consider a single prediction, where all predictors are held at their median or mode:

``````p <- predictions(mod, newdata = "median")
p
#>
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %  hp gear
#>      19.7          1 19.6   <0.001 281.3  17.7   21.6 123    3
#>
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, mpg, hp, gear
#> Type:  response``````

Now we specify an equivalence interval (or “region”) for predictions between 17 and 18:

``````hypotheses(p, equivalence = c(17, 18))
#>
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)  hp gear
#>      19.7          1 19.6   <0.001 281.3  17.7   21.6      0.951    0.00404     0.951 123    3
#>
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, mpg, hp, gear, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
#> Type:  response``````

The results allow us to draw three conclusions:

1. The p value for the non-inferiority test is 0.0040. This suggests that we can reject the null hypothesis that the parameter is below 17.
2. The p value for the non-superiority test is 0.9508. This suggests that we cannot reject the null hypothesis that the parameter (19.6589) is above 18.
3. The p value for the equivalence test is 0.9508. This suggests that we cannot reject the hypothesis that the parameter falls outside the equivalence interval.

## 14.2 Model coefficients

The `hypotheses()` function also allows users to conduct equivalence, non-inferiority, and non-superiority tests for model coefficients, and for arbitrary functions of model coefficients.

Our estimate of the 4th coefficient in the model is:

``````coef(mod)
#> factor(gear)5
#>      6.574763``````

We can test if this parameter is likely to fall in the [5,7] interval by:

``````hypotheses(mod, equivalence = c(5, 7))[4, ]
#>
#>           Term Estimate Std. Error z Pr(>|z|)    S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  factor(gear)5     6.57       1.64 4   <0.001 14.0  3.36   9.79      0.398      0.169     0.398
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv``````

The p value is 0.3979, so we cannot reject the hypothesis that the factor(gear)5 parameter falls outside the [5,7] interval.

## 14.3 Slopes

The same syntax can be used to conduct tests for all the quantities produced by the `marginaleffects` package. For example, imagine that, for substantive or theoretical reasons, an average slope between -0.1 and 0.1 is uninteresting. We can conduct an equivalence test to check if this is the case:

``````avg_slopes(mod, variables = "hp", equivalence = c(-.1, .1))
#>
#>  Term Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>    hp  -0.0669      0.011 -6.05   <0.001 29.4 -0.0885 -0.0452     <0.001    0.00135   0.00135
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
#> Type:  response``````

The p value is 0.0013, which suggests that we can reject the hypothesis that the parameter falls outside the region of “substantive equivalence” that we have defined by the interval.

## 14.4 Difference between comparisons (contrasts)

Consider a model with a multiplicative interaction:

``int <- lm(mpg ~ hp * factor(gear), data = mtcars)``

The average contrast for a change of 1 unit in `hp` differs based on the value of `gear`:

``````avg_comparisons(int, variables = "hp", by = "gear")
#>
#>  Term Contrast gear Estimate Std. Error     z Pr(>|z|)    S   2.5 %  97.5 %
#>    hp mean(+1)    3  -0.0522     0.0146 -3.59   <0.001 11.6 -0.0808 -0.0237
#>    hp mean(+1)    4  -0.1792     0.0303 -5.92   <0.001 28.2 -0.2385 -0.1199
#>    hp mean(+1)    5  -0.0583     0.0126 -4.61   <0.001 17.9 -0.0830 -0.0335
#>
#> Columns: term, contrast, gear, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
#> Type:  response``````

Are these contrasts different from one another? Let’s look at the pairwise differences between them:

``````avg_comparisons(int, variables = "hp", by = "gear",
hypothesis = "pairwise")
#>
#>   Term Estimate Std. Error      z Pr(>|z|)    S   2.5 %  97.5 %
#>  3 - 4  0.12695     0.0336  3.781   <0.001 12.6  0.0611  0.1928
#>  3 - 5  0.00603     0.0193  0.313    0.754  0.4 -0.0318  0.0438
#>  4 - 5 -0.12092     0.0328 -3.688   <0.001 12.1 -0.1852 -0.0567
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
#> Type:  response``````

We consider that these pairwise comparisons are “equivalent to zero” when they fall in the [-.1, .1] interval:

``````avg_comparisons(int, variables = "hp", by = "gear",
hypothesis = "pairwise",
equivalence = c(-.1, .1))
#>
#>   Term Estimate Std. Error      z Pr(>|z|)    S   2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  3 - 4  0.12695     0.0336  3.781   <0.001 12.6  0.0611  0.1928      0.789     <0.001     0.789
#>  3 - 5  0.00603     0.0193  0.313    0.754  0.4 -0.0318  0.0438     <0.001     <0.001    <0.001
#>  4 - 5 -0.12092     0.0328 -3.688   <0.001 12.1 -0.1852 -0.0567     <0.001      0.738     0.738
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv
#> Type:  response``````

The `p (Equiv)` column shows that the difference between the average contrasts when `gear` is 3 and `gear` is 5 can be said to be equivalent to the specified interval. However, there are good reasons to think that the other two pairwise comparisons may fall outside the interval.

## 14.5 Marginal means and `emmeans`

This example shows the equivalence between results produced by the `emmeans` package and the `marginal_means()` function:

``````library(emmeans)

mod <- lm(log(conc) ~ source + factor(percent), data = pigs)

## {emmeans}
emmeans(mod, specs = "source") |>
pairs() |>
test(df = Inf,
null = 0,
delta = log(1.25),
side = "equivalence",
#>  contrast    estimate     SE  df z.ratio p.value
#>  fish - soy    -0.273 0.0529 Inf   0.937  0.8257
#>  fish - skim   -0.402 0.0542 Inf   3.308  0.9995
#>  soy - skim    -0.130 0.0530 Inf  -1.765  0.0388
#>
#> Results are averaged over the levels of: percent
#> Degrees-of-freedom method: user-specified
#> Results are given on the log (not the response) scale.
#> Statistics are tests of equivalence with a threshold of 0.22314
#> P values are left-tailed

## {marginaleffects}
marginal_means(
mod,
variables = "source",
hypothesis = "pairwise",
equivalence = c(-log(1.25), log(1.25)))
#>
#>         Term   Mean Std. Error     z Pr(>|z|)    S  2.5 %  97.5 % p (NonSup) p (NonInf) p (Equiv)
#>  fish - soy  -0.273     0.0529 -5.15   <0.001 21.9 -0.377 -0.1690     <0.001     0.8257    0.8257
#>  fish - skim -0.402     0.0542 -7.43   <0.001 43.0 -0.508 -0.2961     <0.001     0.9995    0.9995
#>  soy - skim  -0.130     0.0530 -2.44   0.0146  6.1 -0.233 -0.0255     <0.001     0.0388    0.0388
#>
#> Results averaged over levels of: percent, source
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, p.value.equiv, p.value.noninf, p.value.nonsup, statistic.noninf, statistic.nonsup
#> Type:  response``````

## 14.6 t-test

Now we show that the results produced by `hypotheses()` are identical to the results produced by the `equivalence` package in the case of a simple t-test:

``````library(equivalence)

set.seed(1024)

## simulate data data
N <- 20
dat <- data.frame(
y = rnorm(N),
x = sample(c(rep(0, N / 2), rep(1, N / 2)), N))

## fit model
mod <- lm(y ~ x, data = dat)

## test with the {equivalence} package
e <- tost(
x = dat\$y[dat\$x == 0],
y = dat\$y[dat\$x == 1],
epsilon = 10)
e
#>
#>  Welch Two Sample TOST
#>
#> data:  dat\$y[dat\$x == 0] and dat\$y[dat\$x == 1]
#> df = 17.607
#> sample estimates:
#>  mean of x  mean of y
#> -0.3788551 -0.2724594
#>
#> Epsilon: 10
#> 95 percent two one-sided confidence interval (TOST interval):
#>  -1.058539  0.845747
#> Null hypothesis of statistical difference is: rejected
#> TOST p-value: 4.248528e-13

## test with {marginaleffects} package
h <- hypotheses(mod, equivalence = c(-10, 10), df = e\$parameter)[2, ]
h
#>
#>  Term Estimate Std. Error     t Pr(>|t|)   S 2.5 % 97.5 % p (NonSup) p (NonInf) p (Equiv)   Df
#>     x    0.106      0.548 0.194    0.848 0.2 -1.05   1.26     <0.001     <0.001    <0.001 17.6
#>
#> Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, df, statistic.noninf, statistic.nonsup, p.value.noninf, p.value.nonsup, p.value.equiv

# identical p values
h\$p.value.equiv |> as.vector()
#>  4.248528e-13

e\$tost.p.value |> as.vector()
#>  4.248528e-13``````