This chapter introduces two classes of statistical testing procedures: null hypothesis tests and equivalence tests. As we will see, these tests are complementary and versatile technologies, which can be applied to parameter estimates, but also to any of the quantities explored in Chapters 5, 6, and 7: predictions, counterfactual comparisons, and slopes.
The first strategy to consider is the null hypothesis test. This test is designed to assess if there is enough evidence to reject the possibility that a population parameter (or function of parameters) takes on a specific value, such as zero. Null hypothesis tests are common in all fields of data analysis. They can help us answer questions such:
Is the effect of a new drug different from the effect of an existing treatment?
Does cognitive-behavioral therapy have a non-zero effect in alleviating depression?
Is there a statistically significant difference in test scores between students who attend public or private schools?
Equivalence tests flip the logic around. Instead of establishing a difference, they are concerned with demonstrating similarity. For instance, in a null hypothesis test the analyst may be interested in showing that a new drug is effective, that its treatment effect is different from zero. In contrast, an equivalence test could show that the drug’s estimated effect is “equivalent” to zero, or not “meaningfully different” from the effect of another drug. Put differently, equivalence testing is used when the researcher aims to show that an observed difference between groups or parameter values is small enough to be negligible in practical terms. This approach is useful to answer questions such as:
Is the effect of a generic drug equivalent to that of the branded version?
Is the effect of a marketing campaign on consumption so small that we can consider it ineffective?
Are the levels of social trust in two different communities equivalent?
This chapter shows how to compute and interpret both hypothesis and equivalence tests using the marginaleffects package.1
To illustrate, we will study data from a study conducted by Thornton (2008): The demand for, and impact of, learning HIV status. One goal of this randomized controlled trial was to find out if we could encourage people to seek information about their HIV status. The researchers administered HIV tests at home to many study participants in rural Malawi. Then, they randomly assigned some people to receive a small monetary incentive if they were willing to travel to a voluntary counseling and testing center and learn their HIV status.
The outcome of interest is a binary variable, outcome, equal to 1 if a study participant chose to travel to the center, and 0 otherwise. The treatment is a binary variable, incentive, equal is 1 if the participant was part of the treatment arm and received an incentive. In addition, the researchers collected information about people’s distance from the test center, and a numeric identifier for the village in which they live. Finally, our dataset includes a measure of the participants’ age, divided in three groups in the agecat column.
We use the read.csv() function to read the dataset into memory, the head() function to extract the first few rows, and the tt() function from the tinytable package to display results in a good-looking table:
After analyzing these data, Thornton (2008) concluded 34% of participants in the control group sought to learn their HIV status. In contrast, a small monetary incentive doubled this proportion. Simply put, the intervention proved to be highly successful and cost effective.
Over the next few chapters, we will use the marginaleffects package to analyze various aspects of Thornton’s data. Here, we ask: Do minors, young adults, and older adults have different propensities to seek information about their HIV status?
To answer this question, let us consider a linear probability model with the binary outcome as dependent variable and each level of the agecat variable as predictors:
We use the lm() function to estimate the model via ordinary least squares, adding -1 to the model formula to suppress the usual intercept. We then call coef() to extract the vector of coefficient estimates:
agecat<18 agecat18 to 35 agecat>35
0.6718750 0.6787004 0.7277354
Because there is no other predictor in the model, and since we intentionally dropped the intercept, the coefficients associated with agecat levels measure the average outcome in each age category. Indeed, the estimated coefficients printed above are exactly identical to subgroup means calculated in-sample using the aggregate() function:
At first glance, it looks like the probability that a young adult will seek information about their HIV status is smaller than the probability for older adults: 67.9% for participants between 18 and 35 years old, and 72.8% for those above 35 years old.
Before conducting hypothesis and equivalence tests on these quantities, two points deserve to be highlighted. First, the concepts and techniques surveyed in this chapter apply to all the quantities that we study in this book: parameter estimates, predictions, counterfactual comparisons, slopes, and more. When you are done reading Part II of the book, you will not only be able to compute these quantities, but also to conduct a wide variety of meaningful statistical tests on them.
Second, it is important to underline the key distinction between statistical and practical significance. We say that a result is “statistically significant” if it would have been unlikely to occur by pure chance (i.e., sampling variation) in a hypothetical world where the null hypothesis and model hold true. We say that a result has ``practical significance” when it has important implications for the real world. Whether a result is practically significant is not dictated purely by statistical considerations; it depends on the field, the research question, and on theory. Many results are statistically significant without having much practical significance.
Often, the magnitude of a treatment effect is distinguishable from zero, but it is too small to be of use to practitioners. In those cases, data analysts will typically report small \(p\) values for the null hypothesis and the equivalence tests.
4.1 Null hypothesis
The null hypothesis test is a fundamental statistical method used to determine if there is sufficient evidence to reject a presumed statement about a population parameter. The null hypothesis \(H_0\) represents a default or initial claim, usually suggesting no effect or no difference in the parameter of interest. For example, \(H_0\) might state that the mean of a population is equal to a specific value, or that there is no association between two variables.
After choosing \(H_0\), the analyst calculates a test statistic from the sample data, and compares it to a critical value derived from the sampling distribution of that test statistic under \(H_0\). If the test statistic falls in a critical region, typically in the tails of its distribution, we conclude that there is enough evidence to reject \(H_0\).
Most statistics textbooks discuss the theory of null hypothesis testing.2 The present chapter is more practical: it illustrates how to use the marginaleffects to conduct linear or non-linear tests on model parameters or on functions of those parameters. We use the standard Wald approach and construct \(z\) statistics of this form:
where \(\hat{\theta}\) is a vector of parameter estimates, and \(h(\hat{\theta})\) is a function of those estimates, such as a prediction, counterfactual comparison, or slope. \(H_0\) is our null hypothesis and \(\hat{V}[h(\hat{\theta})]\) is the estimated variance of the quantity of interest.3
When \(|z|\) is large, we can reject the null hypothesis that \(h(\hat{\theta})=H_0\). The intuition is straightforward. First, the numerator of Equation 4.2 measures the distance between our estimated parameters and the null hypothesis. When that distance is large, the observed data is far from the null hypothesis, which makes it seem more unlikely. Second, the denominator quantifies the uncertainty in our estimate. When that uncertainty is small, our estimate is precise, and thus more likely to allow us to discriminate against the null hypothesis. In other words, when the numerator is large and/or the denominator is small, the \(z\) statistic will be large (in absolute value), and we can reject the null hypothesis \(H_0\).
When we estimated the model in Equation 4.1, we obtained these results:
Estimate Std. Error t value Pr(>|t|)
agecat<18 0.67188 0.02564 26.20 <0.001
agecat18 to 35 0.67870 0.01233 55.06 <0.001
agecat>35 0.72774 0.01336 54.48 <0.001
By default, the summary functions in R and Python report null hypothesis tests against a very specific null hypothesis: that a coefficient is equal to zero. For example, in the results printed above, R reported the estimate and standard error for the first coefficient (\(\hat{\beta_1}\)), along with a test statistic4 associated to the null hypothesis that this coefficient is equal to zero (\(H_0: \beta_1=0\)):
Equation 4.3 shows how to compute the test statistic reported by our software package. But does the corresponding test make sense from a substantive perspective? Is it interesting? Do we really need a formal test to reject the null hypothesis that 0% of people below the age 18 are willing to retrieve their HIV test result from the clinic? If the answer to any of those questions is “no”, we can easily construct alternative test statistics with the marginaleffects package.
4.1.1 Choice of null hypothesis
In many cases, including ours, a null hypothesis of zero hardly makes sense. Instead, analysts may want to specify a different value of \(H_0\) to test against a more meaningful benchmark. For example, we could ask: Can we reject the null hypothesis that the probability of retrieving one’s HIV test result is different from a coin flip?
To answer this question, we use the hypotheses() function and its hypothesis argument:
The results show that all three \(z\) statistics are large (in absolute terms). Therefore, we can reject the null hypotheses that these coefficients are equal to 0.5. If the true chances of seeking information about HIV status were 50/50, we would be very unlikely to observe data like these.
This conclusion is consonant with Wald-style \(p\) values, which we compute by estimating the area under the tails of the test statistic’s distribution. In R, the pnorm(x) function measures the area under the normal distribution to the left of x. The two-tailed \(p\) value associated to the first coefficient can thus be computed as:
# First coefficientb<-coef(mod)[1]# The standard error is the square root of the diagonal element of the# variance-covariance matrixse<-sqrt(diag(vcov(mod)))[1]# The z statistic for Wald test with null hypothesis of b = 0.5z<-(b-.5)/se# The p-value is the area under the curve, in the tails of # the normal distribution beyond |z|pnorm(-abs(z))*2
agecat<18
2.043492e-11
\(p\) is extremely small, which means that we can reject the null hypothesis of \(H_0: \beta_1=0.5\).
4.1.2 Linear and non-linear hypothesis tests
In many contexts, analysts are not solely interested in testing against a simple numeric null hypothesis like 0 or 0.5. Instead, they might be interested in comparing different estimated quantities. For instance, we may want to test if the coefficient associated to the first age category is equal to the coefficient associated to the third age category, \(H_0:\beta_1=\beta_3\).
To conduct this test, all we need to do is supply an equation-style string to the hypothesis argument. The terms of this equation start with b, followed by the position (or index) of the estimate. If we are interested in comparing the first and third coefficients, the equation must include b1 and b3:
This is equivalent to computing the difference between the third and first estimated coefficients:
\[
0.7277354 - 0.671875 = 0.0558604
\]
The \(p\) value for this test is 0.053, which is close to one conventional threshold of statistical significance: 0.05. Researchers who are especially sensitive to Type 1 errors5 may select a more stringent statistical significance threshold and conclude that they cannot reject the null hypothesis. They would conclude we cannot reject possibility that the probability of seeking one’s HIV result is the same in the <18 and >35 groups.
Instead of a difference, we could also conduct a test against the null hypothesis that the ratio of \(\beta_3\) to \(\beta_1\) is equal to 0:
The estimated ratio is \(\hat{\beta}_3 / \hat{\beta}_1 =1.08\). The \(z\) statistic is large, which gives us license to reject the null hypothesis that the ratio is equal to 0. Of course, this null hypothesis is not particularly meaningful in the ratio case.
A more relevant null hypothesis would be: \(\hat{\beta}_3 / \hat{\beta}_1 = 1\). If the left-hand side ratio is different from 1 (our null hypothesis), then we can reject the null hypothesis that the two coefficients are the same. We can test this by modifying the hypothesis argument slightly:
The equations supported by the hypothesis argument are not limited to simple tests of equality, differences, or ratios. Indeed, the user can write equations with more than two estimates, or with various (potentially non-linear) transformations. For example:
marginaleffects also offers a formula-based interface which acts as a shortcut to some of the more common hypothesis tests. For example, if we want to compute the difference between every coefficient and the “reference” quantity (i.e., the first estimate), we supply a formula with the word “reference” on the right side of the tilde symbol (~) and the word “difference” on the left side:
Now, let’s say we want to compare each coefficient to the one that immediately precedes: the young adults to the minors, and the older adults to the young adults. Futher suppose we want to compute ratio of coefficients, instead of differences. We can achieve this by setting ratio on the left-hand side, and sequential on the right-hand side of the formula.
4.1.3 Multiple comparisons and joint hypothesis tests
The goal of null hypothesis testing is to assess if observed data provide enough evidence to reject a null hypothesis. When conducting a single hypothesis test, the probability of Type I error—falsely rejecting the null hypothesis when it is true—is controlled at a predefined significance level, usually 5%. However, when multiple hypothesis tests are performed, the likelihood of at least one Type I error increases with the number of tests. This phenomenon is known as the multiple comparisons problem.
Statisticians have proposed many procedures to adjust hypothesis tests for multiple comparisons, including the Bonferroni, Holm, and Westfall corrections. The hypotheses() function in the marginaleffects package can apply many such strategies, and report corrected \(p\) values as well as family-wise confidence intervals. All we need to do is use the multcomp argument.
The hypotheses() function also supports joint hypothesis tests, via the joint and joint_test arguments. For example, we could test if several parameters are jointly/simultaneously equal to zero. The marginaleffects.com website includes documentation and examples on how to conduct such tests.
4.2 Equivalence
In many contexts, analysts are less interested in rejecting a null hypothesis, and more interested in testing whether an estimate is “inferior”, “superior”, or “equivalent” to a given threshold or interval. For example, medical researchers may wish to determine if the estimated effect of a new treatment is similar to the effect of prior treatments, or if it can be considered “negligible” in terms of “clinical significance.” To answer such questions, we can use non-inferiority, non-superiority, or equivalence tests like the two-one-sided test, or TOST (Wellek 2010; Rainey 2014; Lakens, Scheel, and Isager 2018).
The TOST equivalence test is a statistical method used to determine if an estimate is “practically equivalent” to a null hypothesis within a specified margin of equivalence. Unlike traditional null hypothesis significance testing, which aims to detect a significant difference between groups, the TOST procedure tests two complementary hypotheses: that the true difference between treatments is either greater than a positive equivalence margin or less than a negative equivalence margin. If both one-sided tests reject these null hypotheses, it provides evidence that the true difference falls within the predefined equivalence bounds, thereby concluding practical equivalence.
To conduct a TOST, one proceeds in five steps:
Quantity of interest: Estimate a quantity of interest \(\theta\), which can be a coefficient, function of coefficients, prediction, counterfactual comparison, slope, etc.
Interval: Use subject matter knowledge to define an interval of equivalence \([a,b]\). If the quantity of interest \(\theta\) falls between \(a\) and \(b\), it is considered clinically or practically irrelevant.
Non-inferiority: Compute the \(p\) value associated with a one-tailed null hypothesis test to determine if we can reject the null hypothesis that \(\theta < a\).
Non-superiority: Compute a \(p\) value associated with a one-tailed null hypothesis test to determine if we can reject the null hypothesis that \(\theta > b\).
Equivalence: Check if the maximum of the non-inferiority and non-superiority \(p\) values is lower than the chosen level of statistical significance (e.g., \(\alpha=0.05\)).
To illustrate, let’s revisit the model we fitted above and compare the probability that people in the \(18 to 35\) and \(>35\) age brackets will travel to learn their HIV status:
The results above show that the estimated difference in coefficients for the two groups is equal to 0.0490, and that this difference is statistically significant (i.e., likely different from zero). This difference may be “statistically significant”, but is it “meaningful,” “clinically relevant,” or “practically important”?
The first step to answer this question is to define exactly what we mean by “meaningful” or “important”. Specifically, the researcher must define an “interval of equivalence,” in which estimates are considered unimportant. There is no purely statistical criterion to construct this interval; the decision depends entirely on domain expertise and subject matter knowledge.
In our running example, the researcher could decide that if the difference in \(Pr(\text{outcome}=1)\) between the young and older adults is between -5 and 5 percentage points, we can ignore it. If the difference falls in the \([-0.05,0.05]\) interval, it is “clinically irrelevant” or “equivalent to zero.”
To conduct a TOST on this equivalence range, we simply add the equivalence argument to the previous call:
These results allow us to reach three main conclusions:
Non-inferiority: The \(p\) value associated to this test is very small (\(p<0.001\)). We can reject the null hypothesis that the difference between coefficients is lower than \(-0.05\).
Non-superiority: The \(p\) value associated to this test is large (0.479). We cannot reject the null hypothesis that the difference between coefficients is larger than \(0.05\).
Equivalence: The \(p\) value associated to the TOST of equivalence corresponds to the maximum of the non-superiority and non-superiority values: 0.479. Again, we cannot reject the null hypothesis that the two coefficients are practically equivalent to each other.
This procedure is useful when we want to determine if an estimate is practically equivalent to a benchmark value. Importantly, equivalence tests can be seen as a special case of “interval tests.” Indeed, the interval of equivalence need not be centered around zero, and we can conduct a TOST for any interval of interest by setting different bounds in the equivalence argument of a marginaleffects function.
4.3 Summary
This chapter introduced two classes of statistical testing procedures: null hypothesis and equivalence tests.
A null hypothesis test allows us to determine if there is enough evidence to reject the hypothesis that a parameter (or function of parameters) is equal to a given value.
Examples of statements that could be rejected by a null hypothesis test include:
The predicted wages of college and high school graduates are equal.
The effect of a new drug on a health outcome is zero.
A marketing campaign has the same effect on sales in rural or urban areas.
When a null hypothesis test indicates that we can reject statements like these (small \(p\) value), we establish a difference.
An equivalence test allows us to determine if there is enough evidence to reject the hypothesis that a parameter (or function of parameters) is meaningfully different from a benchmark value.
Examples of statements that could be rejected by an equivalence test include:
The difference in wages between college and high school graduates is considerable.
The effect of a new drug on a health outcome is meaningfully different from the effect of an existing treatment.
The effect of a marketing campaign on consumption is much larger than zero.
When an equivalence test indicates that we can reject statements like these (small \(p\) value), we establish a similarity.
The main marginaleffects functions include both a hypothesis and an equivalence argument. This makes it easy to conduct tests on any of the quantities estimated by the package—predictions, counterfactual comparisons, and slopes—as well as on arbitrary functions of those quantities.
In the next chapters, we will show how null hypothesis and equivalence tests can be applied beyond simple coefficient estimates, to quantities like predictions, counterfactual comparisons, and slopes.
Aronow, Peter M., and Benjamin T. Miller. 2019. Foundations of Agnostic Statistics. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316831762.
Cameron, A Colin, and Pravin K Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge university press.
Lakens, Daniël, Anne M. Scheel, and Peder M. Isager. 2018. “Equivalence Testing for Psychological Research: A Tutorial.”Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.
Rainey, Carlisle. 2014. “Arguing for a Negligible Effect.”American Journal of Political Science 58 (4): 1083–91.
Thornton, Rebecca L. 2008. “The Demand for, and Impact of, Learning HIV Status.”American Economic Review 98 (5): 1829–63.
Wasserman, Larry. 2004. All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. New York, NY: Springer. https://doi.org/10.1007/978-0-387-21736-9.
Wald-style null hypothesis tests are described in most statistical textbooks. Readers who want to learn more about equivalence testing can refer to the book length treatment by Wellek (2010), or to articles by Rainey (2014) and Lakens, Scheel, and Isager (2018).↩︎
As described in Chapter 13, the default strategy for null hypothesis tests in marginaleffects is to compute standard errors using the delta method. That chapter also explains how to use bootstrap or simulations instead.↩︎
By default, R reports \(t\), which is equivalent to \(z\) in large samples.↩︎
A Type 1 error, or false positive, occurs when we reject the null hypothesis where it is actually true.↩︎