library(marginaleffects)
dat <- arrow::read_parquet("https://marginaleffects.com/data/thornton.parquet")
mod <- lm(outcome ~ incentive, data = dat)
coef(mod)
(Intercept) incentive
0.3397746 0.4510603
Experimental analysis is a common use case for the marginaleffects
package, which provides researchers with a powerful toolkit for estimating treatment effects, interpreting interactions, and visualizing results across a variety of experimental designs.
This chapter illustrates the application of marginaleffects
through three widely used experimental setups: the average treatment effect with a single randomized treatment, a 2-by-2 factorial experiment, and a forced choice conjoint experiment. These examples will demonstrate the flexibility and ease of using marginaleffects
to analyze experimental data.
In this section, we explain regression adjustment, a method that helps improve the precision of treatment effect estimates in experiments. We will use the same HIV experiment data from Thornton (2008) as in previous chapters. In this experiment, the outcome
is a binary variable that shows whether individuals travelled to the clinic to learn their HIV status. The treatment is a financial incentive
, randomly given to participants, with the goal of encouraging testing.
Since the treatment was randomized, we can estimate the average treatment effect (ATE) using a linear regression on the binary outcome. This is also called a linear probability model:
library(marginaleffects)
dat <- arrow::read_parquet("https://marginaleffects.com/data/thornton.parquet")
mod <- lm(outcome ~ incentive, data = dat)
coef(mod)
(Intercept) incentive
0.3397746 0.4510603
The coefficient on incentive
tells us the difference in the probability of seeing one’s test result between those who received the incentive and those who did not. This is the same result you would get from the G-computation estimate using the avg_comparisons()
function. The benefit of this function is that it allows us to easily report heteroskedasticity-consistent (or other robust) standard errors:
avg_comparisons(mod,
variables = "incentive",
vcov = "HC2")
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.451 | 0.0209 | 21.6 | <0.001 | 0.41 | 0.492 |
When the treatment is randomized, it is not strictly necessary to include control variables in the model to eliminate bias. However, including covariates can sometimes improve the precision of our estimates by reducing the unexplained variance in the outcome (TODO: cite).
As Freedman (2008) points out, however, naively adding covariates to the regression model introduce small-sample bias and degrade asymptotic precision. To avoid this problem, Lin (2013) recommends a simple solution: interacting all covariates with the treatment indicator, and reporting heteroskedasticity-robust standard errors.
mod <- lm(outcome ~ incentive * (age + distance + hiv2004), data = dat)
Intepreting the results of this model is not as easy as in the previous example, since there are multiple coefficients, with several multiplicative interactions.
coef(mod)
(Intercept) incentive age distance
0.309537645 0.496482649 0.003495124 -0.042323583
hiv2004 incentive:age incentive:distance incentive:hiv2004
0.013562711 -0.002177778 0.014571883 -0.070766068
Thankfully, we can simply call avg_comparisons()
again to get the regression adjusted estimate, along with robust standard errors.
avg_comparisons(mod,
variables = "incentive",
vcov = "HC2")
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
0.449 | 0.0209 | 21.5 | <0.001 | 0.408 | 0.49 |
This result suggests that receiving an incentive
increases by 45 percentage points the probability of seeking one’s test result.
A factorial experiment is a type of study design that allows researchers to assess the effects of two or more randomized treatments simultaneously, with each treatment having multiple levels or conditions. A common example is the 2-by-2 design, where two variables with two levels are randomized simultaneously and independently. This setup enables the evaluator to quantify the effects of each treatment, but also to examine if the treatments interact with one another.
In medicine, factorial experiments are widely used to explore complex phenomena like drug interactions, where researchers test different combinations of treatments to see how they affect patient outcomes. In plant phyisiology, they could be used to see how combinations of temperature and humidity affect photosynthesis. In a business context, a company could use a factorial design to test if different advertising strategies and price points drive sales.
To illustrate, let’s consider a simple data set with 32 observations, a numeric outcome \(Y\), and two treatments with two levels: \(T_a=\{0,1\}\) and \(T_b=\{0,1\}\). We fit a linear regression model with each treatment entered individually, along with their multiplicative interaction (see Chapter 8 for more on interactions).
library(marginaleffects)
dat <- arrow::read_parquet("https://marginaleffects.com/data/factorial.parquet")
mod <- lm(Y ~ Ta + Tb + Ta:Tb, data = dat)
coef(mod)
(Intercept) Ta Tb Ta:Tb
15.050000 5.692857 4.700000 2.928571
We can plot the model’s predictions for each combination of \(T_a\) and \(T_b\) using the plot_predictions
function.
plot_predictions(mod, by = c("Ta", "Tb"))
Now, imagine we want to estimate the effect of changing \(T_a\) from 0 to 1 on \(Y\), while holding \(T_b=0\). We define the two counterfactual values of \(Y\) and take their difference:
\[\begin{align*} \hat{Y} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot T_a + \hat{\beta}_3 \cdot T_b + \hat{\beta}_4 \cdot T_a \cdot T_b\\ \hat{Y}_{T_a=0,T_b=0} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot 0 + \hat{\beta}_3 \cdot 0 + \hat{\beta}_4 \cdot 0 \cdot 0\\ \hat{Y}_{T_a=1,T_b=0} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot 1 + \hat{\beta}_3 \cdot 0 + \hat{\beta}_4 \cdot 1 \cdot 0\\ \hat{Y}_{T_a=1,T_b=0} - \hat{Y}_{T_a=0,T_b=0} &= \hat{\beta}_2 = 5.693 \end{align*}\]
We can use a similar approach to estimate a cross-contrast, that is, the effect of changing both \(T_a\) and \(T_b\) from 0 to 1 simultaneously.
\[\begin{align*} \hat{Y}_{T_a=0,T_b=0} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot 0 + \hat{\beta}_3 \cdot 0 + \hat{\beta}_4 \cdot 0 \cdot 0\\ \hat{Y}_{T_a=1,T_b=1} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot 1 + \hat{\beta}_3 \cdot 1 + \hat{\beta}_4 \cdot 1 \cdot 1\\ \hat{Y}_{T_a=1,T_b=1} - \hat{Y}_{T_0=1,T_b=0} &= \hat{\beta}_2 + \hat{\beta}_3 + \hat{\beta}_4 = 13.321 \end{align*}\]
In a simple 2-by-2 experiment, this arithmetic is fairly simple. However, things can quickly get complicated when more treatment conditions are introduced, or when regression models include non-linear components. Moreover, even if we can add coefficient estimates to obtain the quantity of interest, computing standard errors (classical or robust) is not straightforward. Thus, it is often convenient to rely on software like marginaleffects
to do the hard work for us.
To estimate the effect of \(T_a\) on \(Y\), holding \(T_b=0\) constant, we use the avg_comparisons()
function. The variables
argument specifies the treatment of interest, and the newdata
argument specifies the levels of the other treatment(s) at which we want to evaluate the contrast. In this case, we use newdata=subset(Tb == 0)
to compute the quantity of interest only in the subset of observations where \(T_b=0\).
avg_comparisons(mod,
variables = "Ta",
newdata = subset(Tb == 0))
Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|
5.69 | 1.65 | 3.45 | <0.001 | 2.46 | 8.93 |
To estimate a cross-contrast, we specify each treatment of interest in the variables
argument, and set cross=TRUE
. This gives us an estimate of the effect of simultaneously moving \(T_a\) and \(T_b\) from 0 to 1.
avg_comparisons(mod,
variables = c("Ta", "Tb"),
cross = TRUE)
C: Ta | C: Tb | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
1 - 0 | 1 - 0 | 13.3 | 1.65 | 8.07 | <0.001 | 10.1 | 16.6 |
Figure 10.1 shows six contrasts of interest in a 2-by-2 experimental design. The vertical axis shows the value of the outcome variable in each treatment group The horizontal axis shows the value of the first treatment \(T_a\). The shape of points—0 or 1—represeents the value of the \(T_b\) treatment. For example, the “1” symbol in the top left shows the average value of \(Y\) in the \(T_a=1,T_b=1\) group.
In each panel, the arrow links two treatment groups that we want to compare. For example, in the top-left panel we compare \(\hat{Y}_{T_a=0,T_b=0}\) to \(\hat{Y}_{T_a=1,T_b=0}\). In the bottom-right panel, we compare \(\hat{Y}_{T_a=0,T_b=1}\) to \(\hat{Y}_{T_a=1,T_b=0}\). The avg_comparisons()
call at the top of each panel shows how to use marginaleffects
to compute the difference in \(Y\) that results from the illustrated change in treatment conditions.
One reason factorial experiments are so popular is that they allow researchers to assess the possibility that treatments interact with one another. Indeed, in somes cases, the effect of a treatment \(T_a\) will be stronger (or weaker) at different values of treatment \(T_b\). To test this possibility, we use the strategies described in Chapter 8 on interactions.
First, we estimate the effect of a change from 0 to 1 in \(T_a\), in the subsets of observations with different values of \(T_b\).
avg_comparisons(mod, variables = "Ta", by = "Tb")
Term | Tb | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
Ta | 0 | 5.69 | 1.65 | 3.45 | <0.001 | 2.46 | 8.93 |
Ta | 1 | 8.62 | 1.93 | 4.46 | <0.001 | 4.84 | 12.41 |
The estimated effect of \(T_a\) on \(Y\) is equal to 5.69 when \(T_b=0\), but equal to 8.62 when \(T_b=1\). Is the difference between those estimates statistically significant? Does \(T_b\) moderate the effect the effect of \(T_a\)? Is there treatment heterogeneity or interaction between \(T_a\) and \(T_b\)? To answer these questions, we use the hypothesis
argument and check if the difference between subgroup estimates is significant:
avg_comparisons(mod, variables = "Ta", by = "Tb", hypothesis = "b2 - b1 = 0")
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
b2-b1=0 | 2.93 | 2.54 | 1.15 | 0.249 | -2.05 | 7.91 |
While there is a difference between our two estimates, the \(z\) statistic and \(p\) value indicate that this difference is not statistically significant. Therefore, we cannot reject the null hypothesis that the effect of \(T_a\) is the same, regardless of the value of \(T_b\).
A forced-choice conjoint experiment is a research methodology used in fields such as marketing and political science to understand how people make decisions between “profiles” characterized by multiple “attributes.” Survey respondents are presented with a series of choices between different products, services, or scenarios. Each option is described by a set of attributes (e.g., price, quality, brand, features), and the levels of these attributes are varied randomly across the options presented.
Consider an experiment where researchers ask survey respondents “to act as immigration officials and to decide which of a pair of immigrants they would choose for admission” (Hainmueller, Hopkins, and Yamamoto 2014). The researchers display a table with two columns that represent distinct immigrant profiles with randomized attributes. For example:
Attributes | Profile 1 | Profile 2 |
---|---|---|
Language Skills | Fluent in English | Broken English |
Job | Construction worker | Nurse |
The survey respondent has the “task” of choosing one of the two profiles. Then, the researchers display a new task, including profiles with different randomized attributes, and the respondent chooses again. By analyzing the choices made by participants, researchers can estimate the relative importance of different attributes in the decision-making process.
To illustrate, we use data published alongside the Political Analysis article by Hainmueller, Hopkins, and Yamamoto (2014). In this experiment, each survey respondent \(i\) receives several tasks \(k\), in which they select one of two profiles \(j\), characterized by attributes \(l\). For simplicity, we consider a subset of the data with 5 tasks per respondent, 2 profiles per task, and 2 attributes per profile.
The data is structured in “long” format, with one respondent-task-profile combination per row. The dependent variable is choice
, a binary variable which indicates if the profile in a given row was selected during a given task. The predictors are language
skills and job
type.
Since there are 5 tasks per respondent, and two profiles per task, the dataset includes 10 rows per respondent.
dat <- arrow::read_parquet("https://marginaleffects.com/data/immigration.parquet")
dat[dat$respondent == 4, ]
# A tibble: 10 × 6
respondent task profile choice job language
<int> <int> <dbl> <dbl> <fct> <fct>
1 4 1 1 1 nurse tried but unable
2 4 1 2 0 child care provider used interpreter
3 4 2 1 0 gardener fluent
4 4 2 2 1 construction worker fluent
5 4 3 1 1 nurse broken
6 4 3 2 0 child care provider fluent
7 4 4 1 0 teacher used interpreter
8 4 4 2 1 construction worker fluent
9 4 5 1 1 teacher used interpreter
10 4 5 2 0 nurse used interpreter
To analyze this dataset, we estimate a linear regression model with choice
as the outcome, and in which all predictors are interacted:
mod <- lm(choice ~ job * language, data = dat)
A common strategy to interpret the results of a conjoint experiment is to compute marginal means. As described by Leeper, Hobolt, and Tilley (2020), a “marginal mean describes the level of favorability toward profiles that have a particular feature level, ignoring all other features.”
To compute marginal means, we proceed in two steps:
This is easy to do with the avg_predictions()
function. Note that we use the vcov
argument to report clustered standard errors at the survey respondent-level.
avg_predictions(mod,
newdata = "balanced",
by = "language",
vcov = ~respondent)
language | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
fluent | 0.618 | 0.00831 | 74.4 | <0.001 | 0.602 | 0.635 |
broken | 0.550 | 0.00894 | 61.5 | <0.001 | 0.532 | 0.568 |
tried but unable | 0.490 | 0.00879 | 55.7 | <0.001 | 0.473 | 0.507 |
used interpreter | 0.453 | 0.00927 | 48.8 | <0.001 | 0.434 | 0.471 |
These results suggest that, ignoring (or averaging over) the job
attribute, the “fluent” English speakers are chosen more often than profiles with other language
values.
To see if the average probability of selection is higher when a candidate is fluent in English, relative to when they require an interpreter, we use the hypothesis
argument.
avg_predictions(mod,
hypothesis = "b1 = b4",
newdata = "balanced",
by = "language",
vcov = ~respondent)
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
b1=b4 | 0.166 | 0.0138 | 12 | <0.001 | 0.139 | 0.193 |
This shows that the difference between estimates in those two categories is relatively large (0.2 percentage points), and that this difference is statistically significant.
Average Marginal Component Effects (AMCE) are defined and analyzed in Hainmueller, Hopkins, and Yamamoto (2014). Roughly speaking, they can be viewed as the average effects of changing one attribute on choice, while holding all other attributes of the profile constant. To compute an AMCE, we use the avg_comparisons()
function that we already explored in Chapters 6 and ?sec-gcomputation.
avg_comparisons(mod, vcov = ~respondent, newdata = "balanced")
Term | Contrast | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
job | child care provider - janitor | 0.01044 | 0.0171 | 0.612 | 0.5404 | -0.02300 | 0.0439 |
job | computer programmer - janitor | 0.13576 | 0.0250 | 5.429 | <0.001 | 0.08675 | 0.1848 |
job | construction worker - janitor | 0.03841 | 0.0175 | 2.195 | 0.0282 | 0.00411 | 0.0727 |
job | doctor - janitor | 0.21502 | 0.0243 | 8.850 | <0.001 | 0.16740 | 0.2626 |
job | financial analyst - janitor | 0.11647 | 0.0258 | 4.518 | <0.001 | 0.06594 | 0.1670 |
job | gardener - janitor | 0.01266 | 0.0174 | 0.729 | 0.4660 | -0.02138 | 0.0467 |
job | nurse - janitor | 0.08283 | 0.0169 | 4.894 | <0.001 | 0.04966 | 0.1160 |
job | research scientist - janitor | 0.19308 | 0.0241 | 7.999 | <0.001 | 0.14577 | 0.2404 |
job | teacher - janitor | 0.06742 | 0.0176 | 3.834 | <0.001 | 0.03296 | 0.1019 |
job | waiter - janitor | -0.00511 | 0.0176 | -0.291 | 0.7714 | -0.03959 | 0.0294 |
language | broken - fluent | -0.06835 | 0.0136 | -5.035 | <0.001 | -0.09495 | -0.0417 |
language | tried but unable - fluent | -0.12844 | 0.0134 | -9.559 | <0.001 | -0.15478 | -0.1021 |
language | used interpreter - fluent | -0.16584 | 0.0138 | -11.995 | <0.001 | -0.19294 | -0.1387 |
Abramson et al. (2024) introduce an alternative estimand for forced-choice conjoint experiments: the Average Feature Choice Probability (AFCP). The main difference between AMCE and AFCP lies in their approach to handling attribute comparisons.
AMCE incorporates comparisons and averages over both direct and indirect attribute comparisons. For example, the estimated effect of “fluent” vs. “broken English” on choice depends not only on these two specific characteristics, but also on how they compare to “used interpreter” or “tried but unable”. Thusly, Abramson et al. (2024) argue that the AMCE considers information about irrelevant attributes, and imposes a strong transitivity assumption. In some cases, AMCE can suggest positive effects even when, in direct comparisons, respondents are on average less likely to choose a profile with the feature of interest over the baseline. In contrast, AFCP focuses solely on direct comparisons between attributes, offering a more accurate representation of respondents’ direct preferences without the influence of irrelevant attributes.
To estimate AFCP, we first need to create new columns: language.alt
and job.alt
. These columns record the attributes of the alternative against which each profile was paired, in every task. We use by()
to split the data frame into subgroups for each combination of respondent
and task
. Each subgroup has two rows, because there are two profiles. Then, we use rev()
to create new variables with the opposite attributes.
In respondent 4’s first task, the first profile was “nurse” and the alternative profile was “child care provider”. Thus, “nurse” appears in the first row as job
and in the second row as job.alt
.
subset(dat, respondent == 4 & task == 1)
respondent task profile choice job language
1 4 1 1 1 nurse tried but unable
2 4 1 2 0 child care provider used interpreter
language.alt job.alt
1 used interpreter child care provider
2 tried but unable nurse
To estimate the AFCP, we once again fit a linear regression model. This time, the model is even more flexible. Specifically, the model allows the effect of language skills in the first profile to depend on the value of language skills in the second profile. Likewise, other attributes can influence the probability of selection differently based on attributes in the comparison profile. To achieve this, we interact language
with language.alt
, and job
with job.alt
.
mod <- lm(choice ~ language * language.alt + job * job.alt, data = dat)
As noted above, and detailed in Abramson et al. (2024), the AFCP is a choice pair-specific quantity. This means that we need to average predictions (fitted values) across covariates, within each unique pair of attributes of interest. We thus Thus,
Now we compute the AFCP by averaging fitted values by unique pairs of attributes for language (the combination of language
and alternative
). Since we are not interested comparison pairs where both profiles have the same language skills, we use the subset()
function with the newdata
argument to select an appropriate grid.
avg_predictions(mod,
by = c("language", "language.alt"),
newdata = subset(language.alt == "fluent"),
vcov = ~respondent)
language | language.alt | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
fluent | fluent | 0.500 | 3.09e-07 | 1.62e+06 | <0.001 | 0.500 | 0.500 |
broken | fluent | 0.407 | 1.65e-02 | 2.46e+01 | <0.001 | 0.374 | 0.439 |
tried but unable | fluent | 0.376 | 1.66e-02 | 2.27e+01 | <0.001 | 0.344 | 0.409 |
used interpreter | fluent | 0.361 | 1.57e-02 | 2.30e+01 | <0.001 | 0.330 | 0.392 |
A powerful feature of marginaleffects
is that all its functions include a hypothesis
argument which can be used to conduct hypothesis tests on arbitrary functions of estimates. This allows us to answer questions such as: Is the AFCP for “used intepreter vs. fluent” different from the AFCP for “broken vs. fluent”?
avg_predictions(mod,
by = c("language", "language.alt"),
hypothesis = "b4 - b2 = 0",
newdata = subset(language.alt == "fluent"),
vcov = ~respondent)
Term | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
b4-b2=0 | -0.0457 | 0.0226 | -2.02 | 0.0434 | -0.09 | -0.00135 |
These findings suggest that the difference in selection probability between “interpreter” and “fluent” is smaller than the difference in selection probability between “broken” and “fluent”: estimated gap of -0.046, with a \(z\) statistic of -2.02.