10  Experiments

Experimental analysis is a common use case for the marginaleffects package, which provides researchers with a powerful toolkit for estimating treatment effects, interpreting interactions, and visualizing results across a variety of experimental designs.

This chapter illustrates the application of marginaleffects through three widely used experimental setups: the average treatment effect with a single randomized treatment, a 2-by-2 factorial experiment, and a forced choice conjoint experiment. These examples will demonstrate the flexibility and ease of using marginaleffects to analyze experimental data.

10.1 Regression adjustment

In this section, we explain regression adjustment, a method that helps improve the precision of treatment effect estimates in experiments. We will use the same HIV experiment data from Thornton (2008) as in previous chapters. In this experiment, the outcome is a binary variable that shows whether individuals travelled to the clinic to learn their HIV status. The treatment is a financial incentive, randomly given to participants, with the goal of encouraging testing.

Since the treatment was randomized, we can estimate the average treatment effect (ATE) using a linear regression on the binary outcome. This is also called a linear probability model:

library(marginaleffects)
dat <- arrow::read_parquet("https://marginaleffects.com/data/thornton.parquet")

mod <- lm(outcome ~ incentive, data = dat)
coef(mod)
(Intercept)   incentive 
  0.3397746   0.4510603 

The coefficient on incentive tells us the difference in the probability of seeing one’s test result between those who received the incentive and those who did not. This is the same result you would get from the G-computation estimate using the avg_comparisons() function. The benefit of this function is that it allows us to easily report heteroskedasticity-consistent (or other robust) standard errors:

avg_comparisons(mod, 
    variables = "incentive",
    vcov = "HC2")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.451 0.0209 21.6 <0.001 0.41 0.492

When the treatment is randomized, it is not strictly necessary to include control variables in the model to eliminate bias. However, including covariates can sometimes improve the precision of our estimates by reducing the unexplained variance in the outcome (TODO: cite).

As Freedman (2008) points out, however, naively adding covariates to the regression model introduce small-sample bias and degrade asymptotic precision. To avoid this problem, Lin (2013) recommends a simple solution: interacting all covariates with the treatment indicator, and reporting heteroskedasticity-robust standard errors.

mod <- lm(outcome ~ incentive * (age + distance + hiv2004), data = dat)

Intepreting the results of this model is not as easy as in the previous example, since there are multiple coefficients, with several multiplicative interactions.

coef(mod)
       (Intercept)          incentive                age           distance 
       0.309537645        0.496482649        0.003495124       -0.042323583 
           hiv2004      incentive:age incentive:distance  incentive:hiv2004 
       0.013562711       -0.002177778        0.014571883       -0.070766068 

Thankfully, we can simply call avg_comparisons() again to get the regression adjusted estimate, along with robust standard errors.

avg_comparisons(mod, 
    variables = "incentive",
    vcov = "HC2")
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
0.449 0.0209 21.5 <0.001 0.408 0.49

This result suggests that receiving an incentive increases by 45 percentage points the probability of seeking one’s test result.

10.2 Factorial experiments

A factorial experiment is a type of study design that allows researchers to assess the effects of two or more randomized treatments simultaneously, with each treatment having multiple levels or conditions. A common example is the 2-by-2 design, where two variables with two levels are randomized simultaneously and independently. This setup enables the evaluator to quantify the effects of each treatment, but also to examine if the treatments interact with one another.

In medicine, factorial experiments are widely used to explore complex phenomena like drug interactions, where researchers test different combinations of treatments to see how they affect patient outcomes. In plant phyisiology, they could be used to see how combinations of temperature and humidity affect photosynthesis. In a business context, a company could use a factorial design to test if different advertising strategies and price points drive sales.

To illustrate, let’s consider a simple data set with 32 observations, a numeric outcome \(Y\), and two treatments with two levels: \(T_a=\{0,1\}\) and \(T_b=\{0,1\}\). We fit a linear regression model with each treatment entered individually, along with their multiplicative interaction (see Chapter 8 for more on interactions).

library(marginaleffects)
dat <- arrow::read_parquet("https://marginaleffects.com/data/factorial.parquet")
mod <- lm(Y ~ Ta + Tb + Ta:Tb, data = dat)
coef(mod)
(Intercept)          Ta          Tb       Ta:Tb 
  15.050000    5.692857    4.700000    2.928571 

We can plot the model’s predictions for each combination of \(T_a\) and \(T_b\) using the plot_predictions function.

plot_predictions(mod, by = c("Ta", "Tb"))

Now, imagine we want to estimate the effect of changing \(T_a\) from 0 to 1 on \(Y\), while holding \(T_b=0\). We define the two counterfactual values of \(Y\) and take their difference:

\[\begin{align*} \hat{Y} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot T_a + \hat{\beta}_3 \cdot T_b + \hat{\beta}_4 \cdot T_a \cdot T_b\\ \hat{Y}_{T_a=0,T_b=0} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot 0 + \hat{\beta}_3 \cdot 0 + \hat{\beta}_4 \cdot 0 \cdot 0\\ \hat{Y}_{T_a=1,T_b=0} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot 1 + \hat{\beta}_3 \cdot 0 + \hat{\beta}_4 \cdot 1 \cdot 0\\ \hat{Y}_{T_a=1,T_b=0} - \hat{Y}_{T_a=0,T_b=0} &= \hat{\beta}_2 = 5.693 \end{align*}\]

We can use a similar approach to estimate a cross-contrast, that is, the effect of changing both \(T_a\) and \(T_b\) from 0 to 1 simultaneously.

\[\begin{align*} \hat{Y}_{T_a=0,T_b=0} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot 0 + \hat{\beta}_3 \cdot 0 + \hat{\beta}_4 \cdot 0 \cdot 0\\ \hat{Y}_{T_a=1,T_b=1} &= \hat{\beta}_1 + \hat{\beta}_2 \cdot 1 + \hat{\beta}_3 \cdot 1 + \hat{\beta}_4 \cdot 1 \cdot 1\\ \hat{Y}_{T_a=1,T_b=1} - \hat{Y}_{T_0=1,T_b=0} &= \hat{\beta}_2 + \hat{\beta}_3 + \hat{\beta}_4 = 13.321 \end{align*}\]

In a simple 2-by-2 experiment, this arithmetic is fairly simple. However, things can quickly get complicated when more treatment conditions are introduced, or when regression models include non-linear components. Moreover, even if we can add coefficient estimates to obtain the quantity of interest, computing standard errors (classical or robust) is not straightforward. Thus, it is often convenient to rely on software like marginaleffects to do the hard work for us.

To estimate the effect of \(T_a\) on \(Y\), holding \(T_b=0\) constant, we use the avg_comparisons() function. The variables argument specifies the treatment of interest, and the newdata argument specifies the levels of the other treatment(s) at which we want to evaluate the contrast. In this case, we use newdata=subset(Tb == 0) to compute the quantity of interest only in the subset of observations where \(T_b=0\).

avg_comparisons(mod,
  variables = "Ta",
  newdata = subset(Tb == 0))
Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
5.69 1.65 3.45 <0.001 2.46 8.93

To estimate a cross-contrast, we specify each treatment of interest in the variables argument, and set cross=TRUE. This gives us an estimate of the effect of simultaneously moving \(T_a\) and \(T_b\) from 0 to 1.

avg_comparisons(mod, 
  variables = c("Ta", "Tb"),
  cross = TRUE)
C: Ta C: Tb Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
1 - 0 1 - 0 13.3 1.65 8.07 <0.001 10.1 16.6

Figure 10.1 shows six contrasts of interest in a 2-by-2 experimental design. The vertical axis shows the value of the outcome variable in each treatment group The horizontal axis shows the value of the first treatment \(T_a\). The shape of points—0 or 1—represeents the value of the \(T_b\) treatment. For example, the “1” symbol in the top left shows the average value of \(Y\) in the \(T_a=1,T_b=1\) group.

In each panel, the arrow links two treatment groups that we want to compare. For example, in the top-left panel we compare \(\hat{Y}_{T_a=0,T_b=0}\) to \(\hat{Y}_{T_a=1,T_b=0}\). In the bottom-right panel, we compare \(\hat{Y}_{T_a=0,T_b=1}\) to \(\hat{Y}_{T_a=1,T_b=0}\). The avg_comparisons() call at the top of each panel shows how to use marginaleffects to compute the difference in \(Y\) that results from the illustrated change in treatment conditions.

Figure 10.1: Six contrasts of interest in a 2-by-2 experimental design.

One reason factorial experiments are so popular is that they allow researchers to assess the possibility that treatments interact with one another. Indeed, in somes cases, the effect of a treatment \(T_a\) will be stronger (or weaker) at different values of treatment \(T_b\). To test this possibility, we use the strategies described in Chapter 8 on interactions.

First, we estimate the effect of a change from 0 to 1 in \(T_a\), in the subsets of observations with different values of \(T_b\).

avg_comparisons(mod, variables = "Ta", by = "Tb")
Term Tb Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
Ta 0 5.69 1.65 3.45 <0.001 2.46 8.93
Ta 1 8.62 1.93 4.46 <0.001 4.84 12.41

The estimated effect of \(T_a\) on \(Y\) is equal to 5.69 when \(T_b=0\), but equal to 8.62 when \(T_b=1\). Is the difference between those estimates statistically significant? Does \(T_b\) moderate the effect the effect of \(T_a\)? Is there treatment heterogeneity or interaction between \(T_a\) and \(T_b\)? To answer these questions, we use the hypothesis argument and check if the difference between subgroup estimates is significant:

avg_comparisons(mod, variables = "Ta", by = "Tb", hypothesis = "b2 - b1 = 0")
Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b2-b1=0 2.93 2.54 1.15 0.249 -2.05 7.91

While there is a difference between our two estimates, the \(z\) statistic and \(p\) value indicate that this difference is not statistically significant. Therefore, we cannot reject the null hypothesis that the effect of \(T_a\) is the same, regardless of the value of \(T_b\).

10.3 Conjoint experiments

A forced-choice conjoint experiment is a research methodology used in fields such as marketing and political science to understand how people make decisions between “profiles” characterized by multiple “attributes.” Survey respondents are presented with a series of choices between different products, services, or scenarios. Each option is described by a set of attributes (e.g., price, quality, brand, features), and the levels of these attributes are varied randomly across the options presented.

Consider an experiment where researchers ask survey respondents “to act as immigration officials and to decide which of a pair of immigrants they would choose for admission” (Hainmueller, Hopkins, and Yamamoto 2014). The researchers display a table with two columns that represent distinct immigrant profiles with randomized attributes. For example:

A forced-choice task in a conjoint experiment.

Attributes Profile 1 Profile 2
Language Skills Fluent in English Broken English
Job Construction worker Nurse

The survey respondent has the “task” of choosing one of the two profiles. Then, the researchers display a new task, including profiles with different randomized attributes, and the respondent chooses again. By analyzing the choices made by participants, researchers can estimate the relative importance of different attributes in the decision-making process.

To illustrate, we use data published alongside the Political Analysis article by Hainmueller, Hopkins, and Yamamoto (2014). In this experiment, each survey respondent \(i\) receives several tasks \(k\), in which they select one of two profiles \(j\), characterized by attributes \(l\). For simplicity, we consider a subset of the data with 5 tasks per respondent, 2 profiles per task, and 2 attributes per profile.

The data is structured in “long” format, with one respondent-task-profile combination per row. The dependent variable is choice, a binary variable which indicates if the profile in a given row was selected during a given task. The predictors are language skills and job type.

Since there are 5 tasks per respondent, and two profiles per task, the dataset includes 10 rows per respondent.

dat <- arrow::read_parquet("https://marginaleffects.com/data/immigration.parquet")
dat[dat$respondent == 4, ]
# A tibble: 10 × 6
   respondent  task profile choice job                 language        
        <int> <int>   <dbl>  <dbl> <fct>               <fct>           
 1          4     1       1      1 nurse               tried but unable
 2          4     1       2      0 child care provider used interpreter
 3          4     2       1      0 gardener            fluent          
 4          4     2       2      1 construction worker fluent          
 5          4     3       1      1 nurse               broken          
 6          4     3       2      0 child care provider fluent          
 7          4     4       1      0 teacher             used interpreter
 8          4     4       2      1 construction worker fluent          
 9          4     5       1      1 teacher             used interpreter
10          4     5       2      0 nurse               used interpreter

To analyze this dataset, we estimate a linear regression model with choice as the outcome, and in which all predictors are interacted:

mod <- lm(choice ~ job * language, data = dat)

10.3.1 Marginal means

A common strategy to interpret the results of a conjoint experiment is to compute marginal means. As described by Leeper, Hobolt, and Tilley (2020), a “marginal mean describes the level of favorability toward profiles that have a particular feature level, ignoring all other features.”

To compute marginal means, we proceed in two steps:

  1. Compute the predicted (i.e., fitted) values for each row in a balanced grid of predictors (see Section 3.2.4).
  2. Marginalize (average) those predictions with respect to the variable of interest.

This is easy to do with the avg_predictions() function. Note that we use the vcov argument to report clustered standard errors at the survey respondent-level.

avg_predictions(mod, 
  newdata = "balanced",
  by = "language", 
  vcov = ~respondent)
language Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
fluent 0.618 0.00831 74.4 <0.001 0.602 0.635
broken 0.550 0.00894 61.5 <0.001 0.532 0.568
tried but unable 0.490 0.00879 55.7 <0.001 0.473 0.507
used interpreter 0.453 0.00927 48.8 <0.001 0.434 0.471

These results suggest that, ignoring (or averaging over) the job attribute, the “fluent” English speakers are chosen more often than profiles with other language values.

To see if the average probability of selection is higher when a candidate is fluent in English, relative to when they require an interpreter, we use the hypothesis argument.

avg_predictions(mod, 
  hypothesis = "b1 = b4",
  newdata = "balanced",
  by = "language", 
  vcov = ~respondent)
Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b1=b4 0.166 0.0138 12 <0.001 0.139 0.193

This shows that the difference between estimates in those two categories is relatively large (0.2 percentage points), and that this difference is statistically significant.

10.3.2 Average Marginal Component Effects

Average Marginal Component Effects (AMCE) are defined and analyzed in Hainmueller, Hopkins, and Yamamoto (2014). Roughly speaking, they can be viewed as the average effects of changing one attribute on choice, while holding all other attributes of the profile constant. To compute an AMCE, we use the avg_comparisons() function that we already explored in Chapters 6 and ?sec-gcomputation.

avg_comparisons(mod, vcov = ~respondent, newdata = "balanced")
Term Contrast Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
job child care provider - janitor 0.01044 0.0171 0.612 0.5404 -0.02300 0.0439
job computer programmer - janitor 0.13576 0.0250 5.429 <0.001 0.08675 0.1848
job construction worker - janitor 0.03841 0.0175 2.195 0.0282 0.00411 0.0727
job doctor - janitor 0.21502 0.0243 8.850 <0.001 0.16740 0.2626
job financial analyst - janitor 0.11647 0.0258 4.518 <0.001 0.06594 0.1670
job gardener - janitor 0.01266 0.0174 0.729 0.4660 -0.02138 0.0467
job nurse - janitor 0.08283 0.0169 4.894 <0.001 0.04966 0.1160
job research scientist - janitor 0.19308 0.0241 7.999 <0.001 0.14577 0.2404
job teacher - janitor 0.06742 0.0176 3.834 <0.001 0.03296 0.1019
job waiter - janitor -0.00511 0.0176 -0.291 0.7714 -0.03959 0.0294
language broken - fluent -0.06835 0.0136 -5.035 <0.001 -0.09495 -0.0417
language tried but unable - fluent -0.12844 0.0134 -9.559 <0.001 -0.15478 -0.1021
language used interpreter - fluent -0.16584 0.0138 -11.995 <0.001 -0.19294 -0.1387

10.3.3 Average Feature Choice Probability

Abramson et al. (2024) introduce an alternative estimand for forced-choice conjoint experiments: the Average Feature Choice Probability (AFCP). The main difference between AMCE and AFCP lies in their approach to handling attribute comparisons.

AMCE incorporates comparisons and averages over both direct and indirect attribute comparisons. For example, the estimated effect of “fluent” vs. “broken English” on choice depends not only on these two specific characteristics, but also on how they compare to “used interpreter” or “tried but unable”. Thusly, Abramson et al. (2024) argue that the AMCE considers information about irrelevant attributes, and imposes a strong transitivity assumption. In some cases, AMCE can suggest positive effects even when, in direct comparisons, respondents are on average less likely to choose a profile with the feature of interest over the baseline. In contrast, AFCP focuses solely on direct comparisons between attributes, offering a more accurate representation of respondents’ direct preferences without the influence of irrelevant attributes.

To estimate AFCP, we first need to create new columns: language.alt and job.alt. These columns record the attributes of the alternative against which each profile was paired, in every task. We use by() to split the data frame into subgroups for each combination of respondent and task. Each subgroup has two rows, because there are two profiles. Then, we use rev() to create new variables with the opposite attributes.

dat <- by(dat, ~ respondent + task, \(x) transform(x,
  language.alt = rev(language),
  job.alt = rev(job)
))
dat <- do.call(rbind, dat)

In respondent 4’s first task, the first profile was “nurse” and the alternative profile was “child care provider”. Thus, “nurse” appears in the first row as job and in the second row as job.alt.

subset(dat, respondent == 4 & task == 1)
  respondent task profile choice                 job         language
1          4    1       1      1               nurse tried but unable
2          4    1       2      0 child care provider used interpreter
      language.alt             job.alt
1 used interpreter child care provider
2 tried but unable               nurse

To estimate the AFCP, we once again fit a linear regression model. This time, the model is even more flexible. Specifically, the model allows the effect of language skills in the first profile to depend on the value of language skills in the second profile. Likewise, other attributes can influence the probability of selection differently based on attributes in the comparison profile. To achieve this, we interact language with language.alt, and job with job.alt.

mod <- lm(choice ~ language * language.alt + job * job.alt, data = dat)

As noted above, and detailed in Abramson et al. (2024), the AFCP is a choice pair-specific quantity. This means that we need to average predictions (fitted values) across covariates, within each unique pair of attributes of interest. We thus Thus,

Now we compute the AFCP by averaging fitted values by unique pairs of attributes for language (the combination of language and alternative). Since we are not interested comparison pairs where both profiles have the same language skills, we use the subset() function with the newdata argument to select an appropriate grid.

avg_predictions(mod, 
  by = c("language", "language.alt"), 
  newdata = subset(language.alt == "fluent"),
  vcov = ~respondent)
language language.alt Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
fluent fluent 0.500 3.09e-07 1.62e+06 <0.001 0.500 0.500
broken fluent 0.407 1.65e-02 2.46e+01 <0.001 0.374 0.439
tried but unable fluent 0.376 1.66e-02 2.27e+01 <0.001 0.344 0.409
used interpreter fluent 0.361 1.57e-02 2.30e+01 <0.001 0.330 0.392

A powerful feature of marginaleffects is that all its functions include a hypothesis argument which can be used to conduct hypothesis tests on arbitrary functions of estimates. This allows us to answer questions such as: Is the AFCP for “used intepreter vs. fluent” different from the AFCP for “broken vs. fluent”?

avg_predictions(mod, 
  by = c("language", "language.alt"), 
  hypothesis = "b4 - b2 = 0",
  newdata = subset(language.alt == "fluent"),
  vcov = ~respondent)
Term Estimate Std. Error z Pr(>|z|) 2.5 % 97.5 %
b4-b2=0 -0.0457 0.0226 -2.02 0.0434 -0.09 -0.00135

These findings suggest that the difference in selection probability between “interpreter” and “fluent” is smaller than the difference in selection probability between “broken” and “fluent”: estimated gap of -0.046, with a \(z\) statistic of -2.02.