18 Conjoint Experiments

A forced-choice conjoint experiment is a research methodology used in fields such as marketing and political science to understand how people make decisions between “profiles” characterized by multiple “attributes.” Survey respondents are presented with a series of choices between different products, services, or scenarios. Each option is described by a set of attributes (e.g., price, quality, brand, features), and the levels of these attributes are varied randomly across the options presented.

Consider an experiment where researchers ask survey respondents “to act as immigration officials and to decide which of a pair of immigrants they would choose for admission” (Hainmueller, Hopkins, and Yamamoto 2014). The researchers display a table with two columns that represent distinct immigrant profiles with randomized attributes. For example:

A forced-choice task in a conjoint experiment.
Attributes	Profile 1	Profile 2
Language Skills	Fluent in English	Broken English
Job	Construction worker	Nurse

The survey respondent has the “task” of choosing one of the two profiles. Then, the researchers display a new task, including profiles with different randomized attributes, and the respondent chooses again. By analyzing the choices made by participants, researchers can estimate the relative importance of different attributes in the decision-making process.

To illustrate, we use data published alongside the Political Analysis article by Hainmueller, Hopkins, and Yamamoto (2014). In this experiment, each survey respondent \(i\) receives several tasks \(k\), in which they select one of two profiles \(j\), characterized by attributes \(l\). For simplicity, we consider a subset of the data with 5 tasks per respondent, 2 profiles per task, and 2 attributes per profile.

The data is structured in “long” format, with one respondent-task-profile combination per row. The dependent variable is choice, a binary variable which indicates if the profile in a given row was selected during a given task. The predictors are language skills and job type.

Since there are 5 tasks per respondent, and two profiles per task, the dataset includes 10 rows per respondent.

library(marginaleffects)
dat <- get_dataset("immigration")
dat[dat$respondent == 4, ]

   respondent task profile choice                 job         language
1           4    1       1      1               nurse tried but unable
2           4    1       2      0 child care provider used interpreter
3           4    2       1      0            gardener           fluent
4           4    2       2      1 construction worker           fluent
5           4    3       1      1               nurse           broken
6           4    3       2      0 child care provider           fluent
7           4    4       1      0             teacher used interpreter
8           4    4       2      1 construction worker           fluent
9           4    5       1      1             teacher used interpreter
10          4    5       2      0               nurse used interpreter

To analyze this dataset, we estimate a linear regression model with choice as the outcome, and in which all predictors are interacted:

mod <- lm(choice ~ job * language, data = dat)

18.1 Marginal means

A common strategy to interpret the results of a conjoint experiment is to compute marginal means. As described by Leeper, Hobolt, and Tilley (2020), a “marginal mean describes the level of favorability toward profiles that have a particular feature level, ignoring all other features.”

To compute marginal means, we proceed in two steps:

Compute the predicted (i.e., fitted) values for each row in a balanced grid of predictors (see Section 3.2.4).
Marginalize (average) those predictions with respect to the variable of interest.

This is easy to do with the avg_predictions() function. Note that we use the vcov argument to report clustered standard errors at the survey respondent-level.

avg_predictions(mod, 
  newdata = "balanced",
  by = "language", 
  vcov = ~respondent)


         language Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
 fluent              0.618    0.00831 74.4   <0.001 Inf 0.602  0.635
 broken              0.550    0.00894 61.5   <0.001 Inf 0.532  0.568
 tried but unable    0.490    0.00879 55.7   <0.001 Inf 0.473  0.507
 used interpreter    0.453    0.00927 48.8   <0.001 Inf 0.434  0.471

Type: response

These results suggest that, ignoring (or averaging over) the job attribute, the “fluent” English speakers are chosen more often than profiles with other language values.

To see if the average probability of selection is higher when a candidate is fluent in English, relative to when they require an interpreter, we use the hypothesis argument.

avg_predictions(mod, 
  hypothesis = "b1 = b4",
  newdata = "balanced",
  by = "language", 
  vcov = ~respondent)

Warning: 
It is essential to check the order of estimates when specifying hypothesis tests using positional indices like b1, b2, etc. The indices of estimates can change depending on the order of rows in the original dataset, user-supplied arguments, model-fitting package, and version of `marginaleffects`.

It is also good practice to use assertions that ensure the order of estimates is consistent across different runs of the same code. Example:

```r
mod <- lm(mpg ~ am * carb, data = mtcars)

# assertion for safety
p <- avg_predictions(mod, by = 'carb')
stopifnot(p$carb[1] != 1 || p$carb[2] != 2)

# hypothesis test
avg_predictions(mod, by = 'carb', hypothesis = 'b1 - b2 = 0')
```

Disable this warning with: `options(marginaleffects_safe = FALSE)`
 This warning appears once per session.


 Hypothesis Estimate Std. Error  z Pr(>|z|)     S 2.5 % 97.5 %
      b1=b4    0.166     0.0138 12   <0.001 107.7 0.139  0.193

Type: response

This shows that the difference between estimates in those two categories is relatively large (0.2 percentage points), and that this difference is statistically significant.

18.2 Average Marginal Component Effects

Average Marginal Component Effects (AMCE) are defined and analyzed in Hainmueller, Hopkins, and Yamamoto (2014). Roughly speaking, they can be viewed as the average effects of changing one attribute on choice, while holding all other attributes of the profile constant. To compute an AMCE, we use the avg_comparisons() function that we already explored in Chapters 6 and 8.

avg_comparisons(mod, vcov = ~respondent, newdata = "balanced")


     Term                      Contrast Estimate Std. Error       z Pr(>|z|)     S    2.5 %  97.5 %
 job      child care provider - janitor  0.01044     0.0171   0.612   0.5404   0.9 -0.02300  0.0439
 job      computer programmer - janitor  0.13576     0.0250   5.429   <0.001  24.1  0.08675  0.1848
 job      construction worker - janitor  0.03841     0.0175   2.195   0.0282   5.1  0.00411  0.0727
 job      doctor - janitor               0.21502     0.0243   8.850   <0.001  60.0  0.16740  0.2626
 job      financial analyst - janitor    0.11647     0.0258   4.518   <0.001  17.3  0.06594  0.1670
 job      gardener - janitor             0.01266     0.0174   0.729   0.4660   1.1 -0.02138  0.0467
 job      nurse - janitor                0.08283     0.0169   4.894   <0.001  19.9  0.04966  0.1160
 job      research scientist - janitor   0.19308     0.0241   7.999   <0.001  49.5  0.14577  0.2404
 job      teacher - janitor              0.06742     0.0176   3.834   <0.001  13.0  0.03296  0.1019
 job      waiter - janitor              -0.00511     0.0176  -0.291   0.7714   0.4 -0.03959  0.0294
 language broken - fluent               -0.06835     0.0136  -5.035   <0.001  21.0 -0.09495 -0.0417
 language tried but unable - fluent     -0.12844     0.0134  -9.559   <0.001  69.5 -0.15478 -0.1021
 language used interpreter - fluent     -0.16584     0.0138 -11.995   <0.001 107.7 -0.19294 -0.1387

Type: response

18.3 Average Feature Choice Probability

Abramson et al. (2024) introduce an alternative estimand for forced-choice conjoint experiments: the Average Feature Choice Probability (AFCP). The main difference between AMCE and AFCP lies in their approach to handling attribute comparisons.

AMCE incorporates comparisons and averages over both direct and indirect attribute comparisons. For example, the estimated effect of “fluent” vs. “broken English” on choice depends not only on these two specific characteristics, but also on how they compare to “used interpreter” or “tried but unable”. Thusly, Abramson et al. (2024) argue that the AMCE considers information about irrelevant attributes, and imposes a strong transitivity assumption. In some cases, AMCE can suggest positive effects even when, in direct comparisons, respondents are on average less likely to choose a profile with the feature of interest over the baseline. In contrast, AFCP focuses solely on direct comparisons between attributes, offering a more accurate representation of respondents’ direct preferences without the influence of irrelevant attributes.

To estimate AFCP, we first need to create new columns: language.alt and job.alt. These columns record the attributes of the alternative against which each profile was paired, in every task. We use by() to split the data frame into subgroups for each combination of respondent and task. Each subgroup has two rows, because there are two profiles. Then, we use rev() to create new variables with the opposite attributes.

dat <- by(dat, ~ respondent + task, \(x) transform(x,
  language.alt = rev(language),
  job.alt = rev(job)
))
dat <- do.call(rbind, dat)

In respondent 4’s first task, the first profile was “nurse” and the alternative profile was “child care provider”. Thus, “nurse” appears in the first row as job and in the second row as job.alt.

subset(dat, respondent == 4 & task == 1)

  respondent task profile choice                 job         language     language.alt             job.alt
1          4    1       1      1               nurse tried but unable used interpreter child care provider
2          4    1       2      0 child care provider used interpreter tried but unable               nurse

To estimate the AFCP, we once again fit a linear regression model. This time, the model is even more flexible. Specifically, the model allows the effect of language skills in the first profile to depend on the value of language skills in the second profile. Likewise, other attributes can influence the probability of selection differently based on attributes in the comparison profile. To achieve this, we interact language with language.alt, and job with job.alt.

mod <- lm(choice ~ language * language.alt + job * job.alt, data = dat)

As noted above, and detailed in Abramson et al. (2024), the AFCP is a choice pair-specific quantity. This means that we need to average predictions (fitted values) across covariates, within each unique pair of attributes of interest.

Now we compute the AFCP by averaging fitted values by unique pairs of attributes for language (the combination of language and alternative). Since we are not interested in comparison pairs where both profiles have the same language skills, we use the subset() function with the newdata argument to select an appropriate grid.

avg_predictions(mod, 
  by = c("language", "language.alt"), 
  newdata = subset(language.alt == "fluent"),
  vcov = ~respondent)


         language language.alt Estimate Std. Error        z Pr(>|z|)     S 2.5 % 97.5 %
 fluent                 fluent    0.500   3.16e-07 1.58e+06   <0.001   Inf 0.500  0.500
 broken                 fluent    0.407   1.65e-02 2.46e+01   <0.001 442.8 0.374  0.439
 tried but unable       fluent    0.376   1.66e-02 2.27e+01   <0.001 375.1 0.344  0.409
 used interpreter       fluent    0.361   1.57e-02 2.30e+01   <0.001 385.2 0.330  0.392

Type: response

A powerful feature of marginaleffects is that all its functions include a hypothesis argument which can be used to conduct hypothesis tests on arbitrary functions of estimates. This allows us to answer questions such as: Is the AFCP for “used intepreter vs. fluent” different from the AFCP for “broken vs. fluent”?

avg_predictions(mod, 
  by = c("language", "language.alt"), 
  hypothesis = "b4 - b2 = 0",
  newdata = subset(language.alt == "fluent"),
  vcov = ~respondent)


 Hypothesis Estimate Std. Error     z Pr(>|z|)   S 2.5 %   97.5 %
    b4-b2=0  -0.0457     0.0226 -2.02   0.0434 4.5 -0.09 -0.00135

Type: response

These findings suggest that the difference in selection probability between “interpreter” and “fluent” is smaller than the difference in selection probability between “broken” and “fluent”: estimated gap of -0.046, with a \(z\) statistic of -2.02.