Attributes | Profile 1 | Profile 2 |
---|---|---|
Language Skills | Fluent in English | Broken English |
Job | Construction worker | Nurse |
17 Conjoint experiments
17.1 Conjoint experiments
A forced-choice conjoint experiment is a research methodology used in fields such as marketing and political science to understand how people make decisions between “profiles” characterized by multiple “attributes.” Survey respondents are presented with a series of choices between different products, services, or scenarios. Each option is described by a set of attributes (e.g., price, quality, brand, features), and the levels of these attributes are varied randomly across the options presented.
Consider an experiment where researchers ask survey respondents “to act as immigration officials and to decide which of a pair of immigrants they would choose for admission” (Hainmueller, Hopkins, and Yamamoto 2014). The researchers display a table with two columns that represent distinct immigrant profiles with randomized attributes. For example:
The survey respondent has the “task” of choosing one of the two profiles. Then, the researchers display a new task, including profiles with different randomized attributes, and the respondent chooses again. By analyzing the choices made by participants, researchers can estimate the relative importance of different attributes in the decision-making process.
To illustrate, we use data published alongside the Political Analysis article by Hainmueller, Hopkins, and Yamamoto (2014). In this experiment, each survey respondent \(i\) receives several tasks \(k\), in which they select one of two profiles \(j\), characterized by attributes \(l\). For simplicity, we consider a subset of the data with 5 tasks per respondent, 2 profiles per task, and 2 attributes per profile.
The data is structured in “long” format, with one respondent-task-profile combination per row. The dependent variable is choice
, a binary variable which indicates if the profile in a given row was selected during a given task. The predictors are language
skills and job
type.
Since there are 5 tasks per respondent, and two profiles per task, the dataset includes 10 rows per respondent.
library(marginaleffects)
dat <- get_dataset("immigration")
dat[dat$respondent == 4, ]
respondent task profile choice job language
1 4 1 1 1 nurse tried but unable
2 4 1 2 0 child care provider used interpreter
3 4 2 1 0 gardener fluent
4 4 2 2 1 construction worker fluent
5 4 3 1 1 nurse broken
6 4 3 2 0 child care provider fluent
7 4 4 1 0 teacher used interpreter
8 4 4 2 1 construction worker fluent
9 4 5 1 1 teacher used interpreter
10 4 5 2 0 nurse used interpreter
To analyze this dataset, we estimate a linear regression model with choice
as the outcome, and in which all predictors are interacted:
mod <- lm(choice ~ job * language, data = dat)
17.1.1 Marginal means
A common strategy to interpret the results of a conjoint experiment is to compute marginal means. As described by Leeper, Hobolt, and Tilley (2020), a “marginal mean describes the level of favorability toward profiles that have a particular feature level, ignoring all other features.”
To compute marginal means, we proceed in two steps:
- Compute the predicted (i.e., fitted) values for each row in a balanced grid of predictors (see Section 3.2.4).
- Marginalize (average) those predictions with respect to the variable of interest.
This is easy to do with the avg_predictions()
function. Note that we use the vcov
argument to report clustered standard errors at the survey respondent-level.
avg_predictions(mod,
newdata = "balanced",
by = "language",
vcov = ~respondent)
language Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
fluent 0.618 0.00831 74.4 <0.001 Inf 0.602 0.635
broken 0.550 0.00894 61.5 <0.001 Inf 0.532 0.568
tried but unable 0.490 0.00879 55.7 <0.001 Inf 0.473 0.507
used interpreter 0.453 0.00927 48.8 <0.001 Inf 0.434 0.471
Type: response
These results suggest that, ignoring (or averaging over) the job
attribute, the “fluent” English speakers are chosen more often than profiles with other language
values.
To see if the average probability of selection is higher when a candidate is fluent in English, relative to when they require an interpreter, we use the hypothesis
argument.
avg_predictions(mod,
hypothesis = "b1 = b4",
newdata = "balanced",
by = "language",
vcov = ~respondent)
Warning:
It is essential to check the order of estimates when specifying hypothesis tests using positional indices like b1, b2, etc. The indices of estimates can change depending on the order of rows in the original dataset, user-supplied arguments, model-fitting package, and version of `marginaleffects`.
It is also good practice to use assertions that ensure the order of estimates is consistent across different runs of the same code. Example:
```r
mod <- lm(mpg ~ am * carb, data = mtcars)
# assertion for safety
p <- avg_predictions(mod, by = 'carb')
stopifnot(p$carb[1] != 1 || p$carb[2] != 2)
# hypothesis test
avg_predictions(mod, by = 'carb', hypothesis = 'b1 - b2 = 0')
```
Disable this warning with: `options(marginaleffects_safe = FALSE)`
This warning appears once per session.
Hypothesis Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
b1=b4 0.166 0.0138 12 <0.001 107.7 0.139 0.193
Type: response
This shows that the difference between estimates in those two categories is relatively large (0.2 percentage points), and that this difference is statistically significant.
17.1.2 Average Marginal Component Effects
Average Marginal Component Effects (AMCE) are defined and analyzed in Hainmueller, Hopkins, and Yamamoto (2014). Roughly speaking, they can be viewed as the average effects of changing one attribute on choice, while holding all other attributes of the profile constant. To compute an AMCE, we use the avg_comparisons()
function that we already explored in Chapters 6 and 8.
avg_comparisons(mod, vcov = ~respondent, newdata = "balanced")
Term Contrast Estimate Std. Error z Pr(>|z|)
job child care provider - janitor 0.01044 0.0171 0.612 0.5404
job computer programmer - janitor 0.13576 0.0250 5.429 <0.001
job construction worker - janitor 0.03841 0.0175 2.195 0.0282
job doctor - janitor 0.21502 0.0243 8.850 <0.001
job financial analyst - janitor 0.11647 0.0258 4.518 <0.001
job gardener - janitor 0.01266 0.0174 0.729 0.4660
job nurse - janitor 0.08283 0.0169 4.894 <0.001
job research scientist - janitor 0.19308 0.0241 7.999 <0.001
job teacher - janitor 0.06742 0.0176 3.834 <0.001
job waiter - janitor -0.00511 0.0176 -0.291 0.7714
language broken - fluent -0.06835 0.0136 -5.035 <0.001
language tried but unable - fluent -0.12844 0.0134 -9.559 <0.001
language used interpreter - fluent -0.16584 0.0138 -11.995 <0.001
S 2.5 % 97.5 %
0.9 -0.02300 0.0439
24.1 0.08675 0.1848
5.1 0.00411 0.0727
60.0 0.16740 0.2626
17.3 0.06594 0.1670
1.1 -0.02138 0.0467
19.9 0.04966 0.1160
49.5 0.14577 0.2404
13.0 0.03296 0.1019
0.4 -0.03959 0.0294
21.0 -0.09495 -0.0417
69.5 -0.15478 -0.1021
107.7 -0.19294 -0.1387
Type: response
17.1.3 Average Feature Choice Probability
Abramson et al. (2024) introduce an alternative estimand for forced-choice conjoint experiments: the Average Feature Choice Probability (AFCP). The main difference between AMCE and AFCP lies in their approach to handling attribute comparisons.
AMCE incorporates comparisons and averages over both direct and indirect attribute comparisons. For example, the estimated effect of “fluent” vs. “broken English” on choice depends not only on these two specific characteristics, but also on how they compare to “used interpreter” or “tried but unable”. Thusly, Abramson et al. (2024) argue that the AMCE considers information about irrelevant attributes, and imposes a strong transitivity assumption. In some cases, AMCE can suggest positive effects even when, in direct comparisons, respondents are on average less likely to choose a profile with the feature of interest over the baseline. In contrast, AFCP focuses solely on direct comparisons between attributes, offering a more accurate representation of respondents’ direct preferences without the influence of irrelevant attributes.
To estimate AFCP, we first need to create new columns: language.alt
and job.alt
. These columns record the attributes of the alternative against which each profile was paired, in every task. We use by()
to split the data frame into subgroups for each combination of respondent
and task
. Each subgroup has two rows, because there are two profiles. Then, we use rev()
to create new variables with the opposite attributes.
In respondent 4’s first task, the first profile was “nurse” and the alternative profile was “child care provider”. Thus, “nurse” appears in the first row as job
and in the second row as job.alt
.
subset(dat, respondent == 4 & task == 1)
respondent task profile choice job language
1 4 1 1 1 nurse tried but unable
2 4 1 2 0 child care provider used interpreter
language.alt job.alt
1 used interpreter child care provider
2 tried but unable nurse
To estimate the AFCP, we once again fit a linear regression model. This time, the model is even more flexible. Specifically, the model allows the effect of language skills in the first profile to depend on the value of language skills in the second profile. Likewise, other attributes can influence the probability of selection differently based on attributes in the comparison profile. To achieve this, we interact language
with language.alt
, and job
with job.alt
.
mod <- lm(choice ~ language * language.alt + job * job.alt, data = dat)
As noted above, and detailed in Abramson et al. (2024), the AFCP is a choice pair-specific quantity. This means that we need to average predictions (fitted values) across covariates, within each unique pair of attributes of interest. We thus Thus,
Now we compute the AFCP by averaging fitted values by unique pairs of attributes for language (the combination of language
and alternative
). Since we are not interested comparison pairs where both profiles have the same language skills, we use the subset()
function with the newdata
argument to select an appropriate grid.
avg_predictions(mod,
by = c("language", "language.alt"),
newdata = subset(language.alt == "fluent"),
vcov = ~respondent)
language language.alt Estimate Std. Error z Pr(>|z|) S
fluent fluent 0.500 9.66e-08 5.18e+06 <0.001 Inf
broken fluent 0.407 1.65e-02 2.46e+01 <0.001 442.8
tried but unable fluent 0.376 1.66e-02 2.27e+01 <0.001 375.1
used interpreter fluent 0.361 1.57e-02 2.30e+01 <0.001 385.2
2.5 % 97.5 %
0.500 0.500
0.374 0.439
0.344 0.409
0.330 0.392
Type: response
A powerful feature of marginaleffects
is that all its functions include a hypothesis
argument which can be used to conduct hypothesis tests on arbitrary functions of estimates. This allows us to answer questions such as: Is the AFCP for “used intepreter vs. fluent” different from the AFCP for “broken vs. fluent”?
avg_predictions(mod,
by = c("language", "language.alt"),
hypothesis = "b4 - b2 = 0",
newdata = subset(language.alt == "fluent"),
vcov = ~respondent)
Hypothesis Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
b4-b2=0 -0.0457 0.0226 -2.02 0.0434 4.5 -0.09 -0.00135
Type: response
These findings suggest that the difference in selection probability between “interpreter” and “fluent” is smaller than the difference in selection probability between “broken” and “fluent”: estimated gap of -0.046, with a \(z\) statistic of -2.02.