18 Conjoint experiments
A forced-choice conjoint experiment is a research methodology used in many fields such as marketing and political science to understand how people make decisions between “profiles” characterized by multiple “attributes.” In this type of experiment, participants are presented with a series of choices between different products, services, or scenarios. Each option is described by a set of attributes (e.g., price, quality, brand, features), and the levels of these attributes are varied randomly across the options presented.
Consider an experiment where researchers ask survey respondents “to act as immigration officials and to decide which of a pair of immigrants they would choose for admission” (Hainmueller, Hopkins, and Yamamoto 2014). They display a table in which each column represents a distinct immigrant “profile” with randomized attributes. For example:
Attributes | Profile 1 | Profile 2 |
---|---|---|
Language Skills | Fluent in English | Broken English |
Job | Construction worker | Nurse |
The survey respondent has the “task” of choosing one of the two profiles. Then, the researchers display a new task, including profiles with different randomized attributes, and the respondent chooses again.
By analyzing the choices made by participants, researchers can estimate the relative importance of different attributes in the decision-making process.
The rest of this vignette shows how to use the marginaleffects
package to estimate the main quantities of the quantities of interest in such experiments.
19 Data
To illustrate, we use data published alongside the Political Analysis article by Hainmueller, Hopkins, and Yamamoto (2014). In this experiment, each survey respondent \(i\) receives several tasks \(k\), in which they select one of two profiles \(j\), characterized by attributes \(l\).
For simplicity, we consider a subset of the data with 5 tasks per respondent, 2 profiles per task, and 2 attributes per profile. The data is structured in “long” format, with one respondent-task-profile combination per row.
These are the entries for survey respondent number 4:
library(marginaleffects)
library(data.table)
library(tinytable)
dat <- readRDS(url("https://marginaleffects.com/data/conjoint_immigration.rds", "rb"))
dat[dat$respondent == 4, ]
choice job language respondent task profile
<num> <fctr> <fctr> <int> <int> <num>
1: 1 nurse tried but unable 4 1 1
2: 0 child care provider used interpreter 4 1 2
3: 0 gardener fluent 4 2 1
4: 1 construction worker fluent 4 2 2
5: 1 nurse broken 4 3 1
6: 0 child care provider fluent 4 3 2
7: 0 teacher used interpreter 4 4 1
8: 1 construction worker fluent 4 4 2
9: 1 teacher used interpreter 4 5 1
10: 0 nurse used interpreter 4 5 2
The choice
column indicates if the profile in each row was selected by the respondent.
These data are structured in what could be called a “long” format. Later on in the vignette, it will be useful to have information in a “wide” format, with new columns indicating what attributes characterized every profile in each task. To create these new columns, we use dcast()
from the data.table
package. This function is analogous to reshape()
in base R
or pivot_wider()
in the tidyverse
:
dat <- data.table(dat)
wide <- dcast(dat,
respondent + task ~ profile,
value.var = c("language", "job")
)
dat <- merge(dat, wide)
We now have new columns called language_1
and language_2
indicating the language skills of profiles 1 and 2 respectively in each task.
dat[dat$respondent == 4, ]
Key: <respondent, task>
respondent task choice job language profile language_1 language_2 job_1 job_2
<int> <int> <num> <fctr> <fctr> <num> <fctr> <fctr> <fctr> <fctr>
1: 4 1 1 nurse tried but unable 1 tried but unable used interpreter nurse child care provider
2: 4 1 0 child care provider used interpreter 2 tried but unable used interpreter nurse child care provider
3: 4 2 0 gardener fluent 1 fluent fluent gardener construction worker
4: 4 2 1 construction worker fluent 2 fluent fluent gardener construction worker
5: 4 3 1 nurse broken 1 broken fluent nurse child care provider
6: 4 3 0 child care provider fluent 2 broken fluent nurse child care provider
7: 4 4 0 teacher used interpreter 1 used interpreter fluent teacher construction worker
8: 4 4 1 construction worker fluent 2 used interpreter fluent teacher construction worker
9: 4 5 1 teacher used interpreter 1 used interpreter used interpreter teacher nurse
10: 4 5 0 nurse used interpreter 2 used interpreter used interpreter teacher nurse
20 Marginal means
As described by Leeper, Hobolt, and Tilley (2020), a “marginal mean describes the level of favorability toward profiles that have a particular feature level, ignoring all other features.”
To compute marginal means, we proceed in 3 steps:
- Estimate a regression model with
choice
as outcome and the attributes of interest as predictors. - Compute the predicted (i.e., fitted) values for each row in the original dataset.
- Marginalize (average) those predictions with respect to the variable of interest.
To illustrate, we estimate a linear regression model with interactions between the language
and job
variables:
mod <- lm(choice ~ job * language, data = dat)
Then, we use the avg_predictions()
function to compute and marginalize predicted values. Note that we use the vcov
argument to report clustered standard errors at the respondent-level.
avg_predictions(mod, by = "language", vcov = ~respondent)
language Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
tried but unable 0.459 0.00733 62.7 <0.001 Inf 0.445 0.474
used interpreter 0.424 0.00749 56.6 <0.001 Inf 0.410 0.439
fluent 0.588 0.00726 81.0 <0.001 Inf 0.573 0.602
broken 0.526 0.00746 70.6 <0.001 Inf 0.512 0.541
Type: response
Columns: language, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
The results above suggests that, ignoring (or averaging over) the job
attribute, the “fluent” English speakers are chosen more often than profiles with other language
values.
20.1 Hypothesis tests
Using the hypothesis
argument, we can easily conduct various null hypothesis tests on the estimated marginal means. For example, is the probability of choosing a “fluent” profile equal to the probability of choosing a “tried but unable” profile?
avg_predictions(mod,
hypothesis = "b1 = b3",
by = "language",
vcov = ~respondent)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
-0.128 0.0119 -10.8 <0.001 87.6 -0.152 -0.105
Term: b1=b3
Type: response
Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Is the difference in probabiliy between “fluent” and “broken” equal to the difference in probability between “tried but unable” and “used interpreter”?
avg_predictions(mod,
hypothesis = "b1 - b2 = b3 - b4",
by = "language",
vcov = ~respondent)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
-0.0264 0.0168 -1.57 0.116 3.1 -0.0594 0.00657
Term: b1-b2=b3-b4
Type: response
Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
20.2 Subgroups
Modifying the by
allows analysts to report marginal means for different subgroups of their data, and newdata
can be used to exclude uninteresting profiles:
avg_predictions(mod,
by = c("language", "job"),
newdata = subset(dat, job %in% c("doctor", "gardener")),
vcov = ~respondent)
language job Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
fluent gardener 0.534 0.0243 22.0 <0.001 353.5 0.486 0.581
fluent doctor 0.772 0.0332 23.2 <0.001 394.4 0.707 0.837
used interpreter doctor 0.602 0.0427 14.1 <0.001 147.5 0.518 0.685
tried but unable gardener 0.416 0.0237 17.5 <0.001 226.5 0.370 0.463
used interpreter gardener 0.387 0.0244 15.8 <0.001 185.1 0.339 0.435
broken gardener 0.509 0.0245 20.8 <0.001 317.0 0.461 0.557
tried but unable doctor 0.570 0.0443 12.9 <0.001 123.5 0.483 0.657
broken doctor 0.712 0.0394 18.1 <0.001 239.6 0.635 0.789
Type: response
Columns: language, job, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
21 Average Marginal Component Effects
Average Marginal Component Effects (AMCE) are defined and analyzed in Hainmueller, Hopkins, and Yamamoto (2014). Roughly speaking, they can be viewed as the average effects of changing one attribute on choice, while holding all other attributes of the profile constant. To compute an AMCE, we can proceed in 4 steps:
- Create a dataset A where the
language
column is equal to “fluent English” in every row. - Create a dataset B where the
language
column is equal to “broken English” in every row. - Compute predicted (fitted) values for every row in datasets A and B.
- Compute the average difference between predicted values in the two datasets.
This can be achieved easily using the avg_comparisons()
function from the marginaleffects
package:
avg_comparisons(mod, variables = "language", vcov = ~respondent)
Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
mean(broken) - mean(fluent) -0.0592 0.0120 -4.95 <0.001 20.4 -0.0826 -0.0357
mean(tried but unable) - mean(fluent) -0.1271 0.0119 -10.66 <0.001 85.7 -0.1504 -0.1037
mean(used interpreter) - mean(fluent) -0.1608 0.0121 -13.27 <0.001 131.2 -0.1845 -0.1370
Term: language
Type: response
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
21.1 Empirical vs. balanced grid
In the example above, avg_predictions()
marginalized across the realized distribution of attributes observed in the actual dataset. An alternative would be to marginalzed over a perfectly balanced (“uniform”) grid of treatment conditions. Of course, empirical and uniform grids will yield nearly identical results if the sample is large and randomization successful.
The uniform approach is the default in the amce()
function from the cjoint
package, and the behavior can be replicated using the datagrid()
function and the newdata
argument in avg_comparisons()
:
library(cjoint)
amce_results <- amce(
choice ~ language * job,
data = dat,
cluster = TRUE,
respondent.id = "respondent")
summary(amce_results)$amce |> subset(Attribute == "language")
Attribute Level Estimate Std. Err z value Pr(>|z|)
11 language broken -0.06834528 0.01357451 -5.034826 4.782835e-07 ***
12 language tried but unable -0.12843978 0.01343677 -9.558832 1.190935e-21 ***
13 language used interpreter -0.16584337 0.01382564 -11.995352 3.758141e-33 ***
avg_comparisons(mod,
newdata = datagrid(FUN_factor = unique, FUN_character = unique),
variables = "language",
vcov = ~respondent)
Contrast Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
mean(broken) - mean(fluent) -0.0683 0.0136 -5.03 <0.001 21.0 -0.095 -0.0417
mean(tried but unable) - mean(fluent) -0.1284 0.0134 -9.56 <0.001 69.5 -0.155 -0.1021
mean(used interpreter) - mean(fluent) -0.1658 0.0138 -12.00 <0.001 107.7 -0.193 -0.1387
Term: language
Type: response
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
22 Average Feature Choice Probability
Abramson et al. (2024) introduce an alternative estimand for forced-choice conjoint experiments: the Average Feature Choice Probability (AFCP). The main difference between AMCE and AFCP lies in their approach to handling attribute comparisons.
AMCE incorporates comparisons and averages over both direct and indirect attribute comparisons, potentially considering information about irrelevant attributes, thus imposing a strong transitivity assumption. In some cases, AMCE can suggest positive effects even when, in direct comparisons, respondents are on average less likely to choose a profile with the feature of interest over the baseline. In contrast, AFCP focuses solely on direct comparisons between attributes, offering a more accurate representation of respondents’ direct preferences without the influence of irrelevant attributes.
To estimate AFCP, we once again start by estimating a linear regression model. This time, the model is even more flexible. Specifically, the model allows the effect of language skills in the first profile to depend on the value of language skills in the second profile. Likewise, other attributes can influence the probability of selection differently based on attributes in the comparison profile. To achieve this, we interact language_1
with language_2
, and job_1
with job_2
.
Moreover, since the data is in “long” format, with one profile per row, we must also allow each variable to have different coefficients based on profile
number: the effect of language_1
on the probability of selection is obviously different for profile=1
and for profile=2
. Indeed, when profile=1
the language_1
column records the profile’s own language skills. When profile=2
, the same column records the alternative profile’s language skills.
Thankfully, it is trivial to allow this flexibility by interacting the attributes with the profile
variable:
As explained above and detailed in Abramson et al. (2024), the AFCP is a choice pair-specific quantity. This means that we need to average predictions (fitted values) across covariates, within each unique pair of attributes of interest. To allow this, we create a new variable called alternative
:1
dat$alternative <- fifelse(
dat$language == dat$language_1,
dat$language_2,
dat$language_1
)
This table shows the language skills and job of both profiles in the first task faced by respondent number 4. The alternative
column simply shows the language skills of the alternative profile within each task:
subset(dat,
task == 1 & respondent == 4,
select = c("respondent", "task", "profile", "job", "language", "alternative")) |>
tt()
respondent | task | profile | job | language | alternative |
---|---|---|---|---|---|
4 | 1 | 1 | nurse | tried but unable | used interpreter |
4 | 1 | 2 | child care provider | used interpreter | tried but unable |
Now we compute the AFCP by averaging fitted values by unique pairs of attributes for language (the combination of language
and alternative
). Since we are not interested comparison pairs where both profiles have the same language skills, we use the subset
to supply an appropriate grid.
p <- avg_predictions(mod,
by = c("language", "alternative"),
newdata = subset(dat, language != alternative),
vcov = ~respondent)
Display the results in a nice tinytable
:
library(tinytable)
idx <- which(p$alternative == "fluent")
print(p, "tinytable") |> style_tt(i = idx, background = "pink")
language | alternative | Estimate | Std. Error | z | Pr(>|z|) | S | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|---|
Type: response | ||||||||
Columns: language, alternative, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high | ||||||||
tried but unable | used interpreter | 0.541 | 0.0183 | 29.7 | 639.5 | 0.505 | 0.577 | |
used interpreter | tried but unable | 0.459 | 0.0183 | 25.1 | 460.9 | 0.423 | 0.495 | |
broken | fluent | 0.407 | 0.0164 | 24.8 | 447.7 | 0.375 | 0.439 | |
fluent | broken | 0.593 | 0.0164 | 36.1 | 947.4 | 0.561 | 0.625 | |
used interpreter | fluent | 0.361 | 0.0157 | 22.9 | 384.5 | 0.330 | 0.392 | |
fluent | used interpreter | 0.639 | 0.0157 | 40.6 | Inf | 0.608 | 0.670 | |
broken | tried but unable | 0.577 | 0.0166 | 34.7 | 874.9 | 0.544 | 0.609 | |
tried but unable | broken | 0.423 | 0.0166 | 25.5 | 474.1 | 0.391 | 0.456 | |
used interpreter | broken | 0.381 | 0.0163 | 23.3 | 396.6 | 0.349 | 0.413 | |
broken | used interpreter | 0.619 | 0.0163 | 37.9 | Inf | 0.587 | 0.651 | |
fluent | tried but unable | 0.624 | 0.0165 | 37.7 | Inf | 0.591 | 0.656 | |
tried but unable | fluent | 0.376 | 0.0165 | 22.8 | 378.4 | 0.344 | 0.409 |
The results in pink are the same as those produced by the afcp
package:
library(afcp)
afcp_results <- afcp(
amce_results,
respondent.id = "respondent",
task.id = "task",
profile.id = "profile",
attribute = "language")
afcp_results$afcp
level baseline afcp se zstat pval conf_high conf_low conf_level
1 broken fluent 0.4067416 0.01658803 -5.622030 1.887265e-08 0.4392535 0.3742296 0.95
2 triedbutunable fluent 0.3763066 0.01685034 -7.340705 2.124724e-13 0.4093327 0.3432806 0.95
3 usedinterpreter fluent 0.3610799 0.01592112 -8.725525 2.649493e-18 0.3922847 0.3298750 0.95
22.1 Hypothesis tests
A powerful feature of marginaleffects
is that all its functions include a hypothesis
argument which can be used to conduct hypothesis tests on arbitrary functions of estimates. For example, let’s compute the AFCP for a subset of profile comparisons:
p <- avg_predictions(mod,
by = c("language", "alternative"),
newdata = subset(dat, language != alternative & alternative == "fluent"),
vcov = ~respondent)
p
language alternative Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
broken fluent 0.407 0.0164 24.8 <0.001 447.7 0.375 0.439
used interpreter fluent 0.361 0.0157 22.9 <0.001 384.5 0.330 0.392
tried but unable fluent 0.376 0.0165 22.8 <0.001 378.4 0.344 0.409
Type: response
Columns: language, alternative, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
Now, let’s say we would like to test if any of the pairwise comparisons between AFCP is different from zero. All we need to do is use add a hypothesis="pairwise"
argument:
avg_predictions(mod,
hypothesis = "pairwise",
by = c("language", "alternative"),
newdata = subset(dat, language != alternative & alternative == "fluent"),
vcov = ~respondent)
Term Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
broken - used interpreter 0.0457 0.0226 2.02 0.0429 4.5 0.00146 0.0899
broken - tried but unable 0.0304 0.0228 1.34 0.1811 2.5 -0.01417 0.0750
used interpreter - tried but unable -0.0152 0.0224 -0.68 0.4968 1.0 -0.05915 0.0287
Type: response
Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high