Conjoint experiments

A forced-choice conjoint experiment is a research methodology used in many fields such as marketing and political science to understand how people make decisions between “profiles” characterized by multiple “attributes.” In this type of experiment, participants are presented with a series of choices between different products, services, or scenarios. Each option is described by a set of attributes (e.g., price, quality, brand, features), and the levels of these attributes are varied randomly across the options presented.

Consider an experiment where researchers ask survey respondents “to act as immigration officials and to decide which of a pair of immigrants they would choose for admission” (Hainmueller, Hopkins, and Yamamoto 2014). They display a table in which each column represents a distinct immigrant “profile” with randomized attributes. For example:

tinytable_90q7h6h0yqfzs6k584lx

A single forced-choice task in a conjoint experiment. The survey respondent must choose one of the two profiles before seeing a second task.
Attributes	Profile 1	Profile 2
Language Skills	Fluent in English	Broken English
Job	Construction worker	Nurse

The survey respondent has the “task” of choosing one of the two profiles. Then, the researchers display a new task, including profiles with different randomized attributes, and the respondent chooses again.

By analyzing the choices made by participants, researchers can estimate the relative importance of different attributes in the decision-making process.

The rest of this vignette shows how to use the marginaleffects package to estimate the main quantities of the quantities of interest in such experiments.

Data

To illustrate, we use data published alongside the Political Analysis article by Hainmueller, Hopkins, and Yamamoto (2014). In this experiment, each survey respondent \(i\) receives several tasks \(k\), in which they select one of two profiles \(j\), characterized by attributes \(l\).

For simplicity, we consider a subset of the data with 5 tasks per respondent, 2 profiles per task, and 2 attributes per profile. The data is structured in “long” format, with one respondent-task-profile combination per row.

These are the entries for survey respondent number 4:

library(marginaleffects)
library(data.table)
library(tinytable)

dat <- readRDS(url("https://marginaleffects.com/data/conjoint_immigration.rds", "rb"))

dat[dat$respondent == 4, ]

    choice                 job         language respondent  task profile
     <num>              <fctr>           <fctr>      <int> <int>   <num>
 1:      1               nurse tried but unable          4     1       1
 2:      0 child care provider used interpreter          4     1       2
 3:      0            gardener           fluent          4     2       1
 4:      1 construction worker           fluent          4     2       2
 5:      1               nurse           broken          4     3       1
 6:      0 child care provider           fluent          4     3       2
 7:      0             teacher used interpreter          4     4       1
 8:      1 construction worker           fluent          4     4       2
 9:      1             teacher used interpreter          4     5       1
10:      0               nurse used interpreter          4     5       2

The choice column indicates if the profile in each row was selected by the respondent.

These data are structured in what could be called a “long” format. Later on in the vignette, it will be useful to have information in a “wide” format, with new columns indicating what attributes characterized every profile in each task. To create these new columns, we use dcast() from the data.table package. This function is analogous to reshape() in base R or pivot_wider() in the tidyverse:

dat <- data.table(dat)
wide <- dcast(dat, 
  respondent + task ~ profile,
  value.var = c("language", "job")
)
dat <- merge(dat, wide)

We now have new columns called language_1 and language_2 indicating the language skills of profiles 1 and 2 respectively in each task.

dat[dat$respondent == 4, ]

Key: <respondent, task>
    respondent  task choice                 job         language profile       language_1       language_2    job_1               job_2
         <int> <int>  <num>              <fctr>           <fctr>   <num>           <fctr>           <fctr>   <fctr>              <fctr>
 1:          4     1      1               nurse tried but unable       1 tried but unable used interpreter    nurse child care provider
 2:          4     1      0 child care provider used interpreter       2 tried but unable used interpreter    nurse child care provider
 3:          4     2      0            gardener           fluent       1           fluent           fluent gardener construction worker
 4:          4     2      1 construction worker           fluent       2           fluent           fluent gardener construction worker
 5:          4     3      1               nurse           broken       1           broken           fluent    nurse child care provider
 6:          4     3      0 child care provider           fluent       2           broken           fluent    nurse child care provider
 7:          4     4      0             teacher used interpreter       1 used interpreter           fluent  teacher construction worker
 8:          4     4      1 construction worker           fluent       2 used interpreter           fluent  teacher construction worker
 9:          4     5      1             teacher used interpreter       1 used interpreter used interpreter  teacher               nurse
10:          4     5      0               nurse used interpreter       2 used interpreter used interpreter  teacher               nurse

Marginal means

As described by Leeper, Hobolt, and Tilley (2020), a “marginal mean describes the level of favorability toward profiles that have a particular feature level, ignoring all other features.”

To compute marginal means, we proceed in 3 steps:

Estimate a regression model with choice as outcome and the attributes of interest as predictors.
Compute the predicted (i.e., fitted) values for each row in the original dataset.
Marginalize (average) those predictions with respect to the variable of interest.

To illustrate, we estimate a linear regression model with interactions between the language and job variables:

mod <- lm(choice ~ job * language, data = dat)

Then, we use the avg_predictions() function to compute and marginalize predicted values. Note that we use the vcov argument to report clustered standard errors at the respondent-level.

avg_predictions(mod, by = "language", vcov = ~respondent)


         language Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
 tried but unable    0.459    0.00733 62.7   <0.001 Inf 0.445  0.474
 used interpreter    0.424    0.00749 56.6   <0.001 Inf 0.410  0.439
 fluent              0.588    0.00726 81.0   <0.001 Inf 0.573  0.602
 broken              0.526    0.00746 70.6   <0.001 Inf 0.512  0.541

Columns: language, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response

The results above suggests that, ignoring (or averaging over) the job attribute, the “fluent” English speakers are chosen more often than profiles with other language values.

Hypothesis tests

Using the hypothesis argument, we can easily conduct various null hypothesis tests on the estimated marginal means. For example, is the probability of choosing a “fluent” profile equal to the probability of choosing a “tried but unable” profile?

avg_predictions(mod, 
  hypothesis = "b1 = b3",
  by = "language", 
  vcov = ~respondent)


  Term Estimate Std. Error     z Pr(>|z|)    S  2.5 % 97.5 %
 b1=b3   -0.128     0.0119 -10.8   <0.001 87.6 -0.152 -0.105

Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response

Is the difference in probabiliy between “fluent” and “broken” equal to the difference in probability between “tried but unable” and “used interpreter”?

avg_predictions(mod, 
  hypothesis = "b1 - b2 = b3 - b4",
  by = "language", 
  vcov = ~respondent)


        Term Estimate Std. Error     z Pr(>|z|)   S   2.5 %  97.5 %
 b1-b2=b3-b4  -0.0264     0.0168 -1.57    0.116 3.1 -0.0594 0.00657

Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response

Subgroups

Modifying the by allows analysts to report marginal means for different subgroups of their data, and newdata can be used to exclude uninteresting profiles:

avg_predictions(mod, 
  by = c("language", "job"),
  newdata = subset(dat, job %in% c("doctor", "gardener")),
  vcov = ~respondent)


         language      job Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
 fluent           gardener    0.534     0.0243 22.0   <0.001 353.5 0.486  0.581
 fluent           doctor      0.772     0.0332 23.2   <0.001 394.4 0.707  0.837
 used interpreter doctor      0.602     0.0427 14.1   <0.001 147.5 0.518  0.685
 tried but unable gardener    0.416     0.0237 17.5   <0.001 226.5 0.370  0.463
 used interpreter gardener    0.387     0.0244 15.8   <0.001 185.1 0.339  0.435
 broken           gardener    0.509     0.0245 20.8   <0.001 316.9 0.461  0.557
 tried but unable doctor      0.570     0.0443 12.9   <0.001 123.5 0.483  0.657
 broken           doctor      0.712     0.0394 18.1   <0.001 239.6 0.635  0.789

Columns: language, job, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response

Average Marginal Component Effects

Average Marginal Component Effects (AMCE) are defined and analyzed in Hainmueller, Hopkins, and Yamamoto (2014). Roughly speaking, they can be viewed as the average effects of changing one attribute on choice, while holding all other attributes of the profile constant. To compute an AMCE, we can proceed in 4 steps:

Create a dataset A where the language column is equal to “fluent English” in every row.
Create a dataset B where the language column is equal to “broken English” in every row.
Compute predicted (fitted) values for every row in datasets A and B.
Compute the average difference between predicted values in the two datasets.

This can be achieved easily using the avg_comparisons() function from the marginaleffects package:

avg_comparisons(mod, variables = "language", vcov = ~respondent)


     Term                  Contrast Estimate Std. Error      z Pr(>|z|)     S   2.5 %  97.5 %
 language broken - fluent            -0.0592     0.0120  -4.95   <0.001  20.4 -0.0826 -0.0357
 language tried but unable - fluent  -0.1271     0.0119 -10.66   <0.001  85.7 -0.1504 -0.1037
 language used interpreter - fluent  -0.1608     0.0121 -13.27   <0.001 131.2 -0.1845 -0.1370

Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response

Empirical vs. balanced grid

In the example above, avg_predictions() marginalized across the realized distribution of attributes observed in the actual dataset. An alternative would be to marginalzed over a perfectly balanced (“uniform”) grid of treatment conditions. Of course, empirical and uniform grids will yield nearly identical results if the sample is large and randomization successful.

The uniform approach is the default in the amce() function from the cjoint package, and the behavior can be replicated using the datagrid() function and the newdata argument in avg_comparisons():

library(cjoint)

amce_results <- amce(
  choice ~ language * job,
  data = dat,
  cluster = TRUE,
  respondent.id = "respondent")

summary(amce_results)$amce |> subset(Attribute == "language")

   Attribute            Level    Estimate   Std. Err    z value     Pr(>|z|)    
11  language           broken -0.06834528 0.01357451  -5.034826 4.782835e-07 ***
12  language tried but unable -0.12843978 0.01343677  -9.558832 1.190935e-21 ***
13  language used interpreter -0.16584337 0.01382564 -11.995352 3.758141e-33 ***

avg_comparisons(mod,
  newdata = datagrid(FUN_factor = unique, FUN_character = unique),
  variables = "language",
  vcov = ~respondent)


     Term                  Contrast Estimate Std. Error      z Pr(>|z|)     S  2.5 %  97.5 %
 language broken - fluent            -0.0683     0.0136  -5.03   <0.001  21.0 -0.095 -0.0417
 language tried but unable - fluent  -0.1284     0.0134  -9.56   <0.001  69.5 -0.155 -0.1021
 language used interpreter - fluent  -0.1658     0.0138 -12.00   <0.001 107.7 -0.193 -0.1387

Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response

Average Feature Choice Probability

Abramson et al. (2024) introduce an alternative estimand for forced-choice conjoint experiments: the Average Feature Choice Probability (AFCP). The main difference between AMCE and AFCP lies in their approach to handling attribute comparisons.

AMCE incorporates comparisons and averages over both direct and indirect attribute comparisons, potentially considering information about irrelevant attributes, thus imposing a strong transitivity assumption. In some cases, AMCE can suggest positive effects even when, in direct comparisons, respondents are on average less likely to choose a profile with the feature of interest over the baseline. In contrast, AFCP focuses solely on direct comparisons between attributes, offering a more accurate representation of respondents’ direct preferences without the influence of irrelevant attributes.

To estimate AFCP, we once again start by estimating a linear regression model. This time, the model is even more flexible. Specifically, the model allows the effect of language skills in the first profile to depend on the value of language skills in the second profile. Likewise, other attributes can influence the probability of selection differently based on attributes in the comparison profile. To achieve this, we interact language_1 with language_2, and job_1 with job_2.

Moreover, since the data is in “long” format, with one profile per row, we must also allow each variable to have different coefficients based on profile number: the effect of language_1 on the probability of selection is obviously different for profile=1 and for profile=2. Indeed, when profile=1 the language_1 column records the profile’s own language skills. When profile=2, the same column records the alternative profile’s language skills.

Thankfully, it is trivial to allow this flexibility by interacting the attributes with the profile variable:

mod = lm(choice ~ factor(profile) * (language_1 * language_2 + job_1 * job_2), dat)

As explained above and detailed in Abramson et al. (2024), the AFCP is a choice pair-specific quantity. This means that we need to average predictions (fitted values) across covariates, within each unique pair of attributes of interest. To allow this, we create a new variable called alternative:¹

dat$alternative <- fifelse(
  dat$language == dat$language_1, 
  dat$language_2, 
  dat$language_1
)

This table shows the language skills and job of both profiles in the first task faced by respondent number 4. The alternative column simply shows the language skills of the alternative profile within each task:

subset(dat,
  task == 1 & respondent == 4, 
  select = c("respondent", "task", "profile", "job", "language", "alternative")) |>
  tt()

tinytable_44w79vbrz3rk6w9slf8q

respondent	task	profile	job	language	alternative
4	1	1	nurse	tried but unable	used interpreter
4	1	2	child care provider	used interpreter	tried but unable

Now we compute the AFCP by averaging fitted values by unique pairs of attributes for language (the combination of language and alternative). Since we are not interested comparison pairs where both profiles have the same language skills, we use the subset to supply an appropriate grid.

p <- avg_predictions(mod, 
  by = c("language", "alternative"), 
  newdata = subset(dat, language != alternative),
  vcov = ~respondent)

Display the results in a nice tinytable:

library(tinytable)
idx <- which(p$alternative == "fluent")
print(p, "tinytable") |> style_tt(i = idx, background = "pink")

tinytable_zpw9egipiwb7du60nxcj

language	alternative	Estimate	Std. Error	z	Pr(>\|z\|)	S	2.5 %	97.5 %
tried but unable	used interpreter	0.541	0.0183	29.7	<0.001	639.5	0.505	0.577
used interpreter	tried but unable	0.459	0.0183	25.1	<0.001	460.9	0.423	0.495
broken	fluent	0.407	0.0164	24.8	<0.001	447.7	0.375	0.439
fluent	broken	0.593	0.0164	36.1	<0.001	947.4	0.561	0.625
used interpreter	fluent	0.361	0.0157	22.9	<0.001	384.5	0.330	0.392
fluent	used interpreter	0.639	0.0157	40.6	<0.001	Inf	0.608	0.670
broken	tried but unable	0.577	0.0166	34.7	<0.001	874.9	0.544	0.609
tried but unable	broken	0.423	0.0166	25.5	<0.001	474.1	0.391	0.456
used interpreter	broken	0.381	0.0163	23.3	<0.001	396.6	0.349	0.413
broken	used interpreter	0.619	0.0163	37.9	<0.001	Inf	0.587	0.651
fluent	tried but unable	0.624	0.0165	37.7	<0.001	Inf	0.591	0.656
tried but unable	fluent	0.376	0.0165	22.8	<0.001	378.4	0.344	0.409

The results in pink are the same as those produced by the afcp package:

library(afcp)

afcp_results <- afcp(
  amce_results,
  respondent.id = "respondent",
  task.id = "task",
  profile.id = "profile",
  attribute = "language")

afcp_results$afcp

            level baseline      afcp         se     zstat         pval conf_high  conf_low conf_level
1          broken   fluent 0.4067416 0.01658803 -5.622030 1.887265e-08 0.4392535 0.3742296       0.95
2  triedbutunable   fluent 0.3763066 0.01685034 -7.340705 2.124724e-13 0.4093327 0.3432806       0.95
3 usedinterpreter   fluent 0.3610799 0.01592112 -8.725525 2.649493e-18 0.3922847 0.3298750       0.95

Hypothesis tests

A powerful feature of marginaleffects is that all its functions include a hypothesis argument which can be used to conduct hypothesis tests on arbitrary functions of estimates. For example, let’s compute the AFCP for a subset of profile comparisons:

p <- avg_predictions(mod, 
  by = c("language", "alternative"), 
  newdata = subset(dat, language != alternative & alternative == "fluent"),
  vcov = ~respondent)
p


         language alternative Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
 broken                fluent    0.407     0.0164 24.8   <0.001 447.7 0.375  0.439
 used interpreter      fluent    0.361     0.0157 22.9   <0.001 384.5 0.330  0.392
 tried but unable      fluent    0.376     0.0165 22.8   <0.001 378.4 0.344  0.409

Columns: language, alternative, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response

Now, let’s say we would like to test if any of the pairwise comparisons between AFCP is different from zero. All we need to do is use add a hypothesis="pairwise" argument:

avg_predictions(mod, 
  hypothesis = "pairwise",
  by = c("language", "alternative"), 
  newdata = subset(dat, language != alternative & alternative == "fluent"),
  vcov = ~respondent)


                                Term Estimate Std. Error     z Pr(>|z|)   S    2.5 % 97.5 %
 broken - used interpreter             0.0457     0.0226  2.02   0.0429 4.5  0.00146 0.0899
 broken - tried but unable             0.0304     0.0228  1.34   0.1811 2.5 -0.01417 0.0750
 used interpreter - tried but unable  -0.0152     0.0224 -0.68   0.4968 1.0 -0.05915 0.0287

Columns: term, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
Type:  response

References

Abramson, Scott F, Korhan Kocak, Asya Magazinnik, and Anton Strezhnev. 2024. “Detecting Preference Cycles in Forced-Choice Conjoint Experiments.” SocArXiv. https://doi.org/10.31235/osf.io/xjre9.

Hainmueller, J., D. Hopkins, and T. Yamamoto. 2014. “Causal Inference in Conjoint Analysis: Understanding Multi-Dimensional Choices via Stated Preference Experiments.” Political Analysis 22 (1): 1–30.

Leeper, Thomas J., Sara B. Hobolt, and James Tilley. 2020. “Measuring Subgroup Preferences in Conjoint Experiments.” Political Analysis 28 (2): 207–21. https://doi.org/10.1017/pan.2019.30.

Footnotes

The fifelse() function from data.table preserves factor levels, but not the standard ifelse().↩︎