library(MASS)
library(marginaleffects)
dat <- read.csv("data/affairs.csv")
# The outcome variable needs to be ordered
dat$affairs <- factor(dat$affairs, unique(dat$affairs))
barplot(table(dat$affairs), ylab = "N", xlab = "Affairs")
9 Categorical and ordinal outcomes
This chapter shows how the framework and tools introduced in Chapter 3 can help us give meaningful interpretation to the estimates obtained by fitting a categorical or ordinal outcome model.
We say that the dependent variable of a regression model is categorical when it is discrete, involves a finite number of categories, but does not have a natural ordering. Categorical outcomes are very common in data analysis, for example, when one wishes to model the choice of a mode of transportation (e.g., car, bus, bike, walk). A popular way to approach such data is to use specify and fit a multinomial logit model. In such a model, the probability of each category \(j\) of the outcome variable is modeled as:
We say that the dependent variable of a regression model is ordinal when it is discrete, involves a finite number of categories, and has a natural ordering. Ordinal outcomes are very common in data analysis, for example, when one wishes to model levels of satisfaction (e.g., very dissatisfied, dissatisfied, neutral, satisfied, very satisfied). A popular way to approach such data is to specify and fit an ordered probit model. In such a model, the probability of each category \(j\) of the outcome variable is modeled as (Cameron and Trivedi 2005):
\[P(Y = j) = \Phi(\theta_j - \mathbf{X}\beta) - \Phi(\theta_{j-1} - \mathbf{X}\beta)\]
where \(\mathbf{X}\) represents the vector of predictors, \(\beta\) the vector of coefficients, \(\theta_j\) and \(\theta_{j-1}\) the threshold parameters defining the boundaries between categories \(j\) and \(j-1\), and \(\Phi\) the standard normal cumulative distribution function.
The distinguishing feature of models like the multinomial logit or ordered probit is that they estimate different parameters for each level the outcome variable. This allows the analyst to derive different estimates of the quantities of interest (predictions, counterfactual comparisons, and slopes) for each level of the dependent variable. Since many categorical and ordered outcome models share this property, the framework introduced in this book applies in roughly the same way. Therefore, this chapter only considers the ordered probit case, but the same workflow applies to a wide class of models, including other multinomial and ordered approaches.
Let’s consider Extramarital Affairs dataset, collected through a survey by Psychology Today in 1969, and analyzed in Fair (1978). These data include 601 observations on various demographic, marital, and personal characteristics: gender, age, years married, presence of children, religiousness, education, occupation, self-rated happiness in marriage, and the frequency of extramarital sexual encounters in the past year. The ‘affairs’ variable records the self-reported frequency of extramarital sexual encounters in the past year: : 0, 1, 2, 3, 4-10 or <10.
To start, we load the data and plot the distribution of the outcome:
Now, we fit an ordered probit regression model and summarize the results:
# Fit and summarize the model
mod <- polr(
affairs ~ children + yearsmarried + gender,
method = "probit", data = dat, Hess = TRUE)
summary(mod)
Coefficients:
Value Std. Error t value
childrenyes 0.17290 0.1506 1.1477
yearsmarried 0.03457 0.0116 2.9793
gendermale 0.08852 0.1075 0.8234
Intercepts:
Value Std. Error t value
0|1 1.1438 0.1313 8.7115
1|2 1.3409 0.1336 10.0358
2|3 1.4532 0.1352 10.7500
3|4-10 1.5942 0.1373 11.6074
4-10|>10 2.0248 0.1472 13.7524
The coefficients in this model can be viewed as measuring a change in a latent variable, but building intuition our model is not straightforward simply by looking at coefficient estimates. Instead of coefficient estimates, we focus on predictions and counterfactual comparisons, as we have done throughout this book.
9.1 Predictions
The first step in our post-hoc workflow is to compute average predictions, that is, the expected probabilities of each category of the dependent variable, averaged over the observed data. This provides a summary measure of the model’s predicted outcomes, offering a more interpretable understanding of the model’s overall behavior.
p <- avg_predictions(mod)
p
Group | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
0 | 0.7497 | 0.01739 | 43.12 | <0.001 | 0.7157 | 0.7838 |
1 | 0.0567 | 0.00944 | 6.01 | <0.001 | 0.0382 | 0.0752 |
2 | 0.0285 | 0.00680 | 4.19 | <0.001 | 0.0152 | 0.0418 |
3 | 0.0317 | 0.00715 | 4.44 | <0.001 | 0.0177 | 0.0457 |
4-10 | 0.0702 | 0.01039 | 6.76 | <0.001 | 0.0498 | 0.0906 |
>10 | 0.0631 | 0.00986 | 6.40 | <0.001 | 0.0438 | 0.0824 |
Since we are working with a(n ordered) categorical outcome model, the avg_predictions()
function automatically returns one average predicted probability per outcome level. The group identifiers are available in the group
column of the output data frame:
colnames(p)
[1] "group" "estimate" "std.error" "statistic" "p.value" "s.value"
[7] "conf.low" "conf.high"
p$group
[1] 0 1 2 3 4-10 >10
Levels: 0 1 2 3 4-10 >10
Unsurprisingly, given Figure 9.1, the expected probability that affairs=0
is much higher than that of any of the other outcome levels.
As we saw in Chapter 5, it is easy to compute average predictions for different subgroups. For instance, if we want to know the expected probability of each outcome level, for survey respondents with and without children, we simply call:
avg_predictions(mod, by = "children")
Group | children | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
0 | no | 0.8388 | 0.02748 | 30.53 | <0.001 | 0.78496 | 0.8927 |
0 | yes | 0.7143 | 0.02135 | 33.45 | <0.001 | 0.67245 | 0.7562 |
1 | no | 0.0431 | 0.00856 | 5.03 | <0.001 | 0.02632 | 0.0599 |
1 | yes | 0.0622 | 0.01039 | 5.98 | <0.001 | 0.04180 | 0.0825 |
2 | no | 0.0205 | 0.00554 | 3.71 | <0.001 | 0.00967 | 0.0314 |
2 | yes | 0.0316 | 0.00758 | 4.18 | <0.001 | 0.01679 | 0.0465 |
3 | no | 0.0219 | 0.00581 | 3.77 | <0.001 | 0.01052 | 0.0333 |
3 | yes | 0.0356 | 0.00807 | 4.42 | <0.001 | 0.01982 | 0.0514 |
4-10 | no | 0.0442 | 0.00992 | 4.46 | <0.001 | 0.02479 | 0.0637 |
4-10 | yes | 0.0806 | 0.01212 | 6.65 | <0.001 | 0.05680 | 0.1043 |
>10 | no | 0.0314 | 0.00907 | 3.46 | <0.001 | 0.01362 | 0.0492 |
>10 | yes | 0.0757 | 0.01210 | 6.26 | <0.001 | 0.05199 | 0.0994 |
And we can plot these figures with an analogous call. We use the by
argument to marginalize across combinations of the group
outcome and children
predictor:
plot_predictions(mod, by = c("group", "children"))
Figure 9.2 shows again that the most likely answer—by far—is zero. Moreover, the graph suggests that there the differences in predicted probabilities between the “yes” and “no” children
groups is quite small. This impression will be confirmed in the next section, where we discuss counterfactual comparisons.
In some cases, it may be useful to aggregate or combine different outcome level categories. For example, we may want to compute the sum of all predicted probabilities for group
categories above 0.
One powerful strategy is to define a custom function to use in the hypothesis
argument of avg_predictions()
. According to the function manual (?avg_predictions
), a custom hypothesis function must accept the same data frame that we obtain when calling a function without the hypothesis
argument, i.e. a data frame with group
, term
, and estimate
columns. The custom function can then process this input and return a data frame with a term
and an estimate
column.
For example, this custom function accepts a data frame x
, collapses the predicted probabilities by group categories, and returns a data frame of new estimates:
hyp <- function(x) {
x$term <- ifelse(x$group == "0", "0", ">1")
aggregate(estimate ~ term + children, FUN = sum, data = x)
}
avg_predictions(mod, by = "children", hypothesis = hyp)
Term | children | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|---|
>1 | no | 0.161 | 0.0275 | 5.87 | <0.001 | 0.107 | 0.215 |
0 | no | 0.839 | 0.0275 | 30.53 | <0.001 | 0.785 | 0.893 |
>1 | yes | 0.286 | 0.0214 | 13.38 | <0.001 | 0.244 | 0.328 |
0 | yes | 0.714 | 0.0214 | 33.45 | <0.001 | 0.672 | 0.756 |
The custom function above applied a sum by subgroup using the aggregate()
function, which has been available in base R
since the 20th century. An equivalent tidyverse
version of this custom function would be:
9.2 Counterfactual comparisons
After looking at predictions, we can no move on to counterfactual comparisons. Our first object here is to answer this question: What is the effect of increasing the yearsmarried
function by 5 years on the average predicted probability of each level of the outcome variable?
avg_comparisons(mod, variables = list(yearsmarried = 5))
Group | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
0 | -0.05617 | 0.01940 | -2.89 | 0.00379 | -0.094199 | -0.01814 |
1 | 0.00687 | 0.00236 | 2.91 | 0.00363 | 0.002241 | 0.01150 |
2 | 0.00427 | 0.00168 | 2.54 | 0.01094 | 0.000981 | 0.00756 |
3 | 0.00551 | 0.00215 | 2.56 | 0.01040 | 0.001296 | 0.00973 |
4-10 | 0.01590 | 0.00580 | 2.74 | 0.00607 | 0.004543 | 0.02726 |
>10 | 0.02362 | 0.00923 | 2.56 | 0.01053 | 0.005521 | 0.04171 |
cmp <- avg_comparisons(mod, variables = list(yearsmarried = 5))
cmp <- sprintf("%.1f", abs(cmp$estimate) * 100)
On average, increasing the yearsmarried
variable by 5 reduces the predicted probability that affairs=0
by 5.6 percentage points. On average, increasing yearsmarried
by 5 increases the predicted probability that affairs>10
by 2.4 percentage points. These counterfactual comparisons are associated with small \(p\) values, so we can reject the null hypothesis that yearsmarried
has no effect on predicted affairs
.
In contrast, if we estimate the effect of manipulating the gender
variable, we find that differences in predicted probabilities are small and statistically insignificant. We cannot reject the null hypothesis that gender
has no effect on affairs
.
avg_comparisons(mod, variables = "gender")
Group | Estimate | Std. Error | z | Pr(>|z|) | 2.5 % | 97.5 % |
---|---|---|---|---|---|---|
0 | -0.02736 | 0.03324 | -0.823 | 0.411 | -0.09251 | 0.0378 |
1 | 0.00373 | 0.00458 | 0.815 | 0.415 | -0.00524 | 0.0127 |
2 | 0.00225 | 0.00278 | 0.808 | 0.419 | -0.00320 | 0.0077 |
3 | 0.00284 | 0.00350 | 0.811 | 0.417 | -0.00402 | 0.0097 |
4-10 | 0.00786 | 0.00959 | 0.819 | 0.413 | -0.01095 | 0.0267 |
>10 | 0.01068 | 0.01305 | 0.818 | 0.413 | -0.01491 | 0.0363 |