12  Machine learning

The concepts and post-estimation tools introduced in earlier chapters—predictions, counterfactual comparisons, and slopes—are largely model-agnostic; they are applicable to both statistical and machine learning models. These tools are especially effective for model description (Section 1.1.1), a task that is very important in machine learning applications, where analysts need to audit and understand how models respond to different inputs.

Auditing and describing machine learning models is essential to ensure that their predictions remain fair, and that they are driven by factors that compatible with the substantive knowledge of domain experts. For instance, in credit scoring systems, evaluating how variations in applicant characteristics—such as income, employment status, or ethnicity—influence creditworthiness assessments helps detect and mitigate potential biases. Similarly, in hiring algorithms, how models weight different candidate attributes—like education level or years of experience—can help recruiters use models in decision-making. Audits and model description are crucial to improve the transparency and interpretability of data analyses.

The marginaleffects package facilitates model description and auditing by allowing analysts to compute and visualize predictions, counterfactual predictions, and slopes. It integrates seamlessly with two of the most prominent machine learning frameworks in R: tidymodels and mlr3. tidymodels is a collection of packages in R designed for modeling and machine learning using tidyverse principles, offering a cohesive interface for data preprocessing, modeling, and validation. mlr3 is a modern, object-oriented framework in R that provides a comprehensive suite of tools for machine learning, including a wide array of algorithms, resampling methods, and performance measures. By supporting both tidymodels and mlr3, marginaleffects enables users to interpret a wide variety of machine learning models .

A comprehensive introduction to machine learning in general, or to the tidymodels and mlr3 frameworks in particular, lies outside the scope of this book.1 Instead, this chapter shows a very simple example to demonstrate that the workflow built-up in previous chapters applies in straightforward fashion to this new context.

Let us consider the data on Airbnb rental properties in London, collected and distributed by (BekKez2021?). This dataset includes information on over 50,000 units, including features such as the unit type (single room or entire unit), number of bedrooms, parking, or internet access. The primary outcome of our analysis is the rental price of each unit.

To begin, we load tidymodels and marginaleffects libraries, read the data, and display the first rows and columns.

library(tidymodels)
library(marginaleffects)

airbnb <- arrow::read_parquet("https://marginaleffects.com/data/airbnb.parquet")
airbnb[1:5, 1:6]
# A tibble: 5 × 6
  price bathrooms bedrooms  beds unit_type    `24-hour check-in`
  <int>     <dbl>    <int> <int> <chr>                     <int>
1    23       1          1     1 Private room                  0
2    50       1          1     1 Private room                  0
3    24       1          1     1 Private room                  0
4    50       1.5        1     1 Private room                  1
5    25       1          1     1 Private room                  0

Now, we split the dataset into a training set, used to fit the model, and a testing set, used to make predictions and evaluate the model’s behavior.

airbnb_split <- initial_split(airbnb)
train <- training(airbnb_split)
test <- testing(airbnb_split)

The next block of code holds the core tidymodels commands.

The boost_tree() function specifies the model type. In this case, we use the XGBoost implementation of boosted trees, to predict a continuous outcome (i.e., “regression”). Users who prefer a different prediction algorithm could swap this line for linear_reg(), rand_forest(), bart(). The recipe() function identifies the outcome variable (price), and initiates a data pre-processing “recipe”, that is, a series of steps to transform the raw data into a suitable format for model fitting. step_dummy() is used to convert categorical predictors into dummy variables. Finally, workflow() function combines the model and the pre-processing recipe, and the fit() function fits the model to the training data.

xgb <- boost_tree(mode = "regression", engine = "xgboost")

mod <- recipe(airbnb, price ~ .) |>
  step_dummy(all_nominal_predictors()) |>
  workflow(spec = xgb) |>
  fit(train)

With the fitted model in hand, we now use the predictions() function to generate predictions in the test set. As usual, predictions() returns a simple data frame with the quantity of interest in the estimate column, and the original data in separate columns. Thus, we can easily check the quality of our predictions in the test set by plotting the predicted values of the outcome (estimate) against the actually observed values (price).

p <- predictions(mod, newdata = test)

ggplot(p, aes(x = estimate, y = price)) +
  geom_point(alpha = .2) +
  geom_abline(linetype = 3) +
  labs(x = "Predicted Price", y = "Actual Price") +
  xlim(0, 500) + ylim(0, 500) +
  coord_equal()

All the standard functions of the marginaleffects package are available. For instance, to compute the average predicted price of private rooms and entire homes in the test set, we call avg_predictions() with the by argument.

avg_predictions(mod,
  by = "unit_type",
  newdata = test)
unit_type Estimate
Entire home/apt 135.6
Private room 50.3

Unsurprisingly, our model predicts that entire homes will be more expensive than single rooms. One important to note, in the results printed above, is that fitting a machine learning model with tidymodels or mlr3 does not typically yield standard errors. Therefore, marginaleffects cannot compute uncertainty estimates via the delta method.

As in Chapter 6, we can use the avg_comparisons() function to answer counterfactual queries such as: On average, how does the predicted price change when we increase the number of bedrooms by 2, holding all other variables constant?

avg_comparisons(mod,
  variables = list(bedrooms = 2),
  newdata = test)
Estimate
18.4

Our model predicts that the price of a unit with two extra bedrooms will be 18 pounds higher. Furthermore, we may inquire about the combined effect of increasing the number of bedrooms by one, and of transitioning from an apartment without wireless internet access to one with such access. For this, we use the cross argument.

avg_comparisons(mod,
  variables = c("bedrooms", "Wireless.Internet"),
  cross = TRUE,
  newdata = airbnb)
C: bedrooms Estimate
+1 13.2

TODO: Partial dependence plots

plot_predictions(mod, by = c("bedrooms", "unit_type"), newdata = airbnb)

# Select 1000 profiles at random, otherwise this is very memory-intensive
grid <- datagrid(
  bedrooms = unique,
  unit_type = unique,
  newdata = airbnb[sample(1:nrow(airbnb), 10000), ],
  grid_type = "counterfactual")

# Partial dependence plot
plot_predictions(mod,
  newdata = grid,
  by = c("bedrooms", "unit_type")) +
  labs(x = "# Bedrooms", y = "Predicted Price", color = "")


  1. See (KuhSil2022?), (Jam2023?), (Bis2024?)↩︎