Scikit-learn

This case study demonstrates how to use the marginaleffects package with Scikit-learn, to fit machine learning models.

The strategies illustrated here offers several major benefits. As we will see, marginaleffects can be used to improve the interpretability of opaque machine learning algorithms. marginaleffects also offers a formula API to fit Scikit-learn models. This can streamline the analysis workflow, by obviating the need to build X and y matrices from data frames manually.

The first code block imports necessary libraries and functions for the analysis.

from marginaleffects import *
from statsmodels.formula.api import ols
import polars.selectors as cs
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from xgboost import XGBRegressor

The next code block uses the fit_sklearn() function to fit a linear regression model. We specify the dependent and independent variable using a simple formula interface, with a ~ symbol.1

The type of model we fit is determined by the engine argument. Here, we call LinearRegression() from Scikit-learn. The important requirement for engine is that it must be an object with a fit() method that accepts y and X as inputs.

The avg_predictions() function is then used to compute average predictions by the branch variable.

military = get_dataset("military")

mod_sk = fit_sklearn(
    "rank ~ officer + hisp + branch",
    data=military,
    engine=LinearRegression(),
)

Finally, we can feed the resulting object to any marginaleffects function. For instance, this code computes average predicted rank for each branch of the military.

avg_predictions(mod_sk, by="branch")
shape: (4, 2)
branch estimate
enum f64
"air force" 0.19714
"army" 0.165827
"marine corps" 0.105712
"navy" 0.159498

The following code block shows how to achieve equivalent results using statsmodels for fitting the linear regression model and computing average predictions.

mod_sm = ols(
    "rank ~ officer + hisp + branch",
    data=military.to_pandas()).fit()
avg_predictions(mod_sm, by="branch")
shape: (4, 8)
branch estimate std_error statistic p_value s_value conf_low conf_high
enum f64 f64 f64 f64 f64 f64 f64
"air force" 6.331187 0.002972 2130.172269 0.0 inf 6.325362 6.337013
"army" 6.221938 0.002295 2710.822476 0.0 inf 6.217439 6.226437
"marine corps" 5.781668 0.003798 1522.165211 0.0 inf 5.774223 5.789112
"navy" 6.264098 0.003005 2084.604412 0.0 inf 6.258209 6.269988

Now, we switch to using XGBoost to make predictions and counterfactual-comparisons in a test set. Instead of using the formula interface, we will define a Scikit-learn pipeline and define a custom selector function.

To begin, we load a dataset on AirBnB rental properties, and split it into training and testing sets.

airbnb = get_dataset("airbnb")

train, test = train_test_split(airbnb)

The categorical variables need to be one-hot encoded, meaning they are transformed into a format suitable for machine learning algorithms. We create a Scikit-learn pipeline, which is a sequence of data processing steps that ends with a model fitting step.

catvar = airbnb.select(~cs.numeric()).columns
preprocessor = make_column_transformer(
    (OneHotEncoder(), catvar),
    remainder=FunctionTransformer(lambda x: x.to_numpy()),
)
pipeline = make_pipeline(preprocessor, XGBRegressor())

Scikit-learn pipelines expect data to be passed as two matrices y and X. In contrast, marginaleffects works with data frames. Therefore, we need a function that accepts a data frame and returns two matrices. This will be used internally to fit the model and make predictions on new data.

def selector(data):
    y = data.select(cs.by_name("price", require_all=False))
    X = data.select(~cs.by_name("price", require_all=False))
    return y, X

mod = fit_sklearn(selector, data=train, engine=pipeline)

avg_predictions(mod, newdata=test, by="unit_type")
shape: (2, 2)
unit_type estimate
str f32
"Entire home/apt" 136.3396
"Private room" 50.439476
avg_comparisons(mod, variables={"bedrooms": 2}, newdata=test)
shape: (1, 3)
term contrast estimate
str str f64
"bedrooms" "+2" 12.970476

  1. Under the hood, the formula is parsed using the Formulaic package.↩︎