from marginaleffects import *
from statsmodels.formula.api import ols
import polars.selectors as cs
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_transformer
from xgboost import XGBRegressor
Scikit-learn
This case study demonstrates how to use the marginaleffects
package with Scikit-learn, to fit machine learning models.
The strategies illustrated here offers several major benefits. As we will see, marginaleffects
can be used to improve the interpretability of opaque machine learning algorithms. marginaleffects
also offers a formula API to fit Scikit-learn models. This can streamline the analysis workflow, by obviating the need to build X
and y
matrices from data frames manually.
The first code block imports necessary libraries and functions for the analysis.
The next code block uses the fit_sklearn()
function to fit a linear regression model. We specify the dependent and independent variable using a simple formula interface, with a ~
symbol.1
The type of model we fit is determined by the engine
argument. Here, we call LinearRegression()
from Scikit-learn. The important requirement for engine
is that it must be an object with a fit()
method that accepts y
and X
as inputs.
The avg_predictions()
function is then used to compute average predictions by the branch
variable.
= get_dataset("military")
military
= fit_sklearn(
mod_sk "rank ~ officer + hisp + branch",
=military,
data=LinearRegression(),
engine )
Finally, we can feed the resulting object to any marginaleffects
function. For instance, this code computes average predicted rank for each branch of the military.
="branch") avg_predictions(mod_sk, by
branch | estimate |
---|---|
enum | f64 |
"air force" | 0.19714 |
"army" | 0.165827 |
"marine corps" | 0.105712 |
"navy" | 0.159498 |
The following code block shows how to achieve equivalent results using statsmodels
for fitting the linear regression model and computing average predictions.
= ols(
mod_sm "rank ~ officer + hisp + branch",
=military.to_pandas()).fit()
data="branch") avg_predictions(mod_sm, by
branch | estimate | std_error | statistic | p_value | s_value | conf_low | conf_high |
---|---|---|---|---|---|---|---|
enum | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
"air force" | 6.331187 | 0.002972 | 2130.172269 | 0.0 | inf | 6.325362 | 6.337013 |
"army" | 6.221938 | 0.002295 | 2710.822476 | 0.0 | inf | 6.217439 | 6.226437 |
"marine corps" | 5.781668 | 0.003798 | 1522.165211 | 0.0 | inf | 5.774223 | 5.789112 |
"navy" | 6.264098 | 0.003005 | 2084.604412 | 0.0 | inf | 6.258209 | 6.269988 |
Now, we switch to using XGBoost to make predictions and counterfactual-comparisons in a test set. Instead of using the formula interface, we will define a Scikit-learn pipeline and define a custom selector function.
To begin, we load a dataset on AirBnB rental properties, and split it into training and testing sets.
= get_dataset("airbnb")
airbnb
= train_test_split(airbnb) train, test
The categorical variables need to be one-hot encoded, meaning they are transformed into a format suitable for machine learning algorithms. We create a Scikit-learn pipeline, which is a sequence of data processing steps that ends with a model fitting step.
= airbnb.select(~cs.numeric()).columns
catvar = make_column_transformer(
preprocessor
(OneHotEncoder(), catvar),=FunctionTransformer(lambda x: x.to_numpy()),
remainder
)= make_pipeline(preprocessor, XGBRegressor()) pipeline
Scikit-learn pipelines expect data to be passed as two matrices y
and X
. In contrast, marginaleffects
works with data frames. Therefore, we need a function that accepts a data frame and returns two matrices. This will be used internally to fit the model and make predictions on new data.
def selector(data):
= data.select(cs.by_name("price", require_all=False))
y = data.select(~cs.by_name("price", require_all=False))
X return y, X
= fit_sklearn(selector, data=train, engine=pipeline)
mod
=test, by="unit_type") avg_predictions(mod, newdata
unit_type | estimate |
---|---|
str | f32 |
"Entire home/apt" | 136.3396 |
"Private room" | 50.439476 |
={"bedrooms": 2}, newdata=test) avg_comparisons(mod, variables
term | contrast | estimate |
---|---|---|
str | str | f64 |
"bedrooms" | "+2" | 12.970476 |
Under the hood, the formula is parsed using the Formulaic package.↩︎