Fit a sklearn model with output that is compatible with pymarginaleffects.
This function streamlines the process of fitting sklearn models by:
Parsing the formula
Handling missing values
Creating model matrices
Fitting the model with specified options
Parameters
formula: (str) Model formula
Example: “outcome ~ distance + incentive”
data: (pandas.DataFrame or polars.DataFrame) Dataframe with the response variable and predictors.
Important: All categorical variables must be explicitly converted to Categorical or Enum dtype before fitting. String columns are not accepted in model formulas.
For Polars DataFrames:
import polars as pl# Option 1: Cast to Categorical (simplest) {.unnumbered}df = df.with_columns(pl.col("region").cast(pl.Categorical))# Option 2: Cast to Enum with explicit category order (recommended for control) {.unnumbered}categories = ["<18", "18 to 35", ">35"]df = df.with_columns(pl.col("age_group").cast(pl.Enum(categories)))
For pandas DataFrames:
df["region"] = df["region"].astype("category")
engine: (callable) sklearn model class (e.g., LinearRegression, LogisticRegression)
kwargs_engine: (dict, default={}) Additional arguments passed to the model initialization.
Example: {'weights': weights_array}
Returns
(ModelSklearn) A fitted model wrapped in the ModelSklearn class for compatibility with marginaleffects.
Examples
from marginaleffects import*from statsmodels.formula.api import olsimport polars as plimport polars.selectors as csfrom sklearn.pipeline import make_pipelinefrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import OneHotEncoder, FunctionTransformerfrom sklearn.linear_model import LinearRegressionfrom sklearn.compose import make_column_transformerfrom xgboost import XGBRegressor# Linear regression: Scikit-learn {.unnumbered}military = get_dataset("military")# Convert categorical variables to proper dtypes {.unnumbered}military = military.with_columns( pl.col("branch").cast(pl.Categorical))mod_sk = fit_sklearn("rank ~ officer + hisp + branch", data=military, engine=LinearRegression(),)avg_predictions(mod_sk, by="branch")# Linear regression: Statsmodels {.unnumbered}mod_sm = ols("rank ~ officer + hisp + branch", data=military.to_pandas()).fit()avg_predictions(mod_sm, by="branch")# XGBoost: Scikit-learn {.unnumbered}airbnb = get_dataset("airbnb")# Convert categorical variables to proper dtypes {.unnumbered}catvar = airbnb.select(~cs.numeric()).columnsairbnb = airbnb.with_columns( [pl.col(c).cast(pl.Categorical) for c in catvar])train, test = train_test_split(airbnb)def selector(data): y = data.select(cs.by_name("price", require_all=False)) X = data.select(~cs.by_name("price", require_all=False))return y, Xpreprocessor = make_column_transformer( (OneHotEncoder(), catvar), remainder=FunctionTransformer(lambda x: x.to_numpy()),)pipeline = make_pipeline(preprocessor, XGBRegressor())mod = fit_sklearn(selector, data=train, engine=pipeline)avg_predictions(mod, newdata=test, by="unit_type")avg_comparisons(mod, variables={"bedrooms": 2}, newdata=test)