Python

The Python programming language offers several powerful libraries for (bayesian) statistical analysis, such as NumPyro and PyMC. This vignette shows how to use the the full power of marginaleffects to analyze and interpret the results of models estimated by Markov Chain Monte Carlo using the NumPyro Python library.

Fitting a NumPyro model

To begin, we load the reticulate package which allows us to interact with the Python interpreter from an R session. Then, we write a NumPyro model and we load it to memory using the source_python() function. The important functions to note in the Python code are:

  • load_df() downloads data on pulmonary fibrosis.
  • model() defines the NumPyro model.
  • fit_mcmc_model() fits the model using Markov Chain Monte Carlo.
  • predict_mcmc(): accepts a data frame and returns a matrix of draws from the posterior distribution of adjusted predictions (fitted values).
library(reticulate)
library(marginaleffects)

model <- '
## Model code adapted from the NumPyro documtation under Apache License:
## https://num.pyro.ai/en/latest/tutorials/bayesian_hierarchical_linear_regression.html

import pandas as pd
import numpy as np
import numpyro
from numpyro.infer import SVI, Predictive, MCMC,NUTS, autoguide, TraceMeanField_ELBO
import numpyro.distributions as dist
from numpyro.infer.initialization import init_to_median, init_to_uniform,init_to_sample
from jax import random
from sklearn.preprocessing import LabelEncoder
import pickle

def load_df():
    train = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/modelarchive/main/data-raw/osic_pulmonary_fibrosis.csv")
    return train


def model(data, predict = False):
    FVC_obs = data["FVC"].values  if predict == False else None
    patient_encoder = LabelEncoder()
    Age_obs = data["Age"].values
    patient_code = patient_encoder.fit_transform(data["Patient"].values)
    μ_α = numpyro.sample("μ_α", dist.Normal(0.0, 500.0))
    σ_α = numpyro.sample("σ_α", dist.HalfNormal(100.0))

    age = numpyro.sample("age", dist.Normal(0.0, 500.0))

    n_patients = len(np.unique(patient_code))

    with numpyro.plate("plate_i", n_patients):
        α = numpyro.sample("α", dist.Normal(μ_α, σ_α))

    σ = numpyro.sample("σ", dist.HalfNormal(100.0))
    FVC_est = α[patient_code] + age * Age_obs

    with numpyro.plate("data", len(patient_code)):
        numpyro.sample("obs", dist.Normal(FVC_est, σ), obs=FVC_obs)


def fit_mcmc_model(train_df, samples = 1000):
    numpyro.set_host_device_count(4)
    rng_key = random.PRNGKey(0)
    mcmc = MCMC(
        NUTS(model),
        num_samples=samples,
        num_warmup=1000,
        progress_bar=True,
        num_chains = 4
        )
    
    mcmc.run(rng_key, train_df)

    posterior_draws = mcmc.get_samples()

    with open("mcmc_posterior_draws.pickle", "wb") as handle:
        pickle.dump(posterior_draws, handle, protocol=pickle.HIGHEST_PROTOCOL)

def predict_mcmc(data):

    with open("mcmc_posterior_draws.pickle", "rb") as handle:
        posterior_draws = pickle.load(handle)

    predictive = Predictive(model = model,posterior_samples=posterior_draws)
    samples = predictive(random.PRNGKey(1), data, predict = True)
    y_pred = samples["obs"]
    # transpose so that each column is a draw and each row is an observation
    y_pred = np.transpose(np.array(y_pred))

    return y_pred 
'

## save python script to temp file
tmp <- tempfile()
cat(model, file = tmp)

## load functions
source_python(tmp)

## download data
df <- load_df()

## fit model
fit_mcmc_model(df)

Analyzing the results in marginaleffects

Each of the functions in the marginaleffects package requires that users supply a model object on which the function will operate. When estimating models outside R, we do not have such a model object. We thus begin by creating a “fake” model object: an empty data frame which we define to be of class “custom”. Then, we set a global option to tell marginaleffects that this “custom” class is supported.

mod <- data.frame()
class(mod) <- "custom"

options("marginaleffects_model_classes" = "custom")

Next, we define a get_predict method for our new custom class. This method must accept three arguments: model, newdata, and .... The get_predict method must return a data frame with one row for each of the rows in newdata, two columns (rowid and estimate), and an attribute called posterior_draws which hosts a matrix of posterior draws with the same number of rows as newdata.

The method below uses reticulate to call the predict_mcmc() function that we defined in the Python script above. The predict_mcmc() function accepts a data frame and returns a matrix with the same number of rows.

get_predict.custom <- function(model, newdata, ...) {
    pred <- predict_mcmc(newdata)
    out <- data.frame(
        rowid = seq_len(nrow(newdata)),
        predicted = apply(pred, 1, stats::median)
    )
    attr(out, "posterior_draws") <- pred
    return(out)
}

Now we can use most of the marginaleffects package functions to analyze our results. Since we use a “fake” model object, marginaleffects cannot retrieve the original data from the model object, and we always need to supply a newdata argument:

## predictions on the original dataset
predictions(mod, newdata = df) |> head()

## predictions for user-defined predictor values
predictions(mod, newdata = datagrid(newdata = df, Age = c(60, 70)))

predictions(mod, newdata = datagrid(newdata = df, Age = range))

## average predictions by group
predictions(mod, newdata = df, by = "Sex")

## contrasts (average)
avg_comparisons(mod, variables = "Age", newdata = df)

avg_comparisons(mod, variables = list("Age" = "sd"), newdata = df)

## slope (elasticity)
avg_slopes(mod, variables = "Age", slope = "eyex", newdata = df)