library(reticulate)
library(marginaleffects)
model <- '
## Model code adapted from the NumPyro documtation under Apache License:
## https://num.pyro.ai/en/latest/tutorials/bayesian_hierarchical_linear_regression.html
import pandas as pd
import numpy as np
import numpyro
from numpyro.infer import SVI, Predictive, MCMC,NUTS, autoguide, TraceMeanField_ELBO
import numpyro.distributions as dist
from numpyro.infer.initialization import init_to_median, init_to_uniform,init_to_sample
from jax import random
from sklearn.preprocessing import LabelEncoder
import pickle
def load_df():
train = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/modelarchive/main/data-raw/osic_pulmonary_fibrosis.csv")
return train
def model(data, predict = False):
FVC_obs = data["FVC"].values if predict == False else None
patient_encoder = LabelEncoder()
Age_obs = data["Age"].values
patient_code = patient_encoder.fit_transform(data["Patient"].values)
μ_α = numpyro.sample("μ_α", dist.Normal(0.0, 500.0))
σ_α = numpyro.sample("σ_α", dist.HalfNormal(100.0))
age = numpyro.sample("age", dist.Normal(0.0, 500.0))
n_patients = len(np.unique(patient_code))
with numpyro.plate("plate_i", n_patients):
α = numpyro.sample("α", dist.Normal(μ_α, σ_α))
σ = numpyro.sample("σ", dist.HalfNormal(100.0))
FVC_est = α[patient_code] + age * Age_obs
with numpyro.plate("data", len(patient_code)):
numpyro.sample("obs", dist.Normal(FVC_est, σ), obs=FVC_obs)
def fit_mcmc_model(train_df, samples = 1000):
numpyro.set_host_device_count(4)
rng_key = random.PRNGKey(0)
mcmc = MCMC(
NUTS(model),
num_samples=samples,
num_warmup=1000,
progress_bar=True,
num_chains = 4
)
mcmc.run(rng_key, train_df)
posterior_draws = mcmc.get_samples()
with open("mcmc_posterior_draws.pickle", "wb") as handle:
pickle.dump(posterior_draws, handle, protocol=pickle.HIGHEST_PROTOCOL)
def predict_mcmc(data):
with open("mcmc_posterior_draws.pickle", "rb") as handle:
posterior_draws = pickle.load(handle)
predictive = Predictive(model = model,posterior_samples=posterior_draws)
samples = predictive(random.PRNGKey(1), data, predict = True)
y_pred = samples["obs"]
# transpose so that each column is a draw and each row is an observation
y_pred = np.transpose(np.array(y_pred))
return y_pred
'
## save python script to temp file
tmp <- tempfile()
cat(model, file = tmp)
## load functions
source_python(tmp)
## download data
df <- load_df()
## fit model
fit_mcmc_model(df)
34 NumPyro
The Python programming language offers several powerful libraries for (bayesian) statistical analysis, such as NumPyro
and PyMC
. This vignette shows how to use the the full power of marginaleffects
to analyze and interpret the results of models estimated by Markov Chain Monte Carlo using the NumPyro
Python library.
34.1 Fitting a NumPyro
model
To begin, we load the reticulate
package which allows us to interact with the Python interpreter from an R
session. Then, we write a NumPyro
model and we load it to memory using the source_python()
function. The important functions to note in the Python code are:
-
load_df()
downloads data on pulmonary fibrosis. -
model()
defines theNumPyro
model. -
fit_mcmc_model()
fits the model using Markov Chain Monte Carlo. -
predict_mcmc()
: accepts a data frame and returns a matrix of draws from the posterior distribution of adjusted predictions (fitted values).
34.2 Analyzing the results in marginaleffects
Each of the functions in the marginaleffects
package requires that users supply a model
object on which the function will operate. When estimating models outside R
, we do not have such a model object. We thus begin by creating a “fake” model object: an empty data frame which we define to be of class “custom”. Then, we set a global option to tell marginaleffects
that this “custom” class is supported.
mod <- data.frame()
class(mod) <- "custom"
options("marginaleffects_model_classes" = "custom")
Next, we define a get_predict
method for our new custom class. This method must accept three arguments: model
, newdata
, and ...
. The get_predict
method must return a data frame with one row for each of the rows in newdata
, two columns (rowid
and estimate
), and an attribute called posterior_draws
which hosts a matrix of posterior draws with the same number of rows as newdata
.
The method below uses reticulate
to call the predict_mcmc()
function that we defined in the Python script above. The predict_mcmc()
function accepts a data frame and returns a matrix with the same number of rows.
Now we can use most of the marginaleffects
package functions to analyze our results. Since we use a “fake” model object, marginaleffects
cannot retrieve the original data from the model object, and we always need to supply a newdata
argument:
## predictions on the original dataset
predictions(mod, newdata = df) |> head()
## predictions for user-defined predictor values
predictions(mod, newdata = datagrid(newdata = df, Age = c(60, 70)))
predictions(mod, newdata = datagrid(newdata = df, Age = range))
## average predictions by group
predictions(mod, newdata = df, by = "Sex")
## contrasts (average)
avg_comparisons(mod, variables = "Age", newdata = df)
avg_comparisons(mod, variables = list("Age" = "sd"), newdata = df)
## slope (elasticity)
avg_slopes(mod, variables = "Age", slope = "eyex", newdata = df)