40 FAQ

40.1 Questions & Answers

40.2 Calling `marginaleffects` in functions, loops, environments, or after re-assigning variables

Functions from the marginaleffects package can sometimes fail when they are called inside a function, loop, or other environments. To see why, it is important to know that marginaleffects often needs to operate on the original data that was used to fit the model. To extract this original data, we use the get_data() function from the insight package.

In most cases, get_data() can extract the data which is stored inside the model object created by the modeling package. However, some modeling packages do not save the original data in the model object (in order to save memory). In those cases, get_data() will parse the call to find the name of the data object, and will search for that data object in the global environment. When users fit models in a different environment (e.g., function calls), get_data() may not be able to retrieve the original data.

A related problem can arise if users fit a model, but then assign a new value to the variable that used to store the dataset.

Recommendations:

Supply your dataset explicitly to the newdata argument of slopes functions.
Avoid assigning a new value to a variable that you use to store a dataset for model fitting.

Another alternative is to re-assign relevant variables explicitly inside the function, to ensure that marginaleffects gets access to them. For instance,

get_int_eff = function(formula, data, boot = FALSE){
  assign("formula", formula, envir = parent.frame())
  mod = glm(formula = formula, data = data)
  int_eff = avg_comparisons(
    model = mod,
    variables = "am",
    type = "response")
  if (boot){
    int_eff = inferences(int_eff, method = "boot", R = 100, conf_type = "perc")
  }
  return(int_eff)
}

40.3 Equivalence between `avg_comparisons()` and `avg_predictions()`

Say we estimate a simple logit regression and want to compute the average log-odds ratio associated with a change from 0 to 1 on the vs variable. We can proceed as follows using the avg_comparisons() function:

library(marginaleffects)
mod <- glm(am ~ vs + scale(drat), family = binomial, mtcars)

avg_comparisons(mod,
    variables = "vs",
    comparison = "lnoravg",
    type = "response")


 Estimate Std. Error     z Pr(>|z|)   S 2.5 % 97.5 %
   -0.901      0.368 -2.45   0.0143 6.1 -1.62  -0.18

Term: vs
Type: response
Comparison: ln(odds(1) / odds(0))

We can specify the function to compare “hi” predictions to “lo” predictions manually and get the same results:

fun <- function(hi, lo) log((mean(hi) / (1 - mean(hi))) / (mean(lo) / (1 - mean(lo))))

comparisons(mod,
    variables = "vs",
    comparison = fun,
    type = "response")


 Estimate Std. Error     z Pr(>|z|)   S 2.5 % 97.5 %
   -0.901      0.368 -2.45   0.0143 6.1 -1.62  -0.18

Term: vs
Type: response
Comparison: 1, 0

The same results can be obtained using the predictions() function. First, we replicate the entire dataset, and substitute the values of vs. Then, we make predictions. Finally, we compute the log odds ratios:

dat0 <- transform(mtcars, vs = 0)
dat1 <- transform(mtcars, vs = 1)

p0 <- predictions(mod, newdata = dat0, type = "response")
p1 <- predictions(mod, newdata = dat1, type = "response")

fun(p1$estimate, p0$estimate)

[1] -0.9006949

Notice that the following command gives us a different result. This is because instead of duplicating the full dataset, and computing the “hi” and “lo” as the mean predictions over the full data, we only compute the averages on the subsets of rows where vs is 0 or 1, respectively:

p <- avg_predictions(mod, by = "vs", type = "response")
fun(p$estimate[2], p$estimate[1])

[1] 0.6931472

Which is equivalent to:

dat0 <- subset(mtcars, vs == 0)
dat1 <- subset(mtcars, vs == 1)

p0 <- predictions(mod, newdata = dat0, type = "response")
p1 <- predictions(mod, newdata = dat1, type = "response")

fun(p1$estimate, p0$estimate)

[1] 0.6931472

40.4 Dates

Date predictors are not supported by default in marginaleffects. There are two main reasons for that. First, many quantities of interest are only formally defined for numeric variables (ex: derivatives). Second, an object like “2024-11-04” does not have a natural arithmetic. What does it mean to take a sum like: \(\pi\) + “2024-11-04”? In that spirit, marginaleffects requires that the user transforms the date variable into a numeric variable before fitting the model. This makes operations more explicit.

In this example, we fit a model with a date and another with a integer version of the date. We call predict() to illustrate the fact that the two models yield exactly identical predictions:

library(lubridate)
library(marginaleffects)
set.seed(123)

N <- 100
Date <- ymd("2024-01-01") + days(sample(0:30, N, replace = TRUE))
Days <- as.integer(Date)
Response <- pi * as.integer(Date) + rnorm(N)

mod <- lm(Response ~ Date)
predict(mod) |> head()

       1        2        3        4        5        6 
62055.86 62005.69 62018.23 62002.55 61968.06 61990.01

avg_slopes(mod)

Error: Class of the `Date` variable is class is not supported.

mod <- lm(Response ~ Days)
predict(mod) |> head()

       1        2        3        4        5        6 
62055.86 62005.69 62018.23 62002.55 61968.06 61990.01

avg_slopes(mod)


 Estimate Std. Error   z Pr(>|z|)   S 2.5 % 97.5 %
     3.14     0.0109 289   <0.001 Inf  3.11   3.16

Term: Days
Type: response
Comparison: dY/dX

40.5 More plot customization

As described in the Plot vignette, there are currently three ways to customize plots in marginaleffects:

Using the arguments provided by plot_*() functions in the marginaleffects package.
Modifying the plot objects using ggplot2 or an extension package like ggExtra.
Setting draw=FALSE to extract the plotting data, and then feeding that data to a standalone plotting software like ggplot2, plot, or lattice.

Options 2 and 3 are extremely flexible, and make it trivial to reproduce the default plots or to draw highly customized visualizations.

Of course, some users might still prefer to use function arguments directly available in marginaleffects::plot_*(), and the unavailability of some arguments can be frustrating. Nevertheless, a decision was made to keep the number of arguments in plot_*() functions quite low. This is because ggplot2 functions like geom_*(), facet_*(), and aes() each accept a very large number of options, and because users’ needs are incredibly varied. This makes it difficult to accomodate all needs and possibilities, without developing unwieldy functions with dozens of arguments.¹

Thus, the marginaleffects package is designed with limited scope, focusing on computation while offering a few visualization options for quick exploration. Users who need highly customized or publication-ready figures are encouraged to use the draw=FALSE argument with standalone plotting software.

In an ideal world, someone else would build a richer visualization package on top of marginaleffects. I think this would be a very cool project, and I would be happy to collaborate, advise, and help if I can. In the meantime, feel free to keep the discussion going by opening an issue on Github. I will be happy to consider counter-arguments, suggestions, or pull requests.

40.6 What’s the difference between `condition` and `by` in `plot_*()` functions?

The two arguments control how other variables in the model are handled when plotting predictions, comparisons, or slopes.

40.6.0.2 `by`: Average (marginal) predictions, slopes. and comparisons

Answers this question:

What is the average prediction/comparison/slope in the subgroups defined by the focal predictors specified in by?

Computes predictions for each row in the original data, using their actual covariate values, then aggregates/averages within subgroups defined by by.
Equivalent to using by in predictions() or avg_predictions().

Appropriate when:

The sample is roughly representative of the target population.
We are interested in an average prediction or an average effect, marginalized for different subgroups of the variables specified in by.

Inappropriate when:

The x-axis variable is continuous. Since we average estimates for every value of the x-axis variable, we get averages computed in very small bins, with different “mix” of control variables in each bin. This can lead to jagged-looking plots, which are not very informative.

FWIW, I (Vincent) personally tend to nearly always use by, unless the x-axis variable is continuous and I want a clean, smooth line.

Note that we cannot pass visual arguments to ..., because that mechanism is already used to pass arguments to the predict() methods, which is required for some model types.↩︎

40.1 Questions & Answers

40.2 Calling marginaleffects in functions, loops, environments, or after re-assigning variables

40.3 Equivalence between avg_comparisons() and avg_predictions()

40.4 Dates

40.5 More plot customization

40.6 What’s the difference between condition and by in plot_*() functions?

40.6.0.1 condition: Conditional estimates for a single (average or modal) hypothetical individual

40.6.0.2 by: Average (marginal) predictions, slopes. and comparisons

40.2 Calling `marginaleffects` in functions, loops, environments, or after re-assigning variables

40.3 Equivalence between `avg_comparisons()` and `avg_predictions()`

40.6 What’s the difference between `condition` and `by` in `plot_*()` functions?

40.6.0.1 `condition`: Conditional estimates for a single (average or modal) hypothetical individual

40.6.0.2 `by`: Average (marginal) predictions, slopes. and comparisons