40.2 Calling marginaleffects in functions, loops, environments, or after re-assigning variables
Functions from the marginaleffects package can sometimes fail when they are called inside a function, loop, or other environments. To see why, it is important to know that marginaleffects often needs to operate on the original data that was used to fit the model. To extract this original data, we use the get_data() function from the insight package.
In most cases, get_data() can extract the data which is stored inside the model object created by the modeling package. However, some modeling packages do not save the original data in the model object (in order to save memory). In those cases, get_data() will parse the call to find the name of the data object, and will search for that data object in the global environment. When users fit models in a different environment (e.g., function calls), get_data() may not be able to retrieve the original data.
A related problem can arise if users fit a model, but then assign a new value to the variable that used to store the dataset.
Recommendations:
Supply your dataset explicitly to the newdata argument of slopes functions.
Avoid assigning a new value to a variable that you use to store a dataset for model fitting.
Another alternative is to re-assign relevant variables explicitly inside the function, to ensure that marginaleffects gets access to them. For instance,
get_int_eff=function(formula, data, boot=FALSE){assign("formula", formula, envir =parent.frame())mod=glm(formula =formula, data =data)int_eff=avg_comparisons( model =mod, variables ="am", type ="response")if(boot){int_eff=inferences(int_eff, method ="boot", R =100, conf_type ="perc")}return(int_eff)}
40.3 Equivalence between avg_comparisons() and avg_predictions()
Say we estimate a simple logit regression and want to compute the average log-odds ratio associated with a change from 0 to 1 on the vs variable. We can proceed as follows using the avg_comparisons() function:
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
-0.901 0.368 -2.45 0.0143 6.1 -1.62 -0.18
Term: vs
Type: response
Comparison: ln(odds(1) / odds(0))
We can specify the function to compare “hi” predictions to “lo” predictions manually and get the same results:
fun<-function(hi, lo)log((mean(hi)/(1-mean(hi)))/(mean(lo)/(1-mean(lo))))comparisons(mod, variables ="vs", comparison =fun, type ="response")
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
-0.901 0.368 -2.45 0.0143 6.1 -1.62 -0.18
Term: vs
Type: response
Comparison: 1, 0
The same results can be obtained using the predictions() function. First, we replicate the entire dataset, and substitute the values of vs. Then, we make predictions. Finally, we compute the log odds ratios:
dat0<-transform(mtcars, vs =0)dat1<-transform(mtcars, vs =1)p0<-predictions(mod, newdata =dat0, type ="response")p1<-predictions(mod, newdata =dat1, type ="response")fun(p1$estimate, p0$estimate)
[1] -0.9006949
Notice that the following command gives us a different result. This is because instead of duplicating the full dataset, and computing the “hi” and “lo” as the mean predictions over the full data, we only compute the averages on the subsets of rows where vs is 0 or 1, respectively:
p<-avg_predictions(mod, by ="vs", type ="response")fun(p$estimate[2], p$estimate[1])
[1] 0.6931472
Which is equivalent to:
dat0<-subset(mtcars, vs==0)dat1<-subset(mtcars, vs==1)p0<-predictions(mod, newdata =dat0, type ="response")p1<-predictions(mod, newdata =dat1, type ="response")fun(p1$estimate, p0$estimate)
[1] 0.6931472
40.4 Dates
Date predictors are not supported by default in marginaleffects. There are two main reasons for that. First, many quantities of interest are only formally defined for numeric variables (ex: derivatives). Second, an object like “2024-11-04” does not have a natural arithmetic. What does it mean to take a sum like: \(\pi\) + “2024-11-04”? In that spirit, marginaleffects requires that the user transforms the date variable into a numeric variable before fitting the model. This makes operations more explicit.
In this example, we fit a model with a date and another with a integer version of the date. We call predict() to illustrate the fact that the two models yield exactly identical predictions:
Using the arguments provided by plot_*() functions in the marginaleffects package.
Modifying the plot objects using ggplot2 or an extension package like ggExtra.
Setting draw=FALSE to extract the plotting data, and then feeding that data to a standalone plotting software like ggplot2, plot, or lattice.
Options 2 and 3 are extremely flexible, and make it trivial to reproduce the default plots or to draw highly customized visualizations.
Of course, some users might still prefer to use function arguments directly available in marginaleffects::plot_*(), and the unavailability of some arguments can be frustrating. Nevertheless, a decision was made to keep the number of arguments in plot_*() functions quite low. This is because ggplot2 functions like geom_*(), facet_*(), and aes() each accept a very large number of options, and because users’ needs are incredibly varied. This makes it difficult to accomodate all needs and possibilities, without developing unwieldy functions with dozens of arguments.1
Thus, the marginaleffects package is designed with limited scope, focusing on computation while offering a few visualization options for quick exploration. Users who need highly customized or publication-ready figures are encouraged to use the draw=FALSE argument with standalone plotting software.
In an ideal world, someone else would build a richer visualization package on top of marginaleffects. I think this would be a very cool project, and I would be happy to collaborate, advise, and help if I can. In the meantime, feel free to keep the discussion going by opening an issue on Github. I will be happy to consider counter-arguments, suggestions, or pull requests.
40.6 What’s the difference between condition and by in plot_*() functions?
The two arguments control how other variables in the model are handled when plotting predictions, comparisons, or slopes.
40.6.0.1condition: Conditional estimates for a single (average or modal) hypothetical individual
Answers this question:
How would the estimated quantity change if we changed the focal predictors specified in condition, while holding all other covariates at their mean or modal values?
Builds a grid over your focal predictors, while holding all other variables (including random effects in mixed models) at a fixed reference value—typically their mean, mode, or a user-defined value.
Equivalent to newdata = datagrid(...) in predictions().
Appropriate when:
We are substantively interested in the “average person,” that is, in a person or unit that is exactly average or modal on all dimensions.
Computation is expensive and we need fast results.
The focal predictor we want to display on the x-axis is continuous, which makes averaging estimates for each unique value of that continuous predictor difficult. (This often results in a jagged-looking plot.)
Inappropriate when:
The hypothetical average individual does not exist, or is not particularly interesting.
40.6.0.2by: Average (marginal) predictions, slopes. and comparisons
Answers this question:
What is the average prediction/comparison/slope in the subgroups defined by the focal predictors specified in by?
Computes predictions for each row in the original data, using their actual covariate values, then aggregates/averages within subgroups defined by by.
The sample is roughly representative of the target population.
We are interested in an average prediction or an average effect, marginalized for different subgroups of the variables specified in by.
Inappropriate when:
The x-axis variable is continuous. Since we average estimates for every value of the x-axis variable, we get averages computed in very small bins, with different “mix” of control variables in each bin. This can lead to jagged-looking plots, which are not very informative.
FWIW, I (Vincent) personally tend to nearly always use by, unless the x-axis variable is continuous and I want a clean, smooth line.
Note that we cannot pass visual arguments to ..., because that mechanism is already used to pass arguments to the predict() methods, which is required for some model types.↩︎