19 Missing Data
The marginaleffects
package offers convenience functions to compute and display predictions, contrasts, and marginal effects from models with multiple imputation from the mice
and Amelia
packages. The workflow follows Rubin’s rules (Rubin, 1987, p. 76), via the following steps:
- Impute \(M\) data sets.
- Fit a model in each of the \(M\) imputed data sets.
- Compute marginal effects in each of the \(M\) data sets.
- Pool results.
To highlight the workflow, we consider a simple linear regression model, although the same workflow should work with any model type that is fit using a formula interface and a data
argument.
marginaleffects
directly supports the mice
and Amelia
imputation packages, as well as any other package that can return a list of imputed data frames. This is demonstrated below using the iris
dataset, in which we insert missing observations randomly and then impute missing values using several packages.
19.1 mice
First, we impute the dataset using the mice
package:
Then, we use the standard mice
syntax to produce an object of class mira
with all the models:
Finally, we feed the mira
object to a marginaleffects
function:
mfx_mice <- avg_slopes(mod_mice, by = "Species")
mfx_mice
#>
#> Term Contrast Species Estimate Std. Error t Pr(>|t|) S 2.5 % 97.5 % Df
#> Sepal.Length mean(dY/dX) setosa 0.0684 0.0560 1.222 0.22414 2.2 -0.0424 0.179 120.0
#> Sepal.Length mean(dY/dX) versicolor 0.0540 0.0558 0.968 0.33550 1.6 -0.0567 0.165 93.6
#> Sepal.Length mean(dY/dX) virginica 0.0582 0.0512 1.137 0.25818 2.0 -0.0434 0.160 101.2
#> Sepal.Width mean(dY/dX) setosa 0.1890 0.0836 2.260 0.02436 5.4 0.0246 0.353 400.5
#> Sepal.Width mean(dY/dX) versicolor 0.2092 0.0772 2.710 0.00807 7.0 0.0558 0.363 89.0
#> Sepal.Width mean(dY/dX) virginica 0.2242 0.1041 2.155 0.03506 4.8 0.0162 0.432 61.8
#> Species mean(versicolor) - mean(setosa) setosa 1.1399 0.0977 11.668 < 0.001 68.1 0.9464 1.333 114.8
#> Species mean(versicolor) - mean(setosa) versicolor 1.1399 0.0977 11.668 < 0.001 68.1 0.9464 1.333 114.8
#> Species mean(versicolor) - mean(setosa) virginica 1.1399 0.0977 11.668 < 0.001 68.1 0.9464 1.333 114.8
#> Species mean(virginica) - mean(setosa) setosa 1.7408 0.1108 15.709 < 0.001 100.7 1.5214 1.960 121.6
#> Species mean(virginica) - mean(setosa) versicolor 1.7408 0.1108 15.709 < 0.001 100.7 1.5214 1.960 121.6
#> Species mean(virginica) - mean(setosa) virginica 1.7408 0.1108 15.709 < 0.001 100.7 1.5214 1.960 121.6
#>
#> Columns: term, contrast, Species, estimate, std.error, s.value, predicted_lo, predicted_hi, predicted, df, statistic, p.value, conf.low, conf.high
#> Type: response
19.2 Amelia
With Amelia
, the workflow is essentially the same. First, we impute using Amelia
:
Then, we use Amelia
syntax to produce an object of class amest
with all the models:
Finally, we feed the amest
object to a marginaleffects
function:
mfx_amelia <- avg_slopes(mod_amelia, by = "Species")
mfx_amelia
#>
#> Term Contrast Species Estimate Std. Error t Pr(>|t|) S 2.5 % 97.5 % Df
#> Sepal.Length mean(dY/dX) setosa 0.3878 0.0907 4.278 < 0.001 13.3 0.205 0.5705 43.9
#> Sepal.Length mean(dY/dX) versicolor 0.3231 0.0802 4.029 < 0.001 12.5 0.162 0.4838 56.0
#> Sepal.Length mean(dY/dX) virginica 0.3467 0.0799 4.340 < 0.001 13.6 0.186 0.5077 44.7
#> Sepal.Width mean(dY/dX) setosa -0.2079 0.1491 -1.394 0.16878 2.6 -0.507 0.0909 55.0
#> Sepal.Width mean(dY/dX) versicolor -0.1157 0.1168 -0.991 0.32647 1.6 -0.350 0.1187 51.8
#> Sepal.Width mean(dY/dX) virginica -0.0452 0.1272 -0.355 0.72323 0.5 -0.298 0.2079 82.1
#> Species mean(versicolor) - mean(setosa) setosa 0.6127 0.1731 3.541 0.00111 9.8 0.262 0.9635 36.7
#> Species mean(versicolor) - mean(setosa) versicolor 0.6127 0.1731 3.541 0.00111 9.8 0.262 0.9635 36.7
#> Species mean(versicolor) - mean(setosa) virginica 0.6127 0.1731 3.541 0.00111 9.8 0.262 0.9635 36.7
#> Species mean(virginica) - mean(setosa) setosa 1.0364 0.2004 5.171 < 0.001 16.6 0.629 1.4436 34.2
#> Species mean(virginica) - mean(setosa) versicolor 1.0364 0.2004 5.171 < 0.001 16.6 0.629 1.4436 34.2
#> Species mean(virginica) - mean(setosa) virginica 1.0364 0.2004 5.171 < 0.001 16.6 0.629 1.4436 34.2
#>
#> Columns: term, contrast, Species, estimate, std.error, s.value, predicted_lo, predicted_hi, predicted, df, statistic, p.value, conf.low, conf.high
#> Type: response
19.3 Other imputation packages: missRanger
, or lists of imputed data frames.
Several R
packages can impute missing data. Indeed, the Missing Data CRAN View
lists at least a dozen alternatives. Since user interfaces change a lot from package to package, marginaleffects
supports a single workflow that can be used, with some adaptation, with all imputation packages:
- Use an external package to create a list of imputed data frames.
- Apply the
datalist2mids()
function from themiceadds
package to convert the list of imputed data frames to amids
object. - Use the
with()
function to fit models to createmira
object, as illustrated in themice
andAmelia
sections above. - Pass the
mira
object to amarginaleffects
function.
Consider the imputation package missRanger
, which generates a list of imputed datasets:
library(miceadds)
library(missRanger)
## convert lists of imputed datasets to `mids` objects
dat_missRanger <- replicate(20, missRanger(dat, verbose = 0), simplify = FALSE)
mids_missRanger <- datlist2mids(dat_missRanger)
## fit models
mod_missRanger <- with(mids_missRanger, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))
## `missRanger` slopes
mfx_missRanger <- avg_slopes(mod_missRanger, by = "Species")
mfx_missRanger
#>
#> Term Contrast Species Estimate Std. Error t Pr(>|t|) S 2.5 % 97.5 % Df
#> Sepal.Length mean(dY/dX) setosa 0.0586 0.0414 1.42 0.15672 2.7 -0.02249 0.140 2780689
#> Sepal.Length mean(dY/dX) versicolor 0.0675 0.0392 1.72 0.08514 3.6 -0.00934 0.144 721339
#> Sepal.Length mean(dY/dX) virginica 0.0643 0.0367 1.75 0.07986 3.6 -0.00766 0.136 1020604
#> Sepal.Width mean(dY/dX) setosa 0.2314 0.0690 3.35 < 0.001 10.3 0.09612 0.367 1911621
#> Sepal.Width mean(dY/dX) versicolor 0.2186 0.0551 3.97 < 0.001 13.8 0.11066 0.327 246676
#> Sepal.Width mean(dY/dX) virginica 0.2089 0.0687 3.04 0.00237 8.7 0.07420 0.344 194885
#> Species mean(versicolor) - mean(setosa) setosa 1.1589 0.0704 16.46 < 0.001 199.8 1.02091 1.297 1115135
#> Species mean(versicolor) - mean(setosa) versicolor 1.1589 0.0704 16.46 < 0.001 199.8 1.02091 1.297 1115135
#> Species mean(versicolor) - mean(setosa) virginica 1.1589 0.0704 16.46 < 0.001 199.8 1.02091 1.297 1115135
#> Species mean(virginica) - mean(setosa) setosa 1.7781 0.0822 21.64 < 0.001 342.5 1.61703 1.939 1547086
#> Species mean(virginica) - mean(setosa) versicolor 1.7781 0.0822 21.64 < 0.001 342.5 1.61703 1.939 1547086
#> Species mean(virginica) - mean(setosa) virginica 1.7781 0.0822 21.64 < 0.001 342.5 1.61703 1.939 1547086
#>
#> Columns: term, contrast, Species, estimate, std.error, s.value, predicted_lo, predicted_hi, predicted, df, statistic, p.value, conf.low, conf.high
#> Type: response
19.4 Comparing results with different imputation software
We can use the modelsummary
package to compare the results with listwise deletion to the results using different imputations software:
library(modelsummary)
## listwise deletion slopes
mod_lwd <- lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species, data = dat)
mfx_lwd <- avg_slopes(mod_lwd, by = "Species")
## regression table
models <- list(
"LWD" = mfx_lwd,
"mice" = mfx_mice,
"Amelia" = mfx_amelia,
"missRanger" = mfx_missRanger)
modelsummary(models, shape = term : contrast + Species ~ model)
Species | LWD | mice | Amelia | missRanger | |
---|---|---|---|---|---|
Sepal.Length mean(dY/dX) | setosa | 0.033 | 0.068 | 0.388 | 0.059 |
setosa | (0.061) | (0.056) | (0.091) | (0.041) | |
versicolor | 0.050 | 0.054 | 0.323 | 0.067 | |
versicolor | (0.061) | (0.056) | (0.080) | (0.039) | |
virginica | 0.043 | 0.058 | 0.347 | 0.064 | |
virginica | (0.058) | (0.051) | (0.080) | (0.037) | |
Sepal.Width mean(dY/dX) | setosa | 0.274 | 0.189 | −0.208 | 0.231 |
setosa | (0.091) | (0.084) | (0.149) | (0.069) | |
versicolor | 0.255 | 0.209 | −0.116 | 0.219 | |
versicolor | (0.074) | (0.077) | (0.117) | (0.055) | |
virginica | 0.234 | 0.224 | −0.045 | 0.209 | |
virginica | (0.083) | (0.104) | (0.127) | (0.069) | |
Species mean(versicolor) - mean(setosa) | setosa | 1.157 | 1.140 | 0.613 | 1.159 |
setosa | (0.097) | (0.098) | (0.173) | (0.070) | |
versicolor | 1.157 | 1.140 | 0.613 | 1.159 | |
versicolor | (0.097) | (0.098) | (0.173) | (0.070) | |
virginica | 1.157 | 1.140 | 0.613 | 1.159 | |
virginica | (0.097) | (0.098) | (0.173) | (0.070) | |
Species mean(virginica) - mean(setosa) | setosa | 1.839 | 1.741 | 1.036 | 1.778 |
setosa | (0.123) | (0.111) | (0.200) | (0.082) | |
versicolor | 1.839 | 1.741 | 1.036 | 1.778 | |
versicolor | (0.123) | (0.111) | (0.200) | (0.082) | |
virginica | 1.839 | 1.741 | 1.036 | 1.778 | |
virginica | (0.123) | (0.111) | (0.200) | (0.082) | |
Num.Obs. | 60 | 150 | 150 | 150 | |
Num.Imp. | 20 | 20 | 20 | ||
R2 | 0.953 | 0.930 | 0.853 | 0.947 | |
R2 Adj. | 0.949 | 0.928 | 0.848 | 0.945 | |
AIC | −34.0 | ||||
BIC | −19.3 | ||||
Log.Lik. | 23.997 | ||||
F | 220.780 | ||||
RMSE | 0.16 |
19.5 Passing new data arguments: newdata
, wts
, by
, etc.
Sometimes we want to pass arguments changing or specifying the data on which we will do our analysis using marginaleffects
. This can be for reasons such as wanting to specify the values or weights at which we evaluate e.g. avg_slopes()
, or due to the underlying models not robustly preserving all the original data columns (such as fixest
objects not saving their data in the fit object making it potentially challenging to retrieve, and even if retrievable it will not include the weights used during fitting as a column as wts
expects when given a string).
If we are not using multiple imputation, or if we want to just pass a single dataset to the several fitted models after multiple imputation, we can pass a single dataset to the newdata
argument. However, if we wish to supply each model in our list resulting after multiple imputation with a /different/ dataset on which to calculate results, we cannot use newdata
. Instead, in this case it can be useful to revert to a more manual (but still very easy) approach. Here is an example calculating avg_slopes()
using a different set of weights for each of the fixest
models which we fit after multiple imputation.
set.seed(1024)
library(mice)
library(fixest)
library(marginaleffects)
dat <- mtcars
## insert missing values
dat$hp[sample(seq_len(nrow(mtcars)), 10)] <- NA
dat$mpg[sample(seq_len(nrow(mtcars)), 10)] <- NA
dat$gear[sample(seq_len(nrow(mtcars)), 10)] <- NA
## multiple imputation
dat <- mice(dat, m = 5, method = "sample", printFlag = FALSE)
dat <- complete(dat, action = "all")
## fit models
mod <- lapply(dat, \(x)
feglm(am ~ mpg * cyl + hp,
weight = ~gear,
family = binomial,
data = x))
## slopes without weights
lapply(seq_along(mod), \(i)
avg_slopes(mod[[i]], newdata = dat[[i]])) |>
mice::pool()
#> Class: mipo m = 5
#> term m estimate ubar b t dfcom df riv lambda fmi
#> 1 cyl 5 -0.134279397 7.096230e-04 2.347149e-03 3.526201e-03 29 2.921624 3.969119 0.7987571 0.8667259
#> 2 hp 5 0.001649797 5.709333e-07 1.375524e-06 2.221563e-06 29 3.557012 2.891107 0.7430037 0.8213920
#> 3 mpg 5 0.006083006 1.080610e-04 2.722367e-04 4.347450e-04 29 3.458500 3.023146 0.7514383 0.8284103
## slopes with weights
lapply(seq_along(mod), \(i)
avg_slopes(mod[[i]], newdata = dat[[i]], wts = "gear")) |>
mice::pool()
#> Class: mipo m = 5
#> term m estimate ubar b t dfcom df riv lambda fmi
#> 1 cyl 5 -0.135839559 7.280346e-04 2.480980e-03 3.705211e-03 29 2.868608 4.089333 0.8035106 0.8704735
#> 2 hp 5 0.001671181 5.697784e-07 1.424665e-06 2.279376e-06 29 3.474886 3.000461 0.7500288 0.8272414
#> 3 mpg 5 0.006251264 1.056056e-04 2.705362e-04 4.302490e-04 29 3.422454 3.074113 0.7545478 0.8309834