31 Missing Data
The marginaleffects
package offers convenience functions to compute and display predictions, contrasts, and marginal effects from models with multiple imputation from the mice
and Amelia
packages. The workflow follows Rubin’s rules (Rubin, 1987, p. 76), via the following steps:
- Impute \(M\) data sets.
- Fit a model in each of the \(M\) imputed data sets.
- Compute marginal effects in each of the \(M\) data sets.
- Pool results.
To highlight the workflow, we consider a simple linear regression model, although the same workflow should work with any model type that is fit using a formula interface and a data
argument.
marginaleffects
directly supports the mice
and Amelia
imputation packages, as well as any other package that can return a list of imputed data frames. This is demonstrated below using the iris
dataset, in which we insert missing observations randomly and then impute missing values using several packages.
31.1 mice
First, we impute the dataset using the mice
package:
Then, we use the standard mice
syntax to produce an object of class mira
with all the models:
Finally, we feed the mira
object to a marginaleffects
function:
mfx_mice <- avg_slopes(mod_mice, by = "Species")
mfx_mice
Term Contrast Species Estimate Std. Error t
Sepal.Length dY/dX setosa 0.0684 0.0560 1.222
Sepal.Length dY/dX versicolor 0.0540 0.0558 0.968
Sepal.Length dY/dX virginica 0.0582 0.0512 1.137
Sepal.Width dY/dX setosa 0.1890 0.0836 2.260
Sepal.Width dY/dX versicolor 0.2092 0.0772 2.710
Sepal.Width dY/dX virginica 0.2242 0.1041 2.154
Species versicolor - setosa setosa 1.1399 0.0977 11.668
Species virginica - setosa setosa 1.7408 0.1108 15.709
Species versicolor - setosa versicolor 1.1399 0.0977 11.668
Species virginica - setosa versicolor 1.7408 0.1108 15.709
Species versicolor - setosa virginica 1.1399 0.0977 11.668
Species virginica - setosa virginica 1.7408 0.1108 15.709
Pr(>|t|) S 2.5 % 97.5 % Df
0.22747 2.1 -0.0440 0.181 49.9
0.33859 1.6 -0.0585 0.166 42.6
0.26152 1.9 -0.0449 0.161 44.8
0.02631 5.2 0.0228 0.355 87.0
0.00976 6.7 0.0533 0.365 41.1
0.03901 4.7 0.0120 0.436 31.3
< 0.001 49.7 0.9435 1.336 48.6
< 0.001 67.5 1.5182 1.963 50.3
< 0.001 49.7 0.9435 1.336 48.6
< 0.001 67.5 1.5182 1.963 50.3
< 0.001 49.7 0.9435 1.336 48.6
< 0.001 67.5 1.5182 1.963 50.3
Type: response
31.2 Amelia
With Amelia
, the workflow is essentially the same. First, we impute using Amelia
:
Then, we use Amelia
syntax to produce an object of class amest
with all the models:
Finally, we feed the amest
object to a marginaleffects
function:
mfx_amelia <- avg_slopes(mod_amelia, by = "Species")
mfx_amelia
Term Contrast Species Estimate Std. Error t
Sepal.Length dY/dX setosa 0.3878 0.0907 4.278
Sepal.Length dY/dX versicolor 0.3231 0.0802 4.030
Sepal.Length dY/dX virginica 0.3467 0.0799 4.340
Sepal.Width dY/dX setosa -0.2079 0.1491 -1.395
Sepal.Width dY/dX versicolor -0.1157 0.1168 -0.991
Sepal.Width dY/dX virginica -0.0452 0.1272 -0.355
Species versicolor - setosa setosa 0.6127 0.1731 3.541
Species virginica - setosa setosa 1.0364 0.2004 5.171
Species versicolor - setosa versicolor 0.6127 0.1731 3.541
Species virginica - setosa versicolor 1.0364 0.2004 5.171
Species versicolor - setosa virginica 0.6127 0.1731 3.541
Species virginica - setosa virginica 1.0364 0.2004 5.171
Pr(>|t|) S 2.5 % 97.5 % Df
< 0.001 11.8 0.200 0.5753 23.1
< 0.001 11.4 0.159 0.4872 28.8
< 0.001 12.1 0.182 0.5118 23.5
0.17400 2.5 -0.513 0.0973 28.4
0.33067 1.6 -0.355 0.1239 26.9
0.72421 0.5 -0.303 0.2121 38.8
0.00217 8.8 0.251 0.9748 19.1
< 0.001 13.8 0.615 1.4582 17.6
0.00217 8.8 0.251 0.9748 19.1
< 0.001 13.8 0.615 1.4582 17.6
0.00217 8.8 0.251 0.9748 19.1
< 0.001 13.8 0.615 1.4582 17.6
Type: response
31.3 Other imputation packages: missRanger
, or lists of imputed data frames.
Several R
packages can impute missing data. Indeed, the Missing Data CRAN View
lists at least a dozen alternatives. Since user interfaces change a lot from package to package, marginaleffects
supports a single workflow that can be used, with some adaptation, with all imputation packages:
- Use an external package to create a list of imputed data frames.
- Apply the
datalist2mids()
function from themiceadds
package to convert the list of imputed data frames to amids
object. - Use the
with()
function to fit models to createmira
object, as illustrated in themice
andAmelia
sections above. - Pass the
mira
object to amarginaleffects
function.
Consider the imputation package missRanger
, which generates a list of imputed datasets:
library(miceadds)
library(missRanger)
## convert lists of imputed datasets to `mids` objects
dat_missRanger <- replicate(20, missRanger(dat, verbose = 0), simplify = FALSE)
mids_missRanger <- datlist2mids(dat_missRanger)
## fit models
mod_missRanger <- with(mids_missRanger, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))
## `missRanger` slopes
mfx_missRanger <- avg_slopes(mod_missRanger, by = "Species")
mfx_missRanger
Term Contrast Species Estimate Std. Error t Pr(>|t|)
Sepal.Length dY/dX setosa 0.0589 0.0434 1.36 0.17662
Sepal.Length dY/dX versicolor 0.0613 0.0392 1.56 0.12029
Sepal.Length dY/dX virginica 0.0605 0.0374 1.62 0.10747
Sepal.Width dY/dX setosa 0.2291 0.0726 3.16 0.00195
Sepal.Width dY/dX versicolor 0.2257 0.0561 4.02 < 0.001
Sepal.Width dY/dX virginica 0.2231 0.0689 3.24 0.00150
Species versicolor - setosa setosa 1.1625 0.0726 16.01 < 0.001
Species virginica - setosa setosa 1.7861 0.0851 21.00 < 0.001
Species versicolor - setosa versicolor 1.1625 0.0726 16.01 < 0.001
Species virginica - setosa versicolor 1.7861 0.0851 21.00 < 0.001
Species versicolor - setosa virginica 1.1625 0.0726 16.01 < 0.001
Species virginica - setosa virginica 1.7861 0.0851 21.00 < 0.001
S 2.5 % 97.5 % Df
2.5 -0.0269 0.145 142
3.1 -0.0162 0.139 142
3.2 -0.0133 0.134 142
9.0 0.0856 0.373 142
13.4 0.1148 0.337 142
9.4 0.0869 0.359 142
109.2 1.0190 1.306 142
148.2 1.6179 1.954 142
109.2 1.0190 1.306 142
148.2 1.6179 1.954 142
109.2 1.0190 1.306 142
148.2 1.6179 1.954 142
Type: response
31.4 Comparing results with different imputation software
We can use the modelsummary
package to compare the results with listwise deletion to the results using different imputations software:
library(modelsummary)
## listwise deletion slopes
mod_lwd <- lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species, data = dat)
mfx_lwd <- avg_slopes(mod_lwd, by = "Species")
## regression table
models <- list(
"LWD" = mfx_lwd,
"mice" = mfx_mice,
"Amelia" = mfx_amelia,
"missRanger" = mfx_missRanger)
modelsummary(models, shape = term : contrast + Species ~ model)
Species | LWD | mice | Amelia | missRanger | |
---|---|---|---|---|---|
Sepal.Length dY/dX | setosa | 0.033 | 0.068 | 0.388 | 0.059 |
(0.061) | (0.056) | (0.091) | (0.043) | ||
versicolor | 0.050 | 0.054 | 0.323 | 0.061 | |
(0.061) | (0.056) | (0.080) | (0.039) | ||
virginica | 0.043 | 0.058 | 0.347 | 0.061 | |
(0.058) | (0.051) | (0.080) | (0.037) | ||
Sepal.Width dY/dX | setosa | 0.274 | 0.189 | -0.208 | 0.229 |
(0.091) | (0.084) | (0.149) | (0.073) | ||
versicolor | 0.255 | 0.209 | -0.116 | 0.226 | |
(0.074) | (0.077) | (0.117) | (0.056) | ||
virginica | 0.234 | 0.224 | -0.045 | 0.223 | |
(0.083) | (0.104) | (0.127) | (0.069) | ||
Species versicolor - setosa | setosa | 1.157 | 1.140 | 0.613 | 1.162 |
(0.097) | (0.098) | (0.173) | (0.073) | ||
versicolor | 1.157 | 1.140 | 0.613 | 1.162 | |
(0.097) | (0.098) | (0.173) | (0.073) | ||
virginica | 1.157 | 1.140 | 0.613 | 1.162 | |
(0.097) | (0.098) | (0.173) | (0.073) | ||
Species virginica - setosa | setosa | 1.839 | 1.741 | 1.036 | 1.786 |
(0.123) | (0.111) | (0.200) | (0.085) | ||
versicolor | 1.839 | 1.741 | 1.036 | 1.786 | |
(0.123) | (0.111) | (0.200) | (0.085) | ||
virginica | 1.839 | 1.741 | 1.036 | 1.786 | |
(0.123) | (0.111) | (0.200) | (0.085) | ||
Num.Obs. | 60 | 150 | 150 | 150 | |
Num.Imp. | 20 | 20 | 20 | ||
R2 | 0.953 | 0.930 | 0.853 | 0.948 | |
R2 Adj. | 0.949 | 0.928 | 0.848 | 0.946 | |
AIC | -34.0 | ||||
BIC | -19.3 | ||||
Log.Lik. | 23.997 | ||||
F | 220.780 | ||||
RMSE | 0.16 |
31.5 mice::pool()
In some cases, it may be useful to impute, fit, and post-process our models “manually”, to finally combine results using the mice::pool
function. This works well in some cases like this one:
library(marginaleffects)
library(mice)
set.seed(1024)
dat <- iris
dat$Sepal.Length[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Sepal.Width[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Species[sample(seq_len(nrow(iris)), 40)] <- NA
# impute
imp <- mice(dat, m = 20, printFlag = FALSE)
imp <- lapply(seq_along(imp), function(i) complete(imp, i))
# fit
fit <- lapply(imp[1:20], function(d) {
lm(Sepal.Length ~ Petal.Width + Petal.Length, d)
})
# post processing
mfx <- lapply(fit, avg_slopes)
# Rubin's rules
mice::pool(mfx)
Class: mipo m = 20
term contrast m estimate ubar b t dfcom
1 Petal.Length dY/dX 20 0.6063614 0.004468414 0.001838516 0.006398856 147
2 Petal.Width dY/dX 20 -0.4814243 0.023965740 0.010061356 0.034530164 147
df riv lambda fmi
1 68.1966 0.4320194 0.3016854 0.3213019
2 67.2928 0.4408136 0.3059477 0.3256952
One tricky thing about mice::pool()
is that it expects the term
column to be present in the data frames that it combines. That column must uniquely identify rows. But in some cases, the output of marginaleffects
functions does not include the term
column, which can lead to an error when using mice::pool()
.
For example,
mfx <- avg_slopes(fit[[1]], hypothesis = ~pairwise)
mfx
Hypothesis Estimate Std. Error z Pr(>|z|) S 2.5 %
(Petal.Width) - (Petal.Length) -0.833 0.224 -3.72 <0.001 12.3 -1.27
97.5 %
-0.395
Type: response
colnames(mfx)
[1] "hypothesis" "estimate" "std.error" "statistic" "p.value"
[6] "s.value" "conf.low" "conf.high"
We can work around this problem by adding a term
column manually to the output of marginaleffects
, using standard R
accessors.
mfx <- lapply(fit, \(f) {
out <- avg_slopes(f, hypothesis = ~pairwise)
# add the missing `term` column
out$term <- out$hypothesis
return(out)
})
mice::pool(mfx)
Class: mipo m = 20
term m estimate ubar b t
1 (Petal.Width) - (Petal.Length) 20 -1.087786 0.04836235 0.02034648 0.06972616
dfcom df riv lambda fmi
1 147 67.19845 0.4417446 0.3063959 0.3261571