31  Missing Data

The marginaleffects package offers convenience functions to compute and display predictions, contrasts, and marginal effects from models with multiple imputation from the mice and Amelia packages. The workflow follows Rubin’s rules (Rubin, 1987, p. 76), via the following steps:

  1. Impute \(M\) data sets.
  2. Fit a model in each of the \(M\) imputed data sets.
  3. Compute marginal effects in each of the \(M\) data sets.
  4. Pool results.

To highlight the workflow, we consider a simple linear regression model, although the same workflow should work with any model type that is fit using a formula interface and a data argument.

marginaleffects directly supports the mice and Amelia imputation packages, as well as any other package that can return a list of imputed data frames. This is demonstrated below using the iris dataset, in which we insert missing observations randomly and then impute missing values using several packages.

library(marginaleffects)
set.seed(1024)

dat <- iris
dat$Sepal.Length[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Sepal.Width[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Species[sample(seq_len(nrow(iris)), 40)] <- NA

31.1 mice

First, we impute the dataset using the mice package:

library(mice)

dat_mice <- mice(dat, m = 20, printFlag = FALSE, .Random.seed = 1024)

Then, we use the standard mice syntax to produce an object of class mira with all the models:

mod_mice <- with(dat_mice, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

Finally, we feed the mira object to a marginaleffects function:

mfx_mice <- avg_slopes(mod_mice, by = "Species")
mfx_mice

         Term            Contrast    Species Estimate Std. Error      t
 Sepal.Length dY/dX               setosa       0.0684     0.0560  1.222
 Sepal.Length dY/dX               versicolor   0.0540     0.0558  0.968
 Sepal.Length dY/dX               virginica    0.0582     0.0512  1.137
 Sepal.Width  dY/dX               setosa       0.1890     0.0836  2.260
 Sepal.Width  dY/dX               versicolor   0.2092     0.0772  2.710
 Sepal.Width  dY/dX               virginica    0.2242     0.1041  2.154
 Species      versicolor - setosa setosa       1.1399     0.0977 11.668
 Species      virginica - setosa  setosa       1.7408     0.1108 15.709
 Species      versicolor - setosa versicolor   1.1399     0.0977 11.668
 Species      virginica - setosa  versicolor   1.7408     0.1108 15.709
 Species      versicolor - setosa virginica    1.1399     0.0977 11.668
 Species      virginica - setosa  virginica    1.7408     0.1108 15.709
 Pr(>|t|)    S   2.5 % 97.5 %   Df
  0.22747  2.1 -0.0440  0.181 49.9
  0.33859  1.6 -0.0585  0.166 42.6
  0.26152  1.9 -0.0449  0.161 44.8
  0.02631  5.2  0.0228  0.355 87.0
  0.00976  6.7  0.0533  0.365 41.1
  0.03901  4.7  0.0120  0.436 31.3
  < 0.001 49.7  0.9435  1.336 48.6
  < 0.001 67.5  1.5182  1.963 50.3
  < 0.001 49.7  0.9435  1.336 48.6
  < 0.001 67.5  1.5182  1.963 50.3
  < 0.001 49.7  0.9435  1.336 48.6
  < 0.001 67.5  1.5182  1.963 50.3

Type: response

31.2 Amelia

With Amelia, the workflow is essentially the same. First, we impute using Amelia:

library(Amelia)

dat_amelia <- amelia(dat, m = 20, noms = "Species", p2s = 0)

Then, we use Amelia syntax to produce an object of class amest with all the models:

mod_amelia <- with(dat_amelia, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

Finally, we feed the amest object to a marginaleffects function:

mfx_amelia <- avg_slopes(mod_amelia, by = "Species")
mfx_amelia

         Term            Contrast    Species Estimate Std. Error      t
 Sepal.Length dY/dX               setosa       0.3878     0.0907  4.278
 Sepal.Length dY/dX               versicolor   0.3231     0.0802  4.030
 Sepal.Length dY/dX               virginica    0.3467     0.0799  4.340
 Sepal.Width  dY/dX               setosa      -0.2079     0.1491 -1.395
 Sepal.Width  dY/dX               versicolor  -0.1157     0.1168 -0.991
 Sepal.Width  dY/dX               virginica   -0.0452     0.1272 -0.355
 Species      versicolor - setosa setosa       0.6127     0.1731  3.541
 Species      virginica - setosa  setosa       1.0364     0.2004  5.171
 Species      versicolor - setosa versicolor   0.6127     0.1731  3.541
 Species      virginica - setosa  versicolor   1.0364     0.2004  5.171
 Species      versicolor - setosa virginica    0.6127     0.1731  3.541
 Species      virginica - setosa  virginica    1.0364     0.2004  5.171
 Pr(>|t|)    S  2.5 % 97.5 %   Df
  < 0.001 11.8  0.200 0.5753 23.1
  < 0.001 11.4  0.159 0.4872 28.8
  < 0.001 12.1  0.182 0.5118 23.5
  0.17400  2.5 -0.513 0.0973 28.4
  0.33067  1.6 -0.355 0.1239 26.9
  0.72421  0.5 -0.303 0.2121 38.8
  0.00217  8.8  0.251 0.9748 19.1
  < 0.001 13.8  0.615 1.4582 17.6
  0.00217  8.8  0.251 0.9748 19.1
  < 0.001 13.8  0.615 1.4582 17.6
  0.00217  8.8  0.251 0.9748 19.1
  < 0.001 13.8  0.615 1.4582 17.6

Type: response

31.3 Other imputation packages: missRanger, or lists of imputed data frames.

Several R packages can impute missing data. Indeed, the Missing Data CRAN View lists at least a dozen alternatives. Since user interfaces change a lot from package to package, marginaleffects supports a single workflow that can be used, with some adaptation, with all imputation packages:

  1. Use an external package to create a list of imputed data frames.
  2. Apply the datalist2mids() function from the miceadds package to convert the list of imputed data frames to a mids object.
  3. Use the with() function to fit models to create mira object, as illustrated in the mice and Amelia sections above.
  4. Pass the mira object to a marginaleffects function.

Consider the imputation package missRanger, which generates a list of imputed datasets:

library(miceadds)
library(missRanger)

## convert lists of imputed datasets to `mids` objects
dat_missRanger <- replicate(20, missRanger(dat, verbose = 0), simplify = FALSE)
mids_missRanger <- datlist2mids(dat_missRanger)

## fit models
mod_missRanger <- with(mids_missRanger, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

## `missRanger` slopes
mfx_missRanger <- avg_slopes(mod_missRanger, by = "Species")
mfx_missRanger

         Term            Contrast    Species Estimate Std. Error     t Pr(>|t|)
 Sepal.Length dY/dX               setosa       0.0589     0.0434  1.36  0.17662
 Sepal.Length dY/dX               versicolor   0.0613     0.0392  1.56  0.12029
 Sepal.Length dY/dX               virginica    0.0605     0.0374  1.62  0.10747
 Sepal.Width  dY/dX               setosa       0.2291     0.0726  3.16  0.00195
 Sepal.Width  dY/dX               versicolor   0.2257     0.0561  4.02  < 0.001
 Sepal.Width  dY/dX               virginica    0.2231     0.0689  3.24  0.00150
 Species      versicolor - setosa setosa       1.1625     0.0726 16.01  < 0.001
 Species      virginica - setosa  setosa       1.7861     0.0851 21.00  < 0.001
 Species      versicolor - setosa versicolor   1.1625     0.0726 16.01  < 0.001
 Species      virginica - setosa  versicolor   1.7861     0.0851 21.00  < 0.001
 Species      versicolor - setosa virginica    1.1625     0.0726 16.01  < 0.001
 Species      virginica - setosa  virginica    1.7861     0.0851 21.00  < 0.001
     S   2.5 % 97.5 %  Df
   2.5 -0.0269  0.145 142
   3.1 -0.0162  0.139 142
   3.2 -0.0133  0.134 142
   9.0  0.0856  0.373 142
  13.4  0.1148  0.337 142
   9.4  0.0869  0.359 142
 109.2  1.0190  1.306 142
 148.2  1.6179  1.954 142
 109.2  1.0190  1.306 142
 148.2  1.6179  1.954 142
 109.2  1.0190  1.306 142
 148.2  1.6179  1.954 142

Type: response

31.4 Comparing results with different imputation software

We can use the modelsummary package to compare the results with listwise deletion to the results using different imputations software:

library(modelsummary)

## listwise deletion slopes
mod_lwd <- lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species, data = dat)
mfx_lwd <- avg_slopes(mod_lwd, by = "Species")

## regression table
models <- list(
    "LWD" = mfx_lwd,
    "mice" = mfx_mice,
    "Amelia" = mfx_amelia,
    "missRanger" = mfx_missRanger)
modelsummary(models, shape = term : contrast + Species ~ model)
Species LWD mice Amelia missRanger
Sepal.Length dY/dX setosa 0.033 0.068 0.388 0.059
(0.061) (0.056) (0.091) (0.043)
versicolor 0.050 0.054 0.323 0.061
(0.061) (0.056) (0.080) (0.039)
virginica 0.043 0.058 0.347 0.061
(0.058) (0.051) (0.080) (0.037)
Sepal.Width dY/dX setosa 0.274 0.189 -0.208 0.229
(0.091) (0.084) (0.149) (0.073)
versicolor 0.255 0.209 -0.116 0.226
(0.074) (0.077) (0.117) (0.056)
virginica 0.234 0.224 -0.045 0.223
(0.083) (0.104) (0.127) (0.069)
Species versicolor - setosa setosa 1.157 1.140 0.613 1.162
(0.097) (0.098) (0.173) (0.073)
versicolor 1.157 1.140 0.613 1.162
(0.097) (0.098) (0.173) (0.073)
virginica 1.157 1.140 0.613 1.162
(0.097) (0.098) (0.173) (0.073)
Species virginica - setosa setosa 1.839 1.741 1.036 1.786
(0.123) (0.111) (0.200) (0.085)
versicolor 1.839 1.741 1.036 1.786
(0.123) (0.111) (0.200) (0.085)
virginica 1.839 1.741 1.036 1.786
(0.123) (0.111) (0.200) (0.085)
Num.Obs. 60 150 150 150
Num.Imp. 20 20 20
R2 0.953 0.930 0.853 0.948
R2 Adj. 0.949 0.928 0.848 0.946
AIC -34.0
BIC -19.3
Log.Lik. 23.997
F 220.780
RMSE 0.16

31.5 mice::pool()

In some cases, it may be useful to impute, fit, and post-process our models “manually”, to finally combine results using the mice::pool function. This works well in some cases like this one:

library(marginaleffects)
library(mice)
set.seed(1024)
dat <- iris
dat$Sepal.Length[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Sepal.Width[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Species[sample(seq_len(nrow(iris)), 40)] <- NA

# impute
imp <- mice(dat, m = 20, printFlag = FALSE)
imp <- lapply(seq_along(imp), function(i) complete(imp, i))

# fit
fit <- lapply(imp[1:20], function(d) {
              lm(Sepal.Length ~ Petal.Width + Petal.Length, d)
})

# post processing
mfx <- lapply(fit, avg_slopes)

# Rubin's rules
mice::pool(mfx)
Class: mipo    m = 20 
          term contrast  m   estimate        ubar           b           t dfcom
1 Petal.Length    dY/dX 20  0.6063614 0.004468414 0.001838516 0.006398856   147
2  Petal.Width    dY/dX 20 -0.4814243 0.023965740 0.010061356 0.034530164   147
       df       riv    lambda       fmi
1 68.1966 0.4320194 0.3016854 0.3213019
2 67.2928 0.4408136 0.3059477 0.3256952

One tricky thing about mice::pool() is that it expects the term column to be present in the data frames that it combines. That column must uniquely identify rows. But in some cases, the output of marginaleffects functions does not include the term column, which can lead to an error when using mice::pool().

For example,

mfx <- avg_slopes(fit[[1]], hypothesis = ~pairwise)
mfx

                     Hypothesis Estimate Std. Error     z Pr(>|z|)    S 2.5 %
 (Petal.Width) - (Petal.Length)   -0.833      0.224 -3.72   <0.001 12.3 -1.27
 97.5 %
 -0.395

Type: response
[1] "hypothesis" "estimate"   "std.error"  "statistic"  "p.value"   
[6] "s.value"    "conf.low"   "conf.high" 

We can work around this problem by adding a term column manually to the output of marginaleffects, using standard R accessors.

mfx <- lapply(fit, \(f) {
    out <- avg_slopes(f, hypothesis = ~pairwise)
    # add the missing `term` column
    out$term <- out$hypothesis
    return(out)
})

mice::pool(mfx)
Class: mipo    m = 20 
                            term  m  estimate       ubar          b          t
1 (Petal.Width) - (Petal.Length) 20 -1.087786 0.04836235 0.02034648 0.06972616
  dfcom       df       riv    lambda       fmi
1   147 67.19845 0.4417446 0.3063959 0.3261571