Missing Data

The marginaleffects package offers convenience functions to compute and display predictions, contrasts, and marginal effects from models with multiple imputation from the mice and Amelia packages. The workflow follows Rubin’s rules (Rubin, 1987, p. 76), via the following steps:

  1. Impute \(M\) data sets.
  2. Fit a model in each of the \(M\) imputed data sets.
  3. Compute marginal effects in each of the \(M\) data sets.
  4. Pool results.

To highlight the workflow, we consider a simple linear regression model, although the same workflow should work with any model type that is fit using a formula interface and a data argument.

marginaleffects directly supports the mice and Amelia imputation packages, as well as any other package that can return a list of imputed data frames. This is demonstrated below using the iris dataset, in which we insert missing observations randomly and then impute missing values using several packages.

library(marginaleffects)
set.seed(1024)

dat <- iris
dat$Sepal.Length[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Sepal.Width[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Species[sample(seq_len(nrow(iris)), 40)] <- NA

mice

First, we impute the dataset using the mice package:

library(mice)

dat_mice <- mice(dat, m = 20, printFlag = FALSE, .Random.seed = 1024)

Then, we use the standard mice syntax to produce an object of class mira with all the models:

mod_mice <- with(dat_mice, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

Finally, we feed the mira object to a marginaleffects function:

mfx_mice <- avg_slopes(mod_mice, by = "Species")
mfx_mice
#> 
#>          Term                        Contrast    Species Estimate Std. Error      t Pr(>|t|)     S   2.5 % 97.5 %    Df
#>  Sepal.Length mean(dY/dX)                     setosa       0.0684     0.0560  1.222  0.22412   2.2 -0.0424  0.179 120.0
#>  Sepal.Length mean(dY/dX)                     versicolor   0.0540     0.0558  0.968  0.33553   1.6 -0.0567  0.165  93.6
#>  Sepal.Length mean(dY/dX)                     virginica    0.0582     0.0512  1.137  0.25814   2.0 -0.0433  0.160 101.2
#>  Sepal.Width  mean(dY/dX)                     setosa       0.1890     0.0836  2.260  0.02436   5.4  0.0246  0.353 400.5
#>  Sepal.Width  mean(dY/dX)                     versicolor   0.2092     0.0772  2.710  0.00807   7.0  0.0558  0.363  89.0
#>  Sepal.Width  mean(dY/dX)                     virginica    0.2242     0.1041  2.155  0.03505   4.8  0.0162  0.432  61.8
#>  Species      mean(versicolor) - mean(setosa) setosa       1.1399     0.0977 11.668  < 0.001  68.1  0.9464  1.333 114.8
#>  Species      mean(versicolor) - mean(setosa) versicolor   1.1399     0.0977 11.668  < 0.001  68.1  0.9464  1.333 114.8
#>  Species      mean(versicolor) - mean(setosa) virginica    1.1399     0.0977 11.668  < 0.001  68.1  0.9464  1.333 114.8
#>  Species      mean(virginica) - mean(setosa)  setosa       1.7408     0.1108 15.709  < 0.001 100.7  1.5214  1.960 121.6
#>  Species      mean(virginica) - mean(setosa)  versicolor   1.7408     0.1108 15.709  < 0.001 100.7  1.5214  1.960 121.6
#>  Species      mean(virginica) - mean(setosa)  virginica    1.7408     0.1108 15.709  < 0.001 100.7  1.5214  1.960 121.6
#> 
#> Type:  response 
#> Columns: term, contrast, Species, estimate, std.error, s.value, predicted_lo, predicted_hi, predicted, df, statistic, p.value, conf.low, conf.high

Amelia

With Amelia, the workflow is essentially the same. First, we impute using Amelia:

library(Amelia)

dat_amelia <- amelia(dat, m = 20, noms = "Species", p2s = 0)

Then, we use Amelia syntax to produce an object of class amest with all the models:

mod_amelia <- with(dat_amelia, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

Finally, we feed the amest object to a marginaleffects function:

mfx_amelia <- avg_slopes(mod_amelia, by = "Species")
mfx_amelia
#> 
#>          Term                        Contrast    Species Estimate Std. Error      t Pr(>|t|)    S  2.5 % 97.5 %   Df
#>  Sepal.Length mean(dY/dX)                     setosa       0.3878     0.0907  4.278  < 0.001 13.3  0.205 0.5705 44.0
#>  Sepal.Length mean(dY/dX)                     versicolor   0.3231     0.0802  4.030  < 0.001 12.5  0.163 0.4838 55.9
#>  Sepal.Length mean(dY/dX)                     virginica    0.3467     0.0799  4.340  < 0.001 13.6  0.186 0.5077 44.7
#>  Sepal.Width  mean(dY/dX)                     setosa      -0.2079     0.1491 -1.395  0.16877  2.6 -0.507 0.0909 55.0
#>  Sepal.Width  mean(dY/dX)                     versicolor  -0.1157     0.1168 -0.991  0.32646  1.6 -0.350 0.1187 51.8
#>  Sepal.Width  mean(dY/dX)                     virginica   -0.0452     0.1272 -0.355  0.72325  0.5 -0.298 0.2079 82.2
#>  Species      mean(versicolor) - mean(setosa) setosa       0.6127     0.1731  3.541  0.00111  9.8  0.262 0.9635 36.7
#>  Species      mean(versicolor) - mean(setosa) versicolor   0.6127     0.1731  3.541  0.00111  9.8  0.262 0.9635 36.7
#>  Species      mean(versicolor) - mean(setosa) virginica    0.6127     0.1731  3.541  0.00111  9.8  0.262 0.9635 36.7
#>  Species      mean(virginica) - mean(setosa)  setosa       1.0364     0.2004  5.171  < 0.001 16.6  0.629 1.4436 34.2
#>  Species      mean(virginica) - mean(setosa)  versicolor   1.0364     0.2004  5.171  < 0.001 16.6  0.629 1.4436 34.2
#>  Species      mean(virginica) - mean(setosa)  virginica    1.0364     0.2004  5.171  < 0.001 16.6  0.629 1.4436 34.2
#> 
#> Type:  response 
#> Columns: term, contrast, Species, estimate, std.error, s.value, predicted_lo, predicted_hi, predicted, df, statistic, p.value, conf.low, conf.high

Other imputation packages: missRanger, or lists of imputed data frames.

Several R packages can impute missing data. Indeed, the Missing Data CRAN View lists at least a dozen alternatives. Since user interfaces change a lot from package to package, marginaleffects supports a single workflow that can be used, with some adaptation, with all imputation packages:

  1. Use an external package to create a list of imputed data frames.
  2. Apply the datalist2mids() function from the miceadds package to convert the list of imputed data frames to a mids object.
  3. Use the with() function to fit models to create mira object, as illustrated in the mice and Amelia sections above.
  4. Pass the mira object to a marginaleffects function.

Consider the imputation package missRanger, which generates a list of imputed datasets:

library(miceadds)
library(missRanger)

## convert lists of imputed datasets to `mids` objects
dat_missRanger <- replicate(20, missRanger(dat, verbose = 0), simplify = FALSE)
mids_missRanger <- datlist2mids(dat_missRanger)

## fit models
mod_missRanger <- with(mids_missRanger, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

## `missRanger` slopes
mfx_missRanger <- avg_slopes(mod_missRanger, by = "Species")
mfx_missRanger
#> 
#>          Term                        Contrast    Species Estimate Std. Error     t Pr(>|t|)     S    2.5 % 97.5 %      Df
#>  Sepal.Length mean(dY/dX)                     setosa       0.0589     0.0415  1.42  0.15558   2.7 -0.02238  0.140 6733392
#>  Sepal.Length mean(dY/dX)                     versicolor   0.0682     0.0393  1.74  0.08228   3.6 -0.00873  0.145 1145162
#>  Sepal.Length mean(dY/dX)                     virginica    0.0649     0.0368  1.76  0.07792   3.7 -0.00726  0.137 2020701
#>  Sepal.Width  mean(dY/dX)                     setosa       0.2300     0.0692  3.32  < 0.001  10.1  0.09434  0.366 3128781
#>  Sepal.Width  mean(dY/dX)                     versicolor   0.2166     0.0551  3.93  < 0.001  13.5  0.10854  0.325  687227
#>  Sepal.Width  mean(dY/dX)                     virginica    0.2063     0.0687  3.00  0.00266   8.6  0.07174  0.341  458858
#>  Species      mean(versicolor) - mean(setosa) setosa       1.1572     0.0706 16.38  < 0.001 197.9  1.01871  1.296 3882586
#>  Species      mean(versicolor) - mean(setosa) versicolor   1.1572     0.0706 16.38  < 0.001 197.9  1.01871  1.296 3882586
#>  Species      mean(versicolor) - mean(setosa) virginica    1.1572     0.0706 16.38  < 0.001 197.9  1.01871  1.296 3882586
#>  Species      mean(virginica) - mean(setosa)  setosa       1.7763     0.0825 21.52  < 0.001 338.9  1.61452  1.938 5384985
#>  Species      mean(virginica) - mean(setosa)  versicolor   1.7763     0.0825 21.52  < 0.001 338.9  1.61452  1.938 5384985
#>  Species      mean(virginica) - mean(setosa)  virginica    1.7763     0.0825 21.52  < 0.001 338.9  1.61452  1.938 5384985
#> 
#> Type:  response 
#> Columns: term, contrast, Species, estimate, std.error, s.value, predicted_lo, predicted_hi, predicted, df, statistic, p.value, conf.low, conf.high

Comparing results with different imputation software

We can use the modelsummary package to compare the results with listwise deletion to the results using different imputations software:

library(modelsummary)

## listwise deletion slopes
mod_lwd <- lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species, data = dat)
mfx_lwd <- avg_slopes(mod_lwd, by = "Species")

## regression table
models <- list(
    "LWD" = mfx_lwd,
    "mice" = mfx_mice,
    "Amelia" = mfx_amelia,
    "missRanger" = mfx_missRanger)
modelsummary(models, shape = term : contrast + Species ~ model)
Species LWD mice Amelia missRanger
Sepal.Length mean(dY/dX) setosa 0.033 0.068 0.388 0.059
(0.061) (0.056) (0.091) (0.041)
versicolor 0.050 0.054 0.323 0.068
(0.061) (0.056) (0.080) (0.039)
virginica 0.043 0.058 0.347 0.065
(0.058) (0.051) (0.080) (0.037)
Sepal.Width mean(dY/dX) setosa 0.274 0.189 -0.208 0.230
(0.091) (0.084) (0.149) (0.069)
versicolor 0.255 0.209 -0.116 0.217
(0.074) (0.077) (0.117) (0.055)
virginica 0.234 0.224 -0.045 0.206
(0.083) (0.104) (0.127) (0.069)
Species mean(versicolor) - mean(setosa) setosa 1.157 1.140 0.613 1.157
(0.097) (0.098) (0.173) (0.071)
versicolor 1.157 1.140 0.613 1.157
(0.097) (0.098) (0.173) (0.071)
virginica 1.157 1.140 0.613 1.157
(0.097) (0.098) (0.173) (0.071)
Species mean(virginica) - mean(setosa) setosa 1.839 1.741 1.036 1.776
(0.123) (0.111) (0.200) (0.083)
versicolor 1.839 1.741 1.036 1.776
(0.123) (0.111) (0.200) (0.083)
virginica 1.839 1.741 1.036 1.776
(0.123) (0.111) (0.200) (0.083)
Num.Obs. 60 150 150 150
Num.Imp. 20 20 20
R2 0.953 0.930 0.853 0.947
R2 Adj. 0.949 0.928 0.848 0.945
AIC -34.0
BIC -19.3
Log.Lik. 23.997
F 220.780
RMSE 0.16

Passing new data arguments: newdata, wts, by, etc.

Sometimes we want to pass arguments changing or specifying the data on which we will do our analysis using marginaleffects. This can be for reasons such as wanting to specify the values or weights at which we evaluate e.g. avg_slopes(), or due to the underlying models not robustly preserving all the original data columns (such as fixest objects not saving their data in the fit object making it potentially challenging to retrieve, and even if retrievable it will not include the weights used during fitting as a column as wts expects when given a string).

If we are not using multiple imputation, or if we want to just pass a single dataset to the several fitted models after multiple imputation, we can pass a single dataset to the newdata argument. However, if we wish to supply each model in our list resulting after multiple imputation with a /different/ dataset on which to calculate results, we cannot use newdata. Instead, in this case it can be useful to revert to a more manual (but still very easy) approach. Here is an example calculating avg_slopes using a different set of weights for each of the fixest models which we fit after multiple imputation.

set.seed(1024)
library(mice)
library(fixest)
library(marginaleffects)

dat <- mtcars

## insert missing values
dat$hp[sample(seq_len(nrow(mtcars)), 10)] <- NA
dat$mpg[sample(seq_len(nrow(mtcars)), 10)] <- NA
dat$gear[sample(seq_len(nrow(mtcars)), 10)] <- NA

## multiple imputation
dat <- mice(dat, m = 5, method = "sample", printFlag = FALSE)
dat <- complete(dat, action = "all")

## fit models
mod <- lapply(dat, \(x) 
    feglm(am ~ mpg * cyl + hp,
        weight = ~gear,
        family = binomial,
        data = x))

## slopes without weights
lapply(seq_along(mod), \(i) 
    avg_slopes(mod[[i]], newdata = dat[[i]])) |>
    mice::pool()
#> Class: mipo    m = 5 
#>   term    contrast m     estimate         ubar            b            t dfcom       df      riv    lambda       fmi
#> 1  cyl mean(dY/dX) 5 -0.134280454 7.097467e-04 2.347331e-03 3.526544e-03    29 2.921797 3.968736 0.7987416 0.8667137
#> 2   hp mean(dY/dX) 5  0.001649773 5.709036e-07 1.375452e-06 2.221446e-06    29 3.557014 2.891105 0.7430036 0.8213918
#> 3  mpg mean(dY/dX) 5  0.006082804 1.080647e-04 2.722234e-04 4.347329e-04    29 3.458682 3.022893 0.7514227 0.8283973

## slopes with weights
lapply(seq_along(mod), \(i) 
    avg_slopes(mod[[i]], newdata = dat[[i]], wts = "gear")) |>
    mice::pool()
#> Class: mipo    m = 5 
#>   term    contrast m     estimate         ubar            b            t dfcom       df      riv    lambda       fmi
#> 1  cyl mean(dY/dX) 5 -0.135839444 7.281041e-04 2.481021e-03 3.705329e-03    29 2.868747 4.089010 0.8034981 0.8704636
#> 2   hp mean(dY/dX) 5  0.001671173 5.697747e-07 1.424648e-06 2.279352e-06    29 3.474898 3.000446 0.7500278 0.8272405
#> 3  mpg mean(dY/dX) 5  0.006251144 1.056103e-04 2.705239e-04 4.302390e-04    29 3.422648 3.073835 0.7545310 0.8309696

newdata with imputed datasets

In some contexts (ex: Average Treatment effects on the Treated), users want to apply a function like subset() to the data in newdata. Unfortunately, there is no subset() method applicable to mice-generated objects. One alternative is to generate estimates for each imputed datasets separately and pool the results after. For example:

library(mice)
library(marginaleffects)
data("lalonde_mis", package = "cobalt")

imp <- mice(lalonde_mis, m = 5, print = FALSE)
est <- lapply(1:5, \(i) {
    data <- complete(imp, i)
    mod <- lm(re78 ~ treat * (age + educ + re74), data = data)
    avg_predictions(mod, variables = "treat", newdata = subset(treat == 1))
})
mice::pool(est)
#> Class: mipo    m = 5 
#>   term m estimate     ubar            b        t dfcom       df          riv       lambda        fmi
#> 1    0 5 5558.439 142974.4 3.331491e+03 146972.2   612 534.7224 2.796156e-02 2.720098e-02 0.03081920
#> 2    1 5 6349.144 263028.0 4.652891e-22 263028.0   612 609.9478 2.122766e-27 2.122766e-27 0.00326292