30 Missing Data

The marginaleffects package offers convenience functions to compute and display predictions, contrasts, and marginal effects from models with multiple imputation from the mice and Amelia packages. The workflow follows Rubin’s rules (Rubin, 1987, p. 76), via the following steps:

Impute \(M\) data sets.
Fit a model in each of the \(M\) imputed data sets.
Compute marginal effects in each of the \(M\) data sets.
Pool results.

To highlight the workflow, we consider a simple linear regression model, although the same workflow should work with any model type that is fit using a formula interface and a data argument.

marginaleffects directly supports the mice and Amelia imputation packages, as well as any other package that can return a list of imputed data frames. This is demonstrated below using the iris dataset, in which we insert missing observations randomly and then impute missing values using several packages.

library(marginaleffects)
set.seed(1024)

dat <- iris
dat$Sepal.Length[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Sepal.Width[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Species[sample(seq_len(nrow(iris)), 40)] <- NA

30.1 `mice`

First, we impute the dataset using the mice package:

library(mice)

dat_mice <- mice(dat, m = 20, printFlag = FALSE, .Random.seed = 1024)

Then, we use the standard mice syntax to produce an object of class mira with all the models:

mod_mice <- with(dat_mice, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

Finally, we feed the mira object to a marginaleffects function:

mfx_mice <- avg_slopes(mod_mice, by = "Species")
mfx_mice
#> 
#>          Term            Contrast    Species Estimate Std. Error      t Pr(>|t|)    S   2.5 % 97.5 %   Df
#>  Sepal.Length dY/dX               setosa       0.0684     0.0560  1.222  0.22747  2.1 -0.0440  0.181 49.9
#>  Sepal.Length dY/dX               versicolor   0.0540     0.0558  0.968  0.33850  1.6 -0.0585  0.166 42.6
#>  Sepal.Length dY/dX               virginica    0.0582     0.0512  1.137  0.26149  1.9 -0.0449  0.161 44.8
#>  Sepal.Width  dY/dX               setosa       0.1890     0.0836  2.260  0.02632  5.2  0.0228  0.355 87.0
#>  Sepal.Width  dY/dX               versicolor   0.2092     0.0772  2.710  0.00977  6.7  0.0533  0.365 41.1
#>  Sepal.Width  dY/dX               virginica    0.2242     0.1041  2.155  0.03896  4.7  0.0121  0.436 31.3
#>  Species      versicolor - setosa setosa       1.1399     0.0977 11.668  < 0.001 49.7  0.9435  1.336 48.6
#>  Species      virginica - setosa  setosa       1.7408     0.1108 15.709  < 0.001 67.5  1.5182  1.963 50.3
#>  Species      versicolor - setosa versicolor   1.1399     0.0977 11.668  < 0.001 49.7  0.9435  1.336 48.6
#>  Species      virginica - setosa  versicolor   1.7408     0.1108 15.709  < 0.001 67.5  1.5182  1.963 50.3
#>  Species      versicolor - setosa virginica    1.1399     0.0977 11.668  < 0.001 49.7  0.9435  1.336 48.6
#>  Species      virginica - setosa  virginica    1.7408     0.1108 15.709  < 0.001 67.5  1.5182  1.963 50.3
#> 
#> Type: response

30.2 `Amelia`

With Amelia, the workflow is essentially the same. First, we impute using Amelia:

library(Amelia)

dat_amelia <- amelia(dat, m = 20, noms = "Species", p2s = 0)

Then, we use Amelia syntax to produce an object of class amest with all the models:

mod_amelia <- with(dat_amelia, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

Finally, we feed the amest object to a marginaleffects function:

mfx_amelia <- avg_slopes(mod_amelia, by = "Species")
mfx_amelia
#> 
#>          Term            Contrast    Species Estimate Std. Error      t Pr(>|t|)    S  2.5 % 97.5 %   Df
#>  Sepal.Length dY/dX               setosa       0.3878     0.0907  4.278  < 0.001 11.8  0.200 0.5753 23.1
#>  Sepal.Length dY/dX               versicolor   0.3231     0.0802  4.030  < 0.001 11.4  0.159 0.4872 28.7
#>  Sepal.Length dY/dX               virginica    0.3467     0.0799  4.340  < 0.001 12.1  0.182 0.5118 23.5
#>  Sepal.Width  dY/dX               setosa      -0.2079     0.1491 -1.395  0.17399  2.5 -0.513 0.0973 28.4
#>  Sepal.Width  dY/dX               versicolor  -0.1157     0.1168 -0.991  0.33068  1.6 -0.355 0.1239 26.9
#>  Sepal.Width  dY/dX               virginica   -0.0452     0.1272 -0.355  0.72425  0.5 -0.303 0.2121 38.9
#>  Species      versicolor - setosa setosa       0.6127     0.1731  3.541  0.00217  8.8  0.251 0.9748 19.1
#>  Species      virginica - setosa  setosa       1.0364     0.2004  5.171  < 0.001 13.8  0.615 1.4582 17.6
#>  Species      versicolor - setosa versicolor   0.6127     0.1731  3.541  0.00217  8.8  0.251 0.9748 19.1
#>  Species      virginica - setosa  versicolor   1.0364     0.2004  5.171  < 0.001 13.8  0.615 1.4582 17.6
#>  Species      versicolor - setosa virginica    0.6127     0.1731  3.541  0.00217  8.8  0.251 0.9748 19.1
#>  Species      virginica - setosa  virginica    1.0364     0.2004  5.171  < 0.001 13.8  0.615 1.4582 17.6
#> 
#> Type: response

30.3 Other imputation packages: `missRanger`, or lists of imputed data frames.

Several R packages can impute missing data. Indeed, the Missing Data CRAN View lists at least a dozen alternatives. Since user interfaces change a lot from package to package, marginaleffects supports a single workflow that can be used, with some adaptation, with all imputation packages:

Use an external package to create a list of imputed data frames.
Apply the datalist2mids() function from the miceadds package to convert the list of imputed data frames to a mids object.
Use the with() function to fit models to create mira object, as illustrated in the mice and Amelia sections above.
Pass the mira object to a marginaleffects function.

Consider the imputation package missRanger, which generates a list of imputed datasets:

library(miceadds)
library(missRanger)

## convert lists of imputed datasets to `mids` objects
dat_missRanger <- replicate(20, missRanger(dat, verbose = 0), simplify = FALSE)
mids_missRanger <- datlist2mids(dat_missRanger)

## fit models
mod_missRanger <- with(mids_missRanger, lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species))

## `missRanger` slopes
mfx_missRanger <- avg_slopes(mod_missRanger, by = "Species")
mfx_missRanger
#> 
#>          Term            Contrast    Species Estimate Std. Error     t Pr(>|t|)     S   2.5 % 97.5 %  Df
#>  Sepal.Length dY/dX               setosa       0.0586     0.0434  1.35  0.17907   2.5 -0.0272  0.144 142
#>  Sepal.Length dY/dX               versicolor   0.0615     0.0392  1.57  0.11884   3.1 -0.0160  0.139 142
#>  Sepal.Length dY/dX               virginica    0.0605     0.0373  1.62  0.10724   3.2 -0.0133  0.134 142
#>  Sepal.Width  dY/dX               setosa       0.2302     0.0725  3.17  0.00185   9.1  0.0868  0.374 142
#>  Sepal.Width  dY/dX               versicolor   0.2260     0.0561  4.03  < 0.001  13.4  0.1152  0.337 142
#>  Sepal.Width  dY/dX               virginica    0.2229     0.0688  3.24  0.00148   9.4  0.0869  0.359 142
#>  Species      versicolor - setosa setosa       1.1629     0.0725 16.04  < 0.001 109.4  1.0196  1.306 142
#>  Species      virginica - setosa  setosa       1.7867     0.0849 21.03  < 0.001 148.5  1.6188  1.955 142
#>  Species      versicolor - setosa versicolor   1.1629     0.0725 16.04  < 0.001 109.4  1.0196  1.306 142
#>  Species      virginica - setosa  versicolor   1.7867     0.0849 21.03  < 0.001 148.5  1.6188  1.955 142
#>  Species      versicolor - setosa virginica    1.1629     0.0725 16.04  < 0.001 109.4  1.0196  1.306 142
#>  Species      virginica - setosa  virginica    1.7867     0.0849 21.03  < 0.001 148.5  1.6188  1.955 142
#> 
#> Type: response

30.4 Comparing results with different imputation software

We can use the modelsummary package to compare the results with listwise deletion to the results using different imputations software:

library(modelsummary)

## listwise deletion slopes
mod_lwd <- lm(Petal.Width ~ Sepal.Length * Sepal.Width + Species, data = dat)
mfx_lwd <- avg_slopes(mod_lwd, by = "Species")

## regression table
models <- list(
    "LWD" = mfx_lwd,
    "mice" = mfx_mice,
    "Amelia" = mfx_amelia,
    "missRanger" = mfx_missRanger)
modelsummary(models, shape = term : contrast + Species ~ model)

	Species	LWD	mice	Amelia	missRanger
Sepal.Length dY/dX	setosa	0.033	0.068	0.388	0.059
		(0.061)	(0.056)	(0.091)	(0.043)
	versicolor	0.050	0.054	0.323	0.061
		(0.061)	(0.056)	(0.080)	(0.039)
	virginica	0.043	0.058	0.347	0.061
		(0.058)	(0.051)	(0.080)	(0.037)
Sepal.Width dY/dX	setosa	0.274	0.189	-0.208	0.230
		(0.091)	(0.084)	(0.149)	(0.073)
	versicolor	0.255	0.209	-0.116	0.226
		(0.074)	(0.077)	(0.117)	(0.056)
	virginica	0.234	0.224	-0.045	0.223
		(0.083)	(0.104)	(0.127)	(0.069)
Species versicolor - setosa	setosa	1.157	1.140	0.613	1.163
		(0.097)	(0.098)	(0.173)	(0.073)
	versicolor	1.157	1.140	0.613	1.163
		(0.097)	(0.098)	(0.173)	(0.073)
	virginica	1.157	1.140	0.613	1.163
		(0.097)	(0.098)	(0.173)	(0.073)
Species virginica - setosa	setosa	1.839	1.741	1.036	1.787
		(0.123)	(0.111)	(0.200)	(0.085)
	versicolor	1.839	1.741	1.036	1.787
		(0.123)	(0.111)	(0.200)	(0.085)
	virginica	1.839	1.741	1.036	1.787
		(0.123)	(0.111)	(0.200)	(0.085)
Num.Obs.		60	150	150	150
R2		0.953	0.944	0.845	0.947
R2 Adj.		0.949	0.942	0.839	0.946
AIC		-34.0	-76.0	77.7	-84.7
BIC		-19.3	-55.0	98.8	-63.6
Log.Lik.		23.997	45.018	-31.845	49.360
F		220.780	488.599	156.870	519.432
RMSE		0.16	0.18	0.30	0.17

30.5 `mice::pool()`

In some cases, it may be useful to impute, fit, and post-process our models “manually”, to finally combine results using the mice::pool function. This works well in some cases like this one:

library(marginaleffects)
library(mice)
set.seed(1024)
dat <- iris
dat$Sepal.Length[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Sepal.Width[sample(seq_len(nrow(iris)), 40)] <- NA
dat$Species[sample(seq_len(nrow(iris)), 40)] <- NA

# impute
imp <- mice(dat, m = 20, printFlag = FALSE)
imp <- lapply(seq_along(imp), function(i) complete(imp, i))

# fit
fit <- lapply(imp[1:20], function(d) {
              lm(Sepal.Length ~ Petal.Width + Petal.Length, d)
})

# post processing
mfx <- lapply(fit, avg_slopes)

# Rubin's rules
mice::pool(mfx)
#> Class: mipo    m = 20 
#>           term contrast  m   estimate        ubar           b           t dfcom       df       riv    lambda       fmi
#> 1 Petal.Length    dY/dX 20  0.6063614 0.004468437 0.001838516 0.006398878   147 68.19683 0.4320172 0.3016844 0.3213009
#> 2  Petal.Width    dY/dX 20 -0.4814243 0.023966690 0.010061356 0.034531114   147 67.29457 0.4407961 0.3059393 0.3256865

One tricky thing about mice::pool() is that it expects the term column to be present in the data frames that it combines. That column must uniquely identify rows. But in some cases, the output of marginaleffects functions does not include the term column, which can lead to an error when using mice::pool().

For example,

mfx <- avg_slopes(fit[[1]], hypothesis = ~pairwise)
mfx
#> 
#>                      Hypothesis Estimate Std. Error     z Pr(>|z|)    S 2.5 % 97.5 %
#>  (Petal.Width) - (Petal.Length)   -0.833      0.224 -3.72   <0.001 12.3 -1.27 -0.395
#> 
#> Type: response

colnames(mfx)
#> [1] "hypothesis" "estimate"   "std.error"  "statistic"  "p.value"    "s.value"    "conf.low"   "conf.high"

We can work around this problem by adding a term column manually to the output of marginaleffects, using standard R accessors.

mfx <- lapply(fit, \(f) {
    out <- avg_slopes(f, hypothesis = ~pairwise)
    # add the missing `term` column
    out$term <- out$hypothesis
    return(out)
})

mice::pool(mfx)
#> Class: mipo    m = 20 
#>                             term  m  estimate       ubar          b          t dfcom       df       riv    lambda       fmi
#> 1 (Petal.Width) - (Petal.Length) 20 -1.087786 0.04836376 0.02034648 0.06972757   147 67.19976 0.4417317 0.3063897 0.3261507

30.1 mice

30.2 Amelia

30.3 Other imputation packages: missRanger, or lists of imputed data frames.

30.4 Comparing results with different imputation software

30.5 mice::pool()

30.1 `mice`

30.2 `Amelia`

30.3 Other imputation packages: `missRanger`, or lists of imputed data frames.

30.5 `mice::pool()`