22  Data grids

This notebook presents some advanced datagrid() use. Readers should be familiar with the content of Section 3.2 before diving in.

22.1 The by argument

Sometimes we want to hold variables at constant values within groups. The by argument in datagrid() allows us to create grids where certain variables are fixed at specific values within subgroups.

For example, we might want to compare predictions at the mean of a continuous variable. In this example, the Sepal.Width column is held at the global mean in every row of the grid:

library(marginaleffects)
mod <- lm(Petal.Width ~ Sepal.Width + Species, data = iris)

predictions(mod,
  newdata = datagrid(Species = unique, Sepal.Width = mean)
)

    Species Sepal.Width Estimate Std. Error     z Pr(>|z|)    S  2.5 % 97.5 %
 setosa            3.06    0.141     0.0304  4.64   <0.001 18.1 0.0814   0.20
 versicolor        3.06    1.407     0.0286 49.26   <0.001  Inf 1.3515   1.46
 virginica         3.06    2.050     0.0259 79.18   <0.001  Inf 1.9989   2.10

Type: response

Now, we hold Sepal.Width at its mean value within each Species group. This allows us to compare the effect of species while controlling for the typical sepal width in each group:

predictions(mod,
  newdata = datagrid(Species = unique, Sepal.Width = mean, by = "Species")
)

    Species Sepal.Width Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
 setosa            3.43    0.246     0.0256  9.6   <0.001 70.1 0.196  0.296
 versicolor        2.77    1.326     0.0256 51.7   <0.001  Inf 1.276  1.376
 virginica         2.97    2.026     0.0256 79.1   <0.001  Inf 1.976  2.076

Type: response
aggregate(Sepal.Width ~ Species, data = iris, FUN = mean)
     Species Sepal.Width
1     setosa       3.428
2 versicolor       2.770
3  virginica       2.974

22.2 Sampling from large grids with grid_type = "dataframe"

When working with many categorical variables, a balanced grid can become prohibitively large. For instance, with 5 variables each taking 26 possible values, a full factorial grid would have 26^5 = 11,881,376 rows.

The grid_type = "dataframe" argument combined with a custom FUN allows us to sample uniformly from the full factorial space. The resulting grid will “approach” the true factorial grid, but have a much smaller footprint.

# Simulate data with 5 categorical variables
set.seed(48103)
n <- 10000
dat <- data.frame(
  x1 = sample(letters, n, replace = TRUE),
  x2 = sample(letters, n, replace = TRUE), 
  x3 = sample(letters, n, replace = TRUE),
  x4 = sample(letters, n, replace = TRUE),
  x5 = sample(letters, n, replace = TRUE),
  y = rnorm(n)
)

# Fit model
mod <- lm(y ~ x1 + x2 + x3 + x4 + x5, data = dat)

# Sample 5 values uniformly from each variable's unique entries
sample_unique <- \(x, n) sample(unique(x), n, replace = TRUE)
grid_sampled <- datagrid(
  model = mod,
  grid_type = "dataframe",
  FUN = \(x) sample_unique(x, 5)
)

grid_sampled
  rowid x1 x2 x3 x4 x5          y
1     1  u  z  b  j  b -0.6469524
2     2  h  l  g  m  p  0.1698563
3     3  p  f  c  h  t -0.5320531
4     4  b  h  d  f  z -0.8316959
5     5  g  c  h  n  u  0.4131603

Now we can make predictions on the much smaller sampled grid:

predictions(mod, newdata = grid_sampled)

 Estimate Std. Error      z Pr(>|z|)   S   2.5 % 97.5 %
  -0.0192      0.111 -0.173   0.8628 0.2 -0.2365  0.198
  -0.0742      0.113 -0.654   0.5128 1.0 -0.2964  0.148
   0.0336      0.111  0.303   0.7620 0.4 -0.1838  0.251
   0.0158      0.114  0.139   0.8894 0.2 -0.2075  0.239
   0.2458      0.113  2.179   0.0293 5.1  0.0247  0.467

Type: response

Note that the grid_type="dataframe" argument tells datagrid() to construct the grid by applying FUN to each variable individually, and then by stacking the sampled values vertically using cbind(). This is different from other grids like balanced or counterfactual, which take the cartesian product of supplied values, with expand.grid().