Bayesian Statistics

10. Marginal Effects and Model Comparison

_{Dominique Makowski}
_{^{D.Makowski@sussex.ac.uk}}

Penguins - Pick your fighter!

Adelie: The pebble thieves. They steal pebbles from other penguins to build their nests.
Chinstrap: The power-nappers. They take thousands of microsleeps (4s) a day.
Gentoo: The speedos. They can reach speeds of up to 36 km/h in the water.

head(penguins)

  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
4  Adelie Torgersen       NA       NA          NA        NA   <NA> 2007
5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
6  Adelie Torgersen     39.3     20.6         190      3650   male 2007

How Big are these bad boys?

What is the average body_mass value for each species?

empirical_means <- penguins |> 
  summarise(Empirical_Mean = mean(body_mass, na.rm = TRUE), .by = "species")

empirical_means

    species Empirical_Mean
1    Adelie       3700.662
2    Gentoo       5076.016
3 Chinstrap       3733.088

Gentoo (speedos) > Chinstrap (power-nappers) > Adelie (Pebble thieves)
But are they “significantly” different?

Marginal Means & Contrasts

Model 1: Species as Predictor

Fit the model with species as a predictor and interpret the parameters

model1 <- brms::brm(body_mass ~ species, 
                    data = penguins, refresh=0)

parameters::parameters(model1)

# Fixed Effects

Parameter        |  Median |             95% CI |     pd |  Rhat |  ESS
-----------------------------------------------------------------------
(Intercept)      | 3700.63 | [3626.80, 3773.23] |   100% | 1.000 | 3402
speciesChinstrap |   33.41 | [-101.80,  168.81] | 67.70% | 1.000 | 3970
speciesGentoo    | 1375.30 | [1266.66, 1485.19] |   100% | 1.001 | 3721

# sigma Parameters

Parameter | Median |           95% CI |   pd |  Rhat |  ESS
-----------------------------------------------------------
sigma     | 463.31 | [430.49, 501.39] | 100% | 0.999 | 3936

When a factor is used as a predictor, the first level is used as the reference level (i.e., becomes the intercept)
The other levels are compared to the reference level
Question: What is the average body_mass for “Chinstrap” penguins?
Can we easily get the value of the outcome at each level?

Marginal Means

A statistical model defines a relationship between the outcome and any predictor values
It is straightforward to compute the model-based value (i.e., the prediction) of the outcome at each level of a factor
- estimate_prediction(), estimate_relation()
Model predictions can in turn be leveraged to compute marginal means ^{_{(see Makowski et al., 2025}}
Marginal means are typically the average predicted values of the outcome at each level of a factor, averaged over all other predictors / random effects (i.e., marginalized for other stuff)

modelbased::estimate_means(model1, by="species")

Estimated Marginal Means

species   |  Median |             95% CI |   pd |          ROPE | % in ROPE
---------------------------------------------------------------------------
Adelie    | 3700.63 | [3626.80, 3773.23] | 100% | [-0.10, 0.10] |        0%
Chinstrap | 3732.79 | [3622.35, 3843.17] | 100% | [-0.10, 0.10] |        0%
Gentoo    | 5076.78 | [4996.39, 5158.25] | 100% | [-0.10, 0.10] |        0%

Variable predicted: body_mass
Predictors modulated: species

Marginal means are conceptually (and in practice) different from empirical means (i.e., direct average of the observed values) because they are based on a model (and thus can better generalize, extrapolate, and take into account other predictors)
Marginal means are a powerful tool to estimate any quantities of interest in a statistical model

empirical_means

    species Empirical_Mean
1    Adelie       3700.662
2    Gentoo       5076.016
3 Chinstrap       3733.088

In this case, they are very similar to the empirical means because this is what this linear model does (estimating the mean of the outcome as a function of predictors)
Benefit 1 of marginal means: they provide inferential information (CIs, etc.)

Marginal Contrasts

Benefit 2: One very useful application of marginal means is to compute marginal contrasts (contrasts of marginal means)

modelbased::estimate_contrasts(model1, contrast="species")

Marginal Contrasts Analysis

Level1    | Level2    |  Median |             95% CI |     pd |          ROPE | % in ROPE
-----------------------------------------------------------------------------------------
Chinstrap | Adelie    |   33.41 | [-101.80,  168.81] | 67.70% | [-0.10, 0.10] |     0.13%
Gentoo    | Adelie    | 1375.30 | [1266.66, 1485.19] |   100% | [-0.10, 0.10] |        0%
Gentoo    | Chinstrap | 1343.50 | [1202.88, 1486.06] |   100% | [-0.10, 0.10] |        0%

Variable predicted: body_mass
Predictors contrasted: species

What can we conclude in terms of differences?

Visualization

Most outputs of modelbased functions can be directly visualized with plot()

means1 <- modelbased::estimate_means(model1, by="species", test = NULL)
plot(means1, show_data = TRUE)

(But often better to plot it yourself for publication-ready figures)

Model 2: “Adjusting” for another Predictor

model2 <- brms::brm(body_mass ~ species + bill_len, 
                    data = penguins, refresh=0)

parameters::parameters(model2)

# Fixed Effects

Parameter        |  Median |              95% CI |     pd |  Rhat |  ESS
------------------------------------------------------------------------
(Intercept)      |  156.95 | [ -367.84,  697.73] | 71.97% | 1.000 | 1992
speciesChinstrap | -886.61 | [-1059.17, -706.53] |   100% | 1.002 | 1924
speciesGentoo    |  579.82 | [  432.96,  730.45] |   100% | 1.002 | 1861
bill_len         |   91.26 | [   77.57,  104.72] |   100% | 1.000 | 1882

# sigma Parameters

Parameter | Median |           95% CI |   pd |  Rhat |  ESS
-----------------------------------------------------------
sigma     | 375.80 | [348.21, 408.39] | 100% | 1.002 | 3357

How to interpret these parameters?
The model now estimates the effect of Species on Petal.Length while holding Petal.Width constant

Model 2: Marginal Means

means2 <- modelbased::estimate_means(model2, by="species", test = NULL)
means2

Estimated Marginal Means

species   |  Median |             95% CI
----------------------------------------
Adelie    | 4169.46 | [4076.55, 4263.62]
Chinstrap | 3283.81 | [3173.92, 3398.72]
Gentoo    | 4748.61 | [4665.76, 4835.20]

Variable predicted: body_mass
Predictors modulated: species
Predictors averaged: bill_len (44)

plot(means2)

means1

Estimated Marginal Means

species   |  Median |             95% CI
----------------------------------------
Adelie    | 3700.63 | [3626.80, 3773.23]
Chinstrap | 3732.79 | [3622.35, 3843.17]
Gentoo    | 5076.78 | [4996.39, 5158.25]

Variable predicted: body_mass
Predictors modulated: species

plot(means1)

Why is it different?
Marginal means here answer a different question: “What would be the average body mass if all penguins had the same bill length (specifically, the average bill length of the dataset), assuming that the relationship between bill length and body mass is the same across species?”

Model 3: Interaction

model3 <- brms::brm(body_mass ~ species * bill_len, 
                    data = penguins, refresh=0)

The interaction terms models a different relationship between bill length and body mass across species

means3 <- modelbased::estimate_means(model3, by="species", test = NULL)
means3

Estimated Marginal Means

species   |  Median |             95% CI
----------------------------------------
Adelie    | 4185.08 | [4054.67, 4315.75]
Chinstrap | 3440.01 | [3276.19, 3603.14]
Gentoo    | 4681.73 | [4579.06, 4788.59]

Variable predicted: body_mass
Predictors modulated: species
Predictors averaged: bill_len (44)

plot(means3)

Marginal means answers the question: “What would be the average body mass of different species,”controlling” for bill length, and taking into account the specific relationship between bill length and body mass of each species.
In general, in modern statistics, there is little reasons not to include interactions in a model.

Marginal Contrasts: New Answers

estimate_contrasts(model3, contrast="species", test = "pd")

Marginal Contrasts Analysis

Level1    | Level2    |  Median |             95% CI |   pd
-----------------------------------------------------------
Chinstrap | Adelie    | -744.04 | [-949.96, -529.54] | 100%
Gentoo    | Adelie    |  498.51 | [ 335.69,  667.93] | 100%
Gentoo    | Chinstrap | 1242.32 | [1043.32, 1443.52] | 100%

Variable predicted: body_mass
Predictors contrasted: species
Predictors averaged: bill_len (44)

Contrary to our initial conclusion, all species differ significantly on body mass when controlling for bill length.

Visualisation

pred <- estimate_relation(model3, by = c("species", "bill_len"))

ggplot(penguins, aes(x = bill_len, y = body_mass)) +
  geom_point(aes(color  = species)) +
  geom_ribbon(data = pred, aes(x = bill_len, y = Predicted, ymin = CI_low, ymax = CI_high, fill = species), alpha = 0.2) +
  geom_line(data = pred, aes(x = bill_len, y = Predicted, color = species)) +
  theme_minimal()

Would the difference in body mass be significant for penguins with the same tiny bill?

Varying Marginal Contrasts

Contrasts can be computed explicitly at any value of the predictor (e.g., bill length = 30mm)
In estimate_contrasts() (like in all other modelbased functions), the length argument can be used to easily specify the number of points on continuous predictors

estimate_contrasts(model3, contrast = "species", by = "bill_len", length = 3, test = "pd")

Marginal Contrasts Analysis

Level1    | Level2    | bill_len |   Median |              95% CI |     pd
--------------------------------------------------------------------------
Chinstrap | Adelie    |    32.10 |  -330.64 | [ -820.05,  159.12] | 90.28%
Gentoo    | Adelie    |    32.10 |   320.15 | [  -67.85,  712.99] | 94.77%
Gentoo    | Chinstrap |    32.10 |   651.48 | [   70.80, 1225.42] | 98.80%
Chinstrap | Adelie    |    45.85 |  -812.10 | [-1016.17, -602.30] |   100%
Gentoo    | Adelie    |    45.85 |   526.66 | [  337.58,  717.51] |   100%
Gentoo    | Chinstrap |    45.85 |  1338.88 | [ 1188.53, 1486.44] |   100%
Chinstrap | Adelie    |    59.60 | -1293.82 | [-1863.94, -714.80] |   100%
Gentoo    | Adelie    |    59.60 |   727.46 | [  190.22, 1297.79] | 99.38%
Gentoo    | Chinstrap |    59.60 |  2022.25 | [ 1621.95, 2433.07] |   100%

Variable predicted: body_mass
Predictors contrasted: species

Useful to understand interactions

Marginal Effects

Estimated Slopes

What is the effect of bill length on body mass for each species?
The same model-based estimation logic used for means& contrasts can be applied to “slopes” (effects).
We can technically compute effects at any levels of predictors.

estimate_slopes(model3, slope = "bill_len", by = "species")

Estimated Marginal Effects

species   | Median |          95% CI |   pd |          ROPE | % in ROPE
-----------------------------------------------------------------------
Adelie    |  94.43 | [71.50, 116.97] | 100% | [-0.10, 0.10] |        0%
Chinstrap |  59.37 | [32.61,  86.32] | 100% | [-0.10, 0.10] |        0%
Gentoo    | 109.25 | [88.02, 130.85] | 100% | [-0.10, 0.10] |        0%

Marginal effects estimated for bill_len
Type of slope was dY/dX

Model Comparison

Models Recap

model1 <- brms::brm(body_mass ~ species,  data = penguins, refresh=0)
model2 <- brms::brm(body_mass ~ species + bill_len,  data = penguins, refresh=0)
model3 <- brms::brm(body_mass ~ species * bill_len,  data = penguins, refresh=0)

Which one is better?

Coefficient of Determination - R2

R2 measures how well the model explains the data already observed, focusing on variance reduction
R2 can sometimes give misleading impressions in cases of overfitting; a model might appear to perform very well on the training data but poorly on new data
R2 is primarily applicable (and intuitive) to linear models where it corresponds to “variance explained”

performance::r2(model1)

# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.669 (95% CI [0.635, 0.700])

performance::r2(model2)

# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.782 (95% CI [0.762, 0.800])

performance::r2(model3)

# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.786 (95% CI [0.765, 0.804])

ELPD

Other “relative” indices of fit can be used that measures how well the model predicts each data point and how well it could in-theory generalize to new data
- This quality of fit metric is called the ELPD (Expected Log Pointwise Predictive Density)
It can be computed using 2 main procedures:
- WAIC (Widely Applicable Information Criterion)
- LOO (Leave-One-Out Information Criterion)

ELPD - WAIC

WAIC (Widely Applicable Information Criterion)
- An index of prediction error adjusted for the number of parameters
- It provides a balance between model fit and complexity, penalizing models that have too many parameters (similar to the BIC).
- Computationally more straightforward than LOO, but might not be as accurate (more sensitive to outliers)

loo::waic(model1)


Computed from 4000 by 342 log-likelihood matrix.

          Estimate   SE
elpd_waic  -2586.2 11.4
p_waic         3.6  0.3
waic        5172.4 22.7

loo::waic(model2)


Computed from 4000 by 342 log-likelihood matrix.

          Estimate   SE
elpd_waic  -2515.7 12.9
p_waic         4.9  0.5
waic        5031.3 25.8

loo::waic(model3)


Computed from 4000 by 342 log-likelihood matrix.

          Estimate   SE
elpd_waic  -2513.3 13.0
p_waic         6.5  0.7
waic        5026.6 26.0

What to do with these numbers?

Note that \(ELPD = -(\frac{WAIC}{2})\)
Use loo::loo_compare() to get the difference in ELPD between models

results1 <- loo::loo_compare(loo::waic(model1), loo::waic(model2), loo::waic(model3))
results1

       elpd_diff se_diff
model3   0.0       0.0  
model2  -2.4       2.7  
model1 -72.9       9.6

Shows the “best” model first

ELPD - LOOIC

Leave-One-Out (LOO) Cross-Validation is a method to assess model performance by estimating how well a model predicts each observation, one at a time, using the rest of the data
Instead of refitting the model n times, each time leaving out one of the n data points, approximations like PSIS (Pareto Smoothed Importance Sampling) are used to avoid extensive computation
Provides a robust measure of model’s predictive accuracy (without direct complexity adjustment - but indirect through overfitting sensitivity)

results2 <- loo::loo_compare(loo::loo(model1), loo::loo(model2), loo::loo(model3))
results2

       elpd_diff se_diff
model3   0.0       0.0  
model2  -2.3       2.7  
model1 -72.8       9.6

Better by how much? Interpretation rules of thumb

results1

       elpd_diff se_diff
model3   0.0       0.0  
model2  -2.4       2.7  
model1 -72.9       9.6

results2

       elpd_diff se_diff
model3   0.0       0.0  
model2  -2.3       2.7  
model1 -72.8       9.6

We can focus on the ELPD difference and its SE
The “standardized” difference (\(\Delta_{ELPD} / SE\)) can be interpreted as a z-score
- \(> |1.96|\) ~ p < .05
- \(> |3.29|\) ~ p < .001
- Use 1-pnorm(|z-score|) to get the one-tailed p-value
But the standarized difference does not make much sense if the absolute is small (\(|\Delta_{ELPD}|<4\))
- It’s not useful to say one model is likely to provide better predictions than other if the the predictions are almost the same anyways
Reporting should include the raw ELPD difference and its SE (and eventually the z-score or p-value if it makes sense)

report::report(results2)

The difference in predictive accuracy, as indexed by Expected Log Predictive
Density (ELPD-LOO), suggests that 'model3' is the best model (ELPD = -2513.33),
followed by 'model2' (diff-ELPD = -2.35 +- 2.68, p = 0.380) and 'model1'
(diff-ELPD = -72.85 +- 9.62, p < .001)

Exercice

A researcher is interested in the relationship between the two dimensions (verbal and quantitative) of the SAT (a standardized test for college admissions), in particular whether the quantitative score is predicted by the verbal score. A colleague suggests that gender is important.
Using the real dataset sat.act from the psych package, analyze the data and answer the question
Load the dataset as follows:

df <- psych::sat.act
df$gender <- as.factor(df$gender)  # gender=1: males, 2: females

Conclusion

You have (hopefully) learned:
- What Bayesian statistics were about, the philosophy and applications of Bayes theorem
- The strengths and drawbacks, and the differences and similarities with other approaches
- How to apply it using simple and less simple models
This background knowledge helped you gain confidence about knowing where to look to learn more
Bayesian statistics is a very rapidly evolving field
- Challenge to teach & learn (lots of uncertainty)
Important to keep up-to-date with the latest developments
- Follow the development of new and existing packages (e.g., brms, rstan, …)
- Follow forums, blogs and developers (e.g., Twitter, LinkedIn, …)
Learn by engaging with the community
- Ask for help - it will often be helpful to people
- Help others (share tips and tricks, little successes)

Spread the Bayesian gospel to your fellow academics
- But don’t be a blind zealot - think for yourselves

The End _{^{(for good)}}

Thank you!

Bayesian Statistics

Penguins - Pick your fighter!

How Big are these bad boys?

Marginal Means & Contrasts

Model 1: Species as Predictor

Marginal Means

Marginal Contrasts

Visualization

Model 2: “Adjusting” for another Predictor

Model 2: Marginal Means

Model 3: Interaction

Marginal Contrasts: New Answers

Visualisation

Varying Marginal Contrasts

Marginal Effects

Estimated Slopes

Model Comparison

Models Recap

Coefficient of Determination - R2

ELPD

ELPD - WAIC

What to do with these numbers?

ELPD - LOOIC

Better by how much? Interpretation rules of thumb

Exercice

Conclusion

The End (for good)

The End _{^{(for good)}}