Bayesian Statistics

10. Linear Models (Marginal Means and Comparison)

_{Dominique Makowski}
_{^{D.Makowski@sussex.ac.uk}}

Recap

Is there a relationship between Petal Length and Petal Width (iris builtin dataset)?

library(brms)

f <- brms::bf(Petal.Length ~ 0 + Intercept + Petal.Width)

model <- brms::brm(f, data = iris, refresh=0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.6 seconds.

performance::performance(model)

# Indices of model performance

ELPD     | ELPD_SE |   LOOIC | LOOIC_SE |    WAIC |    R2 | R2 (adj.) |  RMSE | Sigma
-------------------------------------------------------------------------------------
-104.322 |   9.519 | 208.645 |   19.039 | 208.624 | 0.927 |     0.925 | 0.475 | 0.483

parameters::parameters(model)

# Fixed Effects

Parameter   | Median |       95% CI |   pd |  Rhat |     ESS
------------------------------------------------------------
(Intercept) |   1.08 | [0.94, 1.22] | 100% | 1.000 | 1802.00
Petal.Width |   2.23 | [2.13, 2.33] | 100% | 1.000 | 1769.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   0.48 | [0.43, 0.54] | 100% | 1.001 | 2197.00

Marginal Means and Effects

Factors as Predictors

Is Petal Length different between different flower species?

summary(iris$Species)

    setosa versicolor  virginica 
        50         50         50

Species is a categorical factor with 3 levels: setosa, versicolor, and virginica
Factors are very common in psychology, e.g.:
- Condition (control vs. treatment)
- Gender
- Group (healthy vs. patient)
- …
When a factor is used as a predictor, the first level is used as the reference level (i.e., becomes the intercept)
The other levels are compared to the reference level

Petal.Length ~ Species

Fit the model with Species as a predictor

model2 <- brms::brm(Petal.Length ~ 0 + Intercept + Species, 
                    data = iris, refresh=0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.5 seconds.

Interpret parameters

parameters::parameters(model2)

# Fixed Effects

Parameter         | Median |       95% CI |   pd |  Rhat |     ESS
------------------------------------------------------------------
(Intercept)       |   1.46 | [1.34, 1.59] | 100% | 1.002 | 1667.00
Speciesversicolor |   2.80 | [2.62, 2.96] | 100% | 1.000 | 2107.00
Speciesvirginica  |   4.09 | [3.92, 4.26] | 100% | 1.002 | 1777.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   0.43 | [0.38, 0.49] | 100% | 1.002 | 2913.00

What is the value of Petal.Length for “Versicolor”?
Can we easily get the value of the outcome at each level?

Marginal Means

A statistical model defines a relationship between the outcome and any predictor values
It is straightforward to compute the model-based value (i.e., the prediction) of the outcome at each level of a factor

modelbased::estimate_relation(model2, by="Species")

Model-based Predictions

Species    | Predicted |   SE |       95% CI
--------------------------------------------
setosa     |      1.46 | 0.06 | [1.34, 1.59]
versicolor |      4.26 | 0.06 | [4.14, 4.38]
virginica  |      5.55 | 0.06 | [5.44, 5.67]

Variable predicted: Petal.Length
Predictors modulated: Species

A related concept is that of marginal means ^{_{(see Makowski et al., under review for a walkthrough)}}

Marginal means are the average predicted values of the outcome at each level of a factor, averaged over all other predictors / random effects (i.e., marginalized for other stuff)

modelbased::estimate_means(model2, by="Species")

Estimated Marginal Means

Species    | Median |       95% CI |   pd |          ROPE | % in ROPE
---------------------------------------------------------------------
setosa     |   1.46 | [1.34, 1.59] | 100% | [-0.10, 0.10] |        0%
versicolor |   4.26 | [4.14, 4.38] | 100% | [-0.10, 0.10] |        0%
virginica  |   5.55 | [5.44, 5.67] | 100% | [-0.10, 0.10] |        0%

Variable predicted: Petal.Length
Predictors modulated: Species

Why are they the same?
Marginal means are conceptually (and in practice) different from empirical means (i.e., direct average of the observed values) because they are based on a model (and thus can better generalize or extrapolate)
Marginal means are a powerful tool to estimate any quantities of interest in a statistical model

Marginal Contrasts

One very useful application of marginal means is to compute marginal contrasts (contrasts of marginal means)

modelbased::estimate_contrasts(model2, contrast="Species")

Marginal Contrasts Analysis

Level1     | Level2     | Median |       95% CI |   pd |          ROPE | % in ROPE
----------------------------------------------------------------------------------
versicolor | setosa     |   2.80 | [2.62, 2.96] | 100% | [-0.10, 0.10] |        0%
virginica  | setosa     |   4.09 | [3.92, 4.26] | 100% | [-0.10, 0.10] |        0%
virginica  | versicolor |   1.29 | [1.12, 1.47] | 100% | [-0.10, 0.10] |        0%

Variable predicted: Petal.Length
Predictors contrasted: Species

Interactions

Is the relationship between Petal Length and Petal Width different between species?

model3 <- brms::brm(Petal.Length ~ 0 + Intercept + Species * Petal.Width, 
                    data = iris, refresh=0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.5 seconds.
Chain 2 finished in 0.5 seconds.
Chain 3 finished in 0.5 seconds.
Chain 4 finished in 0.4 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.5 seconds.
Total execution time: 2.3 seconds.

parameters::parameters(model3)

# Fixed Effects

Parameter                     | Median |        95% CI |     pd |  Rhat |     ESS
---------------------------------------------------------------------------------
(Intercept)                   |   1.32 | [ 1.05, 1.59] |   100% | 1.004 | 1160.00
Speciesversicolor             |   0.45 | [-0.30, 1.24] | 87.55% | 1.002 | 1968.00
Speciesvirginica              |   2.90 | [ 2.12, 3.72] |   100% | 1.002 | 2066.00
Petal.Width                   |   0.57 | [-0.43, 1.56] | 85.75% | 1.005 | 1083.00
Speciesversicolor:Petal.Width |   1.32 | [ 0.21, 2.45] | 98.72% | 1.005 | 1109.00
Speciesvirginica:Petal.Width  |   0.09 | [-1.03, 1.19] | 57.05% | 1.005 | 1072.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   0.36 | [0.33, 0.41] | 100% | 1.002 | 2436.00

Interaction parameters shows how much the effect of Petal.Width on Petal.Length changes when the species changes

Visualization (1)

p <- ggplot(iris, aes(x = Petal.Width, y = Petal.Length, color = Species)) +
  geom_point() 
p

Visualization (2)

modelbased::estimate_relation() is a convenient function to create a datagrid and make predictions
- keep_iterations to add individual predicted iterations
- length (default is 10), the number of points (more points: smoother line, but slower computation)

pred <- modelbased::estimate_relation(model3, length=20, keep_iterations=15)
reshaped_iters <- reshape_iterations(pred)

p <- p + geom_line(data = reshaped_iters, 
            aes(y = iter_value, group = interaction(Species, iter_group)), 
            alpha = 0.5)
p

Visualization (3)

Add the “median” prediction

p + geom_line(data = pred, aes(y = Predicted), 
              linewidth=2)

Effect of Species (Marginal Means)

modelbased::estimate_means(model2, by="Species")

Estimated Marginal Means

Species    | Median |       95% CI |   pd |          ROPE | % in ROPE
---------------------------------------------------------------------
setosa     |   1.46 | [1.34, 1.59] | 100% | [-0.10, 0.10] |        0%
versicolor |   4.26 | [4.14, 4.38] | 100% | [-0.10, 0.10] |        0%
virginica  |   5.55 | [5.44, 5.67] | 100% | [-0.10, 0.10] |        0%

Variable predicted: Petal.Length
Predictors modulated: Species

modelbased::estimate_means(model3, by="Species")

Estimated Marginal Means

Species    | Median |       95% CI |   pd |          ROPE | % in ROPE
---------------------------------------------------------------------
setosa     |   2.00 | [1.04, 2.96] | 100% | [-0.10, 0.10] |        0%
versicolor |   4.02 | [3.90, 4.14] | 100% | [-0.10, 0.10] |        0%
virginica  |   5.01 | [4.70, 5.34] | 100% | [-0.10, 0.10] |        0%

Variable predicted: Petal.Length
Predictors modulated: Species
Predictors averaged: Petal.Width (1.2)

The means at every level are different between the two levels. why?
Because the second model averages over the effect of Petal.Width (takes more information into account)

Effect of Species along Petal.Width (Marginal Contrasts)

Marginal contrasts can be computed at any levels of other predictors

modelbased::estimate_contrasts(model3, contrast="Species", by = "Petal.Width", length = 3, 
                               test = "pd", centrality = "mean")

Marginal Contrasts Analysis

Level1     | Level2     | Petal.Width |  Mean |        95% CI |     pd
----------------------------------------------------------------------
versicolor | setosa     |        0.10 |  0.59 | [-0.09, 1.30] | 95.25%
virginica  | setosa     |        0.10 |  2.92 | [ 2.19, 3.68] |   100%
virginica  | versicolor |        0.10 |  2.33 | [ 1.37, 3.34] |   100%
versicolor | setosa     |        1.30 |  2.17 | [ 1.10, 3.21] |   100%
virginica  | setosa     |        1.30 |  3.03 | [ 1.94, 4.14] |   100%
virginica  | versicolor |        1.30 |  0.87 | [ 0.58, 1.18] |   100%
versicolor | setosa     |        2.50 |  3.74 | [ 1.43, 6.05] | 99.78%
virginica  | setosa     |        2.50 |  3.15 | [ 0.85, 5.42] | 99.55%
virginica  | versicolor |        2.50 | -0.59 | [-1.25, 0.10] | 95.47%

Variable predicted: Petal.Length
Predictors contrasted: Species

Useful to understand interactions

Effect of Petal Width (Marginal Effects)

The model-based framework can also be used to compute marginal effects
The slope of a predictor at any value of other predictors

modelbased::estimate_slopes(model3, trend="Petal.Width", by = "Species")

Estimated Marginal Effects

Species    | Median |        95% CI |     pd |          ROPE | % in ROPE
------------------------------------------------------------------------
setosa     |   0.57 | [-0.43, 1.56] | 85.75% | [-0.10, 0.10] |     8.39%
versicolor |   1.87 | [ 1.32, 2.40] |   100% | [-0.10, 0.10] |        0%
virginica  |   0.65 | [ 0.28, 1.02] | 99.98% | [-0.10, 0.10] |        0%

Marginal effects estimated for Petal.Width
Type of slope was dY/dX

Exercice

Interpret and visualize effect of the Sepal Width on Sepal Length (depending on the Species)

Model Comparison

Coefficient of Determination - R2

R2 measures how well the model explains the data already observed, focusing on variance reduction
R2 can sometimes give misleading impressions in cases of overfitting; a model might appear to perform very well on the training data but poorly on new data
R2 is primarily applicable (and intuitive) to linear models where the relationship between variables is the primary interest

performance::r2(model)

# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.927 (95% CI [0.920, 0.932])

performance::r2(model2)

# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.941 (95% CI [0.936, 0.945])

performance::r2(model3)

# Bayesian R2 with Compatibility Interval

  Conditional R2: 0.958 (95% CI [0.955, 0.961])

ELPD

Other “relative” indices of fit can be used that measures how well the model predicts each data point and how well it could in-theory generalize to new data
- This quality of fit metric is called the ELPD (Expected Log Pointwise Predictive Density)
It can be computed using 2 main methods
- WAIC (Widely Applicable Information Criterion)
- LOO (Leave-One-Out Information Criterion)

ELPD - WAIC

WAIC (Widely Applicable Information Criterion)
- An index of prediction error adjusted for the number of parameters
- It provides a balance between model fit and complexity, penalizing models that have too many parameters (similar to the BIC).
- Computationally more straightforward than LOO, but might not be as accurate (more sensitive to outliers)

loo::waic(model)


Computed from 4000 by 150 log-likelihood matrix.

          Estimate   SE
elpd_waic   -104.3  9.5
p_waic         3.2  0.5
waic         208.6 19.0

loo::waic(model2)


Computed from 4000 by 150 log-likelihood matrix.

          Estimate   SE
elpd_waic    -89.3 10.6
p_waic         4.5  0.8
waic         178.5 21.3

1 (0.7%) p_waic estimates greater than 0.4. We recommend trying loo instead.

loo::waic(model3)


Computed from 4000 by 150 log-likelihood matrix.

          Estimate   SE
elpd_waic    -64.5 11.5
p_waic         7.0  1.2
waic         129.0 23.0

3 (2.0%) p_waic estimates greater than 0.4. We recommend trying loo instead.

`loo_compare()`

Note that \(ELPD = -(\frac{WAIC}{2})\)
Use loo::loo_compare() to get the difference in ELPD between models

loo::loo_compare(loo::waic(model), loo::waic(model2), loo::waic(model3))

       elpd_diff se_diff
model3   0.0       0.0  
model2 -24.8       6.6  
model  -39.8      10.3

Shows the “best” model first

ELPD - LOOIC

Leave-One-Out (LOO) Cross-Validation is a method to assess model performance by estimating how well a model predicts each observation, one at a time, using the rest of the data
Instead of refitting the model n times, each time leaving out one of the n data points, approximations like PSIS (Pareto Smoothed Importance Sampling) are used to avoid extensive computation
Provides a robust measure of model’s predictive accuracy (without direct complexity adjustment - but indirect through overfitting sensitivity)

loo::loo(model)


Computed from 4000 by 150 log-likelihood matrix.

         Estimate   SE
elpd_loo   -104.3  9.5
p_loo         3.2  0.5
looic       208.6 19.0
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.2]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

loo::loo(model2)


Computed from 4000 by 150 log-likelihood matrix.

         Estimate   SE
elpd_loo    -89.3 10.6
p_loo         4.6  0.8
looic       178.6 21.3
------
MCSE of elpd_loo is 0.0.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.1]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

loo::loo(model3)


Computed from 4000 by 150 log-likelihood matrix.

         Estimate   SE
elpd_loo    -64.5 11.5
p_loo         7.0  1.2
looic       129.1 23.0
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.3, 1.0]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

loo::loo_compare(loo::loo(model), loo::loo(model2), loo::loo(model3))

       elpd_diff se_diff
model3   0.0       0.0  
model2 -24.8       6.6  
model  -39.8      10.3

Better by how much? Interpretation rules of thumb

We can focus on the ELPD difference and its SE
The “standardized” difference (\(\Delta_{ELPD} / SE\)) can be interpreted as a z-score
- \(> |1.96|\) ~ p < .05
- \(> |3.29|\) ~ p < .001
- Use 1-pnorm(|z-score|) to get the one-tailed p-value
But the standarized difference does not make much sense if the absolute is small (\(|\Delta_{ELPD}|<4\))
- It’s not useful to say one model is likely to provide better predictions than other if the the predictions are almost the same anyways
Reporting should include the raw ELPD difference and its SE (and eventually the z-score or p-value if it makes sense)

loo::loo_compare(loo::loo(model), loo::loo(model2), loo::loo(model3)) |> 
  report::report()

The difference in predictive accuracy, as indexed by Expected Log Predictive
Density (ELPD-LOO), suggests that 'model3' is the best model (ELPD = -64.54),
followed by 'model2' (diff-ELPD = -24.75 +- 6.63, p < .001) and 'model'
(diff-ELPD = -39.79 +- 10.30, p < .001)

Exercice

Make a conclusion about which model is better and report the analysis

Exercice

A researcher is interested in the relationship between the two dimensions (verbal and quantitative) of the SAT (a standardized test for college admissions), in particular whether the quantitative score is predicted by the verbal score. A colleague suggests that gender is important.
Using the real dataset sat.act from the psych package, analyze the data and answer the question
Load the dataset as follows:

df <- psych::sat.act
df$gender <- as.factor(df$gender)  # gender=1: males, 2: females

Conclusion

You have (hopefully) learned:
- What Bayesian statistics were about, the philosophy and applications of Bayes theorem
- The strengths and drawbacks, and the differences and similarities with other approaches
- How to apply it using simple and less simple models
This background knowledge helped you gain confidence about knowing where to look to learn more
Bayesian statistics is a very rapidly evolving field
- Challenge to teach & learn (lots of uncertainty)
Important to keep up-to-date with the latest developments
- Follow the development of new and existing packages (e.g., brms, rstan, …)
- Follow forums, blogs and developers (e.g., r-bloggers, Twitter/X, …)
Learn by engaging with the community
- Ask for help - it will often be helpful to people
- Help others (share tips and tricks, little successes)

Spread the Bayesian gospel to your fellow academics
- But don’t be a blind zealot - think for yourselves

The End _{^{(for good)}}

Thank you!

Bayesian Statistics

Recap

Marginal Means and Effects

Factors as Predictors

Petal.Length ~ Species

Marginal Means

Marginal Contrasts

Interactions

Visualization (1)

Visualization (2)

Visualization (3)

Effect of Species (Marginal Means)

Effect of Species along Petal.Width (Marginal Contrasts)

Effect of Petal Width (Marginal Effects)

Exercice

Model Comparison

Coefficient of Determination - R2

ELPD

ELPD - WAIC

loo_compare()

ELPD - LOOIC

Better by how much? Interpretation rules of thumb

Exercice

Exercice

Conclusion

The End (for good)

`loo_compare()`

The End _{^{(for good)}}