Bayesian Statistics

8. Linear Models (Priors)

_{Dominique Makowski}
_{^{D.Makowski@sussex.ac.uk}}

Goal

Last week, we were interested in the correlation between mpg and qsec in the mtcars dataset

BayesFactor::correlationBF(mtcars$mpg, mtcars$qsec)

Bayes factor analysis
--------------
[1] Alt., r=0.333 : 4.416463 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) +
  stat_ellipse(color="blue", linewidth=1)

We learned how to:
- Use of Bayes Factors in simple tests (e.g., t-tests and correlations)
- Run Bayesian Correlations
- Sample from the posterior using MCMC
- Describe and report the posterior distribution

This week, we are interested in the actual relationship (i.e., not just the strength of association, but the actual characteristics of the regression line) between these 2 variables

Exercice

Fit a Frequentist linear model to predict qsec from mpg. Write down its parameters.

model <- lm(qsec ~ mpg, data = mtcars) 
parameters::parameters(model)

Parameter   | Coefficient |   SE |         95% CI | t(30) |      p
------------------------------------------------------------------
(Intercept) |       15.35 | 1.03 | [13.25, 17.46] | 14.91 | < .001
mpg         |        0.12 | 0.05 | [ 0.02,  0.22] |  2.53 | 0.017

Visualize the relationship

ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) +
  geom_smooth(method="lm", color="red")

modelbased::estimate_relation(model) |> 
  plot(show_data = TRUE)

Bayesian Models: Quick n’ Dirty

Installing `brms`

From now on until the rest of the module, we will use the brms package to fit Bayesian models
We took some time to do the installation in the 2nd module, but if you missed it, let me know during the break
It is a package that is complicated to install. Why?
Because it needs to install another programming language…

Stan, JAGS, and BUGS

As mentioned, performing Bayesian inference is complex
Modern Bayesian full inference (using MCMC), known to be resource-intensive and slow, is currently best performed (most fast and flexibly) by dedicated languages, aka probabilistic programming languages
- Stan (the most popular one): Named in honour of Stanislaw Ulam, pioneer of the Monte Carlo method
- JAGS: “Just Another Gibbs Sampler”
- BUGS: “Bayesian Inference Using Gibbs Sampling”
- PyMC (Python)
- Turing (Julia)
These languages are designed to perform Bayesian inference, and are optimized for that purpose
They are not general-purpose programming languages
What a hassle to learn a new language just to do Bayesian inference!

About `brms`

There are “wrappers” around these languages that allow us to use them from other general purpose languages like R or Python
In R:
- brms = Bayesian Regression Models using ‘Stan’
- rstanarm = R interface to Stan for Applied Regression Modeling
rstanarm
- Has different functions for different models (e.g., stan_lm(), stan_glm(), stan_lmer(), stan_glmer(), stan_gamm4() etc.)
- Each models are implemented with precompiled Stan code
- It is slightly easier to install, but less flexible
brms
- Became the most popular
- Main difference is that brms has only one main function, brm(), which can fit almost any model
- These models are translated into Stan code, and then compiled
- Overhead time when fitting new models, but it is more flexible

Doing a Bayesian Regression

Frequentist model

model <- lm(qsec ~ mpg, data = mtcars) 
parameters::parameters(model)

Parameter   | Coefficient |   SE |         95% CI | t(30) |      p
------------------------------------------------------------------
(Intercept) |       15.35 | 1.03 | [13.25, 17.46] | 14.91 | < .001
mpg         |        0.12 | 0.05 | [ 0.02,  0.22] |  2.53 | 0.017

Bayesian model

model <- brms::brm(qsec ~ mpg, data = mtcars, refresh=0)

Running MCMC with 4 sequential chains...

Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.6 seconds.

parameters::parameters(model)

# Fixed Effects

Parameter   | Median |         95% CI |     pd |  Rhat |  ESS
-------------------------------------------------------------
(Intercept) |  15.37 | [13.20, 17.39] |   100% | 1.001 | 3707
mpg         |   0.12 | [ 0.02,  0.23] | 99.28% | 1.000 | 3670

# sigma Parameters

Parameter | Median |       95% CI |   pd |  Rhat |  ESS
-------------------------------------------------------
sigma     |   1.68 | [1.33, 2.23] | 100% | 1.002 | 3308

refresh=0 silences the printing of the sampling (e.g., to avoid cluttering documents in quarto)
R-hat: Convergence diagnostic index. Should be less than < 1.05.
ESS: Effective Sample Size: how many “good iterations” (truly independent draws) we have. Should be > 400 (the maximum being the number of iterations that we draw).

Voilà!

You’ve fitted and reported a Bayesian model.
The results are very coherent with the Frequentist version.
This is a valid Bayesian analysis…
But because we are on our way to become Bayesian Masters, we can actually spend more time and thought about aspects related to this model.

Bayesian Models: Priors

`brms` formula

the brm() function’s first argument is a formula, like in lm()
Because formula can become complex, it is useful to specify the formula in a separate variable

f <- brms::brmsformula(qsec ~ mpg)

We could then run brms::brm(f, data=mtcars) to fit the Bayesian linear model

Regression formulas: the intercept

In most model functions in R, the intercept is included by default:
- lm(qsec ~ mpg) is equivalent to lm(qsec ~ 1 + mpg)

coef(lm(qsec ~ mpg, data=mtcars))

(Intercept)         mpg 
 15.3547689   0.1241366

coef(lm(qsec ~ 1 + mpg, data=mtcars))

(Intercept)         mpg 
 15.3547689   0.1241366

No-intercept models

One can remove the intercept by using lm(qsec ~ 0 + mpg)

coef(lm(qsec ~ 0 + mpg, data=mtcars))

     mpg 
0.827125

It forces the regression line to go through the origin (0, 0)

newdata <- data.frame(mpg = seq(-5, 35, 5))

lm(qsec ~ 0 + mpg, data=mtcars) |> 
  insight::get_predicted(data=newdata, ci=0.95) |> 
  as.data.frame() |> 
  cbind(newdata) |> 
  ggplot(aes(x = mpg, y = Predicted)) +
  geom_ribbon(aes(ymin=CI_low, ymax=CI_high), fill="red", alpha=0.2) +
  geom_line(color="red") +
  geom_point(data=mtcars, aes(x = mpg, y = qsec), size=3)

`brms` formula exception: the intercept

In brms, there is a twist. For reasons that we will clarify later, it is often good to explicitly include the intercept as a “variable”
- y ~ 0 + Intercept is equivalent to y ~ 1 (= “remove the automatic intercept, but add an custom intercept named”Intercept”)
- This makes the intercept as a normal coefficient rather than a special thing, which makes it easier for priors specification
Instead of the previous formula, we will use:

f <- brms::brmsformula(qsec ~ 0 + Intercept + mpg)

The 2 formulas would have resulted in the same model, but explicitly including the intercept makes it easier to specify priors

Get Default Priors

Bayesian inference is the process of reallocating credibility over parameters, from priors to posteriors
In the correlation example, we had a prior over the correlation coefficient
In Bayesian regression models, we have priors over the coefficients (regression parameters)
How many parameters do we have in this linear model?
The get_prior() function returns the priors associated with a given model or a formula (i.e., the default priors made by brms)

brms::get_prior(f, data = mtcars)

                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default

Understanding Priors Parameters

brms::get_prior(f, data = mtcars)

                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default

brms has set a default prior for each parameter
2 types of parameters for linear models:
- Coefficients (b) and Distributional (sigma)
For coefficients b, brms uses “flat” priors for any coefficient
- The first row just indicates this “default”, and then it shows that this default is applied to all the following coefficients (vectorized): “This is the default prior used for all parameters of class X”
- Changing the priors in the first row would is a convenient way to apply the same prior for all coefficients
- We can ignore the first row because we want to be specific with our priors
We will ignore the last row sigma (for now) to focus on the coefficients

Flat priors

What are “flat” priors?
- Uniform priors
These are the “least informative”: we consider that all possible values have the same probability: we consider that a slope of -1 million and +1 million are equally likely
- Using this prior gives result equivalent to Maximum Likelihood Estimation (because the prior has no influence, so only the likelihood matters)
It might seem like a good idea to use flat priors, but it is not
- Not realistic: we often don’t expect the slope to be -1 million or +1 million
- Not good for the inference algorithm: we waste energy exploring regions of the parameter space that are not realistic
- Can lead to convergence issues
In general, avoid flat priors

Alternative to Flat Priors

Instead of flat priors, we can set weakly informative priors
Priors that are “weakly informative” give some information to the model, but not too much
For example, we can use a normal distribution with a very large standard deviation
- It helps the model to converge by avoiding unrealistic regions of the parameter space
- At the same time doesn’t provide too much information to the model that would “bias” the estimation too much

What priors to use? Intercept (1)

The Intercept represents the predicted value of the dependent variable (qsec) when the independent variable (mpg) is 0
We can plot our data to visualize the unit and range of our data

p <- ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) + 
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_hline(yintercept = 0, linetype="dashed") +
  coord_cartesian(xlim = c(-1, 35), ylim = c(-1, 30))
p

What would be a good range for the Intercept?

What priors to use? Intercept (2)

We can visualize regression lines with different intercepts (and a slope of 0)
- e.g., 10? 15? 17? 20? 25?

p + geom_abline(intercept = c(10, 15, 17, 20, 25), slope = 0, 
                color=c("purple", "red", "orange", "green", "blue"), linewidth=1)

Which one feels the most probable? Which one the least?

What priors to use? Intercept (3)

We could use a normal distribution for the intercept, with its mean at the “best guess”
The center could be set to the average y value
- Because if the slope = 0 (no effect), the intercept would be the average y value (cf. “constant” models)
- mean(mtcars$qsec) = 17.85
What precision? In the example above, we used $SD=4$

What priors to use? Intercept (4)

Such prior ($Normal(17.85, 4)$) is actually quite narrow (i.e., “strongly informative”)
- The actual SD of the response variable is sd(mtcars$qsec) = 1.79
You can express more or less strong beliefs by changing the precision
It is good to use informative priors, but it requires thought and understanding of the parameters and the data
However, since we want with weakly informative priors, we can use a wider distribution
- e.g., $Normal(17.85, 40)$
- Or $Normal(M_{qsec}, 10*SD_{qsec})$ = $Normal(17.85, 17.9)$
Can be convenient to express the prior intercept as a function of the outcome variable’s mean and SD
- But the details don’t actually have a huge impact (e.g., taking 10 SD vs. 12 SD)

Priors for Coefficients (1)

In psychology, we are often interested in testing effects
- Coefficients (slopes) often represent some effects of interest
- We are often interested in assessing whether a coefficient is different from 0
Thus, in order to avoid “biasing” the effects in one or the other direction, we often use a symmetric distribution centered at 0
- This means that we are not assuming that the effect is positive or negative
- In other words, this prior is agnostic (doesn’t provide information) about the direction of the effect, only its absolute size, penalizing very large effects
- We consider that the most likely value is 0, but we equally allow for the possibility of positive or negative effects
- This kind of prior centered at 0 is conservative (more or less, depending on its width): it pushes the effects towards 0
That’s one of the reason why the idea that we can “cheat” with priors to make things significant is not true
- Using wide symmetric priors centered at 0 is a common default choice and, if anything, is more conservative than Frequentist estimation

Priors for Coefficients (2)

Assuming the intercept is the mean of qsec (17.85), how do various slopes look like
- e.g., -1, -0.5, 0, 0.5, 1

Values of $|1|$ are quite extreme
We can pick a distribution that does not give very high plausibility to values $> |1|$

Priors for Coefficients (3)

E.g., $Normal(0, 0.5)$

Priors for Coefficients (4)

Because we want to be even weaker in the information, we are going to pick $Normal(0, 5)$
What does this prior say?
- Any slope between -3 SD and + 3 SD (i.e., a slope of -15 and +15) is possible
- A slope of 1,000,000 is very unlikely
Note that other priors could have been suited
For instance, if the data was particularly noisy, or prone to outliers (extreme values), that could make the slope “wobbly”, we could use a t-distribution
- Red: $Normal(\mu=0, \sigma=0.5)$
- Purple: $Normal(\mu=0, \sigma=5)$
- Blue: $t(df=2, \mu=0, \sigma=5)$

Chill out, it’s just a prior

The choice of priors might seem like a big deal, tedious, daunting and scary
But it’s not:
- If you understand your model, you can easily and quickly make reasonable choices
- It’s forgiving, even if you put a “bad” prior (e.g., too informative), if the effect is there the data will overwhelm the prior (unless it’s mis-specified, example putting a $Beta$ prior on a negative relationship)
- In many cases, the priors don’t matter as much
- In many cases, even using “flat” priors is fine and won’t make any difference
But it is useful to think about priors
- Especially when you can have strong prior knowledge
- If you want to be particularly conservative (“error control”)
- If you have noisy or little data
- If you care about efficiency (slow models that have trouble converging)

Set Priors in `brms`

Now that we know what priors we want to set:
- Intercept: $Normal(17.85, 17.9)$
- Slope: $Normal(0, 5)$
How to actually set them?

brms::get_prior(f, data=mtcars)

                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default

These are the default priors

Using `set_prior()`

We create new priors using set_prior(), but we need to identify them using class and coef

brms::set_prior("normal(17.85, 17.9)", class = "b", coef = "Intercept")

b_Intercept ~ normal(17.85, 17.9)

Validate Priors in `brms`

A collection of priors is created using the c() function
This collection of priors is then “validated” using validate_prior()
- Check whether all the specified priors exist in the list of priors
- Whether the replacement of default priors is successful
- Create a new list of priors (default + replaced ones) that we can provide to the model

priors <- c(
  brms::set_prior("normal(17.85, 17.9)", class = "b", coef = "Intercept"),
  brms::set_prior("normal(0, 5)", class = "b", coef = "mpg")
) |> 
  brms::validate_prior(f, data=mtcars)

priors

                prior class      coef group resp dpar nlpar lb ub  source
               (flat)     b                                       default
  normal(17.85, 17.9)     b Intercept                                user
         normal(0, 5)     b       mpg                                user
 student_t(3, 0, 2.5) sigma                                  0    default

Our new object named priors is a collection of priors that we can provide to the model

Check Priors

Are these priors reasonable? i.e., not too narrow, not too wide?
In the case of a simple linear model, where we have a good idea of the range of the coefficients, we can gauge that just from the prior parameters themselves
In more complex models, it can become complicated
- Because parameters don’t have an intuitive sense or are not directly interpretable
- Because parameters interact with one-another
One way to check if priors are appropriate is to do a “Prior Predictive Check”
- This means that we will generate model predictions using the priors only
- We can then see if these predictions are reasonable

Make a model with priors only

We can fit a model in brms using the brm function
- Enter the formula, the data, and priors
- If the prior argument is not specified, it will use the default that we got from get_prior()
Here we want to only use the priors (as if we didn’t see the data)
- Using the argument: sample_prior="only"
The model is first compiled (into Stan and C++ code), then sampled from

model_priors <- brms::brm(
  formula=f,
  data = mtcars,
  prior = priors,
  sample_prior="only"
)

Predicting Relation

p <- ggplot(mtcars, aes(x=mpg, y=qsec)) +
  geom_point(size=3) 
p

In order to add predicted lines, we need to make “predictions” at 2 or more values of the predictor mpg (because our model estimated straight lines, and straight lines can be defined by 2 points)
Predictions¹ can be obtained using estimate_relation()
This gives us predicted values for an arbitrary spread of predictors

pred <- modelbased::estimate_relation(model_priors, keep_iterations = 250)
ncol(pred)

[1] 255

pred[, 1:8]  # Only show the 8 first columns

mpg   | Predicted |     SE |                CI | iter_1 | iter_2 | iter_3
-------------------------------------------------------------------------
10.40 |     17.43 |  54.20 | [ -87.90, 123.63] |  18.23 |   8.48 |   3.46
13.01 |     17.25 |  66.60 | [-113.16, 146.49] |  22.20 |  10.42 |  -3.44
15.62 |     17.07 |  79.19 | [-137.54, 170.90] |  26.18 |  12.36 | -10.34
18.23 |     16.89 |  91.89 | [-163.09, 193.62] |  30.15 |  14.30 | -17.25
20.84 |     16.72 | 104.66 | [-188.72, 218.13] |  34.12 |  16.24 | -24.15
23.46 |     16.54 | 117.47 | [-213.30, 242.21] |  38.09 |  18.18 | -31.05
26.07 |     16.36 | 130.32 | [-238.32, 268.24] |  42.06 |  20.12 | -37.96
28.68 |     16.18 | 143.18 | [-263.62, 294.80] |  46.04 |  22.06 | -44.86
31.29 |     16.00 | 156.07 | [-289.37, 321.17] |  50.01 |  24.00 | -51.76
33.90 |     15.82 | 168.97 | [-314.33, 346.22] |  53.98 |  25.94 | -58.66

p + 
 geom_line(data=pred, aes(y=iter_1), color="red") + 
 geom_line(data=pred, aes(y=iter_2), color="red") + 
 geom_line(data=pred, aes(y=iter_3), color="red") +
 geom_line(data=pred, aes(y=iter_4), color="red") +
 geom_line(data=pred, aes(y=iter_5), color="red")

Iterations

Bayesian sampling is done in iterations
Each iteration represents one possibility of model parameters
More plausible possibilities will appear more frequently
In our case, this means that we have one line (Intercept + Slope) for each iteration, which yields different prediction lines
We can visualize these various lines, that corresponds to possibilities emerging from our prior parameters (i.e., before seeing the data)

Prior Predictive Check - Relationship

We can reshape the dataframe (pivot into longer format) to have one column for all iterations and another to identify the iteration number

pred <- reshape_iterations(pred)
head(pred)

mpg   | Predicted |     SE |                CI | iter_index | iter_group | iter_value
-------------------------------------------------------------------------------------
10.40 |     17.43 |  54.20 | [ -87.90, 123.63] |          1 |          1 |      18.23
13.01 |     17.25 |  66.60 | [-113.16, 146.49] |          2 |          1 |      22.20
15.62 |     17.07 |  79.19 | [-137.54, 170.90] |          3 |          1 |      26.18
18.23 |     16.89 |  91.89 | [-163.09, 193.62] |          4 |          1 |      30.15
20.84 |     16.72 | 104.66 | [-188.72, 218.13] |          5 |          1 |      34.12
23.46 |     16.54 | 117.47 | [-213.30, 242.21] |          6 |          1 |      38.09

We can visualize the relationship between mpg and qsec for each iteration

p + 
  geom_line(data=pred, aes(y=iter_value, group=iter_group), alpha=0.2) +
  coord_cartesian(ylim=c(-400, 400))

What can we say?
This is rather good
- As we can see, the priors don’t predict the exact relationship (the priors don’t have knowledge about what we want to conclude)
- They don’t predict crazy numbers either (e.g., a billion)

Prior Predictive Check - Distribution

We can also visualize the distribution of the predictions to see whether it predicts the outcome in a reasonable range of values

pred <- modelbased::estimate_prediction(model_priors, keep_iterations=250) |>
  reshape_iterations()

head(pred)

mpg   | Predicted |     SE |                CI | Residuals | iter_index
-----------------------------------------------------------------------
21.00 |     16.68 | 105.59 | [-191.15, 219.97] |     -0.22 |          1
21.00 |     16.69 | 105.49 | [-192.64, 220.60] |      0.33 |          2
22.80 |     16.54 | 114.20 | [-209.39, 237.56] |      2.07 |          3
21.40 |     16.52 | 107.54 | [-195.98, 223.62] |      2.92 |          4
18.70 |     16.78 |  94.26 | [-169.60, 197.82] |      0.24 |          5
18.10 |     16.88 |  91.34 | [-163.94, 192.07] |      3.34 |          6

mpg   | iter_group | iter_value
-------------------------------
21.00 |          1 |      33.89
21.00 |          1 |      35.16
22.80 |          1 |      37.04
21.40 |          1 |      35.81
18.70 |          1 |      30.94
18.10 |          1 |      30.49

The difference between estimate_relation() and estimate_prediction() is that the former gives us the expected value of the outcome variable (i.e., the regression line) at arbitrary values of the predictor (useful for visualizing relationships), while the latter gives the predicted values¹ of the outcome variable for each point in the dataset (useful for visualizing the residuals - the difference between observed and predicted - or the predicted distributions)

ggplot(mtcars, aes(x=qsec)) +
  geom_density(fill="skyblue") +
  geom_density(data=pred, aes(x=iter_value, group=iter_group), color="orange", linewidth=0.5) +
  coord_cartesian(xlim=c(-100, 100), ylim=c(0, 0.3)) + 
  theme_minimal()

Exercice

Set very narrow priors for the Intercept and mpg parameters that are centered around the values that we know from the Frequentist model
Run posterior predictive checks. Try to obtain a plot that looks like this:

The Art of Priors

Prior choice can depend on:

Mathematical constraints
- “sigma is strictly positive”
- “correlations are between -1 and 1”
Theoretical knowledge
- Expert knowledge (“this is a plausible range”)
- Model knowledge (e.g., Intercept to the mean of the outcome)
Empirical knowledge
- From previous studies
- From descriptive stats
Types of Priors (different terms, no established taxonomy)
- Weakly informative, mildly informative, strongly informative
- Uninformative, regularizing, principled, informative¹
- Skeptical, agnostic, believer²

Daniel Lüdecke, Anna Makowski, Jens Klein, and Dominique Makowski (2026). Choosing informative priors in Bayesian regression models. A simulation study and tutorial using Stan and R. Preprint

The Practical Side

In general, don’t panic and follow the rule of “good enough”
- If you have reasons to care about priors (e.g., complex models, convergence issues, having ideas about priors), then it’s worth spending time to think about them
- If not, going with the default / flat / very weakly informative priors is fine
- But don’t forget to assess the convergence of the model (e.g., MCMC sampling process) to make sure the posterior has been well explored…

Homework

Watch McElreath’s lecture on MCMC
- https://www.youtube.com/watch?v=rZk2FqX2XnY&t=788s&ab_channel=RichardMcElreath

The End _{^{(for now)}}

Thank you!

Bayesian Statistics

Goal

Exercice

Bayesian Models: Quick n’ Dirty

Installing brms

Stan, JAGS, and BUGS

About brms

Doing a Bayesian Regression

Voilà!

Bayesian Models: Priors

brms formula

Regression formulas: the intercept

No-intercept models

brms formula exception: the intercept

Get Default Priors

Understanding Priors Parameters

Flat priors

Alternative to Flat Priors

What priors to use? Intercept (1)

What priors to use? Intercept (2)

What priors to use? Intercept (3)

What priors to use? Intercept (4)

Priors for Coefficients (1)

Priors for Coefficients (2)

Priors for Coefficients (3)

Priors for Coefficients (4)

Chill out, it’s just a prior

Set Priors in brms

Using set_prior()

Validate Priors in brms

Check Priors

Make a model with priors only

Predicting Relation

Iterations

Prior Predictive Check - Relationship

Prior Predictive Check - Distribution

Exercice

The Art of Priors

The Practical Side

Homework

The End (for now)

Installing `brms`

About `brms`

`brms` formula

`brms` formula exception: the intercept

Set Priors in `brms`

Using `set_prior()`

Validate Priors in `brms`

The End _{^{(for now)}}