Bayesian Statistics

8. Linear Models (Priors)

Dominique Makowski
D.Makowski@sussex.ac.uk

Goal

  • Last week, we were interested in the correlation between mpg and qsec in the mtcars dataset
BayesFactor::correlationBF(mtcars$mpg, mtcars$qsec)
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 4.416463 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) +
  stat_ellipse(color="blue", linewidth=1)

  • We learned how to:
    • Use of Bayes Factors in simple tests (e.g., t-tests and correlations)
    • Run Bayesian Correlations
    • Sample from the posterior using MCMC
    • Describe and report the posterior distribution
  • This week, we are interested in the actual relationship (i.e., not just the strength of association, but the actual characteristics of the regression line) between these 2 variables

Exercice

  • Fit a Frequentist linear model to predict qsec from mpg. Write down its parameters.
model <- lm(qsec ~ mpg, data = mtcars) 
parameters::parameters(model)
Parameter   | Coefficient |   SE |         95% CI | t(30) |      p
------------------------------------------------------------------
(Intercept) |       15.35 | 1.03 | [13.25, 17.46] | 14.91 | < .001
mpg         |        0.12 | 0.05 | [ 0.02,  0.22] |  2.53 | 0.017 
  • Visualize the relationship
ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) +
  geom_smooth(method="lm", color="red")

modelbased::estimate_relation(model) |> 
  plot(show_data = TRUE)

Bayesian Models: Quick n’ Dirty

Installing brms

  • From now on until the rest of the module, we will use the brms package to fit Bayesian models
  • We took some time to do the installation in the 2nd module, but if you missed it, let me know during the break
  • It is a package that is complicated to install. Why?
  • Because it needs to install another programming language…

Stan, JAGS, and BUGS

  • As mentioned, performing Bayesian inference is complex
  • Modern Bayesian full inference (using MCMC), known to be resource-intensive and slow, is currently best performed (most fast and flexibly) by dedicated languages, aka probabilistic programming languages
    • Stan (the most popular one): Named in honour of Stanislaw Ulam, pioneer of the Monte Carlo method
    • JAGS: “Just Another Gibbs Sampler”
    • BUGS: “Bayesian Inference Using Gibbs Sampling”
    • PyMC (Python)
    • Turing (Julia)
  • These languages are designed to perform Bayesian inference, and are optimized for that purpose
  • They are not general-purpose programming languages
  • What a hassle to learn a new language just to do Bayesian inference!

About brms

  • There are “wrappers” around these languages that allow us to use them from other general purpose languages like R or Python
  • In R:
    • brms = Bayesian Regression Models using ‘Stan’
    • rstanarm = R interface to Stan for Applied Regression Modeling
  • rstanarm
    • Has different functions for different models (e.g., stan_lm(), stan_glm(), stan_lmer(), stan_glmer(), stan_gamm4() etc.)
    • Each models are implemented with precompiled Stan code
    • It is slightly easier to install, but less flexible
  • brms
    • Became the most popular
    • Main difference is that brms has only one main function, brm(), which can fit almost any model
    • These models are translated into Stan code, and then compiled
    • Overhead time when fitting new models, but it is more flexible

Doing a Bayesian Regression

  • Frequentist model
model <- lm(qsec ~ mpg, data = mtcars) 
parameters::parameters(model)
Parameter   | Coefficient |   SE |         95% CI | t(30) |      p
------------------------------------------------------------------
(Intercept) |       15.35 | 1.03 | [13.25, 17.46] | 14.91 | < .001
mpg         |        0.12 | 0.05 | [ 0.02,  0.22] |  2.53 | 0.017 
  • Bayesian model
model <- brms::brm(qsec ~ mpg, data = mtcars, refresh=0) 
Running MCMC with 4 sequential chains...

Chain 1 finished in 0.0 seconds.
Chain 2 finished in 0.0 seconds.
Chain 3 finished in 0.0 seconds.
Chain 4 finished in 0.0 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.0 seconds.
Total execution time: 0.6 seconds.
parameters::parameters(model)
# Fixed Effects

Parameter   | Median |         95% CI |     pd |  Rhat |  ESS
-------------------------------------------------------------
(Intercept) |  15.37 | [13.20, 17.39] |   100% | 1.001 | 3707
mpg         |   0.12 | [ 0.02,  0.23] | 99.28% | 1.000 | 3670

# sigma Parameters

Parameter | Median |       95% CI |   pd |  Rhat |  ESS
-------------------------------------------------------
sigma     |   1.68 | [1.33, 2.23] | 100% | 1.002 | 3308
  • refresh=0 silences the printing of the sampling (e.g., to avoid cluttering documents in quarto)
  • R-hat: Convergence diagnostic index. Should be less than < 1.05.
  • ESS: Effective Sample Size: how many “good iterations” (truly independent draws) we have. Should be > 400 (the maximum being the number of iterations that we draw).

Voilà!

  • You’ve fitted and reported a Bayesian model.
  • The results are very coherent with the Frequentist version.
  • This is a valid Bayesian analysis…
  • But because we are on our way to become Bayesian Masters, we can actually spend more time and thought about aspects related to this model.

Bayesian Models: Priors

brms formula

  • the brm() function’s first argument is a formula, like in lm()
  • Because formula can become complex, it is useful to specify the formula in a separate variable
f <- brms::brmsformula(qsec ~ mpg)
  • We could then run brms::brm(f, data=mtcars) to fit the Bayesian linear model

Regression formulas: the intercept

  • In most model functions in R, the intercept is included by default:
    • lm(qsec ~ mpg) is equivalent to lm(qsec ~ 1 + mpg)
coef(lm(qsec ~ mpg, data=mtcars))
(Intercept)         mpg 
 15.3547689   0.1241366 
coef(lm(qsec ~ 1 + mpg, data=mtcars))
(Intercept)         mpg 
 15.3547689   0.1241366 

No-intercept models

  • One can remove the intercept by using lm(qsec ~ 0 + mpg)
coef(lm(qsec ~ 0 + mpg, data=mtcars))
     mpg 
0.827125 
  • It forces the regression line to go through the origin (0, 0)
newdata <- data.frame(mpg = seq(-5, 35, 5))

lm(qsec ~ 0 + mpg, data=mtcars) |> 
  insight::get_predicted(data=newdata, ci=0.95) |> 
  as.data.frame() |> 
  cbind(newdata) |> 
  ggplot(aes(x = mpg, y = Predicted)) +
  geom_ribbon(aes(ymin=CI_low, ymax=CI_high), fill="red", alpha=0.2) +
  geom_line(color="red") +
  geom_point(data=mtcars, aes(x = mpg, y = qsec), size=3) 

brms formula exception: the intercept

  • In brms, there is a twist. For reasons that we will clarify later, it is often good to explicitly include the intercept as a “variable”
    • y ~ 0 + Intercept is equivalent to y ~ 1 (= “remove the automatic intercept, but add an custom intercept named”Intercept”)
    • This makes the intercept as a normal coefficient rather than a special thing, which makes it easier for priors specification
  • Instead of the previous formula, we will use:
f <- brms::brmsformula(qsec ~ 0 + Intercept + mpg)
  • The 2 formulas would have resulted in the same model, but explicitly including the intercept makes it easier to specify priors

Get Default Priors

  • Bayesian inference is the process of reallocating credibility over parameters, from priors to posteriors
  • In the correlation example, we had a prior over the correlation coefficient
  • In Bayesian regression models, we have priors over the coefficients (regression parameters)
  • How many parameters do we have in this linear model?
  • The get_prior() function returns the priors associated with a given model or a formula (i.e., the default priors made by brms)
brms::get_prior(f, data = mtcars)
                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default

Understanding Priors Parameters

brms::get_prior(f, data = mtcars)
                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default
  • brms has set a default prior for each parameter
  • 2 types of parameters for linear models:
    • Coefficients (b) and Distributional (sigma)
  • For coefficients b, brms uses “flat” priors for any coefficient
    • The first row just indicates this “default”, and then it shows that this default is applied to all the following coefficients (vectorized): “This is the default prior used for all parameters of class X”
    • Changing the priors in the first row would is a convenient way to apply the same prior for all coefficients
    • We can ignore the first row because we want to be specific with our priors
  • We will ignore the last row sigma (for now) to focus on the coefficients

Flat priors

  • What are “flat” priors?
    • Uniform priors
  • These are the “least informative”: we consider that all possible values have the same probability: we consider that a slope of -1 million and +1 million are equally likely
    • Using this prior gives result equivalent to Maximum Likelihood Estimation (because the prior has no influence, so only the likelihood matters)
  • It might seem like a good idea to use flat priors, but it is not
    • Not realistic: we often don’t expect the slope to be -1 million or +1 million
    • Not good for the inference algorithm: we waste energy exploring regions of the parameter space that are not realistic
    • Can lead to convergence issues
  • In general, avoid flat priors

Alternative to Flat Priors

  • Instead of flat priors, we can set weakly informative priors
  • Priors that are “weakly informative” give some information to the model, but not too much
  • For example, we can use a normal distribution with a very large standard deviation
    • It helps the model to converge by avoiding unrealistic regions of the parameter space
    • At the same time doesn’t provide too much information to the model that would “bias” the estimation too much

What priors to use? Intercept (1)

  • The Intercept represents the predicted value of the dependent variable (qsec) when the independent variable (mpg) is 0
  • We can plot our data to visualize the unit and range of our data
p <- ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) + 
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_hline(yintercept = 0, linetype="dashed") +
  coord_cartesian(xlim = c(-1, 35), ylim = c(-1, 30))
p

  • What would be a good range for the Intercept?

What priors to use? Intercept (2)

  • We can visualize regression lines with different intercepts (and a slope of 0)
    • e.g., 10? 15? 17? 20? 25?
p + geom_abline(intercept = c(10, 15, 17, 20, 25), slope = 0, 
                color=c("purple", "red", "orange", "green", "blue"), linewidth=1)

  • Which one feels the most probable? Which one the least?

What priors to use? Intercept (3)

  • We could use a normal distribution for the intercept, with its mean at the “best guess”
  • The center could be set to the average y value
    • Because if the slope = 0 (no effect), the intercept would be the average y value (cf. “constant” models)
    • mean(mtcars$qsec) = 17.85
  • What precision? In the example above, we used \(SD=4\)

What priors to use? Intercept (4)

  • Such prior (\(Normal(17.85, 4)\)) is actually quite narrow (i.e., “strongly informative”)
    • The actual SD of the response variable is sd(mtcars$qsec) = 1.79
  • You can express more or less strong beliefs by changing the precision
  • It is good to use informative priors, but it requires thought and understanding of the parameters and the data
  • However, since we want with weakly informative priors, we can use a wider distribution
    • e.g., \(Normal(17.85, 40)\)
    • Or \(Normal(M_{qsec}, 10*SD_{qsec})\) = \(Normal(17.85, 17.9)\)
  • Can be convenient to express the prior intercept as a function of the outcome variable’s mean and SD
    • But the details don’t actually have a huge impact (e.g., taking 10 SD vs. 12 SD)

Priors for Coefficients (1)

  • In psychology, we are often interested in testing effects
    • Coefficients (slopes) often represent some effects of interest
    • We are often interested in assessing whether a coefficient is different from 0
  • Thus, in order to avoid “biasing” the effects in one or the other direction, we often use a symmetric distribution centered at 0
    • This means that we are not assuming that the effect is positive or negative
    • In other words, this prior is agnostic (doesn’t provide information) about the direction of the effect, only its absolute size, penalizing very large effects
    • We consider that the most likely value is 0, but we equally allow for the possibility of positive or negative effects
    • This kind of prior centered at 0 is conservative (more or less, depending on its width): it pushes the effects towards 0
  • That’s one of the reason why the idea that we can “cheat” with priors to make things significant is not true
    • Using wide symmetric priors centered at 0 is a common default choice and, if anything, is more conservative than Frequentist estimation

Priors for Coefficients (2)

  • Assuming the intercept is the mean of qsec (17.85), how do various slopes look like
    • e.g., -1, -0.5, 0, 0.5, 1

  • Values of \(|1|\) are quite extreme
  • We can pick a distribution that does not give very high plausibility to values \(> |1|\)

Priors for Coefficients (3)

  • E.g., \(Normal(0, 0.5)\)

Priors for Coefficients (4)

  • Because we want to be even weaker in the information, we are going to pick \(Normal(0, 5)\)
  • What does this prior say?
    • Any slope between -3 SD and + 3 SD (i.e., a slope of -15 and +15) is possible
    • A slope of 1,000,000 is very unlikely
  • Note that other priors could have been suited
  • For instance, if the data was particularly noisy, or prone to outliers (extreme values), that could make the slope “wobbly”, we could use a t-distribution
    • Red: \(Normal(\mu=0, \sigma=0.5)\)
    • Purple: \(Normal(\mu=0, \sigma=5)\)
    • Blue: \(t(df=2, \mu=0, \sigma=5)\)

Chill out, it’s just a prior

  • The choice of priors might seem like a big deal, tedious, daunting and scary
  • But it’s not:
    • If you understand your model, you can easily and quickly make reasonable choices
    • It’s forgiving, even if you put a “bad” prior (e.g., too informative), if the effect is there the data will overwhelm the prior (unless it’s mis-specified, example putting a \(Beta\) prior on a negative relationship)
    • In many cases, the priors don’t matter as much
    • In many cases, even using “flat” priors is fine and won’t make any difference
  • But it is useful to think about priors
    • Especially when you can have strong prior knowledge
    • If you want to be particularly conservative (“error control”)
    • If you have noisy or little data
    • If you care about efficiency (slow models that have trouble converging)

Set Priors in brms

  • Now that we know what priors we want to set:
    • Intercept: \(Normal(17.85, 17.9)\)
    • Slope: \(Normal(0, 5)\)
  • How to actually set them?
brms::get_prior(f, data=mtcars)
                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default
  • These are the default priors

Using set_prior()

  • We create new priors using set_prior(), but we need to identify them using class and coef
brms::set_prior("normal(17.85, 17.9)", class = "b", coef = "Intercept")
b_Intercept ~ normal(17.85, 17.9)

Validate Priors in brms

  • A collection of priors is created using the c() function
  • This collection of priors is then “validated” using validate_prior()
    • Check whether all the specified priors exist in the list of priors
    • Whether the replacement of default priors is successful
    • Create a new list of priors (default + replaced ones) that we can provide to the model
priors <- c(
  brms::set_prior("normal(17.85, 17.9)", class = "b", coef = "Intercept"),
  brms::set_prior("normal(0, 5)", class = "b", coef = "mpg")
) |> 
  brms::validate_prior(f, data=mtcars)

priors
                prior class      coef group resp dpar nlpar lb ub  source
               (flat)     b                                       default
  normal(17.85, 17.9)     b Intercept                                user
         normal(0, 5)     b       mpg                                user
 student_t(3, 0, 2.5) sigma                                  0    default
  • Our new object named priors is a collection of priors that we can provide to the model

Check Priors

  • Are these priors reasonable? i.e., not too narrow, not too wide?
  • In the case of a simple linear model, where we have a good idea of the range of the coefficients, we can gauge that just from the prior parameters themselves
  • In more complex models, it can become complicated
    • Because parameters don’t have an intuitive sense or are not directly interpretable
    • Because parameters interact with one-another
  • One way to check if priors are appropriate is to do a “Prior Predictive Check”
    • This means that we will generate model predictions using the priors only
    • We can then see if these predictions are reasonable

Make a model with priors only

  • We can fit a model in brms using the brm function
    • Enter the formula, the data, and priors
    • If the prior argument is not specified, it will use the default that we got from get_prior()
  • Here we want to only use the priors (as if we didn’t see the data)
    • Using the argument: sample_prior="only"
  • The model is first compiled (into Stan and C++ code), then sampled from
model_priors <- brms::brm(
  formula=f,
  data = mtcars,
  prior = priors,
  sample_prior="only"
)

Predicting Relation

p <- ggplot(mtcars, aes(x=mpg, y=qsec)) +
  geom_point(size=3) 
p

  • In order to add predicted lines, we need to make “predictions” at 2 or more values of the predictor mpg (because our model estimated straight lines, and straight lines can be defined by 2 points)
  • Predictions1 can be obtained using estimate_relation()
  • This gives us predicted values for an arbitrary spread of predictors
pred <- modelbased::estimate_relation(model_priors, keep_iterations = 250)
ncol(pred)
[1] 255
pred[, 1:8]  # Only show the 8 first columns
mpg   | Predicted |     SE |                CI | iter_1 | iter_2 | iter_3
-------------------------------------------------------------------------
10.40 |     17.43 |  54.20 | [ -87.90, 123.63] |  18.23 |   8.48 |   3.46
13.01 |     17.25 |  66.60 | [-113.16, 146.49] |  22.20 |  10.42 |  -3.44
15.62 |     17.07 |  79.19 | [-137.54, 170.90] |  26.18 |  12.36 | -10.34
18.23 |     16.89 |  91.89 | [-163.09, 193.62] |  30.15 |  14.30 | -17.25
20.84 |     16.72 | 104.66 | [-188.72, 218.13] |  34.12 |  16.24 | -24.15
23.46 |     16.54 | 117.47 | [-213.30, 242.21] |  38.09 |  18.18 | -31.05
26.07 |     16.36 | 130.32 | [-238.32, 268.24] |  42.06 |  20.12 | -37.96
28.68 |     16.18 | 143.18 | [-263.62, 294.80] |  46.04 |  22.06 | -44.86
31.29 |     16.00 | 156.07 | [-289.37, 321.17] |  50.01 |  24.00 | -51.76
33.90 |     15.82 | 168.97 | [-314.33, 346.22] |  53.98 |  25.94 | -58.66
p + 
 geom_line(data=pred, aes(y=iter_1), color="red") + 
 geom_line(data=pred, aes(y=iter_2), color="red") + 
 geom_line(data=pred, aes(y=iter_3), color="red") +
 geom_line(data=pred, aes(y=iter_4), color="red") +
 geom_line(data=pred, aes(y=iter_5), color="red")

Iterations

  • Bayesian sampling is done in iterations
  • Each iteration represents one possibility of model parameters
  • More plausible possibilities will appear more frequently
  • In our case, this means that we have one line (Intercept + Slope) for each iteration, which yields different prediction lines
  • We can visualize these various lines, that corresponds to possibilities emerging from our prior parameters (i.e., before seeing the data)

Prior Predictive Check - Relationship

  • We can reshape the dataframe (pivot into longer format) to have one column for all iterations and another to identify the iteration number
pred <- reshape_iterations(pred)
head(pred)
mpg   | Predicted |     SE |                CI | iter_index | iter_group | iter_value
-------------------------------------------------------------------------------------
10.40 |     17.43 |  54.20 | [ -87.90, 123.63] |          1 |          1 |      18.23
13.01 |     17.25 |  66.60 | [-113.16, 146.49] |          2 |          1 |      22.20
15.62 |     17.07 |  79.19 | [-137.54, 170.90] |          3 |          1 |      26.18
18.23 |     16.89 |  91.89 | [-163.09, 193.62] |          4 |          1 |      30.15
20.84 |     16.72 | 104.66 | [-188.72, 218.13] |          5 |          1 |      34.12
23.46 |     16.54 | 117.47 | [-213.30, 242.21] |          6 |          1 |      38.09
  • We can visualize the relationship between mpg and qsec for each iteration
p + 
  geom_line(data=pred, aes(y=iter_value, group=iter_group), alpha=0.2) +
  coord_cartesian(ylim=c(-400, 400))

  • What can we say?
  • This is rather good
    • As we can see, the priors don’t predict the exact relationship (the priors don’t have knowledge about what we want to conclude)
    • They don’t predict crazy numbers either (e.g., a billion)

Prior Predictive Check - Distribution

  • We can also visualize the distribution of the predictions to see whether it predicts the outcome in a reasonable range of values
pred <- modelbased::estimate_prediction(model_priors, keep_iterations=250) |>
  reshape_iterations()

head(pred)
mpg   | Predicted |     SE |                CI | Residuals | iter_index
-----------------------------------------------------------------------
21.00 |     16.68 | 105.59 | [-191.15, 219.97] |     -0.22 |          1
21.00 |     16.69 | 105.49 | [-192.64, 220.60] |      0.33 |          2
22.80 |     16.54 | 114.20 | [-209.39, 237.56] |      2.07 |          3
21.40 |     16.52 | 107.54 | [-195.98, 223.62] |      2.92 |          4
18.70 |     16.78 |  94.26 | [-169.60, 197.82] |      0.24 |          5
18.10 |     16.88 |  91.34 | [-163.94, 192.07] |      3.34 |          6

mpg   | iter_group | iter_value
-------------------------------
21.00 |          1 |      33.89
21.00 |          1 |      35.16
22.80 |          1 |      37.04
21.40 |          1 |      35.81
18.70 |          1 |      30.94
18.10 |          1 |      30.49
  • The difference between estimate_relation() and estimate_prediction() is that the former gives us the expected value of the outcome variable (i.e., the regression line) at arbitrary values of the predictor (useful for visualizing relationships), while the latter gives the predicted values1 of the outcome variable for each point in the dataset (useful for visualizing the residuals - the difference between observed and predicted - or the predicted distributions)
ggplot(mtcars, aes(x=qsec)) +
  geom_density(fill="skyblue") +
  geom_density(data=pred, aes(x=iter_value, group=iter_group), color="orange", linewidth=0.5) +
  coord_cartesian(xlim=c(-100, 100), ylim=c(0, 0.3)) + 
  theme_minimal()

Exercice

  • Set very narrow priors for the Intercept and mpg parameters that are centered around the values that we know from the Frequentist model
  • Run posterior predictive checks. Try to obtain a plot that looks like this:

The Art of Priors

Prior choice can depend on:

  • Mathematical constraints
    • “sigma is strictly positive”
    • “correlations are between -1 and 1”
  • Theoretical knowledge
    • Expert knowledge (“this is a plausible range”)
    • Model knowledge (e.g., Intercept to the mean of the outcome)
  • Empirical knowledge
    • From previous studies
    • From descriptive stats
  • Types of Priors (different terms, no established taxonomy)
    • Weakly informative, mildly informative, strongly informative
    • Uninformative, regularizing, principled, informative1
    • Skeptical, agnostic, believer2

The Practical Side

  • In general, don’t panic and follow the rule of “good enough”
    • If you have reasons to care about priors (e.g., complex models, convergence issues, having ideas about priors), then it’s worth spending time to think about them
    • If not, going with the default / flat / very weakly informative priors is fine
    • But don’t forget to assess the convergence of the model (e.g., MCMC sampling process) to make sure the posterior has been well explored…

Homework

  • Watch McElreath’s lecture on MCMC
    • https://www.youtube.com/watch?v=rZk2FqX2XnY&t=788s&ab_channel=RichardMcElreath

The End (for now)

Thank you!