Bayesian Statistics

8. Linear Models (Priors)

Dominique Makowski
D.Makowski@sussex.ac.uk

Recap

  • You know how to:
    • Use of Bayes Factors in simple tests (e.g., t-tests and correlations)
    • Run Bayesian Correlations
    • Sample from the posterior using MCMC
    • Describe and report the posterior distribution

Goal

  • Last week, we were interested in the correlation between mpg and qsec in the mtcars dataset
BayesFactor::correlationBF(mtcars$mpg, mtcars$qsec)
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 4.416463 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) +
  stat_ellipse(color="blue", linewidth=1)

  • This week, we are interested in the actual relationship (i.e., not just the strength of association, but the actual characteristics of the regression line) between these 2 variables

1. Visualizing Model Predictions

Exercice

  • Fit a Frequentist linear model to predict qsec from mpg. Write down its parameters.
model <- lm(qsec ~ mpg, data = mtcars) 
summary(model)

Call:
lm(formula = qsec ~ mpg, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8161 -1.0287  0.0954  0.8623  4.7149 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15.35477    1.02978  14.911 2.05e-15 ***
mpg          0.12414    0.04916   2.525   0.0171 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.65 on 30 degrees of freedom
Multiple R-squared:  0.1753,    Adjusted R-squared:  0.1478 
F-statistic: 6.377 on 1 and 30 DF,  p-value: 0.01708
  • Visualize the relationship
ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) +
  geom_smooth(method="lm", color="red")

Slope and Intercept

  • Using geom_smooth(method="lm") is convenient, but we don’t know what’s happening under the hood
  • What is happening under the hood?1
    • A linear model is computed between x and y (\(y = ax + b\))
    • The slope a (aka \(\beta\)) and the intercept b (aka \(\beta_{0}\) or the Intercept) are extracted
    • We draw a segment of this line between the minimum and the maximum of x
ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) +
  geom_abline(intercept = 15, slope = 0.12, color="red")

How to Visualize a Model

  • There is another way to get this regression line
  • A model (i.e., an equation with a set of parameters, e.g., intercept and slope) is a “function”
    • \(y = 2 + 3x\)
    • It expresses y “as a function” of x
    • For any value of x, we can predict a value of y
  • A statistical model can be used to generate predictions

Making Predictions

  • One get get predictions from almost any models (using the predict() or the insight::get_predicted() function - the latter being slightly being more user-friendly and comprehensive)
pred <- insight::get_predicted(model, data=mtcars) 
length(pred)
[1] 32
  • What is pred? And why is its length 32?
data <- mtcars  # Copy dataframe (not good to modify builtin datasets)
data$Predicted <- pred
  • Let’s add it to the dataframe and visualize it

Visualizing Predicted Distribution

  • We can compare the actual distribution of the outcome variable qsec with that of the predicted outcome
  • A good model where predictions are close to the actual values should reproduce the same distribution of the outcome variable
  • However, in general, a linear model is geared at predicting the “mean” of the outcome variable (by definition)
ggplot(data) +
  geom_density(aes(x=qsec), fill="green", alpha=0.5) +
  geom_density(aes(x=Predicted), fill="red", alpha=0.5)

Visualizing Predicted Relationship (1)

  • But we can also visualize the relationship between the predictor and the outcome using the predictions
ggplot(data, aes(x = mpg, y = qsec)) +
  geom_point(size=3) +
  geom_point(aes(y=Predicted), size=3, color="red", alpha=0.5) 

Visualizing Predicted Relationship (2)

  • Interesting pattern… what if we connect the points with a line?

  • It looks like the line we got using the equation

Datagrids

  • The relationship estimated by models can be visualized using predictions (rather than using directly the parameters), which is convenient
  • We can make predictions on non-existing values, example, equally spaced values simply to visualize the relationship
newdata <- data.frame(mpg = c(10, 20, 30, 40))
  • This is called a “datagrid” (a grid of new data points used for model predictions and visualizations)
  • We can also predict the confidence interval around the predictions
pred <- insight::get_predicted(model, data=newdata, ci=0.95) |> 
  as.data.frame() |> 
  cbind(newdata)
head(pred)
  Predicted        SE   CI_low  CI_high mpg
1  16.59613 0.5754098 15.42099 17.77128  10
2  17.83750 0.2916457 17.24188 18.43312  20
3  19.07887 0.5677467 17.91937 20.23836  30
4  20.32023 1.0212413 18.23458 22.40588  40

Visualize Predictions with Confidence Intervals

  • Plot everything
mtcars |> 
  ggplot(aes(x = mpg)) +
  geom_point(aes(y = qsec), size=3) +
  geom_ribbon(data=pred, aes(ymin=CI_low, ymax=CI_high), fill="red", alpha=0.2) +
  geom_line(data=pred, aes(y=Predicted), color="red") 

2. Bayesian Models: Quick n’ Dirty

Installing brms

  • From now on until the rest of the module, we will use the brms package to fit Bayesian models
  • We took some time to do the installation in the 2nd module, but if you missed it, let me know during the break
  • It is a package that is complicated to install. Why?
  • Because it needs to install another programming language…

Stan, JAGS, and BUGS

  • As mentioned, performing Bayesian inference is complex
  • Modern Bayesian full inference (using MCMC), known to be resource-intensive and slow, is currently best performed (most fast and flexibly) by dedicated languages, aka probabilistic programming languages
    • Stan (the most popular one): Named in honour of Stanislaw Ulam, pioneer of the Monte Carlo method
    • JAGS: “Just Another Gibbs Sampler”
    • BUGS: “Bayesian Inference Using Gibbs Sampling”
    • PyMC (Python)
    • Turing (Julia)
  • These languages are designed to perform Bayesian inference, and are optimized for that purpose
  • They are not general-purpose programming languages
  • What a hassle to learn a new language just to do Bayesian inference!

About brms

  • There are “wrappers” around these languages that allow us to use them from other general purpose languages like R or Python
  • In R:
    • brms = Bayesian Regression Models using ‘Stan’
    • rstanarm = R interface to Stan for Applied Regression Modeling
  • rstanarm
    • Has different functions for different models (e.g., stan_lm(), stan_glm(), stan_lmer(), stan_glmer(), stan_gamm4() etc.)
    • Each models are implemented with precompiled Stan code
    • It is slightly easier to install, but less flexible
  • brms
    • Became the most popular
    • Main difference is that brms has only one main function, brm(), which can fit almost any model
    • These models are translated into Stan code, and then compiled
    • Overhead time when fitting new models, but it is more flexible

Doing a Bayesian Regression

  • Frequentist model
model <- lm(qsec ~ mpg, data = mtcars) 
parameters::parameters(model)
Parameter   | Coefficient |   SE |         95% CI | t(30) |      p
------------------------------------------------------------------
(Intercept) |       15.35 | 1.03 | [13.25, 17.46] | 14.91 | < .001
mpg         |        0.12 | 0.05 | [ 0.02,  0.22] |  2.53 | 0.017 
  • Bayesian model
model <- brms::brm(qsec ~ mpg, data = mtcars, refresh=0) 
parameters::parameters(model)
# Fixed Effects

Parameter   | Median |         95% CI |     pd |  Rhat |     ESS
----------------------------------------------------------------
(Intercept) |  15.34 | [13.13, 17.40] |   100% | 1.001 | 3582.00
mpg         |   0.12 | [ 0.03,  0.23] | 99.30% | 1.000 | 3549.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   1.68 | [1.33, 2.21] | 100% | 1.001 | 2972.00
  • refresh=0 silences the printing of the sampling (to avoid cluttering this slide)
  • R-hat: Convergence diagnostic index. Should be less than < 1.05.
  • ESS: Effective Sample Size: how many “good iterations” (truly independent draws) we have. Should be > 400 (the maximum being the number of iterations that we draw).

Voilà!

  • You’ve fitted and reported a Bayesian model.
  • The results are very coherent with the Frequentist version.
  • This is a valid Bayesian analysis…
  • But because we are on our way to become Bayesian Masters, we can actually spend more time and thought about aspects related to this model.

3. Bayesian Models: Priors

brms formula

  • the brm() function’s first argument is a formula, like in lm()
  • Because formula can become complex, it is useful to specify the formula in a separate variable
f <- brms::brmsformula(qsec ~ mpg)
  • We could then run brms::brm(f, data=mtcars) to fit the Bayesian linear model

Regression formulas: the intercept

  • In most model functions in R, the intercept is included by default:
    • lm(qsec ~ mpg) is equivalent to lm(qsec ~ 1 + mpg)
coef(lm(qsec ~ mpg, data=mtcars))
(Intercept)         mpg 
 15.3547689   0.1241366 
coef(lm(qsec ~ 1 + mpg, data=mtcars))
(Intercept)         mpg 
 15.3547689   0.1241366 
  • One can remove the intercept by using lm(qsec ~ 0 + mpg)
coef(lm(qsec ~ 0 + mpg, data=mtcars))
     mpg 
0.827125 

Exercice

  • Visualize the relationship predicted by the No Intercept model
newdata <- data.frame(mpg = seq(-5, 35, 5))

lm(qsec ~ 0 + mpg, data=mtcars) |> 
  insight::get_predicted(data=newdata, ci=0.95) |> 
  as.data.frame() |> 
  cbind(newdata) |> 
  ggplot(aes(x = mpg, y = Predicted)) +
  geom_ribbon(aes(ymin=CI_low, ymax=CI_high), fill="red", alpha=0.2) +
  geom_line(color="red") +
  geom_point(data=mtcars, aes(x = mpg, y = qsec), size=3) 

  • It forces the regression line to go through the origin (0, 0)

brms formula exception: the intercept

  • In brms, there is a twist. For reasons that we will clarify later, it is often good to explicitly include the intercept as a “variable”
    • y ~ 0 + Intercept is equivalent to y ~ 1
    • This makes the intercept as a normal coefficient rather than a special thing, which makes it easier for priors specification
  • Instead of the previous formula, we will use:
f <- brms::brmsformula(qsec ~ 0 + Intercept + mpg)
  • The 2 formulas would have resulted in the same model, but explicitly including the intercept makes it easier to…
  • Specify priors

Get Default Priors

  • Bayesian inference is the process of reallocating credibility over parameters, from priors to posteriors
  • In the correlation example, we had a prior over the correlation coefficient
  • In Bayesian regression models, we have priors over the coefficients (regression parameters)
  • How many parameters do we have in this linear model?
  • The get_prior() function returns the priors associated with a given model or a formula (i.e., the default priors made by brms)
brms::get_prior(f, data = mtcars)
                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default

Understanding Priors Parameters

brms::get_prior(f, data = mtcars)
                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default
  • brms has set a default prior for each parameter
  • 2 types of parameters for linear models:
    • Coefficients (b) and Distributional (sigma)
  • For coefficients b, brms uses “flat” priors for any coefficient
    • The first row just indicates this “default”, and then it shows that this default is applied to all the following coefficients (vectorized)
    • Changing the priors in the first row would is a convenient way to apply the same prior for all coefficients
    • We can ignore the first row because we want to be specific with our priors
  • We will ignore the last row sigma (for now) to focus on the coefficients

Flat priors

  • What are “flat” priors?
    • Uniform priors
  • These are the “least informative”: we consider that all possible values have the same probability: we consider that a slope of -1 million and +1 million are equally likely
    • Using this prior gives result equivalent to Maximum Likelihood Estimation (because the prior has no influence, so only the likelihood matters)
  • It might seem like a good idea to use flat priors, but it is not
    • Not realistic: we often don’t expect the slope to be -1 million or +1 million
    • Not good for the inference algorithm: we waste energy exploring regions of the parameter space that are not realistic
    • Can lead to convergence issues
  • In general, avoid flat priors

Alternative to Flat Priors

  • Instead of flat priors, we can set weakly informative priors
  • Priors that are “weakly informative” give some information to the model, but not too much
  • For example, we can use a normal distribution with a very large standard deviation
    • It helps the model to converge by avoiding unrealistic regions of the parameter space
    • At the same time doesn’t provide too much information to the model that would “bias” the estimation too much

What priors to use? Intercept (1)

  • The Intercept represents the predicted value of the dependent variable (qsec) when the independent variable (mpg) is 0
  • We can plot our data to visualize the unit and range of our data
p <- ggplot(mtcars, aes(x = mpg, y = qsec)) +
  geom_point(size=3) + 
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_hline(yintercept = 0, linetype="dashed") +
  coord_cartesian(xlim = c(-1, 35), ylim = c(-1, 30))
p

  • What would be a good range for the Intercept?

What priors to use? Intercept (2)

  • We can visualize regression lines with different intercepts (and a slope of 0)
    • e.g., 10? 15? 17? 20? 25?
p + geom_abline(intercept = c(10, 15, 17, 20, 25), slope = 0, 
                color=c("purple", "red", "orange", "green", "blue"), linewidth=1)

  • Which one feels the most probable? Which one the least?

What priors to use? Intercept (3)

  • We could use a normal distribution for the intercept, centered as a “best guess”
  • The center could be set to the average y value
    • Because if the slope = 0 (no effect), the intercept would be the average y value (cf. “constant” models)
    • mean(mtcars$qsec) = 17.85
  • What precision? In the example above, we used \(SD=4\)

What priors to use? Intercept (4)

  • Such prior (\(Normal(17.85, 4)\)) is actually quite narrow (i.e., “strongly informative”)
    • The actual SD of the response variable is sd(mtcars$qsec) = 1.79
  • You can express more or less strong beliefs by changing the precision
  • It is good to use informative priors, but it requires thought and understanding of the parameters and the data
  • However, since we want with weakly informative priors, we can use a wider distribution
    • e.g., \(Normal(17.85, 40)\)
    • Or \(Normal(M_{qsec}, 10*SD_{qsec})\) = \(Normal(17.85, 17.9)\)
  • Can be convenient to express the prior intercept as a function of the outcome variable’s mean and SD
    • But the details don’t actually have a huge impact (e.g., taking 10 SD vs. 12 SD)

Priors for Coefficients (1)

  • In psychology, we are often interested in testing effects
    • Coefficients (slopes) often represent some effects of interest
    • We are often interested in assessing whether a coefficient is different from 0
  • Thus, in order to avoid “biasing” the effects in one or the other direction, we often use a symmetric distribution centered at 0
    • This means that we are not assuming that the effect is positive or negative
    • In other words, this prior is agnostic (doesn’t provide information) about the direction of the effect, only its absolute size, penalizing very large effects
    • We consider that the most likely value is 0, but we equally allow for the possibility of positive or negative effects
    • This kind of prior centered at 0 is conservative (more or less, depending on its width): it pushes the effects towards 0
  • That’s one of the reason why the idea that we can “cheat” with priors to make things significant is not true
    • Using wide symmetric priors centered at 0 is a common default choice and, if anything, is more conservative than Frequentist estimation

Priors for Coefficients (2)

  • Assuming the intercept is the mean of qsec (17.85), how do various slopes look like
    • e.g., -1, -0.5, 0, 0.5, 1

  • Values of \(|1|\) are quite extreme
  • We can pick a distribution that does not give very high plausibility to values \(> |1|\)

Priors for Coefficients (3)

  • E.g., \(Normal(0, 0.5)\)

Priors for Coefficients (4)

  • Because we want to be even weaker in the information, we are going to pick \(Normal(0, 5)\)
  • What does this prior say?
    • Any slope between -3 SD and + 3 SD (i.e., a slope of -15 and +15) is possible
    • A slope of 1,000,000 is very unlikely
  • Note that other priors could have been suited
  • For instance, if the data was particularly noisy, or prone to outliers (extreme values), that could make the slope “wobbly”, we could use a t-distribution
    • Red: \(Normal(\mu=0, \sigma=0.5)\)
    • Purple: \(Normal(\mu=0, \sigma=5)\)
    • Blue: \(t(df=2, \mu=0, \sigma=5)\)

Chill out, it’s just a prior

  • The choice of priors might seem like a big deal, tedious, daunting and scary
  • But it’s not:
    • If you understand your model, you can easily and quickly make reasonable choices
    • It’s forgiving, even if you put a “bad” prior (e.g., too informative), if the effect is there the data will overwhelm the prior (unless it’s mis-specified, example putting a \(Beta\) prior on a negative relationship)
    • In many cases, the priors don’t matter as much
    • In many cases, even using “flat” priors is fine and won’t make any difference
  • But it is useful to think about priors
    • Especially when you can have strong prior knowledge
    • If you want to be particularly conservative (“error control”)
    • If you have noisy or little data
    • If you care about efficiency (slow models that have trouble converging)

Set Priors in brms

  • Now that we know what priors we want to set:
    • Intercept: \(Normal(17.85, 17.9)\)
    • Slope: \(Normal(0, 5)\)
  • How to actually set them?
brms::get_prior(f, data=mtcars)
                prior class      coef group resp dpar nlpar lb ub       source
               (flat)     b                                            default
               (flat)     b Intercept                             (vectorized)
               (flat)     b       mpg                             (vectorized)
 student_t(3, 0, 2.5) sigma                                  0         default
  • These are the default priors

Using set_prior()

  • We create new priors using set_prior(), but we need to identify them using class and coef
brms::set_prior("normal(17.85, 17.9)", class = "b", coef = "Intercept")
b_Intercept ~ normal(17.85, 17.9)

Validate Priors in brms

  • A collection of priors is created using the c() function
  • This collection of priors is then “validated” using validate_prior()
    • Check whether all the specified priors exist in the list of priors
    • Whether the replacement of default priors is successful
    • Create a new list of priors (default + replaced ones) that we can provide to the model
priors <- c(
  brms::set_prior("normal(17.85, 17.9)", class = "b", coef = "Intercept"),
  brms::set_prior("normal(0, 5)", class = "b", coef = "mpg")
) |> 
  brms::validate_prior(f, data=mtcars)

priors
                prior class      coef group resp dpar nlpar lb ub  source
               (flat)     b                                       default
  normal(17.85, 17.9)     b Intercept                                user
         normal(0, 5)     b       mpg                                user
 student_t(3, 0, 2.5) sigma                                  0    default
  • Our new object named priors is a collection of priors that we can provide to the model

Check Priors

  • Are these priors reasonable? i.e., not too narrow, not too wide?
  • In the case of a simple linear model, where we have a good idea of the range of the coefficients, we can gauge that just from the prior parameters themselves
  • In more complex models, it can become complicated
    • Because parameters don’t have an intuitive sense or are not directly interpretable
    • Because parameters interact with one-another
  • One way to check if priors are appropriate is to do a “Prior Predictive Check”
    • This means that we will generate model predictions using the priors only
    • We can then see if these predictions are reasonable

Make a model with priors only

  • We can fit a model in brms using the brm function
    • Enter the formula, the data, and priors
    • If the prior argument is not specified, it will use the default that we got from get_prior()
  • Here we want to only use the priors (as if we didn’t see the data)
    • Using the argument: sample_prior="only"
  • The model is first compiled (into Stan and C++ code), then sampled from
model_priors <- brms::brm(
  formula=f,
  data = mtcars,
  prior = priors,
  sample_prior="only"
)

Generate predictions

  • We want to see what this “prior-only” model predicts
  • We want to make predictions on a datagrid (to visualize the regression lines that it predicts)
newdata <- data.frame(mpg = seq(10, 35, length.out=6))
newdata
  mpg
1  10
2  15
3  20
4  25
5  30
6  35
pred <- get_predicted(model_priors, data=newdata, iterations=250) |> 
  as.data.frame() |> 
  cbind(newdata)
nrow(pred)
[1] 6
ncol(pred)
[1] 252
names(pred)
  [1] "Predicted" "iter_1"    "iter_2"    "iter_3"    "iter_4"    "iter_5"   
  [7] "iter_6"    "iter_7"    "iter_8"    "iter_9"    "iter_10"   "iter_11"  
 [13] "iter_12"   "iter_13"   "iter_14"   "iter_15"   "iter_16"   "iter_17"  
 [19] "iter_18"   "iter_19"   "iter_20"   "iter_21"   "iter_22"   "iter_23"  
 [25] "iter_24"   "iter_25"   "iter_26"   "iter_27"   "iter_28"   "iter_29"  
 [31] "iter_30"   "iter_31"   "iter_32"   "iter_33"   "iter_34"   "iter_35"  
 [37] "iter_36"   "iter_37"   "iter_38"   "iter_39"   "iter_40"   "iter_41"  
 [43] "iter_42"   "iter_43"   "iter_44"   "iter_45"   "iter_46"   "iter_47"  
 [49] "iter_48"   "iter_49"   "iter_50"   "iter_51"   "iter_52"   "iter_53"  
 [55] "iter_54"   "iter_55"   "iter_56"   "iter_57"   "iter_58"   "iter_59"  
 [61] "iter_60"   "iter_61"   "iter_62"   "iter_63"   "iter_64"   "iter_65"  
 [67] "iter_66"   "iter_67"   "iter_68"   "iter_69"   "iter_70"   "iter_71"  
 [73] "iter_72"   "iter_73"   "iter_74"   "iter_75"   "iter_76"   "iter_77"  
 [79] "iter_78"   "iter_79"   "iter_80"   "iter_81"   "iter_82"   "iter_83"  
 [85] "iter_84"   "iter_85"   "iter_86"   "iter_87"   "iter_88"   "iter_89"  
 [91] "iter_90"   "iter_91"   "iter_92"   "iter_93"   "iter_94"   "iter_95"  
 [97] "iter_96"   "iter_97"   "iter_98"   "iter_99"   "iter_100"  "iter_101" 
[103] "iter_102"  "iter_103"  "iter_104"  "iter_105"  "iter_106"  "iter_107" 
[109] "iter_108"  "iter_109"  "iter_110"  "iter_111"  "iter_112"  "iter_113" 
[115] "iter_114"  "iter_115"  "iter_116"  "iter_117"  "iter_118"  "iter_119" 
[121] "iter_120"  "iter_121"  "iter_122"  "iter_123"  "iter_124"  "iter_125" 
[127] "iter_126"  "iter_127"  "iter_128"  "iter_129"  "iter_130"  "iter_131" 
[133] "iter_132"  "iter_133"  "iter_134"  "iter_135"  "iter_136"  "iter_137" 
[139] "iter_138"  "iter_139"  "iter_140"  "iter_141"  "iter_142"  "iter_143" 
[145] "iter_144"  "iter_145"  "iter_146"  "iter_147"  "iter_148"  "iter_149" 
[151] "iter_150"  "iter_151"  "iter_152"  "iter_153"  "iter_154"  "iter_155" 
[157] "iter_156"  "iter_157"  "iter_158"  "iter_159"  "iter_160"  "iter_161" 
[163] "iter_162"  "iter_163"  "iter_164"  "iter_165"  "iter_166"  "iter_167" 
[169] "iter_168"  "iter_169"  "iter_170"  "iter_171"  "iter_172"  "iter_173" 
[175] "iter_174"  "iter_175"  "iter_176"  "iter_177"  "iter_178"  "iter_179" 
[181] "iter_180"  "iter_181"  "iter_182"  "iter_183"  "iter_184"  "iter_185" 
[187] "iter_186"  "iter_187"  "iter_188"  "iter_189"  "iter_190"  "iter_191" 
[193] "iter_192"  "iter_193"  "iter_194"  "iter_195"  "iter_196"  "iter_197" 
[199] "iter_198"  "iter_199"  "iter_200"  "iter_201"  "iter_202"  "iter_203" 
[205] "iter_204"  "iter_205"  "iter_206"  "iter_207"  "iter_208"  "iter_209" 
[211] "iter_210"  "iter_211"  "iter_212"  "iter_213"  "iter_214"  "iter_215" 
[217] "iter_216"  "iter_217"  "iter_218"  "iter_219"  "iter_220"  "iter_221" 
[223] "iter_222"  "iter_223"  "iter_224"  "iter_225"  "iter_226"  "iter_227" 
[229] "iter_228"  "iter_229"  "iter_230"  "iter_231"  "iter_232"  "iter_233" 
[235] "iter_234"  "iter_235"  "iter_236"  "iter_237"  "iter_238"  "iter_239" 
[241] "iter_240"  "iter_241"  "iter_242"  "iter_243"  "iter_244"  "iter_245" 
[247] "iter_246"  "iter_247"  "iter_248"  "iter_249"  "iter_250"  "mpg"      

Iterations

  • Bayesian sampling is done in iterations
  • Each iteration represents one possibility of model parameters
  • More plausible possibilities will appear more frequently
  • In our case, this means that we have one line (Intercept + Slope) for each iteration
  • We can visualize these various lines, that corresponds to possibilities emerging from our prior parameters (i.e., before seeing the data)

Prior Predictive Check - Relationship

p <- ggplot(mtcars, aes(x=mpg, y=qsec)) +
  geom_point(size=3) 
p

p + geom_line(data=pred, aes(y=iter_1))

p + 
  geom_line(data=pred, aes(y=iter_1)) + 
  geom_line(data=pred, aes(y=iter_2)) +
  geom_line(data=pred, aes(y=iter_3)) +
  geom_line(data=pred, aes(y=iter_4)) +
  geom_line(data=pred, aes(y=iter_5)) 

Reshaping iterations

  • We can reshape the dataframe (pivot into longer format) to have one column for all iterations and another to identify the iteration number
pred <- reshape_iterations(pred)
head(pred)
  Predicted mpg iter_index iter_group iter_value
1  16.91122  10          1          1   41.22650
2  16.38351  15          2          1   46.27231
3  15.85580  20          3          1   51.31811
4  15.32809  25          4          1   56.36392
5  14.80038  30          5          1   61.40972
6  14.27267  35          6          1   66.45553

Visualizing iterations

  • We can visualize the relationship between mpg and qsec for each iteration
p + 
  geom_line(data=pred, aes(y=iter_value, group=iter_group), alpha=0.2) +
  coord_cartesian(ylim=c(-400, 400))

  • What can we say?
  • This is rather good
    • As we can see, the priors don’t predict the exact relationship (the priors don’t have knowledge about what we want to conclude)
    • They don’t predict crazy numbers either (e.g., a billion)

Prior Predictive Check - Distribution

  • We can also visualize the distribution of the predictions to see whether it predicts the outcome in a reasonable range of values
pred <- get_predicted(model_priors, data=mtcars, iterations=250) |> 
  as.data.frame() |> 
  reshape_iterations()

head(pred)
  Predicted iter_index iter_group iter_value
1  16.40507          1          1  -73.97298
2  16.40507          2          1  -73.97298
3  16.34220          3          1  -83.88313
4  16.39110          4          1  -76.17524
5  16.48541          5          1  -61.31002
6  16.50636          6          1  -58.00664
ggplot(mtcars, aes(x=qsec)) +
  geom_density(fill="skyblue") +
  geom_density(data=pred, aes(x=iter_value, group=iter_group), color="orange", alpha=0.5) +
  coord_cartesian(xlim=c(-100, 100), ylim=c(0, 0.3))

Exercice

  • Set very narrow priors for the Intercept and mpg parameters that are centered around the values that we know from the Frequentist model
  • Run posterior predictive checks. Try to obtain a plot that looks like this:

The Art of Priors

Prior choice can depend on:

  • Mathematical constraints
    • “sigma is strictly positive”
    • “correlations are between -1 and 1”
  • Theoretical knowledge
    • Expert knowledge (“this is a plausible range”)
    • Model knowledge (e.g., Intercept to the mean of the outcome)
  • Empirical knowledge
    • From previous studies
    • From descriptive stats
  • Types of Priors (different terms, no established taxonomy)
    • Weakly informative, mildly informative, strongly informative
    • Uninformative, regularizing, principled, informative1
    • Skeptical, agnostic, believer2

The Practical Side

  • In general, don’t panic and follow the rule of “good enough”
    • If you have reasons to care about priors (e.g., complex models, convergence issues, having ideas about priors), then it’s worth spending time to think about them
    • If not, going with the default / flat / very weakly informative priors is fine
    • But don’t forget to assess the convergence of the model (e.g., MCMC sampling process) to make sure the posterior has been well explored…

Homework

  • Watch McElreath’s lecture on MCMC
    • https://www.youtube.com/watch?v=rZk2FqX2XnY&t=788s&ab_channel=RichardMcElreath

The End (for now)

Thank you!