Introduction to Bayesian Statistics

@ LaPsyDé

_{Dominique Makowski}
_{^{D.Makowski@sussex.ac.uk}}

Managing expectations

Learn Bayesian by understanding Frequentist
Build on existing knowledge and “demystify” Bayesian statistics
Provide a bird’s eye view of Bayesian statistics applied to statistical models

Introductions

Internship at LaPsyDé (2012)
PhD 2018 _{^{(Université Paris Descartes)}}
- Emotion Regulation through “Fiction”
- Software development
Postdoc 2019-2023 _{^{(NTU Singapore)}}
- EEG and Physiological markers of Deception
Lecturer (2023-) _{^{(University of Sussex, Brighton)}}
- Reality Bending Lab (ReBeL) _^website
  - Role of emotions and cognitive control in the perception of reality
  - Illusions, fake news, altered states of consciousness, …
- Bayesian Statistics

Frequentist Refresher

Quizz

What is wrong with this Correlation plot?
It is more a “regression” plot (scatter plot + regression line)

Show figure code

df <- bayestestR::simulate_correlation(n=500, r=0.7)
df |> 
  ggplot(aes(x=V1, y=V2)) +
  geom_point() +
  geom_smooth(method="lm", formula = 'y ~ x', se=FALSE, linewidth=2, color="red") +
  theme_bw()

Solution

A correlation is a measure of the strength of association [-1, 1]
- Better represented by an elipsis
The line is a regression line
- The angle of the line is the regression coefficient (can be > 1)

Show figure code

df |> 
  ggplot(aes(x=V1, y=V2)) +
  geom_point() +
  geom_smooth(method="lm", formula = 'y ~ x', se=FALSE, linewidth=2, color="red") +
  stat_ellipse(type = "norm", color="blue", linewidth=2) +
  theme_bw()

Tests vs. Models

Psychology uses a lot of tests (t-tests, correlation tests, ANOVAs, etc.)
Tests != Models
- We can’t predict the value of y given x with a test
However, tests are usually based on models
In particular, tests are related to specific parameters of models

cor.test(mtcars$wt, mtcars$qsec) |> 
  parameters::parameters()

Pearson's product-moment correlation

Parameter1 |  Parameter2 |     r |        95% CI | t(30) |     p
----------------------------------------------------------------
mtcars$wt  | mtcars$qsec | -0.17 | [-0.49, 0.19] | -0.97 | 0.339

Alternative hypothesis: true correlation is not equal to 0

lm(wt ~ qsec, data=mtcars) |> 
  parameters::parameters()

Parameter   | Coefficient |   SE |        95% CI | t(30) |     p
----------------------------------------------------------------
(Intercept) |        4.92 | 1.77 | [ 1.32, 8.53] |  2.79 | 0.009
qsec        |       -0.10 | 0.10 | [-0.30, 0.11] | -0.97 | 0.339

What are Parameters?

Linear Model

Linear function: \(y = ax + b\)
Regression formula: \(\hat{y} = Intercept + \beta_{1}x\)
- \(\hat{y}\) (Y-hat): predicted response/outcome/dependent variable
- \(Intercept\) (\(\beta_{0}\)): “starting point”. Value of \(\hat{y}\) when \(x=0\)
- \(\beta_{1}\) (beta) : slope/“effect”. Change of \(\hat{y}\) from the intercept when \(x\) increases by 1
- \(x\): predictor/independent variable
What are the parameters of the regression line below? Intercept = 1 and beta = 3

Code

ggplot(df, aes(x=V1, y=V2)) +
  geom_point(alpha=0) +
  geom_vline(xintercept = 0, linetype="dotted") +
  geom_abline(intercept = 1, slope = 3, color="red", linewidth=2)  +
  geom_segment(aes(x = 0, y = 1, xend = 1, yend = 1), linewidth=1, 
               color="green", linetype="dashed") +
  geom_segment(aes(x = 1, y = 1, xend = 1, yend = 4), linewidth=1, 
               color="green", linetype="solid", arrow=arrow(length = unit(0.1, "inches"))) +
  geom_segment(aes(x = 0, y = 0, xend = 0, yend = 1), linewidth=1, 
               color="blue", linetype="solid", arrow=arrow(length = unit(0.1, "inches"))) +
  geom_point(aes(x = 0, y = 0), color="purple", size=8, shape="+") +
  labs(x="x", y="y") +
  theme_bw() +
  coord_cartesian(xlim = c(-1, 1.5), ylim = c(-3, 4))

Impact of Parameters

What happens when we change the slope and the intercept?

Linear Model = “Gaussian” Model

Solving the linear model equation means finding the best fitting normal distribution around the residuals

Code

p <- mtcars |> 
  ggplot(aes(x=qsec, y=mpg)) +
  geom_point(alpha=0.7, size=6) +
  geom_smooth(method="lm", formula = 'y ~ x', se=FALSE, linewidth=2) +
  geom_segment(aes(x = qsec, y = mpg, xend = qsec, yend = predict(lm(mpg ~ qsec, data=mtcars))), 
               color="red", linetype="dotted", linewidth=1) +
  theme_bw() +
  labs(x="x", y="y", title="The residuals are the vertical distances between each point and the line.")

p2 <- data.frame(Error=insight::get_residuals(lm(mpg ~ qsec, data=mtcars))) |> 
  ggplot(aes(x=Error)) +
  geom_histogram(bins=10, fill="grey", color="black") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_density(data=data.frame(Error=bayestestR::distribution_normal(n=100, sd = 2)),
               aes(y=after_stat(density)*40), color="#F44336", linewidth=1, adjust=1) +
  geom_density(data=data.frame(Error=bayestestR::distribution_normal(n=100, sd = 3)),
               aes(y=after_stat(density)*50), color="#FF5722", linewidth=1, adjust=1) +
  geom_density(data=data.frame(Error=bayestestR::distribution_normal(n=100, sd = 4)),
               aes(y=after_stat(density)*60), color="#FF9800", linewidth=1, adjust=1) +
  geom_point(aes(y=0), size=10, shape=16, alpha=0.3) +
  theme_bw() +
  coord_flip() +
  labs(y = "Density")

p + theme(plot.title = element_blank()) | p2

Homoscedasticity

The line that goes through the Normal distribution’s locations \(\mu\) also minimizes the sum of squared residuals (OLS)

Code

model <- lm(mpg ~ qsec, data=mtcars)

p <- mtcars |> 
  ggplot(aes(x=qsec, y=mpg)) +
  geom_smooth(method="lm", formula = 'y ~ x', se=FALSE, linewidth=2) +
  theme_bw()

# Function to add normal distribution curves
add_normals <- function(p, model) {
  sigma <- summary(model)$sigma  # Standard deviation of residuals
  n <- 100  # Number of points for each curve
  
  for(i in 1:nrow(mtcars)) {
    x_val <- mtcars$qsec[i]
    y_pred <- predict(model, newdata = data.frame(qsec = x_val))
    
    # Create a sequence of y values for the normal curve
    y_seq <- seq(y_pred - 3*sigma, y_pred + 3*sigma, length.out = n)
    density <- dnorm(y_seq, mean = y_pred, sd = sigma)
    
    # Adjust density to match the scale of the plot
    max_width <- 1  # Max width of areas
    density_scaled <- (density / max(density)) * max_width
    
    # Create a dataframe for each path
    path_df <- data.frame(x = x_val + density_scaled, y = y_seq)
    path_dfv <- data.frame(x=path_df$x[1], ymin=min(path_df$y), ymax=max(path_df$y))
    
    # Add the path to the plot
    p <- p + 
      geom_segment(data = path_dfv, aes(x = x, xend=x, y = ymin, yend=ymax), 
                   color = "#FF9800", linewidth = 0.7, alpha=0.8, linetype="dotted") +
      geom_path(data = path_df, aes(x = x, y = y), 
                color = "#FF9800", linewidth = 1, alpha=0.8) 
  }
  p
}


# Create the final plot
p <- add_normals(p, model) +
  geom_segment(aes(x = qsec, y = mpg, xend = qsec, yend = predict(lm(mpg ~ qsec, data=mtcars))), 
               color="red", linetype="solid", linewidth=1) +
  geom_point(alpha=0.8, size=6) 
p

Model Parameters

In regression models, 2 types of parameters are estimated (using OLS/MLE)
- The coefficients
  - Intercept
  - The slope(s)
- “Distributional” (aka “Auxiliary”) parameters (about the distribution of errors)
  - E.g., the standard deviation of the errors \(\sigma\) (sigma) in linear models
  - Usually, most people tend to ignore these… but they are important!

model <- lm(mpg ~ qsec, data=mtcars)

summary(model)


Call:
lm(formula = mpg ~ qsec, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8760 -3.4539 -0.7203  2.2774 11.6491 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -5.1140    10.0295  -0.510   0.6139  
qsec          1.4121     0.5592   2.525   0.0171 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.564 on 30 degrees of freedom
Multiple R-squared:  0.1753,    Adjusted R-squared:  0.1478 
F-statistic: 6.377 on 1 and 30 DF,  p-value: 0.01708

insight::get_sigma(model)

[1] 5.563738
attr(,"class")
[1] "insight_aux" "numeric"

The remaining indices (SE, t-value, p-value) are calculated using them

Frequentist Inference

Parameter   | Coefficient |    SE |          95% CI |  t(30) |     p
--------------------------------------------------------------------
(Intercept) |       4.925 | 1.765 | [ 1.319, 8.530] |  2.790 | 0.009
qsec        |      -0.096 | 0.098 | [-0.297, 0.105] | -0.972 | 0.339

The t-value is the ratio between the coefficient and its standard error (SE)
- It can be seen as a “standardized” index of the coefficient’s precision (independent of the scale/unit of the variable)
- It corresponds to a point on a t-distribution of corresponding degrees of freedom (df), centered at 0
- This t-distribution is the assumption made in Frequentist statistics about the distribution of coefficients (effects) under the null hypothesis (Not to be confused with the distribution of residuals, which is assumed to be Normal)

The t-value

Is our coefficient likely to be observed under the null hypothesis?
- What is the probability to obtain a coefficient at least as precise as ours (in both directions) under the null hypothesis?
- \(Prob > |t|\)

Code

# Plot  t-distribution
x <- seq(-5, 5, length.out = 1000)
y <- dt(x, df=30)
df <- data.frame(x = x, y = y)

t_value <- 2.790 

df$Probability <- ifelse(df$x < -t_value, "< -t", "Smaller")
df$Probability <- ifelse(df$x > t_value, "> +t", df$Probability)
df |>
  ggplot(aes(x = x, y = y)) +
  geom_line() +
  geom_area(aes(x = x, y = y, fill = Probability), alpha = 0.5) +
  geom_segment(data=data.frame(x=t_value, y=0),
               aes(xend = t_value, yend = dt(t_value, df=30)), 
               color="red", linetype="solid", linewidth=1) +
  geom_point(data=data.frame(x=t_value, y=dt(t_value, df=30)), color="red", size=3) +
  theme_minimal() +
  scale_fill_manual(values = c("red", "red", "grey")) +
  labs(x = "\nt-values - standardized coefficient under the null hypothesis", y = "Probability", 
       title = paste0("t-Distribution (df=30); ", "t-value = ", round(t_value, 2), "\n"))

p-value

The p-value is the probability of observing a value >= to the t-value under the null hypothesis. In other words, it is the probability of obtaining test results at least as “big” (precisely away from 0) as the result actually observed if we repeat it an infinite amount of times and there is no true effect.
- It is quite convoluted…
- Is not a “trivial” value. It has some delicate aspects.
Estimation: It is the product of several complicated steps:
- Estimate the SE of the coefficients (this is not a straightforward process and gets complicated in more complex models)
- Estimate the df (degrees of freedom) of the model (this is also problematic for complex models, e.g., mixed models)
- Make a strong assumption about the distribution of the coefficients under the null hypothesis (e.g., the t-distribution)
Interpretation:
- We tend to focus on a dichotomous interpretation of the p-value (significant or not), based on an arbitrary threshold \(\alpha\) typically set to 5% (.05) for literally no reason in particular
- We often interpret it as the probability of the null hypothesis being true, or as an index of effect magnitude, which is not correct (it’s more related to the width of the certainty around the estimate than its absolute size)
However, p-values are not fundamentally bad. They are very good at one thing: controlling how often, if we repeat an experiment infinitely, we’ll make a particular kind of error. It’s just that it’s often not what we care about in practice.

Alternatives?

Can we do better and get more intuitive quantities?
- The answer starts with “B”…
- And ends with “…ootstrapping”

Bootstrapping

Parent Population

Frequentist “inferential” statistics are used to make inferences about the parent population from which the data are sampled ^{_{(this is a difference with the Bayesian framework, but more on that later)}}
We use a lot of assumptions about how the effects (and absence thereof) are distributed for the parent population, which allows us to estimate things like confidence intervals and p-values.
If we had access to the parent populations, we wouldn’t need to compute these indices, we would just say this is the “true” value of the effect.
But we usually don’t have access to the parent population, and we have to rely on the data at hand (assumed to be a random sample from the parent distribution) and infer/assume the shape of the parent distribution

“Data are random”

This assumption has an interesting consequence…
Data points can be seen as “interchangeable”!
We could make a new sample from the same data points

df <- mtcars
head(df)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

new_rows <- sample(1:nrow(mtcars), replace = TRUE)
new_rows

 [1] 26 14  3 24  4 17  3 28 31 30 18 14 11 20 27  8 19 31  3 20 12 19 12  7 20
[26]  4 20 16  1 11  5 30

new_sample <- mtcars[new_rows, ]

This sample is theoretically “as good” as the original one
We could compute the model again on this new sample…

Rince and repeat

Let’s do it n times and store all the coefficients in a vector

model <- lm(mpg ~ qsec, data=mtcars)  # Fit linear regression
true_coef <- coef(model)[2]  # Extract slope coefficient

coefs <- c()  # Initialize an empty vector of coefs
for (i in 1:5000) {  # Repeat the process 500 times
  new_sample <- mtcars[sample(1:nrow(mtcars), replace = TRUE), ]  # Sample new data
  new_model <- lm(mpg ~ qsec, data=new_sample)  # recompute the model
  coefs <- c(coefs, coef(new_model)[2])  # Append the coef to the vector
}

Code

data.frame(coefs = coefs) |>
  ggplot(aes(x = coefs)) +
  geom_density(fill="orange") +
  geom_vline(xintercept = true_coef, color="red", linetype="dashed", linewidth=2) +
  theme_minimal() +
  labs(x = "Coefficient", y = "Frequency", 
       title = "Distribution of coefficients from 5000 bootstrapped samples\n") +
  coord_cartesian(xlim = c(-0.5, NA))

Why bootstrapping?

Instead of having a point-estimate of a statistical parameter, we now have a distribution of the statistic
This distribution is valid (flows from the premises of frequentist statistics) and can be used to directly estimate the uncertainty related to the statistic of interest
From now on, you need to shift your thinking from “point-estimates” (single values) to start thinking in terms of “distribution” of statistical values
How to describe this distribution?

Indices of Centrality

We usually still compute a point-estimate, useful to give an idea of the “central tendency” of the distribution _{^{(as our little brains don’t process well uncertainty and ranges)}}
Mean (average) of the distribution
- Pros: Easy to compute, Easy to interpret (“average” value)
- Cons: Sensitive to outliers, Not appropriate for non-symmetric distributions
Median (middle value) of the distribution
- Pros: Easy to compute, Robust to outliers, Consistent interpretation in terms of probabilities
- Cons: Too robust? (to variability in the tails)
Mode (most frequent value - peak - of the distribution) - aka Maximum A Posteriori (MAP)
- Pros: Easy to interpret
- Cons: Complex to compute, Problematic for multimodal distributions, Not really robust, Not always defined (e.g. uniform distributions)

Indices of Uncertainty

Standard deviation (SD), Median Absolute Deviation (MAD)
Credible Intervals (CI)
- Equal-tailed Interval (ETI)
  - Easy to compute
  - Potentially problematic for skewed distributions
- Highest Density Interval (HDI)
  - More complex to compute

Why 95% CI?

Why do we use “95%” (0.95) for the CI?
Because of the .05 alpha level (used for the p-value)
Indeed, confidence intervals have a direct relationship with the p-value:
- if the 95% CI is \([0.00, x]\) or \([x, 0.00]\), then the p-value = .05.
- In other words, a 0.95 CI excluding 0 has a p-value < .05.
But this .05 alpha level is arbitrary…
- Yes. And so is 95%. It is mostly a convention.
Recently, a trend emerged to “shake up” the conventions and highlight the arbitrary nature of the .05 alpha level
- McElreath (2018) suggested to use 89% CI instead of 95% CI. Why? Because why not!
Some software also use 90% or 94% CI as default
However, I still personally recommends 95%
- See https://github.com/easystats/bayestestR/discussions/250
Importantly, the choice of the CI is for you to decide.

Indices of Significance

What does significance even mean?
Big?
Certain?
Reliable?

Probability of Direction (pd)

The Probability of Direction (pd) is defined as the probability that the effect is positive or negative (whichever is the largest)
It varies between 0.50 (50%) and 1.00 (100%), with 0.50 indicating that the effect is equally likely to be positive or negative, and 100% indicating that the effect is for sure positive or negative
We can conceptualize it as an index of effect existence (i.e., how likely is the effect to exist in a given direction regardless of its size)
- If \(pd \rightarrow 100\%\), it means that there is certainly a positive|negative effect (but it can be small, large, or its size can be uncertain)

pd vs. p-value

Although conceptually distinct, the pd typically very strongly correlates with one-sided p-values
- But it is not an index of significance (the interpretation is more straightforward)
For example, if the pd is > 97.5% (0.975), then the p-value is typically < .05
- \(p_{onesided} = 1 - pd\)
- \(p_{twosided} = 2(1 - pd)\)
A function bayestestR exists to convert between the two indices.
- Use it only to gain an intuition, is it not a formal conversion

bayestestR::p_to_pd(c(0.05, 0.01, 0.001), direction = "two-sided")

[1] 0.9750 0.9950 0.9995

bayestestR::pd_to_p(c(0.90, 0.95), direction = "two-sided")

[1] 0.2 0.1

Probability of null effect?

The pd tells us the probability that a distribution (of an effect) is positive (> 0) or negative (< 0)
Can we know if the effect is larger or smaller than zero?
In other words, can we know the probability of the effect being different from 0?
- Which is the goal of Null Hypothesis Significance Testing (NHST)
Well, not really…
- The probability of a continuous distribution being different from a single value is always infinite
- Infinity of values vs. a single one, exactly 0.0000
What can we do?
- What if instead of considering “no effect” as being exactly 0.0000, we consider it as being “close to 0.0000”?
- What if we defined a range of “negligible” values, equivalent to 0 for practical purposes?
- This is the idea behind the Region of Practical Equivalence (ROPE)

Region of Practical Equivalence (ROPE)

The idea of the ROPE is to explicitly define a range that we consider as “negligible” (i.e., equivalent to 0 for practical purposes)
For instance, say I decide to define a parameter < 0.42 as negligible _{^{because why not?}}
- Justified based on e.g., prior studies, preliminary data, theoretical knowledge, …
- I preregister this hypothesis (so that I am not accused of making this choice after seeing the results)
What is the probability of \(parameter < |0.42|\)?

data.frame(Coefficient = coefs) |> 
  ggplot(aes(x=Coefficient)) +
  geom_density(fill="blue") +
  geom_area(data=data.frame(x=c(-0.42, 0.42)), aes(x=x, y=2), fill="red", alpha=0.5) +
  coord_cartesian(xlim=c(-0.6, 3.5))

Compute ROPE

bayestestR::rope(coefs, range = c(-0.42, 0.42), ci = 1.00)

# Proportion of samples inside the ROPE [-0.42, 0.42]:

inside ROPE
-----------
0.66 %

We often compute the proportion of a certain Credible Interval that lies within the ROPE
The ROPE index is often computed as the 95% CI of the distribution that lies within the ROPE bounds
- Because the CI is considered to be more stable than the whole distribution, and thus the resulting index is less prone to small random fluctuations
Be careful when reading and presenting results involving ROPE: what is the ROPE range and the CI used?

# ci=0.95 is the default setting for rope()
result <- bayestestR::rope(coefs, range = c(-0.42, 0.42), ci = 0.95)
result

# Proportion of samples inside the ROPE [-0.42, 0.42]:

inside ROPE
-----------
0.00 %

# The results can be plotted
plot(result)

Is p-ROPE an index of significance?

This index corresponds to the proportion of the distribution/Credible Interval that is within the ROPE
p-ROPE (\(P_{x~\in~ROPE}\)) can be interpreted as an index of significance
- The probability that an effect is non-negligible
- i.e., “significant” is here meant in the sense of “large enough to matter” sense
- Note that p-ROPE index is not equivalent or highly correlated with p-values
Note that “Equivalence Testing” is sometimes presented as an alternative to Null-Hypothesis Significance Testing (NHST), and can be also used with more traditional tests (with the idea of testing against a range of pre-specified “negligible” values instead of a point-null).

ROPE range

The main challenge with ROPE is to define the range of negligible values
It is possible to use rules-of-thumb guidelines of effect size classification
- For correlations, Cohen (1988) suggests that r < 0.1 is a very small (i.e., negligible) effect
- For standardized differences, Cohen’s D guidelines
It gets complicated for interaction parameters, or when the effect is not standardized

Describing parameter distributions: Summary

bayestestR::describe_posterior(coefs)

Summary of Posterior Distribution

Parameter | Median |       95% CI |     pd |          ROPE | % in ROPE
----------------------------------------------------------------------
Posterior |   1.41 | [0.65, 2.63] | 99.92% | [-0.10, 0.10] |        0%

Describe the “centrality” (e.g;, Median), the “uncertainty” (CI), and the “significance” (e.g., ROPE, pd, …)

bayestestR::describe_posterior(coefs, 
                               centrality="mean", 
                               dispersion=TRUE, 
                               ci=0.90, 
                               ci_method="hdi",
                               rope_range =c(-0.9, 0.9), 
                               test="rope")

Summary of Posterior Distribution

Parameter | Mean |   SD |       90% CI |          ROPE | % in ROPE
------------------------------------------------------------------
Posterior | 1.46 | 0.50 | [0.72, 2.30] | [-0.90, 0.90] |     8.51%

Choices must be made and justified: important to understand what the indices are and what they mean

Bootstrapping-specific indices?

Are these new indices (median, MAD, credible intervals, p-direction, ROPE) bootstrapping-specific?
- NO! Bootstrapping is just a method that provides a parameter distribution instead of point-values
These indices are “simply” descriptive summaries of parameter distributions
Thus, they can be used whenever we have a distributions of parameters
⚠️Spoiler Alert⚠️ Such as in Bayesian stats…

Bayesian Statistics

Bayes’ Theorem

\(P(A|B) = \frac{P(A)~P(B|A)}{P(B)}\)
P(A|B): How likely A given the information about B
P(A) and P(B): Probability of A or B occurring on its own, without any information about the other
- It’s our initial (prior) beliefs about A and B
- P(B) is the marginal probability of B, and is often just a normalizing constant (so that the total probability = 1). The important part of the formula is the numerator
P(B|A): Probability of B occurring given that A has already occurred. It quantifies how likely B is when we know A has happened
- It is the likelihood of B given A
Bayes’ Theorem allows us to update our belief about the probability of A happening (P(A|B)) based on new evidence provided by the occurrence of B. It combines our initial belief (P(A)), the likelihood of B happening when A is true (P(B|A)), and the overall likelihood of B (P(B)) to calculate this (a posteriori) updated probability

Bayesian inference

Bayesian inference is the application of the product rule of probability to real problems of inference
“Bayesian inference is really just counting and comparing of possibilities […] in order to make good inference about what actually happened, it helps to consider everything that could have happened” (McElreath, 2016)
“The first idea is that Bayesian inference is reallocation of credibility across possibilities. The second foundational idea is that the possibilities [i.e., the ”hypotheses”], over which we allocate credibility, are parameter values in meaningful mathematical models” (Kruschke, 2015)
- Bayesian inference allows to compute or approximate the likelihood (think density plot) of different parameter values of a statistical model given the data

Bayesian inference in practice

In science, we are interested in the probability of a hypothesis \(H\), given the data \(X\)
- \(= P(H|X)\)
Before any data is collected, the researcher has some level of prior belief \(P(H)\) in this hypothesis
\(P(X|H)\): The researcher also has a model that makes specific predictions about the outcome given the data X (e.g., a linear model)
- This models tells how strongly the data are implied by a hypothesis
Bayes’ theorem tells us that \(P(H|X) =\frac{P(H)~P(X|H)}{P(X)}\)
In words: The probability of a hypothesis given the data is equal to the probability of the hypothesis before seeing the data, multiplied by the probability that the data occur if that hypothesis is true, divided by the probability of the observed data

In short

Bayesian statistics are an elegant framework to update probabilities based on evidence
\(P(H|X)=P(H)~P(X|H)\)
- The probability of a hypothesis given the data is proportional to the prior probability of the hypothesis times the likelihood of the data given the hypothesis
It allows us to quantify the probability of a hypothesis given the data, whereas in Frequentist statistics, we can only quantify the probability of the data (data is random!) given a (null) hypothesis. Basically, it is the opposite answer.
While stemming from a simple theorem, it has deep consequences and powerful applications…

Bayesian inference for stats models

\(P(H|X)=P(H)~P(X|H)\)
In statistical models, P(H) is not one probability number but a (set of) prior distribution corresponding to beliefs about some parameter(s)
P(X|H) corresponds to the data likelihood model (the link between the outcome and the predictor given some value of the parameter)
- E.g., in a linear model \(y ~ 0 + 2x\), if \(x=3\), then predicted outcome is \(2x=6\)
P(H|X) corresponds to the posterior distribution of a parameter (given the prior and the data)
Resolving this equation empirically is actually very complex

Stan, JAGS, and BUGS

Modern Bayesian full inference (using MCMC), known to be resource-intensive and slow, is currently best performed (most fast and flexibly) by dedicated languages, aka probabilistic programming languages
- Stan (the most popular one): Named in honour of Stanislaw Ulam, pioneer of the Monte Carlo method
- JAGS: “Just Another Gibbs Sampler”
- BUGS: “Bayesian Inference Using Gibbs Sampling”
- PyMC (Python)
- Turing (Julia)
These languages are designed to perform Bayesian inference, and are optimized for that purpose
They are not general-purpose programming languages

About `brms`

There are “wrappers” around these languages that allow us to use them from other general purpose languages like R or Python
In R:
- brms = Bayesian Regression Models using ‘Stan’
- rstanarm = R interface to Stan for Applied Regression Modeling
rstanarm
- Has different functions for different models (e.g., stan_lm(), stan_glm(), stan_lmer(), stan_glmer(), stan_gamm4() etc.)
- Each models are implemented with precompiled Stan code
- It is slightly easier to install, but less flexible
brms
- Became the most popular
- Main difference is that brms has only one main function, brm(), which can fit almost any model
- These models are translated into Stan code, and then compiled
- Overhead time when fitting new models, but it is more flexible

Doing a Bayesian Regression

Frequentist model

model <- lm(qsec ~ mpg, data = mtcars) 
parameters::parameters(model)

Parameter   | Coefficient |   SE |         95% CI | t(30) |      p
------------------------------------------------------------------
(Intercept) |       15.35 | 1.03 | [13.25, 17.46] | 14.91 | < .001
mpg         |        0.12 | 0.05 | [ 0.02,  0.22] |  2.53 | 0.017

Bayesian model

model <- brms::brm(qsec ~ mpg, data = mtcars, refresh=0) 
parameters::parameters(model)

# Fixed Effects

Parameter   | Median |         95% CI |     pd |  Rhat |     ESS
----------------------------------------------------------------
(Intercept) |  15.34 | [13.25, 17.47] |   100% | 1.000 | 3454.00
mpg         |   0.12 | [ 0.02,  0.22] | 99.30% | 1.000 | 3394.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   1.68 | [1.34, 2.22] | 100% | 1.000 | 3387.00

refresh=0 silences the printing of the sampling _{(to avoid cluttering this slide)}
R-hat: Convergence diagnostic index. Should be less than < 1.05.
ESS: Effective Sample Size: how many “good iterations” (truly independent draws) we have. Should be > 400 (the maximum being the number of iterations that we draw).

Mixed Model

One function to rule them all.

model <- glmmTMB::glmmTMB(qsec ~ mpg + (1|gear), data = mtcars) 
parameters::parameters(model, effects="fixed")

# Fixed Effects

Parameter   | Coefficient |   SE |         95% CI |     z |      p
------------------------------------------------------------------
(Intercept) |       14.81 | 1.26 | [12.34, 17.28] | 11.75 | < .001
mpg         |        0.13 | 0.05 | [ 0.03,  0.23] |  2.64 | 0.008

Bayesian model

model <- brms::brm(qsec ~ mpg + (1|gear), data = mtcars, refresh=0) 
parameters::parameters(model, effects="fixed")

# Fixed Effects

Parameter   | Median |         95% CI |     pd |  Rhat |     ESS
----------------------------------------------------------------
(Intercept) |  14.83 | [11.70, 17.92] |   100% | 1.001 |  887.00
mpg         |   0.13 | [ 0.03,  0.23] | 99.33% | 1.001 | 2213.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   1.35 | [1.05, 1.83] | 100% | 1.002 | 1658.00

How about bayes factors?

Bayes factor are intuitive indices measuring of the strength of evidence for one hypothesis over another
- They have a natural interpretation in terms of odds (e.g., 3:1 odds), and have wide applications (comparing models, null-hypothesis testing, …)
- They have well established interpretation guidelines (> 3; > 10; > 30; > 100)
They are “easy” to compute for simple statistical tests, but for more complex models, they are notoriously hard to compute
The fact that they are highly sensitive to the choice of priors + the fact that finding an “alternative” model is not at all trivial + that computing them is often very hard have been seen as limitations and made them the focal point of debates (even between Bayesians)
There are different schools:
- Some Bayesians believe that BFs are the best indices ever (assuming one understands them) and the Graal of Bayesian statistics
- Some believe that other indices are better and that BFs reproduce the same culture of binary thinking as Frequentist NHST
- We’ll adopt a middle view: BFs are useful, in particular for simple tests (t-tests, correlations, comparing models), but for modeling, analyzing the posterior distributions is better

Bayes Factors: Misconceptions

BFs do not tell us the absolute probability of a hypothesis being true. It quantifies how much we should update our belief in a hypothesis. If this hypothesis was extremely unlikely, then we might still believe it to be very unlikely, even after computing a large BF.
BFs provide relative evidence between 2 hypotheses, and it is possible that one hypothesis is supported more than another hypothesis, while both hypotheses are false. Saying “BF shows evidence for the null (H0)” is incorrect. It should be “BF shows evidence for the H0 as compared to H1” (what H1 is matters).
BFs tell us nothing about the size of effects.
See Lakens’ post for details

A rabbit hole

There is a lot more to Bayesian statistics
- Inference algorithms (MCMC, …)
- Priors
- Model comparison
- Usage in neuroscientific theories
- …
But it is easy to start using it

Questions?

Thank you!

Introduction to Bayesian Statistics

Managing expectations

Introductions

Frequentist Refresher

Quizz

Solution

Tests vs. Models

Linear Model

Impact of Parameters

Linear Model = “Gaussian” Model

Homoscedasticity

Model Parameters

Frequentist Inference

The t-value

p-value

Alternatives?

Bootstrapping

Parent Population

“Data are random”

Rince and repeat

Why bootstrapping?

Indices of Centrality

Indices of Uncertainty

Why 95% CI?

Indices of Significance

Probability of Direction (pd)

pd vs. p-value

Probability of null effect?

Region of Practical Equivalence (ROPE)

Compute ROPE

Is p-ROPE an index of significance?

ROPE range

Describing parameter distributions: Summary

Bootstrapping-specific indices?

Bayesian Statistics

Bayes’ Theorem

Bayesian inference

Bayesian inference in practice

In short

Bayesian inference for stats models

Stan, JAGS, and BUGS

About brms

Doing a Bayesian Regression

Mixed Model

How about bayes factors?

Bayes Factors: Misconceptions

A rabbit hole

Questions?

About `brms`