Introduction to Bayesian Statistics

@ LaPsyDé

Dominique Makowski
D.Makowski@sussex.ac.uk

Managing expectations

  • Learn Bayesian by understanding Frequentist
  • Build on existing knowledge and “demystify” Bayesian statistics
  • Provide a bird’s eye view of Bayesian statistics applied to statistical models

Introductions

  • Internship at LaPsyDé (2012)
  • PhD 2018 (Université Paris Descartes)
    • Emotion Regulation through “Fiction”
    • Software development
  • Postdoc 2019-2023 (NTU Singapore)
    • EEG and Physiological markers of Deception
  • Lecturer (2023-) (University of Sussex, Brighton)
    • Reality Bending Lab (ReBeL) website
      • Role of emotions and cognitive control in the perception of reality
      • Illusions, fake news, altered states of consciousness, …
    • Bayesian Statistics

Frequentist Refresher

Quizz

  • What is wrong with this Correlation plot?
  • It is more a “regression” plot (scatter plot + regression line)
Show figure code
df <- bayestestR::simulate_correlation(n=500, r=0.7)
df |> 
  ggplot(aes(x=V1, y=V2)) +
  geom_point() +
  geom_smooth(method="lm", formula = 'y ~ x', se=FALSE, linewidth=2, color="red") +
  theme_bw()

Solution

  • A correlation is a measure of the strength of association [-1, 1]
    • Better represented by an elipsis
  • The line is a regression line
    • The angle of the line is the regression coefficient (can be > 1)
Show figure code
df |> 
  ggplot(aes(x=V1, y=V2)) +
  geom_point() +
  geom_smooth(method="lm", formula = 'y ~ x', se=FALSE, linewidth=2, color="red") +
  stat_ellipse(type = "norm", color="blue", linewidth=2) +
  theme_bw()

Tests vs. Models

  • Psychology uses a lot of tests (t-tests, correlation tests, ANOVAs, etc.)
  • Tests != Models
    • We can’t predict the value of y given x with a test
  • However, tests are usually based on models
  • In particular, tests are related to specific parameters of models
cor.test(mtcars$wt, mtcars$qsec) |> 
  parameters::parameters()
Pearson's product-moment correlation

Parameter1 |  Parameter2 |     r |        95% CI | t(30) |     p
----------------------------------------------------------------
mtcars$wt  | mtcars$qsec | -0.17 | [-0.49, 0.19] | -0.97 | 0.339

Alternative hypothesis: true correlation is not equal to 0
lm(wt ~ qsec, data=mtcars) |> 
  parameters::parameters()
Parameter   | Coefficient |   SE |        95% CI | t(30) |     p
----------------------------------------------------------------
(Intercept) |        4.92 | 1.77 | [ 1.32, 8.53] |  2.79 | 0.009
qsec        |       -0.10 | 0.10 | [-0.30, 0.11] | -0.97 | 0.339
  • What are Parameters?

Linear Model

  • Linear function: \(y = ax + b\)
  • Regression formula: \(\hat{y} = Intercept + \beta_{1}x\)
    • \(\hat{y}\) (Y-hat): predicted response/outcome/dependent variable
    • \(Intercept\) (\(\beta_{0}\)): “starting point”. Value of \(\hat{y}\) when \(x=0\)
    • \(\beta_{1}\) (beta) : slope/“effect”. Change of \(\hat{y}\) from the intercept when \(x\) increases by 1
    • \(x\): predictor/independent variable
  • What are the parameters of the regression line below? Intercept = 1 and beta = 3
Code
ggplot(df, aes(x=V1, y=V2)) +
  geom_point(alpha=0) +
  geom_vline(xintercept = 0, linetype="dotted") +
  geom_abline(intercept = 1, slope = 3, color="red", linewidth=2)  +
  geom_segment(aes(x = 0, y = 1, xend = 1, yend = 1), linewidth=1, 
               color="green", linetype="dashed") +
  geom_segment(aes(x = 1, y = 1, xend = 1, yend = 4), linewidth=1, 
               color="green", linetype="solid", arrow=arrow(length = unit(0.1, "inches"))) +
  geom_segment(aes(x = 0, y = 0, xend = 0, yend = 1), linewidth=1, 
               color="blue", linetype="solid", arrow=arrow(length = unit(0.1, "inches"))) +
  geom_point(aes(x = 0, y = 0), color="purple", size=8, shape="+") +
  labs(x="x", y="y") +
  theme_bw() +
  coord_cartesian(xlim = c(-1, 1.5), ylim = c(-3, 4))

Impact of Parameters

  • What happens when we change the slope and the intercept?

Linear Model = “Gaussian” Model

  • Solving the linear model equation means finding the best fitting normal distribution around the residuals
Code
p <- mtcars |> 
  ggplot(aes(x=qsec, y=mpg)) +
  geom_point(alpha=0.7, size=6) +
  geom_smooth(method="lm", formula = 'y ~ x', se=FALSE, linewidth=2) +
  geom_segment(aes(x = qsec, y = mpg, xend = qsec, yend = predict(lm(mpg ~ qsec, data=mtcars))), 
               color="red", linetype="dotted", linewidth=1) +
  theme_bw() +
  labs(x="x", y="y", title="The residuals are the vertical distances between each point and the line.")

p2 <- data.frame(Error=insight::get_residuals(lm(mpg ~ qsec, data=mtcars))) |> 
  ggplot(aes(x=Error)) +
  geom_histogram(bins=10, fill="grey", color="black") +
  geom_vline(xintercept = 0, linetype="dashed") +
  geom_density(data=data.frame(Error=bayestestR::distribution_normal(n=100, sd = 2)),
               aes(y=after_stat(density)*40), color="#F44336", linewidth=1, adjust=1) +
  geom_density(data=data.frame(Error=bayestestR::distribution_normal(n=100, sd = 3)),
               aes(y=after_stat(density)*50), color="#FF5722", linewidth=1, adjust=1) +
  geom_density(data=data.frame(Error=bayestestR::distribution_normal(n=100, sd = 4)),
               aes(y=after_stat(density)*60), color="#FF9800", linewidth=1, adjust=1) +
  geom_point(aes(y=0), size=10, shape=16, alpha=0.3) +
  theme_bw() +
  coord_flip() +
  labs(y = "Density")

p + theme(plot.title = element_blank()) | p2

Homoscedasticity

  • The line that goes through the Normal distribution’s locations \(\mu\) also minimizes the sum of squared residuals (OLS)
Code
model <- lm(mpg ~ qsec, data=mtcars)

p <- mtcars |> 
  ggplot(aes(x=qsec, y=mpg)) +
  geom_smooth(method="lm", formula = 'y ~ x', se=FALSE, linewidth=2) +
  theme_bw()

# Function to add normal distribution curves
add_normals <- function(p, model) {
  sigma <- summary(model)$sigma  # Standard deviation of residuals
  n <- 100  # Number of points for each curve
  
  for(i in 1:nrow(mtcars)) {
    x_val <- mtcars$qsec[i]
    y_pred <- predict(model, newdata = data.frame(qsec = x_val))
    
    # Create a sequence of y values for the normal curve
    y_seq <- seq(y_pred - 3*sigma, y_pred + 3*sigma, length.out = n)
    density <- dnorm(y_seq, mean = y_pred, sd = sigma)
    
    # Adjust density to match the scale of the plot
    max_width <- 1  # Max width of areas
    density_scaled <- (density / max(density)) * max_width
    
    # Create a dataframe for each path
    path_df <- data.frame(x = x_val + density_scaled, y = y_seq)
    path_dfv <- data.frame(x=path_df$x[1], ymin=min(path_df$y), ymax=max(path_df$y))
    
    # Add the path to the plot
    p <- p + 
      geom_segment(data = path_dfv, aes(x = x, xend=x, y = ymin, yend=ymax), 
                   color = "#FF9800", linewidth = 0.7, alpha=0.8, linetype="dotted") +
      geom_path(data = path_df, aes(x = x, y = y), 
                color = "#FF9800", linewidth = 1, alpha=0.8) 
  }
  p
}


# Create the final plot
p <- add_normals(p, model) +
  geom_segment(aes(x = qsec, y = mpg, xend = qsec, yend = predict(lm(mpg ~ qsec, data=mtcars))), 
               color="red", linetype="solid", linewidth=1) +
  geom_point(alpha=0.8, size=6) 
p

Model Parameters

  • In regression models, 2 types of parameters are estimated (using OLS/MLE)
    • The coefficients
      • Intercept
      • The slope(s)
    • “Distributional” (aka “Auxiliary”) parameters (about the distribution of errors)
      • E.g., the standard deviation of the errors \(\sigma\) (sigma) in linear models
      • Usually, most people tend to ignore these… but they are important!
model <- lm(mpg ~ qsec, data=mtcars)

summary(model)

Call:
lm(formula = mpg ~ qsec, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8760 -3.4539 -0.7203  2.2774 11.6491 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -5.1140    10.0295  -0.510   0.6139  
qsec          1.4121     0.5592   2.525   0.0171 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.564 on 30 degrees of freedom
Multiple R-squared:  0.1753,    Adjusted R-squared:  0.1478 
F-statistic: 6.377 on 1 and 30 DF,  p-value: 0.01708
insight::get_sigma(model)
[1] 5.563738
attr(,"class")
[1] "insight_aux" "numeric"    
  • The remaining indices (SE, t-value, p-value) are calculated using them

Frequentist Inference

Parameter   | Coefficient |    SE |          95% CI |  t(30) |     p
--------------------------------------------------------------------
(Intercept) |       4.925 | 1.765 | [ 1.319, 8.530] |  2.790 | 0.009
qsec        |      -0.096 | 0.098 | [-0.297, 0.105] | -0.972 | 0.339
  • The t-value is the ratio between the coefficient and its standard error (SE)
    • It can be seen as a “standardized” index of the coefficient’s precision (independent of the scale/unit of the variable)
    • It corresponds to a point on a t-distribution of corresponding degrees of freedom (df), centered at 0
    • This t-distribution is the assumption made in Frequentist statistics about the distribution of coefficients (effects) under the null hypothesis (Not to be confused with the distribution of residuals, which is assumed to be Normal)

The t-value

  • Is our coefficient likely to be observed under the null hypothesis?
    • What is the probability to obtain a coefficient at least as precise as ours (in both directions) under the null hypothesis?
    • \(Prob > |t|\)
Code
# Plot  t-distribution
x <- seq(-5, 5, length.out = 1000)
y <- dt(x, df=30)
df <- data.frame(x = x, y = y)

t_value <- 2.790 

df$Probability <- ifelse(df$x < -t_value, "< -t", "Smaller")
df$Probability <- ifelse(df$x > t_value, "> +t", df$Probability)
df |>
  ggplot(aes(x = x, y = y)) +
  geom_line() +
  geom_area(aes(x = x, y = y, fill = Probability), alpha = 0.5) +
  geom_segment(data=data.frame(x=t_value, y=0),
               aes(xend = t_value, yend = dt(t_value, df=30)), 
               color="red", linetype="solid", linewidth=1) +
  geom_point(data=data.frame(x=t_value, y=dt(t_value, df=30)), color="red", size=3) +
  theme_minimal() +
  scale_fill_manual(values = c("red", "red", "grey")) +
  labs(x = "\nt-values - standardized coefficient under the null hypothesis", y = "Probability", 
       title = paste0("t-Distribution (df=30); ", "t-value = ", round(t_value, 2), "\n"))

p-value

  • The p-value is the probability of observing a value >= to the t-value under the null hypothesis. In other words, it is the probability of obtaining test results at least as “big” (precisely away from 0) as the result actually observed if we repeat it an infinite amount of times and there is no true effect.
    • It is quite convoluted…
    • Is not a “trivial” value. It has some delicate aspects.
  • Estimation: It is the product of several complicated steps:
    • Estimate the SE of the coefficients (this is not a straightforward process and gets complicated in more complex models)
    • Estimate the df (degrees of freedom) of the model (this is also problematic for complex models, e.g., mixed models)
    • Make a strong assumption about the distribution of the coefficients under the null hypothesis (e.g., the t-distribution)
  • Interpretation:
    • We tend to focus on a dichotomous interpretation of the p-value (significant or not), based on an arbitrary threshold \(\alpha\) typically set to 5% (.05) for literally no reason in particular
    • We often interpret it as the probability of the null hypothesis being true, or as an index of effect magnitude, which is not correct (it’s more related to the width of the certainty around the estimate than its absolute size)
  • However, p-values are not fundamentally bad. They are very good at one thing: controlling how often, if we repeat an experiment infinitely, we’ll make a particular kind of error. It’s just that it’s often not what we care about in practice.

Alternatives?

  • Can we do better and get more intuitive quantities?
    • The answer starts with “B”
    • And ends with “…ootstrapping”

Bootstrapping

Parent Population

  • Frequentist “inferential” statistics are used to make inferences about the parent population from which the data are sampled (this is a difference with the Bayesian framework, but more on that later)
  • We use a lot of assumptions about how the effects (and absence thereof) are distributed for the parent population, which allows us to estimate things like confidence intervals and p-values.
  • If we had access to the parent populations, we wouldn’t need to compute these indices, we would just say this is the “true” value of the effect.
  • But we usually don’t have access to the parent population, and we have to rely on the data at hand (assumed to be a random sample from the parent distribution) and infer/assume the shape of the parent distribution

“Data are random”

  • This assumption has an interesting consequence…
  • Data points can be seen as “interchangeable”!
  • We could make a new sample from the same data points
df <- mtcars
head(df)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
new_rows <- sample(1:nrow(mtcars), replace = TRUE)
new_rows
 [1] 26 14  3 24  4 17  3 28 31 30 18 14 11 20 27  8 19 31  3 20 12 19 12  7 20
[26]  4 20 16  1 11  5 30
new_sample <- mtcars[new_rows, ] 
  • This sample is theoretically “as good” as the original one
  • We could compute the model again on this new sample…

Rince and repeat

  • Let’s do it n times and store all the coefficients in a vector
model <- lm(mpg ~ qsec, data=mtcars)  # Fit linear regression
true_coef <- coef(model)[2]  # Extract slope coefficient

coefs <- c()  # Initialize an empty vector of coefs
for (i in 1:5000) {  # Repeat the process 500 times
  new_sample <- mtcars[sample(1:nrow(mtcars), replace = TRUE), ]  # Sample new data
  new_model <- lm(mpg ~ qsec, data=new_sample)  # recompute the model
  coefs <- c(coefs, coef(new_model)[2])  # Append the coef to the vector
}
Code
data.frame(coefs = coefs) |>
  ggplot(aes(x = coefs)) +
  geom_density(fill="orange") +
  geom_vline(xintercept = true_coef, color="red", linetype="dashed", linewidth=2) +
  theme_minimal() +
  labs(x = "Coefficient", y = "Frequency", 
       title = "Distribution of coefficients from 5000 bootstrapped samples\n") +
  coord_cartesian(xlim = c(-0.5, NA))

Why bootstrapping?

  • Instead of having a point-estimate of a statistical parameter, we now have a distribution of the statistic
  • This distribution is valid (flows from the premises of frequentist statistics) and can be used to directly estimate the uncertainty related to the statistic of interest
  • From now on, you need to shift your thinking from “point-estimates” (single values) to start thinking in terms of “distribution” of statistical values
  • How to describe this distribution?

Indices of Centrality

  • We usually still compute a point-estimate, useful to give an idea of the “central tendency” of the distribution (as our little brains don’t process well uncertainty and ranges)
  • Mean (average) of the distribution
    • Pros: Easy to compute, Easy to interpret (“average” value)
    • Cons: Sensitive to outliers, Not appropriate for non-symmetric distributions
  • Median (middle value) of the distribution
    • Pros: Easy to compute, Robust to outliers, Consistent interpretation in terms of probabilities
    • Cons: Too robust? (to variability in the tails)
  • Mode (most frequent value - peak - of the distribution) - aka Maximum A Posteriori (MAP)
    • Pros: Easy to interpret
    • Cons: Complex to compute, Problematic for multimodal distributions, Not really robust, Not always defined (e.g. uniform distributions)

Indices of Uncertainty

  • Standard deviation (SD), Median Absolute Deviation (MAD)
  • Credible Intervals (CI)
    • Equal-tailed Interval (ETI)
      • Easy to compute
      • Potentially problematic for skewed distributions
    • Highest Density Interval (HDI)
      • More complex to compute

Why 95% CI?

  • Why do we use “95%” (0.95) for the CI?
  • Because of the .05 alpha level (used for the p-value)
  • Indeed, confidence intervals have a direct relationship with the p-value:
    • if the 95% CI is \([0.00, x]\) or \([x, 0.00]\), then the p-value = .05.
    • In other words, a 0.95 CI excluding 0 has a p-value < .05.
  • But this .05 alpha level is arbitrary…
    • Yes. And so is 95%. It is mostly a convention.
  • Recently, a trend emerged to “shake up” the conventions and highlight the arbitrary nature of the .05 alpha level
    • McElreath (2018) suggested to use 89% CI instead of 95% CI. Why? Because why not!
  • Some software also use 90% or 94% CI as default
  • However, I still personally recommends 95%
    • See https://github.com/easystats/bayestestR/discussions/250
  • Importantly, the choice of the CI is for you to decide.

Indices of Significance

  • What does significance even mean?
  • Big?
  • Certain?
  • Reliable?

Probability of Direction (pd)

  • The Probability of Direction (pd) is defined as the probability that the effect is positive or negative (whichever is the largest)
  • It varies between 0.50 (50%) and 1.00 (100%), with 0.50 indicating that the effect is equally likely to be positive or negative, and 100% indicating that the effect is for sure positive or negative
  • We can conceptualize it as an index of effect existence (i.e., how likely is the effect to exist in a given direction regardless of its size)
    • If \(pd \rightarrow 100\%\), it means that there is certainly a positive|negative effect (but it can be small, large, or its size can be uncertain)

pd vs. p-value

  • Although conceptually distinct, the pd typically very strongly correlates with one-sided p-values
    • But it is not an index of significance (the interpretation is more straightforward)
  • For example, if the pd is > 97.5% (0.975), then the p-value is typically < .05
    • \(p_{onesided} = 1 - pd\)
    • \(p_{twosided} = 2(1 - pd)\)
  • A function bayestestR exists to convert between the two indices.
    • Use it only to gain an intuition, is it not a formal conversion
bayestestR::p_to_pd(c(0.05, 0.01, 0.001), direction = "two-sided")
[1] 0.9750 0.9950 0.9995
bayestestR::pd_to_p(c(0.90, 0.95), direction = "two-sided")
[1] 0.2 0.1

Probability of null effect?

  • The pd tells us the probability that a distribution (of an effect) is positive (> 0) or negative (< 0)
  • Can we know if the effect is larger or smaller than zero?
  • In other words, can we know the probability of the effect being different from 0?
    • Which is the goal of Null Hypothesis Significance Testing (NHST)
  • Well, not really
    • The probability of a continuous distribution being different from a single value is always infinite
    • Infinity of values vs. a single one, exactly 0.0000
  • What can we do?
    • What if instead of considering “no effect” as being exactly 0.0000, we consider it as being “close to 0.0000”?
    • What if we defined a range of “negligible” values, equivalent to 0 for practical purposes?
    • This is the idea behind the Region of Practical Equivalence (ROPE)

Region of Practical Equivalence (ROPE)

  • The idea of the ROPE is to explicitly define a range that we consider as “negligible” (i.e., equivalent to 0 for practical purposes)
  • For instance, say I decide to define a parameter < 0.42 as negligible because why not?
    • Justified based on e.g., prior studies, preliminary data, theoretical knowledge, …
    • I preregister this hypothesis (so that I am not accused of making this choice after seeing the results)
  • What is the probability of \(parameter < |0.42|\)?
data.frame(Coefficient = coefs) |> 
  ggplot(aes(x=Coefficient)) +
  geom_density(fill="blue") +
  geom_area(data=data.frame(x=c(-0.42, 0.42)), aes(x=x, y=2), fill="red", alpha=0.5) +
  coord_cartesian(xlim=c(-0.6, 3.5))

Compute ROPE

bayestestR::rope(coefs, range = c(-0.42, 0.42), ci = 1.00)
# Proportion of samples inside the ROPE [-0.42, 0.42]:

inside ROPE
-----------
0.66 %     
  • We often compute the proportion of a certain Credible Interval that lies within the ROPE
  • The ROPE index is often computed as the 95% CI of the distribution that lies within the ROPE bounds
    • Because the CI is considered to be more stable than the whole distribution, and thus the resulting index is less prone to small random fluctuations
  • Be careful when reading and presenting results involving ROPE: what is the ROPE range and the CI used?
# ci=0.95 is the default setting for rope()
result <- bayestestR::rope(coefs, range = c(-0.42, 0.42), ci = 0.95)
result
# Proportion of samples inside the ROPE [-0.42, 0.42]:

inside ROPE
-----------
0.00 %     
# The results can be plotted
plot(result)

Is p-ROPE an index of significance?

  • This index corresponds to the proportion of the distribution/Credible Interval that is within the ROPE
  • p-ROPE (\(P_{x~\in~ROPE}\)) can be interpreted as an index of significance
    • The probability that an effect is non-negligible
    • i.e., “significant” is here meant in the sense of “large enough to matter” sense
    • Note that p-ROPE index is not equivalent or highly correlated with p-values
  • Note that “Equivalence Testing” is sometimes presented as an alternative to Null-Hypothesis Significance Testing (NHST), and can be also used with more traditional tests (with the idea of testing against a range of pre-specified “negligible” values instead of a point-null).

ROPE range

  • The main challenge with ROPE is to define the range of negligible values
  • It is possible to use rules-of-thumb guidelines of effect size classification
    • For correlations, Cohen (1988) suggests that r < 0.1 is a very small (i.e., negligible) effect
    • For standardized differences, Cohen’s D guidelines
  • It gets complicated for interaction parameters, or when the effect is not standardized

Describing parameter distributions: Summary

bayestestR::describe_posterior(coefs)
Summary of Posterior Distribution

Parameter | Median |       95% CI |     pd |          ROPE | % in ROPE
----------------------------------------------------------------------
Posterior |   1.41 | [0.65, 2.63] | 99.92% | [-0.10, 0.10] |        0%
  • Describe the “centrality” (e.g;, Median), the “uncertainty” (CI), and the “significance” (e.g., ROPE, pd, …)
bayestestR::describe_posterior(coefs, 
                               centrality="mean", 
                               dispersion=TRUE, 
                               ci=0.90, 
                               ci_method="hdi",
                               rope_range =c(-0.9, 0.9), 
                               test="rope")
Summary of Posterior Distribution

Parameter | Mean |   SD |       90% CI |          ROPE | % in ROPE
------------------------------------------------------------------
Posterior | 1.46 | 0.50 | [0.72, 2.30] | [-0.90, 0.90] |     8.51%
  • Choices must be made and justified: important to understand what the indices are and what they mean

Bootstrapping-specific indices?

  • Are these new indices (median, MAD, credible intervals, p-direction, ROPE) bootstrapping-specific?
    • NO! Bootstrapping is just a method that provides a parameter distribution instead of point-values
  • These indices are “simply” descriptive summaries of parameter distributions
  • Thus, they can be used whenever we have a distributions of parameters
  • ⚠️Spoiler Alert⚠️ Such as in Bayesian stats…

Bayesian Statistics

Bayes’ Theorem

  • \(P(A|B) = \frac{P(A)~P(B|A)}{P(B)}\)
  • P(A|B): How likely A given the information about B
  • P(A) and P(B): Probability of A or B occurring on its own, without any information about the other
    • It’s our initial (prior) beliefs about A and B
    • P(B) is the marginal probability of B, and is often just a normalizing constant (so that the total probability = 1). The important part of the formula is the numerator
  • P(B|A): Probability of B occurring given that A has already occurred. It quantifies how likely B is when we know A has happened
    • It is the likelihood of B given A
  • Bayes’ Theorem allows us to update our belief about the probability of A happening (P(A|B)) based on new evidence provided by the occurrence of B. It combines our initial belief (P(A)), the likelihood of B happening when A is true (P(B|A)), and the overall likelihood of B (P(B)) to calculate this (a posteriori) updated probability

Bayesian inference

  • Bayesian inference is the application of the product rule of probability to real problems of inference
  • “Bayesian inference is really just counting and comparing of possibilities […] in order to make good inference about what actually happened, it helps to consider everything that could have happened” (McElreath, 2016)
  • “The first idea is that Bayesian inference is reallocation of credibility across possibilities. The second foundational idea is that the possibilities [i.e., the ”hypotheses”], over which we allocate credibility, are parameter values in meaningful mathematical models” (Kruschke, 2015)
    • Bayesian inference allows to compute or approximate the likelihood (think density plot) of different parameter values of a statistical model given the data

Bayesian inference in practice

  • In science, we are interested in the probability of a hypothesis \(H\), given the data \(X\)
    • \(= P(H|X)\)
  • Before any data is collected, the researcher has some level of prior belief \(P(H)\) in this hypothesis
  • \(P(X|H)\): The researcher also has a model that makes specific predictions about the outcome given the data X (e.g., a linear model)
    • This models tells how strongly the data are implied by a hypothesis
  • Bayes’ theorem tells us that \(P(H|X) =\frac{P(H)~P(X|H)}{P(X)}\)
  • In words: The probability of a hypothesis given the data is equal to the probability of the hypothesis before seeing the data, multiplied by the probability that the data occur if that hypothesis is true, divided by the probability of the observed data

In short

  • Bayesian statistics are an elegant framework to update probabilities based on evidence
  • \(P(H|X)=P(H)~P(X|H)\)
    • The probability of a hypothesis given the data is proportional to the prior probability of the hypothesis times the likelihood of the data given the hypothesis
  • It allows us to quantify the probability of a hypothesis given the data, whereas in Frequentist statistics, we can only quantify the probability of the data (data is random!) given a (null) hypothesis. Basically, it is the opposite answer.
  • While stemming from a simple theorem, it has deep consequences and powerful applications…

Bayesian inference for stats models

  • \(P(H|X)=P(H)~P(X|H)\)
  • In statistical models, P(H) is not one probability number but a (set of) prior distribution corresponding to beliefs about some parameter(s)
  • P(X|H) corresponds to the data likelihood model (the link between the outcome and the predictor given some value of the parameter)
    • E.g., in a linear model \(y ~ 0 + 2x\), if \(x=3\), then predicted outcome is \(2x=6\)
  • P(H|X) corresponds to the posterior distribution of a parameter (given the prior and the data)
  • Resolving this equation empirically is actually very complex

Stan, JAGS, and BUGS

  • Modern Bayesian full inference (using MCMC), known to be resource-intensive and slow, is currently best performed (most fast and flexibly) by dedicated languages, aka probabilistic programming languages
    • Stan (the most popular one): Named in honour of Stanislaw Ulam, pioneer of the Monte Carlo method
    • JAGS: “Just Another Gibbs Sampler”
    • BUGS: “Bayesian Inference Using Gibbs Sampling”
    • PyMC (Python)
    • Turing (Julia)
  • These languages are designed to perform Bayesian inference, and are optimized for that purpose
  • They are not general-purpose programming languages

About brms

  • There are “wrappers” around these languages that allow us to use them from other general purpose languages like R or Python
  • In R:
    • brms = Bayesian Regression Models using ‘Stan’
    • rstanarm = R interface to Stan for Applied Regression Modeling
  • rstanarm
    • Has different functions for different models (e.g., stan_lm(), stan_glm(), stan_lmer(), stan_glmer(), stan_gamm4() etc.)
    • Each models are implemented with precompiled Stan code
    • It is slightly easier to install, but less flexible
  • brms
    • Became the most popular
    • Main difference is that brms has only one main function, brm(), which can fit almost any model
    • These models are translated into Stan code, and then compiled
    • Overhead time when fitting new models, but it is more flexible

Doing a Bayesian Regression

  • Frequentist model
model <- lm(qsec ~ mpg, data = mtcars) 
parameters::parameters(model)
Parameter   | Coefficient |   SE |         95% CI | t(30) |      p
------------------------------------------------------------------
(Intercept) |       15.35 | 1.03 | [13.25, 17.46] | 14.91 | < .001
mpg         |        0.12 | 0.05 | [ 0.02,  0.22] |  2.53 | 0.017 
  • Bayesian model
model <- brms::brm(qsec ~ mpg, data = mtcars, refresh=0) 
parameters::parameters(model)
# Fixed Effects

Parameter   | Median |         95% CI |     pd |  Rhat |     ESS
----------------------------------------------------------------
(Intercept) |  15.34 | [13.25, 17.47] |   100% | 1.000 | 3454.00
mpg         |   0.12 | [ 0.02,  0.22] | 99.30% | 1.000 | 3394.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   1.68 | [1.34, 2.22] | 100% | 1.000 | 3387.00
  • refresh=0 silences the printing of the sampling (to avoid cluttering this slide)
  • R-hat: Convergence diagnostic index. Should be less than < 1.05.
  • ESS: Effective Sample Size: how many “good iterations” (truly independent draws) we have. Should be > 400 (the maximum being the number of iterations that we draw).

Mixed Model

  • One function to rule them all.
model <- glmmTMB::glmmTMB(qsec ~ mpg + (1|gear), data = mtcars) 
parameters::parameters(model, effects="fixed")
# Fixed Effects

Parameter   | Coefficient |   SE |         95% CI |     z |      p
------------------------------------------------------------------
(Intercept) |       14.81 | 1.26 | [12.34, 17.28] | 11.75 | < .001
mpg         |        0.13 | 0.05 | [ 0.03,  0.23] |  2.64 | 0.008 
  • Bayesian model
model <- brms::brm(qsec ~ mpg + (1|gear), data = mtcars, refresh=0) 
parameters::parameters(model, effects="fixed")
# Fixed Effects

Parameter   | Median |         95% CI |     pd |  Rhat |     ESS
----------------------------------------------------------------
(Intercept) |  14.83 | [11.70, 17.92] |   100% | 1.001 |  887.00
mpg         |   0.13 | [ 0.03,  0.23] | 99.33% | 1.001 | 2213.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   1.35 | [1.05, 1.83] | 100% | 1.002 | 1658.00

How about bayes factors?

  • Bayes factor are intuitive indices measuring of the strength of evidence for one hypothesis over another
    • They have a natural interpretation in terms of odds (e.g., 3:1 odds), and have wide applications (comparing models, null-hypothesis testing, …)
    • They have well established interpretation guidelines (> 3; > 10; > 30; > 100)
  • They are “easy” to compute for simple statistical tests, but for more complex models, they are notoriously hard to compute
  • The fact that they are highly sensitive to the choice of priors + the fact that finding an “alternative” model is not at all trivial + that computing them is often very hard have been seen as limitations and made them the focal point of debates (even between Bayesians)
  • There are different schools:
    • Some Bayesians believe that BFs are the best indices ever (assuming one understands them) and the Graal of Bayesian statistics
    • Some believe that other indices are better and that BFs reproduce the same culture of binary thinking as Frequentist NHST
    • We’ll adopt a middle view: BFs are useful, in particular for simple tests (t-tests, correlations, comparing models), but for modeling, analyzing the posterior distributions is better

Bayes Factors: Misconceptions

  • BFs do not tell us the absolute probability of a hypothesis being true. It quantifies how much we should update our belief in a hypothesis. If this hypothesis was extremely unlikely, then we might still believe it to be very unlikely, even after computing a large BF.
  • BFs provide relative evidence between 2 hypotheses, and it is possible that one hypothesis is supported more than another hypothesis, while both hypotheses are false. Saying “BF shows evidence for the null (H0)” is incorrect. It should be “BF shows evidence for the H0 as compared to H1” (what H1 is matters).
  • BFs tell us nothing about the size of effects.
  • See Lakens’ post for details

A rabbit hole

  • There is a lot more to Bayesian statistics
    • Inference algorithms (MCMC, …)
    • Priors
    • Model comparison
    • Usage in neuroscientific theories
  • But it is easy to start using it

Questions?

Thank you!