Bayesian Statistics

5. Bayes Factors

Dominique Makowski
D.Makowski@sussex.ac.uk

Recap

  • We learned about the Bayes’ Theorem:
    • \(P(H|X) =\frac{P(H)~P(X|H)}{P(X)}\)
  • And how it can be used to update our beliefs about a hypothesis given new data

Final nail in the coffin for the p-value

  • Another thing often heard about the p-value is that it can only be used to reject the null hypothesis (effect = 0), not as an index of evidence in favour of the null
  • This is because the p-value, by definition, is uniformly distributed under the null hypothesis
    • This means that, if the null hypothesis is true, all values of the p-value are equally likely; p = .001 is just as likely as p = 1
p <- c()
for(i in 1:10000) {
  x <- rnorm(100)
  y <- rnorm(100)
  test <- cor.test(x, y)
  p <- c(p, test$p.value)
}

data.frame(p = p) |> 
  ggplot(aes(x=p)) +
  geom_histogram(bins=40, color="white", fill="grey") +
  labs(x="p-values", y="Count") +
  theme_bw()

Evidence against the null hypothesis?

  • Altough Frequentists recently came up with some workarounds (region of practical equivalence testing), it is still seen as a limitation of the Frequentist framework
  • This is problematic in science because we are often interested in disproving a hypothesis, or proving that there is no effect
  • Does Uncle Bayes have a trick in his hat?

Bayes Factor (BF)

  • Bayes’ Theorem: \(P(H|data) =\frac{P(H)~P(data|H)}{P(data)}\)
    • Bayes’ theorem relates the probability of a hypothesis (H) given some observed evidence (data X) to the probability of data given the hypothesis, as well as the prior probability of the hypothesis
    • \(P(data|H)\) is the likelihood of the data given a hypothesis
    • It corresponds to a “model of the world”
    • In the previous example, we had “the model” of coin tosses. We know the likelihood of outcomes given various coin types because it corresponds to a physical reality (a fair coin gives 50% heads). So the “model” was obvious and easy to define
    • But often, the model’s likelihood depends on uncertain parameter over which we have other priors (e.g., instead of being fixed to 0.7/0.3, I define the outcome of the coin toss as unknown range)
    • In real-life scenarios, the likelihood is complex to compute/approximate, but it is not impossible. And accessing the likelihood is very useful…

Bayes Factor (BF)

  • Bayes’ theorem provides a natural way to test hypotheses. Suppose we have two competing hypotheses: H1 and an alternative hypothesis H0
  • What if we compared the likelihood of the data given two competing hypotheses?
  • \(P(data|H_{0})\) / \(P(data|H_{1})\)
  • This ratio of likelihoods is known as the Bayes Factor (BF)
    • It quantifies the relative strength of evidence in favor of one hypothesis over another based on the observed evidence (data)
    • It tells us how much more likely the data (evidence) is under one hypothesis compared to another (often taking into account prior probabilities in the definition of the likelihood model)
  • \(BF_{10} = \frac{P(data|H_{1})}{P(data|H_{0})}\)
  • \(BF_{01} = \frac{P(data|H_{0})}{P(data|H_{1})}\)
    • Note the convention of writing what is the numerator

BFs with BIC approximation

  • The Bayes Factor is a ratio of likelihoods, which is often difficult to compute, especially for Bayesian models where parameters are themselves probabilistic
  • Wagenmakers (2007)1 (one of the principal proponent of Bayes factors) proposed an approximation of the Bayes Factor based on the Bayesian Information Criterion (BIC), that can easily be computed for traditional Frequentist models (e.g., linear models)
  • \(BF_{01} \approx exp(\frac{BIC_{1} - BIC_{0}}{2})\)

Model 0: qsec ~ mpg

head(mtcars[c("mpg", "qsec", "wt")])
                   mpg  qsec    wt
Mazda RX4         21.0 16.46 2.620
Mazda RX4 Wag     21.0 17.02 2.875
Datsun 710        22.8 18.61 2.320
Hornet 4 Drive    21.4 19.44 3.215
Hornet Sportabout 18.7 17.02 3.440
Valiant           18.1 20.22 3.460
mtcars |> 
  ggplot(aes(x=mpg, y=qsec)) +
  geom_point() +
  geom_smooth(method="lm", formula = "y ~ x") +
  theme_bw()

model0 <- lm(qsec ~ mpg, data = mtcars)
summary(model0)

Call:
lm(formula = qsec ~ mpg, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8161 -1.0287  0.0954  0.8623  4.7149 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15.35477    1.02978  14.911 2.05e-15 ***
mpg          0.12414    0.04916   2.525   0.0171 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.65 on 30 degrees of freedom
Multiple R-squared:  0.1753,    Adjusted R-squared:  0.1478 
F-statistic: 6.377 on 1 and 30 DF,  p-value: 0.01708
  • mpg is a significant predictor of qsec (\(\beta\) = 0.12, p < .05)
  • Is it worth it to add another predictor? Does it improve the model?

Model 1: qsec ~ mpg + wt

mtcars |> 
  ggplot(aes(x=wt, y=qsec)) +
  geom_point() +
  geom_smooth(method="lm", formula = "y ~ x") +
  theme_bw()

  • Do you think adding wt will improve the model?
model1 <- lm(qsec ~ mpg + wt, data = mtcars)
summary(model1)

Call:
lm(formula = qsec ~ mpg + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1092 -0.9617 -0.2343  0.8399  4.2769 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  6.92949    3.53431   1.961   0.0596 . 
mpg          0.32039    0.09138   3.506   0.0015 **
wt           1.39324    0.56286   2.475   0.0194 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.524 on 29 degrees of freedom
Multiple R-squared:  0.3191,    Adjusted R-squared:  0.2722 
F-statistic: 6.797 on 2 and 29 DF,  p-value: 0.003796
  • Be careful! A model with multiple predictors is different from individual models taking separately (because variable interact with each other)
  • Is model1 better than model0?
  • Higher R2, but more parameters (more complex model). Comparing BICs is an alterantive that takes into account the number of parameters

Extract BICs

perf0 <- performance::performance(model0)
perf0
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
126.781 | 127.638 | 131.178 | 0.175 |     0.148 | 1.597 | 1.650
bic0 <- perf0$BIC
bic0
[1] 131.1783
bic1 <- performance::performance(model1)$BIC
bic1
[1] 128.5105
  • Normally, the lower the BIC, the better the model
  • Hard to interpret in absolute terms (the unit is meaningless), but we can compare models (if they are related to the same data)

Compare BICs

  • \(BF_{10} \approx exp(\frac{BIC_{0} - BIC_{1}}{2})\)
  • We can apply the formula to bic1 and bic0
bf10 <- exp((bic0 - bic1) / 2)
bf10
[1] 3.795896
  • You computed an approximation of the Bayes Factor “by hand”!

Using bayestestR

bayestestR::bayesfactor_models(model1, denominator=model0)
Bayes Factors for Model Comparison

         Model      BF
[model1] mpg + wt 3.80

* Against Denominator: [model0] mpg
*   Bayes Factor Type: BIC approximation

BF10 vs. BF01

  • Bayes factors are a ratio of likelihood of data under some hypothesis
  • They quantify “how much” the data is more likely under one model compared to another
  • \(BF_{10} = 3.80\) means the data is 3.80 times more likely under model1 than model0
  • Importantly, BFs are “reversable”:
  • \(BF_{01} = \frac{1}{BF_{10}} = 1 / 3.80 = 0.263\)
  • The data is 0.26 times more likely under model0 than model1
  • We gather evidence both ways (in favour or against a hypothesis)
bayestestR::bayesfactor_models(model0, denominator=model1)
Bayes Factors for Model Comparison

         Model    BF
[model0] mpg   0.263

* Against Denominator: [model1] mpg + wt
*   Bayes Factor Type: BIC approximation
  • Is this “big”? How to interpret BFs?

Bayes Factor Interpretation

  • The standard threshold for a “significant” BF is 3 or 10
    • Jeffreys, H. (1961), Theory of Probability, 3rd ed., Oxford University Press, Oxford
Bayes Factor (\(BF_{10}\)) Interpretation
> 100 Extreme evidence for H1
30 – 100 Very strong evidence for H1
10 – 30 Strong evidence for H1
3 – 10 Moderate evidence for H1
1 – 3 Anecdotal evidence for H1
1 No evidence
1/3 – 1 Anecdotal evidence for H0
1/3 – 1/10 Moderate evidence for null H0
1/10 – 1/30 Strong evidence for null H0
1/30 – 1/100 Very strong evidence for H0
< 1/100 Extreme evidence for H0

Bayes Factor Interpretation

  • The effectsize package provides many automated interpretation functions
effectsize::interpret_bf(bf10)
[1] "moderate evidence in favour of"
(Rules: jeffreys1961)
  • Note that because BFs can be very big or very small, they are sometimes presented on the log scale (\(log~BF_{10}\)), in which case numbers smaller than 1 become negative
log(3.8)
[1] 1.335001
log(1/3.8)
[1] -1.335001
  • To “unlog” a logged BF, use exp()
log(3.8)
[1] 1.335001
exp(1.335001)
[1] 3.8

Exercice

  • Compute the Bayes Factor for the following models: qsec ~ wt vs. a constant model
  • A constant model, aka an “intercept-only” model, is a model with no predictors (only an intercept)
  • A constant model basically predicts the mean of the outcome variable for all observations
mean(mtcars$qsec)
[1] 17.84875
model0 <- lm(qsec ~ 1, data = mtcars)
model0

Call:
lm(formula = qsec ~ 1, data = mtcars)

Coefficients:
(Intercept)  
      17.85  
  • Compute the BF and interpret it. Make a conclusion

Solution

model1 <- lm(qsec ~ wt, data = mtcars)
model1

Call:
lm(formula = qsec ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
    18.8753      -0.3191  
bayestestR::bayesfactor_models(model1, denominator=model0)
Bayes Factors for Model Comparison

         Model    BF
[model1] wt    0.290

* Against Denominator: [model0] (Intercept only)
*   Bayes Factor Type: BIC approximation
  • BF < 1/3 (0.3333)
effectsize::interpret_bf(0.290)
[1] "moderate evidence against"
(Rules: jeffreys1961)

Exercice 2

  • “I try to analyze the linear relationship between the sepal length and the sepal width of the iris flowers (iris built in dataset). Should I control for Species?”

  • Pizzas can help us get a intuitive feeling for BFs

1

2

Pizza plot

plot(bayestestR::bayesfactor_models(model1, denominator=model0)) +
  scale_fill_manual(values = c("red", "yellow")) 

t-Tests

  • True Bayes factor (i.e., not approximations using Frequentist models) can be also computed for “simple” tests, such as t-tests
  • t-tests test for the difference of means of a variable between two groups (or against a constant value)
  • This is a frequentist t-test:
# `am` is a binary variable (0 and 1)
t.test(mtcars$mpg ~ mtcars$am)

    Welch Two Sample t-test

data:  mtcars$mpg by mtcars$am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group 0 mean in group 1 
       17.14737        24.39231 
mtcars |> 
  ggplot(aes(x=as.factor(am), y=mpg)) +
  geom_jitter(width=0.1, size=3)  

  • We conclude that there is a significance difference between the two groups (am == 1 > am == 0, p < .01)

Bayesian t-Tests

  • They are implemented in the BayesFactor package (by Richard Morey) where BFs are computed using a technique called Gaussian Quadrature 1 possible for simple models (not important)
  • The Bayes factor compares two hypotheses: that the standardized effect size is 0, or that the standardized effect size is not 0.
library(BayesFactor)  # Case sensitive!

# Note the use of `formula` and `data` arguments
BayesFactor::ttestBF(formula=mpg ~ am, data=mtcars)
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 86.58973 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS
  • What can we conclude?
  • \(BF_{10}\) = 86.59
  • Very strong evidence in favour of the hypothesis that there is a non-null difference between the two groups
  • What about priors?
    • In this particular case, let’s just say that they are good enough “default” priors (there is a lot of literature justifying it). Can be used “as-is” without too much worry*
    • *Some people will might lose it if they see this

Correlation Test

  • The BayesFactor package also implements correlation tests
library(BayesFactor)  # Case sensitive!

BayesFactor::correlationBF(mtcars$drat, mtcars$qsec)
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.4322745 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
  • What can we conclude?
  • \(BF_{10}\) = 0.43
  • No evidence in favour of any models (that there is a null vs. non-null correlation)

Priors

  • Remember: the Bayes Factor is a ratio of the likelihood of the data under the two models
  • \(\underbrace{\frac{P(H_1 | D)}{P(H_0 | D)}}_{posterior~odds} = \underbrace{\frac{P(D | H_1)}{P(D | H_0)}}_{Bayes~Factor} \cdot \underbrace{\frac{P(H_1)}{P(H_0)}}_{prior~odds}\)
  • The likelhood can be interpreted as an updating factor: how much the data updated our prior beliefs
    • In other words, how much did we move from Prior to Posterior
  • The stronger the BF, the stronger the “difference” between the prior and the posterior
  • But what is the prior?
    • Remember, we are trying to estimate the correlation coefficient rho \(\rho\) after seeing the data
    • But we need to have a prior idea of what \(\rho\) could be
  • How to specify a prior for \(\rho\)?

Prior Distribution

  • In the previous examples, we had priors in the form of discrete categories (e.g., 1/3 that the coin is fair, 1/3 that it’s rigged, …)

  • But how to specify a prior for a continuous variable like a correlation coefficient?
  • Using a distribution!
  • Let’s imagine taking a normal distribution as a prior for \(\rho\) (i.e., we think that \(\rho\) is normally distributed)
  • Imagine that someone proposes the following parameters
    • e.g., \(\mu = 0.42, \sigma = 0.7\)
xspace <- seq(-1.5, 1.5, by=0.01)
prob <- dnorm(xspace, mean=0.42, sd=0.333)

data <- data.frame(x = xspace, y = prob) 

data |> 
  ggplot(aes(x=x, y=y)) +
  geom_line(color="blue", linewidth=3) 

Exercice

  • Do you think this is a good prior?
  • Why is this not a very good prior?
  • Try to come up with a suitable prior for the correlation coefficient \(\rho\)

Solution

  • Priors represent your beliefs before seeing the data
    • In can be from previous studies, from your own experience, from your intuition, …
  • Do we have a hypothesis about the direction (positive vs. negative) of the correlation between drat and qsec
    • No. The correlation could be positive or negative.
  • Do we have reasons to expect a non-null correlation?
    • No. 
    • Our prior could be symmetric around 0, with a higher probability on 0 (i.e., we think that the correlation is more likely to be 0 than any other value)
  • What about \(Normal(0, 0.5)\)
data$y <- dnorm(xspace, mean=0, sd=0.5) 

data |> 
  ggplot(aes(x=x, y=y)) +
  geom_line(color="blue", linewidth=3) 

  • What’s the problem?

Priors for Correlation Coefficients?

  • Bayesian inference is, fundamentally, the process of reallocating credibility across candidate parameter values
    • From prior to posterior
  • Priors should incorporate “domain knowledge” (critical for good Bayesian models)
    • For example, knowledge about the range of possible values
  • We know that the correlation coefficient is bounded between -1 and 1 by definition. We should incorporate this knowledge in our prior and assign a probability of 0 to values outside of this range
  • How can we do that?
    • We could use a truncated normal distribution (e.g., \(Normal(0, 0.5)\) truncated between -1 and 1)
    • Or pick a distribution that has bounds
  • What is a distribution that has negative and positive bounds?

What about Uniform?

  • \(Uniform(-1, 1)\) means that all values between -1 and 1 are equally likely
# Compute the uniform distribution
data$y <- dunif(xspace, -1, 1) 

data |> 
  ggplot(aes(y=y)) +
  geom_line(aes(x=x), color="blue", linewidth=3) +
  theme_bw()

  • This is a non-informative prior
    • i.e., we don’t introduce any prior knowledge about the correlation coefficient (other than domain knowledge about what a correlation coefficient is)
    • As we will see later, uniform priors tend to give results that are similar to frequentist maximum likelihood estimates (i.e., Bayesian models with uniform priors tend to have very similar results to frequentist models)
  • But they are non-recommended.
    • They are often non realistic: we tend to have some expectations about our parameters, and priors should reflect these beliefs

What other prior?

  • What is another distribution that has bounds?
    • The beta distribution!
  • But what’s the problem?
    • It is defined on the interval [0, 1]
  • Solution?
    • Distributions can often be scaled and shifted

Shifted and Scaled Beta

  • We can have a shifted & scaled beta distribution that is defined on the interval [-1, 1] with a peak at 0
  • E.g., How about \(Beta(1.5, 1.5)\) rescaled and shifted to [-1, 1]?
# Compute the beta distribution
data$y <- dbeta(xspace, 1.5, 1.5)
# Rescale its x-axis
data$x2 <- (data$x - 0.5) * 2

data |>
  ggplot(aes(y=y)) +
  geom_line(aes(x=x), color="blue", linewidth=3) +
  geom_line(aes(x=x2), color="orange", linewidth=2, alpha=0.9) +
  coord_cartesian(xlim=c(-1.2, 1.2)) +
  theme_bw()

  • Original distribution in blue and shifted and scaled in orange

Meaning

  • This means that we think that the correlation coefficient is more likely to be 0, and then “gradually” less likely as we move away from 0, with very low likelihood for values close to -1 or 1

BayesFactor’s defaults

  • What does BayesFactor use?
  • We can read the documentation of correlationBF to find out
?BayesFactor::correlationBF

Noninformative priors are assumed for the population means and variances of the two population; a shifted, scaled beta(1/rscale,1/rscale) prior distribution is assumed for \(rho\) […]

For the rscale argument, several named values are recognized: “medium.narrow”, “medium” [default], “wide”, and “ultrawide”. These correspond to r scale values of \(1/\sqrt{27}\), 1/3, \(1/\sqrt{3}\), and \(1\), respectively.

Visualize BayesFactor’s defaults

# List of rscales
rscales <- list(
  "medium.narrow" = 1/sqrt(27),
  "medium" = 1/3,
  "wide" = 1/sqrt(3),
  "ultrawide" = 1
)
# Compute the beta distribution following the formula 
# And store in the same dataframe

xspace <- seq(0, 1, by=0.01)  # Original beta is ]0, 1[

data <- data.frame()  # Inititalize empty dataframe
for(scale in names(rscales)) {
  rscale <- rscales[[scale]]
  dat <- data.frame(
    y = dbeta(xspace, 1/rscale, 1/rscale),
    x = (xspace - 0.5) * 2, # Rescale its x-axis
    scale = scale
  ) 
  data <- rbind(data, dat)
}
  • Exercice: plot that!
data |> 
  ggplot(aes(x=x, y=y)) +
  geom_line(aes(color=scale), linewidth=2) +
  coord_cartesian(xlim=c(-1.2, 1.2)) 

Priors impact Bayes Factors

  • The prior we choose has an impact on the Bayes Factor
BayesFactor::correlationBF(mtcars$drat, mtcars$qsec, 
                           rscale="medium.narrow")
Bayes factor analysis
--------------
[1] Alt., r=0.192 : 0.5409961 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
BayesFactor::correlationBF(mtcars$drat, mtcars$qsec, 
                           rscale="medium")
Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.4322745 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
BayesFactor::correlationBF(mtcars$drat, mtcars$qsec, 
                           rscale="wide")
Bayes factor analysis
--------------
[1] Alt., r=0.577 : 0.3330176 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
BayesFactor::correlationBF(mtcars$drat, mtcars$qsec, 
                           rscale="ultrawide")
Bayes factor analysis
--------------
[1] Alt., r=1 : 0.2473768 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*
  • The Bayes Factor is sensitive to the choice of prior
  • In this case, the wider the prior, the lower the Bayes factor
    • Why?
  • It’s not always the case

Misconceptions about Bayes Factors

  • BFs do not tell us the absolute probability of a hypothesis being true. It quantifies how much we should update our belief in a hypothesis. If this hypothesis was extremely unlikely, then we might still believe it to be very unlikely, even after computing a large BF.
  • BFs provide relative evidence between 2 hypotheses, and it is possible that one hypothesis is supported more than another hypothesis, while both hypotheses are false. Saying “BF shows evidence for the null (H0)” is incorrect. It should be “BF shows evidence for the H0 as compared to H1” (what H1 is matters).
  • BFs tell us nothing about the size of effects.
  • See Lakens’ post for details

Bayes Factors: Summary

  • Bayes factor are intuitive indices measuring of the strength of evidence for one hypothesis over another
    • They have a natural interpretation in terms of odds (e.g., 3:1 odds), and have wide applications (comparing models, null-hypothesis testing, …)
    • They have well established interpretation guidelines (> 3; > 10; > 30; > 100)
  • They are “easy” to compute for simple statistical tests, but for more complex models, they are notoriously hard to compute
  • The fact that they are highly sensitive to the choice of priors + the fact that finding an “alternative” model is not at all trivial + that computing them is often very hard have been seen as limitations and made them the focal point of debates (even between Bayesians)
  • There are different schools:
    • Some Bayesians believe that BFs are the best indices ever (assuming one understands them) and the Graal of Bayesian statistics
    • Some believe that other indices are better and that BFs reproduce the same culture of binary thinking as Frequentist NHST
    • We’ll adopt a middle view: BFs are useful, in particular for simple tests, but they are not the only good thing in Bayesian statistics

Exercice

  • Compute and report the Bayes Factor for the correlation between Sepal.Length and Sepal.Width from the iris dataset.

Recap

  • \(\underbrace{\frac{P(H_1 | D)}{P(H_0 | D)}}_{posterior} = \underbrace{\frac{P(D | H_1)}{P(D | H_0)}}_{Bayes~Factor} \cdot \underbrace{\frac{P(H_1)}{P(H_0)}}_{prior}\)
  • We have seen what Bayes Factors are
  • We have seen how we can place priors on parameters (e.g., for correlations)
  • But… what about the posterior?

The End (for now)

Thank you!