Bayesian Statistics

5. Bayes Factors

_{Dominique Makowski}
_{^{D.Makowski@sussex.ac.uk}}

Recap

We learned about the Bayes’ Theorem:
- \(P(H|X) =\frac{P(H)~P(X|H)}{P(X)}\)
And how it can be used to update our beliefs about a hypothesis given new data

Final nail in the coffin for the p-value

Another thing often heard about the p-value is that it can only be used to reject the null hypothesis (effect = 0), not as an index of evidence in favour of the null
This is because the p-value, by definition, is uniformly distributed under the null hypothesis
- This means that, if the null hypothesis is true, all values of the p-value are equally likely; p = .001 is just as likely as p = 1

p <- c()
for(i in 1:10000) {
  x <- rnorm(100)
  y <- rnorm(100)
  test <- cor.test(x, y)
  p <- c(p, test$p.value)
}

data.frame(p = p) |> 
  ggplot(aes(x=p)) +
  geom_histogram(bins=40, color="white", fill="grey") +
  labs(x="p-values", y="Count") +
  theme_bw()

Evidence against the null hypothesis?

Altough Frequentists recently came up with some workarounds (region of practical equivalence testing), it is still seen as a limitation of the Frequentist framework
This is problematic in science because we are often interested in disproving a hypothesis, or proving that there is no effect
Does Uncle Bayes have a trick in his hat?

Bayes Factor (BF)

Bayes’ Theorem: \(P(H|data) =\frac{P(H)~P(data|H)}{P(data)}\)
- Bayes’ theorem relates the probability of a hypothesis (H) given some observed evidence (data X) to the probability of data given the hypothesis, as well as the prior probability of the hypothesis
- \(P(data|H)\) is the likelihood of the data given a hypothesis
- It corresponds to a “model of the world”
- In the previous example, we had “the model” of coin tosses. We know the likelihood of outcomes given various coin types because it corresponds to a physical reality (a fair coin gives 50% heads). So the “model” was obvious and easy to define
- But often, the model’s likelihood depends on uncertain parameter over which we have other priors (e.g., instead of being fixed to 0.7/0.3, I define the outcome of the coin toss as unknown range)
- In real-life scenarios, the likelihood is complex to compute/approximate, but it is not impossible. And accessing the likelihood is very useful…

Bayes Factor (BF)

Bayes’ theorem provides a natural way to test hypotheses. Suppose we have two competing hypotheses: H1 and an alternative hypothesis H0
What if we compared the likelihood of the data given two competing hypotheses?
\(P(data|H_{0})\) / \(P(data|H_{1})\)
This ratio of likelihoods is known as the Bayes Factor (BF)
- It quantifies the relative strength of evidence in favor of one hypothesis over another based on the observed evidence (data)
- It tells us how much more likely the data (evidence) is under one hypothesis compared to another (often taking into account prior probabilities in the definition of the likelihood model)
\(BF_{10} = \frac{P(data|H_{1})}{P(data|H_{0})}\)
\(BF_{01} = \frac{P(data|H_{0})}{P(data|H_{1})}\)
- Note the convention of writing what is the numerator

BFs with BIC approximation

The Bayes Factor is a ratio of likelihoods, which is often difficult to compute, especially for Bayesian models where parameters are themselves probabilistic
Wagenmakers (2007)¹ _{^{(one of the principal proponent of Bayes factors)}} proposed an approximation of the Bayes Factor based on the Bayesian Information Criterion (BIC), that can easily be computed for traditional Frequentist models (e.g., linear models)
\(BF_{01} \approx exp(\frac{BIC_{1} - BIC_{0}}{2})\)

Model 0: qsec ~ mpg

head(mtcars[c("mpg", "qsec", "wt")])

                   mpg  qsec    wt
Mazda RX4         21.0 16.46 2.620
Mazda RX4 Wag     21.0 17.02 2.875
Datsun 710        22.8 18.61 2.320
Hornet 4 Drive    21.4 19.44 3.215
Hornet Sportabout 18.7 17.02 3.440
Valiant           18.1 20.22 3.460

mtcars |> 
  ggplot(aes(x=mpg, y=qsec)) +
  geom_point() +
  geom_smooth(method="lm", formula = "y ~ x") +
  theme_bw()

model0 <- lm(qsec ~ mpg, data = mtcars)
summary(model0)


Call:
lm(formula = qsec ~ mpg, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8161 -1.0287  0.0954  0.8623  4.7149 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15.35477    1.02978  14.911 2.05e-15 ***
mpg          0.12414    0.04916   2.525   0.0171 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.65 on 30 degrees of freedom
Multiple R-squared:  0.1753,    Adjusted R-squared:  0.1478 
F-statistic: 6.377 on 1 and 30 DF,  p-value: 0.01708

mpg is a significant predictor of qsec (\(\beta\) = 0.12, p < .05)
Is it worth it to add another predictor? Does it improve the model?

Model 1: qsec ~ mpg + wt

mtcars |> 
  ggplot(aes(x=wt, y=qsec)) +
  geom_point() +
  geom_smooth(method="lm", formula = "y ~ x") +
  theme_bw()

Do you think adding wt will improve the model?

model1 <- lm(qsec ~ mpg + wt, data = mtcars)
summary(model1)


Call:
lm(formula = qsec ~ mpg + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1092 -0.9617 -0.2343  0.8399  4.2769 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  6.92949    3.53431   1.961   0.0596 . 
mpg          0.32039    0.09138   3.506   0.0015 **
wt           1.39324    0.56286   2.475   0.0194 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.524 on 29 degrees of freedom
Multiple R-squared:  0.3191,    Adjusted R-squared:  0.2722 
F-statistic: 6.797 on 2 and 29 DF,  p-value: 0.003796

Be careful! A model with multiple predictors is different from individual models taking separately (because variable interact with each other)
Is model1 better than model0?
Higher R2, but more parameters (more complex model). Comparing BICs is an alterantive that takes into account the number of parameters

Extract BICs

perf0 <- performance::performance(model0)
perf0

# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
126.781 | 127.638 | 131.178 | 0.175 |     0.148 | 1.597 | 1.650

bic0 <- perf0$BIC
bic0

[1] 131.1783

bic1 <- performance::performance(model1)$BIC
bic1

[1] 128.5105

Normally, the lower the BIC, the better the model
Hard to interpret in absolute terms (the unit is meaningless), but we can compare models (if they are related to the same data)

Compare BICs

\(BF_{10} \approx exp(\frac{BIC_{0} - BIC_{1}}{2})\)
We can apply the formula to bic1 and bic0

bf10 <- exp((bic0 - bic1) / 2)
bf10

[1] 3.795896

You computed an approximation of the Bayes Factor “by hand”!

Using `bayestestR`

bayestestR::bayesfactor_models(model1, denominator=model0)

Bayes Factors for Model Comparison

         Model      BF
[model1] mpg + wt 3.80

* Against Denominator: [model0] mpg
*   Bayes Factor Type: BIC approximation

BF10 vs. BF01

Bayes factors are a ratio of likelihood of data under some hypothesis
They quantify “how much” the data is more likely under one model compared to another
\(BF_{10} = 3.80\) means the data is 3.80 times more likely under model1 than model0
Importantly, BFs are “reversable”:
\(BF_{01} = \frac{1}{BF_{10}} = 1 / 3.80 = 0.263\)
The data is 0.26 times more likely under model0 than model1
We gather evidence both ways (in favour or against a hypothesis)

bayestestR::bayesfactor_models(model0, denominator=model1)

Bayes Factors for Model Comparison

         Model    BF
[model0] mpg   0.263

* Against Denominator: [model1] mpg + wt
*   Bayes Factor Type: BIC approximation

Is this “big”? How to interpret BFs?

Bayes Factor Interpretation

The standard threshold for a “significant” BF is 3 or 10
- Jeffreys, H. (1961), Theory of Probability, 3rd ed., Oxford University Press, Oxford

Bayes Factor (\(BF_{10}\))	Interpretation
> 100	Extreme evidence for H1
30 – 100	Very strong evidence for H1
10 – 30	Strong evidence for H1
3 – 10	Moderate evidence for H1
1 – 3	Anecdotal evidence for H1
1	No evidence
1/3 – 1	Anecdotal evidence for H0
1/3 – 1/10	Moderate evidence for null H0
1/10 – 1/30	Strong evidence for null H0
1/30 – 1/100	Very strong evidence for H0
< 1/100	Extreme evidence for H0

Bayes Factor Interpretation

The effectsize package provides many automated interpretation functions

effectsize::interpret_bf(bf10)

[1] "moderate evidence in favour of"
(Rules: jeffreys1961)

Note that because BFs can be very big or very small, they are sometimes presented on the log scale (\(log~BF_{10}\)), in which case numbers smaller than 1 become negative

log(3.8)

[1] 1.335001

log(1/3.8)

[1] -1.335001

To “unlog” a logged BF, use exp()

log(3.8)

[1] 1.335001

exp(1.335001)

[1] 3.8

Exercice

Compute the Bayes Factor for the following models: qsec ~ wt vs. a constant model
A constant model, aka an “intercept-only” model, is a model with no predictors (only an intercept)
A constant model basically predicts the mean of the outcome variable for all observations

mean(mtcars$qsec)

[1] 17.84875

model0 <- lm(qsec ~ 1, data = mtcars)
model0


Call:
lm(formula = qsec ~ 1, data = mtcars)

Coefficients:
(Intercept)  
      17.85

Compute the BF and interpret it. Make a conclusion

Solution

model1 <- lm(qsec ~ wt, data = mtcars)
model1


Call:
lm(formula = qsec ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
    18.8753      -0.3191

bayestestR::bayesfactor_models(model1, denominator=model0)

Bayes Factors for Model Comparison

         Model    BF
[model1] wt    0.290

* Against Denominator: [model0] (Intercept only)
*   Bayes Factor Type: BIC approximation

BF < 1/3 (0.3333)

effectsize::interpret_bf(0.290)

[1] "moderate evidence against"
(Rules: jeffreys1961)

Exercice 2

“I try to analyze the linear relationship between the sepal length and the sepal width of the iris flowers (iris built in dataset). Should I control for Species?”
Pizzas can help us get a intuitive feeling for BFs

1

2 Pizza plot

plot(bayestestR::bayesfactor_models(model1, denominator=model0)) +
  scale_fill_manual(values = c("red", "yellow"))

t-Tests

True Bayes factor (i.e., not approximations using Frequentist models) can be also computed for “simple” tests, such as t-tests
t-tests test for the difference of means of a variable between two groups (or against a constant value)
This is a frequentist t-test:

# `am` is a binary variable (0 and 1)
t.test(mtcars$mpg ~ mtcars$am)


    Welch Two Sample t-test

data:  mtcars$mpg by mtcars$am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group 0 mean in group 1 
       17.14737        24.39231

mtcars |> 
  ggplot(aes(x=as.factor(am), y=mpg)) +
  geom_jitter(width=0.1, size=3)

We conclude that there is a significance difference between the two groups (am == 1 > am == 0, p < .01)

Bayesian t-Tests

They are implemented in the BayesFactor package (by Richard Morey) where BFs are computed using a technique called Gaussian Quadrature ¹ possible for simple models _{^{(not important)}}
The Bayes factor compares two hypotheses: that the standardized effect size is 0, or that the standardized effect size is not 0.

library(BayesFactor)  # Case sensitive!

# Note the use of `formula` and `data` arguments
BayesFactor::ttestBF(formula=mpg ~ am, data=mtcars)

Bayes factor analysis
--------------
[1] Alt., r=0.707 : 86.58973 ±0%

Against denominator:
  Null, mu1-mu2 = 0 
---
Bayes factor type: BFindepSample, JZS

What can we conclude?
\(BF_{10}\) = 86.59
Very strong evidence in favour of the hypothesis that there is a non-null difference between the two groups
What about priors?
- In this particular case, let’s just say that they are good enough “default” priors (there is a lot of literature justifying it). Can be used “as-is” without too much worry*
- *Some people will might lose it if they see this

Correlation Test

The BayesFactor package also implements correlation tests

library(BayesFactor)  # Case sensitive!

BayesFactor::correlationBF(mtcars$drat, mtcars$qsec)

Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.4322745 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

What can we conclude?
\(BF_{10}\) = 0.43
No evidence in favour of any models (that there is a null vs. non-null correlation)

Priors

Remember: the Bayes Factor is a ratio of the likelihood of the data under the two models
\(\underbrace{\frac{P(H_1 | D)}{P(H_0 | D)}}_{posterior~odds} = \underbrace{\frac{P(D | H_1)}{P(D | H_0)}}_{Bayes~Factor} \cdot \underbrace{\frac{P(H_1)}{P(H_0)}}_{prior~odds}\)
The likelhood can be interpreted as an updating factor: how much the data updated our prior beliefs
- In other words, how much did we move from Prior to Posterior
The stronger the BF, the stronger the “difference” between the prior and the posterior
But what is the prior?
- Remember, we are trying to estimate the correlation coefficient rho \(\rho\) after seeing the data
- But we need to have a prior idea of what \(\rho\) could be
How to specify a prior for \(\rho\)?

Prior Distribution

In the previous examples, we had priors in the form of discrete categories (e.g., 1/3 that the coin is fair, 1/3 that it’s rigged, …)

But how to specify a prior for a continuous variable like a correlation coefficient?
Using a distribution!

Let’s imagine taking a normal distribution as a prior for \(\rho\) (i.e., we think that \(\rho\) is normally distributed)
Imagine that someone proposes the following parameters
- e.g., \(\mu = 0.42, \sigma = 0.7\)

xspace <- seq(-1.5, 1.5, by=0.01)
prob <- dnorm(xspace, mean=0.42, sd=0.333)

data <- data.frame(x = xspace, y = prob) 

data |> 
  ggplot(aes(x=x, y=y)) +
  geom_line(color="blue", linewidth=3)

Exercice

Do you think this is a good prior?
Why is this not a very good prior?
Try to come up with a suitable prior for the correlation coefficient \(\rho\)

Solution

Priors represent your beliefs before seeing the data
- In can be from previous studies, from your own experience, from your intuition, …
Do we have a hypothesis about the direction (positive vs. negative) of the correlation between drat and qsec
- No. The correlation could be positive or negative.
Do we have reasons to expect a non-null correlation?
- No.
- Our prior could be symmetric around 0, with a higher probability on 0 (i.e., we think that the correlation is more likely to be 0 than any other value)

What about \(Normal(0, 0.5)\)

data$y <- dnorm(xspace, mean=0, sd=0.5) 

data |> 
  ggplot(aes(x=x, y=y)) +
  geom_line(color="blue", linewidth=3)

What’s the problem?

Priors for Correlation Coefficients?

Bayesian inference is, fundamentally, the process of reallocating credibility across candidate parameter values
- From prior to posterior
Priors should incorporate “domain knowledge” (critical for good Bayesian models)
- For example, knowledge about the range of possible values
We know that the correlation coefficient is bounded between -1 and 1 by definition. We should incorporate this knowledge in our prior and assign a probability of 0 to values outside of this range
How can we do that?
- We could use a truncated normal distribution (e.g., \(Normal(0, 0.5)\) truncated between -1 and 1)
- Or pick a distribution that has bounds
What is a distribution that has negative and positive bounds?

What about Uniform?

\(Uniform(-1, 1)\) means that all values between -1 and 1 are equally likely

# Compute the uniform distribution
data$y <- dunif(xspace, -1, 1) 

data |> 
  ggplot(aes(y=y)) +
  geom_line(aes(x=x), color="blue", linewidth=3) +
  theme_bw()

This is a non-informative prior
- i.e., we don’t introduce any prior knowledge about the correlation coefficient (other than domain knowledge about what a correlation coefficient is)
- As we will see later, uniform priors tend to give results that are similar to frequentist maximum likelihood estimates (i.e., Bayesian models with uniform priors tend to have very similar results to frequentist models)
But they are non-recommended.
- They are often non realistic: we tend to have some expectations about our parameters, and priors should reflect these beliefs

What other prior?

What is another distribution that has bounds?
- The beta distribution!
But what’s the problem?
- It is defined on the interval [0, 1]
Solution?
- Distributions can often be scaled and shifted

Shifted and Scaled Beta

We can have a shifted & scaled beta distribution that is defined on the interval [-1, 1] with a peak at 0
E.g., How about \(Beta(1.5, 1.5)\) rescaled and shifted to [-1, 1]?

# Compute the beta distribution
data$y <- dbeta(xspace, 1.5, 1.5)
# Rescale its x-axis
data$x2 <- (data$x - 0.5) * 2

data |>
  ggplot(aes(y=y)) +
  geom_line(aes(x=x), color="blue", linewidth=3) +
  geom_line(aes(x=x2), color="orange", linewidth=2, alpha=0.9) +
  coord_cartesian(xlim=c(-1.2, 1.2)) +
  theme_bw()

Original distribution in blue and shifted and scaled in orange

Meaning

This means that we think that the correlation coefficient is more likely to be 0, and then “gradually” less likely as we move away from 0, with very low likelihood for values close to -1 or 1

BayesFactor’s defaults

What does BayesFactor use?
We can read the documentation of correlationBF to find out

?BayesFactor::correlationBF

Noninformative priors are assumed for the population means and variances of the two population; a shifted, scaled beta(1/rscale,1/rscale) prior distribution is assumed for \(rho\) […]

For the rscale argument, several named values are recognized: “medium.narrow”, “medium” [default], “wide”, and “ultrawide”. These correspond to r scale values of \(1/\sqrt{27}\), 1/3, \(1/\sqrt{3}\), and \(1\), respectively.

Visualize BayesFactor’s defaults

# List of rscales
rscales <- list(
  "medium.narrow" = 1/sqrt(27),
  "medium" = 1/3,
  "wide" = 1/sqrt(3),
  "ultrawide" = 1
)

# Compute the beta distribution following the formula 
# And store in the same dataframe

xspace <- seq(0, 1, by=0.01)  # Original beta is ]0, 1[

data <- data.frame()  # Inititalize empty dataframe
for(scale in names(rscales)) {
  rscale <- rscales[[scale]]
  dat <- data.frame(
    y = dbeta(xspace, 1/rscale, 1/rscale),
    x = (xspace - 0.5) * 2, # Rescale its x-axis
    scale = scale
  ) 
  data <- rbind(data, dat)
}

Exercice: plot that!

data |> 
  ggplot(aes(x=x, y=y)) +
  geom_line(aes(color=scale), linewidth=2) +
  coord_cartesian(xlim=c(-1.2, 1.2))

Priors impact Bayes Factors

The prior we choose has an impact on the Bayes Factor

BayesFactor::correlationBF(mtcars$drat, mtcars$qsec, 
                           rscale="medium.narrow")

Bayes factor analysis
--------------
[1] Alt., r=0.192 : 0.5409961 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

BayesFactor::correlationBF(mtcars$drat, mtcars$qsec, 
                           rscale="medium")

Bayes factor analysis
--------------
[1] Alt., r=0.333 : 0.4322745 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

BayesFactor::correlationBF(mtcars$drat, mtcars$qsec, 
                           rscale="wide")

Bayes factor analysis
--------------
[1] Alt., r=0.577 : 0.3330176 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

BayesFactor::correlationBF(mtcars$drat, mtcars$qsec, 
                           rscale="ultrawide")

Bayes factor analysis
--------------
[1] Alt., r=1 : 0.2473768 ±0%

Against denominator:
  Null, rho = 0 
---
Bayes factor type: BFcorrelation, Jeffreys-beta*

The Bayes Factor is sensitive to the choice of prior
In this case, the wider the prior, the lower the Bayes factor
- Why?
It’s not always the case

Misconceptions about Bayes Factors

BFs do not tell us the absolute probability of a hypothesis being true. It quantifies how much we should update our belief in a hypothesis. If this hypothesis was extremely unlikely, then we might still believe it to be very unlikely, even after computing a large BF.
BFs provide relative evidence between 2 hypotheses, and it is possible that one hypothesis is supported more than another hypothesis, while both hypotheses are false. Saying “BF shows evidence for the null (H0)” is incorrect. It should be “BF shows evidence for the H0 as compared to H1” (what H1 is matters).
BFs tell us nothing about the size of effects.
See Lakens’ post for details

Bayes Factors: Summary

Bayes factor are intuitive indices measuring of the strength of evidence for one hypothesis over another
- They have a natural interpretation in terms of odds (e.g., 3:1 odds), and have wide applications (comparing models, null-hypothesis testing, …)
- They have well established interpretation guidelines (> 3; > 10; > 30; > 100)
They are “easy” to compute for simple statistical tests, but for more complex models, they are notoriously hard to compute
The fact that they are highly sensitive to the choice of priors + the fact that finding an “alternative” model is not at all trivial + that computing them is often very hard have been seen as limitations and made them the focal point of debates (even between Bayesians)
There are different schools:
- Some Bayesians believe that BFs are the best indices ever (assuming one understands them) and the Graal of Bayesian statistics
- Some believe that other indices are better and that BFs reproduce the same culture of binary thinking as Frequentist NHST
- We’ll adopt a middle view: BFs are useful, in particular for simple tests, but they are not the only good thing in Bayesian statistics

Exercice

Compute and report the Bayes Factor for the correlation between Sepal.Length and Sepal.Width from the iris dataset.

Recap

\(\underbrace{\frac{P(H_1 | D)}{P(H_0 | D)}}_{posterior} = \underbrace{\frac{P(D | H_1)}{P(D | H_0)}}_{Bayes~Factor} \cdot \underbrace{\frac{P(H_1)}{P(H_0)}}_{prior}\)
We have seen what Bayes Factors are
We have seen how we can place priors on parameters (e.g., for correlations)
But… what about the posterior?

The End _{^{(for now)}}

Thank you!

Bayesian Statistics

Recap

Final nail in the coffin for the p-value

Evidence against the null hypothesis?

Bayes Factor (BF)

Bayes Factor (BF)

BFs with BIC approximation

Model 0: qsec ~ mpg

Model 1: qsec ~ mpg + wt

Extract BICs

Compare BICs

Using bayestestR

BF10 vs. BF01

Bayes Factor Interpretation

Bayes Factor Interpretation

Exercice

Solution

Exercice 2

1

2

Pizza plot

t-Tests

Bayesian t-Tests

Correlation Test

Priors

Prior Distribution

Exercice

Solution

Priors for Correlation Coefficients?

What about Uniform?

What other prior?

Shifted and Scaled Beta

Meaning

BayesFactor’s defaults

Visualize BayesFactor’s defaults

Priors impact Bayes Factors

Misconceptions about Bayes Factors

Bayes Factors: Summary

Exercice

Recap

The End (for now)

Using `bayestestR`

The End _{^{(for now)}}