Bayesian Statistics

5. Bayes Factors (BIC approximation)

_{Dominique Makowski}
_{^{D.Makowski@sussex.ac.uk}}

Recap

We learned about the Bayes’ Theorem:
- \(P(H|X) =\frac{P(H)~P(X|H)}{P(X)}\)
And how it can be used to update our beliefs about a hypothesis given new data

Final nail in the coffin for the p-value

Another thing often heard about the p-value is that it can only be used to reject the null hypothesis (effect = 0), not as an index of evidence in favour of the null
This is because the p-value, by definition, is uniformly distributed under the null hypothesis
- This means that, if the null hypothesis is true, all values of the p-value are equally likely; p = .001 is just as likely as p = 1

p <- c()
for(i in 1:10000) {
  x <- rnorm(100)
  y <- rnorm(100)
  test <- cor.test(x, y)
  p <- c(p, test$p.value)
}

data.frame(p = p) |> 
  ggplot(aes(x=p)) +
  geom_histogram(bins=40, color="white", fill="grey") +
  labs(x="p-values", y="Count") +
  theme_bw()

Evidence against the null hypothesis?

Altough Frequentists recently came up with some workarounds (region of practical equivalence testing), it is still seen as a limitation of the Frequentist framework
This is problematic in science because we are often interested in disproving a hypothesis, or proving that there is no effect
Does Uncle Bayes have a trick in his hat?

The “Likelihood” as a known model

Bayes’ Theorem: \(P(H|data) =P(H)~P(data|H)\)
- Bayes’ theorem relates the probability of a hypothesis (H) given some observed evidence (data X) to the probability of data given the hypothesis, as well as the prior probability of the hypothesis
\(P(data|H)\) is the likelihood of the data given a hypothesis
It typically corresponds to a “model of the world”
In the previous example, we had “the model” of coin tosses. We know the likelihood of outcomes given various coin types because it corresponds to a physical reality (a fair coin gives 50% heads). So the “model” was obvious and easy to define.

Conceptually, in this experiment:
- The likelihood was a statistical model linking a parameter x (the probability by which a coin is biased towards heads: 0.3, 0.4, 0.5, 0.6, 0.7) to the outcome (heads).
- The prior corresponded to the distribution of our credibility across the different possible x values.
- The posterior corresponded to the updated distribution of our credibility across the different possible x values after observing the data (coin tosses).

The “Likelihood” as a model with probabilistic parameters

We often have general beliefs about probabilistic models.
- What are the odds that more type of coints exist?
- What are the odds that my model of the coin is True
- …
In practice, the model’s likelihood depends on uncertain parameter over which we have other priors, or a combination of priors (e.g., Intercept + Slope)

The “likelihood”, as well as the “hypothesis”, can represent different things (a statistical model, beliefs about whether a general hypothesis is True, etc.)
Problems can be specified in different ways (it is flexible framework)
In typical real-life statistical scenarios, the likelihood term \(P(data|H)\) is very complex to establish mathematically (i.e., it is hard to compute/approximate it)
But it is not impossible. And accessing the likelihood is very useful…

The “Likelihood”: Summary

The likelihood \(P(data|H)\) represents “the probability of the data given a hypothesis”
This hypothesis can be:
- A statistical model with fixed and known parameters (e.g., \(y = 2x + 3\))
- A statistical model with unknown (probabilistic) parameters (e.g., \(y = \beta x + Intercept\) with \(\beta \sim Normal(0, 1)\) and \(Intercept \sim Normal(3, 2)\))
- Any other statement about the world (e.g., a hypothesis like “this model is true”)
Mathematically computing the value of \(P(data|H)\) is hard (aside from the first case)

Bayes Factor (BF)

Bayes’ theorem provides a natural way to test hypotheses and compare models. Suppose we have two competing hypotheses: H1 and an alternative hypothesis H0
What if we compared the likelihood of the data given two competing hypotheses (i.e., models)?
- \(P(data|H_{0})\) / \(P(data|H_{1})\)
This ratio of likelihoods is known as the Bayes Factor (BF)
- It quantifies the relative strength of evidence in favor of one hypothesis over another based on the observed evidence (data)
- It tells us how much more likely the data (evidence) is under one hypothesis compared to another (often taking into account prior probabilities in the definition of the likelihood model)
\(BF_{10} = \frac{P(data|H_{1})}{P(data|H_{0})}\)
\(BF_{01} = \frac{P(data|H_{0})}{P(data|H_{1})}\)
- Note the convention of writing what is the numerator

BFs with BIC approximation

The Bayes Factor is a ratio of likelihoods, which is often difficult to compute, especially for Bayesian models where parameters are themselves probabilistic
Wagenmakers (2007)¹ _{^{(one of the principal proponent of Bayes factors)}} proposed an approximation of the Bayes Factor based on the Bayesian Information Criterion (BIC), that can easily be computed for traditional Frequentist models (e.g., linear models)
\(BF_{01} \approx exp(\frac{BIC_{1} - BIC_{0}}{2})\)

Model 0: qsec ~ mpg

head(mtcars[c("mpg", "qsec", "wt")])

                   mpg  qsec    wt
Mazda RX4         21.0 16.46 2.620
Mazda RX4 Wag     21.0 17.02 2.875
Datsun 710        22.8 18.61 2.320
Hornet 4 Drive    21.4 19.44 3.215
Hornet Sportabout 18.7 17.02 3.440
Valiant           18.1 20.22 3.460

mtcars |> 
  ggplot(aes(x=mpg, y=qsec)) +
  geom_point() +
  geom_smooth(method="lm", formula = "y ~ x") +
  theme_bw()

model0 <- lm(qsec ~ mpg, data = mtcars)
summary(model0)


Call:
lm(formula = qsec ~ mpg, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8161 -1.0287  0.0954  0.8623  4.7149 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 15.35477    1.02978  14.911 2.05e-15 ***
mpg          0.12414    0.04916   2.525   0.0171 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.65 on 30 degrees of freedom
Multiple R-squared:  0.1753,    Adjusted R-squared:  0.1478 
F-statistic: 6.377 on 1 and 30 DF,  p-value: 0.01708

mpg is a significant predictor of qsec (\(\beta\) = 0.12, p < .05)
Is it worth it to add another predictor? Does it improve the model?

Model 1: qsec ~ mpg + wt

mtcars |> 
  ggplot(aes(x=wt, y=qsec)) +
  geom_point() +
  geom_smooth(method="lm", formula = "y ~ x") +
  theme_bw()

Do you think adding wt will improve the model?

model1 <- lm(qsec ~ mpg + wt, data = mtcars)
summary(model1)


Call:
lm(formula = qsec ~ mpg + wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1092 -0.9617 -0.2343  0.8399  4.2769 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  6.92949    3.53431   1.961   0.0596 . 
mpg          0.32039    0.09138   3.506   0.0015 **
wt           1.39324    0.56286   2.475   0.0194 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.524 on 29 degrees of freedom
Multiple R-squared:  0.3191,    Adjusted R-squared:  0.2722 
F-statistic: 6.797 on 2 and 29 DF,  p-value: 0.003796

Be careful! A model with multiple predictors is different from individual models taken separately (because variables interact with each other)

Is model1 better than model0?
Higher R2, but more parameters (more complex model). Comparing BICs is an alternative that takes into account the number of parameters

Extract BICs

perf0 <- performance::performance(model0)
perf0

# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
126.781 | 127.638 | 131.178 | 0.175 |     0.148 | 1.597 | 1.650

bic0 <- perf0$BIC
bic0

[1] 131.1783

bic1 <- performance::performance(model1)$BIC
bic1

[1] 128.5105

Normally, the lower the BIC, the better the model
Hard to interpret in absolute terms (the unit is meaningless), but we can compare models (if they are related to the same data)

Compare BICs

\(BF_{10} \approx exp(\frac{BIC_{0} - BIC_{1}}{2})\)
We can apply the formula to bic1 and bic0

bf10 <- exp((bic0 - bic1) / 2)
bf10

[1] 3.795896

You computed an approximation of the Bayes Factor “by hand”!

Using `bayestestR`

bayestestR::bayesfactor_models(model1, denominator=model0)

Bayes Factors for Model Comparison

         Model      BF
[model1] mpg + wt 3.80

* Against Denominator: [model0] mpg
*   Bayes Factor Type: BIC approximation

BF10 vs. BF01

Bayes factors are a ratio of likelihood of data under some hypothesis
They quantify “how much” the data is more likely under one model compared to another
\(BF_{10} = 3.80\) means the data is 3.80 times more likely under model1 than model0
Importantly, BFs are “reversable”:
\(BF_{01} = \frac{1}{BF_{10}} = 1 / 3.80 = 0.263\)
The data is 0.26 times more likely under model0 than model1
We gather evidence both ways (in favour or against a hypothesis)

bayestestR::bayesfactor_models(model0, denominator=model1)

Bayes Factors for Model Comparison

         Model    BF
[model0] mpg   0.263

* Against Denominator: [model1] mpg + wt
*   Bayes Factor Type: BIC approximation

Is this “big”? How to interpret BFs?

Bayes Factor Interpretation

The standard threshold for a “significant” BF is 3 or 10
- Jeffreys, H. (1961), Theory of Probability, 3rd ed., Oxford University Press, Oxford

Bayes Factor (\(BF_{10}\))	Interpretation
> 100	Extreme evidence for H1
30 – 100	Very strong evidence for H1
10 – 30	Strong evidence for H1
3 – 10	Moderate evidence for H1
1 – 3	Anecdotal evidence for H1
1	No evidence
1/3 – 1	Anecdotal evidence for H0
1/3 – 1/10	Moderate evidence for null H0
1/10 – 1/30	Strong evidence for null H0
1/30 – 1/100	Very strong evidence for H0
< 1/100	Extreme evidence for H0

Bayes Factor Interpretation

The effectsize package provides many automated interpretation functions

effectsize::interpret_bf(bf10)

[1] "moderate evidence in favour of"
(Rules: jeffreys1961)

Note that because BFs can be very big or very small, they are sometimes presented on the log scale (\(log~BF_{10}\)), in which case numbers smaller than 1 become negative

log(3.8)

[1] 1.335001

log(1/3.8)

[1] -1.335001

To “unlog” a logged BF, use exp()

log(3.8)

[1] 1.335001

exp(1.335001)

[1] 3.8

Exercice

Compute the Bayes Factor for the following models: qsec ~ wt vs. a constant model
A constant model, aka an “intercept-only” model, is a model with no predictors (only an intercept)
A constant model basically predicts the mean of the outcome variable for all observations

mean(mtcars$qsec)

[1] 17.84875

model0 <- lm(qsec ~ 1, data = mtcars)
model0


Call:
lm(formula = qsec ~ 1, data = mtcars)

Coefficients:
(Intercept)  
      17.85

Compute the BF and interpret it. Make a conclusion

Solution

model1 <- lm(qsec ~ wt, data = mtcars)
model1


Call:
lm(formula = qsec ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
    18.8753      -0.3191

bayestestR::bayesfactor_models(model1, denominator=model0)

Bayes Factors for Model Comparison

         Model    BF
[model1] wt    0.290

* Against Denominator: [model0] (Intercept only)
*   Bayes Factor Type: BIC approximation

BF < 1/3 (0.3333)

effectsize::interpret_bf(0.290)

[1] "moderate evidence against"
(Rules: jeffreys1961)

Exercice 2

“I try to analyze the linear relationship between the sepal length and the sepal width of the iris flowers (iris built in dataset). Should I control for Species?”

m1 <- lm(Sepal.Length ~ Sepal.Width, data = iris)
m2 <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)

bayestestR::bayesfactor_models(m1, denominator=m2)

Bayes Factors for Model Comparison

     Model             BF
[m1] Sepal.Width 2.96e-40

* Against Denominator: [m2] Sepal.Width + Species
*   Bayes Factor Type: BIC approximation

Pizzas can help us get a intuitive feeling for BFs

1

2 Pizza plot

plot(bayestestR::bayesfactor_models(model1, denominator=model0)) +
  scale_fill_manual(values = c("red", "yellow"))

Summary

Bayes Factors are indices that compare two “hypotheses”
They correspond to the ratio of “likelihoods”
They can be approximated for frequentist models via the BIC-approximation
There are interpretation rules of thumb (> 3, > 10, > 30, > 100)
They can be expressed on the log scale

The End _{^{(for now)}}

Thank you!

Bayesian Statistics

Recap

Final nail in the coffin for the p-value

Evidence against the null hypothesis?

The “Likelihood” as a known model

The “Likelihood” as a model with probabilistic parameters

The “Likelihood”: Summary

Bayes Factor (BF)

BFs with BIC approximation

Model 0: qsec ~ mpg

Model 1: qsec ~ mpg + wt

Extract BICs

Compare BICs

Using bayestestR

BF10 vs. BF01

Bayes Factor Interpretation

Bayes Factor Interpretation

Exercice

Solution

Exercice 2

1

2

Pizza plot

Summary

The End (for now)

Using `bayestestR`

The End _{^{(for now)}}