This module is about understanding what we are doing, and why we are doing it.
* Easy as in “not too hard”
Why Bayesian Statistics?
The Bayesian framework for statistics has quickly gained popularity among scientists, associated with the general shift towards open, transparent, and more rigorous science. Reasons to prefer this approach are:
The possibility of introducing prior knowledge into the analysis (Andrews & Baguley, 2013; Kruschke et al., 2012)
Intuitive results and their straightforward interpretation(Kruschke, 2010; Wagenmakers et al., 2018)
Learning Bayes? Back to the Bayesics first
This module adopts a slightly unorthodox approach: instead of starting with Bayesian theory and equations, we will first consolidate various concepts and notions that are present in Frequentist statistics, that will help us to understand and apply the Bayesian framework
We won’t be talking about Bayes until a few sessions in (“awww 😞”)
But everything we do until then will be in preparation for it
Understanding Bayes painlessly requires to be very clear about some fundamental concepts, in particular, probability distributions and model parameters
How to successfuly attend this module
Goal: Becomes master* of Bayesian statistics
Master User: be comfortable using and reading Bayesian statistics
\(\neq\) becoming a master mathematician
Right level of understanding: not too superficial, not too deep
Code shown in the slides should in general be understood
But you don’t need to memorize it
Best to follow along by trying and running the code in your own system
(If you need me to slow down, let me know!)
Ideally, make an Quarto file and write there info and code examples
Slides will be available online
Equations are not generally important
No need to memorize it, but you should understand the concepts
Memorizing a few Greek symbols will be useful
In particular beta\(\beta\), sigma\(\sigma\), mu\(\mu\)
R Refresher
Setup
Make sure you have R and RStudio on your computer
Why local installation? Independence, flexibility, and power
Follow the instruction on Canvas (in /Syllabus)
If you have any problem, let me know
Why R?
Why not SPSS? Matlab? Python?
The best for statistics (by far)
The best for visualization
Free (independence)
Flexible (packages)
Powerful (create websites, presentations, …)
Demanded (💰)
Gold standard across science
Cutting-edge statistics and methods (e.g., Bayesian statistics)
Vocabulary
R = programming language
RStudio = the “editor” that we use to work with R
Posit = The company that created R Studio and that provides cloud services (used to be called RStudio)
Posit cloud = The online version of RStudio
Quarto = Used to be called R Markdown. A system to combine code, text, and code output into nice documents (similar to Jupyter)
Markdown = A simple syntax to *format* text used on **internet**
Panels
Source (text editor)
Console (interactive)
Environment (objects)
Files (navigate)
Interacting
Create a new document (file)
.R (R script) or .qmd (quarto document)
Write some code in the script
Run the code
Click somewhere on the same line that you want to execute
Or select the code that you want to execute
Hit Ctrl+Enter
2+2
[1] 4
Classes
In R, each thing has a class (type)
Numeric (aka integers and floats; numbers)
Character (aka string; text)
Logical (aka booleans; TRUE/FALSE)
Note: TRUE and FALSE are equivalent to 1 and 0
Try: TRUE + TRUE
Factors (aka categorical; e.g. experimental conditions)
Comments (with hash #, CTRL + SHIFT + C)
“Functions” (ends with (); e.g. mean())
Many more…
You can check the class of an object with class()
You can access a function’s documentation with ?mean or clicking on the function and pressing F1
Types
# Number3# Character "quotations""text"# LogicalTRUE
Check class
x <-3class(x)
[1] "numeric"
Vectors vs. Lists (1)
A vector is a “list” of elements of the same class, indexed by their position
In R, most operations are by default vectorized (i.e. applied to each element of the vector)
Create and concatenate vectors with the combine function c()
# Vectorx <-c(0, 1, 2)x +3
[1] 3 4 5
c(x, 3)
[1] 0 1 2 3
Warning
R starts counting at 1, not 0.
x[2]
[1] 1
Vectors vs. Lists (2)
A list is a container of named elements of any kind, indexed by their name
tidyverse1 and easystats2 are actually collections of packages
Load packages with library()
This simply makes the functions of the package available in the current session
You can still call functions from packages that are not loaded by explicitly mentioning the package name pkg::fun()
Tip
It is good practice to explicitly mention a function’s package when using it, e.g. dplyr::select(), especially when using less popular functions.
ggplot basics (1)
ggplot2 is the main R package for data visualization
It is based on the Grammar of Graphics(Wilkinson, 2005)
The main function is ggplot()
Takes a data frame as first argument
Followed by a mapping of variables to aesthetic characteristics (x, y, color, shape, etc.)
We can then add layers to the plot with +
Note: In ggplot (and most tidyverse) packages, variables are not quoted (x=Sepal.Length, not x="Sepal.Length")
This is not typically the case (in other packages and languages)
library(tidyverse)ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +geom_point() +geom_density_2d() +theme_classic()
ggplot basics (2)
The arguments passed to ggplot() are inherited by the layers
One can specify different data & aesthetics for each layer
ggplot() +geom_point(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) +geom_density_2d(data=iris, aes(x = Sepal.Length, y = Sepal.Width)) +theme_classic()
ggplot basics (3)
Aside from aesthetics and data, other arguments can be used to customize the plot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +geom_point(color="yellow", size=4, shape="triangle") +geom_density_2d(color="red") + see::theme_abyss() # Package in easystats
ggplot basics (4)
Warning
Misnomer: do NOT confuse arguments that are “aesthetics” in aes() (i.e., map variable names to aesthetic features) with arguments that control the appearance of the plot (not in aes())
As we have seen, we can estimate the probability density of a random variable (e.g., a sample of data) and visualize it using a density plot or a histogram
Most common distributions have an analytical solution (i.e., a formula) to compute the probability density over a range of values.
It is called the Probability Density Function (PDF) and can be obtained using the d*() functions
E.g., dnorm() = density + normal
It requires a vector of values (x), and the parameters of the distribution (e.g., mean, sd)
# Get 7 evenly-spaced values between -4 and 4x <-seq(-4, 4, length.out =7)x
ggplot(data, aes(x=time, y=trajectory)) +geom_line() +coord_flip() # Flip x and y axes
Exercice!
Can you simulate 20 different random walks and visualize them as different colors?
Tip
You can loop over a sequence of iterations with for(i in 1:20) {...}
The data.frame() function can be used to initialize an empty data frame
The rbind() (“row-bind”) function can be used to concatenate data frames vertically
Solution (1)
Can you simulate 20 different random walks and visualize them as different colors?
data <-data.frame() # Initialize empty data framefor(i in1:20) { walk_data <-data.frame(trajectory =random_walk(10),time =seq(0, 10),iteration = i ) data <-rbind(data, walk_data)}data
Despite its (relative) complexity, the Normal distribution naturally emerges from very simple processes!
This is known as the Central Limit Theorem, which states that the distribution of the sums/means of many random variables tends to a Normal distribution
This is why the Normal distribution is so ubiquitous in nature and statistics!
Because many measurements are the amalgamation result of many random mechanisms
A Galton Board
On Uniform Distributions
Why did the uniform distribution get hired as a referee? Because it always calls it fair and square, giving every player an equal chance!
Uniform Distribution
The Uniform distribution is the simplest distribution
It is defined by two parameters: a lower and upper bound
All values between the bounds are equally likely (the PDF is flat)
Exercice: generate 50,000 random values between -10 and 10 and plot the histogram
Tip: use runif()
# Use runif(): random + uniformdata.frame(x =runif(50000, min=-10, max=10)) |>ggplot(aes(x=x)) +geom_histogram(bins =50, color="black") +coord_cartesian(xlim=c(-15, 15))
Uniform Distribution - Applications
Often used in experimental designs
E.g., to jitter the Inter-Stimulus Intervals, to randomly select between various conditions, etc.
It is the least informative distribution
Can be used when we want to make no assumptions1 about the distribution of the data
On Beta Distributions
Why do beta distributions love to go to the gym? So that they are not out of shape!
Beta Distribution
The Beta distribution can be used to model probabilities
Defined by two shape parameters, α and β (shape1 & shape2)1
Only expressed in the range \(]0, 1[\) (i.e., null outside of this range)
Make groups. Each groups picks a distribution (Normal, Uniform, Beta, Gamma) and a set of parameters
Then:
Draw 100 random samples from that distribution
Compute the mean of each random subset
Store results in a vector
Repeat 10,000 times
Plot the distribution of the means
Solution
means <-c() # Initialize an empty vectorfor(i in1:10000) { # Iterate x <-rbeta(100, shape1 =10, shape2 =1.5) means <-c(means, mean(x))}
Solution
means <-c() # Initialize an empty vectorfor(i in1:10000) { # Iterate x <-rbeta(100, shape1 =10, shape2 =1.5) means <-c(means, mean(x))}data.frame(x = means) |>ggplot(aes(x=x)) +geom_histogram(bins=40, color="black")
Solution
means <-c()for(i in1:10000) { x <-rbeta(100, shape1 =10, shape2 =1.5) means <-c(means, mean(x))}data.frame(x = means) |>ggplot(aes(x=x)) +geom_histogram(bins=40, color="black")
🤯🤯🤯🤯🤯🤯🤯🤯
A NORMAL DISTRIBUTION
🤯🤯🤯🤯🤯🤯🤯🤯
Central Limit Theorem (2)
The Central Limit Theorem hits gain: “the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population’s distribution”
Practical Implications: The Central Limit Theorem is crucial for inferential statistics. It underpins many statistical methods, such as frequentist hypothesis testing and confidence intervals. It allows for the use of normal probability models to make inferences about population parameters even when the population distribution is not normal.
Standard Error (SE)vs.Standard Deviation (SD)
The standard deviation is a measure of the variability of a single sample of observations
The standard error is a measure of the variability of many sample means (it is the SD of the averages of many samples drawn from the same parent distribution). The SE is often assumed to be normally distributed (even if the underlying distribution is not normal).
On Cauchy Distributions
Why don’t statisticians play hide and seek with Cauchy distributions? Because they never know where they’re going to show up and how far away they could be!
Cauchy Distribution
The Cauchy distribution is known for its “heavy tails” (aka “fat tails”)
Characterized by a location parameter (the median) and a scale parameter (the spread)
The Cauchy distribution is one notable exception to the Central Limit Theorem (CLT): the distribution of the sample means of a Cauchy distribution remains a Cauchy distribution (instead of Normal). This is because the heavy tails of the Cauchy distribution significantly influence the sample mean, preventing it from settling into a normal distribution.
On t-Distributions
How do you call the PhD diploma of a Student’s t-distribution? A degree of freedom!
t-Distribution
Both Cauchy and Normal are extreme cases of the Student’s t-distribution
Student’s t-distribution becomes the Cauchy distribution when the degrees of freedom is equal to one and converges to the normal distribution as the degrees of freedom go to infinity
Defined by its degrees of freedom \(df\) (location and scale usually fixed to 0 and 1)
Tends to have heavier tails than the normal distribution (but less than Cauchy)