4 Binary Data and Choices
4.1 The Data
For this chapter, we will be using the data from Wagenmakers et al. (2008) - Experiment 1 (also reanalyzed by Heathcote and Love 2012), that contains responses and response times for several participants in two conditions (where instructions emphasized either speed or accuracy). Using the same procedure as the authors, we excluded all trials with uninterpretable response time, i.e., responses that are too fast (<180 ms) or too slow (>2 sec instead of >3 sec, see Thériault et al. 2024 for a discussion on outlier removal).
In this chapter, we will focus on the “amount” of errors between the two conditions (response times will be the focus of the next chapter).
#| code-fold: false
using Downloads, CSV, DataFrames, Random
using Turing, Distributions, StatsFuns, SubjectiveScalesModels
using GLMakie
Random.seed!(123) # For reproducibility
df = CSV.read(Downloads.download("https://raw.githubusercontent.com/DominiqueMakowski/CognitiveModels/main/data/wagenmakers2008.csv"), DataFrame)
# Show 10 first rows
first(df, 10)
Let us first compute the average number of errors for each condition.
#| code-fold: false
combine(groupby(df, :Condition), :Error => mean)
dat = combine(groupby(df, :Condition), :Error => mean)
fig = Figure()
ax = Axis(fig[1, 1], ylabel="Average error rate", xticks=([0, 1], ["Speed", "Accuracy"]))
barplot!(ax, [0], [dat.Error_mean[1]], label="Speed", color=:red)
barplot!(ax, [1], [dat.Error_mean[2]], label="Accuracy", color=:green)
axislegend()
fig
Let us then format the data for the model by transforming the Condition
column into a binary variable (0 for Speed
and 1 for Accuracy
). This means that the intercept of the model will represent the error rate in the Speed
condition, and we will be able to estimate the effect of the Accuracy
condition on the error rate.
#| code-fold: false
df.Accuracy = ifelse.(df.Condition .== "Accuracy", 1, 0)
Our hypothesis is that the Accuracy
condition will decrease the error rate compared to the Speed
condition.
4.2 Logistic models for Binary Data
The statistical models that we use aim at uncovering the parameters of the distribution that best predicts the data. In a standard linear regression, this means estimating the mean mu \(\mu\) (which often is expressed as a function of another variable, such as the Condition
) and the standard deviation sigma \(\sigma\) of the normal distribution that best fits the data.
However, when the dependent variable is binary (e.g., 0 or 1), we cannot use a normal distribution, as it would predict values outside the [0, 1] range. We saw in the previous chapter how to model “probabilities” (values between 0 and 1) using \(Beta\) distributions, but in this part we will use another distribution: the Bernoulli distribution. The Bernoulli distribution is a distribution that generates only zeros and ones (or True and False) with a single parameter: the probability of success p.
In the data above, we saw that the error rate (the proportion - or the “probability” - of errors) in the Speed
condition was 0.11
(11%). This would potentially translate into a distribution \(Bernoulli(0.11)\). In our model, we will estimate this probability at this reference condition (aka the intercept), as well as the effect of the Accuracy
condition (which, in this case, we expect to be negative, as that condition seems to lower the error rate). We will set priors for these two parameters (intercept and the effect of accuracy) that will get explored by the sampling algorithm.
But here is the catch. Imagine the sampling algorithm picks a value of \(0.11\) for the intercept (close to the true value) and then decides to explore a value of \(-0.20\). This would lead to a tentative value of \(0.11 - 0.20 = -0.09\)… but this parameter is impossible (the p parameter of the \(Bernouilli\) distribution must be betwen 0 and 1).
Similarly to the previous chapter, we will avoid these issues by expressing our probability p on the logit scale. This way, the effect of Accuracy
will be expressed as a log-odds (i.e., the log of the odds ratio), which can take any value between \(-\infty\) and \(+\infty\). We will then convert this log-odds back to a probability using the logistic function (the “inverse” of the logit transformation) which maps any value to the [0, 1] range.
#| code-fold: false
#| output: false
@model function model_logistic(y; condition=df.Accuracy)
# Priors
p_intercept ~ Normal(0, 1)
p_condition ~ Normal(0, 1)
# Likelihood
for i in 1:length(y)
# Linear model equation
p = p_intercept + p_condition * condition[i]
# Inference
y[i] ~ BernoulliLogit(p)
end
end
fit_logistic = model_logistic(df.Error, condition=df.Accuracy)
posteriors_logistic = sample(fit_logistic, NUTS(), 500)
Note the use of BernoulliLogit(p)
, which is equivalent to Bernoulli(logistic(p))
but more efficient.
#| code-fold: false
# 95% CI
hpd(posteriors_logistic)
In order to visualize the results, we are going to errors and then overlay the predicted error rates for each condition for each posterior draw.
#| output: false
pred = predict(model_logistic(fill(missing, length(df.Error)), condition=df.Accuracy), posteriors_logistic)
pred = Array(pred)
for i in 1:length(posteriors_logistic)
mean_speed = mean(pred[i, :][df.Accuracy.==0])
mean_accuracy = mean(pred[i, :][df.Accuracy.==1])
lines!([0, 1], [mean_speed, mean_accuracy], color=(:black, 0.1), linewidth=1)
end
fig
The model appears to successfully recover the average error rates for each condition.
4.3 Mixed Logistic models
However, this model overlooks the fact that the data has repeated measures. In other words, each individual observation is nested within an overarching cluster that is a “participant”. Taking into account this structure can lead to more accurate estimates of the fixed effects (the intercept and the effect of Accuracy
), as well as a quantification of the variability between participants.
Setting a mixed model in Turing requires, for each “random” effect (that can be added to any of the “fixed” parameters): - A prior for the standard deviation of the random effects (i.e., the variability between participants). This will determine how much each participant are assumed (and allowed) to deviate from the average effect. - A series of priors for each participant. These priors are typically centered at 0 (i.e., no deviation from the average effect) and have a standard deviation equal to the one defined above.
#| code-fold: false
#| output: false
@model function model_logistic(y; condition=df.Accuracy, participant=df.Participant)
# Priors for fixed effects
p_intercept_fixed ~ Normal(0, 1)
p_condition_fixed ~ Normal(0, 1)
# Priors for the SD of the random effects
p_intercept_participant_sd ~ truncated(Normal(0, 0.5); lower=0)
p_condition_participant_sd ~ truncated(Normal(0, 0.5); lower=0)
# Priors for each participant
p_intercept_participant ~ filldist(Normal(0, p_intercept_participant_sd), length(unique(participant)))
p_condition_participant ~ filldist(Normal(0, p_condition_participant_sd), length(unique(participant)))
# Likelihood
for i in 1:length(y)
# Random components
p_intercept = p_intercept_fixed + p_intercept_participant[participant[i]]
p_condition = p_condition_fixed + p_condition_participant[participant[i]]
# Linear model equation
p = p_intercept + p_condition * condition[i]
# Inference
y[i] ~ BernoulliLogit(p)
end
end
fit_logistic = model_logistic(df.Error, condition=df.Accuracy, participant=df.Participant)
posteriors_logistic = sample(fit_logistic, NUTS(), 500)
# using MicroCanonicalHMC
# posteriors_logistic = sample(fit_logistic, externalsampler(MCHMC(1000, 0.001)), 500)
Unfortunately, mixed models are very slow in Julia for now.
4.4 Choice-Confidence (Choco) Models
4.4.1 Real Data Example
Data Preprocessing
#| code-fold: false
using Downloads, CSV, DataFrames, Random
using Turing, Distributions, StatsFuns, SubjectiveScalesModels
using GLMakie
Random.seed!(123)
df = CSV.read(Downloads.download("https://raw.githubusercontent.com/RealityBending/FakeFace/main/data/data.csv"), DataFrame)
df = df[:, [:Participant, :Stimulus, :Real, :Attractive]]
# Show 10 first rows
first(df, 10)
Our variable of interest is Real
, for which participants had to indicate on an analog scale whether they thought the face stimulus was fake (on the left) or real (on the right).
#| code-fold: false
println("min: ", minimum(df.Real), " max: ", maximum(df.Real))
As the data has been modelled using a \(Beta\) regression, the data has been nudged to avoid having any value of 0
anbd 1
. As extreme values are not a problem for the Choco model, we can re-scale the data back to its original form.
#| code-fold: false
df.Real = data_rescale(df.Real; new_range=[0, 1])
#| code-fold: false
hist(df.Real, bins=beta_bins(30), normalization=:pdf, color=:darkred)
We can see that the distribution is quite particular: it seems bimodal (with mass on both sides of the scale) with a lot of zeros and ones (“zero-and-one inflation”). This type of data, while common in psychology, is not well captured by standard models (e.g., linear regression) and requires more advanced models, such as the Choco model.
The Choco Model
See here. TODO. Present the Choco model.
Bayesian Choco
# @model function model_choco(y)
# p1 ~ Normal(0, 2)
# μ0 ~ Normal(0, 1)
# μ1 ~ Normal(0, 1)
# ϕ0 ~ Normal(0, 1)
# ϕ1 ~ Normal(0, 1)
# k0 ~ -Gamma(3, 3)
# k1 ~ Gamma(3, 3)
# for i in 1:length(y)
# y[i] ~ Choco(logistic(p1), logistic(μ0), exp(ϕ0), logistic(μ1), exp(ϕ1), 0.0, 200, logistic(k0), logistic(k1))
# end
# end
# fit_choco = model_choco(df.Real)
# posteriors = sample(fit_choco, NUTS(; init_ϵ=0.1), 1000,
# initial_params=[0, 0, 0, 0, 0, -9, 9])
# # 95% CI
# hpd(posteriors)