4  Binary Data and Choices

4.1 The Data

For this chapter, we will be using the data from Wagenmakers et al. (2008) - Experiment 1 (also reanalyzed by Heathcote and Love 2012), that contains responses and response times for several participants in two conditions (where instructions emphasized either speed or accuracy). Using the same procedure as the authors, we excluded all trials with uninterpretable response time, i.e., responses that are too fast (<180 ms) or too slow (>2 sec instead of >3 sec, see Thériault et al. 2024 for a discussion on outlier removal).

In this chapter, we will focus on the “amount” of errors between the two conditions (response times will be the focus of the next chapter).

#| code-fold: false

using Downloads, CSV, DataFrames, Random
using Turing, Distributions, StatsFuns, SubjectiveScalesModels
using GLMakie

Random.seed!(123)  # For reproducibility

df = CSV.read(Downloads.download("https://raw.githubusercontent.com/DominiqueMakowski/CognitiveModels/main/data/wagenmakers2008.csv"), DataFrame)

# Show 10 first rows
first(df, 10)

Let us first compute the average number of errors for each condition.

#| code-fold: false

combine(groupby(df, :Condition), :Error => mean)
dat = combine(groupby(df, :Condition), :Error => mean)

fig = Figure()
ax = Axis(fig[1, 1], ylabel="Average error rate", xticks=([0, 1], ["Speed", "Accuracy"]))
barplot!(ax, [0], [dat.Error_mean[1]], label="Speed", color=:red)
barplot!(ax, [1], [dat.Error_mean[2]], label="Accuracy", color=:green)
axislegend()
fig

Let us then format the data for the model by transforming the Condition column into a binary variable (0 for Speed and 1 for Accuracy). This means that the intercept of the model will represent the error rate in the Speed condition, and we will be able to estimate the effect of the Accuracy condition on the error rate.

#| code-fold: false

df.Accuracy = ifelse.(df.Condition .== "Accuracy", 1, 0)

Our hypothesis is that the Accuracy condition will decrease the error rate compared to the Speed condition.

4.2 Logistic models for Binary Data

The statistical models that we use aim at uncovering the parameters of the distribution that best predicts the data. In a standard linear regression, this means estimating the mean mu \(\mu\) (which often is expressed as a function of another variable, such as the Condition) and the standard deviation sigma \(\sigma\) of the normal distribution that best fits the data.

However, when the dependent variable is binary (e.g., 0 or 1), we cannot use a normal distribution, as it would predict values outside the [0, 1] range. We saw in the previous chapter how to model “probabilities” (values between 0 and 1) using \(Beta\) distributions, but in this part we will use another distribution: the Bernoulli distribution. The Bernoulli distribution is a distribution that generates only zeros and ones (or True and False) with a single parameter: the probability of success p.

In the data above, we saw that the error rate (the proportion - or the “probability” - of errors) in the Speed condition was 0.11 (11%). This would potentially translate into a distribution \(Bernoulli(0.11)\). In our model, we will estimate this probability at this reference condition (aka the intercept), as well as the effect of the Accuracy condition (which, in this case, we expect to be negative, as that condition seems to lower the error rate). We will set priors for these two parameters (intercept and the effect of accuracy) that will get explored by the sampling algorithm.

But here is the catch. Imagine the sampling algorithm picks a value of \(0.11\) for the intercept (close to the true value) and then decides to explore a value of \(-0.20\). This would lead to a tentative value of \(0.11 - 0.20 = -0.09\)… but this parameter is impossible (the p parameter of the \(Bernouilli\) distribution must be betwen 0 and 1).

Similarly to the previous chapter, we will avoid these issues by expressing our probability p on the logit scale. This way, the effect of Accuracy will be expressed as a log-odds (i.e., the log of the odds ratio), which can take any value between \(-\infty\) and \(+\infty\). We will then convert this log-odds back to a probability using the logistic function (the “inverse” of the logit transformation) which maps any value to the [0, 1] range.

#| code-fold: false
#| output: false

@model function model_logistic(y; condition=df.Accuracy)
    # Priors
    p_intercept ~ Normal(0, 1)
    p_condition ~ Normal(0, 1)

    # Likelihood
    for i in 1:length(y)
        # Linear model equation
        p = p_intercept + p_condition * condition[i]

        # Inference
        y[i] ~ BernoulliLogit(p)
    end
end

fit_logistic = model_logistic(df.Error, condition=df.Accuracy)
posteriors_logistic = sample(fit_logistic, NUTS(), 500)
Code Tip

Note the use of BernoulliLogit(p), which is equivalent to Bernoulli(logistic(p)) but more efficient.

#| code-fold: false

# 95% CI
hpd(posteriors_logistic)

In order to visualize the results, we are going to errors and then overlay the predicted error rates for each condition for each posterior draw.

#| output: false

pred = predict(model_logistic(fill(missing, length(df.Error)), condition=df.Accuracy), posteriors_logistic)
pred = Array(pred)

for i in 1:length(posteriors_logistic)
    mean_speed = mean(pred[i, :][df.Accuracy.==0])
    mean_accuracy = mean(pred[i, :][df.Accuracy.==1])
    lines!([0, 1], [mean_speed, mean_accuracy], color=(:black, 0.1), linewidth=1)
end

fig

The model appears to successfully recover the average error rates for each condition.

4.3 Mixed Logistic models

However, this model overlooks the fact that the data has repeated measures. In other words, each individual observation is nested within an overarching cluster that is a “participant”. Taking into account this structure can lead to more accurate estimates of the fixed effects (the intercept and the effect of Accuracy), as well as a quantification of the variability between participants.

Setting a mixed model in Turing requires, for each “random” effect (that can be added to any of the “fixed” parameters): - A prior for the standard deviation of the random effects (i.e., the variability between participants). This will determine how much each participant are assumed (and allowed) to deviate from the average effect. - A series of priors for each participant. These priors are typically centered at 0 (i.e., no deviation from the average effect) and have a standard deviation equal to the one defined above.

#| code-fold: false
#| output: false

@model function model_logistic(y; condition=df.Accuracy, participant=df.Participant)
    # Priors for  fixed effects
    p_intercept_fixed ~ Normal(0, 1)
    p_condition_fixed ~ Normal(0, 1)

    # Priors for the SD of the random effects 
    p_intercept_participant_sd ~ truncated(Normal(0, 0.5); lower=0)
    p_condition_participant_sd ~ truncated(Normal(0, 0.5); lower=0)

    # Priors for each participant
    p_intercept_participant ~ filldist(Normal(0, p_intercept_participant_sd), length(unique(participant)))
    p_condition_participant ~ filldist(Normal(0, p_condition_participant_sd), length(unique(participant)))

    # Likelihood
    for i in 1:length(y)
        # Random components
        p_intercept = p_intercept_fixed + p_intercept_participant[participant[i]]
        p_condition = p_condition_fixed + p_condition_participant[participant[i]]

        # Linear model equation
        p = p_intercept + p_condition * condition[i]

        # Inference
        y[i] ~ BernoulliLogit(p)
    end
end

fit_logistic = model_logistic(df.Error, condition=df.Accuracy, participant=df.Participant)
posteriors_logistic = sample(fit_logistic, NUTS(), 500)
# using MicroCanonicalHMC
# posteriors_logistic = sample(fit_logistic, externalsampler(MCHMC(1000, 0.001)), 500)

Unfortunately, mixed models are very slow in Julia for now.

4.4 Choice-Confidence (Choco) Models

4.4.1 Real Data Example

Data Preprocessing

#| code-fold: false

using Downloads, CSV, DataFrames, Random
using Turing, Distributions, StatsFuns, SubjectiveScalesModels
using GLMakie


Random.seed!(123)

df = CSV.read(Downloads.download("https://raw.githubusercontent.com/RealityBending/FakeFace/main/data/data.csv"), DataFrame)
df = df[:, [:Participant, :Stimulus, :Real, :Attractive]]

# Show 10 first rows
first(df, 10)

Our variable of interest is Real, for which participants had to indicate on an analog scale whether they thought the face stimulus was fake (on the left) or real (on the right).

#| code-fold: false

println("min: ", minimum(df.Real), " max: ", maximum(df.Real))

As the data has been modelled using a \(Beta\) regression, the data has been nudged to avoid having any value of 0 anbd 1. As extreme values are not a problem for the Choco model, we can re-scale the data back to its original form.

#| code-fold: false

df.Real = data_rescale(df.Real; new_range=[0, 1])
#| code-fold: false

hist(df.Real, bins=beta_bins(30), normalization=:pdf, color=:darkred)

We can see that the distribution is quite particular: it seems bimodal (with mass on both sides of the scale) with a lot of zeros and ones (“zero-and-one inflation”). This type of data, while common in psychology, is not well captured by standard models (e.g., linear regression) and requires more advanced models, such as the Choco model.

The Choco Model

See here. TODO. Present the Choco model.

Bayesian Choco

# @model function model_choco(y)
#     p1 ~ Normal(0, 2)
#     μ0 ~ Normal(0, 1)
#     μ1 ~ Normal(0, 1)
#     ϕ0 ~ Normal(0, 1)
#     ϕ1 ~ Normal(0, 1)
#     k0 ~ -Gamma(3, 3)
#     k1 ~ Gamma(3, 3)

#     for i in 1:length(y)
#         y[i] ~ Choco(logistic(p1), logistic(μ0), exp(ϕ0), logistic(μ1), exp(ϕ1), 0.0, 200, logistic(k0), logistic(k1))
#     end
# end

# fit_choco = model_choco(df.Real)
# posteriors = sample(fit_choco, NUTS(; init_ϵ=0.1), 1000,
#     initial_params=[0, 0, 0, 0, 0, -9, 9])

# # 95% CI
# hpd(posteriors)