Appendix B — Simulated Datasets
While we have made an effort to include real datasets wherever possible in this mini-book, we will utilize simulated data to demonstrate the application of the test workflow from Chapter 1 for certain hypothesis testings. This simulation-based approach allows us to have suitable datasets to illustrate how each test’s modelling assumptions must be satisfied to ensure that we deliver robust inferential conclusions to our stakeholders. Therefore, this appendix will explain the generative modelling process used to create these simulated datasets.

Heads-up on the language chosen to simulate the datasets in the mini-book’s main chapters!
To maintain a bilingual approach using both R and Python, we will provide the code for generating the datasets in both languages within this appendix. However, in the main chapters, we will rely on the data points from the R-generated set while conducting the inferential analysis using both programming languages. This decision is made because there is a discrepancy between the datasets simulated by the two languages, even when using the same simulation seed, due to their distinct pseudo-random number generators.
For each of the datasets listed below, besides providing the simulation code, we will elaborate on the dataset context along with the relevant equations (if necessary) that generate this data.
B.1 \(t\)-test for Paired Samples Dataset
This dataset is used in Chapter 3, more specifically in Section 3.4, to demonstrate the \(t\)-test for paired samples via a hypothetical scenario from medical research. Consider a clinical investigation examining whether an innovative medication can decrease low-density lipoprotein (LDL) cholesterol levels in adults diagnosed with hypercholesterolemia, a condition characterized by elevated cholesterol levels in the blood. This matter can increase the risk of heart problems because cholesterol can build up in blood vessels and block blood flow.

Heads-up on the use of this dataset!
This data is simulated and by no means should be considered for medical research or advice.
This simulated study monitors a group of \(n = 120\) participants over time, recording LDL cholesterol concentrations prior to treatment and again after an eight-week course of the drug. The outcome of interest is the LDL cholesterol measurement (in \(\text{mg/dL}\)), a widely used biomarker for cardiovascular risk. Since each participant provides two observations (i.e., one before and one after treatment), the data are naturally paired (the same individual is measured twice), allowing the analysis to focus on changes within individuals rather than differences between unrelated groups.
B.1.1 Generative Modelling Process
The simulated dataset reflects the following study characteristics:
- Baseline LDL levels: Participants in the simulated population have a mean pre-treatment LDL cholesterol level of \(\mu = 160 \text{ mg/dL}\), with a standard deviation of \(\sigma = 5 \text{ mg/dL}\) to account for individual variation.
- Expected treatment effect: On average, in this simulated population of participants, LDL cholesterol levels decline by \(\Delta = 2.5 \text{ mg/dL}\) after the intervention using the new medication.
- Measurement variation: Biological variability and measurement error introduce random fluctuations in LDL values, even within the same patient.
Since formally, for the \(i\)th participant, the LDL cholesterol level before treatment is assumed to follow a Normal distribution with \(\mu = 160 \text{ mg/dL}\) and \(\sigma = 5 \text{ mg/dL}\), we can express the following:
\[\text{LDL}_{\text{before}, i} \sim \text{Normal}(\mu = 160, \sigma^2= 5^2).\]
Then, again for the \(i\)th participant, the LDL cholesterol level after the treatment is another random variable that is a combination of three components as follows:
\[\text{LDL}_{\text{after}, i} = \text{LDL}_{\text{before}, i} - \Delta + \varepsilon_i,\]
where:
- \(\Delta = 2.5 \text{ mg/dL}\) is the average decrease in LDL cholesterol level due to this new treatment in the simulated population.
- \(\varepsilon_i \sim \text{Normal}(0, 1)\) represents additional variation specific to the after-treatment measurement for the \(i\)th participant. This is what we already defined as the measurement variation.
B.1.2 Code
Let us check the corresponding code to simulate this data. We are simulating 120 participants in this study, which gives us a sample size of \(n = 120\). Additionally, the simulation will utilize the {numpy} (Harris et al. 2020) and {pandas} libraries in Python. The final data frame will be stored in cholesterol_data, which will have the following columns: patient_id, LDL_before, and LDL_after.
# Set seed for reproducibility
set.seed(123)
# Set number of participants
n_patients <- 120
# Set the average decrease in LDL cholesterol levels due to new treatment
Delta <- 2.5
# Step 1: Generate LDL cholesterol levels before treatment
ldl_before <- rnorm(n_patients, mean = 160, sd = 5)
# Step 2: Generate LDL cholesterol levels after treatment
ldl_after <- ldl_before - Delta + rnorm(n_patients, mean = 0, sd = 1)
# Create final dataset
cholesterol_data <- data.frame(
patient_id = 1:n_patients,
LDL_before = round(ldl_before, 1),
LDL_after = round(ldl_after, 1)
)
# Showing the first 20 participants
head(cholesterol_data, 20)# Importing libraries
import numpy as np
import pandas as pd
# Set seed for reproducibility
np.random.seed(123)
# Set number of participants
n_patients = 120
# Set the average decrease in LDL cholesterol levels due to new treatment
Delta = 2.5
# Step 1: Generate LDL cholesterol levels before treatment
ldl_before = np.random.normal(loc=160, scale=5, size=n_patients)
# Step 2: Generate LDL cholesterol levels after treatment
ldl_after = ldl_before - Delta + np.random.normal(loc=0, scale=1, size=n_patients)
# Create final dataset
cholesterol_data = pd.DataFrame({
"patient_id": np.arange(1, n_patients + 1),
"LDL_before": np.round(ldl_before, 1),
"LDL_after": np.round(ldl_after, 1)
})
# Showing the first 20 participants
print(cholesterol_data.head(20))B.2 Tests for Two Population Proportions
B.2.1 Generative Modelling Process
- Under development
B.2.2 Code
- Under development
# Set seed
set.seed(123)
# Set sample sizes
n_A <- 150
n_B <- 200
# Set the values for true population proportions
p_A <- 0.57
p_B <- 0.43
# Generate binary outcomes for survey responses
support_A <- rbinom(n_A, size = 1, prob = p_A)
support_B <- rbinom(n_B, size = 1, prob = p_B)
# Create dataset
policy_data <- data.frame(
region = c(rep("A", n_A), rep("B", n_B)),
support = c(support_A, support_B) )
# Display first few rows
head(policy_data, 10) region support
1 A 1
2 A 0
3 A 1
4 A 0
5 A 0
6 A 1
7 A 1
8 A 0
9 A 1
10 A 1
B.3 ANOVA Datasets
These datasets are used in Chapter 4 to elaborate on analysis of variance (ANOVA) and pertain to an experimental context. Suppose a data-driven marketing team at a well-known tech company, which operates a global online store, is conducting two different A/B/n testings aimed at increasing the customer conversion score (i.e., the outcome). In these experiments, the customer conversion score is defined as a unitless and standardized engagement index. This index combines various elements, such as clicks, time spent on the webpage, and the probability of making a purchase, with a baseline mean set at \(50\). This score measures customer responsiveness on the online store: the higher the score, the greater the customer responsiveness.

B.3.1 One-way ANOVA
Let us start by outlining the generative modelling process for a one-way ANOVA experiment. In this context, we will simulate a dataset that represents a continuous response variable (in this case, the customer conversion score) affected by a single experimental factor, specifically the design of a webpage. This setup allows us to assess whether the average response systematically varies across different design options.
Generative Modelling Process
The experiment involves webpage design as a controllable factor determined by the experimenter. This factor includes three different layouts: \(D_1\) (the current layout), \(D_2\) (a new and experimental layout), and \(D_3\) (another new and experimental layout), making it a three-level factor. This experimental study will be characterized by the following elements:
- One factor, namely, webpage design.
- There are \(3\) treatments (i.e., the above three factor levels), which classifies this study as A/B/n testing.
- We will simulate 200 customers (i.e., replicates) per treatment in our final dataset.
- The outcome variable \(Y\) is the customer conversion score, which has been previously explained.
Our data structure will be an additive model conceptually depicted as:
\[ \begin{align*} \text{Outcome} &= \text{Main Effect} + \text{Random Error}. \end{align*} \tag{B.1}\]
Then, for the data generation process, let \(Y_{i,k}\) represent the customer conversion score for the \(k\)th replicate of the treatment related to the \(i\)th webpage design level. Equation B.1 is translated as:
\[ Y_{i,k} = \alpha_i + \varepsilon_{i,k}, \tag{B.2}\]
where
- \(\alpha_i\) is the fixed main effect corresponding to the \(i\)th level of webpage design for \(i = D_1, D_2, D_3\);
- \(\varepsilon_{i,k}\) is the random error associated to each \(Y_{i,k}\), capturing the variability and measurement error that introduces randomness in the response \(Y_{i,k}\).
In this case, Equation B.2 breaks down the outcome on the right-hand side into two additive components, which serve as the foundation for how ANOVA models the data. Unlike the random error \(\varepsilon_{i,k}\), the term \(\alpha_i\) is assumed to be fixed within the data-generating process, as we are using a frequentist approach. Since \(\varepsilon_{i,k}\) is random, we will assume it follows a Normal distribution with a mean of \(0\) and a variance of \(\sigma^2\), which is another fixed parameter in the simulation:
\[ \varepsilon_{i,k} \sim \text{Normal}(0, \sigma^2). \] In terms of our simulation, imagine you have a population of customers with the following fixed parameters:
- A vector of webpage design effects (i.e., the main effect)
\[\boldsymbol{\alpha} = \begin{bmatrix} \alpha_{D_1} \\ \alpha_{D_2} \\ \alpha_{D_3} \end{bmatrix} = \begin{bmatrix} 60 \\ 60 \\ 60 \end{bmatrix}.\]
- An overall variance
\[ \sigma^2 = 10. \]
The rationale for this simulation setup lies in constructing a baseline generative process that explicitly embodies the null hypothesis of a one-way ANOVA. By assigning identical mean effects to all three webpage designs, we model a population in which, on average, each webpage design yields the same customer conversion performance (that is, all group means are equal as specified in the vector \(\boldsymbol{\alpha}\)). Consequently, any observed differences among sample means arise purely from random sampling variability, not from systematic treatment effects.
In addition, this generative design encodes two fundamental assumptions of the ANOVA framework: normality and homoscedasticity. The normality assumption on the random component \(\varepsilon_{i,k}\) ensures that each group’s response distribution is symmetric and bell-shaped, supporting the validity of parametric inference represented by the \(F\)-test in ANOVA. The homoscedasticity assumption (expressed here as a constant variance \(\sigma^2 = 10\)) asserts that the variability in customer responses is identical across all webpage designs. This condition guarantees that any detected mean differences can be attributed to true design effects rather than unequal levels of random noise.
Code
Let us move to the corresponding code to simulate this data. Recall we are simulating 200 customers for each one of the three treatments, which will give us an overall sample size of \(n = 600\). Furthermore, note that Python additionally uses the {numpy} and {pandas} libraries. The final data frame will be stored in ABn_customer_data_one_factor whose columns will be webpage_design and conversion_score.
# Set seed for reproducibility
set.seed(123)
# Factor levels and sampled customers per treatment
webpage_design_levels <- c("D1", "D2", "D3")
n_per_treatment <- 200
# Population fixed additive parameters
alpha <- c(60, 60, 60)
# Simulating data
data_list <- list()
for (i in 1:3) {
mean_i <- alpha[i]
y <- rnorm(n_per_treatment, mean = mean_i, sd = sqrt(10))
df_i <- data.frame(
webpage_design = as.factor(webpage_design_levels[i]),
conversion_score = round(y, 2)
)
data_list[[length(data_list) + 1]] <- df_i
}
ABn_customer_data_one_factor <- do.call(rbind, data_list)
# Showing the first 100 customers of the A/B/n testing
head(ABn_customer_data_one_factor, n = 100)# Importing libraries
import numpy as np
import pandas as pd
# Set seed for reproducibility
np.random.seed(123)
# Factor levels and sampled customers per treatment
webpage_design_levels = ["D1", "D2", "D3"]
n_per_treatment = 200
# Population fixed additive parameters
alpha = [60, 60, 60]
# Simulating data
data_list = []
for i in range(3):
mean_i = alpha[i]
y = np.random.normal(loc=mean_i, scale=np.sqrt(10), size = n_per_treatment)
y_rounded = np.round(y, 2)
df_i = pd.DataFrame({
'webpage_design': [webpage_design_levels[i]] * n_per_treatment,
'conversion_score': y_rounded
})
data_list.append(df_i)
# Concatenate all groups into one DataFrame
ABn_customer_data_one_factor = pd.concat(data_list, ignore_index = True)
# Showing the first 100 customers of the A/B/n testing
print(ABn_customer_data.head(100))B.3.2 Two-way ANOVA
Let us extend our generative modelling framework to a two-way ANOVA experiment. In this context, we will simulate a dataset with a continuous response variable (i.e., the customer conversion score) that is affected by two experimental factors: webpage design and discount framing strategy. Each factor has its own main effect on the response, and their combination may also yield an interaction effect, indicating how the impact of one factor depends on the level of the other. This setup allows us to assess whether variations in customer conversion arise from the webpage design and discount framing, as well as their combined influence. Note that these two types of effects are added up to an overall effect. This approach offers a more comprehensive understanding of experimental variation and aids in practical decision-making.
Generative Modelling Process
This second experiment has the following controllable factors by the experimenter:
- Webpage design: Three different layouts \(D_1\) (the current layout), \(D_2\) (a new layout), and \(D_3\) (another new layout). This makes a three-level factor.
- Discount framing: \(\text{Low}\) (i.e., “Save 10% today”) or \(\text{High}\) (i.e., “Save up to 40% today”). This makes a two-level factor.
This study will be a full factorial experiment characterized by the following elements:
- Two factors: webpage design and discount framing.
- There are \(3 \times 2 = 6\) treatments (i.e., six different combinations of all the factor levels), which classifies this study as A/B/n testing.
- We will simulate 200 customers (i.e., replicates) per treatment in our final dataset.
- The outcome variable \(Y\) is the customer conversion score, which has been previously explained.
Our data structure will be an additive model conceptually depicted as:
\[ \begin{align*} \text{Outcome} &= \text{Overall Effect} + \\ & \qquad \text{First Main Effect} + \text{Second Main Effect} + \\ & \qquad \quad \text{Interaction Effect} + \text{Random Error}. \end{align*} \tag{B.3}\]
Then, for the data generation process, let \(Y_{i,j,k}\) represent the customer conversion score for the \(k\)th replicate of the treatment related to the \(i\)th webpage design and the \(j\)th discount framing levels. Equation B.3 is translated as:
\[ Y_{i,j,k} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{i,j} + \varepsilon_{i,j,k}, \tag{B.4}\]
where
- \(\mu\) is the grand mean of all observations (i.e., the overall effect).
- \(\alpha_i\) is the fixed first main effect corresponding to the \(i\)th level of webpage design for \(i = D_1, D_2, D_3\);
- \(\beta_j\) is the second fixed main effect corresponding to the \(j\)th level of discount framing for \(j = \text{Low}, \text{High}\);
- \((\alpha \beta)_{i,j}\) is the fixed interaction effect between the \(i\)th and \(j\)th levels of webpage design and discount framing respectively, and
- \(\varepsilon_{i,j,k}\) is the random error associated to each \(Y_{i,j,k}\), capturing the variability and measurement error that introduces randomness in the response \(Y_{i,j,k}\).
Heads-up on the mathematical representation of the interaction term!
The \((\alpha \beta)_{i,j}\) in Equation B.4 does not indicate that the main effects are multiplying each other. Mathematically, it is just another additive term on the right-hand side of the equation.
In this case, Equation B.4 is breaking down the outcome on the right-hand side into five additive components, which form the basis on how ANOVA models the data. With the exception of the random error \(\varepsilon_{i,j,k}\), the other four terms are assumed to be fixed within the data-generating process, given that we are using a frequentist approach. Since \(\varepsilon_{i,j,k}\) is random, we will assume that it follows a Normal distribution with a mean of \(0\) and a variance of \(\sigma^2\) (which is another fixed parameter in the simulation):
\[ \varepsilon_{i,j,k} \sim \text{Normal}(0, \sigma^2). \]
In terms of our simulation, imagine you have a population of customers with the following fixed parameters:
- A scalar denoting the grand mean of all observations (i.e., the overall effect)
\[ \mu = 15. \]
- A vector of webpage design effects (i.e., the first main effect)
\[\boldsymbol{\alpha} = \begin{bmatrix} \alpha_{D_1} \\ \alpha_{D_2} \\ \alpha_{D_3} \end{bmatrix} = \begin{bmatrix} 40 \\ 50 \\ 60 \end{bmatrix}.\]
- A vector of discount framing effects (i.e., the second main effect)
\[\boldsymbol{\beta} = \begin{bmatrix} \alpha_{\text{Low}} \\ \alpha_{\text{High}} \end{bmatrix} = \begin{bmatrix} -6 \\ 6 \end{bmatrix}.\]
- A matrix of interaction effects, whose rows correspond to the levels of webpage design and columns to the levels of discount framing,
\[ \boldsymbol{(\alpha \beta)} = \begin{bmatrix} (\alpha \beta)_{D_1,\text{Low}} & (\alpha \beta)_{D_1,\text{High}} \\ (\alpha \beta)_{D_2,\text{Low}} & (\alpha \beta)_{D_2,\text{High}} \\ (\alpha \beta)_{D_3,\text{Low}} & (\alpha \beta)_{D_3,\text{High}} \end{bmatrix} = \begin{bmatrix} 0 & 5 \\ 4 & 6 \\ 8 & -12 \end{bmatrix}. \]
- An overall variance
\[ \sigma^2 = 16. \]
The purpose of this simulation is to define a generative process that captures the essential structure of a two-way ANOVA design. In this setting, both webpage design and discount framing are treated as controlled experimental factors, each contributing to the overall pattern of customer conversion scores. By specifying fixed main effects for each factor and an accompanying matrix of interaction effects, we represent a population where conversion performance arises from both individual influences and their combined interplay. This design enables us to study how the effect of one factor may depend on the level of the other, which is an idea central to interpreting two-way ANOVA outcomes.
Moreover, the simulation is grounded in the core distributional assumptions of the ANOVA framework: normality and homoscedasticity. The random term \(\varepsilon_{i,j,k}\) is assumed to follow a Normal distribution, ensuring that the response within each treatment combination is approximately symmetric and well-behaved. The homoscedasticity condition, specified through a constant variance \(\sigma^2 = 16\), implies that the degree of variability in customer responses remains consistent across all combinations of factors. These assumptions provide the stability necessary for the \(F\)-test to reliably separate true factor effects from random fluctuations.
Code
Let us move to the corresponding code to simulate this data. Recall we are simulating 200 customers for each one of the six treatments, which will give us an overall sample size of \(n = 1,200\). Furthermore, note that Python additionally uses the {numpy} and {pandas} libraries. The final data frame will be stored in ABn_customer_data_two_factors whose columns will be webpage_design, discount_framing, and conversion_score.
# Set seed for reproducibility
set.seed(123)
# Factor levels and sampled customers per treatment
webpage_design_levels <- c("D1", "D2", "D3")
discount_framing_levels <- c("Low", "High")
n_per_treatment <- 200
# Population fixed additive parameters
mu <- 15
alpha <- c(40, 50, 60)
beta <- c(-6, 6)
interaction <- matrix(
c(
0, 5,
4, 6,
8, -12
),
nrow = 3, byrow = TRUE
)
# Simulating data
data_list <- list()
for (i in 1:3) {
for (j in 1:2) {
mean_ij <- mu + alpha[i] + beta[j] + interaction[i, j]
y <- rnorm(n_per_treatment, mean = mean_ij, sd = sqrt(16))
df_ij <- data.frame(
webpage_design = as.factor(webpage_design_levels[i]),
discount_framing = as.factor(discount_framing_levels[j]),
conversion_score = round(y, 2)
)
data_list[[length(data_list) + 1]] <- df_ij
}
}
ABn_customer_data_two_factors <- do.call(rbind, data_list)
# Showing the first 100 customers of the A/B/n testing
head(ABn_customer_data_two_factors, n = 100)# Importing libraries
import numpy as np
import pandas as pd
# Set seed for reproducibility
np.random.seed(123)
# Factor levels and sampled customers per treatment
webpage_design_levels = ["D1", "D2", "D3"]
discount_framing_levels = ["Low", "High"]
n_per_treatment = 200
# Population fixed additive parameters
mu = 15
alpha = [40, 50, 60]
beta = [-6, 6]
interaction = np.array([
[0, 5],
[4, 6],
[8, -12]
])
# Simulating data
data_list = []
for i in range(3):
for j in range(2):
mean_ij = mu + alpha[i] + beta[j] + interaction[i, j]
y = np.random.normal(loc=mean_ij, scale=np.sqrt(16), size = n_per_treatment)
y_rounded = np.round(y, 2)
df_ij = pd.DataFrame({
'webpage_design': [webpage_design_levels[i]] * n_per_treatment,
'discount_framing': [discount_framing_levels[j]] * n_per_treatment,
'conversion_score': y_rounded
})
data_list.append(df_ij)
# Concatenate all groups into one DataFrame
ABn_customer_data_two_factors = pd.concat(data_list, ignore_index = True)
# Showing the first 100 customers of the A/B/n testing
print(ABn_customer_data_two_factors.head(100))