Introduction to Multi–armed Bandit Algorithms
Homework Assignment 1
Instructions
• There are three problems in this Assignment that are related to each other.
• Attach your code to receive full credit. We recommend using Python but the choice is yours.
• It is recommended that you use a Document Scanning application, e.g., iOS Notes, CamScanner, to convert your solutions to PDF format.
Comparing the Performance of Multi-armed Bandit Algorithms
Through a set of assignments, we will learn how to apply multi–armed bandit algorithms in real–life applications. We will also compare the performance of three multi–armed bandit (MAB) algorithms on real datasets. The datasets we will use to that end are as follows:
• Movie Lens Dataset: https://grouplens.org/datasets/movielens/1m/
• Goodreads Dataset: https://github.com/MengtingWan/chainRec/tree/master
• Open Bandits Dataset: https://arxiv.org/pdf/2008.07146.pdf
Choose one of these datasets. Movie Lens and Goodreads datasets contain user ratings for movies and books, respectively. Your main goal is to compare the performance of the fol– lowing MAB algorithms: Explore–then–commit (ETC), Upper Confidence Bound (UCB), and Thompson Sampling (TS). The comparison will be in terms of the expected cumulative Regret an algorithm incurs until round t, where t = 1, . . . , n. Here, n defines the horizon, i.e., the total number of rounds the algorithm is used. Depending on the data set, decide on what “arms” may represent and what “rewards” may represent. For example, each movie genre can be an arm, and user ratings can be considered as the reward received when movie from a genre is rated by a user. Please stick with the same set–up and data–set for all questions below.
Problem 1
This question aims to demonstrate the fact that theoretical results presented in the lectures are for the mean regret. Thus, to check their validity in real-world experiments, we need to run a sufficient number of experiments and look at the average value of the cumulative regret.
Consider the ETC algorithm. Choose the horizon as n = 100, 000 and set the length m * k of the exploration phase as ≈ 10% of n, i.e., m * k ≈ 10, 000. In Problem 3, we will explore the impact of m on the performance. Using your chosen dataset, run the ETC algorithm and record the cumulative regret at each round t = 1, . . . , n. Repeat this process 10 times, reshuffling the data each time before the experiment is run.
Plot the result of ten experiments separately on the same figure. Namely, your figure should have ten curves, one for each experiment, representing the cumulative regret in round t as t varies from 1 to 100, 000. Comment on what you see from these figures. For example, do you get the same regret value in each experiment?
Now, run 100 experiments with the same setting and plot the average regret together with error bars indicating one standard deviation above and below the mean. See Figure 1 below for an example plot.
For all remaining parts, plots should be obtained by averaging over 100 experiments and errors bar should be shown as indicated above (unless otherwise indicated).
Problem 2
This question is about setting the value of horizon n appropriately. Some of the theoretical results presented in lectures, e.g., the fact that cumulative regret scales as the logarithm of n for the algorithms presented, are valid in the limit of n → ∞ . In practice, how large n should be to see this logarithmic behavior. may depend on the particular setting used including the algorithm and the reward distributions.
We will consider five different values for the horizon: n = 500, n = 5, 000, n = 50, 000, n = 500, 000, and n = 5, 000, 000. Consider the ETC algorithm with the length m * k of the exploration phase set as 10% of n. For each nvalue, plot the average regret of the ETC algorithm as a function of t = 1, . . . , n, i.e., you will produce five figures, one for each n value. For this question, error bars are not needed.
Comment on the results.
Problem 3
In this question, we will explore how to set the parameter m for the ETC algorithm, i.e., to see how long should we explore.
Set n = 100, 000. For your chosen dataset, plot the performance of the ETC algorithm for four different m values: m * k = 100, m * k = 1, 000, m * k = 10, 000, m * k = 20, 000. As before, average over 100 experiments and show the error bars. Comment on the results you are seeing.
Figure 1: Example plots showing cumulative regret as a function of the number of rounds for different algorithms.