Reader small image

You're reading from  Hands-On Reinforcement Learning with Python

Product typeBook
Published inJun 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788836524
Edition1st Edition
Languages
Right arrow
Author (1)
Sudharsan Ravichandiran
Sudharsan Ravichandiran
author image
Sudharsan Ravichandiran

Sudharsan Ravichandiran is a data scientist and artificial intelligence enthusiast. He holds a Bachelors in Information Technology from Anna University. His area of research focuses on practical implementations of deep learning and reinforcement learning including natural language processing and computer vision. He is an open-source contributor and loves answering questions on Stack Overflow.
Read more about Sudharsan Ravichandiran

Right arrow

Multi-Armed Bandit Problem

In the previous chapters, we have learned about fundamental concepts of reinforcement learning (RL) and several RL algorithms, as well as how RL problems can be modeled as the Markov Decision Process (MDP). We have also seen different model-based and model-free algorithms that are used to solve the MDP. In this chapter, we will see one of the classical problems in RL called the multi-armed bandit (MAB) problem. We will see what the MAB problem is and how to solve the problem with different algorithms followed by how to identify the correct advertisement banner that will receive most of the clicks using MAB. We will also learn about contextual bandit that is widely used for building recommendation systems.

In the chapter, you will learn about the following:

  • The MAB problem
  • The epsilon-greedy algorithm
  • The softmax exploration algorithm
  • The upper confidence...

The MAB problem

The MAB problem is one of the classical problems in RL. An MAB is actually a slot machine, a gambling game played in a casino where you pull the arm (lever) and get a payout (reward) based on a randomly generated probability distribution. A single slot machine is called a one-armed bandit and, when there are multiple slot machines it is called multi-armed bandits or k-armed bandits.

MABs are shown as follows:

As each slot machine gives us the reward from its own probability distribution, our goal is to find out which slot machine will give us the maximum cumulative reward over a sequence of time. So, at each time step t, the agent performs an action at, that is, pulls an arm from the slot machine and receives a reward rt, and the goal of our agent is to maximize the cumulative reward.

We define the value of an arm Q(a) as average rewards received by pulling the...

Applications of MAB

So far, we have looked at the MAB problem and how we can solve it using various exploration strategies. But bandits are not just used for playing slot machines; they have many applications.

Bandits are used as a replacement for AB testing. AB testing is one of the commonly used classical methods of testing. Say you have two versions of the landing page of your website. How do you know which version is liked by most of the users? You conduct an AB test to understand which version is most liked by users.

In AB testing, we allocate a separate time for exploration and a separate time for exploitation. That is, it has two different dedicated periods only for exploration and exploitation alone. But the problem with this method is that this will incur a lot of regrets. So, we can minimize the regret using various exploration strategies that we use to solve MAB. Instead...

Identifying the right advertisement banner using MAB

Let us say you are running a website and you have five different banners for the same ad, and you want to know which banner attracts the user. We model this problem statement as a bandit problem. Let us say these five banners are the five arms of the bandit and we award one point if the user clicks the ad and award zero if the user does not click the ad.

In normal A/B testing, we will perform a complete exploration of all these five banners before deciding which banner is the best. But that will cost us a lot of energy and time. Instead, we will use a good exploration strategy for deciding which banner will give us the most rewards (most clicks).

First, let us import the necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let us simulate a dataset...

Contextual bandits

We just saw how bandits are used for recommending the correct ad banner to the user. But the banner preference varies from user to user. User A likes banner type 1, but user B might like banner type 3. So we have to personalize ad banners according to user behavior. How can we do that? We introduce a new bandit type called contextual bandits.

In a normal MABs problem, we perform the action and receive a reward. But with contextual bandits, instead of just taking the actions alone, we take the environment state as well. The state holds the context. Here, the state specifies the user behaviors, so we will take actions (show ads) according to the state (user behavior) that will result in a maximum reward (ad clicks). Thus, contextual bandits are widely used for personalizing content according to the user's preference behavior. They are used to solve cold-start...

Summary

In this chapter, we have learned about the MAB problem and how it can be applied to different applications. We understood several methods to solve an explore-exploit dilemma. First, we looked at the epsilon-greedy policy, where we explored with the probability epsilon, and carried out exploration with the probability 1-epsilon. We looked at the UCB algorithm, where we picked up the best action with the maximum upper bound value, followed by the TS algorithm, where we picked up the best action via beta distribution.

In the upcoming chapters, we will learn about deep learning and how deep learning is used to solve RL problems.

Questions

The question list is as follows:

  1. What is an MAB problem?
  2. What is an explore-exploit dilemma?
  3. What is the significance of epsilon in the epsilon-greedy policy?
  4. How do we solve an explore-exploit dilemma?
  5. What is a UCB algorithm?
  6. How does Thompson sampling differ from the UCB algorithm?

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Reinforcement Learning with Python
Published in: Jun 2018Publisher: PacktISBN-13: 9781788836524
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sudharsan Ravichandiran

Sudharsan Ravichandiran is a data scientist and artificial intelligence enthusiast. He holds a Bachelors in Information Technology from Anna University. His area of research focuses on practical implementations of deep learning and reinforcement learning including natural language processing and computer vision. He is an open-source contributor and loves answering questions on Stack Overflow.
Read more about Sudharsan Ravichandiran