Reader small image

You're reading from  Mastering Reinforcement Learning with Python

Product typeBook
Published inDec 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781838644147
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Enes Bilgin
Enes Bilgin
author image
Enes Bilgin

Enes Bilgin works as a senior AI engineer and a tech lead in Microsoft's Autonomous Systems division. He is a machine learning and operations research practitioner and researcher with experience in building production systems and models for top tech companies using Python, TensorFlow, and Ray/RLlib. He holds an M.S. and a Ph.D. in systems engineering from Boston University and a B.S. in industrial engineering from Bilkent University. In the past, he has worked as a research scientist at Amazon and as an operations research scientist at AMD. He also held adjunct faculty positions at the McCombs School of Business at the University of Texas at Austin and at the Ingram School of Engineering at Texas State University.
Read more about Enes Bilgin

Right arrow

Chapter 3: Contextual Bandits

A more advanced version of the multi-armed bandit is the contextual bandit (CB) problem, where decisions are tailored to the context they are made in. In the previous chapter, we identified the best performing ad in an online advertising scenario. In doing so, we did not use any information about, for instance, the user persona, age, gender, location, previous visits etc., which would have increased the likelihood of a click. Contextual bandits allow us to leverage such information, which makes them play a central role in commercial personalization and recommendation applications.

Context is similar to a state in a multi-step reinforcement learning (RL) problem, with one key difference. In a multi-step RL problem, the action an agent takes affects the states it is likely to visit in the subsequent steps. For example, while playing tic-tac-toe, an agent's action in the current state changes the board configuration (state) in a particular way, which...

Why we need function approximations

While solving (contextual) multi-armed bandit problems, our goal is to learn action values for each arm (action) from our observations, which we have denoted by . In the online advertising example, it represented our estimate for the probability of a user clicking the ad if we displayed . Now, assume that we have two pieces of information about the user seeing the ad, namely:

  • Device type (e.g. mobile vs. desktop), and
  • Location (e.g. domestic / U.S. vs. international / non-U.S.)

It is quite likely that ad performances will differ with device type and location, which make up the context in this example. A CB model will therefore leverage this information, estimate the action values for each context, and choose the actions accordingly.

This would look like filling a table for each ad similar to the below:

Table 1 – Sample action values for ad D

This means solving four MAB problems, one for...

Using function approximation for context

Function approximations allow us to model the dynamics of a process from which we have observed data, such as contexts and ad clicks. As in the previous chapter, consider an online advertising scenario with five different ads (i.e. A, B, C, D, and E), with the context comprised of user device, location and age. In this section, our agent will learn five different Q functions, one per ad, each receiving a context , and return the action value estimate. This is illustrated in Figure 1.

Figure 3.1 – We learn a function for each action that receives the context and returns the action value.

At this point, we have a supervised machine learning problem to solve for each action. We can use different models to obtain the Q functions, such as logistic regression or a neural network (which actually allows us to use a single network that estimates values for all actions). Once we choose the type of function approximation...

Using function approximation for action

In our online advertising examples so far, we have assumed to have a fixed set of ads (actions/arms) to choose from. However, in many applications of contextual bandits, the set of available actions change over time. Take the example of a modern advertising network that uses an ad server to match ads to websites/apps. This is a very dynamic operation which involves, leaving the pricing aside, three major components:

  • Website/app content,
  • Viewer/user profile,
  • Ad inventory.

Previously, we considered only the user profile for the context. An ad server needs to take the website/app content into account additionally, but this does not really change the structure of problem we solved before. However, now, we cannot use a separate model for each ad since the ad inventory is dynamic. We handle this by using a single model to which we feed ad features. This is illustrated in Figure 5.

Figure 3.5 –...

Other applications of multi-armed and contextual bandits

So far, we have focused on online advertising example as our running example. If you are wondering how commonly bandit algorithms are used in practice for such problems, it is actually quite common. For example, Microsoft has a service, called Personalizer, based on bandit algorithms (disclaimer: the author is a Microsoft employee at the time of writing this book). The example here itself is inspired by the work at Hubspot – a marketing solutions company (Collier & Llorens, 2018). Moreover, bandit problems have a vast array of practical applications other than advertising. In this section we briefly go over some of those applications.

Recommender systems

The bandit problems we formulated and solved in this chapter are a type of recommender system: they recommend which ad to display, potentially leveraging the information available about the users. There are many other recommender systems that use bandits in a...

Summary

In this chapter, we've concluded our discussion on bandit problems with contextual bandits. As we mentioned, bandit problems have many practical applications. So, it would not be a surprise if you already had a problem in your business or research that can be modeled as a bandit problem. Now that you know how to formulate and solve one, go out and apply what you have learned! Bandit problems are also important to develop intuition on how to solve exploration-exploitation dilemma, which will exist in almost every RL setting.

Now that you have a solid understanding of how to solve one-step RL, it is time to move on to full-blown multi-step RL. In the next chapter, we will go into the theory behind multi-step RL with Markov Decision Processes, and build the foundation for modern deep RL methods that we will cover in the subsequent chapters.

References

  1. Bouneffouf, D., & Rish, I. (2019). A Survey on Practical Applications of Multi-Armed and Contextual Bandits. Retrieved from arXiv: https://arxiv.org/abs/1904.10040
  2. Chandrashekar, A., Amat, F., Basilico, J., & Jebara, T. (2017, December 7). Netflix Technology Blog. Retrieved from Artwork Personalization at Netflix: https://netflixtechblog.com/artwork-personalization-c589f074ad76
  3. Chapelle, O., & Li, L. (2011). An Empirical Evaluation of Thompson Sampling. Advances in Neural Information Processing Systems 24, (pp. 2249-2257).
  4. Collier, M., & Llorens, H. U. (2018). Deep Contextual Multi-Armed Bandits. Retrieved from arXiv: https://arxiv.org/abs/1807.09809
  5. Gal, Y., Hron, J., & Kendall, A. (2017). Concrete Dropout. Advances in Neural Information Processing Systems 30, (pp. 3581-3590).
  6. Marmerola, G. D. (2017, November 28). Thompson Sampling for Contextual bandits. Retrieved from Guilherme's Blog: https://gdmarmerola.github.io/ts...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Reinforcement Learning with Python
Published in: Dec 2020Publisher: PacktISBN-13: 9781838644147
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Enes Bilgin

Enes Bilgin works as a senior AI engineer and a tech lead in Microsoft's Autonomous Systems division. He is a machine learning and operations research practitioner and researcher with experience in building production systems and models for top tech companies using Python, TensorFlow, and Ray/RLlib. He holds an M.S. and a Ph.D. in systems engineering from Boston University and a B.S. in industrial engineering from Bilkent University. In the past, he has worked as a research scientist at Amazon and as an operations research scientist at AMD. He also held adjunct faculty positions at the McCombs School of Business at the University of Texas at Austin and at the Ingram School of Engineering at Texas State University.
Read more about Enes Bilgin