Reader small image

You're reading from  Reinforcement Learning Algorithms with Python

Product typeBook
Published inOct 2019
Reading LevelBeginner
PublisherPackt
ISBN-139781789131116
Edition1st Edition
Languages
Right arrow
Author (1)
Andrea Lonza
Andrea Lonza
author image
Andrea Lonza

Andrea Lonza is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.
Read more about Andrea Lonza

Right arrow

Developing the ESBAS Algorithm

By now, you are capable of approaching RL problems in a systematic and concise way. You are able to design and develop RL algorithms specifically for the problem at hand and get the most from the environment. Moreover, in the previous two chapters, you learned about algorithms that go beyond RL, but that can be used to solve the same set of tasks.

At the beginning of this chapter, we'll present a dilemma that we have already encountered in many of the previous chapters; namely, the exploration-exploitation dilemma. We have already presented potential solutions for the dilemma throughout the book (such as the -greedy strategy), but we want to give you a more comprehensive outlook on the problem, and a more concise view of the algorithms that solve it. Many of them, such as the upper confidence bound (UCB) algorithm, are more sophisticated and...

Exploration versus exploitation

The exploration-exploitation trade-off dilemma, or exploration-exploitation problem, affects many important domains. Indeed, it's not only restricted to the RL context, but applies to everyday life. The idea behind this dilemma is to establish whether it is better to take the optimal solution that is known so far, or if it's worth trying something new. Let's say you are buying a new book. You could either choose a title from your favorite author, or buy a book of the same genre that Amazon is suggesting to you. In the first case, you are confident about what you're getting, but by selecting the second option, you don't know what to expect. However, in the latter case, you could be incredibly pleased, and end up reading a very good book that is indeed better than the one written by your favorite author.

This conflict between...

Approaches to exploration

Put simply, the multi-armed bandit problem, and in general every exploration problem, can be solved either through random strategies, or through smarter techniques. The most notorious algorithm that belongs to the first category, is called -greedy; whereas optimistic exploration, such as UCB, and posterior exploration, such as Thompson sampling, belong to the second category. In this section, we'll take a look particularly at the -greedy and UCB strategies.

It's all about balancing the risk and the reward. But, how can we measure the quality of an exploration algorithm? Through regret. Regret is defined as the opportunity lost in one step that is, the regret, , at time, , is as follows:

Here, denotes the optimal value, and the action-value of .

Thus, the goal is to find a trade-off between exploration and exploitation, by minimizing the...

Epochal stochastic bandit algorithm selection

The main use of exploration strategies in reinforcement learning is to help the agent in the exploration of the environment. We saw this use case in DQN with -greedy, and in other algorithms with the injection of additional noise into the policy. However, there are other ways of using exploration strategies. So, to better grasp the exploration concepts that have been presented so far, and to introduce an alternative use case of these algorithms, we will present and develop an algorithm called ESBAS. This algorithm was introduced in the paper, Reinforcement Learning Algorithm Selection.

ESBAS is a meta-algorithm for online algorithm selection (AS) in the context of reinforcement learning. It uses exploration methods in order to choose the best algorithm to employ during a trajectory, so as to maximize the expected reward.

In order to...

Summary

In this chapter, we addressed the exploration-exploitation dilemma. This problem has already been tackled in previous chapters, but only in a light way, by employing simple strategies. In this chapter, we studied this dilemma in more depth, starting from the notorious multi-armed bandit problem. We saw how more sophisticated counter-based algorithms, such as UCB, can actually reach optimal performance, and with the expected logarithmic regret.

We then used exploration algorithms for AS. AS is an interesting application of exploratory algorithms, because the meta-algorithm has to choose the algorithm that best performs the task at hand. AS also has an outlet in reinforcement learning. For example, AS can be used to pick the best policy that has been trained with different algorithms from the portfolio, in order to run the next trajectory. That's also what ESBAS does...

Questions

  1. What's the exploration-exploitation dilemma?
  2. What are two exploration strategies that we have already used in previous RL algorithms?
  3. What's UCB?
  4. Which problem is more difficult to solve: Montezuma's Revenge or the multi-armed bandit problem?
  5. How does ESBAS tackle the problem of online RL algorithm selection?

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Reinforcement Learning Algorithms with Python
Published in: Oct 2019Publisher: PacktISBN-13: 9781789131116
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Andrea Lonza

Andrea Lonza is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.
Read more about Andrea Lonza