Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Deep Reinforcement Learning Hands-On. - Second Edition

You're reading from  Deep Reinforcement Learning Hands-On. - Second Edition

Product type Book
Published in Jan 2020
Publisher Packt
ISBN-13 9781838826994
Pages 826 pages
Edition 2nd Edition
Languages
Author (1):
Maxim Lapan Maxim Lapan
Profile icon Maxim Lapan

Table of Contents (28) Chapters

Preface 1. What Is Reinforcement Learning? 2. OpenAI Gym 3. Deep Learning with PyTorch 4. The Cross-Entropy Method 5. Tabular Learning and the Bellman Equation 6. Deep Q-Networks 7. Higher-Level RL Libraries 8. DQN Extensions 9. Ways to Speed up RL 10. Stocks Trading Using RL 11. Policy Gradients – an Alternative 12. The Actor-Critic Method 13. Asynchronous Advantage Actor-Critic 14. Training Chatbots with RL 15. The TextWorld Environment 16. Web Navigation 17. Continuous Action Space 18. RL in Robotics 19. Trust Regions – PPO, TRPO, ACKTR, and SAC 20. Black-Box Optimization in RL 21. Advanced Exploration 22. Beyond Model-Free – Imagination 23. AlphaGo Zero 24. RL in Discrete Optimization 25. Multi-agent RL 26. Other Books You May Enjoy
27. Index

Advanced Exploration

Next, we will talk about the topic of exploration in reinforcement learning (RL). It has been mentioned several times in the book that the exploration/exploitation dilemma is a fundamental thing in RL and very important for efficient learning. However, in the previous examples, we used quite a trivial approach to exploring the environment, which was, in most cases, -greedy action selection. Now it's time to go deeper into the exploration subfield of RL.

In this chapter, we will:

  • Discuss why exploration is such a fundamental topic in RL
  • Explore the effectiveness of the epsilon-greedy (-greedy) approach
  • Take a look at alternatives and try them on different environments

Why exploration is important

In this book, lots of environments and methods have been discussed and in almost every chapter, exploration was mentioned. Very likely, you've already got ideas about why it's important to explore the environment effectively, so I'm just going to give a list of the main reasons.

Before that, it might be useful to agree on the term "effective exploration." In theoretical RL, a strict definition of this exists, but the high-level idea is simple and intuitive. Exploration is effective when we don't waste time in states of the environment that have already been seen by and are familiar to the agent. Rather than taking the same actions again and again, the agent needs to look for a new experience. As we've already discussed, exploration has to be balanced by exploitation, which is the opposite and means using our knowledge to get the best reward in the most efficient way. Let's now quickly discuss why we might be interested...

What's wrong with -greedy?

Throughout the book, we have used the -greedy exploration strategy as a simple, but still acceptable, approach to exploring the environment. The underlying idea behind -greedy is to take a random action with the probability of ; otherwise, (with probability) we act greedily. By varying the hyperparameter, we can change the exploration ratio. This approach was used in most of the value-based methods described in the book.

Quite a similar idea was used in policy-based methods, when our network returns the probability distribution over actions to take. To prevent the network from becoming too certain about actions (by returning a probability of 1 for a specific action and 0 for others), we added the entropy loss, which is just the entropy of the probability distribution multiplied by some hyperparameter. In the early stages of the training, this entropy loss pushes our network toward taking random actions (by regularizing the probability distribution...

Alternative ways of exploration

In this section, we will cover an overview of a set of alternative approaches to the exploration problem. This won't be an exhaustive list of approaches that exist, but rather will provide an outline of the landscape.

We're going to check three different approaches to exploration:

  • Randomness in the policy, when stochasticity is added to the policy that we use to get samples. The method in this family is noisy networks, which we have already covered.
  • Count-based methods, which keep track of the count of times the agent has seen the particular state. We will check two methods: the direct counting of states and the pseudo-count method.
  • Prediction-based methods, which try to predict something from the state and from the quality of the prediction. We can make judgements about the familiarity of the agent with this state. To illustrate this approach, we will take a look at the policy distillation method, which has shown state-of...

MountainCar experiments

In this section, we will try to implement and compare the effectiveness of different exploration approaches on a simple, but still challenging, environment, which could be classified as a "classical RL" problem that is very similar to the familiar CartPole. But in contrast to CartPole, the MountainCar problem is quite challenging from an exploration point of view.

The problem's illustration is shown in the following figure and it consists of a small car starting from the bottom of the valley. The car can move left and right, and the goal is to reach the top of the mountain on the right.

Figure 21.3: The MountainCar environment

The trick here is in the environment's dynamics and the action space. To reach the top, the actions need to be applied in a particular way to swing the car back and forth to speed it up. In other words, the agent needs to apply the actions for several time steps to make the car go faster and eventually...

Atari experiments

The MountainCar environment is a nice and fast way to experiment with exploration methods, but to conclude the chapter, I've included Atari versions of the DQN and PPO methods with the exploration tweaks we described. As the primary environment, I've used Seaquest, which is a game where the submarine needs to shoot fish and enemy submarines, and save aquanauts. This game is not as famous as Montezuma's Revenge, but it still might be considered as medium-hard exploration, because to continue the game, you need to control the level of oxygen. When it becomes low, the submarine needs to rise to the surface for some time. Without this, the episode will end after 560 steps and with a maximum reward of 20. But once the agent learns how to replenish the oxygen, the game might continue almost infinitely and bring to the agent a 10k-100k score. Surprisingly, traditional exploration methods struggle with discovering this; normally, training gets stuck at 560...

Summary

In this chapter, we discussed why -greedy exploration is not the best in some cases and checked alternative modern approaches for exploration. The topic of exploration is much wider and lots of interesting methods are left uncovered, but I hope you were able to get an overall impression of the new methods and the way they should be implemented.

In the next chapter, we will take a look at a different sphere of modern RL development: model-based methods.

References

  1. Strehl and Littman, An analysis of model-based Interval Estimation for Markov Decision Processes, 2008: https://www.sciencedirect.com/science/article/pii/S0022000008000767
  2. Meire Fortunato, et al, Noisy Networks for Exploration 2017, arxiv: 1706.10295
  3. Georg Ostrovski, et al, Count-Based Exploration with Neural Density Models, 2017, arxiv:1703.01310v2
  4. Yuri Burda, et al, Exploration by random network distillation, 2018, arxiv:1810.12894
lock icon The rest of the chapter is locked
You have been reading a chapter from
Deep Reinforcement Learning Hands-On. - Second Edition
Published in: Jan 2020 Publisher: Packt ISBN-13: 9781838826994
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}