You're reading from Deep Reinforcement Learning Hands-On. - Second Edition

Product type Book

Published in Jan 2020

Publisher Packt

ISBN-13 9781838826994

Pages 826 pages

Edition 2nd Edition

Languages

Python

Concepts

Deep Reinforcement Learning

Author (1):

Maxim Lapan

Table of Contents (28) Chapters

Preface

1. What Is Reinforcement Learning?

2. OpenAI Gym

3. Deep Learning with PyTorch

4. The Cross-Entropy Method

5. Tabular Learning and the Bellman Equation

6. Deep Q-Networks

7. Higher-Level RL Libraries

8. DQN Extensions

9. Ways to Speed up RL

10. Stocks Trading Using RL

11. Policy Gradients – an Alternative

12. The Actor-Critic Method

13. Asynchronous Advantage Actor-Critic

14. Training Chatbots with RL

15. The TextWorld Environment

16. Web Navigation

17. Continuous Action Space

18. RL in Robotics

19. Trust Regions – PPO, TRPO, ACKTR, and SAC

20. Black-Box Optimization in RL

21. Advanced Exploration

22. Beyond Model-Free – Imagination

23. AlphaGo Zero

24. RL in Discrete Optimization

25. Multi-agent RL

26. Other Books You May Enjoy

27. Index

Continuous Action Space

This chapter kicks off the advanced reinforcement learning (RL) part of the book by taking a look at a problem that has only been briefly mentioned: working with environments when our action space is not discrete. In this chapter, you will become familiar with the challenges that arise in such cases and learn how to solve them.

Continuous action space problems are an important subfield of RL, both theoretically and practically, because they have essential applications in robotics (which will be the subject of the next chapter), control problems, and other fields in which we communicate with physical objects.

In this chapter, we will:

Cover the continuous action space, why it is important, how it differs from the already familiar discrete action space, and the way it is implemented in the Gym API
Discuss the domain of continuous control using RL methods
Check three different algorithms on the problem of a four-legged robot

Why a continuous space?

All the examples that we have seen so far in the book had a discrete action space, so you might have the wrong impression that discrete actions dominate the field. This is a very biased view, of course, and just reflects the selection of domains that we picked our test problems from. Besides Atari games and simple, classic RL problems, there are many tasks that require more than just making a selection from a small and discrete set of things to do.

To give you an example, just imagine a simple robot with only one controllable joint that can be rotated in some range of degrees. Usually, to control a physical joint, you have to specify either the desired position or the force applied.

In both cases, you need to make a decision about a continuous value. This value is fundamentally different from a discrete action space, as the set of values on which you can make a decision is potentially infinite. For instance, you could ask the joint to move to a 13.5&...

The A2C method

The first method that we will apply to our walking robot problem is A2C, which we experimented with in part three of the book. This choice of method is quite obvious, as A2C is very easy to adapt to the continuous action domain. As a quick refresher, A2C's idea is to estimate the gradient of our policy as . The policy is supposed to provide the probability distribution of actions given the observed state. The quantity is called a critic, equals to the value of the state, and is trained using the mean squared error (MSE) loss between the critic's return and the value estimated by the Bellman equation. To improve exploration, the entropy bonus is usually added to the loss.

Obviously, the value head of the actor-critic will be unchanged for continuous actions. The only thing that is affected is the representation of the policy. In the discrete cases that you have seen, we had only one action with several mutually exclusive discrete values. For such a case...

Deterministic policy gradients

The next method that we will take a look at is called deterministic policy gradients, which is an actor-critic method but has a very nice property of being off-policy. The following is my very relaxed interpretation of the strict proofs. If you are interested in understanding the core of this method deeply, you can always refer to the article by David Silver and others called Deterministic Policy Gradient Algorithms, published in 2014 (http://proceedings.mlr.press/v32/silver14.pdf), and the paper by Timothy P. Lillicrap and others called Continuous Control with Deep Reinforcement Learning, published in 2015 (https://arxiv.org/abs/1509.02971).

The simplest way to illustrate the method is through comparison with the already familiar A2C method. In this method, the actor estimates the stochastic policy, which returns the probability distribution over discrete actions or, as we have just covered in the previous section, the parameters of normal distribution...

Distributional policy gradients

As the last method of this chapter, we will take a look at the very recent paper by Gabriel Barth-Maron, Matthew W. Hoffman, and others, called Distributed Distributional Deterministic Policy Gradients, published in 2018 (https://arxiv.org/abs/1804.08617).

The full name of the method is distributed distributional deep deterministic policy gradients or D4PG for short. The authors proposed several improvements to the DDPG method to improve stability, convergence, and sample efficiency.

First of all, they adapted the distributional representation of the Q-value proposed in the paper by Marc G. Bellemare and others called A Distributional Perspective on Reinforcement Learning, published in 2017 (https://arxiv.org/abs/1707.06887). We discussed this approach in Chapter 8, DQN Extensions, when we talked about DQN improvements, so refer to it or to the original Bellemare paper for details. The core idea is to replace a single Q-value from the critic with...

Things to try

Here is a list of things you can do to improve your understanding of the topic:

In the D4PG code, I used a simple replay buffer, which was enough to get good improvement over DDPG. You can try to switch the example to the prioritized replay buffer in the same way as we did in Chapter 8, DQN Extensions, and check the effect.
There are lots of interesting and challenging environments around. For example, you can start with other PyBullet environments, but there is also the DeepMind Control Suite (Tassa, Yuval, et al., DeepMind Control Suite, arXiv abs/1801.00690 (2018)), MuJoCo-based environments in Gym, and many others.
You can request the trial license of MuJoCo and compare its stability, performance, and resulting policy with PyBullet.
You can play with the very challenging Learning to Run competition from NIPS-2017 (which also took place in 2018 and 2019 with more challenging problems), where you are given a simulator of the human body and your agent...

Summary

In this chapter, we quickly skimmed through the very interesting domain of continuous control using RL methods, and we checked three different algorithms on one problem of a four-legged robot. In our training, we used an emulator, but there are real models of this robot made by the Ghost Robotics company. (You can check out the cool video on YouTube: https://youtu.be/bnKOeMoibLg.) We applied three training methods to this environment: A2C, DDPG, and D4PG (which showed the best results).

In the next chapter, we will continue exploring the continuous action domain and check a different set of improvements: trust region extension.