Reader small image

You're reading from  Deep Reinforcement Learning Hands-On. - Second Edition

Product typeBook
Published inJan 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838826994
Edition2nd Edition
Languages
Right arrow
Author (1)
Maxim Lapan
Maxim Lapan
author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan

Right arrow

Policy Gradients – an Alternative

In this first chapter of part three of the book, we will consider an alternative way to handle Markov decision process (MDP) problems, which forms a full family of methods called policy gradient methods.

In this chapter, we will:

  • Cover an overview of the methods, their motivations, and their strengths and weaknesses in comparison to the already familiar Q-learning
  • Start with a simple policy gradient method called REINFORCE and try to apply it to our CartPole environment, comparing this with the deep Q-network (DQN) approach

Values and policy

Before we start talking about policy gradients, let's refresh our minds with the common characteristics of the methods covered in part two of this book. The central topic in value iteration and Q-learning is the value of the state (V) or value of the state and action (Q). Value is defined as the discounted total reward that we can gather from this state or by issuing this particular action from the state. If we know the value, our decision on every step becomes simple and obvious: we just act greedily in terms of value, and that guarantees us a good total reward at the end of the episode. So, the values of states (in the case of the value iteration method) or state + action (in the case of Q-learning) stand between us and the best reward. To obtain these values, we have used the Bellman equation, which expresses the value on the current step via the values on the next step.

In Chapter 1, What Is Reinforcement Learning?, we defined the entity that tells us...

The REINFORCE method

The formula of policy gradient that you have just seen is used by most of the policy-based methods, but the details can vary. One very important point is how exactly gradient scales, Q(s, a), are calculated. In the cross-entropy method from Chapter 4, The Cross-Entropy Method, we played several episodes, calculated the total reward for each of them, and trained on transitions from episodes with a better-than-average reward. This training procedure is a policy gradient method with Q(s, a) = 1 for state and action pairs from good episodes (with a large total reward) and Q(s, a) = 0 for state and action pairs from worse episodes.

The cross-entropy method worked even with those simple assumptions, but the obvious improvement will be to use Q(s, a) for training instead of just 0 and 1. Why should it help? The answer is a more fine-grained separation of episodes. For example, transitions of the episode with the total reward of 10 should contribute to the gradient...

REINFORCE issues

In the previous section, we discussed the REINFORCE method, which is a natural extension of the cross-entropy method. Unfortunately, both REINFORCE and the cross-entropy method still suffer from several problems, which make both of them limited to simple environments.

Full episodes are required

First of all, we still need to wait for the full episode to complete before we can start training. Even worse, both REINFORCE and the cross-entropy method behave better with more episodes used for training (just from the fact that more episodes mean more training data, which means more accurate policy gradients). This situation is fine for short episodes in the CartPole, when in the beginning, we can barely handle the bar for more than 10 steps; but in Pong, it is completely different: every episode can last for hundreds or even thousands of frames. It's equally bad from the training perspective, as our training batch becomes very large, and from the sample efficiency...

Policy gradient methods on CartPole

Nowadays, almost nobody uses the vanilla policy gradient method, as the much more stable actor-critic method exists. However, I still want to show the policy gradient implementation, as it establishes very important concepts and metrics to check the policy gradient method's performance.

Implementation

So, we will start with a much simpler environment of CartPole, and in the next section, we will check its performance on our favorite Pong environment.

The complete code for the following example is available in Chapter11/04_cartpole_pg.py.

GAMMA = 0.99
LEARNING_RATE = 0.001
ENTROPY_BETA = 0.01
BATCH_SIZE = 8
REWARD_STEPS = 10

Besides the already familiar hyperparameters, we have two new ones: the ENTROPY_BETA value is the scale of the entropy bonus and the REWARD_STEPS value specifies how many steps ahead the Bellman equation is unrolled to estimate the discounted total reward of every transition.

class PGN...

Policy gradient methods on Pong

As we covered in the previous section, the vanilla policy gradient method works well on a simple CartPole environment, but it works surprisingly badly on more complicated environments.

For the relatively simple Atari game Pong, our DQN was able to completely solve it in 1M frames and showed positive reward dynamics in just 100k frames, whereas the policy gradient method failed to converge. Due to the instability of policy gradient training, it became very hard to find good hyperparameters and was still very sensitive to initialization.

This doesn't mean that the policy gradient method is bad, because, as you will see in the next chapter, just one tweak of the network architecture to get a better baseline in the gradients will turn the policy gradient method into one of the best methods (the asynchronous advantage actor-critic method). Of course, there is a good chance that my hyperparameters are completely wrong or the code has some hidden...

Summary

In this chapter, you saw an alternative way of solving RL problems: policy gradient methods, which are different in many ways from the familiar DQN method. We explored a basic method called REINFORCE, which is a generalization of our first method in RL-domain cross-entropy. This policy gradient method is simple, but when applied to the Pong environment, it didn't show good results.

In the next chapter, we will consider ways to improve the stability of policy gradient methods by combining both families of value-based and policy-based methods.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Reinforcement Learning Hands-On. - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781838826994
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan