Reader small image

You're reading from  Deep Reinforcement Learning Hands-On. - Second Edition

Product typeBook
Published inJan 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838826994
Edition2nd Edition
Languages
Right arrow
Author (1)
Maxim Lapan
Maxim Lapan
author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan

Right arrow

The Actor-Critic Method

In Chapter 11, Policy Gradients—an Alternative, we started to investigate a policy-based alternative to the familiar value-based methods family. In particular, we focused on the method called REINFORCE and its modification, which uses discounted reward to obtain the gradient of the policy (which gives us the direction in which to improve the policy). Both methods worked well for a small CartPole problem, but for a more complicated Pong environment, the convergence dynamics were painfully slow.

Next, we will discuss another extension to the vanilla policy gradient method, which magically improves the stability and convergence speed of that method. Despite the modification being only minor, the new method has its own name, actor-critic, and it's one of the most powerful methods in deep reinforcement learning (RL).

In this chapter, we will:

  • Explore how the baseline impacts statistics and the convergence of gradients
  • Cover an extension...

Variance reduction

In the previous chapter, I briefly mentioned that one of the ways to improve the stability of policy gradient methods is to reduce the variance of the gradient. Now let's try to understand why this is important and what it means to reduce the variance. In statistics, variance is the expected square deviation of a random variable from the expected value of that variable.

Variance shows us how far values are dispersed from the mean. When variance is high, the random variable can take values that deviate widely from the mean. On the following plot, there is a normal (Gaussian) distribution with the same value for the mean, , but with different values for the variance.

Figure 12.1: The effect of variance on Gaussian distribution

Now let's return to policy gradients. It was stated in the previous chapter that the idea is to increase the probability of good actions and decrease the chance of bad ones. In math notation, our policy gradient was...

CartPole variance

To check this theoretical conclusion in practice, let's plot our policy gradient variance during the training for both the baseline version and the version without the baseline. The complete example is in Chapter12/01_cartpole_pg.py, and most of the code is the same as in Chapter 11, Policy Gradients – an Alternative. The differences in this version are the following:

  • It now accepts the command-line option --baseline, which enables the mean subtraction from the reward. By default, no baseline is used.
  • On every training loop, we gather the gradients from the policy loss and use this data to calculate the variance.

To gather only the gradients from the policy loss and exclude the gradients from the entropy bonus added for exploration, we need to calculate the gradients in two stages. Luckily, PyTorch allows this to be done easily. In the following code, only the relevant part of the training loop is included to illustrate the idea:

...

Actor-critic

The next step in reducing the variance is making our baseline state-dependent (which is a good idea, as different states could have very different baselines). Indeed, to decide on the suitability of a particular action in some state, we use the discounted total reward of the action. However, the total reward itself could be represented as a value of the state plus the advantage of the action: Q(s, a) = V(s) + A(s, a). You saw this in Chapter 8, DQN Extensions, when we discussed DQN modifications, particularly dueling DQN.

So, why can't we use V(s) as a baseline? In that case, the scale of our gradient will be just advantage, A(s, a), showing how this taken action is better in respect to the average state's value. In fact, we can do this, and it is a very good idea for improving the policy gradient method. The only problem here is that we don't know the value, V(s), of the state that we need to subtract from the discounted total reward, Q(s, a). To solve...

A2C on Pong

In the previous chapter, you saw a (not very successful) attempt to solve our favorite Pong environment with policy gradient methods. Let's try it again with the actor-critic method at hand.

GAMMA = 0.99
LEARNING_RATE = 0.001
ENTROPY_BETA = 0.01
BATCH_SIZE = 128
NUM_ENVS = 50
REWARD_STEPS = 4
CLIP_GRAD = 0.1

We start, as usual, by defining hyperparameters (imports are omitted). These values are not tuned, as we will do this in the next section of this chapter. We have one new value here: CLIP_GRAD. This hyperparameter specifies the threshold for gradient clipping, which basically prevents our gradients from becoming too large at the optimization stage and pushing our policy too far. Clipping is implemented using the PyTorch functionality, but the idea is very simple: if the L2 norm of the gradient is larger than this hyperparameter, then the gradient vector is clipped to this value.

The REWARD_STEPS hyperparameter determines how many...

A2C on Pong results

To start the training, run 02_pong_a2c.py with the --cuda and -n options (which provides a name for the run for TensorBoard):

      rl_book_samples/Chapter10$ ./02_pong_a2c.py --cuda -n t2
      AtariA2C (
         (conv): Sequential (
           (0): Conv2d(4, 32, kernel_size=(8, 8), stride=(4, 4))
           (1): ReLU ()
           (2): Conv2d(32, 64, kernel_size=(4, 4), stride=(2, 2))
           (3): ReLU ()
           (4): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1))
           (5): ReLU ()
         )
         (policy): Sequential (
           (0): Linear (3136 -> 512)
           (1): ReLU ()
           (2): Linear (512 -> 6)
         )
         (value): Sequential (
           (0): Linear (3136 -> 512)
           (1): ReLU ()
           (2): Linear (512 -> 1)
         )
      )
      37799: done 1 games, mean reward -21.000, speed 722.89 f/s
      39065: done 2 games, mean reward -21.000, speed 749.92 f/s
      39076: done 3 games, mean...

Tuning hyperparameters

In the previous section, we had Pong solved in three hours of optimization and 9 million frames. Now is a good time to tweak our hyperparameters to speed up convergence. The golden rule here is to tweak one option at a time and make conclusions carefully, as the whole process is stochastic.

In this section, we will start with the original hyperparameters and perform the following experiments:

  • Increase the learning rate
  • Increase the entropy beta
  • Change the count of environments that we are using to gather experience
  • Tweak the size of the batch

Strictly speaking, the following experiments weren't proper hyperparameter tuning but just an attempt to get a better understanding of how A2C convergence dynamics depend on the parameters. To find the best set of parameters, the full grid search or random sampling of values could give much better results, but they would require much more time and resources.

Learning rate

Our starting...

Summary

In this chapter, you learned about one of the most widely used methods in deep RL: A2C, which wisely combines the policy gradient update with the value of the state approximation. We analyzed the effect of the baseline on the statistics and convergence of gradients. Then, we checked the extension of the baseline idea: A2C, where a separate network head provides us with the baseline for the current state.

In the next chapter, we will look at ways to perform the same algorithm in a distributed way.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Reinforcement Learning Hands-On. - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781838826994
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan