Reader small image

You're reading from  Deep Reinforcement Learning Hands-On. - Second Edition

Product typeBook
Published inJan 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838826994
Edition2nd Edition
Languages
Right arrow
Author (1)
Maxim Lapan
Maxim Lapan
author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan

Right arrow

Trust Regions – PPO, TRPO, ACKTR, and SAC

Next, we will take a look at the approaches used to improve the stability of the stochastic policy gradient method. Some attempts have been made to make the policy improvement more stable, and in this chapter, we will focus on three methods:

  • Proximal policy optimization (PPO)
  • Trust region policy optimization (TRPO)
  • Advantage actor-critic (A2C) using Kronecker-factored trust region (ACKTR).

In addition, we will compare those methods to a relatively new off-policy method called soft actor-critic (SAC), which is a development of the deep deterministic policy gradients (DDPG) method described in Chapter 17, Continuous Action Space. To compare them to the A2C baseline, we will use several environments from the Roboschool library created by OpenAI.

The overall motivation of the methods that we will look at is to improve the stability of the policy update during the training. There is a dilemma: on the one hand, we...

Roboschool

To experiment with the methods in this chapter, we will use Roboschool, which uses PyBullet as a physics engine and has 13 environments of various complexity. PyBullet has similar environments, but at the time of writing, it isn't possible to create several instances of the same environment due to an internal OpenGL issue.

In this chapter, we will explore two problems: RoboschoolHalfCheetah-v1, which models a two-legged creature, and RoboschoolAnt-v1, which has four legs. Their state and action spaces are very similar to the Minitaur environment that we saw in Chapter 17, Continuous Action Space: the state includes characteristics from joints, and the actions are activations of those joints. The goal for both is to move as far as possible, minimizing the energy spent. Figure 19.1 shows screenshots of the two environments.

Figure 19.1: Screenshots of two Roboschool environments: RoboschoolHalfCheetah and RoboschoolAnt

To install Roboschool, you need to follow...

The A2C baseline

To establish the baseline results, we will use the A2C method in a very similar way to the code in the previous chapter.

Implementation

The complete source is in files Chapter19/01_train_a2c.py and Chapter19/lib/model.py. There are a few differences between this baseline and the version we used in the previous chapter. First of all, there are 16 parallel environments used to gather experience during the training. The second difference is the model structure and the way that we perform exploration. To illustrate them, let's look at the model and the agent classes.

Both the actor and critic are placed in the separate networks without sharing weights. They follow the approach used in the previous chapter, with our critic estimating the mean and the variance for the actions. However, now variance is not a separate head of the base network; it is just a single parameter of the model. This parameter will be adjusted during the training by SGD, but it doesn...

PPO

Historically, the PPO method came from the OpenAI team and it was proposed long after TRPO, which is from 2015. However, PPO is much simpler than TRPO, so we will start with it. The 2017 paper in which it was proposed is by John Schulman et. al., and it is called Proximal Policy Optimization Algorithms (arXiv:1707.06347).

The core improvement over the classic A2C method is changing the formula used to estimate the policy gradients. Instead of using the gradient of logarithm probability of the action taken, the PPO method uses a different objective: the ratio between the new and the old policy scaled by the advantages.

In math form, the old A2C objective could be written as . The new objective proposed by PPO is .

The reason behind changing the objective is the same as with the cross-entropy method covered in Chapter 4, The Cross-Entropy Method: importance sampling. However, if we just start to blindly maximize this value, it may lead to a very large update to the policy...

TRPO

TRPO was proposed in 2015 by Berkeley researchers in a paper by John Schulman et. al., called Trust Region Policy Optimization (arXiv:1502.05477). This paper was a step towards improving the stability and consistency of stochastic policy gradient optimization and has shown good results on various control tasks.

Unfortunately, the paper and the method are quite math-heavy, so it can be hard to understand the details of the method. The same could be said about the implementation, which uses the conjugate gradients method to efficiently solve the constrained optimization problem.

As the first step, the TRPO method defines the discounted visitation frequencies of the state: . In this equation, P (si = s) equals the sampled probability of state s, to be met at position i of the sampled trajectories. Then, TRPO defines the optimization objective as where is the expected discounted reward of the policy and defines the deterministic policy.

To address the issue of large policy...

ACKTR

The third method that we will compare, ACKTR, uses a different approach to address SGD stability. In the paper by Yuhuai Wu and others called Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation, published in 2017 (arXiv:1708.05144), the authors combined the second-order optimization methods and trust region approach.

The idea of the second-order methods is to improve the traditional SGD by taking the second-order derivatives of the optimized function (in other words, its curvature) to improve the convergence of the optimization process. To make things more complicated, working with the second derivatives usually requires you to build and invert a Hessian matrix, which can be prohibitively large, so the practical methods typically approximate it in some way. This area is currently very active in research, as developing robust, scalable optimization methods is very important for the whole machine learning domain.

One of the...

SAC

In the final section, we will check our environments on the latest state-of-the-art method, called SAC, which was proposed by a group of Berkeley researchers and introduced in the paper Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning, by Tuomas Taarnoja et. al. arXiv 1801.01290, published in 2018.

At the moment, it's considered to be one of the best methods for continuous control problems. The core idea of the method is closer to the DDPG method than to A2C policy gradients. The SAC method might have been more logically described in Chapter 17, Continuous Action Space. However, in this chapter, we have the chance to compare it directly with PPO's performance, which was considered to be the de facto standard in continuous control problems for a long time.

The central idea in the SAC method is entropy regularization, which adds a bonus reward at each timestamp that is proportional to the entropy of the policy at this timestamp. In mathematical...

Summary

In this chapter, we checked three different methods with the aim of improving the stability of the stochastic policy gradient and compared them to the A2C implementation on two continuous control problems. With methods from the previous chapter (DDPG and D4PG), they create the basic tools to work with a continuous control domain. Finally, we checked a relatively new off-policy method that is an extension of DDPG: SAC

In the next chapter, we will switch to a different set of RL methods that have been becoming popular recently: black-box or gradient-free methods.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Reinforcement Learning Hands-On. - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781838826994
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan