Reader small image

You're reading from  TensorFlow Reinforcement Learning Quick Start Guide

Product typeBook
Published inMar 2019
PublisherPackt
ISBN-139781789533583
Edition1st Edition
Right arrow
Author (1)
Kaushik Balakrishnan
Kaushik Balakrishnan
author image
Kaushik Balakrishnan

Kaushik Balakrishnan works for BMW in Silicon Valley, and applies reinforcement learning, machine learning, and computer vision to solve problems in autonomous driving. Previously, he also worked at Ford Motor Company and NASA Jet Propulsion Laboratory. His primary expertise is in machine learning, computer vision, and high-performance computing, and he has worked on several projects involving both research and industrial applications. He has also worked on numerical simulations of rocket landings on planetary surfaces, and for this he developed several high-fidelity models that run efficiently on supercomputers. He holds a PhD in aerospace engineering from the Georgia Institute of Technology in Atlanta, Georgia.
Read more about Kaushik Balakrishnan

Right arrow

Asynchronous Methods - A3C and A2C

We looked at the DDPG algorithm in the previous chapter. One of the main drawbacks of the DDPG algorithm (as well as the DQN algorithm that we saw earlier) is the use of a replay buffer to obtain independent and identically distributed samples of data for training. Using a replay buffer consumes a lot of memory, which is not desirable for robust RL applications. To overcome this problem, researchers at Google DeepMind came up with an on-policy algorithm called Asynchronous Advantage Actor Critic (A3C). A3C does not use a replay buffer; instead, it uses parallel worker processors, where different instances of the environment are created and the experience samples are collected. Once a finite and fixed number of samples are collected, they are used to compute the policy gradients, which are asynchronously sent to a central processor that updates...

Technical requirements

To successfully complete this chapter, some knowledge of the following will be of great help:

  • TensorFlow (version 1.4 or higher)
  • Python (version 2 or 3)
  • NumPy

The A3C algorithm

As we mentioned earlier, we have parallel workers in A3C, and each worker will compute the policy gradients and pass them on to the central (or master) processor. The A3C paper also uses the advantage function to reduce variance in the policy gradients. The loss functions consist of three losses, which are weighted and added; they include the value loss, the policy loss, and an entropy regularization term. The value loss, Lv, is an L2 loss of the state value and the target value, with the latter computed as a discounted sum of the rewards. The policy loss, Lp, is the product of the logarithm of the policy distribution and the advantage function, A. The entropy regularization, Le, is the Shannon entropy, which is computed as the product of the policy distribution and its logarithm, with a minus sign included. The entropy regularization term is like a bonus for...

The A3C algorithm applied to CartPole

Here, we will code A3C in TensorFlow and apply it so that we can train an agent to learn the CartPole problem. The following code files will be required to code:

  • cartpole.py: This will start the training or testing process
  • a3c.py: This is where the A3C algorithm is coded
  • utils.py: This includes utility functions

Coding cartpole.py

We will now code cartpole.py. Follow these steps to get started:

  1. First, we import the packages:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import gym
import os
import threading
import multiprocessing

from random import choice
from time import sleep
from time import time

from a3c import *
from utils import *
  1. Next, we set the parameters...

The A3C algorithm applied to LunarLander

We will extend the same code to train an agent on the LunarLander problem, which is harder than CartPole. Most of the code is the same as before, so we will only describe the changes that need to be made to the preceding code. First, the reward shaping is different for the LunarLander problem. So, we will include a function called reward_shaping() in the a3c.py file. It will check if the lander has crashed on the lunar surface; if so, the episode will be terminated and there will be a -1.0 penalty. If the lander is not moving, the episode will be terminated and a -0.5 penalty will be paid:

def reward_shaping(r, s, s1):
# check if y-coord < 0; implies lander crashed
if (s1[1] < 0.0):
print('-----lander crashed!----- ')
d = True
r -= 1.0

# check if lander is stuck
xx = s[0] - s1[0]
yy...

The A2C algorithm

The difference between A2C and A3C is that A2C performs synchronous updates. Here, all the workers will wait until they have completed the collection of experiences and computed the gradients. Only after this are the global (or master) network's parameters updated. This is different from A3C, where the update is performed asynchronously, that is, where the worker threads do not wait for the others to finish. A2C is easier to code than A3C, but that is not undertaken here. If you are interested in this, you are encouraged to take the preceding A3C code and convert it to A2C, after which the performance of both algorithms can be compared.

Summary

In this chapter, we introduced the A3C algorithm, which is an on-policy algorithm that's applicable to both discrete and continuous action problems. You saw how three different loss terms are combined into one and optimized. Python's threading library is useful for running multiple threads, with a copy of the policy network in each thread. These different workers compute the policy gradients and pass them on to the master to update the neural network parameters. We applied A3C to train agents for the CartPole and the LunarLander problems, and the agents learned them very well. A3C is a very robust algorithm and does not require a replay buffer, although it does require a local buffer for collecting a small number of experiences, after which it is used to update the networks. Lastly, a synchronous version of the algorithm, called A2C, was also introduced.

This...

Questions

  1. Is A3C an on-policy or off-policy algorithm?
  2. Why is the Shannon entropy term used?
  3. What are the problems with using a large number of worker threads?
  4. Why is softmax used in the policy neural network?
  5. Why do we need an advantage function?
  6. This is left as an exercise: For the LunarLander problem, repeat the training without reward shaping and see if the agent learns faster/slower than what we saw in this chapter.

Further reading

lock icon
The rest of the chapter is locked
You have been reading a chapter from
TensorFlow Reinforcement Learning Quick Start Guide
Published in: Mar 2019Publisher: PacktISBN-13: 9781789533583
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kaushik Balakrishnan

Kaushik Balakrishnan works for BMW in Silicon Valley, and applies reinforcement learning, machine learning, and computer vision to solve problems in autonomous driving. Previously, he also worked at Ford Motor Company and NASA Jet Propulsion Laboratory. His primary expertise is in machine learning, computer vision, and high-performance computing, and he has worked on several projects involving both research and industrial applications. He has also worked on numerical simulations of rocket landings on planetary surfaces, and for this he developed several high-fidelity models that run efficiently on supercomputers. He holds a PhD in aerospace engineering from the Georgia Institute of Technology in Atlanta, Georgia.
Read more about Kaushik Balakrishnan