Reader small image

You're reading from  Deep Reinforcement Learning Hands-On. - Second Edition

Product typeBook
Published inJan 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838826994
Edition2nd Edition
Languages
Right arrow
Author (1)
Maxim Lapan
Maxim Lapan
author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan

Right arrow

Asynchronous Advantage Actor-Critic

This chapter is dedicated to the extension of the advantage actor-critic (A2C) method that we discussed in detail in Chapter 12, The Actor-Critic Method. The extension adds true asynchronous environment interaction, and its full name is asynchronous advantage actor-critic, which is normally abbreviated to A3C. This method is one of the most widely used by reinforcement learning (RL) practitioners.

We will take a look at two approaches for adding asynchronous behavior to the basic A2C method: data-level and gradient-level parallelism. They have different resource requirements and characteristics, which makes them applicable to different situations.

In this chapter, we will:

  • Discuss why it is important for policy gradient methods to gather training data from multiple environments
  • Implement two different approaches to A3C

Correlation and sample efficiency

One of the approaches to improving the stability of the policy gradient family of methods is using multiple environments in parallel. The reason behind this is the fundamental problem we discussed in Chapter 6, Deep Q-Networks, when we talked about the correlation between samples, which breaks the independent and identically distributed (i.i.d.) assumption, which is critical for stochastic gradient descent (SGD) optimization. The negative consequence of such correlation is very high variance in gradients, which means that our training batch contains very similar examples, all of them pushing our network in the same direction.

However, this may be totally the wrong direction in the global sense, as all those examples may be from one single lucky or unlucky episode.

With our deep Q-network (DQN), we solved the issue by storing a large number of previous states in the replay buffer and sampling our training batch from this buffer. If the buffer...

Adding an extra A to A2C

From the practical point of view, communicating with several parallel environments is simple. We already did this in the previous chapter, but it wasn't explicitly stated. In the A2C agent, we passed an array of Gym environments into the ExperienceSource class, which switched it into round-robin data gathering mode. This means that every time we ask for a transition from the experience source, the class uses the next environment from our array (of course, keeping the state for every environment). This simple approach is equivalent to parallel communication with environments, but with one single difference: communication is not parallel in the strict sense but performed in a serial way. However, samples from our experience source are shuffled. This idea is shown in the following diagram:

Figure 13.1: An agent training from multiple environments in parallel

This method works fine and helped us to get convergence in the A2C method, but it is still...

Multiprocessing in Python

Python includes the multiprocessing (most of the time abbreviated to just mp) module to support process-level parallelism and the required communication primitives. In our example, we will use the two main classes from this module:

  • mp.Queue: A concurrent multi-producer, multi-consumer FIFO (first in, first out) queue with transparent serialization and deserialization of objects placed in the queue
  • mp.Process: A piece of code that is run in the child process and methods to control it from the parent process

PyTorch provides its own thin wrapper around the multiprocessing module, which adds the proper handling of tensors and variables on CUDA devices and shared memory. It provides exactly the same functionality as the multiprocessing module from the standard library, so all you need to do is use import torch.multiprocessing instead of import multiprocessing.

A3C with data parallelism

The first version of A3C parallelization that we will check (which was outlined in Figure 13.2) has both one main process that carries out training and several child processes communicating with environments and gathering experience to train on.

Implementation

For simplicity and efficiency, the NN weights broadcasting from the trainer process are not implemented. Instead of explicitly gathering and sending weights to child processes, the network is shared between all processes using PyTorch built-in capabilities, allowing us to use the same nn.Module instance with all its weights in different processes by calling the share_memory() method on NN creation. Under the hood, this method has zero overhead for CUDA (as GPU memory is shared among all the host's processes), or shared memory inter-process communication (IPC) in the case of CPU computation. In both cases, the method improves performance, but limits our example of one single machine using one...

A3C with gradients parallelism

The next approach that we will consider to parallelize A2C implementation will have several child processes, but instead of feeding training data to the central training loop, they will calculate the gradients using their local training data, and send those gradients to the central master process.

This process is responsible for combining those gradients together (which is basically just summing them) and performing an SGD update on the shared network.

The difference might look minor, but this approach is much more scalable, especially if you have several powerful nodes with multiple GPUs connected with the network. In this case, the central process in the data-parallel model quickly becomes a bottleneck, as the loss calculation and backpropagation are computationally demanding. Gradient parallelization allows for the spreading of the load on several GPUs, performing only a relatively simple operation of gradient combination in the central place...

Summary

In this chapter, we discussed why it is important for policy gradient methods to gather training data from multiple environments, due to their on-policy nature. We also implemented two different approaches to A3C, in order to parallelize and stabilize the training process. Parallelization will come up once again in this book, when we discuss black-box methods (Chapter 20, Black-Box Optimization in RL).

In the next three chapters, we will take a look at practical problems that can be solved using policy gradient methods, which will wrap up the policy gradient methods part of the book.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Reinforcement Learning Hands-On. - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781838826994
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan