Reader small image

You're reading from  Reinforcement Learning with TensorFlow

Product typeBook
Published inApr 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788835725
Edition1st Edition
Languages
Right arrow
Author (1)
Sayon Dutta
Sayon Dutta
author image
Sayon Dutta

Sayon Dutta is an Artificial Intelligence researcher and developer. A graduate from IIT Kharagpur, he owns the software copyright for Mobile Irrigation Scheduler. At present, he is an AI engineer at Wissen Technology. He co-founded an AI startup Marax AI Inc., focused on AI-powered customer churn prediction. With over 2.5 years of experience in AI, he invests most of his time implementing AI research papers for industrial use cases, and weightlifting.
Read more about Sayon Dutta

Right arrow

Chapter 6. Asynchronous Methods

So far, we have covered most of the important topics, such as the Markov Decision Processes, Value Iteration, Q-learning, Policy Gradients, deep-Q networks, and Actor Critic Algorithms. These form the core of the reinforcement learning algorithms. In this chapter, we will continue our search from where we left off in Actor Critic Algorithms, and delve into the advanced asynchronous methods used in deep reinforcement learning, and its most famous variant, the asynchronous advantage actor-critic algorithm, better known as the A3C Algorithm.

But, before we start with the A3C algorithm, let's revise the basics of the Actor Critic Algorithm covered in Chapter 4, Policy Gradients. If you remember, the Actor Critic Algorithm has two components:

  • Actor
  • Critic

The Actor takes the current environment state and determines best action to take, while the Critic plays a policy-evaluation role by taking in the environment state and action, and returns a score depicting how good...

Why asynchronous methods?


Asynchronous methods for deep reinforcement learning was published in June 2016 by the combined team of Google DeepMind and MILA (https://arxiv.org/pdf/1602.01783.pdf).  It was faster and was able to show good results on a multi-core CPU instead of using a GPU. Asynchronous methods also work on continuous as well as discrete action spaces.

If we recall the approach of deep Q-network, we use experience replay as a storage to store all the experiences, and then use a random sample from that to train our deep neural network, which in turn predicts maximum Q-value for the most favorable action. But, it has the drawbacks of high memory usage and heavy computation over time. The basic idea behind this was to overcome this issue. Therefore, instead of using experience replay, multiple instances of the environment are created and multiple agents asynchronously execute actions in parallel (shown in the following diagram):

High-level diagram of the asynchronous method in deep...

Asynchronous one-step Q-learning


The architecture of asynchronous one-step Q-learning is very similar to DQN. An agent in DQN was represented by a set of primary and target networks, where one-step loss is calculated as the square of the difference between the state-action value of the current state s predicted by the primary network, and the target state-action value of the current state calculated by the target network. The gradients of the loss is calculated with respect to the parameters of the policy network, and then the loss is minimized using a gradient descent optimizer leading to parameter updates of the primary network.

The difference here in asynchronous one-step Q-learning is that there are multiple such learning agents, for instance, learners running and calculating this loss in parallel. Thus, the gradient calculation also occurs in parallel in different threads where each learning agent interacts with its own copy of the environment. The accumulation of these gradients in...

Asynchronous one-step SARSA


The architecture of asynchronous one-step SARSA is almost similar to the architecture of asynchronous one-step Q-learning, except the way target state-action value of the current state is calculated by the target network. Instead of using the maximum Q-value of the next state s' by the target network, SARSA uses 

-greedy to choose the action a' for the next state s' and the Q-value of the next state action pair, that is, Q(s',a';

) is used to calculate the target state-action value of the current state. 

The pseudo-code for asynchronous one-step SARSA is shown below. Here, the following are the global parameters:

  •  : the parameters (weights and biases) of the policy network
  •  : parameters (weights and biases) of the target network  
  • T : overall time step counter 
// Globally shared parameters 
,
and T //
is initialized arbitrarily // T is initialized 0 pseudo-code for each learner running parallel in each of the threads: Initialize thread level time step counter t=0...

Asynchronous n-step Q-learning


The architecture of asynchronous n-step Q-learning is, to an extent, similar to that of asynchronous one-step Q-learning. The difference is that the learning agent actions are selected using the exploration policy for up to

 steps or until a terminal state is reached, in order to compute a single update of policy network parameters. This process lists 

 rewards from the environment since its last update. Then, for each time step, the loss is calculated as the difference between the discounted future rewards at that time step and the estimated Q-value. The gradient of this loss with respect to thread-specific network parameters for each time step is calculated and accumulated. There are multiple such learning agents running and accumulating the gradients in parallel. These accumulated gradients are used to perform asynchronous updates of policy network parameters.

The pseudo-code for asynchronous n-step Q-learning is shown below. Here, the following are the global...

Asynchronous advantage actor critic


In the architecture of asynchronous advantage actor-critic, each learning agent contains an actor-critic learner that combines the benefits of both value- and policy-based methods. The actor network takes in the state as input and predicts the best action of that state, while the critic network takes in the state and action as the inputs and outputs the action score to quantify how good the action is for that state. The actor network updates its weight parameters using policy gradients, while the critic network updates its weight parameters using TD(0), in other words, the difference of value estimates between two time steps, as discussed in Chapter 4Policy Gradients.

In Chapter 4Policy Gradients, we studied how updating the policy gradients by subtracting a baseline function from the expected future rewards in the policy gradients reduces the variance without affecting the expectation value of the gradient. The difference between the expected future...

A3C for Pong-v0 in OpenAI gym


We have already discussed the pong environment before in Chapter 4, Policy Gradients. We will use the following code to create the A3C for Pong-v0 in OpenAI gym:

import multiprocessing
import threading
import tensorflow as tf
import numpy as np
import gym
import os
import shutil
import matplotlib.pyplot as plt

game_env = 'Pong-v0'
num_workers = multiprocessing.cpu_count()
max_global_episodes = 100000
global_network_scope = 'globalnet'
global_iteration_update = 20
gamma = 0.9
beta = 0.0001
lr_actor = 0.0001 # learning rate for actor
lr_critic = 0.0001 # learning rate for critic
global_running_rate = []
global_episode = 0

env = gym.make(game_env)

num_actions = env.action_space.n


tf.reset_default_graph()

The input state image preprocessing function:

def preprocessing_image(obs): #where I is the single frame of the game as the input
    """ prepro 210x160x3 uint8 frame into 6400 (80x80) 1D float vector """
#the values below have been precomputed through trail...

Summary


We saw that using parallel learners to update a shared model produced a great improvement on the learning process. We learned about the reason behind the use of asynchronous methods in deep learning and their different variants, including asynchronous one-step Q-learning, asynchronous one-step SARSA, asynchronous n-step Q-learning, and asynchronous advantage actor-critic. We also learned to implement the A3C algorithm, where we made an agent learn to play the games Breakout and Doom.

In the coming chapters, we will focus on different domains and how deep reinforcement learning is being, and can be, applied.

 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Reinforcement Learning with TensorFlow
Published in: Apr 2018Publisher: PacktISBN-13: 9781788835725
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sayon Dutta

Sayon Dutta is an Artificial Intelligence researcher and developer. A graduate from IIT Kharagpur, he owns the software copyright for Mobile Irrigation Scheduler. At present, he is an AI engineer at Wissen Technology. He co-founded an AI startup Marax AI Inc., focused on AI-powered customer churn prediction. With over 2.5 years of experience in AI, he invests most of his time implementing AI research papers for industrial use cases, and weightlifting.
Read more about Sayon Dutta