Reader small image

You're reading from  Reinforcement Learning with TensorFlow

Product typeBook
Published inApr 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788835725
Edition1st Edition
Languages
Right arrow
Author (1)
Sayon Dutta
Sayon Dutta
author image
Sayon Dutta

Sayon Dutta is an Artificial Intelligence researcher and developer. A graduate from IIT Kharagpur, he owns the software copyright for Mobile Irrigation Scheduler. At present, he is an AI engineer at Wissen Technology. He co-founded an AI startup Marax AI Inc., focused on AI-powered customer churn prediction. With over 2.5 years of experience in AI, he invests most of his time implementing AI research papers for industrial use cases, and weightlifting.
Read more about Sayon Dutta

Right arrow

Chapter 4. Policy Gradients

So far, we have seen how to derive implicit policies from a value function with the value-based approach. Here, an agent will try to learn the policy directly. The approach is similar, any experienced agent will change the policy after witnessing it.

Value iteration, policy iteration, and Q-learning come under the value-based approach solved by dynamic programming, while the policy optimization approach involves policy gradients and union of this knowledge along with policy iteration, giving rise to actor-critic algorithms.

As per the dynamic programming method, there are a set of self-consistent equations to satisfy the Q and V values. Policy optimization is different, where policy learning happens directly, unlike deriving from the value function:

Thus, value-based methods learn the value function and we derive an implicit policy, but with policy-based methods, no value function is learned and the policy is learnt directly. The actor-critic method is more advanced...

The policy optimization method


The goal of the policy optimization method is to find the stochastic policy 

 that is a distribution of actions for a given state that maximizes the expected sum of rewards. It aims to find the policy directly. The basic overview is to create a neural network (that is, policy network) that processes some state information and outputs the distribution of possible actions that an agent might take.

The two major components of policy optimization are:

  • The weight parameter of the neural network is defined by 
     vector, which is also the parameter of our control policy. Thus, our aim is to train the weight parameters to obtain the best policy. Since we value the policy as the expected sum of rewards for the given policy. Here, for different parameter values of 
    , policy will differ and hence, the optimal policy would be the one having the maximum overall reward. Therefore, the 
     parameter which has the maximum expected reward will be the optimal policy. Following is the...

Why policy optimization methods?


In this section, we will cover the pros and cons of policy optimization methods over value-based methods. The advantages are as follows:

  • They provides better convergence.
  • They are highly effective in case of high-dimensional/continuous state-action spaces. If action spaces are very big then a max function in a value-based method will be computationally expensive. So, the policy-based method directly changes the policy by changing the parameters instead of solving the max function at each step.
  • Ability to learn stochastic policies.

The disadvantages associated with policy-based methods are as follows:

  • Converges to local instead of global optimum
  • Policy evaluation is inefficient and has high variance

We will discuss the approaches to tackle these disadvantages later in this chapter. For now, let's focus on the need for stochastic policies.

Why stochastic policy?

Let's go through two examples that will explain the importance of incorporating a stochastic policy compared...

Policy objective functions


Let's discuss now how to optimize a policy. In policy methods, our main objective is that a given policy 

 with parameter vector 

 finds the best values of the parameter vector. In order to measure which is the best, we measure

 the quality of the policy 

 for different values of the parameter vector 

.

Before discussing the optimization methods, let's first figure out the different ways to measure the quality of a policy 

:

  • If it's an episodic environment, 
     can be the value function of the start state
     that is if it starts from any state 
    , then the value function of it would be the expected sum of reward from that state onwards. Therefore,
  • If it's a continuing environment, 
     can be the average value function of the states. So, if the environment goes on and on forever, then the measure of the quality of the policy can be the summation of the probability of being in any state s that is
     times the value of that state that is, the expected reward from that state onward. Therefore...

Temporal difference rule


Firstly, temporal difference (TD) is the difference of the value estimates between two time steps. It is different from the outcome-based Monte Carlo approach where a full look ahead till the end of the episode is done in order to update the learning parameters. In case of temporal difference learning, only one step look ahead is done and a value estimate of the state at the next step is used to update the current state's value estimate. Thus, learning parameters update along the way. Different rules to approach temporal difference learning are the TD(1), TD(0), and TD(

) rules. The basic notion in all the approaches is that the value estimate of the next step is used to update the current state's value estimate.

TD(1) rule

TD(1) incorporates the concept of eligibility trace. Let's go through the pseudo code of the approach and then we will discuss it in detail:

Episode T
    For all s, At the start of the episode : e(s) = 0 and 
After
: (at step t)
...

Policy gradients


As per the policy gradient theorem, for the previous specified policy objective functions and any differentiable policy 

 the policy gradient is as follows:

Steps to update parameters using the Monte Carlo policy gradient based approach is shown in the following section.

The Monte Carlo policy gradient

In the Monte Carlo policy gradient approach, we update the parameters by the stochastic gradient ascent method, using the update as per policy gradient theorem and 

 as an unbiased sample of 

. Here, 

 is the cumulative reward from that time-step onward.

The Monte Carlo policy gradient approach is as follows:

Initialize 
arbitrarily for each episode as per the current policy
do for step t=1 to T-1 do
end for end for Output: final

Actor-critic algorithms

The preceding policy optimization using the Monte Carlo policy gradient approach leads to high variance. In order to tackle this issue, we use a critic to estimate the state-action value function, that is as follows:

This gives...

Agent learning pong using policy gradients


In this section, we will create a policy network that will take raw pixels from our pong environment that is pong-v0 from OpenAI gym as the input. The policy network is a single hidden layer neural network fully connected to the raw pixels of pong at the input layer and also to the output layer containing a single node returning the probability of the paddle going up. I would like to thank Andrej Karpathy for coming up with a solution to make the agent learn using policy gradients. We will try to implement a similar kind of approach.

A pixel image of size 80*80 in grayscale (we will not use RGB, which would be 80*80*3). Thus, we have a 80*80 grid that is binary and tells us the position of paddles and the ball, which we will feed as an input to the neural network. Thus a neural network would consist of the following:

  • Input layer (X): 80*80 squashed to 6400*1 that is 6400 nodes
  • Hidden layer: 200 nodes
  • Output layer: 1 node

Therefore, the total parameters...

Summary


In this chapter, we covered the most famous algorithms in reinforcement learning, the policy gradients and actor-critic algorithms. There is a lot of research going on in developing policy gradients to benchmark better results in reinforcement learning. Further study of policy gradients include Trust Region Policy Optimization (TRPO), Natural Policy Gradients, and Deep Dependency Policy Gradients (DDPG), which are beyond the scope of this book. 

In the next chapter, we will take a look at the building blocks of Q-Learning, applying deep neural networks, and many more techniques. 

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Reinforcement Learning with TensorFlow
Published in: Apr 2018Publisher: PacktISBN-13: 9781788835725
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sayon Dutta

Sayon Dutta is an Artificial Intelligence researcher and developer. A graduate from IIT Kharagpur, he owns the software copyright for Mobile Irrigation Scheduler. At present, he is an AI engineer at Wissen Technology. He co-founded an AI startup Marax AI Inc., focused on AI-powered customer churn prediction. With over 2.5 years of experience in AI, he invests most of his time implementing AI research papers for industrial use cases, and weightlifting.
Read more about Sayon Dutta