Packt+ | Advance your knowledge in tech

You're reading from Reinforcement Learning with TensorFlow

Product typeBook

Published inApr 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781788835725

Edition1st Edition

Languages

Python

Tools

TensorFlow

Concepts

Reinforcement Learning

Author (1)

Sayon Dutta

Chapter 4. Policy Gradients

So far, we have seen how to derive implicit policies from a value function with the value-based approach. Here, an agent will try to learn the policy directly. The approach is similar, any experienced agent will change the policy after witnessing it.

Value iteration, policy iteration, and Q-learning come under the value-based approach solved by dynamic programming, while the policy optimization approach involves policy gradients and union of this knowledge along with policy iteration, giving rise to actor-critic algorithms.

As per the dynamic programming method, there are a set of self-consistent equations to satisfy the Q and V values. Policy optimization is different, where policy learning happens directly, unlike deriving from the value function:

Thus, value-based methods learn the value function and we derive an implicit policy, but with policy-based methods, no value function is learned and the policy is learnt directly. The actor-critic method is more advanced...

The policy optimization method

The goal of the policy optimization method is to find the stochastic policy

that is a distribution of actions for a given state that maximizes the expected sum of rewards. It aims to find the policy directly. The basic overview is to create a neural network (that is, policy network) that processes some state information and outputs the distribution of possible actions that an agent might take.

The two major components of policy optimization are:

The weight parameter of the neural network is defined by
vector, which is also the parameter of our control policy. Thus, our aim is to train the weight parameters to obtain the best policy. Since we value the policy as the expected sum of rewards for the given policy. Here, for different parameter values of
, policy will differ and hence, the optimal policy would be the one having the maximum overall reward. Therefore, the
parameter which has the maximum expected reward will be the optimal policy. Following is the...

Why policy optimization methods?

In this section, we will cover the pros and cons of policy optimization methods over value-based methods. The advantages are as follows:

They provides better convergence.
They are highly effective in case of high-dimensional/continuous state-action spaces. If action spaces are very big then a max function in a value-based method will be computationally expensive. So, the policy-based method directly changes the policy by changing the parameters instead of solving the max function at each step.
Ability to learn stochastic policies.

The disadvantages associated with policy-based methods are as follows:

Converges to local instead of global optimum
Policy evaluation is inefficient and has high variance

We will discuss the approaches to tackle these disadvantages later in this chapter. For now, let's focus on the need for stochastic policies.

Why stochastic policy?

Let's go through two examples that will explain the importance of incorporating a stochastic policy compared...

Policy objective functions

Let's discuss now how to optimize a policy. In policy methods, our main objective is that a given policy

with parameter vector

finds the best values of the parameter vector. In order to measure which is the best, we measure

the quality of the policy

for different values of the parameter vector

Before discussing the optimization methods, let's first figure out the different ways to measure the quality of a policy

If it's an episodic environment,
can be the value function of the start state
that is if it starts from any state
, then the value function of it would be the expected sum of reward from that state onwards. Therefore,

If it's a continuing environment,
can be the average value function of the states. So, if the environment goes on and on forever, then the measure of the quality of the policy can be the summation of the probability of being in any state s that is
times the value of that state that is, the expected reward from that state onward. Therefore...

Temporal difference rule

Firstly, temporal difference (TD) is the difference of the value estimates between two time steps. It is different from the outcome-based Monte Carlo approach where a full look ahead till the end of the episode is done in order to update the learning parameters. In case of temporal difference learning, only one step look ahead is done and a value estimate of the state at the next step is used to update the current state's value estimate. Thus, learning parameters update along the way. Different rules to approach temporal difference learning are the TD(1), TD(0), and TD(

) rules. The basic notion in all the approaches is that the value estimate of the next step is used to update the current state's value estimate.

TD(1) rule

TD(1) incorporates the concept of eligibility trace. Let's go through the pseudo code of the approach and then we will discuss it in detail:

Episode T
    For all s, At the start of the episode : e(s) = 0 and 
        After  : (at step t)


     ...

Policy gradients

As per the policy gradient theorem, for the previous specified policy objective functions and any differentiable policy

the policy gradient is as follows:

Steps to update parameters using the Monte Carlo policy gradient based approach is shown in the following section.

The Monte Carlo policy gradient

In the Monte Carlo policy gradient approach, we update the parameters by the stochastic gradient ascent method, using the update as per policy gradient theorem and

as an unbiased sample of

. Here,

is the cumulative reward from that time-step onward.

The Monte Carlo policy gradient approach is as follows:

Initialize  arbitrarily
for each episode as per the current policy  do
    for step t=1 to T-1 do

    end for
end for

Output: final

Actor-critic algorithms

The preceding policy optimization using the Monte Carlo policy gradient approach leads to high variance. In order to tackle this issue, we use a critic to estimate the state-action value function, that is as follows:

This gives...

Agent learning pong using policy gradients

In this section, we will create a policy network that will take raw pixels from our pong environment that is pong-v0 from OpenAI gym as the input. The policy network is a single hidden layer neural network fully connected to the raw pixels of pong at the input layer and also to the output layer containing a single node returning the probability of the paddle going up. I would like to thank Andrej Karpathy for coming up with a solution to make the agent learn using policy gradients. We will try to implement a similar kind of approach.

A pixel image of size 80*80 in grayscale (we will not use RGB, which would be 80*80*3). Thus, we have a 80*80 grid that is binary and tells us the position of paddles and the ball, which we will feed as an input to the neural network. Thus a neural network would consist of the following:

Input layer (X): 80*80 squashed to 6400*1 that is 6400 nodes
Hidden layer: 200 nodes
Output layer: 1 node

Therefore, the total parameters...

Summary

In this chapter, we covered the most famous algorithms in reinforcement learning, the policy gradients and actor-critic algorithms. There is a lot of research going on in developing policy gradients to benchmark better results in reinforcement learning. Further study of policy gradients include Trust Region Policy Optimization (TRPO), Natural Policy Gradients, and Deep Dependency Policy Gradients (DDPG), which are beyond the scope of this book.

In the next chapter, we will take a look at the building blocks of Q-Learning, applying deep neural networks, and many more techniques.

The rest of the chapter is locked

You have been reading a chapter from

Reinforcement Learning with TensorFlow

Published in: Apr 2018Publisher: PacktISBN-13: 9781788835725

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sayon Dutta

Sayon Dutta is an Artificial Intelligence researcher and developer. A graduate from IIT Kharagpur, he owns the software copyright for Mobile Irrigation Scheduler. At present, he is an AI engineer at Wissen Technology. He co-founded an AI startup Marax AI Inc., focused on AI-powered customer churn prediction. With over 2.5 years of experience in AI, he invests most of his time implementing AI research papers for industrial use cases, and weightlifting.
Read more about Sayon Dutta

Other recommended products

Related to this chapter

Hands-On Reinforcement Learning with Python

Reinforcement learning is a self-evolving type of machine learning that takes us closer to achieving true artificial intelligence. This easy-to-follow guide explains everything from scratch using rich examples written in Python.

BookJun 2018318 pages

Hands-On Intelligent Agents with OpenAI Gym

Walks through the hands-on process of building intelligent agents from the basics and all the way up to solving complex problems including playing Atari games and driving a car autonomously in the CARLA simulator. Discusses various learning environments and how to transform real-world problems into learning environments and solve using the agents.

BookJul 2018254 pages

TensorFlow Reinforcement Learning Quick Start Guide

This book is an essential guide for anyone interested in Reinforcement Learning. The book provides an actionable reference for Reinforcement Learning algorithms and their applications using TensorFlow and Python. It will help readers leverage the power of algorithms such as Deep Q-Network (DQN), Deep Deterministic Policy Gradients (DDPG), and Proximal Policy Optimization (PPO) to solve challenging control and decision-making problems.

BookMar 2019184 pages

Deep Reinforcement Learning with Python

Deep Reinforcement Learning with Python - Second Edition will help you learn reinforcement learning algorithms, techniques and architectures – including deep reinforcement learning – from scratch. This new edition is an extensive update of the original, reflecting the state-of-the-art latest thinking in reinforcement learning.

BookSep 2020760 pages

Hands-On Q-Learning with Python

Q-learning is the reinforcement learning approach behind Deep-Q-Learning and is a values-based learning algorithm in RL. This book will help you get comfortable with developing the effective agents for Q learning and also make you learn to effectively develop and deploy Deep Q networks for complex AI applications.

BookApr 2019212 pages

Python Reinforcement Learning Projects

Python Reinforcement Learning Projects brings various aspects and methodologies of RL using 8 real-world projects that explore RL and will have hands-on experience with real data and artificial intelligence problems. You will learn to build self-learning models using sophisticated techniques like Q-learning, Markov models and Monte-Carlo process.

BookSep 2018296 pages

PyTorch 1.x Reinforcement Learning Cookbook

This book presents practical solutions to the most common reinforcement learning problems. The recipes in this book will help you understand the fundamental concepts to develop popular RL algorithms. You will gain practical experience in the RL domain using the modern offerings of the PyTorch 1.x library.

BookOct 2019340 pages

TensorFlow 2 Reinforcement Learning Cookbook

This cookbook will help you to gain a solid understanding of deep reinforcement learning (RL) algorithms with the help of concise, easy-to-follow implementations from scratch. You'll learn how to implement these algorithms with minimal code and develop AI applications to solve real-world and business problems using RL.

BookJan 2021472 pages

Practical Reinforcement Learning

Reinforcement learning (RL) is becoming a popular tool for constructing autonomous systems that improve themselves with experience. We will break the RL framework into its core building blocks, and provide you with details of each element. This book is divided into three parts. The first part defines Reinforcement Learning and describes the basics and the Python and Java frameworks, which we are going to use later in the book. The second part discusses learning techniques with basic algorithms such as Temporal Difference, Monte Carlo, and Policy Gradient—all with practical examples. Lastly, in the third part we apply Reinforcement Learning with the most recent and widely used algorithms, via practical applications.

BookOct 2017336 pages

Reinforcement Learning Algorithms with Python

With this book, you will understand the core concepts and techniques of reinforcement learning. You will take a look into each RL algorithm and will develop your own self-learning algorithms and models. You will optimize the algorithms for better precision, use high-speed actions and lower the risk of anomalies in your applications.

BookOct 2019366 pages

The Reinforcement Learning Workshop

With the help of practical examples and engaging activities, The Reinforcement Learning Workshop takes you through reinforcement learning’s core techniques and frameworks. Following a hands-on approach, it allows you to learn reinforcement learning at your own pace to develop your own intelligent applications with ease.

BookAug 2020822 pages

Mastering Reinforcement Learning with Python

This book focuses on expert-level explanations and implementations of scalable reinforcement learning algorithms and approaches. Starting with the fundamentals, the book covers state-of-the-art methods from bandit problems to meta-reinforcement learning. You’ll also explore practical examples inspired by real-life problems from the industry.

BookDec 2020544 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages