Reader small image

You're reading from  TensorFlow Reinforcement Learning Quick Start Guide

Product typeBook
Published inMar 2019
PublisherPackt
ISBN-139781789533583
Edition1st Edition
Right arrow
Author (1)
Kaushik Balakrishnan
Kaushik Balakrishnan
author image
Kaushik Balakrishnan

Kaushik Balakrishnan works for BMW in Silicon Valley, and applies reinforcement learning, machine learning, and computer vision to solve problems in autonomous driving. Previously, he also worked at Ford Motor Company and NASA Jet Propulsion Laboratory. His primary expertise is in machine learning, computer vision, and high-performance computing, and he has worked on several projects involving both research and industrial applications. He has also worked on numerical simulations of rocket landings on planetary surfaces, and for this he developed several high-fidelity models that run efficiently on supercomputers. He holds a PhD in aerospace engineering from the Georgia Institute of Technology in Atlanta, Georgia.
Read more about Kaushik Balakrishnan

Right arrow

Identifying reward functions and the concept of discounted rewards

Rewards in RL are no different from real world rewards – we all receive good rewards for doing well, and bad rewards (aka penalties) for inferior performance. Reward functions are provided by the environment to guide an agent to learn as it explores the environment. Specifically, it is a measure of how well the agent is performing.

The reward function defines what the good and bad things are that can happen to the agent. For instance, a mobile robot that reaches its goal is rewarded, but is penalized for crashing into obstacles. Likewise, an industrial robot arm is rewarded for putting a peg into a hole, but is penalized for being in undesired poses that can be catastrophic by causing ruptures or crashes. Reward functions are the signal to the agent regarding what is optimum and what isn't. The agent's long-term goal is to maximize rewards and minimize penalties.

Rewards

In RL literature, rewards at a time instant t are typically denoted as Rt. Thus, the total rewards earned in an episode is given by R = r1+ r2 + ... + rt, where t is the length of the episode (which can be finite or infinite).

The concept of discounting is used in RL, where a parameter called the discount factor is used, typically represented by γ and 0 ≤ γ ≤ 1 and a power of it multiplies rt. γ = 0, making the agent myopic, and aiming only for the immediate rewards. γ = 1 makes the agent far-sighted to the point that it procrastinates the accomplishment of the final goal. Thus, a value of γ in the 0-1 range (0 and 1 exclusive) is used to ensure that the agent is neither too myopic nor too far-sighted. γ ensures that the agent prioritizes its actions to maximize the total discounted rewards, Rt, from time instant t, which is given by the following:

Since 0 ≤ γ ≤ 1, the rewards into the distant future are valued much less than the rewards that the agent can earn in the immediate future. This helps the agent to not waste time and to prioritize its actions. In practice, γ = 0.9-0.99 is typically used in most RL problems.

Previous PageNext Page
You have been reading a chapter from
TensorFlow Reinforcement Learning Quick Start Guide
Published in: Mar 2019Publisher: PacktISBN-13: 9781789533583
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kaushik Balakrishnan

Kaushik Balakrishnan works for BMW in Silicon Valley, and applies reinforcement learning, machine learning, and computer vision to solve problems in autonomous driving. Previously, he also worked at Ford Motor Company and NASA Jet Propulsion Laboratory. His primary expertise is in machine learning, computer vision, and high-performance computing, and he has worked on several projects involving both research and industrial applications. He has also worked on numerical simulations of rocket landings on planetary surfaces, and for this he developed several high-fidelity models that run efficiently on supercomputers. He holds a PhD in aerospace engineering from the Georgia Institute of Technology in Atlanta, Georgia.
Read more about Kaushik Balakrishnan