Reader small image

You're reading from  TensorFlow Reinforcement Learning Quick Start Guide

Product typeBook
Published inMar 2019
PublisherPackt
ISBN-139781789533583
Edition1st Edition
Right arrow
Author (1)
Kaushik Balakrishnan
Kaushik Balakrishnan
author image
Kaushik Balakrishnan

Kaushik Balakrishnan works for BMW in Silicon Valley, and applies reinforcement learning, machine learning, and computer vision to solve problems in autonomous driving. Previously, he also worked at Ford Motor Company and NASA Jet Propulsion Laboratory. His primary expertise is in machine learning, computer vision, and high-performance computing, and he has worked on several projects involving both research and industrial applications. He has also worked on numerical simulations of rocket landings on planetary surfaces, and for this he developed several high-fidelity models that run efficiently on supercomputers. He holds a PhD in aerospace engineering from the Georgia Institute of Technology in Atlanta, Georgia.
Read more about Kaushik Balakrishnan

Right arrow

Deep Q-Network

Deep Q-Networks (DQNs) revolutionized the field of reinforcement learning (RL). I am sure you have heard of Google DeepMind, which used to be a British company called DeepMind Technologies until Google acquired it in 2014. DeepMind published a paper in 2013 titled Playing Atari with Deep RL, where they used Deep Neural Networks (DNNs) in the context of RL, or DQNs as they are referred to – which is an idea that is seminal to the field. This paper revolutionized the field of deep RL, and the rest is history! Later, in 2015, they published a second paper, titled Human Level Control Through Deep RL, in Nature, where they had more interesting ideas that further improved the former paper. Together, the two papers led to a Cambrian explosion in the field of deep RL, with several new algorithms that have improved the training of agents using neural networks, and...

Technical requirements

Knowledge of the following will help you to better understand the concepts presented in this chapter:

  • Python (2 and above)
  • NumPy
  • TensorFlow (version 1.4 or higher)

Learning the theory behind a DQN

In this section, we will look at the theory behind a DQN, including the math behind it, and learn the use of neural networks to evaluate the value function.

Previously, we looked at Q-learning, where Q(s,a) was stored and evaluated as a multi-dimensional array, with one entry for each state-action pair. This worked well for grid-world and cliff-walking problems, both of which are low-dimensional in both state and action spaces. So, can we apply this to higher dimensional problems? Well, no, due to the curse of dimensionality, which makes it unfeasible to store very large number states and actions. Moreover, in continuous control problems, the actions vary as a real number in a bounded range, although an infinite number of real numbers are possible, which cannot be represented as a tabular Q array. This gave rise to function approximations in RL...

Understanding target networks

An interesting feature of a DQN is the utilization of a second network during the training procedure, which is referred to as the target network. This second network is used for generating the target-Q values that are used to compute the loss function during training. Why not use just use one network for both estimations, that is, for choosing the action a to take, as well as updating the Q-network? The issue is that, at every step of training, the Q-network's values change, and if we use a constantly changing set of values to update our network, then the estimations can easily become unstable – the network can fall into feedback loops between the target and estimated Q-values. In order to mitigate this instability, the target network's weights are fixed – that is, slowly updated to the primary Q-network's values. This...

Learning about replay buffer

We need the tuple (s, a, r, s', done) for updating the DQN, where s and a are respectively the state and actions at time t; s' is the new state at time t+1; and done is a Boolean value that is True or False depending on whether the episode is not completed or has ended, also referred to as the terminal value in the literature. This Boolean done or terminal variable is used so that, in the Bellman update, the last terminal state of an episode is properly handled (since we cannot do an r + γ max Q(s',a') for the terminal state). One problem in DQNs is that we use contiguous samples of the (s, a, r, s', done) tuple, they are correlated, and so the training can overfit.

To mitigate this issue, a replay buffer is used, where the tuple (s, a, r, s', done) is stored from experience, and a mini-batch of such experiences...

Getting introduced to the Atari environment

The Atari 2600 game suite was originally released in the 1970s, and was a big hit at that time. It involves several games that are played by users using the keyboard to enter actions. These games were a big hit back in the day, and inspired many computer game players of the 1970s and 1980s, but are considered too primitive by today's video game players' standards. However, they are popular today in the RL community as a portal to games that can be trained by RL agents.

Summary of Atari games

Here is a summary of a select few games from Atari (we won't present screenshots of the games for copyright reasons, but will provide links to them).

...

Summary

In this chapter, we looked at our very first deep RL algorithm, DQN, which is probably the most popular RL algorithm in use today. We learned the theory behind a DQN, and also looked at the concept and use of target networks to stabilize training. We were also introduced to the Atari environment, which is the most popular environment suite for RL. In fact, many of the RL papers published today apply their algorithms to games from the Atari suite and report their episodic rewards, comparing them with corresponding values reported by other researchers who use other algorithms. So, the Atari environment is a natural suite of games to train RL agents and compare them to ascertain the robustness of algorithms. We also looked at the use of a replay buffer, and learned why it is used in off-policy algorithms.

This chapter has laid the foundation for us to delve deeper into deep...

Questions

  1. Why is a replay buffer used in a DQN?
  2. Why do we use target networks?
  3. Why do we stack four frames into one state? Will one frame alone suffice to represent one state?
  4. Why is the Huber loss sometimes preferred over L2 loss?
  5. We converted the RGB input image into grayscale. Can we instead use the RGB image as input to the network? What are the pros and cons of using RGB images instead of grayscale?

Further reading

  • Playing Atari with Deep Reinforcement Learning, by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller, arXiv:1312.5602: https://arxiv.org/abs/1312.5602
  • Human-level control through deep reinforcement learning by Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis, Nature, 2015: https://www.nature.com/articles/nature14236
lock icon
The rest of the chapter is locked
You have been reading a chapter from
TensorFlow Reinforcement Learning Quick Start Guide
Published in: Mar 2019Publisher: PacktISBN-13: 9781789533583
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Kaushik Balakrishnan

Kaushik Balakrishnan works for BMW in Silicon Valley, and applies reinforcement learning, machine learning, and computer vision to solve problems in autonomous driving. Previously, he also worked at Ford Motor Company and NASA Jet Propulsion Laboratory. His primary expertise is in machine learning, computer vision, and high-performance computing, and he has worked on several projects involving both research and industrial applications. He has also worked on numerical simulations of rocket landings on planetary surfaces, and for this he developed several high-fidelity models that run efficiently on supercomputers. He holds a PhD in aerospace engineering from the Georgia Institute of Technology in Atlanta, Georgia.
Read more about Kaushik Balakrishnan