Reader small image

You're reading from  Deep Reinforcement Learning Hands-On. - Second Edition

Product typeBook
Published inJan 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838826994
Edition2nd Edition
Languages
Right arrow
Author (1)
Maxim Lapan
Maxim Lapan
author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan

Right arrow

DQN Extensions

Since DeepMind published its paper on the deep Q-network (DQN) model (https://deepmind.com/research/publications/playing-atari-deep-reinforcement-learning) in 2015, many improvements have been proposed, along with tweaks to the basic architecture, which, significantly, have improved the convergence, stability, and sample efficiency of DeepMind's basic DQN. In this chapter, we will take a deeper look at some of those ideas.

Very conveniently, in October 2017, DeepMind published a paper called Rainbow: Combining Improvements in Deep Reinforcement Learning ([1] Hessel and others, 2017), which presented the seven most important improvements to DQN; some were invented in 2015, but others were very recent. In this paper, state-of-the-art results on the Atari games suite were reached, just by combining those seven methods. This chapter will go through all those methods. We will analyze the ideas behind them, alongside how they can be implemented and compared to the...

Basic DQN

To get started, we will implement the same DQN method as in Chapter 6, Deep Q-Networks, but leveraging the high-level libraries described in Chapter 7, Higher-Level RL Libraries. This will make our code much more compact, which is good, as non-relevant details won't distract us from the method's logic.

At the same time, the purpose of this book is not to teach you how to use the existing libraries, but rather how to develop intuition about RL methods and, if necessary, implement everything from scratch. From my perspective, this is a much more valuable skill, as libraries come and go, but true understanding of the domain will allow you to quickly make sense of other people's code and apply it consciously.

In the basic DQN implementation, we have three modules:

  • Chapter08/lib/dqn_model.py: the DQN neural network (NN), which is the same as Chapter 6, so I won't repeat it
  • Chapter08/lib/common.py: common functions and declarations shared by...

N-step DQN

The first improvement that we will implement and evaluate is quite an old one. It was first introduced in the paper Learning to Predict by the Methods of Temporal Differences, by Richard Sutton ([2] Sutton, 1988). To get the idea, let's look at the Bellman update used in Q-learning once again:

This equation is recursive, which means that we can express Q(st+1, at+1) in terms of itself, which gives us this result:

Value ra,t+1 means local reward at time t+1, after issuing action a. However, if we assume that action a at the step t+1 was chosen optimally, or close to optimally, we can omit the maxa operation and obtain this:

This value can be unrolled again and again any number of times. As you may guess, this unrolling can be easily applied to our DQN update by replacing one-step transition sampling with longer transition sequences of n-steps. To understand why this unrolling will help us to speed up training, let's consider the example illustrated...

Double DQN

The next fruitful idea on how to improve a basic DQN came from DeepMind researchers in the paper titled Deep Reinforcement Learning with Double Q-Learning ([3] van Hasselt, Guez, and Silver, 2015). In the paper, the authors demonstrated that the basic DQN tends to overestimate values for Q, which may be harmful to training performance and sometimes can lead to suboptimal policies. The root cause of this is the max operation in the Bellman equation, but the strict proof is too complicated to write down here. As a solution to this problem, the authors proposed modifying the Bellman update a bit.

In the basic DQN, our target value for Q looked like this:

Q'(st+1, a) was Q-values calculated using our target network, so we update with the trained network every n steps. The authors of the paper proposed choosing actions for the next state using the trained network, but taking values of Q from the target network. So, the new expression for target Q-values will look...

Noisy networks

The next improvement that we are going to look at addresses another RL problem: exploration of the environment. The paper that we will draw from is called Noisy Networks for Exploration ([4] Fortunato and others, 2017) and it has a very simple idea for learning exploration characteristics during training instead of having a separate schedule related to exploration.

Classical DQN achieves exploration by choosing random actions with a specially defined hyperparameter epsilon, which is slowly decreased over time from 1.0 (fully random actions) to some small ratio of 0.1 or 0.02. This process works well for simple environments with short episodes, without much non-stationarity during the game; but even in such simple cases, it requires tuning to make the training processes efficient.

In the Noisy Networks paper, the authors proposed a quite simple solution that, nevertheless, works well. They add noise to the weights of fully connected layers of the network and adjust...

Prioritized replay buffer

The next very useful idea on how to improve DQN training was proposed in 2015 in the paper Prioritized Experience Replay ([7] Schaul and others, 2015). This method tries to improve the efficiency of samples in the replay buffer by prioritizing those samples according to the training loss.

The basic DQN used the replay buffer to break the correlation between immediate transitions in our episodes. As we discussed in Chapter 6, Deep Q-Networks, the examples we experience during the episode will be highly correlated, as most of the time, the environment is "smooth" and doesn't change much according to our actions. However, the stochastic gradient descent (SGD) method assumes that the data we use for training has an i.i.d. property. To solve this problem, the classic DQN method uses a large buffer of transitions, randomly sampled to get the next training batch.

The authors of the paper questioned this uniform random sample policy and proved...

Dueling DQN

This improvement to DQN was proposed in 2015, in the paper called Dueling Network Architectures for Deep Reinforcement Learning ([8] Wang et al., 2015). The core observation of this paper is that the Q-values, Q(s, a), that our network is trying to approximate can be divided into quantities: the value of the state, V(s), and the advantage of actions in this state, A(s, a).

You have seen the quantity V(s) before, as it was the core of the value iteration method from Chapter 5, Tabular Learning and the Bellman Equation. It is just equal to the discounted expected reward achievable from this state. The advantage A(s, a) is supposed to bridge the gap from A(s) to Q(s, a), as, by definition, Q(s, a) = V(s) + A(s, a). In other words, the advantage A(s, a) is just the delta, saying how much extra reward some particular action from the state brings us. The advantage could be positive or negative and, in general, can have any magnitude. For example, at some tipping point, the...

Categorical DQN

The last, and the most complicated, method in our DQN improvements toolbox is from a very recent paper, published by DeepMind in June 2017, called A Distributional Perspective on Reinforcement Learning ([9] Bellemare, Dabney, and Munos, 2017).

In the paper, the authors questioned the fundamental piece of Q-learning—Q-values—and tried to replace them with a more generic Q-value probability distribution. Let's try to understand the idea. Both the Q-learning and value iteration methods work with the values of the actions or states represented as simple numbers and showing how much total reward we can achieve from a state, or an action and a state. However, is it practical to squeeze all future possible rewards into one number? In complicated environments, the future could be stochastic, giving us different values with different probabilities.

For example, imagine the commuter scenario when you regularly drive from home to work. Most of the time...

Combining everything

You have now seen all the DQN improvements mentioned in the paper Rainbow: Combining Improvements in Deep Reinforcement Learning, but it was done in an incremental way, which helped you to understand the idea and implementation of every improvement. The main point of the paper was to combine those improvements and check the results. In the final example, I've decided to exclude categorical DQN and double DQN from the final system, as they haven't shown too much improvement on our guinea pig environment. If you want, you can add them and try using a different game. The complete example is available in Chapter08/08_dqn_rainbow.py.

First of all, we need to define our network architecture and the methods that have contributed to it:

  • Dueling DQN: our network will have two separate paths for the value of the state distribution and advantage distribution. On the output, both paths will be summed together, providing the final value probability distributions...

Summary

In this chapter, we have walked through and implemented a lot of DQN improvements that have been discovered by researchers since the first DQN paper was published in 2015. This list is far from complete. First of all, for the list of methods, I used the paper Rainbow: Combining Improvements in Deep Reinforcement Learning, which was published by DeepMind, so the list of methods is definitely biased to DeepMind papers. Secondly, RL is so active nowadays that new papers come out almost every day, which makes it very hard to keep up, even if we limit ourselves to one kind of RL model, such as a DQN. The goal of this chapter was to give you a practical view of different ideas that the field has developed.

In the next chapter, we will continue discussing practical DQN applications from an engineering perspective by talking about ways to improve DQN performance without touching the underlying method.

References

  1. Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, David Silver, 2017, Rainbow: Combining Improvements in Deep Reinforcement Learning. arXiv:1710.02298
  2. Sutton, R.S., 1988, Learning to Predict by the Methods of Temporal Differences, Machine Learning 3(1):9-44
  3. Hado Van Hasselt, Arthur Guez, David Silver, 2015, Deep Reinforcement Learning with Double Q-Learning. arXiv:1509.06461v3
  4. Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Pilot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg, 2017, Noisy Networks for Exploration. arXiv:1706.10295v1
  5. Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaus, David Saxton, Remi Munos, 2016, Unifying Count-Based Exploration and Intrinsic Motivation. arXiv:1606.01868v2
  6. Jarryd Martin, Suraj Narayanan Sasikumar, Tom Everitt, Marcus Hutter, 2017, Count-Based...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Reinforcement Learning Hands-On. - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781838826994
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan