Atari Games with Deep Q Network

Deep Q Network (DQN) is one of the very popular and widely used deep reinforcement learning (DRL) algorithms. In fact, it created a lot of buzz around the reinforcement learning (RL) community after its release. The algorithm was proposed by researchers at Google's DeepMind and achieved human-level results when playing any Atari game by just taking the game screen as input.

In this chapter, we will explore how DQN works and also learn how to build a DQN that plays any Atari game by taking only the game screen as input. We will look at some of the improvements made to DQN architecture, such as double DQN and dueling network architecture.

In this chapter, you will learn about:

Deep Q Networks (DQNs)
Architecture of DQN
Building an agent to play Atari games
Double DQN
Prioritized experience replay

...

What is a Deep Q Network?

Before going ahead, first, let us just recap the Q function. What is a Q function? A Q function, also called a state-action value function, specifies how good an action a is in the state s. So, we store the value of all possible actions in each state in a table called a Q table and we pick the action that has the maximum value in a state as the optimal action. Remember how we learned this Q function? We used Q learning, which is an off-policy temporal difference learning algorithm for estimating the Q function. We looked at this in Chapter 5, Temporal Difference Learning.

So far, we have seen environments with a finite number of states with limited actions, and we did an exhaustive search through all possible state-action pairs for finding the optimal Q value. Think of an environment where we have a very large number of states and, in each state, we have...

Architecture of DQN

Now that we have a basic understanding of DQN, we will go into detail about how DQN works and the architecture of DQN for playing Atari games. We will look at each component and then we will view the algorithm as a whole.

Convolutional network

The first layer of DQN is the convolutional network, and the input to the network will be a raw frame of the game screen. So, we take a raw frame and pass that to the convolutional layers to understand the game state. But the raw frames will have 210 x 160 pixels with a 128 color palette and it will clearly take a lot of computation and memory if we feed the raw pixels directly. So, we downsample the pixel to 84 x 84 and convert the RGB values to grayscale values...

Building an agent to play Atari games

Now we will see how to build an agent to play any Atari game. You can get the complete code as a Jupyter notebook with the explanation here (https://github.com/sudharsan13296/Hands-On-Reinforcement-Learning-With-Python/blob/master/08.%20Atari%20Games%20with%20DQN/8.8%20Building%20an%20Agent%20to%20Play%20Atari%20Games.ipynb).

First, we import all the necessary libraries:

import numpy as np
import gym
import tensorflow as tf
from tensorflow.contrib.layers import flatten, conv2d, fully_connected
from collections import deque, Counter
import random
from datetime import datetime

We can use any of the Atari gaming environments given here: http://gym.openai.com/envs/#atari.

In this example, we use the Pac-Man game environment:

env = gym.make("MsPacman-v0")
n_outputs = env.action_space.n

The Pac-Man environment is shown here:

Now we define a...

Double DQN

Deep Q learning is pretty cool, right? It has generalized its learning to play any Atari game. But the problem with DQN is that it tends to overestimate Q values. This is because of the max operator in the Q learning equation. The max operator uses the same value for both selecting and evaluating an action. What do I mean by that? Let's suppose we are in a state s and we have five actions a₁ to a₅. Let's say a₃ is the best action. When we estimate Q values for all these actions in the state s, the estimated Q values will have some noise and differ from the actual value. Due to this noise, action a₂ will get a higher value than the optimal action a₃. Now, if we select the best action as the one that has maximum value, we will end up selecting a suboptimal action a₂instead of optimal action a₃.

We can solve this problem by having two separate Q functions, each...

Prioritized experience replay

In DQN architecture, we use experience replay to remove correlations between the training samples. However, uniformly sampling transitions from the replay memory is not an optimal method. Instead, we can prioritize transitions and sample according to priority. Prioritizing transitions helps the network to learn swiftly and effectively. How do we prioritize the transitions? We prioritize the transitions that have a high TD error. We know that a TD error specifies the difference between the estimated Q value and the actual Q value. So, transitions with a high TD error are the transition we have to focus on and learn from because those are the transitions that deviate from our estimation. Intuitively, let us say you try to solve a set of problems, but you fail in solving two of these problems. You then give priority to those two problems alone to focus...

Dueling network architecture

We know that the Q function specifies how good it is for an agent to perform an action a in the state s and the value function specifies how good it is for an agent to be in a state s. Now we introduce a new function called an advantage function which can be defined as the difference between the value function and the advantage function. The advantage function specifies how good it is for an agent to perform an action a compared to other actions.

Thus, the value function specifies the goodness of a state and the advantage function specifies the goodness of an action. What would happen if we were to combine the value function and advantage function? It would tell us how good it is for an agent to perform an action a in a state s that is actually our Q function. So we can define our Q function as a sum of a value function and an advantage function, as...

Summary

In this chapter, we have learned about one of the very popular deep reinforcement learning algorithms called DQN. We saw how deep neural networks are used to approximate the Q function. We also learned how to build an agent to play Atari games. Later, we looked at several advancements to the DQN, such as double DQN, which is used to avoid overestimating Q values. We then looked at prioritized experience replay, for prioritizing the experience, and dueling network architecture, which breaks down the Q function computation into two streams, called value stream and advantage stream.

In the next chapter, Chapter 9, Playing Doom with Deep Recurrent Q Network, we will look at a really cool variant of DQNs called DRQN, which makes use of an RNN for approximating a Q function.