In the last few chapters, we have learned how Deep Q learning works by approximating the q function with a neural network. Following this, we have seen various improvements to Deep Q Network (DQN) such as Double Q learning, dueling network architectures, and the Deep Recurrent Q Network. We have seen how DQN makes use of a replay buffer to store the agent's experience and trains the network with the mini-batch of samples from the buffer. We have also implemented DQNs for playing Atari games and a Deep Recurrent Q Network (DRQN) for playing the Doom game. In this chapter, let's get into the detailed implementation of a dueling DQN, which is essentially the same as a regular DQN, except the final fully connected layer will be broken down into two streams, namely a value stream and an advantage stream, and these two streams...
You're reading from Hands-On Reinforcement Learning with Python
Environment wrapper functions
The credit for the code used in this chapter goes to Giacomo Spigler's GitHub repository (https://github.com/spiglerg/DQN_DDQN_Dueling_and_DDPG_Tensorflow). Throughout this chapter, the code is explained at each and every line. For a complete structured code, check the above GitHub repository.
First, we import all the necessary libraries:
import numpy as np
import tensorflow as tf
import gym
from gym.spaces import Box
from scipy.misc import imresize
import random
import cv2
import time
import logging
import os
import sys
We define the EnvWrapper class and define some of the environment wrapper functions:
class EnvWrapper:
We define the __init__ method and initialize variables:
def __init__(self, env_name, debug=False):
Initialize the gym environment:
self.env = gym.make(env_name)
Get the action_space:
self.action_space = self.env.action_space...
Dueling network
Now, we build our dueling DQN; we build three convolutional layers followed by two fully connected layers, and the final fully connected layer will be split into two separate layers for value stream and advantage stream. We will use the aggregate layer, which combines both the value stream and the advantage stream, to compute the q value. The dimensions of these layers are given as follows:
- Layer 1: 32 8x8 filters with stride 4 + RELU
- Layer 2: 64 4x4 filters with stride 2 + RELU
- Layer 3: 64 3x3 filters with stride 1 + RELU
- Layer 4a: 512 unit fully-connected layer + RELU
- Layer 4b: 512 unit fully-connected layer + RELU
- Layer 5a: 1 unit FC + RELU (state value)
- Layer 5b: Actions FC + RELU (advantage value)
- Layer6: Aggregate V(s)+A(s,a)
class QNetworkDueling(QNetwork):
We define the __init__ method to initialize all layers:
def __init__(self, input_size, output_size...
Replay memory
Now, we build the experience replay buffer, which is used for storing all the agent's experience. We sample a minibatch of experience from the replay buffer for training the network:
class ReplayMemoryFast:
First, we define the __init__ method and initiate the buffer size:
def __init__(self, memory_size, minibatch_size):
# max number of samples to store
self.memory_size = memory_size
# minibatch size
self.minibatch_size = minibatch_size
self.experience = [None]*self.memory_size
self.current_index = 0
self.size = 0
Next, we define the store function for storing the experiences:
def store(self, observation, action, reward, newobservation, is_terminal):
Store the experience as a tuple (current state, action, reward, next state, is it a terminal state):
self.experience[self.current_index] = (observation...
Training the network
Now, we will see how to train the network.
First, we define the DQN class and initialize all variables in the __init__ method:
class DQN(object):
def __init__(self, state_size,
action_size,
session,
summary_writer = None,
exploration_period = 1000,
minibatch_size = 32,
discount_factor = 0.99,
experience_replay_buffer = 10000,
target_qnet_update_frequency = 10000,
initial_exploration_epsilon = 1.0,
final_exploration_epsilon = 0.05,
reward_clipping = -1,
):
Initialize all variables:
self.state_size = state_size
self.action_size = action_size
self.session = session
...
Car racing
So far, we have seen how to build a dueling DQN. Now, we will see how to make use of our dueling DQN when playing the car racing game.
First, let's import our necessary libraries:
import gym
import time
import logging
import os
import sys
import tensorflow as tf
Initialize all of the necessary variables:
ENV_NAME = 'Seaquest-v0'
TOTAL_FRAMES = 20000000
MAX_TRAINING_STEPS = 20*60*60/3
TESTING_GAMES = 30
MAX_TESTING_STEPS = 5*60*60/3
TRAIN_AFTER_FRAMES = 50000
epoch_size = 50000
MAX_NOOP_START = 30
LOG_DIR = 'logs'
outdir = 'results'
logger = tf.train.SummaryWriter(LOG_DIR)
# Intialize tensorflow session
session = tf.InteractiveSession()
Build the agent:
agent = DQN(state_size=env.observation_space.shape,
action_size=env.action_space.n,
session=session,
summary_writer = logger,
exploration_period = 1000000,
minibatch_size = 32,
discount_factor = 0...
Summary
In this chapter, we have learned how to implement a dueling DQN in detail. We started off with the basic environment wrapper functions for preprocessing our game screens and then we defined the QNetworkDueling class. Here, we implemented a dueling Q Network, which splits the final fully connected layer of DQN into a value stream and an advantage stream and then combines these two streams to compute the q value. Following this, we saw how to create a replay buffer, which is used to store the experience and samples a minibatch of experience for training the network, and finally, we initialized our car racing environment using OpenAI's Gym and trained our agent. In the next chapter, Chapter 13, Recent Advancements and Next Steps, we will see some of the recent advancements in RL.
Questions
The question list is as follows:
- What is the difference between a DQN and a dueling DQN?
- Write the Python code for a replay buffer.
- What is a target network?
- Write the Python code for a prioritized experience replay buffer.
- Create a Python function to decay an epsilon-greedy policy.
- How does a dueling DQN differ from a double DQN?
- Create a Python function for updating primary network weights to the target network.
Further reading
The following links will help expand your knowledge:
- Flappy Bird using DQN: https://github.com/yenchenlin/DeepLearningFlappyBird
- Super Mario using DQN: https://github.com/JSDanielPark/tensorflow_dqn_supermario