Reader small image

You're reading from  PyTorch 1.x Reinforcement Learning Cookbook

Product typeBook
Published inOct 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781838551964
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Yuxi (Hayden) Liu
Yuxi (Hayden) Liu
author image
Yuxi (Hayden) Liu

Yuxi (Hayden) Liu was a Machine Learning Software Engineer at Google. With a wealth of experience from his tenure as a machine learning scientist, he has applied his expertise across data-driven domains and applied his ML expertise in computational advertising, cybersecurity, and information retrieval. He is the author of a series of influential machine learning books and an education enthusiast. His debut book, also the first edition of Python Machine Learning by Example, ranked the #1 bestseller in Amazon and has been translated into many different languages.
Read more about Yuxi (Hayden) Liu

Right arrow

Deep Q-Networks in Action

Deep Q-learning, or using deep Q-networks, is considered the most modern reinforcement learning technique. In this chapter, we will develop various deep Q-network models step by step and apply them to solve several reinforcement learning problems. We will start with vanilla Q-networks and enhance them with experience replay. We will improve robustness by using an additional target network and demonstrate how to fine-tune a Deep Q-Network. We will also experiment with dueling deep Q-networks and see how their value functions differs from other types of Deep Q-Networks. In the last two recipes, we will solve complex Atari game problems by incorporating convolutional neural networks into Deep Q-Networks.

The following recipes will be covered in this chapter:

  • Developing deep Q-networks
  • Improving DQNs with experience replay
  • Developing double deep Q-Networks...

Developing deep Q-networks

You will recall that Function Approximation (FA) approximates the state space using a set of features generated from the original states. Deep Q-Networks (DQNs) are very similar to FA with neural networks, but they use neural networks to map the states to action values directly instead of using a set of generated features as media.

In Deep Q-learning, a neural network is trained to output the appropriate Q(s,a) values for each action given the input state, s. The action, a, of the agent is chosen based on the output Q(s,a) values following the epsilon-greedy policy. The structure of a DQN with two hidden layers is depicted in the following diagram:

You will recall that Q-learning is an off-policy learning algorithm and that it updates the Q-function based on the following equation:

Here, s' is the resulting state after taking action, a, in state...

Improving DQNs with experience replay

The approximation of Q-values using neural networks with one sample at a time is not very stable. You will recall that, in FA, we incorporated experience replay to improve stability. Similarly, in this recipe, we will apply experience replay to DQNs.

With experience replay, we store the agent's experiences (an experience is composed of an old state, a new state, an action, and a reward) during episodes in a training session in a memory queue. Every time we gain sufficient experience, batches of experiences are randomly sampled from the memory and are used to train the neural network. Learning with experience replay becomes two phases: gaining experience, and updating models based on the past experiences randomly selected. Otherwise, the model will keep learning from the most recent experience and the neural network model could get stuck...

Developing double deep Q-Networks

In the deep Q-learning algorithms we have developed so far, the same neural network is used to calculate the predicted values and the target values. This may cause a lot of divergence as the target values keep on changing and the prediction has to chase it. In this recipe, we will develop a new algorithm using two neural networks instead of one.

In double DQNs, we use a separate network to estimate the target rather than the prediction network. The separate network has the same structure as the prediction network. And its weights are fixed for every T episode (T is a hyperparameter we can tune), which means they are only updated after every T episode. The update is simply done by copying the weights of the prediction network. In this way, the target function is fixed for a while, which results in a more stable training process.

Mathematically...

Tuning double DQN hyperparameters for CartPole

In this recipe, let's solve the CartPole environment using double DQNs. We will demonstrate how to fine-tune the hyperparameters in a double DQN to achieve the best performance.

In order to fine-tune the hyperparameters, we can apply the grid search technique to explore a set of different combinations of values and pick the one achieving the best average performance. We can start with a coarse range of values and continue to narrow it down gradually. And don’t forget to fix the random number generators for all of the following in order to ensure reproducibility:

  • The Gym environment random number generator
  • The epsilon-greedy random number generator
  • The initial weights for the neural network in PyTorch

How to do it...

...

Developing Dueling deep Q-Networks

In this recipe, we are going to develop another advanced type of DQNs, Dueling DQNs (DDQNs). In particularly, we will see how the computation of the Q value is split into two parts in DDQNs.

In DDQNs, the Q value is computed with the following two functions:

Here, V(s) is the state-value function, calculating the value of being at state s; A(s, a) is the state-dependent action advantage function, estimating how much better it is to take an action, a, rather than taking other actions at a state, s. By decoupling the value and advantage functions, we are able to accommodate the fact that our agent may not necessarily look at both the value and advantage at the same time during the learning process. In other words, the agent using DDQNs can efficiently optimize either or both functions as it prefers.

...

Applying Deep Q-Networks to Atari games

The problems we have worked with so far are fairly simple, and applying DQNs is sometimes overkill. In this and the next recipe, we'll use DQNs to solve Atari games, which are far more complicated problems.

We will use Pong (https://gym.openai.com/envs/Pong-v0/) as an example in this recipe. It simulates the Atari 2600 game Pong, where the agent plays table tennis with another player. The observation in this environment is an RGB image of the screen (refer to the following screenshot):

This is a matrix of shape (210, 160, 3), which means that the image is of size 210 * 160 and in three RGB channels.

The agent (on the right-hand side) moves up and down during the game to hit the ball. If it misses it, the other player (on the left-hand side) will get 1 point; similarly, if the other player misses it, the agent will get 1 point. The...

Using convolutional neural networks for Atari games

In the previous recipe, we treated each observed image in the Pong environment as a grayscale array and fed it to a fully connected neural network. Flattening an image may actually result in information loss. Why don’t we use the image as input instead? In this recipe, we will incorporate convolutional neural networks (CNNs) into the DQN model.

A CNN is one of the best neural network architectures to deal with image inputs. In a CNN, the convolutional layers are able to effectively extract features from images, which will be passed on to downstream, fully connected layers. An example of a CNN with two convolutional layers is depicted here:

As you can imagine, if we simply flatten an image into a vector, we will lose some information on where the ball is located, and where the two players are. Such information is significant...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
PyTorch 1.x Reinforcement Learning Cookbook
Published in: Oct 2019Publisher: PacktISBN-13: 9781838551964
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Yuxi (Hayden) Liu

Yuxi (Hayden) Liu was a Machine Learning Software Engineer at Google. With a wealth of experience from his tenure as a machine learning scientist, he has applied his expertise across data-driven domains and applied his ML expertise in computational advertising, cybersecurity, and information retrieval. He is the author of a series of influential machine learning books and an education enthusiast. His debut book, also the first edition of Python Machine Learning by Example, ranked the #1 bestseller in Amazon and has been translated into many different languages.
Read more about Yuxi (Hayden) Liu