PyTorch 1.x Reinforcement Learning Cookbook

Getting Started with Reinforcement Learning and PyTorch

We kick off our journey of practical reinforcement learning and PyTorch with the basic, yet important, reinforcement learning algorithms, including random search, hill climbing, and policy gradient. We will start by setting up the working environment and OpenAI Gym, and you will become familiar with reinforcement learning environments through the Atari and CartPole playgrounds. We will also demonstrate how to develop algorithms to solve the CartPole problem step by step. Also, we will review the essentials of PyTorch and prepare for the upcoming learning examples and projects.

This chapter contains the following recipes:

Setting up the working environment
Installing OpenAI Gym
Simulating Atari environments
Simulating the CartPole environment
Reviewing the fundamentals of PyTorch
Implementing and evaluating a random search policy
Developing the hill-climbing algorithm
Developing a policy gradient algorithm

Setting up the working environment

Let's get started with setting up the working environment, including the correct versions of Python and Anaconda, and PyTorch as the main framework that is used throughout the book.

Python is the language we use to implement all reinforcement learning algorithms and techniques throughout the book. In this book, we will be using Python 3, or more specifically, 3.6 or above. If you are a Python 2 user, now is the best time for you to switch to Python 3, as Python 2 will no longer be supported after 2020. The transition is very smooth, though, so don't panic.

Anaconda is an open source Python distribution (www.anaconda.com/distribution/) for data science and machine learning. We will be using Anaconda's package manager, conda, to install Python packages, along with pip.

PyTorch (https://pytorch.org/), primarily developed by the Facebook AI Research (FAIR) Group, is a trendy machine learning library based on Torch (http://torch.ch/). Tensors in PyTorch replace NumPy's ndarrays, which provides more flexibility and compatibility with GPUs. Because of the powerful computational graphs and the simple and friendly interface, the PyTorch community is expanding on a daily basis, and it has seen heavy adoption by more and more tech giants.

Let's see how to properly set up all of these components.

How to do it...

We will begin by installing Anaconda. You can skip this if you already have Anaconda for Python 3.6 or 3.7 running on your system. Otherwise, you can follow the instructions at https://docs.anaconda.com/anaconda/install/ for your operating system, as follows:

Feel free to play around with PyTorch once the setup is done. To verify that you have the right setup of Anaconda and Python, you can enter the following line in your Terminal in Linux/Mac or Command Prompt in Windows (from now on, we will just call it Terminal):

python

It will display your Python Anaconda environment. You should see something similar to the following screenshot:

If Anaconda and Python 3.x are not mentioned, please check the system path or the path Python is running from.

The next thing to do is to install PyTorch. First, go to https://pytorch.org/get-started/locally/ and pick the description of your environment from the following table:

Here, we use Mac, Conda, Python 3.7, and running locally (no CUDA) as an example, and enter the resulting command line in the Terminal:

conda install pytorch torchvision -c pytorch

To confirm PyTorch is installed correctly, run the following lines of code in Python:

>>> import torch
>>> x = torch.empty(3, 4)
>>> print(x)
tensor([[ 0.0000e+00,  2.0000e+00, -1.2750e+16, -2.0005e+00],
        [ 9.8742e-37,  1.4013e-45, 9.9222e-37,  1.4013e-45],
        [ 9.9220e-37,  1.4013e-45, 9.9225e-37,  2.7551e-40]])

If a 3 x 4 matrix is displayed, that means PyTorch is installed correctly.

Now we have successfully set up the working environment.

How it works...

We have just created a tensor of size 3 x 4 in PyTorch. It is an empty matrix. By saying empty, this doesn't mean all elements are of the value Null. Instead, they are a bunch of meaningless floats that are considered placeholders. Users are required to set all the values later. This is very similar to NumPy's empty array.

There's more...

Some of you may question the necessity of installing Anaconda and using conda to manage packages since it is easy to install packages with pip. In fact, conda is a better packaging tool than pip. We mainly use conda for the following four reasons:

It handles library dependencies nicely: Installing a package with conda will automatically download all of its dependencies. However, doing so with pip will lead to a warning, and installation will be aborted.
It solves conflicts of packages gracefully: If installing a package requires another package of a specific version (let's say 2.3 or after, for example), conda will update the version of the other package automatically.
It creates a virtual environment easily: A virtual environment is a self-contained package directory tree. Different applications or projects can use different virtual environments. All virtual environments are isolated from each other. It is recommended to use virtual environments so that whatever we do for one application doesn't affect our system environment or any other environment.
It is also compatible with pip: We can still use pip in conda with the following command:

conda install pip

Installing OpenAI Gym

After setting up the working environment, we can now install OpenAI Gym. You can't work on reinforcement learning without using OpenAI Gym, which gives you a variety of environments in which to develop your learning algorithms.

OpenAI (https://openai.com/) is a non-profit research company that is focused on building safe artificial general intelligence (AGI) and ensuring that it benefits humans. OpenAI Gym is a powerful and open source toolkit for developing and comparing reinforcement learning algorithms. It provides an interface to varieties of reinforcement learning simulations and tasks, from walking to moon landing, from car racing to playing Atari games. See https://gym.openai.com/envs/ for the full list of environments.We can write agents to interact with OpenAI Gym environments using any numerical computation library, such as PyTorch, TensorFlow, or Keras.

How to do it...

There are two ways to install Gym. The first one is to use pip, as follows:

pip install gym

For conda users, remember to install pip first in conda using the following command before installing Gym using pip:

conda install pip

This is because Gym is not officially available in conda as of early 2019.

Another approach is to build from source:

First, clone the package directly from its Git repository:

git clone https://github.com/openai/gym

Go to the downloaded folder and install Gym from there:

cd gym
pip install -e .

And now you are good to go. Feel free to play around with gym.

You can also check the available gym environment by typing the following lines of code:

>>> from gym import envs
>>> print(envs.registry.all())
dict_values([EnvSpec(Copy-v0), EnvSpec(RepeatCopy-v0), EnvSpec(ReversedAddition-v0), EnvSpec(ReversedAddition3-v0), EnvSpec(DuplicatedInput-v0), EnvSpec(Reverse-v0), EnvSpec(CartPole-v0), EnvSpec(CartPole-v1), EnvSpec(MountainCar-v0), EnvSpec(MountainCarContinuous-v0), EnvSpec(Pendulum-v0), EnvSpec(Acrobot-v1), EnvSpec(LunarLander-v2), EnvSpec(LunarLanderContinuous-v2), EnvSpec(BipedalWalker-v2), EnvSpec(BipedalWalkerHardcore-v2), EnvSpec(CarRacing-v0), EnvSpec(Blackjack-v0)
...
...

This will give you a long list of environments if you installed Gym properly. We will play around with some of them in the next recipe, Simulating Atari environments.

How it works...

Compared to the simple pip approach for installing Gym, the second approach provides more flexibility if you want to add new environments and modify Gym itself.

There's more...

You may wonder why we need to test reinforcement learning algorithms on Gym's environments since the actual environments we work in can be a lot different. You will recall that reinforcement learning doesn't make many assumptions about the environment, but it gets to know more about the environment by interacting with it. Also, when comparing the performance of different algorithms, we need to apply them to standardized environments. Gym is a perfect benchmark, covering many versatile and easy-to-use environments. This is similar to the datasets that we often use as benchmarks in supervised and unsupervised learning, such as MNIST, Imagenet, MovieLens, and Thomson Reuters News.

Simulating Atari environments

To get started with Gym, let's play some Atari games with it.

The Atari environments (https://gym.openai.com/envs/#atari) are a variety of Atari 2600 video games, such as Alien, AirRaid, Pong, and Space Race. If you have ever played Atari games, this recipe should be fun for you, as you will play an Atari game, Space Invaders. However, an agent will act on your behalf.

How to do it...

Let's simulate the Atari environments by following these steps:

To run any atari environment for the first time, we need to install the atari dependencies by running this command in the Terminal:

pip install gym[atari]

Alternatively, if you used the second approach in the previous recipe to install gym, you can run the following command instead:

pip install -e '.[atari]'

After installing the Atari dependencies, we import the gym library in Python:

>>> import gym

Create an instance of the SpaceInvaders environment:

>>> env = gym.make('SpaceInvaders-v0')

Reset the environment:

>>> env.reset()
 array([[[ 0,  0, 0],
         [ 0, 0,  0],
         [ 0, 0,  0],
         ...,
         ...,
         [80, 89, 22],
         [80, 89, 22],
         [80, 89, 22]]], dtype=uint8)

As you can see, this also returns the initial state of the environment.

Render the environment:

>>> env.render()
True

You will see a small window popping up, as follows:

As you can see from the game window, the spaceship starts with three lives (the red spaceships).

Randomly pick one possible move and execute the action:

>>> action = env.action_space.sample()
>>> new_state, reward, is_done, info = env.step(action)

The step() method returns what happens after an action is taken, including the following:

New state: The new observation.
Reward: The reward associated with that action in that state.
Is done: A flag indicating whether the game ends. In a SpaceInvaders environment, this will be True if the spaceship has no more lives left or all the aliens are gone; otherwise, it will remain False.
Info: Extra information pertaining to the environment. This is about the number of lives left in this case. This is useful for debugging.

Let’s take a look at the is_done flag and info:

>>> print(is_done)
False
>>> print(info)
{'ale.lives': 3}

Now we render the environment:

>>> env.render()
 True

The game window becomes the following:

You won't notice much difference in the game window, because the spaceship just made a move.

Now, let's make a while loop and let the agent perform as many actions as it can:

>>> is_done = False
>>> while not is_done:
...     action = env.action_space.sample()
...     new_state, reward, is_done, info = env.step(action)
...     print(info)
...     env.render()
{'ale.lives': 3}
True
{'ale.lives': 3}
True
……
……
{'ale.lives': 2}
True
{'ale.lives': 2}
True
……
……
{'ale.lives': 1}
True
{'ale.lives': 1}
True

Meanwhile, you will see that the game is running, and the spaceship keeps moving and shooting, and so do the aliens. And it is pretty fun to watch, too. At the end, when the game ends, the window looks like the following:

As you can see, we scored 150 points in this game. You may get a higher or lower score than this because the actions the agent performs are all randomly selected.

We also confirm that no lives are left with the last piece of info:

>>> print(info)
{'ale.lives': 0}

How it works...

Using Gym, we can easily create an environment instance by calling the make() method with the name of the environment as the parameter.

As you may have noticed, the actions that the agent performs are randomly chosen using the sample() method.

Note that, normally, we would have a more sophisticated agent guided by reinforcement learning algorithms. Here, we just demonstrated how to simulate an environment, and how an agent takes actions regardless of the outcome.

Run this a few times and see what we get:

>>> env.action_space.sample()
0
>>> env.action_space.sample()
3
>>> env.action_space.sample()
0
>>> env.action_space.sample()
4
>>> env.action_space.sample()
2
>>> env.action_space.sample()
1
>>> env.action_space.sample()
4
>>> env.action_space.sample()
5
>>> env.action_space.sample()
1
>>> env.action_space.sample()
0

There are six possible actions in total. We can also see this by running the following command:

>>> env.action_space
Discrete(6)

Actions from 0 to 5 stand for No Operation, Fire, Up, Right, Left, and Down, respectively, which are all the moves the spaceship in the game can do.

The step() method will let the agent take the action that is specified as its parameter. The render() method will update the display window based on the latest observation of the environment.

The observation of the environment, new_state, is represented by a 210 x 160 x 3 matrix, as follows:

>>> print(new_state.shape)
(210, 160, 3)

This means that each frame of the display screen is an RGB image of size 210 x 160.

There's more...

You may wonder why we need to install Atari dependencies. In fact, there are a few more environments that do not accompany the installation of gym, such as Box2d, Classic control, MuJoCo, and Robotics.

Take the Box2d environments, for example; we need to install the Box2d dependencies before we first run the environments. Again, two installation approaches are as follows:

pip install gym[box2d]
pip install -e '.[box2d]'

After that, we can play around with the LunarLander environment, as follows:

>>> env = gym.make('LunarLander-v2')
>>> env.reset()
array([-5.0468446e-04,  1.4135642e+00, -5.1140346e-02,  1.1751971e-01,
       5.9164839e-04,  1.1584054e-02, 0.0000000e+00,  0.0000000e+00],
     dtype=float32)
>>> env.render()

A game window will pop up:

Simulating the CartPole environment

In this recipe, we will work on simulating one more environment in order to get more familiar with Gym. The CartPole environment is a classic one in reinforcement learning research.

CartPole is a traditional reinforcement learning task in which a pole is placed upright on top of a cart. The agent moves the cart either to the left or to the right by 1 unit in a timestep. The goal is to balance the pole and prevent it from falling over. The pole is considered to have fallen if it is more than 12 degrees from the vertical, or the cart moves 2.4 units away from the origin. An episode terminates when any of the following occurs:

The pole falls over
The number of timesteps reaches 200

How to do it...

Let's simulate the CartPole environment by following these steps:

To run the CartPole environment, let's first search for its name in the table of environments at https://github.com/openai/gym/wiki/Table-of-environments. We get 'CartPole-v0' and also learn that the observation space is represented in a 4-dimensional array, and that there are two possible actions (which makes sense).
We import the Gym library and create an instance of the CartPole environment:

 >>> import gym
 >>> env = gym.make('CartPole-v0')

Reset the environment:

 >>> env.reset()
 array([-0.00153354,  0.01961605, -0.03912845, -0.01850426])

As you can see, this also returns the initial state represented by an array of four floats.

Render the environment:

 >>> env.render()
 True

You will see a small window popping up, as follows:

Now, let's make a while loop and let the agent perform as many random actions as it can:

 >>> is_done = False
 >>> while not is_done:
 ...     action = env.action_space.sample()
 ...     new_state, reward, is_done, info = env.step(action)
 ...     print(new_state)
 ...     env.render()
 ...
 [-0.00114122 -0.17492355 -0.03949854  0.26158095]
 True
 [-0.00463969 -0.36946006 -0.03426692  0.54154857]
 True
 ……
 ……
 [-0.11973207 -0.41075106  0.19355244 1.11780626]
 True
 [-0.12794709 -0.21862176  0.21590856 0.89154351]
 True

Meanwhile, you will see that the cart and pole are moving. At the end, you will see they both stop. The window looks like the following:

The episode only lasts several steps because the left or right actions are chosen randomly. Can we record the whole process so we can replay it afterward? We can do so with just two lines of code in Gym, as shown in Step 7. If you are using a Mac or Linux system, you need to complete Step 6 first; otherwise, you can jump to Step 7.

To record video, we need to install the ffmpeg package. For Mac, it can be installed via the following command:

brew install ffmpeg

For Linux, the following command should do it:

sudo apt-get install ffmpeg

After creating the CartPole instance, add these two lines:

>>> video_dir = './cartpole_video/'
>>> env = gym.wrappers.Monitor(env, video_dir)

This will record what is displayed in the window and store it in the specified directory.

Now re-run the codes from Step 3 to Step 5. After an episode terminates, we can see that an .mp4 file is created in the video_dir folder. The video is quite short; it may last 1 second or so.

How it works...

In this recipe, we print out the state array for every step. But what does each float in the array mean? We can find more information about CartPole on Gym's GitHub wiki page: https://github.com/openai/gym/wiki/CartPole-v0. It turns out that those four floats represent the following:

Cart position: This ranges from -2.4 to 2.4, and any position beyond this range will trigger episode termination.
Cart velocity.
Pole angle: Any value less than -0.209 (-12 degrees) or greater than 0.209 (12 degrees) will trigger episode termination.
Pole velocity at the tip.

In terms of the action, it is either 0 or 1, which corresponds to pushing the cart to the left and to the right, respectively.

The reward in this environment is +1 for every timestep before the episode terminates. We can also verify this by printing out the reward for every step. And the total reward is simply the number of timesteps.

There's more...

So far, we've run only one episode. In order to assess how well the agent performs, we can simulate many episodes and then average the total rewards for an individual episode. The average total reward will tell us about the performance of the agent that takes random actions.

Let’s set 10,000 episodes:

 >>> n_episode = 10000

In each episode, we compute the total reward by accumulating the reward in every step:

 >>> total_rewards = []
 >>> for episode in range(n_episode):
 ...     state = env.reset()
 ...     total_reward = 0
 ...     is_done = False
 ...     while not is_done:
 ...         action = env.action_space.sample()
 ...         state, reward, is_done, _ = env.step(action)
 ...         total_reward += reward
 ...     total_rewards.append(total_reward)

Finally, we calculate the average total reward:

 >>> print('Average total reward over {} episodes: {}'.format(
           n_episode, sum(total_rewards) / n_episode))
 Average total reward over 10000 episodes: 22.2473

On average, taking a random action scores 22.25.

We all know that taking random actions is not sophisticated enough, and we will implement an advanced policy in upcoming recipes. But for the next recipe, let's take a break and review the basics of PyTorch.

Reviewing the fundamentals of PyTorch

As we’ve already mentioned, PyTorch is the numerical computation library we use to implement reinforcement learning algorithms in this book.

PyTorch is a trendy scientific computing and machine learning (including deep learning) library developed by Facebook. Tensor is the core data structure in PyTorch, which is similar to NumPy's ndarrays. PyTorch and NumPy are comparable in scientific computing. However, PyTorch is faster than NumPy in array operations and array traversing. This is mainly due to the fact that array element access is faster in PyTorch. Hence, more and more people believe PyTorch will replace NumPy.

How to do it...

Let's do a quick review of the basic programming in PyTorch to get more familiar with it:

We created an uninitialized matrix in an earlier recipe. How about a randomly initialized one? See the following commands:

 >>> import torch
 >>> x = torch.rand(3, 4)
 >>> print(x)
 tensor([[0.8052, 0.3370, 0.7676, 0.2442],
        [0.7073, 0.4468, 0.1277, 0.6842],
        [0.6688, 0.2107, 0.0527, 0.4391]])

Random floats from a uniform distribution in the interval (0, 1) are generated.

We can specify the desired data type of the returned tensor. For example, a tensor of the double type (float64) is returned as follows:

 >>> x = torch.rand(3, 4, dtype=torch.double)
 >>> print(x)
 tensor([[0.6848, 0.3155, 0.8413, 0.5387],
        [0.9517, 0.1657, 0.6056, 0.5794],
        [0.0351, 0.3801, 0.7837, 0.4883]], dtype=torch.float64)

By default, float is the returned data type.

Next, let's create a matrix full of zeros and a matrix full of ones:

    >>> x = torch.zeros(3, 4)
    >>> print(x)
    tensor([[0., 0., 0., 0.],
           [0., 0., 0., 0.],
           [0., 0., 0., 0.]])
    >>> x = torch.ones(3, 4)
    >>> print(x)
    tensor([[1., 1., 1., 1.],
           [1., 1., 1., 1.],
           [1., 1., 1., 1.]])

To get the size of a tensor, use this code:

 >>> print(x.size())
 torch.Size([3, 4])

torch.Size is actually a tuple.

To reshape a tensor, we can use the view() method:

 >>> x_reshaped = x.view(2, 6)
 >>> print(x_reshaped)
 tensor([[1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1.]])

We can create a tensor directly from data, including a single value, a list, and a nested list:

 >>> x1 = torch.tensor(3)
 >>> print(x1)
 tensor(3)
 >>> x2 = torch.tensor([14.2, 3, 4])
 >>> print(x2)
 tensor([14.2000,  3.0000, 4.0000])
 >>> x3 = torch.tensor([[3, 4, 6], [2, 1.0, 5]])
 >>> print(x3)
 tensor([[3., 4., 6.],
         [2., 1., 5.]])

To access the elements in a tensor of more than one element, we can use indexing in a similar way to NumPy:

 >>> print(x2[1])
 tensor(3.)
 >>> print(x3[1, 0])
 tensor(2.)
 >>> print(x3[:, 1])
 tensor([4., 1.])
 >>> print(x3[:, 1:])
 tensor([[4., 6.],
         [1., 5.]])

As with a one-element tensor, we do so by using the item() method:

 >>> print(x1.item())
 3

Tensor and NumPy arrays are mutually convertible. Convert a tensor to a NumPy array using the numpy() method:

 >>> x3.numpy()
 array([[3., 4., 6.],
        [2., 1., 5.]], dtype=float32)

Convert a NumPy array to a tensor with from_numpy():

>>> import numpy as np
>>> x_np = np.ones(3)
>>> x_torch = torch.from_numpy(x_np)
>>> print(x_torch)
tensor([1., 1., 1.], dtype=torch.float64)

Note that if the input NumPy array is of the float data type, the output tensor will be of the double type. Typecasting may occasionally be needed.

Take a look at the following example, where a tensor of the double type is converted to a float:

 >>> print(x_torch.float())
 tensor([1., 1., 1.])

Operations in PyTorch are similar to NumPy as well. Take addition as an example; we can simply do the following:

>>> x4 = torch.tensor([[1, 0, 0], [0, 1.0, 0]])
>>> print(x3 + x4)
tensor([[4., 4., 6.],
         [2., 2., 5.]])

Or we can use the add() method as follows:

 >>> print(torch.add(x3, x4))
 tensor([[4., 4., 6.],
         [2., 2., 5.]])

PyTorch supports in-place operations, which mutate the tensor object. For example, let's run this command:

 >>> x3.add_(x4)
 tensor([[4., 4., 6.],
         [2., 2., 5.]])

You will see that x3 is changed to the result of the original x3plus x4:

 >>> print(x3)
 tensor([[4., 4., 6.],
         [2., 2., 5.]])

There's more...

Any method with _ indicates that it is an in-place operation, which updates the tensor with the resulting value.

Implementing and evaluating a random search policy

After some practice with PyTorch programming, starting from this recipe, we will be working on more sophisticated policies to solve the CartPole problem than purely random actions. We start with the random search policy in this recipe.

A simple, yet effective, approach is to map an observation to a vector of two numbers representing two actions. The action with the higher value will be picked. The linear mapping is depicted by a weight matrix whose size is 4 x 2 since the observations are 4-dimensional in this case. In each episode, the weight is randomly generated and is used to compute the action for every step in this episode. The total reward is then calculated. This process repeats for many episodes and, in the end, the weight that enables the highest total reward will become the learned policy. This approach is called random search because the weight is randomly picked in each trial with the hope that the best weight will be found with a large number of trials.

How to do it...

Let's go ahead and implement a random search algorithm with PyTorch:

Import the Gym and PyTorch packages and create an environment instance:

>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')

Obtain the dimensions of the observation and action space:

>>> n_state = env.observation_space.shape[0]
>>> n_state
 4
>>> n_action = env.action_space.n
>>> n_action
 2

These will be used when we define the tensor for the weight matrix, which is size 4 x 2 in size.

Define a function that simulates an episode given the input weight and returns the total reward:

 >>> def run_episode(env, weight):
 ...     state = env.reset()
 ...     total_reward = 0
 ...     is_done = False
 ...     while not is_done:
 ...         state = torch.from_numpy(state).float()
 ...         action = torch.argmax(torch.matmul(state, weight))
 ...         state, reward, is_done, _ = env.step(action.item())
 ...         total_reward += reward
 ...     return total_reward

Here, we convert the state array to a tensor of the float type because we need to compute the multiplication of the state and weight tensor, torch.matmul(state, weight), for linear mapping. The action with the higher value is selected using the torch.argmax() operation. And don't forget to take the value of the resulting action tensor using .item() because it is a one-element tensor.

Specify the number of episodes:

>>> n_episode = 1000

We need to keep track of the best total reward on the fly, as well as the corresponding weight. So, we specify their starting values:

>>> best_total_reward = 0
>>> best_weight = None

We will also record the total reward for every episode:

>>> total_rewards = []

Now, we can run n_episode. For each episode, we do the following:

Randomly pick the weight
Let the agent take actions according to the linear mapping
An episode terminates and returns the total reward
Update the best total reward and the best weight if necessary
Also, keep a record of the total reward

Put this into code as follows:

 >>> for episode in range(n_episode):
 ...     weight = torch.rand(n_state, n_action)
 ...     total_reward = run_episode(env, weight)
 ...     print('Episode {}: {}'.format(episode+1, total_reward))
 ...     if total_reward > best_total_reward:
 ...         best_weight = weight
 ...         best_total_reward =  total_reward
 ...     total_rewards.append(total_reward)
 ...
 Episode 1: 10.0
 Episode 2: 73.0
 Episode 3: 86.0
 Episode 4: 10.0
 Episode 5: 11.0
 ……
 ……
 Episode 996: 200.0
 Episode 997: 11.0
 Episode 998: 200.0
 Episode 999: 200.0
 Episode 1000: 9.0

We have obtained the best policy through 1,000 random searches. The best policy is parameterized by best_weight.

Before we test out the best policy in the testing episodes, we can calculate the average total reward achieved by random linear mapping:

 >>> print('Average total reward over {} episode: {}'.format(
           n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 47.197

This is more than twice what we got from the random action policy (22.25).

Now, let's see how the learned policy performs on 100 new episodes:

 >>> n_episode_eval = 100
 >>> total_rewards_eval = []
 >>> for episode in range(n_episode_eval):
 ...     total_reward = run_episode(env, best_weight)
 ...     print('Episode {}: {}'.format(episode+1, total_reward))
 ...     total_rewards_eval.append(total_reward)
 ...
 Episode 1: 200.0
 Episode 2: 200.0
 Episode 3: 200.0
 Episode 4: 200.0
 Episode 5: 200.0
 ……
 ……
 Episode 96: 200.0
 Episode 97: 188.0
 Episode 98: 200.0
 Episode 99: 200.0
 Episode 100: 200.0
 >>> print('Average total reward over {} episode: {}'.format(
           n_episode, sum(total_rewards_eval) / n_episode_eval))
 Average total reward over 1000 episode: 196.72

Surprisingly, the average reward for the testing episodes is close to the maximum of 200 steps with the learned policy. Be aware that this value may vary a lot. It could be anywhere from 160 to 200.

How it works...

The random search algorithm works so well mainly because of the simplicity of our CartPole environment. Its observation state is composed of only four variables. You will recall that the observation in the Atari Space Invaders game is more than 100,000 (which is 210 * 160 * 3) . The number of dimensions of the action state in CartPole is a third of that in Space Invaders. In general, simple algorithms work well for simple problems. In our case, we simply search for the best linear mapping from the observation to the action from a random pool.

Another interesting thing we've noticed is that before we select and deploy the best policy (the best linear mapping), random search also outperforms random action. This is because random linear mapping does take the observations into consideration. With more information from the environment, the decisions made in the random search policy are more intelligent than completely random ones.

There's more...

We can also plot the total reward for every episode in the training phase:

>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

This will generate the following plot:

If you have not installed matplotlib, you can do so via the following command:

conda install matplotlib

We can see that the reward for each episode is pretty random, and that there is no trend of improvement as we go through the episodes. This is basically what we expected.

In the plot of reward versus episodes, we can see that there are some episodes in which the reward reaches 200. We can end the training phase whenever this occurs since there is no room to improve. Incorporating this change, we now have the following for the training phase:

 >>> n_episode = 1000
 >>> best_total_reward = 0
 >>> best_weight = None
 >>> total_rewards = []
 >>> for episode in range(n_episode):
 ...     weight = torch.rand(n_state, n_action)
 ...     total_reward = run_episode(env, weight)
 ...     print('Episode {}: {}'.format(episode+1, total_reward))
 ...     if total_reward > best_total_reward:
 ...         best_weight = weight
 ...         best_total_reward = total_reward
 ...     total_rewards.append(total_reward)
 ...     if best_total_reward == 200:
 ...         break
 Episode 1: 9.0
 Episode 2: 8.0
 Episode 3: 10.0
 Episode 4: 10.0
 Episode 5: 10.0
 Episode 6: 9.0
 Episode 7: 17.0
 Episode 8: 10.0
 Episode 9: 43.0
 Episode 10: 10.0
 Episode 11: 10.0
 Episode 12: 106.0
 Episode 13: 8.0
 Episode 14: 32.0
 Episode 15: 98.0
 Episode 16: 10.0
 Episode 17: 200.0

The policy achieving the maximal reward is found in episode 17. Again, this may vary a lot because the weights are generated randomly for each episode. To compute the expectation of training episodes needed, we can repeat the preceding training process 1,000 times and take the average of the training episodes:

 >>> n_training = 1000
 >>> n_episode_training = []
 >>> for _ in range(n_training):
 ...     for episode in range(n_episode):
 ...         weight = torch.rand(n_state, n_action)
 ...         total_reward = run_episode(env, weight)
 ...         if total_reward == 200:
 ...             n_episode_training.append(episode+1)
 ...             break
 >>> print('Expectation of training episodes needed: ',
            sum(n_episode_training) / n_training)
 Expectation of training episodes needed:  13.442

On average, we expect that it takes around 13 episodes to find the best policy.

Developing the hill-climbing algorithm

As we can see in the random search policy, each episode is independent. In fact, all episodes in random search can be run in parallel, and the weight that achieves the best performance will eventually be selected. We've also verified this with the plot of reward versus episode, where there is no upward trend. In this recipe, we will develop a different algorithm, a hill-climbing algorithm, to transfer the knowledge acquired in one episode to the next episode.

In the hill-climbing algorithm, we also start with a randomly chosen weight. But here, for every episode, we add some noise to the weight. If the total reward improves, we update the weight with the new one; otherwise, we keep the old weight. In this approach, the weight is gradually improved as we progress through the episodes, instead of jumping around in each episode.

How to do it...

Let's go ahead and implement the hill-climbing algorithm with PyTorch:

As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:

>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n

We will reuse the run_episode function we defined in the previous recipe, so we will not repeat it here. Again, given the input weight, it simulates an episode and returns the total reward.
Let's make it 1,000 episodes for now:

>>> n_episode = 1000

We need to keep track of the best total reward on the fly, as well as the corresponding weight. So, let's specify their starting values:

>>> best_total_reward = 0
>>> best_weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []

As we mentioned, we will add some noise to the weight for each episode. In fact, we will apply a scale to the noise so that the noise won't overwhelm the weight. Here, we will choose 0.01 as the noise scale:

>>> noise_scale = 0.01

Now, we can run the n_episode function. After we randomly pick an initial weight, for each episode, we do the following:

Add random noise to the weight
Let the agent take actions according to the linear mapping
An episode terminates and returns the total reward
If the current reward is greater than the best one obtained so far, update the best reward and the weight
Otherwise, the best reward and the weight remain unchanged
Also, keep a record of the total reward

Put this into code as follows:

 >>> for episode in range(n_episode):
 ...     weight = best_weight +
                     noise_scale * torch.rand(n_state, n_action)
 ...     total_reward = run_episode(env, weight)
 ...     if total_reward >= best_total_reward:
 ...         best_total_reward = total_reward
 ...         best_weight = weight
 ...     total_rewards.append(total_reward)
 ...     print('Episode {}: {}'.format(episode + 1, total_reward))
 ...
 Episode 1: 56.0
 Episode 2: 52.0
 Episode 3: 85.0
 Episode 4: 106.0
 Episode 5: 41.0
 ……
 ……
 Episode 996: 39.0
 Episode 997: 51.0
 Episode 998: 49.0
 Episode 999: 54.0
 Episode 1000: 41.0

We also calculate the average total reward achieved by the hill-climbing version of linear mapping:

 >>> print('Average total reward over {} episode: {}'.format(
          n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 50.024

To assess the training using the hill-climbing algorithm, we repeat the training process multiple times (by running the code from Step 4 to Step 6 multiple times). We observe that the average total reward fluctuates a lot. The following are the results we got when running it 10 times:

Average total reward over 1000 episode: 9.261   
Average total reward over 1000 episode: 88.565
Average total reward over 1000 episode: 51.796
Average total reward over 1000 episode: 9.41
Average total reward over 1000 episode: 109.758
Average total reward over 1000 episode: 55.787
Average total reward over 1000 episode: 189.251
Average total reward over 1000 episode: 177.624
Average total reward over 1000 episode: 9.146
Average total reward over 1000 episode: 102.311

What could cause such variance? It turns out that if the initial weight is bad, adding noise at a small scale will have little effect on improving the performance. This will cause poor convergence. On the other hand, if the initial weight is good, adding noise at a big scale might move the weight away from the optimal weight and jeopardize the performance. How can we make the training of the hill-climbing model more stable and reliable? We can actually make the noise scale adaptive to the performance, just like the adaptive learning rate in gradient descent. Let's see Step 8 for more details.

To make the noise adaptive, we do the following:

Specify a starting noise scale.
If the performance in an episode improves, decrease the noise scale. In our case, we take half of the scale, but set 0.0001 as the lower bound.
If the performance in an episode drops, increase the noise scale. In our case, we double the scale, but set 2 as the upper bound.

Put this into code:

 >>> noise_scale = 0.01
 >>> best_total_reward = 0
 >>> total_rewards = []
 >>> for episode in range(n_episode):
 ...     weight = best_weight +
                       noise_scale * torch.rand(n_state, n_action)
 ...     total_reward = run_episode(env, weight)
 ...     if total_reward >= best_total_reward:
 ...         best_total_reward = total_reward
 ...         best_weight = weight
 ...         noise_scale = max(noise_scale / 2, 1e-4)
 ...     else:
 ...         noise_scale = min(noise_scale * 2, 2)
 ...     print('Episode {}: {}'.format(episode + 1, total_reward))
 ...     total_rewards.append(total_reward)
 ...
 Episode 1: 9.0
 Episode 2: 9.0
 Episode 3: 9.0
 Episode 4: 10.0
 Episode 5: 10.0
 ……
 ……
 Episode 996: 200.0
 Episode 997: 200.0
 Episode 998: 200.0
 Episode 999: 200.0
 Episode 1000: 200.0

The reward is increasing as the episodes progress. It reaches the maximum of 200 within the first 100 episodes and stays there. The average total reward also looks promising:

 >>> print('Average total reward over {} episode: {}'.format(
          n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 186.11

We also plot the total reward for every episode as follows:

>>> import matplotlib.pyplot as plt
>>> plt.plot(total_rewards)
>>> plt.xlabel('Episode')
>>> plt.ylabel('Reward')
>>> plt.show()

In the resulting plot, we can see a clear upward trend before it plateaus at the maximum value:

Feel free to run the new training process a few times. The results are very stable compared to learning with a constant noise scale.

Now, let's see how the learned policy performs on 100 new episodes:

 >>> n_episode_eval = 100
 >>> total_rewards_eval = []
 >>> for episode in range(n_episode_eval):
 ...     total_reward = run_episode(env, best_weight)
 ...     print('Episode {}: {}'.format(episode+1, total_reward))
 ...     total_rewards_eval.append(total_reward)
 ...
 Episode 1: 200.0
 Episode 2: 200.0
 Episode 3: 200.0
 Episode 4: 200.0
 Episode 5: 200.0
 ……
 ……
 Episode 96: 200.0
 Episode 97: 200.0
 Episode 98: 200.0
 Episode 99: 200.0
 Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 199.94

The average reward for the testing episodes is close to the maximum of 200 that we obtained with the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

How it works...

We are able to achieve much better performance with the hill-climbing algorithm than with random search by simply adding adaptive noise to each episode. We can think of it as a special kind of gradient descent without a target variable. The additional noise is the gradient, albeit in a random way. The noise scale is the learning rate, and it is adaptive to the reward from the previous episode. The target variable in hill climbing becomes achieving the highest reward. In summary, rather than isolating each episode, the agent in the hill-climbing algorithm makes use of the knowledge learned from each episode and performs a more reliable action in the next episode. As the name hill climbing implies, the reward moves upwards through the episodes as the weight gradually moves towards the optimum value.

There's more...

We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page, https://github.com/openai/gym/wiki/CartPole-v0, "solved" means the average reward over 100 consecutive episodes is no less than 195.

We refine the stopping criterion accordingly:

 >>> noise_scale = 0.01
 >>> best_total_reward = 0
 >>> total_rewards = []
 >>> for episode in range(n_episode):
 ...     weight = best_weight + noise_scale * torch.rand(n_state, n_action)
 ...     total_reward = run_episode(env, weight)
 ...     if total_reward >= best_total_reward:
 ...         best_total_reward = total_reward
 ...         best_weight = weight
 ...         noise_scale = max(noise_scale / 2, 1e-4)
 ...     else:
 ...         noise_scale = min(noise_scale * 2, 2)
 ...     print('Episode {}: {}'.format(episode + 1, total_reward))
 ...     total_rewards.append(total_reward)
 ...     if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
 ...         break
 ...
 Episode 1: 9.0
 Episode 2: 9.0
 Episode 3: 10.0
 Episode 4: 10.0
 Episode 5: 9.0
 ……
 ……
 Episode 133: 200.0
 Episode 134: 200.0
 Episode 135: 200.0
 Episode 136: 200.0
 Episode 137: 200.0

At episode 137, the problem is considered solved.

Developing a policy gradient algorithm

The last recipe of the first chapter is about solving the CartPole environment with a policy gradient algorithm. This may be more complicated than we need for this simple problem, in which the random search and hill-climbing algorithms suffice. However, it is a great algorithm to learn, and we will use it in more complicated environments later in the book.

In the policy gradient algorithm, the model weight moves in the direction of the gradient at the end of each episode. We will explain the computation of gradients in the next section. Also, in each step, it samples an action from the policy based on the probabilities computed using the state and weight. It no longer takes an action with certainty, in contrast with random search and hill climbing (by taking the action achieving the higher score). Hence, the policy switches from deterministic to stochastic.

How to do it...

Now, it is time to implement the policy gradient algorithm with PyTorch:

As before, import the necessary packages, create an environment instance, and obtain the dimensions of the observation and action space:

>>> import gym
>>> import torch
>>> env = gym.make('CartPole-v0')
>>> n_state = env.observation_space.shape[0]
>>> n_action = env.action_space.n

We define the run_episode function, which simulates an episode given the input weight and returns the total reward and the gradients computed. More specifically, it does the following tasks in each step:

Calculates the probabilities, probs, for both actions based on the current state and input weight
Samples an action, action, based on the resulting probabilities
Computes the derivatives, d_softmax, of the softmax function with the probabilities as input
Divides the resulting derivatives, d_softmax, by the probabilities, probs, to get the derivatives, d_log, of the log term with respect to the policy
Applies the chain rule to compute the gradient, grad, of the weights
Records the resulting gradient, grad
Performs the action, accumulates the reward, and updates the state

Putting all of this into code, we have the following:

 >>> def run_episode(env, weight):
 ...     state = env.reset()
 ...     grads = []
 ...     total_reward = 0
 ...     is_done = False
 ...     while not is_done:
 ...         state = torch.from_numpy(state).float()
 ...         z = torch.matmul(state, weight)
 ...         probs = torch.nn.Softmax()(z)
 ...         action = int(torch.bernoulli(probs[1]).item())
 ...         d_softmax = torch.diag(probs) -
                             probs.view(-1, 1) * probs
 ...         d_log = d_softmax[action] / probs[action]
 ...         grad = state.view(-1, 1) * d_log
 ...         grads.append(grad)
 ...         state, reward, is_done, _ = env.step(action)
 ...         total_reward += reward
 ...         if is_done:
 ...             break
 ...     return total_reward, grads

After an episode finishes, it returns the total reward obtained in this episode and the gradients computed for the individual steps. These two outputs will be used to update the weight.

Let's make it 1,000 episodes for now:

>>> n_episode = 1000

This means we will run run_episode and n_episodetimes.

Initiate the weight:

>>> weight = torch.rand(n_state, n_action)

We will also record the total reward for every episode:

>>> total_rewards = []

At the end of each episode, we need to update the weight using the computed gradients. For every step of the episode, the weight moves by learning rate * gradient calculated in this step * total reward in the remaining steps. Here, we choose 0.001 as the learning rate:

>>> learning_rate = 0.001

Now, we can run n_episodeepisodes:

 >>> for episode in range(n_episode):
 ...     total_reward, gradients = run_episode(env, weight)
 ...     print('Episode {}: {}'.format(episode + 1, total_reward))
 ...     for i, gradient in enumerate(gradients):
 ...         weight += learning_rate * gradient * (total_reward - i)
 ...     total_rewards.append(total_reward)
 ……
 ……
 Episode 101: 200.0
 Episode 102: 200.0
 Episode 103: 200.0
 Episode 104: 190.0
 Episode 105: 133.0
 ……
 ……
 Episode 996: 200.0
 Episode 997: 200.0
 Episode 998: 200.0
 Episode 999: 200.0
 Episode 1000: 200.0

Now, we calculate the average total reward achieved by the policy gradient algorithm:

 >>> print('Average total reward over {} episode: {}'.format(
          n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 179.728

We also plot the total reward for every episode as follows:

 >>> import matplotlib.pyplot as plt
 >>> plt.plot(total_rewards)
 >>> plt.xlabel('Episode')
 >>> plt.ylabel('Reward')
 >>> plt.show()

In the resulting plot, we can see a clear upward trend before it stays at the maximum value:

We can also see that the rewards oscillate even after it converges. This is because the policy gradient algorithm is a stochastic policy.

Now, let's see how the learned policy performs on 100 new episodes:

 >>> n_episode_eval = 100
 >>> total_rewards_eval = []
 >>> for episode in range(n_episode_eval):
 ...     total_reward, _ = run_episode(env, weight)
 ...     print('Episode {}: {}'.format(episode+1, total_reward))
 ...     total_rewards_eval.append(total_reward)
 ...
 Episode 1: 200.0
 Episode 2: 200.0
 Episode 3: 200.0
 Episode 4: 200.0
 Episode 5: 200.0
 ……
 ……
 Episode 96: 200.0
 Episode 97: 200.0
 Episode 98: 200.0
 Episode 99: 200.0
 Episode 100: 200.0

Let's see the average performance:

>>> print('Average total reward over {} episode: {}'.format(n_episode, sum(total_rewards) / n_episode))
 Average total reward over 1000 episode: 199.78

The average reward for the testing episodes is close to the maximum value of 200 for the learned policy. You can re-run the evaluation multiple times. The results are pretty consistent.

How it works...

The policy gradient algorithm trains an agent by taking small steps and updating the weight based on the rewards associated with those steps at the end of an episode. The technique of having the agent run through an entire episode and then updating the policy based on the rewards obtained is called Monte Carlo policy gradient.

The action is selected based on the probability distribution computed based on the current state and the model’s weight. For example, if the probabilities for the left and right actions are [0.6, 0.4], this means the left action is selected 60% of the time; it doesn't mean the left action is chosen, as in the random search and hill-climbing algorithms.

We know that the reward is 1 for each step before an episode terminates. Hence, the future reward we use to calculate the policy gradient at each step is the number of steps remaining. After each episode, we feed the gradient history multiplied by the future rewards to update the weight using the stochastic gradient ascent method. In this way, the longer an episode is, the bigger the update of the weight. This will eventually increase the chance of getting a larger total reward.

As we mentioned at the start of this section, the policy gradient algorithm might be overkill for a simple environment such as CartPole, but it should get us ready for more complicated problems.

There's more...

If we examine the reward/episode plot, it seems that we can also stop early during training when it has been solved – the average reward over 100 consecutive episodes is no less than 195. We just add the following lines of code to the training session:

 >>> if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
 ...     break

Re-run the training session. You should get something similar to the following, which stops after several hundred episodes:

Episode 1: 10.0
Episode 2: 27.0
Episode 3: 28.0
Episode 4: 15.0
Episode 5: 12.0
……
……
Episode 549: 200.0
Episode 550: 200.0
Episode 551: 200.0
Episode 552: 200.0
Episode 553: 200.0