Implementing Policy Gradients and Policy Optimization

In this chapter, we will focus on policy gradient methods as one of the most popular reinforcement learning techniques over recent years. We will start with implementing the fundamental REINFORCE algorithm and will proceed with an improvement algorithm baseline. We will also implement a more powerful algorithm, actor-critic, and its variations, and apply it to solve the CartPole and Cliff Walking problems. We will also experience an environment with continuous action space and resort to Gaussian distribution to solve it. By way of a fun section at the end, we will train an agent based on the cross-entropy method to play the CartPole game.

The following recipes will be covered in this chapter:

Implementing the REINFORCE algorithm
Developing the REINFORCE algorithm with baseline
Implementing the actor-critic algorithm
Solving...

Implementing the REINFORCE algorithm

A recent publication stipulated that policy gradient methods are becoming more and more popular. Their learning goal is to optimize the probability distribution of actions so that given a state, a more rewarding action will have a higher probability value. In the first recipe of the chapter, we will talk about the REINFORCE algorithm, which is foundational to advanced policy gradient methods.

The REINFORCE algorithm is also known as the Monte Carlo policy gradient, as it optimizes the policy based on Monte Carlo methods. Specifically, it collects trajectory samples from one episode using its current policy and uses them to the policy parameters, θ . The learning objective function for policy gradients is as follows:

Its gradient can be derived as follows:

Here, is the return, which is the cumulative discounted reward until time, t...

Developing the REINFORCE algorithm with baseline

In the REINFORCE algorithm, Monte Carlo plays out the whole trajectory in an episode that is used to update the policy afterward. However, the stochastic policy may take different actions at the same state in different episodes. This can confuse the training, since one sampled experience wants to increase the probability of choosing one action while another sampled experience may want to decrease it. To reduce this high variance problem in vanilla REINFORCE, we will develop a variation algorithm, REINFORCE with baseline, in this recipe.

In REINFORCE with baseline, we subtract the baseline state-value from the return, G. As a result, we use an advantage function A in the gradient update, which is described as follows:

Here, V(s) is the value function that estimates the state-value given a state. Typically, we can use a linear function...

Implementing the actor-critic algorithm

In the REINFORCE with baseline algorithm, there are two separate components, the policy model and the value function. We can actually combine the learning of these two components, since the goal of learning the value function is to update the policy network. This is what the actor-critic algorithm does, and which we are going to develop in this recipe.

The network for the actor-critic algorithm consists of the following two parts:

Actor: This takes in the input state and outputs the action probabilities. Essentially, it learns the optimal policy by updating the model using information provided by the critic.
Critic: This evaluates how good it is to be at the input state by computing the value function. The value guides the actor on how it should adjust.

These two components share parameters of input and hidden layers in the network, as...

Solving Cliff Walking with the actor-critic algorithm

In this recipe, let's solve a more complicated Cliff Walking environment using the A2C algorithm.

Cliff Walking is a typical Gym environment with long episodes without a guarantee of termination. It is a grid problem with a 4 * 12 board. An agent makes a move of up, right, down and left at a step. The bottom-left tile is the starting point for the agent, and the bottom-right is the winning point where an episode will end if it is reached. The remaining tiles in the last row are cliffs where the agent will be reset to the starting position after stepping on any of them, but the episode continues. Each step the agent takes incurs a -1 reward, with the exception of stepping on the cliffs, where a -100 reward incurs.

The state is an integer from 0 to 47, indicating where the agent is located, as illustrated:

Such value does...

Setting up the continuous Mountain Car environment

So far, the environments we have worked on have discrete action values, such as 0 or 1, representing up or down, left or right. In this recipe, we will experience a Mountain Car environment with continuous actions.

Continuous Mountain Car (https://github.com/openai/gym/wiki/MountainCarContinuous-v0) is a Mountain Car environment with continuous actions whose value is from -1 to 1. As shown in the following screenshot, its goal is to get the car to the top of the hill on the right-hand side:

In a one-dimensional track, the car is positioned between -1.2 (leftmost) and 0.6 (rightmost), and the goal (yellow flag) is located at 0.5. The engine of the car is not strong enough to drive it to the top in a single pass, so it has to drive back and forth to build up momentum. Hence, the action is a float that represents the force of pushing...

Solving the continuous Mountain Car environment with the advantage actor-critic network

In this recipe, we are going to solve the continuous Mountain Car problem using the advantage actor-critic algorithm, a continuous version this time of course. You will see how it differs from the discrete version.

As we have seen in A2C for environments with discrete actions, we sample actions based on the estimated probabilities. How can we model a continuous control, since we can't do such sampling for countless continuous actions? We can actually resort to Gaussian distribution. We can assume that the action values are under a Gaussian distribution:

Here, the mean, , and deviation, , are computed from the policy network. With this tweak, we can sample actions from the constructed Gaussian distribution by current mean and deviation. The loss function in continuous A2C is similar to...

Playing CartPole through the cross-entropy method

In this last recipe, by way of a bonus (and fun) section, we will develop a simple, yet powerful, algorithm to solve CartPole. It is based on cross-entropy, and directly maps input states to an output action. In fact, it is more straightforward than all the other policy gradient algorithms in this chapter.

We have applied several policy gradient algorithms to solve the CartPole environment. They use complicated neural network architectures and a loss function, which may be overkill for simple environments such as CartPole. Why don't we directly predict the actions for given states? The idea behind this is straightforward: we model the mapping from state to action, and train it ONLY with the most successful experiences from the past. We are only interested in what the correct actions should be. The objective function, in this...