Scaling Up Learning with Function Approximation

So far, we have represented the value function in the form of a lookup table in the MC and TD methods. The TD method is able to update the Q-function on the fly during an episode, which is considered an advancement on the MC method. However, the TD method is still not sufficiently scalable for problems with many states and/or actions. It will be extremely slow at learning too many values for individual pairs of states and actions using the TD method.

This chapter will focus on function approximation, which can overcome the scaling issues in the TD method. We will begin by setting up the Mountain Car environment playground. After developing the linear function estimator, we will incorporate it into the Q-learning and SARSA algorithms. We will then improve the Q-learning algorithm using experience replay, and experiment with using...

Setting up the Mountain Car environment playground

The TD method can learn the Q-function during an episode but is not scalable. For example, the number of states in a chess game is around 1,040, and 1,070 in a Go game. Moreover, it seems infeasible to learn the values for continuous state using the TD method. Hence, we need to solve such problems using function approximation (FA), which approximates the state space using a set of features.

In this first recipe, we will begin by getting familiar with the Mountain Car environment, which we will solve with the help of FA methods in upcoming recipes.

Mountain Car (https://gym.openai.com/envs/MountainCar-v0/) is a typical Gym environment with continuous states. As shown in the following diagram, its goal is to get the car to the top of the hill:

On a one-dimensional track, the car is positioned between -1.2 (leftmost) and 0.6 (rightmost...

Estimating Q-functions with gradient descent approximation

Starting from this recipe, we will develop FA algorithms to solve environments with continuous state variables. We will begin by approximating Q-functions using linear functions and gradient descent.

The main idea of FA is to use a set of features to estimate Q values. This is extremely useful for processes with a large state space where the Q table becomes huge. There are several ways to map the features to the Q values; for example, linear approximations that are linear combinations of features and neural networks. With linear approximation, the state-value function for an action is expressed by a weighted sum of the features:

Here, F1(s), F2(s), ……, Fn(s) is a set of features given the input state, s; θ1, θ2,......, θn are the weights applied to corresponding features. Or we can put...

Developing Q-learning with linear function approximation

In the previous recipe, we developed a value estimator based on linear regression. We will employ the estimator in Q-learning, as part of our FA journey.

As we have seen, Q-learning is an off-policy learning algorithm and it updates the Q-function based on the following equation:

Here, s' is the resulting state after taking action, a, in state, s; r is the associated reward; α is the learning rate; and γ is the discount factor. Also, means that the behavior policy is greedy, where the highest Q-value among those in state s' is selected to generate learning data. In Q-learning, actions are taken on the basis of the epsilon-greedy policy. Similarly, Q-learning with FA has the following error term:

Our learning goal is to minimize the error term to zero, which means the estimated V(st) should satisfy...

Developing SARSA with linear function approximation

We've just solved the Mountain Car problem using the off-policy Q-learning algorithm in the previous recipe. Now, we will do so with the on-policy State-Action-Reward-State-Action (SARSA) algorithm (the FA version of course).

In general, the SARSA algorithm updates the Q-function based on the following equation:

Here, s' is the resulting state after taking action, a, in state s; r is the associated reward; α is the learning rate; and γ is the discount factor. We simply pick up the next action, a', by also following an epsilon-greedy policy to update the Q value. And the action, a', is taken in the next step. Accordingly, SARSA with FA has the following error term:

Our learning goal is to minimize the error term to zero, which means that the estimated V(st) should satisfy the following equation...

Incorporating batching using experience replay

In the previous two recipes, we developed two FA learning algorithms: off-policy and on-policy, respectively. In this recipe, we will improve the performance of off-policy Q-learning by incorporating experience replay.

Experience replay means we store the agent's experiences during an episode instead of running Q-learning. The learning phase with experience replay becomes two phases: gaining experience and updating models based on the experience obtained after an episode finishes.Specifically, the experience (also called the buffer, or memory) includes the past state, the action taken, the reward received, and the next state for individual steps in an episode.

In the learning phase, a certain number of data points are randomly sampled from the experience and are used to train the learning models. Experience replay can stabilize...

Developing Q-learning with neural network function approximation

As we mentioned before, we can also use neural networks as the approximating function. In this recipe, we will solve theMountain Car environment using Q-learning with neural networks for approximation.

The goal of FA is to use a set of features to estimate the Q values via a regression model. Using neural networks as the estimation model, we increase the regression power by adding flexibility (multiple layers in neural networks) and non-linearity introduced by non-linear activation in hidden layers. The remaining part of the Q-learning model is very similar to the one with linear approximation. We also use gradient descent to train the network. The ultimate goal of learning is to find the optimal weights of the network to best approximate the state-value function, V(s), for each possible action. The loss function...

Solving the CartPole problem with function approximation

This is a bonus recipe in this chapter, where we will solve the CartPole problem using FA.

As we saw in Chapter 1, Getting started with reinforcement learning and PyTorch, we simulated the CartPole environment in the , Simulating the CartPole environment recipe, and solved the environment using random search, and the hill climbing and policy gradient algorithms, respectively, in recipes including Implementing and evaluating the random search policy, Developing the hill climbing algorithm, and Developing the policy gradient algorithm. Now, let's try to solve CartPole using what we've talked about in this chapter.