In the previous chapter, Chapter 4, Gaming with Monte Carlo Methods, we learned about the interesting Monte Carlo method, which is used for solving the Markov Decision Process (MDP) when the model dynamics of the environment are not known in advance, unlike dynamic programming. We looked at the Monte Carlo prediction method, which is used for predicting value functions and control methods for further optimizing value functions. But there are some pitfalls with the Monte Carlo method. It is applied only for episodic tasks. If an episode is very long, then we have to wait a long time for computing value functions. So, we will use another interesting algorithm called temporal-difference (TD) learning, which is a model-free learning algorithm: it doesn't require the model dynamics to be known in advance and it can be applied for non-episodic tasks...
You're reading from Hands-On Reinforcement Learning with Python
TD learning
The TD learning algorithm was introduced by Sutton in 1988. The algorithm takes the benefits of both the Monte Carlo method and dynamic programming (DP) into account. Like the Monte Carlo method, it doesn't require model dynamics, and like DP it doesn't need to wait until the end of the episode to make an estimate of the value function. Instead, it approximates the current estimate based on the previously learned estimate, which is also called bootstrapping. If you see in Monte Carlo methods there is no bootstrapping, we made an estimate only at the end of the episode but in TD methods we can bootstrap.
TD prediction
Like we did in Monte Carlo prediction, in TD prediction we try to predict the state values. In Monte Carlo prediction, we estimate the value function by simply taking the mean return. But in TD learning, we update the value of a previous state by current state. How can we do this? TD learning using something called a TD update rule for updating the value of a state, as follows:
The value of a previous state = value of previous state + learning_rate (reward + discount_factor(value of current state) - value of previous state)
What does this equation actually mean?
If you think of this equation intuitively, it is actually the difference between the actual reward () and the expected reward () multiplied by the learning rate alpha. What does the learning rate signify? The learning rate, also called step size, is useful for convergence.
Did you notice? Since we take the...
TD control
In TD prediction, we estimated the value function. In TD control, we optimize the value function. For TD control, we use two kinds of control algorithm:
- Off-policy learning algorithm: Q learning
- On-policy learning algorithm: SARSA
Q learning
We will now look into the very popular off-policy TD control algorithm called Q learning. Q learning is a very simple and widely used TD algorithm. In control algorithms, we don't care about state value; here, in Q learning, our concern is the state-action value pair—the effect of performing an action A in the state S.
We will update the Q value based on the following equation:
The preceding equation is similar to the TD prediction update rule with a little difference...
The difference between Q learning and SARSA
Q learning and SARSA will always be confusing for many folks. Let us break down the differences between these two. Look at the flowchart here:
Can you spot the difference? In Q learning, we take action using an epsilon-greedy policy and, while updating the Q value, we simply pick up the maximum action. In SARSA, we take the action using the epsilon-greedy policy and also, while updating the Q value, we pick up the action using the epsilon-greedy policy.
Summary
In this chapter, we learned a different model-free learning algorithm that overcame the limitations of the Monte Carlo methods. We saw both prediction and control methods. In TD prediction, we updated the state-value of a state based on the next state. In terms of the control methods, we saw two different algorithms: Q learning and SARSA.
Questions
The question list is as follows:
- How does TD learning differ from the Monte Carlo method?
- What exactly is a TD error?
- What is the difference between TD prediction and control?
- How to build an intelligent agent using Q learning?
- What is the difference between Q learning and SARSA?
Further reading
Sutton's original TD paper: https://pdfs.semanticscholar.org/9c06/865e912788a6a51470724e087853d7269195.pdf