Reader small image

You're reading from  Reinforcement Learning Algorithms with Python

Product typeBook
Published inOct 2019
Reading LevelBeginner
PublisherPackt
ISBN-139781789131116
Edition1st Edition
Languages
Right arrow
Author (1)
Andrea Lonza
Andrea Lonza
author image
Andrea Lonza

Andrea Lonza is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.
Read more about Andrea Lonza

Right arrow

DDPG and TD3 Applications

In the previous chapter, we concluded a comprehensive overview of all the major policy gradient algorithms. Due to their capacity to deal with continuous action spaces, they are applied to very complex and sophisticated control systems. Policy gradient methods can also use a second-order derivative, as is done in TRPO, or use other strategies, in order to limit the policy update by preventing unexpected bad behaviors. However, the main concern when dealing with this type of algorithm is their poor efficiency, in terms of the quantity of experience needed to hopefully master a task. This drawback comes from the on-policy nature of these algorithms, which makes them require new experiences each time the policy is updated. In this chapter, we will introduce a new type of off-policy actor-critic algorithm that learns a target deterministic policy, while exploring...

Combining policy gradient optimization with Q-learning

Throughout this book, we approach two main types of model-free algorithms: the ones based on the gradient of the policy, and the ones based on the value function. From the first family, we saw REINFORCE, actor-critic, PPO, and TRPO. From the second, we saw Q-learning, SARSA, and DQN. As well as the way in which the two families learn a policy (that is, policy gradient algorithms use stochastic gradient ascent toward the steepest increment on the estimated return, and value-based algorithms learn an action value for each state-action to then build a policy), there are key differences that let us prefer one family over the other. These are the on-policy or off-policy nature of the algorithms, and their predisposition to manage large action spaces. We already discussed the differences between on-policy and off-policy in the previous...

Deep deterministic policy gradient

If you implemented DPG with the deep neural networks that were presented in the previous section, the algorithm would be very unstable and it wouldn't be capable of learning anything. We encountered a similar problem when we extended Q-learning with deep neural networks. Indeed, to combine DNN and Q-learning in the DQN algorithm, we had to employ some other tricks to stabilize learning. The same holds true for DPG algorithms. These methods are off-policy, just like Q-learning, and as we'll soon see, some ingredients that make deterministic policies work with DNN are similar to the ones used in DQN.

DDPG (Continuous Control with Deep Reinforcement Learning by Lillicrap, and others: https://arxiv.org/pdf/1509.02971.pdf) is the first deterministic actor-critic that employs deep neural networks, for learning both the actor and the critic...

Twin delayed deep deterministic policy gradient (TD3)

DDPG is regarded as one of the most sample-efficient actor-critic algorithms, but it has been demonstrated to be brittle and sensitive to hyperparameters. Further studies have tried to alleviate these problems, by introducing novel ideas, or by using tricks from other algorithms on top of DDPG. Recently, one algorithm has taken over as a replacement of DDPG: twin delayed deep deterministic policy gradient, or for short, TD3 (the paper is Addressing Function Approximation Error in Actor-Critic Methods: https://arxiv.org/pdf/1802.09477.pdf). We have used the word replacement here, because it's actually a continuation of the DDPG algorithms, with some more ingredients that make it more stable, and more performant.

TD3 focuses on some of the problems that are also common in other off-policy algorithms. These problems are the...

Summary

In this chapter, we approached two different ways in which to solve an RL problem. The first is through the estimation of state-action values that are used to choose the best next action, so-called Q-learning algorithms. The second involves the maximization of the expected reward policy through its gradient. In fact, these methods are called policy gradient methods. In this chapter, we showed the advantages and disadvantages of such approaches, and demonstrated that many of these are complementary. For example, Q-learning algorithms are sample efficient but cannot deal with continuous action. Instead, policy gradient algorithms require more data, but are able to control agents with continuous actions. We then introduced DPG methods that combine Q-learning and policy gradient techniques. In particular, these methods overcome the global maximization of the Q-learning algorithms...

Questions

  1. What is the primary limitation of Q-learning algorithms?
  2. Why are stochastic gradient algorithms sample inefficient?
  3. How does DPG overcome the maximization problem?
  4. How does DPG guarantee enough exploration?
  5. What does DDPG stand for? And what is its main contribution?
  6. What problems does TD3 propose to minimize?
  7. What new mechanisms does TD3 employ?

Further reading

You can use the following links to learn more:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Reinforcement Learning Algorithms with Python
Published in: Oct 2019Publisher: PacktISBN-13: 9781789131116
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Andrea Lonza

Andrea Lonza is a deep learning engineer with a great passion for artificial intelligence and a desire to create machines that act intelligently. He has acquired expert knowledge in reinforcement learning, natural language processing, and computer vision through academic and industrial machine learning projects. He has also participated in several Kaggle competitions, achieving high results. He is always looking for compelling challenges and loves to prove himself.
Read more about Andrea Lonza