Reader small image

You're reading from  Mastering Reinforcement Learning with Python

Product typeBook
Published inDec 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781838644147
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Enes Bilgin
Enes Bilgin
author image
Enes Bilgin

Enes Bilgin works as a senior AI engineer and a tech lead in Microsoft's Autonomous Systems division. He is a machine learning and operations research practitioner and researcher with experience in building production systems and models for top tech companies using Python, TensorFlow, and Ray/RLlib. He holds an M.S. and a Ph.D. in systems engineering from Boston University and a B.S. in industrial engineering from Bilkent University. In the past, he has worked as a research scientist at Amazon and as an operations research scientist at AMD. He also held adjunct faculty positions at the McCombs School of Business at the University of Texas at Austin and at the Ingram School of Engineering at Texas State University.
Read more about Enes Bilgin

Right arrow

Chapter 6: Deep Q-Learning at Scale

In the previous chapter, we covered dynamic programming (DP) methods to solve Markov decision processes, and then mentioned that they suffer two important limitations: DP i) assumes complete knowledge of the environment's reward and transition dynamics; ii) uses tabular representations of state and actions, which is not scalable as the number of possible state-action combinations is too big in many realistic applications. We have addressed the former by introducing the Monte Carlo (MC) and temporal-difference (TD) methods, which learn from their interactions with the environment (often in simulation) without needing to know the environment dynamics. On the other hand, the latter is yet to be addressed, and this is where deep learning comes in. Deep reinforcement learning (deep RL or DRL) is about utilizing neural networks' representational power to learn policies for a wide variety of situations.

As great as it sounds, though, it is...

From tabular Q-learning to deep Q-learning

When we covered the tabular Q-learning method in Chapter 5, Solving the Reinforcement Learning Problem, it should have been obvious that we cannot really extend those methods to most real-life scenarios. Think about an RL problem which uses images as input. A image with three 8-bit color channels would lead to possible images, a number that your calculator won't be able to calculate. For this very reason, we need to use function approximators to represent the value function. Given their success in supervised and unsupervised learning, neural networks / deep learning emerges as the clear choice here. On the other hand, as we mentioned in the introduction, the convergence guarantees of tabular Q-learning fall apart when function approximators come in. This section introduces two deep Q-learning algorithms, the Neural Fitted Q-iteration and online Q-learning, and then discusses what does not go so well with them. With that, we set the...

Deep Q-networks

DQN is a seminal work by (Mnih et al., 2015) that made deep RL a viable approach to complex sequential control problems. The authors demonstrated that a single DQN architecture can achieve super-human level performance in many Atari games without any feature engineering, which created a lot of excitement regarding the progress of AI. Let's look into what makes DQN so effective compared to the algorithms we mentioned earlier.

Key concepts in deep Q-networks

DQN modifies online Q-learning with two important concepts by using experience replay and a target network, which greatly stabilize the learning. We describe these concepts next.

Experience replay

As mentioned earlier, simply using the experience sampled sequentially from the environment leads to highly correlated gradient steps. DQN, on the other hand, stores those experience tuples, , in a replay buffer (memory), an idea that was introduced back in 1993 (Lin, 1993). During learning, the samples...

Extensions to DQN: Rainbow

The Rainbow improvements bring in significant performance boost over the vanilla DQN and they have become standard in most Q-learning implementations. In this section, we discuss what those improvements are, how they help, and what their relative importance are. At the end, we talk how DQN and these extensions collectively overcome the deadly triad.

The extensions

There are six extensions to DQN included in the Rainbow algorithm. These are: i) double Q-learning, ii) prioritized replay, iii) dueling networks, iv) multi-step learning, v) distributional RL, and iv) noisy nets. Let's start describing them with double Q-learning.

Double Q-learning

One of the well-known issues in Q-learning is that the Q-value estimates we obtain during learning is higher than the true Q-values because of the maximization operation . This phenomenon is called maximization bias, and the reason we run into it is that we do a maximization operation over noisy observations...

Distributed deep Q-learning

Deep learning models are notorious for their hunger for data. When it comes to reinforcement learning, the hunger for data is much greater, which mandates parallelization for data collection while training RL models. The original DQN model is a single-threaded process. Despite its great success, it has limited scalability. In this section, we present methods to parallelize deep Q-learning to many (possibly thousands) of processes.

The key insight behind distributed Q-learning is its off-policy nature, which virtually decouples the training from experience generation. In other words, the specific processes/policies that generate the experience do not matter to the training process (although there are caveats to this statement). Combined with the idea of using a replay buffer, this allows us to parallelize the experience generation and store the data in central or distributed replay buffers. In addition, we can parallelize how the data is sampled from these...

Implementing scalable deep Q-learning algorithms using Ray

In this section, we will implement a parallelized DQN variate using the Ray library. Ray is a powerful, general-purpose, yet simple framework for building and running distributed applications on a single machine as well as on large clusters. Ray has been built for applications that have heterogenous computational needs in mind. This is exactly what modern DRL algorithms require as they involve a mix of long and short running tasks, usage of GPU and CPU resources, and more. In fact, Ray itself has a powerful RL library that is called RLlib. Both Ray and RLlib have been increasingly adopted in academia and industry.

Info

For a comparison of Ray to other distributed backend frameworks such as Spark and Dask, see https://bit.ly/2T44AzK. You will see that Ray is a very competitive alternative, even beating Python's own multiprocessing implementation in some benchmarks.

Writing a production-grade distributed application...

RLlib: Production-grade deep reinforcement learning

As we mentioned at the beginning, one of the motivations of Ray's creators is to build an easy-to-use distributed computing framework that can handle complex and heterogenous applications such as deep reinforcement learning. With that, they also created a widely-used deep RL library based on Ray. Training a model similar to ours is very simple using RLlib. The main steps are:

  • Import the default training configs for Ape-X DQN as well as the trainer,
  • Customize the training configs,
  • Train the trainer.

That's it! The code necessary for that is very simple. All you need is the following:

Chapter06/rllib_apex_dqn.py

import pprint
from ray import tune
from ray.rllib.agents.dqn.apex import APEX_DEFAULT_CONFIG
from ray.rllib.agents.dqn.apex import ApexTrainer
if __name__ == '__main__':
    config = APEX_DEFAULT_CONFIG.copy()
    pp = pprint.PrettyPrinter...

Summary

In this chapter, we have come a long way from using tabular Q-learning to implementing a modern, distributed deep Q-learning algorithm. Along the way, we covered the details of neural fitted Q-iteration, online Q-learning, DQN with rainbow improvements, Gorila, and Ape-X DQN algorithms. We also introduced you to Ray and RLlib, which are powerful distributed computing and deep reinforcement learning frameworks.

In the next chapter, we will go into another class of deep Q-learning algorithms: Policy-based methods. Those methods will allow us to directly learn random policies and use continuous actions.

References

  1. Sutton, R. S., Barto, A. G. (2018). Reinforcement Learning: An Introduction. The MIT Press. URL: http://incompleteideas.net/book/the-book.html
  2. Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
  3. Riedmiller M. (2005) Neural Fitted Q Iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method. In: Gama J., Camacho R., Brazdil P.B., Jorge A.M., Torgo L. (eds) Machine Learning: ECML 2005. ECML 2005. Lecture Notes in Computer Science, vol 3720. Springer, Berlin, Heidelberg.
  4. Lin, L. (1993). Reinforcement learning for robots using neural networks.
  5. McClelland, J. L., McNaughton, B. L., & O'Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3), 419–457.
  6. van Hasselt...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Reinforcement Learning with Python
Published in: Dec 2020Publisher: PacktISBN-13: 9781838644147
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Enes Bilgin

Enes Bilgin works as a senior AI engineer and a tech lead in Microsoft's Autonomous Systems division. He is a machine learning and operations research practitioner and researcher with experience in building production systems and models for top tech companies using Python, TensorFlow, and Ray/RLlib. He holds an M.S. and a Ph.D. in systems engineering from Boston University and a B.S. in industrial engineering from Bilkent University. In the past, he has worked as a research scientist at Amazon and as an operations research scientist at AMD. He also held adjunct faculty positions at the McCombs School of Business at the University of Texas at Austin and at the Ingram School of Engineering at Texas State University.
Read more about Enes Bilgin