Deep Reinforcement Learning Hands-On

4.1 (20 reviews total)
By Maxim Lapan
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. What is Reinforcement Learning?

About this book

Deep Reinforcement Learning Hands-On is a comprehensive guide to the very latest DL tools and their limitations. You will evaluate methods including Cross-entropy and policy gradients, before applying them to real-world environments. Take on both the Atari set of virtual games and family favorites such as Connect4.

The book provides an introduction to the basics of RL, giving you the know-how to code intelligent learning agents to take on a formidable array of practical tasks. Discover how to implement Q-learning on 'grid world' environments, teach your agent to buy and trade stocks, and find out how natural language models are driving the boom in chatbots.

Publication date:
June 2018


Chapter 1. What is Reinforcement Learning?

Reinforcement Learning is a subfield of machine learning which addresses the problem of automatic learning of optimal decisions over time. This is a general and common problem studied in many scientific and engineering fields.

In our changing world, even problems which look like static input-output problems become dynamic in a larger perspective. For example, consider that you're solving the simple supervised learning problem of pet image classification with two target classes—dog and cat. You've gathered the training dataset and implemented the classifier using your favorite deep learning toolkit, and after a while, the model that has converged demonstrates excellent performance. Good? Definitely! You've deployed it and left it running for a while. Then, after a vacation on some seaside resort, you discover that dog haircut fashions have changed, and a significant portion of your queries are now misclassified, so you need to update your training images and repeat the process again. Good? Definitely not!

The preceding example is intended to show that even simple Machine Learning (ML) problems have a hidden time dimension, which is frequently overlooked, but it might become an issue in a production system.

Reinforcement Learning (RL) is an approach that natively incorporates this extra dimension (which is usually time, but not necessarily) into learning equations, which puts it much close to the human perception of artificial intelligence. In this chapter, we will become familiar with the following:

  • How RL is related to and differs from other ML disciplines: supervised and unsupervised learning

  • What the main RL formalisms are and how they are related to each other

  • Theoretical foundations of RL: the Markov decision processes


Learning – supervised, unsupervised, and reinforcement

You may be familiar with the notion of supervised learning, which is the most studied and well-known machine learning problem. Its basic question is: how do you automatically build a function that maps some input into some output, when given a set of example pairs? It sounds simple in those terms, but the problem includes many tricky questions that computers have only recently started to deal with some success. There are lots of examples of supervised learning problems, including the following:

  • Text classification: Is this email message spam or not?

  • Image classification and object location: Does this image contain a picture of a cat, dog, or something else?

  • Regression problems: Given the information from weather sensors, what will be the weather tomorrow?

  • Sentiment analysis: What's the customer satisfaction level of this review?

These questions can look different, but they share the same idea: we have many examples of the input and desired output, and we want to learn how to generate the output for some future, currently unseen inputs. The name, supervised comes from the fact that we learn from the known answers, which were obtained from some supervisor who has provided us with those labeled examples.

At the other extreme, we have the so-called unsupervised learning, which assumes no supervision that has no known labels assigned to our data. The main objective is to learn some hidden structure of the dataset at hand. One common example of such an approach to learning is the clustering of data. This happens when our algorithm tries to combine data items into a set of clusters, which can reveal relationships in data.

Another unsupervised learning method that is becoming more and more popular is, Generative Adversarial Networks (GANs). When we have two competing neural networks, the first of them is trying to generate fake data to fool the second network, while the other is trying to discriminate artificially generated data from data sampled from our dataset. Over time, both of them are becoming more and more skillful in their tasks by capturing subtle specific patterns of your dataset.

RL is the third camp and lays somewhere in between full supervision and a complete lack of predefined labels. On the one hand, it uses many well-established methods of supervised learning such as deep neural networks for function approximation, stochastic gradient descent, and backpropagation, to learn data representation. On the other hand, it usually applies them in a different way.

In the next two sections of the chapter, we'll have the chance to explore specific details of the RL approach including its assumptions and abstractions in its strict mathematical form. For now, to compare RL to supervised and unsupervised learning, we'll take a less formal, but more intuitive description.

Imagine you have an agent that needs to take actions in some environment. A robot mouse in a maze is a good example, but we can also imagine an automatic helicopter trying to make a roll, or a chess program learning how to beat a grandmaster. Let's go with the robot mouse for simplicity.

Figure 1: Robot mouse maze world

Its environment is a maze with food at some points and electricity at others. The robot mouse can take actions such as turn left/right and move forward. Finally, at every moment it can observe the full state of the maze to make a decision about the actions it may take. It is trying to find as much food as possible, while avoiding an electric shock whenever possible. These food and electricity signals stand as a reward given to the agent by the environment as additional feedback about the agent's actions. The reward is a very important concept in RL, and we'll talk about it later in the chapter. For now, it will be enough to understand that the final goal of the agent is to get as much total reward as possible. In our particular example, the mouse could suffer a bit of an electric shock to get to the place with plenty of food—this will be a better result for the mouse than just standing still and gaining nothing.

We don't want to hard-code knowledge about the environment and the best actions to take in every specific situation into the robot—it will take too much effort and may become useless even with a slight maze change. What we want to do is to have some magic set of methods that will allow our robot to learn on its own how to avoid electricity and gather as much food as possible.

Reinforcement Learning is exactly this magic toolbox, which plays differently from supervised and unsupervised learning methods. It doesn't work with predefined labels as supervised learning does. Nobody labels all the images the robot sees as good or bad or gives it the best direction to turn in.

However, we're not completely blind as in an unsupervised learning setup—we have a reward system. Rewards can be positive from gathering the food, negative from electric shocks, or neutral when nothing special happens. By observing such a reward and relating it to the actions we've taken, our agent learns how to perform an action better, gather more food, and get fewer electric shocks.

Of course, RL generality and flexibility comes with a price. RL is considered to be a much more challenging area than supervised and unsupervised learning. Let's quickly discuss what makes Reinforcement Learning tricky.

The first thing to note is that observation in RL depends on an agent's behavior and to some extent, it is the result of their behavior. If your agent decides to do inefficient things, then the observations will tell you nothing about what they have done wrong and what should be done to improve the outcome (the agent will just get negative feedback all the time). If the agent is stubborn and keeps making mistakes, then the observations can make the false impression that there is no way to get a larger reward—life is suffering—which could be totally wrong. In machine learning terms, it can be rephrased as having non-i.i.d data. The abbreviation i.i.d stands for independent and identically distributed, a requirement for most supervised learning methods.

The second thing that complicates our agent's life is that they need to not only exploit the policy they have learned, but to actively explore the environment, because, who knows, maybe by doing things differently we can significantly improve the outcome we get. The problem is that too much exploration may also seriously decrease the reward (not to mention that the agent can actually forget what they have learned before), so, we need to find a balance between these two activities somehow. This exploration/exploitation dilemma is one of the open fundamental questions in RL.

People face this choice all the time: should I go to an already known place for dinner or try this new fancy restaurant? How frequently should you change jobs? Should you study a new field or keep working in your area? There are no universal answers to these questions.

The third complication factor lays in the fact that reward can be seriously delayed from actions. In cases of chess, it can be one single strong move in the middle of the game that has shifted the balance. During learning, we need to discover such casualties, which can be tricky to do over the flow of time and our actions.

However, despite all these obstacles and complications, RL has made huge improvements over recent years and is becoming more and more active as a field of research and practical application.

Interested? Let's get to the details and look at RL formalisms and play rules.


RL formalisms and relations

Every scientific and engineering field has its own assumptions and limitations. In the previous section, we discussed supervised learning, in which such assumptions are the knowledge of input-output pairs. No labels for your data? Sorry, you need to figure out how to obtain labels or try to use some other theory. It doesn't make supervised learning good or bad, it just makes it inapplicable to your problem. It's important to know and understand those play rules for various methods, as it can save you tons of time in advance. However, we know there are many examples of practical and theoretical breakthroughs, when somebody tried to challenge the rules in a creative way. To do this you should first of all know the limitations.

Of course, such formalisms exist for RL, and now it is the right time to introduce them, as we'll spend the rest of the book analyzing them from various angles. You can see the following diagram showing two major RL entities: Agent and Environment and their communication channels: Actions, Reward, and Observ ations:

Figure 2: RL entities and their communications


The first thing to discuss is a notion of reward. In RL, it's just a scalar value we obtain periodically from the environment. It can be positive or negative, large or small, but it's just a number. The purpose of reward is to tell our agent how well they have behaved. We don't define how frequently the agent receives this reward; it can be every second or once in a lifetime, although it's common practice to receive a reward every fixed timestamp or every environment interaction, just for convenience. In the case of once-in-a-lifetime reward systems, all rewards except the last one will be zero.

As I mentioned, the purpose of a reward is to give an agent feedback about its success, and it's an important central thing in RL. Basically, the term reinforcement comes from the fact that a reward obtained by an agent should reinforce its behavior in a positive or negative way. Reward is local, meaning, it reflects the success of the agent's recent activity, not all the successes achieved by the agent so far. Of course, getting a large reward for some action doesn't mean that a second later you won't face dramatic consequences from your previous decisions. It's like robbing a bank: it could look like a good idea until you think about the consequences.

What an agent is trying to achieve is the largest accumulated reward over its sequence of actions. To give you a more intuitive understanding of reward, let's list some concrete examples with their rewards:

  • Financial trading: An amount of profit is a reward for a trader buying and selling stocks.

  • Chess: Here, reward is obtained at the end of the game, as a win, lose, or draw. Of course, it's up to interpretation. For me, for example, having a draw in a match against a chess master would be a huge reward. In practice, we need to explicitly specify the exact reward value, but it could be a fairly complicated expression. For instance, in case of chess, the reward could be proportional to the opponent's strength.

  • Dopamine system in a brain: There is a part in the brain (limbic system) that produces dopamine every time it needs to send a positive signal to the rest of the brain. Higher concentrations of dopamine lead to a sense of pleasure, which reinforces activities considered by this system as good. Unfortunately, the limbic system is ancient in terms of things it considers good: food, reproduction, and dominance, but this is a totally different story.

  • Computer games: They usually give obvious feedback to the player, which is either the number of enemies killed or a score gathered. Note in this example that reward is already accumulated, so the RL reward for arcade games should be the derivative of the score, that is, +1 every time a new enemy is killed and 0 at all other time steps.

  • Web navigation: There is a set of problems with high practical value, which is to be able to automatically extract information present on the web. Search engines are trying to solve this task in general, but sometimes, to get to the data you're looking for you need to fill some forms or navigate through series of links, or complete captchas, which can be difficult for search engines to do. There is an RL-based approach to those tasks, in which the reward is the information or the outcome you need to get.

  • Neural network architecture search: RL has been successfully applied to the domain of NN architecture optimization, where the aim is to get the best performance metric on some dataset by tweaking the number of layers or their parameters, adding extra bypass connections, or making other changes to the neural network architecture. The reward in this case is the performance (accuracy or another measure showing how accurate the NN predictions are).

  • Dog training: If you have ever tried to train a dog, you know that you need to give it something tasty (but too not much) every time it does the thing you've asked. It's also common to punish your pet a bit (negative reward) when it doesn't follow your orders, although recent studies have shown this isn't as effective as positive rewards.

  • School marks: We all have experience here! School marks are a reward system to give pupils feedback about their studying.

As you can see from the preceding examples, the notion of reward is a very general indication of the agent's performance, and it can be found or artificially injected into lots of practical problems around us.

The agent

An agent is somebody or something who/which interacts with the environment by executing certain actions, taking observations, and receiving eventual rewards for this. In most practical RL scenarios, it's our piece of software that is supposed to solve some problem in a more-or-less efficient way. For our initial set of six examples, the agents will be one of these:

  • Financial trading: A trading system or a trader making decisions about order execution

  • Chess: A player or a computer program

  • Dopamine system: The brain itself, according to sensory data, decides if it was a good experience or bad

  • Computer games: The player who enjoys the game or the computer program (Andrey Karpathy once stated in his tweet, "We were supposed to make AI do all the work and we play games but we do all the work and the AI is playing games!")

  • Web navigation: The software that tells the browser which links to click on, where to move the mouse, or which text to enter

  • Neural network architecture search: The software that controls the concrete architecture of the neural network being evaluated

  • Dog training: Your beloved pet

  • School: Student/pupil

The environment

The environment is everything outside of an agent. In the most general sense, it's the rest of the universe, but this goes slightly overboard and exceeds the capacity of even tomorrow's computers, so we usually follow the general sense here.

The environment is external to an agent, and its communication with the environment is limited by rewards (obtained from the environment), actions (executed by the agent and given to the environment), and observations (some information besides the rewards that the agent receives from the environment). We discussed rewards already, so let's talk about actions and observations.


Actions are things that an agent can do in the environment. Actions can be moves allowed by the rules of play (if it's some game), or it can be doing homework (in the case of school). They can be simple such as move pawn one space forward, or complicated such as fill the tax form in for tomorrow morning.

In RL, we distinguish between two types of actions: discrete or continuous. Discrete actions form the finite set of mutually exclusive things an agent could do, such as move left or right. Continuous actions have some value attached to the action, such as a car's action steer the wheel having an angle and direction of steering. Different angles could lead to a different scenario a second later, so just saying steer the wheel is definitely not enough.


Observations of the environment is the second information channel for an agent, with the first being a reward. You may be wondering, why do we need a separate data source? The answer is convenience. Observations are pieces of information that the environment provides the agent with, which say what's going on around them. It may be relevant to the upcoming reward (such as seeing a bank notification saying, You have been paid) or not. Observations even can include reward information in some vague or obfuscated form, such as score numbers on a computer game's screen. Score numbers are just pixels, but potentially we can convert them into reward values; it's not a big deal with modern deep learning at hand.

On the other hand, reward shouldn't be seen as a secondary or unimportant thing: the reward is the main force that drives the agent's learning process. If the reward is made wrong, noisy, or just slightly off-course of the primary objective, then there is a chance that training will go in a wrong way.

It's also important to distinguish between an environment's state and observations. The state of an environment potentially includes every atom in the universe, which makes it impossible to measure everything about the environment. Even if we limit the environment's state to be small enough, most of the time it's either still not possible to get full information or our measurements will contain noise. This is completely fine though, and RL was created to support such cases natively. Once again, let's support our intuition with our set of examples to capture the difference:

  • Financial trading: Here the environment is the whole financial market and everything that influences it. This is a huge list of things such as the latest news, economic and political conditions, weather, food supplies, and Twitter trends. Even your decision to stay home today can potentially indirectly influence the world financial system. However, our observations are limited to stock prices, news, and so on. We don't have access to most of the environment's state, which makes trading such a nontrivial thing.

  • Chess: The environment here is your board plus your opponent, which includes their chess skills, mood, brain state, chosen tactics, and so on. Observation is what you see (your current chess position), but, I guess, at some levels of play mastery, the knowledge of psychology and ability to read an opponent's mood could increase your chances.

  • Dopamine system: The environment here is your brain PLUS nervous system and organ's states PLUS the whole world you can perceive. Observations are the inner brain state and signals coming from your senses.

  • Computer game: Here, the environment is your computer's state, including all memory and disk data. For networked games, you need to include other computers PLUS all internet infrastructure between them and your machine. Observations are a screen's pixels and sound, that's it. A screen's pixels is not a tiny amount of information (somebody calculated that the total number of possible moderate-size images 1024 × 768 is significantly larger than the number of atoms in our galaxy), but the whole environment state is definitely larger.

  • Web navigation: The environment here is the internet, including all the network infrastructure between the computer our agent works and the web server, which is a really huge system that includes millions and millions of different components. Observation is normally the web page that is loaded at the current navigation step.

  • Neural network architecture search: In this example, the environment is fairly simple and includes the NN toolkit that performs the particular neural network evaluation and the dataset that is used to obtain the performance metric. In comparison to the internet, this looks like a tiny toy environment. Observations might be different and include some information about the testing, such as loss convergence dynamics or other metrics obtained from the evaluation step.

  • Dog training: Here the environment is your dog (including its hardly observable inner reactions, mood, and life experiences) and everything around it, including other dogs and a cat hiding in a bush. Observations are signals from your senses and memory.

  • School: The environment here is the school itself, the education system of the country, society, and the cultural legacy. Observations are the same as for the dog training: the student's senses and memory.

This is our mise en scène and we'll play around with it in the rest of the book. I think you've already noticed that the RL model is extremely flexible, general, and could be applied to a variety of scenarios. Let's look at how RL is related to other disciplines, before diving into the details of RL's model.

There are many other areas that contribute or relate to RL. The most significant are shown in the following diagram (taken from David Silver's RL course, which includes six large domains heavily overlapping each other on the methods and specific topics related to decision making (shown inside the inner gray circle). In the intersection of all those related, but still different scientific areas, sits RL, which is so general and flexible that it can take the best from these varying domains:

Figure 3: Various domains in RL

  • Machine learning (ML): RL, being a subfield of ML, borrows lots of its machinery, tricks, and techniques from ML. Basically, the goal of RL is to learn how an agent should behave when it is given imperfect observational data.

  • Engineering (especially optimal control): This helps in taking a sequence of optimal actions to get the best result.

  • Neuroscience: We saw the dopamine system as our example, and it has been shown that the human brain acts closely to the RL model.

  • Psychology: This studies behavior in various conditions, such as how people react and adapt, which is close to the RL topic.

  • Economics: One of the important topics is how to maximize reward in terms of imperfect knowledge and the changing conditions of the real world.

  • Mathematics: This works with idealized systems, and also devotes significant attention to finding and reaching the optimal conditions in the field of operations research.


Markov decision processes

In this part of the chapter, we'll get familiar with the theoretical foundation of RL, which makes it possible to start moving toward the methods used to solve the RL problem. This section is important to understand the rest of the book and will ensure that you familiarize yourself with RL. First, we introduce you to the mathematical representation and notation of formalisms (reward, agent, actions, observations, and environment) we just discussed. Second, using this basis, we introduce you to the second-order notions of the RL language including state, episode, history, value, and gain, which will be used repeatedly to describe different methods later in the book. Finally, our description of Markov decision processes is built like a Russian matryoshka doll: we start from the simplest case of a Markov Process (MP) (also known as a Markov chain), then extend it with rewards, which will turn it into a Markov reward processes. Then we'll put this idea into one other extra envelope by adding actions, which will lead us to Markov Decision Processes (MDPs).

Markov processes and Markov decision processes are widely used in computer science and other engineering fields. So reading this chapter will be useful for you not only in RL contexts but also for a much wider range of topics.

If you're already familiar with MDPs, then you can quickly skim this chapter, paying attention only to the terminology definitions, as we'll use them later on.

Markov process

Let's start with the simplest child of the Markov family: the Markov process, also known as a Markov chain. Imagine that you have some system in front of you that you can only observe. What you observe is called states, and the system can switch between states according to some laws of dynamics. Again, you cannot influence the system, but only watch the states changing.

All possible states for a system form a set called state space. In Markov processes, we require this set of states to be finite (but it can be extremely large to compensate this limitation). Your observations form a sequence of states or a chain (that's why Markov processes are also called Markov chains). For example, looking at the simplest model of the weather in some city, we can observe the current day as sunny or rainy, which is our state space. A sequence of observations over time forms a chain of states, such as [sunny, sunny, rainy, sunny, …], and is called history.

To call such a system a MP, it needs to fulfil the Markov property, which means that the future system dynamics from any state have to depend on this state only. The main point of the Markov property is to make every observable state self-contained to describe the future of the system. In other words, the Markov property requires the states of the system to be distinguishable from each other and unique. In this case, only one state is required to model the future dynamics of the system, not the whole history or, say, the last N states.

In the case of our toy weather example, the Markov property limits our model to represent only the cases when a sunny day can be followed by a rainy one, with the same probability, regardless of the amount of sunny days we've seen in the past. It's not a very realistic model, as from common sense we know that the chance of rain tomorrow depends not only on the current condition, but on a large number of other factors, such as the season, our latitude, and the presence of mountains and sea nearby. It was recently proven that even solar activity has a major influence on weather. So, our example is really naïve, but it's important to understand the limitations and make conscious decisions about them.

Of course, if we want to make our model more complex, we can always do this by extending our state space, which will allow us to capture more dependencies in the model at the cost of a larger state space. For example, if you want to capture separately the probability of rainy days during summer and winter, then you can include the season in your state. In this case, your state space will be [sunny+summer, sunny+winter, rainy+summer, rainy+winter] and so on.

As your system model complies with the Markov property, you can capture transition probabilities with a transition matrix, which is a square matrix of the size N×N, where N is the number of states in our model. Every cell in a row i and a column j in the matrix contains the probability of the system to transition from the state i to state j.

For example, in our sunny/rainy example the transition matrix could be as follows:










In this case, if we have a sunny day, then there is an 80% chance that the next day will be sunny and a 20% chance that the next day will be rainy. If we observe a rainy day, then there is a 10% probability that the weather will become better and a 90% probability of the next day being rainy.

So, that's it. The formal definition of Markov process is as follows:

  • A set of states (S) that a system can be in

  • A transition matrix (T), with transition probabilities, which defines the system dynamics

The useful visual representation of MP is a graph with nodes corresponding to system states and edges, labeled with probabilities representing a possible transition from state to state. If the probability of transition is 0, we don't draw an edge (there is no way to go from one state to another). This kind of representation is also widely used in finite state machine representation, which is studied in the automata theory. For our sunny/rainy weather model the graph is as shown here:

Figure 4: Sunny/Rainy weather model

Again, now we're talking about observation only. There is no way for us to influence the weather, so we just observe and record our observations.

To give you a more complicated example, we'll consider another model of Office Worker (Dilbert, the main character in Scott Adams' famous cartoons, is a good example). His state space in our example has the following states:

  • Home: He's not at the office

  • Computer: He's working on his computer at the office

  • Coffee: He's drinking coffee at the office

  • Chatting: He's discussing something with colleagues at the office

The state transition graph looks like this:

Figure 5: State transition graph

We expect that his work day usually starts from the Home state and that he always starts his work day with Coffee, without exception (no HomeComputer edge and no HomeChatting edge). The preceding diagram also shows that work days always end (that is, the going to the Home state) from the Computer state. The transition matrix for the preceding diagram is as follows:


























The transition probabilities could be placed directly on the state transition graph, as shown here:

Figure 6: State transition graph with transition probabilities

In practice, we rarely have the luxury of knowing the exact transition matrix. A much more real-world situation is when we have only observations of our systems' states, which are also called episodes:

  • home → coffee → coffee → chat → chat → coffee → computer → computer → home

  • computer → computer → chat → chat → coffee → computer → computer → computer

  • home → home → coffee → chat → computer → coffee → coffee

It's not complicated to estimate the transition matrix by our observation; we just count all the transitions from every state and normalize them to a sum of 1. The more observation data we have, the closer our estimation will be to the true underlying model.

It's also worth noting that the Markov property implies stationarity (that is, the underlying transition distribution for any state does not change over time). Nonstationarity means that there is some hidden factor that influences our system dynamics, and this factor is not included in observations. However, this contradicts the Markov property, which requires the underlying probability distribution to be the same for the same state regardless of the transition history. It's important to understand the difference between the actual transitions observed in an episode and the underlying distribution given in the transition matrix. Concrete episodes that we observe are randomly sampled from the distribution of the model, so they can differ from episode to episode. However, the probability of concrete transition to be sampled remains the same. If this is not the case, Markov chain formalism becomes nonapplicable.

Now we can go further and extend the Markov process model to make it closer to our RL problems. Let's add rewards to the picture!

Markov reward process

To introduce rewards, we need to extend our Markov process model a bit. First, we need to add value to our transition from state to state. We already have probability, but probability is being used to capture the dynamics of the system, so now we have an extra scalar number without an extra burden.

Reward can be represented in various forms. The most general way is to have another square matrix similar to the transition matrix with rewards for transitioning from state i to state j residing in row i and column j. Rewards can be positive or negative, large or small—it's just a number. In some cases, this representation is redundant and can be simplified. For example, if the reward is given for reaching the state regardless of the previous state, we can keep only state → reward pairs, which is a more compact representation. However, this is applicable only if the reward value depends only on the target state, which is not always the case.

The second thing we're adding to the model is discount factor γ (gamma), a single number from 0 to 1 (inclusive). The meaning will be explained later, after we define the extra characteristics of our Markov reward process.

As you remember, we observe a chain of state transitions in a Markov process. This is still the case for a Markov reward process, but for every transition, we have our extra quantity—reward. So now, all our observations have a reward value attached to every transition of the system.

For every episode, we define return at the time t as this quantity:

Let's try to understand what this means. For every time point, we calculate return as a sum of subsequent rewards, but more distant rewards are multiplied by the discount factor raised to the power of the number of steps we are away from the starting point at time t. The discount factor stands for the foresightedness of an agent. If gamma equals to 1, then return Gt just equals a sum of all subsequent rewards and corresponds to the agent with perfect visibility of any subsequent rewards. If gamma equals 0, our return Gt will be just immediate reward without any subsequent state and correspond to absolute short-sightedness.

These extreme values are not useful, and usually gamma is set to something in between, such as 0.9 or 0.99. In this case, we will look into future rewards, but not too far.

This gamma parameter is important in RL, and we'll meet it a lot in the subsequent chapters. For now, think about it as a measure of how far into the future we look to estimate the future return: the closer to 1, the more steps ahead of us we take into account.

This return quantity is not very useful in practice, as it was defined for every specific chain we observed from our Markov reward process, so it can vary widely even for the same state. However, if we go to the extremes and calculate the mathematical expectation of return for any state (by averaging large amount of chains), we'll get a much more useful quantity, called a value of state:

This interpretation is simple: for every state s, the value v (s) is the average (or expected) return we get by following the Markov reward process.

To show how this theoretical stuff is related to practice, let's extend our Dilbert process with rewards and turn it into a Dilbert Reward Process (DRP). Our reward values will be as follows:

  • home → home: 1 (as it's good to be home)

  • home → coffee: 1

  • computer → computer: 5 (working hard is a good thing)

  • computer → chat: -3 (it's not good to be distracted)

  • chat → computer: 2

  • computer → coffee: 1

  • coffee → computer: 3

  • coffee → coffee: 1

  • coffee → chat: 2

  • chat → coffee: 1

  • chat → chat: -1 (long conversation becomes boring)

A diagram with rewards is shown here:

Figure 7: State transition graph with transition probabilities (dark) and rewards (light)

Let's return to our gamma parameter and think about the values of states with different values of gamma. We will start with a simple case: gamma = 0. How do you calculate the values of states here?

To answer this question, let's fix our state to Chat. What could the subsequent transition be? The answer is: It depends on chance. According to our transition matrix for the Dilbert process, there is a 50% probability that the next state will be Chat again, 20% that it will be Coffee, and in 30% of cases, we return to the Computer state. When gamma = 0, our return is equal only to a value of the next immediate state. So, if we want to calculate the value of the Chat state, then we need to sum all transition values, and multiply it by their probabilities:

V(chat) = -1 * 0.5 + 2 * 0.3 + 1 * 0.2 = 0.3

V(coffee) = 2 * 0.7 + 1 * 0.1 + 3 * 0.2 = 2.1

V(home) = 1 * 0.6 + 1 * 0.4 = 1.0

V(computer) = 5 * 0.5 + (-3) * 0.1 + 1 * 0.2 + 2 * 0.2 = 2.8

So, Computer is the most valuable state to be in (if we care only about immediate reward), which is not surprising as ComputerComputer is frequent, has a large reward, and the ratio of interruptions is not too high.

Now a trickier question: what's the value when gamma = 1? Think about this carefully.

The answer is: the value is infinite for all states. Our diagram doesn't contain sink states (states without outgoing transitions), and when our discount equals 1, we care about a potentially infinite amount of transitions in the future. As we've seen in the case of gamma = 0, all our values are positive in the short term, so the sum of the infinite amount of positive values will give us an infinite value, regardless of the starting state.

This infinite result shows us one of the reasons to introduce gamma into a Markov reward process, instead of just summing all future rewards. In most cases, the process can have an infinite (or large) amount of transitions. As it is not very practical to deal with infinite values, we would like to limit the horizon we calculate values for. Gamma with a value less than 1 provides such a limitation, and we'll discuss this later in chapters about the value iteration methods family. On the other hand, if you're dealing with finite-horizon environments (for example, the TicTacToe game which is limited by at most 9 steps), then it will be fine to use gamma = 1. As another example, there is an important class of environments with only one step called Multi-Armed Bandit MDP. This means that on every step you need to make a selection of one alternative action, which provides you with some reward and the episode ends.

As I already said about the Markov reward process definition, gamma is usually set to a value between 0 and 1 (commonly used values for gamma are 0.9 and 0.99); however, with such values it becomes almost impossible to calculate accurately the values by hand, even for MRPs as small as our Dilbert example, because it will require summing of hundreds of values. Computers are good at tedious tasks such as summing thousands of numbers, and there are several simple methods which can quickly calculate values for MRPs, given transition and reward matrices. We'll see and even implement one such method in Chapter 5, Tabular Learning and the Bellman Equation, when we'll start looking at Q-learning methods.

For now, let's put another layer of complexity around our Markov reward processes and introduce the final missing piece: actions.

Markov decision process

You may already have ideas about how to extend our MRP to include actions into the picture. First, we must add a set of actions (A), which has to be finite. This is our agent's action space.

Then, we need to condition our transition matrix with action, which basically means our matrix needs an extra action dimension, which turns it into a cube. If you remember, in the case of MPs and MRPs, the transition matrix had a square form, with source state in rows and target state in columns. So, every row i contained a list of probabilities to jump to every state:

Figure 8: Transition matrix

Now the agent no longer passively observes state transitions, but can actively choose an action to take at every time. So, for every state, we don't have a list of numbers, but a matrix, where the depth dimension contains actions that the agent can take, and the other dimension is that the target state system will jump to after this action is performed by the agent. The following diagram shows our new transition table that became a cube with source state as the height dimension (indexed by i), target state as width (j), and action the agent can choose from is depth (k) of the transition table:

Figure 9: Transition probabilities for MDP

So, in general, by choosing an action, the agent can affect the probabilities of target states, which is a useful ability.

To give you an idea of why we need so many complications, let's imagine a small robot which lives in a 3 × 3 grid and can execute the actions turn left, turn right, and go forward. The state of the world is the robot's position plus orientation (up, down, left, and right), which gives us 3 × 3 × 4 = 36 states (the robot can be at any location in any orientation).

Also, imagine that the robot has imperfect motors (which is frequently the case in the real world), and when it executes turn left or turn right, there is a 90% chance that the desired turn happens, but sometimes, with 10% probability, the wheel slips and the robot's position stays the same. The same happens with go forward: in 90% of cases it works, but for the rest (10%) the robot stays at the same position.

In the following illustration, a small part of a transition diagram is shown, displaying the possible transitions from the state (1, 1, up), when the robot is in the center of the grid and facing up. If it tries to move forward, there is a 90% chance that it will end up in the state (0, 1, up), but there is a 10% probability that the wheels will slip and the target position will remain (1, 1, up).

To properly capture all these details about the environment and possible reactions on the agent's actions, the general MDP has a 3D transition matrix with dimensions (source state, action, and target state).

Figure 10: Grid world environment

Finally, to turn our MRP into an MDP, we need to add actions to our reward matrix in the same way we did with the transition matrix: our reward matrix will depend not only on state but also on action. In other words, it means that the reward the agent obtains now depends not only on the state it ends up in but also on the action that leads to this state. It's similar as when putting effort into something, you're usually gaining skills and knowledge, even if the result of your efforts wasn't too successful. So, the reward could be better if you're doing something, rather than not doing something, even if the final result is the same.

Now, with a formally defined MDP, we're finally ready to introduce the most important central thing for MDPs and RL: policy.

The intuitive definition of policy is that it is some set of rules that controls the agent's behavior. Even for fairly simple environments, we can have a variety of policies. For example, in the preceding example with the robot in the grid world, the agent can have different policies, which will lead to different sets of visited states. For example, this robot can perform the following actions:

  • Blindly move forward regardless of anything

  • Try to go around obstacles by checking whether that previous forward action failed

  • Funnily spin around to entertain its creator

  • Choose an action randomly modelling a drunk robot in the grid world scenario, and so on …

You may remember that the main objective of the agent in RL is to gather as much return (which was defined as discounted cumulative reward) as possible. So, again, intuitively, different policies can give us different return, which makes it important to find a good policy. This is why the notion of policy is important, and it's the central thing we're looking for.

Formally, policy is defined as the probability distribution over actions for every possible state:

This is defined as probability, not as a concrete action, to introduce randomness into an agent's behavior. We'll talk later why this is important and useful. Finally, deterministic policy is a special case of probabilistics with needed action having 1 as its probability.

Another useful notion is that if our policy is fixed and not changing, then our MDP becomes an MRP, as we can reduce transition and reward matrices with a policy's probabilities and get rid of action dimensions.

So, my congratulations on getting to this stage! This chapter was challenging, but it was important for subsequent practical material. After two more introductory chapters about OpenAI gym and deep learning, we can finally start tackling the question: how do I teach agents to solve practical tasks?



In this chapter, we started our journey into the RL world by learning what makes RL special and how it relates to the supervised and unsupervised learning paradigm. We then learned about the basic RL formalisms and how they interact with each other, after which we defined Markov process, Markov reward process, and Markov decision process.

In the next chapter, we'll move away from the formal theory into the practice of RL. We'll cover the setup required, libraries, and write our first agent.

About the Author

  • Maxim Lapan

    Maxim Lapan is a deep learning enthusiast and independent researcher. His background and 15 years' work expertise as a software developer and a systems architect lies from low-level Linux kernel driver development to performance optimization and design of distributed applications working on thousands of servers. With vast work experiences in big data, machine learning, and large parallel distributed HPC and non-HPC systems, he is able to explain a number of complicated concepts in simple words and vivid examples. His current areas of interest are in practical applications of deep learning, such as deep natural language processing and deep reinforcement learning. Maxim lives in Moscow, Russian Federation, with his family.

    Browse publications by this author

Latest Reviews

(20 reviews total)
Easy to digest and practical.
The book is superb. Explanations are clear and concise, no fluff nor excessive use of math notation (though some math is provided as an extra for curious readers). The author puts emphasis on applications and using the techniques to solve actual problems that are well suited for them; he is also mindful enough to try examples that are not beaten to death (cough...cough...MNIST), so you actually gain a different perspective. I will probably read this book more than once this year, and I plan to buy a printed copy to support the author.
В данной книге я нашел необходимую мне информацию.

Recommended For You

Deep Reinforcement Learning Hands-On
Unlock this book and the full library FREE for 7 days
Start now