Reader small image

You're reading from  Mastering Reinforcement Learning with Python

Product typeBook
Published inDec 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781838644147
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Enes Bilgin
Enes Bilgin
author image
Enes Bilgin

Enes Bilgin works as a senior AI engineer and a tech lead in Microsoft's Autonomous Systems division. He is a machine learning and operations research practitioner and researcher with experience in building production systems and models for top tech companies using Python, TensorFlow, and Ray/RLlib. He holds an M.S. and a Ph.D. in systems engineering from Boston University and a B.S. in industrial engineering from Bilkent University. In the past, he has worked as a research scientist at Amazon and as an operations research scientist at AMD. He also held adjunct faculty positions at the McCombs School of Business at the University of Texas at Austin and at the Ingram School of Engineering at Texas State University.
Read more about Enes Bilgin

Right arrow

Chapter 4: Makings of a Markov Decision Process

In the first chapter, we talked about many applications of Reinforcement Learning (RL), from robotics to finance. Before implementing any RL algorithms for such applications, we need to first model them mathematically. Markov Decision Process (MDP) is the framework we use to model such sequential decision-making problems. MDPs have some special characteristics that make it easier for us to theoretically analyze them. Building on that theory, Dynamic Programming (DP) is the field that proposes solution methods for MDPs. RL, in some sense, is a collection of approximate DP approaches, which enable us to obtain good (but not necessarily optimal) solutions to very complex problems that are intractable to solve with exact DP methods.

In this chapter we step-by-step build the MDP, explain its characteristics, and lay down the mathematical foundation for the RL algorithms upcoming in the later chapters. In an MDP, the actions an agent takes...

Starting with Markov chains

We start this chapter with Markov chains, which do not involve any decision-making. They only model a special type of stochastic processes that are governed by some internal transition dynamics. Therefore, we don't talk about an agent yet. Understanding how Markov chains work will allow us to lay the foundation for MDPs that we will cover later.

Stochastic processes with Markov property

We already defined the state as the set information that completely describes the situation an environment is in. If the next state that the environment will transition into only depends on the current state, not the past, we say that the process has the Markov property. This is named after the Russian mathematician Andrey Markov.

Imagine a broken robot that randomly moves in a grid world. At any given step, the robot goes up, down, left and right with 0.2, 0.3, 0.25 and 0.25 probability, respectively. This is depicted in Figure 4.1, as follows:

...

Introducing the reward: Markov reward process

In our robot example so far, we have not really identified any situation/state that is "good" or "bad." In any system though, there are desired states to be in and there are other states that we want to avoid. In this section, we attach rewards to states/transitions, which gives us a Markov Reward Process (MRP). We then assess the "value" of each state.

Attaching rewards to the grid world example

Remember the version of the robot example where it could not bounce back to the cell it was in when it hits a wall, but crashes in a way that it is not recoverable. From now on, we will work on that version, and attach rewards to the process. Now, let's build this example:

  1. We modify the transition probability matrix to assign self-transition probabilities to the "crashed" state that we add to the matrix:
    P = np.zeros((m2 + 1, m2 + 1))
    P[:m2, :m2] = get_P(3, 0.2, 0.3, 0.25, 0.25)
    for i in...

Bringing the action in: Markov decision process

A Markov reward process allowed us to model and study a Markov chain with rewards. Of course, our ultimate goal is to control such a system to achieve the maximum rewards. Now, we incorporate decisions into the MRP.

Definition

A Markov decision process (MDP) is simply a Markov reward process with decisions affecting transition probabilities and potentially the rewards.

Info

An MDP is characterized by a tuple , where we have a finite set of actions, , on top of MRP.

MDP is the mathematical framework behind RL. So, this is time to remember the RL diagram that we introduced in Chapter 1, Introduction to Reinforcement Learning:

Figure 4.8 – Markov decision process diagram

Our goal in MDP is to find a policy that maximizes expected cumulative reward. A policy simply tells which action(s) to take for a given state. In other words, it is a mapping from states to actions. More formally, a policy...

Partially observable Markov decision process

The definition of policy we have used in this chapter so far is that it is a mapping from the states of an environment to actions. Now, the question we should ask is whether the state is really known to the agent in all types of environments.

Remember the definition of the state: it describes everything in an environment related to the agent's decision-making (in the grid world example, if the color of the walls is not important, for instance, so it will not be part of the state).

If you think about it: This is a very strong definition. Consider the situation when someone is driving a car. Does the driver know everything about the world around them while making their driving decisions? Of course not! To begin with, the cars would be blocking each other in their sight more often than not. Not knowing the precise state of the world does not stop anyone from driving, though. In such cases, we base our decision on our observations...

Summary

In this chapter, we have covered the mathematical framework in which we model the sequential decision-making problems we face in real-life: Markov decision processes. To this end, we started with Markov chains, which do not involve any concept of reward or decision making. Markov chains simply describe stochastic processes where the system transitions based on the current state and independent of the previously visited states. We then added the notion of a reward and started discussing things like which states are more advantageous to be in in terms of the expected future rewards. This created a concept of a "value" for a state. Then, we finally brought in the concept of "decision/action" and defined the Markov decision process. Subsequently, we finalized the definitions of state-value functions and action-value functions. Lastly, we discussed what a partially observable environment is and how it affects the decision-making of an agent.

The Bellman equation...

Exercises

  1. Calculate -step transition probabilities for the robot using the Markov chain model we introduced with the state initialized at . You will notice that it will take a bit more time for the system to reach the steady state.
  2. Modify the Markov chain to include the absorbing state for the robot crashing into the wall. What does your look like for a large ?
  3. Using the state values in Figure 4.7, calculate the value of a corner state using the estimates for the neighboring state values.
  4. Iteratively estimate the state values in the grid world MRP using matrix forms and operations instead of using a for loop.
  5. Calculate the action value where the policy π corresponds to taking no actions in any state using the values in Figure 4.7. Based on how compares to , would you consider changing your policy to take the action 'up' instead of no actions in state ?

References

  1. Silver, D. (2015). Lecture 2: Markov Decision Processes. Retrieved from UCL Course on RL: https://www.davidsilver.uk/wp-content/uploads/2020/03/MDP.pdf
  2. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. A Bradford Book.
  3. Ross, S. M. (1996). Stochastic Processes. 2nd ed, Wiley.
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Reinforcement Learning with Python
Published in: Dec 2020Publisher: PacktISBN-13: 9781838644147
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Enes Bilgin

Enes Bilgin works as a senior AI engineer and a tech lead in Microsoft's Autonomous Systems division. He is a machine learning and operations research practitioner and researcher with experience in building production systems and models for top tech companies using Python, TensorFlow, and Ray/RLlib. He holds an M.S. and a Ph.D. in systems engineering from Boston University and a B.S. in industrial engineering from Bilkent University. In the past, he has worked as a research scientist at Amazon and as an operations research scientist at AMD. He also held adjunct faculty positions at the McCombs School of Business at the University of Texas at Austin and at the Ingram School of Engineering at Texas State University.
Read more about Enes Bilgin