# 2. Markov Decision Processes and Bellman Equations

Overview

This chapter will cover more of the theory behind reinforcement learning. We will cover Markov chains, Markov reward processes, and Markov decision processes. We will learn about the concepts of state values and action values along with Bellman equations to calculate previous quantities. By the end of this chapter, you will be able to solve Markov decision processes using linear programming methods.

# Introduction

In the previous chapter, we studied the main elements of **Reinforcement Learning** (**RL**). We described an agent as an entity that can perceive an environment's state and act by modifying the environment state in order to achieve a goal. An agent acts through a policy that represents its behavior, and the way the agent selects an action is based on the environment state. In the second half of the previous chapter, we introduced Gym and Baselines, two Python libraries that simplify the environment representation and the algorithm implementation, respectively.

We mentioned that RL considers problems as **Markov Decision Processes** (**MDPs**), without entering into the details and without giving a formal definition.

In this chapter, we will formally describe what an MDP is, its properties, and its characteristics. When facing a new problem in RL, we have to ensure that the problem can be formalized as an MDP; otherwise, applying RL techniques is impossible.

Before presenting a formal definition of MDPs, we need to understand **Markov Chains** (**MCs**) and **Markov Reward Processes** (**MRPs**). MCs and MRPs are specific cases (simplified) of MDPs. An MC only focuses on state transitions without modeling rewards and actions. Consider the example of the game of snakes and ladders, where the next action is completely dependent on the number displayed on the dice. MRPs also include the reward component in the state transition. MRPs and MCs are useful in understanding the characteristics of MDPs gradually. We will be looking at specific examples of MCs and MRPs later in the chapter.

Along with MDPs, this chapter also presents the concepts of the state-value function and the action-value function, which are used to evaluate how good a state is for an agent and how good an action taken in a given state is. State-value functions and action-value functions are the building blocks of the algorithms used to solve real-world problems. The concepts of state-value functions and action-value functions are highly related to the agent's policy and the environment dynamics, as we will learn later in this chapter.

The final part of this chapter presents two **Bellman equations**, namely the **Bellman expectation equation** and the **Bellman optimality equation**. These equations are helpful in the context of RL in order to evaluate the behavior of an agent and find a policy that maximizes the agent's performance in an MDP.

In this chapter, we will practice with some MDP examples, such as the student MDP and Gridworld. We will implement the solution methods and equations explained in this chapter using Python, SciPy, and NumPy.

# Markov Processes

In the previous chapter, we described the RL loop as an agent observing a representation of the environment state, interacting with an environment through actions, and receiving a reward based on the action and the environment state. This interaction process is called an MDP. In this section, we will understand what an MDP is, starting with the simplest case of an MDP, an MC. Before describing the various types of MDPs, it is useful to formalize the underlying property of all these processes, the Markov property.

## The Markov Property

Let's start with two examples to help us to understand what the Markov property is. Consider a Rubik's cube. When formalizing the solving of a Rubik's cube as an RL task, we can define the environment state as the state of the cube. The agent can perform actions corresponding to the rotation of the cube's faces. The action results in a state transition that changes the cube. Here, the history is not important – that is, the sequence of actions yielding the current state – in determining the next state. The current state and the present action are the only components that influence the future state:

Looking at the preceding figure, suppose the current environment state is the cube with the `Present`

label. The current state can be reached by the two states to its left, with the labels `Past #1`

and `Past #2`

, using two different actions, represented as black arrows. By rotating the face on the left downwards, in the current state, we get the future state on the right, denoted by the label `Future`

. The next state, in this case, is independent of the past, in the sense that only the present state and action determine it. It does not matter what the former state was, whether it was `Past #1`

or `Past #2`

; in both cases, we end up with the same future state.

Let's now consider another classic example: the Breakout game.

Breakout is a classic Atari game. In the game, there is a layer of bricks at the top of the screen; the goal is to break the bricks using a ball, without allowing the ball to touch the bottom of the screen. The player can only move a paddle horizontally. When formalizing the Breakout game as an RL task, we can define the environment state as the image pixels at a certain moment. The agent has at its disposal three possible actions, "Left," "Right," and "None," corresponding to the paddle's movement.

Here, there is a difference with respect to the Rubik's cube example. *Figure 2.2* explains the difference visually. If we represent the environment state using only the current frame, the future is not determined only by the current state and the current action. We can easily visualize this problem by looking at the ball.

In the left part of *Figure 2.2*, we can see two possible past states yielding the same present state. With the arrow, we represent the ball movement. In both cases, the agent's action is "Left."

In the right part of the figure, we have two possible future states, `Future #1`

and `Future #2`

, starting from the present state and performing the same action (the "Left" action).

By looking only at the current state, it is not possible to decide with certainty which of the two future states will be the next one, as we cannot infer the ball's direction, whether it is going toward the top of the screen or the bottom. We need to know the history, that is, which of the two previous states was the actual previous state, in order to understand what the next state will be.

In this case, the future state is not independent of the past:

Note

Notice that the arrow is not actually present in the environment state. We have drawn it in the frame for ease of presentation.

In the Rubik's cube example, the current state contained enough information to determine, together with the current action, the next state. In the Atari example, this is not true. The current state does not contain a crucial piece of information: the movement component. In this case, we need not only the current state but also the past states to determine the next ones.

The Markov property explains exactly the difference between the two examples in mathematical terms. The Markov property states that "the future is independent of the past, given the present."

This means that the future state depends only on the present state, the present state is the only thing influencing the future state, and that we can get rid of the past states. The Markov property can be formalized in the following way:

The probability, , of the next state, , given the current one, , is equal to the probability of the next state given the state history, . This means that the past states, , have no influence over the next state distribution.

In other words, to describe the probability distribution of the next state, we only need the information contained in the current state. Almost all RL environments, being MDPs, assume that the Markov property holds true. We need to remember this property when designing RL tasks; otherwise, the main RL assumptions won't be true anymore, causing the algorithms to fail miserably.

Note

In statistical language, the term "given" means that the probability is influenced by some information. In other words, the probability function depends on some other information.

Most of the time, the Markov property holds true; however, there are cases in which we need to design the environment state to ensure the independence of the next state from the past states. This is exactly the case in Breakout. To restore the Markov property, we can define the state as multiple consequent frames so that it is possible to infer the ball direction. Refer to the following figure for a visual representation:

As you can see in the preceding figure, the state is represented by three consequent frames.

Note

There are other tricks you can use to restore the Markov property. One of these tricks consists of using policies represented as **Recurrent Neural Networks** (**RNNs**). Using RNNs, the agent can also take into account past states when determining the current action. The usage of RNNs as RL policies will be discussed later on in the book.

In the context of MDPs, the probability of the next state given the current one, is referred to as a transition function.

If the state space is finite, composed of *N* states, we can arrange the transition functions evaluated for each couple of states in an N x N matrix, where the sum of all the columns is 1, as we are summing a probability distribution over transition function elements:

In the rows, we have the source states, and in the columns, we have the destination states.

The probability matrix summarizes the transition function. It can be read as follows: is the probability of landing in state starting from state .

## Markov Chains

An MC, or, simply, a Markov process, is defined as a tuple of state space and transition function . The state space, together with the transition function, defines a memory-less sequence of random states, , satisfying the Markov property. A sample from a Markov process is simply a sequence of states, which is also called an episode in the context of RL:

Consider the preceding MC. As you can see, we have three states represented by circles. The probability function evaluated for the state pairs is reported on the edges connecting the different states. Looking at the edges starting from each state, we can see that the sum of the probabilities associated with each edge is 1, as it defines a probability distribution. The transition function for a couple of states that are not linked by an edge is 0.

The transition function can be arranged in a matrix, as follows:

The matrix form of the transition function is very convenient from a programming perspective as it allows us to perform calculations easily.

## Markov Reward Processes

An MRP is an MC with values associated with state transitions, called rewards. The reward function evaluates how useful it is to transition from one state to another.

An MRP is a tuple of such that the following is true:

- S is a finite set of states.
- P is the transition probability, where is the probability of transitioning from state to state .
- R is a reward function, where is the reward associated with the transition from state to state .
- is the discount factor associated with future rewards, :

As you can see in the previous figure, the rewards are represented by `r`

and are associated with state transitions.

Let's consider the MRP in *Figure 2.8*. The highest reward (`10`

) is associated with transitions *1->3* and the self-loop, *3->3*. The lowest reward is associated with transitions *3->2*, and it is equal to `-1`

.

In an MRP, it is possible to calculate the discounted return as the cumulative sum of discounted rewards.

In this context, we use the term "trajectory" or "episode" to denote a sequence of states traversed by the process.

Let's now calculate the discounted return for a given trajectory; for example, the trajectory of 1-2-3-3-3 with discount factor .

The discounted return is as follows:

We can also calculate the discounted return for a different trajectory, for example, 1-3-3-3-3:

In this example, the second trajectory is more convenient than the first one, having a higher return. This means that the associated path is better in comparison to the first one. The return does not represent an absolute feature of a trajectory; it represents the relative goodness with respect to the other trajectories. Trajectory returns of different MRPs are not comparable to each other.

Considering an MRP composed of N states, the reward function can be represented in an N x N matrix, similar to the transition matrix:

In the rows, we represent the source states, and in the columns, we represent the destination states.

The reward matrix can be read as follows: is the reward associated with the state transition, .

For the example in *Figure 2.11*, the reward function arranged in a matrix is as follows:

When a reward is not specified, we assume that the reward is `0`

.

Using Python and NumPy, we can represent the transition matrix in this way:

n_states = 3 P = np.zeros((n_states, n_states), np.float) P[0, 1] = 0.7 P[0, 2] = 0.3 P[1, 0] = 0.5 P[1, 2] = 0.5 P[2, 1] = 0.1 P[2, 2] = 0.9

In a similar way, the reward matrix can be represented like this:

R = np.zeros((n_states, n_states), np.float) R[0, 1] = 1 R[0, 2] = 10 R[1, 0] = 0 R[1, 2] = 1 R[2, 1] = -1 R[2, 2] = 10

We are now ready to introduce the concepts of value functions and Bellman equations for MRPs.

### Value Functions and Bellman Equations for MRPs

The **value function** in an MRP evaluates the long-term value of a given state, intended as the expected return starting from that state. In this way, the value function expresses a preference over states. A state with a higher value in comparison to another state represents a better state – in other words, a state that it is more rewarding to be in.

Mathematically, the value function is formalized as follows:

The value function of state is represented by . The expectation on the right side of the equation is the expected value, represented by of the return, , considering the fact that the current state is precisely equal to state – the state for which we are evaluating the value function. The expectation is taken according to the transition function.

The value function can be decomposed into two parts by considering the immediate reward and the discounted value function of the successor state:

The last equation is a recursive equation, known as the **Bellman expectation equation for MRPs**, in which the value function of given states depends on the value function of the successor states.

To highlight the dependency of the equation on the transition function, we can rewrite the expectation as a summation of the possible states weighted by the transition probability. We define with the expectation of the reward function in state , which can also be defined as the average reward.

We can write in the following ways:

We can now rewrite the value function in a more convenient way:

This expression can be translated into code, as follows:

R_expected = np.sum(P * R, axis=1, keepdims=True)

In the preceding code, we calculated the expected reward for each state by multiplying element-wise the probability matrix and the reward matrix. Please note that the `keepdims`

parameter is required to obtain a column vector.

This formulation makes it possible to rewrite the Bellman equation using matrix notation:

Here, `V`

is a column vector with state values, is the expected reward for each state, and `P`

is the transition matrix:

Using matrix notation, it is also possible to solve the Bellman equation for `V`

, finding the value function associated with each state:

Here, `I`

is an identity matrix of size N x N, and `N`

is the number of states in the MRP.

### Solving Linear Systems of an Equation Using SciPy

SciPy (https://github.com/scipy/scipy) is a Python library used for scientific computing based on NumPy. SciPy offers, inside the `linalg`

module (linear algebra), useful methods for solving systems of equations.

In particular, we can use `linalg.solve(A, b)`

to solve a system of equations in the form of . This is precisely the method we can use to solve the system , where is the matrix, `A`

; `V`

is the vector of variables, `x`

; and is the vector, `b`

.

When translated into code, it should look like this:

gamma = 0.9 A = np.eye(n_states) - gamma * P B = R_states # solve using scipy linalg solve V = linalg.solve(A, B)

As you can see, we have declared the elements of the Bellman equation and are using `scipy.linalg`

to calculate the `value`

function.

Let's now strengthen our understanding further by completing an exercise.

## Exercise 2.01: Finding the Value Function in an MRP

In this exercise, we are going to solve the Bellman expectation equation by finding the value function for the MRP in the following figure. We will use `scipy`

and the `linalg`

module to solve the linear equation presented in the previous section. We will also demonstrate how to define a transition probability matrix and how to calculate the expected reward for each state:

- Import the required NumPy and SciPy packages:
import numpy as np from scipy import linalg

- Define the transition probability matrix:
# define the Transition Probability Matrix n_states = 3 P = np.zeros((n_states, n_states), np.float) P[0, 1] = 0.7 P[0, 2] = 0.3 P[1, 0] = 0.5 P[1, 2] = 0.5 P[2, 1] = 0.1 P[2, 2] = 0.9 print(P)

You should obtain the following output:

array([[0. , 0.7, 0.3], [0.5, 0. , 0.5], [0. , 0.1, 0.9]])

Let's check the correctness of the matrix. The probability of going from state 1 to state 1 is . This is correct as there are no self-loops in state 1. The probability of going from state 1 to state 2 is as it's the probability associated with edge

*1->2*. This can be done for all elements of the transition matrix. Note that, here, the transition matrix elements are indexed by the state, not by their position in the matrix. This means that with , we refer to element 0,0 of the matrix. - Check that the sum of all the columns is exactly equal to
`1`

, being a probability matrix:# the sum over columns is 1 for each row being a probability matrix assert((np.sum(P, axis=1) == 1).all())

The

`assert`

function is used to ensure that a particular condition will return`true`

. In this case, the`assert`

function will make sure that the sum of all the columns is exactly`1`

. - We can calculate the expected immediate reward for each state using the reward matrix and the transition probability matrix:
# define the reward matrix R = np.zeros((n_states, n_states), np.float) R[0, 1] = 1 R[0, 2] = 10 R[1, 0] = 0 R[1, 2] = 1 R[2, 1] = -1 R[2, 2] = 10 """ calculate expected reward for each state by multiplying the probability matrix for each reward """ #keepdims is required to obtain a column vector R_expected = np.sum(P * R, axis=1, keepdims=True) # The matrix R_expected R_expected

You should obtain the following column vector:

array([[3.7], [0.5], [8.9]])

The

`R_expected`

vector is the expected immediate reward for each state. State 1 has an expected reward of`3.7`

, which is exactly equal to*0.7 * 1 + 0.3*10*. The same logic applies to state 2 and state 3. - Now we need to define
`gamma`

, and we are ready to solve the Bellman equation as a linear equation, . We have and :# define the discount factor gamma = 0.9 # Now it is possible to solve the Bellman Equation A = np.eye(n_states) - gamma * P B = R_expected # solve using scipy linalg V = linalg.solve(A, B) V

You should obtain the following output:

array([[65.540732 ], [64.90791027], [77.5879575 ]])

The vector,

`V`

, represents the value for each state. State 3 has the highest value (`77.58`

). This means that state 3 is the state providing the highest expected return. It is the best state in this MRP. Intuitively, state 3 is the best state because, with a high probability (0.9), the transition brings the agent to the same state, and the reward associated with the transition is high (+10).Note

To access the source code for this specific section, please refer to https://packt.live/37o5ZH4.

You can also run this example online at https://packt.live/3dU8cfW.

In this exercise, we solved the Bellman equation for an MRP by finding the state values for our toy problem. The state values describe quantitatively the benefit of being in each state. We described the MRP in terms of a transition probability matrix and a reward matrix. These two matrices permit us to solve the linear system associated with the Bellman equation.

Note

The computational complexity of the solution of the Bellman equation is `O(n`

3`)`

; it is cubic in the number of states. Therefore, it is only possible for small MRPs.

In the next section, we will consider an active agent that can perform actions, thus arriving at the description of an MDP.

## Markov Decision Processes

An MDP is an MRP with decisions. In this context, we have a set of actions available to an agent that can condition the transition probability to the next state. While, in MRPs, the transition probability depends only on the state of the environment, in MDPs, the agent can perform actions influencing the transition probability. In this way, the agent becomes an active entity in the framework, interacting with the environment through actions.

Formally, an MDP is a tuple, , in which the following is true:

- is the set of states.
- is the set of actions.
- is the reward function, . is the expected reward resulting in action and state .
- is the transition probability function in which is the probability of landing in state starting from the current state, , and performing an action, .
- is the discount factor associated with future rewards, .

The difference between an MRP and an MDP is the fact that the agent has at its disposal a set of actions from which it can choose to condition the transition probability to have a higher possibility of landing in good states. If an MRP and MC are only a description of Markov processes without an objective, an MDP contains the concept of a policy and a goal. In an MDP, the agent should take decisions about which action to take, maximizing the discounted return:

*Figure 2.21* is an example of an MDP representing the day of a university student. There are six possible states: `Class 1`

, `Class 2`

, `Class 3`

, `Social`

, `Bed`

, and `Pub`

. The edges between the states represent state transitions. On the edges, we have the action and the reward, denoted by `r`

. Possible actions are `Study`

, `Social`

, `Beer`

, and `Sleep`

. The initial state, represented by the incoming arrow, is `Class 1`

. The goal of the student is to select the best actions in each state, maximizing their return.

In the following paragraphs, we will discuss some possible strategies for this MDP.

A student agent starts from `Class 1`

. They can decide to study and complete all of the lessons. Each study decision comes with a small negative reward, `-2`

. If the student decides to sleep after `Class 3`

, they will land in the absorbing state, `Bed`

, with a high positive reward of `+10`

. This represents a very common situation in daily routines. You have to sacrifice some immediate reward in order to obtain a higher reward in the future. In this case, by deciding to study in `Class 1`

and `2`

, you obtain a negative reward but are compensated by the positive reward after `Class 3`

.

Another possible strategy in this MDP is to select a `Social`

action right after the `Class 1`

state. This action comes with a small negative reward. The student can continue doing the same action, and each time they get the same reward. The student can also decide to `Study`

from the `Social`

state (notice that `Social`

is both a state and an action) by returning to `Class 1`

. Feeling guilty, in `Class 1`

, the student can decide to study. After having studied a bit, they may feel tired and decide to sleep for a little while, ending up in the `Bed`

state. Having performed the `Social`

action, the agent has cumulated a negative return.

Let's evaluate the possible strategies for this example. We will assume a discount factor of , that is, no discount:

- Strategy: Good student. The good student strategy was the first strategy that was described. Supposing the student will end in
`Class 1`

, they can perform the following actions:`Study`

,`Study`

, and`Study`

. The associated sequence of states is thus`Class 1`

,`Class 2`

,`Class 3`

, and`Sleep`

. The associated return is, therefore, the sum of the rewards along the trajectory:

- Strategy: Social student. The social student strategy is the second strategy described. The student can perform the following actions:
`Social`

,`Social`

,`Social`

,`Study`

,`Study`

, and`Sleep`

. The associated sequence of states is`Class 1`

,`Social`

,`Social`

,`Social`

,`Class 1`

,`Class 2`

, and`Bed`

. The associated return is, in this case, as follows:

By looking at the associated return, we can see that the good student strategy is a better strategy in comparison to the social student strategy, having a higher return.

The question you may ask at this point is how can an agent decide which action to take in order to maximize the return? To answer the question, we need to introduce two useful functions: the state-value function and the action-value function.

### The State-Value Function and the Action-Value Function

In the context of MDPs, we can define a function by evaluating how good it is to be in a given state. However, we should take into account the agent's policy, as it defines the agent's decisions and conditions the probability over trajectories, that is, the sequence of future states. So, the value function depends on the agent policy, .

The **state-value function**, , of an MDP can be defined as the expected return that starts from state `s`

and follows the policy, :

In MDPs, we are also interested in defining the benefit of taking an action in a given state. This function is called the action-value function.

The **action-value function**, , (also called the q-function), can be termed as the expected return starting from state , which takes action and follows the policy :

The state-value function, as we will learn later in the book, provides information that is only useful when it comes to evaluating a policy. The action-value function also provides information about control, that is, for selecting an action in a state.

Suppose that we know the action-value function for an MDP. If we are in given state, , which action would be the best one?

Well, the best action is the one that yields the highest discounted return. The action-value function measures the discounted return that is obtained by starting from a state and performing an action. In this way, the action-value function provides an ordering (or a preference) over the actions in a state. The best action to perform is the one with the highest q-function:

Note that, in this case, we are only doing a one-step optimization of the current policy; that is, we are modifying, possibly, the action in a given state under the assumption that the following actions are taken with the current policy. If we do this, we do not select the best action in this state, but we select the best action under this policy.

Just like in an MRP, in an MDP, the state-value function and the action-value function can be decomposed in a recursive way:

These equations are known as the Bellman expectation equations for MDPs.

Bellman expectation equations are recursive as the state-value function of a given state depends on the state-value function of another state. This is also true for the action-value function.

In the action-value function equation, the action, , for which we are evaluating the function, is an arbitrary action. It is not taken from the action distribution defined by the policy. Instead, the action, , taken in the following step, is taken according to the action distribution defined in state .

Let's rewrite the state-value function and the action-value function to highlight the contribution of the agent's policy, :

Let's analyze the two terms of the equation:

- : This term is the expectation of the immediate rewards given the action distribution defined by the agent's policy. Each immediate reward for a state-action pair is weighted by the probability of the action given the state, which is defined as .
- is the discounted expected value of the state-value function, given the state distribution defined by the transition function. Note that here the action,
`a`

, is defined by the agent's policy. Being an expected value, every state value, , is weighed by the probability of the transition from state to state , given the action, . This is represented by .

The action-value function can be rewritten to highlight the dependency on the transition and value functions:

The action-value function, therefore, is given by the summation of the immediate reward and the expected value of the state-value function of the successor state under the environment dynamic (`P`

).

By comparing the two equations, we obtain an important relationship between the state value and the action value:

In other words, the state-value state, , under the policy, , is the expected value of the action-value function under the actions selected by . Each action-value function is weighted by the probability of the action given the state.

The state-value function can also be rewritten in matrix form, as in the MRP case:

There is a direct solution, as follows:

Here, you can see the following:

- (column vector) is the expected value of the immediate reward induced by the policy for each state:

- is the column vector of the state values for each state.
- is the transition matrix based on the action distribution. It is an matrix, where is the number of states in the MDP. Given two states, and , we have the following:

Therefore, the transition matrix is the probability of transitioning from state to state given the actions selected by the policy and the transition function defined by the MDP.

Following the same steps, we can also find the matrix form of the action-value function:

Here, is a column vector with entries. is the vector of immediate rewards with the same shape of . is the transition matrix with a shape of rows and columns. represents the state values for each state.

The explicit form of and is as follows:

Here, the number of actions associated with state is indicated by , thus . The number of actions of the MDP is obtained by summing up the actions associated with each state.

Let's now implement our understanding of the state- and action-value functions for our student MDP example. In this example, we will use the calculation of the state-value function and the action-value function for the student MDP in *Figure 2.21*. We will consider the case of an undecided student, that is, a student with a random policy for each state. This means that the probability of each action for each state is exactly 0.5.

We will examine a different case for a myopic student in the following example.

Import the required libraries as follows:

import numpy as np from scipy import linalg

Define the environment properties:

n_states = 6 # transition matrix together with policy P_pi = np.zeros((n_states, n_states)) R = np.zeros_like(P_pi)

`P_pi`

contains the contribution of the transition matrix and the policy of the agent. `R`

is the reward matrix.

We will use the following state encoding:

`0`

: Class 1`1`

: Class 2`2`

: Class 3`3`

: Social`4`

: Pub`5`

: Bed

Create the transition matrix by considering a random policy:

P_pi[0, 1] = 0.5 P_pi[0, 3] = 0.5 P_pi[1, 2] = 0.5 P_pi[1, 5] = 0.5 P_pi[2, 4] = 0.5 P_pi[2, 5] = 0.5 P_pi[4, 5] = 0.5 P_pi[4, 0] = 0.5 P_pi[3, 0] = 0.5 P_pi[3, 3] = 0.5 P_pi[5, 5] = 1

Print `P_pi`

:

P_pi

The output will be as follows:

array([[0. , 0.5, 0. , 0.5, 0. , 0. ], [0. , 0. , 0.5, 0. , 0. , 0.5], [0. , 0. , 0. , 0. , 0.5, 0.5], [0.5, 0. , 0. , 0.5, 0. , 0. ], [0.5, 0. , 0. , 0. , 0. , 0.5], [0. , 0. , 0. , 0. , 0. , 1. ]])

Create the reward matrix, `R`

:

R[0, 1] = -2 R[0, 3] = -1 R[1, 2] = -2 R[1, 5] = 0 R[2, 4] = 15 R[2, 5] = 10 R[4, 5] = 10 R[4, 0] = -10 R[3, 3] = -1 R[3, 0] = -3

Print `R`

:

R

The output will be as follows:

array([[ 0., -2., 0., -1., 0., 0.], [ 0., 0., -2., 0., 0., 0.], [ 0., 0., 0., 0., 15., 10.], [ -3., 0., 0., -1., 0., 0.], [-10., 0., 0., 0., 0., 10.], [ 0., 0., 0., 0., 0., 0.]])

Being a probability matrix, the sum of all the columns of `P_pi`

should be `1`

:

# check the correctness of P_pi assert((np.sum(P_pi, axis=1) == 1).all())

The assertion should be verified.

We can now calculate the expected reward for each state, using `R`

and `P_pi`

:

# expected reward for each state R_expected = np.sum(P_pi * R, axis=1, keepdims=True) R_expected

The expected reward, in this case, is as follows:

array([[-1.5], [-1. ], [12.5], [-2. ], [ 0. ], [ 0. ]])

The `R_expected`

vector contains the expected immediate reward for each state.

We are ready to solve the Bellman equation to find the value for each state. For this, we can use `scipy.linalg.solve`

:

# Now it is possible to solve the Bellman Equation gamma = 0.9 A = np.eye(n_states, n_states) - gamma * P_pi B = R_expected # solve using scipy linalg V = linalg.solve(A, B) V

The vector, `V`

, contains the following values:

array([[-1.78587056], [ 4.46226255], [12.13836121], [-5.09753046], [-0.80364175], [ 0. ]])

This is the vector of the state values. State `0`

has a value of `-1.7`

, state `1`

has a value of `4.4`

, and so on:

Let's examine how the results change with , which is the condition assumed for a myopic random student:

gamma = 0. A = np.eye(n_states, n_states) - gamma * P_pi B = R_expected # solve using scipy linalg V_gamma_zero = linalg.solve(A, B) V_gamma_zero

The output will be as follows:

array([[-1.5], [-1. ], [12.5], [-2. ], [ 0. ], [ 0. ]])

The visual representation is as follows:

As you can see, using , the value of each state is exactly equal to the expected immediate reward according to the policy.

Now we can calculate the action-value function. We need to use a different form of immediate reward using a matrix with a shape of . Each row corresponds to a state-action pair, and the value is the immediate reward for that pair:

Translate it into code as follows:

R_sa = np.zeros((n_states*2, 1)) R_sa[0] = -2 # study in state 0 R_sa[1] = -1 # social in state 0 R_sa[2] = -2 # study in state 1 R_sa[3] = 0 # sleep in state 1 R_sa[4] = 10 # sleep in state 2 R_sa[5] = +15 # beer in state 2 R_sa[6] = -1 # social in state 3 (social) R_sa[7] = -3 # study in state 3 (social) R_sa[8] = 10 # sleep in state 4 (pub) R_sa[9] = -10 # study in state 4 (pub) R_sa.shape

The output will be as follows:

(10, 1)

We now have to define the transition matrix of the student MDP. The transition matrix contains the probability of landing in a given state, starting from a state and an action. In the rows, we have the source state and action, and in the columns, we have the landing state:

When translating the probability transition matrix into code, you should see the following:

# Transition Matrix (states x action, states) P = np.zeros((n_states*2, n_states)) P[0, 1] = 1 # study in state 0 -> state 1 P[1, 3] = 1 # social in state 0 -> state 3 P[2, 2] = 1 # study in state 1 -> state 2 P[3, 5] = 1 # sleep in state 1 -> state 5 (bed) P[4, 5] = 1 # sleep in state 2 -> state 5 (bed) P[5, 4] = 1 # beer in state 2 -> state 4 (pub) P[6, 3] = 1 # social in state 3 -> state 3 (social) P[7, 0] = 1 # study in state 3 -> state 0 (Class 1) P[8, 5] = 1 # sleep in state 4 -> state 5 (bed) P[9, 0] = 1 # study in state 4 -> state 0 (class 1)

We can now calculate the action-value function using :

gamma = 0.9 Q_sa_pi = R_sa + gamma * P @ V Q_sa_pi

The action-value vector contains the following values:

array([[ 2.01603629], [ -5.58777741], [ 8.92452509], [ 0. ], [ 10. ], [ 14.27672242], [ -5.58777741], [ -4.60728351], [ 10. ], [-11.60728351]])

`Q_sa_pi`

is the action-value vector. For each state-action pair, we have the value of the action in that state. The action-value function is represented in the following figure. Action values are represented with :

We are now interested in extracting the best action for each state:

""" reshape the column so that we obtain a vector with shape (n_states, n_actions) """ n_actions = 2 Q_sa_pi2 = np.reshape(Q_sa_pi, (-1, n_actions)) Q_sa_pi2

The output will be as follows:

array([[ 2.01603629, -5.58777741], [ 8.92452509, 0. ], [ 10. , 14.27672242], [ -5.58777741, -4.60728351], [ 10. , -11.60728351]])

In this way, performing the `argmax`

function, we obtain the index of the best action in each state:

best_actions = np.reshape(np.argmax(Q_sa_pi2, -1), (-1, 1)) best_actions

The `best_actions`

vector contains the following values:

array([[0], [0], [1], [1], [0]])

The best actions can be visualized as follows:

In *Figure 2.43*, the dotted arrows are the best actions in each state. We can easily find them by looking at the action maximizing the `q`

function in each state.

From the action-value calculation, we can see that when , the action-value function is equal to the expected immediate reward:

Q_sa_pi_gamma_zero = R_sa Q_sa_pi_gamma_zero

The output will be as follows:

array([[ -2.], [ -1.], [ -2.], [ 0.], [ 10.], [ 15.], [ -1.], [ -3.], [ 10.], [-10.]])

Reshape the columns with `n_actions = 2`

, as follows:

n_actions = 2 Q_sa_pi_gamma_zero2 = np.reshape(Q_sa_pi_gamma_zero, \ (-1, n_actions)) Q_sa_pi_gamma_zero2

The output will be as follows:

array([[ -2., -1.], [ -2., 0.], [ 10., 15.], [ -1., -3.], [ 10., -10.]])

By performing the `argmax`

function, we obtain the index of the best action in each state as follows:

best_actions_gamma_zero = np.reshape(np.argmax\ (Q_sa_pi_gamma_zero2, -1), \ (-1, 1)) best_actions_gamma_zero

The output will be as follows:

array([[1], [1], [1], [0], [0]])

The state diagram can be visualized as follows:

It is interesting to note how the best actions are changed by only modifying the discount factor. Here, the best action the agent can take, starting from `Class 1`

, is `Social`

as it provides a bigger immediate reward compared to the `Study`

action. The `Social`

action brings the agent to state `Social`

. Here, the best the agent can do is to repeat the `Social`

action, cumulating negative rewards.

In this example, we learned how to calculate the state-value function using `scipy.linalg.solve`

and how to calculate the action-value function using the matrix form. We noticed that both the state values and the action values depend on the discount factor.

In the next section, we will illustrate the Bellman optimality equation, which makes it possible to solve MDPs by finding the best policy and the best state values.

### Bellman Optimality Equation

It is natural to ask whether it is possible to define an order for policies that determines whether one policy is better than another one. It turns out that the value function provides ordering over policies.

Policy can be considered better than or equal to policy if the expected return from that policy is greater than or equal to the expected return of for all states:

In this example, we substituted the expected return in a state with the state-value function, using the state-value function definition.

Following the previous definition, an optimal policy is a policy that is better than or equal to all other policies in all states. The optimal state-value function, , and the optimal action-value function, , are simply the ones associated with the best policy:

Some important properties of MDPs are as follows:

- There is always at least one optimal (deterministic) policy maximizing the state-value function in every state.
- All optimal policies share the same optimal state-value function.

An MDP is solved if we know the optimal state-value function and the optimal action-value function.

Knowing the optimal value function, , makes it possible to find the optimal policy of the MDP by maximizing over . We can define the optimal policy associated with the optimal action-value function as follows:

As you can see, this policy is simply telling us to perform the action, , with a probability of 1 (essentially, in a deterministic way) if the action, , maximizes the action-value function in this state. In other words, we need to take the action that guarantees the highest discounted return following the optimal policy. All other actions, being suboptimal, are taken with probability 0; therefore, they are, essentially, never taken. Notice that the policy obtained in this way is deterministic, not stochastic.

Analyzing this result, we uncover two essential facts:

- There is always a deterministic optimal policy for any MDP.
- The optimal policy is determined by the knowledge of the optimal action-value function, .

The optimal value functions are related to the Bellman optimality equation. The Bellman optimality equation states that the optimal state-value function in a state is equal to the overall maximum actions of the optimal action-value function in the same state:

Using the definition of the action-value function, we can expand the previous equation to a more explicit form:

The previous equation tells us that the optimal value function of a state is equal to the maximum over actions of the immediate reward, , plus the discounted , expected optimal value of the successor state, , where the expected value is determined by the transition function.

Also, the optimal action-value function has an explicit formulation, known as the Bellman optimality equation, for :

This can be rewritten only in terms of by using the relationship between and :

The Bellman optimality equation for expresses the fact that the optimal state-value function must equal the expected return for the best action in that state. Similarly, the Bellman optimality equation for expresses the fact that the optimal q-function must equal the immediate reward plus the discounted return of the best action in the next state according to the environment dynamic.

### Solving the Bellman Optimality Equation

The presence of a maximization makes the Bellman optimality equation non-linear. This means that we do not have a closed-form solution for these equations in the general case. However, there are many iterative solution methods that we will analyze in the next sections and chapters.

The main methods include value iteration, policy iteration, Q learning, and SARSA, which we will study in later chapters.

## Solving MDPs

Now that we have gained a fair understanding of all the important concepts and equations, let's move on to solving actual MDPs.

### Algorithm Categorization

Before considering the different algorithms for solving MDPs, it is beneficial for us to understand the family of algorithms along with their pros and cons. Knowing the main family of algorithms makes it possible for us to select the correct family based on our task:

The first main distinction is between model-based algorithms and model-free algorithms:

- A model-based algorithm requires knowledge of the environment dynamic (model). This is a strong requirement, as the environment model is usually unknown. Let's consider an autonomous driving problem. Here, knowing the environment dynamic means that we should know exactly how the agent's actions influence the environment and the next state distribution. This depends on many factors: the street state, weather conditions, car characteristics, and much more. For many problems, the dynamic is unknown, too complex, or too inaccurate to be used successfully. Nonetheless, the dynamic provides beneficial information for solving the task.
When the model is known (

`Model Exploiting`

), model-based algorithms are preferred over their counterparts for their sample efficiency, as they require fewer samples to learn good policies.The environment model in these cases can also be unknown; the algorithm itself explicitly learns an environment model (

`Model Learning`

) and uses it to plan its actions. Dynamic programming algorithms use this model knowledge to perform bootstrapping, which uses a previous estimation for the estimate of another quantity. - Model-free algorithms do not require a model of the environment. These types of algorithms are, therefore, preferred for real-world applications. Note that these algorithms may build an environment representation internally, taking into account the environment dynamic. However, usually, this process is implicit, and the users just don't care about these aspects.

Model-free algorithms can also be classified as value-based algorithms or policy-search algorithms.

### Value-Based Algorithms

A value-based algorithm focuses on learning the action-value function and the state-value function. Learning the value functions is done by using the Bellman equations presented in the previous sections. An example of a value-based algorithm is Q learning, where the objective is to learn the action-value function, which, in turn, is used for control. A deep Q network is an extension of Q learning in which a neural network is used to approximate the q-function. Value-based algorithms are usually off-policy, which means they can reuse previous samples collected with a different policy with respect to the policy being optimized at the moment. This is a very powerful property as it allows us to obtain more efficient algorithms in terms of samples. We will learn about Q learning and deep Q networks in more detail in *Chapter 9*, *What Is Deep Q-Learning?*.

### Policy Search Algorithms

**Policy Search** (**PS**) methods explore the policy space directly. In PS, the RL problem is formalized as the maximization of the performance measure depending on the policy parameters. You will study PS methods and policy gradients in more detail in *Chapter 11*, *Policy-Based Methods for Reinforcement Learning*.

### Linear Programming

Linear programming is an optimization technique that is used for problems with linear constraints and linear objective functions. The objective function describes the quantity to be optimized. In the case of RL, this quantity is the expected discounted return of all the states weighted by the initial state distribution, which is the probability of starting an episode in that state.

When the starting state is precisely one, this simplifies to the optimization of the expected discounted return starting from the initial state.

Linear programming is a model-based, model-exploiting technique. Solving an MDP with linear programming, therefore, requires perfect knowledge of the environment dynamics, which translates into knowledge of the transition probability matrix, . Using linear programming, we can solve MDPs by finding the best state values for each state. From our knowledge of state values, we can derive knowledge of the optimal action-value function. In this way, we can find a control policy for our agent and maximize its performance in the given task.

The basic idea follows on from the definition of ordering over policies; we want to find the state-value function by maximizing the value of each state weighted by the initial state distribution, , subject to a feasibility constraint:

Here, we have variables and constraints. The variables are the values, , for each state, `s`

, in the state space, `S`

.

Note that the maximization role is taken by the constraints, while we need to minimize the objective function because, otherwise, an optimal solution would have infinite values for all variables, .

The constraints are based on the idea that the value of a state must be greater than or equal to the immediate reward plus the discounted expected value of the successor states. This must be true for all states and all actions.

The huge number of variables and constraints makes it possible to use linear programming techniques for only finite-state and finite-action MDPs.

We will be using the following notation:

In the preceding notation, `c`

is the vector of coefficients of the objective function, is the matrix of the upper bound constraints, and is the associated coefficient vector.

In Python, SciPy offers the `linprog`

function (inside the `optimize`

module, `scipy.optimize.linprog`

), which optimizes linear programs given the objective function and the constraints.

The signature of the function is `scipy.optimize.linprog(c, A_ub, b_ub)`

.

To rephrase the problem using upper bounds, we have the following:

Note

For further reading on linear programming for MDPs, refer to the following paper from *de Farias, D. P. (2002): The Linear Programming Approach to Approximate Dynamic Programming: Theory and Application*: http://www.mit.edu/~pucci/discountedLP.pdf.

Let's now solve a quick exercise to strengthen our understanding of linear programming.

## Exercise 2.02: Determining the Best Policy for an MDP Using Linear Programming

The goal of this exercise is to solve the MDP in the following figure using linear programming. In this MDP, the environment model is straightforward and the transition function is deterministic, determined uniquely by the action. We will be finding the best action (the one with the maximum reward) taken by the agent, which determines the best policy of the environment, using linear programming:

The variables of the linear program are the state values. The coefficients are given by the initial state distribution, which, in our case, is a deterministic function, as state 1 is the initial state. Therefore, the coefficients of the objective function are `[1, 0, 0]`

.

We are now ready to tackle our problem:

- As always, import the required libraries:
import numpy as np import scipy.optimize

- Define the number of states and actions and the discount factor for this problem:
# number of states and number of actions n_states = 3 n_actions = 2

- Define the initial state distribution. In our case, it is a deterministic function:
# initial state distribution mu = np.array([[1, 0, 0]]).T # only state 1 mu

The output will be as follows:

array([[1], [0], [0]])

- Now we need to build the upper bound coefficients for action
`A`

:# Build the upper bound coefficients for the action A # define the reward matrix for action A R_A = np.zeros((n_states, 1), np.float) R_A[0, 0] = 1 R_A[1, 0] = 0 R_A[2, 0] = 0 R_A

The output will be as follows:

array([[1.], [0.], [0.]])

- Define the transition matrix for action
`A`

:# Define the transition matrix for action A P_A = np.zeros((n_states, n_states), np.float) P_A[0, 1] = 1 P_A[1, 0] = 1 P_A[2, 1] = 1 P_A

The output will be as follows:

array([[0., 1., 0.], [1., 0., 0.], [0., 1., 0.]])

- We are ready to build the upper bound matrix for action
`A`

:gamma = 0.9 # Upper bound A matrix for action A A_up_A = gamma * P_A - np.eye(3,3) A_up_A

The output will be as follows:

array([[-1. , 0.9, 0. ], [ 0.9, -1. , 0. ], [ 0. , 0.9, -1. ]])

- We need to do the same for action
`B`

:# The same for action B # define the reward matrix for action B R_B = np.zeros((n_states, 1), np.float) R_B[0, 0] = 10 R_B[1, 0] = 1 R_B[2, 0] = 10 # Define the transition matrix for action B P_B = np.zeros((n_states, n_states), np.float) P_B[0, 2] = 1 P_B[1, 2] = 1 P_B[2, 2] = 1 # Upper bound A matrix for action B A_up_B = gamma * P_B - np.eye(3,3) A_up_B

The output will be as follows:

array([[-1. , 0. , 0.9], [ 0. , -1. , 0.9], [ 0. , 0. , -0.1]])

- We are ready to concatenate the results for the two actions:
# Upper bound matrix for all actions and all states A_up = np.vstack((A_up_A, A_up_B)) """ verify the shape: number of constraints are equal to |actions| * |states| """ assert(A_up.shape[0] == n_states * n_actions) # Reward vector is obtained by stacking the two vectors R = np.vstack((R_A, R_B))

- The only thing we have to do now is to solve the linear program using
`scipy.optimize.linprog`

:c = mu b_up = -R # Solve the linear program res = scipy.optimize.linprog(c, A_up, b_up)

- Let's collect the results:
# Obtain the results: state values V_ = res.x V_ V = V_.reshape((-1, 1)) V np.savetxt("solution/V.txt", V)

Let's analyze the results. We can see that the value of state 2 is the lowest one, as expected. The values of states 1 and 3 are very close to each other and are approximately equal to 1e+2:

- Now we can calculate the optimal policy by calculating the optimal action-value function for each state-action pair:
""" transition matrix. On the rows, we have states and actions, and on the columns, we have the next states """ P = np.vstack((P_A, P_B)) P

The output will be as follows:

array([[0., 1., 0.], [1., 0., 0.], [0., 1., 0.], [0., 0., 1.], [0., 0., 1.], [0., 0., 1.]])

- Use the action-value formula to calculate the action values for each state-action pair:
""" Use the action value formula to calculate the action values for each state action pair. """ Q_sa = R + gamma * P.dot(V) """ The first three rows are associated to action A, the last three are associated # to action B """ Q_sa

The output is as follows:

array([[ 88.32127683], [ 89.99999645], [ 87.32127683], [100.00000622], [ 91.00000622], [100.00000622]])

- Reshape and use the
`argmax`

function to better understand the best actions:Q_sa_2 = np.stack((Q_sa[:3, 0], Q_sa[3:, 0]), axis=1) Q_sa_2

The output will be as follows:

array([[ 88.32127683, 100.00000622], [ 89.99999645, 91.00000622], [ 87.32127683, 100.00000622]])

Use the following code to better understand the best actions:

best_actions = np.reshape(np.argmax(Q_sa_2, axis=1), (3, 1)) best_actions

The output will be as follows:

array([[1], [1], [1]])

By visually inspecting the result, we can see that action

`B`

is the best action for all states, having acquired the highest q values for all states. Thus, the optimal policy decides to always take action`B`

. Doing this, we will land in state`3`

, and we will follow the self-loop, cumulating high positive rewards:

The optimal policy is represented in *Figure 2.59*. The dotted arrows represent the best action for each state.

Note

To access the source code for this specific section, please refer to https://packt.live/2Arr9rO.

You can also run this example online at https://packt.live/2Ck6neR.

In this exercise, we used linear programming techniques to solve a simple MDP with finite states and actions. By using the correspondence between the state-value function and the action-value function, we extracted the value of each state-action pair. From this knowledge, we extracted the optimal policy for this environment. In this case, the best policy is always just to take action `B`

.

In the next activity, we will use the Bellman expectation equation to evaluate a policy for a more complex task. Before that, let's explore the environment that we are going to use in the activity, Gridworld.

## Gridworld

Gridworld is a classical RL environment with many variants. The following figure displays the visual representation of the environment:

As you can see, the states are represented by cells, and there are 25 states arranged in a 5 x 5 grid. There are four available actions: left, right, up, and down. These actions move the current state in the direction of the action, and the associated reward is 0 for all actions. The exceptions are as follows:

- Border cells: If an action takes the agent outside of the grid, the agent state does not change, and the agent receives a reward of -1.
- Good cells: and are good cells. For these cells, each action brings the agent to states and , respectively. The associated reward is +10 for going outside state and +5 for going outside state .
- Bad cells: and are bad cells. For these cells, the associated reward is -1 for all actions.

Now that we have an understanding of the environment, let's attempt an activity that implements it.

## Activity 2.01: Solving Gridworld

In this activity, we will be working on the Gridworld environment. The goal of the activity is to calculate and visually represent the state values for a random policy, in which the agent selects each action with an equal probability (1/4) in all states. The discount factor is assumed to be equal to 0.9.

The following steps will help you to complete the activity:

- Import the required libraries. Import
`Enum`

and`auto`

from`enum`

,`matplotlib.pyplot`

,`scipy`

, and`numpy`

, and import`tuple`

from`typing`

. - Define the visualization function and the possible actions for the agent.
- Write a policy class that returns the action probability in a given state; for a random policy, the state can be ignored.
- Write an
`Environment`

class with a step function that returns the next state and the associated reward given the current state and action. - Loop for all states and actions and build a transition matrix (width*height, width*height) and a reward matrix of the same dimension. The transition matrix contains the probability of going from one state to another, so the sum of the first axis should be equal to 1 for all rows.
- Use the matrix form of the Bellman expectation equation to compute the state values for each state. You can use
`scipy.linalg.solve`

or directly compute the inverse matrix and solve the system.The output will be as follows:

Note

It is useful to visualize the state values and the expected reward, so write a function visually representing the calculated matrices.

The solution for this activity can be found via this link.

## Summary

In this chapter, we learned the differences between MCs, MRPs, and MDPs. An MC is the most straightforward description of a generic process that is composed of states and a probability function that describes the transition between states. An MRP includes the concept of rewards as a measure of how good a transition is. The MDP is what we are most interested in; it includes the concept of actions, policies, and goals.

In the context of Markov processes, we introduced Bellman equations in different forms and also analyzed the relationship between the state-value function and the action-value function.

We discussed various methods for solving MDPs, categorizing algorithms based on the information they require and on the methods they use. These algorithms will be presented in more detail in the following chapters. We focused on linear programming, showing how it is possible to solve MDPs using these techniques.

In the next chapter, you will learn how to use TensorFlow 2 to implement deep learning algorithms and machine learning models.