Reader small image

You're reading from  Reinforcement Learning with TensorFlow

Product typeBook
Published inApr 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781788835725
Edition1st Edition
Languages
Right arrow
Author (1)
Sayon Dutta
Sayon Dutta
author image
Sayon Dutta

Sayon Dutta is an Artificial Intelligence researcher and developer. A graduate from IIT Kharagpur, he owns the software copyright for Mobile Irrigation Scheduler. At present, he is an AI engineer at Wissen Technology. He co-founded an AI startup Marax AI Inc., focused on AI-powered customer churn prediction. With over 2.5 years of experience in AI, he invests most of his time implementing AI research papers for industrial use cases, and weightlifting.
Read more about Sayon Dutta

Right arrow

Chapter 3. Markov Decision Process

The Markov decision process, better known as MDP, is an approach in reinforcement learning to take decisions in a gridworld environment. A gridworld environment consists of states in the form of grids, such as the one in the FrozenLake-v0 environment from OpenAI gym, which we tried to examine and solve in the last chapter.

The MDP tries to capture a world in the form of a grid by dividing it into states, actions, models/transition models, and rewards. The solution to an MDP is called a policy and the objective is to find the optimal policy for that MDP task.

Thus, any reinforcement learning task composed of a set of states, actions, and rewards that follows the Markov property would be considered an MDP.

In this chapter, we will dig deep into MDPs, states, actions, rewards, policies, and how to solve them using Bellman equations. Moreover, we will cover the basics of Partially Observable MDP and their complexity in solving. We will also cover the exploration...

Markov decision processes


As already mentioned, an MDP is a reinforcement learning approach in a gridworld environment containing sets of states, actions, and rewards, following the Markov property to obtain an optimal policy. MDP is defined as the collection of the following:

  • States: S
  • Actions: A(s), A
  • Transition model: T(s,a,s') ~ P(s'|s,a)
  • Rewards: R(s), R(s,a), R(s,a,s')
  • Policy:
     
     is the optimal policy

In the case of an MDP, the environment is fully observable, that is, whatever observation the agent makes at any point in time is enough to make an optimal decision. In case of a partially observable environment, the agent needs a memory to store the past observations to make the best possible decisions.

Let's try to break this into different lego blocks to understand what this overall process means. 

The Markov property

In short, as per the Markov property, in order to know the information of near future (say, at time t+1) the present information at time t matters. 

Given a sequence, 

, the first...

Partially observable Markov decision processes


In an MDP, the observable quantities are action, set A, the state, set S, transition model, T, and rewards, set R. This is not in case of Partially observable MDP, also known as POMDP. In a POMDP, there's an MDP inside that is not directly observable to the agent and takes the decision from whatever observations made. 

In POMDP, there's an observation set, Z, containing different observable states and a observation function, O, which takes the s state and the z observation as inputs and outputs the probability of seeing that z observation in the s state.

POMDPs are basically a generalization of MDPs:

  • MDP: {S,A,T,R}

  • POMDP: {S,A,Z,T,R,O}

  • where, S, A, T ,and R are the same. Therefore, for a POMDP to be a true MDP, following condition:

, that is, fully observe all states

POMDP are hugely intractable to solve optimally.

State estimation

If we expand the state spaces, this helps us to convert the POMDP into an MDP where Z contains fully observable state space...

Training the FrozenLake-v0 environment using MDP


This is about a gridworld environment in OpenAI gym called FrozenLake-v0, discussed in Chapter 2, Training Reinforcement Learning Agents Using OpenAI Gym. We implemented Q-learning and Q-network (which we will discuss in future chapters) to get the understanding of an OpenAI gym environment.

Now, let's try to implement value iteration to obtain the utility value of each state in the FrozenLake-v0 environment, using the following code:

# importing dependency libraries
from __future__ import print_function
import gym
import numpy as np
import time

#Load the environment
env = gym.make('FrozenLake-v0')

s = env.reset()
print(s)
print()

env.render()
print()

print(env.action_space) #number of actions
print(env.observation_space) #number of states
print()

print("Number of actions : ",env.action_space.n)
print("Number of states : ",env.observation_space.n)
print()

# Value Iteration Implementation

#Initializing Utilities of all states with zeros...

Summary


In this chapter, we covered the details of a gridworld type of environment and understood the basics of the Markov decision process, that is, states, actions, rewards, transition model, and policy. Moreover, we utilized this information to calculate the utility and optimal policy through value iteration and policy iteration approaches.

Apart from this, we got a basic understanding of what partially observable Markov decision processes look like and the challenges in solving them. Finally, we took our favorite gridworld environment from OpenAI gym, that is, FrozenLake-v0 and implemented a value iteration approach to make our agent learn to navigate that environment.

In the next chapter, we will start with policy gradients and move beyond FrozenLake to some other fascinating and complex environments.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Reinforcement Learning with TensorFlow
Published in: Apr 2018Publisher: PacktISBN-13: 9781788835725
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Sayon Dutta

Sayon Dutta is an Artificial Intelligence researcher and developer. A graduate from IIT Kharagpur, he owns the software copyright for Mobile Irrigation Scheduler. At present, he is an AI engineer at Wissen Technology. He co-founded an AI startup Marax AI Inc., focused on AI-powered customer churn prediction. With over 2.5 years of experience in AI, he invests most of his time implementing AI research papers for industrial use cases, and weightlifting.
Read more about Sayon Dutta