Reader small image

You're reading from  PyTorch 1.x Reinforcement Learning Cookbook

Product typeBook
Published inOct 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781838551964
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Yuxi (Hayden) Liu
Yuxi (Hayden) Liu
author image
Yuxi (Hayden) Liu

Yuxi (Hayden) Liu was a Machine Learning Software Engineer at Google. With a wealth of experience from his tenure as a machine learning scientist, he has applied his expertise across data-driven domains and applied his ML expertise in computational advertising, cybersecurity, and information retrieval. He is the author of a series of influential machine learning books and an education enthusiast. His debut book, also the first edition of Python Machine Learning by Example, ranked the #1 bestseller in Amazon and has been translated into many different languages.
Read more about Yuxi (Hayden) Liu

Right arrow

Markov Decision Processes and Dynamic Programming

In this chapter, we will continue our practical reinforcement learning journey with PyTorch by looking at Markov decision processes (MDPs) and dynamic programming. This chapter will start with the creation of a Markov chain and an MDP, which is the core of most reinforcement learning algorithms. You will also become more familiar with Bellman equations by practicing policy evaluation. We will then move on and apply two approaches to solving an MDP: value iteration and policy iteration. We will use the FrozenLake environment as an example. At the end of the chapter, we will demonstrate how to solve the interesting coin-flipping gamble problem with dynamic programming step by step.

The following recipes will be covered in this chapter:

  • Creating a Markov chain
  • Creating an MDP
  • Performing policy evaluation
  • Simulating the FrozenLake...

Technical requirements

You will need the following programs installed on your system to successfully execute the recipes in this chapter:

  • Python 3.6, 3.7, or above
  • Anaconda
  • PyTorch 1.0 or above
  • OpenAI Gym

Creating a Markov chain

Let's get started by creating a Markov chain, on which the MDP is developed.

A Markov chain describes a sequence of events that comply with the Markov property. It is defined by a set of possible states, S = {s0, s1, ... , sm}, and a transition matrix, T(s, s'), consisting of the probabilities of state s transitioning to state s'. With the Markov property, the future state of the process, given the present state, is conditionally independent of past states. In other words, the state of the process at t+1 is dependent only on the state at t. Here, we use a process of study and sleep as an example and create a Markov chain based on two states, s0 (study) and s1 (sleep). Let's say we have the following transition matrix:

In the next section, we will compute the transition matrix after k steps, and the probabilities of being in each state...

Creating an MDP

Developed upon the Markov chain, an MDP involves an agent and a decision-making process. Let's go ahead with developing an MDP and calculating the value function under the optimal policy.

Besides a set of possible states, S = {s0, s1, ... , sm}, an MDP is defined by a set of actions, A = {a0, a1, ... , an}; a transition model, T(s, a, s'); a reward function, R(s); and a discount factor, 𝝲. The transition matrix, T(s, a, s'), contains the probabilities of taking action a from state s then landing in s'. The discount factor, 𝝲, controls the tradeoff between future rewards and immediate ones.

To make our MDP slightly more complicated, we extend the study and sleep process with one more state, s2 play games. Let's say we have two actions, a0 work and a1 slack. The 3 * 2 * 3 transition matrix T(s, a, s') is as follows:

...

Performing policy evaluation

We have just developed an MDP and computed the value function of the optimal policy using matrix inversion. We also mentioned the limitation of inverting an m * m matrix with a large m value (let's say 1,000, 10,000, or 100,000). In this recipe, we will talk about a simpler approach called policy evaluation.

Policy evaluation is an iterative algorithm. It starts with arbitrary policy values and then iteratively updates the values based on the Bellman expectation equation until they converge. In each iteration, the value of a policy, π, for a state, s, is updated as follows:

Here, π(s, a) denotes the probability of taking action a in state s under policy π. T(s, a, s') is the transition probability from state s to state s' by taking action a, and R(s, a) is the reward received in state s by taking action a.

There are...

Simulating the FrozenLake environment

The optimal policies for the MDPs we have dealt with so far are pretty intuitive. However, it won't be that straightforward in most cases, such as the FrozenLake environment. In this recipe, let's play around with the FrozenLake environment and get ready for upcoming recipes where we will find its optimal policy.

FrozenLake is a typical Gym environment with a discrete state space. It is about moving an agent from the starting location to the goal location in a grid world, and at the same time avoiding traps. The grid is either four by four (https://gym.openai.com/envs/FrozenLake-v0/) or eight by eigh.

t (https://gym.openai.com/envs/FrozenLake8x8-v0/). The grid is made up of the following four types of tiles:

  • S: The starting location
  • G: The goal location, which terminates an episode
  • F: The frozen tile, which is a walkable location...

Solving an MDP with a value iteration algorithm

An MDP is considered solved if its optimal policy is found. In this recipe, we will figure out the optimal policy for the FrozenLake environment using a value iteration algorithm.

The idea behind value iteration is quite similar to that of policy evaluation. It is also an iterative algorithm. It starts with arbitrary policy values and then iteratively updates the values based on the Bellman optimality equation until they converge. So in each iteration, instead of taking the expectation (average) of values across all actions, it picks the action that achieves the maximal policy values:

Here, V*(s) denotes the optimal value, which is the value of the optimal policy; T(s, a, s') is the transition probability from state s to state s’ by taking action a; and R(s, a) is the reward received in state s by taking action a.

Once...

Solving an MDP with a policy iteration algorithm

Another approach to solving an MDP is by using a policy iteration algorithm, which we will discuss in this recipe.

A policy iteration algorithm can be subdivided into two components: policy evaluation and policy improvement. It starts with an arbitrary policy. And in each iteration, it first computes the policy values given the latest policy, based on the Bellman expectation equation; it then extracts an improved policy out of the resulting policy values, based on the Bellman optimality equation. It iteratively evaluates the policy and generates an improved version until the policy doesn't change any more.

Let's develop a policy iteration algorithm and use it to solve the FrozenLake environment. After that, we will explain how it works.

...

Solving the coin-flipping gamble problem

Gambling on coin flipping should sound familiar to everyone. In each round of the game, the gambler can make a bet on whether a coin flip will show heads. If it turns out heads, the gambler will win the same amount they bet; otherwise, they will lose this amount. The game continues until the gambler loses (ends up with nothing) or wins (wins more than 100 dollars, let's say). Let's say the coin is unfair and it lands on heads 40% of the time. In order to maximize the chance of winning, how much should the gambler bet based on their current capital in each round? This will definitely be an interesting problem to solve.

If the coin lands on heads more than 50% of the time, there is nothing to discuss. The gambler can just keep betting one dollar each round and should win the game most of the time. If it is a fair coin, the gambler...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
PyTorch 1.x Reinforcement Learning Cookbook
Published in: Oct 2019Publisher: PacktISBN-13: 9781838551964
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Yuxi (Hayden) Liu

Yuxi (Hayden) Liu was a Machine Learning Software Engineer at Google. With a wealth of experience from his tenure as a machine learning scientist, he has applied his expertise across data-driven domains and applied his ML expertise in computational advertising, cybersecurity, and information retrieval. He is the author of a series of influential machine learning books and an education enthusiast. His debut book, also the first edition of Python Machine Learning by Example, ranked the #1 bestseller in Amazon and has been translated into many different languages.
Read more about Yuxi (Hayden) Liu