Packt+ | Advance your knowledge in tech

You're reading from Python Reinforcement Learning Projects

Product type Book

Published in Sep 2018

Publisher Packt

ISBN-13 9781788991612

Pages 296 pages

Edition 1st Edition

Languages

Python

Concepts

Reinforcement Learning

Authors (3):

Sean Saito

Yang Wenzhuo

Rajalingappaa Shanmugamani

View More author details

Table of Contents (17) Chapters

Title Page

Packt Upsell

Contributors

Preface

Up and Running with Reinforcement Learning

Balancing CartPole

Playing Atari Games

Simulating Control Tasks

Building Virtual Worlds in Minecraft

Learning to Play Go

Creating a Chatbot

Generating a Deep Learning Image Classifier

Predicting Future Stock Prices

Looking Ahead

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Chapter 5. Building Virtual Worlds in Minecraft

In the two previous chapters, we discussed the deep Q-learning (DQN) algorithm for playing Atari games and the Trust Region Policy Optimization (TRPO) algorithm for continuous control tasks. We saw the big success of these algorithms in solving complex problems when compared to traditional reinforcement learning algorithms without the use of deep neural networks to approximate the value function or the policy function. Their main disadvantage, especially for DQN, is that the training step converges too slowly, for example, training an agent to play Atari games takes about one week. For more complex games, even one week's training is insufficient.

This chapter will introduce a more complicated example, Minecraft, which is a popular online video game created by Swedish game developer Markus Persson and later developed by Mojang. You will learn how to launch a Minecraft environment using OpenAI Gym and play different missions. In order to build...

Introduction to the Minecraft environment

The original OpenAI Gym does not contain the Minecraft environment. We need to install a Minecraft environment bundle, available at https://github.com/tambetm/gym-minecraft. This bundle is built based on Microsoft's Malmö, which is a platform for AI experimentation and research built on top of Minecraft.

Before installing the gym-minecraft package, Malmö should first be downloaded from https://github.com/Microsoft/malmo. We can download the latest pre-built version from https://github.com/Microsoft/malmo/releases. After unzipping the package, go to the Minecraft folder and run launchClient.bat on Windows, or launchClient.sh on Linux/MacOS, to launch a Minecraft environment. If it is successfully launched, we can now install gym-minecraft via the following scripts:

python3 -m pip install gym
python3 -m pip install pygame

git clone https://github.com/tambetm/minecraft-py.git
cd minecraft-py
python setup.py install

git clone https://github.com/tambetm...

Data preparation

In the Atari environment, recall that there are three modes for each Atari game, for example, Breakout, BreakoutDeterministic, and BreakoutNoFrameskip, and each mode has two versions, for example, Breakout-v0 and Breakout-v4. The main difference between the three modes is the frameskip parameter that indicates the number of frames (steps) the one action is repeated on. This is called the frame-skipping technique, which allows us to play more games without significantly increasing the runtime.

However, in the Minecraft environment, there is only one mode where the frameskip parameter is equal to one. Therefore, in order to apply the frame-skipping technique, we need to explicitly repeat a certain action frameskip multiple times during one timestep. Besides this, the frame images returned by the step function are RGB images. Similar to the Atari environment, the observed frame images are converted to grayscale and then resized to 84x84. The following code provides the wrapper...

Asynchronous advantage actor-critic algorithm

In the previous chapters, we discussed the DQN for playing Atari games and the use of the DPG and TRPO algorithms for continuous control tasks. Recall that DQN has the following architecture:

At each timestep

, the agent observes the frame image

and selects an action

based on the current learned policy. The emulator (the Minecraft environment) executes this action and returns the next frame image

and the corresponding reward

. The quadruplet

is then stored in the experience memory and is taken as a sample for training the Q-network by minimizing the empirical loss function via stochastic gradient descent.

Deep reinforcement learning algorithms based on experience replay have achieved unprecedented success in playing Atari games. However, experience replay has several disadvantages:

It uses more memory and computation per real interaction
It requires off-policy learning algorithms that can update from data generated by an older policy

In order...

Implementation of A3C

We will now look at how to implement A3C using Python and TensorFlow. Here, the policy network and value network share the same feature representation. We implement two kinds of policies: one is based on the CNN architecture used in DQN, and the other is based on LSTM.

We implement the FFPolicy class for the policy based on CNN:

class FFPolicy:

    def __init__(self, input_shape=(84, 84, 4), n_outputs=4, network_type='cnn'):

        self.width = input_shape[0]
        self.height = input_shape[1]
        self.channel = input_shape[2]
        self.n_outputs = n_outputs
        self.network_type = network_type
        self.entropy_beta = 0.01

        self.x = tf.placeholder(dtype=tf.float32, 
                                shape=(None, self.channel, self.width, self.height))
        self.build_model()

The constructor requires three arguments:

input_shape
n_outputs
network_type

input_shape is the size of the input image. After data preprocessing, the input is an 84x84x4...

Experiments

The full implementation of the A3C algorithm can be downloaded from our GitHub repository (https://github.com/PacktPublishing/Python-Reinforcement-Learning-Projects). There are three environments in our implementation we can test. The first one is the special game, demo, introduced in Chapter 3, Playing Atari Games. For this game, A3C only needs to launch two agents to achieve good performance. Run the following command in the src folder:

python3 train.py -w 2 -e demo

The first argument, -w, or --num_workers, indicates the number of launched agents. The second argument, -e, or --env, specifies the environment, for example, demo. The other two environments are Atari and Minecraft. For Atari games, A3C requires at least 8 agents running in parallel. Typically, launching 16 agents can achieve better performance:

python3 train.py -w 8 -e Breakout

For Breakout, A3C takes about 2-3 hours to achieve a score of 300. If you have a decent PC with more than 8 cores, it is better to test it...

Summary

This chapter introduced the Gym Minecraft environment, available at https://github.com/tambetm/gym-minecraft. You have learned how to launch a Minecraft mission and how to implement an emulator for it. The most important part of this chapter was the asynchronous reinforcement learning framework. You learned what the shortcomings of DQN are, and why DQN is difficult to apply in complex tasks. Then, you learned how to apply the asynchronous reinforcement learning framework in the actor-critic method REINFORCE, which led us to the A3C algorithm. Finally, you learned how to implement A3C using Tensorflow and how to handle multiple terminals using TMUX. The tricky part in the implementation is that of the global shared parameters. This is related to creating a cluster of TensorFlow servers. For the readers who want to learn more about this, visit https://www.tensorflow.org/deploy/distributed.

In the following chapters, you will learn more about how to apply reinforcement learning algorithms...