Chapter 5. Building Virtual Worlds in Minecraft
In the two previous chapters, we discussed the deep Q-learning (DQN) algorithm for playing Atari games and the Trust Region Policy Optimization (TRPO) algorithm for continuous control tasks. We saw the big success of these algorithms in solving complex problems when compared to traditional reinforcement learning algorithms without the use of deep neural networks to approximate the value function or the policy function. Their main disadvantage, especially for DQN, is that the training step converges too slowly, for example, training an agent to play Atari games takes about one week. For more complex games, even one week's training is insufficient.
This chapter will introduce a more complicated example, Minecraft, which is a popular online video game created by Swedish game developer Markus Persson and later developed by Mojang. You will learn how to launch a Minecraft environment using OpenAI Gym and play different missions. In order to build...
Introduction to the Minecraft environment
The original OpenAI Gym does not contain the Minecraft environment. We need to install a Minecraft environment bundle, available at https://github.com/tambetm/gym-minecraft. This bundle is built based on Microsoft's Malmö, which is a platform for AI experimentation and research built on top of Minecraft.
Before installing the gym-minecraft
package, Malmö should first be downloaded from https://github.com/Microsoft/malmo. We can download the latest pre-built version from https://github.com/Microsoft/malmo/releases. After unzipping the package, go to the Minecraft
folder and run launchClient.bat
on Windows, or launchClient.sh
on Linux/MacOS, to launch a Minecraft environment. If it is successfully launched, we can now install gym-minecraft
via the following scripts:
python3 -m pip install gym
python3 -m pip install pygame
git clone https://github.com/tambetm/minecraft-py.git
cd minecraft-py
python setup.py install
git clone https://github.com/tambetm...
In the Atari environment, recall that there are three modes for each Atari game, for example, Breakout, BreakoutDeterministic, and BreakoutNoFrameskip, and each mode has two versions, for example, Breakout-v0 and Breakout-v4. The main difference between the three modes is the frameskip parameter that indicates the number of frames (steps) the one action is repeated on. This is called the frame-skipping technique, which allows us to play more games without significantly increasing the runtime.
However, in the Minecraft environment, there is only one mode where the frameskip parameter is equal to one. Therefore, in order to apply the frame-skipping technique, we need to explicitly repeat a certain action frameskip multiple times during one timestep. Besides this, the frame images returned by the step
function are RGB images. Similar to the Atari environment, the observed frame images are converted to grayscale and then resized to 84x84. The following code provides the wrapper...
Asynchronous advantage actor-critic algorithm
In the previous chapters, we discussed the DQN for playing Atari games and the use of the DPG and TRPO algorithms for continuous control tasks. Recall that DQN has the following architecture:
At each timestep
, the agent observes the frame image
and selects an action
based on the current learned policy. The emulator (the Minecraft environment) executes this action and returns the next frame image
and the corresponding reward
. The quadruplet
is then stored in the experience memory and is taken as a sample for training the Q-network by minimizing the empirical loss function via stochastic gradient descent.
Deep reinforcement learning algorithms based on experience replay have achieved unprecedented success in playing Atari games. However, experience replay has several disadvantages:
- It uses more memory and computation per real interaction
- It requires off-policy learning algorithms that can update from data generated by an older policy
In order...
We will now look at how to implement A3C using Python and TensorFlow. Here, the policy network and value network share the same feature representation. We implement two kinds of policies: one is based on the CNN architecture used in DQN, and the other is based on LSTM.
We implement the FFPolicy
class for the policy based on CNN:
class FFPolicy:
def __init__(self, input_shape=(84, 84, 4), n_outputs=4, network_type='cnn'):
self.width = input_shape[0]
self.height = input_shape[1]
self.channel = input_shape[2]
self.n_outputs = n_outputs
self.network_type = network_type
self.entropy_beta = 0.01
self.x = tf.placeholder(dtype=tf.float32,
shape=(None, self.channel, self.width, self.height))
self.build_model()
The constructor requires three arguments:
-
input_shape
n_outputs
network_type
input_shape
is the size of the input image. After data preprocessing, the input is an 84x84x4...
The full implementation of the A3C algorithm can be downloaded from our GitHub repository (https://github.com/PacktPublishing/Python-Reinforcement-Learning-Projects). There are three environments in our implementation we can test. The first one is the special game, demo
, introduced in Chapter 3, Playing Atari Games. For this game, A3C only needs to launch two agents to achieve good performance. Run the following command in the src
folder:
python3 train.py -w 2 -e demo
The first argument, -w
, or --num_workers
, indicates the number of launched agents. The second argument, -e
, or --env
, specifies the environment, for example, demo
. The other two environments are Atari and Minecraft. For Atari games, A3C requires at least 8 agents running in parallel. Typically, launching 16 agents can achieve better performance:
python3 train.py -w 8 -e Breakout
For Breakout, A3C takes about 2-3 hours to achieve a score of 300. If you have a decent PC with more than 8 cores, it is better to test it...
This chapter introduced the Gym Minecraft environment, available at https://github.com/tambetm/gym-minecraft. You have learned how to launch a Minecraft mission and how to implement an emulator for it. The most important part of this chapter was the asynchronous reinforcement learning framework. You learned what the shortcomings of DQN are, and why DQN is difficult to apply in complex tasks. Then, you learned how to apply the asynchronous reinforcement learning framework in the actor-critic method REINFORCE, which led us to the A3C algorithm. Finally, you learned how to implement A3C using Tensorflow and how to handle multiple terminals using TMUX. The tricky part in the implementation is that of the global shared parameters. This is related to creating a cluster of TensorFlow servers. For the readers who want to learn more about this, visit https://www.tensorflow.org/deploy/distributed.
In the following chapters, you will learn more about how to apply reinforcement learning algorithms...