Packt+ | Advance your knowledge in tech

You're reading from Python Deep Learning

Product type Book

Published in Apr 2017

Publisher Packt

ISBN-13 9781786464453

Pages 406 pages

Edition 1st Edition

Languages

Python

Concepts

Deep Learning

Authors (4):

Valentino Zocca

Gianmario Spacagna

Daniel Slater

Peter Roelants

View More author details

Table of Contents (18) Chapters

Python Deep Learning

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Machine Learning – An Introduction

Neural Networks

Deep Learning Fundamentals

Unsupervised Feature Learning

Image Recognition

Recurrent Neural Networks and Language Models

Deep Learning for Board Games

Deep Learning for Computer Games

Anomaly Detection

Building a Production-Ready Intrusion Detection System

Index

Chapter 8. Deep Learning for Computer Games

The last chapter focused on solving board games. In this chapter, we will look at the more complex problem of training AI to play computer games. Unlike with board games, the rules of the game are not known ahead of time. The AI cannot tell what will happen if it takes an action. It can't simulate a range of button presses and their effect on the state of the game to see which receive the best scores. It must instead learn the rules and constraints of the game purely from watching, playing, and experimenting.

In this chapter, we will cover the following topics:

Q-learning
Experience replay
Actor-critic
Model-based approaches

A supervised learning approach to games

The challenge in reinforcement learning is working out a good target for our network. We saw one approach to this in the last chapter, policy gradients. If we can ever turn a reinforcement learning task into a supervised task problem, it becomes a lot easier. So, if our aim is to build an AI agent that plays computer games, one thing we might try is to look at how humans play and get our agent to learn from them. We can make a recording of an expert human player playing a game, keeping track of both the screen image and the buttons the player is pressing.

As we saw in the chapter on computer vision, deep neural networks can identify patterns from images, so we can train a network that has the screen as input and the buttons the human pressed in each frame as the targets. This is similar to how AlphaGo was pretrained in the last chapter. This was tried on a range of complex 3D games, such as Super Smash Bros and Mario Tennis. Convolutional networks were...

Applying genetic algorithms to playing games

For a long time, the best results and the bulk of the research in AI's playing in video game environments were around genetic algorithms. This approach involves creating a set of modules that take parameters to control the behavior of the AI. The range of parameter values are then set by a selection of genes. A group of agents would then be created using different combinations of these genes, which would be run on the game. The most successful set of agent's genes would be selected, then a new generation of agents would be created using combinations of the successful agent's genes. Those would again be run on the game and so on until a stopping criteria is reached, normally either a maximum number of iterations or a level of performance in the game. Occasionally, when creating a new generation, some of the genes can be mutated to create new genes. A good example of this is MarI/O, an AI that learnt to play the classic SNES game Super Mario World...

Q-Learning

Imagine that we have an agent who will be moving through a maze environment, somewhere in which is a reward. The task we have is to find the best path for getting to the reward as quickly as possible. To help us think about this, let's start with a very simple maze environment:

Figure 2: A simple maze, the agent can move along the lines to go from one state to another. A reward of 4 is received if the agent gets to state D.

In the maze pictured, the agent can move between any of the nodes, in both directions, by following the lines. The node the agent is in is its state; moving along a line to a different node is an action. There is a reward of 4 if the agent gets to the goal in state D. We want to come up with the optimum path through the maze from any starting node.

Let's think about this problem for a moment. If moving along a line puts us in state D, then that will always be the path we want to take as that will give us the 4 reward in the next time step. Then going back a step...

Q-learning in action

A game may have in the region of 16-60 frames per second, and often rewards will be received based on actions taken many seconds ago. Also, the state space is vast. In computer games, the state contains all the pixels on the screen used as input to the game. If we imagine a screen downsampled to say 80 x 80 pixels, all of which are single color and binary, black or white, that is still a 2^6400 state. This makes a direct map from state to reward impractical.

What we will need to do is learn an approximation of the Q-function. This is where neural networks can be used for their universal function approximation ability. To train our Q-function approximation, we will store all the game states, rewards, and actions our agent took as it plays through the game. The loss function for our network will be the square of the difference between its approximation of the reward in the previous state and the actual reward it got in the current state, plus its approximation of the reward...

Dynamic games

Now that we have learned the world's simplest game, let's try learning something a bit more dynamic. The cart pole task is a classic reinforcement learning problem. The agent must control a cart, on which is balanced a pole, attached to the cart via a joint. At every step, the agent can choose to move the cart left or right, and it receives a reward of 1 every time step that the pole is balanced. If the pole ever deviates by more than 15 degrees from upright, then the game ends:

Figure 5: The cart pole task

To run the cart pole task, we will use OpenAIGym, an open source project set up in 2015, which gives a way to run reinforcement learning agents against a range of environments in a consistent way. At the time of writing, OpenAIGym has support for running a whole range of Atari games and even some more complex games, such as doom, with minimum setup. It can be installed using pip by running this:

pip install gym[all]

Running cart pole in Python can be done as follows:

import...

Atari Breakout

Breakout is a classic Atari game originally released in 1976. The player controls a paddle and must use it to bounce a ball into the colored blocks at the top of the screen. Points are scored whenever a block is hit. If the ball travels down past the paddle off the bottom of the screen, the player loses a life. The game ends either when the all the blocks have been destroyed or if the player loses all three lives that he starts with:

Figure 8: Atari Breakout

Think about how much harder learning a game like Breakout is compared to the cart pole task we just looked at. For cart pole, if a bad move is made that leads to the pole tipping over, we will normally receive feedback within a couple of moves. In Breakout, such feedback is much rarer. If we position our paddle wrong, that can be because of 20 or more moves that went into positioning.

Atari Breakout random benchmark

Before we go any further, let's create an agent that will play Breakout by selecting moves randomly. That way...

Actor-critic methods

Approaches to reinforcement learning can be divided into three broad categories:

Value-based learning: This tries to learn the expected reward/value for being in a state. The desirability of getting into different states can then be evaluated based on their relative value. Q-learning in an example of value-based learning.
Policy-based learning: In this, no attempt is made to evaluate the state, but different control policies are tried out and evaluated based on the actual reward from the environment. Policy gradients are an example of that.
Model-based learning: In this approach, which will be discussed in more detail later in the chapter, the agent attempts to model the behavior of the environment and choose an action based on its ability to simulate the result of actions it might take by evaluating its model.

Actor-critic methods all revolve around the idea of using two neural networks for training. The first, the critic, uses value-based learning to learn a value function...

Asynchronous methods

We have seen a lot of interesting methods in this chapter, but they all suffer from the constraint of being very slow to train. This isn't such a problem when we are running on basic control problems, such as the cart-pole task. But for learning Atari games, or the even more complex human tasks that we might want to learn in the future, the days to weeks of training time are far too long.

A big part of the time constraint, for both policy gradients and actor-critic, is that when learning online, we can only ever evaluate one policy at a time. We can get significant speed improvements by using more powerful GPUs and bigger and bigger processors; the speed of evaluating the policy online will always act as a hard limit on performance.

This is the problem that asynchronous methods aim to solve. The idea is to train multiple copies of the same neural networks across multiple threads. Each neural network trains online against a separate instance of the environment running on...

Model-based approaches

The approaches we've so far shown can do a good job of learning all kinds of tasks, but an agent trained in these ways can still suffer from significant limitations:

It trains very slowly; a human can learn a game like Pong from a couple of plays, while for Q-learning, it may take millions of playthroughs to get to a similar level.
For games that require long-term planning, all the techniques perform very badly. Imagine a platform game where a player must retrieve a key from one side of a room to open a door on the other side. There will rarely be a passage of play where this occurs, and even then, the chance of learning that it was the key that lead to the extra reward from the door is miniscule.
It cannot formulate a strategy or in any way adapt to a novel opponent. It may do well against an opponent it trains against, but when presented with an opponent showing some novelty in play, it will take a long time to learn to adapt to this.
If given a new goal within an environment...

Summary

In this chapter, we looked at building computer game playing agents using reinforcement learning. We went through the three main approaches: policy gradients, Q-learning, and model-based learning, and we saw how deep learning can be used with these approaches to achieve human or greater level performance. We would hope that the reader would come out of this chapter with enough knowledge to be able to use these techniques in other games or problems that they may want to solve. Reinforcement learning is an incredibly exciting area of research at the moment. Companies such as Google, Deepmind, OpenAI, and Microsoft are all investing heavily to unlock this future.

In the next chapter, we will take a look at anomaly detection and how the deep learning method can be applied to detect instances of fraud in financial transaction data.