Reader small image

You're reading from  Deep Learning with Theano

Product typeBook
Published inJul 2017
PublisherPackt
ISBN-139781786465825
Edition1st Edition
Tools
Right arrow
Author (1)
Christopher Bourez
Christopher Bourez
author image
Christopher Bourez

Christopher Bourez graduated from Ecole Polytechnique and Ecole Normale Suprieure de Cachan in Paris in 2005 with a Master of Science in Math, Machine Learning and Computer Vision (MVA). For 7 years, he led a company in computer vision that launched Pixee, a visual recognition application for iPhone in 2007, with the major movie theater brand, the city of Paris and the major ticket broker: with a snap of a picture, the user could get information about events, products, and access to purchase. While working on missions in computer vision with Caffe, TensorFlow or Torch, he helped other developers succeed by writing on a blog on computer science. One of his blog posts, a tutorial on the Caffe deep learning technology, has become the most successful tutorial on the web after the official Caffe website. On the initiative of Packt Publishing, the same recipes that made the success of his Caffe tutorial have been ported to write this book on Theano technology. In the meantime, a wide range of problems for Deep Learning are studied to gain more practice with Theano and its application.
Read more about Christopher Bourez

Right arrow

Chapter 11. Learning from the Environment with Reinforcement

Supervised and unsupervised learning describe the presence or the absence of labels or targets during training. A more natural learning environment for an agent is to receive rewards when the correct decision has been taken. This reward, such as playing correctly tennis for example, may be attributed in a complex environment, and the result of multiple actions, delayed or cumulative.

In order to optimize the reward from the environment for an artificial agent, the Reinforcement Learning (RL) field has seen the emergence of many algorithms, such as Q-learning, or Monte Carlo Tree Search, and with the advent of deep learning, these algorithms have been revised into new methods, such as deep-Q-networks, policy networks, value networks, and policy gradients.

We'll begin with a presentation of the reinforcement learning frame, and its potential application to virtual environments. Then, we'll develop its algorithms and their integration...

Reinforcement learning tasks


Reinforcement learning consists of training an agent, that just needs occasional feedback from the environment, to learn to get the best feedback at the end. The agent performs actions, modifying the state of the environment.

The actions to navigate in the environment can be represented as directed edges from one state to another state as a graph, as shown in the following figure:

A robot, working in a real environment (walking robots, control of motors, and so on) or a virtual environment (video game, online games, chat room, and so on) has to decide which movements (or keys to strike) to receive the maximum reward:

Simulation environments


Virtual environments make it possible to simulate thousands to millions of gameplays, at no other cost than the computations. For the purpose of benchmarking different reinforcement learning algorithms, simulation environments have been developed by the research community.

In order to find the solutions that generalize well, the Open-AI non-profit artificial intelligence research company, associated with business magnate Elon Musk, that aims to carefully promote and develop friendly AI in such a way as to benefit humanity as a whole, has gathered in its open source simulation environment, Open-AI Gym (https://gym.openai.com/), a collection of reinforcement tasks and environments in a Python toolkit to test our own approaches on them. Among these environments, you'll find:

  • Video games from Atari 2600, a home video game console released by Atari Inc in 1977, wrapping the simulator from the Arcade Learning Environment, one of the most common RL benchmark environment:

  • MuJoCo...

Q-learning


A major approach to solve games has been the Q-learning approach. In order to fully understand the approach, a basic example will illustrate a simplistic case where the number of states of the environment is limited to 6, state 0 is the entrance, state 5 is the exit. At each stage, some actions make it possible to jump to another state, as described in the following figure:

The reward is, let's say, 100, when the agent leaves state 4 to state 5. There isn't any other reward for other states since the goal of the game in this example is to find the exit. The reward is time-delayed and the agent has to scroll through multiple states from state 0 to state 4 to find the exit.

In this case, Q-learning consists of learning a matrix Q, representing the value of a state-action pair:

  • Each row in the Q-matrix corresponds to a state the agent would be in

  • Each column the target state from that state

the value representing how much choosing that action in that state will move us close to the exit...

Deep Q-network


While the number of possible actions is usually limited (number of keyboard keys or movements), the number of possible states can be dramatically huge, the search space can be enormous, for example, in the case of a robot equipped with cameras in a real-world environment or a realistic video game. It becomes natural to use a computer vision neural net, such as the ones we used for classification in Chapter 7, Classifying Images with Residual Networks, to represent the value of an action given an input image (the state), instead of a matrix:

The Q-network is called a state-action value network and predicts action values given a state. To train the Q-network, one natural way of doing it is to have it fit the Bellman equation via gradient descent:

Note that, is evaluated and fixed, while the descent is computed for the derivatives in, and that the value of each state can be estimated as the maximum of all state-action values.

After initializing the Q-network with random weights...

Training stability


Different methods are possible to improve stability during training. Online training, that is, training the model while playing the game, forgetting previous experiences, just considering the last one, is fundamentally unstable with deep neural networks: states that are close in time, such as the most recent states, are usually strongly similar or correlated, and taking the most recent states during training does not converge well.

To avoid such a failure, one possible solution has been to store the experiences in a replay memory or to use a database of human gameplays. Batching and shuffling random samples from the replay memory or the human gameplay database leads to more stable training, but off-policy training.

A second solution to improve stability is to fix the value of the parameter in the target evaluation for several thousands of updates of , reducing the correlations between the target and the Q-values:

It is possible to train more efficiently with n-steps Q-learning...

Policy gradients with REINFORCE algorithms


The idea of Policy Gradients (PG) / REINFORCE algorithms is very simple: it consists in re-using the classification loss function in the case of reinforcement learning tasks.

Let's remember that the classification loss is given by the negative log likelihood, and minimizing it with a gradient descent follows the negative log-likelihood derivative with respect to the network weights:

Here, y is the select action, the predicted probability of this action given inputs X and weights .

The REINFORCE theorem introduces the equivalent for reinforcement learning, where r is the reward. The following derivative:

represents an unbiased estimate of the derivative of the expected reward with respect to the network weights:

So, following the derivative will encourage the agent to maximize the reward.

Such a gradient descent enables us to optimize a policy network for our agents: a policy is a probability distribution over legal actions, to sample actions to execute...

Summary


Reinforcement learning describes the tasks of optimizing an agent stumbling into rewards episodically. Online, offline, value-based, or policy-based algorithms have been developed with the help of deep neural networks for various games and simulation environments.

Policy-gradients are a brute-force solution that require the sampling of actions during training and are better suited for small action spaces, although they provide first solutions for continuous search spaces.

Policy-gradients also work to train non-differentiable stochastic layers in a neural net and back propagate gradients through them. For example, when propagation through a model requires to sample following a parameterized submodel, gradients from the top layer can be considered as a reward for the bottom network.

In more complex environments, when there is no obvious reward (for example understanding and inferring possible actions from the objects present in the environment), reasoning helps humans optimize their...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Learning with Theano
Published in: Jul 2017Publisher: PacktISBN-13: 9781786465825
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Christopher Bourez

Christopher Bourez graduated from Ecole Polytechnique and Ecole Normale Suprieure de Cachan in Paris in 2005 with a Master of Science in Math, Machine Learning and Computer Vision (MVA). For 7 years, he led a company in computer vision that launched Pixee, a visual recognition application for iPhone in 2007, with the major movie theater brand, the city of Paris and the major ticket broker: with a snap of a picture, the user could get information about events, products, and access to purchase. While working on missions in computer vision with Caffe, TensorFlow or Torch, he helped other developers succeed by writing on a blog on computer science. One of his blog posts, a tutorial on the Caffe deep learning technology, has become the most successful tutorial on the web after the official Caffe website. On the initiative of Packt Publishing, the same recipes that made the success of his Caffe tutorial have been ported to write this book on Theano technology. In the meantime, a wide range of problems for Deep Learning are studied to gain more practice with Theano and its application.
Read more about Christopher Bourez