Reader small image

You're reading from  Deep Reinforcement Learning Hands-On. - Second Edition

Product typeBook
Published inJan 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781838826994
Edition2nd Edition
Languages
Right arrow
Author (1)
Maxim Lapan
Maxim Lapan
author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan

Right arrow

Multi-agent RL

In the last chapter, we dived into discrete optimization problems. In this final chapter, we will discuss a relatively new direction of reinforcement learning (RL) and deep RL, which is related to the situations when multiple agents communicate in an environment.

In this chapter, we will:

  • Start with an overview of the similarities and differences between the classical single-agent RL problem and multi-agent RL
  • Cover the MAgent environment, which was implemented and open sourced by the Geek.AI UK/China research group
  • Use MAgent to train models in different environments with several groups of agents

Multi-agent RL explained

The multi-agent setup is a natural extension of the familiar RL model that we covered in Chapter 1, What Is Reinforcement Learning?, In the normal RL setup, we have one agent communicating with the environment using the observation, reward, and actions. But in some problems, which often arise in reality, we have several agents involved in the environment interaction. To give some concrete examples:

  • A chess game, when our program tries to beat the opponent
  • A market simulation, like product advertisements or price changes, when our actions might lead to counter-actions from other participants
  • Multiplayer games, like Dota2 or StarCraft II, when the agent needs to control several units competing with other players' units

If other agents are outside of our control, we can treat them as part of the environment and still stick to the normal RL model with the single agent. But sometimes, that's too limited and not exactly what we want...

The MAgent environment

Before we jump into our first MARL example, I will describe our environment to experiment with.

Installation

If you want to play with MARL, your choice is a bit limited. All the environments that come with Gym support only one agent. There are some patches for Atari Pong, to switch it into two-player mode, but they are not standard and are an exception rather than the rule.

DeepMind, together with Blizzard, has made StarCraft II publicly available (https://github.com/deepmind/pysc2) and it makes for a very interesting and challenging environment for experimentation. However, for somebody who is taking their first steps in MARL, it might be too complex. In that regard, I found the MAgent environment from Geek.AI (https://github.com/geek-ai/MAgent) perfectly suitable: it is simple, fast, and has minimal dependency, but it still allows you to simulate different multi-agent scenarios for experimentation. It doesn't provide a Gym-compatible API, but...

Deep Q-network for tigers

In the previous example, both groups of agents were behaving randomly, which is not very interesting. Now we will apply the deep Q-network (DQN) model to the tiger group of agents to check whether they can learn some interesting policy. All of the agents share the network, so their behavior will be the same.

The training code is in Chapter25/forest_tigers_dqn.py, and it doesn't differ much from the other DQN versions from the previous chapters. To make the MAgent environment work with our classes, gym.Env wrapper was implemented in Chapter25/lib/data.py in class MAgentEnv. Let's check it to understand how it fits into the rest of the code.

class MAgentEnv(VectorEnv):
    def __init__(self, env: magent.GridWorld, handle,
                 reset_env_func: Callable[[], None],
                 is_slave: bool = False,
                 steps_limit: Optional[int] = None):
        reset_env_func()
        action_space = self.handle_action_space(env...

Collaboration by the tigers

The second experiment that I implemented was designed to make the tigers' lives more complicated and encourage collaboration between them. The training and play code are the same; the only difference is in the MAgent environment's configuration. I took the double_attack configuration file from MAgent (https://github.com/geek-ai/MAgent/blob/master/python/magent/builtin/config/double_attack.py) and tweaked it to add the reward of 0.1 after every step for both tigers and deer. The following is the modified function config_double_attack() from Chapter25/lib/data.py:

def config_double_attack(map_size):
    gw = magent.gridworld
    cfg = gw.Config()
    cfg.set({"map_width": map_size, "map_height": map_size})
    cfg.set({"embedding_size": 10})

We create the configuration object and set the map dimensions. The embedding size is the dimensionality of the minimap, which is not enabled in this configuration...

Training both tigers and deer

The next example is the scenario when both tigers and deer are controlled by different DQN models being trained simultaneously. Tigers are rewarded for living longer, which means eating more deer, as at every step in the simulation they lose health points. Deer are also rewarded on every timestamp.

The code is in Chapter25/forest_both_dqn.py and it is quite a simple extension of the previous example. For both groups of agents, we have a separate Agent class instance, which communicates with the environment. As the observation for both groups is different, we have two separate networks, replay buffers, and experience sources. On every training step, we sample batches from both replay buffers and then train both networks independently.

I'm not going to put the code here, as it differs from the previous example only in small details. If you are curious, you can check GitHub examples. The following are plots with the convergence results.

...

The battle between equal actors

The final example in this chapter is the situation when one policy drives fighting between two groups of identical agents. This version is implemented in Chapter25/battle_dqn.py. The code is straightforward and won't be put here.

I did only a couple of experiments with the code, so hyperparameters could be improved. In addition, you can experiment with the training process. In the code, both groups are driven by the same policy that we are optimizing, which may not be the best approach. Instead, you can experiment with an AlphaGo Zero style of training, when the best policy is used for one group and another group is driven by the policy that we are optimizing at the moment. Once the best policy starts to consistently lose, it is updated. In this case, the optimized policy may have time to learn all the tricks and weaknesses of the current best policy, which may start an improvement loop.

In my experiments, the training wasn't very stable...

Summary

In this chapter, we just touched a bit on the very interesting and dynamic field of MARL. There are lots of things that you can try on your own using the MAgent environment or other environments (like PySC2).

My congratulations on reaching the end of the book! I hope that the book was useful and you enjoyed reading it as much as I enjoyed gathering the material and writing all the chapters. As a final word, I would like to wish you good luck in this exciting and dynamic area of RL. The domain is developing very rapidly, but with an understanding of the basics, it will become much simpler for you to keep track of the new developments and research in this field.

There are many very interesting topics left uncovered, such as partially observable Markov decision processes (where environment observations don't fulfill the Markov property) or recent approaches to exploration, such as the count-based methods. There has been a lot of recent activity around multi-agent methods...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Deep Reinforcement Learning Hands-On. - Second Edition
Published in: Jan 2020Publisher: PacktISBN-13: 9781838826994
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Maxim Lapan

Maxim has been working as a software developer for more than 20 years and was involved in various areas: distributed scientific computing, distributed systems and big data processing. Since 2014 he is actively using machine and deep learning to solve practical industrial tasks, such as NLP problems, RL for web crawling and web pages analysis. He has been living in Germany with his family.
Read more about Maxim Lapan