The Reinforcement Learning Workshop

By Alessandro Palmas , Emanuele Ghelfi , Dr. Alexandra Galina Petre and 6 more
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. 1. Introduction to Reinforcement Learning

About this book

Various intelligent applications such as video games, inventory management software, warehouse robots, and translation tools use reinforcement learning (RL) to make decisions and perform actions that maximize the probability of the desired outcome. This book will help you to get to grips with the techniques and the algorithms for implementing RL in your machine learning models.

Starting with an introduction to RL, you’ll be guided through different RL environments and frameworks. You’ll learn how to implement your own custom environments and use OpenAI baselines to run RL algorithms. Once you’ve explored classic RL techniques such as Dynamic Programming, Monte Carlo, and TD Learning, you’ll understand when to apply the different deep learning methods in RL and advance to deep Q-learning. The book will even help you understand the different stages of machine-based problem-solving by using DARQN on a popular video game Breakout. Finally, you’ll find out when to use a policy-based method to tackle an RL problem.

By the end of The Reinforcement Learning Workshop, you’ll be equipped with the knowledge and skills needed to solve challenging problems using reinforcement learning.

Publication date:
August 2020


1. Introduction to Reinforcement Learning


This chapter introduces the Reinforcement Learning (RL) framework, which is one of the most exciting fields of machine learning and artificial intelligence. You will learn how to describe the characteristics and advanced applications of RL to show what can be achieved within this framework. You will also learn to differentiate between RL and other learning approaches. You will learn the main concepts of this discipline both from a theoretical point of view and from a practical point of view using Python and other useful libraries.

By the end of the chapter, you will understand what RL is and know how to use the Gym toolkit and Baselines, two popular libraries in this field, to interact with an environment and implement a simple learning loop.



Learning and adapting to new circumstances is a crucial process for humans and, in general, for all animals. Usually, learning is intended as a process of trial and error through which we improve our performance in particular tasks. Our life is a continuous learning process, that is, we start from simple goals (for example, walking), and we end up pursuing difficult and complex tasks (for example, playing a sport). As humans, we are always driven by our reward mechanism, which awards good behaviors and punishes bad ones.

Reinforcement Learning (RL), inspired by the human learning process, is a subfield of machine learning and deals with learning from interaction. With the term "interaction," we mean the process of trial and error through which we, as humans, understand the consequences of our actions and build up our own experiences.

RL, in particular, considers sequential decision-making problems. These are problems in which an agent has to take a sequence of decisions, that is, actions, to maximize a certain performance measure.

RL considers tasks to be Markov Decision Processes (MDPs), which are problems arising in many real-world scenarios. In this setting, the decision-maker, referred to as the agent, has to make decisions accounting for environmental uncertainty and experience. Agents are goal-directed; they need only a notion of a goal, such as a numerical signal, to be maximized. Unlike supervised learning, in RL, there is no need to provide good examples; it is the agent who learns how to map situations to actions. The mapping from situations (states) to actions is called "policy" in literature, and it represents the agent's behavior or strategy. Solving an MDP means finding the agent's policy by maximizing the desired outcome (that is, the total reward). We will study MDPs in more detail in future chapters.

RL has been successfully applied to various kinds of problems and domains, showing exciting results. This chapter is an introduction to RL. It aims to explain some applications and describe concepts both from an intuitive perspective and from a mathematical point of view. Both of these aspects are very important when learning new disciplines. Without intuitive understanding, it is impossible to make sense of formulas and algorithms; without mathematical background, it is tough to implement existing or new algorithms.

In this chapter, we will first compare the three main machine learning paradigms, namely supervised learning, RL, and unsupervised learning. We will discuss their differences and similarities and define some example problems.

Second, we will move on to a section that contains the theory of RL and its notations. We will learn about concepts such as what an agent is, what an environment is, and how to parameterize different policies. This section represents the fundamentals of this discipline.

Third, we will begin using two RL frameworks, namely Gym and Baselines. We will learn that interacting with a Gym environment is extremely simple, as is learning a task using Baselines algorithms.

Finally, we will explore some RL applications to motivate you to study this discipline, showing various techniques that can be used to face real-world problems. RL is not bound to the academic world. However, it is still crucial from an industrial point of view, allowing you to solve problems that are almost impossible to solve using other techniques.


Learning Paradigms

In this section, we will discuss the similarities and differences between the three main learning paradigms under the umbrella of machine learning. We will analyze some representative problems in order to understand the characteristics of these frameworks better.

Introduction to Learning Paradigms

For a learning paradigm, we implement a problem and a solution method. Usually, learning paradigms deal with data and rephrase the problem in a way that can be solved by finding parameters and maximizing an objective function. In these settings, the problem can be faced using mathematical and optimization tools, allowing a formal study. The term "learning" is often used to represent a dynamic process of adapting the algorithm's parameters in such a way as to optimize their performance (that is, to learn) on a given task. Tom Mitchell defined learning in a precise way, as follows:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Let's rephrase the preceding definition more intuitively. To define whether a program is learning, we need to set a task; that is the goal of the program. The task can be everything we want the program to do, that is, play a game of chess, do autonomous driving, or carry out image classification. The problem should be accompanied by a performance measure, that is, a function that returns how well the program is performing on that task. For the chess game, a performance function can simply be represented by the following:

Figure 1.1: A performance function for a game of chess

Figure 1.1: A performance function for a game of chess

In this context, the experience is the amount of data collected by the program at a specific moment. For chess, the experience is represented by the set of games played by the program.

The same input presented at the beginning of the learning phase or the end of the learning phase can result in different responses (that is, outputs) from the algorithm; the differences are caused by the algorithm's parameters being updated during the process.

In the following table, we can see some examples of the experience, task, and performance tuples to better understand their concrete instantiations:

Figure 1.2: Table for instantiations

Figure 1.2: Table for instantiations

It is possible to classify the learning algorithms based on the input they have and on the feedback they receive. In the following section, we will look at the three main learning paradigms in the context of machine learning based on this classification.

Supervised versus Unsupervised versus RL

The three main learning paradigms are supervised learning, unsupervised learning, and RL. The following figure represents the general schema of each of these learning paradigms:

Figure 1.3: Representation of learning paradigms

Figure 1.3: Representation of learning paradigms

From the preceding figure, we can derive the following information:

  • Supervised learning minimizes the error of the output of the model with respect to a target specified in the training set.
  • RL maximizes the reward signal of the actions.
  • Unsupervised learning has no target and no reward; it tries to learn a data representation that can be useful.

Let's go more in-depth and elaborate on these concepts further, particularly from a mathematical perspective.

Supervised learning deals with learning a function by mapping an input to an output when the correspondences between the input and output (sample, label) are given by an external teacher (supervisor) and are contained in a training set. The objective of supervised learning is to generalize to unseen samples that are not included in the dataset, resulting in a system (for example, a function) that is able to respond correctly in new situations. Here, the correspondences between the sample and label are usually known (for example, in the training set) and given to the system. Examples of supervised learning tasks include regression and classification problems. In a regression task, the learner has to find a function, a, of the input, b, producing a (or e, in general) real output,c. In mathematical notation, we have to find d such that:

Figure 1.4: Regression

Figure 1.4: Regression

Here, f is known for the examples in the training set. In a classification task, the function to be learned is a discrete mapping; 7 belongs to a finite and discrete set. Formalizing the problem, we search for a discrete-valued function, 8, such that:

Figure 1.5: Classification

Figure 1.5: Classification

Here, the set, A picture containing clock

Description automatically generated, represents the set of possible classes or categories.

Unsupervised learning deals with learning patterns in the data when the target label is not present or is unknown. The objective of unsupervised learning is to find a new, usually smaller, representation of data. Examples of unsupervised learning algorithms include clustering and Principal Component Analysis (PCA).

In a clustering task, the learner should split the dataset into clusters (a group of elements) according to some similarity measure. At first glance, clustering may seem very similar to classification; however, as an unsupervised learning task, the labels, or classes, are not given to the algorithm inside the training set. Indeed, it is the algorithm itself that should make sense of its inputs, by learning a representation of the input space in such a way that similar samples are close to each other.

For example, in the following figure, we have the original data on the left; on the right, we have the possible output of a clustering algorithm. Different colors denote different clusters:

Figure 1.6: An example of a clustering application

Figure 1.6: An example of a clustering application

In the preceding example, the input space is composed of two dimensions, that is, g, and the algorithm found three clusters or three groups of similar elements.

PCA is an unsupervised algorithm used for dimensionality reduction and feature extraction. PCA tries to make sense of data by searching for a representation that contains most of the information from the given data.

RL is different from both supervised and unsupervised learning. RL deals with learning control actions in a sequential decision-making problem. The sequential structure of the problem makes RL challenging and different from the two other paradigms. Moreover, in supervised and unsupervised learning, the dataset is fixed. In RL, the dataset is continuously changing, and dataset creation is itself the agent's task. In RL, different from supervised learning, no teacher provides the correct value for a given sample or the right action for a given situation. RL is based on a different form of feedback, which is the environment's feedback evaluating the behavior of the agent. It is precisely the presence of feedback that also makes RL different from unsupervised learning.

We will explore these concepts in more detail in future sections:

Figure 1.7: Machine learning paradigms and their relationships

Figure 1.7: Machine learning paradigms and their relationships

RL and supervised learning can also be mixed up. A common technique (also used by AlphaGo Zero) is called imitation learning (or behavioral cloning). Instead of learning a task from scratch, we teach the agent in a supervised way how to behave (or which action to take) in a given situation. In this context, we have an expert (or multiple experts) demonstrating to the agent the desired behavior. In this way, the agent can start building its internal representation and its initial knowledge. Its actions won't be random at all when the RL part begins, and its behavior will be more focused on the actions shown by the expert.

Let's now look at a few scenarios that will help us to classify the problems in a better manner.

Classifying Common Problems into Learning Scenarios

In this section, we will understand how it is possible to frame some common real-world problems into a learning framework by defining the required elements.

Predicting Whether an Image Contains a Dog or a Cat

Predicting the content of an image is a standard classification example; therefore, it lies under the umbrella of supervised learning. Here, we are given a picture, and the algorithm should decide whether the image contains a dog or a cat. The input is the image, and the associated label can be 0 for cats and 1 for dogs.

For a human, this is a straightforward task, as we have an internal representation of dogs and cats (as well as an internal representation of the world), and we are trained extensively in our life to recognize dogs and cats. Despite this, writing an algorithm that is able to identify whether an image contains a dog or a cat is a difficult task without machine learning techniques. For a human, it is elementary to know whether the image is of a dog or cat; it is also easy to create a simple dataset of images of cats and dogs.

Why Not Unsupervised Learning?

Unsupervised learning is not suited to this type of task as we have a defined output we need to obtain from an input. Of course, supervised learning methods build an internal representation of the input data in which similarities are better exploited. This representation is only implicit; it is not the output of the algorithm as is the case in unsupervised learning.

Why Not RL?

RL, by definition, considers sequential decision-making problems. Predicting the content of an image is not a sequential problem, but instead a one-shot task.

Detecting and Classifying All Dogs and Cats in an Image

Detection and classification are two examples of supervised learning problems. However, this task is more complicated than the previous one. The detection part can be seen as both a regression and classification problem at the same time. The input is always the image we want to analyze, and the output is the coordinate of the bounding boxes for each dog or cat in the picture. Associated with each bounding box, we have a label to classify the content in the region of interest as a dog or a cat:

Figure 1.8: Cat and dog detection and classification

Figure 1.8: Cat and dog detection and classification

Why Not Unsupervised Learning?

As in the previous example, here, we have a determined output given an input (an image). We do not want to extract unknown patterns in the data.

Why Not RL?

Detection and classification are not tasks that are suited to the RL framework. We do not have a set of actions the agent should take to solve a problem. Also, in this case, the sequential structure is absent.

Playing Chess

Playing chess can be seen as an RL problem. The program can perceive the current state of the board (for example, the positions and types of pawns), and, based on that, it should decide which action to take. Here, the number of possible actions is vast. Selecting an action means to understand and anticipate the consequences of the move to defeat the opponent:

Figure 1.9: Chess as an RL problem

Figure 1.9: Chess as an RL problem

Why Not Supervised?

We can think of playing chess as a supervised learning problem, but we would need to have a dataset, and we should incorporate the sequential structure of the game into the supervised learning problem. In RL, there is no need to have a dataset; it is the algorithm itself that builds up a dataset through interaction and, possibly, self-play.

Why Not Unsupervised?

Unsupervised learning does not fit in this problem as we are not dealing with learning a representation of the data; we have a defined objective, which is winning the game.

In this section, we compared the three main learning paradigms. We saw the kind of data they have at their disposal, the type of interaction each algorithm has with the external world, and we analyzed some particular problems to understand which learning paradigm is best suited.

When facing a real-world problem, we always have to remember the distinction between these techniques, selecting the best one based on our goals, our data, and on the problem structure.


Fundamentals of Reinforcement Learning

In RL, the main goal is to learn from interaction. We want agents to learn a behavior, a way of selecting actions in given situations, to achieve some goal. The main difference between classical programming or planning is that we do not want to code the planning software explicitly on our own, as this would require a great effort; it can be very inefficient and even impossible. The RL discipline was born precisely for this reason.

RL agents start (usually) with no idea of what to do. They typically do not know the goal, they do not know the game's rules, and they do not know the dynamics of the environment or how their actions influence the state.

There are three main components of RL: perception, actions, and goals.

Agents should be able to perceive the current environment state to deal with a task. This perception, also called observation, might be different from the actual environment state, can be subject to noise, or can be partial.

For example, think of a robot moving in an unknown environment. For robotic applications, usually, the robot perceives the environment using cameras. Such a perception does not represent the environment state completely; it can be subject to occlusions, poor lighting, or adverse conditions. The system should be able to deal with this incomplete representation and learn a way of moving in the environment.

The other main component of an agent is the ability to act; the agent should be able to take actions that affect the environment state or the agent's state.

Agents should also have a goal defined through the environment state. Goals are described using high-level concepts such as winning a game, moving in an environment, or driving correctly.

One of the challenges of RL, a challenge that does not arise in other types of learning, is the exploration-exploitation trade-off. In order to improve, the agent has to exploit its knowledge; it should prefer actions that have demonstrated themselves as useful in the past. There's a problem here: to discover better actions, the agent should continue exploring, trying moves they have never done before. To estimate the effect of an action reliably, an agent has to perform each action many times. The critical thing to notice here is that neither exploration nor exploitation can be performed individually in order to learn a task.

The aforementioned is very similar to the challenges we face as babies when we have to learn how to walk. At first, we try different types of movement, and we start from a simple movement yielding satisfactory results: crawling. Then, we want to improve our behavior to become more efficient. To learn a new behavior, we have to do movements we never did before: we try to walk. At first, we perform different actions yielding unsatisfactory results: we fall many times. Once we discover the correct way of moving our legs and balancing our body, we become more efficient in walking. If we did not explore further and we stopped at the first behavior that yields satisfactory results, we would crawl forever. By exploring, we learn that there can be different behaviors that are more efficient. Once we learn how to walk, we can stop exploring, and we can start exploiting our knowledge.

Elements of RL

Let's introduce the main elements of the RL framework intuitively.


In RL, the agent is the abstract concept of the entity that moves in the world, takes actions, and achieves goals. An agent can be a piece of autonomous driving software, a chess player, a Go player, an algorithmic trader, or a robot. The agent is everything that can perceive and influence the state of the environment and, therefore, can be used to accomplish goals.


An agent can perform actions based on the current situation. Actions can assume different forms depending on the specific task.

Actions can be to steer, to push the accelerator pedal, or to push the brake pedal in an autonomous driving context. Other examples of actions include moving the horse to the H5 position or moving the king to the A5 position in a chess context.

Actions can be low-level, such as controlling the voltage of the motors of a vehicle, but they can also be high-level, or planning actions, such as deciding where to go. The decision on the action level is the responsibility of the algorithm's designer. Actions that are too high-level can be challenging to implement at a lower level; they might require extensive planning at lower levels. At the same time, low-level actions make the problem difficult to learn.


The environment represents the context in which the agent moves and takes decisions. An environment is composed of three main elements: states, dynamics, and rewards. They can be explained as follows:

  • State: This represents all of the information describing the environment at a particular timestep. The state is available to the agent through observations, which can be a partial or full representation.
  • Dynamics: The dynamics of an environment describe how actions influence the state of the environment. The environment dynamic is usually very complex or unknown. An RL algorithm using the information of the environment dynamic to learn how to achieve a goal belongs to the category of model-based RL, where the model represents the mathematical description of the environment. Most of the time, the environment dynamic is not available to the agent. In this case, the algorithm belongs to the model-free category. Even if the environment model is not available, too complicated, or too approximated, the agent can learn a model of the environment during training. Also, in this case, the algorithm is said to be model-based.
  • Rewards: Rewards are scalar values associated with each timestep describing the agent's goal. Rewards can also be described as environmental feedback, providing information to an agent about its behavior; it is, therefore, necessary for making learning possible. If the agent receives a high reward, it means that it performed a good move, a move bringing it closer to its goal.


A policy describes the behavior of the agent. Agents select actions by following their policies. Mathematically, a policy is a function mapping states to actions. What does this mean? Well, it means that the input of the policy is the current state, and its output is the action to take. A policy can have different forms. It can be a simple set of rules, a lookup table, a neural network, or any function approximator. A policy is the core of the RL framework, and the goal of all RL algorithms (implicit or explicit) is to improve the agent's policy to maximize the agent's performance on a task (or on a set of tasks). A policy can be stochastic, involving a distribution over actions, or it can be deterministic. In the latter case, the selected action is uniquely determined by the environment's state.

An Example of an Autonomous Driving Environment

To better understand the environment's role and its characteristics in the RL framework, let's formalize an autonomous driving environment, as shown in the following figure:

Figure 1.10: An autonomous driving scenario

Figure 1.10: An autonomous driving scenario

Considering the preceding figure, let's now look at each of the components of the environment:

  • State: The state can be represented by the 360-degree image of the street around our car. In this case, the state is an image, that is, a matrix of pixels. It can also be represented by a series of images covering the whole space around the car. Another possibility is to describe the state using features and not images. The state can be the current velocity and acceleration of our vehicle, the distance from other cars, or the distance from the street border. In this case, we are using preprocessed information to represent the state more easily. These features can be extracted from images or other types of sensors (for example, Light Detection and Ranging – LIDAR).
  • Dynamics: The dynamics of the environment in an autonomous car scenario are represented by the equations describing how the system changes when the car accelerates, breaks, or steers. For instance, the vehicle is going at 30 km/h, and the next vehicle is 100 meters away from it. The state is represented by the car's speed and the proximity information concerning the next vehicle. If the car accelerates, the speed changes according to the car's properties (included in the environment dynamics). Also, the proximity information changes since the next vehicle can be closer or further away (according to the speed). In this situation, at the next timestep, the car's speed can be 35 km/h, and the next vehicle can be closer, for example, only 90 meters away.
  • Reward: The reward can represent how well the agent is driving. It's not easy to formalize a reward function. A natural reward function should award states in which the car is aligned to the street and should avoid states in which the car crashes or goes off the road. The reward function definition is an open problem and researchers are putting efforts into developing algorithms where the reward function is not needed (self-motivation or curiosity-driven agents), where the agent learns from demonstrations (imitation learning), and where the agent recovers the reward function from demonstrations (Inverse Reinforcement Learning or IRL).


    For further reading on curiosity-driven agents, please refer to the following paper:

We are now ready to design and implement our first environment class using Python. We will demonstrate how to implement the state, the dynamics, and the reward of a toy problem in the following exercise.

Exercise 1.01: Implementing a Toy Environment Using Python

In this exercise, we will implement a simple toy environment using Python. The environment is illustrated in Figure 1.11. It is composed of three states (1, 2, 3) and two actions (A and B). The initial state is state 1. States are represented by nodes. Edges represent transitions between states. On the edges, we have an action causing the transition and the associated reward.

The representation of the environment in Figure 1.11 is the standard environment representation in the context of RL. In this exercise, we will become acquainted with the concept of the environment and its implementation:

Figure 1.11: A toy environment composed of three states (1, 2, 3) and two actions (A and B)

Figure 1.11: A toy environment composed of three states (1, 2, 3) and two actions (A and B)

In the preceding figure, the reward is associated with each state-action pair.

The goal of this exercise is to implement an Environment class with a step() method that takes as input the agent's actions and returns a state-action pair (next state, reward). In addition to this, we will write a reset() method that restarts the environment's state:

  1. Create a new Jupyter notebook or a simple Python script to enter the code.
  2. Import the Tuple type from typing:
    from typing import Tuple
  3. Define the class constructor by initializing its properties:
    class Environment:
        def __init__(self):
            Constructor of the Environment class.
            self._initial_state = 1
            self._allowed_actions = [0, 1]  # 0: A, 1: B
            self._states = [1, 2, 3]
            self._current_state = self._initial_state


    The triple-quotes ( """ ) shown in the code snippet above are used to denote the start and end points of a multi-line code comment. Comments are added into code to help explain specific bits of logic.

    We have two allowed actions, the action 0 and the action 1 representing the actions A and B. We have three environment states: 1, 2, and 3. We define the current_state variable to be equal to the initial state (state 1).

  4. Define the step function, which is responsible for updating the current state based on the previous state and the action taken by the agent:
        def step(self, action: int) -> Tuple[int, int]:
            Step function: compute the one-step dynamic from the \
            given action.
                action (int): the action taken by the agent.
                The tuple current_state, reward.
            # check if the action is allowed
            if action not in self._allowed_actions:
                raise ValueError("Action is not allowed")
            reward = 0
            if action == 0 and self._current_state == 1:
                self._current_state = 2
                reward = 1
            elif action == 1 and self._current_state == 1:
                self._current_state = 3
                reward = 10
            elif action == 0 and self._current_state == 2:
                self._current_state = 1
                reward = 0
            elif action == 1 and self._current_state == 2:
                self._current_state = 3
                reward = 1
            elif action == 0 and self._current_state == 3:
                self._current_state = 2
                reward = 0
            elif action == 1 and self._current_state == 3:
                self._current_state = 3
                reward = 10
            return self._current_state, reward


    The # symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic.

    We first check that the action is allowed. Then, we define the new current state and reward based on the action and the previous state by looking at the transition in the previous figure.

  5. Now, we need to define the reset function, which simply resets the environment state:
        def reset(self) -> int:
            Reset the environment starting from the initial state.
                The environment state after reset (initial state).
            self._current_state = self._initial_state
            return self._current_state
  6. We can use our environment class to understand whether our implementation is correct for the specified environment. We can do this with a simple loop, using a predefined set of actions to test the transitions of our environment. A possible action set, in this case, is [0, 0, 1, 1, 0, 1]. Using this set, we will test all of the environment's transitions:
    env = Environment()
    state = env.reset()
    actions = [0, 0, 1, 1, 0, 1]
    print(f"Initial state is {state}")
    for action in actions:
        next_state, reward = env.step(action)
        print(f"From state {state} to state {next_state} \
    with action {action}, reward: {reward}")
        state = next_state


    The code snippet shown here uses a backslash ( \ ) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

    The output should be as follows:

    Initial state is 1
    From state 1 to state 2 with action 0, reward: 1
    From state 2 to state 1 with action 0, reward: 0
    From state 1 to state 3 with action 1, reward: 10
    From state 3 to state 3 with action 1, reward: 10
    From state 3 to state 2 with action 0, reward: 0
    From state 2 to state 3 with action 1, reward: 1

To understand this better, compare the output with Figure 1.11 to discover whether the transitions and rewards are compatible with the selected actions.


To access the source code for this specific section, please refer to

You can also run this example online at

In this exercise, we implemented a simple RL environment by defining the step function and the reset function. These functions are at the core of every environment, representing the interaction between the agent and the environment.

The Agent-Environment Interface

RL considers sequential decision-making problems. In this context, we can refer to the agent as the "decision-maker." In sequential decision-making problems, actions taken by the decision-maker do not only influence the immediate reward and the immediate environment's state, but they also affect future rewards and states. MDPs are a natural way of formalizing sequential decision-making problems. In MDPs, an agent interacts with an environment through actions and receives rewards based on the action, on the current state of the environment, and on the environment's dynamics. The goal of the decision-maker is to maximize the cumulative sum of rewards given a horizon (which is possibly infinite). The task the agent has to learn is defined through the rewards it receives, as you can see in the following figure:

Figure 1.12: The Agent-Environment interface

Figure 1.12: The Agent-Environment interface

In RL, an episode is divided into a sequence of discrete timesteps: 11. Here, a represents the horizon length, which is possibly infinite. The interaction between the agent and the environment happens at each timestep. At each timestep, the agent receives a representation of the current environment's state, 18. Based on this state, it selects an action, 14, belonging to the action space given the current state,15. The action affects the environment. As a result, the environment changes its state, transitioning to the next state, 16, according to its dynamics. At the same time, the agent receives a scalar reward, 17 quantifying how good the action taken in that state was.

Let's now try to understand the mathematical notations used in the preceding example:

  • Time horizon 19: If a task has a finite time horizon, then 20 is an integer number representing the maximum duration of an episode. In infinite tasks, 21 can also be 22.
  • Action 23 is the action taken by the agent in the timestep, t. The action belongs to the action space, 24, defined by the current state, 25.
  • State 26 is the representation of the environment's state received by the agent at time t. It belongs to the state space, 27, defined by the environment. It can be represented by an image, a sequence of images, or a simple vector assuming different shapes. Note that the actual environment state can be different and more complex than the state perceived by the agent.
  • Reward s is represented by a real number, describing how good the taken action was. A high reward corresponds to a good action. The reward is fundamental for the agent to understand how to achieve a goal.

In episodic RL, the agent-environment interaction is divided into episodes; the agent has to achieve the goal within the episode. The interaction is finalized to learn better behavior. After several episodes, the agent can decide to update its behavior by incorporating its knowledge of past interactions. Based on the effect of the action on the environment and the received rewards, the agent will perform more frequent actions yielding higher rewards.

What's the Agent? What's in the Environment?

An important aspect to take into account when dealing with RL is the difference between the agent and the environment. This difference is not typically defined in terms of a physical distinction. Usually, we model the environment as everything that's not under the control of the agent. The environment can include physical laws, other agents, or an agent's properties or characteristics.

However, this does not imply that the agent does not know the environment. The agent can also be aware of the environment and the effect of its actions on it, but it cannot change the way the environment reacts. Also, the reward computation belongs to the environment, as it must be entirely outside the agent's control. If this is not the case, the agent can learn how to modify the reward function in such a way as to maximize its performance without learning the task. The boundary between the agent and environment is a control boundary, meaning that the agent cannot control the reaction of the environment. It is not a knowledge boundary since the agent can know the environment model perfectly and still find difficulties in learning the task.

Environment Types

In this section, we will examine some possible environment dichotomies. The characterization of the environment depends on the state space (finite or continuous), on the type of transitions (deterministic or stochastic), on the information available to the agent (fully or partially observable), and the number of agents involved in the learning problem (single versus multi-agent).

Finite versus Continuous

The state space gives the first distinction. The state space can be divided into two main categories: a finite state space and a continuous state space. A finite state space has a finite number of possible states in which the agent can be, and it's the more straightforward case. An environment with a continuous state space has infinite possible states. In these types of environments, the generalization properties of the agent are fundamental to solve a task because the probability of arriving at the same state twice is almost zero. In continuous environments, an agent cannot use the experience due to the previous presence in that state; it has to generalize using some kind of similarity with respect to the previously experienced states. Note that generalization is also essential for finite state spaces with a considerable number of states (for example, when the state space is represented by the set of all possible images).

Consider the following examples:

  • Chess is finite. There is a finite number of possible states in which an agent can be. The state, for chess, is represented by the chessboard situation at a given time. We can calculate all the possible states by varying the situation of the chessboard. The number of states is very high but still finite.
  • Autonomous driving can be defined as a continuous problem. If we describe the autonomous driving problem as a problem in which the agent has to make driving decisions based on the sensors' input, we obtain a continuous problem. The sensors provide continuous input in a given range. The agent state, in this case, can be represented by the agent's speed, the agent's acceleration, or the rotation of the wheels per minute.

Deterministic versus Stochastic

A deterministic environment is an environment in which, given a state, an action is performed by the agent; the following state is uniquely determined as well as the following reward. Deterministic environments are simple types of environments, but they are also rarely used due to their limited applicability in the real world.

Almost all real-world environments are stochastic. In stochastic environments, a state and an action performed by the agent determines the probability distribution over the next state and the next reward. The following state is not uniquely determined, but it's uncertain. In these types of environments, the agent should act many times to obtain a reliable estimate of its consequences.

Notice that, in a deterministic environment, the agent could perform each action in each state exactly once, and based on the acquired knowledge, it can solve the task. Also, notice that solving the task does not mean taking actions that yield the highest immediate return, because this action can also bring the agent to an inconvenient part of the environment where future rewards are always low. To solve the task correctly, the agent should take actions with the highest associated future return (called a state-action value). The state-action value does not take into account only the immediate reward but also the future rewards, giving the agent a farsighted view. We will define later what a state-action value is.

Consider the following examples:

  • Rubik's Cube is deterministic. To a given action, it corresponds a defined state transition.
  • Chess is deterministic but opponent-dependent. The successive state does not depend only on the agent's action but also on the opponent's action.
  • Texas Hold'em is stochastic and opponent-dependent. The transition to the next state is stochastic and depends on the deck, which is not known by the agent.

Fully Observable versus Partially Observable

The agent, to plan actions, has to receive a representation of the environment state, 29 (refer to Figure 1.12, The Agent-Environment interface). If the state representation received by the agent completely defines the state of the environment, the environment is Fully Observable. If some parts of the environment are outside the representation observed by the agent, the environment is Partially Observable, also called the Partially Observable Markov Decision Process (POMDP). Partially observable environments are, for example, multi-agent environments. In the case of partially observable environments, the information perceived by the agents, together with the action taken, is not sufficient for determining the next state of the environment. A technique to improve the perception of the agent, making it more accurate, is to keep the history of taken actions and observations, but this requires some memory techniques (such as a Recurrent Neural Network, or RNN, or Long Short-Term Memory, or LSTM) embedded in the agent's policy.


For more information on LSTMs, please refer to

POMDP versus MDP

Consider the following figure:

Figure 1.13: A representation of a partially observable environment

Figure 1.13: A representation of a partially observable environment

In the preceding figure, the agent does not receive the full environment state but only an observation, A close up of a logo

Description automatically generated.

To better understand the differences between these two types of environments, let's look at Figure 1.13. In partially observable environments (POMDP), the representation given to the agent is only a part of the actual environment state, and it is not enough to understand the actual environment state without uncertainty.

In fully observable environments (MDPs), the state representation given to the agent is semantically equivalent to the state of the environment. Notice that, in this case, the state given to the agent can assume a different form (for example, an image, a vector, a matrix, or a tensor). However, from this representation, it is always possible to reconstruct the actual state of the environment. The meaning of the state is precisely the same, even if under a different form.

Consider the following examples:

  • Chess (and, in general, board games) is fully observable. The agent can perceive the whole environment state. In a chess game, the environment state is represented by the chessboard, and the agent can exactly perceive the position of each pawn.
  • Poker is partially observable. A poker agent cannot perceive the whole state of the game, which includes the opponent cards and deck cards.

Single Agents versus Multiple Agents

Another useful characteristic of environments is the number of agents involved in a task. If there is only one agent, the subject of our study, the environment is a single-agent environment. If the number of agents is more than one, the environment is a multi-agent environment. The presence of multiple agents increases the complexity of the problem since the action that influences the state becomes a joint action, the set of all the agents' actions. Usually, agents only know their individual actions and do not know another agent's actions. For this reason, the multi-agent environment is an instance of POMDP in which the partial visibility is due to the presence of other agents. Notice that each agent has its own observation, which can differ from the other agent's observation, as shown in the following figure:

Figure 1.14: A schematic representation of the multi-agent decentralized MDP

Figure 1.14: A schematic representation of the multi-agent decentralized MDP

Consider the following examples:

  • Robot navigation is usually a single-agent task. We may have only one agent moving in a possible unknown environment. The goal of the agent can be to reach a given position in the environment while avoiding crashes as much as possible in the minimum amount of time.
  • Poker is a multi-agent task where we have two agents competing against each other. The perceived state is different in this case and the perceived reward is also different.

An Action and Its Types

The action set of an agent in an environment can be finite or continuous. If the action set is finite, the agent has at its disposal a finite number of actions. Consider the MountainCar-v0 (discrete) example, described in more detail later. This has a discrete action set; the agent only has to select the direction in which to accelerate, and the acceleration is constant.

If the action set is continuous, the agent has at its disposal infinite actions from which it should select the best actions in a given state. Usually, tasks with continuous action sets are more challenging to solve than those with finite actions.

Let's look at the example of MountainCar-v0:

Figure 1.15: A MountainCar-v0 task

Figure 1.15: A MountainCar-v0 task

As you can see in the preceding figure, a car is positioned in a valley between two mountains. The goal of the car is to arrive at the flag on the mountain to its right.

The MountainCar-v0 example is a standard RL benchmark in which there is a car trying to ramp itself up a mountain. The car's engine doesn't have enough strength to ramp upward. For this reason, the car should use the inertia given from the shape of the valley, that is, it should go to the left to gain speed. The state is composed of the car velocity, acceleration, and x position. There are two versions of this task based on the action set we define, as follows:

  • MountainCar-v0 discrete: We have only two possible actions, (-1, +1) or (0, 1), depending on the parameterization.
  • MountainCar-v0 continuous: A continuous set of actions from -1 to +1.


We define the policy as the behavior of the agent. Formally, a policy is a function that takes as input the history of the current episode and outputs the current action. The concept of policies has huge importance in RL; all RL algorithms focus on learning the best policy for a given task.

An example of a winning policy for the MountainCar-v0 task is a policy that brings the agent up on the left mountain and then uses the cumulated potential to ramp up the mountain on the right. For negative velocities, the optimal action is LEFT, as the agent should go as high as possible on the left mountain. For positive velocities, the agent should take the action RIGHT, as its goal is to ramp up the mountain on its right.

A Markovian policy is simply a policy depending only on the current state and not the whole history.

We denote a stationary Markovian policy with pi as follows:

Figure 1.16: Stationary Markovian policy

Figure 1.16: Stationary Markovian policy

The Markovian policy goes from the state space to the action space. If we evaluate the policy in a given state, 33, we obtain the selected action, a, in that state:

Figure 1.17: Stationary Markovian policy in state St

Figure 1.17: Stationary Markovian policy in state 34

A policy can be implemented in different ways. The most straightforward policy is just a rule-based policy, which is essentially a set of rules or heuristics.

Policies that are a subject of interest in RL are usually parametric. Parametric policies are (differentiable) functions depending on a set of parameters. Usually, the policy parameters are identified as A picture containing object, clock

Description automatically generated:

Figure 1.18: Parametric policies

Figure 1.18: Parametric policies

The set of policy parameters can be represented by a vector in a d-dimensional space. The selected action is determined by the policy structure (we will explore some possible policy structures later on), by the policy parameters, and, of course, by the current environment state.

Stochastic Policies

The policies presented so far are merely deterministic policies because the output is precisely an action. Stochastic policies are policies that output a distribution over the action space. Stochastic policies are usually powerful policies that mix both exploration and exploitation. With stochastic policies, it is possible to obtain complex behaviors.

A stochastic policy assigns a certain probability to each action. The actions will be selected according to the associated probability.

Figure 1.19 explains, graphically, and with an example, the differences between a stochastic policy and a deterministic policy. The policy in the figure has three possible actions.

The stochastic policy (upper part) assigns to actions, respectively, a probability of 0.2, 0.7, and 0.1. The most probable action is the second action, which is associated with the highest probability. However, all of the actions could also be selected.

In the bottom part, we have the same set of actions with a deterministic policy. The policy, in this case, selects only one action (the second in the figure) with a probability of 1. In this case, actions 1 and 3 will not be selected, having an associated probability of 0.

Note that we can obtain a deterministic policy from a stochastic one by taking the action associated with the highest probability:

Figure 1.19: Stochastic versus deterministic policies

Figure 1.19: Stochastic versus deterministic policies

Policy Parameterizations

In this section, we will analyze some possible policy parameterizations. Parameterizing a policy means giving a structure to the policy function and considering how parameters affect our output actions. Based on the parameterization, it is possible to obtain simple policies or even complex stochastic policies starting from the same input state.

Linear (Deterministic)

The resulting action is a linear combination of the state features, A picture containing drawing, animal

Description automatically generated:

Figure 1.20: An expression of a linear policy

Figure 1.20: An expression of a linear policy

A linear policy is a very simple policy represented by a matrix multiplication.

Consider the example of MountainCar-v0. The state space is represented by the position, speed, and acceleration: A close up of a sign

Description automatically generated. We usually add a constant, 1, that corresponds to the bias term. Therefore, A close up of a sign

Description automatically generated . Policy parameters are defined by A picture containing object, clock

Description automatically generated . We can simply use as state features the identity function, 40.

The resulting policy is as follows:

Figure 1.21: A linear policy for MountainCar-v0

Figure 1.21: A linear policy for MountainCar-v0


Using a comma, ,, we can denote the column separator, and with a semicolon, ;, we can denote the row separator.

Therefore, 41 is a row vector, and 42 is a column vector that is equivalent to 43.

If the environment state is [1, 2, 0.1], the cart is in position b with velocity c and acceleration A picture containing drawing, clock

Description automatically generated, and the policy parameters are defined by [4, 5, 1, 1], we obtain an action, e.

Since the action space of MountainCar-v0 is defined in the interval, [-1, +1], we need to squash the resulting action using a squashing function such as f (hyperbolic tangent). In our case, g applied to the output of the multiplication results in approximately +1:

Figure 1.22: A hyperbolic tangent plot; the hyperbolic tangent squashes 
the real numbers in the interval, [-1, +1]

Figure 1.22: A hyperbolic tangent plot; the hyperbolic tangent squashes the real numbers in the interval, [-1, +1]

Even if linear policies are simple, they are usually enough to solve most tasks, given that the state features represent the problem.

Gaussian Policy

In the case of Gaussian parameterization, the resulting action has a Gaussian distribution in which the mean, 49, and the variance, A drawing of a person

Description automatically generated, depend on state features:

Figure 1.23: Expression for a Gaussian policy

Figure 1.23: Expression for a Gaussian policy

Here, with the symbol c, we denote the conditional distribution; therefore, with a, we denote the distribution conditioned on state b.

Remember, the functional form of the Gaussian distribution, A picture containing drawing

Description automatically generated, is as follows:

Figure 1.24: A Gaussian distribution

Figure 1.24: A Gaussian distribution

In the case of a Gaussian policy, this becomes the following:

Figure 1.25: A Gaussian policy

Figure 1.25: A Gaussian policy

Gaussian parameterization is useful for continuous action spaces. Note that we are giving the agent the possibility of also changing the variance of the distribution. This means that it can decide to increase the variance, enabling it to explore scenarios where it's not sure what the best action to take is, or it can reduce the variance by increasing the amount of exploitation when it's very sure about which action to take in a given state. The effect of the variance can be visualized as follows:

Figure 1.26: The effect of the variance on a Gaussian policy

Figure 1.26: The effect of the variance on a Gaussian policy

In the preceding figure, if the variance increases (the lower curve), the policy becomes more exploratory. Additionally, actions that are very far from the mean have nonzero probabilities. When the variance is small (the higher curve), the policy is highly exploitative. This means that only actions that are very close to the mean have nonzero probabilities.

In the preceding diagram, the smaller Gaussian represents a highly explorative policy with respect to the larger policy. Here, we can see the effect of the variance on the policy exploration attitude.

While learning a task, in the first training episodes, the policy needs to have a high variance in order for it to explore different actions. The variance will be reduced once the agent gains some experience and becomes more and more confident about the best actions.

The Boltzmann Policy

Boltzmann parameterization is used for discrete action spaces. The resulting action is a softmax function acting on the weighted state features, as stated in the following expression:

Figure 1.27: Expression for a Boltzmann policy

Figure 1.27: Expression for a Boltzmann policy

Here, 55 is the set of parameters associated with action c.

The Boltzmann policy is a stochastic policy. The motivation behind this is very simple; let's sum the policy over all the actions (the denominator does not depend on the action, d), as follows:

Figure 1.28: A Boltzmann policy over all of the actions

Figure 1.28: A Boltzmann policy over all of the actions

The Boltzmann policy becomes deterministic if we select the action with the highest probability, which is equivalent to selecting the mean action in a Gaussian distribution. What the Boltzmann parameterization represents is simply a normalization of the value, 59, corresponding to the score of action e. The score is thus normalized by considering the value of all the other actions obtaining a distribution.

In all of these parametrizations, the state features might be non-linear features depending on several parameters, for example, whether it is coming from a neural network, the radial basis function (RBF) features, or the tile coding features.

Exercise 1.02: Implementing a Linear Policy

In this exercise, we will practice with the implementation of a linear policy. The goal is to write the presented parameterizations in the case of a state composed of n components. In the first case, the features can be represented by the identity function; in the second case, the features are represented by a polynomial function of order 2:

  1. Open a new Jupyter notebook and import NumPy to implement all of the requested policies:
    from typing import Callable, List
    import matplotlib
    from matplotlib import pyplot as plt
    import numpy as np
    import scipy.stats
  2. Let's now implement the linear policy. A linear policy can be efficiently represented by a dot product between the policy parameters and state features. The first step is to write the constructor:
    class LinearPolicy:
        def __init__(
            self, parameters: np.ndarray, \
            features: Callable[[np.ndarray], np.ndarray]):
            Linear Policy Constructor.
                parameters (np.ndarray): policy parameters 
                as np.ndarray.
                features (Callable[[np.ndarray], np.ndarray]): 
                function used to extract features from the 
                state representation.
            self._parameters = parameters
            self._features = features

    The constructor simply sets the attribute's parameters and features. The feature parameter is actually a callable that takes, as input, a NumPy array and returns another NumPy array. The input is the environment state, whereas the output is the state features.

  3. Next, we will implement the call method. The __call__ method takes as input the state, and returns the selected action according to the policy parameters. The call represents a real policy implementation. What we have to do in the linear case is to first apply the feature function and then compute the dot product between the parameters and the features. A possible implementation of the call function is as follows:
        def __call__(self, state: np.ndarray) -> np.ndarray:
            Call method of the Policy.
                state (np.ndarray): environment state.
                The resulting action.
            # calculate state features
            state_features = self._features(state)
            the parameters shape [0] should be the same as the 
            state features as they must be multiplied
            assert state_features.shape[0] == self._parameters.shape[0]
            # dot product between parameters and state features
            return, state_features)
  4. Let's try the defined policy with a state composed of a 5-dimensional array. Sample a random set of parameters and a random state vector. Create the policy object. The constructor needs the callable features, which, in this case, is the identity function. Call the policy to obtain the resulting action:
    # sample a random set of parameters
    parameters = np.random.rand(5, 1)
    # define the state features as identity function
    features = lambda x: x
    # define the policy
    pi: LinearPolicy = LinearPolicy(parameters, features)
    # sample a state
    state = np.random.rand(5, 1)
    # Call the policy obtaining the action
    action = pi(state)

    The output will be as follows:


This value is the action selected by our agent, given the state and the policy parameters. In this case, the selected action is [[1.33244481]]. The meaning of the action depends on the RL task.

Of course, you will obtain different results based on the sampled parameters and sampled state. It is always possible to seed the NumPy random number generator to obtain reproducible results.


To access the source code for this specific section, please refer to You can also refer to the Gaussian and Boltzmann policies that are implemented in the same notebook.

You can also run this example online at

In this exercise, we practiced with different policies and parameterizations. These are simple policies, but they are the building blocks of more complex policies. The trick is just to substitute the state features with a neural network or any other feature extractor.

Goals and Rewards

In RL, the agent's goal is to maximize the total amount of reward it receives during an episode.

This is based on the famous reward hypothesis in Sutton & Barto 1998:

"That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)."

The important thing here is that the reward should not describe how to achieve the goal; instead, it should describe the goal of the agent. The reward function is an element of the environment, but it can also be designed for a specific task. In principle, there are infinite reward functions for each task. Usually, reward functions that are characterized by a lot of information help the agent to learn. Sparse reward functions (with no information) makes learning difficult or, sometimes, impossible. Sparse reward functions are functions in which, most of the time, the reward is constant (or zero).

Sutton's hypothesis, which we explained earlier, is the basis of the RL framework. This hypothesis may be wrong; probably, a scalar reward signal (and its maximization) is not enough to define complex goals; however, still, this hypothesis is very flexible, simple, and it can be applied to a wide range of tasks. At the time of writing, the reward function design is more art than engineering; there are no formal practices regarding how to write a reward function, rather there are only best practices based on experience. Usually, a simple reward function works very well. Usually, we associate a positive value with good actions and behavior and negative values with bad actions or actions that are not important at that particular moment.

In a locomotion task (for example, teaching a robot how to move), the reward may be defined as proportional to the robot's forward movement. In chess, the reward may be defined as 0 for each timestep: +1 if the agent wins and -1 if the agent loses. If we want our agent to solve Rubik's Cube, the reward may be defined similarly: 0 every step and +1 if the cube is solved.

Sometimes, as we learned earlier, defining a scalar reward function for a task is not easy, and, nowadays, it is more art than engineering or science.

In each of these tasks, the final objective is to learn a policy, a way of selecting actions, maximizing the total rewards received by the agent. Tasks can be episodic or continuous. Episodic tasks have a finite length, that is, a finite number of timesteps (for example, T is finite). Continuous tasks can last forever or until the agent reaches its goal. In the first case, we can simply define the total reward (return) received by an agent as the sum of the individual rewards:

Figure 1.29: Expression for a total reward

Figure 1.29: Expression for a total reward

Usually, we are interested in the return from a certain timestep, 61. In other words, the return, 62 , quantifies the agent's performance in the long term, and it can be calculated as the sum of immediate rewards following time t until the end of the episode (timestep d):

Figure 1.30: Expression for a return from timestep t

Figure 1.30: Expression for a return from timestep t

It is straightforward to see that, with this formulation, the return for continuing tasks diverges to infinity.

In order to deal with continuing tasks, we need to introduce the notion of a discounted return. This concept formalizes, in mathematical terms, the principle that the immediate reward (sometimes) is more valuable than the same amount of reward after many steps. This principle is widely known in economics. The discount factor, 64, quantifies the present value of future rewards. We are ready to present the unified notation for the return in episodic and continuing tasks.

The discounted return is the cumulative, discounted sum of rewards until the end of the episode. In mathematical terms, it can be formalized as follows:

Figure 1.31: Expression for the discounted return from timestep t

Figure 1.31: Expression for the discounted return from timestep t

To understand how the discount affects the return, it is possible to see that the value of receiving reward e after a 66 timestep is 67, since 68 is less than or equal to 69. It is worth introducing the effect of the discount on the return. If A picture containing object, clock

Description automatically generated, the return, even if composed by an infinite sum, has a bounded value. If 70, the agent is myopic since it cares only about the immediate reward, and it does not care about future rewards. A myopic agent can cause problems: the only thing it learns is to select the action yielding the highest immediate return. A myopic chess player can, for example, eat the opponent's pawn causing the game's loss. Notice that, for some tasks, this isn't always a problem. This includes tasks in which the current action does not affect the future reward and has no consequences for the agent's future. These tasks can be solved by finding the action that causes a higher immediate reward for each state independently. Most of the time, the current action influences the future of the agent and its rewards. If the discount factor is near to 1, the agent is farsighted; it is possible for them to sacrifice an action yielding to a good immediate reward now for a higher reward in future steps.

It is important to understand the relationship between returns at different timesteps, both from a theoretical point of view but also from an algorithmic point of view, because many RL algorithms are based on this principle:

Figure 1.32: Relationship between returns at different timesteps

Figure 1.32: Relationship between returns at different timesteps

By following these simple steps, we can see that the return from the timestep equals the immediate reward plus the return at the following step scaled by gamma. This simple relationship will be extensively used in RL algorithms.

Why Discount?

The following describes the motivations as to why many RL problems are discounted:

  • It is convenient from a mathematical perspective to have a bounded return, and also in the case of continuing tasks.
  • If the task is a financial task, immediate rewards may gain more interest than delayed rewards.
  • Animal and human behavior show a preference for immediate rewards.
  • A discounted reward may also represent uncertainty about the future.
  • It is also possible to use an undiscounted return d if all of the episodes terminate after a finite number of steps.

This section introduced the main elements of RL, including agents, actions, environments, transition functions, and policies. In the next section, we will practice with these concepts by defining agents, environments, and measuring the performance of agents on some tasks.


Reinforcement Learning Frameworks

In the previous sections, we learned the basic theory behind RL. In principle, an agent or an environment can be implemented in any way or any language. For RL, the primary language used by both academic and industrial people is Python, as it allows you to focus on the algorithms and not on the language details, making it very simple to use. Implementing, from scratch, an algorithm or a complex environment (that is, an autonomous driving environment) might be very difficult and error-prone. For this reason, several well-established and well-tested libraries make RL very easy for newcomers. In this section, we will explore the main Python RL libraries. We will present OpenAI Gym, a set of environments that is ready to use and easy to modify, and OpenAI Baselines, a set of high quality, state-of-the-art algorithms. By the end of this chapter, you will have learned about and practiced with environments and agents.

OpenAI Gym

OpenAI Gym ( is a Python library that provides a set of RL environments ranging from toy environments to Atari environments and more complex environments, such as MuJoCo and Robotics environments. OpenAI Gym, besides providing this large set of tasks, also provides a unified interface for interacting with RL tasks and a set of interfaces that are useful for describing the environment's characteristics, such as the action space and the state space. An important property of Gym is that its only focus is on environments; it makes no assumption of the type of agent you have or the computational framework you use. We will not cover the installation details in this chapter for ease of presentation. Instead, we will focus on the main concepts and learn how to interact with these libraries.

Getting Started with Gym – CartPole

CartPole is a classical control environment provided by Gym and used by researchers as a starting point of algorithms. It consists of a cart that moves along the horizontal axis (1-dimensional) and a pole anchored to the cart on one endpoint:

Figure 1.33: CartPole environment representation

Figure 1.33: CartPole environment representation

The agent has to learn how to move the cart to balance the pole (that is, to stop the pole from falling). The episode ends when the pole angle (71) becomes higher than a certain threshold (72). The state space is represented by the position of the cart along the axis, 73; the velocity along the axis, 74; the pole angle, 75; and the pole angular velocity, 76. The state space is continuous in this case, but it can also be discretized to make learning simpler.

In the following steps, we will practice with Gym and its environments.

Let's create a CartPole environment using Gym and analyze its properties in a Jupyter notebook. Please refer to the Preface for Gym installation instructions:

# Import the gym Library
import gym
# Create the environment using gym.make(env_name)
env = gym.make('CartPole-v1')
Analyze the action space of cart pole using the property action_space
print("Action Space:", env.action_space)
Analyze the observation space of cartpole using the property observation_space
print("Observation Space:", env.observation_space)

If you run these lines, you will get the following output:

Action Space: Discrete(2)
Observation Space: Box(4,)

Discrete(2) means that the action space of CartPole is a discrete action space composed of two actions: Go Left and Go Right. These actions are the only actions available to the agent. The action of Go Left, in this case, is represented by action 0, and the action of Go Right by action 1.

Box(4,) means that the state space (the observation space) of the environment is represented by a 4-dimensional box, a subspace of d. Formally, it is a Cartesian product of n intervals. The state space has a lower bound and an upper bound. The bounds may also be infinite, creating an unbounded box.

To inspect the observation space better, we can use the properties of high and low:

# Analyze the bounds of the observation space
print("Lower bound of the Observation Space:", \
print("Upper bound of the Observation Space:", \

This will print the following:

Lower bound of the Observation Space: [-4.8000002e+00 -3.4028235e+38 
-4.1887903e-01 -3.4028235e+38]
Upper bound of the Observation Space: [4.8000002e+00 3.4028235e+38 
4.1887903e-01 3.4028235e+38]

Here, we can see that upper and lower bounds are arrays of 4 elements; one element for each state dimension. The following are some observations:

  • The lower bound of the cart position (the first state dimension) is -4.8, while the upper bound is 4.8.
  • The lower bound of the velocity (the second state dimension) is -3.1038, basically 77; and the upper bound is +3.1038, basically 78.
  • The lower bound of the pole angle (the third state dimension) is -0.4 radians, representing an angle of -24 degrees. The upper bound is 0.4 radians, representing an angle of +24 degrees.
  • The lower and upper bounds of the pole angular velocity (the fourth state dimension) are, respectively, 79 and 80 , similar to the lower and upper bounds for the cart policy's angular velocity.

Gym Spaces

The Gym Space class represents the way Gym describes actions and state spaces. The most used spaces are the Discrete and Box spaces.

A discrete space is composed of a fixed number of elements. It can represent both a state space but also an action space, and it describes the number of elements through the n attribute. Its elements range from 0 to n-1.

A Box space describes its shape through the shape attribute. It can have an n-dimensional shape that corresponds to an n-dimensional box. A Box space can also be unbounded. Each interval has the form of one of 81.

It is possible to sample from the action space to gain insight into the elements it is composed of using the space.sample() method.


For the sampling distribution of box environments, to create a sample of the box, each coordinate is sampled according to the form of the interval in the following distributions:

- 82 : A uniform distribution

-83: A shifted exponential distribution

- 84: A shifted negative exponential distribution

- 85: A normal distribution

Let's now demonstrate how to create simple spaces and how to sample from spaces:

# Type hinting
from typing import Tuple
import gym
# Import the spaces module
from gym import spaces
# Create a discrete space composed by N-elements (5)
n: int = 5
discrete_space = spaces.Discrete(n=n)
# Sample from the space using .sample method
print("Discrete Space Sample:", discrete_space.sample())
Create a Box space with a shape of (4, 4)
Upper and lower Bound are 0 and 1
box_shape: Tuple[int, int] = (4, 4)
box_space = spaces.Box(low=0, high=1, shape=box_shape)
# Sample from the space using .sample method
print("Box Space Sample:", box_space.sample())

This will print the samples from our spaces:

Discrete Space Sample: 4
Box Space Sample: [[0.09071387 0.4223234  0.09272052 0.15551752]
 [0.8507258  0.28962377 0.98583364 0.55963445]
 [0.4308358  0.8658449  0.6882108  0.9076272 ]
 [0.9877584  0.7523759  0.96407163 0.630859  ]]

Of course, the samples will change according to your seeds.

As you can see, we have sampled element 4 from our discrete space composed of 5 elements (from 0 to 4). We sampled a random 4 x 4 matrix with elements between 0 and 1, the lower and the upper bound of our space.

To obtain reproducible results, it is also possible to set the seed of an environment using the seed method:

# Seed spaces to obtain reproducible samples
# Sample from the seeded space
print("Discrete Space (seed=0) Sample:", discrete_space.sample())
# Sample from the seeded space
print("Box Space (seed=0) Sample:", box_space.sample())

This will print the following:

Discrete Space (seed=0) Sample: 0
Box Space (seed=0) Sample: [[0.05436005 0.9653909  
0.63269097 0.29001734]
 [0.10248426 0.67307633 0.39257675 0.66984606]
 [0.05983897 0.52698725 0.04029069 0.9779441 ]
 0.46293673 0.6296479  0.9470484  0.6992778 ]]

The previous statement will always print the same sample since we set the seed to 0. Seeding an environment is very important in order to guarantee reproducible results.

Exercise 1.03: Creating a Space for Image Observations

In this exercise, we will create a space to represent an image observation. Image-based observations are essential in RL since they allow the agent to learn from pixels and require minimal feature engineering or need to go through the feature extraction phase. The agent can focus on what is important for its task without being limited by manually decided heuristics. We will create a space representing RGB images with dimensions equal to 256 x 256:

  1. Open a new Jupyter notebook and import the desired modules – gym and NumPy:
    import gym
    from gym import spaces
    import matplotlib.pyplot as plt
    %matplotlib inline
    import numpy as np # used for the dtype of the space
  2. We are dealing with 256 x 256 RGB images, so the space has a shape of (256, 256, 3). In addition, the images range from 0 to 255 (if we consider the uint8 images):
    since the Space is RGB images with shape 256x256 the final shape is (256, 256, 3)
    shape = (256, 256, 3)
    # If we consider uint8 images the bounds are 0-255
    low = 0
    high = 255
    # Space type: unsigned int
    dtype = np.uint8
  3. We are now ready to create the space. An image is a Box space since it has defined bounds:
    # create the space
    space = spaces.Box(low=low, high=high, shape=shape, dtype=dtype)
    # Print space representation
    print("Space", space)

    This will print the representation of our space:

    Space Box(256, 256, 3)

    The first dimension is the image width, the second dimension is the image height, and the third dimension is the number of channels.

  4. Here is a sample from the space:
    # Sample from the space
    sample = space.sample()
    print("Space Sample", sample)

    This will return the space sample; in this case, it is a huge tensor of 256 x 256 x 3 unsigned integers (between 0 and 255). The output (fewer lines are presented now) should be similar to the following:

    Space Sample [[[ 37 254 243]
      [134 179  12]
      [238  32   0]
      [100  61  73]
      [103 164 131]
      [166  31  68]]
     [[218 109 213]
      [190  22 130]
      [ 56 235 167]
  5. To visualize the returned sample, use the following code:

    The output will be as follows:

    Figure 1.34: A sample from a Box space of (256, 256) RGB

    Figure 1.34: A sample from a Box space of (256, 256) RGB

    The preceding is not very informative because it is a random image.

  6. Now, suppose we want to give our agent the opportunity to see the last n=4 frames. By adding the temporal component, we can obtain a state representation composed of 4 dimensions. The first dimension is the temporal one, the second is the width, the third is the height, and the last one is the number of channels. This is a very useful technique that allows the agent to understand its movement:
    # we want a space representing the last n=4 frames
    n_frames = 4  # number of frames
    width = 256  # image width
    height = 256  # image height
    channels = 3  # number of channels (RGB)
    shape_temporal = (n_frames, width, height, channels)
    # create a new instance of space
    space_temporal = spaces.Box(low=low, high=high, \
                                shape=shape_temporal, dtype=dtype)
    print("Space with temporal component", space_temporal)

    This will print the following:

    Space with temporal component Box(4, 256, 256, 3)

As you can see, we have successfully created a space and, on inspecting the space representation, we notice that we have another dimension: the temporal dimension.


To access the source code for this specific section, please refer to

You can also run this example online at

Image-based environments are very important in RL. They allow the agent to learn salient features for solving the task directly from raw pixels, without any preprocessing. In this exercise, we learned how to create a Gym space for image observations and how to deal with image spaces.

Rendering an Environment

In the Getting Started with Gym – CartPole section, we saw a sample from the CartPole state space. However, visualizing or understanding the CartPole state from a vector representation is not an easy task, at least for a human. Gym also allows you to visualize a given task (if possible) through the env.render() function.


The env.render() function is usually slow. Rendering an environment is done primarily to understand the behavior learned by the agent after the training or on intervals of many training steps. Usually, we train agents without rendering the environment state to improve the training speed.

If we just call the env.render() function, we will always see the same scene, that is, the environment state does not change. To see the evolution of the environment in time, we must call the env.step() function, which takes as input an action belonging to the action space and applies the action in the environment.

Rendering CartPole

The following code demonstrates how to render the CartPole environment. The action is a sample from the action space. For RL algorithms, the action will be smartly selected from the policy:

# Create the environment using gym.make(env_name)
env = gym.make("CartPole-v1")
# reset the environment (mandatory)
# render the environment for 100 steps
n_steps = 100
for i in range(n_steps):
    action = env.action_space.sample()
# close the environment correctly

If you run this script, you will see that gym opens a window and displays the CartPole environment with random actions, as shown in the following figure:

Figure 1.35: A CartPole environment rendered in Gym (the initial state)

Figure 1.35: A CartPole environment rendered in Gym (the initial state)

A Reinforcement Learning Loop with Gym

To understand the consequences of an action, and to come up with a better policy, the agent observes its new state and a reward. Implementing this loop with gym is easy. The key element is the env.step() function. This function takes an action as input. It applies the action and returns four values, which are described as follows:

  • Observation: The observation is the next environmental state. This is represented as an element belonging to the observation space of the environment.
  • Reward: The reward associated with a step is a float value that is related to the action given as input to the function.
  • Done: This return value assumes the True value when the episode is finished, and it's time to call the env.reset() function to reset the environment state.
  • Info: This is a dictionary containing debugging information; usually, it is ignored.

Let's now implement the RL loop within the Gym environment.

Exercise 1.04: Implementing the Reinforcement Learning Loop with Gym

In this exercise, we will implement a basic RL loop with episodes and timesteps using the CartPole environment. You can change the environment and use other environments as well; nothing changes as the main goal of Gym is to unify the interfaces of all possible environments in order to build agents that are as environment-agnostic as possible. The transparency with respect to the environment is a very peculiar thing in RL: the algorithms are not usually suited to the task but are task-agnostic so that they can be applied successfully to a variety of environments and still solve them.

We need to create the Gym CartPole environment as before using the gym.make() function. After that, we can loop for a defined number of episodes; for each episode, we loop for a defined number of steps or until the episode is terminated (by checking the done value). For each timestep, we have to call the env.step() function by passing an action (we will pass a random action for now), and then we collect the desired information:

  1. Open a new Jupyter notebook and define the import, the environment, and the desired number of steps:
    import gym
    import matplotlib.pyplot as plt
    %matplotlib inline
    env = gym.make("CartPole-v1")
    # each episode is composed by 100 timesteps
    # define 10 episodes
    n_episodes = 10
    n_timesteps = 100
  2. Loop for each episode:
    # loop for the episodes
    for episode_number in range(n_episodes):
        # here we are inside an episode
  3. Reset the environment and get the first observation:
        the reset function resets the environment and returns
        the first environment observation
        observation = env.reset()
  4. Loop for each timestep:
        loop for the given number of timesteps or
        until the episode is terminated
        for timestep_number in range(n_timesteps):
  5. Render the environment, select the action (randomly by using the env.action_space.sample() method), and then take the action:
            # render the environment
            # select the action
            action = env.action_space.sample()
            # apply the selected action by calling env.step
            observation, reward, done, info = env.step(action)
  6. Check whether the episode has been terminated using the done variable:
            """if done the episode is terminated, we have to reset
            the environment
            if done:
                print(f"Episode Number: {episode_number}, \
    Timesteps: {timestep_number}")
                # break from the timestep loop
  7. After the episode loop, close the environment in order to release the associated memory:
    # close the environment

    If you run the previous code, the output should, approximately, be like this:

    Episode Number: 0, Timesteps: 34
    Episode Number: 1, Timesteps: 10
    Episode Number: 2, Timesteps: 12
    Episode Number: 3, Timesteps: 21
    Episode Number: 4, Timesteps: 16
    Episode Number: 5, Timesteps: 17
    Episode Number: 6, Timesteps: 12
    Episode Number: 7, Timesteps: 15
    Episode Number: 8, Timesteps: 16
    Episode Number: 9, Timesteps: 16

We have the episode number and the number of steps taken in that episode. We can see that the average number of timesteps for an episode is approximately 17. This means that, using the random policy, after 17 episodes on average, the pole falls and the episode finishes.


To access the source code for this specific section, please refer to

This section does not currently have an online interactive example, and will need to be run locally.

The goal of this exercise was to understand the bare bones of each RL algorithm. The only different thing here is that the action selection phase should take into account the environment state in order for it to be useful, and it should not be random.

Let's now move toward completing an activity to measure the performance of an agent.

Activity 1.01: Measuring the Performance of a Random Agent

The measurement of the performance and the design of an agent is an essential phase of every RL experiment. The goal of this activity is to practice with these two concepts by designing an agent that is able to interact with an environment using a random policy and then measure the performance.

You need to design a random agent using a Python class to modularize and keep the agent independent from the main loop. After that, you have to measure the mean and the variance of the discounted return using a batch of 100 episodes. You can use every environment you want, taking into account that the agent's action should be compatible with the environment. You can design two different types of agents for discrete action spaces and continuous action spaces. The following steps will help you to complete the activity:

  1. Import the required libraries: abc, numpy, and gym.
  2. Define the Agent abstract class in a very simple way, defining only the pi() function that represents the policy. The input should be an environment state. The __init__ method should take as input the action space and build the distribution accordingly.
  3. Define a ContinuousAgent deriving from the Agent abstract class. The agent should check that the action space is coherent with it, and it should be a continuous action space. The agent should also initialize a probability distribution for sampling actions (you can use NumPy to define probability distributions). The continuous agent can change the distribution type according to the distributions defined by the Gym spaces.
  4. Define a DiscreteAgent deriving from the Agent abstract class. The discrete agent should, of course, initialize a uniform distribution.
  5. Implement the pi() function for both agents. This function is straightforward and should only sample from the distribution defined in the constructor and return it, ignoring the environment state. Of course, this is a simplification. You can also implement the pi() function in the Agent base class.
  6. Define the main RL loop in another file by importing the agent.
  7. Instantiate the correct agent according to the selected environment. Examples of environments are "CartPole-v1" or "MountainCar-Continuous-v0."
  8. Take actions according to the pi function of the agent.
  9. Measure the performance of the agent collecting (in a list or a NumPy array) the discounted return for each episode. Then, take the average and the standard deviation (you can use NumPy for this). Remember to apply the discount factor (user-defined) to the immediate reward. You have to keep a cumulated discount factor by multiplying the discount factor at each timestep.

    The output should be similar to the following:

    Episode Number: 0, Timesteps: 27, Return: 28.0
    Episode Number: 1, Timesteps: 9, Return: 10.0
    Episode Number: 2, Timesteps: 13, Return: 14.0
    Episode Number: 3, Timesteps: 16, Return: 17.0
    Episode Number: 4, Timesteps: 31, Return: 32.0
    Episode Number: 5, Timesteps: 10, Return: 11.0
    Episode Number: 6, Timesteps: 14, Return: 15.0
    Episode Number: 7, Timesteps: 11, Return: 12.0
    Episode Number: 8, Timesteps: 10, Return: 11.0
    Episode Number: 9, Timesteps: 30, Return: 31.0
    Statistics on Return: Average: 18.1, Variance: 68.89000000000001


    The solution for this activity can be found via this link.

OpenAI Baselines

OpenAI Baselines ( is a set of state-of-the-art RL algorithms. The main goal of Baselines is to make it easier to reproduce results on a set of benchmarks, to evaluate new ideas, and to compare them to existing algorithms. In this section, we will learn how to use Baselines to run an existing algorithm on an environment taken from Gym (refer to the previous section) and how to visualize the behavior learned by the agent. As for Gym, we will not cover the installation instructions; these can be found in the Preface section. The implementation of the Baselines' algorithm is based on TensorFlow, one of the most popular libraries for machine learning.

Getting Started with Baselines – DQN on CartPole

Training a Deep Q Network (DQN) on CartPole is straightforward with Baselines; we can do it with just one line of Bash.

Just use the terminal and run this command:

# Train model and save the results to cartpole_model.pkl
python -m –alg=deepq –env=CartPole-v0 –save_path=./cartpole_model.pkl –num_timesteps=1e5

Let's understand the parameters, as follows:

  • --alg=deepq specifies the algorithm to be used to train our agent. In our case, we selected deepq, that is, DQN.
  • --env=CartPole-v0 specifies the environment to be used. We selected CartPole, but we can also select many other environments.
  • --save_path=./cartpole_model.pkl specifies where to save the trained agent.
  • --num_timesteps=1e5 is the number of training timesteps.

After having trained the agent, it is also possible to visualize the learned behavior using the following:

# Load the model saved in cartpole_model.pkl 
# and visualize the learned policy
python -m --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play

DQN is a very powerful algorithm; using it for a simple task such as CartPole is almost overkill. We can see that the agent has learned a stable policy, and the pole almost never falls. We will explore DQN in more detail in the following chapters.

In the following steps, we will train a DQN agent on the CartPole environment using Baselines:

  1. First, we import gym and baselines:
    import gym
    # Import the desired algorithm from baselines
    from baselines import deepq
  2. Define a callback to inform baselines when to stop training. The callback should return True if the reward is satisfying:
    def callback(locals, globals):
        function called at every step with state of the algorithm.
        If callback returns true training stops.
        stop training if average reward exceeds 199
        time should be greater than 100 and the average of 
        last 100 returns should be >= 199
        is_solved = (locals["t"] > 100 and \
                               [-101:-1]) / 100 >= 199)
        return is_solved
  3. Now, let's create the environment and prepare the algorithm's parameters:
    # create the environment
    env = gym.make("CartPole-v0")
    Prepare learning parameters: network and learning rate
    the policy is a multi-layer perceptron
    network = "mlp"
    # set learning rate of the algorithm
    learning_rate = 1e-3
  4. We can use the deep.learn() method to start the training and solve the task:
    launch learning on this environment using DQN
    ignore the exploration parameter for now
    actor = deepq.learn(env, network=network, lr=learning_rate, \
                        total_timesteps=100000, buffer_size=50000, \
                        exploration_fraction=0.1, \
                        exploration_final_eps=0.02, print_freq=10, \

After some time, depending on your hardware (it usually takes a few minutes), the learning phase terminates, and you will have the CartPole agent saved to your current working directory.

We should see the baselines logs reporting the agent's performance over time.

Consider the following example:

| % time spent exploring  | 2        |
| episodes                | 770      |
| mean 100 episode reward | 145      |
| steps                   | 6.49e+04 |

The following are the observations from the preceding logs:

  • The episodes parameter reports the episode number we are referring to.
  • mean 100 episode reward is the average return obtained in the last 100 episodes.
  • steps is the number of training steps the algorithm has performed.

Now we can save our actor so that we can reuse it without retraining it:

print("Saving model to cartpole_model.pkl")"cartpole_model.pkl")

After the function, the "cartpole_model.pkl" file contains the trained model.

Now it is possible to use the model and visualize the agent's behavior.

The actor returned by deepq.learn is actually a callable that returns the action given the current observation – it is the agent policy. We can use it by passing the current observation, and it returns the selected action:

# Visualize the policy
n_episodes = 5
n_timesteps = 1000
for episode in range(n_episodes):
    observation = env.reset()
    episode_return = 0
    for timestep in range(n_timesteps):
        # render the environment
        # select the action according to the actor
        action = actor(observation[None])[0]
        # call env.step function
        observation, reward, done, _ = env.step(action)
        since the reward is undiscounted we can simply add 
        the reward to the cumulated return
        episode_return += reward
        if done:
    # here an episode is terminated, print the return
    print("Episode return", episode_return) 
        here an episode is terminated, print the return 
        and the number of steps
    print(f"Episode return {episode_return}, \
Number of steps: {timestep}")

If you run the preceding code, you should see the agent's performance on the CartPole task.

You should get, as output, the return for each episode; it should be something similar to the following:

Episode return 200.0, Number of steps: 199
Episode return 200.0, Number of steps: 199
Episode return 200.0, Number of steps: 199
Episode return 200.0, Number of steps: 199
Episode return 200.0, Number of steps: 199 

This means our agent always reaches the maximum possible return for CartPole (200.0) and the maximum possible number of steps (199).

We can compare the return obtained using a trained DQN agent with respect to the return obtained using a random agent (Activity 1.01, Measuring the Performance of a Random Agent). The random agent yields an average return of 20.0, while DQN obtains the maximum return possible for CartPole, which is 200.0.

In this section, we presented OpenAI Gym and OpenAI Baselines, the two main frameworks for RL research and experiments. There are many other frameworks for RL, each with their pros and cons. Gym is particularly suited due to its unified interface in the RL loop, while OpenAI Baselines is very useful for understanding how to implement sophisticated state-of-the-art RL algorithms and how to compare new algorithms with existing ones.

In the following section, we will explore some interesting RL applications in order to better understand the possibilities offered by the framework as well as its flexibility.


Applications of Reinforcement Learning

RL has exciting and useful applications in many different contexts. Recently, the usage of deep neural networks has augmented the number of possible applications considerably.

When used in a deep learning context, RL can also be referred to as deep RL.

The applications vary from games and video games to real-world applications, such as robotics and autonomous driving. In each of these applications, RL is a game-changer, allowing you to solve tasks that are considered to be almost impossible (or, at least, very difficult) without these techniques.

In this section, we will present some RL applications, describe the challenges of each application, and begin to understand why RL is preferred among other methods, along with its advantages and its drawbacks.


Nowadays, RL is widely used in video games and board games.

Games are used to benchmark RL algorithms because, usually, they are very complex to solve yet easy to implement and to evaluate. Games also represent a simulated reality in which the agent can freely move and behave without affecting the real environment:

Figure 1.36:  Breakout – one of the most famous Atari games

Figure 1.36: Breakout – one of the most famous Atari games


The preceding screenshot has been sourced from the official documentation of OpenAI Gym. Please refer to the following link for more examples:

Despite appearing to be secondary or relatively limited-use applications, games represent a useful benchmark for RL and, in general, artificial intelligence algorithms. Very often, artificial intelligence algorithms are tested on games due to the significant challenges that arise in these scenarios.

The two main characteristics required to play games are planning and real-time control.

An algorithm that is not able to plan won't be able to win strategic games. Having a long-term plan is also fundamental in the early stages of a game. Planning is also fundamental in real-world applications in which taken actions may have long-term consequences.

Real-time control is another fundamental challenge that requires an algorithm to be able to respond within a small timeframe. This challenge is similar to one an algorithm has to face when applied to real-world cases such as autonomous driving, robot control, and many others. In these cases, the algorithm can't evaluate all the possible actions or all the possible consequences of these actions; therefore, the algorithm should learn an efficient (and maybe compressed) state representation and should understand the consequences of its actions without simulating all of the possible scenarios.

Recently, RL has been able to exceed human performance in games such as Go, and in video games such as Dota II and StarCraft, thanks to work done by DeepMind and OpenAI.


Go is a very complex, highly strategic board game. In Go, two players are competing against each other. The aim is to use the game pieces, also called stones, to surround more territory than the opponent. At each turn, the player can place its stone in a vacant intersection on the board. At the end of the game, when no player can place a stone, the player surrounding more territories wins.

Go has been studied for many years to understand the strategies and moves necessary to lead a player to victory. Until recently, no algorithm succeeded in producing strong players – even algorithms working very well for similar games, such as chess. This difficulty is due to Go's huge search space, the variety of possible moves, and the average length (in terms of moves) of Go games, which, for example, is longer than the average length of chess games. RL, and in particular AlphaGo by DeepMind, succeeded recently in beating a human player on a standard dimension board. AlphaGo is actually a mix of RL, supervised learning, and tree search algorithms trained on an extensive set of games from both human and artificial players. AlphaGo denoted a real milestone in artificial intelligence history, which was made possible mainly due to the advances in RL algorithms and their improved efficiency.

The successor of AlphaGo is AlphaGo Zero. AlphaGo Zero has been trained fully in a self-play fashion, learning from itself completely with no human intervention (Zero comes from this characteristic). It is currently the world's top player at Go and Chess:

Figure 1.37: The Go board

Figure 1.37: The Go board

Both AlphaGo and AlphaGo Zero used a deep Convolutional Neural Network (CNN) to learn a suitable game representation starting from the "raw" board. This peculiarity shows that a deep CNN can also extract features starting from a sparse representation such as the Go board. One of the main strengths of RL is that it can use, in a transparent way, machine learning models that are widely studied in other fields or problems.

Deep convolutional networks are usually used for classification or segmentation problems that, at first glance, might seem very different from RL problems. Actually, the way CNNs are used in RL is very similar to a classification or a regression problem. The CNN of AlphaGo Zero, for example, takes the raw board representation and outputs the probabilities for each possible action together with the value of each action. It can be seen as a classification and regression problem at the same time. The difference is that the labels, or actions in the case of RL, are not given in the training set, rather it is the algorithm itself that has to discover the real labels through interaction. AlphaGo, the predecessor of AlphaGo Zero, used two different networks: one for action probabilities and another for value estimates. This technique is called actor-critic. The network tasked with predicting actions is called the actor, and the network that has to evaluate actions is called the critic.

Dota 2

Dota 2 is a complex, real-time strategy game in which there are two teams of five players competing, with each player controlling a "hero." The characteristics of Dota, from an RL perspective, are as follows:

  • Long-Time Horizon: A Dota game can have around 20,000 moves and can last for 45 minutes. As a reference, a chess game ends before 40 moves and a Go game ends before 150 moves.
  • Partially Observed State: In Dota, agents can only see a small portion of the full map, that is, only the portion around them. A strong player should make predictions about the position of the enemies and their actions. As a reference, Go and Chess are fully observable games where agents can see the whole situation and the actions taken by the opponents.
  • High-Dimensional and Continuous Action Space: Dota has a vast number of actions available to each player at each step. The possible actions have been discretized by researchers in around 170,000 actions, with an average of 1,000 possible actions for each step. In comparison, the number of average actions in chess is 35, and in Go, it is 250. With a huge action space, learning becomes very difficult.
  • High-Dimensional and Continuous Observation Space: While Chess and Go have a discretized observation space, Dota has a continuous state space with around 20,000 dimensions. The state space, as we will learn later in the book, includes all of the information available to players that must be taken into consideration when selecting an action. In a video game, the state space is represented by the characteristics and position of the enemies, the state of the current player, including its ability, its equipment, and its health status, and other domain-specific features.

OpenAI Five, the RL algorithm able to exceed human performance at Dota, is composed of five neural networks collaborating together. The algorithm learns to play by itself through self-play, playing an equivalent of 180 years per day. The algorithm used for training the five neural networks is called Proximal Policy Optimization, representing the current state of the art of RL algorithms.


To read more on OpenAI Five, refer to the following link:


StarCraft has characteristics that make it very similar to Dota, including a huge number of moves per play, imperfect information available to players, and highly dimensional state and action spaces. AlphaStar, the player developed by DeepMind, is the first artificial intelligence agent able to reach the top league without any game restrictions. AlphaStar uses machine learning techniques such as neural networks, self-play through RL, multi-agent learning methods, and imitation learning to learn from other human players in a supervised way.


For further reading on AlphaStar, refer to the following paper:

Robot Control

Robots are starting to become ubiquitous nowadays and are widely used in various industries because of their ability to perform repetitive tasks in a precise and efficient way. RL can be beneficial for robotics applications, by simplifying the development of complex behaviors. At the same time, robotics applications represent a set of benchmark and real-world validations for RL algorithms. Researchers test their algorithm on robotic tasks such as locomotion (for example, learning to move) or grasping (for example, learning how to grasp an object). Robotics offers unique challenges, such as the curse of dimensionality, the effective usage of samples (also called sample efficiency), the possibility of transferring knowledge from similar or simulated tasks, and the need for safety:

Figure 1.38: A robotic task from the Gym robotics suite

Figure 1.38: A robotic task from the Gym robotics suite


The preceding diagram has been sourced from the official documentation for OpenAI Gym:

Please refer to the link for more examples of robot control.

The curse of dimensionality is a challenge that can also be found in supervised learning applications. Still, in these cases, it is softened by restricting the space of possible solutions to a limited class of functions or by injecting prior knowledge elements in the models through architectural decisions. Robots usually have many degrees of freedom, making the space of possible states and possible actions very large.

Robots interact with the physical environment by definition. The interaction of a real robot with an environment is usually time-consuming, and it can be dangerous. Usually, RL algorithms require millions of samples (or episodes) in order to become efficient. Sample efficiency is a problem in this field, as the required time may be impractical. The usage of collected samples in a smart way is the key to successful RL-based robotics applications. A technique that can be used in these cases is the so-called sim2real, in which an initial learning phase is practiced in a simulated environment that is usually safer and faster than the real environment. After this phase, the learned behavior is transferred to the real robot in the real environment. This technique requires a simulated environment that is very similar to the real environment or the generalization capabilities of the algorithm.

Autonomous Driving

Autonomous driving is another exciting application of RL. The main challenge this task presents is the lack of precise specifications. In autonomous driving, it is challenging to formalize what it means to drive well, whether steering in a given situation is good or bad, or whether the driver should accelerate or break. As with robotic applications, autonomous driving can also be hazardous. Testing an RL algorithm, or, in general, a machine learning algorithm, on a driving task, is very problematic and raises many concerns.

Aside from the concerns, the autonomous driving scenario fits very well in the RL framework. As we will explore later in the book, we can think of the driver as the decision-maker. At each step, they receive an observation. The observation includes the road's state, the current velocity, the acceleration, and all of the car's characteristics. The driver, based on the current state, should make a decision corresponding to what to do with the car's commands, steering, brakes, and acceleration. Designing a rule-based system that is able to drive in real situations is complicated, due to the infinite number of different situations to confront. For this reason, a learning-based system would be far more efficient and effective in tasks such as this.


There are many simulated environments available for developing efficient algorithms in the context of autonomous driving, listed as follows:

Voyage Deepdrive:

AWS DeepRacer:

In this section, we analyzed some interesting RL applications, the main challenges of them, and the main techniques used by researchers. Games, robotics, and autonomous driving are just some examples of real-world RL applications, but there are many others. In the remainder of this book, we will deep dive into RL; we will understand its components and the techniques presented in this chapter.



RL is one of the fundamental paradigms under the umbrella of machine learning. The principles of RL are very general and interdisciplinary, and they are not bound to a specific application.

RL considers the interaction of an agent with an external environment, taking inspiration from the human learning process. RL explicitly targets the need to explore efficiently and the exploration-exploitation trade-off appearing in almost all human problems; this is a peculiarity that distinguishes this discipline from others.

We started this chapter with a high-level description of RL, showing some interesting applications. We then introduced the main concepts of RL, describing what an agent is, what an environment is, and how an agent interacts with its environment. Finally, we implemented Gym and Baselines by showing how these libraries make RL extremely simple.

In the next chapter, we will learn more about the theory behind RL, starting with Markov chains and arriving at MDPs. We will present the two functions at the core of almost all RL algorithms, namely the state-value function, which evaluates the goodness of states, and the action-value function, which evaluates the quality of the state-action pair.

About the Authors

  • Alessandro Palmas

    Alessandro Palmas is an aerospace engineer with more than 7 years of proven expertise in software development for advanced scientific applications and complex software systems. His main ML focus is on computer vision, 3D models, volumetric networks, and deep reinforcement learning. He also founded innovative initiatives, his last being Artificial Twin, which provides advanced technologies for machine learning, physical modeling, and computational geometry applications. Two key areas in which current Artificial Twin deep RL work is focused on are video games entertainment, and guidance, navigation & control systems.

    Browse publications by this author
  • Emanuele Ghelfi

    Emanuele Ghelfi is a computer science and machine learning engineer. He received an M.Sc. degree in computer science and engineering at Politecnico di Milano in December 2018. In his thesis, he proposed a new RL algorithm for an MDP extension. The paper from the thesis got accepted at ICML 2019.

    Browse publications by this author
  • Dr. Alexandra Galina Petre

    Dr. Alexandra Galina Petre is a machine learning and data science expert, currently leading and teaching various engineering modules in Coventry, United Kingdom. Her leadership and management experience is linked to her work in quality management for the Airbus A380 and her IET membership. She received her Ph.D. in user feedback-based reinforcement learning for vehicle comfort control with a focus on revolutionary heating ventilation and air conditioning SARSA-based control systems that can learn from the driver’s preferential changes to the UI. Her research is focusing on how thermal comfort depends on the occupant’s inclination to manual control as outlined in the SAE paper published in 2019, and the development of a novel Java-based user model (UBL) integrated within a car cabin environment. She is working on deep RL implementations in Python and R-based statistical developments within various automation and control projects.

    Browse publications by this author
  • Mayur Kulkarni

    Mayur Kulkarni works in the Machine Learning research team at Microsoft and has previously been at IIT Bombay, and IIM Lucknow. He has also been an instructor for the postgraduate programs in Artificial Intelligence and Machine Learning at UpGrad and IIIT Bangalore, covering topics in Deep Reinforcement Learning. He is one of the contributors to DVC, torch, and scikit-learn, which are some of the most popular open-source machine learning libraries in Python.

    Browse publications by this author
  • Anand N.S.

    Anand N.S. has more than two decades of technology experience working, with a strong hands-on track record of application of artificial intelligence, machine learning, and data science to create measurable business outcomes. He has been granted several US patents in the areas of data science, machine learning, and artificial Intelligence. Anand has a B.Tech in Electrical Engineering from IIT Madras and an MBA with a Gold Medal from IIM Kozhikode.

    Browse publications by this author
  • Quan Nguyen

    Quan Nguyen is a programmer with a special interest in scientific computing, data analysis, and artificial intelligence. Before publishing his first book with Packt, he was a primary contributor to the book Python for Scientists and Engineers and various open-source projects on GitHub. He is also a writer for the Python Software Foundation and Oracle's AI and Data Science blog.

    Browse publications by this author
  • Aritra Sen

    Aritra Sen currently works as a data scientist in Ericsson. His current role includes building and deploying large scale machine learning solutions for the telecom industry. He has around 10 years of experience in data science and business intelligence. He previously worked in Cognizant, KPMG, IBM, and TCS. Aritra also has a keen interest in blogging and he regularly writes about machine learning, deep learning, etc. He also filed a patent related to the telecom industry.

    Browse publications by this author
  • Anthony So

    Anthony So is an outstanding leader with more than 13 years of experience. He is recognized for his analytical skills and data-driven approach for solving complex business problems and driving performance improvements. He is also a successful coach and mentor with capabilities in statistical analysis and expertise in machine learning with Python.

    Browse publications by this author
  • Saikat Basak

    Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.

    Browse publications by this author
Book Title
Access this book, plus 7,500 other titles for FREE
Access now