1. Introduction to Reinforcement Learning
Overview
This chapter introduces the Reinforcement Learning (RL) framework, which is one of the most exciting fields of machine learning and artificial intelligence. You will learn how to describe the characteristics and advanced applications of RL to show what can be achieved within this framework. You will also learn to differentiate between RL and other learning approaches. You will learn the main concepts of this discipline both from a theoretical point of view and from a practical point of view using Python and other useful libraries.
By the end of the chapter, you will understand what RL is and know how to use the Gym toolkit and Baselines, two popular libraries in this field, to interact with an environment and implement a simple learning loop.
Introduction
Learning and adapting to new circumstances is a crucial process for humans and, in general, for all animals. Usually, learning is intended as a process of trial and error through which we improve our performance in particular tasks. Our life is a continuous learning process, that is, we start from simple goals (for example, walking), and we end up pursuing difficult and complex tasks (for example, playing a sport). As humans, we are always driven by our reward mechanism, which awards good behaviors and punishes bad ones.
Reinforcement Learning (RL), inspired by the human learning process, is a subfield of machine learning and deals with learning from interaction. With the term "interaction," we mean the process of trial and error through which we, as humans, understand the consequences of our actions and build up our own experiences.
RL, in particular, considers sequential decision-making problems. These are problems in which an agent has to take a sequence of decisions, that is, actions, to maximize a certain performance measure.
RL considers tasks to be Markov Decision Processes (MDPs), which are problems arising in many real-world scenarios. In this setting, the decision-maker, referred to as the agent, has to make decisions accounting for environmental uncertainty and experience. Agents are goal-directed; they need only a notion of a goal, such as a numerical signal, to be maximized. Unlike supervised learning, in RL, there is no need to provide good examples; it is the agent who learns how to map situations to actions. The mapping from situations (states) to actions is called "policy" in literature, and it represents the agent's behavior or strategy. Solving an MDP means finding the agent's policy by maximizing the desired outcome (that is, the total reward). We will study MDPs in more detail in future chapters.
RL has been successfully applied to various kinds of problems and domains, showing exciting results. This chapter is an introduction to RL. It aims to explain some applications and describe concepts both from an intuitive perspective and from a mathematical point of view. Both of these aspects are very important when learning new disciplines. Without intuitive understanding, it is impossible to make sense of formulas and algorithms; without mathematical background, it is tough to implement existing or new algorithms.
In this chapter, we will first compare the three main machine learning paradigms, namely supervised learning, RL, and unsupervised learning. We will discuss their differences and similarities and define some example problems.
Second, we will move on to a section that contains the theory of RL and its notations. We will learn about concepts such as what an agent is, what an environment is, and how to parameterize different policies. This section represents the fundamentals of this discipline.
Third, we will begin using two RL frameworks, namely Gym and Baselines. We will learn that interacting with a Gym environment is extremely simple, as is learning a task using Baselines algorithms.
Finally, we will explore some RL applications to motivate you to study this discipline, showing various techniques that can be used to face real-world problems. RL is not bound to the academic world. However, it is still crucial from an industrial point of view, allowing you to solve problems that are almost impossible to solve using other techniques.
Learning Paradigms
In this section, we will discuss the similarities and differences between the three main learning paradigms under the umbrella of machine learning. We will analyze some representative problems in order to understand the characteristics of these frameworks better.
Introduction to Learning Paradigms
For a learning paradigm, we implement a problem and a solution method. Usually, learning paradigms deal with data and rephrase the problem in a way that can be solved by finding parameters and maximizing an objective function. In these settings, the problem can be faced using mathematical and optimization tools, allowing a formal study. The term "learning" is often used to represent a dynamic process of adapting the algorithm's parameters in such a way as to optimize their performance (that is, to learn) on a given task. Tom Mitchell defined learning in a precise way, as follows:
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
Let's rephrase the preceding definition more intuitively. To define whether a program is learning, we need to set a task; that is the goal of the program. The task can be everything we want the program to do, that is, play a game of chess, do autonomous driving, or carry out image classification. The problem should be accompanied by a performance measure, that is, a function that returns how well the program is performing on that task. For the chess game, a performance function can simply be represented by the following:

Figure 1.1: A performance function for a game of chess
In this context, the experience is the amount of data collected by the program at a specific moment. For chess, the experience is represented by the set of games played by the program.
The same input presented at the beginning of the learning phase or the end of the learning phase can result in different responses (that is, outputs) from the algorithm; the differences are caused by the algorithm's parameters being updated during the process.
In the following table, we can see some examples of the experience, task, and performance tuples to better understand their concrete instantiations:

Figure 1.2: Table for instantiations
It is possible to classify the learning algorithms based on the input they have and on the feedback they receive. In the following section, we will look at the three main learning paradigms in the context of machine learning based on this classification.
Supervised versus Unsupervised versus RL
The three main learning paradigms are supervised learning, unsupervised learning, and RL. The following figure represents the general schema of each of these learning paradigms:

Figure 1.3: Representation of learning paradigms
From the preceding figure, we can derive the following information:
- Supervised learning minimizes the error of the output of the model with respect to a target specified in the training set.
- RL maximizes the reward signal of the actions.
- Unsupervised learning has no target and no reward; it tries to learn a data representation that can be useful.
Let's go more in-depth and elaborate on these concepts further, particularly from a mathematical perspective.
Supervised learning deals with learning a function by mapping an input to an output when the correspondences between the input and output (sample, label) are given by an external teacher (supervisor) and are contained in a training set. The objective of supervised learning is to generalize to unseen samples that are not included in the dataset, resulting in a system (for example, a function) that is able to respond correctly in new situations. Here, the correspondences between the sample and label are usually known (for example, in the training set) and given to the system. Examples of supervised learning tasks include regression and classification problems. In a regression task, the learner has to find a function, , of the input,
, producing a (or
, in general) real output,
. In mathematical notation, we have to find
such that:

Figure 1.4: Regression
Here, is known for the examples in the training set. In a classification task, the function to be learned is a discrete mapping;
belongs to a finite and discrete set. Formalizing the problem, we search for a discrete-valued function,
, such that:

Figure 1.5: Classification
Here, the set, , represents the set of possible classes or categories.
Unsupervised learning deals with learning patterns in the data when the target label is not present or is unknown. The objective of unsupervised learning is to find a new, usually smaller, representation of data. Examples of unsupervised learning algorithms include clustering and Principal Component Analysis (PCA).
In a clustering task, the learner should split the dataset into clusters (a group of elements) according to some similarity measure. At first glance, clustering may seem very similar to classification; however, as an unsupervised learning task, the labels, or classes, are not given to the algorithm inside the training set. Indeed, it is the algorithm itself that should make sense of its inputs, by learning a representation of the input space in such a way that similar samples are close to each other.
For example, in the following figure, we have the original data on the left; on the right, we have the possible output of a clustering algorithm. Different colors denote different clusters:

Figure 1.6: An example of a clustering application
In the preceding example, the input space is composed of two dimensions, that is, , and the algorithm found three clusters or three groups of similar elements.
PCA is an unsupervised algorithm used for dimensionality reduction and feature extraction. PCA tries to make sense of data by searching for a representation that contains most of the information from the given data.
RL is different from both supervised and unsupervised learning. RL deals with learning control actions in a sequential decision-making problem. The sequential structure of the problem makes RL challenging and different from the two other paradigms. Moreover, in supervised and unsupervised learning, the dataset is fixed. In RL, the dataset is continuously changing, and dataset creation is itself the agent's task. In RL, different from supervised learning, no teacher provides the correct value for a given sample or the right action for a given situation. RL is based on a different form of feedback, which is the environment's feedback evaluating the behavior of the agent. It is precisely the presence of feedback that also makes RL different from unsupervised learning.
We will explore these concepts in more detail in future sections:

Figure 1.7: Machine learning paradigms and their relationships
RL and supervised learning can also be mixed up. A common technique (also used by AlphaGo Zero) is called imitation learning (or behavioral cloning). Instead of learning a task from scratch, we teach the agent in a supervised way how to behave (or which action to take) in a given situation. In this context, we have an expert (or multiple experts) demonstrating to the agent the desired behavior. In this way, the agent can start building its internal representation and its initial knowledge. Its actions won't be random at all when the RL part begins, and its behavior will be more focused on the actions shown by the expert.
Let's now look at a few scenarios that will help us to classify the problems in a better manner.
Classifying Common Problems into Learning Scenarios
In this section, we will understand how it is possible to frame some common real-world problems into a learning framework by defining the required elements.
Predicting Whether an Image Contains a Dog or a Cat
Predicting the content of an image is a standard classification example; therefore, it lies under the umbrella of supervised learning. Here, we are given a picture, and the algorithm should decide whether the image contains a dog or a cat. The input is the image, and the associated label can be 0 for cats and 1 for dogs.
For a human, this is a straightforward task, as we have an internal representation of dogs and cats (as well as an internal representation of the world), and we are trained extensively in our life to recognize dogs and cats. Despite this, writing an algorithm that is able to identify whether an image contains a dog or a cat is a difficult task without machine learning techniques. For a human, it is elementary to know whether the image is of a dog or cat; it is also easy to create a simple dataset of images of cats and dogs.
Why Not Unsupervised Learning?
Unsupervised learning is not suited to this type of task as we have a defined output we need to obtain from an input. Of course, supervised learning methods build an internal representation of the input data in which similarities are better exploited. This representation is only implicit; it is not the output of the algorithm as is the case in unsupervised learning.
Why Not RL?
RL, by definition, considers sequential decision-making problems. Predicting the content of an image is not a sequential problem, but instead a one-shot task.
Detecting and Classifying All Dogs and Cats in an Image
Detection and classification are two examples of supervised learning problems. However, this task is more complicated than the previous one. The detection part can be seen as both a regression and classification problem at the same time. The input is always the image we want to analyze, and the output is the coordinate of the bounding boxes for each dog or cat in the picture. Associated with each bounding box, we have a label to classify the content in the region of interest as a dog or a cat:

Figure 1.8: Cat and dog detection and classification
Why Not Unsupervised Learning?
As in the previous example, here, we have a determined output given an input (an image). We do not want to extract unknown patterns in the data.
Why Not RL?
Detection and classification are not tasks that are suited to the RL framework. We do not have a set of actions the agent should take to solve a problem. Also, in this case, the sequential structure is absent.
Playing Chess
Playing chess can be seen as an RL problem. The program can perceive the current state of the board (for example, the positions and types of pawns), and, based on that, it should decide which action to take. Here, the number of possible actions is vast. Selecting an action means to understand and anticipate the consequences of the move to defeat the opponent:

Figure 1.9: Chess as an RL problem
Why Not Supervised?
We can think of playing chess as a supervised learning problem, but we would need to have a dataset, and we should incorporate the sequential structure of the game into the supervised learning problem. In RL, there is no need to have a dataset; it is the algorithm itself that builds up a dataset through interaction and, possibly, self-play.
Why Not Unsupervised?
Unsupervised learning does not fit in this problem as we are not dealing with learning a representation of the data; we have a defined objective, which is winning the game.
In this section, we compared the three main learning paradigms. We saw the kind of data they have at their disposal, the type of interaction each algorithm has with the external world, and we analyzed some particular problems to understand which learning paradigm is best suited.
When facing a real-world problem, we always have to remember the distinction between these techniques, selecting the best one based on our goals, our data, and on the problem structure.
Fundamentals of Reinforcement Learning
In RL, the main goal is to learn from interaction. We want agents to learn a behavior, a way of selecting actions in given situations, to achieve some goal. The main difference between classical programming or planning is that we do not want to code the planning software explicitly on our own, as this would require a great effort; it can be very inefficient and even impossible. The RL discipline was born precisely for this reason.
RL agents start (usually) with no idea of what to do. They typically do not know the goal, they do not know the game's rules, and they do not know the dynamics of the environment or how their actions influence the state.
There are three main components of RL: perception, actions, and goals.
Agents should be able to perceive the current environment state to deal with a task. This perception, also called observation, might be different from the actual environment state, can be subject to noise, or can be partial.
For example, think of a robot moving in an unknown environment. For robotic applications, usually, the robot perceives the environment using cameras. Such a perception does not represent the environment state completely; it can be subject to occlusions, poor lighting, or adverse conditions. The system should be able to deal with this incomplete representation and learn a way of moving in the environment.
The other main component of an agent is the ability to act; the agent should be able to take actions that affect the environment state or the agent's state.
Agents should also have a goal defined through the environment state. Goals are described using high-level concepts such as winning a game, moving in an environment, or driving correctly.
One of the challenges of RL, a challenge that does not arise in other types of learning, is the exploration-exploitation trade-off. In order to improve, the agent has to exploit its knowledge; it should prefer actions that have demonstrated themselves as useful in the past. There's a problem here: to discover better actions, the agent should continue exploring, trying moves they have never done before. To estimate the effect of an action reliably, an agent has to perform each action many times. The critical thing to notice here is that neither exploration nor exploitation can be performed individually in order to learn a task.
The aforementioned is very similar to the challenges we face as babies when we have to learn how to walk. At first, we try different types of movement, and we start from a simple movement yielding satisfactory results: crawling. Then, we want to improve our behavior to become more efficient. To learn a new behavior, we have to do movements we never did before: we try to walk. At first, we perform different actions yielding unsatisfactory results: we fall many times. Once we discover the correct way of moving our legs and balancing our body, we become more efficient in walking. If we did not explore further and we stopped at the first behavior that yields satisfactory results, we would crawl forever. By exploring, we learn that there can be different behaviors that are more efficient. Once we learn how to walk, we can stop exploring, and we can start exploiting our knowledge.
Elements of RL
Let's introduce the main elements of the RL framework intuitively.
Agent
In RL, the agent is the abstract concept of the entity that moves in the world, takes actions, and achieves goals. An agent can be a piece of autonomous driving software, a chess player, a Go player, an algorithmic trader, or a robot. The agent is everything that can perceive and influence the state of the environment and, therefore, can be used to accomplish goals.
Actions
An agent can perform actions based on the current situation. Actions can assume different forms depending on the specific task.
Actions can be to steer, to push the accelerator pedal, or to push the brake pedal in an autonomous driving context. Other examples of actions include moving the horse to the H5 position or moving the king to the A5 position in a chess context.
Actions can be low-level, such as controlling the voltage of the motors of a vehicle, but they can also be high-level, or planning actions, such as deciding where to go. The decision on the action level is the responsibility of the algorithm's designer. Actions that are too high-level can be challenging to implement at a lower level; they might require extensive planning at lower levels. At the same time, low-level actions make the problem difficult to learn.
Environment
The environment represents the context in which the agent moves and takes decisions. An environment is composed of three main elements: states, dynamics, and rewards. They can be explained as follows:
- State: This represents all of the information describing the environment at a particular timestep. The state is available to the agent through observations, which can be a partial or full representation.
- Dynamics: The dynamics of an environment describe how actions influence the state of the environment. The environment dynamic is usually very complex or unknown. An RL algorithm using the information of the environment dynamic to learn how to achieve a goal belongs to the category of model-based RL, where the model represents the mathematical description of the environment. Most of the time, the environment dynamic is not available to the agent. In this case, the algorithm belongs to the model-free category. Even if the environment model is not available, too complicated, or too approximated, the agent can learn a model of the environment during training. Also, in this case, the algorithm is said to be model-based.
- Rewards: Rewards are scalar values associated with each timestep describing the agent's goal. Rewards can also be described as environmental feedback, providing information to an agent about its behavior; it is, therefore, necessary for making learning possible. If the agent receives a high reward, it means that it performed a good move, a move bringing it closer to its goal.
Policy
A policy describes the behavior of the agent. Agents select actions by following their policies. Mathematically, a policy is a function mapping states to actions. What does this mean? Well, it means that the input of the policy is the current state, and its output is the action to take. A policy can have different forms. It can be a simple set of rules, a lookup table, a neural network, or any function approximator. A policy is the core of the RL framework, and the goal of all RL algorithms (implicit or explicit) is to improve the agent's policy to maximize the agent's performance on a task (or on a set of tasks). A policy can be stochastic, involving a distribution over actions, or it can be deterministic. In the latter case, the selected action is uniquely determined by the environment's state.
An Example of an Autonomous Driving Environment
To better understand the environment's role and its characteristics in the RL framework, let's formalize an autonomous driving environment, as shown in the following figure:

Figure 1.10: An autonomous driving scenario
Considering the preceding figure, let's now look at each of the components of the environment:
- State: The state can be represented by the 360-degree image of the street around our car. In this case, the state is an image, that is, a matrix of pixels. It can also be represented by a series of images covering the whole space around the car. Another possibility is to describe the state using features and not images. The state can be the current velocity and acceleration of our vehicle, the distance from other cars, or the distance from the street border. In this case, we are using preprocessed information to represent the state more easily. These features can be extracted from images or other types of sensors (for example, Light Detection and Ranging – LIDAR).
- Dynamics: The dynamics of the environment in an autonomous car scenario are represented by the equations describing how the system changes when the car accelerates, breaks, or steers. For instance, the vehicle is going at 30 km/h, and the next vehicle is 100 meters away from it. The state is represented by the car's speed and the proximity information concerning the next vehicle. If the car accelerates, the speed changes according to the car's properties (included in the environment dynamics). Also, the proximity information changes since the next vehicle can be closer or further away (according to the speed). In this situation, at the next timestep, the car's speed can be 35 km/h, and the next vehicle can be closer, for example, only 90 meters away.
- Reward: The reward can represent how well the agent is driving. It's not easy to formalize a reward function. A natural reward function should award states in which the car is aligned to the street and should avoid states in which the car crashes or goes off the road. The reward function definition is an open problem and researchers are putting efforts into developing algorithms where the reward function is not needed (self-motivation or curiosity-driven agents), where the agent learns from demonstrations (imitation learning), and where the agent recovers the reward function from demonstrations (Inverse Reinforcement Learning or IRL).
Note
For further reading on curiosity-driven agents, please refer to the following paper: https://pathak22.github.io/large-scale-curiosity/resources/largeScaleCuriosity2018.pdf.
We are now ready to design and implement our first environment class using Python. We will demonstrate how to implement the state, the dynamics, and the reward of a toy problem in the following exercise.
Exercise 1.01: Implementing a Toy Environment Using Python
In this exercise, we will implement a simple toy environment using Python. The environment is illustrated in Figure 1.11. It is composed of three states (1, 2, 3) and two actions (A and B). The initial state is state 1. States are represented by nodes. Edges represent transitions between states. On the edges, we have an action causing the transition and the associated reward.
The representation of the environment in Figure 1.11 is the standard environment representation in the context of RL. In this exercise, we will become acquainted with the concept of the environment and its implementation:

Figure 1.11: A toy environment composed of three states (1, 2, 3) and two actions (A and B)
In the preceding figure, the reward is associated with each state-action pair.
The goal of this exercise is to implement an Environment
class with a step()
method that takes as input the agent's actions and returns a state-action pair (next state, reward). In addition to this, we will write a reset()
method that restarts the environment's state:
- Create a new Jupyter notebook or a simple Python script to enter the code.
- Import the
Tuple
type fromtyping
:from typing import Tuple
- Define the class constructor by initializing its properties:
class Environment: def __init__(self): """ Constructor of the Environment class. """ self._initial_state = 1 self._allowed_actions = [0, 1] # 0: A, 1: B self._states = [1, 2, 3] self._current_state = self._initial_state
Note
The triple-quotes (
"""
) shown in the code snippet above are used to denote the start and end points of a multi-line code comment. Comments are added into code to help explain specific bits of logic.We have two allowed actions, the action 0 and the action 1 representing the actions A and B. We have three environment states: 1, 2, and 3. We define the
current_state
variable to be equal to the initial state (state 1). - Define the step function, which is responsible for updating the current state based on the previous state and the action taken by the agent:
def step(self, action: int) -> Tuple[int, int]: """ Step function: compute the one-step dynamic from the \ given action. Args: action (int): the action taken by the agent. Returns: The tuple current_state, reward. """ # check if the action is allowed if action not in self._allowed_actions: raise ValueError("Action is not allowed") reward = 0 if action == 0 and self._current_state == 1: self._current_state = 2 reward = 1 elif action == 1 and self._current_state == 1: self._current_state = 3 reward = 10 elif action == 0 and self._current_state == 2: self._current_state = 1 reward = 0 elif action == 1 and self._current_state == 2: self._current_state = 3 reward = 1 elif action == 0 and self._current_state == 3: self._current_state = 2 reward = 0 elif action == 1 and self._current_state == 3: self._current_state = 3 reward = 10 return self._current_state, reward
Note
The
#
symbol in the code snippet above denotes a code comment. Comments are added into code to help explain specific bits of logic.We first check that the action is allowed. Then, we define the new current state and reward based on the action and the previous state by looking at the transition in the previous figure.
- Now, we need to define the
reset
function, which simply resets the environment state:def reset(self) -> int: """ Reset the environment starting from the initial state. Returns: The environment state after reset (initial state). """ self._current_state = self._initial_state return self._current_state
- We can use our environment class to understand whether our implementation is correct for the specified environment. We can do this with a simple loop, using a predefined set of actions to test the transitions of our environment. A possible action set, in this case, is
[0, 0, 1, 1, 0, 1]
. Using this set, we will test all of the environment's transitions:env = Environment() state = env.reset() actions = [0, 0, 1, 1, 0, 1] print(f"Initial state is {state}") for action in actions: next_state, reward = env.step(action) print(f"From state {state} to state {next_state} \ with action {action}, reward: {reward}") state = next_state
Note
The code snippet shown here uses a backslash (
\
) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.The output should be as follows:
Initial state is 1 From state 1 to state 2 with action 0, reward: 1 From state 2 to state 1 with action 0, reward: 0 From state 1 to state 3 with action 1, reward: 10 From state 3 to state 3 with action 1, reward: 10 From state 3 to state 2 with action 0, reward: 0 From state 2 to state 3 with action 1, reward: 1
To understand this better, compare the output with Figure 1.11 to discover whether the transitions and rewards are compatible with the selected actions.
Note
To access the source code for this specific section, please refer to https://packt.live/2Arr9rO.
You can also run this example online at https://packt.live/2zpMul0.
In this exercise, we implemented a simple RL environment by defining the step function and the reset function. These functions are at the core of every environment, representing the interaction between the agent and the environment.
The Agent-Environment Interface
RL considers sequential decision-making problems. In this context, we can refer to the agent as the "decision-maker." In sequential decision-making problems, actions taken by the decision-maker do not only influence the immediate reward and the immediate environment's state, but they also affect future rewards and states. MDPs are a natural way of formalizing sequential decision-making problems. In MDPs, an agent interacts with an environment through actions and receives rewards based on the action, on the current state of the environment, and on the environment's dynamics. The goal of the decision-maker is to maximize the cumulative sum of rewards given a horizon (which is possibly infinite). The task the agent has to learn is defined through the rewards it receives, as you can see in the following figure:

Figure 1.12: The Agent-Environment interface
In RL, an episode is divided into a sequence of discrete timesteps: . Here,
represents the horizon length, which is possibly infinite. The interaction between the agent and the environment happens at each timestep. At each timestep, the agent receives a representation of the current environment's state,
. Based on this state, it selects an action,
, belonging to the action space given the current state,
. The action affects the environment. As a result, the environment changes its state, transitioning to the next state,
, according to its dynamics. At the same time, the agent receives a scalar reward,
quantifying how good the action taken in that state was.
Let's now try to understand the mathematical notations used in the preceding example:
- Time horizon
: If a task has a finite time horizon, then
is an integer number representing the maximum duration of an episode. In infinite tasks,
can also be
.
- Action
is the action taken by the agent in the timestep, t. The action belongs to the action space,
, defined by the current state,
.
- State
is the representation of the environment's state received by the agent at time t. It belongs to the state space,
, defined by the environment. It can be represented by an image, a sequence of images, or a simple vector assuming different shapes. Note that the actual environment state can be different and more complex than the state perceived by the agent.
- Reward
is represented by a real number, describing how good the taken action was. A high reward corresponds to a good action. The reward is fundamental for the agent to understand how to achieve a goal.
In episodic RL, the agent-environment interaction is divided into episodes; the agent has to achieve the goal within the episode. The interaction is finalized to learn better behavior. After several episodes, the agent can decide to update its behavior by incorporating its knowledge of past interactions. Based on the effect of the action on the environment and the received rewards, the agent will perform more frequent actions yielding higher rewards.
What's the Agent? What's in the Environment?
An important aspect to take into account when dealing with RL is the difference between the agent and the environment. This difference is not typically defined in terms of a physical distinction. Usually, we model the environment as everything that's not under the control of the agent. The environment can include physical laws, other agents, or an agent's properties or characteristics.
However, this does not imply that the agent does not know the environment. The agent can also be aware of the environment and the effect of its actions on it, but it cannot change the way the environment reacts. Also, the reward computation belongs to the environment, as it must be entirely outside the agent's control. If this is not the case, the agent can learn how to modify the reward function in such a way as to maximize its performance without learning the task. The boundary between the agent and environment is a control boundary, meaning that the agent cannot control the reaction of the environment. It is not a knowledge boundary since the agent can know the environment model perfectly and still find difficulties in learning the task.
Environment Types
In this section, we will examine some possible environment dichotomies. The characterization of the environment depends on the state space (finite or continuous), on the type of transitions (deterministic or stochastic), on the information available to the agent (fully or partially observable), and the number of agents involved in the learning problem (single versus multi-agent).
Finite versus Continuous
The state space gives the first distinction. The state space can be divided into two main categories: a finite state space and a continuous state space. A finite state space has a finite number of possible states in which the agent can be, and it's the more straightforward case. An environment with a continuous state space has infinite possible states. In these types of environments, the generalization properties of the agent are fundamental to solve a task because the probability of arriving at the same state twice is almost zero. In continuous environments, an agent cannot use the experience due to the previous presence in that state; it has to generalize using some kind of similarity with respect to the previously experienced states. Note that generalization is also essential for finite state spaces with a considerable number of states (for example, when the state space is represented by the set of all possible images).
Consider the following examples:
- Chess is finite. There is a finite number of possible states in which an agent can be. The state, for chess, is represented by the chessboard situation at a given time. We can calculate all the possible states by varying the situation of the chessboard. The number of states is very high but still finite.
- Autonomous driving can be defined as a continuous problem. If we describe the autonomous driving problem as a problem in which the agent has to make driving decisions based on the sensors' input, we obtain a continuous problem. The sensors provide continuous input in a given range. The agent state, in this case, can be represented by the agent's speed, the agent's acceleration, or the rotation of the wheels per minute.
Deterministic versus Stochastic
A deterministic environment is an environment in which, given a state, an action is performed by the agent; the following state is uniquely determined as well as the following reward. Deterministic environments are simple types of environments, but they are also rarely used due to their limited applicability in the real world.
Almost all real-world environments are stochastic. In stochastic environments, a state and an action performed by the agent determines the probability distribution over the next state and the next reward. The following state is not uniquely determined, but it's uncertain. In these types of environments, the agent should act many times to obtain a reliable estimate of its consequences.
Notice that, in a deterministic environment, the agent could perform each action in each state exactly once, and based on the acquired knowledge, it can solve the task. Also, notice that solving the task does not mean taking actions that yield the highest immediate return, because this action can also bring the agent to an inconvenient part of the environment where future rewards are always low. To solve the task correctly, the agent should take actions with the highest associated future return (called a state-action value). The state-action value does not take into account only the immediate reward but also the future rewards, giving the agent a farsighted view. We will define later what a state-action value is.
Consider the following examples:
- Rubik's Cube is deterministic. To a given action, it corresponds a defined state transition.
- Chess is deterministic but opponent-dependent. The successive state does not depend only on the agent's action but also on the opponent's action.
- Texas Hold'em is stochastic and opponent-dependent. The transition to the next state is stochastic and depends on the deck, which is not known by the agent.
Fully Observable versus Partially Observable
The agent, to plan actions, has to receive a representation of the environment state, (refer to Figure 1.12, The Agent-Environment interface). If the state representation received by the agent completely defines the state of the environment, the environment is Fully Observable. If some parts of the environment are outside the representation observed by the agent, the environment is Partially Observable, also called the Partially Observable Markov Decision Process (POMDP). Partially observable environments are, for example, multi-agent environments. In the case of partially observable environments, the information perceived by the agents, together with the action taken, is not sufficient for determining the next state of the environment. A technique to improve the perception of the agent, making it more accurate, is to keep the history of taken actions and observations, but this requires some memory techniques (such as a Recurrent Neural Network, or RNN, or Long Short-Term Memory, or LSTM) embedded in the agent's policy.
Note
For more information on LSTMs, please refer to https://www.bioinf.jku.at/publications/older/2604.pdf.
POMDP versus MDP
Consider the following figure:

Figure 1.13: A representation of a partially observable environment
In the preceding figure, the agent does not receive the full environment state but only an observation, .
To better understand the differences between these two types of environments, let's look at Figure 1.13. In partially observable environments (POMDP), the representation given to the agent is only a part of the actual environment state, and it is not enough to understand the actual environment state without uncertainty.
In fully observable environments (MDPs), the state representation given to the agent is semantically equivalent to the state of the environment. Notice that, in this case, the state given to the agent can assume a different form (for example, an image, a vector, a matrix, or a tensor). However, from this representation, it is always possible to reconstruct the actual state of the environment. The meaning of the state is precisely the same, even if under a different form.
Consider the following examples:
- Chess (and, in general, board games) is fully observable. The agent can perceive the whole environment state. In a chess game, the environment state is represented by the chessboard, and the agent can exactly perceive the position of each pawn.
- Poker is partially observable. A poker agent cannot perceive the whole state of the game, which includes the opponent cards and deck cards.
Single Agents versus Multiple Agents
Another useful characteristic of environments is the number of agents involved in a task. If there is only one agent, the subject of our study, the environment is a single-agent environment. If the number of agents is more than one, the environment is a multi-agent environment. The presence of multiple agents increases the complexity of the problem since the action that influences the state becomes a joint action, the set of all the agents' actions. Usually, agents only know their individual actions and do not know another agent's actions. For this reason, the multi-agent environment is an instance of POMDP in which the partial visibility is due to the presence of other agents. Notice that each agent has its own observation, which can differ from the other agent's observation, as shown in the following figure:

Figure 1.14: A schematic representation of the multi-agent decentralized MDP
Consider the following examples:
- Robot navigation is usually a single-agent task. We may have only one agent moving in a possible unknown environment. The goal of the agent can be to reach a given position in the environment while avoiding crashes as much as possible in the minimum amount of time.
- Poker is a multi-agent task where we have two agents competing against each other. The perceived state is different in this case and the perceived reward is also different.
An Action and Its Types
The action set of an agent in an environment can be finite or continuous. If the action set is finite, the agent has at its disposal a finite number of actions. Consider the MountainCar-v0 (discrete) example, described in more detail later. This has a discrete action set; the agent only has to select the direction in which to accelerate, and the acceleration is constant.
If the action set is continuous, the agent has at its disposal infinite actions from which it should select the best actions in a given state. Usually, tasks with continuous action sets are more challenging to solve than those with finite actions.
Let's look at the example of MountainCar-v0:

Figure 1.15: A MountainCar-v0 task
As you can see in the preceding figure, a car is positioned in a valley between two mountains. The goal of the car is to arrive at the flag on the mountain to its right.
The MountainCar-v0 example is a standard RL benchmark in which there is a car trying to ramp itself up a mountain. The car's engine doesn't have enough strength to ramp upward. For this reason, the car should use the inertia given from the shape of the valley, that is, it should go to the left to gain speed. The state is composed of the car velocity, acceleration, and x position. There are two versions of this task based on the action set we define, as follows:
- MountainCar-v0 discrete: We have only two possible actions, (-1, +1) or (0, 1), depending on the parameterization.
- MountainCar-v0 continuous: A continuous set of actions from -1 to +1.
Policy
We define the policy as the behavior of the agent. Formally, a policy is a function that takes as input the history of the current episode and outputs the current action. The concept of policies has huge importance in RL; all RL algorithms focus on learning the best policy for a given task.
An example of a winning policy for the MountainCar-v0 task is a policy that brings the agent up on the left mountain and then uses the cumulated potential to ramp up the mountain on the right. For negative velocities, the optimal action is LEFT, as the agent should go as high as possible on the left mountain. For positive velocities, the agent should take the action RIGHT, as its goal is to ramp up the mountain on its right.
A Markovian policy is simply a policy depending only on the current state and not the whole history.
We denote a stationary Markovian policy with as follows:

Figure 1.16: Stationary Markovian policy
The Markovian policy goes from the state space to the action space. If we evaluate the policy in a given state, , we obtain the selected action,
, in that state:

Figure 1.17: Stationary Markovian policy in state
A policy can be implemented in different ways. The most straightforward policy is just a rule-based policy, which is essentially a set of rules or heuristics.
Policies that are a subject of interest in RL are usually parametric. Parametric policies are (differentiable) functions depending on a set of parameters. Usually, the policy parameters are identified as :

Figure 1.18: Parametric policies
The set of policy parameters can be represented by a vector in a d-dimensional space. The selected action is determined by the policy structure (we will explore some possible policy structures later on), by the policy parameters, and, of course, by the current environment state.
Stochastic Policies
The policies presented so far are merely deterministic policies because the output is precisely an action. Stochastic policies are policies that output a distribution over the action space. Stochastic policies are usually powerful policies that mix both exploration and exploitation. With stochastic policies, it is possible to obtain complex behaviors.
A stochastic policy assigns a certain probability to each action. The actions will be selected according to the associated probability.
Figure 1.19 explains, graphically, and with an example, the differences between a stochastic policy and a deterministic policy. The policy in the figure has three possible actions.
The stochastic policy (upper part) assigns to actions, respectively, a probability of 0.2, 0.7, and 0.1. The most probable action is the second action, which is associated with the highest probability. However, all of the actions could also be selected.
In the bottom part, we have the same set of actions with a deterministic policy. The policy, in this case, selects only one action (the second in the figure) with a probability of 1. In this case, actions 1 and 3 will not be selected, having an associated probability of 0.
Note that we can obtain a deterministic policy from a stochastic one by taking the action associated with the highest probability:

Figure 1.19: Stochastic versus deterministic policies
Policy Parameterizations
In this section, we will analyze some possible policy parameterizations. Parameterizing a policy means giving a structure to the policy function and considering how parameters affect our output actions. Based on the parameterization, it is possible to obtain simple policies or even complex stochastic policies starting from the same input state.
Linear (Deterministic)
The resulting action is a linear combination of the state features, :

Figure 1.20: An expression of a linear policy
A linear policy is a very simple policy represented by a matrix multiplication.
Consider the example of MountainCar-v0. The state space is represented by the position, speed, and acceleration: . We usually add a constant, 1, that corresponds to the bias term. Therefore,
. Policy parameters are defined by
. We can simply use as state features the identity function,
.
The resulting policy is as follows:

Figure 1.21: A linear policy for MountainCar-v0
Note
Using a comma, ,
, we can denote the column separator, and with a semicolon, ;
, we can denote the row separator.
Therefore, is a row vector, and
is a column vector that is equivalent to
.
If the environment state is [1, 2, 0.1], the cart is in position with velocity
and acceleration
, and the policy parameters are defined by [4, 5, 1, 1], we obtain an action,
.
Since the action space of MountainCar-v0 is defined in the interval, [-1, +1], we need to squash the resulting action using a squashing function such as (hyperbolic tangent). In our case,
applied to the output of the multiplication results in approximately +1:
![Figure 1.22: A hyperbolic tangent plot; the hyperbolic tangent squashes
the real numbers in the interval, [-1, +1]](https://static.packt-cdn.com/products/9781800200456/graphics/image/B16182_01_22.jpg)
Figure 1.22: A hyperbolic tangent plot; the hyperbolic tangent squashes the real numbers in the interval, [-1, +1]
Even if linear policies are simple, they are usually enough to solve most tasks, given that the state features represent the problem.
Gaussian Policy
In the case of Gaussian parameterization, the resulting action has a Gaussian distribution in which the mean, , and the variance,
, depend on state features:

Figure 1.23: Expression for a Gaussian policy
Here, with the symbol , we denote the conditional distribution; therefore, with
, we denote the distribution conditioned on state
.
Remember, the functional form of the Gaussian distribution, , is as follows:

Figure 1.24: A Gaussian distribution
In the case of a Gaussian policy, this becomes the following:

Figure 1.25: A Gaussian policy
Gaussian parameterization is useful for continuous action spaces. Note that we are giving the agent the possibility of also changing the variance of the distribution. This means that it can decide to increase the variance, enabling it to explore scenarios where it's not sure what the best action to take is, or it can reduce the variance by increasing the amount of exploitation when it's very sure about which action to take in a given state. The effect of the variance can be visualized as follows:

Figure 1.26: The effect of the variance on a Gaussian policy
In the preceding figure, if the variance increases (the lower curve), the policy becomes more exploratory. Additionally, actions that are very far from the mean have nonzero probabilities. When the variance is small (the higher curve), the policy is highly exploitative. This means that only actions that are very close to the mean have nonzero probabilities.
In the preceding diagram, the smaller Gaussian represents a highly explorative policy with respect to the larger policy. Here, we can see the effect of the variance on the policy exploration attitude.
While learning a task, in the first training episodes, the policy needs to have a high variance in order for it to explore different actions. The variance will be reduced once the agent gains some experience and becomes more and more confident about the best actions.
The Boltzmann Policy
Boltzmann parameterization is used for discrete action spaces. The resulting action is a softmax function acting on the weighted state features, as stated in the following expression:

Figure 1.27: Expression for a Boltzmann policy
Here, is the set of parameters associated with action
.
The Boltzmann policy is a stochastic policy. The motivation behind this is very simple; let's sum the policy over all the actions (the denominator does not depend on the action, ), as follows:

Figure 1.28: A Boltzmann policy over all of the actions
The Boltzmann policy becomes deterministic if we select the action with the highest probability, which is equivalent to selecting the mean action in a Gaussian distribution. What the Boltzmann parameterization represents is simply a normalization of the value, , corresponding to the score of action
. The score is thus normalized by considering the value of all the other actions obtaining a distribution.
In all of these parametrizations, the state features might be non-linear features depending on several parameters, for example, whether it is coming from a neural network, the radial basis function (RBF) features, or the tile coding features.
Exercise 1.02: Implementing a Linear Policy
In this exercise, we will practice with the implementation of a linear policy. The goal is to write the presented parameterizations in the case of a state composed of components. In the first case, the features can be represented by the identity function; in the second case, the features are represented by a polynomial function of order 2:
- Open a new Jupyter notebook and import NumPy to implement all of the requested policies:
from typing import Callable, List import matplotlib from matplotlib import pyplot as plt import numpy as np import scipy.stats
- Let's now implement the linear policy. A linear policy can be efficiently represented by a dot product between the policy parameters and state features. The first step is to write the constructor:
class LinearPolicy: def __init__( self, parameters: np.ndarray, \ features: Callable[[np.ndarray], np.ndarray]): """ Linear Policy Constructor. Args: parameters (np.ndarray): policy parameters as np.ndarray. features (Callable[[np.ndarray], np.ndarray]): function used to extract features from the state representation. """ self._parameters = parameters self._features = features
The constructor simply sets the attribute's parameters and features. The feature parameter is actually a callable that takes, as input, a NumPy array and returns another NumPy array. The input is the environment state, whereas the output is the state features.
- Next, we will implement the call method. The
__call__
method takes as input the state, and returns the selected action according to the policy parameters. The call represents a real policy implementation. What we have to do in the linear case is to first apply the feature function and then compute the dot product between the parameters and the features. A possible implementation of the call function is as follows:def __call__(self, state: np.ndarray) -> np.ndarray: """ Call method of the Policy. Args: state (np.ndarray): environment state. Returns: The resulting action. """ # calculate state features state_features = self._features(state) """ the parameters shape [0] should be the same as the state features as they must be multiplied """ assert state_features.shape[0] == self._parameters.shape[0] # dot product between parameters and state features return np.dot(self._parameters.T, state_features)
- Let's try the defined policy with a state composed of a 5-dimensional array. Sample a random set of parameters and a random state vector. Create the policy object. The constructor needs the callable features, which, in this case, is the identity function. Call the policy to obtain the resulting action:
# sample a random set of parameters parameters = np.random.rand(5, 1) # define the state features as identity function features = lambda x: x # define the policy pi: LinearPolicy = LinearPolicy(parameters, features) # sample a state state = np.random.rand(5, 1) # Call the policy obtaining the action action = pi(state) print(action)
The output will be as follows:
[[1.33244481]]
This value is the action selected by our agent, given the state and the policy parameters. In this case, the selected action is [[1.33244481]]
. The meaning of the action depends on the RL task.
Of course, you will obtain different results based on the sampled parameters and sampled state. It is always possible to seed the NumPy random number generator to obtain reproducible results.
Note
To access the source code for this specific section, please refer to https://packt.live/2Yvrku7. You can also refer to the Gaussian and Boltzmann policies that are implemented in the same notebook.
You can also run this example online at https://packt.live/3dXc4Nc.
In this exercise, we practiced with different policies and parameterizations. These are simple policies, but they are the building blocks of more complex policies. The trick is just to substitute the state features with a neural network or any other feature extractor.
Goals and Rewards
In RL, the agent's goal is to maximize the total amount of reward it receives during an episode.
This is based on the famous reward hypothesis in Sutton & Barto 1998:
"That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward)."
The important thing here is that the reward should not describe how to achieve the goal; instead, it should describe the goal of the agent. The reward function is an element of the environment, but it can also be designed for a specific task. In principle, there are infinite reward functions for each task. Usually, reward functions that are characterized by a lot of information help the agent to learn. Sparse reward functions (with no information) makes learning difficult or, sometimes, impossible. Sparse reward functions are functions in which, most of the time, the reward is constant (or zero).
Sutton's hypothesis, which we explained earlier, is the basis of the RL framework. This hypothesis may be wrong; probably, a scalar reward signal (and its maximization) is not enough to define complex goals; however, still, this hypothesis is very flexible, simple, and it can be applied to a wide range of tasks. At the time of writing, the reward function design is more art than engineering; there are no formal practices regarding how to write a reward function, rather there are only best practices based on experience. Usually, a simple reward function works very well. Usually, we associate a positive value with good actions and behavior and negative values with bad actions or actions that are not important at that particular moment.
In a locomotion task (for example, teaching a robot how to move), the reward may be defined as proportional to the robot's forward movement. In chess, the reward may be defined as 0 for each timestep: +1 if the agent wins and -1 if the agent loses. If we want our agent to solve Rubik's Cube, the reward may be defined similarly: 0 every step and +1 if the cube is solved.
Sometimes, as we learned earlier, defining a scalar reward function for a task is not easy, and, nowadays, it is more art than engineering or science.
In each of these tasks, the final objective is to learn a policy, a way of selecting actions, maximizing the total rewards received by the agent. Tasks can be episodic or continuous. Episodic tasks have a finite length, that is, a finite number of timesteps (for example, T is finite). Continuous tasks can last forever or until the agent reaches its goal. In the first case, we can simply define the total reward (return) received by an agent as the sum of the individual rewards:

Figure 1.29: Expression for a total reward
Usually, we are interested in the return from a certain timestep, . In other words, the return,
, quantifies the agent's performance in the long term, and it can be calculated as the sum of immediate rewards following time t until the end of the episode (timestep
):

Figure 1.30: Expression for a return from timestep t
It is straightforward to see that, with this formulation, the return for continuing tasks diverges to infinity.
In order to deal with continuing tasks, we need to introduce the notion of a discounted return. This concept formalizes, in mathematical terms, the principle that the immediate reward (sometimes) is more valuable than the same amount of reward after many steps. This principle is widely known in economics. The discount factor, , quantifies the present value of future rewards. We are ready to present the unified notation for the return in episodic and continuing tasks.
The discounted return is the cumulative, discounted sum of rewards until the end of the episode. In mathematical terms, it can be formalized as follows:

Figure 1.31: Expression for the discounted return from timestep t
To understand how the discount affects the return, it is possible to see that the value of receiving reward after a
timestep is
, since
is less than or equal to
. It is worth introducing the effect of the discount on the return. If
, the return, even if composed by an infinite sum, has a bounded value. If
, the agent is myopic since it cares only about the immediate reward, and it does not care about future rewards. A myopic agent can cause problems: the only thing it learns is to select the action yielding the highest immediate return. A myopic chess player can, for example, eat the opponent's pawn causing the game's loss. Notice that, for some tasks, this isn't always a problem. This includes tasks in which the current action does not affect the future reward and has no consequences for the agent's future. These tasks can be solved by finding the action that causes a higher immediate reward for each state independently. Most of the time, the current action influences the future of the agent and its rewards. If the discount factor is near to 1, the agent is farsighted; it is possible for them to sacrifice an action yielding to a good immediate reward now for a higher reward in future steps.
It is important to understand the relationship between returns at different timesteps, both from a theoretical point of view but also from an algorithmic point of view, because many RL algorithms are based on this principle:

Figure 1.32: Relationship between returns at different timesteps
By following these simple steps, we can see that the return from the timestep equals the immediate reward plus the return at the following step scaled by gamma. This simple relationship will be extensively used in RL algorithms.
Why Discount?
The following describes the motivations as to why many RL problems are discounted:
- It is convenient from a mathematical perspective to have a bounded return, and also in the case of continuing tasks.
- If the task is a financial task, immediate rewards may gain more interest than delayed rewards.
- Animal and human behavior show a preference for immediate rewards.
- A discounted reward may also represent uncertainty about the future.
- It is also possible to use an undiscounted return
if all of the episodes terminate after a finite number of steps.
This section introduced the main elements of RL, including agents, actions, environments, transition functions, and policies. In the next section, we will practice with these concepts by defining agents, environments, and measuring the performance of agents on some tasks.
Reinforcement Learning Frameworks
In the previous sections, we learned the basic theory behind RL. In principle, an agent or an environment can be implemented in any way or any language. For RL, the primary language used by both academic and industrial people is Python, as it allows you to focus on the algorithms and not on the language details, making it very simple to use. Implementing, from scratch, an algorithm or a complex environment (that is, an autonomous driving environment) might be very difficult and error-prone. For this reason, several well-established and well-tested libraries make RL very easy for newcomers. In this section, we will explore the main Python RL libraries. We will present OpenAI Gym, a set of environments that is ready to use and easy to modify, and OpenAI Baselines, a set of high quality, state-of-the-art algorithms. By the end of this chapter, you will have learned about and practiced with environments and agents.
OpenAI Gym
OpenAI Gym (https://gym.openai.com) is a Python library that provides a set of RL environments ranging from toy environments to Atari environments and more complex environments, such as MuJoCo and Robotics environments. OpenAI Gym, besides providing this large set of tasks, also provides a unified interface for interacting with RL tasks and a set of interfaces that are useful for describing the environment's characteristics, such as the action space and the state space. An important property of Gym is that its only focus is on environments; it makes no assumption of the type of agent you have or the computational framework you use. We will not cover the installation details in this chapter for ease of presentation. Instead, we will focus on the main concepts and learn how to interact with these libraries.
Getting Started with Gym – CartPole
CartPole is a classical control environment provided by Gym and used by researchers as a starting point of algorithms. It consists of a cart that moves along the horizontal axis (1-dimensional) and a pole anchored to the cart on one endpoint:

Figure 1.33: CartPole environment representation
The agent has to learn how to move the cart to balance the pole (that is, to stop the pole from falling). The episode ends when the pole angle () becomes higher than a certain threshold (
). The state space is represented by the position of the cart along the axis,
; the velocity along the axis,
; the pole angle,
; and the pole angular velocity,
. The state space is continuous in this case, but it can also be discretized to make learning simpler.
In the following steps, we will practice with Gym and its environments.
Let's create a CartPole environment using Gym and analyze its properties in a Jupyter notebook. Please refer to the Preface for Gym installation instructions:
# Import the gym Library import gym # Create the environment using gym.make(env_name) env = gym.make('CartPole-v1') """ Analyze the action space of cart pole using the property action_space """ print("Action Space:", env.action_space) """ Analyze the observation space of cartpole using the property observation_space """ print("Observation Space:", env.observation_space)
If you run these lines, you will get the following output:
Action Space: Discrete(2) Observation Space: Box(4,)
Discrete(2)
means that the action space of CartPole is a discrete action space composed of two actions: Go Left and Go Right. These actions are the only actions available to the agent. The action of Go Left, in this case, is represented by action 0, and the action of Go Right by action 1.
Box(4,)
means that the state space (the observation space) of the environment is represented by a 4-dimensional box, a subspace of . Formally, it is a Cartesian product of
n
intervals. The state space has a lower bound and an upper bound. The bounds may also be infinite, creating an unbounded box.
To inspect the observation space better, we can use the properties of high
and low
:
# Analyze the bounds of the observation space print("Lower bound of the Observation Space:", \ env.observation_space.low) print("Upper bound of the Observation Space:", \ env.observation_space.high)
This will print the following:
Lower bound of the Observation Space: [-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38] Upper bound of the Observation Space: [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]
Here, we can see that upper and lower bounds are arrays of 4 elements; one element for each state dimension. The following are some observations:
- The lower bound of the cart position (the first state dimension) is -4.8, while the upper bound is 4.8.
- The lower bound of the velocity (the second state dimension) is -3.1038, basically
; and the upper bound is +3.1038, basically
.
- The lower bound of the pole angle (the third state dimension) is -0.4 radians, representing an angle of -24 degrees. The upper bound is 0.4 radians, representing an angle of +24 degrees.
- The lower and upper bounds of the pole angular velocity (the fourth state dimension) are, respectively,
and
, similar to the lower and upper bounds for the cart policy's angular velocity.
Gym Spaces
The Gym Space
class represents the way Gym describes actions and state spaces. The most used spaces are the Discrete
and Box
spaces.
A discrete space is composed of a fixed number of elements. It can represent both a state space but also an action space, and it describes the number of elements through the n
attribute. Its elements range from 0 to n-1
.
A Box
space describes its shape through the shape
attribute. It can have an n-dimensional shape that corresponds to an n
-dimensional box. A Box
space can also be unbounded. Each interval has the form of one of .
It is possible to sample from the action space to gain insight into the elements it is composed of using the space.sample()
method.
Note
For the sampling distribution of box environments, to create a sample of the box, each coordinate is sampled according to the form of the interval in the following distributions:
- : A uniform distribution
-: A shifted exponential distribution
- : A shifted negative exponential distribution
- : A normal distribution
Let's now demonstrate how to create simple spaces and how to sample from spaces:
# Type hinting from typing import Tuple import gym # Import the spaces module from gym import spaces # Create a discrete space composed by N-elements (5) n: int = 5 discrete_space = spaces.Discrete(n=n) # Sample from the space using .sample method print("Discrete Space Sample:", discrete_space.sample()) """ Create a Box space with a shape of (4, 4) Upper and lower Bound are 0 and 1 """ box_shape: Tuple[int, int] = (4, 4) box_space = spaces.Box(low=0, high=1, shape=box_shape) # Sample from the space using .sample method print("Box Space Sample:", box_space.sample())
This will print the samples from our spaces:
Discrete Space Sample: 4 Box Space Sample: [[0.09071387 0.4223234 0.09272052 0.15551752] [0.8507258 0.28962377 0.98583364 0.55963445] [0.4308358 0.8658449 0.6882108 0.9076272 ] [0.9877584 0.7523759 0.96407163 0.630859 ]]
Of course, the samples will change according to your seeds.
As you can see, we have sampled element 4 from our discrete space composed of 5 elements (from 0 to 4). We sampled a random 4 x 4 matrix with elements between 0 and 1, the lower and the upper bound of our space.
To obtain reproducible results, it is also possible to set the seed of an environment using the seed
method:
# Seed spaces to obtain reproducible samples discrete_space.seed(0) box_space.seed(0) # Sample from the seeded space print("Discrete Space (seed=0) Sample:", discrete_space.sample()) # Sample from the seeded space print("Box Space (seed=0) Sample:", box_space.sample())
This will print the following:
Discrete Space (seed=0) Sample: 0 Box Space (seed=0) Sample: [[0.05436005 0.9653909 0.63269097 0.29001734] [0.10248426 0.67307633 0.39257675 0.66984606] [0.05983897 0.52698725 0.04029069 0.9779441 ] 0.46293673 0.6296479 0.9470484 0.6992778 ]]
The previous statement will always print the same sample since we set the seed to 0. Seeding an environment is very important in order to guarantee reproducible results.
Exercise 1.03: Creating a Space for Image Observations
In this exercise, we will create a space to represent an image observation. Image-based observations are essential in RL since they allow the agent to learn from pixels and require minimal feature engineering or need to go through the feature extraction phase. The agent can focus on what is important for its task without being limited by manually decided heuristics. We will create a space representing RGB images with dimensions equal to 256 x 256:
- Open a new Jupyter notebook and import the desired modules –
gym
and NumPy:import gym from gym import spaces import matplotlib.pyplot as plt %matplotlib inline import numpy as np # used for the dtype of the space
- We are dealing with 256 x 256 RGB images, so the space has a shape of (256, 256, 3). In addition, the images range from 0 to 255 (if we consider the
uint8
images):""" since the Space is RGB images with shape 256x256 the final shape is (256, 256, 3) """ shape = (256, 256, 3) # If we consider uint8 images the bounds are 0-255 low = 0 high = 255 # Space type: unsigned int dtype = np.uint8
- We are now ready to create the space. An image is a
Box
space since it has defined bounds:# create the space space = spaces.Box(low=low, high=high, shape=shape, dtype=dtype) # Print space representation print("Space", space)
This will print the representation of our space:
Space Box(256, 256, 3)
The first dimension is the image width, the second dimension is the image height, and the third dimension is the number of channels.
- Here is a sample from the space:
# Sample from the space sample = space.sample() print("Space Sample", sample)
This will return the space sample; in this case, it is a huge tensor of 256 x 256 x 3 unsigned integers (between 0 and 255). The output (fewer lines are presented now) should be similar to the following:
Space Sample [[[ 37 254 243] [134 179 12] [238 32 0] ... [100 61 73] [103 164 131] [166 31 68]] [[218 109 213] [190 22 130] [ 56 235 167]
- To visualize the returned sample, use the following code:
plt.imshow(sample)
The output will be as follows:
Figure 1.34: A sample from a Box space of (256, 256) RGB
The preceding is not very informative because it is a random image.
- Now, suppose we want to give our agent the opportunity to see the last
n=4
frames. By adding the temporal component, we can obtain a state representation composed of 4 dimensions. The first dimension is the temporal one, the second is the width, the third is the height, and the last one is the number of channels. This is a very useful technique that allows the agent to understand its movement:# we want a space representing the last n=4 frames n_frames = 4 # number of frames width = 256 # image width height = 256 # image height channels = 3 # number of channels (RGB) shape_temporal = (n_frames, width, height, channels) # create a new instance of space space_temporal = spaces.Box(low=low, high=high, \ shape=shape_temporal, dtype=dtype) print("Space with temporal component", space_temporal)
This will print the following:
Space with temporal component Box(4, 256, 256, 3)
As you can see, we have successfully created a space and, on inspecting the space representation, we notice that we have another dimension: the temporal dimension.
Note
To access the source code for this specific section, please refer to https://packt.live/2AwJm7x.
You can also run this example online at https://packt.live/2UzxoAY.
Image-based environments are very important in RL. They allow the agent to learn salient features for solving the task directly from raw pixels, without any preprocessing. In this exercise, we learned how to create a Gym space for image observations and how to deal with image spaces.
Rendering an Environment
In the Getting Started with Gym – CartPole section, we saw a sample from the CartPole state space. However, visualizing or understanding the CartPole state from a vector representation is not an easy task, at least for a human. Gym also allows you to visualize a given task (if possible) through the env.render()
function.
Note
The env.render()
function is usually slow. Rendering an environment is done primarily to understand the behavior learned by the agent after the training or on intervals of many training steps. Usually, we train agents without rendering the environment state to improve the training speed.
If we just call the env.render()
function, we will always see the same scene, that is, the environment state does not change. To see the evolution of the environment in time, we must call the env.step()
function, which takes as input an action belonging to the action space and applies the action in the environment.
Rendering CartPole
The following code demonstrates how to render the CartPole environment. The action is a sample from the action space. For RL algorithms, the action will be smartly selected from the policy:
# Create the environment using gym.make(env_name) env = gym.make("CartPole-v1") # reset the environment (mandatory) env.reset() # render the environment for 100 steps n_steps = 100 for i in range(n_steps): action = env.action_space.sample() env.step(action) env.render() # close the environment correctly env.close()
If you run this script, you will see that gym
opens a window and displays the CartPole environment with random actions, as shown in the following figure:

Figure 1.35: A CartPole environment rendered in Gym (the initial state)
A Reinforcement Learning Loop with Gym
To understand the consequences of an action, and to come up with a better policy, the agent observes its new state and a reward. Implementing this loop with gym
is easy. The key element is the env.step()
function. This function takes an action as input. It applies the action and returns four values, which are described as follows:
- Observation: The observation is the next environmental state. This is represented as an element belonging to the observation space of the environment.
- Reward: The reward associated with a step is a float value that is related to the action given as input to the function.
- Done: This return value assumes the
True
value when the episode is finished, and it's time to call theenv.reset()
function to reset the environment state. - Info: This is a dictionary containing debugging information; usually, it is ignored.
Let's now implement the RL loop within the Gym environment.
Exercise 1.04: Implementing the Reinforcement Learning Loop with Gym
In this exercise, we will implement a basic RL loop with episodes and timesteps using the CartPole environment. You can change the environment and use other environments as well; nothing changes as the main goal of Gym is to unify the interfaces of all possible environments in order to build agents that are as environment-agnostic as possible. The transparency with respect to the environment is a very peculiar thing in RL: the algorithms are not usually suited to the task but are task-agnostic so that they can be applied successfully to a variety of environments and still solve them.
We need to create the Gym CartPole environment as before using the gym.make()
function. After that, we can loop for a defined number of episodes; for each episode, we loop for a defined number of steps or until the episode is terminated (by checking the done
value). For each timestep, we have to call the env.step()
function by passing an action (we will pass a random action for now), and then we collect the desired information:
- Open a new Jupyter notebook and define the import, the environment, and the desired number of steps:
import gym import matplotlib.pyplot as plt %matplotlib inline env = gym.make("CartPole-v1") # each episode is composed by 100 timesteps # define 10 episodes n_episodes = 10 n_timesteps = 100
- Loop for each episode:
# loop for the episodes for episode_number in range(n_episodes): # here we are inside an episode
- Reset the environment and get the first observation:
""" the reset function resets the environment and returns the first environment observation """ observation = env.reset()
- Loop for each timestep:
""" loop for the given number of timesteps or until the episode is terminated """ for timestep_number in range(n_timesteps):
- Render the environment, select the action (randomly by using the
env.action_space.sample()
method), and then take the action:# render the environment env.render(mode="rgb-array") # select the action action = env.action_space.sample() # apply the selected action by calling env.step observation, reward, done, info = env.step(action)
- Check whether the episode has been terminated using the
done
variable:"""if done the episode is terminated, we have to reset the environment """ if done: print(f"Episode Number: {episode_number}, \ Timesteps: {timestep_number}") # break from the timestep loop break
- After the episode loop, close the environment in order to release the associated memory:
# close the environment env.close()
If you run the previous code, the output should, approximately, be like this:
Episode Number: 0, Timesteps: 34 Episode Number: 1, Timesteps: 10 Episode Number: 2, Timesteps: 12 Episode Number: 3, Timesteps: 21 Episode Number: 4, Timesteps: 16 Episode Number: 5, Timesteps: 17 Episode Number: 6, Timesteps: 12 Episode Number: 7, Timesteps: 15 Episode Number: 8, Timesteps: 16 Episode Number: 9, Timesteps: 16
We have the episode number and the number of steps taken in that episode. We can see that the average number of timesteps for an episode is approximately 17. This means that, using the random policy, after 17 episodes on average, the pole falls and the episode finishes.
Note
To access the source code for this specific section, please refer to https://packt.live/2MOs5t5.
This section does not currently have an online interactive example, and will need to be run locally.
The goal of this exercise was to understand the bare bones of each RL algorithm. The only different thing here is that the action selection phase should take into account the environment state in order for it to be useful, and it should not be random.
Let's now move toward completing an activity to measure the performance of an agent.
Activity 1.01: Measuring the Performance of a Random Agent
The measurement of the performance and the design of an agent is an essential phase of every RL experiment. The goal of this activity is to practice with these two concepts by designing an agent that is able to interact with an environment using a random policy and then measure the performance.
You need to design a random agent using a Python class to modularize and keep the agent independent from the main loop. After that, you have to measure the mean and the variance of the discounted return using a batch of 100 episodes. You can use every environment you want, taking into account that the agent's action should be compatible with the environment. You can design two different types of agents for discrete action spaces and continuous action spaces. The following steps will help you to complete the activity:
- Import the required libraries:
abc
,numpy
, andgym
. - Define the
Agent
abstract class in a very simple way, defining only thepi()
function that represents the policy. The input should be an environment state. The__init__
method should take as input the action space and build the distribution accordingly. - Define a
ContinuousAgent
deriving from theAgent
abstract class. The agent should check that the action space is coherent with it, and it should be a continuous action space. The agent should also initialize a probability distribution for sampling actions (you can use NumPy to define probability distributions). The continuous agent can change the distribution type according to the distributions defined by the Gym spaces. - Define a
DiscreteAgent
deriving from theAgent
abstract class. The discrete agent should, of course, initialize a uniform distribution. - Implement the
pi()
function for both agents. This function is straightforward and should only sample from the distribution defined in the constructor and return it, ignoring the environment state. Of course, this is a simplification. You can also implement thepi()
function in theAgent
base class. - Define the main RL loop in another file by importing the agent.
- Instantiate the correct agent according to the selected environment. Examples of environments are "CartPole-v1" or "MountainCar-Continuous-v0."
- Take actions according to the
pi
function of the agent. - Measure the performance of the agent collecting (in a list or a NumPy array) the discounted return for each episode. Then, take the average and the standard deviation (you can use NumPy for this). Remember to apply the discount factor (user-defined) to the immediate reward. You have to keep a cumulated discount factor by multiplying the discount factor at each timestep.
The output should be similar to the following:
Episode Number: 0, Timesteps: 27, Return: 28.0 Episode Number: 1, Timesteps: 9, Return: 10.0 Episode Number: 2, Timesteps: 13, Return: 14.0 Episode Number: 3, Timesteps: 16, Return: 17.0 Episode Number: 4, Timesteps: 31, Return: 32.0 Episode Number: 5, Timesteps: 10, Return: 11.0 Episode Number: 6, Timesteps: 14, Return: 15.0 Episode Number: 7, Timesteps: 11, Return: 12.0 Episode Number: 8, Timesteps: 10, Return: 11.0 Episode Number: 9, Timesteps: 30, Return: 31.0 Statistics on Return: Average: 18.1, Variance: 68.89000000000001
Note
The solution for this activity can be found via this link.
OpenAI Baselines
OpenAI Baselines (https://github.com/openai/baselines) is a set of state-of-the-art RL algorithms. The main goal of Baselines is to make it easier to reproduce results on a set of benchmarks, to evaluate new ideas, and to compare them to existing algorithms. In this section, we will learn how to use Baselines to run an existing algorithm on an environment taken from Gym (refer to the previous section) and how to visualize the behavior learned by the agent. As for Gym, we will not cover the installation instructions; these can be found in the Preface section. The implementation of the Baselines' algorithm is based on TensorFlow, one of the most popular libraries for machine learning.
Getting Started with Baselines – DQN on CartPole
Training a Deep Q Network (DQN) on CartPole is straightforward with Baselines; we can do it with just one line of Bash.
Just use the terminal and run this command:
# Train model and save the results to cartpole_model.pkl python -m baselines.run –alg=deepq –env=CartPole-v0 –save_path=./cartpole_model.pkl –num_timesteps=1e5
Let's understand the parameters, as follows:
--alg=deepq
specifies the algorithm to be used to train our agent. In our case, we selecteddeepq
, that is, DQN.--env=CartPole-v0
specifies the environment to be used. We selected CartPole, but we can also select many other environments.--save_path=./cartpole_model.pkl
specifies where to save the trained agent.--num_timesteps=1e5
is the number of training timesteps.
After having trained the agent, it is also possible to visualize the learned behavior using the following:
# Load the model saved in cartpole_model.pkl # and visualize the learned policy python -m baselines.run --alg=deepq --env=CartPole-v0 --load_path=./cartpole_model.pkl --num_timesteps=0 --play
DQN is a very powerful algorithm; using it for a simple task such as CartPole is almost overkill. We can see that the agent has learned a stable policy, and the pole almost never falls. We will explore DQN in more detail in the following chapters.
In the following steps, we will train a DQN agent on the CartPole environment using Baselines:
- First, we import
gym
andbaselines
:import gym # Import the desired algorithm from baselines from baselines import deepq
- Define a callback to inform
baselines
when to stop training. The callback should returnTrue
if the reward is satisfying:def callback(locals, globals): """ function called at every step with state of the algorithm. If callback returns true training stops. stop training if average reward exceeds 199 time should be greater than 100 and the average of last 100 returns should be >= 199 """ is_solved = (locals["t"] > 100 and \ sum(locals["episode_rewards"]\ [-101:-1]) / 100 >= 199) return is_solved
- Now, let's create the environment and prepare the algorithm's parameters:
# create the environment env = gym.make("CartPole-v0") """ Prepare learning parameters: network and learning rate the policy is a multi-layer perceptron """ network = "mlp" # set learning rate of the algorithm learning_rate = 1e-3
- We can use the
deep.learn()
method to start the training and solve the task:""" launch learning on this environment using DQN ignore the exploration parameter for now """ actor = deepq.learn(env, network=network, lr=learning_rate, \ total_timesteps=100000, buffer_size=50000, \ exploration_fraction=0.1, \ exploration_final_eps=0.02, print_freq=10, \ callback=callback,)
After some time, depending on your hardware (it usually takes a few minutes), the learning phase terminates, and you will have the CartPole agent saved to your current working directory.
We should see the baselines
logs reporting the agent's performance over time.
Consider the following example:
-------------------------------------- | % time spent exploring | 2 | | episodes | 770 | | mean 100 episode reward | 145 | | steps | 6.49e+04 |
The following are the observations from the preceding logs:
- The
episodes
parameter reports the episode number we are referring to. mean 100 episode reward
is the average return obtained in the last 100 episodes.steps
is the number of training steps the algorithm has performed.
Now we can save our actor so that we can reuse it without retraining it:
print("Saving model to cartpole_model.pkl") actor.save("cartpole_model.pkl")
After the actor.save
function, the "cartpole_model.pkl"
file contains the trained model.
Now it is possible to use the model and visualize the agent's behavior.
The actor returned by deepq.learn
is actually a callable that returns the action given the current observation – it is the agent policy. We can use it by passing the current observation, and it returns the selected action:
# Visualize the policy n_episodes = 5 n_timesteps = 1000 for episode in range(n_episodes): observation = env.reset() episode_return = 0 for timestep in range(n_timesteps): # render the environment env.render() # select the action according to the actor action = actor(observation[None])[0] # call env.step function observation, reward, done, _ = env.step(action) """ since the reward is undiscounted we can simply add the reward to the cumulated return """ episode_return += reward if done: break # here an episode is terminated, print the return print("Episode return", episode_return) """ here an episode is terminated, print the return and the number of steps """ print(f"Episode return {episode_return}, \ Number of steps: {timestep}")
If you run the preceding code, you should see the agent's performance on the CartPole task.
You should get, as output, the return for each episode; it should be something similar to the following:
Episode return 200.0, Number of steps: 199 Episode return 200.0, Number of steps: 199 Episode return 200.0, Number of steps: 199 Episode return 200.0, Number of steps: 199 Episode return 200.0, Number of steps: 199
This means our agent always reaches the maximum possible return for CartPole (200.0
) and the maximum possible number of steps (199
).
We can compare the return obtained using a trained DQN agent with respect to the return obtained using a random agent (Activity 1.01, Measuring the Performance of a Random Agent). The random agent yields an average return of 20.0
, while DQN obtains the maximum return possible for CartPole, which is 200.0
.
In this section, we presented OpenAI Gym and OpenAI Baselines, the two main frameworks for RL research and experiments. There are many other frameworks for RL, each with their pros and cons. Gym is particularly suited due to its unified interface in the RL loop, while OpenAI Baselines is very useful for understanding how to implement sophisticated state-of-the-art RL algorithms and how to compare new algorithms with existing ones.
In the following section, we will explore some interesting RL applications in order to better understand the possibilities offered by the framework as well as its flexibility.
Applications of Reinforcement Learning
RL has exciting and useful applications in many different contexts. Recently, the usage of deep neural networks has augmented the number of possible applications considerably.
When used in a deep learning context, RL can also be referred to as deep RL.
The applications vary from games and video games to real-world applications, such as robotics and autonomous driving. In each of these applications, RL is a game-changer, allowing you to solve tasks that are considered to be almost impossible (or, at least, very difficult) without these techniques.
In this section, we will present some RL applications, describe the challenges of each application, and begin to understand why RL is preferred among other methods, along with its advantages and its drawbacks.
Games
Nowadays, RL is widely used in video games and board games.
Games are used to benchmark RL algorithms because, usually, they are very complex to solve yet easy to implement and to evaluate. Games also represent a simulated reality in which the agent can freely move and behave without affecting the real environment:

Figure 1.36: Breakout – one of the most famous Atari games
Note
The preceding screenshot has been sourced from the official documentation of OpenAI Gym. Please refer to the following link for more examples: https://gym.openai.com/envs/#atari.
Despite appearing to be secondary or relatively limited-use applications, games represent a useful benchmark for RL and, in general, artificial intelligence algorithms. Very often, artificial intelligence algorithms are tested on games due to the significant challenges that arise in these scenarios.
The two main characteristics required to play games are planning and real-time control.
An algorithm that is not able to plan won't be able to win strategic games. Having a long-term plan is also fundamental in the early stages of a game. Planning is also fundamental in real-world applications in which taken actions may have long-term consequences.
Real-time control is another fundamental challenge that requires an algorithm to be able to respond within a small timeframe. This challenge is similar to one an algorithm has to face when applied to real-world cases such as autonomous driving, robot control, and many others. In these cases, the algorithm can't evaluate all the possible actions or all the possible consequences of these actions; therefore, the algorithm should learn an efficient (and maybe compressed) state representation and should understand the consequences of its actions without simulating all of the possible scenarios.
Recently, RL has been able to exceed human performance in games such as Go, and in video games such as Dota II and StarCraft, thanks to work done by DeepMind and OpenAI.
Go
Go is a very complex, highly strategic board game. In Go, two players are competing against each other. The aim is to use the game pieces, also called stones, to surround more territory than the opponent. At each turn, the player can place its stone in a vacant intersection on the board. At the end of the game, when no player can place a stone, the player surrounding more territories wins.
Go has been studied for many years to understand the strategies and moves necessary to lead a player to victory. Until recently, no algorithm succeeded in producing strong players – even algorithms working very well for similar games, such as chess. This difficulty is due to Go's huge search space, the variety of possible moves, and the average length (in terms of moves) of Go games, which, for example, is longer than the average length of chess games. RL, and in particular AlphaGo by DeepMind, succeeded recently in beating a human player on a standard dimension board. AlphaGo is actually a mix of RL, supervised learning, and tree search algorithms trained on an extensive set of games from both human and artificial players. AlphaGo denoted a real milestone in artificial intelligence history, which was made possible mainly due to the advances in RL algorithms and their improved efficiency.
The successor of AlphaGo is AlphaGo Zero. AlphaGo Zero has been trained fully in a self-play fashion, learning from itself completely with no human intervention (Zero comes from this characteristic). It is currently the world's top player at Go and Chess:

Figure 1.37: The Go board
Both AlphaGo and AlphaGo Zero used a deep Convolutional Neural Network (CNN) to learn a suitable game representation starting from the "raw" board. This peculiarity shows that a deep CNN can also extract features starting from a sparse representation such as the Go board. One of the main strengths of RL is that it can use, in a transparent way, machine learning models that are widely studied in other fields or problems.
Deep convolutional networks are usually used for classification or segmentation problems that, at first glance, might seem very different from RL problems. Actually, the way CNNs are used in RL is very similar to a classification or a regression problem. The CNN of AlphaGo Zero, for example, takes the raw board representation and outputs the probabilities for each possible action together with the value of each action. It can be seen as a classification and regression problem at the same time. The difference is that the labels, or actions in the case of RL, are not given in the training set, rather it is the algorithm itself that has to discover the real labels through interaction. AlphaGo, the predecessor of AlphaGo Zero, used two different networks: one for action probabilities and another for value estimates. This technique is called actor-critic. The network tasked with predicting actions is called the actor, and the network that has to evaluate actions is called the critic.
Dota 2
Dota 2 is a complex, real-time strategy game in which there are two teams of five players competing, with each player controlling a "hero." The characteristics of Dota, from an RL perspective, are as follows:
- Long-Time Horizon: A Dota game can have around 20,000 moves and can last for 45 minutes. As a reference, a chess game ends before 40 moves and a Go game ends before 150 moves.
- Partially Observed State: In Dota, agents can only see a small portion of the full map, that is, only the portion around them. A strong player should make predictions about the position of the enemies and their actions. As a reference, Go and Chess are fully observable games where agents can see the whole situation and the actions taken by the opponents.
- High-Dimensional and Continuous Action Space: Dota has a vast number of actions available to each player at each step. The possible actions have been discretized by researchers in around 170,000 actions, with an average of 1,000 possible actions for each step. In comparison, the number of average actions in chess is 35, and in Go, it is 250. With a huge action space, learning becomes very difficult.
- High-Dimensional and Continuous Observation Space: While Chess and Go have a discretized observation space, Dota has a continuous state space with around 20,000 dimensions. The state space, as we will learn later in the book, includes all of the information available to players that must be taken into consideration when selecting an action. In a video game, the state space is represented by the characteristics and position of the enemies, the state of the current player, including its ability, its equipment, and its health status, and other domain-specific features.
OpenAI Five, the RL algorithm able to exceed human performance at Dota, is composed of five neural networks collaborating together. The algorithm learns to play by itself through self-play, playing an equivalent of 180 years per day. The algorithm used for training the five neural networks is called Proximal Policy Optimization, representing the current state of the art of RL algorithms.
Note
To read more on OpenAI Five, refer to the following link: https://openai.com/blog/openai-five/
StarCraft
StarCraft has characteristics that make it very similar to Dota, including a huge number of moves per play, imperfect information available to players, and highly dimensional state and action spaces. AlphaStar, the player developed by DeepMind, is the first artificial intelligence agent able to reach the top league without any game restrictions. AlphaStar uses machine learning techniques such as neural networks, self-play through RL, multi-agent learning methods, and imitation learning to learn from other human players in a supervised way.
Note
For further reading on AlphaStar, refer to the following paper: https://arxiv.org/pdf/1902.01724.pdf
Robot Control
Robots are starting to become ubiquitous nowadays and are widely used in various industries because of their ability to perform repetitive tasks in a precise and efficient way. RL can be beneficial for robotics applications, by simplifying the development of complex behaviors. At the same time, robotics applications represent a set of benchmark and real-world validations for RL algorithms. Researchers test their algorithm on robotic tasks such as locomotion (for example, learning to move) or grasping (for example, learning how to grasp an object). Robotics offers unique challenges, such as the curse of dimensionality, the effective usage of samples (also called sample efficiency), the possibility of transferring knowledge from similar or simulated tasks, and the need for safety:

Figure 1.38: A robotic task from the Gym robotics suite
Note
The preceding diagram has been sourced from the official documentation for OpenAI Gym: https://gym.openai.com/envs/#robotics
Please refer to the link for more examples of robot control.
The curse of dimensionality is a challenge that can also be found in supervised learning applications. Still, in these cases, it is softened by restricting the space of possible solutions to a limited class of functions or by injecting prior knowledge elements in the models through architectural decisions. Robots usually have many degrees of freedom, making the space of possible states and possible actions very large.
Robots interact with the physical environment by definition. The interaction of a real robot with an environment is usually time-consuming, and it can be dangerous. Usually, RL algorithms require millions of samples (or episodes) in order to become efficient. Sample efficiency is a problem in this field, as the required time may be impractical. The usage of collected samples in a smart way is the key to successful RL-based robotics applications. A technique that can be used in these cases is the so-called sim2real, in which an initial learning phase is practiced in a simulated environment that is usually safer and faster than the real environment. After this phase, the learned behavior is transferred to the real robot in the real environment. This technique requires a simulated environment that is very similar to the real environment or the generalization capabilities of the algorithm.
Autonomous Driving
Autonomous driving is another exciting application of RL. The main challenge this task presents is the lack of precise specifications. In autonomous driving, it is challenging to formalize what it means to drive well, whether steering in a given situation is good or bad, or whether the driver should accelerate or break. As with robotic applications, autonomous driving can also be hazardous. Testing an RL algorithm, or, in general, a machine learning algorithm, on a driving task, is very problematic and raises many concerns.
Aside from the concerns, the autonomous driving scenario fits very well in the RL framework. As we will explore later in the book, we can think of the driver as the decision-maker. At each step, they receive an observation. The observation includes the road's state, the current velocity, the acceleration, and all of the car's characteristics. The driver, based on the current state, should make a decision corresponding to what to do with the car's commands, steering, brakes, and acceleration. Designing a rule-based system that is able to drive in real situations is complicated, due to the infinite number of different situations to confront. For this reason, a learning-based system would be far more efficient and effective in tasks such as this.
Note
There are many simulated environments available for developing efficient algorithms in the context of autonomous driving, listed as follows:
Voyage Deepdrive: https://news.voyage.auto/introducing-voyage-deepdrive-69b3cf0f0be6
AWS DeepRacer: https://aws.amazon.com/fr/deepracer/
In this section, we analyzed some interesting RL applications, the main challenges of them, and the main techniques used by researchers. Games, robotics, and autonomous driving are just some examples of real-world RL applications, but there are many others. In the remainder of this book, we will deep dive into RL; we will understand its components and the techniques presented in this chapter.
Summary
RL is one of the fundamental paradigms under the umbrella of machine learning. The principles of RL are very general and interdisciplinary, and they are not bound to a specific application.
RL considers the interaction of an agent with an external environment, taking inspiration from the human learning process. RL explicitly targets the need to explore efficiently and the exploration-exploitation trade-off appearing in almost all human problems; this is a peculiarity that distinguishes this discipline from others.
We started this chapter with a high-level description of RL, showing some interesting applications. We then introduced the main concepts of RL, describing what an agent is, what an environment is, and how an agent interacts with its environment. Finally, we implemented Gym and Baselines by showing how these libraries make RL extremely simple.
In the next chapter, we will learn more about the theory behind RL, starting with Markov chains and arriving at MDPs. We will present the two functions at the core of almost all RL algorithms, namely the state-value function, which evaluates the goodness of states, and the action-value function, which evaluates the quality of the state-action pair.