The relationship between an agent and its environment

At a very basic level, RL involves an agent and an environment. An agent is an artificial intelligence entity that has certain goals, must remain vigilant about things that can come in the way of these goals, and must, at the same time, pursue the things that help in the attaining of these goals. An environment is everything that the agent can interact with. Let me explain further with an example that involves an industrial mobile robot.

For example, in a setting involving an industrial mobile robot navigating inside a factory, the robot is the agent, and the factory is the environment.

The robot has certain pre-defined goals, for example, to move goods from one side of the factory to the other without colliding with obstacles such as walls and/or other robots. The environment is the region available for the robot to navigate and includes all the places the robot can go to, including the obstacles that the robot could crash in to. So the primary task of the robot, or more precisely, the agent, is to explore the environment, understand how the actions it takes affects its rewards, be cognizant of the obstacles that can cause catastrophic crashes or failures, and then master the art of maximizing the goals and improving its performance over time.

In this process, the agent inevitably interacts with the environment, which can be good for the agent regarding certain tasks, but could be bad for the agent regarding other tasks. So, the agent must learn how the environment will respond to the actions that are taken. This is a trial-and-error learning approach, and only after numerous such trials can the agent learn how the environment will respond to its decisions.

Let's now come to understand what the state space of an agent is, and the actions that the agent performs to explore the environment.

Defining the states of the agent

In RL parlance, states represent the current situation of the agent. For example, in the previous industrial mobile robot agent case, the state at a given time instant is the location of the robot inside the factory – that is, where it is located, its orientation, or more precisely, the pose of the robot. For a robot that has joints and effectors, the state can also include the precise location of the joints and effectors in a three-dimensional space. For an autonomous car, its state can represent its speed, location on a map, distance to other obstacles, torques on its wheels, the rpm of the engine, and so on.

States are usually deduced from sensors in the real world; for instance, the measurement from odometers, LIDARs, radars, and cameras. States can be a one-dimensional vector of real numbers or integers, or two-dimensional camera images, or even higher-dimensional, for instance, three-dimensional voxels. There are really no precise limitations on states, and the state just represents the current situation of the agent.

In RL literature, states are typically represented as s_t, where the subscript t is used to denote the time instant corresponding to the state.

Defining the actions of the agent

The agent performs actions to explore the environment. Obtaining this action vector is the primary goal in RL. Ideally, you need to strive to obtain optimal actions.

An action is the decision an agent takes in a certain state, s_t. Typically, it is represented as a_t, where, as before, the subscript t denotes the time instant. The actions that are available to an agent depends on the problem. For instance, an agent in a maze can decide to take a step north, or south, or east, or west. These are called discrete actions, as there are a fixed number of possibilities. On the other hand, for an autonomous car, actions can be the steering angle, throttle value, brake value, and so on, which are called continuous actions as they can take real number values in a bounded range. For example, the steering angle can be 40 degrees from the north-south line, and the throttle can be 60% down, and so on.

Thus, actions a_t can be either discrete or continuous, depending on the problem at hand. Some RL approaches handle discrete actions, while others are suited for continuous actions.

A schematic of the agent and its interaction with the environment is shown in the following diagram:

Figure 1: Schematic showing the agent and its interaction with the environment

Now that we know what an agent is, we will look at the policies that the agent learns, what value and advantage functions are, and how these quantities are used in RL.

Understanding policy, value, and advantage functions

A policy defines the guidelines for an agent's behavior at a given state. In mathematical terms, a policy is a mapping from a state of the agent to the action to be taken at that state. It is like a stimulus-response rule that the agent follows as it learns to explore the environment. In RL literature, it is usually denoted as π(a_t|s_t) – that is, it is a conditional probability distribution of taking an action a_t in a given state s_t. Policies can be deterministic, wherein the exact value of a_t is known at s_t, or can be stochastic where a_t is sampled from a distribution – typically this is a Gaussian distribution, but it can also be any other probability distribution.

In RL, value functions are used to define how good a state of an agent is. They are typically denoted by V(s) at state s and represent the expected long-term average rewards for being in that state. V(s) is given by the following expression where E[.] is an expectation over samples:

Note that V(s) does not care about the optimum actions that an agent needs to take at the state s. Instead, it is a measure of how good a state is. So, how can an agent figure out the most optimum action a_t to take in a given state s_t at time instant t? For this, you can also define an action-value function given by the following expression:

Note that Q(s,a) is a measure of how good is it to take action a in state s and follow the same policy thereafter. So, t is different from V(s), which is a measure of how good a given state is. We will see in the following chapters how the value function is used to train the agent under the RL setting.

The advantage function is defined as the following:

A(s,a) = Q(s,a) - V(s)

This advantage function is known to reduce the variance of policy gradients, a topic that will be discussed in depth in a later chapter.

The classic RL textbook is Reinforcement Learning: An Introduction by Richard S Sutton and Andrew G Barto, The MIT Press, 1998.

We will now define what an episode is and its significance in an RL context.