Nowadays, most computers are based on a symbolic elaboration, that is, the problem is first encoded in a set of variables and then processed using an explicit algorithm that, for each possible input of the problem, offers an adequate output. However, there are problems in which resolution with an explicit algorithm is inefficient or even unnatural, for example with a speech recognizer; tackling this kind of problem with the classic approach is inefficient. This and other similar problems, such as autonomous navigation of a robot or voice assistance in performing an operation, are part of a very diverse set of problems that can be addressed directly through solutions based on reinforcement learning.

Reinforcement learning is a very exciting part of machine learning, used in applications ranging from autonomous cars to playing games. Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. To do this, we use external feedback signals (reward signals) generated by the environment according to the choices made by the algorithm. A correct choice will result in a reward, while an incorrect choice will lead to a penalization of the system. All of this is in order to achieve the best result obtainable.

The topics covered in this chapter are the following:

- An overview of machine learning
- Reinforcement learning
**Markov Decision Process**(**MDP**)**Temporal difference**(**TD**) learning- Q-learning
- Deep Q-learning networks

At the end of the chapter, you will be fully introduced to the power of reinforcement learning and will learn the different approaches to this technique. Several reinforcement learning methods will be covered.

Machine learning is a multidisciplinary field created at the intersection of, and by the synergy between, computer science, statistics, neurobiology, and control theory. Its emergence has played a key role in several fields and has fundamentally changed the vision of software programming. If the question before was, how to program a computer, now the question becomes is how computers will program themselves. Thus, it is clear that machine learning is a basic method that allows a computer to have its own intelligence.

As might be expected, machine learning interconnects and coexists with the study of, and research on, human learning. Like humans, whose brain and neurons are the foundation of insight, **Artificial Neural Networks** (**ANNs**) are the basis of any decision-making activity of the computer.

Machine learning refers to the ability to learn from experience without any outside help, which is what we humans do, in most cases. Why should it not be the same for machines?

From a set of data, we can find a model that approximates the set by the use of machine learning. For example, we can identify a correspondence between input variables and output variables for a given system. One way to do this is to postulate the existence of some kind of mechanism for the parametric generation of data, which, however, does not know the exact values of the parameters. This process typically makes reference to statistical techniques, such as the following:

- Induction
- Deduction
- Abduction

The extraction of general laws from a set of observed data is called induction; it is opposed to deduction, in which, starting from general laws, we want to predict the value of a set of variables. Induction is the fundamental mechanism underlying the scientific method in which we want to derive general laws, typically described in a mathematical language, starting from the observation of phenomena.

This observation includes the measurement of a set of variables and, therefore, the acquisition of data that describes the observed phenomena. Then, the resultant model can be used to make predictions on additional data. The overall process in which, starting from a set of observations, we want to make predictions for new situations, is called inference. Therefore, inductive learning starts from observations arising from the surrounding environment and generalizes obtaining knowledge that will be valid for not-yet-observed cases; at least we hope so.

Inductive learning is based on learning by example: knowledge gained by starting from a set of positive examples that are instances of the concept to be learned, and negative examples that are non-instances of the concept. In this regard, Galileo Galilei's (1564-1642) phrase is particularly clear:

"Knowledge forms part of the experience from which hypotheses are derived, based on quantitative data, which must be verified through experiments, also mentally, understood as artificial relationships between quantified variables, to arrive at the formulation of the law in the form of an equation."

The following diagram consists of a flowchart showing inductive and deductive learning:

A question arises spontaneously: why do machine learning systems work where traditional algorithms fail? The reasons for the failure of traditional algorithms are numerous and typically include the following:

**Difficulty in problem formalization**: For example, each of us can recognize our friends from their voices. But probably none can describe a sequence of computational steps enabling them to recognize the speaker from the recorded sound.**High number of variables at play**: When considering the problem of recognizing characters from a document, specifying all parameters that are thought to be involved can be particularly complex. In addition, the same formalization applied in the same context but on a different idiom could prove inadequate.**Lack of theory**: Imagine you have to predict exactly the performance of financial markets in the absence of specific mathematical laws.**Need for customization**: The distinction between interesting and uninteresting features depends significantly on the perception of the individual user.

A quick analysis of these issues highlights the lack of experience in all cases.

The power of machine learning is due to the quality of its algorithms, which have been improved and updated over the years; these are divided into several main types depending on the nature of the signal used for learning or the type of feedback adopted by the system.

They are as follows:

**Supervised learning**: The algorithm generates a function that links input values to a desired output through the observation of a set of examples in which each data input has its relative output data, which is used to construct predictive models.**Unsupervised learning**: The algorithm tries to derive knowledge from a general input without the help of a set of pre-classified examples that are used to build descriptive models. A typical example of the application of these algorithms is found in search engines.**Reinforcement learning**: The algorithm is able to learn depending on the changes that occur in the environment in which it is performed. In fact, since every action has some effect on the environment concerned, the algorithm is driven by the same feedback environment. Some of these algorithms are used in speech or text recognition.

The subdivision that we have just proposed does not prohibit the use of hybrid approaches between some or all of these different areas, which have often recorded good results.

Supervised learning is a machine learning technique that aims to program a computer system so that it can resolve the relevant tasks automatically. To do this, the input data is included in a set *I* (typically vectors). Then, the set of output data is fixed as set *O*, and finally, it defines a function *f* that associates each input with the correct answer. Such information is called a training set. This workflow is presented in the following diagram:

All supervised learning algorithms are based on the following thesis: if an algorithm provides an adequate number of examples, it will be able to create a derived function*B*that will approximate the desired function*A*.

If the approximation of the desired function is adequate, then when the input data is offered to the derived function, this function should be able to provide output responses similar to those provided by the desired function and then acceptable. These algorithms are based on the following concept: similar inputs correspond to similar outputs.

Generally, in the real world, this assumption is not valid; however, some situations exist in which it is acceptable. Clearly, the proper functioning of such algorithms depends significantly on the input data. If there are only a few training inputs, the algorithm might not have enough experience to provide a correct output. Conversely, many inputs may make it excessively slow, since the derivative function generated by a large number of inputsincreases the training time. Hence the slowness.

Moreover, experience shows that this type of algorithm is very sensitive to noise; even a few pieces of incorrect data can make the entire system unreliable and lead to wrong decisions.

In supervised learning, it's possible to split problems based on the nature of the data. If the output value is categorical, such as membership/non-membership of a certain class, then it is a classification problem. If the output is a continuous real value in a certain range, then it is a regression problem.

The aim of unsupervised learning is to automatically extract information from databases. This process occurs without a priori knowledge of the contents to be analyzed. Unlike supervised learning, there is no information on the membership classes of examples, or more generally on the output corresponding to a certain input. The goal is to get a model that is able to discover interesting properties: groups with similar characteristics (clustering), for instance. Search engines are an example of an application of these algorithms. Given one or more keywords, they are able to create a list of links related to our search.

The validity of these algorithms depends on the usefulness of the information they can extract from the databases. These algorithms work by comparing data and looking for similarities or differences. Available data concerns only the set of features that describe each example.

The following diagram shows supervised learning (on the left) and unsupervised learning examples (on the right):

They show great efficiency with elements of numeric type, but are much less accurate with non-numeric data. Generally, they work properly in the presence of data that is clearly identifiable and contains an order or a clear grouping.

Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli, the nature of which depends on the algorithm choices. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course.

In supervised learning, there is a teacher that tells the system the correct output (learning with a teacher). This is not always possible. Often, we have only qualitative information (sometimes binary, right/wrong, or success/failure).

The information available is called reinforcement signals. But the system does not give any information on how to update the agent's behavior (that is, weights). You cannot define a cost function or a gradient. The goal of the system is to create smart agents that have machinery able to learn from their experience.

When developing an application that uses machine learning, we will follow a procedure characterized by the following steps:

**Collecting the data**: Everything starts from the data, no doubt about it; but one might wonder from where so much data comes. In practice, it is collected through lengthy procedures that may, for example, derive from measurement campaigns or face-to-face interviews. In all cases, the data is collected in a database so that it can then be analyzed to derive knowledge.**Preparing the data**: We have collected the data; now, we have to prepare it for the next step. Once we have this data, we must make sure it is in a format usable by the algorithm we want to use. To do this, you may need to do some formatting. Recall that some algorithms need data in an integer format, whereas others require data in the form of strings, and finally others need it to be in a special format. We will get to this later, but the specific formatting is usually simple compared to the data collection.**Exploring the data**: At this point, we can look at data to verify that it is actually working and that we do not have a bunch of empty values. In this step, through the use of plots, we can recognize patterns and whether or not there are some data points that are vastly different from the rest of the set. Plotting data in one, two, or three dimensions can also help.**Training the algorithm**: Now, let's get serious. In this step, the machine learning algorithm works on the definition of the model and therefore deals with the training. The model starts to extract knowledge from the large amounts of data that we had available, and from which nothing has been explained so far. For unsupervised learning, there's no training step because you don't have a target value.**Testing the algorithm**: In this step, we use the information learned in the previous step to see if the model actually works. The evaluation of an algorithm is for seeing how well the model approximates the real system. In the case of supervised learning, we have some known values that we can use to evaluate the algorithm. In unsupervised learning, we may need to use some other metrics to evaluate success. In both cases, if we are not satisfied, we can return to the previous steps, change some things, and retry the test.

**Evaluating the algorithm**: We have reached the point where we can apply what has been done so far. We can assess the approximation ability of the model by applying it to real data. The model, previously trained and tested, is then valued in this phase.**Improving algorithm performance**: Finally, we can focus on the finishing steps. We've verified that the model works, we have evaluated the performance, and now we are ready to analyze the whole process to identify any possible room for improvement.

Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli that depend on the actions chosen by the agent. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course.

These mechanisms derive from the basic concepts of machine learning (learning from experience), in an attempt to simulate human behavior. In fact, in our mind, we activate brain mechanisms that lead us to chase and repeat what, in us, produces feelings of gratification and wellbeing. Whenever we experience moments of pleasure (food, sex, love, and so on), some substances are produced in our brains that work by reinforcing that same stimulus, emphasizing it.

Along with this mechanism of neurochemical reinforcement, an important role is represented by memory. In fact, the memory collects the experience of the subject in order to be able to repeat it in the future. Evolution has endowed us with this mechanism to push us to repeat gratifying experiences in the direction of the best solutions.

This is why we so powerfully remember the most important experiences of our life: experiences, especially those that are powerfully rewarding, are impressed in memories and condition our future explorations. Previously, we have seen that learning from experience can be simulated by a numerical algorithm in various ways, depending on the nature of the signal used for learning and the type of feedback adopted by the system.

The following diagram shows a flowchart that displays an agent's interaction with the environment in a reinforcement learning setting:

Scientific literature has taken an uncertain stance on the classification of learning by reinforcement as a paradigm. In fact, in the initial phaseof the literature,it was considered a special case of supervised learning, after which it was fully promoted as the third paradigm of machine learning algorithms. It isappliedin different contexts in which supervised learning is inefficient: the problems of interaction with the environment are a clear example.

The following list shows the steps to follow to correctly apply a reinforcement learning algorithm:

- Preparation of the agent
- Observation of the environment
- Selection of the optimal strategy
- Execution of actions
- Calculation of the corresponding reward (or penalty)
- Development of updating strategies (if necessary)
- Repetition of steps 2 through 5 iteratively until the agent learns the optimal strategies

Reinforcement learning is based on a theory from psychology, elaborated following a series of experiments performed on animals. In particular, Edward Thorndike (American psychologist) noted that if a cat is given a reward immediately after the execution of a behavior considered correct, then this increases the probability that this behavior will repeat itself. On the other hand, in the face of unwanted behavior, the application of a punishment decreases the probability of a repetition of the error.

On the basis of this theory, after defining a goal to be achieved, reinforcement learning tries to maximize the rewards received for the execution of the action or set of actions that allows to reach the designated goal.

Reinforcement learning can be seen as a special case of the interaction problem, in terms of achieving a goal. The entity that must reach the goal is called an agent. The entity with which the agent must interact is called the environment, which corresponds to everything that is external to the agent.

So far, we are more focused on the term *agent*, but what does it represent? The agent (software) is a software entity that performs services on behalf of another program, usually automatically and invisibly. These pieces of software are also called smart agents.

What follows is a list of the most important features of an agent:

- It can choose between a continuous and a discrete set for an action on the environment.
- The action depends on the situation. The situation is summarized in the system state.
- The agent continuously monitors the environment (input) and continuously changes the status
- The choice of the action is not trivial and requires a certain degree of intelligence.
- The agent has a smart memory.

The agent has a goal-directed behavior, but acts in an uncertain environment that is not known a priori or only partially known. An agent learns by interacting with the environment. Planning can be developed while learning about the environment through the measurements made by the agent itself. This strategy is close to trial-and-error theory.

### Note

Trial and error is a fundamental method of problem solving. It is characterized by repeated, varied attempts that are continued until success, or until the agent stops trying.

The agent-environment interaction is continuous: the agent chooses an action to be taken, and in response, the environment changes state, presenting a new situation to be faced.

In the particular case of reinforcement learning, the environment provides the agent with a reward. It is essential that the source of the reward is the environment to avoid the formation, within the agent, of a personal reinforcement mechanism that would compromise learning.

The value of the reward is proportional to the influence that the action has in reaching the objective, so it is positive or high in the case of a correct action, or negative or low for an incorrect action.

In the following list are some examples of real life in which there is an interaction between agent and environment to solve a problem:

- A chess player, for each move, has information on the configurations of pieces that can be created, and on the possible countermoves of the opponent.
- A little giraffe, in just a few hours, learns to get up and run.
- A truly autonomous robot learns to move around a room to get out of it. For example: Roomba Robot Vacuum.
- The parameters of a refinery (oil pressure, flow, and so on) are set in real time, so as to obtain the maximum yield or maximum quality. For example, if particularly dense oil arrives, then the flow rate to the plant is modified to allow an adequate refining.

All the examples that we examined have the following characteristics in common:

- Interaction with the environment
- A specific goal that the agent wants to get
- Uncertainty or partial knowledge of the environment

From the analysis of these examples, it is possible to make the following observations:

- The agent learns from its own experience.
- The actions change the status (the situation), the possibilities of choice in the future change (delayed reward).
- The effect of an action cannot be completely predicted.
- The agent has a global assessment of its behavior.
- It must exploit this information to improve its choices. Choices improve with experience.
- Problems can have a finite or infinite time horizon.

Essentially, the agent receives sensations from the environment through its sensors. Depending on its feelings, the agent decides what actions to take in the environment. Based on the immediate result of its actions, the agent can be rewarded.

If you want to use an automatic learning method, you need to give a formal description of the environment. It is not important to know exactly how the environment is made; what is interesting is to make general assumptions about the properties that the environment has. In reinforcement learning, it is usually assumed that the environment can be described by a MDP.

To avoid load problems and computational difficulties, the agent-environment interaction is considered an MDP. An MDP is a discrete-time stochastic control process.

Stochastic processes are mathematical models used to study the evolution of phenomena following random or probabilistic laws. It is known that in all natural phenomena, both by their very nature and by observational errors, a random or accidental component is present. This component causes the following: at every instance of *t,* the result of the observation on the phenomenon is a random number or random variable *s _{t}*. It is not possible to predict with certainty what the result will be; one can only state that it will take one of several possible values, each of which has a given probability.

A stochastic process is called Markovian when, having chosen a certain instance of *t* for observation, the evolution of the process, starting with *t*, depends only on *t* and does not depend in any way on the previous instances. Thus, a process is Markovian when, given the moment of observation, only this instance determines the future evolution of the process, while this evolution does not depend on the past.

In a Markov process, at each time step, the process is in a state *s € S*, and the decision maker may choose any action *a € A* that is available in state *s*. The process responds at the next timestamp by randomly moving into a new state *s'*, and giving the decision maker a corresponding reward *r(s,s')*.

The following diagram shows the agent-environment interaction in a MDP:

The agent-environmentinteractionshown in theprecedingdiagram can be schematized as follows:

- The agent and the environment interact at discrete intervals over time,
*t = 0, 1, 2… n*. - At each interval, the agent receives a representation of the state
*s*of the environment._{t} - Each element
*s*, where_{t}∈ S*S*is the set of possible states. - Once the state is recognized, the agent must take an action
*a*, where_{t}∈ A(s_{t})*A(s*is the set of possible actions in the state_{t})*s*._{t} - The choice of the action to be taken depends on the objective to be achieved and is mapped through the policy indicated with the symbol
*π*(discounted cumulative reward), which associates the action with*a*for each state_{t }∈ A(s)*s*. The term*π*represents the probability that action_{t}(s,a)*a*is carried out in the state*s*. - During the next time interval
*t + 1*, as part of the consequence of the action*a*, the agent receives a numerical reward_{t}*r*corresponding to the action previously taken_{t + 1}∈ R*a*._{t} - The consequence of the action represents, instead, the new state
*s*. At this point the agent_{t} - This iteration repeats itself until the achievement of the objective by the agent.

The definition of the status *s _{t + 1}* depends on the previous state and the action taken (MDP), that is as follows:

*s _{t + 1} = δ (s_{t},a_{t})*

Here, *δ* represents the status function.

In summary:

- In an MDP, the agent can perceive the state
*s ∈ S*in which it is and has an*A*set of actions at its disposal - At each discrete interval of time
*t*, the agent detects the current status*s*and decides to implement an action_{t}*a*_{t}∈ A - The environment responds by providing a reward (a reinforcement)
*r*and moving into the state_{t}= r (st, at)*s*_{t + 1}= δ (st, at) - The
*r*and*δ*functions are part of the environment; they depend only on the current state and action (not the previous ones) and are not necessarily known to the agent - The goal of reinforcement learning is to learn a policy that, for each state
*s*in which the system is located, indicates to the agent an action to maximize the total reinforcement received during the entire action sequence

Let's go deeper into some of the terms used:

- A reward function defines the goal in a reinforcement learning problem. It maps the detected states of the environment into a single number, thereby defining a reward. As already mentioned, the only goal is to maximize the total reward it receives in the long term. The reward function then defines what the good and bad events are for the agent. The reward function has the need to be correct, and it can be used as a basis for changing the policy. For example, if an action selected by the policy is followed by a low reward, the policy can be changed to select other actions in that situation in the next step.
- A policy defines the behavior of the learning agent at a given time. It maps both the detected states of the environment and the actions to take when they are in those states. This corresponds to what, in psychology, would be called a set of rules or associations of stimulus response. The policy is the fundamental part of a reinforcing learning agent, in the sense that it alone is enough to determine behavior.
- A value function represents how good a state is for an agent. It is equal to the total reward expected for an agent from the status
*s*. The value function depends on the policy with which the agent selects the actions to be performed. - An action-value function returns the value, that is, the expected return (overall reward) for using action
*a*in a certain state*s*, following a policy.

In the previous section, we said that the goal of reinforcement learning is to learn a policy that, for each state *s* in which the system is located, indicates to the agent an action to maximize the total reward received during the entire action sequence. How can we maximize the total reinforcement received during the entire sequence of actions?

The total reinforcement derived from the policy is calculated as follows:

Here, *r _{T}* represents the reward of the action that drives the environment in the terminal state

*s*.

_{T}A possible solution to the problem is to associate the action that provides the highest reward to each individual state; that is, we must determine an optimal policy such that the previous quantity is maximized.

For problems that do not reach the goal or terminal state in a finite number of steps (continuing tasks), *R _{t}* tends to infinity.

In these cases, the sum of the rewards that one wants to maximize diverges at the infinite, so this approach is not applicable. Then, it is necessary to develop an alternative reinforcement technique.

The technique that best suits the reinforcement learning paradigm turns out to be the discounted cumulative reward, which tries to maximize the following quantity:

Here, *γ* is called a discount factor and represents the importance for future rewards. This parameter can take the values *0 ≤ γ ≤ 1*, with the following value:

- If
*γ <1*, the sequence*r*will converge to a finite value_{t} - If
*γ = 0*, the agent will have no interest in future rewards, but will try to maximize the reward only for the current state - If
*γ = 1*, the agent will try to increase future rewards even at the expense of the immediate ones

The discount factor can be modified during the learning process to highlight particular actions or states. An optimal policy can lead to the reinforcement obtained in performing a single action to be low (or even negative), provided that this leads to greater reinforcement overall.

Ideally, the agent must associate with each action *a _{t}* the respective reward

*r*, in order to then choose the most rewarding behavior for achieving the goal. This approach is therefore impracticable for complex problems in which the number of states is particularly high and, consequently, the possible associations increase exponentially.

This problem is called the exploration-exploitation dilemma. Ideally, the agent must explore all possible actions for each state, finding the one that is actually most rewarded for exploiting in achieving its goal.

Thus, decision-making involves a fundamental choice:

**Exploitation**: Make the best decision, given current information**Exploration**: Collect more information

In this process, the best long-term strategy can lead to considerable sacrifices in the short term. Therefore, it is necessary to gather enough information to make the best decisions.

The exploration-exploitation dilemma makes itself known whenever we try to learn something new. Often, we have to decide whether to choose what we already know (exploitation), leaving our cultural baggage unaltered, or choosing something new and learning more in this way (exploration). The second choice puts us at the risk of making the wrong choices. This is an experience that we often face; think, for example, about the choices we make in a restaurant when we are asked to choose between the dishes on the menu:

- We can choose something that we already know and that, in the past, has given us back a known reward with gratification (exploitation), such as pizza (who does not know the goodness of a margherita pizza?)
- We can try something new that we have never tasted before and see what we get (exploration), such as lasagna (alas, not everyone knows the magic taste of a plate of lasagna)

The choice we will make will depend on many boundary conditions: the price of the dishes, the level of hunger, knowledge of the dishes, and so on. What is important is that the study of the best way to make these kinds of choices has demonstrated that optimal learning sometimes requires us to make bad choices. This means that, sometimes, you have to choose to avoid the action you deem most rewarding and take an action that you feel is less rewarding. The logic is that these actions are necessary to obtain a long-term benefit: sometimes, you need to get your hands dirty to learn more.

The following are more examples of adopting this technique for real-life cases:

- Selection of a store:
**Exploitation**: Go to your favorite store**Exploration**: Try a new store

- Choice of a route:
**Exploitation**: Choose the best route so far**Exploration**: Try a new route

In practice, in very complex problems, convergence to a very good strategy would be too slow.

A good solution to the problem is to find a balance between exploration and exploitation:

- An agent that limits itself to exploring will always act in a casual way in every state, and it is evident that the convergence to an optimal strategy is impossible
- If an agent explores little, it will always use the usual actions, which may not be optimal ones

Finally, we can say that at every step the agent has to choose between repeating what it has done so far, or trying out new movements that could achieve better results.

As we have seen in the previous sections, reinforcement learning is a programming technique that aims to develop algorithms that can learn and adapt to changes in the environment. This programming technique is based on the assumption of the agent being able to receive stimuli from the outside and to change its actions according to these stimuli. So, a correct choice will result in a reward while an incorrect choice will lead to a penalization of the system.

The goal of the system is to achieve the highest possible reward and consequently the best possible result. This result can be obtained through two approaches:

- The first approach involves evaluating the choices of the algorithm and then rewarding or punishing the algorithm based on the result. These techniques can also adapt to substantial changes in the environment. An example is the image recognition programs that improve their performance with use. In this case we can say that learning takes place continuously.
- In the second approach, a first phase is applied in which the algorithm is previously trained, and when the system is considered reliable, it is crystallized and no longer modifiable. This derives from the observation that constantly evaluating the actions of the algorithm can be a process that cannot be automated or that is very expensive.

These are only implementation choices, so it may happen that an algorithm includes the newly analyzed approaches.

So far, we have introduced the basic concepts of reinforcement learning. Now, we can analyze the various ways in which these concepts have been transformed into algorithms. In this section, we will list them, providing an overview, and we will deepen them in the practical cases that we will address in the following chapters.

**Dynamic Programming** (**DP**) represents a set of algorithms that can be used to calculate an optimal policy given a perfect model of the environment in the form of an MDP. The fundamental idea of DP, as well as reinforcement learning in general, is the use of state values and actions to look for good policies.

The DP methods approach the resolution of MDP processes through the iteration of two processes called policy evaluation and policy improvement:

- The policy evaluation algorithm consists of applying an iterative method to the resolution of the Bellman equation. Since convergence is guaranteed to us only for
*k → ∞*, we must be content to have good approximations by imposing a stopping condition. - The policy improvement algorithm improves policy based on current values.

### Note

A Bellman equation, named after *Richard E. Bellman*, an American applied mathematician, is a necessary condition for the optimality associated with the DP method. It allows us to obtain the value of a decision problem at some point in time in terms of payoff from some initial choices and the value of the remaining decision problem resulting from those initial choices.

The iteration of the two aforementioned processes is shown in the following diagram:

A disadvantage of the policy iteration algorithm is that we have to evaluate the policy at every step. This involves an iterative process in which we do not know a priori the time of convergence, which will depend, among other things, on how the starting policy was chosen.

One way to overcome this drawback is to cut off the evaluation of the policy at a specific step. This operation does not change the guarantee of convergence to the optimal value. A special case in which the assessment of the policy is blocked step by step (also called sweep) defines the value iteration algorithm. In the value iteration algorithm, a single iteration of calculation of the values is performed between each step of the policy improvement. In the following snippet, a pseudocode for a value-iteration algorithm is shown:

```
initialize value function V
repeat
for all s
for all a
update Q function
V = max Q function
until V converge
```

Thus, in the value iteration algorithm, the system was initiated by setting a random value function. Starting from this value, a new function is sought in an iterative process which, compared to the previous one, has been improved, until reaching the optimal value function.

As we said previously, the DP algorithms are therefore essentially based on two processes that take place in parallel: policy evaluation and policy improvement. The repeated execution of these two processes makes the general process converge toward the optimal solution. In the policy iteration algorithm, the two phases alternate and one ends before the other begins.

In policy iteration algorithms, we start by initializing the system with a random policy, so we first must find the value function of that policy. This phase is called the policy evaluation step. So, we find a new policy, also improved compared to the previous one, based on the previous value function, and so on. In this process, each policy presents an improvement over the previous one until the optimal policy is reached. In the following snippet, a pseudocode for a policy iteration algorithm is shown:

initialize value function V and policy π repeat evaluate V using policy π improve π using V until convergence

DP methods operate through the entire set of states that can be assumed by the environment, performing a complete backup for each state at each iteration. Each update operation performed by the backup updates the value of a state based on the values of all possible successor states, weighed for their probability of occurrence, induced by the policy of choice and by the dynamics of the environment. Full backups are closely related to the Bellman equation; they are nothing more than the transformation of the equation into assignment instructions.

When a complete backup iteration does not bring any change to the state values, convergence is obtained and, therefore, the final state values fully satisfy the Bellman equation. The DP methods are applicable only if there is a perfect model of the environment, which must be equivalent to an MDP.

Precisely for this reason, the DP algorithms are of little use in reinforcement learning, both for their assumption of a perfect model of the environment, and for the high and expensive computation, but it is still opportune to mention them, because they represent the theoretical basis of reinforcement learning. In fact, all the methods of reinforcement learning try to achieve the same goal as that of the DP methods, only with lower computational cost and without the assumption of a perfect model of the environment.

The DP methods converge to the optimal solution with a number of polynomial operations with respect to the number of states *n* and actions *m*, against the number of exponential operations *m*n* required by methods based on direct search.

The DP methods update the estimates of the values of the states, based on the estimates of the values of the successor states, or update the estimates on the basis of past estimates. This represents a special property, which is called bootstrapping. Several methods of reinforcement learning perform bootstrapping, even methods that do not require a perfect model of the environment, as required by the DP methods.

The **Monte Carlo** (**MC**) methods for estimating the *value* function and discovering excellent policies do not require the presence of a model of the environment. They are able to learn through the use of the agent's experience alone or from samples of state sequences, actions, and rewards obtained from the interactions between agent and environment. The experience can be acquired by the agent in line with the learning process or emulated by a previously populated dataset. The possibility of gaining experience during learning (online learning) is interesting because it allows obtaining excellent behavior even in the absence of a priori knowledge of the dynamics of the environment. Even learning through an already populated experience dataset can be interesting, because, if combined with online learning, it makes automatic policy improvement induced by others' experiences possible.

In general, MC methods rely on repeated random sampling to obtain numerical results. To do this, they use randomness to solve deterministic problems. In our case, we will use random sampling of states and action-state pairs and we will look at the rewards and then we will review the policy in an iterative way. The iteration of the process will converge on optimal policy as we explore every possible action-state pair.

For example, we could take the following procedure:

- We will assign a reward of +1 to a right action, -1 to a wrong action, and 0 to a draw.
- We will establish a table in which each key corresponds to a particular state-action pair and each value is the value of that pair. This represents the average reward received for that action in that state.

To solve the reinforcement learning problems, MC methods estimate the value function on the basis of the total sum of rewards, obtained on average in the past episodes. This assumes that the experience is divided into episodes, and that all episodes are composed of a finite number of transitions. This is because, in MC methods, the estimation of new values, and the modification of policies, takes place at the end of each episode. MC methods iteratively estimate policy and value function. In this case, however, each iteration cycle is equivalent to completing an episode—the new estimates of policy and value function occur episode by episode.

The following is a pseudocode for MC policy evaluation:

```
Initialize
arbitrary policy π
arbitrary state-value function
Repeat
generate episode using π
for each state s in episode
the received reinforcement R is added to the set of rewards obtained so far
estimate the value function on the basis on the average of the total sum of rewards obtained
```

Usually, the term MC is used for estimation methods, the operations of which involve random components. In this case, the term MC refers to reinforcement learning methods based on total reward averages. Unlike the DP methods that calculate the values for each state, the MC methods calculate the values for each state-action pair, because, in the absence of a model, the only state values are not sufficient to decide which action is best performed in a certain state.

TD learning algorithms are based on reducing the differences between estimates made by the agent at different times. TD algorithms try to predict a quantity that depends on the future values of a given signal. Its name derives from the differences used in predictions on successive time steps to guide the learning process. The prediction at any time is updated to bring it closer to the prediction of the same quantity at the next time step. In reinforcement learning, they are used to predict a measure of the total amount of reward expected in the future.

It is a combination of the ideas of the MC method and the DP.

### Note

MC methods allow solving reinforcement learning problems based on the average of the results obtained. DP represents a set of algorithms that can be used to calculate an optimal policy given a perfect model of the environment in the form of a MDP.

TD algorithm can learn directly from raw data, without a model of the dynamics of the environment (such as MC). This algorithm updates the estimates based partly on previously learned estimates, without waiting for the final result (bootstrap, like DP). Converge (using a fixed policy) if the time step is sufficiently small, or if it reduces over time.

The consecutive predictions are often related to each other; the TD methods are based on this assumption. These methods try to minimize the error of consecutive time forecasts. To do this, calculate the value function update using the Bellman equation. As already mentioned, to improve the prediction, the bootstrap technique is used, thereby reducing the variance of the prediction in each update step.

The different types of algorithms based on time differences can be distinguished on the basis of the methodology of choosing an action adopted. There are methods of time differences on-policy, in which the update is made on the basis of the results of actions determined by the selected policy and off-policy methods, in which various policies can be assessed through hypothetical actions, not actually undertaken. Unlike on-policy methods, the latter can separate the problem of exploration from that of control, learning tactics not necessarily applied during the learning phase.

The most used TD learning algorithms are the following:

- SARSA
- Q-learning
- Deep Q-learning

In the following sections, we will analyze the main characteristics of the two algorithms and the substantial differences.

The **State-action-reward-state-action** (**SARSA**) algorithm implements an on-policy time differences method, in which the update of the action-value function is performed based on the outcome of the transition from state *s* to state *s'* through action a, based on a selected policy *π (s, a)*.

There are policies that always choose the action that provides the maximum reward and non-deterministic policies (*ε*-greedy, *ε*-soft, softmax), which ensure an element of exploration in the learning phase.

### Note

Greedy is a term used to represent a family of algorithms trying to get a global solution, through excellent local choices.

In SARSA, it is necessary to estimate the action-value function *q (s, a)*, because the total value of a state *v* (*s*) (value function) is not sufficient in the absence of an environment model to allow the policy to determine, given a state, which action is best performed. In this case, however, the values are estimated step by step following the Bellman equation with the update parameter *v* (*s*), considering, however, in place of a state, the state-action pair.

Being of an on-policy nature, SARSA estimates the action-value function based on the behavior of the *π* policy, and at the same time modifies the greedy behavior of the policy with respect to the updated estimates from the action-value function. The convergence of SARSA, and more generally of all TD methods, depends on the nature of policies.

The following is a pseudocode for the SARSA algorithm:

Initialize arbitrary action-value function Repeat (for each episode) Initialize s choose a from s using policy from action-value function Repeat (for each step in episode) take action a observe r, s' choose a' from s' using policy from action-value function update action-value function update s,a

The update rule of the action-value function uses all five elements *(s _{t}*,

*a*,

_{t}*r*

_{t + 1},

*s*

_{t+ 1},

*a*; for this reason, it is called SARSA.

_{t + 1})Q-learning is one of the most used reinforcement learning algorithms. This is due to its ability to compare the expected utility of the available actions without requiring an environment model. Thanks to this technique, it is possible to find an optimal action for every given state in a finished MDP.

A general solution to the reinforcement learning problem is to estimate, thanks to the learning process, an evaluation function. This function must be able to evaluate, through the sum of the rewards, the optimality/utility or otherwise of a particular policy. In fact, Q-learning tries to maximize the value of the *Q* function (action-value function), which represents the maximum discounted future reward when we perform actions *a* in the state *s*.

Q-learning, like SARSA, estimates the function value *q* (*s*, *a*) incrementally, updating the value of the state-action pair at each step of the environment, following the logic of updating the general formula for estimating the values for the TD methods. Q-learning, unlike SARSA, has off-policy characteristics, that is, while the policy is improved according to the values estimated by *q* (*s*, *a*), the value function updates the estimates following a strictly greedy secondary policy: given a state, the chosen action is always the one that maximizes the value max *q* (*s*, *a*). However, the *π* policy has an important role in estimating values because, through it, the state-action pairs to be visited and updated are determined.

The following is a pseudocode for a Q-learning algorithm:

Initialize arbitrary action-value function Repeat (for each episode) Initialize s choose a from s using policy from action-value function Repeat (for each step in episode) take action a observe r, s' update action-value function update s

Q-learning uses a table to store each state-action pair. At each step, the agent observes the current state of the environment and, using the *π* policy, selects and executes the action. By executing the action, the agent obtains the reward *R*_{t+1} and the new state *S _{t+1}*. At this point the agent is able to calculate

*Q(S*, updating the estimate.

_{t}, a_{t})Deep Q-learning represents an evolution of the basic Q-learning method the state-action is replaced by a neural network, with the aim of approximating the optimal value function.

Compared to the previous approaches, where it was used to structure the network in order to request both input and action and providing its expected return, Deep Q-learning revolutionizes the structure in order to request only the state of the environment and supply as many status-action values as there are actions that can be performed in the environment.

Reinforcement learning aims to create algorithms that can learn and adapt to environmental changes. This programming technique is based on the concept of receiving external stimuli depending on the algorithm choices. A correct choice will involve a reward, while an incorrect choice will lead to a penalty. The goal of the system is to achieve the best possible result, of course. In this chapter, we dealt with the basics of reinforcement learning.

To start, we explored the amazing world of machine learning and took a tour of the most popular machine learning algorithms to choose the right one for our needs. To understand what is most suitable for our needs, we learned to perform a preliminary analysis. Then we analyzed how to build machine learning models step by step.

In the central part of the chapter, we saw that the goal of learning with reinforcement is to create intelligent agents that are able to learn from their experience. So we analyzed the steps to follow to correctly apply a reinforcement learning algorithm. Later we explored the agent-environment interface. The entity that must achieve the goal is called an agent. The entity with which the agent must interact is called the environment, which corresponds to everything outside the agent.

To avoid load problems and computational difficulties, the agent-environment interaction is considered an MDP. An MDP is a stochastic control process. Then the discount factor concept was introduced. The discount factor is used during the learning process to highlight or not highlight particular actions or states. An optimal policy can cause the reinforcement obtained in performing a single action to be even low (or negative), provided that overall this leads to greater reinforcement.

Finally, we analyzed the most common reinforcement learning techniques. Q-learning, TD learning, and Deep Q-learning networks were covered.

In the next chapter, the reader will know the basic concepts of the Markov process, the basic concepts of random walks, understand how the random walk algorithms work, know how to use a Markov chain to forecast the weather, and learn how to simulate random walks using Markov chains.