"Machine Learning (CS229) is the most popular course at Stanford" –this is how a Forbes article by Laura Hamilton started, continuing- "Why? Because, increasingly, machine learning is eating the world".

Machine learning techniques are, indeed, being applied in a variety of fields, and data scientists are being sought after in many different industries. With machine learning, we identify the processes through which we gain knowledge that is not readily apparent from data, in order to be able to make decisions. Applications of machine learning techniques may vary greatly and are applicable in disciplines as diverse as medicine, finance, and advertising.

In this chapter, we will present different Machine learning approaches and techniques, and some of their applications to real-world problems, and we will introduce one of the major open source packages available in Python for machine learning, `scikit-learn`

. This will lay the background for later chapters in which we will focus on a particular type of machine learning approach using neural networks that aims at emulating brain functionality, and in particular deep learning. Deep learning makes use of more advanced neural networks than those used during the 80's, thanks not only to recent developments in the theory but also to advances in computer speed and the use of **GPUs** (**Graphical **
**Processing Units**) versus the more traditional use of **CPUs** (**Computing Processing Units**). This chapter is meant mostly as a summary of what machine learning is and can do, and to prepare the reader to better understand how deep learning differentiates itself from popular traditional machine learning techniques.

In particular, in this chapter we will cover:

What is machine learning?

Different machine learning approaches

Steps involved in machine learning systems

Brief description of popular techniques/algorithms

Applications in real-life

A popular open source package

Machine learning is often mentioned together with terms such as "big data" and "artificial intelligence", or A.I. for short, but it is quite different from both. In order to understand what machine learning is and why it is useful, it is important to understand what big data is and how machine learning applies to it. Big data is a term used to describe huge data sets that are created as the result of large increases in data gathered and stored, for example, through cameras, sensors, or Internet social sites. It is estimated that Google alone processes over 20 petabytes of information per day and this number is only going to increase. IBM estimated (http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html) that every day, 2.5 quintillion bytes are created and that 90% of all the data in the world has been created in the last two years.

Clearly, humans alone are unable to grasp, let alone analyze, such a huge amount of data, and machine learning techniques are used to make sense of these very large data sets. Machine learning is the tool used for large-scale data processing and is well suited for complex datasets with huge numbers of variables and features. One of the strengths of many machine learning techniques, and deep learning in particular, is that it performs best when it can be used on large data sets improving its analytic and predictive power. In other words, machine learning techniques, and especially deep learning neural networks, "learn" best when they can access large data sets in order to discover patterns and regularities hidden in the data.

On the other hand, machine learning's predictive ability can be well adapted to artificial intelligence systems. Machine learning can be thought of as "the brain" of an artificial intelligence system. Artificial intelligence can be defined (though this definition may not be unique) as a system that can interact with its environment: artificial intelligence machines are endowed with sensors that allow them to know about the environment they are in and tools with which they can relate back. Machine learning is therefore the brain that allows the machine to analyze the data ingested through its sensors to formulate an appropriate answer. A simple example is Siri on an iPhone. Siri hears the command through its microphone and outputs an answer through its speakers or through its display, but in order to do so it needs to "understand" what it is being said to formulate the correct answer. Similarly, driverless cars will be equipped with cameras, GPS systems, sonars and lidars, but all this information needs to be processed in order to provide a correct answer, that is, whether to accelerate, brake, turn, and so on. The information processing that leads to the answer represents what machine learning is.

The term machine learning, as we have seen, is used in a very general way and it refers to general techniques to extrapolate patterns from large sets or to the ability to make predictions on new data based on what is learnt by analyzing available known data. This is a very general and broad definition and it encompasses many different techniques. Machine learning techniques can be roughly divided into two large classes: Supervised and Unsupervised learning, though one more class is often added, and is referred to as Reinforcement Learning.

The first class of machine algorithms is named *supervised learning*. Supervised learning algorithms are a class of machine learning algorithms that use a set of labeled data in order to classify similar un-labeled data. Labeled data is data that has already been classified, while un-labeled data is data that has not yet been labeled. Labels, as we will see, can either be discrete or continuous. In order to better understand this concept, let's use an example.

Assume that a user receives a large amount of e-mails every day, some of which are important business e-mails and some of which are un-solicited junk e-mails, or spam. A supervised machine algorithm will be presented with a large body of e-mails that have already been labeled by the user as spam or not spam. The algorithm will run over all the labeled data and make predictions on whether the e-mail is spam or not. This means that the algorithm will examine each example and make a prediction for each example on whether the e-mail is spam or not. Typically, the first time the algorithm runs over all the un-labeled data, it will mislabel many of the e-mails and it may perform quite poorly. However, after each run, the algorithm will compare its prediction to the desired outcome (the label). In doing so, the algorithm will learn to improve its performance and accuracy. As noted above, an approach of this kind will benefit from large amounts of data on which it can better learn what characteristics (or features) cause each e-mail to be classified as spam or not.

After the algorithm has run for a while on the labeled data (often also called training data) and it stops improving its accuracy, it can then be used on new e-mails to test its accuracy on new un-labeled data.

In the example we have used, we have described a process in which an algorithm learns from labeled data (the e-mails that have been classified as spam or not spam) in order to make predictions on new unclassified e-mails. It is important to note, however, that we can generalize this process to more than simply two classes: for example, we could run the software and train it on a set of labeled e-mails where the labels are called **Personal**, **Business**/**Work**, **Social**, or **Spam**.

In fact, Gmail, the free e-mail service by Google, allows the user to select up to five categories, labeled as:

**Social**, which includes messages from social networks and media sharing sites**Promotions**, which includes marketing e-mails, offers, and discounts**Updates**, which includes bills, bank statements, and receipts**Forums**, which includes messages from online groups and mailing lists

In some cases, the outcome may not necessarily be discrete, and we may not have a finite number of classes to classify our data into. For example, we may be trying to predict the life expectancy of a group of people based on pre-determined health parameters. In this case, since the outcome is a continuous function (we can specify a life expectancy as a real number expressing the number of years that person is expected to live) we do not talk about a classification task but rather of a regression problem.

One way to think of supervised learning is by imagining we are trying to build a function *f *defined on the dataset. Our dataset will comprise information organized by *features*. In the example of e-mail classification, those features may be specific words that may appear more frequently than others in spam e-mails. The use of explicit sex-related words will most likely identify a spam e-mail rather than a business/work e-mail. On the contrary, words, such as "meeting", "business", and "presentation" will more likely describe a work e-mail. If we have access to metadata, the sender information may also be used to better classify e-mails. Each e-mail will then have associated a set of features, and each feature will have a value (in this case, how many times the specific word is present in the e-mail body). The machine learning algorithm will then seek to map those values to a discrete range which represents the set of classes, or a real value in the case of regression. The algorithm will run over many examples until it is able to define the best function that will allow matching most labeled data correctly. It can then be run over unlabeled data to make predictions without human intervention. This defines a function:

We can also think of classification as a process seeking to separate different groups of data points. Once we have defined our features, any example, for example, an e-mail, in our dataset can be thought of as a point in the space of features, where each point represents a different example (or e-mail). The machine algorithm task will be to draw a hyper-plane (that is a plane in a high dimensional space) to separate points with different characteristics, in the same way we want to separate spam from non-spam e-mails.

While, as in the picture below, this may look trivial in a two-dimensional case, this can turn out to be very complex in examples with hundreds or thousands of dimensions.

In later chapters, we will see several examples of either classification or regression problems. One such problem we will discuss is that of the classification of digits: given a set of images representing 0 to 9, the machine learning algorithm will try to classify each image assigning to it the digits it depicts. For such examples, we will make use of one of the most classic datasets, the MNIST dataset. In this example, each digit is represented by an image with 28 x 28 (=784) pixels, and we need to classify each of the ten digits, therefore we need to draw 9 separating hyper planes in a 784-dimensional space.

The second class of machine learning algorithms is named *unsupervised learning*. In this case, we do not label the data beforehand, rather we let the algorithm come to its conclusion. One of the most common and perhaps simplest examples of unsupervised learning is clustering. This is a technique that attempts to separate the data into subsets.

For example, in the previous case of spam/not spam e-mails, the algorithm may be able to find elements that are common to all spam e-mails (for example, the presence of misspelled words). While this may provide a better than random classification, it isn't clear that spam/not spam e-mails can be so easily separated. The subsets into which the algorithm separates the data are different classes for the dataset. For clustering to work, each element in each cluster should in principle have high intra-class similarity and low similarity with other classes. Clustering may work with any number of classes, and the idea behind clustering methods such as k-means is to find k-subsets of the original data whose elements are closer (more similar) to each other than they are to any other element outside their class. Of course, in order to do this, we need to define what *closer* or *more similar* means, that is, we need to define some kind of metric that defines the distance between points.

In the following graph, we show how a set of points can be classified to form three subsets:

Elements of a given dataset need not necessarily cluster together to form a finite set, but clustering may also include unbounded subsets of the given dataset as in the following picture:

Clustering is not the only unsupervised technique and we will see that deep learning's recent successes are related to it being so effective in unsupervised learning tasks.

New data is created every day, very quickly, and labeling all the new data is quite a laborious and time-consuming activity. One advantage of unsupervised learning algorithms is that they do not need labeled data. Unsupervised deep learning techniques and methods, such as Restricted Boltzmann machines, work by abstracting features from the data. For example, using the MNIST dataset, Restricted Boltzmann machines will abstract characteristics that are unique to each digit, detecting the shape of the lines and curves for each digit. Unsupervised learning works by revealing hidden structures in the data that allow us to classify it accordingly, not by matching it to a label.

In addition, for instance with deep belief nets, we can improve performance of an unsupervised method by refining it with supervised learning.

The third class of machine learning techniques is called *reinforcement learning*. This works differently from supervised learning though it still uses a feedback element to improve its performance. A common application of reinforcement learning techniques is in teaching machines how to play games: in this case, we do not label each move as good or bad but the feedback comes from the game, either through the outcome of the game or through signals during the game, such as scoring or losing points. Winning a game will reflect a positive outcome, similar to recognizing the right digit or whether an e-mail is spam or not, while losing the game would require further "learning". Reinforcement learning algorithms tend to reuse actions tried in the past that led to successful results, like winning in a game. However, when in uncharted territory, the algorithm must try new actions from which, depending on the result, it will learn the structure of the game more deeply. Since usually, actions are inter-related, it is not the single action that can be valued as "good" or "bad", but rather it is the whole dynamics of actions together that is evaluated. Similar to how in playing chess sometimes sacrificing a pawn may be considered a positive action if it brings a better positioning on the chessboard, even though the loss of a piece is, in general, a negative outcome, in reinforcement learning it is the whole problem and its goal that is explored. For example, a moving cleaning robot may have to decide whether to continue cleaning a room or to start to move back to its recharging station, and such a decision could be made on the basis of whether in similar circumstances it was able to find the charging station before the battery ran out. In reinforcement learning, the basic idea is that of *reward, *where the algorithm will seek to maximize the total reward it receives.

A simple example of reinforcement learning can be used to play the classic game of tic-tac-toe. In this case, each position on the board has associated a probability (a value), which is the probability of winning the game from that state based on previous experience. At the beginning, each state is set at 50%, which is we assume that at the beginning we have an equal probability of winning or losing from any position. In general, the machine will try to move towards positions with higher values in order to win the game, and will re-evaluate them if, instead, it loses. At each position, the machine will then make a choice based on the probable outcome, rather than on a fixed determined rule. As it keeps playing, these probabilities will get refined and output a higher or lower chance of success depending on the position.

So far, we have discussed different machine learning approaches, and we have roughly organized them in three different classes. Another important aspect of classical machine learning is understanding the data to better understand the problem at hand. The important aspects we need to define in order to apply machine learning can roughly be described as follows:

**Learner**: this represents the algorithm being used and its "learning philosophy". As we will see in the next paragraph, there are many different machine learning techniques that can be applied to different learning problems. The choice of learner is important, since different problems can be better suited to certain machine learning algorithms.**Training data**: This is the raw dataset that we are interested in. Such data may be unlabeled for unsupervised learning, or it may include labels for supervised learning. It is important to have enough sample data for the learner to understand the structure of the problem.**Representation**: This is how the data is expressed in terms of the features chosen so that it can be ingested by the learner. For example, if we are trying to classify digits using images, this will represent the array of values describing the pixels of the images. A good choice of representation of the data is important to achieve better results.**Goal**: This represents the reason to learn from the data for the problem at hand. This is strictly related to the target, and helps define how and what learner should be used and what representation to use. For example, the goal may be to clean our mailbox of unwanted e-mails, and the goal defines what the target of our learner is, for example, detection of spam e-mails.**Target**: This represents what is being learned and the final output. It may be a classification of the unlabeled data, it may be a representation of the input data according to hidden patterns or characteristics, it may be a simulator for future predictions, it may be a response to an outside stimulus, or it can be a strategy in the case of reinforcement learning.

It can never be emphasized enough, though, that any machine learning algorithm can only achieve an approximation of the target, not a perfect numerical description. Machine learning algorithms are not exact mathematical solutions to problems, rather they are just approximations. In the previous paragraph, we have defined learning as a function from the space of features (the input) into a range of classes; we will later see how certain machine learning algorithms, such as neural networks, can be proved to be able to approximate any function to any degree, in theory. This theorem is called the Universal Approximation Theorem, but it does not imply that we can get a precise solution to our problem. In addition, solutions to the problem can be better achieved by better understanding of the training data.

Typically, a problem solvable with classic machine learning techniques may require a thorough understanding and cleaning of the training data before deployment. If we are to state some of the steps required in approaching a machine learning problem, we may summarize them as follows:

**Data Collection**: This implies the gathering of as much data as possible and in the supervised learning problem also its correct labeling.**Data Processing**: This implies cleaning of the data (for example removing redundant or highly correlated features, or filling missing data) and understanding of the features defining the training data.**Creation of the test case**: Typically data can be divided into two or three sets: a training dataset, on which we train the algorithm, and a testing dataset, on which we test, after having trained the algorithm, the accuracy of the approach. Often, we also create a validation dataset, on which, after the training-testing procedure has been repeated many times and we are finally satisfied with the result, we make the final testing (or validation).

There are valid reasons to create a testing and a validation dataset. As we mentioned, machine learning techniques can only produce an approximation of the desired result. This is due to the fact that often, we can only include a finite and limited number of variables, and there may be many variables that are outside our own control. If we only used a single dataset, our model may end up "memorizing" the data and produce an extremely high accuracy value on the data it has memorized, but this result may not be reproducible on other similar datasets. One of the key desired goals of machine learning techniques is their ability to generalize. That is why we create both a testing dataset, used for tuning our model selection after training, and a final validation dataset only used at the end of the process to confirm the validity of the selected algorithm.

To understand the importance of selecting valid features in the data and the importance of avoiding "memorizing" the data (in more technical terms, this is what is referred to as "overfitting" in the literature, and this is what we will be calling it from now on), let's use a joke taken from an *xkcd* comic as an example (http://xkcd.com/1122): "Until 1996 no democratic US presidential candidate who was an incumbent and with no combat experience had ever beaten anyone whose first name was worth more in Scrabble". It is apparent in this example that such a "rule" is meaningless, however it underscores the importance of selecting valid features (does how much a name is worth in Scrabble bear any relevance to selecting a US president?) and that selecting random features as predictors, while it may predict the current data, cannot be used as a predictor for more general data, and that the fact that for 52 elections this had held true was simple coincidence. This is what is generally called overfitting, that is, making predictions that fit the data at hand perfectly, but do not generalize to larger datasets. Overfitting is the process of trying to make sense of what is generally called "noise" (that is, information that does not have any real meaning) and trying to fit the model to small perturbations.

Another example may be given by attempting to use machine learning to predict the trajectory of a ball thrown from the ground up in the air (not perpendicularly) until it reaches the ground again. Physics teaches us that the trajectory is shaped like a parabola, and we expect that a good machine learning algorithm observing thousands of such throws would come up with a parabola as a solution. However, if we were to zoom in on the ball and observe the smallest fluctuations in the air due to turbulence, we might notice that the ball does not hold a steady trajectory but may be subject to small perturbations. This is what we call "noise". A machine learning algorithm that tried to model these small perturbations would fail to see the big picture, and produce a result that is not satisfactory. In other words, overfitting is the process that makes the machine learning algorithm see the trees but forget about the forest.

This is why we separate the training data from the test data: if the accuracy of the test data were not similar to the result achieved on the training data, that would be a good indication that we have overfitted the model. Of course, we need to make sure we don't make the opposite error either, that is, underfit the model. In practice, though, if we are aiming at making our prediction model as accurate as possible on our training data, underfitting is much less of a risk than overfitting, and most care is therefore taken in order to avoid overfitting the model.

Besides grouping algorithms based upon their "learning style", that is, the three classes discussed at the beginning of the book, supervised learning, unsupervised learning, and reinforcement learning, we can also group them by their implementation. Clearly, each class discussed above may be implemented using different machine learning algorithms, as, for example, there are many different supervised learning techniques, where each of them may be best suited to the specific classifying or regression task at hand. In fact, the distinction between classification and regression is one of the most critical to make, and it is important to understand what we are trying to accomplish.

The following is by no means meant to be an exhaustive list or a thorough description of each machine learning method, for which we refer the reader to the book *Python Machine Learning*, Sebastian Raschka (https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning), rather it is meant as a simple review to provide the reader with a simple flavor of the different techniques and how deep learning differs from them. In the next chapters, we will see that deep learning is not just another machine learning algorithm, but it differs in substantial ways from classical machine learning techniques.

We will introduce a regression algorithm, linear regression, classical classifiers such as decision trees, naïve Bayes, and support vector machine, and unsupervised clustering algorithms such as k-means, and reinforcement learning techniques, the cross-entropy method, to give only a small glimpse of the variety of machine learning techniques that exist, and we will end this list by introducing neural networks, that is the main focus of this book.

Regression algorithms are a kind of supervised algorithm that use features of the input data to predict a value, for example the cost of a house given certain features such as size, age, number of bathrooms, number of floors, location, and so on. Regression analysis tries to find the value of the parameters for the function that best fits an input dataset. In a linear regression algorithm, the goal is to minimize a cost function by finding appropriate parameters for the function on the input data that best approximate the target values. A cost function is a function of the error, which is how far we are from getting a correct result. A typical cost function used is the mean squared error, where we take the square of the difference between the expected value and the predicted result. The sum over all the input examples gives the error of the algorithm and it represents the cost function.

Say we have a 100-square meter house, built 25 years ago, with three bathrooms, and two floors. Furthermore, assume that we divide the city, where the houses are in 10 different areas, that we denote with integers from 1 to 10, and say this house is located in the area denoted by a 7. We can then parameterize this house with a 5-dimensional vector *x = (100, 25, 3, 2, 7)*. Say that we also know that this house has an estimated value of €10,0000. What we want to achieve is to create a function *f* such that *f(x) = 100000*.

In linear regression, this means finding a vector *w= (w*_{1}*, w*_{2}*, w*_{3}*, w*_{4}*, w*_{5}*)* such that *100*w*_{1}* + 25*w*_{2}* + 3*w*_{3}* + 2*w*_{4}* + 7*w*_{5}* = 100000*. If we had a thousand houses, we could repeat the same process for every house, and ideally we would like to find a vector *w *that can predict the correct value (or close enough) for every house. Let's say that we initially pick some random value of *w. *In that case, we won't expect *f(x) = 100*w*_{1}* + 25*w*_{2}* + 3*w*_{3}* + 2*w*_{4}* + 7*w**5* to be equal to 1,00,000, so we can calculate the error ∆ = (100000 − * f(x))*_{2}. This is the squared error for one example *x, *and the mean of all the squared errors for all the examples represents our cost, that is, how much our function differs from the real value. Our aim is therefore to minimize this error, and to do so we calculate the derivative δ of the cost function with respect to *w*. The derivative indicates the direction where the function increases (or decreases), therefore, moving *w* in the opposite direction to the derivative will improve our function's accuracy. This is the main point of linear regression, moving towards the minimum of the cost function, which represents the error. Of course, we need to decide how fast we want to move in the direction of the derivative, as our derivative only indicates a direction. The cost function is not linear, therefore we need to make sure we only take small steps in the direction indicated by the derivative. Taking too large a step may possibly make us overshoot our minimum, and therefore not be able to converge to it. The magnitude of this step is what is called the *learning rate*, and lets us indicate its magnitude with the symbol "*lr*".

By setting *w = w - *δ**lr, *we are therefore improving the choice of *w *towards a better solution. Repeating this process many times will produce a value of *w *that represents the best possible choice for the function *f. *We should emphasize, however, that this process will only work locally, and it may not find a global best value if the space is not convex. As the image suggests, if many local minima exist, the algorithm may end up being stuck in one of these local minima and not be able to escape it to reach the global minimum of the error function, similar to how a small ball, moving down from a mountain, may get stuck in a small valley and never reach the bottom of the mountain.

Another widely used supervised algorithm is the decision tree algorithm. A decision tree algorithm creates a classifier in the form of a "tree". A decision tree is comprised of decision nodes where tests on specific attributes are performed, and leaf nodes that indicate the value of the target attribute. Decision trees are a type of classifier that works by starting at the root node and moving down through the decision nodes until a leaf is reached.

A classic application of this algorithm is the Iris flower dataset (http://archive.ics.uci.edu/ml/datasets/Iris) that contains data from 50 samples of three types of Irises (Iris setosa, Iris virginica, and Iris versicolor). Ronald Fisher, who created the dataset, measured four different features of these flowers, the length and width of their sepals and the length and width of their petals. Based on the different combinations of these features, it is possible to create a decision tree to decide to which species each flower belongs. We will here describe a simple simplified decision tree that will classify correctly, almost all the flowers only using two of these features, the petal length and width.

We start with the first node and we create the first test on petal length: if the petal length is less than 2.5, then the flower belongs to the Iris setosa species. This, in fact, classifies correctly, all the setosa flowers, which all have a petal length less than 2.5 cm. Therefore, we reach a leaf node, which is labeled by the outcome Iris setosa. If the petal length is greater than 2.5, we take a different branch and we reach a new decision node, and we test whether the petal width is larger than 1.8. If the petal width is larger or equal to 1.8, we reach a leaf node and we classify our flower as an Iris virginica, otherwise we reach a new decision node, where again we test whether the petal length is longer than 4.9. If it is, we reach a leaf node labeled by the Iris virginica flower, otherwise we reach another leaf node, labeled by the Iris versicolor flower.

The decision tree discussed can be shown as follows, where the left branch reflects the positive answer to the test in the decision node, while the right branch represents the negative answer to the test in the decision node. The end nodes for each branch are the leaf nodes:

This example shows how very different the decision tree algorithm is from linear regression. In addition, when we introduce neural nets, the reader will be able to see an example of how neural nets work by using this same dataset. In that example, we will also provide Python code and we will show a few pictures of how neural nets will try to separate the flowers based on their features.

Clustering algorithms, as we have already discussed, are a type of unsupervised machine learning method. The most common clustering technique is called k-means clustering and is a clustering technique that groups every element in a dataset by grouping them into k distinct subsets (hence the *k* in the name). K-means is a relatively simple procedure, and consists of choosing random k points that represent the distinct centers of the k subsets, which are called centroids. We then select, for each centroid, all the points closest to it. This will create k different subsets. At this point, for each subset, we will re-calculate the center. We have again, *k* new centroids, and we repeat the steps above, selecting for each centroid, the new subsets of points closest to the centroids. We continue this process until the centroids stop moving.

It is clear that for this technique to work, we need to be able to identify a metric that allows us to calculate the distance between points. This procedure can be summarized as follows:

Choose initial k-points, called the centroids.

To each point in the dataset, associate the closest centroid.

Calculate the new center for the sets of points associated to a particular centroid.

Define the new centers to be the new centroids.

Repeat steps 3 and 4 until the centroids stop moving.

It is important to notice that this method is sensitive to the initial choice of random centroids, and that it may be a good idea to repeat the process for different initial choices. Also, it is possible for some of the centroids not to be closest to any of the points in the dataset, reducing the number of subsets down from k. It is also worth mentioning that if we used k-means with *k=3* in the above example discussing decision trees, we may not be getting the same classification for the iris dataset that we found using decision trees, highlighting once more how important it is to carefully choose and use the correct machine learning method for each problem.

Now, let's discuss a practical example that uses k-means clustering. Let's say a pizza delivery place wants to open four new franchises in a new city, and they need to choose the location for the four new sites. This is a problem that can be solved easily using k-means clustering. The idea is to find the locations where pizza is ordered most often; these will be our data points. Next, we choose four random points where the site locations will be located. By using k-means clustering techniques, we can later identify the four best locations that minimize the distance to each delivery place. This is an example where k-means clustering can help solve a business problem.

Naïve Bayes is different from many other machine learning algorithms. Probabilistically, what most machine learning techniques try to evaluate is the probability of a certain event *Y* given conditions *X*, which we denote by *p(Y|X)*. For example, given the picture representing a digit (that is, a picture with a certain distribution of pixels), what is the probability that that number is 5? If the pixels' distribution is such that it is close to the pixel distribution of other examples that were labeled as 5, the probability of that event will be high, otherwise the probability will be low.

Sometimes we have the opposite information, that is, given we know that we have an event Y, we know the probability that our sample is *X*. The Bayes theorem states: *p(X|Y) = p(Y|X)*p(X)/p(Y)*, where *p(X|Y)* means the probability of generating instance X given Y, which is also why naïve Bayes is called a generative approach. In simple terms, we may calculate the probability that a certain pixel configuration represents the number 5, knowing what is the probability, given that we have a 5, that a random pixel configuration may match the given one.

This is best understood in the realm of medical testing. Let's say we test for a specific disease or cancer. We want to know what is the probability we may have a particular disease given that our test result was positive. Now, most tests have a reliability value, which is the percentage chance of the test being positive when administered on people with the particular disease. By reversing the expression *p(X|Y) = p(Y|X)*p(X)/p(Y)*, we have that:

*p(cancer | test=positive) = p(test=positive | cancer) * p(cancer)/p(test=positive)*

Assume that the test is 98% reliable, which means that in 98% of the cases, if the person has cancer, the test is positive, and likewise, if the person does not have cancer, the test result is negative. Also assume that this particular kind of cancer only affects older people, and only 2% of people below 50 have this kind of cancer, and the test administered on people under 50 is positive only on 3.9% of the population (we could have derived this fact from the data, but for simplicity we provide this information).

We could ask this question: if a test is 98% accurate for cancer and a 45-year-old person took the test, and it turned out to be positive, what is the probability that he/she may have cancer? Using the formula above we can calculate:

*p(cancer | test=positive) = 0.98 * 0.02/0.039 = 0.50*

So, despite the high accuracy of the test, naïve Bayes tells us we also need to take into account the fact that the cancer is quite rare under 50, therefore the positivity of the test alone does not give a 98% probability of cancer. The probability *p(cancer)*, or more in general the probability p for the outcome we are trying to estimate, is called the prior probability, because it represents the probability of the event without any additional information, therefore before we took the test.

At this point, we may wonder what would happen if we had more information, for example if we performed a different test with different reliability, or knew some information about the person such as recurrence of cancer in the family. In the preceding equation we used, as one of the factors in the computation, the probability *p(test=positive | cancer)*, and if we performed a second test, and it came positive, we would also have *p(test2=positive | cancer)*. The naïve Bayes technique makes the assumption that each piece of information is independent of each other (this means that the outcome of test 2 did not know about the outcome of test 1 and it was independent of it, that is, taking test 1 could not change the outcome of test 2, and therefore its result was not biased by the first test). naïve Bayes is a classification algorithm that assumes the independence of different events to calculate their probability. Hence:

*p(test1 and test2=pos | cancer) =p(test1=pos | cancer)*p(test2=pos | cancer)*

This equation is also called the likelihood *L(test1 and test2 = pos)* that *test1* and *test2* be positive given the fact that the person does have cancer.

We can then rewrite the equation as:

*p(cancer | both tests=pos) = *

*= p(both test=pos | cancer)*p(cancer)/p(both tests=pos) =*

* = p(test1=pos | cancer)*p(test2=pos | cancer) *p(cancer)/p(both tests=pos)*

Support vector machines is a supervised machine learning algorithm mainly used for classification. The advantage of support vector machines over other machine learning algorithms is that not only does it separate the data into classes, but it does so finding a separating hyper-plane (the analog of a plane in a space with more than three dimensions) that maximizes the margin separating each point from the hyper-plane. In addition, support vector machines can also deal with the case when the data is not linearly separable. There are two ways to deal with non-linearly separable data, one is by introducing soft margins, and another is by introducing the so-called kernel trick.

Soft margins work by allowing a few miss-classified elements while retaining most predictive ability of the algorithm. As we have discussed above, in practice it is always better not to overfit any machine learning model, and we could do so by relaxing some of the support vector machine hypotheses.

The kernel trick instead involves mapping the space of features into another space where we can define a hyper-plane that, when mapped back into the space of features, is not a linear hyper-plane anymore, allowing to separate elements in the dataset that do not appear to be separable. Since this book will be mainly concerned with deep learning, we will not discuss in detail how support vector machines are implemented, that may take too much time, but rather want to emphasize the concept that support vector machines have been quite popular and effective thanks to their ability to generalize to non-linear situations. As we have seen before, the task of a supervised machine learning algorithm is to find a function from the space of features to a set of classes. Each input *x= (x*
_{1
}
*, x*
*2*
*, …, x*
*n*) represents an input example, and each *x*
*i* represents the value of x for the *i*th feature. Earlier on we gave, as an example, trying to estimate the resell value of a certain house depending on some features, like number of bathrooms or location. If the *i*th feature corresponds to the number of bathrooms, *x*
*i* would correspond to the number of bathrooms present in house *x*. We can create a function *k* from the space of features to a different representation of this space, called a kernel: for example *k* may map *x*
*i* into *(x*
*i*
*)*
*2*, and in general map the space of features non-linearly into another space *W*. So, a separating hyper-plane in *W*, can be mapped back into the space of features, where it would not be a linear hyper-plane anymore. The exact conditions under which this is true are well defined but beyond the scope of this short introduction. However, this again highlights the importance of the choice of correct features in classical machine learning algorithms, a choice that can allow finding the solution to specific problems.

So far, we have introduced supervised and unsupervised learning algorithms. The cross-entropy method belongs, instead, to the reinforcement learning class of algorithms, which will be discussed in great detail in Chapter 7, *Deep Learning for Board Games* and Chapter 8, *Deep Learning for Computer Games* of this book. The cross-entropy method is a technique to solve optimization problems, that is, to find the best parameters to minimize or maximize a specific function.

In general, the cross-entropy method consists of the following phases:

Generate a random sample of the variables we are trying to optimize. For deep learning these variables might be the weights of a neural network.

Run the task and store the performance.

Identify the best runs and select the top performing variables.

Calculate new means and variances for each variable, based on the top performing runs, and generate a new sample of the variables.

Repeat steps until a stop condition is reached or the system stops improving.

Suppose we are trying to solve for a function that depends on many variables, for example we are trying to build a model plane that can fly the longest when launched from a specific altitude. The distance that the plane covers will be a function of the size of its wings, their angle, the weight, and so on. Each time, we can record each variable and then launch the plane and measure the distance it flies. However, rather than trying all possible combinations, we create statistics, we select the best and worst runs, and we note at what values the variables were set during the best runs and during the worst runs. For example, if we detect that for each of the best runs the plane had wings of a specific size, we can conclude that that particular size may be optimal for the plane to fly a long distance. Conversely, if for each of the worst runs, the plane's wings were at a certain angle, we would conclude that that particular angle would be a bad choice for our plane's wings. In general, we will produce a probability distribution for each value that should produce the optimal plane, probabilities that are not random anymore, but based on the feedback we have received.

This method, therefore, uses the feedback from the run (how far the plane has flown) to determine the best solution to the problem (the value for each variable) in a typical reinforcement learning process.

After having refreshed the reader with some of the popular classical machine learning algorithms, we will now introduce neural networks, and explain in deeper detail how they work and how they differ from the algorithms we have briefly summarized.

Neural networks are another machine learning algorithm and they have known periods of high popularity and periods during which they were rarely used. Understanding neural networks, to which we will dedicate the next and following chapters, is indeed key for following the content of this book.

The first example of a neural network was called the perceptron, which was invented by Frank Rosenblatt in 1957. The perceptron is a network comprised of only an input and an output layer. In case of binary classifications, the output layer has only one neuron or unit. The perceptron seemed to be very promising from the start, though it was quickly realized that it could only learn linearly separable patterns. For example, Marvin Minsky and Seymour Papert showed that it could not learn the XOR logical function. In its most basic representations, perceptrons are just simple representations of one neuron and its input, input that can be comprised of several neurons.

Given different inputs into a neuron, we define an activation value by the formula , where *x*
_{i} is the value for the input neuron, while *w*
_{i} is the value of the connection between the neuron *i *and the output. We will learn this in much deeper detail in the next chapter, for now we should just notice that perceptrons share many similarities with logistic regression algorithms, and are constrained by linear classifiers as well. If the activation value, which should be thought of as the neuron internal state, is greater than a fixed threshold *b*, then the neuron will activate, that is, it will fire, otherwise it will not.

The simple activation defined above can be interpreted as the dot product between the vector **w** and the vector **x**. The vector **w** is fixed, and it defines how the perceptron works, while **x** represents the input. A vector **x** is perpendicular to the weight vector **w** if *<*
**w,x**
*> = 0*, therefore all vectors **x** such that *<*
**w,x**
*> = 0* define a hyper-plane in **R**
^{3} (where *3* is the dimension of **x**, but it could be any integer in general). Hence, any vector **x** satisfying *<*
**w,x**
*> > 0* is a vector on the side of the hyper-plane defined by **w**. This makes it clear how a perceptron just defines a hyper-plane and it works as a classifier. In general, instead of *0* we can set the threshold to be any real number *b*, this has the effect of translating the hyper-plane away from the origin. However, rather than keeping track of this value, generally we include a bias unit in our network, which is an always on (*value = 1)* special neuron with connecting weight *-b*. In this case, if the connecting weight has value *–b*, the activation value becomes and setting *a(x) > 0* is equivalent to setting .

Perceptrons, while limited in their performance, are very important historically as they are the first examples of a neural network.

Neural networks, of course, do not need to have one single output neuron, and, in fact, in general they do not. If the network has more than one output neuron, for each output neuron we can repeat the same process. Each weight will then be labeled by two indices, *i *and *j, *to indicate that* *the weight is connecting the neuron *i *on the input layer to the neuron *j o*n the output layer. There will also be a connection from the bias unit, with value 1, to each neuron on the output layer. It should also be noted that we can define different activity functions on the activation value. We have defined the activation value as a(x) (from now on we will assume that the bias is included in this formula) and we have said that the neuron activates if the activation is greater than *0*. As we will see, this already defines an activity function, that is, a function defined on the activation, that is, on the neuron's internal state, and this is called the threshold activity, because the neuron activates when the activation is greater than *0*. However, we will see that neural nets can have many different activity functions that can be defined on their activation value, and we will discuss them in greater detail in the next chapter.

The previous paragraph introduced a very simple example of a neural network, a feed-forward 1-layer network. They are called feed-forward because the information proceeds from the input towards the output and it never loops back, and 1-layer because there is only 1-output layer besides the input layer. This is not the general case. We have already discussed the limitations of 1-layer feed-forward networks when we mentioned that they can only work on linearly separable data, and in particular we mentioned that they cannot approximate the logical XOR function. There are, however, networks that have extra layers between the input and the output layers: these layers are called the hidden layers. A feed-forward network with hidden layers will then move the information from the input through its hidden layer to the output layer, which defines a function that takes an input and it defines an output. There exists a theorem, called the Universal Theorem, which states that any function can be approximated by a neural network with at least one hidden layer, and we will give an intuition of why this is true in the next chapter.

For a long time, given this theorem and also the difficulty in working with complex networks, people have worked with shallow networks with only one hidden layer. However, recently people have realized that more complex networks with many hidden layers can understand levels of abstraction that shallow layers cannot. In addition, recurrent networks have been introduced where neurons can also feed information back into themselves. Some neural networks' structures can also permit to define an energy function that allows for the creation of memories. All of this exciting functionality will be discussed in the next chapters as we will delve through the most recent development in deep learning.

Machine learning in general, and deep learning in particular, are producing more and more astonishing results in terms of quality of predictions, feature detections, and classifications. Many of these recent results have made the news in recent years. Such is the pace of progress, that many experts are worrying that very soon machines will be more intelligent than humans. At a UN meeting, on October 14, 2015, artificial intelligence experts and researchers in many other fields warned about the need to define ethical guidelines to prevent the danger that a super-intelligent class of machines may pose to the human race. Such fears arise from some recent incredible results in which computers have been able to beat the best human champions in games where it was thought that intuition would give humans an advantage over machines.

AlphaGo is an artificial intelligence machine based on deep learning which made news in 2016 by beating world Go champion Lee Sedol. In January 2016, AlphaGo had already made the news when it beat the European champion Fan Hui, though, at the time, it seemed unlikely that it could go on to beat the world champion. Fast forward a couple of months and AlphaGo was able to achieve this remarkable feat by sweeping its opponent in a 4-1 victory series. The reason for celebration was due to the fact that Go has many more possible game variations than other games, such as chess, and it is impossible to be able to consider all possible moves in advance. In addition, in Go it is also very difficult to even judge the current position or value of a single stone on the board, unlike chess.

The strength of AlphaGo was that it had not been programmed to play, but it had *learned* to play by playing thousands of games against itself using reinforcement learning and deep learning techniques. The ability to learn is what renders machine learning, and deep learning especially, a completely different approach to solving problems. Deep learning is about creating programs that may learn by themselves with little or no help from humans.

However, the variety of fields in which deep learning has been applied with considerable success is not limited to games. Kaggle (http://www.kaggle.com) is a web site hosting many different machine learning competitions. These vary extensively in terms of the field they are used in and their applications. In 2013, the University of Oregon sponsored a competition in which it was asked to use machine learning techniques to detect and identify birds by using standard recording of real-world audio data. In order to gain better understanding of bird population trends, costly human effort is often required. Machine learning helps solve this problem by automatically identifying what birds are present by simply listening on an audio recording.

Amazon recently launched another competition for the problem of granting employees access to internal computers and networks, hoping that a successful solution would cut the costly delays of human supervisory intervention.

The Chicago Department of Health held a competition in 2015 where it asked "given weather location testing and spraying data … when and where different species of mosquitoes will test positive for West Nile virus".

In August 2015, a competition asked to predict rental prices across Western Australia, and in February 2016 BNP Paribas, a French bank, launched a competition to accelerate its claim management process.

This provides some idea of the variety of problems that can be solved using machine learning, and it should be noted that all these competitions offered prizes for the best solution. In 2009, Netflix launched a one million dollar competition to improve the accuracy of its prediction system on what movies a user may enjoy based on his/her previously ranked movies, and data scientist jobs are routinely ranked among the highest paid and most sought after work occupations.

Machine learning is routinely used in applications ranging from self-driving cars, military drones, and target reconnaissance systems, to medical applications, such as applications able to read doctors' notes to spot potential health problems, and surveillance systems that can provide facial recognition.

Optical character recognition is widely used, for example by post offices, to read addresses on envelopes, and we will show how to apply neural networks to digit recognition using the MNIST dataset. Unsupervised deep learning has also found many applications and great results in **Natural Language Processing** (**NLP**), and almost each one of us has an NLP application of deep learning on his/her smartphone, as both Apple and Android use deep learning applied to NLP for their virtual assistants (for example, Siri). Machine learning also finds applications in biometrics, for example in recognizing someone's physical characteristics, such as fingerprints, DNA or retina recognition. In addition, automobile autonomous driving has improved in recent years to the point that it is now a reality.

Machine learning can also be applied to catalog pictures in a photo album, or, more importantly, satellite images, allowing the description of different images according to whether they are an urban environment or not, whether they describe forests, ice regions, water extensions, and so on.

To summarize, machine learning has recently found applications in almost every aspect of our lives, and its accuracy and performance has seen a continuous improvement thanks also to increasingly better and faster computers.

Machine learning is a popular and competitive field, and there are many open source packages that implement most of the classic machine learning algorithms. One of the most popular is `scikit-learn`

(http://scikit-learn.org), a widely used open source library used in Python.

`scikit-learn`

offers libraries that implement most classical machine-learning classifiers, regressors and clustering algorithms such as **support vector machines** (**SVM**), nearest neighbors, random forests, linear regression, k-means, decision trees and neural networks, and many more machine learning algorithms

The base class `sklearn`

has several packages available, depending on the type of algorithm chosen, such as `sklearn.ensemble`

, `sklearn.linear_model`

, `sklearn.naive_bayes`

, `sklearn.neural_network`

, `sklearn.svm`

, and `sklearn.tree`

.

There are also helpers to do cross-validation and for helping select the best features. Rather than spending time describing all the functionality abstractly, we will instead start with one simple example using a multi-layer neural network. The `scikit-learn`

library uses methods with similar signatures for each machine learning algorithm, so classifiers share the same common functionality. In addition, we want the reader to be able to quickly start getting a flavor of what neural networks can do without the need to spend time creating a neural network from scratch. The following chapters will discuss other libraries and more complex implementations for many different types of deep learning neural nets, but for now, the user can start getting a quick idea of their functionality.

For example, if one wanted to use a multi-layer neural network with scikit-learn, it would be enough to import it in our program by typing:

from sklearn.neural_network.multilayer_perceptron import MLPClassifier

Each algorithm needs to be called using pre-defined parameters, though in most instances default values may be used. In the case of the MLPClassifier, no parameter is needed and one can use the default values (all parameters are described on the scikit-learn website, and in particular for the MLPClassifier one can find them at: http://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html).

The algorithm is then called on the training data, using the labels for tuning of the parameters, using the `fit`

function:

MLPClassifier().fit(data, labels)

Once the algorithm has been fit on the training data, it can be used to make predictions on the test data, using the `predict_proba`

function that will output probabilities for each class:

probabilities = MLPClassifier().predict_proba(data)

Let's write a full example of how to use a `MLPClassifier`

classifier on the `iris`

dataset that we discussed briefly when we introduced decision trees.

Scikit-learn makes it easy to load important classic datasets. To do this we only need:

from sklearn import datasets iris = datasets.load_iris() data = iris.data labels = iris.target

This will load the dataset. Now, to load the classifier we just need:

from sklearn.neural_network.multilayer_perceptron import MLPClassifier

Now we tune the parameters using the data:

mlp = MLPClassifier(random_state=1) mlp.fit(data, labels)

Now, since the weights are initialized randomly, the `random_state`

value is simply there to force the initialization to always use the same random values in order to get consistent results across different trials. It is completely irrelevant to understanding the process. The `fit`

function is the important method to call, it is the method that will find the best weights by training the algorithm using the data and labels provided, in a supervised fashion.

Now we can check our predictions and compare them to the actual result. Since the function `predict_proba`

outputs the probabilities, while `predict`

outputs the class with the highest probability, we will use the latter to make the comparison, and we will use one of sikit-learn helper modules to give us the accuracy:

pred = mlp.predict(data) from sklearn.metrics import accuracy_score print('Accuracy: %.2f' % accuracy_score(labels, pred))

And that's it. Of course, as we mentioned, it is usually better to split our data between training data and test data, and we can also improve on this simple code by using some regularization of the data. Scikit-learn provides some helper functions for this as well:

from sklearn.cross_validation import train_test_split from sklearn.preprocessing import StandardScaler data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.5, random_state=1) scaler = StandardScaler() scaler.fit(data) data_train_std = scaler.transform(data_train) data_test_std = scaler.transform(data_test) data_train_std = data_train data_test_std = data_test

This code is self-explanatory, we split the data and we normalize it, which means we subtract the mean and scale the data to unit variance. Then we fit our algorithm on the training data and we test on the test data:

mlp.fit(data_train, labels_train) pred = mlp.predict(data_test) print('Misclassified samples: %d' % (labels_test != pred).sum()) from sklearn.metrics import accuracy_score print('Accuracy: %.2f' % accuracy_score(labels_test, pred))

And we get the following output:

**Misclassified samples: 3 Accuracy: 0.96**

We can draw some pictures to show the data and how the neural net divides the space into three regions to separate the three types of flowers (since we can only draw two-dimensional pictures we will only draw two features at the time). The first graph shows how the algorithm tries to separate the flowers based on their petal width and length, without having normalized the data:

The second graph shows the same based only on the petal width and the sepal width, instead:

The third graph shows the same data as the first one, after normalizing the data:

And, finally, the fourth graph is the same as the second one with the data normalized:

We also show the code used to create these graphs. Note that the code to draw these pictures was adapted from similar code by Sebastian Raschka in his *Python Machine Learning* book, published by Packt Publishing, and we refer the reader to it for further details.

The code to make the preceding drawings is as follows. Note that before data must be set to only contain the data relative to two variables, for example `data = iris.data[:,[1,3]] `

for sepal and petal length, since we can only draw two-dimensional images.

import numpy from matplotlib.colors import ListedColormap import matplotlib.pyplot as plt markers = ('s', '*', '^') colors = ('blue', 'green', 'red') cmap = ListedColormap(colors) x_min, x_max = data[:, 0].min() - 1, data[:, 0].max() + 1 y_min, y_max = data[:, 1].min() - 1, data[:, 1].max() + 1 resolution = 0.01 x, y = numpy.meshgrid(numpy.arange(x_min, x_max, resolution), numpy.arange(y_min, y_max, resolution)) Z = mlp.predict(numpy.array([x.ravel(), y.ravel()]).T) Z = Z.reshape(x.shape) plt.pcolormesh(x, y, Z, cmap=cmap) plt.xlim(x.min(), x.max()) plt.ylim(y.min(), y.max()) # plot the data classes = ["setosa", "versicolor", "verginica"] for index, cl in enumerate(numpy.unique(labels)): plt.scatter(data[labels == cl, 0], data[labels == cl, 1], c=cmap(index), marker=markers[index], s=50, label=classes[index]) plt.xlabel('petal length') plt.ylabel('sepal length') plt.legend(loc='upper left') plt.show()

The `MLPClassifier`

, as we mentioned, does have many parameters that we can use; we will cite only the activation function and the number of hidden layers and how many neurons each may have, but the documentation for all possible parameters is available at http://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.htmlThe number of hidden layers and the number of neurons can be specified by adding `hidden_layer_sizes=(n`

`1`

`, n`

`2`

`, n`

`3`

`, …, n`

`m`

`)`

, where `n`

`i`

is the number of neurons in the *i*th layer.

For a neural net with two hidden layers with `200`

and `100`

neurons respectively, we would write:

mlp = MLPClassifier(random_state=1, hidden_layer_sizes=(200, 100,))

The other important parameter refers to the activation function, that we have called previously the activity function. This module supports three types defined below:

### Note

The **ReLU** is the easiest, but also one of the most popular (and the default activation function) and it is simply defined as

The **logistic** function is used when we are interested in calculating the probability of an event, in fact it has values between 0 and 1 and it is defined as:

Finally, the **tanh** is simply defined as:

For example, to use two hidden layers with 200 and 100 neurons respectively with a logistic activation function, the code would modify to be:

mlp = MLPClassifier(random_state=1, hidden_layer_sizes=(200, 100,), activation = "logistic")

We invite the reader to play with some of these parameters, and also to use the `max_iter`

parameter that will limit the number of iterations. The number of iterations refers to the number of passes over the training data. A small value, such as `max_iter=100`

, will not produce such good results, as the algorithm will not have the time to converge. Note, however, that on such a small dataset, more hidden layers will not necessarily result in better predictions, and they may instead degrade prediction accuracy.

That concludes this chapter, where we have introduced the reader to the importance of machine learning and the many applications in the real world. We have briefly mentioned some of the issues and problems, and we have touched on the topic of neural networks that will be the focus of the next chapter. We have also touched on how to use standard open source libraries such as `scikit-learn`

to start implementing some simple multi-layer feed-forward networks.

We now turn to discussing neural networks in depth and the motivations behind their use.

In this chapter, we have covered what machine learning is and why it is so important. We have provided several examples where machine learning techniques find applications and what kind of problems can be solved using machine learning. We have also introduced a particular type of machine learning algorithm, called neural networks, which is at the basis of deep learning, and have provided a coding example in which, using a popular machine learning library, we have solved a particular classification problem. In the next chapter, we will cover neural networks in better detail and will provide their theoretical justifications based on biological considerations drawn from observations of how our own brain works.