Hands-On Transfer Learning with Python

Machine Learning Fundamentals

One day the AIs are going to look back on us the same way we look at fossil skeletons on the plains of Africa. An upright ape living in dust with crude language and tools, all set for extinction.

– Nathan Bateman, Ex Machina (Movie 2014)

This quote may seem exaggerated to the core and difficult to digest, yet, with the pace at which technology and science are improving, who knows? We as a species have always dreamt of creating intelligent, self-aware machines. With recent advancements in research, technology, and the democratization of computing power, artificial intelligence (AI), machine learning (ML), and deep learning have gotten enormous attention and hype amongst technologists and the population in general. Though Hollywood's promised future is debatable, we have started to see and use glimpses of intelligent systems in our daily lives. From intelligent conversational engines, such as Google Now, Siri, Alexa, and Cortana, to self-driving cars, we are gradually accepting such smart technologies in our daily routines.

As we step into the new era of learning machines, it is important to understand that the fundamental ideas and concepts have existed for some time and have constantly been improved upon by intelligent people across the planet. It is well known that 90% of the world's data has been created in just the last couple of years, and we continue to create far more data at ever increasing rates. The realm of ML, deep learning, and AI helps us utilize these massive amounts of data to solve various real-world problems.

This book is divided into three sections. In this first section, we will get started with the basic concepts and terminologies associated with AI, ML, and deep learning, followed by in-depth details on deep learning architectures.

This chapter provides our readers with a quick primer on the basic concepts of ML before we get started with deep learning in subsequent chapters. This chapter covers the following aspects:

Introduction to ML
ML methodologies
CRISP-DM—workflow for ML projects
ML pipelines
Exploratory data analysis
Feature extraction and engineering
Feature selection

Every chapter of the book builds upon concepts and techniques from the previous chapters. Readers who are well-versed with the basics of ML and deep learning may pick and choose the topics as they deem necessary, yet it is advised to go through the chapters sequentially. The code for this chapter is available for quick reference in the Chapter 1 folder in the GitHub repository at https://github.com/dipanjanS/hands-on-transfer-learning-with-python which you can refer to as needed to follow along with the chapter.

Why ML?

We live in a world where our daily routine involves multiple contact points with the digital world. We have computers assisting us with communication, travel, entertainment, and whatnot. The digital online products (apps, websites, software, and so on) that we use seamlessly all the time help us avoid mundane and repetitive tasks. These software have been developed using computer programming languages (like C, C++, Python, Java, and so on) by programmers who have explicitly programmed each instruction to enable these software to perform defined tasks. A typical interaction between a compute device (computer, phone, and so on) and an explicitly programmed software application with inputs and defined outputs is depicted in the following diagram:

Tradition programming paradigm

Though the current paradigm has been helping us develop amazingly complex software/systems to address tasks from different domains and aspects in a pretty efficient way, they require somebody to define and code explicit rules for such programs to work. These are the tasks that are easy for a computer to solve but difficult or time consuming for humans. For instance, performing complex calculations, storing massive amounts of data, searching through huge databases, and so on are tasks that can be performed efficiently by a computer once the rules are defined.

Yet, there is another class of problems that can be solved intuitively by humans but are difficult to program. Problems like object identification, playing games, and so on are natural to us yet difficult to define with a set of rules. Alan Turing, in his landmark paper Computing Machinery and Intelligence (https://www.csee.umbc.edu/courses/471/papers/turing.pdf), which introduced the Turing test, discussed general purpose computers and whether they could be capable of such tasks.

This new paradigm, which embodies the thoughts about general purpose computing, is what gave rise to AI in a broader sense. This new paradigm, better termed as an ML paradigm, is one where computers or machines learn from experience (analogous to human learning) to solve tasks rather than being explicitly programmed to do so.

AI is thus an encompassing field of research, with ML and deep learning being specific subfields of study within it. AI is a general field that includes other subfields as well, which may or may not involve learning (for instance, see symbolic AI). In this book we will concentrate our time and efforts upon ML and deep learning only. The scope of artificial intelligence, machine learning, and deep learning can be visualized as follows:

Scope of artificial learning, with machine learning, and deep learning as its subfields

Formal definition

A formal definition of ML, as stated by Tom Mitchell, is explained as follows.

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

This definition beautifully captures the essence of what ML is in a very concise manner. Let's take an example from the real world to understand it better. Let's consider a task (T) is to identify spam emails. We may now present many examples (or experiences E) to a system about spam and non-spam emails, from which it learns rather than being explicitly programmed. The program or system may then be measured for its performance (P) on the learned task of identifying spam emails. Interesting, isn't it?

Shallow and deep learning

ML is thus the task of identifying patterns from training examples and applying these learned patterns (or representations) to new unseen data. ML is also sometimes termed as shallow learning because of its nature of learning single layered representations (in most cases). This brings us to the questions of what layers of representation are? and what deep learning is? We will answer these questions in the subsequent chapters. Let's have a quick overview of deep learning.

Deep learning is a subfield of ML that is concerned with learning successive meaningful representations from training examples to solve a given task. Deep learning is closely associated with artificial neural networks that consist of multiple layers stacked one after the other, which capture successive representations.

Do not worry if it was difficult to digest and understand, as mentioned, we will cover more in considerable depth in subsequent chapters.

ML has become a buzzword thanks to the amount of data we are generating and collecting along with faster compute. Let's look at ML in more depth in the following sections.

ML techniques

ML is a popular subfield of AI, one which covers a very wide scope. One of the reasons for this popularity is the comprehensive toolbox of sophisticated algorithms, techniques, and methodologies under its gambit. This toolbox has been developed and improved over the years, and new ones are being researched on an ongoing basis. To understand and use the ML toolbox wisely, consider the following few ways of categorizing it.

Categorization based on amount of human supervision:

Supervised learning: This class of learning involves high-human supervision. The algorithms under supervised learning utilize the training data and associated outputs to learn a mapping between the two and apply the same on unseen data. Classification and regression are two major types of supervised learning algorithms.
Unsupervised learning: This class of algorithms attempts to learn inherent latent structures, patterns, and relationships from the input data without any associated outputs/labels (human supervision). Clustering, dimensionality reduction, association rule mining, and so on are a few major types of unsupervised learning algorithms.
Semi-supervised learning: This class of algorithms is a hybrid of supervised and unsupervised learning. In this case, the algorithms work with small amounts of labeled training data and more of unlabeled data. Thus making a creative use of both supervised and unsupervised methods to solve a given task.
Reinforcement learning: This class of algorithms is a bit different from supervised and unsupervised learning methods. The central entity here is an agent, which trains over a period while interacting with its environment to maximize some reward/award. The agent iteratively learns and changes strategies/policies based on rewards/penalties from interacting with the environment.

Categorization based on data availability:

Batch learning: This is also termed as offline learning. This type of learning is utilized when the required training data is available, and a model can be trained and fine-tuned before deploying into production/real world.
Online learning: As the name suggests, in this case the learning is not stopped once the data is available. Rather, in this case, data is fed into the system in mini-batches and the training process continues with new batches of data.

The previously discussed categorizations give us an abstract view of how ML algorithms can be organized, understood, and utilized. The most common way to categorize them is into supervised and unsupervised learning algorithms. Let's go into a bit more detail about these two categories as this should help us get started for further advanced topics to be introduced later.

Supervised learning

Supervised learning algorithms are a class of algorithms that utilize data samples (also called training samples) and corresponding outputs (or labels) to infer a mapping function between the two. The inferred mapping function or the learned function is the output of this training process. The learned function is then utilized to correctly map new and unseen data points (input elements) to test the performance of the learned function.

Some key concepts for supervised learning algorithms are as follows:

Training dataset: The training samples and corresponding outputs utilized during the training process are termed as training data. Formally, a training dataset is a two-element tuple consisting of an input element (usually a vector) and a corresponding output element or signal.
Test dataset: The unseen dataset that is utilized to test the performance of the learned function. This dataset is also a two-element tuple containing input data points and corresponding output signals. Data points in this set are not used for the training phase (this dataset is further divided into the validation set as well; we will discuss this in more detail in subsequent chapters).
Learned function: This is the output of the training phase. Also termed as inferred function or the model. This function is inferred based on the training examples (input data points and their corresponding outputs) from the training dataset. An ideal model/learned function would learn the mapping in such a way that the results can be generalized for unseen data as well.

There are various supervised learning algorithms available. Based on the use case requirements, they can be majorly categorize into classification and regression models.

Classification

In the simplest terms, these algorithms help us answer objective questions or a yes-no prediction. For instance, these algorithms are useful in scenarios like is it going to rain today?, or can this tumour be cancerous?, and so on.

Formally, the key objective of classification algorithms is to predict output labels that are categorical in nature depending upon the input data points. The output labels are categorical in nature; namely, they each belong to a discrete class or category.

Logistic regression, Support Vector Machines (SVMs), Neural Networks, Random Forests, k-Nearest Neighbours (KNN), Decision Trees, and so on are some of the popular classification algorithms.

Suppose we have a real-world use case to evaluate different car models. To keep things simple, let's assume that the model is expected to predict an output for every car model as either acceptable or unacceptable based on multiple input training samples. The input training samples have attributes such as buying price, number of doors, capacity (in number of persons), and safety.

The level apart from the class label denotes each data point as either acceptable or unacceptable. The following diagram depicts the binary classification problem at hand. The classification algorithm takes the training samples as input to prepare a supervised model. This model is then utilized to predict the evaluation label for a new data point:

Supervised learning: Binary classification for car model evaluation

Since output labels are discrete classes in case of classification problems, if there are only two possible output classes the task is termed as a binary classification problem, and a multi-class classification otherwise. Predicting whether it will rain tomorrow or not would be a binary classification problem (with output being a yes or a no) while predicting a numeric digit from scanned handwritten images would be multi-class classification with 10 labels (zero to nine possible output labels).

Regression

This class of supervised learning algorithms helps us answer quantitative questions of the type how many or how much?. Formally, the key objective for regression models is value estimation. In this case, the output labels are continuous in nature (as opposed to being discrete in classification).

In the case of regression problems, the input data points are termed as independent or explanatory variables, while the output is termed as a dependent variable. Regression models are also trained using training data samples consisting of input (or independent) data points along with output (or dependent) signals. Linear regression, multivariate regression, regression trees, and so on are a few supervised regression algorithms.

Regression models can be further categorized based on how they model the relationship between dependent and independent variables.

Simple linear regression models work with single independent and single dependent variables. Ordinary Least Squares (OLS) regression is a popular linear regression model. Multiple regression or multivariate regression is where there is a single dependent variable, while each observation is a vector composed of multiple explanatory variables.

Polynomial regression models are a special case of multivariate regression. Here the dependent variable is modeled to the n^th degree of the independent variable. Since polynomial regression models fit or map nonlinear relationships between dependent and independent variables, these are also termed as nonlinear regression models.

The following is an example of linear regression:

Supervised learning: Linear regression

To understand different regression types, let's consider a real-world use case of estimating the stopping distance of a car, based on its speed. Here, based on the training data we have, we can model the stopping distance as a linear function of speed or as a polynomial function of the speed of the car. Remember that the main objective is to minimize the error without overfitting the training data itself.

The preceding graph depicts a linear fit while the following one depicts a polynomial fit for the the same dataset:

Supervised learning: Polynomial regression

Unsupervised learning

As the name suggests, this class of algorithms learns/infers concepts without supervision. Unlike supervised learning algorithms, which infer a mapping function based on training dataset consisting of input data points and output signals, unsupervised algorithms are tasked with finding patterns and relationships in the training data without any output signals available in the training dataset. This class of algorithms utilizes the input dataset to detect patterns, and mine for rules or group/cluster data points so as to extract meaningful insights from the raw input dataset.

Unsupervised algorithms come in handy when we do not have the liberty of a training set that contains corresponding output signals or labels. In many real-world scenarios, datasets are available without output signals and it is difficult to manually label them. Thus, unsupervised algorithms are helpful in plugging such gaps.

Similar to supervised learning algorithms, unsupervised algorithms can also be categorized for ease of understanding and learning. The following are different categories of unsupervised learning algorithms.

Clustering

The unsupervised equivalent of classification is termed as clustering. These algorithms help us cluster or group data points into different groups or categories, without the availability of any output label in the input/training dataset. These algorithms try to find patterns and relationships from the input dataset, utilizing inherent features to group them into various groups based on some similarity measure, as shown in the following diagram:

Unsupervised learning: Clustering news articles

A real-world example to help understand clustering could be news articles. There are hundreds of news articles written daily, each catering to different topics ranging from politics and sports to entertainment, and so on. An unsupervised approach to group these articles together can be achieved using clustering, as shown in the preceding figure.

There are different approaches to perform the process of clustering. The most popular ones are:

Centroid based methods. Popular ones are K-means and K-medoids.
Agglomerative and divisive hierarchical clustering methods. Popular ones are Ward's and affinity propagation.
Data distribution based methods, for instance, Gaussian mixture models.
Density based methods such as DBSCAN and so on.

Dimensionality reduction

Data and ML are the best of friends, yet a lot of issues come with more and bigger data. A large number of attributes or a bloated-up feature space is one common problem. A large feature space poses problems in analyzing and visualizing the data along with issues related to training, memory, and space constraints. This is also known as the curse of dimensionality. Since unsupervised methods help us extract insights and patterns from unlabeled training datasets, they are also useful in helping us reduce dimensionality.

In other words, unsupervised methods help us reduce feature space by helping us select a representative set of features from the complete available list:

Unsupervised learning: Dimensionality reduction using PCA

Principal Component Analysis (PCA), nearest neighbors, and discriminant analysis are some of the popular dimensionality reduction techniques.

The preceding diagram is a famous depiction of the workings of the PCA based dimensionality reduction technique. It shows a swiss roll shape with data represented in three-dimensional space. Application of PCA results in transformation of the data into two-dimensional space, as shown on the right-hand side of the diagram.

Association rule mining

This class of unsupervised ML algorithms helps us understand and extract patterns from transactional datasets. Also termed as Market Basket Analysis (MBA), these algorithms help us identify interesting relationships and associations between items across transactions.

Using association rule mining, we can answer questions like what items are bought together by people at a given store?, or do people who buy wine also tend to buy cheese?, and many more. FP-growth, ECLAT, and Apriori are some of the most widely used algorithms for association rule mining tasks.

Anomaly detection

Anomaly detection is the task of identifying rare events/observations based on historical data. Anomaly detection is also termed as outlier detection. Anomalies or outliers usually have characteristics such as being infrequent or occurring in short sudden bursts over time.

For such tasks, we provide a historical dataset for the algorithm so it can identify and learn the normal behavior of data in an unsupervised manner. Once learned, the algorithm helps us identify patterns that differ from this learned behavior.

CRISP-DM

Cross Industry Standard Process for Data Mining (CRISP-DM) is one of the most popular and widely used processes for data mining and analytics projects. CRISP-DM provides the required framework, which clearly outlines the necessary steps and workflows for executing a data mining and analytics project, from business requirements to the final deployment stages and everything in between.

More popularly known by the acronym itself, CRISP-DM is a tried, tested, and robust industry standard process model followed for data mining and analytics projects. CRISP-DM clearly depicts the necessary steps, processes, and workflows for executing any project, right from formalizing business requirements to testing and deploying a solution to transform data into insights. Data science, data mining, and ML are all about trying to run multiple iterative processes to extract insights and information from data. Hence, we can say that analyzing data is truly both an art as well as a science, because it is not always about running algorithms without reason; a lot of the major effort involves understanding the business, the actual value of the efforts being invested, and proper methods for articulating end results and insights.

Data science and data mining projects are iterative in nature to extract meaningful insights and information from data. Data science is as much art as science and thus a lot of time is spent understanding the business value and the data at hand before applying the actual algorithms (these again go through multiple iterations) and finally evaluations and deployment.

Similar to software engineering projects, which have different life cycle models, CRISP-DM helps us track a data mining and analytics project from start to end. This model is divided into six major steps that cover from aspects of business and data understanding to evaluation and finally deployment, all of which are iterative in nature. See the following diagram:

CRISP-DM model depicting workflow for ML projects

Let's now have a deeper look into each of the six stages to better understand the CRISP-DM model.

Business understanding

The first and the foremost step is understanding the business. This crucial step begins with setting the business context and requirements for the problem. Defining the business requirements formally is important to transform them into a data science and analytics problem statement. This step also used to set the expectations and success criteria for both business and data science teams to be on the same page and track the progress of the project.

The main deliverable of this step is a detailed plan consisting of major milestones, timelines, assumptions, constraints, caveats, issues expected, and success criteria.

Data understanding

Data collection and understanding is the second step in the CRISP-DM framework. In this step we take a deeper dive to understand and analyze the data for the problem statement formalized in the previous step. This step begins with investigating the various sources of data outlined in the detailed project plan previously. These sources of data are then used to collect data, analyze different attributes, and make a note of data quality. This step also involves what is generally termed as exploratory data analysis.

Exploratory data analysis (EDA) is a very important sub-step. It is during EDA we analyze different attributes of data, their properties and characteristics. We also visualize data during EDA for a better understanding and uncovering patterns that might be previously unseen or ignored. This step lays down the foundation for the coming step and hence this step cannot be neglected at all.

Data preparation

This is the third and the most time-consuming step in any data science project. Data preparation takes place once we have understood the business problem and explored the data available. This step involves data integration, cleaning, wrangling, feature selection, and feature engineering. First and the foremost is data integration. There are times when data is available from various sources and hence needs to be combined based on certain keys or attributes for better usage.

Data cleaning and wrangling are very important steps. This involves handling missing values, data inconsistencies, fixing incorrect values, and converting data to ingestible formats such that they can be used by ML algorithms.

Data preparation is the most time-consuming step, taking over 60-70% of the overall time taken for any data science project. Apart from data integration and wrangling, this step involves selecting key features based on relevance, quality, assumptions, and constraints. This is also termed as feature selection. There are also times when we have to derive or generate features from existing ones. For example, deriving age from date of birth and so on, depending upon the use case requirements. This step is termed as feature engineering and is again required based on use case.

Modeling

The fourth step or the modeling step is where the actual analysis and ML takes place. This step utilizes the clean and formatted data prepared in the previous step for modeling purposes. This is an iterative process and works in sync with the data preparation step as models/algorithms require data in different settings/formats with varying set of attributes.

This step involves selecting relevant tools and frameworks along with the selection of a modeling technique or algorithms. This step includes model building, evaluation, and fine-tuning of models, based on the expectations and criteria laid down during the business understanding phase.

Evaluation

Once the modeling step results in a model(s) that satisfies the success criteria, performance benchmarks, and model evaluation metrics, a thorough evaluation step comes into picture. In this step, we consider the following activities before moving ahead with the deployment stage:

Model result assessment based on quality and alignment with business objectives
Identifying any additional assumptions made or constraints relaxed
Data quality, missing information, and other feedback from data science team and/or subject matter experts (SMEs)
Cost of deployment of the end-to-end ML solution

Deployment

The final step of the CRISP-DM model is deployment to production. The models that have been developed, fined-tuned, validated, and tested during multiple iterations are saved and prepared for production environment. A proper deployment plan is built, which includes details on hardware and software requirements. The deployment stage also includes putting in place checks and monitoring aspects to evaluate the model in production for results, performance, and other metrics.

Standard ML workflow

The CRISP-DM model provides a high-level workflow for management of ML and related projects. In this section, we will discuss the technical aspects and implementation of standard workflows for handling ML projects. Simply put, an ML pipeline is an end-to-end workflow consisting of various aspects of a data intensive project. Once the initial phases such as business understanding, risk assessments, and ML or data mining techniques selection have been covered, we proceed towards the solution space of driving the project. A typical ML pipeline or workflow with different sub-components is shown in the following diagram:

Typical ML pipeline

A standard ML pipeline broadly consists of the following stages.

Data retrieval

Data collection and extraction is where the story usually begins. Datasets come in all forms including structured and unstructured data that often includes missing or noisy data. Each data type and format needs special mechanisms for data handling as well as management. For instance, if a project concerns analysis of tweets, we need to work with Twitter APIs and develop mechanisms to extract the required tweets, which are usually in JSON format.

Other scenarios may involve already existing structured or unstructured public datasets or private ones, both may require additional permissions apart from just developing extraction mechanisms. A fairly detailed account pertaining to working with diverse data formats is discussed in Chapter 3 of the book Practical Machine Learning with Python, Sarkar and their co-authors, Springer, 2017 in case you are interested in diving deeper into further details.

Data preparation

It is worth reiterating that this is where the maximum time is spent in the whole pipeline. This is a fairly detailed step that involves fundamental and important sub-steps, which include:

Exploratory data analysis
Data processing and wrangling
Feature engineering and extraction
Feature scaling and selection

Exploratory data analysis

So far, all the initial steps in the project have revolved around business context, requirements, risks, and so on. This is the first touch point where we actually explore in depth the data that is collected/available. EDA helps us understand various facets of our data. In this step, we analyze different attributes of data, uncover interesting insights, and even visualize data on different dimensions to get a better understanding.

This step helps us gather important characteristics of the dataset at hand, which not only is useful in later stages of the project but also helps us identify and/or mitigate potential issues early in the pipeline. We cover an interesting example later on in this chapter for readers to understand the process and importance of EDA.

Data processing and wrangling

This step is concerned with the transformation of data into a usable form. The raw data retrieved in the first step is in most cases unusable by ML algorithms. Formally, data wrangling is the process of cleaning, transforming, and mapping data from one form to another for consumption in later stages of the project life cycle. This step includes missing data imputation, typecasting, handling duplicates and outliers, and so on. We will cover these steps in the context of use-case driven chapters for a better understanding.

Feature engineering and extraction

Preprocessed and wrangled data reaches the state where it can be utilized by the feature engineering and extraction step. In this step, we utilize existing attributes to derive and extract context/use-case specific attributes or features that can be utilized by ML algorithms in the coming stages. We employ different techniques based on data types.

Feature engineering and extraction is a fairly involved step and hence is discussed in more detail in the later part of this chapter.

Feature scaling and selection

There are cases when the number of features available is so large that it adversely affects the overall solution. Not only is the processing and handling of a dataset with a huge number of attributes an issue but it also leads to difficulty in interpretation, visualization, and many more. These issues are formally termed as the curse of dimensionality.

Feature selection thus helps us identify representative sets of features that can be utilized in the modeling step without much loss of information. There are different techniques to perform feature selection; some of them are discussed in the later sections of the chapter.

Modeling

In the process of modeling, we usually feed the data features to a ML method or algorithm and train the model, typically to optimize a specific cost function, in most cases with the objective of reducing errors and generalizing the representations learned from the data.

Depending upon the dataset and project requirements, we apply one or a combination of different ML techniques. These can include supervised techniques such as classification or regression, unsupervised techniques such as clustering, or even a hybrid approach combining different techniques (as discussed earlier in the ML techniques sections).

Modeling is usually an iterative process, and we often leverage multiple algorithms or methods and choose the best model, based on model evaluation performance metrics. Since this is a book about transfer learning, we will mostly be building deep learning based models in subsequent chapters, but the basic principles of modeling are quite similar to ML models.

Model evaluation and tuning

Developing a model is just one portion of learning from data. Modeling, evaluation, and tuning are iterative steps that help us fine-tune and select the best performing models.

Model evaluation

A model is basically a generalized representation of data and the underlying algorithm used for learning this representation. Thus, model evaluation is the process of evaluating the built model against certain criteria to assess its performance. Model performance is usually a function defined to provide a numerical value to help us decide the effectiveness of any model. Often, cost or loss functions are optimized to build an accurate model based on these evaluation metrics.

Depending upon the modeling technique used, we leverage relevant evaluation metrics. For supervised methods, we usually leverage the following techniques:

Creating a confusion matrix based on model predictions versus actual values. This covers metrics such as True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) considering one of the classes as the positive class (which is usually a class of interest).
Metrics derived from the confusion matrix, which include accuracy (overall performance), precision (predictive power of the model), recall (hit rate), and the F1-score (harmonic mean of precision and recall).
The receiver operator characteristic (ROC) curve and the area under curve (AUC) metric, which represents the AUC.
R-square (coefficient of determination), root mean square error (RMSE), F-statistic, Akaike information criterion (AIC), and p-values specifically for regression models.

Popular metrics for evaluating unsupervised methods such as clustering include the following:

Silhouette coefficients
Sum of squared errors
Homogeneity, completeness, and the V-measure
Calinski-Harabaz index

Do note that this list depicts the most popular metrics, which are extensively used, but is by no means an exhaustive list of model evaluation metrics.

Cross-validation is also an important aspect of the model evaluation process where we leverage validation sets based on cross-validation strategies to evaluate model performance by tuning various hyperparameters of the model. You can think of hyperparameters as knobs that can be used to tune the model to build efficient and better performing models. The usage and details of these evaluation techniques will be more clear when we use them for evaluating our models in subsequent chapters with extensive hands-on examples.

Bias variance trade-off

Supervised learning algorithms help us infer or learn a mapping from input data points to output signals. This learning results in a target or a learned function. Now, in an ideal scenario, the target function would learn the exact mapping between input and output variables. Unfortunately, there are no ideals.

As discussed while introducing supervised learning algorithms, we utilized a subset of data called the training dataset to learn the target function and then test the performance on another subset called the test dataset. Since the algorithm only sees a subset of all possible combinations of data, there arises an error between the predicted outputs and the observed outputs. This is called the total error or the prediction error:

Total Error = Bias Error + Variance Error + Irreducible Error

The irreducible error is the inherent error introduced due to noise, the way we have framed the problem, collected the data, and so on. As the name suggests, this error is irreducible, and we can do little from an algorithmic point of view to handle this.

Bias

The term bias refers to the underlying assumptions made by a learning algorithm to infer the target function. High bias suggests that the algorithm makes more assumptions about the target function while low bias suggests lesser assumptions.

The error due to bias is simply the difference between the expected (or average) prediction values and the actual observed values. To get an average of predictions, we repeat the learning step multiple times and then average the results. Bias error helps us understand how well the model generalizes. Low bias algorithms are usually non-parametric algorithms such as decision trees, SVMs, and so on, while parametric functions such as linear and logistic regression are high on bias.

Variance

Variance marks the sensitivity of a model towards the training dataset. As we know, the learning phase relies on a small subset of all possible data combinations called the training set. Thus, variance error captures the changes in the model's estimates as the training dataset changes.

Low variance suggests significantly fewer changes to prediction values, as the underlying training dataset changes while high variance points in the other direction. Non-parametric algorithms such as decision trees have high variance, while parametric algorithms such as linear regression are less flexible and hence low on variance.

Trade-off

The bias-variance trade-off is the problem of simultaneously reducing the bias and variance errors of a supervised learning algorithm, which prevents the target function from generalizing well beyond the training data points. Let's have a look at the following illustrations:

Bias variance trade-off

Readers are encouraged to visit the following links for a better and in-depth understanding of bias-variance trade-off: http://scott.fortmann-roe.com/docs/BiasVariance.html and https://elitedatascience.com/bias-variance-tradeoff.

Consider that we are given a problem statement as: given a person's height, determine his/her weight. We are also given a training dataset with corresponding values for height and weight. The data is shown in the following diagram:

Plot depicting height-weight dataset

Please note that this is a toy example to explain important concepts, we will use real-world cases in subsequent chapters while solving actual problems.

This is an instance of supervised learning problem, more so of a regression problem (see why?). Utilizing this training dataset, our algorithm would have to learn the target function to find a mapping between heights and weights of different individuals.

Underfitting

Based on our algorithm, there could be different outputs of the training phase. Let's assume that the learned target function is as shown in the following diagram:

Underfit model

This lazy function always predicts a constant output value. Since the target function is not able to learn the underlying structure of the data, it results in what is termed as underfitting. An underfit model has a poor predictive performance.

Overfitting

The other extreme of the training phase is termed as overfitting. The overfitting graph can be represented as follows:

Overfit model

This shows a target function that perfectly maps each data point in our training dataset. This is better known as model overfitting. In such cases, the algorithm tries to learn the exact data characteristics, including the noise, and thus fails to predict reliably on new unseen data points.

Generalization

The sweet spot between underfitting and overfitting is what we term as a good fit. The graph for a model which may generalize well for the given problem is as follows:

Well generalizing fit

A learned function that can perform well enough on unseen data points as well as on the training data is termed a generalizable function. Thus, generalization refers to how well a target function can perform on unseen data, based on the concepts learned during the training phase. The preceding diagram depicts a well generalizing fit.

Model tuning

Preparing and evaluating a model is as essential as tuning one. Working with different ML frameworks/libraries that provide us with the standard set of algorithms, we hardly ever use them straight out of the box.

ML algorithms have different parameters or knobs, which can be tuned based on the project requirements and different evaluation results. Model tuning works by iterating over different settings of hyperparameters or metaparameters to achieve better results. Hyperparameters are knobs at a high-level abstraction, which are set before the learning process begins.

This is different from model level parameters, which are learned during the training phase. Hence, model tuning is also termed hyperparameter optimization.

Grid search, randomized hyperparameter search, Bayesian optimization, and so on are some of the popular ways of performing model tuning. Though model tuning is very important, overdoing it might impact the learning process adversely. Some of the issues related to overdoing the tuning process were discussed in the section bias-variance trade-off.

Deployment and monitoring

Once model development, evaluation, and tuning is complete, along with multiple iterations of improving the results, the final stage of model deployment comes into the picture. Model deployment takes care of aspects such as model persistence, exposing models to other applications through different mechanisms such as API endpoints, and so on, along with developing monitoring strategies.

We live in a dynamic world where everything changes every so often, and the same is true about data and other factors related to our use cases. It is imperative that we put in place monitoring strategies such as regular reports, logs, and tests to keep a check on the performance of our solutions and make changes as and when required.

ML pipelines are as much about software engineering as they are about data science and ML. We outlined and discussed the different components of a typical pipeline in brief. Depending upon specific use cases, we modify the standard pipeline to suit the needs yet make sure we do not overlook known pitfalls. In the coming sections, let's understand a couple of the components of a typical ML pipeline in a bit more detail, with actual examples and code snippets.

Exploratory data analysis

EDA is among the first few tasks we perform when we get started on any ML project. As discussed in the section on CRISP-DM, data understanding is an important step to uncover various insights about the data and better understand the business requirements and context.

In this section, we will take up an actual dataset and perform EDA using pandas as our data manipulation library, coupled with seaborn for visualization. Complete code snippets and details for this analysis are available in the Python Notebook game_of_thrones_eda.ipynb.

We first begin by importing the required libraries and setting up the configurations as shown in the following snippet:

In [1]: import numpy as np 
   ...: import pandas as pd 
   ...: from collections import Counter 
   ...:  
   ...: # plotting 
   ...: import seaborn as sns 
   ...: import matplotlib.pyplot as plt 
   ...:  
   ...: # setting params 
   ...: params = {'legend.fontsize': 'x-large', 
   ...:           'figure.figsize': (30, 10), 
   ...:           'axes.labelsize': 'x-large', 
   ...:           'axes.titlesize':'x-large', 
   ...:           'xtick.labelsize':'x-large', 
   ...:           'ytick.labelsize':'x-large'} 
   ...:  
   ...: sns.set_style('whitegrid') 
   ...: sns.set_context('talk') 
   ...:  
   ...: plt.rcParams.update(params)

Once the settings and requirements are in place, we can begin concentrating on the data. The dataset in consideration for exploratory analysis is the battles.csv file, which contains all major battles from the world of Game of Thrones (up to season 5).

One of the most popular television series of all time, Game of Thrones is a fantasy drama set in the fictional continents of Westeros and Essos, filled with multiple plots and a huge number of characters all battling for the Iron Throne! It is an adaptation of the A Song of Ice and Fire novel series by George R. R. Martin. Being a popular series, it has caught the attention of many, and data scientists aren't to be excluded. This notebook presents EDA on the Kaggle dataset enhanced by Myles O'Neill (more details: https://www.kaggle.com/mylesoneill/game-of-thrones). This dataset is based on a combination of multiple datasets collected and contributed to by multiple people. We utilize the battles.csv in this analysis. The original battles data was presented by Chris Albon; more details are available at https://github.com/chrisalbon/war_of_the_five_kings_dataset.

The following snippet loads the battles.csv file using pandas:

In [2]: battles_df = pd.read_csv('battles.csv')

The dataset is as shown in the following screenshot:

Sample rows from battles.csv of Game of Thrones

We can view the total number of rows, data types of each of the attributes, and general statistics of numerical attributes using the pandas utilities shape, dtypes, and describe() respectively. We have data about 38 battles, with 25 attributes describing each one of them.

Let's understand the distribution of battles across years from the fantasy land. The following snippet plots a bar graph of this distribution:

In [3]: sns.countplot(y='year',data=battles_df) 
   ...: plt.title('Battle Distribution over Years') 
   ...: plt.show()

The following plot shows that the highest number of battles were fought in the year 299, followed by 300 and 298 respectively:

Battle distribution over years

There are different regions in this fantasy land, with battles taking place at every place imaginable. Yet, it would be interesting to see if there were any preferred regions. The following snippet helps us answer this question precisely:

In [4]: sns.countplot(x='region',data=battles_df)
...: plt.title('Battles by Regions')
...: plt.show()

The following plot helps us identify that The Riverlands have seen the most battles, followed by The North and The Westerlands:

Battles by regions

Another interesting thing to notice is that there has been only one battle Beyond the Wall (spoiler alert: stay tuned for later seasons).

We can perform similar analysis using different group-by variations to understand, for instance, the number of major deaths, or captures per region, and so on.

We move on to see which king attacked the most. We visualize this using a pie chart to understand the percentage share of battles fought by each of the kings involved. Please note that we perform this analysis based on attacking kings. Similar analysis can be performed using defending kings as well. The following snippet prepares a pie chart to display each attacking king's share of battles:

 In [5]: attacker_king = battles_df.attacker_king.value_counts() 
    ...: attacker_king.name='' # turn off annoying y-axis-label 
    ...: attacker_king.plot.pie(figsize=(6, 6),autopct='%.2f')

Each attacking king's share of battles is displayed in the following pie chart:

Battle share per attacking king

The lands of Westeros and Essos are dangerous with enemies and threats all across. Let's analyze the data a bit to understand on how many occasions each of the kings was a winner. Since a king can be either defending his land or attacking for power, it would be interesting to see the defending and attacking wins as well. The following snippet helps us prepare a stacked bar chart to analyze each king's attacking and defending wins:

In [6] : attack_winners = battles_df[battles_df. 
    ...:                             attacker_outcome=='win'] 
    ...:                                 ['attacker_king']. 
    ...:                                   value_counts(). 
    ...:                                   reset_index() 
    ...:  
    ...: attack_winners.rename( 
    ...:         columns={'index':'king', 
    ...:                  'attacker_king':'wins'}, 
    ...:                   inplace=True) 
    ...:  
    ...: attack_winners.loc[:,'win_type'] = 'attack' 
    ...:  
    ...: defend_winners = battles_df[battles_df. 
    ...:                             attacker_outcome=='loss'] 
    ...:                             ['defender_king']. 
    ...:                             value_counts(). 
    ...:                             reset_index() 
    ...: defend_winners.rename( 
    ...:         columns={'index':'king', 
    ...:                  'defender_king':'wins'}, 
    ...:                   inplace=True) 
    ...:  
    ...: defend_winners.loc[:,'win_type'] = 'defend'                                                                      
    ...:  
    ...:  
    ...: sns.barplot(x="king",  
    ...:             y="wins",  
    ...:             hue="win_type",  
    ...:             data=pd.concat([attack_winners, 
    ...:                             defend_winners])) 
    ...: plt.title('Kings and Their Wins') 
    ...: plt.ylabel('wins') 
    ...: plt.xlabel('king') 
    ...: plt.show()

The preceding snippet calculates the number of wins per king while attacking and then calculates the number of wins per king while defending. We then merge the two results and plot the same using a stacked barplot. The results are shown in the following graph:

Number of wins per king

The preceding graph clearly shows that the Baratheon boys have the most number of wins, both while attacking as well as while defending. Seems like they have luck on their side so far. Robb Stark was the second most successful king, until of course the Red Wedding happened.

The dataset also contains attributes describing the number of houses involved, battle commanders, and army sizes. We can perform similar and more in-depth analysis to better understand the battles. We encourage the readers to try out a few of these as exercises and check the Python Notebook for more pointers.

Before we close the section, let's try to identify archenemies in the fight for the Iron Throne. Though the fans will already have a gut feeling about this, let's see what the data has to say about it. The following snippet helps us answer this question:

In [7]: temp_df = battles_df.dropna( 
    ...:                     subset = ["attacker_king", 
    ...:                               "defender_king"])[ 
    ...:                                 ["attacker_king", 
    ...:                                  "defender_king"] 
    ...:                                 ] 
    ...:  
    ...: archenemy_df = pd.DataFrame( 
    ...:                 list(Counter( 
    ...:                         [tuple(set(king_pair))  
    ...:                          for king_pair in temp_df.values 
    ...:                          if len(set(king_pair))>1]). 
    ...:                             items()), 
    ...:                 columns=['king_pair', 
    ...:                          'battle_count']) 
    ...:  
    ...: archenemy_df['versus_text'] = archenemy_df. 
    ...:                                 apply( 
    ...:                                     lambda row: 
    ...:                                 '{} Vs {}'.format( 
    ...:                                         row[ 
    ...:                                             'king_pair' 
    ...:                                             ][0], 
    ...:                                         row[ 
    ...:                                             'king_pair' 
    ...:                                             ][1]), 
    ...:                                         axis=1) 
    ...: archenemy_df.sort_values('battle_count', 
    ...:                          inplace=True, 
    ...:                          ascending=False) 
    ...:  
    ...:  
    ...: archenemy_df[['versus_text', 
    ...:               'battle_count']].set_index('versus_text', 
    ...:                                           inplace=True) 
    ...: sns.barplot(data=archenemy_df, 
    ...:             x='versus_text', 
    ...:             y='battle_count') 
    ...: plt.xticks(rotation=45) 
    ...: plt.xlabel('Archenemies') 
    ...: plt.ylabel('Number of Battles') 
    ...: plt.title('Archenemies') 
    ...: plt.show()

We first prepare a temporary dataframe and remove any battles that do not have either the attacking or defending king's name listed. Once we have a clean dataframe, we iterate over each of the rows and count the number of battles every pair has fought. We ignore cases where the battle was among the king's own army (if len(set(king_pair))>1). We then simply plot the results in a bar graph, shown as follows:

Archenemies from Game of Thrones

We see that the dataset confirms the gut feelings. Robb Stark and Joffrey Baratheon have fought a total of 19 battles already, with other pairs having fought five or fewer battles.

The analysis and visualizations shared in this section were a glimpse of what can be done on a dataset. There could be many more patterns and insights that could be extracted from this dataset alone.

EDA is a very powerful mechanism for understanding the dataset in detail before jumping into other stages of ML. In the coming chapters, we will regularly perform EDA to assist us in understanding the business problem along with the dataset before we go into modeling, tuning, evaluation, and deployment stages.

Feature extraction and engineering

Data preparation is the longest and the most complex phase of any ML project. The same was emphasized while discussing the CRISP-DM model, where we mentioned how the data preparation phase takes up about 60-70% of the overall time spent in a ML project.

Once we have our raw dataset preprocessed and wrangled, the next step is to make it usable for ML algorithms. Feature extraction is the process of deriving features from raw attributes. For instance, feature extraction while working with image data refers to the extraction of red, blue, and green channel information as features from raw pixel-level data.

On the same lines, feature engineering refers to the process of deriving additional features from existing ones using mathematical transformations. For instance, feature engineering would help us in deriving a feature such as annual income from a person's monthly income (based on use case requirements). Since both feature extraction and engineering help us transform raw datasets into usable forms the terms are used interchangeably by ML practioners.

Feature engineering strategies

The process of transforming raw datasets (post clean up and wrangling) into features that can be utilized by ML algorithms is a combination of domain knowledge, use case requirements, and specific techniques. Features thus depict various representations of the underlying data and are the outcome of the feature engineering process.

Since feature engineering transforms raw data into a useful representation of itself, there are various standard techniques and strategies that can be utilized, based on the type of data at hand. In this section we will discuss a few of those strategies, briefly covering both structured and unstructured data.

Working with numerical data

Numerical data, commonly available in datasets in the form of integers or floating-point numbers and popularly known as continuous numerical data, is usually a ML friendly data type. By friendly, we refer to the fact that numeric data can be ingested in most ML algorithms directly. This however, does not mean that numeric data does not require additional processing and feature engineering steps.

There are various techniques for extracting and engineering features from numerical data. Let's look at some of those techniques in this section:

Raw measures: These data attributes or features can be used directly in their raw or native format as they occur in the dataset without any additional processing. Examples can be age, height, or weight (as long as data distributions are not too skewed!).
Counts: Numeric features such as counts and frequencies are also useful in certain scenarios to depict important details. Examples can be the number of credit card fraud occurences, song listen counts, device event occurences, and so on.
Binarization: Often we might want to binarize occurrences or features, especially to just indicate if a specific item or attribute was present (usually denoted with a 1) or absent (denoted with a 0). This is useful in scenarios like building recommendation systems.
Binning: This technique typically bins or groups continuous numeric values from any feature or attribute under analysis to discrete bins, such that each bin covers a specific numeric range of values. Once we get these discrete bins, we can choose to further apply categorical data-based feature engineering on the same. Various binning strategies exist, such as fixed-width binning and adaptive binning.

Code snippets to better understand feature engineering for numeric data are available in the notebook feature_engineering_numerical_and_categorical_data.ipynb.

Working with categorical data

Another important class of data commonly encountered is categorical data. Categorical features have discrete values that belong to a finite set of classes. These classes may be represented as text or numbers. Depending upon whether there is any order to the classes or not, categorical features are termed as ordinal and nominal respectively.

Nominal features are those categorical features that have a finite set of values but do not have any natural ordering to them. For instance, weather seasons, movie genres, and so on are all nominal features. Categorical features that have a finite set of classes with a natural ordering to them are termed as ordinal features. For instance, days of the week, dress sizes, and so on are ordinals.

Typically, any standard workflow in feature engineering involves some form of transformation of these categorical values into numeric labels and then the application of some encoding scheme on these values. Popular encoding schemes are briefly mentioned as follows:

One-hot encoding: This strategy creates n binary valued columns for a categorical attribute assuming there are n number of distinct categories
Dummy coding: This strategy creates n-1 binary valued columns for a categorical attribute assuming there are n number of distinct categories
Feature hashing: This strategy is leveraged where we use a hash function to add several features into a single bin or bucket (new feature), which is popularly used when we have a large number of features

Code snippets to better understand feature engineering for categorical data are available in the notebook feature_engineering_numerical_and_categorical_data.ipynb.

Working with image data

Image or visual data is a rich source of data, with several use cases that can be solved using ML algorithms and deep learning. Image data poses a lot of challenges and requires careful preprocessing and transformation before it can be utilized by any of the algorithms. Some of the most common ways of performing feature engineering on image data are as follows:

Utilize metadata information or EXIF data: Attributes such as image creation date, modification date, dimensions, compression format, device used to capture the image, resolution, focal length, and so on.
Pixel and channel information: Every image can be considered as a matrix of pixel values or a (m, n, c) matrix where m represents the number of rows, n represents the number of columns, and c points to color channels (for instance R, G, and B). Such a matrix can be then transformed into different shapes as per requirements of the algorithm and use case.
Pixel intensity: Sometimes it is difficult to work with colored images that have multiple channels across colors. Pixel intensity-based feature extraction relies on binning pixels based on intensities rather than utilizing raw pixel-level values.
Edge detection: Sharp changes in contrast and brightness between neighboring pixels can be utilized to identify object edges. There are different algorithms available for edge detection.
Object detection: We take the concept of edge detection and extend it to object detection and then utilize identified object boundaries as useful features. Again, different algorithms may be utilized based on the type of image data available.

Deep learning based automated feature extraction

The feature extraction methods for image data and other types discussed so far require a lot of time, effort, and domain understanding. This kind of feature extraction has its merits along with its limitations.

Lately, deep learning, specifically Convolutional Neural Networks (CNNs), have been studied and utilized as automated feature extractors. A CNN is a special case of deep neural networks optimized for image data. At the core of any CNN are convolutional layers, which basically apply sliding filters across the height and width of the image. The dot product of pixel values and these filters results in activation maps that are learned across multiple epochs. At every level, these convolutional layers help in extracting specific features such as edges, textures, corners, and so on.

There is more to deep learning and CNNs, but, to keep things simple, let's assume that at every layer, CNNs help us extract different low and high-level features automatically. This in turn saves us from manually performing feature extraction. We will study CNNs in more detail in coming chapters and see how they help us extract features automatically.

Working with text data

Numerical and categorical features are what we call structured data types. They are easier to process and utilize in ML workflow. Textual data is one major source of unstructured information that is equally important. Textual data presents multiple challenges related to syntactical understanding, semantics, format, and content. Textual data also presents issues of transformation into numeric form before it can be utilized by ML algorithms. Thus, feature engineering for textual data is preceded by rigorous preprocessing and clean up steps.

Text preprocessing

Textual data requires careful and diligent preprocessing before any feature extraction/engineering can be performed. There are various steps involved in preprocessing textual data. The following is a list of some of the most widely used preprocessing steps for textual data:

Tokenization
Lowercasing
Removal of special characters
Contraction expansions
Stopword removal
Spell corrections
Stemming and lemmatization

We will be covering most techniques in detail in the chapters related to use cases. For a better understanding, readers may refer to Chapter 4 and Chapter 7 of Practical Machine Learning with Python, Sarkar and their co-authors, Springer, 2017.

Feature engineering

Once we have our textual data properly processed via the methods mentioned in the previous section, we can utilize some of the following techniques for feature extraction and transformation into numerical form. Code snippets to better understand feature engineering for textual data are available in the Jupyter Notebook feature_engineering_text_data.ipynb:

Bag-of-words model: This is by far the simplest vectorization technique for textual data. In this technique, each document is represented as a vector on N dimensions, where N indicates all possible words across the preprocessed corpus, and each component of the vector either denotes the presence of the word or its frequency.
TF-IDF model: The bag-of-words model works under very simplistic assumptions and at certain times leads to various issues. One of the most common issues is related to some words overshadowing the rest of the words due to very high frequency, as the bag-of-words model utilizes absolute frequencies to vectorize. The Term Frequency-Inverse Document Frequency (TF-IDF) model mitigates this issue by scaling/normalizing the absolute frequencies. Mathematically, the model is defined as follows:
tfidf (w, D) = tf (W, D) * idf (w, D)

Here, tfidf (w, D) denotes the TF-IDF score of each word w in document D, tf (w, D) is the frequency of word w in document D and idf (w, D) denotes the inverse document frequency, calculated as the log transformation of total documents in corpus C divided by frequency of documents where w occurs.

Apart from bag of words and TF-IDF, there are other transformations, such as bag of N-grams, and word embeddings such as Word2vec, GloVe, and many more. We will cover several of them in detail in subsequent chapters.

Feature selection

The process of feature extraction and engineering helps us extract as well as generate features from underlying datasets. There are cases where this leads to large inputs to an algorithm for processing. It such cases, it is suspected that many of the features in the input might be redundant and may lead to complex models and even overfitting. Feature selection is the process of identifying representative features from the complete feature set that is available/generated. The selected set of features are expected to contain the required information such that the algorithm is able to solve the given task without running into processing, complexity, and overfitting issues. Feature selection also helps in better understanding the data that is being used for the modeling process along with making processing quicker.

Feature selection methods can be broadly classified into the following three categories:

Filter methods: As the name suggests, these methods help us rank features based on a statistical score. We then select a subset of these features. These methods are usually not concerned with model outputs, rather evaluating features independently. Threshold based techniques and statistical tests such as correlation coefficients and chi-squared tests are some popular choices.
Wrapper methods: These methods perform a comparative search on the performance of different combinations of subsets of features, and then help us select the best performing subset. Backward selection and forward elimination are two popular wrapper methods for feature selection.
Embedded methods: These methods provide the best of the preceding two methods by learning which subset of features would be the best. Regularization and tree based methods are popular choices.

Feature selection is an important aspect in the process of building a ML system. It is also one of the major sources of biases that can get into the system if not handled with care. Readers should note that feature selection should be done using a dataset separate from your training dataset. Utilizing the training dataset for feature selection would invariably lead to overfitting, while utilizing the test set for feature selection would overestimate the model's performance.

Most popular libraries provide a wide array of feature selection techniques. Libraries such as scikit-learn provide these methods out of the box. We will see and utilize many of them in subsequent sections/chapters.