The Deep Learning with Keras Workshop

By Matthew Moocarme , Mahla Abdolahnejad , Ritesh Bhagwat
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    1. Introduction to Machine Learning with Keras
About this book

New experiences can be intimidating, but not this one! This beginner’s guide to deep learning is here to help you explore deep learning from scratch with Keras, and be on your way to training your first ever neural networks.

What sets Keras apart from other deep learning frameworks is its simplicity. With over two hundred thousand users, Keras has a stronger adoption in industry and the research community than any other deep learning framework.

The Deep Learning with Keras Workshop starts by introducing you to the fundamental concepts of machine learning using the scikit-learn package. After learning how to perform the linear transformations that are necessary for building neural networks, you'll build your first neural network with the Keras library. As you advance, you'll learn how to build multi-layer neural networks and recognize when your model is underfitting or overfitting to the training data. With the help of practical exercises, you’ll learn to use cross-validation techniques to evaluate your models and then choose the optimal hyperparameters to fine-tune their performance. Finally, you’ll explore recurrent neural networks and learn how to train them to predict values in sequential data.

By the end of this book, you'll have developed the skills you need to confidently train your own neural network models.

Publication date:
July 2020
Publisher
Packt
Pages
496
ISBN
9781800562967

 

1. Introduction to Machine Learning with Keras

Overview

This chapter introduces machine learning with Python. We will use real-life datasets to demonstrate the basics of machine learning, which include preprocessing data for machine learning models and building a classification model using the logistic regression model with scikit-learn. We will then advance our model-building skills by incorporating regularization into our models and evaluating their performance with model evaluation metrics. By the end of this chapter, you will be able to confidently create models to solve classification tasks using the scikit-learn library in Python and evaluate the performance of those models effectively.

 

Introduction

Machine learning is the science of utilizing machines to emulate human tasks and to have the machine improve its performance of that task over time. By feeding machines data in the form of observations of real-world events, they can develop patterns and relationships that will optimize an objective function, such as the accuracy of a binary classification task or the error in a regression task.

In general, the usefulness of machine learning is in the machine's ability to learn highly complex and non-linear relationships in large datasets and to replicate the results of that learning many times. One branch of machine learning algorithms has shown a lot of promise in learning highly complex and non-linear relationships associated with large, often unstructured datasets such as images, audio, and text data—Artificial Neural Networks (ANNs). ANNs, however, can be complicated to program, train, and evaluate, and this can be intimidating for beginners in the field. Keras is a Python library that presents a facile introduction to building, training, and evaluating ANNs that is incredibly useful to those studying machine learning.

Take, for example, the classification of a dataset of pictures of either dogs or cats into classes of their respective type. For a human, this is simple, and the accuracy would likely be very high. However, it may take around a second to categorize each picture and scaling the task can only be achieved by increasing the number of humans, which may not be feasible. While it may be difficult, though certainly not impossible, for machines to reach the same level of accuracy as humans for this task, machines can classify many images per second, and scaling can be easily done by increasing the processing power of a single machine or making the algorithm more efficient:

Figure 1.1: The classification of images as either dog or cat is a simple task for humans, but quite difficult for machines

Figure 1.1: The classification of images as either dog or cat is a simple task for humans, but quite difficult for machines

While the trivial task of classifying dogs and cats may be simple for us humans, the same principles that are used to create a machine learning model that classifies dogs and cats can be applied to other classification tasks that humans may struggle with. An example of this is identifying tumors in Magnetic Resonance Images (MRIs). For humans, this task requires a medical professional with years of experience, whereas a machine may only need a dataset of labeled images. The following image shows MRI images of the brain, some of which include tumors:

Figure 1.2: A non-trivial classification task for humans – MRIs of brains, some of which include the presence of tumors

Figure 1.2: A non-trivial classification task for humans – MRIs of brains, some of which include the presence of tumors

 

Data Representation

We build models so that we can learn something about the data we are training on and about the relationships between the features of the dataset. This learning can inform us when we encounter new observations. However, we must realize that the observations we interact within the real world and the format of the data that's needed to train machine learning models are very different. Working with text data is a prime example of this. When we read the text, we are able to understand each word and apply the context that's given by each word in relation to the surrounding words—not a trivial task. However, machines are unable to interpret this contextual information. Unless it is specifically encoded, they have no idea how to convert text into something that can be a numerical input. Therefore, we must represent the data appropriately, often by converting non-numerical data types—for example, converting text, dates, and categorical variables into numerical ones.

Tables of Data

Much of the data that's fed into machine learning problems is two-dimensional and can be represented as rows or columns. Images are a good example of a dataset that may be three-or even four-dimensional. The shape of each image will be two-dimensional (a height and a width), the number of images together will add a third dimension, and a color channel (red, green, and blue) will add a fourth:

Figure 1.3: A color image and its representation as red, green, and blue images

Figure 1.3: A color image and its representation as red, green, and blue images

The following figure shows a few rows from a dataset that has been taken from the UCI repository, which documents the online session activity of various users of a shopping website. The columns of the dataset represent various attributes of the session activity and general attributes of the page, while the rows represent the various sessions, all corresponding to different users. The column named Revenue represents whether the user ended the session by purchasing products from the website.

Note

The dataset that documents the online session activity of various users of a shopping website can be found here: https://packt.live/39rdA7S.

One objective of analyzing the dataset could be to try and use the information given to predict whether a given user will purchase any products from the website. We can then check whether we were correct by comparing our predictions to the column named Revenue. The long-term benefit of this is that we could then use our model to identify important attributes of a session or web page that may predict purchase intent:

Figure 1.4: An image showing the first 20 instances of the online shoppers purchasing intention dataset

Figure 1.4: An image showing the first 20 instances of the online shoppers purchasing intention dataset

Loading Data

Data can be in different forms and can be available in many places. Datasets for beginners are often given in a flat format, which means that they are two-dimensional, with rows and columns. Other common forms of data may include images, JSON objects, and text documents. Each type of data format has to be loaded in specific ways. For example, numerical data can be loaded into memory using the NumPy library, which is an efficient library for working with matrices in Python.

However, we would not be able to load our marketing data .csv file into memory using the NumPy library because the dataset contains string values. For our dataset, we will use the pandas library because of its ability to easily work with various data types, such as strings, integers, floats, and binary values. In fact, pandas is dependent on NumPy for operations on numerical data types. pandas is also able to read JSON, Excel documents, and databases using SQL queries, which makes the library common among practitioners for loading and manipulating data in Python.

Here is an example of how to load a CSV file using the NumPy library. We use the skiprows argument in case there is a header, which usually contains column names:

import numpy as np
data = np.loadtxt(filename, delimiter=",", skiprows=1)

Here's an example of loading data using the pandas library:

import pandas as pd
data = pd.read_csv(filename, delimiter=",")

Here, we are loading in a .CSV file. The default delimiter is a comma, so passing this as an argument is not necessary but is useful to see. The pandas library can also handle non-numeric datatypes, which makes the library more flexible:

import pandas as pd
data = pd.read_json(filename)

The pandas library will flatten out the JSON and return a DataFrame.

The library can even connect to a database, queries can be fed directly into the function, and the table that's returned will be loaded as a pandas DataFrame:

import pandas as pd
data = pd.read_sql(con, "SELECT * FROM table")

We have to pass a database connection to the function in order for this to work. There is a myriad of ways for this to be achieved, depending on the database flavor.

Other forms of data that are common in deep learning, such as images and text, can also be loaded in and will be discussed later in this course.

Note

You can find all the documentation for pandas here: https://pandas.pydata.org/pandas-docs/stable/. The documentation for NumPy can be found here: https://docs.scipy.org/doc/.

Exercise 1.01: Loading a Dataset from the UCI Machine Learning Repository

Note

For all the exercises and activities in this chapter, you will need to have Python 3.7, Jupyter, and pandas installed on your system. Refer to the Preface for installation instructions. The exercises and activities are performed in Jupyter notebooks. It is recommended to keep a separate notebook for different assignments. You can download all the notebooks from this book's GitHub repository, which can be found here: https://packt.live/2OL5E9t.

In this exercise, we will be loading the online shoppers purchasing intention dataset from the UCI Machine Learning Repository. The goal of this exercise will be to load in the CSV data and identify a target variable to predict and the feature variables to use to model the target variable. Finally, we will separate the feature and target columns and save them to .CSV files so that we can use them in subsequent activities and exercises.

The dataset is related to the online behavior and activity of customers of an online store and indicates whether the user purchased any products from the website. You can find the dataset in the GitHub repository at: https://packt.live/39rdA7S.

Follow these steps to complete this exercise:

  1. Open a new Jupyter Notebook and load the data into memory using the pandas library with the read_csv function. Import the pandas library and read in the data file:
    import pandas as pd
    data = pd.read_csv('../data/online_shoppers_intention.csv')

    Note

    The code above assumes that you are using the same folder and file structure as in the GitHub repository. If you get an error that the file cannot be found, then check to make sure your working directory is correctly structured. Alternatively, you can edit the file path in the code so that it points to the correct file location on your system, though you will need to ensure you are consistent with this when saving and loading files in later exercises.

  2. To verify that we have loaded the data into the memory correctly, we can print the first few rows. Let's print out the top 20 values of the variable:
    data.head(20)

    The printed output should look like this:

    Figure 1.5: The first 20 rows and first 8 columns of the pandas DataFrame

    Figure 1.5: The first 20 rows and first 8 columns of the pandas DataFrame

  3. We can also print the shape of the DataFrame:
    data.shape

    The printed output should look as follows, showing that the DataFrame has 12330 rows and 18 columns:

    (12330, 18)

    We have successfully loaded the data into memory, so now we can manipulate and clean the data so that a model can be trained using this data. Remember that machine learning models require data to be represented as numerical data types in order to be trained. We can see from the first few rows of the dataset that some of the columns are string types, so we will have to convert them into numerical data types later in this chapter.

  4. We can see that there is a given output variable for the dataset, known as Revenue, which indicates whether or not the user purchased a product from the website. This seems like an appropriate target to predict, since the design of the website and the choice of the products featured may be based upon the user's behavior and whether they resulted in purchasing a particular product. Create feature and target datasets as follows, providing the axis=1 argument:
    feats = data.drop('Revenue', axis=1)
    target = data['Revenue']

    Note

    The axis=1 argument tells the function to drop columns rather than rows.

  5. To verify that the shapes of the datasets are as expected, print out the number of rows and columns of each:

    Note

    The code snippet shown here uses a backslash ( \ ) to split the logic across multiple lines. When the code is executed, Python will ignore the backslash, and treat the code on the next line as a direct continuation of the current line.

    print(f'Features table has {feats.shape[0]} \
    rows and {feats.shape[1]} columns')
    print(f'Target table has {target.shape[0]} rows')

    This preceding code produces the following output:

    Features table has 12330 rows and 17 columns
    Target table has 12330 rows

    We can see two important things here that we should always verify before continuing: first, the number of rows of the feature DataFrame and target DataFrame are the same. Here, we can see that both have 12330 rows. Second, the number of columns of the feature DataFrame should be one fewer than the total DataFrame, and the target DataFrame has exactly one column.

    Regarding the second point, we have to verify that the target is not contained in the feature dataset; otherwise, the model will quickly find that this is the only column needed to minimize the total error, all the way down to zero. The target column doesn't necessarily have to be one column, but for binary classification, as in our case, it will be. Remember that these machine learning models are trying to minimize some cost function in which the target variable will be part of that cost function, usually a difference function between the predicted value and target variable.

  6. Finally, save the DataFrames as CSV files so that we can use them later:
    feats.to_csv('../data/OSI_feats.csv', index=False)
    target.to_csv('../data/OSI_target.csv', \
                  header='Revenue', index=False)

    Note

    The header='Revenue' parameter is used to provide a column name. We will do this to reduce confusion later on. The index=False parameter is used so that the index column is not saved.

In this section, we have successfully demonstrated how to load data into Python using the pandas library. This will form the basis of loading data into memory for most tabular data. Images and large documents, both of which are other common forms of data for machine learning applications, have to be loaded in using other methods, all of which will be discussed later in this book.

Note

To access the source code for this specific section, please refer to https://packt.live/2YZRAyB.

You can also run this example online at https://packt.live/3dVR0pF.

 

Data Preprocessing

To fit models to the data, it must be represented in numerical format since the mathematics used in all machine learning algorithms only works on matrices of numbers (you cannot perform linear algebra on an image). This will be one goal of this section: to learn how to encode all the features into numerical representations. For example, in binary text, values that contain one of two possible values may be represented as zeros or ones. An example of this can be seen in the following diagram. Since there are only two possible values, the value 0 is assumed to be a cat and the value 1 is assumed to be a dog.

We can also rename the column for interpretation:

Figure 1.6: A numerical encoding of binary text values

Figure 1.6: A numerical encoding of binary text values

Another goal will be to appropriately represent the data in numerical format - by appropriately, we mean that we want to encode relevant information numerically through the distribution of numbers. For example, one method to encode the months of the year would be to use the number of the month in the year. For example, January would be encoded as 1, since it is the first month, and December would be 12. Here's an example of how this would look in practice:

Figure 1.7: A numerical encoding of months

Figure 1.7: A numerical encoding of months

Not encoding information appropriately into numerical features can lead to machine learning models learning unintuitive representations, as well as relationships between the feature data and target variables that will prove useless for human interpretation.

An understanding of the machine learning algorithms you are looking to use will also help encode features into numerical representations appropriately. For example, algorithms for classification tasks such as Artificial Neural Networks (ANNs) and logistic regression are susceptible to large variations in the scale between the features that may hamper their model-fitting ability.

Take, for example, a regression problem attempting to fit house attributes, such as the area in square feet and the number of bedrooms, to the house price. The bounds of the area may be anywhere from 0 to 5000, whereas the number of bedrooms may only vary from 0 to 6, so there is a large difference between the scale of the variables.

An effective way to combat the large variation in scale between the features is to normalize the data. Normalizing the data will scale the data appropriately so that it is all of a similar magnitude. This ensures that any model coefficients or weights can be compared correctly. Algorithms such as decision trees are unaffected by data scaling, so this step can be omitted for models using tree-based algorithms.

In this section, we demonstrated a number of different ways to encode information numerically. There is a myriad of alternative techniques that can be explored elsewhere. Here, we will show some simple and popular methods that can be used to tackle common data formats.

Exercise 1.02: Cleaning the Data

It is important that we clean the data appropriately so that it can be used for training models. This often includes converting non-numerical datatypes into numerical datatypes. This will be the focus of this exercise - to convert all the columns in the feature dataset into numerical columns. To complete the exercise, perform the following steps:

  1. First, we load the feature dataset into memory:
    %matplotlib inline
    import pandas as pd
    data = pd.read_csv('../data/OSI_feats.csv')
  2. Again, look at the first 20 rows to check out the data:
    data.head(20)

    The following screenshot shows the output of the preceding code:

    Figure 1.8: First 20 rows and 8 columns of the pandas feature DataFrame

    Figure 1.8: First 20 rows and 8 columns of the pandas feature DataFrame

    Here, we can see that there are a number of columns that need to be converted into the numerical format. The numerical columns we may not need to modify are the columns named Administrative, Administrative_Duration, Informational, Informational_Duration, ProductRelated, ProductRelated_Duration, BounceRates, ExitRates, PageValues, SpecialDay, OperatingSystems, Browser, Region, and TrafficType.

    There is also a binary column that has either one of two possible values. This is the column named Weekend.

    Finally, there are also categorical columns that are string types, but there are a limited number of choices (>2) that the column can take. These are the columns named Month and VisitorType.

  3. For the numerical columns, use the describe function to get a quick indication of the bounds of the numerical columns:
    data.describe()

    The following screenshot shows the output of the preceding code:

    Figure 1.9: Output of the describe function in the feature DataFrame

    Figure 1.9: Output of the describe function in the feature DataFrame

  4. Convert the binary column, Weekend, into a numerical column. To do so, we will examine the possible values by printing the count of each value and plotting the result, and then convert one of the values into 1 and the other into 0. If appropriate, rename the column for interpretability.

    For context, it is helpful to see the distribution of each value. We can do that using the value_counts function. We can try this out on the Weekend column:

    data['Weekend'].value_counts()

    We can also look at these values as a bar graph by plotting the value counts by calling the plot method of the resulting DataFrame and passing the kind='bar' argument:

    data['Weekend'].value_counts().plot(kind='bar')

    Note

    The kind='bar' argument will plot the data as a bar graph. The default is a line graph. When plotting in Jupyter notebooks, in order to make the plots within the notebook, the following command may need to be run: %matplotlib inline.

    The following figure shows the output of the preceding code:

    Figure 1.10: A plot of the distribution of values of the default column

    Figure 1.10: A plot of the distribution of values of the default column

  5. Here, we can see that this distribution is skewed toward false values. This column represents whether the visit to the website occurred on a weekend, corresponding to a true value, or a weekday, corresponding to a false value. Since there are more weekdays than weekends, this skewed distribution makes sense. Convert the column into a numerical value by converting the True values into 1 and the False values into 0. We can also change the name of the column from its default to is_weekend. This makes it a bit more obvious what the column means:
    data['is_weekend'] = data['Weekend'].apply(lambda \
                         row: 1 if row == True else 0)

    Note

    The apply function iterates through each element in the column and applies the function provided as the argument. A function has to be supplied as the argument. Here, a lambda function is supplied.

  6. Take a look at the original and converted columns side by side. Take a sample of the last few rows to see examples of both values being manipulated so that they're numerical data types:
    data[['Weekend','is_weekend']].tail()

    Note

    The tail function is identical to the head function, except the function returns the bottom n values of the DataFrame instead of the top n.

    The following figure shows the output of the preceding code:

    Figure 1.11: The original and manipulated column

    Figure 1.11: The original and manipulated column

    Here, we can see that True is converted into 1 and False is converted into 0.

  7. Now we can drop the Weekend column, as only the is_weekend column is needed:
    data.drop('Weekend', axis=1, inplace=True)
  8. Next, we have to deal with categorical columns. We will approach the conversion of categorical columns into numerical values slightly differently than with binary text columns, but the concept will be the same. Convert each categorical column into a set of dummy columns. With dummy columns, each categorical column will be converted into n columns, where n is the number of unique values in the category. The columns will be 0 or 1, depending on the value of the categorical column.

    This is achieved with the get_dummies function. If we need any help understanding this function, we can use the help function or any function:

    help(pd.get_dummies)

    The following figure shows the output of the preceding code:

    Figure 1.12: The output of the help command being applied to the pd.get_dummies function

    Figure 1.12: The output of the help command being applied to the pd.get_dummies function

  9. Let's demonstrate how to manipulate categorical columns with the age column. Again, it is helpful to see the distribution of values, so look at the value counts and plot them:
    data['VisitorType'].value_counts()
    data['VisitorType'].value_counts().plot(kind='bar')

    The following figure shows the output of the preceding code:

    Figure 1.13: A plot of the distribution of values of the age column

    Figure 1.13: A plot of the distribution of values of the age column

  10. Call the get_dummies function on the VisitorType column and take a look at the rows alongside the original:
    colname = 'VisitorType'
    visitor_type_dummies = pd.get_dummies(data[colname], \
                                          prefix=colname)
    pd.concat([data[colname], \
               visitor_type_dummies], axis=1).tail(n=10)

    The following figure shows the output of the preceding code:

    Figure 1.14: Dummy columns from the VisitorType column

    Figure 1.14: Dummy columns from the VisitorType column

    Here, we can see that, in each of the rows, there can be one value of 1, which is in the column corresponding to the value in the VisitorType column.

    In fact, when using dummy columns, there is some redundant information. Because we know there are three values, if two of the values in the dummy columns are 0 for a particular row, then the remaining column must be equal to 1. It is important to eliminate any redundancy and correlations in features as it becomes difficult to determine which feature is the most important in minimizing the total error.

  11. To remove the interdependency, drop the VisitorType_Other column because it occurs with the lowest frequency:
    visitor_type_dummies.drop('VisitorType_Other', \
                              axis=1, inplace=True)
    visitor_type_dummies.head()

    Note

    In the drop function, the inplace argument will apply the function in place, so a new variable does not have to be declared.

    Looking at the first few rows, we can see what remains of our dummy columns for the original VisitorType column:

    Figure 1.15: Final dummy columns from the VisitorType column

    Figure 1.15: Final dummy columns from the VisitorType column

  12. Finally, add these dummy columns to the original feature data by concatenating the two DataFrames column-wise and dropping the original column:
    data = pd.concat([data, visitor_type_dummies], axis=1)
    data.drop('VisitorType', axis=1, inplace=True) 
  13. Repeat the exact same steps with the remaining categorical column, Month. First, examine the distribution of column values, which is an optional step. Second, create dummy columns. Third, drop one of the columns to remove redundancy. Fourth, concatenate the dummy columns into a feature dataset. Finally, drop the original column if it remains in the dataset. You can do this using the following code:
    colname = 'Month'
    month_dummies = pd.get_dummies(data[colname], prefix=colname)
    month_dummies.drop(colname+'_Feb', axis=1, inplace=True)
    data = pd.concat([data, month_dummies], axis=1)
    data.drop('Month', axis=1, inplace=True) 
  14. Now, we should have our entire dataset as numerical columns. Check the types of each column to verify this:
    data.dtypes

    The following figure shows the output of the preceding code:

    Figure 1.16: The datatypes of the processed feature dataset

    Figure 1.16: The datatypes of the processed feature dataset

  15. Now that we have verified the datatypes, we have a dataset we can use to train a model, so let's save this for later:
    data.to_csv('../data/OSI_feats_e2.csv', index=False)
  16. Let's do the same for the target variable. First, load the data in, convert the column into a numerical datatype, and save the column as a CSV file:
    target = pd.read_csv('../data/OSI_target.csv')
    target.head(n=10)

    The following figure shows the output of the preceding code:

    Figure 1.17: First 10 rows of the target dataset

    Figure 1.17: First 10 rows of the target dataset

    Here, we can see that this is a Boolean datatype and that there are two unique values.

  17. Convert this into a binary numerical column, much like we did with the binary columns in the feature dataset:
    target['Revenue'] = target['Revenue'].apply(lambda row: 1 \
                        if row==True else 0)
    target.head(n=10)

    The following figure shows the output of the preceding code:

    Figure 1.18: First 10 rows of the target dataset when converted into integers

    Figure 1.18: First 10 rows of the target dataset when converted into integers

  18. Finally, save the target dataset to a CSV file:
    target.to_csv('../data/OSI_target_e2.csv', index=False)

In this exercise, we learned how to clean the data appropriately so that it can be used to train models. We converted the non-numerical datatypes into numerical datatypes; that is, we converted all the columns in the feature dataset into numerical columns. Lastly, we saved the target dataset as a CSV file so that we can use it in the following exercises and activities.

Note

To access the source code for this specific section, please refer to https://packt.live/2YW1DVi.

You can also run this example online at https://packt.live/2BpO4EI.

Appropriate Representation of the Data

In our online shoppers purchase intention dataset, we have some columns that are defined as numerical variables when, upon closer inspection, they are actually categorical variables that have been given numerical labels. These columns are OperatingSystems, Browser, TrafficType, and Region. Currently, we have treated them as numerical variables, though they are categorical, which should be encoded into the features if we want the models we build to learn the relationships between the features and the target.

We do this because we may be encoding some misleading relationships in the features. For example, if the value of the OperatingSystems field is equal to 2, does that mean it is twice the value as that which has the value 1? Probably not, since it refers to the operating system. For this reason, we will convert the field into a categorical variable. The same may be applied to the Browser, TrafficType, and Region columns.

Exercise 1.03: Appropriate Representation of the Data

In this exercise, we will convert the OperatingSystems, Browser, TrafficType, and Region columns into categorical types to accurately reflect the information. To do this, we will create dummy variables from the column in a similar manner to what we did in Exercise 1.02, Cleaning the Data. To do so, perform the following steps:

  1. Open a Jupyter Notebook.
  2. Load the dataset into memory. We can use the same feature dataset that was the output from Exercise 1.02, Cleaning the Data, which contains the original numerical versions of the OperatingSystems, Browser, TrafficType, and Region columns:
    import pandas as pd
    data = pd.read_csv('../data/OSI_feats_e2.csv')
  3. Look at the distribution of values in the OperatingSystems column:
    data['OperatingSystems'].value_counts()

    The following figure shows the output of the preceding code:

    Figure 1.19: The distribution of values in the OperatingSystems column

    Figure 1.19: The distribution of values in the OperatingSystems column

  4. Create dummy variables from the OperatingSystem column:
    colname = 'OperatingSystems'
    operation_system_dummies = pd.get_dummies(data[colname], \
                               prefix=colname)
  5. Drop the dummy variable representing the value with the lowest occurring frequency and join back with the original data:
    operation_system_dummies.drop(colname+'_5', axis=1, \
                                  inplace=True)
    data = pd.concat([data, operation_system_dummies], axis=1)
  6. Repeat this for the Browser column:
    data['Browser'].value_counts()

    The following figure shows the output of the preceding code:

    Figure 1.20: The distribution of values in the Browser column

    Figure 1.20: The distribution of values in the Browser column

  7. Create dummy variables, drop the dummy variable with the lowest occurring frequency, and join back with the original data:
    colname = 'Browser'
    browser_dummies = pd.get_dummies(data[colname], \
                      prefix=colname)
    browser_dummies.drop(colname+'_9', axis=1, inplace=True)
    data = pd.concat([data, browser_dummies], axis=1)
  8. Repeat this for the TrafficType and Region columns:

    Note

    The # symbol in the code snippet below denotes a code comment. Comments are added into code to help explain specific bits of logic.

    colname = 'TrafficType'
    data[colname].value_counts()
    traffic_dummies = pd.get_dummies(data[colname], prefix=colname)
    # value 17 occurs with lowest frequency
    traffic_dummies.drop(colname+'_17', axis=1, inplace=True)
    data = pd.concat([data, traffic_dummies], axis=1)
    colname = 'Region'
    data[colname].value_counts()
    region_dummies = pd.get_dummies(data[colname], \
                     prefix=colname)
    # value 5 occurs with lowest frequency
    region_dummies.drop(colname+'_5', axis=1, inplace=True)
    data = pd.concat([data, region_dummies], axis=1)
  9. Check the column types to verify they are all numerical:
    data.dtypes

    The following figure shows the output of the preceding code:

    Figure 1.21: The datatypes of the processed feature dataset

    Figure 1.21: The datatypes of the processed feature dataset

  10. Finally, save the dataset to a CSV file for later use:
    data.to_csv('../data/OSI_feats_e3.csv', index=False)

Now, we can accurately test whether the browser type, operating system, traffic type, or region will affect the target variable. This exercise has demonstrated how to appropriately represent data for use in machine learning algorithms. We have presented some techniques that we can use to convert data into numerical datatypes that cover many situations that may be encountered when working with tabular data.

Note

To access the source code for this specific section, please refer to https://packt.live/3dXOTBy.

You can also run this example online at https://packt.live/3iBvDxw.

 

Life Cycle of Model Creation

In this section, we will cover the life cycle of creating performant machine learning models, from engineering features to fitting models to training data, and evaluating our models using various metrics. The following diagram demonstrates the iterative process of building machine learning models. Features are engineered that represent potential correlations between the features and the target, the model is fit, and then models are evaluated.

Depending on how the model is scored according to the model's evaluation metrics, the features are engineered further, and the process continues. Many of the steps that are implemented to create models are highly transferable between all machine learning libraries. We'll start with scikit-learn, which has the advantage of being widely used, and as such, there is a lot of documentation, tutorials, and learning materials to be found across the internet:

Figure 1.22: The life cycle of model creation

Figure 1.22: The life cycle of model creation

Machine Learning Libraries

While this book is an introduction to deep learning with Keras, as we mentioned earlier, we will start by utilizing scikit-learn. This will help us establish the fundamentals of building a machine learning model using the Python programming language.

Similar to scikit-learn, Keras makes it easy to create models in the Python programming language through an easy-to-use API. However, the goal of Keras is the creation and training of neural networks, rather than machine learning models in general. ANNs represent a large class of machine learning algorithms, and they are so-called because their architecture resembles the neurons in the human brain. The Keras library has many general-purpose functions built-in, such as optimizers, activation functions, and layer properties, so that users, like in scikit-learn, do not have to code these algorithms from scratch.

 

scikit-learn

Scikit-learn was initially created by David Cournapeau in 2007 as a way to easily create machine learning models in the Python programming language. Since its inception, the library has grown immensely in popularity because of its ease of use, wide adoption within the machine learning community, and flexibility of use. scikit-learn is usually the first machine learning package that's implemented by practitioners using Python because of the large number of algorithms available for classification, regression, and clustering tasks and the speed with which results can be obtained.

For example, scikit-learn's LinearRegression class is an excellent choice if you wish to quickly train a simple regression model, whereas if a more complex algorithm is required that's capable of learning nonlinear relationships, scikit-learn's GradientBoostingRegressor or any one of the support vector machine algorithms are great choices. Likewise, with classification or clustering tasks, scikit-learn offers a wide variety of algorithms to choose from.

The following are a few of the advantages and disadvantages of using scikit-learn for machine learning purposes.

The advantages of scikit-learn are as follows:

  • Mature: Scikit-learn is well-established within the community and used by members of the community of all skill levels. The package includes most of the common machine learning algorithms for classification, regression, and clustering tasks.
  • User-friendly: Scikit-learn features an easy-to-use API that allows beginners to efficiently prototype without having to have a deep understanding or having to code each specific mode.
  • Open source: There is an active open source community working to improve the library, add documentation, and release regular updates, which ensures that the package is stable and up to date.

The disadvantage of scikit-learn is as follows:

Neural network support is lacking: Estimators with ANN algorithms are minimal.

Note

You can find all the documentation for the scikit-learn library here: https://scikit-learn.org/stable/documentation.html.

The estimators in scikit-learn can generally be classified into supervised learning and unsupervised learning techniques. Supervised learning occurs when a target variable is present. A target variable is a variable of the dataset that you are trying to predict, given the other variables. Supervised learning requires the target variable to be known and models are trained to correctly predict this variable. Binary classification using logistic regression is a good example of a supervised learning technique.

In unsupervised learning, no target variable is given in the training data, but models aim to assign a target variable. An example of an unsupervised learning technique is k-means clustering. This algorithm partitions data into a given number of clusters based on its proximity to neighboring data points. The target variable that's assigned may be either the cluster number or cluster center.

An example of utilizing a clustering example in practice may look as follows. Imagine that you are a jacket manufacturer and your goal is to develop dimensions for various jacket sizes. You cannot create a custom-fit jacket for each customer, so one option you have to determine the dimensions for jackets is to sample the population of customers for various parameters that may be correlated to fit, such as height and weight. Then, you can group the population into clusters using scikit-learn's k-means clustering algorithm with a cluster number that matches the number of jacket sizes you wish to produce. The cluster-centers that are created from the clustering algorithm become the parameters that the jacket sizes are based on.

This is visualized in the following figure:

Figure 1.23: An unsupervised learning example of grouping customer parameters into clusters

Figure 1.23: An unsupervised learning example of grouping customer parameters into clusters

There are even semi-supervised learning techniques in which unlabeled data is used in the training of machine learning models. This technique may be used if there is only a small amount of labeled data and a copious amount of unlabeled data. In practice, semi-supervised learning produces a significant improvement in model performance compared to unsupervised learning.

The scikit-learn library is ideal for beginners as the general concepts for building machine learning pipelines can be learned easily. Concepts such as data preprocessing (the preparation of data for use in machine learning models), hyperparameter tuning (the process of selecting the appropriate model parameters), model evaluation (the quantitative evaluation of a model's performance), and many more are all included in the library. Even experienced users find the library easy to use in order to rapidly prototype models before using a more specialized machine learning library.

Indeed, the various machine learning techniques we've discussed, such as supervised and unsupervised learning, can be applied with Keras using neural networks with different architectures, all of which will be discussed throughout this book.

 

Keras

Keras is designed to be a high-level neural network API that is built on top of frameworks such as TensorFlow, CNTK, and Theano. One of the great benefits of using Keras as an introduction to deep learning for beginners is that it is very user-friendly; advanced functions such as optimizers and layers are already built into the library and do not have to be written from scratch. This is why Keras is popular not only among beginners but also seasoned experts. Also, the library allows the rapid prototyping of neural networks, supports a wide variety of network architectures, and can be run on both CPUs and GPUs.

Note

You can find the library and all the documentation for Keras here: https://Keras.io/.

Keras is used to create and train neural networks and does not offer much in terms of other machine learning algorithms, including supervised algorithms such as support vector machines and unsupervised algorithms such as k-means clustering. What Keras does offer, though, is a well-designed API that can be used to create and train neural networks, which takes away much of the effort that's required to apply linear algebra and multivariate calculus accurately.

The specific modules that are available from the Keras library, such as neural layers, cost functions, optimizers, initialization schemes, activation functions, and regularization schemes, will be explained thoroughly throughout this book. All these modules have relevant functions that can be used to optimize performance for training neural networks for specific tasks.

Advantages of Keras

Here are a few of the main advantages of using Keras for machine learning purposes:

  • User-friendly: Much like scikit-learn, Keras features an easy-to-use API that allows users to focus on model-building rather than the specifics of the algorithms.
  • Modular: The API consists of fully configurable modules that can all be plugged together and work seamlessly.
  • Extensible: It is relatively simple to add new modules to the library. This allows users to take advantage of the many robust modules within the library while providing them the flexibility to create their own.
  • Open source: Keras is an open source library and is constantly improving and adding modules to its code base thanks to the work of many collaborators working in conjunction to build improvements and help create a robust library for all.
  • Works with Python: Keras models are declared directly in Python rather than in separate configuration files, which allows Keras to take advantage of working with Python, such as ease of debugging and extensibility.

Disadvantages of Keras

Here are a few of the main disadvantages of using Keras for machine learning purposes:

  • Advanced customization: While simple surface-level customization such as creating simple custom loss functions or neural layers is facile, it can be difficult to change how the underlying architecture works.
  • Lack of examples: Beginners often rely on examples to kick-start their learning. Advanced examples can be lacking in the Keras documentation, which can prevent beginners from advancing in their learning.

Keras offers those familiar with the Python programming language and machine learning the ability to create neural network architectures easily. Since neural networks are quite complicated, we will use scikit-learn to introduce many machine learning concepts before applying them to the Keras library.

More Than Building Models

While machine learning libraries such as scikit-learn and Keras were created to help build and train predictive models, their practicality extends much further. One common use case of building models is that they can be utilized to perform predictions on new data. Once a model has been trained, new observations can be fed into the model to generate predictions. Models may even be used as intermediate steps. For example, neural network models can be used as feature extractors, classifying objects in an image that can then be fed into a subsequent model, as illustrated in the following image:

Figure 1.24: Classifying objects using deep learning

Figure 1.24: Classifying objects using deep learning

Another common use case for models is that they can be used to summarize datasets by learning representations of the data. Such models are known as auto-encoders, a type of neural network architecture that can be used to learn such representations of a given dataset. Therefore, the dataset can thus be represented in a reduced dimension with minimal loss of information:

Figure 1.25: An example of using deep learning for text summarization

Figure 1.25: An example of using deep learning for text summarization

 

Model Training

In this section, we will begin fitting our model to the datasets that we have created. In this chapter, we will review the minimum steps that are required to create a machine learning model that can be applied when building models with any machine learning library, including scikit-learn and Keras.

Classifiers and Regression Models

This book is concerned with applications of deep learning. The vast majority of deep learning tasks are supervised learning, in which there is a given target, and we want to fit a model so that we can understand the relationship between the features and the target.

An example of supervised learning is identifying whether a picture contains a dog or a cat. We want to determine the relationship between the input (a matrix of pixel values) and the target variable, that is, whether the image is of a dog or a cat:

Figure 1.26: A simple supervised learning task to classify images as dogs and cats

Figure 1.26: A simple supervised learning task to classify images as dogs and cats

Of course, we may need many more images in our training dataset to robustly classify new images, but models that are trained on such a dataset are able to identify the various relationships that differentiate cats and dogs, which can then be used to predict labels for new data.

Supervised learning models are generally used for either classification or regression tasks.

Classification Tasks

The goal of classification tasks is to fit models from data with discrete categories that can be used to label unlabeled data. For example, these types of models can be used to classify images as dogs or cats. But it doesn't stop at binary classification; multi-label classification is also possible. Another example of how this may be a classification task would be to predict the existence of dogs within the images. A positive prediction would indicate the presence of dogs within the images, while a negative prediction would indicate no presence of dogs. Note that this could easily be converted into a regression task, that is, the estimation of a continuous variable as opposed to a discrete variable, which classification tasks estimate, by predicting the number of dogs within the images.

Most classification tasks output a probability for each unique class. This prediction is determined as the class with the highest probability, as can be seen in the following figure:

Figure 1.27: An illustration of a classification model labeling an image

Figure 1.27: An illustration of a classification model labeling an image

Some of the most common classification algorithms are as follows:

  • Logistic regression: This algorithm is similar to linear regression, in which feature coefficients are learned and predictions are made by taking the sum of the product of the feature coefficients and features.
  • Decision trees: This algorithm follows a tree-like structure. Decisions are made at each node and branches represent possible options at the node, terminating in the predicted result.
  • ANNs: ANNs replicate the structure and performance of a biological neural network to perform pattern recognition tasks. An ANN consists of interconnected neurons, laid out with a set architecture, that pass information to each other until a result is achieved.

Regression Tasks

While the aim of classification tasks is to label datasets with discrete variables, the aim of regression tasks is to provide input data with continuous variables and output a numerical value. For example, if you have a dataset of stock market prices, a classification task may predict whether to buy, sell, or hold, whereas a regression task will predict what the stock market price will be.

A simple yet very popular algorithm for regression tasks is linear regression. It consists of only one independent feature (x), whose relationship with its dependent feature (y) is linear. Due to its simplicity, it is often overlooked, even though it performs very well for simple data problems.

Some of the most common regression algorithms are as follows:

  • Linear regression: This algorithm learns feature coefficients and predictions are made by taking the sum of the product of the feature coefficients and features.
  • Support Vector Machines: This algorithm uses kernels to map input data into a multi-dimensional feature space to understand relationships between features and the target.
  • ANNs: ANNs replicate the structure and performance of a biological neural network to perform pattern recognition tasks. An ANN consists of interconnected neurons, laid out with a set architecture, that pass information to each other until a result is achieved.

Training Datasets and Test Datasets

Whenever we create machine learning models, we separate the data into training and test datasets. The training data is the set of data that's used to train the model. Typically, it is a large proportion—around 80%—of the total dataset. The test dataset is a sample of the dataset that is held out from the beginning and is used to provide an unbiased evaluation of the model. The test dataset should represent real-world data as accurately as possible. Any model evaluation metrics that are reported should be applied to the test dataset unless it's explicitly stated that the metrics have been evaluated on the training dataset. The reason for this is that models will typically perform better on the data they are trained on.

Furthermore, models can overfit the training dataset, meaning that they perform well on the training dataset but perform poorly on the test dataset. A model is said to be overfitted to the data if the model's performance is very good when evaluated on the training dataset, but it performs poorly on the test dataset. Conversely, a model can be underfitted to the data. In this case, the model will fail to learn relationships between the features and the target, which will lead to poor performance when evaluated on both the training and test datasets.

We aim for a balance of the two, not relying so heavily on the training dataset that we overfit but allowing the model to learn the relationships between the features and the target so that the model generalizes well to new data. This concept is illustrated in the following figure:

Figure 1.28: An example of underfitting and overfitting a dataset

Figure 1.28: An example of underfitting and overfitting a dataset

There are many ways to split the dataset via sampling methods. One way to split a dataset into training is to simply randomly sample the data until you have the desired number of data points. This is often the default method in functions such as the scikit-learn train_test_spilt function.

Another method is to stratify the sampling. In stratified sampling, each subpopulation is sampled independently. Each subpopulation is determined by the target variable. This can be advantageous in examples such as binary classification, where the target variable is highly skewed toward one value or another, and random sampling may not provide data points of both values in the training and test datasets. There are also validation datasets, which we will address later in this chapter.

Model Evaluation Metrics

It is important to be able to evaluate our models effectively, not just in terms of the model's performance but also in the context of the problem we are trying to solve. For example, let's say we built a classification task to predict whether to buy, sell, or hold stock based on historical stock market prices. If our model only predicted to buy every time, this would not be a useful result because we may not have infinite resources to buy stock. It may be better to be less accurate yet also include some sell predictions.

Common evaluation metrics for classification tasks include accuracy, precision, recall, and f1 score. Accuracy is defined as the number of correct predictions divided by the total number of predictions. Accuracy is very interpretable and relatable and good for when there are balanced classes. When the classes are highly skewed, accuracy can be misleading, however:

Figure 1.29: Formula to calculate accuracy

Figure 1.29: Formula to calculate accuracy

Precision is another useful metric. It's defined as the number of true positive results divided by the total number of positive results (true and false) predicted by the model:

Figure 1.30: Formula to calculate precision

Figure 1.30: Formula to calculate precision

Recall is defined as the number of correct positive results divided by all the positive results from the ground truth:

Figure 1.31: Formula to calculate recall

Figure 1.31: Formula to calculate recall

Both precision and recall are scored between zero and one but scoring well on one may mean scoring poorly on the other. For example, a model may have high precision but low recall, which indicates that the model is very accurate but misses a large number of positive instances. It is useful to have a metric that combines recall and precision. Enter the F1 score, which determines how precise and robust your model is:

Figure 1.32: Formula to calculate the F1 score

Figure 1.32: Formula to calculate the F1 score

When evaluating models, it is helpful to look at a range of different evaluation metrics. They will help determine the most appropriate model and evaluate where the model is misclassifying predictions.

For example, take a model that helps doctors predict the presence of a rare disease in their patients. By predicting a negative result for every instance, the model might provide a highly accurate evaluation, but this would not help the doctors or patients very much. Instead, examining the precision or recall may be much more informative.

A high precision model is very picky and will likely ensure that all predictions labeled positive are indeed positive. A high recall model is likely to recall many of the true positive instances, at the cost of incurring many false positives.

A high precision model is desired when you want to be sure the predictions labeled as true have a high likelihood of being true. In our example, this may be desired if the cost of treating a rare disease or risk of treatment complications is high. A high recall model is desired if you want to make sure your model recalls as many true positives as possible. In our example, this may be the case if the rare disease is highly contagious and we want to be sure all cases of the disease are treated.

Exercise 1.04: Creating a Simple Model

In this exercise, we will create a simple logistic regression model from the scikit-learn package. Then, we will create some model evaluation metrics and test the predictions against those model evaluation metrics.

We should always approach training any machine learning model as an iterative approach, beginning with a simple model and using model evaluation metrics to evaluate the performance of the models. In this model, our goal is to classify the users in the online shoppers purchasing intention dataset into those that will purchase during their session and those that will not. Follow these steps to complete this exercise:

  1. Load in the data:
    import pandas as pd
    feats = pd.read_csv('../data/OSI_feats_e3.csv')
    target = pd.read_csv('../data/OSI_target_e2.csv')
  2. Begin by creating a test and training dataset. Train the data using the training dataset and evaluate the performance of the model on the test dataset.

    We will use test_size = 0.2, which means that 20% of the data will be reserved for testing, and we will set a number for the random_state parameter:

    from sklearn.model_selection import train_test_split
    test_size = 0.2
    random_state = 42
    X_train, X_test, \
    y_train, y_test = train_test_split(feats, target, \
                                       test_size=test_size, \
                                       random_state=random_state)
  3. Print out the shape of each DataFrame to verify that the dimensions are correct:
    print(f'Shape of X_train: {X_train.shape}')
    print(f'Shape of y_train: {y_train.shape}')
    print(f'Shape of X_test: {X_test.shape}')
    print(f'Shape of y_test: {y_test.shape}')

    The preceding code produces the following output:

    Shape of X_train: (9864, 68)
    Shape of y_train: (9864, 1)
    Shape of X_test: (2466, 68)
    Shape of y_test: (2466, 1)

    These dimensions look correct; each of the target datasets has a single column, the training feature and target DataFrames have the same number of rows, the same applies to the test feature and target DataFrames, and the test DataFrames are 20% of the total dataset.

  4. Next, instantiate the model:
    from sklearn.linear_model import LogisticRegression
    model = LogisticRegression(random_state=42)

    While there are many arguments we can add to scikit-learn's logistic regression model (such as the type and value of the regularization parameter, the type of solver, and the maximum number of iterations for the model to have), we will only pass random_state.

  5. Then, fit the model to the training data:
    model.fit(X_train, y_train['Revenue'])
  6. To test the performance of the model, compare the predictions of the model with the true values:
    y_pred = model.predict(X_test)

    There are many types of model evaluation metrics that we can use. Let's start with the accuracy, which is defined as the proportion of predicted values that equal the true values:

    from sklearn import metrics
    accuracy = metrics.accuracy_score(y_pred=y_pred, \
                                      y_true=y_test)
    print(f'Accuracy of the model is {accuracy*100:.4f}%')

    The preceding code produces the following output:

    Accuracy of the model is 87.0641%
  7. Other common evaluation metrics for classification models include precision, recall, and fscore. Use the scikit-learn precison_recall_fscore_support function, which can calculate all three:
    precision, recall, fscore, _ = \
    metrics.precision_recall_fscore_support(y_pred=y_pred, \
                                            y_true=y_test, \
                                            average='binary')
    print(f'Precision: {precision:.4f}\nRecall: \
    {recall:.4f}\nfscore: {fscore:.4f}')

    Note

    The underscore is used in Python for many reasons. It can be used to recall the value of the last expression in the interpreter, but in this case, we're using it to ignore specific values that are output by the function.

    The following figure shows the output of the preceding code:

    Precision: 0.7347
    Recall: 0.3504
    fscore: 0.4745

    Since these metrics are scored between 0 and 1, the recall and fscore are not as impressive as the accuracy, though looking at all of these metrics together can help us find where our models are doing well and where they could be improved by examining in which observations the model gets predictions incorrect.

  8. Look at the coefficients that the model outputs to observe which features have a greater impact on the overall result of the prediction:
    coef_list = [f'{feature}: {coef}' for coef, \
                 feature in sorted(zip(model.coef_[0], \
                 X_train.columns.values.tolist()))]
    for item in coef_list:
        print(item)

    The following figure shows the output of the preceding code:

    Figure 1.33: The sorted important features of the model with their respective coefficients

Figure 1.33: The sorted important features of the model with their respective coefficients

This exercise has taught us how to create and train a predictive model to predict a target variable when given feature variables. We split the feature and target dataset into training and test datasets. Then, we trained our model on the training dataset and evaluated our model on the test dataset. Finally, we observed the trained coefficients for this model.

Note

To access the source code for this specific section, please refer to https://packt.live/2Aq3ZCc.

You can also run this example online at https://packt.live/2VIRSaL.

 

Model Tuning

In this section, we will delve further into evaluating model performance and examine techniques that we can use to generalize models to new data using regularization. Providing the context of a model's performance is extremely important. Our aim is to determine whether our model is performing well compared to trivial or obvious approaches. We do this by creating a baseline model against which machine learning models we train are compared. It is important to stress that all model evaluation metrics are evaluated and reported via the test dataset since that will give us an understanding of how the model will perform on new data.

Baseline Models

A baseline model should be a simple and well-understood procedure, and the performance of this model should be the lowest acceptable performance for any model we build. For classification models, a useful and easy baseline model is to calculate the model outcome value. For example, if there are 60% false values, our baseline model would be to predict false for every value, which would give us an accuracy of 60%. For regression models, the mean or median can be used as the baseline.

Exercise 1.05: Determining a Baseline Model

In this exercise, we will put the model performance into context. The accuracy we attained from our model seemed good, but we need something to compare it to. Since machine learning model performance is relative, it is important to develop a robust baseline with which to compare models. Once again, we are using the online shoppers purchasing intention dataset, and our target variable is whether or not each user will purchase a product in their session. Follow these steps to complete this exercise:

  1. Import the pandas library and load in the target dataset:
    import pandas as pd
    target = pd.read_csv('../data/OSI_target_e2.csv')
  2. Next, calculate the relative proportion of each value of the target variables:
    target['Revenue'].value_counts()/target.shape[0]*100

    The following figure shows the output of the preceding code:

    Figure 1.34: Relative proportion of each value

    Figure 1.34: Relative proportion of each value

  3. Here, we can see that 0 is represented 84.525547% of the time—that is, there is no purchase by the user, and this is our baseline accuracy. Now, for the other model evaluation metrics:
    from sklearn import metrics
    y_baseline = pd.Series(data=[0]*target.shape[0])
    precision, recall, \
    fscore, _ = metrics.precision_recall_fscore_support\
                (y_pred=y_baseline, \
                 y_true=target['Revenue'], average='macro')

    Here, we've set the baseline model to predict 0 and have repeated the value so that it's the same as the number of rows in the test dataset.

    Note

    The average parameter in the precision_recall_fscore_support function has to be set to macro because when it is set to binary, as it was previously, the function is looking for true values, and our baseline model only consists of false values.

  4. Print the final output for precision, recall, and fscore:
    print(f'Precision: {precision:.4f}\nRecall:\
    {recall:.4f}\nfscore: {fscore:.4f}')

    The preceding code produces the following output:

    Precision: 0.9226
    Recall: 0.5000
    Fscore: 0.4581

Now, we have a baseline model that we can compare to our previous model, as well as any subsequent models. By doing this, we can tell that while the accuracy of our previous model seemed high, it did not score much better than this baseline model.

Note

To access the source code for this specific section, please refer to https://packt.live/31MD1jH.

You can also run this example online at https://packt.live/2VFFSXO.

Regularization

Earlier in this chapter, we learned about overfitting and what it looks like. The hallmark of overfitting is when a model is trained on the training data and performs extremely well yet performs terribly on test data. One reason for this could be that the model may be relying too heavily on certain features that lead to good performance in the training dataset but do not generalize well to new observations of data or the test dataset.

One technique that can be used to avoid this is called regularization. Regularization constrains the values of the coefficients toward zero, which discourages a complex model. There are many different types of regularization techniques. For example, in linear and logistic regression, ridge and lasso regularization are most common. In tree-based models, limiting the maximum depth of the trees acts as regularization.

There are two different types of regularization, namely L1 and L2. This term is either the L2 norm (the sum of the squared values) of the weights or the L1 norm (the sum of the absolute values) of the weights. Since the l1 regularization parameter acts as a feature selector, it is able to reduce the coefficient of features to zero. We can use the output of this model to observe which features do not contribute much to the performance and remove them entirely if desired. The l2 regularization parameter will not reduce the coefficient of features to zero, so we will observe that they all have non-zero values.

The following code shows how to instantiate the models using these regularization techniques:

model_l1 = LogisticRegressionCV(Cs=Cs, penalty='l1', \
                                cv=10, solver='liblinear', \
                                random_state=42)
model_l2 = LogisticRegressionCV(Cs=Cs, penalty='l2', \
                                cv=10, random_state=42)

The following code shows how to fit the models:

model_l1.fit(X_train, y_train['Revenue'])
model_l2.fit(X_train, y_train['Revenue'])

The same concepts in lasso and ridge regularization can be applied to ANNs. However, penalization occurs on the weight matrices rather than the coefficients. Dropout is another form of regularization that's used to prevent overfitting in ANNs. Dropout randomly selects nodes at each iteration and removes them, along with their connections, as shown in the following figure:

Figure 1.35: Dropout regularization in ANNs

Figure 1.35: Dropout regularization in ANNs

Cross-Validation

Cross-validation is often used in conjunction with regularization to help tune hyperparameters. Take, for example, the penalization parameter in ridge and lasso regression, or the proportion of nodes to drop out at each iteration using the dropout technique with ANNs. How will you determine which parameter to use? One way is to run models for each value of the regularization parameter and evaluate them on the test set; however, using the test set often can introduce bias into the model.

One popular example of cross-validation is called k-fold cross-validation. This technique gives us the ability to test our model on unseen data while retaining a test set that we will use to test at the end. Using this method, the data is divided into k subsets. In each of the k iterations, k-1 of the subsets are used as training data and the remaining subset is used as a validation set. This is repeated k times until all k subsets have been used as validation sets.

By using this technique, there is a significant reduction in bias, since most of the data is used for fitting. There is also a reduction in variation since most of the data is also used for validation. Typically, there are between 5 and 10 folds, and the technique can even be stratified, which is useful when there is a large imbalance of classes.

The following example shows 5-fold cross-validation with 20% of the data being held out as a test set. The remaining 80% is separated into 5 folds. Four of those folds comprise the training data, and the remaining fold is the validation data. This is repeated a total of five times until every fold has been used once for validation:

Figure 1.36: A figure demonstrating how 5-fold cross-validation works

Figure 1.36: A figure demonstrating how 5-fold cross-validation works

Activity 1.01: Adding Regularization to the Model

In this activity, we will utilize the same logistic regression model from the scikit-learn package. This time, however, we will add regularization to the model and search for the optimum regularization parameter—a process often called hyperparameter tuning. After training the models, we will test the predictions and compare the model evaluation metrics to those produced by the baseline model and the model without regularization.

The steps we will take are as follows:

  1. Load in the feature and target datasets of the online shoppers purchasing intention dataset from '../data/OSI_feats_e3.csv' and '../data/OSI_target_e2.csv'.
  2. Create training and test datasets for each of the feature and target datasets. The training datasets will be used to train on, and the models will be evaluated using the test datasets.
  3. Instantiate a model instance of the LogisticRegressionCV class of scikit-learn's linear_model package.
  4. Fit the model to the training data.
  5. Make predictions on the test dataset using the trained model.
  6. Evaluate the models by comparing how they scored against the true values using the evaluation metrics.

After implementing these steps, you should get the following expected output:

l1
Precision: 0.7300
Recall: 0.4078
fscore: 0.5233
l2
Precision: 0.7350
Recall: 0.4106
fscore: 0.5269

Note

The solution for this activity can be found via this link.

This activity has taught us how to use regularization in conjunction with cross-validation to appropriately score a model. We have learned how to fit a model to data using regularization and cross-validation. Regularization is an important technique to use to ensure that models don't overfit the training data. Models that have been trained with regularization will perform better on new data, which is generally the goal of machine learning models—to predict a target when given new observations of the input data. Choosing the optimal regularization parameter may require iterating over a number of different choices.

Cross-validation is a technique that's used to determine which set of regularization parameters fit the data best. Cross-validation will train multiple models with different values for the regularization parameters on different cuts of the data. This technique ensures the best set of regularization parameters are chosen, without adding bias and minimizing variance.

 

Summary

In this chapter, we covered how to prepare data and construct machine learning models. We achieved this by utilizing Python and libraries such as pandas and scikit-learn. We also used the algorithms in scikit-learn to build our machine learning models.

Then, we learned how to load data into Python, as well as how to manipulate data so that a machine learning model can be trained on the data. This involved converting all the columns into numerical data types. We also created a basic logistic regression classification model using scikit-learn algorithms. We divided the dataset into training and test datasets and fit the model to the training dataset. We evaluated the performance of the model on the test dataset using the model evaluation metrics, that is, accuracy, precision, recall, and fscore.

Finally, we iterated on this basic model by creating two models with different types of regularization for the model. We utilized cross-validation to determine the optimal parameter to use for the regularization parameter.

In the next chapter, we will use these same concepts to create the model using the Keras library. We will use the same dataset and attempt to predict the same target value for the same classification task. By doing so, we will learn how to use regularization, cross-validation, and model evaluation metrics when fitting our neural network to the data.

About the Authors
  • Matthew Moocarme

    Matthew Moocarme is an accomplished data scientist with more than eight years of experience in creating and utilizing machine learning models. He comes from a background in the physical sciences, in which he holds a Ph.D. in physics from the Graduate Center of CUNY. Currently, he leads a team of data scientists and engineers in the media and advertising space to build and integrate machine learning models for a variety of applications. In his spare time, Matthew enjoys sharing his knowledge with the data science community through published works, conference presentations, and workshops.

    Browse publications by this author
  • Mahla Abdolahnejad

    Mahla Abdolahnejad is a Ph.D. candidate in systems and computer engineering with Carleton University, Canada. She also holds a bachelor's degree and a master's degree in biomedical engineering, which first exposed her to the field of artificial intelligence and artificial neural networks, in particular. Her Ph.D. research is focused on deep unsupervised learning for computer vision applications. She is particularly interested in exploring the differences between a human's way of learning from the visual world and a machine's way of learning from the visual world, and how to push machine learning algorithms toward learning and thinking like humans.

    Browse publications by this author
  • Ritesh Bhagwat

    Ritesh Bhagwat has a master's degree in applied mathematics with a specialization in computer science. He has over 14 years of experience in data-driven technologies and has led and been a part of complex projects ranging from data warehousing and business intelligence to machine learning and artificial intelligence. He has worked with top-tier global consulting firms as well as large multinational financial institutions. Currently, he works as a data scientist. Besides work, he enjoys playing and watching cricket and loves to travel. He is also deeply interested in Bayesian statistics.

    Browse publications by this author
The Deep Learning with Keras Workshop
Unlock this book and the full library FREE for 7 days
Start now