Reader small image

You're reading from  Applied Supervised Learning with Python

Product typeBook
Published inApr 2019
Reading LevelIntermediate
Publisher
ISBN-139781789954920
Edition1st Edition
Languages
Right arrow
Authors (2):
Benjamin Johnston
Benjamin Johnston
author image
Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston

Ishita Mathur
Ishita Mathur
author image
Ishita Mathur

Ishita Mathur has worked as a data scientist for 2.5 years with product-based start-ups working with business concerns in various domains and formulating them as technical problems that can be solved using data and machine learning. Her current work at GO-JEK involves the end-to-end development of machine learning projects, by working as part of a product team on defining, prototyping, and implementing data science models within the product. She completed her masters' degree in high-performance computing with data science at the University of Edinburgh, UK, and her bachelor's degree with honors in physics at St. Stephen's College, Delhi.
Read more about Ishita Mathur

View More author details
Right arrow

Chapter 4: Classification


Activity 11: Linear Regression Classifier – Two-Class Classifier

Solution

  1. Import the required dependencies:

    import struct
    import numpy as np
    import gzip
    import urllib.request
    import matplotlib.pyplot as plt
    from array import array
    from sklearn.linear_model import LinearRegression
  2. Load the MNIST data into memory:

    with gzip.open('train-images-idx3-ubyte.gz', 'rb') as f:
        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))
    
        img = np.array(array("B", f.read())).reshape((size, rows, cols))
    
    
    with gzip.open('train-labels-idx1-ubyte.gz', 'rb') as f:
        magic, size = struct.unpack(">II", f.read(8))
        labels = np.array(array("B", f.read()))
    
    
    with gzip.open('t10k-images-idx3-ubyte.gz', 'rb') as f:
        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))
    
        img_test = np.array(array("B", f.read())).reshape((size, rows, cols))
    
    
    with gzip.open('t10k-labels-idx1-ubyte.gz', 'rb') as f:
        magic, size = struct.unpack(">II", f.read(8))
        labels_test = np.array(array("B", f.read()))
  3. Visualize a sample of the data:

    for i in range(10):
        plt.subplot(2, 5, i + 1)
        plt.imshow(img[i], cmap='gray');
        plt.title(f'{labels[i]}');
        plt.axis('off')

    We'll get the following output:

    Figure 4.76: Sample data

  4. Construct a linear classifier model to classify the digits zero and one. The model we are going to create is to determine whether the samples are either the digits zero or one. To do this, we first need to select only those samples:

    samples_0_1 = np.where((labels == 0) | (labels == 1))[0]
    images_0_1 = img[samples_0_1]
    labels_0_1 = labels[samples_0_1]
    
    samples_0_1_test = np.where((labels_test == 0) | (labels_test == 1))
    images_0_1_test = img_test[samples_0_1_test].reshape((-1, rows * cols))
    labels_0_1_test = labels_test[samples_0_1_test]
  5. Visualize the selected information. Here's the code for zero:

    sample_0 = np.where((labels == 0))[0][0]
    plt.imshow(img[sample_0], cmap='gray');

    The output will be as follows:

    Figure 4.77: First sample data

    Here's the code for one:

    sample_1 = np.where((labels == 1))[0][0]
    plt.imshow(img[sample_1], cmap='gray');

    The output will be:

    Figure 4.78: Second sample data

  6. In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape:

    images_0_1 = images_0_1.reshape((-1, rows * cols))
    images_0_1.shape

    The output will be:

    (12665, 784)
  7. Let's construct the model; use the LinearRegression API and call the fit function:

    model = LinearRegression()
    model.fit(X=images_0_1, y=labels_0_1)

    The output will be:

    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
             normalize=False)
  8. Determine the R2 score against the training set:

    model.score(X=images_0_1, y=labels_0_1)

    The output will be:

    0.9705320567708795
  9. Determine the label predictions for each of the training samples, using a threshold of 0.5. Values greater than 0.5 classify as one; values less than or equal to 0.5 classify as zero:

    y_pred = model.predict(images_0_1) > 0.5
    y_pred = y_pred.astype(int)
    y_pred

    The output will be:

    array([0, 1, 1, ..., 1, 0, 1])
  10. Compute the classification accuracy of the predicted training values versus the ground truth:

    np.sum(y_pred == labels_0_1) / len(labels_0_1)

    The output will be:

    0.9947887879984209
  11. Compare the performance against the test set:

    y_pred = model.predict(images_0_1_test) > 0.5
    y_pred = y_pred.astype(int)
    np.sum(y_pred == labels_0_1_test) / len(labels_0_1_test)

    The output will be:

    0.9938534278959811

Activity 12: Iris Classification Using Logistic Regression

Solution

  1. Import the required packages. For this activity, we will require the pandas package for loading the data, the Matplotlib package for plotting, and scikit-learn for creating the logistic regression model. Import all the required packages and relevant modules for these tasks:

    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LogisticRegression
  2. Load the Iris dataset using pandas and examine the first five rows:

    df = pd.read_csv('iris-data.csv')
    df.head()

    The output will be:

    Figure 4.79: The first five rows of the Iris dataset

  3. The next step is feature engineering. We need to select the most appropriate features that will provide the most powerful classification model. Plot a number of different features versus the allocated species classifications, for example, sepal length versus petal length and species. Visually inspect the plots and look for any patterns that could indicate separation between each of the species:

    markers = {
        'Iris-setosa': {'marker': 'x'},
        'Iris-versicolor': {'marker': '*'},
        'Iris-virginica': {'marker': 'o'},
    }
    plt.figure(figsize=(10, 7))
    for name, group in df.groupby('Species'):
        plt.scatter(group['Sepal Width'], group['Petal Length'], 
                    label=name,
                    marker=markers[name]['marker'],
                   )
        
    plt.title('Species Classification Sepal Width vs Petal Length');
    plt.xlabel('Sepal Width (mm)');
    plt.ylabel('Petal Length (mm)');
    plt.legend();

    The output will be:

    Figure 4.80: Species classification plot

  4. Select the features by writing the column names in the following list:

    selected_features = [
        'Sepal Width', # List features here
        'Petal Length'
    ]
  5. Before we can construct the model, we must first convert the species values into labels that can be used within the model. Replace the Iris-setosa species string with the value 0, the Iris-versicolor species string with the value 1, and the Iris-virginica species string with the value 2:

    species = [
        'Iris-setosa', # 0
        'Iris-versicolor', # 1
        'Iris-virginica', # 2
    ]
    output = [species.index(spec) for spec in df.Species]
  6. Create the model using the selected_features and the assigned species labels:

    model = LogisticRegression(multi_class='auto', solver='lbfgs')
    model.fit(df[selected_features], output)

    The output will be:

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='auto',
              n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
              tol=0.0001, verbose=0, warm_start=False)
  7. Compute the accuracy of the model against the training set:

    model.score(df[selected_features], output)

    The output will be:

    0.9533333333333334
  8. Construct another model using your second choice selected_features and compare the performance:

    selected_features = [
        'Sepal Length', # List features here
        'Petal Width'
    ]
    model.fit(df[selected_features], output)
    model.score(df[selected_features], output)

    The output will be:

    0.96
  9. Construct another model using all available information and compare the performance:

    selected_features = [
        'Sepal Length', # List features here
        'Sepal Width'
    ]
    model.fit(df[selected_features], output)
    model.score(df[selected_features], output)

    The output will be:

    0.82

Activity 13: K-NN Multiclass Classifier

Solution

  1. Import the following packages:

    import struct
    import numpy as np
    import gzip
    import urllib.request
    import matplotlib.pyplot as plt
    from array import array
    from sklearn.neighbors import KNeighborsClassifier as KNN
  2. Load the MNIST data into memory.

    Training images:

    with gzip.open('train-images-idx3-ubyte.gz', 'rb') as f:
        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))
    
        img = np.array(array("B", f.read())).reshape((size, rows, cols))

    Training labels:

    with gzip.open('train-labels-idx1-ubyte.gz', 'rb') as f:
        magic, size = struct.unpack(">II", f.read(8))
        labels = np.array(array("B", f.read()))

    Test images:

    with gzip.open('t10k-images-idx3-ubyte.gz', 'rb') as f:
        magic, size, rows, cols = struct.unpack(">IIII", f.read(16))
    
        img_test = np.array(array("B", f.read())).reshape((size, rows, cols))

    Test labels:

    with gzip.open('t10k-labels-idx1-ubyte.gz', 'rb') as f:
        magic, size = struct.unpack(">II", f.read(8))
        labels_test = np.array(array("B", f.read()))
  3. Visualize a sample of the data:

    for i in range(10):
        plt.subplot(2, 5, i + 1)
        plt.imshow(img[i], cmap='gray');
        plt.title(f'{labels[i]}');
        plt.axis('off')

    The output will be:

    Figure 4.81: Sample images

  4. Construct a K-NN classifier, with three nearest neighbors to classify the MNIST dataset. Again, to save processing power, randomly sample 5,000 images for use in training:

    selection = np.random.choice(len(img), 5000)
    selected_images = img[selection]
    selected_labels = labels[selection]
  5. In order to provide the image information to the model, we must first flatten the data out so that each image is 1 x 784 pixels in shape:

    selected_images = selected_images.reshape((-1, rows * cols))
    selected_images.shape

    The output will be:

    (5000, 784)
  6. Build the three-neighbor KNN model and fit the data to the model. Note that, in this activity, we are providing 784 features or dimensions to the model, not simply 2:

    model = KNN(n_neighbors=3)
    model.fit(X=selected_images, y=selected_labels)

    The output will be:

    KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
               metric_params=None, n_jobs=None, n_neighbors=3, p=2,
               weights='uniform')
  7. Determine the score against the training set:

    model.score(X=selected_images, y=selected_labels)

    The output will be:

    0.9692
  8. Display the first two predictions for the model against the training data:

    model.predict(selected_images)[:2]
    
    plt.subplot(1, 2, 1)
    plt.imshow(selected_images[0].reshape((28, 28)), cmap='gray');
    plt.axis('off');
    plt.subplot(1, 2, 2)
    plt.imshow(selected_images[1].reshape((28, 28)), cmap='gray');
    plt.axis('off');

    The output will be as follows:

    Figure 4.82: First predicted values

  9. Compare the performance against the test set:

    model.score(X=img_test.reshape((-1, rows * cols)), y=labels_test)

    The output will be:

    0.9376
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Applied Supervised Learning with Python
Published in: Apr 2019Publisher: ISBN-13: 9781789954920

Authors (2)

author image
Benjamin Johnston

Benjamin Johnston is a senior data scientist for one of the world's leading data-driven MedTech companies and is involved in the development of innovative digital solutions throughout the entire product development pathway, from problem definition to solution research and development, through to final deployment. He is currently completing his Ph.D. in machine learning, specializing in image processing and deep convolutional neural networks. He has more than 10 years of experience in medical device design and development, working in a variety of technical roles, and holds first-class honors bachelor's degrees in both engineering and medical science from the University of Sydney, Australia.
Read more about Benjamin Johnston

author image
Ishita Mathur

Ishita Mathur has worked as a data scientist for 2.5 years with product-based start-ups working with business concerns in various domains and formulating them as technical problems that can be solved using data and machine learning. Her current work at GO-JEK involves the end-to-end development of machine learning projects, by working as part of a product team on defining, prototyping, and implementing data science models within the product. She completed her masters' degree in high-performance computing with data science at the University of Edinburgh, UK, and her bachelor's degree with honors in physics at St. Stephen's College, Delhi.
Read more about Ishita Mathur