Reader small image

You're reading from  The Machine Learning Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839219061
Edition2nd Edition
Languages
Tools
Right arrow
Author (1)
Hyatt Saleh
Hyatt Saleh
author image
Hyatt Saleh

Hyatt Saleh discovered the importance of data analysis for understanding and solving real-life problems after graduating from college as a business administrator. Since then, as a self-taught person, she not only works as a machine learning freelancer for many companies globally, but has also founded an artificial intelligence company that aims to optimize everyday processes. She has also authored Machine Learning Fundamentals, by Packt Publishing.
Read more about Hyatt Saleh

Right arrow

1. Introduction to Scikit-Learn

Activity 1.01: Selecting a Target Feature and Creating a Target Matrix

Solution:

  1. Load the titanic dataset using the seaborn library:
    import seaborn as sns
    titanic = sns.load_dataset('titanic')
    titanic.head(10)

    The first couple of rows should look as follows:

    Figure 1.22: An image showing the first 10 instances of the Titanic dataset

  2. Select your preferred target feature for the goal of this activity.

    The preferred target feature could be either survived or alive. This is mainly because both of them label whether a person survived the crash. For the following steps, the variable that's been chosen is survived. However, choosing alive will not affect the final shape of the variables.

  3. Create both the features matrix and the target matrix. Make sure that you store the data from the features matrix in a variable, X, and the data from the target matrix in another variable, Y:
    X = titanic.drop('survived',axis = 1)
    Y = titanic...

2. Unsupervised Learning – Real-Life Applications

Activity 2.01: Using Data Visualization to Aid the Pre-processing Process

Solution:

  1. Import all the required elements to load the dataset and pre-process it:
    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
  2. Load the previously downloaded dataset by using pandas' read_csv() function. Store the dataset in a pandas DataFrame named data:
    data = pd.read_csv("wholesale_customers_data.csv")
  3. Check for missing values in your DataFrame. Using the isnull() function plus the sum() function, count the missing values of the entire dataset at once:
    data.isnull().sum()

    The output is as follows:

    Channel             0
    Region              0
    Fresh               0
    Milk                0
    Grocery             0
    Frozen              0
    Detergents_Paper    0
    Delicassen          0
    dtype: int64

    As you can see from the preceding screenshot, there are no missing values in the dataset.

  4. Check for outliers...

3. Supervised Learning – Key Steps

Activity 3.01: Data Partitioning on a Handwritten Digit Dataset

Solution:

  1. Import all the required elements to split a dataset, as well as the load_digits function from scikit-learn to load the digits dataset. Use the following code to do so:
    from sklearn.datasets import load_digits
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import KFold
  2. Load the digits dataset and create Pandas DataFrames containing the features and target matrices:
    digits = load_digits()
    X = pd.DataFrame(digits.data)
    Y = pd.DataFrame(digits.target)
    print(X.shape, Y.shape)

    The shape of your features and target matrices should be as follows, respectively:

    (1797, 64) (1797, 1)
  3. Perform the conventional split approach, using a split ratio of 60/20/20%.

    Using the train_test_split function, split the data into an initial train set and a test set:

    X_new, X_test, \
    Y_new, Y_test = train_test_split(X, Y, test_size...

4. Supervised Learning Algorithms: Predicting Annual Income

Activity 4.01: Training a Naïve Bayes Model for Our Census Income Dataset

Solution:

  1. In a Jupyter Notebook, import all the required elements to load and split the dataset, as well as to train a Naïve Bayes algorithm:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import GaussianNB
  2. Load the pre-processed Census Income dataset. Next, separate the features from the target by creating two variables, X and Y:
    data = pd.read_csv("census_income_dataset_preprocessed.csv")
    X = data.drop("target", axis=1)
    Y = data["target"]

    Note that there are several ways to achieve the separation of X and Y. Use the one that you feel most comfortable with. However, take into account that X should contain the features of all instances, while Y should contain the class labels of all instances.

  3. Divide the dataset into training, validation, and...

5. Artificial Neural Networks: Predicting Annual Income

Activity 5.01: Training an MLP for Our Census Income Dataset

Solution:

  1. Import all the elements required to load and split a dataset, to train an MLP, and to measure accuracy:
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.neural_network import MLPClassifier
    from sklearn.metrics import accuracy_score
  2. Using the preprocessed Census Income Dataset, separate the features from the target, creating the variables X and Y:
    data = pd.read_csv("census_income_dataset_preprocessed.csv")
    X = data.drop("target", axis=1)
    Y = data["target"]

    As explained previously, there are several ways to achieve the separation of X and Y, and the main thing to consider is that X should contain the features for all instances, while Y should contain the class label of all instances.

  3. Divide the dataset into training, validation, and testing sets, using a split ratio of 10...

6. Building Your Own Program

Activity 6.01: Performing the Preparation and Creation Stages for the Bank Marketing Dataset

Solution:

Note

To ensure the reproducibility of the results available at https://packt.live/2RpIhn9, make sure that you use a random_state of 0 when splitting the datasets and a random_state of 2 when training the models.

  1. Open a Jupyter Notebook and import all the required elements:
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neural_network import MLPClassifier
    from sklearn.metrics import precision_score
  2. Load the dataset into the notebook. Make sure that you load the one that was edited previously, named bank-full-dataset.csv, which is also available at https://packt.live/2wnJyny:
    data = pd.read_csv("bank-full-dataset.csv")
    data.head(10)

    The output is as follows:

    Figure 6.8: A screenshot showing...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Machine Learning Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839219061
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Hyatt Saleh

Hyatt Saleh discovered the importance of data analysis for understanding and solving real-life problems after graduating from college as a business administrator. Since then, as a self-taught person, she not only works as a machine learning freelancer for many companies globally, but has also founded an artificial intelligence company that aims to optimize everyday processes. She has also authored Machine Learning Fundamentals, by Packt Publishing.
Read more about Hyatt Saleh