Reader small image

You're reading from  AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

Product typeBook
Published inMar 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800569003
Edition1st Edition
Languages
Right arrow
Authors (2):
Somanath Nanda
Somanath Nanda
author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura
Weslley Moura
author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

View More author details
Right arrow

Chapter 3: Data Preparation and Transformation

You have probably heard that data scientists spend most of their time working on data preparation-related activities. It is now time to explain why that happens and which types of activities we are talking about.

In this chapter, you will learn how to deal with categorical and numerical features, as well as applying different techniques to transform your data, such as one-hot encoding, binary encoders, ordinal encoding, binning, and text transformations. You will also learn how to handle missing values and outliers in your data, which are two important tasks you can implement to build good machine learning models.

In this chapter, we will cover the following topics:

  • Identifying types of features
  • Dealing with categorical features
  • Dealing with numerical features
  • Understanding data distributions
  • Handling missing values
  • Dealing with outliers
  • Dealing with unbalanced datasets
  • Dealing with text data...

Identifying types of features

We cannot start modeling without knowing what a feature is and which type of information it might store. You have already read about different processes that deal with features. For example, you know that feature engineering is related to the task of building and preparing features to your models; you also know that feature selection is related to the task of choosing the best set of features to feed a particular algorithm. These two tasks have one behavior in common: they may vary according to the types of features they are processing.

It is very important to understand this behavior (feature type versus applicable transformations) because it will help you eliminate invalid answers during your exam (and, most importantly, you will become a better data scientist).

When we refer to types of features, we are talking about the data type that a particular feature is supposed to store. The following diagram shows how we could potentially describe the...

Dealing with categorical features

Data transformation methods for categorical features will vary according to the sub-type of your variable. In the upcoming sections, we will understand how to transform nominal and ordinal features.

Transforming nominal features

You may have to create numerical representations of your categorical features before applying ML algorithms to them. Some libraries may have embedded logic to handle that transformation for you, but most of them do not.

The first transformation we will cover is known as label encoding. A label encoder is suitable for categorical/nominal variables and it will just associate a number with each distinct label of your variable. The following table shows how a label encoder works:

Figure 3.3 – Label encoder in action

A label encoder will always ensure that a unique number is associated with each distinct label. In the preceding table, although "India" appears twice, the same number...

Dealing with numerical features

In terms of numerical features (discrete and continuous), we can think of transformations that rely on the training data and others that rely purely on the observation being transformed.

Those that rely on the training data will use the train set to learn the necessary parameters during fit, and then use them to transform any test or new data. The logic is pretty much the same as what we just reviewed for categorical features; however, this time, the encoder will learn different parameters.

On the other hand, those that rely purely on observations do not care about train or test sets. They will simply perform a mathematical computation on top of an individual value. For example, we could apply an exponential transformation to a particular variable by squaring its value. There is no dependency on learned parameters from anywhere – just get the value and square it.

At this point, you might be thinking about dozens of available transformations...

Understanding data distributions

Although the Gaussian distribution is probably the most common distribution for statistical and machine learning models, you should be aware that it is not the only one. There are other types of data distributions, such as the Bernoulli, Binomial, and Poisson distributions.

The Bernoulli distribution is a very simple one, as there are only two types of possible events: success or failure. The success event has a probability "p" of happening, while the failure one has a probability of "1-p".

Some examples that follow a Bernoulli distribution are rolling a six-sided die or flipping a coin. In both cases, you must define the event of success and the event of failure. For example, suppose our events for success and failure in the die example are as follows:

  • Success: Getting a number 6
  • Failure: Getting any other number

We can then say that we have a p probability of success (1/6 = 0.16 = 16%) and a 1-p probability...

Handling missing values

As the name suggests, missing values refer to the absence of data. Such absences are usually represented by tokens, which may or may not be implemented in a standard way.

Although using tokens is standard, the way those tokens are displayed may vary across different platforms. For example, relational databases represent missing data with NULL, core Python code will use None, and some Python libraries will represent missing numbers as (Not a Number (NaN).

Important note

For numerical fields, don't replace those standard missing tokens with zeros. By default, zero is not a missing value, but another number. I said "by default" because, in data science, we may face some data quality issues, which we will cover next.

However, in real business scenarios, you may or may not find those standard tokens. For example, a software engineering team might have designed the system to automatically fill missing data with specific tokens, such as ...

Dealing with outliers

We are not on this studying journey just to pass the AWS Machine Learning Specialty exam, but also to become better data scientists. There are many different ways to look at the outlier problem purely from a mathematical perspective; however, the datasets we use are derived from the underlying business process, so we must include a business perspective during an outlier analysis.

An outlier is an atypical data point in a set of data. For example, the following chart shows some data points that have been plotted in a two-dimension plan; that is, x and y. The red point is an outlier, since it is an atypical value on this series of data:

Figure 3.19 – Identifying an outlier

We want to treat outlier values because some statistical methods are impacted by them. Still, in the preceding chart, we can see this behavior in action. On the left-hand side, we drew a line that best fits those data points, ignoring the red point. On the right...

Dealing with unbalanced datasets

At this point, I hope you have realized why data preparation is probably the longest part of our work. We have learned about data transformation, missing data values, and outliers, but the list of problems goes on. Don't worry – bear with me and let's master this topic together!

Another well-known problem with ML models, specifically with binary classification problems, is unbalanced classes. In a binary classification model, we say that a dataset is unbalanced when most of its observations belong to the same class (target variable).

This is very common in fraud identification systems, for example, where most of the events belong to a regular operation, while a very small number of events belong to a fraudulent operation. In this case, we can also say that fraud is a rare event.

There is no strong rule for defining whether a dataset is unbalanced or not, in the sense of it being necessary to worry about it. Most challenge problems...

Dealing with text data

We have already learned how to transform categorical features into numerical representations, either using label encoders, ordinal encoders, or one-hot encoding. However, what if we have fields containing long piece of text in our dataset? How are we supposed to provide a mathematical representation for them in order to properly feed ML algorithms? This is a common issue in natural language processing (NLP), a subfield of AI.

NLP models aim to extract knowledge from texts; for example, translating text between languages, identifying entities in a corpus of text (also known as Name Entity Recognition (NER)), classifying sentiments from a user review, and many other applications.

Important note

In Chapter 2, AWS Application Services for AI/ML, you learned about some AWS application services that apply NLP to their solutions, such as Amazon Translate and Amazon Comprehend. During the exam, you might be asked to think about the fastest or easiest way (with...

Summary

First, you were introduced to the different types of features that you might have to work with. Identifying the type of variable you'll be working with is very important for defining the types of transformations and techniques that can be applied to each case.

Then, we learned how to deal with categorical features. We saw that, sometimes, categorical variables do have an order (such as the ordinal ones), while other times, they don't (such as the nominal ones). You learned that one-hot encoding (or dummy variables) is probably the most common type of transformation for nominal features; however, depending on the number of unique categories, after applying one-hot encoding, your data might suffer from sparsity issues. Regarding ordinal features, you shouldn't create dummy variables on top of them, since you would be losing the information of order that's been incorporated into the variable. In those cases, ordinal encoding is the most appropriate transformation...

Questions

  1. You are working as a data scientist for a healthcare company and are creating a machine learning model to predict fraud, waste, and abuse across the company's claims. One of the features of this model is the number of times a particular drug has been prescribed, to the same patient of the claim, in a period of 2 years. Which type of feature is this?

    a) Discrete

    b) Continuous

    c) Nominal

    d) Ordinal

    Answer

    a, The feature is counting the number of times that a particular drug has been prescribed. Individual and countable items are classified as discrete data.

  2. You are building a ML model for an educational group that owns schools and universities across the globe. Your model aims to predict how likely a particular student is to leave his/her studies. Many factors may contribute to school dropout, but one of your features is the current academic stage of each student: preschool, elementary school, middle school, or high school. Which type of feature is this?

    a) Discrete...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide
Published in: Mar 2021Publisher: PacktISBN-13: 9781800569003
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura