Reader small image

You're reading from  AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781835082201
Edition2nd Edition
Right arrow
Authors (2):
Somanath Nanda
Somanath Nanda
author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura
Weslley Moura
author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

View More author details
Right arrow

Data Preparation and Transformation

You have probably heard that data scientists spend most of their time working on data-preparation-related activities. It is now time to explain why that happens and what types of activities they work on.

In this chapter, you will learn how to deal with categorical and numerical features, as well as how to apply different techniques to transform your data, such as one-hot encoding, binary encoders, ordinal encoding, binning, and text transformations. You will also learn how to handle missing values and outliers in your data, which are two important tasks you can implement to build good machine learning (ML) models.

In this chapter, you will cover the following topics:

  • Identifying types of features
  • Dealing with categorical features
  • Dealing with numerical features
  • Understanding data distributions
  • Handling missing values
  • Dealing with outliers
  • Dealing with unbalanced datasets
  • Dealing with text data

This...

Identifying types of features

You cannot start modeling without knowing what a feature is and what type of information it can store. You have already read about the different processes that deal with features. For example, you know that feature engineering is related to the task of building and preparing features for your models; you also know that feature selection is related to the task of choosing the best set of features to feed a particular algorithm. These two tasks have one behavior in common: they may vary according to the types of features they are processing.

It is very important to understand this behavior (feature type versus applicable transformations) because it will help you eliminate invalid answers during your exam (and, most importantly, you will become a better data scientist).

By types of features, you refer to the data type that a particular feature is supposed to store. Figure 4.1 shows how you could potentially describe the different types of features of...

Dealing with categorical features

Data transformation methods for categorical features will vary according to the sub-type of your variable. In the upcoming sections, you will understand how to transform nominal and ordinal features.

Transforming nominal features

You may have to create numerical representations of your categorical features before applying ML algorithms to them. Some libraries may have embedded logic to handle that transformation for you, but most of them do not.

The first transformation you will learn is known as label encoding. A label encoder is suitable for categorical/nominal variables, and it will just associate a number with each distinct label of your variables. Table 4.2 shows how a label encoder works:

Dealing with numerical features

In terms of numerical features (discrete and continuous), you can think of transformations that rely on the training data and others that rely purely on the (individual) observation being transformed.

Those who rely on the training data will use the training set to learn the necessary parameters during fit, and then use them to transform any test or new data. The logic is pretty much the same as what you just learned for categorical features; however, this time, the encoder will learn different parameters.

On the other hand, those that rely purely on (individual) observations do not depend on training or testing sets. They will simply perform a mathematical computation on top of an individual value. For example, you could apply an exponential transformation to a particular variable by squaring its value. There is no dependency on learned parameters from anywhere – just get the value and square it.

At this point, you might be thinking...

Understanding data distributions

Although the Gaussian distribution is probably the most common distribution for statistical and machine learning models, you should be aware that it is not the only one. There are other types of data distributions, such as the Bernoulli, binomial, and Poisson distributions.

The Bernoulli distribution is a very simple one, as there are only two types of possible events: success or failure. The success event has a probability p of happening, while the failure one has a probability of 1-p.

Some examples that follow a Bernoulli distribution are rolling a six-sided die or flipping a coin. In both cases, you must define the event of success and the event of failure. For example, assume the following success and failure events when rolling a die:

  • Success: Getting a number 6
  • Failure: Getting any other number

You can then say that there is a p probability of success (1/6 = 0.16 = 16%) and a 1-p probability of failure (1 - 0.16 = 0.84...

Handling missing values

As the name suggests, missing values refer to the absence of data. Such absences are usually represented by tokens, which may or may not be implemented in a standard way.

Although using tokens is standard, the way those tokens are displayed may vary across different platforms. For example, relational databases represent missing data with NULL, core Python code will use None, and some Python libraries will represent missing numbers as Not a Number (NaN).

Important note

For numerical fields, don’t replace those standard missing tokens with zeros. By default, zero is not a missing value, but another number.

However, in real business scenarios, you may or may not find those standard tokens. For example, a software engineering team might have designed the system to automatically fill missing data with specific tokens, such as “unknown” for strings or “-1” for numbers. In that case, you would have to search by those two...

Dealing with outliers

You are not on this studying journey just to pass the AWS Machine Learning Specialty exam but also to become a better data scientist. There are many different ways to look at the outlier problem purely from a mathematical perspective; however, the datasets used in real life are derived from the underlying business process, so you must include a business perspective during an outlier analysis.

An outlier is an atypical data point in a set of data. For example, Figure 4.8 shows some data points that have been plotted in a two-dimension plan; that is, x and y. The red point is an outlier since it is an atypical value in this series of data.

Figure 4.8 – Identifying an outlier

Figure 4.8 – Identifying an outlier

It is important to treat outlier values because some statistical methods are impacted by them. Still, in Figure 4.8, you can see this behavior in action. On the left-hand side, there has been drawn a line that best fits those data points, ignoring...

Dealing with unbalanced datasets

At this point, you might have realized why data preparation is probably the longest part of the data scientist’s work. You have learned about data transformation, missing data values, and outliers, but the list of problems goes on. Don’t worry – you are on the right journey to master this topic!

Another well-known problem with ML models, specifically with classification problems, is unbalanced classes. In a classification model, you can say that a dataset is unbalanced when most of its observations belong to one (or some) of the classes (target variable).

This is very common in fraud identification systems: for example, where most of the events belong to a regular operation, while a very small number of events belong to a fraudulent operation. In this case, you can also say that fraud is a rare event.

There is no strong rule for defining whether a dataset is unbalanced or not, it really depends on the context of your business...

Dealing with text data

You have already learned how to transform categorical features into numerical representations, either using label encoders, ordinal encoders, or one-hot encoding. However, what if you have fields containing long pieces of text in your dataset? How are you supposed to provide a mathematical representation for them in order to properly feed ML algorithms? This is a common issue in Natural Language Processing (NLP), a subfield of AI.

NLP models aim to extract knowledge from texts; for example, translating text between languages, identifying entities in a corpus of text (also known as Name Entity Recognition, or NER for short), classifying sentiments from a user review, and many other applications.

Important note

In Chapter 8, AWS Application Services for AI/ML, you will learn about some AWS application services that apply NLP to their solutions, such as Amazon Translate and Amazon Comprehend. During the exam, you might be asked to think about the fastest...

Summary

First, you were introduced to the different types of features that you might have to work with. Identifying the type of variable you’ll be working with is very important for defining the types of transformations and techniques that can be applied to each case.

Then, you learned how to deal with categorical features. You saw that, sometimes, categorical variables do have an order (such as the ordinal ones), while other times, they don’t (such as the nominal ones). You learned that one-hot encoding (or dummy variables) is probably the most common type of transformation for nominal features; however, depending on the number of unique categories, after applying one-hot encoding, your data might suffer from sparsity issues. Regarding ordinal features, you shouldn’t create dummy variables on top of them, since you would be losing the information about the order that has been incorporated into the variable. In those cases, ordinal encoding is the most appropriate...

Exam Readiness Drill – Chapter Review Questions

Apart from a solid understanding of key concepts, being able to think quickly under time pressure is a skill that will help you ace your certification exam. That is why working on these skills early on in your learning journey is key.

Chapter review questions are designed to improve your test-taking skills progressively with each chapter you learn and review your understanding of key concepts in the chapter at the same time. You’ll find these at the end of each chapter.

How To Access These Resources

To learn how to access these resources, head over to the chapter titled Chapter 11, Accessing the Online Practice Resources.

To open the Chapter Review Questions for this chapter, perform the following steps:

  1. Click the link – https://packt.link/MLSC01E2_CH04.

    Alternatively, you can scan the following QR code (Figure 4.15):

Figure 4.15 – QR code that opens Chapter Review Questions for logged-in users

Figure 4.15 – QR code that opens Chapter...

Working On Timing

Target: Your aim is to keep the score the same while trying to answer these questions as quickly as possible. Here’s an example of how your next attempts should look like:

Country

Label encoding

India

1

Canada

2

...

Attempt

Score

Time Taken

Attempt 5

77%

21 mins 30 seconds

Attempt 6

78%

18 mins 34 seconds

Attempt 7

76%

14 mins 44 seconds

Table 4.14 – Sample timing practice drills on the online platform

Note

The time limits shown in the above table are just examples. Set your own time limits with each attempt based on the time limit of the quiz on the website.

With each new attempt, your score should stay above 75% while your “time taken...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781835082201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Authors (2)

author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura