You're reading from AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Product typeBook

Published inFeb 2024

PublisherPackt

ISBN-139781835082201

Edition2nd Edition

Concepts

Machine Learning

Authors (2):

Somanath Nanda

Weslley Moura

View More author details

Data Preparation and Transformation

You have probably heard that data scientists spend most of their time working on data-preparation-related activities. It is now time to explain why that happens and what types of activities they work on.

In this chapter, you will learn how to deal with categorical and numerical features, as well as how to apply different techniques to transform your data, such as one-hot encoding, binary encoders, ordinal encoding, binning, and text transformations. You will also learn how to handle missing values and outliers in your data, which are two important tasks you can implement to build good machine learning (ML) models.

In this chapter, you will cover the following topics:

Identifying types of features
Dealing with categorical features
Dealing with numerical features
Understanding data distributions
Handling missing values
Dealing with outliers
Dealing with unbalanced datasets
Dealing with text data

This...

Identifying types of features

You cannot start modeling without knowing what a feature is and what type of information it can store. You have already read about the different processes that deal with features. For example, you know that feature engineering is related to the task of building and preparing features for your models; you also know that feature selection is related to the task of choosing the best set of features to feed a particular algorithm. These two tasks have one behavior in common: they may vary according to the types of features they are processing.

It is very important to understand this behavior (feature type versus applicable transformations) because it will help you eliminate invalid answers during your exam (and, most importantly, you will become a better data scientist).

By types of features, you refer to the data type that a particular feature is supposed to store. Figure 4.1 shows how you could potentially describe the different types of features of...

Dealing with categorical features

Data transformation methods for categorical features will vary according to the sub-type of your variable. In the upcoming sections, you will understand how to transform nominal and ordinal features.

Transforming nominal features

You may have to create numerical representations of your categorical features before applying ML algorithms to them. Some libraries may have embedded logic to handle that transformation for you, but most of them do not.

The first transformation you will learn is known as label encoding. A label encoder is suitable for categorical/nominal variables, and it will just associate a number with each distinct label of your variables. Table 4.2 shows how a label encoder works:

Dealing with numerical features

In terms of numerical features (discrete and continuous), you can think of transformations that rely on the training data and others that rely purely on the (individual) observation being transformed.

Those who rely on the training data will use the training set to learn the necessary parameters during fit, and then use them to transform any test or new data. The logic is pretty much the same as what you just learned for categorical features; however, this time, the encoder will learn different parameters.

On the other hand, those that rely purely on (individual) observations do not depend on training or testing sets. They will simply perform a mathematical computation on top of an individual value. For example, you could apply an exponential transformation to a particular variable by squaring its value. There is no dependency on learned parameters from anywhere – just get the value and square it.

At this point, you might be thinking...

Understanding data distributions

Although the Gaussian distribution is probably the most common distribution for statistical and machine learning models, you should be aware that it is not the only one. There are other types of data distributions, such as the Bernoulli, binomial, and Poisson distributions.

The Bernoulli distribution is a very simple one, as there are only two types of possible events: success or failure. The success event has a probability p of happening, while the failure one has a probability of 1-p.

Some examples that follow a Bernoulli distribution are rolling a six-sided die or flipping a coin. In both cases, you must define the event of success and the event of failure. For example, assume the following success and failure events when rolling a die:

Success: Getting a number 6
Failure: Getting any other number

You can then say that there is a p probability of success (1/6 = 0.16 = 16%) and a 1-p probability of failure (1 - 0.16 = 0.84...

Handling missing values

As the name suggests, missing values refer to the absence of data. Such absences are usually represented by tokens, which may or may not be implemented in a standard way.

Although using tokens is standard, the way those tokens are displayed may vary across different platforms. For example, relational databases represent missing data with NULL, core Python code will use None, and some Python libraries will represent missing numbers as Not a Number (NaN).

Important note

For numerical fields, don’t replace those standard missing tokens with zeros. By default, zero is not a missing value, but another number.

However, in real business scenarios, you may or may not find those standard tokens. For example, a software engineering team might have designed the system to automatically fill missing data with specific tokens, such as “unknown” for strings or “-1” for numbers. In that case, you would have to search by those two...

Dealing with outliers

You are not on this studying journey just to pass the AWS Machine Learning Specialty exam but also to become a better data scientist. There are many different ways to look at the outlier problem purely from a mathematical perspective; however, the datasets used in real life are derived from the underlying business process, so you must include a business perspective during an outlier analysis.

An outlier is an atypical data point in a set of data. For example, Figure 4.8 shows some data points that have been plotted in a two-dimension plan; that is, x and y. The red point is an outlier since it is an atypical value in this series of data.

Figure 4.8 – Identifying an outlier

It is important to treat outlier values because some statistical methods are impacted by them. Still, in Figure 4.8, you can see this behavior in action. On the left-hand side, there has been drawn a line that best fits those data points, ignoring...

Dealing with unbalanced datasets

At this point, you might have realized why data preparation is probably the longest part of the data scientist’s work. You have learned about data transformation, missing data values, and outliers, but the list of problems goes on. Don’t worry – you are on the right journey to master this topic!

Another well-known problem with ML models, specifically with classification problems, is unbalanced classes. In a classification model, you can say that a dataset is unbalanced when most of its observations belong to one (or some) of the classes (target variable).

This is very common in fraud identification systems: for example, where most of the events belong to a regular operation, while a very small number of events belong to a fraudulent operation. In this case, you can also say that fraud is a rare event.

There is no strong rule for defining whether a dataset is unbalanced or not, it really depends on the context of your business...

Dealing with text data

You have already learned how to transform categorical features into numerical representations, either using label encoders, ordinal encoders, or one-hot encoding. However, what if you have fields containing long pieces of text in your dataset? How are you supposed to provide a mathematical representation for them in order to properly feed ML algorithms? This is a common issue in Natural Language Processing (NLP), a subfield of AI.

NLP models aim to extract knowledge from texts; for example, translating text between languages, identifying entities in a corpus of text (also known as Name Entity Recognition, or NER for short), classifying sentiments from a user review, and many other applications.

Important note

In Chapter 8, AWS Application Services for AI/ML, you will learn about some AWS application services that apply NLP to their solutions, such as Amazon Translate and Amazon Comprehend. During the exam, you might be asked to think about the fastest...

Summary

First, you were introduced to the different types of features that you might have to work with. Identifying the type of variable you’ll be working with is very important for defining the types of transformations and techniques that can be applied to each case.

Then, you learned how to deal with categorical features. You saw that, sometimes, categorical variables do have an order (such as the ordinal ones), while other times, they don’t (such as the nominal ones). You learned that one-hot encoding (or dummy variables) is probably the most common type of transformation for nominal features; however, depending on the number of unique categories, after applying one-hot encoding, your data might suffer from sparsity issues. Regarding ordinal features, you shouldn’t create dummy variables on top of them, since you would be losing the information about the order that has been incorporated into the variable. In those cases, ordinal encoding is the most appropriate...

Exam Readiness Drill – Chapter Review Questions

Apart from a solid understanding of key concepts, being able to think quickly under time pressure is a skill that will help you ace your certification exam. That is why working on these skills early on in your learning journey is key.

Chapter review questions are designed to improve your test-taking skills progressively with each chapter you learn and review your understanding of key concepts in the chapter at the same time. You’ll find these at the end of each chapter.

How To Access These Resources

To learn how to access these resources, head over to the chapter titled Chapter 11, Accessing the Online Practice Resources.

To open the Chapter Review Questions for this chapter, perform the following steps:

Click the link – https://packt.link/MLSC01E2_CH04.
Alternatively, you can scan the following QR code (Figure 4.15):

Figure 4.15 – QR code that opens Chapter...

Working On Timing

Target: Your aim is to keep the score the same while trying to answer these questions as quickly as possible. Here’s an example of how your next attempts should look like:

Country	Label encoding
India	1
Canada	2
...

Attempt	Score	Time Taken
Attempt 5	77%	21 mins 30 seconds
Attempt 6	78%	18 mins 34 seconds
Attempt 7	76%	14 mins 44 seconds

Table 4.14 – Sample timing practice drills on the online platform

Note

The time limits shown in the above table are just examples. Set your own time limits with each attempt based on the time limit of the quiz on the website.

With each new attempt, your score should stay above 75% while your “time taken...

The rest of the chapter is locked

You have been reading a chapter from

AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Published in: Feb 2024Publisher: PacktISBN-13: 9781835082201

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Authors (2)

Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages