You're reading from AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

Product typeBook

Published inMar 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800569003

Edition1st Edition

Languages

Python

Tools

Azure Functions

Concepts

Machine Learning

Authors (2):

Somanath Nanda

Weslley Moura

View More author details

Chapter 3: Data Preparation and Transformation

You have probably heard that data scientists spend most of their time working on data preparation-related activities. It is now time to explain why that happens and which types of activities we are talking about.

In this chapter, you will learn how to deal with categorical and numerical features, as well as applying different techniques to transform your data, such as one-hot encoding, binary encoders, ordinal encoding, binning, and text transformations. You will also learn how to handle missing values and outliers in your data, which are two important tasks you can implement to build good machine learning models.

In this chapter, we will cover the following topics:

Identifying types of features
Dealing with categorical features
Dealing with numerical features
Understanding data distributions
Handling missing values
Dealing with outliers
Dealing with unbalanced datasets
Dealing with text data...

Identifying types of features

We cannot start modeling without knowing what a feature is and which type of information it might store. You have already read about different processes that deal with features. For example, you know that feature engineering is related to the task of building and preparing features to your models; you also know that feature selection is related to the task of choosing the best set of features to feed a particular algorithm. These two tasks have one behavior in common: they may vary according to the types of features they are processing.

It is very important to understand this behavior (feature type versus applicable transformations) because it will help you eliminate invalid answers during your exam (and, most importantly, you will become a better data scientist).

When we refer to types of features, we are talking about the data type that a particular feature is supposed to store. The following diagram shows how we could potentially describe the...

Dealing with categorical features

Data transformation methods for categorical features will vary according to the sub-type of your variable. In the upcoming sections, we will understand how to transform nominal and ordinal features.

Transforming nominal features

You may have to create numerical representations of your categorical features before applying ML algorithms to them. Some libraries may have embedded logic to handle that transformation for you, but most of them do not.

The first transformation we will cover is known as label encoding. A label encoder is suitable for categorical/nominal variables and it will just associate a number with each distinct label of your variable. The following table shows how a label encoder works:

Figure 3.3 – Label encoder in action

A label encoder will always ensure that a unique number is associated with each distinct label. In the preceding table, although "India" appears twice, the same number...

Dealing with numerical features

In terms of numerical features (discrete and continuous), we can think of transformations that rely on the training data and others that rely purely on the observation being transformed.

Those that rely on the training data will use the train set to learn the necessary parameters during fit, and then use them to transform any test or new data. The logic is pretty much the same as what we just reviewed for categorical features; however, this time, the encoder will learn different parameters.

On the other hand, those that rely purely on observations do not care about train or test sets. They will simply perform a mathematical computation on top of an individual value. For example, we could apply an exponential transformation to a particular variable by squaring its value. There is no dependency on learned parameters from anywhere – just get the value and square it.

At this point, you might be thinking about dozens of available transformations...

Understanding data distributions

Although the Gaussian distribution is probably the most common distribution for statistical and machine learning models, you should be aware that it is not the only one. There are other types of data distributions, such as the Bernoulli, Binomial, and Poisson distributions.

The Bernoulli distribution is a very simple one, as there are only two types of possible events: success or failure. The success event has a probability "p" of happening, while the failure one has a probability of "1-p".

Some examples that follow a Bernoulli distribution are rolling a six-sided die or flipping a coin. In both cases, you must define the event of success and the event of failure. For example, suppose our events for success and failure in the die example are as follows:

Success: Getting a number 6
Failure: Getting any other number

We can then say that we have a p probability of success (1/6 = 0.16 = 16%) and a 1-p probability...

Handling missing values

As the name suggests, missing values refer to the absence of data. Such absences are usually represented by tokens, which may or may not be implemented in a standard way.

Although using tokens is standard, the way those tokens are displayed may vary across different platforms. For example, relational databases represent missing data with NULL, core Python code will use None, and some Python libraries will represent missing numbers as (Not a Number (NaN).

Important note

For numerical fields, don't replace those standard missing tokens with zeros. By default, zero is not a missing value, but another number. I said "by default" because, in data science, we may face some data quality issues, which we will cover next.

However, in real business scenarios, you may or may not find those standard tokens. For example, a software engineering team might have designed the system to automatically fill missing data with specific tokens, such as ...

Dealing with outliers

We are not on this studying journey just to pass the AWS Machine Learning Specialty exam, but also to become better data scientists. There are many different ways to look at the outlier problem purely from a mathematical perspective; however, the datasets we use are derived from the underlying business process, so we must include a business perspective during an outlier analysis.

An outlier is an atypical data point in a set of data. For example, the following chart shows some data points that have been plotted in a two-dimension plan; that is, x and y. The red point is an outlier, since it is an atypical value on this series of data:

Figure 3.19 – Identifying an outlier

We want to treat outlier values because some statistical methods are impacted by them. Still, in the preceding chart, we can see this behavior in action. On the left-hand side, we drew a line that best fits those data points, ignoring the red point. On the right...

Dealing with unbalanced datasets

At this point, I hope you have realized why data preparation is probably the longest part of our work. We have learned about data transformation, missing data values, and outliers, but the list of problems goes on. Don't worry – bear with me and let's master this topic together!

Another well-known problem with ML models, specifically with binary classification problems, is unbalanced classes. In a binary classification model, we say that a dataset is unbalanced when most of its observations belong to the same class (target variable).

This is very common in fraud identification systems, for example, where most of the events belong to a regular operation, while a very small number of events belong to a fraudulent operation. In this case, we can also say that fraud is a rare event.

There is no strong rule for defining whether a dataset is unbalanced or not, in the sense of it being necessary to worry about it. Most challenge problems...

Dealing with text data

We have already learned how to transform categorical features into numerical representations, either using label encoders, ordinal encoders, or one-hot encoding. However, what if we have fields containing long piece of text in our dataset? How are we supposed to provide a mathematical representation for them in order to properly feed ML algorithms? This is a common issue in natural language processing (NLP), a subfield of AI.

NLP models aim to extract knowledge from texts; for example, translating text between languages, identifying entities in a corpus of text (also known as Name Entity Recognition (NER)), classifying sentiments from a user review, and many other applications.

Important note

In Chapter 2, AWS Application Services for AI/ML, you learned about some AWS application services that apply NLP to their solutions, such as Amazon Translate and Amazon Comprehend. During the exam, you might be asked to think about the fastest or easiest way (with...

Summary

First, you were introduced to the different types of features that you might have to work with. Identifying the type of variable you'll be working with is very important for defining the types of transformations and techniques that can be applied to each case.

Then, we learned how to deal with categorical features. We saw that, sometimes, categorical variables do have an order (such as the ordinal ones), while other times, they don't (such as the nominal ones). You learned that one-hot encoding (or dummy variables) is probably the most common type of transformation for nominal features; however, depending on the number of unique categories, after applying one-hot encoding, your data might suffer from sparsity issues. Regarding ordinal features, you shouldn't create dummy variables on top of them, since you would be losing the information of order that's been incorporated into the variable. In those cases, ordinal encoding is the most appropriate transformation...

Questions

You are working as a data scientist for a healthcare company and are creating a machine learning model to predict fraud, waste, and abuse across the company's claims. One of the features of this model is the number of times a particular drug has been prescribed, to the same patient of the claim, in a period of 2 years. Which type of feature is this?
a) Discrete
b) Continuous
c) Nominal
d) Ordinal
Answer
a, The feature is counting the number of times that a particular drug has been prescribed. Individual and countable items are classified as discrete data.
You are building a ML model for an educational group that owns schools and universities across the globe. Your model aims to predict how likely a particular student is to leave his/her studies. Many factors may contribute to school dropout, but one of your features is the current academic stage of each student: preschool, elementary school, middle school, or high school. Which type of feature is this?
a) Discrete...

The rest of the chapter is locked

You have been reading a chapter from

AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

Published in: Mar 2021Publisher: PacktISBN-13: 9781800569003

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

Other recommended products

Related to this chapter

Amazon Redshift Cookbook

The Amazon Redshift Cookbook helps you get to grips with architecting Redshift and performing database administration tasks. You'll learn techniques for building pipelines, loading data optimally, and deriving insights from this data, along with understanding how to optimize performance and costs associated with data warehouses, and build ingestion patterns with Amazon Redshift.

BookJul 2021384 pages

Serverless Architectures with AWS

Serverless Architectures with AWS teaches you how to build serverless applications on AWS—applications that do not require the developer to provision, scale, or manage any servers. Using an event-driven approach and AWS Lambda as the primary service, the book explains the many benefits of serverless architectures. By the end of the book, you will be ready to create and run your first serverless application that takes advantage of the high availability, security, performance, and scalability of AWS. With this new architecture, you will be able to focus on your product instead of worrying about managing and operating servers to run it.

BookDec 2018226 pages

Learn Amazon SageMaker

This book will teach you how to move quickly from business questions to machine learning models in production. Using real-world examples implemented with Python and Jupyter notebooks, you’ll learn about many the features and APIs of Amazon SageMaker on a wide spectrum of use cases: tabular data, computer vision, and natural language processing.

BookAug 2020490 pages

Hands-On Artificial Intelligence on Amazon Web Services

AI in AWS covers primarily two broad topics – a) how to leverage readily available AI/ML APIs and b) how to build, train and deploy ML models from scratch, to solve diverse business problems, such as demand forecasting, image classification, topic modeling, speech and text recognition. By the end of the book, you will have learned how to build production grade AI/ML applications in AWS

BookOct 2019426 pages1

Amazon SageMaker Best Practices

Going beyond the basics, Amazon SageMaker Best Practices provides end-to-end coverage of the service capabilities that the platform offers for building and automating machine learning workloads to address data science challenges. With this book, you'll discover tips to train, deploy, and monitor your machine learning solutions efficiently.

BookSep 2021348 pages

Mastering Machine Learning on AWS

This book will help you master your skills in various artificial intelligence and machine learning services available on AWS. Through practical hands-on examples, you’ll learn how to use these services to generate impressive results. You will have a tremendous understanding of how to use a wide range of AWS services in your own organization.

BookMay 2019306 pages

AWS Certified Developer - Associate Guide

With rapid adaptation of the cloud platform, the need for cloud certification has also increased. This is your one stop solution and will help you transform yourself from zero to certified. This guide will help you gain technical expertise in the AWS platform and help you start working with various AWS Services.

BookJun 2019812 pages5

AWS Certified Security – Specialty Exam Guide

Amazon has come up with Specialty certifications which validates a particular user's expertise that he/she would want to build a career in. This Guide will be a companion to getting skilled with complex and creative security solutions.

BookSep 2020558 pages

The Applied AI and Natural Language Processing Workshop

The Applied AI and NLP Workshop will show you how to integrate artificial intelligence with Amazon Web Services to create intelligent applications. From developing language translation apps and chatbots to creating models for processing large volumes of images, you’ll learn key concepts effectively and in a real-world context.

BookJul 2020384 pages

AWS Certified Developer - Associate Guide

With rapid adaptation of the cloud platform, the need for cloud certification has also increased. This is your one stop solution and will help you transform yourself from zero to certified. This guide will help you gain technical expertise in the AWS platform and help you start working with various AWS Services.

BookSep 2017600 pages

AWS Certified Solutions Architect - Associate Guide

With rapid adaptation of the cloud platform, the need for cloud certification has also increased. This is your one stop solution and will help you transform yourself from zero to certified. This guide will help you gain technical expertise in the AWS platform and help you start working with various AWS Services

BookOct 2018626 pages

Scalable Data Streaming with Amazon Kinesis

This practical guide takes a hands-on approach to implementation and associated methodologies to have you up and running with all that Amazon Kinesis has to offer. You’ll work with use cases and practical examples to be able to ingest, process, analyze, and stream real-time data in no time.

BookMar 2021314 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages