Reader small image

You're reading from  AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

Product typeBook
Published inMar 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800569003
Edition1st Edition
Languages
Right arrow
Authors (2):
Somanath Nanda
Somanath Nanda
author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura
Weslley Moura
author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

View More author details
Right arrow

Chapter 7: Applying Machine Learning Algorithms

In the previous chapter, we studied AWS services for data processing, including Glue, Athena, and Kinesis! It is now time to move on to the modeling phase and study machine learning algorithms. I am sure that, during the earlier chapters, you have realized that building machine learning models requires a lot of knowledge about AWS services, data engineering, data exploratory, data architecture, and much more. This time, we will go deeper into the algorithms that we have been talking about so far and many others.

Having a good sense of the different types of algorithms and machine learning approaches will put you in a very good position to make decisions during your projects. Of course, this type of knowledge is also crucial to the AWS machine learning specialty exam.

Bear in mind that there are thousands of algorithms out there and, by the way, you can even propose your own algorithm for a particular problem. Furthermore, we will...

Introducing this chapter

During this chapter, we will talk about several algorithms, modeling concepts, and learning strategies. We think all these topics will be beneficial for you during the exam and your data scientist career.

We have structured this chapter in a way so that it covers not only the necessary topics of the exam but also gives you a good sense of the most important learning strategies out there. For example, the exam will check your knowledge regarding the basic concepts of K-means; however, we will cover it on a much deeper level, since this is an important topic for your career as a data scientist.

We will follow this approach, looking deeper into the logic of the algorithm, for some types of models that we feel every data scientist should master. So, keep that in mind: sometimes, we might go deeper than expected in the exam, but that will be extremely important for you.

Many times, during this chapter, we will use the term built-in algorithms. We will use...

Storing the training data

First of all, you can use multiple AWS services to prepare data for machine learning, such as EMR, Redshift, Glue, and so on. After preprocessing the training data, you should store it in S3, in a format expected by the algorithm you are using. The following table shows the list of acceptable data formats per algorithm:

Figure 7.1 – Data formats that are acceptable per AWS algorithm

As we can see, many algorithms accept text/.csv format. Keep in mind that you should follow these rules if you want to use that format:

  • Your CSV file can't have a header record.
  • For supervised learning, the target variable must be in the first column.
  • While configuring the training pipeline, set the input data channel as content_type equal to text/csv.
  • For unsupervised learning, set the label_size within the content_type to 'content_type=text/csv;label_size=0'.

Although text/.csv format is fine for many use...

A word about ensemble models

Before we start diving into the algorithms, there is an important modeling concept that you should be aware of, known as ensemble. The term ensemble is used to describe methods that use multiple algorithms to create a model.

For example, instead of creating just one model to predict fraudulent transactions, you could create multiple models that do the same thing and, using a vote sort of system, select the predicted outcome. The following table illustrates this simple example:

Figure 7.2 – An example of a voting system on ensemble methods

The same approach works for regression problems, where, instead of voting, we could average the results of each model and use that as the final outcome.

Voting and averaging are just two examples of ensemble approaches. Other powerful techniques include blending and stacking, where you can create multiple models and use the outcome of each model as features for a main model. Looking...

Supervised learning

AWS provides supervised learning algorithms for general purposes (regression and classification tasks) and for more specific purposes (forecasting and vectorization). The list of built-in algorithms that can be found in these sub-categories is as follows:

  • Linear learner algorithm
  • Factorization machines algorithm
  • XGBoost algorithm
  • K-Nearest Neighbor algorithm
  • Object2Vec algorithm
  • DeepAR Forecasting algorithm

Let's start with regression models and the linear learner algorithm.

Working with regression models

Okay; I know that real problems usually aren't linear nor simple. However, looking into linear regression models is a nice way to figure out what's going on inside regression models in general (yes, regression models can be linear and non-linear). This is mandatory knowledge for every data scientist and can help you solve real challenges as well. We'll take a closer look at this in the following subsections...

Unsupervised learning

AWS provides several unsupervised learning algorithms for the following tasks:

  • Clustering:
  • K-means algorithm
  • Dimension reduction:
  • Principal Component Analysis (PCA)
  • Pattern recognition:
  • IP Insights
  • Anomaly detection:
  • Random Cut Forest Algorithm (RCF)

Let's start by talking about clustering and how the most popular clustering algorithm works: K-means.

Clustering

Clustering algorithms are very popular in data science. Basically, they aim to identify groups in a given dataset. Technically, we call these findings or groups clusters. Clustering algorithms belong to the field of non-supervised learning, which means that they don't need a label or response variable to be trained.

This is just fantastic because labeled data used to be scarce. However, it comes with some limitations. The main one is that clustering algorithms provide clusters for you, but not the meaning of each cluster. Thus, someone, as a...

Textual analysis

Modern applications use Natural Language Processing (NLP) for several purposes, such as text translation, document classifications, web search, named entity recognition (NER), and many others.

AWS offers a suite of algorithms for most NLP use cases. In the next few subsections, we will have a look at these built-in algorithms for textual analysis.

Blazing Text algorithm

Blazing Text does two different types of tasks: text classification, which is a supervised learning approach that extends the fastText text classifier, and word2vec, which is an unsupervised learning algorithm.

The Blazing Text's implementations of these two algorithms are optimized to run on large datasets. For example, you can train a model on top of billions of words in a few minutes.

This scalability aspect of Blazing Text is possible due to the following:

  • Its ability to use multi-core CPUs and a single GPU to accelerate text classification
  • Its ability to use multi...

Image processing

Image processing is a very popular topic in machine learning. The idea is pretty self-explanatory: creating models that can analyze images and make inferences on top of them. By inference, you can understand this as detecting objects in an image, classifying images, and so on.

AWS offers a set of built-in algorithms we can use to train image processing models. In the next few sections, we will have a look at those algorithms.

Image classification algorithm

As the name suggests, the image classification algorithm is used to classify images using supervised learning. In other words, it needs a label within each image. It supports multi-label classification.

The way it operates is simple: during training, it receives an image and its associated labels. During inference, it receives an image and returns all the predicted labels. The image classification algorithm uses a CNN (ResNet) for training. It can either train the model from scratch or take advantage...

Summary

That was such a journey! Let's take a moment to highlight what we have just learned. We broke this chapter into four main sections: supervised learning, unsupervised learning, textual analysis, and image processing. Everything that we have learned fits those subfields of machine learning.

The list of supervised learning algorithms that we have studied includes the following:

  • Linear learner algorithm
  • Factorization machines algorithm
  • XGBoost algorithm
  • K-Nearest Neighbors algorithm
  • Object2Vec algorithm
  • DeepAR forecasting algorithm

Remember that you can use linear learner, factorization machines, XGBoost, and KNN for multiple purposes, including to solve regression and classification problems. Linear learner is probably the simplest algorithm out of these four; factorization machines extend linear learner and are good for sparse datasets, XGBoost uses an ensemble method based on decision trees, and KNN is an index-based algorithm.

The...

Questions

  1. You are working as a lead data scientist for a retail company. Your team is building a regression model and using the linear learner built-in algorithm to predict the optimal price of a particular product. The model is clearly overfitting to the training data and you suspect that this is due to the excessive number of variables being used. Which of the following approaches would best suit a solution that addresses your suspicion?

    a) Implementing a cross-validation process to reduce overfitting during the training process.

    b) Applying L1 regularization and changing the wd hyperparameter of the linear learner algorithm.

    c) Applying L2 regularization and changing the wd hyperparameter of the linear learner algorithm.

    d) Applying L1 and L2 regularization.

    Answers

    C, This question prompts about to the problem of overfitting due an excessive number of features being used. L2 regularization, which is available in linear learner through the wd hyperparameter, will work as a feature...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide
Published in: Mar 2021Publisher: PacktISBN-13: 9781800569003
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura