Reader small image

You're reading from  AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781835082201
Edition2nd Edition
Right arrow
Authors (2):
Somanath Nanda
Somanath Nanda
author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura
Weslley Moura
author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

View More author details
Right arrow

Applying Machine Learning Algorithms

In the previous chapter, you learned about understanding data and visualization. It is now time to move on to the modeling phase and study machine learning algorithms! In the earlier chapters, you learned that building machine learning models requires a lot of knowledge about AWS services, data engineering, data exploration, data architecture, and much more. This time, you will delve deeper into the algorithms that have been introduced and more.

Having a good sense of the different types of algorithms and machine learning approaches will put you in a very good position to make decisions during your projects. Of course, this type of knowledge is also crucial to the AWS Certified Machine Learning Specialty exam.

Bear in mind that there are thousands of algorithms out there. You can even propose your own algorithm for a particular problem. In this chapter, you will learn about the most relevant ones and, hopefully, the ones that you will probably...

Introducing this chapter

During this chapter, you will read about several algorithms, modeling concepts, and learning strategies. All these topics are beneficial for you to know for the exam and throughout your career as a data scientist.

This chapter has been structured in such a way that it not only covers the necessary topics of the exam but also gives you a good sense of the most important learning strategies out there. For example, the exam will check your knowledge regarding the basic concepts of K-Means. However, this chapter will cover it on a much deeper level, since this is an important topic for your career as a data scientist.

The chapter will follow this approach of looking deeper into the algorithms’ logic for some types of models that every data scientist should master. Furthermore, keep this in mind: sometimes you may go deeper than what is expected of you in the exam, but that will be extremely important for you in your career.

Many times during this...

Storing the training data

First of all, you can use multiple AWS services to prepare data for machine learning, such as Elastic MapReduce (EMR), Redshift, Glue, and so on. After preprocessing the training data, you should store it in S3, in a format expected by the algorithm you are using. Table 6.1 shows the list of acceptable data formats per algorithm.

...

A word about ensemble models

Before you start diving into the algorithms, there is an important modeling concept that you should be aware of – ensemble. The term ensemble is used to describe methods that use multiple algorithms to create a model.

A regular algorithm that does not implement ensemble methods will rely on a single model to train and predict the target variable. That is what happens when you create a decision tree or regression model. On the other hand, algorithms that do implement ensemble methods will rely on multiple models to predict the target variable. In that case, since each of these models might come up with a different prediction for the target variable, ensemble algorithms implement either a voting (for classification models) or averaging (for regression models) system to output the final results. Table 6.2 illustrates a very simple voting system for an ensemble algorithm composed of three models.

Data format

Algorithm

Application/x-image

Object detection algorithm, semantic segmentation

Application/x-recordio

Object detection algorithm

Application/x-recordio-protobuf

Factorization machines, K-Means, KNN, latent Dirichlet allocation, linear learner, NTM, PCA, RCF, sequence-to-sequence

Application/jsonlines

BlazingText, DeepAR

...

Supervised learning

AWS provides supervised learning algorithms for general purposes (regression and classification tasks) and more specific purposes (forecasting and vectorization). The list of built-in algorithms that can be found in these sub-categories is as follows:

  • Linear learner algorithm
  • Factorization machines algorithm
  • XGBoost algorithm
  • KNN algorithm
  • Object2Vec algorithm
  • DeepAR forecasting algorithm

You will start by learning about regression models and the linear learner algorithm.

Working with regression models

Looking at linear regression models is a nice way to understand what is going on inside regression models in general (linear and non-linear regression models). This is mandatory knowledge for every data scientist and can help you solve real challenges as well. You will now take a closer look at this in the following subsections.

Introducing regression algorithms

Linear regression models aim to predict a numeric value...

Unsupervised learning

AWS provides several unsupervised learning algorithms for the following tasks:

  • Clustering: K-Means algorithm
  • Dimension reduction: Principal Component Analysis (PCA)
  • Pattern recognition: IP Insights
  • Anomaly detection: The Random Cut Forest (RCF) algorithm

Let us start by talking about clustering and how the most popular clustering algorithm works: K-Means.

Clustering

Clustering algorithms are very popular in data science. Basically, they aim to identify similar groups in a given dataset, also known as clusters. Clustering algorithms belong to the field of non-supervised learning, which means that they do not need a label or response variable to be trained.

This is just fantastic since labeled data is very scarce! However, it comes with some limitations. The main one is that clustering algorithms provide clusters for you, but not the meaning of each cluster. Thus, someone, as a subject matter expert, has to analyze the properties...

Textual analysis

Modern applications use Natural Language Processing (NLP) for several purposes, such as text translation, document classifications, web search, Named Entity Recognition (NER), and many others.

AWS offers a suite of algorithms for most NLP use cases. In the next few subsections, you will have a look at these built-in algorithms for textual analysis.

BlazingText algorithm

BlazingText does two different types of tasks: text classification, which is a supervised learning approach that extends the fastText text classifier, and Word2Vec, which is an unsupervised learning algorithm.

BlazingText’s implementations of these two algorithms are optimized to run on large datasets. For example, you can train a model on top of billions of words in a few minutes.

This scalability aspect of BlazingText is possible due to the following:

  • Its ability to use multi-core CPUs and a single GPU to accelerate text classification
  • Its ability to use multi-core...

Image processing

Image processing is a very popular topic in machine learning. The idea is pretty self-explanatory: creating models that can analyze images and make inferences on top of them. By inference, you can understand this as detecting objects in an image, classifying images, and so on.

AWS offers a set of built-in algorithms you can use to train image processing models. In the next few sections, you will have a look at those algorithms.

Image classification algorithm

As the name suggests, the image classification algorithm is used to classify images using supervised learning. In other words, it needs a label within each image. It supports multi-label classification.

The way it operates is simple: during training, it receives an image and its associated labels. During inference, it receives an image and returns all the predicted labels. The image classification algorithm uses a CNN (ResNet) for training. It can either train the model from scratch or take advantage...

Summary

That was such a journey! Take a moment to recap what you have just learned. This chapter had four main topics: supervised learning, unsupervised learning, textual analysis, and image processing. Everything that you have learned fits into those subfields of machine learning.

The list of supervised learning algorithms that you have studied includes the following:

  • Linear learner
  • Factorization machines
  • XGBoost
  • KNN
  • Object2Vec
  • DeepAR forecasting

Remember that you can use linear learner, factorization machines, XGBoost, and KNN for multiple purposes, including solving regression and classification problems. Linear learner is probably the simplest algorithm out of these four; factorization machines extends linear earner and is good for sparse datasets, XGBoost uses an ensemble method based on decision trees, and KNN is an index-based algorithm.

The other two algorithms, Object2Vec and DeepAR, are used for specific purposes. Object2Vec is used...

Exam Readiness Drill – Chapter Review Questions

Apart from a solid understanding of key concepts, being able to think quickly under time pressure is a skill that will help you ace your certification exam. That is why working on these skills early on in your learning journey is key.

Chapter review questions are designed to improve your test-taking skills progressively with each chapter you learn and review your understanding of key concepts in the chapter at the same time. You’ll find these at the end of each chapter.

How To Access These Resources

To learn how to access these resources, head over to the chapter titled Chapter 11, Accessing the Online Practice Resources.

To open the Chapter Review Questions for this chapter, perform the following steps:

  1. Click the link – https://packt.link/MLSC01E2_CH06.

    Alternatively, you can scan the following QR code (Figure 6.19):

Figure 6.19 – QR code that opens Chapter Review Questions for logged-in users

Figure 6.19 – QR code that opens Chapter...

Working On Timing

Target: Your aim is to keep the score the same while trying to answer these questions as quickly as possible. Here’s an example of how your next attempts should look like:

Attempt

Score

Time Taken

Attempt 5

77%

21 mins 30 seconds

Attempt 6

78%

18 mins 34 seconds

Attempt 7

76%

14 mins 44 seconds

Table 6.11 – Sample timing practice drills on the online platform

Note

The time limits shown in the above table are just examples. Set your own time limits with each attempt based on the time limit of the quiz on the website.

With each new attempt, your score should stay above 75% while your “time taken...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781835082201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura