Reader small image

You're reading from  AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781835082201
Edition2nd Edition
Right arrow
Authors (2):
Somanath Nanda
Somanath Nanda
author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

Weslley Moura
Weslley Moura
author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura

View More author details
Right arrow

Data Understanding and Visualization

Data visualization is an art! No matter how much effort you and your team put into data preparation and preliminary analysis for modeling, if you don’t know how to show your findings effectively, your audience may not understand the point you are trying to make.

Often, such situations may be even worse when you are dealing with decision-makers. For example, if you choose the wrong set of charts to tell a particular story, people can misinterpret your analysis and make bad decisions.

Understanding the different types of data visualizations, and knowing how they fit with each type of analysis, will put you in a very good position in terms of engaging your audience and transmitting the information you want.

In this chapter, you will learn about some data visualization techniques. You will be covering the following topics:

  • Visualizing relationships in your data
  • Visualizing comparisons in your data
  • Visualizing compositions...

Visualizing relationships in your data

When you need to show relationships in your data, you are usually talking about plotting two or more variables in a chart to visualize their level of dependency. A scatter plot is probably the most common type of chart to show the relationship between two variables. Figure 5.1 shows a scatter plot for two variables, X and Y.

Figure 5.1 – Plotting relationships with a scatter plot

Figure 5.1 – Plotting relationships with a scatter plot

Figure 5.1 shows a clear relationship between X and Y. As X increases, Y also increases. In this particular case, you can say that there is a linear relationship between both variables. Keep in mind that scatter plots may also catch other types of relationships, not only linear ones. For example, it would also be possible to find an exponential relationship between the two variables.

Another nice chart to make comparisons with is the bubble chart. Just like a scatter plot, it will also show the relationship between variables...

Visualizing comparisons in your data

Comparisons are very common in data analysis and there are different ways to present them. Starting with the bar chart, you must have seen many reports that have used this type of visualization.

Bar charts can be used to compare one variable among different classes – for example, a car’s price across different models or population size per country. In Figure 5.3, the bar chart is used to analyze the percentage of positive tests for COVID-19 in a range of regions of India as of April 7th, 2020.

Figure 5.3 – Plotting comparisons with a bar chart (source: State Health Department of India)

Figure 5.3 – Plotting comparisons with a bar chart (source: State Health Department of India)

Sometimes, you can also use stacked column charts to add another dimension to the data that is being analyzed. For example, Figure 5.4 uses a stacked bar chart to show how many people were on board the Titanic by sex. Additionally, it breaks down the number of people who survived (positive class) and those who...

Visualizing distributions in your data

Exploring the distribution of your feature is very important to understand some key characteristics of it, such as its skewness, mean, median, and quantiles. You can easily visualize skewness by plotting a histogram. This type of chart groups your data into bins or buckets and performs counts on top of them. For example, Figure 5.7 shows a histogram for the age variable:

Figure 5.7 – Plotting distributions with a histogram

Figure 5.7 – Plotting distributions with a histogram

Looking at the histogram, you could conclude that most of the people are between 20 and 50 years old. You can also see a few people more than 60 years old. Another example of a histogram is shown in Figure 5.8, which plots the distribution of payments from a particular event that has different ticket prices. It aims to analyze how much money people are paying per ticket.

Figure 5.8 – Checking skewness with a histogram

Figure 5.8 – Checking skewness with a histogram

Here, you can see that most of the...

Visualizing compositions in your data

Sometimes, you want to analyze the various elements that compose a feature – for example, the percentage of sales per region or percentage of queries per channel. In both examples, they are not considering any time dimension; instead, they are just looking at the entire data points. For these types of compositions, where you don’t have the time dimension, you could show your data using pie charts, stacked 100% bar charts, and tree maps.

Figure 5.11 is a pie chart showing the number of queries per customer channel for a given company over a pre-defined period of time.

Figure 5.11 – Plotting compositions with a pie chart

Figure 5.11 – Plotting compositions with a pie chart

If you want to show compositions while considering a time dimension, then your most common options are a stacked area chart, a stacked 100% area chart, a stacked column chart, or a stacked 100% column chart. For reference, take a look at Figure 5.12, which shows the sales per...

Building key performance indicators

Before you wrap up these data visualization sections, you need to be introduced to key performance indicators, or KPIs for short.

A KPI is usually a single value that describes the results of a business indicator, such as the churn rate, net promoter score (NPS), return on investment (ROI), and so on. Although there are some standard indicators across different industries, you usually need to build custom metrics based on the company’s needs.

To be honest, the most complex challenge associated with indicators is not in their visualization aspect itself, but in the way they have been built (the rules used) and the way they will be communicated and used across different levels of the company.

From a visualization perspective, just like any other single value, you can use all those charts that you have learned about to analyze your indicator, depending on your need. However, if you just want to show your KPI, with no time dimension,...

Introducing QuickSight

Amazon QuickSight is a cloud-based analytics service that allows you to build data visualizations and ad hoc analysis. QuickSight supports a variety of data sources, such as Redshift, Aurora, Athena, RDS, and your on-premises database solution.

Other sources of data include S3, where you can retrieve data from Excel, CSV, or log files, and Software-as-a-Service (SaaS) solutions, where you can retrieve data from Salesforce entities.

Amazon QuickSight has two versions:

  • Standard edition
  • Enterprise edition

The most important difference between these two versions is the possibility of integration with Microsoft Active Directory (AD) and encryption at rest. Both features are only provided in the Enterprise edition.

Important note

Keep in mind that AWS services are constantly evolving, so more differences between the Standard and Enterprise versions may crop up in the future. You should always consult the latest documentation of AWS services...

Summary

You started this chapter by learning how to visualize relationships in the data. Scatter plots and bubble charts are the most important charts in this category to show relationships between two or three variables, respectively.

Then, you moved to another category of data visualization, which aimed to make comparisons in the data. The most common charts that you can use to show comparisons are bar charts, column charts, and line charts. Tables are also useful to show comparisons.

The next use case that you learned was visualizing data distributions. The most common types of charts that are used to show distributions are histograms and box plots.

Then, you moved to compositions. You can use this set of charts when you want to show the different elements that make up the data. While showing compositions, you must be aware of whether you want to present static data or data that changes over time. For static data, you should use a pie chart, a stacked 100% bar chart, or...

Exam Readiness Drill – Chapter Review Questions

Apart from a solid understanding of key concepts, being able to think quickly under time pressure is a skill that will help you ace your certification exam. That is why working on these skills early on in your learning journey is key.

Chapter review questions are designed to improve your test-taking skills progressively with each chapter you learn and review your understanding of key concepts in the chapter at the same time. You’ll find these at the end of each chapter.

How To Access These Resources

To learn how to access these resources, head over to the chapter titled Chapter 11, Accessing the Online Practice Resources.

To open the Chapter Review Questions for this chapter, perform the following steps:

  1. Click the link – https://packt.link/MLSC01E2_CH05.

    Alternatively, you can scan the following QR code (Figure 5.13):

Figure 5.13 – QR code that opens Chapter Review Questions for logged-in users

Figure 5.13 – QR code that opens Chapter...

Working On Timing

Target: Your aim is to keep the score the same while trying to answer these questions as quickly as possible. Here’s an example of how your next attempts should look like:

Attempt

Score

Time Taken

Attempt 5

77%

21 mins 30 seconds

Attempt 6

78%

18 mins 34 seconds

Attempt 7

76%

14 mins 44 seconds

Table 5.1 – Sample timing practice drills on the online platform

Note

The time limits shown in the above table are just examples. Set your own time limits with each attempt based on the time limit of the quiz on the website.

With each new attempt, your score should stay above 75% while your “time taken...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781835082201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Somanath Nanda

Somanath has 10 years of working experience in IT industry which includes Prod development, Devops, Design and architect products from end to end. He has also worked at AWS as a Big Data Engineer for about 2 years.
Read more about Somanath Nanda

author image
Weslley Moura

Weslley Moura has been developing data products for the past decade. At his recent roles, he has been influencing data strategy and leading data teams into the urban logistics and blockchain industries.
Read more about Weslley Moura