Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

You're reading from  AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide

Product type Book
Published in Mar 2021
Publisher Packt
ISBN-13 9781800569003
Pages 338 pages
Edition 1st Edition
Languages
Authors (2):
Somanath Nanda Somanath Nanda
Profile icon Somanath Nanda
Weslley Moura Weslley Moura
Profile icon Weslley Moura
View More author details

Table of Contents (14) Chapters

Preface Section 1: Introduction to Machine Learning
Chapter 1: Machine Learning Fundamentals Chapter 2: AWS Application Services for AI/ML Section 2: Data Engineering and Exploratory Data Analysis
Chapter 3: Data Preparation and Transformation Chapter 4: Understanding and Visualizing Data Chapter 5: AWS Services for Data Storing Chapter 6: AWS Services for Data Processing Section 3: Data Modeling
Chapter 7: Applying Machine Learning Algorithms Chapter 8: Evaluating and Optimizing Models Chapter 9: Amazon SageMaker Modeling Other Books You May Enjoy

Chapter 4: Understanding and Visualizing Data

Data visualization is an art! No matter how much effort you and your team put into data preparation and preliminary analysis for modeling, if you don't know how to show your findings effectively, your audience may not understand the point you are trying to make.

Often, such situations may be even worse when you are dealing with decision-makers. For example, if you choose the wrong set of charts to tell a particular story, people can misinterpret your analysis and make bad decisions.

Understanding the different types of data visualizations and knowing how they fit with each type of analysis will put you in a very good position, in terms of engaging your audience and transmitting the information you want to.

In this chapter, you will learn about some data visualization techniques. We will be covering the following topics:

  • Visualizing relationships in your data
  • Visualizing comparisons in your data
  • Visualizing...

Visualizing relationships in your data

When we need to show relationships in our data, we are usually talking about plotting two or more variables in a chart to visualize their level of dependency. A scatter plot is probably the most common type of chart to show the relationship between two variables. The following is a scatter plot for two variables, x and y:

Figure 4.1 – Plotting relationships with a scatter plot

The preceding plot shows a clear relationship between x and y. As x increases, y also increases. In this particular case, we can say that there is a linear relationship between both variables. Keep in mind that scatter plots may also catch other types of relationships, not only linear ones. For example, it would also be possible to find an exponential relationship between the two variables.

Another nice chart to make comparisons with is known as a bubble chart. Just like a scatter plot, it will also show the relationship between variables...

Visualizing comparisons in your data

Comparisons are very common in data analysis and we have different ways to present them. Let's start with the bar chart. I am sure you have seen many reports that have used this type of visualization.

Bar charts can be used to compare one variable among different classes; for example, a car's price across different models or population size per country. In the following graph, we have used a bar chart to present the number of Covid-19 cases per state in India, until June 2020:

Figure 4.3 – Plotting comparisons with a bar chart

Sometimes, we can also use stacked column charts to add another dimension to the data that is being analyzed. For example, in the following graph, we are using a stacked bar chart to show how many people were on board the Titanic, per gender. Additionally, we are breaking down the number of people who survived (positive class) and those who did not (negative class):

...

Visualizing distributions in your data

Exploring the distribution of your feature is very important to understand some key characteristics of it, such as its skewness, mean, median, and quantiles. You can easily visualize skewness by plotting a histogram. This type of chart groups your data into bins or buckets and performs counts on top of them. For example, the following chart shows a histogram for the age variable:

Figure 4.7 – Plotting distributions with a histogram

Looking at the histogram, we can conclude that most of the people are between 20 and 50 years old. We can also see a few people more than 60 years old. Another example of a histogram is shown in the following chart, where we are plotting the distribution of payments from a particular event that has different ticket prices. We want to see how much money people are paying per ticket:

Figure 4.8 – Checking skewness with a histogram

Here, we can see that the...

Visualizing compositions in your data

Sometimes, you want to analyze the various elements that compose your feature; for example, the percentage of sales per region or percentage of queries per channel. In both examples, we are not considering any time dimension; instead, we are just looking at the data as a whole. For these types of compositions, where you don't have the time dimension, you could show your data using pie charts, stacked 100% bar charts, and treemaps.

The following is a pie chart showing the number of queries per customer channel, for a given company, during a pre-defined period of time:

Figure 4.11 – Plotting compositions with a pie chart

If you want to show compositions while considering a time dimension, then your most common options would be a stacked area chart, a stacked 100% area chart, a stacked column chart, or a stacked 100% column chart. For reference, take a look at the following chart, which shows the sales per region...

Building key performance indicators

Before we wrap up these data visualization sections, I want to introduce key performance indicators, or KPIs for short.

A KPI is usually a single value that describes the results of a business indicator, such as the churn rate, net promoter score (NPS), return on investment (ROI), and so on. Although there are some commonly used indicators across different industries, you are free to come up with a number, based on your company's needs.

To be honest, the most complex challenge associated with indicators is not in their visualization aspect itself, but in the way they have been built (the rules used) and the way they will be communicated and used across different levels of the company.

From a visualization perspective, just like any other single value, you can use all those charts that we have learned about to analyze your indicator, depending on your need. However, if you just want to show your KPI, with no time dimension, you can...

Introducing Quick Sight

Amazon Quick Sight is a cloud-based analytics service that allows you to build data visualizations and ad hoc analysis. Quick Sight supports a variety of data sources, such as Redshift, Aurora, Athena, RDS, and your on-premises database solution.

Other sources of data include S3, where you can retrieve data from Excel, CSV, or log files, and Software as a Service (SaaS) solutions, where you can retrieve data from Salesforce entities.

Amazon Quick Sight has two versions:

  • Standard Edition
  • Enterprise Edition

The most important difference between these two versions is the possibility of integration with Microsoft Active Directory (AD) and encryption at rest. Both features are only provided in the enterprise edition.

Important note

Keep in mind that AWS services are constantly evolving, so more differences between the standard and enterprise versions may crop up in the future. You should always consult the latest documentation of...

Summary

We have reached the end of this chapter about data visualization. Let's take this opportunity to provide a quick recap of what we have learned. We started this chapter by showing you how to visualize relationships in your data. Scatter plots and bubble charts are the most important charts in this category, either to show relationships between two or three variables, respectively.

Then, we moved on to another category of data visualization, which aimed to make comparisons in your data. The most common charts that you can use to show comparisons are bar charts, column charts, and line charts. Tables are also useful to show comparisons.

The next use case that we covered was visualizing data distributions. The most common types of charts that are used to show distributions are histograms and box plots.

Then, we moved on to compositions. We use this set of charts when we want to show the different elements that make up the data. While showing compositions, you must...

Questions

  1. You are working as a data scientist for a fintech company. At the moment, you are working on a regression model that predicts how much money customers will spend on their credit card transactions in the next month. You believe you have created a good model; however, you want to complete your residual analysis to confirm that the model errors are randomly distributed around zero. What is the best chart for performing this residual analysis?

    a) Line chart

    b) Bubble chart

    c) Scatter plot

    d) Stacked bar chart

    Answer

    C, In this case, you want to show the distribution of the model errors. A scatter plot would be a nice approach to present such an analysis. Having model errors randomly distributed across zero is just more evidence that the model is not suffering from overfitting. Histograms are also nice for performing error analysis.

  2. Although you believe that two particular variables are highly correlated, you think this is not a linear correlation. Knowing the type of correlation...
lock icon The rest of the chapter is locked
You have been reading a chapter from
AWS Certified Machine Learning Specialty: MLS-C01 Certification Guide
Published in: Mar 2021 Publisher: Packt ISBN-13: 9781800569003
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}