Reader small image

You're reading from  The Pandas Workshop

Product typeBook
Published inJun 2022
Reading LevelBeginner
PublisherPackt
ISBN-139781800208933
Edition1st Edition
Languages
Concepts
Right arrow
Authors (4):
Blaine Bateman
Blaine Bateman
author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

Saikat Basak
Saikat Basak
author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

Thomas V. Joseph
Thomas V. Joseph
author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

William So
William So
author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So

View More author details
Right arrow

Chapter 8: Understanding Data Visualization

In the previous chapter, you were introduced to data transformation methods in pandas. In this chapter, you will learn more about data visualization in pandas and use different types of charts such as line, bar, pie, scatter, and box to perform exploratory data analysis. In this chapter, we shall also touch upon different ways you can plot these charts using the plot() function by pandas and matplotlib. We will learn the differences between these two methods and learn which one to use, depending on the desired outcome. The plots that we are going to learn about in this chapter will help us analyze our data to find out useful insights, such as the distribution of certain features over the population using histograms and finding outliers using boxplots. By the end of this chapter, you will know how to select the best chart type for your data, build it, and customize it for the purpose of your analysis.

This chapter consists of the following...

Introduction to data visualization

Humans can process a large amount of information using their sense of vision. Data visualization utilizes humans' innate skills to enhance the efficiency of data processing and organization. A classic visualization process starts by filtering data, transforming it into visual forms, and eventually displaying the data interactively to end users. With data visualization, users find it easier to understand and interpret the meaning of the underlying data. Good data visualization helps identify patterns, trends, and extreme values in a concise presentation. This is important in every aspect, especially when the data is big in volume or highly complex. Making sense of a large amount of data in a small amount of time is a huge business value.

pandas offers various options for visualizing data. To ensure your visualizations are accurate and that they correctly convey the insights gained from the underlying data, it is critical to identify and clean...

Understanding the basics of pandas visualization

pandas has built-in plot generation capabilities that can be used to visualize both DataFrames and series alike. pandas comes with a built-in plot function that acts as a wrapper on top of the matplotlib plot function. This means that pandas is actually using the matplotlib library but with a simplified syntax. This presents the advantage of being much easier to use (less code and simpler syntax) compared to matplotlib. It provides a wide range of functionality and flexibility to plot data analytics charts with given data.

To start off using pandas in-built visualizations, you will need to know several key parameters for the .plot() function, which can be called from a DataFrame. Some of these are listed as follows:

  • kind: This is the type of plot (bar, barh, pie, scatter, kde, and so on).
  • color: This is the color of the plot.
  • linestyle: This is the style of the line used in the plot (solid, dotted, and dashed).
  • ...

Exploring matplotlib

Matplotlib is one of the most frequently used Python libraries. It can generate plotting diagrams with great flexibility. The pandas plot() function is a wrapper on top of matplotlib with some bare minimum functionality. While it does simplify the syntax, it also restrains the numerous possibilities of matplotlib. If you want to build complex visualizations, then matplotlib will be your best choice, as it allows controls over all kinds of properties, such as the size, the type of figures and markers, the line width, the colors, and the styles. We will see some of the customizations that can be easily done with matplotlib compared to pandas:

  1. Let's start with an example. Consider the following snippet:
    # Importing libraries
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
     
    # Defining a DataFrame
    data_frame = pd.DataFrame({
    'Year':['2010','2011','2012','2013','2014',&apos...

Visualizing data of different types

In the previous section, we saw how to use pandas and matplotlib to create charts for data visualization. In a data analytics project, data visualization can be used either for data analysis or to communicate insights. Presenting results in a visual way that stakeholders can easily understand and interpret is definitely a must-have skill for any good data analyst. However, you cannot choose any random chart or plot to visualize all of the different types of data that an analyst may encounter. Different chart or plot types are suitable for communicating the insight for different types of data – that is, when communicating the reach of social media on different age groups, it is preferable to use a pie chart instead of a bar or a box. On the other hand, line plots are more suitable for visualizing gradual change. The trick of data visualization is to know exactly which type of plot is appropriate for each data type you will encounter. This is...

Activity 8.01 – Using data visualization for exploratory data analysis

In this activity, we will apply what we have learned in this chapter to building different types of plots in order to perform an exploratory data analysis on a sale price. We will work on the Manufactured Housing Survey dataset, published by the United States Census Bureau, that can be found in the GitHub repository at https://raw.githubusercontent.com/PacktWorkshops/The-pandas-Workshop/master/Chapter08/Data/PUF2020final_v1coll.csv.

Note

More details about the Ames Housing dataset can be found at https://www.census.gov/data/datasets/2020/econ/mhs/puf.html.

The goal of this activity is to analyze the different factors contributing to a sale price in the housing market. We will use different types of plots in order to achieve it.

Your tasks will be as follows:

  1. Open a Jupyter notebook.
  2. Import the pandas, numpy, and matplotlib packages.
  3. Load the CSV file as a DataFrame.
  4. For the...

Summary

In this chapter, we have learned the fundamentals of pandas visualization and how to create charts. After going through the basics of creating charts in pandas, we looked at how we can further customize charts by using the matplotlib package. Then, we learned what the main charts are for each type of data, such as numerical data, categorical data, and statistical data, before learning how to handle multiple data plots.

Finally, we applied our learnings to an activity with the purpose of applying what we learned in this chapter to a business case, where the goal was to determine how different factors affect a price. In the next chapter, you will learn how to model data to derive insights.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Pandas Workshop
Published in: Jun 2022Publisher: PacktISBN-13: 9781800208933
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Blaine Bateman

Blaine Bateman has more than 35 years of experience working with various industries from government R&D to startups to $1B public companies. His experience focuses on analytics including machine learning and forecasting. His hands-on abilities include Python and R coding, Keras/Tensorflow, and AWS & Azure machine learning services. As a machine learning consultant, he has developed and deployed actual ML models in industry.
Read more about Blaine Bateman

author image
Saikat Basak

Saikat Basak is a data scientist and a passionate programmer. Having worked with multiple industry leaders, he has a good understanding of problem areas that can potentially be solved using data. Apart from being a data guy, he is also a science geek and loves to explore new ideas in the frontiers of science and technology.
Read more about Saikat Basak

author image
Thomas V. Joseph

Thomas V. Joseph is a data science practitioner, researcher, trainer, mentor, and writer with more than 19 years of experience. He has extensive experience in solving business problems using machine learning toolsets across multiple industry segments.
Read more about Thomas V. Joseph

author image
William So

William So is a Data Scientist with both a strong academic background and extensive professional experience. He is currently the Head of Data Science at Douugh and also a Lecturer for Master of Data Science and Innovation at the University of Technology Sydney. During his career, he successfully covered the end-end spectrum of data analytics from ML to Business Intelligence helping stakeholders derive valuable insights and achieve amazing results that benefits the business. William is a co-author of the "The Applied Artificial Intelligence Workshop" published by Packt.
Read more about William So