Reader small image

You're reading from  Hands-On Data Analysis with Pandas - Second Edition

Product typeBook
Published inApr 2021
Reading LevelIntermediate
PublisherPackt
ISBN-139781800563452
Edition2nd Edition
Languages
Tools
Concepts
Right arrow
Author (1)
Stefanie Molin
Stefanie Molin
author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin

Right arrow

Chapter 5: Visualizing Data with Pandas and Matplotlib

So far, we have been working with data strictly in a tabular format. However, the human brain excels at picking out visual patterns; hence, our natural next step is learning how to visualize our data. Visualizations make it much easier to spot aberrations in our data and explain our findings to others. However, we should not reserve data visualizations exclusively for those we present our conclusions to, as visualizations will be crucial in helping us understand our data quickly and more completely in our exploratory data analysis.

There are numerous types of visualizations that go way beyond what we may have seen in the past. In this chapter, we will cover the most common plot types, such as line plots, histograms, scatter plots, and bar plots, along with several other plot types that build upon these. We won't be covering pie charts—they are notorious for being difficult to read properly, and there are better ways...

Chapter materials

The materials for this chapter can be found on GitHub at https://github.com/stefmolin/Hands-On-Data-Analysis-with-Pandas-2nd-edition/tree/master/ch_05. We will be working with three datasets, all of which can be found in the data/ directory. In the fb_stock_prices_2018.csv file, we have the daily opening, high, low, and closing prices of Facebook stock from January through December 2018, along with the volume traded. This was obtained using the stock_analysis package, which we will build in Chapter 7, Financial Analysis – Bitcoin and the Stock Market. The stock market is closed on the weekends, so we only have data for the trading days.

The earthquakes.csv file contains earthquake data collected from the United States Geological Survey (USGS) API (https://earthquake.usgs.gov/fdsnws/event/1/) for September 18, 2018 through October 13, 2018. For each earthquake, we have the value of the magnitude (the mag column), the scale it was measured on (the magType...

An introduction to matplotlib

The plotting capabilities in pandas and seaborn are powered by matplotlib: both of these packages provide wrappers around the lower-level functionality in matplotlib. Consequently, we have many visualization options at our fingertips with minimal code to write; however, this comes at a price: reduced flexibility in what we can create.

We may find that the pandas or seaborn implementation isn't quite meeting our needs, and, indeed, it may be impossible to override a particular setting after creating the plot with them, meaning we will have to do some of the legwork with matplotlib. Additionally, many of the tweaks that will be made to the final appearance of the visualization will be handled with matplotlib commands, which we will discuss in the next chapter. Therefore, it would greatly benefit us to have some understanding of how matplotlib works.

The basics

The matplotlib package is rather large since it encompasses quite a bit of functionality...

Plotting with pandas

Both Series and DataFrame objects have a plot() method that allows us to create several different plots and control some aspects of their formatting, such as subplot layout, figure size, titles, and whether to share an axis across subplots. This makes plotting our data much more convenient, as the bulk of the work to create presentable plots is achieved with a single method call. Under the hood, pandas is making several calls to matplotlib to produce our plot. Some of the most frequently used arguments to the plot() method include the following:

Figure 5.10 – Frequently used pandas plotting arguments

Rather than having separate functions for each plot type, as we saw during our discussion of matplotlib, the plot() method from pandas allows us to specify the type of plot we want using the kind argument. The choice of plot will determine which other arguments are required. We can use the Axes object that's returned by the plot...

The pandas.plotting module

In the Plotting with pandas section, we covered standard plots that pandas has provided easier implementations for. However, pandas also has a module (which is appropriately named plotting) with special plots that we can use on our data. Note that the customization options of these may be more limited because of how they are composed and returned to us.

We will be working in the 3-pandas_plotting_module.ipynb notebook for this section. As usual, we will begin with our imports and reading in the data; we will only be using the Facebook data here:

>>> %matplotlib inline
>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> import pandas as pd
>>> fb = pd.read_csv(
...     'data/fb_stock_prices_2018.csv', 
...     index_col='date', 
...     parse_dates=True
... )

Now, let's take a tour of some of the...

Summary

Now that we've completed this chapter, we are well-equipped to quickly create a variety of visualizations in Python using pandas and matplotlib. We now understand the basics of how matplotlib works and the main components of a plot. Additionally, we discussed various plot types and the situations in which to use them—a crucial component of data visualization is choosing the appropriate plot. Be sure to check out the Choosing the appropriate visualization section in the Appendix for future reference.

Note that the best practices for visualization don't just apply to the plot type, but also to the formatting of the plot, which we will discuss in the next chapter. In addition to this, we will build upon the foundation we laid here to discuss additional plots using seaborn and how to customize our plots using matplotlib. Be sure to complete the end-of-chapter exercises to practice plotting before moving on, as we will be building on this chapter's material...

Exercises

Create the following visualizations using what you have learned up to this point in this book. Use the data from this chapter's data/ directory:

  1. Plot the rolling 20-day minimum of the Facebook closing price using pandas.
  2. Create a histogram and KDE of the change from open to close in the price of Facebook stock.
  3. Using the earthquake data, create box plots for the magnitudes of each magType used in Indonesia.
  4. Make a line plot of the difference between the weekly maximum high price and the weekly minimum low price for Facebook. This should be a single line.
  5. Plot the 14-day moving average of the daily change in new COVID-19 cases in Brazil, China, India, Italy, Spain, and the USA:

    a) First, use the diff() method that was introduced in the Working with time series data section of Chapter 4, Aggregating Pandas DataFrames, to calculate the day-over-day change in new cases. Then, use rolling() to calculate the 14-day moving average.

    b) Make three subplots...

Further reading

Take a look at the following resources for additional information on the concepts that were discussed in this chapter:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Analysis with Pandas - Second Edition
Published in: Apr 2021Publisher: PacktISBN-13: 9781800563452
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Stefanie Molin

Stefanie Molin is a data scientist and software engineer at Bloomberg LP in NYC, tackling tough problems in information security, particularly revolving around anomaly detection, building tools for gathering data, and knowledge sharing. She has extensive experience in data science, designing anomaly detection solutions, and utilizing machine learning in both R and Python in the AdTech and FinTech industries. She holds a B.S. in operations research from Columbia University's Fu Foundation School of Engineering and Applied Science, with minors in economics, and entrepreneurship and innovation. In her free time, she enjoys traveling the world, inventing new recipes, and learning new languages spoken among both people and computers.
Read more about Stefanie Molin