Reader small image

You're reading from  Big Data Analysis with Python

Product typeBook
Published inApr 2019
Reading LevelIntermediate
PublisherPackt
ISBN-139781789955286
Edition1st Edition
Languages
Right arrow
Authors (3):
Ivan Marin
Ivan Marin
author image
Ivan Marin

Ivan Marin is a systems architect and data scientist working at Daitan Group, a Campinas-based software company. He designs big data systems for large volumes of data and implements machine learning pipelines end to end using Python and Spark. He is also an active organizer of data science, machine learning, and Python in So Paulo, and has given Python for data science courses at university level.
Read more about Ivan Marin

Ankit Shukla
Ankit Shukla
author image
Ankit Shukla

Ankit Shukla is a data scientist working with World Wide Technology, a leading US-based technology solution provider, where he develops and deploys machine learning and artificial intelligence solutions to solve business problems and create actual dollar value for clients. He is also part of the company's R&D initiative, which is responsible for producing intellectual property, building capabilities in new areas, and publishing cutting-edge research in corporate white papers. Besides tinkering with AI/ML models, he likes to read and is a big-time foodie.
Read more about Ankit Shukla

Sarang VK
Sarang VK
author image
Sarang VK

Sarang VK is a lead data scientist at StraitsBridge Advisors, where his responsibilities include requirement gathering, solutioning, development, and productization of scalable machine learning, artificial intelligence, and analytical solutions using open source technologies. Alongside this, he supports pre-sales and competency.
Read more about Sarang VK

View More author details
Right arrow

Chapter 2. Statistical Visualizations

Note

Learning Objectives

We will start our journey by understanding the power of Python to manipulate and visualize data, creating useful analysis.

By the end of this chapter, you will be able to:

  • Use graphs for data analysis

  • Create graphs of various types

  • Change graph parameters such as color, title, and axis

  • Export graphs for presentation, printing, and other uses

Note

In this chapter, we will illustrate how the students can generate visualizations with Matplotlib and Seaborn.

Introduction


In the last chapter, we learned that the libraries that are most commonly used for data science work with Python. Although they are not big data libraries per se, the libraries of the Python Data Science Stack (NumPy, Jupyter, IPython, Pandas, and Matplotlib) are important in big data analysis.

As we will demonstrate in this chapter, no analysis is complete without visualizations, even with big datasets, so knowing how to generate images and graphs from data in Python is relevant for our goal of big data analysis. In the subsequent chapters, we will demonstrate how to process large volumes of data and aggregate it to visualize it using Python tools.

There are several visualization libraries for Python, such as Plotly, Bokeh, and others. But one of the oldest, most flexible, and most used is Matplotlib. But before going through the details of creating a graph with Matplotlib, let's first understand what kinds of graphs are relevant for analysis.

Types of Graphs and When to Use Them


Every analysis, whether on small or large datasets, involves a descriptive statistics step, where the data is summarized and described by statistics such as mean, median, percentages, and correlation. This step is commonly the first step in the analysis workflow, allowing a preliminary understanding of the data and its general patterns and behaviors, providing grounds for the analyst to formulate hypotheses, and directing the next steps in the analysis. Graphs are powerful tools to aid in this step, enabling the analyst to visualize the data, create new views and concepts, and communicate them to a larger audience.

There is a vast amount of literature on statistics about visualizing information. The classic book, Envisioning Information, by Edward Tufte, demonstrates beautiful and useful examples of how to present information in graphical form. In another book, The Visual Display of Quantitative Information, Tufte enumerates a few qualities that a graph...

Components of a Graph


Each graph has a set of common components that can be adjusted. The names that Matplotlib uses for these components are demonstrated in the following graph:

Figure 2.3: Components of a graph

The components of a graph are as follows:

  • Figure: The base of the graph, where all the other components are drawn.

  • Axis: Contains the figure elements and sets the coordinate system.

  • Title: The title gives the graph its name.

  • X-axis label: The name of the x-axis, usually named with the units.

  • Y-axis label: The name of the y-axis, usually named with the units.

  • Legend: A description of the data plotted in the graph, allowing you to identify the curves and points in the graph.

  • Ticks and tick labels: They indicate the points of reference on a scale for the graph, where the values of the data are. The labels indicate the values themselves.

  • Line plots: These are the lines that are plotted with the data.

  • Markers: Markers are the pictograms that mark the point data.

  • Spines: The lines that delimit the...

Seaborn


Seaborn (https://seaborn.pydata.org/) is part of the PyData family of tools and is a visualization library based on Matplotlib with the goal of creating statistical graphs more easily. It can operate directly on DataFrames and series, doing aggregations and mapping internally. Seaborn uses color palettes and styles to make visualizations consistent and more informative. It also has functions that can calculate some statistics, such as regression, estimation, and errors. Some specialized plots, such as violin plots and multi-facet plots, are also easy to create with Seaborn.

Which Tool Should Be Used?


Seaborn tries to make the creation of some common analysis graphs easier than using Matplotlib directly. Matplotlib can be considered more low-level than Seaborn, and although this makes it a bit more cumbersome and verbose, it gives analysts much more flexibility. Some graphs, which with Seaborn are created with one function call, would take several lines of code to achieve using Matplotlib.

There is no rule to determine whether an analyst should use only the pandas plotting interface, Matplotlib directly, or Seaborn. Analysts should keep in mind the visualization requirements and the level of configuration required to create the desired graph.

Pandas' plotting interface is easier to use but is more constrained and limited. Seaborn has several graph patterns ready to use, including common statistical graphs such as pair plots and boxplots, but requires that the data is formatted into a tidy format and is more opinionated on how the graphs should look. Matplotlib...

Types of Graphs


The first type of graph that we will present is the line graph or line chart. A line graph displays data as a series of interconnected points on two axes (x and y), usually Cartesian, ordered commonly by the x-axis. Line charts are useful for demonstrating trends in data, such as in time series, for example.

A graph related to the line graph is the scatter plot. A scatter plot represents the data as points in Cartesian coordinates. Usually, two variables are demonstrated in this graph, although more information can be conveyed if the data is color-coded or size-coded by category, for example. Scatter plots are useful for showing the relationship and possible correlation between variables.

Histograms are useful for representing the distribution of data. Unlike the two previous examples, histograms show only one variable, usually on the x-axis, while the y-axis shows the frequency of occurrence of the data. The process of creating a histogram is a bit more involved than the line...

Pandas DataFrames and Grouped Data


As we learned in the previous chapter, when analyzing data and using Pandas to do so, we can use the plot functions from Pandas or use Matplotlib directly. Pandas uses Matplotlib under the hood, so the integration is great. Depending on the situation, we can either plot directly from pandas or create a figure and an axes with Matplotlib and pass it to pandas to plot. For example, when doing a GroupBy, we can separate the data into a GroupBy key. But how can we plot the results of GroupBy? We have a few approaches at our disposal. We can, for example, use pandas directly, if the DataFrame is already in the right format:

Note

The following code is a sample and will not get executed.

fig, ax = plt.subplots()
df = pd.read_csv('data/dow_jones_index.data')
df[df.stock.isin(['MSFT', 'GE', 'PG'])].groupby('stock')['volume'].plot(ax=ax)

Or we can just plot each GroupBy key on the same plot:

fig, ax = plt.subplots()
df.groupby('stock').volume.plot(ax=ax)

For the following...

Changing Plot Design: Modifying Graph Components


So far, we've looked at the main graphs used in analyzing data, either directly or grouped, for comparison and trend visualization. But one thing that we can see is that the design of each graph is different from the others, and we don't have basic things such as a title and legends.

We've learned that a graph is composed of several components, such as a graph title, x and y labels, and so on. When using Seaborn, the graphs already have x and y labels, with the names of the columns. With Matplotlib, we don't have this. These changes are not only cosmetic.

The understanding of a graph can be greatly improved when we adjust things such as line width, color, and point size too, besides labels and titles. A graph must be able to stand on its own, so title, legends, and units are paramount. How can we apply the concepts that we described previously to make good, informative graphs on Matplotlib and Seaborn?

The possible number of ways that plots can...

Exporting Graphs


After generating our visualizations and configuring the details, we can export our graphs to a hard copy format, such as PNG, JPEG, or SVG. If we are using the interactive API in the notebook, we can just call the savefig function over the pyplot interface, and the last generated graph will be exported to the file:

df.plot(kind='scatter', x='weight', y='horsepower', figsize=(20,10))
plt.savefig('horsepower_weight_scatter.png')

Figure 2.26: Exporting the graphs

All plot configurations will be carried to the plot. To export a graph when using the object-oriented API, we can call savefig from the figure:

fig, ax = plt.subplots()
df.plot(kind='scatter', x='weight', y='horsepower', figsize=(20,10), ax=ax)
fig.savefig('horsepower_weight_scatter.jpg')

Figure 2.27: Saving the graph

We can change some parameters for the saved image:

  • dpi: Adjust the saved image resolution.

  • facecolor: The face color of the figure.

  • edgecolor: The edge color of the figure, around the graph.

  • format: Usually PNG...

Summary


In this chapter, we have seen the importance of creating meaningful and interesting visualizations when analyzing data. A good data visualization can immensely help the analyst's job, representing data in a way that can reach larger audiences and explain concepts that could be hard to translate into words or to represent with tables.

A graph, to be effective as a data visualization tool, must show the data, avoid distortions, make understanding large datasets easy, and have a clear purpose, such as description or exploration. The main goal of a graph is to communicate data, so the analyst must keep that in mind when creating a graph. A useful graph is more desirable than a beautiful one.

We demonstrated some kinds of graphs commonly used in analysis: the line graph, the scatter plot, the histogram, and the boxplot. Each graph has its purpose and application, depending on the data and the goal. We have also shown how to create graphs directly from Matplotlib, from pandas, or a combination...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Big Data Analysis with Python
Published in: Apr 2019Publisher: PacktISBN-13: 9781789955286
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Ivan Marin

Ivan Marin is a systems architect and data scientist working at Daitan Group, a Campinas-based software company. He designs big data systems for large volumes of data and implements machine learning pipelines end to end using Python and Spark. He is also an active organizer of data science, machine learning, and Python in So Paulo, and has given Python for data science courses at university level.
Read more about Ivan Marin

author image
Ankit Shukla

Ankit Shukla is a data scientist working with World Wide Technology, a leading US-based technology solution provider, where he develops and deploys machine learning and artificial intelligence solutions to solve business problems and create actual dollar value for clients. He is also part of the company's R&D initiative, which is responsible for producing intellectual property, building capabilities in new areas, and publishing cutting-edge research in corporate white papers. Besides tinkering with AI/ML models, he likes to read and is a big-time foodie.
Read more about Ankit Shukla

author image
Sarang VK

Sarang VK is a lead data scientist at StraitsBridge Advisors, where his responsibilities include requirement gathering, solutioning, development, and productization of scalable machine learning, artificial intelligence, and analytical solutions using open source technologies. Alongside this, he supports pre-sales and competency.
Read more about Sarang VK