2. Static Visualization – Global Patterns and Summary Statistics
Learning Objectives
By the end of this chapter, you will be able to:
- Explain various visualization techniques for different contexts
- Identify global patterns of one or more features in a dataset
- Create plots to represent global patterns in data: scatter plots, hexbin plots, contour plots, and heatmaps
- Create plots that present summary statistics of data: histograms (revisited), box plots, and violin plots
In this chapter, we'll explore different visualization techniques for presenting global patterns and summary statistics of data.
Introduction
In the previous chapter, we learned how to handle pandas
DataFrames as inputs for data visualization, how to plot with pandas
and seaborn
, and how to refine plots to increase their aesthetic appeal. The intent of this chapter is to acquire practical knowledge about the strengths and limitations of various visualization techniques. We'll practice creating plots for a variety of different contexts. However, you will notice that the variety in existing plot types and visualization techniques is huge, and choosing the appropriate visualization becomes confusing. There are times when a plot shows too much information for the reader to grasp or too little for the reader to get the necessary intuition regarding the data. There are times when a visualization is too esoteric for the reader to appreciate properly, and other times when an over-simplistic visualization just doesn't have the right impact. All these scenarios can be avoided by being armed with practical knowledge...
Creating Plots that Present Global Patterns in Data
In this section, we will study the context of plots that present global patterns in data, such as:
- Plots that show the variance in individual features in data, such as histograms
- Plots that show how different features present in data vary with respect to each other, such as scatter plots, line plots, and heatmaps
Most data scientists prefer to see such plots because they give an idea of the entire spectrum of values taken by the features of interest. Plots depicting global patterns are also useful because they make it easier to spot anomalies in data.
We will work with a dataset called mpg
. It was published by the StatLib library, maintained at Carnegie Mellon University, and is available in the seaborn
library. It was originally used to study the relationship of mileage – Miles Per Gallon (MPG) – with other features in the dataset; hence the name mpg
. Since the dataset contains 3 discrete features...
Creating Plots That Present Summary Statistics of Your Data
It's now time for a switch to our next section. When datasets are huge, it is sometimes useful to look at the summary statistics of a range of different features and get a preliminary idea of the dataset. For example, the summary statistics for any numerical feature include measures of central tendency, such as the mean, and measures of dispersion, such as the standard deviation.
When a dataset is too small, plots presenting summary statistics may actually be misleading because summary statistics are meaningful only when the dataset is big enough to draw statistical conclusions. For example, if somebody reports the variance of a feature using five data points, we cannot make any concrete conclusions regarding the dispersion of the feature.
Histogram Revisited
Let's revisit histograms from Chapter 1, Introduction to Visualization with Python – Basic and Customized Plotting. Although histograms show...
Summary
In this chapter, we learned how choosing the most appropriate visualization(s) depends on four key elements:
- The nature of the features in a dataset: categorical/discrete, numerical/continuous numerical
- The size of the dataset: small/medium/large
- The density of the data points in the chosen feature space: whether too many or too few data points are set to certain feature values
- The context of the visualization: the source of the dataset and frequently used visualizations for the given application
For the purpose of explaining the concepts clearly and defining certain general guidelines, we classified visualizations into two categories:
- Plots representing the global patterns of the chosen features (for example, histograms, scatter plots, hexbin plots, contour plots, line plots,and heatmaps)
- Plots representing the summary statistics of the specific features (box plots and violin plots)
We are not implying that a single best visualization...