You're reading from Python Real-World Projects

Product type Book

Published in Sep 2023

Publisher Packt

ISBN-13 9781803246765

Pages 478 pages

Edition 1st Edition

Languages

Concepts

Programming Language

Author (1):

Steven F. Lott

Table of Contents (20) Chapters

Preface

1. Chapter 1: Project Zero: A Template for Other Projects

2. Chapter 2: Overview of the Projects

3. Chapter 3: Project 1.1: Data Acquisition Base Application

4. Chapter 4: Data Acquisition Features: Web APIs and Scraping

5. Chapter 5: Data Acquisition Features: SQL Database

6. Chapter 6: Project 2.1: Data Inspection Notebook

7. Chapter 7: Data Inspection Features

8. Chapter 8: Project 2.5: Schema and Metadata

9. Chapter 9: Project 3.1: Data Cleaning Base Application

10. Chapter 10: Data Cleaning Features

11. Chapter 11: Project 3.7: Interim Data Persistence

12. Chapter 12: Project 3.8: Integrated Data Acquisition Web Service

13. Chapter 13: Project 4.1: Visual Analysis Techniques

14. Chapter 14: Project 4.2: Creating Reports

15. Chapter 15: Project 5.1: Modeling Base Application

16. Chapter 16: Project 5.2: Simple Multivariate Statistics

17. Chapter 17: Next Steps

18. Other Books You Might Enjoy

19. Index

Chapter 13
Project 4.1: Visual Analysis Techniques

When doing exploratory data analysis (EDA), one common practice is to use graphical techniques to help understand the nature of data distribution. The US National Institute of Standards and Technology (NIST) has an Engineering Statistics Handbook that strongly emphasizes the need for graphic techniques. See https://doi.org/10.18434/M32189.

This chapter will create some additional Jupyter notebooks to present a few techniques for displaying univariate and multivariate distributions.

In this chapter, we’ll focus on some important skills for creating diagrams for the cleaned data:

Additional Jupyter Notebook techniques
Using PyPlot to present data
Unit testing for Jupyter Notebook functions

This chapter has one project, to build the start of a more complete analysis notebook. A notebook can be saved and exported as a PDF file, allowing an analyst to share preliminary results for early conversations. In the next chapter, we...

13.1 Description

In the previous chapters, the sequence of projects created a pipeline to acquire and then clean the raw data. The intent is to build automated data gathering as Python applications.

We noted that ad hoc data inspection is best done with a notebook, not an automated CLI tool. Similarly, creating command-line applications for analysis and presentation can be challenging. Analytical work seems to be essentially exploratory, making it helpful to have immediate feedback from looking at results.

Additionally, analytical work transforms raw data into information, and possibly even insight. Analytical results need to be shared to create significant value. A Jupyter notebook is an exploratory environment that can create readable, helpful presentations.

One of the first things to do with raw data is to create diagrams to illustrate the distribution of univariate data and the relationships among variables in multivariate data. We’ll emphasize the following common kinds of...

13.2 Overall approach

We’ll take some guidance from the C4 model ( https://c4model.com) when looking at our approach:

Context: For this project, the context diagram has two use cases: the acquire-to-clean process and this analysis notebook.
Containers: There’s one container for analysis application: the user’s personal computer.
Components: The software components include the existing analysis models that provide handy definitions for the Python objects.
Code: The code is scattered in two places: supporting modules as well as the notebook itself.

A context diagram for this application is shown in Figure 13.1.

The analyst will often need to share their analytical results with stakeholders. An initial notebook might provide confirmation that some data does not conform to the null hypothesis, suggesting an interesting relationship that deserves deeper exploration. This could be part of justifying a budget allocation to do more...

13.3 Deliverables

This project has the following deliverables:

A requirements-dev.txt file that identifies the tools used, usually jupyterlab==3.5.3 and matplotlib==3.7.0.
Documentation in the docs folder.
Unit tests for any new application modules in the tests folder.
Any new application modules in the src folder with code to be used by the inspection notebook.
A notebook to summarize the clean data. In the case of Anscombe’s quartet, it’s essential to show the means and variances are nearly identical, but the scatter plots are dramatically different.

We’ll look at a few of these deliverables in a little more detail.

13.3.1 Unit test

There are two distinct kinds of modules that can require testing:

The notebook with any function or class definitions. All of these definitions require unit tests.
If functions are factored from the notebook into a supporting module, this module will need unit tests. Many previous projects have emphasized these tests.

A notebook...

13.4 Summary

This project begins the deeper analysis work on clean data. It emphasizes several key skills, including:

More advanced Jupyter Notebook techniques. This includes setting the PYTHONPATH to import modules and creating figures with plots to visualize data.
Using PyPlot to present data. The project uses popular types of visualizations: histograms and scatter plots.
Unit testing for Jupyter Notebook functions.

In the next chapter, we’ll formalize the notebook into a presentation “slide deck” that can be shown to a group of stakeholders.

13.5 Extras

Here are some ideas for the reader to add to these projects.

13.5.1 Use Seaborn for plotting

An alternative to the pyplot package is the Seaborn package. This package also provides statistical plotting functions. It provides a wider variety of styling options, permitting more colorful (and perhaps more informative) plots.

See https://seaborn.pydata.org for more information.

This module is based on matplotlib, making it compatible with JupyterLab.

Note that the Seaborn package can work directly with a list-of-dictionary structure. This matches the ND JSON format used for acquiring and cleaning the data.

Using a list-of-dictionary type suggests it might be better to avoid the analysis model structure, and stick with dictionaries created by the clean application. Doing this might sacrifice some model-specific processing and validation functionality.

On the other hand, the pydantic package offers a built-in dict() method that covers a sophisticated analysis model object into...