Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Python Real-World Projects

You're reading from  Python Real-World Projects

Product type Book
Published in Sep 2023
Publisher Packt
ISBN-13 9781803246765
Pages 478 pages
Edition 1st Edition
Languages
Author (1):
Steven F. Lott Steven F. Lott
Profile icon Steven F. Lott

Table of Contents (20) Chapters

Preface 1. Chapter 1: Project Zero: A Template for Other Projects 2. Chapter 2: Overview of the Projects 3. Chapter 3: Project 1.1: Data Acquisition Base Application 4. Chapter 4: Data Acquisition Features: Web APIs and Scraping 5. Chapter 5: Data Acquisition Features: SQL Database 6. Chapter 6: Project 2.1: Data Inspection Notebook 7. Chapter 7: Data Inspection Features 8. Chapter 8: Project 2.5: Schema and Metadata 9. Chapter 9: Project 3.1: Data Cleaning Base Application 10. Chapter 10: Data Cleaning Features 11. Chapter 11: Project 3.7: Interim Data Persistence 12. Chapter 12: Project 3.8: Integrated Data Acquisition Web Service 13. Chapter 13: Project 4.1: Visual Analysis Techniques 14. Chapter 14: Project 4.2: Creating Reports 15. Chapter 15: Project 5.1: Modeling Base Application 16. Chapter 16: Project 5.2: Simple Multivariate Statistics 17. Chapter 17: Next Steps 18. Other Books You Might Enjoy 19. Index

Chapter 13
Project 4.1: Visual Analysis Techniques

When doing exploratory data analysis (EDA), one common practice is to use graphical techniques to help understand the nature of data distribution. The US National Institute of Standards and Technology (NIST) has an Engineering Statistics Handbook that strongly emphasizes the need for graphic techniques. See https://doi.org/10.18434/M32189.

This chapter will create some additional Jupyter notebooks to present a few techniques for displaying univariate and multivariate distributions.

In this chapter, we’ll focus on some important skills for creating diagrams for the cleaned data:

  • Additional Jupyter Notebook techniques

  • Using PyPlot to present data

  • Unit testing for Jupyter Notebook functions

This chapter has one project, to build the start of a more complete analysis notebook. A notebook can be saved and exported as a PDF file, allowing an analyst to share preliminary results for early conversations. In the next chapter, we...

13.1 Description

In the previous chapters, the sequence of projects created a pipeline to acquire and then clean the raw data. The intent is to build automated data gathering as Python applications.

We noted that ad hoc data inspection is best done with a notebook, not an automated CLI tool. Similarly, creating command-line applications for analysis and presentation can be challenging. Analytical work seems to be essentially exploratory, making it helpful to have immediate feedback from looking at results.

Additionally, analytical work transforms raw data into information, and possibly even insight. Analytical results need to be shared to create significant value. A Jupyter notebook is an exploratory environment that can create readable, helpful presentations.

One of the first things to do with raw data is to create diagrams to illustrate the distribution of univariate data and the relationships among variables in multivariate data. We’ll emphasize the following common kinds of...

13.2 Overall approach

We’ll take some guidance from the C4 model ( https://c4model.com) when looking at our approach:

  • Context: For this project, the context diagram has two use cases: the acquire-to-clean process and this analysis notebook.

  • Containers: There’s one container for analysis application: the user’s personal computer.

  • Components: The software components include the existing analysis models that provide handy definitions for the Python objects.

  • Code: The code is scattered in two places: supporting modules as well as the notebook itself.

A context diagram for this application is shown in Figure 13.1.

Figure 13.1: Context diagram
Figure 13.1: Context diagram

The analyst will often need to share their analytical results with stakeholders. An initial notebook might provide confirmation that some data does not conform to the null hypothesis, suggesting an interesting relationship that deserves deeper exploration. This could be part of justifying a budget allocation to do more...

13.3 Deliverables

This project has the following deliverables:

  • A requirements-dev.txt file that identifies the tools used, usually jupyterlab==3.5.3 and matplotlib==3.7.0.

  • Documentation in the docs folder.

  • Unit tests for any new application modules in the tests folder.

  • Any new application modules in the src folder with code to be used by the inspection notebook.

  • A notebook to summarize the clean data. In the case of Anscombe’s quartet, it’s essential to show the means and variances are nearly identical, but the scatter plots are dramatically different.

We’ll look at a few of these deliverables in a little more detail.

13.3.1 Unit test

There are two distinct kinds of modules that can require testing:

  • The notebook with any function or class definitions. All of these definitions require unit tests.

  • If functions are factored from the notebook into a supporting module, this module will need unit tests. Many previous projects have emphasized these tests.

A notebook...

13.4 Summary

This project begins the deeper analysis work on clean data. It emphasizes several key skills, including:

  • More advanced Jupyter Notebook techniques. This includes setting the PYTHONPATH to import modules and creating figures with plots to visualize data.

  • Using PyPlot to present data. The project uses popular types of visualizations: histograms and scatter plots.

  • Unit testing for Jupyter Notebook functions.

In the next chapter, we’ll formalize the notebook into a presentation “slide deck” that can be shown to a group of stakeholders.

13.5 Extras

Here are some ideas for the reader to add to these projects.

13.5.1 Use Seaborn for plotting

An alternative to the pyplot package is the Seaborn package. This package also provides statistical plotting functions. It provides a wider variety of styling options, permitting more colorful (and perhaps more informative) plots.

See https://seaborn.pydata.org for more information.

This module is based on matplotlib, making it compatible with JupyterLab.

Note that the Seaborn package can work directly with a list-of-dictionary structure. This matches the ND JSON format used for acquiring and cleaning the data.

Using a list-of-dictionary type suggests it might be better to avoid the analysis model structure, and stick with dictionaries created by the clean application. Doing this might sacrifice some model-specific processing and validation functionality.

On the other hand, the pydantic package offers a built-in dict() method that covers a sophisticated analysis model object into...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Python Real-World Projects
Published in: Sep 2023 Publisher: Packt ISBN-13: 9781803246765
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}