Reader small image

You're reading from  Python Real-World Projects

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781803246765
Edition1st Edition
Right arrow
Author (1)
Steven F. Lott
Steven F. Lott
author image
Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott

Right arrow

Chapter 6
Project 2.1: Data Inspection Notebook

We often need to do an ad hoc inspection of source data. In particular, the very first time we acquire new data, we need to see the file to be sure it meets expectations. Additionally, debugging and problem-solving also benefit from ad hoc data inspections. This chapter will guide you through using a Jupyter notebook to survey data and find the structure and domains of the attributes.

The previous chapters have focused on a simple dataset where the data types look like obvious floating-point values. For such a trivial dataset, the inspection isn’t going to be very complicated.

It can help to start with a trivial dataset and focus on the tools and how they work together. For this reason, we’ll continue using relatively small datasets to let you learn about the tools without having the burden of also trying to understand the data.

This chapter’s projects cover how to create and use a Jupyter notebook for data inspection...

6.1 Description

When confronted with raw data acquired from a source application, database, or web API, it’s prudent to inspect the data to be sure it really can be used for the desired analysis. It’s common to find that data doesn’t precisely match the given descriptions. It’s also possible to discover that the metadata is out of date or incomplete.

The foundational principle behind this project is the following:

We don’t always know what the actual data looks like.

Data may have errors because source applications have bugs. There could be ”undocumented features,” which are similar to bugs but have better explanations. There may have been actions made by users that have introduced new codes or status flags. For example, an application may have a ”comments” field on an accounts-payable record, and accounting clerks may have invented their own set of coded values, which they put in the last few characters of this field. This...

6.2 Approach

We’ll take some guidance from the C4 model ( https://c4model.com) when looking at our approach.

  • Context: For this project, the context diagram has two use cases: acquire and inspect

  • Containers: There’s one container for the various applications: the user’s personal computer

  • Components: There are two significantly different collections of software components: the acquisition program and inspection notebooks

  • Code: We’ll touch on this to provide some suggested directions

A context diagram for this application is shown in Figure 6.1.

Figure 6.1: Context Diagram
Figure 6.1: Context Diagram

The data analyst will use the CLI to run the data acquisition program. Then, the analyst will use the CLI to start a Jupyter Lab server. Using a browser, the analyst can then use Jupyter Lab to inspect the data.

The components fall into two overall categories. The component diagram is shown in Figure 6.2.

Figure 6.2: Component diagram
Figure 6.2: Component diagram

The diagram shows the interfaces...

6.3 Deliverables

This project has the following deliverables:

  • A pyproject.toml file that identifies the tools used. For this book, we used jupyterlab==3.5.3. Note that while the book was being prepared for publication, version 4.0 was released. This ongoing evolution of components makes it important for you to find the latest version, not the version quoted here.

  • Documentation in the docs folder.

  • Unit tests for any new application modules in the tests folder.

  • Any new application modules in the src folder with code to be used by the inspection notebook.

  • A notebook to inspect the raw data acquired from any of the sources.

The project directory structure suggested in Chapter 1, Project Zero: A Template for Other Projects mentions a notebooks directory. See List of deliverables for more information. Previous chapters haven’t used any notebooks, so this directory might not have been created in the first place. For this project, the snotebooks directory is needed.

Let’...

6.4 Summary

This chapter’s project covered the basics of creating and using a Jupyter Lab notebook for data inspection. This permits tremendous flexibility, something often required when looking at new data for the first time.

We also looked at adding doctest examples to functions and running the doctest tool in the last cell of a notebook. This lets us validate that the code in the notebook is very likely to work properly.

Now that we’ve got an initial inspection notebook, we can start to consider the specific kinds of data being acquired. In the next chapter, we’ll add features to this notebook.

6.5 Extras

Here are some ideas for you to add to this project.

6.5.1 Use pandas to examine data

A common tool for interactive data exploration is the pandas package.

See https://pandas.pydata.org for more information.

Also, see https://www.packtpub.com/product/learning-pandas/9781783985128 for resources for learning more about pandas.

The value of using pandas for examining text may be limited. The real value of pandas is for doing more sophisticated statistical and graphical analysis of the data.

We encourage you to load NDJSON documents using pandas and do some preliminary investigation of the data values.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Real-World Projects
Published in: Sep 2023Publisher: PacktISBN-13: 9781803246765
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott