You're reading from Python Real-World Projects

Product typeBook

Published inSep 2023

PublisherPackt

ISBN-139781803246765

Edition1st Edition

Concepts

Programming Language

Author (1)

Steven F. Lott

Chapter 7
Data Inspection Features

There are three broad kinds of data domains: cardinal, ordinal, and nominal. The first project in this chapter will guide you through the inspection of cardinal data; values like weights, measures, and durations where the data is continuous, as well as counts where the data is discrete. The second project will guide reasoners through the inspection of ordinal data involving things like dates, where order matters, but the data isn’t a proper measurement; it’s more of a code or designator. The nominal data is a code that happens to use digits but doesn’t represent numeric values. The third project will cover the more complex case of matching keys between separate data sources.

An inspection notebook is required when looking at new data. It’s a great place to keep notes and lessons learned. It’s helpful when diagnosing problems that arise in a more mature analysis pipeline.

This chapter will cover a number of skills...

Chapter 7
Data Inspection Features

This chapter will cover a number of skills...

7.1.1 Description

This project’s intent is to inspect raw data to understand if it is actually cardinal data. In some cases, floating-point values may have been used to represent nominal data; the data appears to be a measurement but is actually a code.

Spreadsheet software tends to transform all data into floating-point numbers; many data items may look like cardinal data.

One example is US Postal Codes, which are strings of digits, but may be transformed into numeric values by a spreadsheet.

Another example is bank account numbers, which — while very long — can be converted into floating-point numbers. A floating-point value uses 8 bytes of storage, but will comfortably represent about 15 decimal digits. While this is a net saving in storage, it is a potential confusion of data types and there is a (small) possibility of having an account number altered by floating-point truncation rules.

The user experience is a Jupyter Lab notebook that can be used to examine...

7.1.2 Approach

This project is based on the initial inspection notebook from Chapter 6, Project 2.1: Data Inspection Notebook. Some of the essential cell content will be reused in this notebook. We’ll add components to the components shown in the earlier chapter – specifically, the samples_iter() function to iterate over samples in an open file. This feature will be central to working with the raw data.

In the previous chapter, we suggested avoiding conversion functions. When starting down the path of inspecting data, it’s best to assume nothing and look at the text values first.

There are some common patterns in the source data values:

The values appear to be all numeric values. The int() or float() function works on all of the values. There are two sub-cases here:
- All of the values seem to be proper counts or measures in some expected range. This is ideal.
- A few “outlier” values are present. These are values that seem to be outside the expected...

7.1.3 Deliverables

This project has the following deliverables:

A requirements-dev.txt file that identifies the tools used, usually jupyterlab==3.5.3.
Documentation in the docs folder.
Unit tests for any new changes to the modules in use.
Any new application modules with code to be used by the inspection notebook.
A notebook to inspect the attributes that appear to have cardinal data.

This project will require a notebooks directory. See List of deliverables for some more information on this structure.

We’ll look at a few of these deliverables in a little more detail.

Inspection module

You are encouraged to refactor functions like samples_iter(), non_numeric(), and numeric_filter() into a separate module. Additionally, the AttrSummary class and the closely related summary_iter() function are also good candidates for being moved to a separate module with useful inspection classes and functions.

Notebooks can be refactored to import these classes and functions from a separate...

Chapter 7
Data Inspection Features

This chapter will cover a number of skills...

7.2.1 Description

In the previous project (Project 2.2: Validating cardinal domains — measures, counts, and durations), we looked at attributes that contained cardinal data – measures and counts. We also need to look at ordinal and nominal data. Ordinal data is generally used to provide ranks and ordering. Nominal data is best thought of as codes made up of strings of digits. Values like US postal codes and bank account numbers are nominal data.

When we look at the CO2 PPM — Trends in Atmospheric Carbon Dioxide data set, available at https://datahub.io/core/co2-ppm, it has dates that are provided in two forms: as a year-month-day string and as a decimal number. The decimal number positions the first day of the month within the year as a whole.

It’s instructive to use ordinal day numbers to compute unique values for each date and compare these with the supplied ”Decimal Date” value. An integer day number may be more useful than the decimal date...

7.2.2 Approach

Dates and times often have bewildering formats. This is particularly true in the US, where dates are often written as numbers in month/day/year format. Using year/month/day puts the values in order of significance. Using day/month/year is the reverse order of significance. The US ordering is simply strange.

This makes it difficult to do inspections on completely unknown data without any metadata to explain the serialization format. A date like 01/02/03 could mean almost anything.

In some cases, a survey of many date-like values will reveal a field with a range of 1-12 and another field with a range of 1-31, permitting analysts to distinguish between the month and day. The remaining field can be taken as a truncated year.

In cases where there is not enough data to make a positive identification of month or day, other clues will be needed. Ideally, there’s metadata to define the date format.

The datetime.strptime() function can be used to parse dates when the format...

7.2.3 Deliverables

This project has the following deliverables:

A requirements-dev.txt file that identifies the tools used, usually jupyterlab==3.5.3.
Documentation in the docs folder.
Unit tests for any new changes to the modules in use.
Any new application modules with code to be used by the inspection notebook.
A notebook to inspect the attributes that appear to have ordinal or nominal data.

The project directory structure suggested in Chapter 1, Project Zero: A Template for Other Projects mentions a notebooks directory. See List of deliverables for some more information. For this project, the notebook directory is needed.

We’ll look at a few of these deliverables in a little more detail.

Revised inspection module

Functions for date conversions and cleaning up nominal data can be written in a separate module. Or they can be developed in a notebook, and then moved to the inspection module. As we noted in the Description section, this project’s objective is to...

Chapter 7
Data Inspection Features

This chapter will cover a number of skills...

7.4 Summary

This chapter expanded on the core features of the inspection notebook. We looked at handling cardinal data (measures and counts), ordinal data (dates and ranks), and nominal data (codes like account numbers).

Our primary objective was to get a complete view of the data, prior to formalizing our analysis pipeline. A secondary objective was to leave notes for ourselves on outliers, anomalies, data formatting problems and other complications. A pleasant consequence of this effort is to be able to write some functions that can be used downstream to clean and normalize the data we’ve found.

Starting in Chapter 9, Project 3.1: Data Cleaning Base Application, we’ll look at refactoring these inspection functions to create a complete and automated data cleaning and normalization application. That application will be based on the lessons learned while creating inspection notebooks.

In the next chapter, we’ll look at one more lesson that’s often learned...

7.5 Extras

Here are some ideas for you to add to the projects in this chapter.

7.5.1 Markdown cells with dates and data source information

A minor feature of an inspection notebook is some identification of the date, time, and source of the data. It’s sometimes clear from the context what the data source is; there may, for example, be an obvious path to the data.

However, in many cases, it’s not perfectly clear what file is being inspected or how it was acquired. As a general solution, any processing application should produce a log. In some cases, a metadata file can include the details of the processing steps.

This additional metadata on the source and processing steps can be helpful when reviewing a data inspection notebook or sharing a preliminary inspection of data with others. In many cases, this extra data is pasted into ordinary markdown cells. In other cases, this data may be the result of scanning a log file for key INFO lines that summarize processing.

...

The rest of the chapter is locked

You have been reading a chapter from

Python Real-World Projects

Published in: Sep 2023Publisher: PacktISBN-13: 9781803246765

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

You're reading from Python Real-World Projects

Chapter 7 Data Inspection Features

Chapter 7 Data Inspection Features

7.1.1 Description

7.1.2 Approach

7.1.3 Deliverables

Inspection module

Chapter 7 Data Inspection Features

7.2.1 Description

7.2.2 Approach

7.2.3 Deliverables

Revised inspection module

Chapter 7 Data Inspection Features

7.4 Summary

7.5 Extras

7.5.1 Markdown cells with dates and data source information

Unlock this book and the full library FREE for 7 days

Author (1)

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

Expert C++

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

Developer Career Masterplan

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

Python Real-World Projects

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

Extending Microsoft Business Central with Power Platform

Extending Microsoft Business Central with Power Platform

Quantum Computing Algorithms

Python – Complete Python, Django, Data Science and ML Guide

Python – Complete Python, Django, Data Science and ML Guide

Chapter 7
Data Inspection Features

Chapter 7
Data Inspection Features

Chapter 7
Data Inspection Features

Chapter 7
Data Inspection Features