Reader small image

You're reading from  Python Real-World Projects

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781803246765
Edition1st Edition
Right arrow
Author (1)
Steven F. Lott
Steven F. Lott
author image
Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott

Right arrow

Chapter 10
Data Cleaning Features

There are a number of techniques for validating and converting data to native Python objects for subsequent analysis. This chapter guides you through three of these techniques, each appropriate for different kinds of data. The chapter moves on to the idea of standardization to transform unusual or atypical values into a more useful form. The chapter concludes with the integration of acquisition and cleansing into a composite pipeline.

This chapter will expand on the project in Chapter 9, Project 3.1: Data Cleaning Base Application. The following additional skills will be emphasized:

  • CLI application extension and refactoring to add features.

  • Pythonic approaches to validation and conversion.

  • Techniques for uncovering key relationships.

  • Pipeline architectures. This can be seen as a first step toward a processing DAG (Directed Acyclic Graph) in which various stages are connected.

We’ll start with a description of the first project to expand...

Chapter 10
Data Cleaning Features

There are a number of techniques for validating and converting data to native Python objects for subsequent analysis. This chapter guides you through three of these techniques, each appropriate for different kinds of data. The chapter moves on to the idea of standardization to transform unusual or atypical values into a more useful form. The chapter concludes with the integration of acquisition and cleansing into a composite pipeline.

This chapter will expand on the project in Chapter 9, Project 3.1: Data Cleaning Base Application. The following additional skills will be emphasized:

  • CLI application extension and refactoring to add features.

  • Pythonic approaches to validation and conversion.

  • Techniques for uncovering key relationships.

  • Pipeline architectures. This can be seen as a first step toward a processing DAG (Directed Acyclic Graph) in which various stages are connected.

We’ll start with a description of the first project to expand...

Chapter 10
Data Cleaning Features

There are a number of techniques for validating and converting data to native Python objects for subsequent analysis. This chapter guides you through three of these techniques, each appropriate for different kinds of data. The chapter moves on to the idea of standardization to transform unusual or atypical values into a more useful form. The chapter concludes with the integration of acquisition and cleansing into a composite pipeline.

This chapter will expand on the project in Chapter 9, Project 3.1: Data Cleaning Base Application. The following additional skills will be emphasized:

  • CLI application extension and refactoring to add features.

  • Pythonic approaches to validation and conversion.

  • Techniques for uncovering key relationships.

  • Pipeline architectures. This can be seen as a first step toward a processing DAG (Directed Acyclic Graph) in which various stages are connected.

We’ll start with a description of the first project to expand...

Chapter 10
Data Cleaning Features

There are a number of techniques for validating and converting data to native Python objects for subsequent analysis. This chapter guides you through three of these techniques, each appropriate for different kinds of data. The chapter moves on to the idea of standardization to transform unusual or atypical values into a more useful form. The chapter concludes with the integration of acquisition and cleansing into a composite pipeline.

This chapter will expand on the project in Chapter 9, Project 3.1: Data Cleaning Base Application. The following additional skills will be emphasized:

  • CLI application extension and refactoring to add features.

  • Pythonic approaches to validation and conversion.

  • Techniques for uncovering key relationships.

  • Pipeline architectures. This can be seen as a first step toward a processing DAG (Directed Acyclic Graph) in which various stages are connected.

We’ll start with a description of the first project to expand...

Chapter 10
Data Cleaning Features

There are a number of techniques for validating and converting data to native Python objects for subsequent analysis. This chapter guides you through three of these techniques, each appropriate for different kinds of data. The chapter moves on to the idea of standardization to transform unusual or atypical values into a more useful form. The chapter concludes with the integration of acquisition and cleansing into a composite pipeline.

This chapter will expand on the project in Chapter 9, Project 3.1: Data Cleaning Base Application. The following additional skills will be emphasized:

  • CLI application extension and refactoring to add features.

  • Pythonic approaches to validation and conversion.

  • Techniques for uncovering key relationships.

  • Pipeline architectures. This can be seen as a first step toward a processing DAG (Directed Acyclic Graph) in which various stages are connected.

We’ll start with a description of the first project to expand...

Chapter 10
Data Cleaning Features

There are a number of techniques for validating and converting data to native Python objects for subsequent analysis. This chapter guides you through three of these techniques, each appropriate for different kinds of data. The chapter moves on to the idea of standardization to transform unusual or atypical values into a more useful form. The chapter concludes with the integration of acquisition and cleansing into a composite pipeline.

This chapter will expand on the project in Chapter 9, Project 3.1: Data Cleaning Base Application. The following additional skills will be emphasized:

  • CLI application extension and refactoring to add features.

  • Pythonic approaches to validation and conversion.

  • Techniques for uncovering key relationships.

  • Pipeline architectures. This can be seen as a first step toward a processing DAG (Directed Acyclic Graph) in which various stages are connected.

We’ll start with a description of the first project to expand...

10.6 Summary

This chapter expanded in several ways on the project in Chapter 9, Project 3.1: Data Cleaning Base Application. The following additional processing features were added:

  • Pythonic approaches to validation and conversion of cardinal values.

  • Approaches to validation and conversion of nominal and ordinal values.

  • Techniques for uncovering key relationships and validating data that must properly reference a foreign key.

  • Pipeline architectures using the shell pipeline.

10.7 Extras

Here are some ideas for you to add to these projects.

10.7.1 Hypothesis testing

The computations for mean, variance, standard deviation, and standardized Z-scores involve floating-point values. In some cases, the ordinary truncation errors of float values can introduce significant numeric instability. For the most part, the choice of a proper algorithm can ensure results are useful.

In addition to basic algorithm design, additional testing is sometimes helpful. For numeric algorithms, the Hypothesis package is particularly helpful. See https://hypothesis.readthedocs.io/en/latest/.

Looking specifically at Project 3.5: Standardize data to common codes and ranges, the Approach section suggests a way to compute the variance. This class definition is an excellent example of a design that can be tested effectively by the Hypothesis module to confirm that the results of providing a sequence of three known values produces the expected results for the count, sum, mean, variance...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Real-World Projects
Published in: Sep 2023Publisher: PacktISBN-13: 9781803246765
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott