You're reading from Python Real-World Projects

Product typeBook

Published inSep 2023

PublisherPackt

ISBN-139781803246765

Edition1st Edition

Concepts

Programming Language

Author (1)

Steven F. Lott

Chapter 9
Project 3.1: Data Cleaning Base Application

Data validation, cleaning, converting, and standardizing are steps required to transform raw data acquired from source applications into something that can be used for analytical purposes. Since we started using a small data set of very clean data, we may need to improvise a bit to create some ”dirty” raw data. A good alternative is to search for more complicated, raw data.

This chapter will guide you through the design of a data cleaning application, separate from the raw data acquisition. Many details of cleaning, converting, and standardizing will be left for subsequent projects. This initial project creates a foundation that will be extended by adding features. The idea is to prepare for the goal of a complete data pipeline that starts with acquisition and passes the data through a separate cleaning stage. We want to exploit the Linux principle of having applications connected by a shared buffer, often referred...

9.1 Description

We need to build a data validating, cleaning, and standardizing application. A data inspection notebook is a handy starting point for this design work. The goal is a fully-automated application to reflect the lessons learned from inspecting the data.

A data preparation pipeline has the following conceptual tasks:

Validate the acquired source text to be sure it’s usable and to mark invalid data for remediation.
Clean any invalid raw data where necessary; this expands the available data in those cases where sensible cleaning can be defined.
Convert the validated and cleaned source data from text (or bytes) to usable Python objects.
Where necessary, standardize the code or ranges of source data. The requirements here vary with the problem domain.

The goal is to create clean, standardized data for subsequent analysis. Surprises occur all the time. There are several sources:

Technical problems with file formats of the upstream software. The intent of the acquisition...

9.2 Approach

We’ll take some guidance from the C4 model ( https://c4model.com) when looking at our approach.

Context: For this project, the context diagram has expanded to three use cases: acquire, inspect, and clean.
Containers: There’s one container for the various applications: the user’s personal computer.
Components: There are two significantly different collections of software components: the acquisition program and the cleaning program.
Code: We’ll touch on this to provide some suggested directions.

A context diagram for this application is shown in Figure 9.1.

A component diagram for the conversion application isn’t going to be as complicated as the component diagrams for acquisition applications. One reason for this is there are no choices for reading, extracting, or downloading raw data files. The source files are the ND JSON files created by the acquisition application.

The second reason the conversion...

9.3 Deliverables

This project has the following deliverables:

Documentation in the docs folder.
Acceptance tests in the tests/features and tests/steps folders.
Unit tests for the application modules in the tests folder.
Application to clean some acquired data and apply simple conversions to a few fields. Later projects will add more complex validation rules.

We’ll look at a few of these deliverables in a little more detail.

When starting a new kind of application, it often makes sense to start with acceptance tests. Later, when adding features, the new acceptance tests may be less important than new unit tests for the features. We’ll start by looking at a new scenario for this new application.

9.3.1 Acceptance tests

As we noted in Chapter 4, Data Acquisition Features: Web APIs and Scraping, we can provide a large block of text as part of a Gherkin scenario. This can be the contents of an input file. We can consider something like the following scenario.

Scenario...

9.4 Summary

This chapter has covered a number of aspects of data validation and cleaning applications:

CLI architecture and how to design a simple pipeline of processes.
The core concepts of validating, cleaning, converting, and standardizing raw data.

In the next chapter, we’ll dive more deeply into a number of data cleaning and standardizing features. Those projects will all build on this base application framework. After those projects, the next two chapters will look a little more closely at the analytical data persistence choices, and provide an integrated web service for providing cleaned data to other stakeholders.

9.5 Extras

Here are some ideas for you to add to this project.

9.5.1 Create an output file with rejected samples

In Error reports we suggested there are times when it’s appropriate to create a file of rejected samples. For the examples in this book — many of which are drawn from well-curated, carefully managed data sets — it can feel a bit odd to design an application that will reject data.

For enterprise applications, data rejection is a common need.

It can help to look at a data set like this: https://datahub.io/core/co2-ppm. This contains data same with measurements of CO2 levels measures with units of ppm, parts per million.

This has some samples with an invalid number of days in the month. It has some samples where a monthly CO2 level wasn’t recorded.

It can be insightful to use a rejection file to divide this data set into clearly usable records, and records that are not as clearly usable.

The output will not reflect the analysis model. These objects...

The rest of the chapter is locked

You have been reading a chapter from

Python Real-World Projects

Published in: Sep 2023Publisher: PacktISBN-13: 9781803246765

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott

Personalised recommendations for you

Based on your interests and search pattern

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

BookSep 2023288 pages

Expert C++

Discover advanced programming techniques, the latest features of C++17 and C++20, and best practices for memory management, debugging, testing, and large-scale application design with Expert C++. Ideal for experienced developers advancing to proficient programmers and building professional-grade C++ applications.

BookAug 2023604 pages

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

BookOct 2023604 pages4

Developer Career Masterplan

Written by industry experts that have spent the last 20+ years helping developers grow their career path towards senior developer positions and beyond. This book provides a comprehensive guide, sharing examples and stories from their global careers. By the end, you’ll have the knowledge to create a clear career progression plan as a technical professional.

BookSep 2023310 pages

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

BookNov 2023434 pages

Python Real-World Projects

Amplify your developer journey by curating a dynamic project portfolio that outshines traditional resumes. Delve into the Python realm through immersive projects, mastering core concepts while constructing comprehensive modules and applications. From data acquisition prowess to impactful data visualization, Python Real-World Projects arms you with essential skills to beat the competition.

BookSep 2023478 pages5

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

BookNov 2023386 pages

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Extending Microsoft Business Central with Power Platform

Extending Business Central with the Power Platform is a step-by-step guide for Business Central professionals to create solutions that automate business processes, explain complex workflow approvals, and integrate with hundreds of other systems, without traditional development. It’ll guide you in customizing Business Central with Power Platform.

BookAug 2023458 pages5

Quantum Computing Algorithms

The book emphasizes intuitive ideas behind quantum algorithms in ways that other books don’t cover, striking a careful balance between no math and too much math. To get the most from this book, you should be comfortable with basic algebra and writing simple computer code. No prior understanding of quantum physics is needed to get started.

BookSep 2023342 pages

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

Python – Complete Python, Django, Data Science and ML Guide

Unlock Python's full potential with this 50+ hour course! From programming to web and game development, data manipulation, and machine learning, gain the skills required to succeed in various Python-related careers. With practical tasks, hands-on experience, and a strong foundation in Python, you'll be ready to tackle real-world challenges and take advantage of the many opportunities this versatile language offers.

VideoNov 202350 hours 30 minutes5

You're reading from Python Real-World Projects

Chapter 9 Project 3.1: Data Cleaning Base Application

9.1 Description

9.2 Approach

9.3 Deliverables

9.3.1 Acceptance tests

9.4 Summary

9.5 Extras

9.5.1 Create an output file with rejected samples

Unlock this book and the full library FREE for 7 days

Author (1)

C++ Programming for Linux Systems

This book covers the essential system programming tools and helps you explore the features of C++20. It emphasizes important details to maintain code quality and tackle everyday challenges of developing software for high performance, optimization, and more.

Expert C++

iOS 17 Programming for Beginners

iOS 17 Programming for Beginners, Eighth Edition is your comprehensive guide to learning the art of iOS app development. Whether you dream of creating the next chart-topping app or simply want to enhance your programming skills, this book is your trusted companion on this exciting journey.

Developer Career Masterplan

Refactoring with C#

In Refactoring with C#, you’ll explore the process of safely refactoring modern .NET code using Visual Studio features, advanced unit tests, AI assistance, and custom Roslyn analyzers.

Python Real-World Projects

The MVVM Pattern in .NET MAUI

The MVVM Pattern in .NET MAUI enables developers to master MVVM principles and effectively apply them to .NET MAUI. This book uses real-life examples and covers complex problems to help you successfully apply MVVM with .NET MAUI to confidently develop robust and high-performing cross-platform apps.

Extending Microsoft Business Central with Power Platform

Extending Microsoft Business Central with Power Platform

Quantum Computing Algorithms

Python – Complete Python, Django, Data Science and ML Guide

Python – Complete Python, Django, Data Science and ML Guide

Chapter 9
Project 3.1: Data Cleaning Base Application