Reader small image

You're reading from  Python Real-World Projects

Product typeBook
Published inSep 2023
PublisherPackt
ISBN-139781803246765
Edition1st Edition
Right arrow
Author (1)
Steven F. Lott
Steven F. Lott
author image
Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott

Right arrow

Chapter 8
Project 2.5: Schema and Metadata

It helps to keep the data schema separate from the various applications that share the schema. One way to do this is to have a separate module with class definitions that all of the applications in a suite can share. While this is helpful for a simple project, it can be awkward when sharing data schema more widely. A Python language module is particularly difficult for sharing data outside the Python environment.

This project will define a schema in JSON Schema Notation, first by building pydantic class definitions, then by extracting the JSON from the class definition. This will allow you to publish a formal definition of the data being created. The schema can be used by a variety of tools to validate data files and assure that the data is suitable for further analytical use.

The schema is also useful for diagnosing problems with data sources. Validator tools like jsonschema can provide detailed error reports that can help identify changes...

8.1 Description

Data validation is a common requirement when moving data between applications. It is extremely helpful to have a clear definition of what constitutes valid data. It helps even more when the definition exists outside a particular programming language or platform.

We can use the JSON Schema (https://json-schema.org) to define a schema that applies to the intermediate documents created by the acquisition process. Using JSON Schema enables the confident and reliable use of the JSON data format.

The JSON Schema definition can be shared and reused within separate Python projects and with non-Python environments, as well. It allows us to build data quality checks into the acquisition pipeline to positively affirm the data really fit the requirements for analysis and processing.

Additional metadata provided with a schema often includes the provenance of the data and details on how attribute values are derived. This isn’t a formal part of a JSON Schema, but we can add some...

8.2 Approach

First, we’ll need some additional modules. The jsonschema module defines a validator that can be used to confirm a document matches the defined schema.

Additionally, the Pydantic module provides a way to create class definitions that can emit JSON Schema definitions, saving us from having to create the schema manually. In most cases, manual schema creation is not terribly difficult. For some cases, though, the schema and the validation rules might be challenging to write directly, and having Python class definitions available can simplify the process.

This needs to be added to the requirements-dev.txt file so other developers know to install it.

When using conda to manage virtual environments, the command might look like the following:

% conda install jsonschema pydantic

When using other tools to manage virtual environments, the command might look like the following:

% python -m pip install jupyterlab

The JSON Schema package requires some supplemental type stubs...

8.3 Deliverables

This project has the following deliverables:

  • A requirements.txt file that identifies the tools used, usually pydantic==1.10.2 and jsonschema==4.16.0.

  • Documentation in the docs folder.

  • The JSON-format files with the source and analysis schemas. A separate schema directory is the suggested location for these files.

  • An acceptance test for the schemas.

We’ll look at the schema acceptance test in some detail. Then we’ll look at using schema to extend other acceptance tests.

8.3.1 Schema acceptance tests

To know if the schema is useful, it is essential to have acceptance test cases. As new sources of data are integrated into an application, and old sources of data mutate through ordinary bug fixes and upgrades, files will change. The new files will often cause problems, and the root cause of the problem will be the unexpected file format change.

Once a file format change is identified, the smallest relevant example needs to be transformed into an acceptance...

8.4 Summary

This chapter’s projects have shown examples of the following features of a data acquisition application:

  • Using the Pydantic module for crisp, complete definitions

  • Using JSON Schema to create an exportable language-independent definition that anyone can use

  • Creating test scenarios to use the formal schema definition

Having formalized schema definitions permits recording additional details about the data processing applications and the transformations applied to the data.

The docstrings for the class definitions become the descriptions in the schema. This permits writing details on data provenance and transformation that are exposed to all users of the data.

The JSON Schema standard permits recording examples of values. The Pydantic package has ways to include this metadata in field definitions, and class configuration objects. This can be helpful when explaining odd or unusual data encodings.

Further, for text fields, JSONSchema permits including a format attribute...

8.5 Extras

Here are some ideas for you to add to this project.

8.5.1 Revise all previous chapter models to use Pydantic

The previous chapters used dataclass definitions from the dataclasses module. These can be shifted to use the pydantic.dataclasses module. This should have minimal impact on the previous projects.

We can also shift all of the previous acceptance test suites to use a formal schema definition for the source data.

8.5.2 Use the ORM layer

For SQL extracts, an ORM can be helpful. The pydantic module lets an application create Python objects from intermediate ORM objects. This two-layer processing seems complex but permits detailed validation in the Pydantic objects that aren’t handled by the database.

For example, a database may have a numeric column without any range provided. A Pydantic class definition can provide a field definition with ge and le attributes to define a range. Further, Pydantic permits the definition of a unique data type with unique validation...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Python Real-World Projects
Published in: Sep 2023Publisher: PacktISBN-13: 9781803246765
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Steven F. Lott

Steven Lott has been programming since computers were large, expensive, and rare. Working for decades in high tech has given him exposure to a lot of ideas and techniques, some bad, but most are helpful to others. Since the 1990s, Steven has been engaged with Python, crafting an array of indispensable tools and applications. His profound expertise has led him to contribute significantly to Packt Publishing, penning notable titles like "Mastering Object-Oriented," "The Modern Python Cookbook," and "Functional Python Programming." A self-proclaimed technomad, Steven's unconventional lifestyle sees him residing on a boat, often anchored along the vibrant east coast of the US. He tries to live by the words “Don't come home until you have a story.”
Read more about Steven F. Lott