You're reading from Driving Data Quality with Data Contracts

Product typeBook

Published inJun 2023

PublisherPackt

ISBN-139781837635009

Edition1st Edition

Concepts

Data Engineering

Author (1)

Andrew Jones

What Makes Up a Data Contract

In this chapter, we’re going to look at what exactly makes up a data contract. This includes the schema, which describes and documents the structure of the data. We’ll discuss why this is important and show how we can define the schema in several open source schema formats.

A schema can only describe data at a point in time. However, as the needs of the organization change, so too does our data. We’ll explore how we can support the evolution of our data, while still providing data consumers the stability they need to build on this data with confidence.

However, data contracts are more than just a schema. As we’ve discussed in previous chapters, we need our data contracts to capture metadata that describes how the data can be used, how it is governed, and the controls around the data. We’ll show how we do that, and how we can use that metadata to drive tooling and integrate with other services.

By the end of this...

The schema of a data contract

We’ll start this section by looking at the schema of a data contract, what to put in it, and why. Then we’ll look at how to make these schemas accessible to both data generators and consumers, by storing them in a system (or a registry) that is recognized as the source of truth.

We’ll cover these topics in the following subsections:

Defining a schema
Using a schema registry as the source of truth

Defining a schema

The schema defines the structure of the data. At a minimum, it will hold the complete list of the fields available and their data type.

The following code block shows an example of a schema that defines a Customer record with fields and their types using Protocol Buffers (https://protobuf.dev), as well as a unique field number, as required by Protocol Buffers:

message Customer {
  string id       = 1;
  string name    ...

Evolving your data over time

In this section, we’ll discuss how we can manage the evolution of our data, and the schemas that define it, while still giving the data consumers the stability they need to build on the data with confidence.

We spoke in detail about how data evolves in an organization and why managing the evolution of data well is important for consumers in Chapter 4, Bringing Data Consumers and Generators Closer Together, in the Managing the evolution of data section. We also discussed the difference between a breaking change and a non-breaking change, and how for a breaking change we want to deliberately introduce some friction to ensure the migration to that new version is managed to reduce the impact on downstream consumers.

It’s this concept of versions that allows us to evolve schemas. We use versioning to track and manage the changes to a schema over time. The previous versions of the schema are used to validate whether the new version introduces...

Defining the governance and controls

In Chapter 5, Embedding Data Governance, we discussed the importance of data governance and how we embed those controls alongside the data. We also spoke about how the responsibility of those controls is assigned to the data generators, supported by a central data governance committee through policies, standards, and tooling.

In this section, we’ll look at exactly how we can define the governance and controls in the data contract.

Every data contract must have an owner. This is the data generator, and it is they who take on the responsibilities and accountabilities we discussed in Chapter 4, Bringing Data Consumers and Generators Closer Together.

Depending on your requirements, you might want to embed some of the following in your data contract:

The version number of the contract
The service-level agreements (SLAs)
How to access the data (for example, is the interface a table in a data warehouse, a topic on a stream...

Summary

In this chapter, we’ve started to see exactly what makes up a data contract. A large part of the data contract is the schema. We explored various open source schema formats to understand how we can use them to define schemas and the different functionality we can add to those schemas. We also looked at how we can make schemas accessible by using a schema registry to act as the source of truth for them.

However, schemas can only define how the data looks at a set point in time. Data will evolve, and so will the schema. So, we then discussed how to evolve your data over time and how to migrate your consumers to a new version without causing major disruption or breaking existing applications unexpectedly.

We finished the chapter by looking at how we can use data contracts to manage the governance and controls of data through the specification of metadata that describes the data.

We can then use that metadata to integrate with any tool or service. This can be done...

Protocol Buffers: https://protobuf.dev/
Apache Avro: https://avro.apache.org/
JSON Schema: https://json-schema.org/
YAML: https://yaml.org/
Jsonnet: https://jsonnet.org/
Schemata: https://github.com/ananthdurai/schemata
Protocol Buffers Best Practices for Backward and Forward Compatibility by John Gramila: https://earthly.dev/blog/backward-and-forward-compatibility/
Understanding Avro Compatibility by Kyle Carter: https://medium.com/codex/understanding-avro-compatibility-e2f9afa48dd1
Understanding JSON Schema Compatibility by Robert Yokota: https://yokota.blog/2021/03/29/understanding-json-schema-compatibility/
Data contracts: The missing foundation by Tom Baeyens: https://medium.com/@tombaeyens/data-contracts-the-missing-foundation-3c7a98544d2a
Template for a data contract used in a data mesh: https://github.com/paypal/data...

The rest of the chapter is locked

You have been reading a chapter from

Driving Data Quality with Data Contracts

Published in: Jun 2023Publisher: PacktISBN-13: 9781837635009

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Andrew Jones

Andrew Jones is a principal engineer at GoCardless, one of Europe's leading Fintech's. He has over 15 years experience in the industry, with the first half primarily as a software engineer, before he moved into the data infrastructure and data engineering space. Joining GoCardless as its first data engineer, he led his team to build their data platform from scratch. After initially following a typical data architecture and getting frustrated with facing the same old challenges he'd faced for years, he started thinking there must be a better way, which led to him coining and defining the ideas around data contracts. Andrew is a regular speaker and writer, and he is passionate about helping organizations get maximum value from data.
Read more about Andrew Jones

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Driving Data Quality with Data Contracts

What Makes Up a Data Contract

The schema of a data contract

Defining a schema

Evolving your data over time

Defining the governance and controls

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook