Reader small image

You're reading from  Driving Data Quality with Data Contracts

Product typeBook
Published inJun 2023
PublisherPackt
ISBN-139781837635009
Edition1st Edition
Right arrow
Author (1)
Andrew Jones
Andrew Jones
author image
Andrew Jones

Andrew Jones is a principal engineer at GoCardless, one of Europe's leading Fintech's. He has over 15 years experience in the industry, with the first half primarily as a software engineer, before he moved into the data infrastructure and data engineering space. Joining GoCardless as its first data engineer, he led his team to build their data platform from scratch. After initially following a typical data architecture and getting frustrated with facing the same old challenges he'd faced for years, he started thinking there must be a better way, which led to him coining and defining the ideas around data contracts. Andrew is a regular speaker and writer, and he is passionate about helping organizations get maximum value from data.
Read more about Andrew Jones

Right arrow

What Makes Up a Data Contract

In this chapter, we’re going to look at what exactly makes up a data contract. This includes the schema, which describes and documents the structure of the data. We’ll discuss why this is important and show how we can define the schema in several open source schema formats.

A schema can only describe data at a point in time. However, as the needs of the organization change, so too does our data. We’ll explore how we can support the evolution of our data, while still providing data consumers the stability they need to build on this data with confidence.

However, data contracts are more than just a schema. As we’ve discussed in previous chapters, we need our data contracts to capture metadata that describes how the data can be used, how it is governed, and the controls around the data. We’ll show how we do that, and how we can use that metadata to drive tooling and integrate with other services.

By the end of this...

The schema of a data contract

We’ll start this section by looking at the schema of a data contract, what to put in it, and why. Then we’ll look at how to make these schemas accessible to both data generators and consumers, by storing them in a system (or a registry) that is recognized as the source of truth.

We’ll cover these topics in the following subsections:

  • Defining a schema
  • Using a schema registry as the source of truth

Defining a schema

The schema defines the structure of the data. At a minimum, it will hold the complete list of the fields available and their data type.

The following code block shows an example of a schema that defines a Customer record with fields and their types using Protocol Buffers (https://protobuf.dev), as well as a unique field number, as required by Protocol Buffers:

message Customer {
  string id       = 1;
  string name    ...

Evolving your data over time

In this section, we’ll discuss how we can manage the evolution of our data, and the schemas that define it, while still giving the data consumers the stability they need to build on the data with confidence.

We spoke in detail about how data evolves in an organization and why managing the evolution of data well is important for consumers in Chapter 4, Bringing Data Consumers and Generators Closer Together, in the Managing the evolution of data section. We also discussed the difference between a breaking change and a non-breaking change, and how for a breaking change we want to deliberately introduce some friction to ensure the migration to that new version is managed to reduce the impact on downstream consumers.

It’s this concept of versions that allows us to evolve schemas. We use versioning to track and manage the changes to a schema over time. The previous versions of the schema are used to validate whether the new version introduces...

Defining the governance and controls

In Chapter 5, Embedding Data Governance, we discussed the importance of data governance and how we embed those controls alongside the data. We also spoke about how the responsibility of those controls is assigned to the data generators, supported by a central data governance committee through policies, standards, and tooling.

In this section, we’ll look at exactly how we can define the governance and controls in the data contract.

Every data contract must have an owner. This is the data generator, and it is they who take on the responsibilities and accountabilities we discussed in Chapter 4, Bringing Data Consumers and Generators Closer Together.

Depending on your requirements, you might want to embed some of the following in your data contract:

  • The version number of the contract
  • The service-level agreements (SLAs)
  • How to access the data (for example, is the interface a table in a data warehouse, a topic on a stream...

Summary

In this chapter, we’ve started to see exactly what makes up a data contract. A large part of the data contract is the schema. We explored various open source schema formats to understand how we can use them to define schemas and the different functionality we can add to those schemas. We also looked at how we can make schemas accessible by using a schema registry to act as the source of truth for them.

However, schemas can only define how the data looks at a set point in time. Data will evolve, and so will the schema. So, we then discussed how to evolve your data over time and how to migrate your consumers to a new version without causing major disruption or breaking existing applications unexpectedly.

We finished the chapter by looking at how we can use data contracts to manage the governance and controls of data through the specification of metadata that describes the data.

We can then use that metadata to integrate with any tool or service. This can be done...

Further reading

For more information on the topics covered in this chapter, please see the following resources:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Driving Data Quality with Data Contracts
Published in: Jun 2023Publisher: PacktISBN-13: 9781837635009
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Andrew Jones

Andrew Jones is a principal engineer at GoCardless, one of Europe's leading Fintech's. He has over 15 years experience in the industry, with the first half primarily as a software engineer, before he moved into the data infrastructure and data engineering space. Joining GoCardless as its first data engineer, he led his team to build their data platform from scratch. After initially following a typical data architecture and getting frustrated with facing the same old challenges he'd faced for years, he started thinking there must be a better way, which led to him coining and defining the ideas around data contracts. Andrew is a regular speaker and writer, and he is passionate about helping organizations get maximum value from data.
Read more about Andrew Jones