You're reading from Engineering Data Mesh in Azure Cloud

Product type Book

Published in Mar 2024

Publisher Packt

ISBN-13 9781805120780

Pages 314 pages

Edition 1st Edition

Languages

Concepts

Data Science

Author (1):

Aniruddha Deswandikar

Table of Contents (23) Chapters

Preface

Part 1: Rolling Out the Data Mesh in the Azure Cloud

Chapter 1: Introducing Data Meshes

Chapter 2: Building a Data Mesh Strategy

Chapter 3: Deploying a Data Mesh Using the Azure Cloud-Scale Analytics Framework

Chapter 4: Building a Data Mesh Governance Framework Using Microsoft Azure Services

Chapter 5: Security Architecture for Data Meshes

Chapter 6: Automating Deployment through Azure Resource Manager and Azure DevOps

Chapter 7: Building a Self-Service Portal for Common Data Mesh Operations

Part 2: Practical Challenges of Implementing a Data Mesh

Chapter 8: How to Design, Build, and Manage Data Contracts

Chapter 9: Data Quality Management

Chapter 10: Master Data Management

Chapter 11: Monitoring and Data Observability

Chapter 12: Monitoring Data Mesh Costs and Building a Cross-Charging Model

Chapter 13: Understanding Data-Sharing Topologies in a Data Mesh

Part 3: Popular Data Product Architectures

Chapter 14: Advanced Analytics Using Azure Machine Learning, Databricks, and the Lakehouse Architecture

Chapter 15: Big Data Analytics Using Azure Synapse Analytics

Chapter 16: Event-Driven Analytics Using Azure Event Hubs, Azure Stream Analytics, and Azure Machine Learning

Chapter 17: AI Using Azure Cognitive Services and Azure OpenAI

Index

Why subscribe?

Other Books You May Enjoy

How to Design, Build, and Manage Data Contracts

Data contracts are a very important requirement for collaboration across the data mesh. However, they are a very new concept in the industry and there are no out-of-the-box solutions or products available to build and maintain data contracts. As a result, it becomes one of the most challenging parts of data mesh design and implementation. It’s one of those situations where you know what it is but you don’t know where to start.

In this chapter, we will discuss how you can design, plan, and implement data contracts for your data mesh.

In this chapter, we will cover the following questions:

What are data contracts?
What is the content of a data contract?
Who creates and owns a data contract?
Who consumes the data contract?
How do we store data and access contracts?
How do we link data contracts to data consumption or pipelines?

What are data contracts?

While building multi-tiered applications that integrate with multiple other external systems, the most common mode of communication is the application programming interface (API). These APIs are interfaces to the functionality of the application. APIs have the following elements bundled into a contract:

The protocol used
The URL of the API
The request format
The response format
Any special security information

Various standards for APIs exist. Simple Object Access Protocol (SOAP), representational state transfer (REST), GraphQL, and WebSocket are some of the examples.

Once these APIs are built, they keep changing over time. New functionality, new data elements, and several other reasons can force teams to change, deprecate, and build new APIs.

In a small project, these changes are easy to manage. It’s a small team that communicates very effectively because the API builder and consumer are probably sitting across the...

What are the contents of a data contract?

The contents of a data contract will depend on a company’s requirements. The content can be segmented into a few buckets:

Identification: This is a way to uniquely identify a contract and associate it with the data it represents. Ideally, this should be a string formed by combining the department, the project, and the store that the data belongs to.
Basic information: This bucket has the obvious attributes that any user would want to know about the data, such as its name, description, and version, and the owners.
Schema: Schema defines the structure of the data. For structured data, this can be the table schema. For semi-structured data, this could be a JSON document with the schema explained. If it is an unstructured document, such as an image, it could be the image details such as image size, resolution, or image format. A schema definition can be provided as part of the contract. This helps users to ensure that the...

Who creates and owns a data contract?

This should be a short discussion because the answer is fairly obvious. The data owners create and maintain the data contract because the data owners are the only ones who know everything about their data. They know how to check the quality of data and they know the availability of their sources to justify the service-level contract.

While the data contract is created by the owners, the responsibility of keeping it up to date is divided between the data owners and the pipeline builders/developers. Typically, any data made available at a storage location is brought there by some program or pipeline. Every time this program or pipeline runs, as a last step, it should update the contract fields such as Last updated. Similarly, every time data owners see a change in the source systems or need to change the data format or structure, they should communicate this to the developers. The developers should then develop a new schema and update the Schema...

Who consumes the data contract?

The data contract is consumed by data teams who are looking for data to build their data products and by the pipeline managers who are consuming data based on the contract.

The users typically look at the contract to ensure that the data has the quality and reliability they need for their project. Once they decide to use this data, they typically build the code to consume this data. This could be a pipeline that pulls data from this source and lands it in their landing zone or it could be direct access to this data source through memory data frames, such as Python code running in a notebook. Irrespective of this, they should first confirm that the data contract has not changed and that the schema and the version are still valid. Figure 8.1 is a simple version of how a contract can be maintained and consumed:

Figure 8.1 – Maintaining data contracts

In reality, it might be more complex than that. Depending on how...

How do we store data and access contracts?

We cannot help but observe that a lot of the content of a data contract is also part of a typical data catalog such as Microsoft Purview. Would it make sense to maintain the contract information as added attributes in Microsoft Purview? While that might look like a tempting option and hence eliminate the need for additional storage, some features such as data versioning are yet not available in Microsoft Purview. While the Microsoft Purview product team might eventually bring this feature to Microsoft Purview, you need this feature now.

Considering this situation, you have two options:

Spread the information across different data catalogs including Microsoft Purview and maintain the missing attributes as JSON files in a separate store
Implement a data contract as a completely separate system decoupled from the catalog

The choice between these two depends on your roadmap. If you think you will eventually switch the data...

How do we link data contracts to data consumption or pipelines?

Hosting contracts in a central repository that data owners and data consumers can access and maintain is just one-half of the solution. Sticking to these contracts and ensuring that data pipelines are not failing because they are using the wrong version of the data completes the end-to-end data contract implementation. We also need to ensure that this consistency check is done in an automated fashion so that the pipelines or programs are aware of the consistency and take the necessary actions if they observe a mismatch.

The first step in this process is to ensure that you have programmatic access to the data contracts. Other than providing read-and-write access to the data contract, you also have to allow users to browse and search the contract with keywords.

As mentioned in the What are the contents of a data contract? section, certain attributes in the data catalog might overlap with the attributes that we are...

Summary

In this chapter, we first learned what data contracts are and how they are an important element for transforming data into data products. We looked at the contents of a data contract and understood the concept of having a standalone data contract system as well as a hybrid system where some components are maintained in a data catalog and some in a custom contract. These contract documents can then be stored in a data store such as Azure Cosmos DB or even an SQL database. We learned about the components required to build, write, and read these contracts through an API and also to search them through the data mesh portal.

In the next chapter, we will look at the next important feature of a data mesh, which is data quality management.