You're reading from Data Observability for Data Engineering

Product typeBook

Published inDec 2023

PublisherPackt

ISBN-139781804616024

Edition1st Edition

Concepts

Data Processing

Authors (2):

Michele Pinto

Sammy El Khammal

View More author details

Turning data quality into SLAs

Trust is an important element for the consumer. The producer is not always fully aware of the data they are treating. Indeed, the producer can be seen as an executor creating an application on behalf of the consumer. They are not a domain expert and work in a black-box environment: on the one hand, the producer doesn’t exactly understand the objectives of their work, while on the other hand, the consumer doesn’t have control over the supply chain.

This can involve psychological barriers. The consumer and the producer may have never met. For instance, the head of sales, who is waiting for their weekly sales report, doesn’t know the data engineer in charge of the Spark ingestion in the data lake. The level of trust between the different stakeholders can be low from the beginning and the relationship can be poisonous and compromised. This kind of trust issue is difficult to solve, and confidence is dramatically complex to restore. This triggers a negative vortex that can lead to very important effects in the company.

As stated in the Identifying information bias in the data section, there is a fundamental asymmetry of information and responsibilities between producers and consumers. To solve this, data quality must be applied as an SLA between the parties. Let’s delve into this key component of data quality.

An agreement as a starting point

A SLA serves as a contract between the producer and the consumer, establishing the expected data quality level of a data asset. Concretely, with a SLA, the data producer is committed to offering the level of quality the consumer expects from the data source. The goals of the SLA are manifold:

Firstly, it ensures awareness of the producer on the expected level of quality they must deliver.
Secondly, the contractual engagement asks the producers to assume their responsibilities regarding the quality of the delivered processes.
Thirdly, it enhances the trust of the consumer in the outcomes of the pipeline, easing the burden of cumbersome double checks. The consumer puts their confidence in the contract.

In essence, data quality can be viewed as an SLA between the data producer and the consumer of the data. The SLA is not something you measure by nature, but this will drive the team’s ability to define proper objectives, as we will see in the next section.

The incumbent responsibilities of producers

To support those SLAs, the producer must establish SLOs. Those objectives are targets the producer sets to meet the requirements of the SLA.

However, it is important to stress that those quality targets can be different for each contract. Let’s imagine a department store conducting several data science projects. The marketing team oversees two data products based on the central data lake of all the store sales. The first one is a machine learning model that forecasts the sales of the latest children’s toy, while the second one is a report of all cosmetic products. The number of kids in the household may be an important feature for the first model, while it won’t be of any importance for the second report. The SLAs that are linked to the first project are different from the SLAs of the second one. However, the producer is responsible for providing both project teams with a good set of data. This said, the producer has to summarize all agreements to establish SLOs that will fulfill all their commitments.

This leads to the notion of vertical SLAs versus transversal SLOs. The SLA is considered vertical as it involves a one-to-one relationship between the producer and the consumer. In contrast, the SLO is transversal as, for a specific dataset, a SLO will be used to fulfill one or several SLAs. Figure 1.11 illustrates these principles. The producer is engaged in a contractual relationship with the customer, while the SLO directs the producers’ monitoring plan. We can say that a data producer, for a specific dataset, has a unique set of SLOs that fit all the SLAs:

Figure 1.11 – SLAs and SLOs in a data pipeline

The agreement will concern a bilateral relationship between a producer and their consumer. An easy example is as follows: “To run my churn prediction model, I need all the transactions performed by 18+ people in the last 2 weeks.” In this configuration, the data producer knows that the data of the CRM must contain all transactions from the last 2 weeks of people aged 18 and over. They can now create the following objectives:

The data needs to contain transactions from the last 2 weeks
The data needs to contain transactions of customers who are 18 and over

We’ll see how indicators can be set to validate the SLO later.

The SLA is a contract between both parties, whereas the SLO is set by the producer to ensure they can respect all SLAs. This means that an SLO can serve several agreements. If another project requires the transaction data of 18+ customers for 3 weeks, the data team could easily adjust its target and set the limit to 3.

The advantage of thinking of data quality as a service is that, by knowing which dimension of quality your consumers need, you can provide the right adjusted level of service to achieve the objectives. As we’ll see later, it will help a lot regarding costs and resource optimization.

Considerations for SLOs and SLAs

To ensure SLAs and SLOs remain effective and relevant, it is important to focus on different topics.

First, as business requirements change over time, it is essential for both parties to regularly review the existing agreements and adjust objectives accordingly. Collaboration and communication are key to ensuring the objectives are always well aligned with consumer expectations.

Documentation and traceability are also important when it comes to agreements and objectives. By documenting those, all parties can maintain transparency and keep track of changes over time. This can be helpful in case of misalignment or misunderstandings.

Also, to encourage producers to respect their agreements, incentives and penalties can be activated, regardless of whether the agreements are met.

What is important and at the core of the SLA/SLO framework is being able to measure the performance against the established SLOs. It is crucial to set up metrics that will assess the data producer’s performance.

To be implemented, objectives have to be aligned with indicators. In the next section, we will see which indicators are important for measuring data quality.

You have been reading a chapter from

Data Observability for Data Engineering

Published in: Dec 2023Publisher: PacktISBN-13: 9781804616024

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Michele Pinto

Michele Pinto is the Head of Engineering at Kensu. With over 15 years of experience, Michele has a great knack for understanding how data observability and data engineering are closely linked. He started his career as a software engineer and has worked since then in various roles, such as big data engineer, big data architect, head of data and until recently he was a Head of Engineering. He has a great community presence and believes in giving back to the community. He has also been a teacher for Digital Product Management Master TAG Innovation School in Milan, Italy. His collaboration on the book has been prompt, swift, eager, and very invested.
Read more about Michele Pinto

Sammy El Khammal

Sammy El Khammal works at Kensu. He started off as a field engineer and worked his way up to the position of product manager. In the past, he has also worked with Mercedes as their Business Development Analyst – Intern. He has also been an O'Reilly teacher for 3 workshops on data quality, lineage monitoring, and data observability. During that time, he provided some brilliant insights, very responsive behaviour, and immense talent and determination.
Read more about Sammy El Khammal

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages