You're reading from Data Observability for Data Engineering

Product typeBook

Published inDec 2023

PublisherPackt

ISBN-139781804616024

Edition1st Edition

Concepts

Data Processing

Authors (2):

Michele Pinto

Sammy El Khammal

View More author details

Data Observability Elements

In the previous chapter, we covered the methods that can be used to collect observability metrics in the context of a data application. We will now focus on the observations themselves. What do you need to collect to keep the data application under control?

In the general observability paradigm, which involves collecting data, the application, and the application’s infrastructure, as described in Chapter 2, Fundamentals of Data Observability, we saw that observability metrics can be gathered from diverse sources. In Chapter 3, Data Observability Techniques, we learned how to extract information directly from data applications. In this chapter, we will focus on which metrics can be collected from the data application itself. We will list and describe all the elements that can be used as service-level indicators (SLIs) of the data. We will learn how to add SLIs in Chapter 4.

Using an open source library, based on the monkey patching methods presented...

Technical requirements

To be able to run the example provided in this chapter, you will need a Python environment with the pip repository manager installed. The example has been tested with Python 3.7+. You can find this book’s GitHub repository at https://github.com/PacktPublishing/Data-Observability-for-Data-Engineering.

Prerequisites and installation requirements

This chapter introduces many concepts that are used in the kensu-py open source library. If you want to follow how the log file was generated, please refer to the notebook in this book’s GitHub repository, in the chapter 4 section.

If you are familiar with Python and want to run the example by yourself, we advise you to create a virtual environment; see https://docs.python.org/3/library/venv.html for more details.

To install the necessary libraries for this chapter, run pip install –r requirements.txt in this book’s repository’s directory.

Kensu – a data observability framework

In this chapter, we will be working with an open source library called Kensu. It is a Python library that interacts with data transformation libraries to generate observations and send them to the Kensu platform.

Kensu allows you to collect and process data observations through the Kensu platform. You can try out the...

Static and dynamic elements

First, let’s focus on what we consider as a data observability element in this case. A data observability element is a piece of data you can retrieve from the running application that aims to make the pipeline observable. If it can be monitored, the same element can then become a SLI.

It’s important to make a clear distinction between two categories of observations: static and dynamic.

The set of static elements represents the assets, whereas the set of dynamic elements represents the usages of those assets. For instance, the application will be ranged in the static category, while the application will be run in the dynamic one.

The static elements correspond to all the observations that can be manually reported by a human documenting their data usage because they represent assets that are located and (virtually) accessible and can used or reused. Dynamic observations are often linked to the execution or usage of static elements and...

Defining the data observability context

Following the data observability principles, the context of data manipulation is important. Now is a good time to define what we mean by context in data observability. We can define the context as the set of circumstances of the data transformations – in other words, they are the metadata that can help you understand how and where the data transformation or manipulation happened. The context will tell you which application manipulated the data, when it was manipulated, who executed the manipulation, what triggered it, and so on. This context should give you all the necessary pieces of information while you’re debugging the code or the data issue, both upstream (root cause analysis) and downstream (impact analysis).

Long story short, the context is the background of the application. It starts at the beginning of the script or program execution and lasts until all the data transformations the application was supposed to perform...

Getting the metadata of the data sources

The fuel for the data application is the data itself. The data sources that are used in the application have to be correctly identified in the logs. If an issue occurs at a data source and you need to perform deeper analyses, you would expect information that will help you retrieve the data. In this section, we will see how a data source can be identified.

Data source

To identify the data that’s used by an application, we need to define the metadata of the data source. The metadata represents the data on the data.

The metadata of the data source is all the elements that will allow you to recognize the data source. Let’s explore them:

The file’s location: This gives you the address of the data source and helps you retrieve the data in case you need it. The file location can be the path on your local filesystem or the filesystem of the company. It can also be a connection string if the data is in a table located...

Mastering lineage

Lineage or process lineage is the action of a data application on the data sources’ schemas. Lineage is a link between inputs and outputs, often one or several input schemas and an output schema.

It expresses what happens with the data inside a specific application. By extension, the lineage of the data source is the set of all the transformations that ended in creating the data source and all the computations or manipulations that are based on the data source.

As we stated previously, lineage is a link between schemas. These schemas can come from the same data source. For instance, creating a new column inside a SQL table creates a new schema inside the table that is fed by data coming from another schema of the data source.

Lineage is a unique combination of data flows – a data flow being a one-to-one relationship between an input schema and output schema that occurs inside the application. Without the application, there cannot be any lineage...

Computing observability metrics

The following data observability elements are known as data quality metrics. In this category, we will group everything we consider to be observability metrics. These observations are statistics related to the data you manipulate:

Distribution observations: Minimum, maximum, mean, standard deviation, skewness and kurtosis, quantiles, and so on
Categorical stats: Number of categories, percentage of each category, and so on
Completeness observations: Number of rows and number of missing values
Freshness information: Timestamp of the data itself
KPIs: Key performance indicators and other custom metrics worth checking, for technical or business purposes

The metrics you compute depend on the circumstances and need to be linked to the context where they were computed. Those metrics can change following the usage of the data, the filters you applied, and the application run. Figure 4.7 shows an example of multiple contexts for...

Data observability for AI models

Here, we would like to mention some specific elements of data observability that focus on AI and ML methods. You can see the model as a particular case of a data source. The model is the output of a lineage. For this, you can follow the second notebook for this chapter, Orders_predict.

Let’s look at the different components of ML observability.

Model method

The model method is the name of the method that we will use to apply the transformation to create the model. It will be, for instance, the name of the scikit-learn classes you use or a more generic method name:

Example of a library method: Scikit::LinearRegression()
Example of a generic method: Random forest

The method is the ingredient you use to create the model data source. Inside the same application, you can try several methods and compare them. At this point, you must link the method to the right lineage. To do so, you must use the model training entity.

...

Summary

In this chapter, we covered the important elements that we need to collect to implement observability at the data level from within the application. This observability was exposed in a data model, where we distinguished several categories of observations.

First is the elements related to the context – that is, what application is running, what version it is using, who created it and who runs it, and where and when it was run. These elements are important to create a structure around the data transformations. Second is the data itself. We saw that the metadata can be defined by some attribute of the data source and its schema. Third are the data transformations and operations, which we have described as lineages. These lineages are also the link between the data sources, their schemas, and their applications. Finally, once we have associated the lineage with the right execution, some observation metrics can be computed.

We also looked at some specific elements related...

The rest of the chapter is locked

You have been reading a chapter from

Data Observability for Data Engineering

Published in: Dec 2023Publisher: PacktISBN-13: 9781804616024

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Michele Pinto

Michele Pinto is the Head of Engineering at Kensu. With over 15 years of experience, Michele has a great knack for understanding how data observability and data engineering are closely linked. He started his career as a software engineer and has worked since then in various roles, such as big data engineer, big data architect, head of data and until recently he was a Head of Engineering. He has a great community presence and believes in giving back to the community. He has also been a teacher for Digital Product Management Master TAG Innovation School in Milan, Italy. His collaboration on the book has been prompt, swift, eager, and very invested.
Read more about Michele Pinto

Sammy El Khammal

Sammy El Khammal works at Kensu. He started off as a field engineer and worked his way up to the position of product manager. In the past, he has also worked with Mercedes as their Business Development Analyst – Intern. He has also been an O'Reilly teacher for 3 workshops on data quality, lineage monitoring, and data observability. During that time, he provided some brilliant insights, very responsive behaviour, and immense talent and determination.
Read more about Sammy El Khammal

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages