Reader small image

You're reading from  Data Observability for Data Engineering

Product typeBook
Published inDec 2023
PublisherPackt
ISBN-139781804616024
Edition1st Edition
Right arrow
Authors (2):
Michele Pinto
Michele Pinto
author image
Michele Pinto

Michele Pinto is the Head of Engineering at Kensu. With over 15 years of experience, Michele has a great knack for understanding how data observability and data engineering are closely linked. He started his career as a software engineer and has worked since then in various roles, such as big data engineer, big data architect, head of data and until recently he was a Head of Engineering. He has a great community presence and believes in giving back to the community. He has also been a teacher for Digital Product Management Master TAG Innovation School in Milan, Italy. His collaboration on the book has been prompt, swift, eager, and very invested.
Read more about Michele Pinto

Sammy El Khammal
Sammy El Khammal
author image
Sammy El Khammal

Sammy El Khammal works at Kensu. He started off as a field engineer and worked his way up to the position of product manager. In the past, he has also worked with Mercedes as their Business Development Analyst – Intern. He has also been an O'Reilly teacher for 3 workshops on data quality, lineage monitoring, and data observability. During that time, he provided some brilliant insights, very responsive behaviour, and immense talent and determination.
Read more about Sammy El Khammal

View More author details
Right arrow

Data Observability Elements

In the previous chapter, we covered the methods that can be used to collect observability metrics in the context of a data application. We will now focus on the observations themselves. What do you need to collect to keep the data application under control?

In the general observability paradigm, which involves collecting data, the application, and the application’s infrastructure, as described in Chapter 2, Fundamentals of Data Observability, we saw that observability metrics can be gathered from diverse sources. In Chapter 3, Data Observability Techniques, we learned how to extract information directly from data applications. In this chapter, we will focus on which metrics can be collected from the data application itself. We will list and describe all the elements that can be used as service-level indicators (SLIs) of the data. We will learn how to add SLIs in Chapter 4.

Using an open source library, based on the monkey patching methods presented...

Technical requirements

To be able to run the example provided in this chapter, you will need a Python environment with the pip repository manager installed. The example has been tested with Python 3.7+. You can find this book’s GitHub repository at https://github.com/PacktPublishing/Data-Observability-for-Data-Engineering.

Prerequisites and installation requirements

This chapter introduces many concepts that are used in the kensu-py open source library. If you want to follow how the log file was generated, please refer to the notebook in this book’s GitHub repository, in the chapter 4 section.

If you are familiar with Python and want to run the example by yourself, we advise you to create a virtual environment; see https://docs.python.org/3/library/venv.html for more details.

To install the necessary libraries for this chapter, run pip install –r requirements.txt in this book’s repository’s directory.

Kensu – a data observability framework

In this chapter, we will be working with an open source library called Kensu. It is a Python library that interacts with data transformation libraries to generate observations and send them to the Kensu platform.

Kensu allows you to collect and process data observations through the Kensu platform. You can try out the...

Static and dynamic elements

First, let’s focus on what we consider as a data observability element in this case. A data observability element is a piece of data you can retrieve from the running application that aims to make the pipeline observable. If it can be monitored, the same element can then become a SLI.

It’s important to make a clear distinction between two categories of observations: static and dynamic.

The set of static elements represents the assets, whereas the set of dynamic elements represents the usages of those assets. For instance, the application will be ranged in the static category, while the application will be run in the dynamic one.

The static elements correspond to all the observations that can be manually reported by a human documenting their data usage because they represent assets that are located and (virtually) accessible and can used or reused. Dynamic observations are often linked to the execution or usage of static elements and...

Defining the data observability context

Following the data observability principles, the context of data manipulation is important. Now is a good time to define what we mean by context in data observability. We can define the context as the set of circumstances of the data transformations – in other words, they are the metadata that can help you understand how and where the data transformation or manipulation happened. The context will tell you which application manipulated the data, when it was manipulated, who executed the manipulation, what triggered it, and so on. This context should give you all the necessary pieces of information while you’re debugging the code or the data issue, both upstream (root cause analysis) and downstream (impact analysis).

Long story short, the context is the background of the application. It starts at the beginning of the script or program execution and lasts until all the data transformations the application was supposed to perform...

Getting the metadata of the data sources

The fuel for the data application is the data itself. The data sources that are used in the application have to be correctly identified in the logs. If an issue occurs at a data source and you need to perform deeper analyses, you would expect information that will help you retrieve the data. In this section, we will see how a data source can be identified.

Data source

To identify the data that’s used by an application, we need to define the metadata of the data source. The metadata represents the data on the data.

The metadata of the data source is all the elements that will allow you to recognize the data source. Let’s explore them:

  • The file’s location: This gives you the address of the data source and helps you retrieve the data in case you need it. The file location can be the path on your local filesystem or the filesystem of the company. It can also be a connection string if the data is in a table located...

Mastering lineage

Lineage or process lineage is the action of a data application on the data sources’ schemas. Lineage is a link between inputs and outputs, often one or several input schemas and an output schema.

It expresses what happens with the data inside a specific application. By extension, the lineage of the data source is the set of all the transformations that ended in creating the data source and all the computations or manipulations that are based on the data source.

As we stated previously, lineage is a link between schemas. These schemas can come from the same data source. For instance, creating a new column inside a SQL table creates a new schema inside the table that is fed by data coming from another schema of the data source.

Lineage is a unique combination of data flows – a data flow being a one-to-one relationship between an input schema and output schema that occurs inside the application. Without the application, there cannot be any lineage...

Computing observability metrics

The following data observability elements are known as data quality metrics. In this category, we will group everything we consider to be observability metrics. These observations are statistics related to the data you manipulate:

  • Distribution observations: Minimum, maximum, mean, standard deviation, skewness and kurtosis, quantiles, and so on
  • Categorical stats: Number of categories, percentage of each category, and so on
  • Completeness observations: Number of rows and number of missing values
  • Freshness information: Timestamp of the data itself
  • KPIs: Key performance indicators and other custom metrics worth checking, for technical or business purposes

The metrics you compute depend on the circumstances and need to be linked to the context where they were computed. Those metrics can change following the usage of the data, the filters you applied, and the application run. Figure 4.7 shows an example of multiple contexts for...

Data observability for AI models

Here, we would like to mention some specific elements of data observability that focus on AI and ML methods. You can see the model as a particular case of a data source. The model is the output of a lineage. For this, you can follow the second notebook for this chapter, Orders_predict.

Let’s look at the different components of ML observability.

Model method

The model method is the name of the method that we will use to apply the transformation to create the model. It will be, for instance, the name of the scikit-learn classes you use or a more generic method name:

  • Example of a library method: Scikit::LinearRegression()
  • Example of a generic method: Random forest

The method is the ingredient you use to create the model data source. Inside the same application, you can try several methods and compare them. At this point, you must link the method to the right lineage. To do so, you must use the model training entity.

...

Summary

In this chapter, we covered the important elements that we need to collect to implement observability at the data level from within the application. This observability was exposed in a data model, where we distinguished several categories of observations.

First is the elements related to the context – that is, what application is running, what version it is using, who created it and who runs it, and where and when it was run. These elements are important to create a structure around the data transformations. Second is the data itself. We saw that the metadata can be defined by some attribute of the data source and its schema. Third are the data transformations and operations, which we have described as lineages. These lineages are also the link between the data sources, their schemas, and their applications. Finally, once we have associated the lineage with the right execution, some observation metrics can be computed.

We also looked at some specific elements related...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Observability for Data Engineering
Published in: Dec 2023Publisher: PacktISBN-13: 9781804616024
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Michele Pinto

Michele Pinto is the Head of Engineering at Kensu. With over 15 years of experience, Michele has a great knack for understanding how data observability and data engineering are closely linked. He started his career as a software engineer and has worked since then in various roles, such as big data engineer, big data architect, head of data and until recently he was a Head of Engineering. He has a great community presence and believes in giving back to the community. He has also been a teacher for Digital Product Management Master TAG Innovation School in Milan, Italy. His collaboration on the book has been prompt, swift, eager, and very invested.
Read more about Michele Pinto

author image
Sammy El Khammal

Sammy El Khammal works at Kensu. He started off as a field engineer and worked his way up to the position of product manager. In the past, he has also worked with Mercedes as their Business Development Analyst – Intern. He has also been an O'Reilly teacher for 3 workshops on data quality, lineage monitoring, and data observability. During that time, he provided some brilliant insights, very responsive behaviour, and immense talent and determination.
Read more about Sammy El Khammal