Reader small image

You're reading from  Data Engineering with dbt

Product typeBook
Published inJun 2023
PublisherPackt
ISBN-139781803246284
Edition1st Edition
Right arrow
Author (1)
Roberto Zagni
Roberto Zagni
author image
Roberto Zagni

Roberto Zagni is a senior leader with extensive hands-on experience in data architecture, software development and agile methodologies. Roberto is an Electronic Engineer by training with a special interest in bringing software engineering best practices to cloud data platforms and growing great teams that enjoy what they do. He has been helping companies to better use their data, and now to transition to cloud based Data Automation with an agile mindset and proper SW engineering tools and processes, aka DataOps. Roberto also coaches data teams hands-on about practical data architecture and the use of patterns, testing, version control and agile collaboration. Since 2019 his go to tools are dbt, dbt Cloud and Snowflake or BigQuery.
Read more about Roberto Zagni

Right arrow

Moving Beyond the Basics

In the previous chapters, we discussed the basic tenets of data engineering, our opinionated approach to the Pragmatic Data Platform (PDP), and we used basic and advanced dbt functionalities to implement it in its basic form.

In this chapter, you will review the best practices to apply modularity in your pipelines to simplify their evolution and maintenance.

Next you will learn how to manage the identity of your entities as it is central to store changes to them and to apply master data management to combine data from different systems.

We will also use macros, the most powerful dbt functionality, to implement the first pattern to store and retrieve the changes in our data according to our discussion of identity management. This allows all developers to use the best practices that senior colleagues have developed.

In this chapter, you will learn about the following topics:

  • Building for modularity
  • Managing identity
  • Master data management...

Technical requirements

This chapter builds on the concepts from the previous chapters, including the basic dbt functions, the description of our target architecture, and the sample project.

All the code samples for this chapter are available on GitHub at

https://github.com/PacktPublishing/Data-engineering-with-dbt/tree/main/Chapter_13.

Building for modularity

By now, you should be familiar with the layers of our Pragmatic Data Platform (PDP) and the core principles of each layer:

Figure 13.1: The layers of the Pragmatic Data Platform

Figure 13.1: The layers of the Pragmatic Data Platform

Let’s quickly recap them:

  • Storage layer: Here, we adapt incoming data to how we want to use it without changing its semantics and store all the data: the good, the bad, and the ugly.

The core principle is to isolate here the platform state – that is, models that depend on previous runs such as snapshots or incremental models – so that the next layers can be rebuilt from scratch on top of the storage layer.

The perspective is source centric, with one load pipeline for each source table.

  • Refined layer: Here, we apply master data and implement business rules.

The core principle here is to apply modularity while building more abstract and general-use concepts on top of the simpler, source-centric ones from...

Managing identity

Identity is probably the single most important concept in data management, even if, at its core, it is extremely simple: being able to identify to what instance of an entity some data refers.

The problems with identity arise from the fact that we humans are very flexible in using the available information and we can easily put information in the right bucket, even if it is presented to us in the wrong way.

We are not good at being consistent in the real world but that is not a problem for us as we are very flexible in our data processing. We can easily recognize that two names are referring to the same person, even if they are in uppercase or lowercase, or that it is still the same person with or without the middle name initial, even if we invert the name and surname.

Machines are fast, but not as good at coping with the variability of information as we are, so indicating how to identify instances of an entity, whether they are people, products, documents...

Master Data management

When we refer to Master Data, we are talking about the descriptive data that is at the core of an organization and the processes to ensure that we can understand when different units are referring to the same instance of a concept.

To many extents, Master Data in the data platform realm overlaps with the dimensions that describe the organization concepts, such as customer, product, employee, and so on.

The rare times when we have a Master Data dimension, it adds the semantic of containing the “golden records” selected to represent the instances of the business concept (that is, the entity) represented by the dimension.

As an example, the product MD dimension (MDD_PRODUCT for us in the REF layer) contains the golden records of the products, starting from the codes used as the PK of the entity and continuing with the values of the columns.

Quite often, we will have only a list of MD codes, eventually with names, and mapping tables that...

Saving history at scale

In Chapter 6, we saw that storing the data that we work on gives us many benefits, with the biggest being the ability to build a simple platform where the state is limited to the data storage and the refined and delivery layers are stateless.

Back then, we introduced the dbt feature of snapshots and showed you a second way of storing change history based on incremental models that is simple, quick, and does not have the architectural limit of being shared across environments:

Figure 13.6: Storage layer highlighted in the context of the Pragmatic Data Platform

Figure 13.6: Storage layer highlighted in the context of the Pragmatic Data Platform

While looking at the three-layer architecture of our Pragmatic Data Platform in this section, we are going to discuss our preferred way to store incoming data: HIST models that store all the versions of your data using the most efficient features of Snowflake.

We will create HIST models as they are simple, efficient, flexible, and resilient and they are the best solution...

Summary

In this chapter, you learned how to build modular pipelines based on well-thought-out keys that reflect the desired semantics of entities.

You learned how to combine entities from different systems and how to store their changes simply and effectively, according to their semantics.

With the knowledge you’ve gained in this chapter, you are now able to build modern, maintainable, and scalable data platforms with hundreds of entities and hundreds of millions, or even billions, of rows in the major tables.

In the next chapter, we will wrap up our sample project and use it to discuss a few more topics that come in handy when delivering a real-life project.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with dbt
Published in: Jun 2023Publisher: PacktISBN-13: 9781803246284
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Roberto Zagni

Roberto Zagni is a senior leader with extensive hands-on experience in data architecture, software development and agile methodologies. Roberto is an Electronic Engineer by training with a special interest in bringing software engineering best practices to cloud data platforms and growing great teams that enjoy what they do. He has been helping companies to better use their data, and now to transition to cloud based Data Automation with an agile mindset and proper SW engineering tools and processes, aka DataOps. Roberto also coaches data teams hands-on about practical data architecture and the use of patterns, testing, version control and agile collaboration. Since 2019 his go to tools are dbt, dbt Cloud and Snowflake or BigQuery.
Read more about Roberto Zagni