You're reading from Data Engineering with dbt

Product typeBook

Published inJun 2023

PublisherPackt

ISBN-139781803246284

Edition1st Edition

Concepts

Data Streaming

Author (1)

Roberto Zagni

Data Modeling for Data Engineering

In this chapter, we will introduce what a data model is and why we need data modeling.

At the base of a relational database, there is the Entity-Relationship (E-R) model. Therefore, you will learn how you can use E-R models to represent data models that describe the data you have or want to collect.

We will present the E-R notation, cardinality, optionality, and the different levels of abstraction and of keys that you can have in a data model, and we will introduce two different notations commonly used in the industry, throughout the different examples that we will discuss.

We will explain a few special use cases of data models, such as weak entities or hierarchical relations, discussing their peculiarities or how they are usually implemented.

We will also introduce you to some common problems that you will face with your data, how to avoid them if possible, and how to recognize them if you cannot avoid them.

By the end of the chapter...

Technical requirements

This chapter does not require any previous knowledge of the topics of E-R models.

All code samples of this chapter are available on GitHub at https://github.com/PacktPublishing/Data-engineering-with-dbt/tree/main/Chapter_03.

What is and why do we need data modeling?

Data does not exist in a vacuum. Pure data without any surrounding knowledge rarely has any value. Data has a lot of value when you can put it into context and transform it into information.

Understanding data

The pure number 1.75, as you find it in a column of a database, by itself does not say much.

What do you think it represents?

It could be 1.75 meters, kilograms, gallons, seconds, or whatever unit you want to attach to it.

If instead of the pure number 1.75, you have 1.75 meters or 1.75 seconds, you already understand it much better, but you can’t really say that you know what this data is about yet.

If you have 1.75 meters in a column called width, then you know a bit more, but good luck guessing what that number really represents. If you also know it is in a table called car, product, or road, you can probably understand much better what it really represents.

By following through this very simple example...

Conceptual, logical, and physical data models

Data models can be designed with slightly different notations, but no matter how you design them, a model that describes everything in your data project would be as complex as your database and become too big to be useful as a communication tool.

Furthermore, when working on a data engineering project, you have discussions with different people, and these discussions focus on different levels of detail with respect to the data, business, and technical aspects of the project.

It is common to refer to the following three types of data models, which differ in the level of detail:

Conceptual data model: This is the most abstract model, defining what will be in the domain of the project, providing the general scope
Logical data model: This model provides much greater detail, defining what the data will look like
Physical data model: This is the most detailed model, describing exactly how the data will be stored in the database...

Entity-Relationship modeling

In the previous sections, we briefly defined the three components of an E-R model: entity, attribute, and relationship. We have also seen a few E-R diagrams, drawn with different notations such as UML or crow’s foot.

In this section, we will define in a bit more detail the E-R models and show how different cases can be represented in these two common notations (UML and crow’s foot).

Main notation

We have already introduced the three components of an E-R model. In this section, we explain how they are represented visually in E-R models:

Entity: This is represented as a box with the name of the entity inside

If attributes are shown, the entity name is at the top and often in bold or bigger and visually separated from the attributes

Attribute: This is represented by the attribute’s name inside the box of the entity it belongs, with one name per line
Relationship: This is represented by a line joining...

Modeling use cases and patterns

This chapter introduces the use of E-R diagrams to represent conceptual and logical data models in data engineering projects, but we have seen that the same modeling concepts are also used to design data models for applications and everywhere we want to describe how different pieces of data are related.

In the next sections, we will present a few common use cases of data structures and their models, some known problematic cases, and finally, we point to the whole area of data model libraries.

Header-detail use case

One of the ubiquitous cases when working with data is the header-detail data structure. It is so common that I am pretty sure you have already seen it somewhere.

As an example, it is used in invoices and orders, and almost anywhere else where there is one document with multiple lines of detail inside. The common document info goes into the header entity with the detail lines represented by one or more weak entities depending on...

Common problems in data models

Creating a correct data model is only the beginning of providing a good data platform, as users of these data models must also be aware of problems that can arise because of the nature of operations on relational models.

In the following section, we present the most common problems you will encounter.

Fan trap

The fan trap is a very common problem that can happen every time you have a join in a one-to-many relationship. It is not a problem of the relationship, but of how you might use it.

The fan trap problem causes the calculations done on measures joined from the one side of the one-to-many relationship to be wrong.

This is only a problem if you use a measure that is on the one side of a one-to-many relationship, when grouping/working at the granularity of the entities on the many side. This happens because the join will duplicate the measures to match the cardinality of the many side.

Let’s look at a simple example based on...

Modeling styles and architectures

In this chapter, we have introduced data modeling and seen how it is used to describe the data we have or how we want it to become.

Data modeling is the basic tool to describe the data but building a data platform is more than describing the data hosted in the platform and the relationships between the various entities. As an example, how you load new data into the tables and how you update all the intermediate tables to keep your data marts up to date is a key point that is not captured by a data model.

In the next chapter, we will look in more detail at the overall data life cycle, but in this section, we want to introduce a few design styles that are heavily connected to how you develop data models for your data platform.

The great news is that with dbt, there are no preferences or limitations, and you can adopt any of these different paradigms and implement whichever will work best.

We will start with the Kimball method, which introduces...

Summary

Congratulations, you are now able to draw and understand simple data models!

In this chapter, you have learned the bread and butter of data modeling using the E-R model at different levels, as well as how to express key ideas about your data.

You now know how to look at some of the most common data model patterns and you should not be tricked anymore by fan and chasm traps.

You have gone through the architectures and modeling styles in use today, learning about their pros and cons, and you have a better idea of the approach we will use in this book.

In the next chapter, Analytics Engineering as the New Core of Data Engineering, you will get a full picture of data engineering and then dive into the core of what we do with dbt: analytics engineering.

Roberto Zagni is a senior leader with extensive hands-on experience in data architecture, software development and agile methodologies. Roberto is an Electronic Engineer by training with a special interest in bringing software engineering best practices to cloud data platforms and growing great teams that enjoy what they do. He has been helping companies to better use their data, and now to transition to cloud based Data Automation with an agile mindset and proper SW engineering tools and processes, aka DataOps. Roberto also coaches data teams hands-on about practical data architecture and the use of patterns, testing, version control and agile collaboration. Since 2019 his go to tools are dbt, dbt Cloud and Snowflake or BigQuery.
Read more about Roberto Zagni

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Data Engineering with dbt

Data Modeling for Data Engineering

Technical requirements

What is and why do we need data modeling?

Understanding data

Conceptual, logical, and physical data models

Entity-Relationship modeling

Main notation

Modeling use cases and patterns

Header-detail use case

Common problems in data models

Fan trap

Modeling styles and architectures

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook