Search icon
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Data Engineering with dbt
Data Engineering with dbt

Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

By Roberto Zagni
€29.99 €8.99
Book Jun 2023 578 pages 1st Edition
eBook
€29.99 €8.99
Print
€37.99
Subscription
€14.99 Monthly
eBook
€29.99 €8.99
Print
€37.99
Subscription
€14.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jun 30, 2023
Length 578 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781803246284
Category :
Concepts :
Table of content icon View table of contents Preview book icon Preview Book

Data Engineering with dbt

The Basics of SQL to Transform Data

This book is about engineering data platforms with dbt. When working with dbt, you write code in Structured Query Language (normally referred to as SQL, pronounced sequel or as its letters: es-qu-el), with a spritz of Jinja templating.

This means that when we are not discussing architectural and design topics, we will be reading and writing code in SQL, so you should get up to speed with SQL to work proficiently with dbt.

In this chapter, you will review, or learn from scratch, the basics of SQL, becoming familiar with this powerful language that you find everywhere when working with data.

First, we will introduce SQL and walk you through its core concepts, so that even if you are new to this language, you can start building your competence and follow the rest of this chapter and the book.

To complement this introduction, you will quickly be presented with all the main SQL commands, organized by category of use. This will ensure that you know the most relevant parts of SQL and you can use this chapter as a quick reference when you start writing your code.

You will then be guided through how to create an account in Snowflake, the leading cloud database that is very often used with dbt. You will use Snowflake in the cloud, without the need to install anything, and it is free for a trial period.

Once you have a Snowflake account, you will be guided through creating your first role, user, and database.

With your cloud database ready and the basic SQL commands presented, we will look deeper into the SELECT statement and the JOIN clause, as they will be crucial in working with dbt.

Analytical and window functions are presented at the end of the chapter in a simple way, suitable for beginners as well as mid- to senior-experienced SQL users; these advanced functionalities will power up your SQL skills, vastly widening your abilities.

To summarize, in this chapter, we are going to cover these topics:

  • Introducing SQL
  • SQL basics – core concepts and commands
  • Setting up a Snowflake database with users and roles
  • Querying data in SQL – syntax and operators
  • Combining data in SQL – the JOIN clause
  • Advanced – introducing window functions

Technical requirements

This chapter does not assume any prior SQL knowledge and presents information from the basics to intermediate level. If you know SQL well, you can skip this chapter or just browse the more advanced topics.

All code samples of this chapter are available on GitHub at https://github.com/PacktPublishing/Data-engineering-with-dbt/tree/main/Chapter_01.

To run the samples, later in this chapter, we will guide you through setting up a Snowflake account and roles and creating your first database.

Introducing SQL

SQL was created in the 70s and by the end of the 80s, had become the de facto standard to interact with Relational Databases (RDBs), and it now powers most of the data management industry in the world.

Given its huge footprint and powerful abstractions, SQL has become a standard that anyone working with database systems eventually becomes familiar with. The expressive power of SQL is well understood and its knowledge is so ubiquitous that it has been taken into use beyond RDBs, with Database Management Systems (DBMSs) of all sorts providing a SQL interface even on top of many non-RDB systems.

Some of the great advantages of SQL are as follows:

  • The core SQL functionality was standardized in the 80s and yet SQL is still very much alive and well, evolving and adding new powerful functionalities as data management evolves while maintaining compatibility with previous versions.

Every database has its SQL quirks, but the logic is the same and most SQL code will work on multiple databases with little or no change.

Learn it now and use it forever, and with (almost) every database.

  • At its core, it has a simple, rigorous, and powerful syntax that reads like English sentences, so even non-tech people can grasp the basic idea, while professionals can express exactly what they want in a precise and concise way.

Most people can probably get a sense of what the following SQL does:

SELECT ORDER_ID, CUSTOMER_CODE, TOTAL_AMOUNT
FROM ORDERS
WHERE YEAR(ORDER_DATE) = 2021;
  • With SQL, you work at the logical level, so you do not have to deal with implementation details, and it is a declarative language; you describe in a rigorous way what you want to achieve, not how to do it. The database engine has the freedom to store data, be implemented, and perform the request in its own way, as long as it produces the correct result according to SQL specifications.
  • With a single SQL statement, you can process one piece of data or billions, leaving the burden of finding the most effective way to the database and giving you some freedom from scale.

SQL basics – core concepts and commands

The SQL language is certainly big and complex, taking time to be fully mastered, but surprisingly, you can get productive with the limited set of features that we will introduce in this book, and you will pick up the nuances when you’ll need them while working on real use cases.

The goal of this chapter is not to make you a SQL expert but to get you started with it, getting deeper into the few commands that we will use often to make you quickly productive with dbt.

In this intro, we start with the most basic concepts of SQL and then provide an overview of the SQL commands by their categories of use. In the rest of this chapter, we will dig deeper into the commands that are mostly used when working with dbt: SELECT, JOIN, and analytical and window functions.

SQL core concepts

The main goal of SQL, as its name implies, is to allow users to query data that is contained in a database; SQL also provides all the commands to fully manage the database, allowing you to add, transform, delete, organize, and manage data and other database objects, such as users and roles.

The core concepts in SQL come from how an RDB is organized:

  • A database contains tables, organized in schemata (plural of schema)
  • Tables store data in rows that have one value for each column defined in the table
  • Columns have a name and can contain only data of the declared data type
  • To regulate access to the data, privileges are assigned to users and roles

You can see a database with schemata and tables in the following screenshot, which shows part of the sample database available in any Snowflake account:

Figure 1.1: Example database with schemata and tables

Figure 1.1: Example database with schemata and tables

Let’s go through the core concepts of SQL, starting with the table.

What is a table?

The table is the most central concept in SQL, as it is the object that contains data.

A table in SQL is very close to the layman’s concept of a table, with data organized in columns and rows.

Columns represent the attributes that can be stored in a table, such as a customer’s name or the currency of an order, with each column defining the type of data it can store, such as text, a number, or a date.

In a table, you can have as many rows as you want, and you can keep adding more whenever you need. Each row stores in the table one instance of the concept that the table represents, such as a specific order in the following example of a table for orders:

Order_ID

Customer_CODE

Total_amount

Currency

123

ABC

100

EUR

166

C099

125,15

USD

Table 1.1: Example order table

Looking at the previous example order table, we can see that the table has four columns that allow storing the four attributes for each row (order ID, customer code, total amount, and currency of the order). The table, as represented, has two rows of data, representing one order each.

The first row, in bold, is not data but is just a header to represent the column names and make the table easier to read when printed out.

In SQL, a table is both the definition of its content (columns and their types) and the content itself (the data, organized by rows):

  • Table definition: It lists the columns that make up the table, and each column provides the data type and other optional details. The data type is mandatory and declares what values will be accepted in the column.
  • Table content: It is organized in rows, each row containing one value for each column defined for the table or null if no value is provided.
  • Data value: All the values in a column must be compatible with the type declared for the column. null is a special value that corresponds to the absence of data and is compatible with all data types.

When creating a table, we must provide a name for the table and its definition, which consists of at least column names and a type for each column; the data can be added at a different time.

In the following code block, we have a sample definition for an order table:

CREATE TABLE ORDERS (
   ORDER_ID NUMBER,
   CUSTOMER_CODE TEXT,
   TOTAL_AMOUNT FLOAT,
   ORDER_DATE DATE,
   CURRENCY TEXT DEFAULT 'EUR'
);

Tip

When we write some SQL, we will use Snowflake’s SQL syntax and commands. We will guide you on how to create a free Snowflake account where you can run the code shown here.

In this example, we see that the command to create a table reads pretty much like English and provides the name of the table and a list of columns with the type of data each column is going to contain.

You can see that for the CURRENCY column, of type text, this code also provides the default value EUR. Single quotes are used in SQL to delimit a piece of text, aka a string.

View – a special way to look at some data

If you already have data and you want to make it available with some transformation or filtering, you can create a view. You can think of a view like a table, but with the column definition and data both coming from a query, reading from one or more tables.

As an example, if you would like to have a shortlist of orders with amount greater than 1,000 you could write the following query to create a BIG_ORDERS view:

CREATE VIEW BIG_ORDERS AS
SELECT * FROM ORDERS
WHERE TOTAL_AMOUNT > 1000;

In this example, we see that this simple create view statement provides the name for the view and uses a query, which is a SELECT statement, to define what data is made available by the view.

The query provides both the data, all the orders with a total amount greater than 1,000, and the column definitions. The * character – called star – is a shortcut for all columns in the tables read by the SELECT statement.

This is, of course, a naïve example, but throughout this book, you will see that combining tables that store data and views that filter and transform the data coming from tables and views is the bread and butter of working with dbt. Building one object on top of the previous allows us to take raw data as input and provide as output refined information that our users can easily access and understand.

Tip

When working with dbt, you will not need to write create table or create view statements, as dbt will create them for us. It is nevertheless good to get familiar with these basic SQL commands as these are the commands executed in the database and you will see them if you look in the logs.

Database and schema

We have seen that a database is a container for tables and views.

A DB can be further divided and organized using schema objects.

In real-life DBs, the number of tables can range from a few units to many thousands. Schemata act pretty much like folders, but cannot be nested, so you can always identify a table by its database, schema, and name.

In the following screenshot, we see part of the contents of the SNOWFLAKE_SAMPLE_DATA database, which is available in all Snowflake accounts:

Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database

Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database

The database and schemata, as table containers, are the main ways information is organized, but they are also used to apply security, access limitations, and other features in a hierarchical way, simplifying the management of big systems.

To create the TEST database and the SOME_DATA schema in it, we can use the following commands:

CREATE DATABASE TEST;
CREATE SCHEMA TEST.SOME_DATA;

The database.schema notation, also known as a fully qualified name, allows us to precisely describe in which database to create the schema and after its creation, uniquely identifies the schema.

Tip

While working with dbt, you will create a database or use an existing one for your project; dbt will create the required schema objects for you if you have not created them already.

A best practice in Snowflake is to have one database for each project or set of data that you want to keep separate for administration purposes. Databases and schemata in Snowflake are soft boundaries, as all the data from all the databases and schemata can be accessed if the user has the appropriate privileges.

In some other database systems, such as PostgreSQL, a database is a stronger boundary.

Users and roles

To control access to your data in SQL, you GRANT access and other privileges to both users and roles.

A user represent one individual user or service that can access the database, while a role represents a named entity that can be granted privileges.

A role can be granted to a user, providing them all the privileges associated with the role.

A role can also be granted to another role, building hierarchies that use simple basic roles to build more complex roles that are assigned to a user.

Using roles instead of granting privileges directly to users allows you to manage even a large number of users simply and consistently. In this case, roles are labels for a set of privileges and become a way to manage groups of users that we want to grant the same privileges. Changing the privileges granted to a role at any moment will change the privileges that the users receive from that role.

A typical pattern when working with dbt is to create a role for the dbt users and then assign it to the developers and the service user that the dbt program will use.

The following is an example of a simple setup with one role and a couple of users:

CREATE ROLE DBT_SAMPLE_ROLE;
CREATE USER MY_NAME;           -- Personal user
CREATE USER SAMPLE_SERVICE;    -- Service user
GRANT ROLE DBT_SAMPLE_ROLE TO USER MY_NAME;
GRANT ROLE DBT_SAMPLE_ROLE TO USER SAMPLE_SERVICE;

A more complex setup could have one role to read and one to write for each source system (represented by a schema with the data from the system), for the data warehouse (one or more schemata where the data is processed), and for each data mart (one schema for each data mart).

You could then control in much more detail who can read and write what, at the cost of more effort.

Understanding the categories of SQL commands

SQL commands can be organized into categories according to their usage in the language:

  • Data Definition Language (DDL): DDL contains the commands that are used to manage the structure and organization of a database
  • Data Manipulation Language (DML): DML contains the commands that are used to manipulate data, for example, INSERT, DELETE, and UPDATE
  • Data Query Language (DQL): DQL contains the SELECT command and is the central part of SQL that allows querying and transforming the data stored in a database
  • Data Control Language (DCL): DCL contains the GRANT and REVOKE commands, which are used to manage the privileges that control the access to database resources and objects
  • Transaction Control Language (TCL): TCL contains the commands to manage transactions

In the upcoming sections, we provide more details about these by looking at Snowflake-specific commands, but the ideas and names are of general use in all database systems, with little or no change.

Data Definition Language – DDL

DDL commands do not deal directly with the data but are used to create and maintain the structure and organization of the database, including creating the tables where the data is stored.

They operate on the following objects:

  • Account/session objects, which contain and operate on the data, such as user, role, database, and warehouse
  • Database/schema objects, which store and manipulate the data, such as schema, table, view, function, and store procedure

The main commands are as follows:

  • CREATE: Used to create the database itself and other objects
  • DROP: Used to delete the specified object
  • ALTER: Used to modify some attribute of the specified object
  • DESC: Used to describe the details of the specified object
  • SHOW: Used to list the existing objects of the specified object type, with metadata
  • USE: Used to select the database, schema, and object to use when fully specified names are not used

Tip

When working with dbt, we use the DDL and DML commands only in macros.

We do not use the DDL and DML commands in models because dbt will generate the required commands for our models based on the metadata attached to the model.

Data Manipulation Language – DML

DML provides the commands to manipulate data in a database and carry out bulk data loading.

Snowflake also provides specific commands to stage files, such as loading files in a Snowflake-managed location, called a stage.

The main commands are as follows:

  • INSERT: Inserts rows into a table
  • DELETE: Removes specified rows from a table
  • UPDATE: Updates some values of specified rows in a table
  • MERGE: Inserts, updates, or deletes rows in a table
  • TRUNCATE TABLE: Empties a table, preserving the definition and privileges

For bulk data loading, Snowflake provides the following command:

  • COPY INTO: Loads data from files in a stage into a table or unloads data from a table into one or more files in a stage

To manage files in stages, Snowflake provides these file-staging commands:

  • PUT: Uploads a local file to a Snowflake stage
  • GET: Downloads a file from a stage to the local machine
  • LIST: Lists the files existing in a Snowflake stage
  • REMOVE: Removes a file from a Snowflake stage

Important note

In dbt, we can use macros with the COPY INTO and file-staging commands to manage the data-loading part of a data pipeline, when source data is in a file storage service such as AWS S3, Google Cloud Storage, or Microsoft Azure Data Lake file storage.

Data Query Language – DQL

DQL is the reason why SQL exists: to query and transform data.

The command that is used to query data is SELECT, which is without any doubt the most important and versatile command in all of SQL.

For the moment, consider that a SELECT statement, aka a query, can do all these things:

  • Read data from one or more tables
  • Apply functions and transformations to the data retrieved
  • Filter the data retrieved from each of these tables
  • Group the retrieved data on a set of columns (or transformed values), producing one row for each group, calculating functions on the grouped rows, and filtering on the results of those functions
  • Calculate for each row one or more functions based on groups of rows identified by a window expression and filter the results based on the results of these functions

Important note

We have dedicated the Query syntax and operators section later in this chapter to analyzing the SELECT command in Snowflake in detail, as you will use the SELECT command in every dbt model.

Data Control Language – DCL

DCL contains the GRANT and REVOKE commands, which are used to manage privileges and roles that control access to or use database resources and objects.

Together with the DDL commands to create roles, users, and other database objects, the DCL commands are used to manage users and security:

  • GRANT: Assigns a privilege or a role to a role (or user)
  • REVOKE: Removes a privilege or a role from a role (or user)
  • SHOW GRANTS: Lists access privileges granted to a role or object

Transaction Control Language – TCL

The TCL commands are used to manage transactions in a database.

A transaction groups a set of SQL commands into a single execution unit and guarantees that either all the effects of all commands are applied, if the transaction completes with success, or no effect at all is applied if the transaction fails. This can also be described with the ACID acronym, which stands for atomic, consistent, isolated, and durable.

A transaction succeeds and ends only when all the commands it groups have finished with success; in any other case, the transaction fails, and its effects are rolled back like they never happened.

The TCL commands are as follows:

  • BEGIN: Starts a transaction in the current session
  • COMMIT: Commits an open transaction in the current session and makes the transaction effects visible to all sessions
  • ROLLBACK: Rolls back an open transaction in the current session and undoes all the effects applied since the BEGIN statement

Now that we have covered the basic concepts and commands in SQL, it is time to set up a database to run them. The next section will provide you with access to a Snowflake DB.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Build a solid dbt base and learn data modeling and the modern data stack to become an analytics engineer
  • Build automated and reliable pipelines to deploy, test, run, and monitor ELTs with dbt Cloud
  • Guided dbt + Snowflake project to build a pattern-based architecture that delivers reliable datasets

Description

dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps. This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work. By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.

What you will learn

Create a dbt Cloud account and understand the ELT workflow Combine Snowflake and dbt for building modern data engineering pipelines Use SQL to transform raw data into usable data, and test its accuracy Write dbt macros and use Jinja to apply software engineering principles Test data and transformations to ensure reliability and data quality Build a lightweight pragmatic data platform using proven patterns Write easy-to-maintain idempotent code using dbt materialization

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon AI Assistant (beta) to help accelerate your learning
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jun 30, 2023
Length 578 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781803246284
Category :
Concepts :

Table of Contents

21 Chapters
Preface Chevron down icon Chevron up icon
1. Part 1: The Foundations of Data Engineering Chevron down icon Chevron up icon
2. Chapter 1: The Basics of SQL to Transform Data Chevron down icon Chevron up icon
3. Chapter 2: Setting Up Your dbt Cloud Development Environment Chevron down icon Chevron up icon
4. Chapter 3: Data Modeling for Data Engineering Chevron down icon Chevron up icon
5. Chapter 4: Analytics Engineering as the New Core of Data Engineering Chevron down icon Chevron up icon
6. Chapter 5: Transforming Data with dbt Chevron down icon Chevron up icon
7. Part 2: Agile Data Engineering with dbt Chevron down icon Chevron up icon
8. Chapter 6: Writing Maintainable Code Chevron down icon Chevron up icon
9. Chapter 7: Working with Dimensional Data Chevron down icon Chevron up icon
10. Chapter 8: Delivering Consistency in Your Data Chevron down icon Chevron up icon
11. Chapter 9: Delivering Reliability in Your Data Chevron down icon Chevron up icon
12. Chapter 10: Agile Development Chevron down icon Chevron up icon
13. Chapter 11: Team Collaboration Chevron down icon Chevron up icon
14. Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms Chevron down icon Chevron up icon
15. Chapter 12: Deployment, Execution, and Documentation Automation Chevron down icon Chevron up icon
16. Chapter 13: Moving Beyond the Basics Chevron down icon Chevron up icon
17. Chapter 14: Enhancing Software Quality Chevron down icon Chevron up icon
18. Chapter 15: Patterns for Frequent Use Cases Chevron down icon Chevron up icon
19. Index Chevron down icon Chevron up icon
20. Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.