You're reading from Data Engineering with dbt

Product type Book

Published in Jun 2023

Publisher Packt

ISBN-13 9781803246284

Pages 578 pages

Edition 1st Edition

Languages

Concepts

Data Streaming

Author (1):

Roberto Zagni

Table of Contents (21) Chapters

Preface

1. Part 1: The Foundations of Data Engineering

2. Chapter 1: The Basics of SQL to Transform Data

3. Chapter 2: Setting Up Your dbt Cloud Development Environment

4. Chapter 3: Data Modeling for Data Engineering

5. Chapter 4: Analytics Engineering as the New Core of Data Engineering

6. Chapter 5: Transforming Data with dbt

7. Part 2: Agile Data Engineering with dbt

8. Chapter 6: Writing Maintainable Code

9. Chapter 7: Working with Dimensional Data

10. Chapter 8: Delivering Consistency in Your Data

11. Chapter 9: Delivering Reliability in Your Data

12. Chapter 10: Agile Development

13. Chapter 11: Team Collaboration

14. Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms

15. Chapter 12: Deployment, Execution, and Documentation Automation

16. Chapter 13: Moving Beyond the Basics

17. Chapter 14: Enhancing Software Quality

18. Chapter 15: Patterns for Frequent Use Cases

19. Index

Why subscribe?

20. Other Books You May Enjoy

Combining data in SQL – the JOIN clause

The JOIN clause, and the equivalent forms of the FROM clause with multiple tables, is used to combine the rows from two tables to create a row with the columns (that you select) from both tables.

Joins are useful when the tables to be combined are related, that is, when the two tables have some columns that represent the same thing, and we want to combine data from both tables.

We express how to combine the rows by providing a join clause, usually with the ON subclause, which compares the rows from one table to the rows of the other table. Most of the time, the relation is that the values of corresponding columns are the same, but any predicate is fine in the ON subclause.

Combining orders and customers

One example of how to combine data might be a web_order table and a customer table.

In both tables, you normally have a column with the customer ID information even if the columns might not have the same name. Let’s assume that in the order table, there is the ID of the customer who placed the order in the ORDER_PLACED_BY column, and in the customer table, there is the ID of the customer in the CUSTOMER_ID column. Then, we could write the following query:

SELECT *
FROM web_order
JOIN customer ON ORDER_PLACED_BY = CUSTOMER_ID;

This query, using *, returns rows that have all columns from both tables, in all the cases when there is a row that satisfies the ON condition.

Let’s look at the relevant rows of the input and output tables in the case where we have one order with ORDER_PLACED_BY = 123 in the order table and one customer with CUSTOMER_ID = 123.

Say we have one row with ORDER_PLACED_BY = 123 in the web_order table, as follows:

Order_ID	ORDER_PLACED_BY	ORDER_VALUE
WEB_0001	123	225.52

Table 1.2: Sample web_order table

And we have one row with CUSTOMER_ID = 123 in the customer table, as follows:

Customer_ID	Customer_Name	Address
123	Big Buyer LLP	Nice place road, 00100 SOMEWHERE

Table 1.3: Sample customer table

Then, we get the following row in the result table:

Order_ID	ORDER_PLACED_BY	ORDER_VALUE	Customer_ID	Customer_Name	Address
WEB_0001	123	225.52	123	Big Buyer LLP	Nice …

Table 1.4: Sample result of the previous query

If we do not have any customer with CUSTOMER_ID = 123, then we will have no row returned (for that order) in the result table.

Say we have the same order table as before, but three rows with CUSTOMER_ID = 123 in the customer table:

Customer_ID	Customer_Name	Address
123	Big Buyer LLP	Nice place road, 00100 SOMEWHERE
123	Another Customer	Some road, 10250 SOME PLACE
123	A third customer	No way road, 20100 NOWHERE

Table 1.5: Alternative example of a customer table with three rows with CUSTOMER_ID = 123

Then, we will have three rows returned, each having the same order information combined with one specific customer per row, as you see in the following table:

Order_ID	ORDER_PLACED_BY	ORDER_VALUE	Customer_ID	Customer_Name	Address
WEB_0001	123	225.52	123	Big Buyer LLP	Nice …
WEB_0001	123	225.52	123	Another Customer	Some …
WEB_0001	123	225.52	123	A third customer	No …

Table 1.6: Table resulting from the previous sample JOIN query, with three customer matches

This last situation is probably not what you want, as it will “duplicate” the order, returning one row with the same order information for each customer that matches the condition. Later in the book, when we talk about identity, we will see how to make sure that this does not happen and how with dbt, you can also easily test that it really does not happen.

Another question that you might have is how do we keep the order information in the results, even if we do not have a match in the customer table, so that we get all the orders, with the customer information when available? That’s a good question, and the next topic on join types will enlighten you.

JOIN types

In the previous section about the query syntax, we introduced the simplified syntax of a JOIN clause. Let’s recall it here with shorter table names and no aliases:

SELECT …
FROM t1
[<join type>] JOIN t2 ON <condition_A>
[[<join type>] JOIN t3 ON <condition_B>]
…

In the most common cases, the join type is one of [INNER] or { LEFT | RIGHT | FULL } [ OUTER ].

This gives us the following possible joins with a join condition using the ON subclause:

INNER JOIN: This is the default and most common type of join, returning only the rows that have a match on the join condition. The INNER keyword is optional, so you can write the following:
```
SELECT … FROM t1 JOIN t2 ON <some condition>
```

Note that the preceding INNER JOIN is equivalent to the following query that uses only FROM and WHERE:

SELECT … FROM t1, t2 WHERE <some condition>

It is preferable, anyway, to use the JOIN syntax, which clearly shows, especially to humans, which conditions are for the join and which are for filtering the incoming data.

LEFT OUTER JOIN: This is the second most used type of join, as it returns all the rows from the left table, which is the first to be named, combined with the matching rows from the other table, padding with NULL the values where the right table has no matches

Of course, you will have one row of the left table for each matching row of the right table.

RIGHT OUTER JOIN: This is similar to LEFT OUTER JOIN, but it keeps all the columns from the right table and the matching ones from the left table. It is less used than the left as you can reverse the table order and use the left expression.

The query t1 RIGHT OUTER JOIN t2 is the same as t2 LEFT OUTER JOIN t1.

FULL OUTER JOIN: This type of join combines the left and right behavior to keep all the rows from left and right, padding with nulls the columns where there is no match.

There are also two other types of join where you do not specify a join condition:

CROSS JOIN: This type of join produces a Cartesian product, with all possible combinations of rows from both tables. This is also what you obtain if you do not use an ON subclause when using the previous types of joins. The cross join does not have an ON subclause:
```
SELECT … FROM t1 CROSS JOIN t2
```

This is equivalent to what we have seen in the FROM clause:

SELECT … FROM t1, t2

The difference is just how obvious it is for a human reader that you really want to have a cross join, or that you forgot about the ON subclause or some join-related condition in the WHERE clause. It is not so common to use cross joins, because of the Cartesian explosion we talked about; it is, therefore, a much better style to explicitly indicate that you really want a cross join, the few times when you will actually want it.

NATURAL <type> JOIN: The NATURAL join is identical to the various types of JOINs that we have seen so far, but instead of using the ON subclause to find the matches between the two tables, it uses the columns that have the same name with an equality condition. Another small difference is that the columns with the same name in the two tables are returned only once in the results as they always have the same values on both sides, because of the equality condition.

Here are a couple of examples of how to write queries with this type of join:

SELECT … FROM t1 NATURAL INNER JOIN t2

The preceding query is like an INNER JOIN query on columns with the same name in t1 and t2.

SELECT … FROM t1 NATURAL FULL OUTER JOIN t2

This one is like a FULL OUTER JOIN query on columns with the same name in t1 and t2.

Tip

When talking about JOIN, we use LEFT and RIGHT, but with respect to what?

It is a reference to the order in which the tables appear in a chain of joins.

The FROM table is the leftmost one and any other table that is joined is added to the right in the order in which the join appears.

Writing SELECT … FROM t1 JOIN t2 ON … JOIN t3 ON … JOIN t4 ON … makes clear that the tables will be stacked from left to right like this: t1-t2-t3-t4.

You could rewrite the same example as it is normally written on multiple lines, as follows:

SELECT …

FROM t1

JOIN t2 ON …

JOIN t3 ON …

JOIN t4 ON …

The result is the same, even if it is not so immediate to see left and right as mentioned in the previous statement.

Visual representation of join types

We have defined how joins work through examples and explanations, but I think that for some people, an image is worth a thousand explanations, so I propose two ways to graphically look at joins:

One that tries to show how the matching and non-matching rows are treated in different kinds of joins
One that uses a set notation and compares the different types of joins

Visual representation of returned rows in JOIN queries

The following figure visually describes how two rows of tables A and B are aligned to form the rows resulting from the different types of joins:

Figure 1.7: Visual representation of different JOIN types between tables A and B

Each table is divided into two: one part where the rows have one or more matches on the other table that satisfy the join condition and another part where each row has no match on the other table.

The rows from the two tables that have a match, in the center of the figure, are properly aligned according to the matching condition so that each resulting row contains the values from A’s columns and B’s columns where the condition is met. This central part is always returned by all joins, unless explicitly excluded with a WHERE clause requesting one of the two keys to be NULL.

The rows where there is no match, shown at the top for table A and at the bottom for table B, are aligned with columns from the other table padded with NULL values. This produces the somewhat unintuitive result that a query with an ON A.key = B.key clause might produce rows where one of the two keys is NULL and the other is not.

Tip

Please remember that NULL is a special value and not all things work out as expected. As an example, the expression NULL = NULL produces NULL and not TRUE as you might expect.

Try it out yourself with this query: SELECT NULL = NULL as unexpected;.

That is why you test for NULL values using <field> IS NULL and not using equality.

Full outer join

The following figure depicts the result of a FULL OUTER JOIN in terms of the rows of the two original tables:

Figure 1.8: Visual representation of a FULL OUTER JOIN

You can clearly identify the central part of the previous picture, where rows from both tables satisfy the join constraint and the two parts where one of the tables has no matching rows for the other table; in these rows, the columns from the other table are filled with NULL values.

Left outer join

The following figure depicts the result of a LEFT OUTER JOIN in terms of the rows from the two original tables:

Figure 1.9: Visual representation of a LEFT OUTER JOIN

You can clearly see in the picture that the result consists of the rows from both sides that satisfy the join constraints, plus only the rows from table A that do not have a match in table B, with the columns from B filled with NULL values.

The rows from table B without a match in table A are not returned.

Another way to express this is that we have all rows from table A, plus the rows from B where there is a match in the join condition, and NULL for the other rows.

Important Note

When we join two tables, and we write a condition such as ON A.key = B.key, we are expressing interest in rows where this condition is true, and INNER JOIN just gives us these rows.

However, OUTER joins also return rows where the join clause is not true; in these rows, either the A.key or B.key column will be filled with NULL as there is no match on the other table.

Visual representation of JOIN results using sets

The following figure visually represents the join types that use an ON clause, representing, as sets, the rows from tables A and B that match or do not match the join condition in the ON clause.

The overlapping area is where the condition is matched by rows in both tables:

Figure 1.10: Visual representation of SQL JOIN types with an ON clause, as set operations

The preceding figure represents the join types that we have seen in two forms:

Using only the ON subclause, showing the results of the pure join
Using a WHERE clause on the column used in the ON subclause

In the figure, this information is used to exclude from the outer joins the rows where the match happens, where A and B overlap, which are the rows returned by an inner join.

This type of query is useful, and often used, for example, to see whether we have orders where the referenced customer does not exist in the customer table. This would be called an orphan key in the order table.

Let’s see an example using Snowflake sample data:

SELECT *
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
LEFT OUTER JOIN "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
  ON C_CUSTKEY = O_CUSTKEY
WHERE C_CUSTKEY is NULL;

This query should return no rows, as all the customers referenced by the orders should exist in the customer table. Now that we have covered all the basic functions in SQL, let us check out an advanced feature: windows functions.