Data Engineering with dbt

The Basics of SQL to Transform Data

This book is about engineering data platforms with dbt. When working with dbt, you write code in Structured Query Language (normally referred to as SQL, pronounced sequel or as its letters: es-qu-el), with a spritz of Jinja templating.

This means that when we are not discussing architectural and design topics, we will be reading and writing code in SQL, so you should get up to speed with SQL to work proficiently with dbt.

In this chapter, you will review, or learn from scratch, the basics of SQL, becoming familiar with this powerful language that you find everywhere when working with data.

First, we will introduce SQL and walk you through its core concepts, so that even if you are new to this language, you can start building your competence and follow the rest of this chapter and the book.

To complement this introduction, you will quickly be presented with all the main SQL commands, organized by category of use. This will ensure that you know the most relevant parts of SQL and you can use this chapter as a quick reference when you start writing your code.

You will then be guided through how to create an account in Snowflake, the leading cloud database that is very often used with dbt. You will use Snowflake in the cloud, without the need to install anything, and it is free for a trial period.

Once you have a Snowflake account, you will be guided through creating your first role, user, and database.

With your cloud database ready and the basic SQL commands presented, we will look deeper into the SELECT statement and the JOIN clause, as they will be crucial in working with dbt.

Analytical and window functions are presented at the end of the chapter in a simple way, suitable for beginners as well as mid- to senior-experienced SQL users; these advanced functionalities will power up your SQL skills, vastly widening your abilities.

To summarize, in this chapter, we are going to cover these topics:

Introducing SQL
SQL basics – core concepts and commands
Setting up a Snowflake database with users and roles
Querying data in SQL – syntax and operators
Combining data in SQL – the JOIN clause
Advanced – introducing window functions

Introducing SQL

SQL was created in the 70s and by the end of the 80s, had become the de facto standard to interact with Relational Databases (RDBs), and it now powers most of the data management industry in the world.

Given its huge footprint and powerful abstractions, SQL has become a standard that anyone working with database systems eventually becomes familiar with. The expressive power of SQL is well understood and its knowledge is so ubiquitous that it has been taken into use beyond RDBs, with Database Management Systems (DBMSs) of all sorts providing a SQL interface even on top of many non-RDB systems.

Some of the great advantages of SQL are as follows:

The core SQL functionality was standardized in the 80s and yet SQL is still very much alive and well, evolving and adding new powerful functionalities as data management evolves while maintaining compatibility with previous versions.

Every database has its SQL quirks, but the logic is the same and most SQL code will work on multiple databases with little or no change.

Learn it now and use it forever, and with (almost) every database.

At its core, it has a simple, rigorous, and powerful syntax that reads like English sentences, so even non-tech people can grasp the basic idea, while professionals can express exactly what they want in a precise and concise way.

Most people can probably get a sense of what the following SQL does:

SELECT ORDER_ID, CUSTOMER_CODE, TOTAL_AMOUNT
FROM ORDERS
WHERE YEAR(ORDER_DATE) = 2021;

With SQL, you work at the logical level, so you do not have to deal with implementation details, and it is a declarative language; you describe in a rigorous way what you want to achieve, not how to do it. The database engine has the freedom to store data, be implemented, and perform the request in its own way, as long as it produces the correct result according to SQL specifications.
With a single SQL statement, you can process one piece of data or billions, leaving the burden of finding the most effective way to the database and giving you some freedom from scale.

SQL basics – core concepts and commands

The SQL language is certainly big and complex, taking time to be fully mastered, but surprisingly, you can get productive with the limited set of features that we will introduce in this book, and you will pick up the nuances when you’ll need them while working on real use cases.

The goal of this chapter is not to make you a SQL expert but to get you started with it, getting deeper into the few commands that we will use often to make you quickly productive with dbt.

In this intro, we start with the most basic concepts of SQL and then provide an overview of the SQL commands by their categories of use. In the rest of this chapter, we will dig deeper into the commands that are mostly used when working with dbt: SELECT, JOIN, and analytical and window functions.

SQL core concepts

The main goal of SQL, as its name implies, is to allow users to query data that is contained in a database; SQL also provides all the commands to fully manage the database, allowing you to add, transform, delete, organize, and manage data and other database objects, such as users and roles.

The core concepts in SQL come from how an RDB is organized:

A database contains tables, organized in schemata (plural of schema)
Tables store data in rows that have one value for each column defined in the table
Columns have a name and can contain only data of the declared data type
To regulate access to the data, privileges are assigned to users and roles

You can see a database with schemata and tables in the following screenshot, which shows part of the sample database available in any Snowflake account:

Figure 1.1: Example database with schemata and tables

Let’s go through the core concepts of SQL, starting with the table.

What is a table?

The table is the most central concept in SQL, as it is the object that contains data.

A table in SQL is very close to the layman’s concept of a table, with data organized in columns and rows.

Columns represent the attributes that can be stored in a table, such as a customer’s name or the currency of an order, with each column defining the type of data it can store, such as text, a number, or a date.

In a table, you can have as many rows as you want, and you can keep adding more whenever you need. Each row stores in the table one instance of the concept that the table represents, such as a specific order in the following example of a table for orders:

Order_ID	Customer_CODE	Total_amount	Currency
123	ABC	100	EUR
166	C099	125,15	USD

Table 1.1: Example order table

Looking at the previous example order table, we can see that the table has four columns that allow storing the four attributes for each row (order ID, customer code, total amount, and currency of the order). The table, as represented, has two rows of data, representing one order each.

The first row, in bold, is not data but is just a header to represent the column names and make the table easier to read when printed out.

In SQL, a table is both the definition of its content (columns and their types) and the content itself (the data, organized by rows):

Table definition: It lists the columns that make up the table, and each column provides the data type and other optional details. The data type is mandatory and declares what values will be accepted in the column.
Table content: It is organized in rows, each row containing one value for each column defined for the table or null if no value is provided.
Data value: All the values in a column must be compatible with the type declared for the column. null is a special value that corresponds to the absence of data and is compatible with all data types.

When creating a table, we must provide a name for the table and its definition, which consists of at least column names and a type for each column; the data can be added at a different time.

In the following code block, we have a sample definition for an order table:

CREATE TABLE ORDERS (
   ORDER_ID NUMBER,
   CUSTOMER_CODE TEXT,
   TOTAL_AMOUNT FLOAT,
   ORDER_DATE DATE,
   CURRENCY TEXT DEFAULT 'EUR'
);

Tip

When we write some SQL, we will use Snowflake’s SQL syntax and commands. We will guide you on how to create a free Snowflake account where you can run the code shown here.

In this example, we see that the command to create a table reads pretty much like English and provides the name of the table and a list of columns with the type of data each column is going to contain.

You can see that for the CURRENCY column, of type text, this code also provides the default value EUR. Single quotes are used in SQL to delimit a piece of text, aka a string.

View – a special way to look at some data

If you already have data and you want to make it available with some transformation or filtering, you can create a view. You can think of a view like a table, but with the column definition and data both coming from a query, reading from one or more tables.

As an example, if you would like to have a shortlist of orders with amount greater than 1,000 you could write the following query to create a BIG_ORDERS view:

CREATE VIEW BIG_ORDERS AS
SELECT * FROM ORDERS
WHERE TOTAL_AMOUNT > 1000;

In this example, we see that this simple create view statement provides the name for the view and uses a query, which is a SELECT statement, to define what data is made available by the view.

The query provides both the data, all the orders with a total amount greater than 1,000, and the column definitions. The * character – called star – is a shortcut for all columns in the tables read by the SELECT statement.

This is, of course, a naïve example, but throughout this book, you will see that combining tables that store data and views that filter and transform the data coming from tables and views is the bread and butter of working with dbt. Building one object on top of the previous allows us to take raw data as input and provide as output refined information that our users can easily access and understand.

Tip

When working with dbt, you will not need to write create table or create view statements, as dbt will create them for us. It is nevertheless good to get familiar with these basic SQL commands as these are the commands executed in the database and you will see them if you look in the logs.

Database and schema

We have seen that a database is a container for tables and views.

A DB can be further divided and organized using schema objects.

In real-life DBs, the number of tables can range from a few units to many thousands. Schemata act pretty much like folders, but cannot be nested, so you can always identify a table by its database, schema, and name.

In the following screenshot, we see part of the contents of the SNOWFLAKE_SAMPLE_DATA database, which is available in all Snowflake accounts:

Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database

The database and schemata, as table containers, are the main ways information is organized, but they are also used to apply security, access limitations, and other features in a hierarchical way, simplifying the management of big systems.

To create the TEST database and the SOME_DATA schema in it, we can use the following commands:

CREATE DATABASE TEST;
CREATE SCHEMA TEST.SOME_DATA;

The database.schema notation, also known as a fully qualified name, allows us to precisely describe in which database to create the schema and after its creation, uniquely identifies the schema.

Tip

While working with dbt, you will create a database or use an existing one for your project; dbt will create the required schema objects for you if you have not created them already.

A best practice in Snowflake is to have one database for each project or set of data that you want to keep separate for administration purposes. Databases and schemata in Snowflake are soft boundaries, as all the data from all the databases and schemata can be accessed if the user has the appropriate privileges.

In some other database systems, such as PostgreSQL, a database is a stronger boundary.

Users and roles

To control access to your data in SQL, you GRANT access and other privileges to both users and roles.

A user represent one individual user or service that can access the database, while a role represents a named entity that can be granted privileges.

A role can be granted to a user, providing them all the privileges associated with the role.

A role can also be granted to another role, building hierarchies that use simple basic roles to build more complex roles that are assigned to a user.

Using roles instead of granting privileges directly to users allows you to manage even a large number of users simply and consistently. In this case, roles are labels for a set of privileges and become a way to manage groups of users that we want to grant the same privileges. Changing the privileges granted to a role at any moment will change the privileges that the users receive from that role.

A typical pattern when working with dbt is to create a role for the dbt users and then assign it to the developers and the service user that the dbt program will use.

The following is an example of a simple setup with one role and a couple of users:

CREATE ROLE DBT_SAMPLE_ROLE;
CREATE USER MY_NAME;           -- Personal user
CREATE USER SAMPLE_SERVICE;    -- Service user
GRANT ROLE DBT_SAMPLE_ROLE TO USER MY_NAME;
GRANT ROLE DBT_SAMPLE_ROLE TO USER SAMPLE_SERVICE;

A more complex setup could have one role to read and one to write for each source system (represented by a schema with the data from the system), for the data warehouse (one or more schemata where the data is processed), and for each data mart (one schema for each data mart).

You could then control in much more detail who can read and write what, at the cost of more effort.

Understanding the categories of SQL commands

SQL commands can be organized into categories according to their usage in the language:

Data Definition Language (DDL): DDL contains the commands that are used to manage the structure and organization of a database
Data Manipulation Language (DML): DML contains the commands that are used to manipulate data, for example, INSERT, DELETE, and UPDATE
Data Query Language (DQL): DQL contains the SELECT command and is the central part of SQL that allows querying and transforming the data stored in a database
Data Control Language (DCL): DCL contains the GRANT and REVOKE commands, which are used to manage the privileges that control the access to database resources and objects
Transaction Control Language (TCL): TCL contains the commands to manage transactions

In the upcoming sections, we provide more details about these by looking at Snowflake-specific commands, but the ideas and names are of general use in all database systems, with little or no change.

Data Definition Language – DDL

DDL commands do not deal directly with the data but are used to create and maintain the structure and organization of the database, including creating the tables where the data is stored.

They operate on the following objects:

Account/session objects, which contain and operate on the data, such as user, role, database, and warehouse
Database/schema objects, which store and manipulate the data, such as schema, table, view, function, and store procedure

The main commands are as follows:

CREATE: Used to create the database itself and other objects
DROP: Used to delete the specified object
ALTER: Used to modify some attribute of the specified object
DESC: Used to describe the details of the specified object
SHOW: Used to list the existing objects of the specified object type, with metadata
USE: Used to select the database, schema, and object to use when fully specified names are not used

Tip

When working with dbt, we use the DDL and DML commands only in macros.

We do not use the DDL and DML commands in models because dbt will generate the required commands for our models based on the metadata attached to the model.

Data Manipulation Language – DML

DML provides the commands to manipulate data in a database and carry out bulk data loading.

Snowflake also provides specific commands to stage files, such as loading files in a Snowflake-managed location, called a stage.

The main commands are as follows:

INSERT: Inserts rows into a table
DELETE: Removes specified rows from a table
UPDATE: Updates some values of specified rows in a table
MERGE: Inserts, updates, or deletes rows in a table
TRUNCATE TABLE: Empties a table, preserving the definition and privileges

For bulk data loading, Snowflake provides the following command:

COPY INTO: Loads data from files in a stage into a table or unloads data from a table into one or more files in a stage

To manage files in stages, Snowflake provides these file-staging commands:

PUT: Uploads a local file to a Snowflake stage
GET: Downloads a file from a stage to the local machine
LIST: Lists the files existing in a Snowflake stage
REMOVE: Removes a file from a Snowflake stage

Important note

In dbt, we can use macros with the COPY INTO and file-staging commands to manage the data-loading part of a data pipeline, when source data is in a file storage service such as AWS S3, Google Cloud Storage, or Microsoft Azure Data Lake file storage.

Data Query Language – DQL

DQL is the reason why SQL exists: to query and transform data.

The command that is used to query data is SELECT, which is without any doubt the most important and versatile command in all of SQL.

For the moment, consider that a SELECT statement, aka a query, can do all these things:

Read data from one or more tables
Apply functions and transformations to the data retrieved
Filter the data retrieved from each of these tables
Group the retrieved data on a set of columns (or transformed values), producing one row for each group, calculating functions on the grouped rows, and filtering on the results of those functions
Calculate for each row one or more functions based on groups of rows identified by a window expression and filter the results based on the results of these functions

Important note

We have dedicated the Query syntax and operators section later in this chapter to analyzing the SELECT command in Snowflake in detail, as you will use the SELECT command in every dbt model.

Data Control Language – DCL

DCL contains the GRANT and REVOKE commands, which are used to manage privileges and roles that control access to or use database resources and objects.

Together with the DDL commands to create roles, users, and other database objects, the DCL commands are used to manage users and security:

GRANT: Assigns a privilege or a role to a role (or user)
REVOKE: Removes a privilege or a role from a role (or user)
SHOW GRANTS: Lists access privileges granted to a role or object

Transaction Control Language – TCL

The TCL commands are used to manage transactions in a database.

A transaction groups a set of SQL commands into a single execution unit and guarantees that either all the effects of all commands are applied, if the transaction completes with success, or no effect at all is applied if the transaction fails. This can also be described with the ACID acronym, which stands for atomic, consistent, isolated, and durable.

A transaction succeeds and ends only when all the commands it groups have finished with success; in any other case, the transaction fails, and its effects are rolled back like they never happened.

The TCL commands are as follows:

BEGIN: Starts a transaction in the current session
COMMIT: Commits an open transaction in the current session and makes the transaction effects visible to all sessions
ROLLBACK: Rolls back an open transaction in the current session and undoes all the effects applied since the BEGIN statement

Now that we have covered the basic concepts and commands in SQL, it is time to set up a database to run them. The next section will provide you with access to a Snowflake DB.

Setting up a Snowflake database with users and roles

In this section, you will be guided through creating your own Snowflake account so that you can experiment with the SQL commands from the previous section and use it with dbt in the next chapters.

Once you have created your account, we will help you through a quick setup of users and roles that will be useful throughout this book and in your later projects.

Creating your Snowflake account

To sign up for a 30-day free trial of Snowflake, you can go to https://signup.snowflake.com/ or look for the Start for free button on the Snowflake home page.

On the Snowflake site, go through the following steps:

Fill in the registration form with your data and click CONTINUE.

Figure 1.3: Snowflake free trial registration form

Select the Snowflake edition that you want to use, pick your preferred cloud provider, and click GET STARTED and you will reach a confirmation page.

Figure 1.4: Left: Snowflake edition and cloud provider selection form. Right: sign-up confirmation

Go to your email client and follow the link from the email that you receive to confirm your email address. It will take you back to the Snowflake website.
On the Welcome to Snowflake! page, enter the username and password for the user that will become your account administrator and then click Get started.

Figure 1.5: Welcome to Snowflake! page

Tip

Later, we will teach you how to create all the roles and users that you want.

Anyway, it makes sense to pick a good name and keep the password safe, as this user has all privileges and can do pretty much everything on your account.

After clicking the Get started button, you will land on the Snowflake user interface, with an empty worksheet open and an introductory menu with a few topics to get started with Snowflake. If you are new to Snowflake, it is a good idea to go through those topics.
After dismissing this introductory menu, you are ready to explore and use your Snowflake account, which in the classical console interface will look like this:

Figure 1.6: The user interface of Snowflake worksheets

If you end up with the new Snowflake interface (called Snowsight) you can work with it or use the Classic Console entry in the menu to switch to the older interface.

Setting up initial users, roles, and a database in Snowflake

Now you have a Snowflake account with a user that you can log in with, and your user has been granted the omnipotent role of ACCOUNTADMIN.

That’s great, but it is not advised that you use this role for normal work. The best practice is to set up a few users and roles to provide the desired access level and use the lowest level that can accomplish a task.

Snowflake allows you to pick the role you impersonate to execute commands in a worksheet between the ones you have been granted so that you do not have to log out and back in to change the role.

Tip

You can change your role, the warehouse to use, and the current database and schema with the selector in the top-right corner of the worksheet editing area.

You might have noticed that in the navigation panel on the left, you have two databases:

The SNOWFLAKE database provides you with information about your Snowflake account
The SNOWFLAKE_SAMPLE_DATA DB provides, as the name implies, some sample data

These databases are shared with you by Snowflake, and you can only read from them.

To do something meaningful, you will need at least a database you can write in.

In this section, we will create a database and some roles and users to use it.

Overview of initial roles

In a database a user needs the relevant privilege to be able to perform an action.

A user can have the privilege directly, that is, assigned to the user, or indirectly, through a role that the user can impersonate.

In practice, privileges are assigned to roles, which are assigned to other roles and users, to build the desired security hierarchy, because assigning privileges to users is not scalable.

When you log in to a newly created Snowflake account, you are impersonating the ACCOUNTADMIN role, which has all privileges, but other, more specialized roles are already available.

Snowflake has already set up a few self-explanatory basic roles: ACCOUNTADMIN, SECURITYADMIN, USERADMIN, SYSADMIN, and PUBLIC.

Important note

To create users and roles, Snowflake provides the USERADMIN role by default, which has the CREATE ROLE and CREATE USER privileges directly assigned. This role is already assigned to the initial user that you created.

The initial user could create new users and roles using the ACCOUNTADMIN or SECURITYADMIN roles, because they have been granted the USERADMIN role; SECURITYADMIN had it granted directly and ACCOUNTADMIN indirectly, having been granted the SECURITYADMIN role.

However, this would not be a great idea, as you will see in the following paragraphs on ownership.

The following is an example of how to create a hierarchy of roles.

You can explore the existing roles and their privileges with these commands:

SHOW ROLES;
SHOW GRANTS TO ROLE <role_name>;

Ownership is needed for some operations, such as deletions, which cannot be granted.

It is therefore advisable that a user with the role that we want to use to manage an object is the one who creates it, while impersonating such role.

In this way, the user, and all others with the same role, will have the required privileges.

An alternative is to use the GRANT command to explicitly provide the required privileges to a role.

Important note

An object is owned by the role that created it.

When a user has multiple roles, the role impersonated by the user at the moment of the creation will own the object.

Before creating an object, make sure you are impersonating the role you want to own the object.

Creating and granting your first role

Let’s switch to USERADMIN to create our users and roles, because this is the role that you will probably grant to people delegated to perform user and role administration.

Being the owner will make all operations, including deletions, simpler:

Switch to the USERADMIN role:
```
USE ROLE USERADMIN;
```

Let’s create the role that we will use for all users that have to fully manage a dbt project, humans, or applications.

I usually call this role DBT_EXECUTOR_ROLE because I call the user for the dbt application DBT_EXECUTOR and I like to be crystal clear with my names.

You can of course pick a name that you prefer for both the role and user.

Create a role for users running dbt models:

CREATE ROLE DBT_EXECUTOR_ROLE

  COMMENT = 'Role for the users running DBT models';

We now have a role that we can shape and use to manage a set of database objects, for example, corresponding to one or more dbt projects.

The dbt executor user will be used by dbt to run dbt models in shared environments, while dbt will use each developer’s own user to run dbt models in their own development environment.

The simplest setup, giving all developers the ability to manage all environments, is to assign this role to all developers. An alternative is to have a different role for production and other environments that you want to keep protected.

Let’s keep it simple and assign the executor role to ourselves:
```
GRANT ROLE DBT_EXECUTOR_ROLE TO USER <your_user>;
```

Now, our user has both the account admin and the executor roles. To be able to see and select the role in the user interface dropdown, you might need to refresh the page.

Granting the privileges to create a database

We could create a database while impersonating the account admin role, but this will not help with our plan to use the executor role to manage this database. We must then give our executor role the ability to create a database. To have our SQL commands executed on a database, we need the ability to use or create an existing warehouse.

Important Note

To manage the structure of databases and the warehouse settings, Snowflake provides the default SYSADMIN role, which has the required privileges.

We can achieve our goal of providing the executor role with those abilities by doing one of the following:

Granting the required privileges directly to the executor role
Granting the SYSADMIN role to the executor role, which has the required privileges

What would be the difference?

If you grant the privileges directly, the executor role is totally isolated from other roles, and users with that role will just be able to access what the role allows them to access.

If you grant the SYSADMIN role, users with the executor role will also be able to impersonate the SYSADMIN role and therefore also access any object it can access. As an example, they could access a database created by another user that has created it with the sysadmin role.

Let’s assign the desired privileges directly, to keep our dbt objects away from SYSADMIN.

To grant the privileges to the executor role, we need to switch to a role that can grant these privileges and then grant them:

Switch to the SYSADMIN role:
```
USE ROLE SYSADMIN;
```

Grant the privilege to create a new database:

GRANT CREATE DATABASE ON ACCOUNT

  TO ROLE DBT_EXECUTOR_ROLE;

Grant the privilege to use the default warehouse, named COMPUTE_WH:

GRANT USAGE ON WAREHOUSE COMPUTE_WH

  TO ROLE DBT_EXECUTOR_ROLE;

Creating your first database

Now that our executor role can create a database, let’s do it:

Let’s impersonate the executor role so that it will own what we create:
```
USE ROLE DBT_EXECUTOR_ROLE;
```

Now, let’s create our database:

CREATE DATABASE DATA_ENG_DBT;

       -- or pick any name you like for the DB

To see the new database in Snowflake, you might need to refresh the browsing panel on the left of the user interface, with the Refresh button, shaped as a circular arrow, in the top right of the panel.

Tip

Clicking on the new database will show that it is not empty.

The new database has two schemata: INFORMATION_SCHEMA, which contains views that you can use to collect information on the database itself, and PUBLIC, which is the default schema of the new database and is empty.

To reduce Snowflake credit consumption, let’s finish by configuring the default warehouse to suspend after 1 minute instead of 10:

USE ROLE SYSADMIN;
ALTER WAREHOUSE "COMPUTE_WH" SET
    WAREHOUSE_SIZE = 'XSMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE
    COMMENT = 'Default Warehouse';

Tip

To alter the warehouse, you must switch to the SYSADMIN user as it is the role that owns the warehouse; we have granted our executor role only the privilege to use it.

Congratulations! You have now created a role, assigned it to your personal user, created a database owned by that role, and configured the default warehouse. You are almost done setting up.

Creating a user for the dbt application

Let’s complete our initial setup by creating a new user that we will use with dbt and grant it the executor role so that it will be able to manage, as an owner, that database.

To create a user, we need to switch to a role with that ability:

Switch to the USERADMIN role:
```
USE ROLE USERADMIN;
```

Create the new user for dbt; I have named it DBT_EXECUTOR, though you can pick any name:

CREATE USER IF NOT EXISTS DBT_EXECUTOR

    COMMENT = 'User running DBT commands'

    PASSWORD = 'pick_a_password'

    DEFAULT_WAREHOUSE = 'COMPUTE_WH'

    DEFAULT_ROLE = 'DBT_EXECUTOR_ROLE'

Let’s assign the executor role to the dbt user:

GRANT ROLE DBT_EXECUTOR_ROLE TO USER DBT_EXECUTOR;

Switch back to the operational role, which we should always be working with:
```
USE ROLE DBT_EXECUTOR_ROLE;
```

Great! You have now performed a basic setup of your new Snowflake account; you have learned the basics of user and role management and you are ready to learn more SQL to query data.

Querying data in SQL – syntax and operators

In this section, you will learn how to query existing data using SQL and the Snowflake example database.

In this chapter, and more generally in this book, you will get familiar with Snowflake syntax, which is modern and very standard. We will also use some proprietary extensions that make your work easier.

Snowflake query syntax

Snowflake supports querying data with the standard SELECT statement, which has the following basic syntax:

WITH …
SELECT …
FROM …
JOIN …
WHERE …
GROUP BY …
HAVING …
QUALIFY …
ORDER BY …
LIMIT …

The only mandatory part is select, so SELECT 1 is a valid query that just returns the value 1.

If you are familiar with SQL from other database systems, you will wonder what the QUALIFY clause is. It is an optional SQL clause that is very well suited to the analytical kind of work that Snowflake is used for and that not all database engines implement. It is described later in this section.

We often use the terms query, command, and statement interchangeably when referring to some piece of SQL that you can execute.

Properly speaking, a command is a generic command such as SELECT or CREATE <object>, while a statement is one specific and complete instance of a command that can be run, such as SELECT 1 or CREATE TABLE my_table …;.

The term query should really only refer to SELECT statements, as SELECT statements are used to query data from the database, but query is often used with any statement that has to do with data.

You will also often hear the term clause used, such as the FROM clause or GROUP BY clause. Informally, you can think about it as a piece of a statement that follows the syntax and rules of that specific keyword.

The WITH clause

The WITH clause is optional and can only precede the SELECT command to define one or more Common Table Expressions (CTEs). A CTE associates a name with the results of another SELECT statement, which can be used later in the main SELECT statement as any other table-like object.

Defining a CTE is useful for the following:

Clarity: You can provide an informative name to a piece of SQL
Reuse, maintenance, and efficiency: You can define a supporting query, whose results you might use more than once in the main query, and the database engine will execute the supporting query once
Creating recursive queries: Defining a CTE that queries itself is the only way to use recursion in SQL
Break a complex transformation into smaller, simpler steps that are easier to code and understand and limit the scope of future maintenance

The simplified syntax is as follows:

WITH [RECURSIVE]
  <cte1_name> AS (SELECT …)
  [, <cte2_name> AS (SELECT …)]
SELECT …

We will check out some examples here:

Check whether we still have high-priority orders pending:

WITH

high_prio_orders as (

  SELECT *

  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"

  WHERE O_ORDERPRIORITY IN ('1-URGENT', '2-HIGH')

SELECT count(*)

FROM high_prio_orders

WHERE O_ORDERDATE < '1998-01-01'

  and O_ORDERSTATUS = 'O';

Here, you can see that the first expression encapsulates the business definition of high-priority order to be any order with the priority set to urgent or high priority. Without the CTE, you have to mix the business definition and other filtering logic in the WITH clause. Then, it would be unclear whether the status is part of the definition or is just a filter that we are applying now.

Calculate some metrics for customers in the auto industry:

  WITH

  auto_customer_key as (

    SELECT C_CUSTKEY

  FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER

  WHERE C_MKTSEGMENT = 'AUTOMOBILE'

),

orders_by_auto_customer as (

  SELECT O_ORDERKEY

  FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS

  WHERE O_CUSTKEY in (SELECT * FROM auto_customer_key)

),

metrics as (

  SELECT 'customers' as metric, count(*) as value

  FROM auto_customer

  UNION ALL

  SELECT 'orders by these customers', count(*)

  FROM orders_by_auto_customer

SELECT * FROM metrics;

In this example, the CTEs and the final query remain short and very simple. You can start to appreciate that the clear labeling of intentions and the simple pieces of SQL make the full query easy to understand for anyone.

The SELECT clause

The SELECT command can appear in two possible forms: as the central clause of a SELECT statement or used as a clause in other statements. In both cases, it defines the set of columns and calculated values returned by the statement.

The simplified syntax of a SELECT clause in Snowflake is a list of column definitions separated by commas, with the optional distinct keyword to omit duplicates:

SELECT [DISTINCT]
  <column_definition_1> [, <column_definition_2> …]

The SELECT clause is used to provide the list of columns that the query will return.

For each column definition that is provided, it is also possible to provide a column alias, which will be used to identify that column definition in the results.

When two column definitions have the same name, an alias must be provided for at least one of the two to avoid a name clash in the results.

A column definition can be one of these four expressions:

<column_definition> =
   [object.*]         -- all columns in the object
  |[object.col_name]  -- the named column in the object
  |[object.$n]        -- the n-th column in the object
  |[<expression>]     -- the value of the expression
 AS <col_alias>

Let’s describe in detail these four expressions:

The star symbol, *, indicates all possible columns; it means “all columns from all tables” if it is applied without an alias or “all the columns from the aliased object” if applied with an alias.
- Star without an alias will return all columns from both tables:
```
SELECT * FROM table_1, table_2
```
- Star with an alias (or table name) will return all columns from the object with the alias. In this case, it will return all columns from table_1, as it is applied to its alias:
```
SELECT t1.* FROM table_1 as t1, table_2 as t2
```
- Snowflake has introduced two powerful extensions to the Star Syntax: EXCLUDE to remove some columns from being returned and RENAME to rename a column while selecting it, as shown in this example:
```
SELECT * EXCLUDE (c3, c2) RENAME (c4 as cx, c5 as cy)
```
```
FROM table_1
```

A column name, optionally with an alias indicating the object the column comes from and/or an alias to use for the column in the results:

SELECT

    O_ORDERKEY,

    ord.O_CUSTKEY,

    cust.C_NAME as CUSTOMER_NAME

FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS as ord

JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER as cust

    ON cust.C_CUSTKEY = ord.O_CUSTKEY;

Note that here, the designers of the TPCH database have gone a long way to keep column names unique between all tables, by adding the table initials as a prefix to column names. This is a pretty old style as it makes it more difficult to recognize fields that contain the same values as C_CUSTKEY and O_CUSTKEY.

Looking at our example query, we can see the following:

Object aliases are optional if the name of the column is unique in the tables referenced in the query, as it is for all columns in the TPCH database.
Aliases can in any case be used for clarity, like with cust.C_NAME.
Object aliases are mandatory when referencing a column name that appears in more than one referenced table. This would have been the case if both fields had been just CUSTKEY, without the C_ and O_ table prefixes.
Lastly, a column alias can always be used, even just to rename a column with an initial name that we do not like or is not clear, as in the case of a name that would not be clear, so we rename it to CUSTOMER_NAME.

A column number after a dollar sign, such as $1 or $3.

This allows us to reference columns that do not have a name, such as when reading out of a CSV file without a header. This is also useful, especially in scripts, to reference columns that we do not know the name of, but we know their position:

SELECT $1 as ORDER_KEY, $2 as CUST_KEY
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;

In this case, we read from a table that has column names, and we can always use column numbers instead of column names. But this is rarely done when reading from a table or a view because by using the numbers, the result will depend on the order of the columns in the object, which can change over time if the object is recreated.

An expression, like a mathematical expression or a function call, that evaluates to some value for each row:

SELECT

    P_PARTKEY

    , UPPER(P_NAME) as P_NAME

    , P_RETAILPRICE

    , P_RETAILPRICE * 0.9 as P_DISCOUNTED_PRICE

FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART";

The preceding example shows two expressions: a mathematical expression that multiplies a column by a constant to produce a new column and a function call that converts the p_name column to uppercase, keeping the same column name.

Important note

The ability highlighted here, to write arbitrary expressions and assign a name to the result, is at the core of how SQL is used to transform data and is one of the most common things that we will do with dbt.

In this section, we have seen that the SELECT clause allows us four great abilities:

To decide what data from the source tables to keep in the result of a query
To calculate new data from source data using arbitrary complex expressions
To provide a new name for the data that we want in the query result
To keep only one copy of each distinct row in the result, using the DISTINCT keyword

The FROM clause

The FROM clause introduces the table objects used in a SQL statement as the source of data or target of the command.

In the case of a SELECT statement, the FROM clause can list none, one, or more table objects, each with its own alias. The rows in the result of the SELECT statement will potentially have all the columns from all the tables referenced in the FROM clause.

The simplified syntax looks like this:

SELECT …
FROM [tableObject1 [AS alias1]]
   [, tableObject2 [AS alias2] …]

Table objects, in the context of a SELECT statement, are a combination of the following:

Tables and views: Tables and views are the main data objects SQL works with, so most of the time, this is what you find in the FROM clause.
Table functions: Table functions are a category of functions that return a set of rows, with one or more columns. They can be predefined functions or user-defined ones.
A VALUES clause: This clause allows us to build a set of rows, using constant values. This allows us to create an inline table that can be suitable as a reference table, mapping table, or test input.
A few other objects that can be read from, such as a LATERAL subquery or a staged file.

When no table is listed, then the columns in the SELECT clause must use constants and global objects such as the current_date function, as in the following example:

SELECT 1 + 1 as sum, current_date as today;

When we use the VALUES clause, we can define an inline table, like in this example:

SELECT * FROM ( VALUES
  ('IT', 'ITA', 'Italy')
 ,('US', 'USA', 'United States of America')
 ,('SF', 'FIN', 'Finland (Suomi)')
  as inline_table (code_2, code_3, country_name)
);

When the FROM clause lists more than one table object, the result is the Cartesian product of the elements in these tables. To avoid a Cartesian explosion, it is important to include a WHERE clause that restricts the combinations, keeping only the ones we want.

Important note

The Cartesian product is the result of pairing the rows of all the objects in the product in all possible ways. This means that the number of rows returned by the Cartesian product is the product of the number of rows in each object in the product.

The expression Cartesian explosion is often used to refer to the fact that the number of rows returned by the Cartesian product increases very quickly and can easily create performance issues.

The following query, thanks to the WHERE clause, will generate a more meaningful and compact set of rows, ideally equal to the number of rows that exist in the LINEITEM table:

SELECT count(*)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM" as l
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"   as o
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" as c
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART"     as p
WHERE o.O_ORDERKEY = l.L_ORDERKEY
  and c.C_CUSTKEY  = o.O_CUSTKEY
  and p.P_PARTKEY  = l.L_PARTKEY
;

In the previous query, we have used a WHERE clause to keep in the results only the rows that are really related. We achieve this by using equivalence constraints between the fields of the related tables.

The previous query generates a result of exactly 6,001,215 rows, which is the number of rows in the LINEITEM table, instead of the astronomical number resulting from the multiplication of the number of rows: 6,001,215 x 1,500,000 x 150,000 x 200,000, which is pretty much 2,7 x 10^23.

I’ll let you calculate the exact result…

The consequence is that the full query processes 6 million rows and runs in a few seconds, while the query without the WHERE clause will try to process 10^18 times more rows, taking very many hours or days to complete, in the process using up a lot of resources for nothing, as the resulting data provides nothing of interest.

The exact correspondence in the count derives from the fact that for each line item, there is only one order, for each order only one customer, and so on.

Providing the relevant WHERE clause will help with reducing the result by many orders of magnitude even if the keys don’t match so precisely. We will deal with this topic in more detail in Chapter 3, where we talk about data modeling.

This example illustrates well the power of SQL and why it is important to express what we want correctly.

Tip

As a basic rule, never do an unrestricted Cartesian product unless you know what you are doing and you are fine with the exploded number of rows that you will generate.

The JOIN clause

JOIN is a subclause of the FROM clause and is used to describe in a more precise way how to combine the rows of two tables. JOIN cannot exist outside of a FROM clause.

The simplified syntax of JOIN looks like this:

SELECT …
FROM tableObject1 AS to1
[<join type>] JOIN tableObject2 as to2 ON <condition_A>
[[<join type>] JOIN tableObject3 as to3 ON <condition_B>]
…

The condition in each join is a Boolean expression, often using columns from the base to1 table and the joined table. It does not need to be an equality check and can use columns from multiple tables or even constants.

It is normal to chain more than one JOIN inside a FROM clause to pick data from multiple tables at once, and you generally do not need to think too much about the order of the joins, but it is good to know that the order might matter. In the case of Snowflake, the order is from left to right (think of the statement written on one row).

We will look in more detail into the JOIN clause in the next section, where we will introduce the different types of joins and some examples.

We have seen that we can combine data from multiple tables, listing them in the FROM clause, and that we can use the WHERE clause to restrict the resulting Cartesian product to the combined rows that we actually want to keep. This use of FROM and WHERE is equivalent to the INNER JOIN construct, which we will see in the Combining data in SQL – the JOIN clause section.

The WHERE clause

The WHERE clause specifies the conditions that the rows involved in the command need to match.

The rows for which the expression evaluates to true are kept/processed.

In the context of a SELECT statement, the WHERE clause limits the result of the query to the subset of rows that verify the overall condition set by the WHERE clause.

The simplified syntax is as follows:

SELECT …
WHERE <predicate>

<predicate> can be as simple as the literal true or false or a complex expression, including logical operators and database functions that evaluate to a Boolean value (true, false, or NULL) for every row that the query processes.

Tip

An expression that returns a Boolean value is called a predicate.

The following example returns only the rows where the order total is greater than 500,000:

SELECT O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE O_TOTALPRICE > 500000;

The following example does not return any row, but defines two columns with the same name and type as the columns in the source table and two with the name and type that we provided:

SELECT O_ORDERKEY,O_CUSTKEY, 1 as an_int, null::number as a_num
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE false;

This may be a good trick when you need to define some column names and types. Remember that you do not need a FROM clause if you do not want to use columns from an existing table…

Important note

Be careful when NULL is involved as it might not behave as you expect.

As an example, the NULL = NULL expression evaluates to NULL.

In a WHERE clause, it means that the corresponding row is discarded.

We have seen in a previous example using the FROM clause that the WHERE clause can be used to specify some types of JOIN. We suggest using the JOIN … ON … syntax for joins and the WHERE clause to filter rows. We will look at more examples in the Combining data in SQL - the JOIN clause section.

The GROUP BY clause

A GROUP BY clause is used to calculate aggregate functions on groups of rows that produce the same value for the group by expression.

The simplified syntax is as follows:

SELECT …
GROUP BY groupExpr1 [, groupExpr2 …]

The group by expression can be one of the following:

Column name: The result of the expression is the value of the column with the given name. All rows with the same value in the column are grouped.

The following query calculates the sum of all orders by each customer (customer key):

SELECT O_CUSTKEY, sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY O_CUSTKEY;

Number: The value to group by is the value of the column in the given position in the result of the SELECT statement. The value is evaluated at the end of the statement, after applying any function or expression.

The following query is the same as the previous one, using the ordinal 1 instead of the O_CUSTKEY column name to indicate the column to group by:

SELECT O_CUSTKEY, sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY 1;

SQL expression: Any expression using any combination of data from the query. The value to group by is the result of the expression.

The following query calculates the total orders by year, using the year() function, which returns the year from a date:

SELECT YEAR(O_ORDERDATE), sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE)
ORDER BY YEAR(O_ORDERDATE);

We have also added the ORDER BY clause with the same expression as the GROUP BY so that we get the output in a nice order. This is a pretty common pattern.

When more than one group by expression is provided, the rows are grouped together according to the values of all the group by expressions.

In the previous example, you could have used GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE) to have the order total split by month (and year), like in the following:

SELECT YEAR(O_ORDERDATE),MONTH(O_ORDERDATE),sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);

The HAVING clause

The HAVING clause filters the rows resulting from a GROUP BY clause according to a predicate:

SELECT …
GROUP BY …
HAVING <predicate>

The predicate is an expression that returns a Boolean value and references:

Constants
Expressions that appear in the GROUP BY clause
Aggregated functions that can be calculated according to the GROUP BY clause

As an example, we could extend the example from the GROUP BY topic to look at the months where we have less than 10,000 orders.

The following query does it:

SELECT YEAR(O_ORDERDATE), MONTH(O_ORDERDATE), sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
HAVING count(*) < 10000
ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);

In this query, we have used the count() aggregate function to count how many lines, one for each order, there are in one group.

Tip

The HAVING clause is for GROUP BY what the WHERE clause is for the FROM clause.

We will look at the third filtering clause, QUALIFY, in the next topic.

The QUALIFY clause

The QUALIFY clause filters on the results of the window functions, which is the third type of data calculation that we can have in a query after expressions and group by.

We will dedicate the last section of this chapter to window functions.

The QUALIFY clause is an optional SQL clause that proves extremely useful in analytical workloads, so it is implemented in Snowflake.

To recap a bit of what we have seen, these are the three clauses that are used to filter data:

WHERE applies to the data being read from the sources specified in the FROM / JOIN clauses
HAVING applies to the data that has been grouped by a GROUP BY clause
QUALIFY applies to the data that has been calculated by a window function

One of the most common usages of the QUALIFY clause together with the simple row_number() window function is to defend against undesired duplicates or select one specific row with respect to other rows that represent different versions of the same object:

SELECT *
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM"
QUALIFY row_number()
            over(partition by L_ORDERKEY, L_LINENUMBER
                 order by L_COMMITDATE desc ) = 1;

This query selects all fields from a table, just filtering out undesired rows for which the row_number window function returns a value different from 1.

The row_number window function, assigns a progressive number from 1 onward to all the rows in the same window, following the row order, and then restarts from 1 with the next window.

Picking the rows where it is equal to 1 means keeping only the first row for each window.

The windows are defined by the over(…) clause, which comprises a partition by part that defines the windows, in a way similar to group by, and an order by that provides the ordering in the window. We will look in more detail at this in the section devoted to windows functions.

Tip

When using the QUALIFY clause, as shown in the previous query, we avoid the need to create a column with the result of the row_number window function and then use a subquery to filter on it. That is the normal way of doing this kind of filtering in a database that does not support the QUALIFY clause. QUALIFY is much simpler to read and use.

Now, let’s extend the order example from the previous section to select out the “good months,” when we have higher total sales than the average sales for the year they belong to:

WITH
monthly_totals as (
  SELECT
    YEAR(O_ORDERDATE) as year,
    MONTH(O_ORDERDATE) as month,
    sum(O_TOTALPRICE) as month_tot
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
  GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
)
SELECT year, month, month_tot
       ,avg(month_tot) over(partition by YEAR) as year_avg
FROM monthly_totals
QUALIFY month_tot > year_avg
ORDER BY YEAR, MONTH;

Here, we have put together a few of the clauses that we have seen so far, using a WITH clause to define a CTE named monthly_totals with our previous query to calculate monthly totals.

We have then defined a query that reads from the CTE and uses the avg window function to calculate the average monthly sales for each month of the full year that a particular month belongs to. Then, we use the QUALIFY clause to keep only the rows where the monthly total is greater than the yearly average.

Tip

The previous example illustrates why the window functions are also called analytical functions.

Important note

Please note that the window function calculates a value for each row using the defined windows, not changing the number of rows. In contrast, using the avg aggregate function with a GROUP BY clause would have reduced the number of rows to one per group.

The ORDER BY clause

The ORDER BY clause specifies in which order the query or window function lays out the rows.

The simplified syntax is as follows:

SELECT …
ORDER BY orderExpr1 [ASC|DESC] [NULLS FIRST|LAST] [, orderExpr2 …]

Each order by expression can be made out of three parts:

One expression that identifies what to order on. It can be either a column alias, a position ordinal, or an expression, as we have seen for the GROUP BY clause. Please refer to that section for more details.
An optional direction for the sorting: ASC for ascending or DESC for descending sort.
An optional specification of how to sort null values: NULLS FIRST or NULLS LAST, which are self-explanatory.

Please look at previous sections for examples of ORDER BY in action, in both queries and window function definitions with over().

The LIMIT/FETCH clause

The LIMIT clause is used to restrict the number of rows returned by the query.

LIMIT and FETCH are synonyms, with slightly different syntax.

Let’s look at the LIMIT syntax:

SELECT …
[ORDER BY …]
LIMIT <count> [OFFSET <start>]

The count parameter is a number that specifies the maximum number of rows to return.

If the OFFSET part is present, the returned rows are the ones after the start position. This allows tools connecting to a database to retrieve all the results in chunks of the desired size. As an example, a LIMIT 10 OFFSET 10 clause would retrieve rows from 11 to 20.

Important note

If no ORDER BY clause is present, the order of the rows is undefined and could differ in each execution of the same query. In this case, the result of a LIMIT clause is non-deterministic because what rows are returned depends on the order in which the rows happen in the result set.

Query clause order of evaluation

In the previous sections, we have seen all the clauses that can appear in a SELECT statement.

Now is a good time to bring your attention to the fact that these clauses are generally evaluated in the following specific order, as well as what it is important to pay attention to for each clause:

FROM and its JOIN subclause, which are used to identify the source data for the query.
The WHERE clause, which is used to filter out the source data that we do not want.

This is probably the most important clause for performance, because the less data a query works on, the quicker it is. Use WHERE whenever possible to just bring in the data you need.

The GROUP BY clause, which groups the source data left after applying the WHERE clause and calculates the aggregate functions on the grouped data.
The HAVING clause, which filters on the results of GROUP BY.
Partitioning of the windows and calculation of the window functions.
The QUALIFY clause, which filters on the results of the window functions.
The DISTINCT keyword, if applied to the SELECT clause, which removes duplicated rows.
The ORDER BY clause, which puts the resulting rows in the desired order.
The LIMIT clause, which caps the rows returned by the query to the desired amount.

SQL operators

When writing queries, we can perform operations on the data handled by the query.

We do so by building expressions that return the desired value, using functions and operators.

We can perform an operation pretty much everywhere a value is expected: in the SELECT clause to provide the desired outputs by transforming the inputs, in the WHERE clause or ON part of a JOIN clause, HAVING clause, or QUALIFY clause to identify what should or should not be returned by the query, in GROUP BY to decide how to aggregate, and so on.

Let’s go through the categories of operators and how they are used:

Arithmetic operators: These are the traditional +, -, *, /, and % (modulo).

They expect one or more numeric (or convertible to numeric) inputs to provide a numeric result, with the usual arithmetic precedence and meaning, like in the following example:

SELECT 1 + '2' as three, (3+2) * 4 as twenty
WHERE twenty % 2 = 0;

Note that '2' is a string but can be automatically converted to the number 2. Also note that implicit conversions happen, but explicit conversions are better.

The modulo operator returns the remainder of the division of the first operator by the second, and the val % 2 = 0 pattern is often used to identify even numbers.

Comparison operators: These are used to test two values for equality or other comparisons, and are = (equal), != (not equal), <> (also not equal), < (less than), <= (less than or equal), > (greater than), and >= (greater than or equal).

They are typically used in the WHERE clause, but can be used anywhere a Boolean result is desired, such as in the following example:

SELECT 2 < 1 as nope, '3' != 'three' as yep
WHERE 1 != 2;

Note that the operators can compare all types where the operation is defined.

I have seen WHERE 1 = 2 used in many places to avoid returning any row, because 1=2 always returns FALSE; it would be clearer to write WHERE false directly and if you do not need to filter any row out, you can just leave the WHERE clause off, or use the WHERE true expression.

Logical operators: These are the traditional AND, OR, and NOT Boolean operators.

They operate only on Boolean predicates and values and return Booleans.

They are generally used in the WHERE clause, but can appear anywhere an expression is allowed, such as in the following example:

SELECT *,
  (C_ACCTBAL > 7500) AND (C_NATIONKEY = 24) as IS_TOP_US_CUST
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE (C_NAME IS NOT null) AND IS_TOP_US_CUST;

In the previous example, we have defined an IS_TOP_US_CUST column using a logical operator to combine two predicates, as we have done in the WHERE clause, also reusing the new Boolean column we created.

Set operators: These operators are used to put together the results from multiple queries, and are INTERSECT, MINUS or EXCEPT, and UNION [ALL].

The simplified syntax is as follows:

query_1 <set_operator> query_2

The queries must be compatible, having the same number of columns and of the same type.

It is also important that the semantics of the columns in the same position are correct, as combining people’s names with state codes is possible, as they are both strings, but in general, it does not make much sense.

Let’s describe the set operators:

INTERSECT returns the rows that appear in both queries, checking all columns to have the same value
MINUS or EXCEPT returns the rows from the first query that do not appear in the second
UNION [ALL] returns the rows from both queries, with ALL keeping duplicates

As an example, the following query returns data for the customers from India (8) and the US (24) and the customers in the AUTOMOBILE segment without duplicates:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY IN (8, 24)
UNION
SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_MKTSEGMENT = 'AUTOMOBILE';

In this special case, as both the queries are on the same table, we could have just used a single query with a slightly more complex WHERE clause using an OR operator to compose the two individual clauses. In real cases, you might want to combine similar data from different tables and the set operators are here for you.

Subquery operators: These operators allow us to use subqueries in WHERE clauses.

A subquery is a query defined inside another query.

A subquery can be used without any operator as a table-like object, such as selecting a subset of a table we want to use some data, or an expression if it returns a single value, as in the following example:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY = (
  SELECT N_NATIONKEY
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION"
  WHERE N_NAME = 'JAPAN'
);

In this example, the subquery returns only one value, so we can use the equal operator.

The subquery operators extend their use to other cases, as per their definitions:

ALL/ANY: Allows you to apply a comparison to all/any rows of the subquery
[NOT] EXISTS: Returns true if the subquery returns at least one row, false with NOT in front
[NOT] IN: Returns true if the expression is not included in the results of the subquery

The following example extends the previous example to the case when we want customers from more than one country, picking the country by name:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY IN (
  SELECT N_NATIONKEY
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION"
  WHERE N_NAME IN ('JAPAN', 'CANADA')
);

We could easily rewrite the same query using = ANY instead of IN, and probably with some changes in the subquery also using EXISTS.

Now that we have been acquainted with the basics of querying data in SQL, let’s dive deeper into the JOIN clause, which allows us to put together data from multiple sources, which is crucial to turning raw data into useful information.

Combining data in SQL – the JOIN clause

The JOIN clause, and the equivalent forms of the FROM clause with multiple tables, is used to combine the rows from two tables to create a row with the columns (that you select) from both tables.

Joins are useful when the tables to be combined are related, that is, when the two tables have some columns that represent the same thing, and we want to combine data from both tables.

We express how to combine the rows by providing a join clause, usually with the ON subclause, which compares the rows from one table to the rows of the other table. Most of the time, the relation is that the values of corresponding columns are the same, but any predicate is fine in the ON subclause.

Combining orders and customers

One example of how to combine data might be a web_order table and a customer table.

In both tables, you normally have a column with the customer ID information even if the columns might not have the same name. Let’s assume that in the order table, there is the ID of the customer who placed the order in the ORDER_PLACED_BY column, and in the customer table, there is the ID of the customer in the CUSTOMER_ID column. Then, we could write the following query:

SELECT *
FROM web_order
JOIN customer ON ORDER_PLACED_BY = CUSTOMER_ID;

This query, using *, returns rows that have all columns from both tables, in all the cases when there is a row that satisfies the ON condition.

Let’s look at the relevant rows of the input and output tables in the case where we have one order with ORDER_PLACED_BY = 123 in the order table and one customer with CUSTOMER_ID = 123.

Say we have one row with ORDER_PLACED_BY = 123 in the web_order table, as follows:

Order_ID	ORDER_PLACED_BY	ORDER_VALUE
WEB_0001	123	225.52

Table 1.2: Sample web_order table

And we have one row with CUSTOMER_ID = 123 in the customer table, as follows:

Customer_ID	Customer_Name	Address
123	Big Buyer LLP	Nice place road, 00100 SOMEWHERE

Table 1.3: Sample customer table

Then, we get the following row in the result table:

Order_ID	ORDER_PLACED_BY	ORDER_VALUE	Customer_ID	Customer_Name	Address
WEB_0001	123	225.52	123	Big Buyer LLP	Nice …

Table 1.4: Sample result of the previous query

If we do not have any customer with CUSTOMER_ID = 123, then we will have no row returned (for that order) in the result table.

Say we have the same order table as before, but three rows with CUSTOMER_ID = 123 in the customer table:

Customer_ID	Customer_Name	Address
123	Big Buyer LLP	Nice place road, 00100 SOMEWHERE
123	Another Customer	Some road, 10250 SOME PLACE
123	A third customer	No way road, 20100 NOWHERE

Table 1.5: Alternative example of a customer table with three rows with CUSTOMER_ID = 123

Then, we will have three rows returned, each having the same order information combined with one specific customer per row, as you see in the following table:

Order_ID	ORDER_PLACED_BY	ORDER_VALUE	Customer_ID	Customer_Name	Address
WEB_0001	123	225.52	123	Big Buyer LLP	Nice …
WEB_0001	123	225.52	123	Another Customer	Some …
WEB_0001	123	225.52	123	A third customer	No …

Table 1.6: Table resulting from the previous sample JOIN query, with three customer matches

This last situation is probably not what you want, as it will “duplicate” the order, returning one row with the same order information for each customer that matches the condition. Later in the book, when we talk about identity, we will see how to make sure that this does not happen and how with dbt, you can also easily test that it really does not happen.

Another question that you might have is how do we keep the order information in the results, even if we do not have a match in the customer table, so that we get all the orders, with the customer information when available? That’s a good question, and the next topic on join types will enlighten you.

JOIN types

In the previous section about the query syntax, we introduced the simplified syntax of a JOIN clause. Let’s recall it here with shorter table names and no aliases:

SELECT …
FROM t1
[<join type>] JOIN t2 ON <condition_A>
[[<join type>] JOIN t3 ON <condition_B>]
…

In the most common cases, the join type is one of [INNER] or { LEFT | RIGHT | FULL } [ OUTER ].

This gives us the following possible joins with a join condition using the ON subclause:

INNER JOIN: This is the default and most common type of join, returning only the rows that have a match on the join condition. The INNER keyword is optional, so you can write the following:
```
SELECT … FROM t1 JOIN t2 ON <some condition>
```

Note that the preceding INNER JOIN is equivalent to the following query that uses only FROM and WHERE:

SELECT … FROM t1, t2 WHERE <some condition>

It is preferable, anyway, to use the JOIN syntax, which clearly shows, especially to humans, which conditions are for the join and which are for filtering the incoming data.

LEFT OUTER JOIN: This is the second most used type of join, as it returns all the rows from the left table, which is the first to be named, combined with the matching rows from the other table, padding with NULL the values where the right table has no matches

Of course, you will have one row of the left table for each matching row of the right table.

RIGHT OUTER JOIN: This is similar to LEFT OUTER JOIN, but it keeps all the columns from the right table and the matching ones from the left table. It is less used than the left as you can reverse the table order and use the left expression.

The query t1 RIGHT OUTER JOIN t2 is the same as t2 LEFT OUTER JOIN t1.

FULL OUTER JOIN: This type of join combines the left and right behavior to keep all the rows from left and right, padding with nulls the columns where there is no match.

There are also two other types of join where you do not specify a join condition:

CROSS JOIN: This type of join produces a Cartesian product, with all possible combinations of rows from both tables. This is also what you obtain if you do not use an ON subclause when using the previous types of joins. The cross join does not have an ON subclause:
```
SELECT … FROM t1 CROSS JOIN t2
```

This is equivalent to what we have seen in the FROM clause:

SELECT … FROM t1, t2

The difference is just how obvious it is for a human reader that you really want to have a cross join, or that you forgot about the ON subclause or some join-related condition in the WHERE clause. It is not so common to use cross joins, because of the Cartesian explosion we talked about; it is, therefore, a much better style to explicitly indicate that you really want a cross join, the few times when you will actually want it.

NATURAL <type> JOIN: The NATURAL join is identical to the various types of JOINs that we have seen so far, but instead of using the ON subclause to find the matches between the two tables, it uses the columns that have the same name with an equality condition. Another small difference is that the columns with the same name in the two tables are returned only once in the results as they always have the same values on both sides, because of the equality condition.

Here are a couple of examples of how to write queries with this type of join:

SELECT … FROM t1 NATURAL INNER JOIN t2

The preceding query is like an INNER JOIN query on columns with the same name in t1 and t2.

SELECT … FROM t1 NATURAL FULL OUTER JOIN t2

This one is like a FULL OUTER JOIN query on columns with the same name in t1 and t2.

Tip

When talking about JOIN, we use LEFT and RIGHT, but with respect to what?

It is a reference to the order in which the tables appear in a chain of joins.

The FROM table is the leftmost one and any other table that is joined is added to the right in the order in which the join appears.

Writing SELECT … FROM t1 JOIN t2 ON … JOIN t3 ON … JOIN t4 ON … makes clear that the tables will be stacked from left to right like this: t1-t2-t3-t4.

You could rewrite the same example as it is normally written on multiple lines, as follows:

SELECT …

FROM t1

JOIN t2 ON …

JOIN t3 ON …

JOIN t4 ON …

The result is the same, even if it is not so immediate to see left and right as mentioned in the previous statement.

Visual representation of join types

We have defined how joins work through examples and explanations, but I think that for some people, an image is worth a thousand explanations, so I propose two ways to graphically look at joins:

One that tries to show how the matching and non-matching rows are treated in different kinds of joins
One that uses a set notation and compares the different types of joins

Visual representation of returned rows in JOIN queries

The following figure visually describes how two rows of tables A and B are aligned to form the rows resulting from the different types of joins:

Figure 1.7: Visual representation of different JOIN types between tables A and B

Each table is divided into two: one part where the rows have one or more matches on the other table that satisfy the join condition and another part where each row has no match on the other table.

The rows from the two tables that have a match, in the center of the figure, are properly aligned according to the matching condition so that each resulting row contains the values from A’s columns and B’s columns where the condition is met. This central part is always returned by all joins, unless explicitly excluded with a WHERE clause requesting one of the two keys to be NULL.

The rows where there is no match, shown at the top for table A and at the bottom for table B, are aligned with columns from the other table padded with NULL values. This produces the somewhat unintuitive result that a query with an ON A.key = B.key clause might produce rows where one of the two keys is NULL and the other is not.

Tip

Please remember that NULL is a special value and not all things work out as expected. As an example, the expression NULL = NULL produces NULL and not TRUE as you might expect.

Try it out yourself with this query: SELECT NULL = NULL as unexpected;.

That is why you test for NULL values using <field> IS NULL and not using equality.

Full outer join

The following figure depicts the result of a FULL OUTER JOIN in terms of the rows of the two original tables:

Figure 1.8: Visual representation of a FULL OUTER JOIN

You can clearly identify the central part of the previous picture, where rows from both tables satisfy the join constraint and the two parts where one of the tables has no matching rows for the other table; in these rows, the columns from the other table are filled with NULL values.

Left outer join

The following figure depicts the result of a LEFT OUTER JOIN in terms of the rows from the two original tables:

Figure 1.9: Visual representation of a LEFT OUTER JOIN

You can clearly see in the picture that the result consists of the rows from both sides that satisfy the join constraints, plus only the rows from table A that do not have a match in table B, with the columns from B filled with NULL values.

The rows from table B without a match in table A are not returned.

Another way to express this is that we have all rows from table A, plus the rows from B where there is a match in the join condition, and NULL for the other rows.

Important Note

When we join two tables, and we write a condition such as ON A.key = B.key, we are expressing interest in rows where this condition is true, and INNER JOIN just gives us these rows.

However, OUTER joins also return rows where the join clause is not true; in these rows, either the A.key or B.key column will be filled with NULL as there is no match on the other table.

Visual representation of JOIN results using sets

The following figure visually represents the join types that use an ON clause, representing, as sets, the rows from tables A and B that match or do not match the join condition in the ON clause.

The overlapping area is where the condition is matched by rows in both tables:

Figure 1.10: Visual representation of SQL JOIN types with an ON clause, as set operations

The preceding figure represents the join types that we have seen in two forms:

Using only the ON subclause, showing the results of the pure join
Using a WHERE clause on the column used in the ON subclause

In the figure, this information is used to exclude from the outer joins the rows where the match happens, where A and B overlap, which are the rows returned by an inner join.

This type of query is useful, and often used, for example, to see whether we have orders where the referenced customer does not exist in the customer table. This would be called an orphan key in the order table.

Let’s see an example using Snowflake sample data:

SELECT *
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
LEFT OUTER JOIN "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
  ON C_CUSTKEY = O_CUSTKEY
WHERE C_CUSTKEY is NULL;

This query should return no rows, as all the customers referenced by the orders should exist in the customer table. Now that we have covered all the basic functions in SQL, let us check out an advanced feature: windows functions.

Advanced – introducing window functions

A window function is a SQL function that operates on a set of rows, called a window of rows.

These functions are also called analytic functions, as they are mostly used in analytic queries.

A window is a set of correlated rows, containing any amount from 0 to all the rows resulting from a query.

Generally, the windows divide the resulting rows into disjointed partitions.

Some window functions can also operate on a window frame, which is a dynamic subset of the rows defined by the window, based on the current row.

You can define a cumulative window frame, taking the rows of the window before or after the current row, or a sliding window frame, going from some rows before the current row to some rows after the current row.

Window functions have a lot in common with aggregated functions, the ones that use GROUP BY, as they both divide the results of a query into groups, that is, partitions, and then calculate the function on the rows in each group.

The big difference is that with GROUP BY, you get only one row per group, while with window functions, all the rows are returned, with the function calculated on the desired subset of rows.

Let’s show some examples to make the concept more real.

Assume we have a table with orders, and we would like to know how much each order contributes to the total by day or month. We can calculate the percentage with the following formula:

monthly_pct = order_total / monthly_total * 100

Using group by, you could calculate the order total for each day or month, in separate queries, and then you would need to join those results with the original table to be able to apply the formula.

Window functions are here to help, as you can calculate the daily or monthly measures in a single step by defining a window that puts the rows together in different ways, as in the following query:

SELECT
  O_ORDERKEY,
  O_CUSTKEY,
  O_ORDERDATE,
  O_TOTALPRICE,
  avg(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_avg,
  sum(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_total,
  sum(O_TOTALPRICE) over(partition by
          DATE_TRUNC(MONTH, O_ORDERDATE)) as monthly_total,
  O_TOTALPRICE / daily_avg * 100 as avg_pct,
  O_TOTALPRICE / daily_total * 100 as day_pct,
  O_TOTALPRICE / monthly_total * 100 as month_pct
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
QUALIFY row_number() over(partition by O_ORDERDATE
                          order by O_TOTALPRICE DESC) <= 5
order by O_ORDERDATE, O_TOTALPRICE desc;

This example uses window functions to calculate totals and averages over different periods and also to limit the returned rows to the five biggest orders for each day.

This query is pretty efficient, executing in a few seconds even after processing million of rows. By changing the schema from TPCH_SF1 to TPCH_SF10, TPCH_SF100, or TPCH_SF1000, you can try out bigger tables in the Snowflake sample data, to see how it works with bigger datasets.

Window definition

A window is defined using the OVER clause, which specifies how to partition and order the rows of the incoming data to apply the window function.

To use a window function, you use this syntax:

<function_name> ([<arguments>])
    OVER ( [PARTITION BY <part_expr>] [ORDER BY <order_expr>])

The presence of the OVER clause specifies that the function is applied to the windows defined inside the OVER clause. Of course, the function can also take its own arguments.

The PARTITION BY clause defines how the rows are partitioned and <part_expr> is a comma-separated list of columns or expressions calculated on the query data, in a similar way to what we have seen for the group by expressions and shown in the previous example.

The ORDER BY clause specifies the ordering of the rows to be used to calculate the function and it is not visible outside of the window function. The order by clause and <order_expr> work pretty much as they do in the SELECT statement to order the results of the query.

At the syntax level, both the PARTITION BY and ORDER BY clauses are optional, even if some functions require or forbid one of the two. Look at each function’s documentation for more details.

The ORDER BY clause has an effect on the order-sensitive window function, which comes in two types:

Rank-related functions: These functions calculate values based on the rank of the row, which is the order of the row in the window. They are used to make explicit, that is, crystallized into a value that can be used later, one of many possible orderings of a row in a window.

One example is the ordering of teams in a championship based on points or medals in a sport: you can have different styles of ordering based on what is decided when you have the same value.

As an example, if two teams (A and B) have the same score, after another team (C) with an higher score, you can say that A and B are both second or you can say that one is second and the other one third.

Window frame functions: These functions work on a window frame, which is a dynamic subset of rows of a window. As we saw previously, they can use cumulative or sliding window frames. The next section will delve into these.

Window frame definition

Some functions can use a dynamic window frame instead of a fixed window.

You specify the window frame inside the OVER clause, after the ORDER BY clause of the window:

<function_name> ([<arguments>])
    OVER ( [PARTITION BY <part_expr>]
           ORDER BY <order_expr>
           <cumulative_frame_def> | <sliding_frame_def>
         )

A cumulative window frame is a subset of the window that is either growing or shrinking based on the current row inside the window. The frame contains the rows of the window either before or after the current row. The syntax of the cumulative frame definition is the two following lines:

  {ROWS | RANGE} BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
| {ROWS | RANGE} BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING

The first expression takes all the rows in the partition or all the rows in the same order range up to the current one, while the second takes from the current row to the end of the partition or range. A better description of ROWS and RANGE follows.

A sliding window frame is a subset of rows that extends from a number of N rows before the current row to a number of M rows after the current row. One or both sides can be fixed to the start or end of the partition.

The following are the three possible forms of the syntax:

  ROWS BETWEEN <N> { PRECEDING | FOLLOWING }
       AND <M> { PRECEDING | FOLLOWING }
| ROWS BETWEEN UNBOUNDED PRECEDING AND <M> { PRECEDING | FOLLOWING }
| ROWS BETWEEN <N> { PRECEDING | FOLLOWING } AND UNBOUNDED FOLLOWING

The first syntax is when you want to create a frame that extends from N rows before to M rows after the current rows, while the other two are the cases when one of the two extremities is fixed to be the start or end of the window.

ROWS/RANGE in a cumulative window definition

When specifying a cumulative window frame, we can use either ROWS or RANGE, but what is the difference?

ROWS: Indicates that the fixed side of the frame extends to the limit of the partition: from the start of the partition to the current row or from the current row to the end of the partition.
RANGE: Indicates that the fixed side of the frame extends to the previous/following rows that have the same value for the ORDER BY value. The frame operates on the preceding/following rows that are in the same position in the ordering. As an example, you can think of stepping through many teams that have the same number of points in a championship list.

Important note

Providing extensive information on window functions is beyond the scope of this introduction to SQL used in analytical operations. It is also one area where SQL is not always the same in all databases. We suggest searching for “window functions” in the documentation of your database.

Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

What do you get with eBook?

Product Details

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (1)

FAQs