The Basics of SQL to Transform Data
This book is about engineering data platforms with dbt. When working with dbt, you write code in Structured Query Language (normally referred to as SQL, pronounced sequel or as its letters: es-qu-el), with a spritz of Jinja templating.
This means that when we are not discussing architectural and design topics, we will be reading and writing code in SQL, so you should get up to speed with SQL to work proficiently with dbt.
In this chapter, you will review, or learn from scratch, the basics of SQL, becoming familiar with this powerful language that you find everywhere when working with data.
First, we will introduce SQL and walk you through its core concepts, so that even if you are new to this language, you can start building your competence and follow the rest of this chapter and the book.
To complement this introduction, you will quickly be presented with all the main SQL commands, organized by category of use. This will ensure that you know the most relevant parts of SQL and you can use this chapter as a quick reference when you start writing your code.
You will then be guided through how to create an account in Snowflake, the leading cloud database that is very often used with dbt. You will use Snowflake in the cloud, without the need to install anything, and it is free for a trial period.
Once you have a Snowflake account, you will be guided through creating your first role, user, and database.
With your cloud database ready and the basic SQL commands presented, we will look deeper into the SELECT
statement and the JOIN
clause, as they will be crucial in working with dbt.
Analytical and window functions are presented at the end of the chapter in a simple way, suitable for beginners as well as mid- to senior-experienced SQL users; these advanced functionalities will power up your SQL skills, vastly widening your abilities.
To summarize, in this chapter, we are going to cover these topics:
- Introducing SQL
- SQL basics – core concepts and commands
- Setting up a Snowflake database with users and roles
- Querying data in SQL – syntax and operators
- Combining data in SQL – the JOIN clause
- Advanced – introducing window functions
Technical requirements
This chapter does not assume any prior SQL knowledge and presents information from the basics to intermediate level. If you know SQL well, you can skip this chapter or just browse the more advanced topics.
All code samples of this chapter are available on GitHub at https://github.com/PacktPublishing/Data-engineering-with-dbt/tree/main/Chapter_01.
To run the samples, later in this chapter, we will guide you through setting up a Snowflake account and roles and creating your first database.
Introducing SQL
SQL was created in the 70s and by the end of the 80s, had become the de facto standard to interact with Relational Databases (RDBs), and it now powers most of the data management industry in the world.
Given its huge footprint and powerful abstractions, SQL has become a standard that anyone working with database systems eventually becomes familiar with. The expressive power of SQL is well understood and its knowledge is so ubiquitous that it has been taken into use beyond RDBs, with Database Management Systems (DBMSs) of all sorts providing a SQL interface even on top of many non-RDB systems.
Some of the great advantages of SQL are as follows:
- The core SQL functionality was standardized in the 80s and yet SQL is still very much alive and well, evolving and adding new powerful functionalities as data management evolves while maintaining compatibility with previous versions.
Every database has its SQL quirks, but the logic is the same and most SQL code will work on multiple databases with little or no change.
Learn it now and use it forever, and with (almost) every database.
- At its core, it has a simple, rigorous, and powerful syntax that reads like English sentences, so even non-tech people can grasp the basic idea, while professionals can express exactly what they want in a precise and concise way.
Most people can probably get a sense of what the following SQL does:
SELECT ORDER_ID, CUSTOMER_CODE, TOTAL_AMOUNT FROM ORDERS WHERE YEAR(ORDER_DATE) = 2021;
- With SQL, you work at the logical level, so you do not have to deal with implementation details, and it is a declarative language; you describe in a rigorous way what you want to achieve, not how to do it. The database engine has the freedom to store data, be implemented, and perform the request in its own way, as long as it produces the correct result according to SQL specifications.
- With a single SQL statement, you can process one piece of data or billions, leaving the burden of finding the most effective way to the database and giving you some freedom from scale.
SQL basics – core concepts and commands
The SQL language is certainly big and complex, taking time to be fully mastered, but surprisingly, you can get productive with the limited set of features that we will introduce in this book, and you will pick up the nuances when you’ll need them while working on real use cases.
The goal of this chapter is not to make you a SQL expert but to get you started with it, getting deeper into the few commands that we will use often to make you quickly productive with dbt.
In this intro, we start with the most basic concepts of SQL and then provide an overview of the SQL commands by their categories of use. In the rest of this chapter, we will dig deeper into the commands that are mostly used when working with dbt: SELECT
, JOIN
, and analytical and window functions.
SQL core concepts
The main goal of SQL, as its name implies, is to allow users to query data that is contained in a database; SQL also provides all the commands to fully manage the database, allowing you to add, transform, delete, organize, and manage data and other database objects, such as users and roles.
The core concepts in SQL come from how an RDB is organized:
- A database contains tables, organized in schemata (plural of schema)
- Tables store data in rows that have one value for each column defined in the table
- Columns have a name and can contain only data of the declared data type
- To regulate access to the data, privileges are assigned to users and roles
You can see a database with schemata and tables in the following screenshot, which shows part of the sample database available in any Snowflake account:

Figure 1.1: Example database with schemata and tables
Let’s go through the core concepts of SQL, starting with the table.
What is a table?
The table is the most central concept in SQL, as it is the object that contains data.
A table in SQL is very close to the layman’s concept of a table, with data organized in columns and rows.
Columns represent the attributes that can be stored in a table, such as a customer’s name or the currency of an order, with each column defining the type of data it can store, such as text, a number, or a date.
In a table, you can have as many rows as you want, and you can keep adding more whenever you need. Each row stores in the table one instance of the concept that the table represents, such as a specific order in the following example of a table for orders:
Order_ID |
Customer_CODE |
Total_amount |
Currency |
123 |
ABC |
100 |
EUR |
166 |
C099 |
125,15 |
USD |
Table 1.1: Example order table
Looking at the previous example order table, we can see that the table has four columns that allow storing the four attributes for each row (order ID, customer code, total amount, and currency of the order). The table, as represented, has two rows of data, representing one order each.
The first row, in bold, is not data but is just a header to represent the column names and make the table easier to read when printed out.
In SQL, a table is both the definition of its content (columns and their types) and the content itself (the data, organized by rows):
- Table definition: It lists the columns that make up the table, and each column provides the data type and other optional details. The data type is mandatory and declares what values will be accepted in the column.
- Table content: It is organized in rows, each row containing one value for each column defined for the table or
null
if no value is provided. - Data value: All the values in a column must be compatible with the type declared for the column.
null
is a special value that corresponds to the absence of data and is compatible with all data types.
When creating a table, we must provide a name for the table and its definition, which consists of at least column names and a type for each column; the data can be added at a different time.
In the following code block, we have a sample definition for an order table:
CREATE TABLE ORDERS ( ORDER_ID NUMBER, CUSTOMER_CODE TEXT, TOTAL_AMOUNT FLOAT, ORDER_DATE DATE, CURRENCY TEXT DEFAULT 'EUR' );
Tip
When we write some SQL, we will use Snowflake’s SQL syntax and commands. We will guide you on how to create a free Snowflake account where you can run the code shown here.
In this example, we see that the command to create a table reads pretty much like English and provides the name of the table and a list of columns with the type of data each column is going to contain.
You can see that for the CURRENCY
column, of type text
, this code also provides the default
value EUR
. Single quotes are used in SQL to delimit a piece of text
, aka a string
.
View – a special way to look at some data
If you already have data and you want to make it available with some transformation or filtering, you can create a view. You can think of a view like a table, but with the column definition and data both coming from a query, reading from one or more tables.
As an example, if you would like to have a shortlist of orders with amount greater than 1,000 you could write the following query to create a BIG_ORDERS
view:
CREATE VIEW BIG_ORDERS AS SELECT * FROM ORDERS WHERE TOTAL_AMOUNT > 1000;
In this example, we see that this simple create view
statement provides the name for the view and uses a query, which is a SELECT
statement, to define what data is made available by the view.
The query provides both the data, all the orders with a total amount greater than 1,000, and the column definitions. The *
character – called star – is a shortcut for all columns in the tables read by the SELECT
statement.
This is, of course, a naïve example, but throughout this book, you will see that combining tables that store data and views that filter and transform the data coming from tables and views is the bread and butter of working with dbt. Building one object on top of the previous allows us to take raw data as input and provide as output refined information that our users can easily access and understand.
Tip
When working with dbt, you will not need to write create table
or create view
statements, as dbt will create them for us. It is nevertheless good to get familiar with these basic SQL commands as these are the commands executed in the database and you will see them if you look in the logs.
Database and schema
We have seen that a database is a container for tables and views.
A DB can be further divided and organized using schema objects.
In real-life DBs, the number of tables can range from a few units to many thousands. Schemata act pretty much like folders, but cannot be nested, so you can always identify a table by its database, schema, and name.
In the following screenshot, we see part of the contents of the SNOWFLAKE_SAMPLE_DATA database, which is available in all Snowflake accounts:

Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database
The database and schemata, as table containers, are the main ways information is organized, but they are also used to apply security, access limitations, and other features in a hierarchical way, simplifying the management of big systems.
To create the TEST
database and the SOME_DATA
schema in it, we can use the following commands:
CREATE DATABASE TEST; CREATE SCHEMA TEST.SOME_DATA;
The database.schema
notation, also known as a fully qualified name, allows us to precisely describe in which database to create the schema and after its creation, uniquely identifies the schema.
Tip
While working with dbt, you will create a database or use an existing one for your project; dbt will create the required schema objects for you if you have not created them already.
A best practice in Snowflake is to have one database for each project or set of data that you want to keep separate for administration purposes. Databases and schemata in Snowflake are soft boundaries, as all the data from all the databases and schemata can be accessed if the user has the appropriate privileges.
In some other database systems, such as PostgreSQL, a database is a stronger boundary.
Users and roles
To control access to your data in SQL, you GRANT
access and other privileges to both users and roles.
A user represent one individual user or service that can access the database, while a role represents a named entity that can be granted privileges.
A role can be granted to a user, providing them all the privileges associated with the role.
A role can also be granted to another role, building hierarchies that use simple basic roles to build more complex roles that are assigned to a user.
Using roles instead of granting privileges directly to users allows you to manage even a large number of users simply and consistently. In this case, roles are labels for a set of privileges and become a way to manage groups of users that we want to grant the same privileges. Changing the privileges granted to a role at any moment will change the privileges that the users receive from that role.
A typical pattern when working with dbt is to create a role for the dbt users and then assign it to the developers and the service user that the dbt program will use.
The following is an example of a simple setup with one role and a couple of users:
CREATE ROLE DBT_SAMPLE_ROLE; CREATE USER MY_NAME; -- Personal user CREATE USER SAMPLE_SERVICE; -- Service user GRANT ROLE DBT_SAMPLE_ROLE TO USER MY_NAME; GRANT ROLE DBT_SAMPLE_ROLE TO USER SAMPLE_SERVICE;
A more complex setup could have one role to read and one to write for each source system (represented by a schema with the data from the system), for the data warehouse (one or more schemata where the data is processed), and for each data mart (one schema for each data mart).
You could then control in much more detail who can read and write what, at the cost of more effort.
Understanding the categories of SQL commands
SQL commands can be organized into categories according to their usage in the language:
- Data Definition Language (DDL): DDL contains the commands that are used to manage the structure and organization of a database
- Data Manipulation Language (DML): DML contains the commands that are used to manipulate data, for example,
INSERT
,DELETE
, andUPDATE
- Data Query Language (DQL): DQL contains the
SELECT
command and is the central part of SQL that allows querying and transforming the data stored in a database - Data Control Language (DCL): DCL contains the
GRANT
andREVOKE
commands, which are used to manage the privileges that control the access to database resources and objects - Transaction Control Language (TCL): TCL contains the commands to manage transactions
In the upcoming sections, we provide more details about these by looking at Snowflake-specific commands, but the ideas and names are of general use in all database systems, with little or no change.
Data Definition Language – DDL
DDL commands do not deal directly with the data but are used to create and maintain the structure and organization of the database, including creating the tables where the data is stored.
They operate on the following objects:
- Account/session objects, which contain and operate on the data, such as user, role, database, and warehouse
- Database/schema objects, which store and manipulate the data, such as schema, table, view, function, and store procedure
The main commands are as follows:
CREATE
: Used to create the database itself and other objectsDROP
: Used to delete the specified objectALTER
: Used to modify some attribute of the specified objectDESC
: Used to describe the details of the specified objectSHOW
: Used to list the existing objects of the specified object type, with metadataUSE
: Used to select the database, schema, and object to use when fully specified names are not used
Tip
When working with dbt, we use the DDL and DML commands only in macros.
We do not use the DDL and DML commands in models because dbt will generate the required commands for our models based on the metadata attached to the model.
Data Manipulation Language – DML
DML provides the commands to manipulate data in a database and carry out bulk data loading.
Snowflake also provides specific commands to stage files, such as loading files in a Snowflake-managed location, called a stage.
The main commands are as follows:
INSERT
: Inserts rows into a tableDELETE
: Removes specified rows from a tableUPDATE
: Updates some values of specified rows in a tableMERGE
: Inserts, updates, or deletes rows in a tableTRUNCATE TABLE
: Empties a table, preserving the definition and privileges
For bulk data loading, Snowflake provides the following command:
COPY INTO
: Loads data from files in a stage into a table or unloads data from a table into one or more files in a stage
To manage files in stages, Snowflake provides these file-staging commands:
PUT
: Uploads a local file to a Snowflake stageGET
: Downloads a file from a stage to the local machineLIST
: Lists the files existing in a Snowflake stageREMOVE
: Removes a file from a Snowflake stage
Important note
In dbt, we can use macros with the COPY INTO
and file-staging commands to manage the data-loading part of a data pipeline, when source data is in a file storage service such as AWS S3, Google Cloud Storage, or Microsoft Azure Data Lake file storage.
Data Query Language – DQL
DQL is the reason why SQL exists: to query and transform data.
The command that is used to query data is SELECT
, which is without any doubt the most important and versatile command in all of SQL.
For the moment, consider that a SELECT
statement, aka a query, can do all these things:
- Read data from one or more tables
- Apply functions and transformations to the data retrieved
- Filter the data retrieved from each of these tables
- Group the retrieved data on a set of columns (or transformed values), producing one row for each group, calculating functions on the grouped rows, and filtering on the results of those functions
- Calculate for each row one or more functions based on groups of rows identified by a window expression and filter the results based on the results of these functions
Important note
We have dedicated the Query syntax and operators section later in this chapter to analyzing the SELECT
command in Snowflake in detail, as you will use the SELECT
command in every dbt model.
Data Control Language – DCL
DCL contains the GRANT
and REVOKE
commands, which are used to manage privileges and roles that control access to or use database resources and objects.
Together with the DDL commands to create roles, users, and other database objects, the DCL commands are used to manage users and security:
GRANT
: Assigns a privilege or a role to a role (or user)REVOKE
: Removes a privilege or a role from a role (or user)SHOW GRANTS
: Lists access privileges granted to a role or object
Transaction Control Language – TCL
The TCL commands are used to manage transactions in a database.
A transaction groups a set of SQL commands into a single execution unit and guarantees that either all the effects of all commands are applied, if the transaction completes with success, or no effect at all is applied if the transaction fails. This can also be described with the ACID acronym, which stands for atomic, consistent, isolated, and durable.
A transaction succeeds and ends only when all the commands it groups have finished with success; in any other case, the transaction fails, and its effects are rolled back like they never happened.
The TCL commands are as follows:
BEGIN
: Starts a transaction in the current sessionCOMMIT
: Commits an open transaction in the current session and makes the transaction effects visible to all sessionsROLLBACK
: Rolls back an open transaction in the current session and undoes all the effects applied since theBEGIN
statement
Now that we have covered the basic concepts and commands in SQL, it is time to set up a database to run them. The next section will provide you with access to a Snowflake DB.
Setting up a Snowflake database with users and roles
In this section, you will be guided through creating your own Snowflake account so that you can experiment with the SQL commands from the previous section and use it with dbt in the next chapters.
Once you have created your account, we will help you through a quick setup of users and roles that will be useful throughout this book and in your later projects.
Creating your Snowflake account
To sign up for a 30-day free trial of Snowflake, you can go to https://signup.snowflake.com/ or look for the Start for free button on the Snowflake home page.
On the Snowflake site, go through the following steps:
- Fill in the registration form with your data and click CONTINUE.

Figure 1.3: Snowflake free trial registration form
- Select the Snowflake edition that you want to use, pick your preferred cloud provider, and click GET STARTED and you will reach a confirmation page.

Figure 1.4: Left: Snowflake edition and cloud provider selection form. Right: sign-up confirmation
- Go to your email client and follow the link from the email that you receive to confirm your email address. It will take you back to the Snowflake website.
- On the Welcome to Snowflake! page, enter the username and password for the user that will become your account administrator and then click Get started.

Figure 1.5: Welcome to Snowflake! page
Tip
Later, we will teach you how to create all the roles and users that you want.
Anyway, it makes sense to pick a good name and keep the password safe, as this user has all privileges and can do pretty much everything on your account.
- After clicking the Get started button, you will land on the Snowflake user interface, with an empty worksheet open and an introductory menu with a few topics to get started with Snowflake. If you are new to Snowflake, it is a good idea to go through those topics.
- After dismissing this introductory menu, you are ready to explore and use your Snowflake account, which in the classical console interface will look like this:

Figure 1.6: The user interface of Snowflake worksheets
If you end up with the new Snowflake interface (called Snowsight) you can work with it or use the Classic Console entry in the menu to switch to the older interface.
Setting up initial users, roles, and a database in Snowflake
Now you have a Snowflake account with a user that you can log in with, and your user has been granted the omnipotent role of ACCOUNTADMIN
.
That’s great, but it is not advised that you use this role for normal work. The best practice is to set up a few users and roles to provide the desired access level and use the lowest level that can accomplish a task.
Snowflake allows you to pick the role you impersonate to execute commands in a worksheet between the ones you have been granted so that you do not have to log out and back in to change the role.
Tip
You can change your role, the warehouse to use, and the current database and schema with the selector in the top-right corner of the worksheet editing area.
You might have noticed that in the navigation panel on the left, you have two databases:
- The SNOWFLAKE database provides you with information about your Snowflake account
- The SNOWFLAKE_SAMPLE_DATA DB provides, as the name implies, some sample data
These databases are shared with you by Snowflake, and you can only read from them.
To do something meaningful, you will need at least a database you can write in.
In this section, we will create a database and some roles and users to use it.
Overview of initial roles
In a database a user needs the relevant privilege to be able to perform an action.
A user can have the privilege directly, that is, assigned to the user, or indirectly, through a role that the user can impersonate.
In practice, privileges are assigned to roles, which are assigned to other roles and users, to build the desired security hierarchy, because assigning privileges to users is not scalable.
When you log in to a newly created Snowflake account, you are impersonating the ACCOUNTADMIN
role, which has all privileges, but other, more specialized roles are already available.
Snowflake has already set up a few self-explanatory basic roles: ACCOUNTADMIN
, SECURITYADMIN
, USERADMIN
, SYSADMIN
, and PUBLIC
.
Important note
To create users and roles, Snowflake provides the USERADMIN
role by default, which has the CREATE ROLE
and CREATE USER
privileges directly assigned. This role is already assigned to the initial user that you created.
The initial user could create new users and roles using the ACCOUNTADMIN
or SECURITYADMIN
roles, because they have been granted the USERADMIN
role; SECURITYADMIN
had it granted directly and ACCOUNTADMIN
indirectly, having been granted the SECURITYADMIN
role.
However, this would not be a great idea, as you will see in the following paragraphs on ownership.
The following is an example of how to create a hierarchy of roles.
You can explore the existing roles and their privileges with these commands:
SHOW ROLES; SHOW GRANTS TO ROLE <role_name>;
Ownership is needed for some operations, such as deletions, which cannot be granted.
It is therefore advisable that a user with the role that we want to use to manage an object is the one who creates it, while impersonating such role.
In this way, the user, and all others with the same role, will have the required privileges.
An alternative is to use the GRANT
command to explicitly provide the required privileges to a role.
Important note
An object is owned by the role that created it.
When a user has multiple roles, the role impersonated by the user at the moment of the creation will own the object.
Before creating an object, make sure you are impersonating the role you want to own the object.
Creating and granting your first role
Let’s switch to USERADMIN
to create our users and roles, because this is the role that you will probably grant to people delegated to perform user and role administration.
Being the owner will make all operations, including deletions, simpler:
- Switch to the
USERADMIN
role:USE ROLE USERADMIN;
Let’s create the role that we will use for all users that have to fully manage a dbt project, humans, or applications.
I usually call this role DBT_EXECUTOR_ROLE
because I call the user for the dbt application DBT_EXECUTOR
and I like to be crystal clear with my names.
You can of course pick a name that you prefer for both the role and user.
- Create a role for users running dbt models:
CREATE ROLE DBT_EXECUTOR_ROLE
COMMENT = 'Role for the users running DBT models';
We now have a role that we can shape and use to manage a set of database objects, for example, corresponding to one or more dbt projects.
The dbt executor user will be used by dbt to run dbt models in shared environments, while dbt will use each developer’s own user to run dbt models in their own development environment.
The simplest setup, giving all developers the ability to manage all environments, is to assign this role to all developers. An alternative is to have a different role for production and other environments that you want to keep protected.
- Let’s keep it simple and assign the executor role to ourselves:
GRANT ROLE DBT_EXECUTOR_ROLE TO USER <your_user>;
Now, our user has both the account admin and the executor roles. To be able to see and select the role in the user interface dropdown, you might need to refresh the page.
Granting the privileges to create a database
We could create a database while impersonating the account admin role, but this will not help with our plan to use the executor role to manage this database. We must then give our executor role the ability to create a database. To have our SQL commands executed on a database, we need the ability to use or create an existing warehouse.
Important Note
To manage the structure of databases and the warehouse settings, Snowflake provides the default SYSADMIN
role, which has the required privileges.
We can achieve our goal of providing the executor role with those abilities by doing one of the following:
- Granting the required privileges directly to the executor role
- Granting the SYSADMIN role to the executor role, which has the required privileges
What would be the difference?
If you grant the privileges directly, the executor role is totally isolated from other roles, and users with that role will just be able to access what the role allows them to access.
If you grant the SYSADMIN
role, users with the executor role will also be able to impersonate the SYSADMIN
role and therefore also access any object it can access. As an example, they could access a database created by another user that has created it with the sysadmin role.
Let’s assign the desired privileges directly, to keep our dbt objects away from SYSADMIN
.
To grant the privileges to the executor role, we need to switch to a role that can grant these privileges and then grant them:
- Switch to the
SYSADMIN
role:USE ROLE SYSADMIN;
- Grant the privilege to create a new database:
GRANT CREATE DATABASE ON ACCOUNT
TO ROLE DBT_EXECUTOR_ROLE;
- Grant the privilege to use the default warehouse, named
COMPUTE_WH
:GRANT USAGE ON WAREHOUSE COMPUTE_WH
TO ROLE DBT_EXECUTOR_ROLE;
Creating your first database
Now that our executor role can create a database, let’s do it:
- Let’s impersonate the executor role so that it will own what we create:
USE ROLE DBT_EXECUTOR_ROLE;
- Now, let’s create our database:
CREATE DATABASE DATA_ENG_DBT;
-- or pick any name you like for the DB
To see the new database in Snowflake, you might need to refresh the browsing panel on the left of the user interface, with the Refresh button, shaped as a circular arrow, in the top right of the panel.
Tip
Clicking on the new database will show that it is not empty.
The new database has two schemata: INFORMATION_SCHEMA
, which contains views that you can use to collect information on the database itself, and PUBLIC
, which is the default schema of the new database and is empty.
To reduce Snowflake credit consumption, let’s finish by configuring the default warehouse to suspend after 1 minute instead of 10:
USE ROLE SYSADMIN; ALTER WAREHOUSE "COMPUTE_WH" SET WAREHOUSE_SIZE = 'XSMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE COMMENT = 'Default Warehouse';
Tip
To alter the warehouse, you must switch to the SYSADMIN
user as it is the role that owns the warehouse; we have granted our executor role only the privilege to use it.
Congratulations! You have now created a role, assigned it to your personal user, created a database owned by that role, and configured the default warehouse. You are almost done setting up.
Creating a user for the dbt application
Let’s complete our initial setup by creating a new user that we will use with dbt and grant it the executor role so that it will be able to manage, as an owner, that database.
To create a user, we need to switch to a role with that ability:
- Switch to the
USERADMIN
role:USE ROLE USERADMIN;
- Create the new user for dbt; I have named it
DBT_EXECUTOR
, though you can pick any name:CREATE USER IF NOT EXISTS DBT_EXECUTOR
COMMENT = 'User running DBT commands'
PASSWORD = 'pick_a_password'
DEFAULT_WAREHOUSE = 'COMPUTE_WH'
DEFAULT_ROLE = 'DBT_EXECUTOR_ROLE'
;
- Let’s assign the executor role to the dbt user:
GRANT ROLE DBT_EXECUTOR_ROLE TO USER DBT_EXECUTOR;
- Switch back to the operational role, which we should always be working with:
USE ROLE DBT_EXECUTOR_ROLE;
Great! You have now performed a basic setup of your new Snowflake account; you have learned the basics of user and role management and you are ready to learn more SQL to query data.
Querying data in SQL – syntax and operators
In this section, you will learn how to query existing data using SQL and the Snowflake example database.
In this chapter, and more generally in this book, you will get familiar with Snowflake syntax, which is modern and very standard. We will also use some proprietary extensions that make your work easier.
Snowflake query syntax
Snowflake supports querying data with the standard SELECT
statement, which has the following basic syntax:
WITH … SELECT … FROM … JOIN … WHERE … GROUP BY … HAVING … QUALIFY … ORDER BY … LIMIT …
The only mandatory part is select, so SELECT 1
is a valid query that just returns the value 1.
If you are familiar with SQL from other database systems, you will wonder what the QUALIFY
clause is. It is an optional SQL clause that is very well suited to the analytical kind of work that Snowflake is used for and that not all database engines implement. It is described later in this section.
We often use the terms query, command, and statement interchangeably when referring to some piece of SQL that you can execute.
Properly speaking, a command is a generic command such as SELECT
or CREATE <object>
, while a statement is one specific and complete instance of a command that can be run, such as SELECT 1
or CREATE TABLE
my_table …;
.
The term query should really only refer to SELECT
statements, as SELECT
statements are used to query data from the database, but query is often used with any statement that has to do with data.
You will also often hear the term clause used, such as the FROM
clause or GROUP BY
clause. Informally, you can think about it as a piece of a statement that follows the syntax and rules of that specific keyword.
The WITH clause
The WITH
clause is optional and can only precede the SELECT
command to define one or more Common Table Expressions (CTEs). A CTE associates a name with the results of another SELECT
statement, which can be used later in the main SELECT
statement as any other table-like object.
Defining a CTE is useful for the following:
- Clarity: You can provide an informative name to a piece of SQL
- Reuse, maintenance, and efficiency: You can define a supporting query, whose results you might use more than once in the main query, and the database engine will execute the supporting query once
- Creating recursive queries: Defining a CTE that queries itself is the only way to use recursion in SQL
- Break a complex transformation into smaller, simpler steps that are easier to code and understand and limit the scope of future maintenance
The simplified syntax is as follows:
WITH [RECURSIVE] <cte1_name> AS (SELECT …) [, <cte2_name> AS (SELECT …)] SELECT …
We will check out some examples here:
- Check whether we still have high-priority orders pending:
WITH
high_prio_orders as (
SELECT *
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE O_ORDERPRIORITY IN ('1-URGENT', '2-HIGH')
)
SELECT count(*)
FROM high_prio_orders
WHERE O_ORDERDATE < '1998-01-01'
and O_ORDERSTATUS = 'O';
Here, you can see that the first expression encapsulates the business definition of high-priority order to be any order with the priority set to urgent or high priority. Without the CTE, you have to mix the business definition and other filtering logic in the WITH
clause. Then, it would be unclear whether the status is part of the definition or is just a filter that we are applying now.
- Calculate some metrics for customers in the auto industry:
WITH
auto_customer_key as (
SELECT C_CUSTKEY
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
WHERE C_MKTSEGMENT = 'AUTOMOBILE'
),
orders_by_auto_customer as (
SELECT O_ORDERKEY
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
WHERE O_CUSTKEY in (SELECT * FROM auto_customer_key)
),
metrics as (
SELECT 'customers' as metric, count(*) as value
FROM auto_customer
UNION ALL
SELECT 'orders by these customers', count(*)
FROM orders_by_auto_customer
)
SELECT * FROM metrics;
In this example, the CTEs and the final query remain short and very simple. You can start to appreciate that the clear labeling of intentions and the simple pieces of SQL make the full query easy to understand for anyone.
The SELECT clause
The SELECT
command can appear in two possible forms: as the central clause of a SELECT
statement or used as a clause in other statements. In both cases, it defines the set of columns and calculated values returned by the statement.
The simplified syntax of a SELECT
clause in Snowflake is a list of column definitions separated by commas, with the optional distinct
keyword to omit duplicates:
SELECT [DISTINCT] <column_definition_1> [, <column_definition_2> …]
The SELECT
clause is used to provide the list of columns that the query will return.
For each column definition that is provided, it is also possible to provide a column alias, which will be used to identify that column definition in the results.
When two column definitions have the same name, an alias must be provided for at least one of the two to avoid a name clash in the results.
A column definition can be one of these four expressions:
<column_definition> = [object.*] -- all columns in the object |[object.col_name] -- the named column in the object |[object.$n] -- the n-th column in the object |[<expression>] -- the value of the expression AS <col_alias>
Let’s describe in detail these four expressions:
- The star symbol,
*
, indicates all possible columns; it means “all columns from all tables” if it is applied without an alias or “all the columns from the aliased object” if applied with an alias.- Star without an alias will return all columns from both tables:
SELECT * FROM table_1, table_2
- Star with an alias (or table name) will return all columns from the object with the alias. In this case, it will return all columns from
table_1
, as it is applied to its alias:SELECT t1.* FROM table_1 as t1, table_2 as t2
- Snowflake has introduced two powerful extensions to the Star Syntax:
EXCLUDE
to remove some columns from being returned andRENAME
to rename a column while selecting it, as shown in this example:SELECT * EXCLUDE (c3, c2) RENAME (c4 as cx, c5 as cy)
FROM table_1
- Star without an alias will return all columns from both tables:
- A column name, optionally with an alias indicating the object the column comes from and/or an alias to use for the column in the results:
SELECT
O_ORDERKEY,
ord.O_CUSTKEY,
cust.C_NAME as CUSTOMER_NAME
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS as ord
JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER as cust
ON cust.C_CUSTKEY = ord.O_CUSTKEY;
Note that here, the designers of the TPCH database have gone a long way to keep column names unique between all tables, by adding the table initials as a prefix to column names. This is a pretty old style as it makes it more difficult to recognize fields that contain the same values as C_CUSTKEY
and O_CUSTKEY
.
Looking at our example query, we can see the following:
- Object aliases are optional if the name of the column is unique in the tables referenced in the query, as it is for all columns in the TPCH database.
- Aliases can in any case be used for clarity, like with
cust.C_NAME
. - Object aliases are mandatory when referencing a column name that appears in more than one referenced table. This would have been the case if both fields had been just
CUSTKEY
, without theC_
andO_
table prefixes. - Lastly, a column alias can always be used, even just to rename a column with an initial name that we do not like or is not clear, as in the case of a name that would not be clear, so we rename it to
CUSTOMER_NAME
.
- A column number after a dollar sign, such as $1 or $3.
This allows us to reference columns that do not have a name, such as when reading out of a CSV file without a header. This is also useful, especially in scripts, to reference columns that we do not know the name of, but we know their position:
SELECT $1 as ORDER_KEY, $2 as CUST_KEY FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;
In this case, we read from a table that has column names, and we can always use column numbers instead of column names. But this is rarely done when reading from a table or a view because by using the numbers, the result will depend on the order of the columns in the object, which can change over time if the object is recreated.
- An expression, like a mathematical expression or a function call, that evaluates to some value for each row:
SELECT
P_PARTKEY
, UPPER(P_NAME) as P_NAME
, P_RETAILPRICE
, P_RETAILPRICE * 0.9 as P_DISCOUNTED_PRICE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART";
The preceding example shows two expressions: a mathematical expression that multiplies a column by a constant to produce a new column and a function call that converts the p_name
column to uppercase, keeping the same column name.
Important note
The ability highlighted here, to write arbitrary expressions and assign a name to the result, is at the core of how SQL is used to transform data and is one of the most common things that we will do with dbt.
In this section, we have seen that the SELECT
clause allows us four great abilities:
- To decide what data from the source tables to keep in the result of a query
- To calculate new data from source data using arbitrary complex expressions
- To provide a new name for the data that we want in the query result
- To keep only one copy of each distinct row in the result, using the
DISTINCT
keyword
The FROM clause
The FROM
clause introduces the table objects used in a SQL statement as the source of data or target of the command.
In the case of a SELECT
statement, the FROM
clause can list none, one, or more table objects, each with its own alias. The rows in the result of the SELECT
statement will potentially have all the columns from all the tables referenced in the FROM
clause.
The simplified syntax looks like this:
SELECT … FROM [tableObject1 [AS alias1]] [, tableObject2 [AS alias2] …]
Table objects, in the context of a SELECT
statement, are a combination of the following:
- Tables and views: Tables and views are the main data objects SQL works with, so most of the time, this is what you find in the
FROM
clause. - Table functions: Table functions are a category of functions that return a set of rows, with one or more columns. They can be predefined functions or user-defined ones.
- A
VALUES
clause: This clause allows us to build a set of rows, using constant values. This allows us to create an inline table that can be suitable as a reference table, mapping table, or test input. - A few other objects that can be read from, such as a
LATERAL
subquery or a staged file.
When no table is listed, then the columns in the SELECT
clause must use constants and global objects such as the current_date
function, as in the following example:
SELECT 1 + 1 as sum, current_date as today;
When we use the VALUES
clause, we can define an inline table, like in this example:
SELECT * FROM ( VALUES ('IT', 'ITA', 'Italy') ,('US', 'USA', 'United States of America') ,('SF', 'FIN', 'Finland (Suomi)') as inline_table (code_2, code_3, country_name) );
When the FROM
clause lists more than one table object, the result is the Cartesian product of the elements in these tables. To avoid a Cartesian explosion, it is important to include a WHERE
clause that restricts the combinations, keeping only the ones we want.
Important note
The Cartesian product is the result of pairing the rows of all the objects in the product in all possible ways. This means that the number of rows returned by the Cartesian product is the product of the number of rows in each object in the product.
The expression Cartesian explosion is often used to refer to the fact that the number of rows returned by the Cartesian product increases very quickly and can easily create performance issues.
The following query, thanks to the WHERE
clause, will generate a more meaningful and compact set of rows, ideally equal to the number of rows that exist in the LINEITEM
table:
SELECT count(*) FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM" as l ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" as o ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" as c ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART" as p WHERE o.O_ORDERKEY = l.L_ORDERKEY and c.C_CUSTKEY = o.O_CUSTKEY and p.P_PARTKEY = l.L_PARTKEY ;
In the previous query, we have used a WHERE
clause to keep in the results only the rows that are really related. We achieve this by using equivalence constraints between the fields of the related tables.
The previous query generates a result of exactly 6,001,215 rows, which is the number of rows in the LINEITEM
table, instead of the astronomical number resulting from the multiplication of the number of rows: 6,001,215 x 1,500,000 x 150,000 x 200,000, which is pretty much 2,7 x 10^23.
I’ll let you calculate the exact result…
The consequence is that the full query processes 6 million rows and runs in a few seconds, while the query without the WHERE
clause will try to process 10^18 times more rows, taking very many hours or days to complete, in the process using up a lot of resources for nothing, as the resulting data provides nothing of interest.
The exact correspondence in the count derives from the fact that for each line item, there is only one order, for each order only one customer, and so on.
Providing the relevant WHERE
clause will help with reducing the result by many orders of magnitude even if the keys don’t match so precisely. We will deal with this topic in more detail in Chapter 3, where we talk about data modeling.
This example illustrates well the power of SQL and why it is important to express what we want correctly.
Tip
As a basic rule, never do an unrestricted Cartesian product unless you know what you are doing and you are fine with the exploded number of rows that you will generate.
The JOIN clause
JOIN
is a subclause of the FROM
clause and is used to describe in a more precise way how to combine the rows of two tables. JOIN
cannot exist outside of a FROM
clause.
The simplified syntax of JOIN
looks like this:
SELECT … FROM tableObject1 AS to1 [<join type>] JOIN tableObject2 as to2 ON <condition_A> [[<join type>] JOIN tableObject3 as to3 ON <condition_B>] …
The condition in each join is a Boolean expression, often using columns from the base to1
table and the joined table. It does not need to be an equality check and can use columns from multiple tables or even constants.
It is normal to chain more than one JOIN
inside a FROM
clause to pick data from multiple tables at once, and you generally do not need to think too much about the order of the joins, but it is good to know that the order might matter. In the case of Snowflake, the order is from left to right (think of the statement written on one row).
We will look in more detail into the JOIN
clause in the next section, where we will introduce the different types of joins and some examples.
We have seen that we can combine data from multiple tables, listing them in the FROM
clause, and that we can use the WHERE
clause to restrict the resulting Cartesian product to the combined rows that we actually want to keep. This use of FROM
and WHERE
is equivalent to the INNER JOIN
construct, which we will see in the Combining data in SQL – the JOIN clause section.
The WHERE clause
The WHERE
clause specifies the conditions that the rows involved in the command need to match.
The rows for which the expression evaluates to true
are kept/processed.
In the context of a SELECT
statement, the WHERE
clause limits the result of the query to the subset of rows that verify the overall condition set by the WHERE
clause.
The simplified syntax is as follows:
SELECT … WHERE <predicate>
<predicate>
can be as simple as the literal true
or false
or a complex expression, including logical operators and database functions that evaluate to a Boolean value (true
, false
, or NULL
) for every row that the query processes.
Tip
An expression that returns a Boolean value is called a predicate.
The following example returns only the rows where the order total is greater than 500,000:
SELECT O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" WHERE O_TOTALPRICE > 500000;
The following example does not return any row, but defines two columns with the same name and type as the columns in the source table and two with the name and type that we provided:
SELECT O_ORDERKEY,O_CUSTKEY, 1 as an_int, null::number as a_num FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" WHERE false;
This may be a good trick when you need to define some column names and types. Remember that you do not need a FROM
clause if you do not want to use columns from an existing table…
Important note
Be careful when NULL
is involved as it might not behave as you expect.
As an example, the NULL = NULL
expression evaluates to NULL
.
In a WHERE
clause, it means that the corresponding row is discarded.
We have seen in a previous example using the FROM
clause that the WHERE
clause can be used to specify some types of JOIN
. We suggest using the JOIN … ON …
syntax for joins and the WHERE
clause to filter rows. We will look at more examples in the Combining data in SQL - the JOIN clause section.
The GROUP BY clause
A GROUP BY
clause is used to calculate aggregate functions on groups of rows that produce the same value for the group by expression.
The simplified syntax is as follows:
SELECT … GROUP BY groupExpr1 [, groupExpr2 …]
The group by expression can be one of the following:
- Column name: The result of the expression is the value of the column with the given name. All rows with the same value in the column are grouped.
The following query calculates the sum of all orders by each customer (customer key):
SELECT O_CUSTKEY, sum(O_TOTALPRICE) FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" GROUP BY O_CUSTKEY;
- Number: The value to group by is the value of the column in the given position in the result of the
SELECT
statement. The value is evaluated at the end of the statement, after applying any function or expression.
The following query is the same as the previous one, using the ordinal 1
instead of the O_CUSTKEY
column name to indicate the column to group by:
SELECT O_CUSTKEY, sum(O_TOTALPRICE) FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" GROUP BY 1;
- SQL expression: Any expression using any combination of data from the query. The value to group by is the result of the expression.
The following query calculates the total orders by year, using the year()
function, which returns the year from a date:
SELECT YEAR(O_ORDERDATE), sum(O_TOTALPRICE) FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" GROUP BY YEAR(O_ORDERDATE) ORDER BY YEAR(O_ORDERDATE);
We have also added the ORDER BY
clause with the same expression as the GROUP BY
so that we get the output in a nice order. This is a pretty common pattern.
When more than one group by expression is provided, the rows are grouped together according to the values of all the group by expressions.
In the previous example, you could have used GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
to have the order total split by month (and year), like in the following:
SELECT YEAR(O_ORDERDATE),MONTH(O_ORDERDATE),sum(O_TOTALPRICE) FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE) ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);
The HAVING clause
The HAVING
clause filters the rows resulting from a GROUP BY
clause according to a predicate
:
SELECT … GROUP BY … HAVING <predicate>
The predicate is an expression that returns a Boolean value and references:
- Constants
- Expressions that appear in the
GROUP
BY
clause - Aggregated functions that can be calculated according to the
GROUP
BY
clause
As an example, we could extend the example from the GROUP BY
topic to look at the months where we have less than 10,000 orders.
The following query does it:
SELECT YEAR(O_ORDERDATE), MONTH(O_ORDERDATE), sum(O_TOTALPRICE) FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE) HAVING count(*) < 10000 ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);
In this query, we have used the count()
aggregate function to count how many lines, one for each order, there are in one group.
Tip
The HAVING
clause is for GROUP BY
what the WHERE
clause is for the FROM
clause.
We will look at the third filtering clause, QUALIFY
, in the next topic.
The QUALIFY clause
The QUALIFY
clause filters on the results of the window functions, which is the third type of data calculation that we can have in a query after expressions and group by.
We will dedicate the last section of this chapter to window functions.
The QUALIFY
clause is an optional SQL clause that proves extremely useful in analytical workloads, so it is implemented in Snowflake.
To recap a bit of what we have seen, these are the three clauses that are used to filter data:
WHERE
applies to the data being read from the sources specified in theFROM
/JOIN
clausesHAVING
applies to the data that has been grouped by aGROUP
BY
clauseQUALIFY
applies to the data that has been calculated by a window function
One of the most common usages of the QUALIFY
clause together with the simple row_number()
window function is to defend against undesired duplicates or select one specific row with respect to other rows that represent different versions of the same object:
SELECT * FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM" QUALIFY row_number() over(partition by L_ORDERKEY, L_LINENUMBER order by L_COMMITDATE desc ) = 1;
This query selects all fields from a table, just filtering out undesired rows for which the row_number
window function returns a value different from 1
.
The row_number
window function, assigns a progressive number from 1
onward to all the rows in the same window, following the row order, and then restarts from 1
with the next window.
Picking the rows where it is equal to 1 means keeping only the first row for each window.
The windows are defined by the over(…)
clause, which comprises a partition by
part that defines the windows, in a way similar to group by
, and an order by
that provides the ordering in the window. We will look in more detail at this in the section devoted to windows functions.
Tip
When using the QUALIFY
clause, as shown in the previous query, we avoid the need to create a column with the result of the row_number
window function and then use a subquery to filter on it. That is the normal way of doing this kind of filtering in a database that does not support the QUALIFY
clause. QUALIFY
is much simpler to read and use.
Now, let’s extend the order example from the previous section to select out the “good months,” when we have higher total sales than the average sales for the year they belong to:
WITH monthly_totals as ( SELECT YEAR(O_ORDERDATE) as year, MONTH(O_ORDERDATE) as month, sum(O_TOTALPRICE) as month_tot FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE) ) SELECT year, month, month_tot ,avg(month_tot) over(partition by YEAR) as year_avg FROM monthly_totals QUALIFY month_tot > year_avg ORDER BY YEAR, MONTH;
Here, we have put together a few of the clauses that we have seen so far, using a WITH
clause to define a CTE named monthly_totals
with our previous query to calculate monthly totals.
We have then defined a query that reads from the CTE and uses the avg
window function to calculate the average monthly sales for each month of the full year that a particular month belongs to. Then, we use the QUALIFY
clause to keep only the rows where the monthly total is greater than the yearly average.
Tip
The previous example illustrates why the window functions are also called analytical functions.
Important note
Please note that the window function calculates a value for each row using the defined windows, not changing the number of rows. In contrast, using the avg
aggregate function with a GROUP BY
clause would have reduced the number of rows to one per group.
The ORDER BY clause
The ORDER BY
clause specifies in which order the query or window function lays out the rows.
The simplified syntax is as follows:
SELECT … ORDER BY orderExpr1 [ASC|DESC] [NULLS FIRST|LAST] [, orderExpr2 …]
Each order by expression can be made out of three parts:
- One expression that identifies what to order on. It can be either a column alias, a position ordinal, or an expression, as we have seen for the
GROUP BY
clause. Please refer to that section for more details. - An optional direction for the sorting:
ASC
for ascending orDESC
for descending sort. - An optional specification of how to sort
null
values:NULLS FIRST
orNULLS LAST
, which are self-explanatory.
Please look at previous sections for examples of ORDER BY
in action, in both queries and window function definitions with over()
.
The LIMIT/FETCH clause
The LIMIT
clause is used to restrict the number of rows returned by the query.
LIMIT
and FETCH
are synonyms, with slightly different syntax.
Let’s look at the LIMIT
syntax:
SELECT … [ORDER BY …] LIMIT <count> [OFFSET <start>]
The count
parameter is a number that specifies the maximum number of rows to return.
If the OFFSET
part is present, the returned rows are the ones after the start
position. This allows tools connecting to a database to retrieve all the results in chunks of the desired size. As an example, a LIMIT 10 OFFSET 10
clause would retrieve rows from 11 to 20.
Important note
If no ORDER BY
clause is present, the order of the rows is undefined and could differ in each execution of the same query. In this case, the result of a LIMIT
clause is non-deterministic because what rows are returned depends on the order in which the rows happen in the result set.
Query clause order of evaluation
In the previous sections, we have seen all the clauses that can appear in a SELECT
statement.
Now is a good time to bring your attention to the fact that these clauses are generally evaluated in the following specific order, as well as what it is important to pay attention to for each clause:
FROM
and itsJOIN
subclause, which are used to identify the source data for the query.- The
WHERE
clause, which is used to filter out the source data that we do not want.
This is probably the most important clause for performance, because the less data a query works on, the quicker it is. Use WHERE
whenever possible to just bring in the data you need.
- The
GROUP BY
clause, which groups the source data left after applying theWHERE
clause and calculates the aggregate functions on the grouped data. - The
HAVING
clause, which filters on the results ofGROUP BY
. - Partitioning of the windows and calculation of the window functions.
- The
QUALIFY
clause, which filters on the results of the window functions. - The
DISTINCT
keyword, if applied to theSELECT
clause, which removes duplicated rows. - The
ORDER BY
clause, which puts the resulting rows in the desired order. - The
LIMIT
clause, which caps the rows returned by the query to the desired amount.
SQL operators
When writing queries, we can perform operations on the data handled by the query.
We do so by building expressions that return the desired value, using functions and operators.
We can perform an operation pretty much everywhere a value is expected: in the SELECT
clause to provide the desired outputs by transforming the inputs, in the WHERE
clause or ON
part of a JOIN
clause, HAVING
clause, or QUALIFY
clause to identify what should or should not be returned by the query, in GROUP BY
to decide how to aggregate, and so on.
Let’s go through the categories of operators and how they are used:
They expect one or more numeric (or convertible to numeric) inputs to provide a numeric result, with the usual arithmetic precedence and meaning, like in the following example:
SELECT 1 + '2' as three, (3+2) * 4 as twenty WHERE twenty % 2 = 0;
Note that '2'
is a string but can be automatically converted to the number 2. Also note that implicit conversions happen, but explicit conversions are better.
The modulo operator returns the remainder of the division of the first operator by the second, and the val % 2 = 0
pattern is often used to identify even numbers.
- Comparison operators: These are used to test two values for equality or other comparisons, and are
=
(equal),!=
(not equal),<>
(also not equal),<
(less than),<=
(less than or equal),>
(greater than), and>=
(greater than or equal).
They are typically used in the WHERE
clause, but can be used anywhere a Boolean result is desired, such as in the following example:
SELECT 2 < 1 as nope, '3' != 'three' as yep WHERE 1 != 2;
Note that the operators can compare all types where the operation is defined.
I have seen WHERE 1 = 2
used in many places to avoid returning any row, because 1=2
always returns FALSE
; it would be clearer to write WHERE false
directly and if you do not need to filter any row out, you can just leave the WHERE
clause off, or use the WHERE
true
expression.
They operate only on Boolean predicates and values and return Booleans.
They are generally used in the WHERE
clause, but can appear anywhere an expression is allowed, such as in the following example:
SELECT *, (C_ACCTBAL > 7500) AND (C_NATIONKEY = 24) as IS_TOP_US_CUST FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" WHERE (C_NAME IS NOT null) AND IS_TOP_US_CUST;
In the previous example, we have defined an IS_TOP_US_CUST
column using a logical operator to combine two predicates, as we have done in the WHERE
clause, also reusing the new Boolean column we created.
- Set operators: These operators are used to put together the results from multiple queries, and are
INTERSECT
,MINUS
orEXCEPT
, andUNION [ALL]
.
The simplified syntax is as follows:
query_1 <set_operator> query_2
The queries must be compatible, having the same number of columns and of the same type.
It is also important that the semantics of the columns in the same position are correct, as combining people’s names with state codes is possible, as they are both strings, but in general, it does not make much sense.
Let’s describe the set operators:
INTERSECT
returns the rows that appear in both queries, checking all columns to have the same valueMINUS
orEXCEPT
returns the rows from the first query that do not appear in the secondUNION [ALL]
returns the rows from both queries, withALL
keeping duplicates
As an example, the following query returns data for the customers from India (8) and the US (24) and the customers in the AUTOMOBILE
segment without duplicates:
SELECT C_NAME, C_ADDRESS, C_PHONE FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" WHERE C_NATIONKEY IN (8, 24) UNION SELECT C_NAME, C_ADDRESS, C_PHONE FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" WHERE C_MKTSEGMENT = 'AUTOMOBILE';
In this special case, as both the queries are on the same table, we could have just used a single query with a slightly more complex WHERE
clause using an OR
operator to compose the two individual clauses. In real cases, you might want to combine similar data from different tables and the set operators are here for you.
A subquery is a query defined inside another query.
A subquery can be used without any operator as a table-like object, such as selecting a subset of a table we want to use some data, or an expression if it returns a single value, as in the following example:
SELECT C_NAME, C_ADDRESS, C_PHONE FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" WHERE C_NATIONKEY = ( SELECT N_NATIONKEY FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION" WHERE N_NAME = 'JAPAN' );
In this example, the subquery returns only one value, so we can use the equal operator.
The subquery operators extend their use to other cases, as per their definitions:
ALL
/ANY
: Allows you to apply a comparison to all/any rows of the subquery[NOT] EXISTS
: Returnstrue
if the subquery returns at least one row,false
withNOT
in front[NOT] IN
: Returnstrue
if the expression is not included in the results of the subquery
The following example extends the previous example to the case when we want customers from more than one country, picking the country by name:
SELECT C_NAME, C_ADDRESS, C_PHONE FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" WHERE C_NATIONKEY IN ( SELECT N_NATIONKEY FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION" WHERE N_NAME IN ('JAPAN', 'CANADA') );
We could easily rewrite the same query using = ANY
instead of IN
, and probably with some changes in the subquery also using EXISTS
.
Now that we have been acquainted with the basics of querying data in SQL, let’s dive deeper into the JOIN
clause, which allows us to put together data from multiple sources, which is crucial to turning raw data into useful information.
Combining data in SQL – the JOIN clause
The JOIN
clause, and the equivalent forms of the FROM
clause with multiple tables, is used to combine the rows from two tables to create a row with the columns (that you select) from both tables.
Joins are useful when the tables to be combined are related, that is, when the two tables have some columns that represent the same thing, and we want to combine data from both tables.
We express how to combine the rows by providing a join clause, usually with the ON
subclause, which compares the rows from one table to the rows of the other table. Most of the time, the relation is that the values of corresponding columns are the same, but any predicate is fine in the ON
subclause.
Combining orders and customers
One example of how to combine data might be a web_order
table and a customer
table.
In both tables, you normally have a column with the customer ID information even if the columns might not have the same name. Let’s assume that in the order table, there is the ID of the customer who placed the order in the ORDER_PLACED_BY
column, and in the customer table, there is the ID of the customer in the CUSTOMER_ID
column. Then, we could write the following query:
SELECT * FROM web_order JOIN customer ON ORDER_PLACED_BY = CUSTOMER_ID;
This query, using *
, returns rows that have all columns from both tables, in all the cases when there is a row that satisfies the ON
condition.
Let’s look at the relevant rows of the input and output tables in the case where we have one order with ORDER_PLACED_BY = 123
in the order table and one customer with CUSTOMER_ID =
123
.
Say we have one row with ORDER_PLACED_BY = 123
in the web_order
table, as follows:
Order_ID |
ORDER_PLACED_BY |
ORDER_VALUE |
WEB_0001 |
123 |
225.52 |
Table 1.2: Sample web_order table
And we have one row with CUSTOMER_ID = 123
in the customer
table, as follows:
Customer_ID |
Customer_Name |
Address |
123 |
Big Buyer LLP |
Nice place road, 00100 SOMEWHERE |
Table 1.3: Sample customer table
Then, we get the following row in the result table:
Order_ID |
ORDER_PLACED_BY |
ORDER_VALUE |
Customer_ID |
Customer_Name |
Address |
WEB_0001 |
123 |
225.52 |
123 |
Big Buyer LLP |
Nice … |
Table 1.4: Sample result of the previous query
If we do not have any customer with CUSTOMER_ID = 123
, then we will have no row returned (for that order) in the result table.
Say we have the same order table as before, but three rows with CUSTOMER_ID = 123
in the customer
table:
Customer_ID |
Customer_Name |
Address |
123 |
Big Buyer LLP |
Nice place road, 00100 SOMEWHERE |
123 |
Another Customer |
Some road, 10250 SOME PLACE |
123 |
A third customer |
No way road, 20100 NOWHERE |
Table 1.5: Alternative example of a customer table with three rows with CUSTOMER_ID = 123
Then, we will have three rows returned, each having the same order information combined with one specific customer per row, as you see in the following table:
Order_ID |
ORDER_PLACED_BY |
ORDER_VALUE |
Customer_ID |
Customer_Name |
Address |
WEB_0001 |
123 |
225.52 |
123 |
Big Buyer LLP |
Nice … |
WEB_0001 |
123 |
225.52 |
123 |
Another Customer |
Some … |
WEB_0001 |
123 |
225.52 |
123 |
A third customer |
No … |
Table 1.6: Table resulting from the previous sample JOIN query, with three customer matches
This last situation is probably not what you want, as it will “duplicate” the order, returning one row with the same order information for each customer that matches the condition. Later in the book, when we talk about identity, we will see how to make sure that this does not happen and how with dbt, you can also easily test that it really does not happen.
Another question that you might have is how do we keep the order information in the results, even if we do not have a match in the customer table, so that we get all the orders, with the customer information when available? That’s a good question, and the next topic on join types will enlighten you.
JOIN types
In the previous section about the query syntax, we introduced the simplified syntax of a JOIN
clause. Let’s recall it here with shorter table names and no aliases:
SELECT … FROM t1 [<join type>] JOIN t2 ON <condition_A> [[<join type>] JOIN t3 ON <condition_B>] …
In the most common cases, the join type is one of [INNER]
or { LEFT | RIGHT | FULL } [
OUTER ]
.
This gives us the following possible joins with a join condition using the ON
subclause:
INNER JOIN
: This is the default and most common type of join, returning only the rows that have a match on the join condition. TheINNER
keyword is optional, so you can write the following:SELECT … FROM t1 JOIN t2 ON <some condition>
Note that the preceding INNER JOIN
is equivalent to the following query that uses only FROM
and WHERE
:
SELECT … FROM t1, t2 WHERE <some condition>
It is preferable, anyway, to use the JOIN
syntax, which clearly shows, especially to humans, which conditions are for the join and which are for filtering the incoming data.
LEFT OUTER JOIN
: This is the second most used type of join, as it returns all the rows from the left table, which is the first to be named, combined with the matching rows from the other table, padding withNULL
the values where the right table has no matches
Of course, you will have one row of the left table for each matching row of the right table.
RIGHT OUTER JOIN
: This is similar toLEFT OUTER JOIN
, but it keeps all the columns from the right table and the matching ones from the left table. It is less used than the left as you can reverse the table order and use the left expression.
The query t1 RIGHT OUTER JOIN t2
is the same as t2 LEFT OUTER
JOIN t1
.
FULL OUTER JOIN
: This type of join combines the left and right behavior to keep all the rows from left and right, padding with nulls the columns where there is no match.
There are also two other types of join where you do not specify a join condition:
CROSS JOIN
: This type of join produces a Cartesian product, with all possible combinations of rows from both tables. This is also what you obtain if you do not use anON
subclause when using the previous types of joins. The cross join does not have anON
subclause:SELECT … FROM t1 CROSS JOIN t2
This is equivalent to what we have seen in the FROM
clause:
SELECT … FROM t1, t2
The difference is just how obvious it is for a human reader that you really want to have a cross join, or that you forgot about the ON
subclause or some join-related condition in the WHERE
clause. It is not so common to use cross joins, because of the Cartesian explosion we talked about; it is, therefore, a much better style to explicitly indicate that you really want a cross join, the few times when you will actually want it.
NATURAL <type> JOIN
: TheNATURAL
join is identical to the various types ofJOIN
s that we have seen so far, but instead of using theON
subclause to find the matches between the two tables, it uses the columns that have the same name with an equality condition. Another small difference is that the columns with the same name in the two tables are returned only once in the results as they always have the same values on both sides, because of the equality condition.
Here are a couple of examples of how to write queries with this type of join:
SELECT … FROM t1 NATURAL INNER JOIN t2
The preceding query is like an INNER JOIN
query on columns with the same name in t1
and t2
.
SELECT … FROM t1 NATURAL FULL OUTER JOIN t2
This one is like a FULL OUTER JOIN
query on columns with the same name in t1
and t2
.
Tip
When talking about JOIN
, we use LEFT
and RIGHT
, but with respect to what?
It is a reference to the order in which the tables appear in a chain of joins.
The FROM
table is the leftmost one and any other table that is joined is added to the right in the order in which the join appears.
Writing SELECT … FROM t1 JOIN t2 ON … JOIN t3 ON … JOIN t4 ON …
makes clear that the tables will be stacked from left to right like this: t1-t2-t3-t4
.
You could rewrite the same example as it is normally written on multiple lines, as follows:
SELECT …
FROM t1
JOIN t2
ON …
JOIN t3
ON …
JOIN t4
ON …
The result is the same, even if it is not so immediate to see left and right as mentioned in the previous statement.
Visual representation of join types
We have defined how joins work through examples and explanations, but I think that for some people, an image is worth a thousand explanations, so I propose two ways to graphically look at joins:
- One that tries to show how the matching and non-matching rows are treated in different kinds of joins
- One that uses a set notation and compares the different types of joins
Visual representation of returned rows in JOIN queries
The following figure visually describes how two rows of tables A and B are aligned to form the rows resulting from the different types of joins:

Figure 1.7: Visual representation of different JOIN types between tables A and B
Each table is divided into two: one part where the rows have one or more matches on the other table that satisfy the join condition and another part where each row has no match on the other table.
The rows from the two tables that have a match, in the center of the figure, are properly aligned according to the matching condition so that each resulting row contains the values from A’s columns and B’s columns where the condition is met. This central part is always returned by all joins, unless explicitly excluded with a WHERE
clause requesting one of the two keys to be NULL
.
The rows where there is no match, shown at the top for table A and at the bottom for table B, are aligned with columns from the other table padded with NULL
values. This produces the somewhat unintuitive result that a query with an ON A.key = B.key
clause might produce rows where one of the two keys is NULL
and the other is not.
Tip
Please remember that NULL
is a special value and not all things work out as expected. As an example, the expression NULL = NULL
produces NULL
and not TRUE
as you might expect.
Try it out yourself with this query: SELECT NULL = NULL
as unexpected;
.
That is why you test for NULL
values using <field> IS NULL
and not using equality.
Full outer join
The following figure depicts the result of a FULL OUTER JOIN in terms of the rows of the two original tables:

Figure 1.8: Visual representation of a FULL OUTER JOIN
You can clearly identify the central part of the previous picture, where rows from both tables satisfy the join constraint and the two parts where one of the tables has no matching rows for the other table; in these rows, the columns from the other table are filled with NULL
values.
Left outer join
The following figure depicts the result of a LEFT OUTER JOIN in terms of the rows from the two original tables:

Figure 1.9: Visual representation of a LEFT OUTER JOIN
You can clearly see in the picture that the result consists of the rows from both sides that satisfy the join constraints, plus only the rows from table A that do not have a match in table B, with the columns from B filled with NULL
values.
The rows from table B without a match in table A are not returned.
Another way to express this is that we have all rows from table A, plus the rows from B where there is a match in the join condition, and NULL
for the other rows.
Important Note
When we join two tables, and we write a condition such as ON A.key = B.key
, we are expressing interest in rows where this condition is true, and INNER JOIN
just gives us these rows.
However, OUTER
joins also return rows where the join clause is not true; in these rows, either the A.key
or B.key
column will be filled with NULL
as there is no match on the other table.
Visual representation of JOIN results using sets
The following figure visually represents the join types that use an ON
clause, representing, as sets, the rows from tables A and B that match or do not match the join condition in the ON
clause.
The overlapping area is where the condition is matched by rows in both tables:

Figure 1.10: Visual representation of SQL JOIN types with an ON clause, as set operations
The preceding figure represents the join types that we have seen in two forms:
- Using only the
ON
subclause, showing the results of the pure join - Using a
WHERE
clause on the column used in theON
subclause
In the figure, this information is used to exclude from the outer joins the rows where the match happens, where A and B overlap, which are the rows returned by an inner join.
This type of query is useful, and often used, for example, to see whether we have orders where the referenced customer does not exist in the customer table. This would be called an orphan key in the order table.
Let’s see an example using Snowflake sample data:
SELECT * FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" LEFT OUTER JOIN "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" ON C_CUSTKEY = O_CUSTKEY WHERE C_CUSTKEY is NULL;
This query should return no rows, as all the customers referenced by the orders should exist in the customer table. Now that we have covered all the basic functions in SQL, let us check out an advanced feature: windows functions.
Advanced – introducing window functions
A window function is a SQL function that operates on a set of rows, called a window of rows.
These functions are also called analytic functions, as they are mostly used in analytic queries.
A window is a set of correlated rows, containing any amount from 0 to all the rows resulting from a query.
Generally, the windows divide the resulting rows into disjointed partitions.
Some window functions can also operate on a window frame, which is a dynamic subset of the rows defined by the window, based on the current row.
You can define a cumulative window frame, taking the rows of the window before or after the current row, or a sliding window frame, going from some rows before the current row to some rows after the current row.
Window functions have a lot in common with aggregated functions, the ones that use GROUP BY
, as they both divide the results of a query into groups, that is, partitions, and then calculate the function on the rows in each group.
The big difference is that with GROUP BY
, you get only one row per group, while with window functions, all the rows are returned, with the function calculated on the desired subset of rows.
Let’s show some examples to make the concept more real.
Assume we have a table with orders, and we would like to know how much each order contributes to the total by day or month. We can calculate the percentage with the following formula:
monthly_pct = order_total / monthly_total * 100
Using group by, you could calculate the order total for each day or month, in separate queries, and then you would need to join those results with the original table to be able to apply the formula.
Window functions are here to help, as you can calculate the daily or monthly measures in a single step by defining a window that puts the rows together in different ways, as in the following query:
SELECT O_ORDERKEY, O_CUSTKEY, O_ORDERDATE, O_TOTALPRICE, avg(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_avg, sum(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_total, sum(O_TOTALPRICE) over(partition by DATE_TRUNC(MONTH, O_ORDERDATE)) as monthly_total, O_TOTALPRICE / daily_avg * 100 as avg_pct, O_TOTALPRICE / daily_total * 100 as day_pct, O_TOTALPRICE / monthly_total * 100 as month_pct FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS" QUALIFY row_number() over(partition by O_ORDERDATE order by O_TOTALPRICE DESC) <= 5 order by O_ORDERDATE, O_TOTALPRICE desc;
This example uses window functions to calculate totals and averages over different periods and also to limit the returned rows to the five biggest orders for each day.
This query is pretty efficient, executing in a few seconds even after processing million of rows. By changing the schema from TPCH_SF1
to TPCH_SF10
, TPCH_SF100
, or TPCH_SF1000
, you can try out bigger tables in the Snowflake sample data, to see how it works with bigger datasets.
Window definition
A window is defined using the OVER
clause, which specifies how to partition and order the rows of the incoming data to apply the window function.
To use a window function, you use this syntax:
<function_name> ([<arguments>]) OVER ( [PARTITION BY <part_expr>] [ORDER BY <order_expr>])
The presence of the OVER
clause specifies that the function is applied to the windows defined inside the OVER
clause. Of course, the function can also take its own arguments.
The PARTITION BY
clause defines how the rows are partitioned and <part_expr>
is a comma-separated list of columns or expressions calculated on the query data, in a similar way to what we have seen for the group by expressions and shown in the previous example.
The ORDER BY
clause specifies the ordering of the rows to be used to calculate the function and it is not visible outside of the window function. The order by clause and <order_expr>
work pretty much as they do in the SELECT
statement to order the results of the query.
At the syntax level, both the PARTITION BY
and ORDER BY
clauses are optional, even if some functions require or forbid one of the two. Look at each function’s documentation for more details.
The ORDER BY
clause has an effect on the order-sensitive window function, which comes in two types:
- Rank-related functions: These functions calculate values based on the rank of the row, which is the order of the row in the window. They are used to make explicit, that is, crystallized into a value that can be used later, one of many possible orderings of a row in a window.
One example is the ordering of teams in a championship based on points or medals in a sport: you can have different styles of ordering based on what is decided when you have the same value.
As an example, if two teams (A and B) have the same score, after another team (C) with an higher score, you can say that A and B are both second or you can say that one is second and the other one third.
- Window frame functions: These functions work on a window frame, which is a dynamic subset of rows of a window. As we saw previously, they can use cumulative or sliding window frames. The next section will delve into these.
Window frame definition
Some functions can use a dynamic window frame instead of a fixed window.
You specify the window frame inside the OVER
clause, after the ORDER BY
clause of the window:
<function_name> ([<arguments>]) OVER ( [PARTITION BY <part_expr>] ORDER BY <order_expr> <cumulative_frame_def> | <sliding_frame_def> )
A cumulative window frame is a subset of the window that is either growing or shrinking based on the current row inside the window. The frame contains the rows of the window either before or after the current row. The syntax of the cumulative frame definition is the two following lines:
{ROWS | RANGE} BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW | {ROWS | RANGE} BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
The first expression takes all the rows in the partition or all the rows in the same order range up to the current one, while the second takes from the current row to the end of the partition or range. A better description of ROWS
and RANGE
follows.
A sliding window frame is a subset of rows that extends from a number of N
rows before the current row to a number of M
rows after the current row. One or both sides can be fixed to the start or end of the partition.
The following are the three possible forms of the syntax:
ROWS BETWEEN <N> { PRECEDING | FOLLOWING } AND <M> { PRECEDING | FOLLOWING } | ROWS BETWEEN UNBOUNDED PRECEDING AND <M> { PRECEDING | FOLLOWING } | ROWS BETWEEN <N> { PRECEDING | FOLLOWING } AND UNBOUNDED FOLLOWING
The first syntax is when you want to create a frame that extends from N
rows before to M
rows after the current rows, while the other two are the cases when one of the two extremities is fixed to be the start or end of the window.
ROWS/RANGE in a cumulative window definition
When specifying a cumulative window frame, we can use either ROWS
or RANGE
, but what is the difference?
ROWS
: Indicates that the fixed side of the frame extends to the limit of the partition: from the start of the partition to the current row or from the current row to the end of the partition.RANGE
: Indicates that the fixed side of the frame extends to the previous/following rows that have the same value for theORDER BY
value. The frame operates on the preceding/following rows that are in the same position in the ordering. As an example, you can think of stepping through many teams that have the same number of points in a championship list.
Important note
Providing extensive information on window functions is beyond the scope of this introduction to SQL used in analytical operations. It is also one area where SQL is not always the same in all databases. We suggest searching for “window functions” in the documentation of your database.
Summary
In this chapter, you were introduced to SQL, starting with the basics and working up to some quite advanced topics. You also got hands-on with working with Snowflake in the cloud.
You learned about the different SQL commands and categories, opened an account in Snowflake, and created your first database with users and roles. You then got into the details of the possible clauses and operators used in writing SELECT
queries and tried them with Snowflake sample data. You then analyzed in more detail how to use the JOIN
clause to blend data from multiple tables. We closed the chapter by presenting the window functions, which open up new avenues for our queries and make easy and efficient many complex calculations.
In the next chapter, Setting Up Your dbt Cloud Development Environment, we will create our first dbt models, setting up dbt and GitHub in the cloud.
Further reading
In this chapter, we have described the most important SQL topics, but there is much more that you can read if you want to look deeper and become proficient with SQL.
The Snowflake documentation is free to use and provides good explanations and examples, and of course, is the reference source of knowledge for the Snowflake dialect of SQL.
You can start with these pages:
- SQL command reference: https://docs.snowflake.com/en/sql-reference-commands.html
- Summary of commands: https://docs.snowflake.com/en/sql-reference/intro-summary-sql.html
- Queries: https://docs.snowflake.com/en/user-guide/querying.html