Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Data Engineering with dbt
Data Engineering with dbt

Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

eBook
AU$38.99 AU$55.99
Paperback
AU$68.99
Subscription
Free Trial
Renews at AU$24.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

Data Engineering with dbt

The Basics of SQL to Transform Data

This book is about engineering data platforms with dbt. When working with dbt, you write code in Structured Query Language (normally referred to as SQL, pronounced sequel or as its letters: es-qu-el), with a spritz of Jinja templating.

This means that when we are not discussing architectural and design topics, we will be reading and writing code in SQL, so you should get up to speed with SQL to work proficiently with dbt.

In this chapter, you will review, or learn from scratch, the basics of SQL, becoming familiar with this powerful language that you find everywhere when working with data.

First, we will introduce SQL and walk you through its core concepts, so that even if you are new to this language, you can start building your competence and follow the rest of this chapter and the book.

To complement this introduction, you will quickly be presented with all the main SQL commands, organized by category of use. This will ensure that you know the most relevant parts of SQL and you can use this chapter as a quick reference when you start writing your code.

You will then be guided through how to create an account in Snowflake, the leading cloud database that is very often used with dbt. You will use Snowflake in the cloud, without the need to install anything, and it is free for a trial period.

Once you have a Snowflake account, you will be guided through creating your first role, user, and database.

With your cloud database ready and the basic SQL commands presented, we will look deeper into the SELECT statement and the JOIN clause, as they will be crucial in working with dbt.

Analytical and window functions are presented at the end of the chapter in a simple way, suitable for beginners as well as mid- to senior-experienced SQL users; these advanced functionalities will power up your SQL skills, vastly widening your abilities.

To summarize, in this chapter, we are going to cover these topics:

  • Introducing SQL
  • SQL basics – core concepts and commands
  • Setting up a Snowflake database with users and roles
  • Querying data in SQL – syntax and operators
  • Combining data in SQL – the JOIN clause
  • Advanced – introducing window functions

Technical requirements

This chapter does not assume any prior SQL knowledge and presents information from the basics to intermediate level. If you know SQL well, you can skip this chapter or just browse the more advanced topics.

All code samples of this chapter are available on GitHub at https://github.com/PacktPublishing/Data-engineering-with-dbt/tree/main/Chapter_01.

To run the samples, later in this chapter, we will guide you through setting up a Snowflake account and roles and creating your first database.

Introducing SQL

SQL was created in the 70s and by the end of the 80s, had become the de facto standard to interact with Relational Databases (RDBs), and it now powers most of the data management industry in the world.

Given its huge footprint and powerful abstractions, SQL has become a standard that anyone working with database systems eventually becomes familiar with. The expressive power of SQL is well understood and its knowledge is so ubiquitous that it has been taken into use beyond RDBs, with Database Management Systems (DBMSs) of all sorts providing a SQL interface even on top of many non-RDB systems.

Some of the great advantages of SQL are as follows:

  • The core SQL functionality was standardized in the 80s and yet SQL is still very much alive and well, evolving and adding new powerful functionalities as data management evolves while maintaining compatibility with previous versions.

Every database has its SQL quirks, but the logic is the same and most SQL code will work on multiple databases with little or no change.

Learn it now and use it forever, and with (almost) every database.

  • At its core, it has a simple, rigorous, and powerful syntax that reads like English sentences, so even non-tech people can grasp the basic idea, while professionals can express exactly what they want in a precise and concise way.

Most people can probably get a sense of what the following SQL does:

SELECT ORDER_ID, CUSTOMER_CODE, TOTAL_AMOUNT
FROM ORDERS
WHERE YEAR(ORDER_DATE) = 2021;
  • With SQL, you work at the logical level, so you do not have to deal with implementation details, and it is a declarative language; you describe in a rigorous way what you want to achieve, not how to do it. The database engine has the freedom to store data, be implemented, and perform the request in its own way, as long as it produces the correct result according to SQL specifications.
  • With a single SQL statement, you can process one piece of data or billions, leaving the burden of finding the most effective way to the database and giving you some freedom from scale.

SQL basics – core concepts and commands

The SQL language is certainly big and complex, taking time to be fully mastered, but surprisingly, you can get productive with the limited set of features that we will introduce in this book, and you will pick up the nuances when you’ll need them while working on real use cases.

The goal of this chapter is not to make you a SQL expert but to get you started with it, getting deeper into the few commands that we will use often to make you quickly productive with dbt.

In this intro, we start with the most basic concepts of SQL and then provide an overview of the SQL commands by their categories of use. In the rest of this chapter, we will dig deeper into the commands that are mostly used when working with dbt: SELECT, JOIN, and analytical and window functions.

SQL core concepts

The main goal of SQL, as its name implies, is to allow users to query data that is contained in a database; SQL also provides all the commands to fully manage the database, allowing you to add, transform, delete, organize, and manage data and other database objects, such as users and roles.

The core concepts in SQL come from how an RDB is organized:

  • A database contains tables, organized in schemata (plural of schema)
  • Tables store data in rows that have one value for each column defined in the table
  • Columns have a name and can contain only data of the declared data type
  • To regulate access to the data, privileges are assigned to users and roles

You can see a database with schemata and tables in the following screenshot, which shows part of the sample database available in any Snowflake account:

Figure 1.1: Example database with schemata and tables

Figure 1.1: Example database with schemata and tables

Let’s go through the core concepts of SQL, starting with the table.

What is a table?

The table is the most central concept in SQL, as it is the object that contains data.

A table in SQL is very close to the layman’s concept of a table, with data organized in columns and rows.

Columns represent the attributes that can be stored in a table, such as a customer’s name or the currency of an order, with each column defining the type of data it can store, such as text, a number, or a date.

In a table, you can have as many rows as you want, and you can keep adding more whenever you need. Each row stores in the table one instance of the concept that the table represents, such as a specific order in the following example of a table for orders:

Order_ID

Customer_CODE

Total_amount

Currency

123

ABC

100

EUR

166

C099

125,15

USD

Table 1.1: Example order table

Looking at the previous example order table, we can see that the table has four columns that allow storing the four attributes for each row (order ID, customer code, total amount, and currency of the order). The table, as represented, has two rows of data, representing one order each.

The first row, in bold, is not data but is just a header to represent the column names and make the table easier to read when printed out.

In SQL, a table is both the definition of its content (columns and their types) and the content itself (the data, organized by rows):

  • Table definition: It lists the columns that make up the table, and each column provides the data type and other optional details. The data type is mandatory and declares what values will be accepted in the column.
  • Table content: It is organized in rows, each row containing one value for each column defined for the table or null if no value is provided.
  • Data value: All the values in a column must be compatible with the type declared for the column. null is a special value that corresponds to the absence of data and is compatible with all data types.

When creating a table, we must provide a name for the table and its definition, which consists of at least column names and a type for each column; the data can be added at a different time.

In the following code block, we have a sample definition for an order table:

CREATE TABLE ORDERS (
   ORDER_ID NUMBER,
   CUSTOMER_CODE TEXT,
   TOTAL_AMOUNT FLOAT,
   ORDER_DATE DATE,
   CURRENCY TEXT DEFAULT 'EUR'
);

Tip

When we write some SQL, we will use Snowflake’s SQL syntax and commands. We will guide you on how to create a free Snowflake account where you can run the code shown here.

In this example, we see that the command to create a table reads pretty much like English and provides the name of the table and a list of columns with the type of data each column is going to contain.

You can see that for the CURRENCY column, of type text, this code also provides the default value EUR. Single quotes are used in SQL to delimit a piece of text, aka a string.

View – a special way to look at some data

If you already have data and you want to make it available with some transformation or filtering, you can create a view. You can think of a view like a table, but with the column definition and data both coming from a query, reading from one or more tables.

As an example, if you would like to have a shortlist of orders with amount greater than 1,000 you could write the following query to create a BIG_ORDERS view:

CREATE VIEW BIG_ORDERS AS
SELECT * FROM ORDERS
WHERE TOTAL_AMOUNT > 1000;

In this example, we see that this simple create view statement provides the name for the view and uses a query, which is a SELECT statement, to define what data is made available by the view.

The query provides both the data, all the orders with a total amount greater than 1,000, and the column definitions. The * character – called star – is a shortcut for all columns in the tables read by the SELECT statement.

This is, of course, a naïve example, but throughout this book, you will see that combining tables that store data and views that filter and transform the data coming from tables and views is the bread and butter of working with dbt. Building one object on top of the previous allows us to take raw data as input and provide as output refined information that our users can easily access and understand.

Tip

When working with dbt, you will not need to write create table or create view statements, as dbt will create them for us. It is nevertheless good to get familiar with these basic SQL commands as these are the commands executed in the database and you will see them if you look in the logs.

Database and schema

We have seen that a database is a container for tables and views.

A DB can be further divided and organized using schema objects.

In real-life DBs, the number of tables can range from a few units to many thousands. Schemata act pretty much like folders, but cannot be nested, so you can always identify a table by its database, schema, and name.

In the following screenshot, we see part of the contents of the SNOWFLAKE_SAMPLE_DATA database, which is available in all Snowflake accounts:

Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database

Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database

The database and schemata, as table containers, are the main ways information is organized, but they are also used to apply security, access limitations, and other features in a hierarchical way, simplifying the management of big systems.

To create the TEST database and the SOME_DATA schema in it, we can use the following commands:

CREATE DATABASE TEST;
CREATE SCHEMA TEST.SOME_DATA;

The database.schema notation, also known as a fully qualified name, allows us to precisely describe in which database to create the schema and after its creation, uniquely identifies the schema.

Tip

While working with dbt, you will create a database or use an existing one for your project; dbt will create the required schema objects for you if you have not created them already.

A best practice in Snowflake is to have one database for each project or set of data that you want to keep separate for administration purposes. Databases and schemata in Snowflake are soft boundaries, as all the data from all the databases and schemata can be accessed if the user has the appropriate privileges.

In some other database systems, such as PostgreSQL, a database is a stronger boundary.

Users and roles

To control access to your data in SQL, you GRANT access and other privileges to both users and roles.

A user represent one individual user or service that can access the database, while a role represents a named entity that can be granted privileges.

A role can be granted to a user, providing them all the privileges associated with the role.

A role can also be granted to another role, building hierarchies that use simple basic roles to build more complex roles that are assigned to a user.

Using roles instead of granting privileges directly to users allows you to manage even a large number of users simply and consistently. In this case, roles are labels for a set of privileges and become a way to manage groups of users that we want to grant the same privileges. Changing the privileges granted to a role at any moment will change the privileges that the users receive from that role.

A typical pattern when working with dbt is to create a role for the dbt users and then assign it to the developers and the service user that the dbt program will use.

The following is an example of a simple setup with one role and a couple of users:

CREATE ROLE DBT_SAMPLE_ROLE;
CREATE USER MY_NAME;           -- Personal user
CREATE USER SAMPLE_SERVICE;    -- Service user
GRANT ROLE DBT_SAMPLE_ROLE TO USER MY_NAME;
GRANT ROLE DBT_SAMPLE_ROLE TO USER SAMPLE_SERVICE;

A more complex setup could have one role to read and one to write for each source system (represented by a schema with the data from the system), for the data warehouse (one or more schemata where the data is processed), and for each data mart (one schema for each data mart).

You could then control in much more detail who can read and write what, at the cost of more effort.

Understanding the categories of SQL commands

SQL commands can be organized into categories according to their usage in the language:

  • Data Definition Language (DDL): DDL contains the commands that are used to manage the structure and organization of a database
  • Data Manipulation Language (DML): DML contains the commands that are used to manipulate data, for example, INSERT, DELETE, and UPDATE
  • Data Query Language (DQL): DQL contains the SELECT command and is the central part of SQL that allows querying and transforming the data stored in a database
  • Data Control Language (DCL): DCL contains the GRANT and REVOKE commands, which are used to manage the privileges that control the access to database resources and objects
  • Transaction Control Language (TCL): TCL contains the commands to manage transactions

In the upcoming sections, we provide more details about these by looking at Snowflake-specific commands, but the ideas and names are of general use in all database systems, with little or no change.

Data Definition Language – DDL

DDL commands do not deal directly with the data but are used to create and maintain the structure and organization of the database, including creating the tables where the data is stored.

They operate on the following objects:

  • Account/session objects, which contain and operate on the data, such as user, role, database, and warehouse
  • Database/schema objects, which store and manipulate the data, such as schema, table, view, function, and store procedure

The main commands are as follows:

  • CREATE: Used to create the database itself and other objects
  • DROP: Used to delete the specified object
  • ALTER: Used to modify some attribute of the specified object
  • DESC: Used to describe the details of the specified object
  • SHOW: Used to list the existing objects of the specified object type, with metadata
  • USE: Used to select the database, schema, and object to use when fully specified names are not used

Tip

When working with dbt, we use the DDL and DML commands only in macros.

We do not use the DDL and DML commands in models because dbt will generate the required commands for our models based on the metadata attached to the model.

Data Manipulation Language – DML

DML provides the commands to manipulate data in a database and carry out bulk data loading.

Snowflake also provides specific commands to stage files, such as loading files in a Snowflake-managed location, called a stage.

The main commands are as follows:

  • INSERT: Inserts rows into a table
  • DELETE: Removes specified rows from a table
  • UPDATE: Updates some values of specified rows in a table
  • MERGE: Inserts, updates, or deletes rows in a table
  • TRUNCATE TABLE: Empties a table, preserving the definition and privileges

For bulk data loading, Snowflake provides the following command:

  • COPY INTO: Loads data from files in a stage into a table or unloads data from a table into one or more files in a stage

To manage files in stages, Snowflake provides these file-staging commands:

  • PUT: Uploads a local file to a Snowflake stage
  • GET: Downloads a file from a stage to the local machine
  • LIST: Lists the files existing in a Snowflake stage
  • REMOVE: Removes a file from a Snowflake stage

Important note

In dbt, we can use macros with the COPY INTO and file-staging commands to manage the data-loading part of a data pipeline, when source data is in a file storage service such as AWS S3, Google Cloud Storage, or Microsoft Azure Data Lake file storage.

Data Query Language – DQL

DQL is the reason why SQL exists: to query and transform data.

The command that is used to query data is SELECT, which is without any doubt the most important and versatile command in all of SQL.

For the moment, consider that a SELECT statement, aka a query, can do all these things:

  • Read data from one or more tables
  • Apply functions and transformations to the data retrieved
  • Filter the data retrieved from each of these tables
  • Group the retrieved data on a set of columns (or transformed values), producing one row for each group, calculating functions on the grouped rows, and filtering on the results of those functions
  • Calculate for each row one or more functions based on groups of rows identified by a window expression and filter the results based on the results of these functions

Important note

We have dedicated the Query syntax and operators section later in this chapter to analyzing the SELECT command in Snowflake in detail, as you will use the SELECT command in every dbt model.

Data Control Language – DCL

DCL contains the GRANT and REVOKE commands, which are used to manage privileges and roles that control access to or use database resources and objects.

Together with the DDL commands to create roles, users, and other database objects, the DCL commands are used to manage users and security:

  • GRANT: Assigns a privilege or a role to a role (or user)
  • REVOKE: Removes a privilege or a role from a role (or user)
  • SHOW GRANTS: Lists access privileges granted to a role or object

Transaction Control Language – TCL

The TCL commands are used to manage transactions in a database.

A transaction groups a set of SQL commands into a single execution unit and guarantees that either all the effects of all commands are applied, if the transaction completes with success, or no effect at all is applied if the transaction fails. This can also be described with the ACID acronym, which stands for atomic, consistent, isolated, and durable.

A transaction succeeds and ends only when all the commands it groups have finished with success; in any other case, the transaction fails, and its effects are rolled back like they never happened.

The TCL commands are as follows:

  • BEGIN: Starts a transaction in the current session
  • COMMIT: Commits an open transaction in the current session and makes the transaction effects visible to all sessions
  • ROLLBACK: Rolls back an open transaction in the current session and undoes all the effects applied since the BEGIN statement

Now that we have covered the basic concepts and commands in SQL, it is time to set up a database to run them. The next section will provide you with access to a Snowflake DB.

Setting up a Snowflake database with users and roles

In this section, you will be guided through creating your own Snowflake account so that you can experiment with the SQL commands from the previous section and use it with dbt in the next chapters.

Once you have created your account, we will help you through a quick setup of users and roles that will be useful throughout this book and in your later projects.

Creating your Snowflake account

To sign up for a 30-day free trial of Snowflake, you can go to https://signup.snowflake.com/ or look for the Start for free button on the Snowflake home page.

On the Snowflake site, go through the following steps:

  1. Fill in the registration form with your data and click CONTINUE.
Figure 1.3: Snowflake free trial registration form

Figure 1.3: Snowflake free trial registration form

  1. Select the Snowflake edition that you want to use, pick your preferred cloud provider, and click GET STARTED and you will reach a confirmation page.
Figure 1.4: Left: Snowflake edition and cloud provider selection form. Right: sign-up confirmation

Figure 1.4: Left: Snowflake edition and cloud provider selection form. Right: sign-up confirmation

  1. Go to your email client and follow the link from the email that you receive to confirm your email address. It will take you back to the Snowflake website.
  2. On the Welcome to Snowflake! page, enter the username and password for the user that will become your account administrator and then click Get started.
Figure 1.5: Welcome to Snowflake! page

Figure 1.5: Welcome to Snowflake! page

Tip

Later, we will teach you how to create all the roles and users that you want.

Anyway, it makes sense to pick a good name and keep the password safe, as this user has all privileges and can do pretty much everything on your account.

  1. After clicking the Get started button, you will land on the Snowflake user interface, with an empty worksheet open and an introductory menu with a few topics to get started with Snowflake. If you are new to Snowflake, it is a good idea to go through those topics.
  2. After dismissing this introductory menu, you are ready to explore and use your Snowflake account, which in the classical console interface will look like this:
Figure 1.6: The user interface of Snowflake worksheets

Figure 1.6: The user interface of Snowflake worksheets

If you end up with the new Snowflake interface (called Snowsight) you can work with it or use the Classic Console entry in the menu to switch to the older interface.

Setting up initial users, roles, and a database in Snowflake

Now you have a Snowflake account with a user that you can log in with, and your user has been granted the omnipotent role of ACCOUNTADMIN.

That’s great, but it is not advised that you use this role for normal work. The best practice is to set up a few users and roles to provide the desired access level and use the lowest level that can accomplish a task.

Snowflake allows you to pick the role you impersonate to execute commands in a worksheet between the ones you have been granted so that you do not have to log out and back in to change the role.

Tip

You can change your role, the warehouse to use, and the current database and schema with the selector in the top-right corner of the worksheet editing area.

You might have noticed that in the navigation panel on the left, you have two databases:

  • The SNOWFLAKE database provides you with information about your Snowflake account
  • The SNOWFLAKE_SAMPLE_DATA DB provides, as the name implies, some sample data

These databases are shared with you by Snowflake, and you can only read from them.

To do something meaningful, you will need at least a database you can write in.

In this section, we will create a database and some roles and users to use it.

Overview of initial roles

In a database a user needs the relevant privilege to be able to perform an action.

A user can have the privilege directly, that is, assigned to the user, or indirectly, through a role that the user can impersonate.

In practice, privileges are assigned to roles, which are assigned to other roles and users, to build the desired security hierarchy, because assigning privileges to users is not scalable.

When you log in to a newly created Snowflake account, you are impersonating the ACCOUNTADMIN role, which has all privileges, but other, more specialized roles are already available.

Snowflake has already set up a few self-explanatory basic roles: ACCOUNTADMIN, SECURITYADMIN, USERADMIN, SYSADMIN, and PUBLIC.

Important note

To create users and roles, Snowflake provides the USERADMIN role by default, which has the CREATE ROLE and CREATE USER privileges directly assigned. This role is already assigned to the initial user that you created.

The initial user could create new users and roles using the ACCOUNTADMIN or SECURITYADMIN roles, because they have been granted the USERADMIN role; SECURITYADMIN had it granted directly and ACCOUNTADMIN indirectly, having been granted the SECURITYADMIN role.

However, this would not be a great idea, as you will see in the following paragraphs on ownership.

The following is an example of how to create a hierarchy of roles.

You can explore the existing roles and their privileges with these commands:

SHOW ROLES;
SHOW GRANTS TO ROLE <role_name>;

Ownership is needed for some operations, such as deletions, which cannot be granted.

It is therefore advisable that a user with the role that we want to use to manage an object is the one who creates it, while impersonating such role.

In this way, the user, and all others with the same role, will have the required privileges.

An alternative is to use the GRANT command to explicitly provide the required privileges to a role.

Important note

An object is owned by the role that created it.

When a user has multiple roles, the role impersonated by the user at the moment of the creation will own the object.

Before creating an object, make sure you are impersonating the role you want to own the object.

Creating and granting your first role

Let’s switch to USERADMIN to create our users and roles, because this is the role that you will probably grant to people delegated to perform user and role administration.

Being the owner will make all operations, including deletions, simpler:

  1. Switch to the USERADMIN role:
    USE ROLE USERADMIN;

Let’s create the role that we will use for all users that have to fully manage a dbt project, humans, or applications.

I usually call this role DBT_EXECUTOR_ROLE because I call the user for the dbt application DBT_EXECUTOR and I like to be crystal clear with my names.

You can of course pick a name that you prefer for both the role and user.

  1. Create a role for users running dbt models:
    CREATE ROLE DBT_EXECUTOR_ROLE
      COMMENT = 'Role for the users running DBT models';

We now have a role that we can shape and use to manage a set of database objects, for example, corresponding to one or more dbt projects.

The dbt executor user will be used by dbt to run dbt models in shared environments, while dbt will use each developer’s own user to run dbt models in their own development environment.

The simplest setup, giving all developers the ability to manage all environments, is to assign this role to all developers. An alternative is to have a different role for production and other environments that you want to keep protected.

  1. Let’s keep it simple and assign the executor role to ourselves:
    GRANT ROLE DBT_EXECUTOR_ROLE TO USER <your_user>;

Now, our user has both the account admin and the executor roles. To be able to see and select the role in the user interface dropdown, you might need to refresh the page.

Granting the privileges to create a database

We could create a database while impersonating the account admin role, but this will not help with our plan to use the executor role to manage this database. We must then give our executor role the ability to create a database. To have our SQL commands executed on a database, we need the ability to use or create an existing warehouse.

Important Note

To manage the structure of databases and the warehouse settings, Snowflake provides the default SYSADMIN role, which has the required privileges.

We can achieve our goal of providing the executor role with those abilities by doing one of the following:

  • Granting the required privileges directly to the executor role
  • Granting the SYSADMIN role to the executor role, which has the required privileges

What would be the difference?

If you grant the privileges directly, the executor role is totally isolated from other roles, and users with that role will just be able to access what the role allows them to access.

If you grant the SYSADMIN role, users with the executor role will also be able to impersonate the SYSADMIN role and therefore also access any object it can access. As an example, they could access a database created by another user that has created it with the sysadmin role.

Let’s assign the desired privileges directly, to keep our dbt objects away from SYSADMIN.

To grant the privileges to the executor role, we need to switch to a role that can grant these privileges and then grant them:

  1. Switch to the SYSADMIN role:
    USE ROLE SYSADMIN;
  2. Grant the privilege to create a new database:
    GRANT CREATE DATABASE ON ACCOUNT
      TO ROLE DBT_EXECUTOR_ROLE;
  3. Grant the privilege to use the default warehouse, named COMPUTE_WH:
    GRANT USAGE ON WAREHOUSE COMPUTE_WH
      TO ROLE DBT_EXECUTOR_ROLE;

Creating your first database

Now that our executor role can create a database, let’s do it:

  1. Let’s impersonate the executor role so that it will own what we create:
    USE ROLE DBT_EXECUTOR_ROLE;
  2. Now, let’s create our database:
    CREATE DATABASE DATA_ENG_DBT;
           -- or pick any name you like for the DB

To see the new database in Snowflake, you might need to refresh the browsing panel on the left of the user interface, with the Refresh button, shaped as a circular arrow, in the top right of the panel.

Tip

Clicking on the new database will show that it is not empty.

The new database has two schemata: INFORMATION_SCHEMA, which contains views that you can use to collect information on the database itself, and PUBLIC, which is the default schema of the new database and is empty.

To reduce Snowflake credit consumption, let’s finish by configuring the default warehouse to suspend after 1 minute instead of 10:

USE ROLE SYSADMIN;
ALTER WAREHOUSE "COMPUTE_WH" SET
    WAREHOUSE_SIZE = 'XSMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE
    COMMENT = 'Default Warehouse';

Tip

To alter the warehouse, you must switch to the SYSADMIN user as it is the role that owns the warehouse; we have granted our executor role only the privilege to use it.

Congratulations! You have now created a role, assigned it to your personal user, created a database owned by that role, and configured the default warehouse. You are almost done setting up.

Creating a user for the dbt application

Let’s complete our initial setup by creating a new user that we will use with dbt and grant it the executor role so that it will be able to manage, as an owner, that database.

To create a user, we need to switch to a role with that ability:

  1. Switch to the USERADMIN role:
    USE ROLE USERADMIN;
  2. Create the new user for dbt; I have named it DBT_EXECUTOR, though you can pick any name:
    CREATE USER IF NOT EXISTS DBT_EXECUTOR
        COMMENT = 'User running DBT commands'
        PASSWORD = 'pick_a_password'
        DEFAULT_WAREHOUSE = 'COMPUTE_WH'
        DEFAULT_ROLE = 'DBT_EXECUTOR_ROLE'
    ;
  3. Let’s assign the executor role to the dbt user:
    GRANT ROLE DBT_EXECUTOR_ROLE TO USER DBT_EXECUTOR;
  4. Switch back to the operational role, which we should always be working with:
    USE ROLE DBT_EXECUTOR_ROLE;

Great! You have now performed a basic setup of your new Snowflake account; you have learned the basics of user and role management and you are ready to learn more SQL to query data.

Querying data in SQL – syntax and operators

In this section, you will learn how to query existing data using SQL and the Snowflake example database.

In this chapter, and more generally in this book, you will get familiar with Snowflake syntax, which is modern and very standard. We will also use some proprietary extensions that make your work easier.

Snowflake query syntax

Snowflake supports querying data with the standard SELECT statement, which has the following basic syntax:

WITH …
SELECT …
FROM …
JOIN …
WHERE …
GROUP BY …
HAVING …
QUALIFY …
ORDER BY …
LIMIT …

The only mandatory part is select, so SELECT 1 is a valid query that just returns the value 1.

If you are familiar with SQL from other database systems, you will wonder what the QUALIFY clause is. It is an optional SQL clause that is very well suited to the analytical kind of work that Snowflake is used for and that not all database engines implement. It is described later in this section.

We often use the terms query, command, and statement interchangeably when referring to some piece of SQL that you can execute.

Properly speaking, a command is a generic command such as SELECT or CREATE <object>, while a statement is one specific and complete instance of a command that can be run, such as SELECT 1 or CREATE TABLE my_table …;.

The term query should really only refer to SELECT statements, as SELECT statements are used to query data from the database, but query is often used with any statement that has to do with data.

You will also often hear the term clause used, such as the FROM clause or GROUP BY clause. Informally, you can think about it as a piece of a statement that follows the syntax and rules of that specific keyword.

The WITH clause

The WITH clause is optional and can only precede the SELECT command to define one or more Common Table Expressions (CTEs). A CTE associates a name with the results of another SELECT statement, which can be used later in the main SELECT statement as any other table-like object.

Defining a CTE is useful for the following:

  • Clarity: You can provide an informative name to a piece of SQL
  • Reuse, maintenance, and efficiency: You can define a supporting query, whose results you might use more than once in the main query, and the database engine will execute the supporting query once
  • Creating recursive queries: Defining a CTE that queries itself is the only way to use recursion in SQL
  • Break a complex transformation into smaller, simpler steps that are easier to code and understand and limit the scope of future maintenance

The simplified syntax is as follows:

WITH [RECURSIVE]
  <cte1_name> AS (SELECT …)
  [, <cte2_name> AS (SELECT …)]
SELECT …

We will check out some examples here:

  1. Check whether we still have high-priority orders pending:
    WITH
    high_prio_orders as (
      SELECT *
      FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
      WHERE O_ORDERPRIORITY IN ('1-URGENT', '2-HIGH')
    )
    SELECT count(*)
    FROM high_prio_orders
    WHERE O_ORDERDATE < '1998-01-01'
      and O_ORDERSTATUS = 'O';

Here, you can see that the first expression encapsulates the business definition of high-priority order to be any order with the priority set to urgent or high priority. Without the CTE, you have to mix the business definition and other filtering logic in the WITH clause. Then, it would be unclear whether the status is part of the definition or is just a filter that we are applying now.

  1. Calculate some metrics for customers in the auto industry:
      WITH
      auto_customer_key as (
        SELECT C_CUSTKEY 
      FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER
      WHERE C_MKTSEGMENT = 'AUTOMOBILE'
    ),
    orders_by_auto_customer as (
      SELECT O_ORDERKEY
      FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS
      WHERE O_CUSTKEY in (SELECT * FROM auto_customer_key)
    ),
    metrics as (
      SELECT 'customers' as metric, count(*) as value
      FROM auto_customer
      UNION ALL
      SELECT 'orders by these customers', count(*)
      FROM orders_by_auto_customer
    )
    SELECT * FROM metrics;

In this example, the CTEs and the final query remain short and very simple. You can start to appreciate that the clear labeling of intentions and the simple pieces of SQL make the full query easy to understand for anyone.

The SELECT clause

The SELECT command can appear in two possible forms: as the central clause of a SELECT statement or used as a clause in other statements. In both cases, it defines the set of columns and calculated values returned by the statement.

The simplified syntax of a SELECT clause in Snowflake is a list of column definitions separated by commas, with the optional distinct keyword to omit duplicates:

SELECT [DISTINCT]
  <column_definition_1> [, <column_definition_2> …]

The SELECT clause is used to provide the list of columns that the query will return.

For each column definition that is provided, it is also possible to provide a column alias, which will be used to identify that column definition in the results.

When two column definitions have the same name, an alias must be provided for at least one of the two to avoid a name clash in the results.

A column definition can be one of these four expressions:

<column_definition> =
   [object.*]         -- all columns in the object
  |[object.col_name]  -- the named column in the object
  |[object.$n]        -- the n-th column in the object
  |[<expression>]     -- the value of the expression
 AS <col_alias>

Let’s describe in detail these four expressions:

  1. The star symbol, *, indicates all possible columns; it means “all columns from all tables” if it is applied without an alias or “all the columns from the aliased object” if applied with an alias.
    • Star without an alias will return all columns from both tables:
      SELECT * FROM table_1, table_2
    • Star with an alias (or table name) will return all columns from the object with the alias. In this case, it will return all columns from table_1, as it is applied to its alias:
      SELECT t1.* FROM table_1 as t1, table_2 as t2
    • Snowflake has introduced two powerful extensions to the Star Syntax: EXCLUDE to remove some columns from being returned and RENAME to rename a column while selecting it, as shown in this example:
      SELECT * EXCLUDE (c3, c2) RENAME (c4 as cx, c5 as cy)
      FROM table_1
  2. A column name, optionally with an alias indicating the object the column comes from and/or an alias to use for the column in the results:
    SELECT
        O_ORDERKEY,
        ord.O_CUSTKEY,
        cust.C_NAME as CUSTOMER_NAME
    FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS as ord
    JOIN SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER as cust
        ON cust.C_CUSTKEY = ord.O_CUSTKEY;

Note that here, the designers of the TPCH database have gone a long way to keep column names unique between all tables, by adding the table initials as a prefix to column names. This is a pretty old style as it makes it more difficult to recognize fields that contain the same values as C_CUSTKEY and O_CUSTKEY.

Looking at our example query, we can see the following:

  • Object aliases are optional if the name of the column is unique in the tables referenced in the query, as it is for all columns in the TPCH database.
  • Aliases can in any case be used for clarity, like with cust.C_NAME.
  • Object aliases are mandatory when referencing a column name that appears in more than one referenced table. This would have been the case if both fields had been just CUSTKEY, without the C_ and O_ table prefixes.
  • Lastly, a column alias can always be used, even just to rename a column with an initial name that we do not like or is not clear, as in the case of a name that would not be clear, so we rename it to CUSTOMER_NAME.
  1. A column number after a dollar sign, such as $1 or $3.

This allows us to reference columns that do not have a name, such as when reading out of a CSV file without a header. This is also useful, especially in scripts, to reference columns that we do not know the name of, but we know their position:

SELECT $1 as ORDER_KEY, $2 as CUST_KEY
FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;

In this case, we read from a table that has column names, and we can always use column numbers instead of column names. But this is rarely done when reading from a table or a view because by using the numbers, the result will depend on the order of the columns in the object, which can change over time if the object is recreated.

  1. An expression, like a mathematical expression or a function call, that evaluates to some value for each row:
    SELECT
        P_PARTKEY
        , UPPER(P_NAME) as P_NAME
        , P_RETAILPRICE
        , P_RETAILPRICE * 0.9 as P_DISCOUNTED_PRICE
    FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART";

The preceding example shows two expressions: a mathematical expression that multiplies a column by a constant to produce a new column and a function call that converts the p_name column to uppercase, keeping the same column name.

Important note

The ability highlighted here, to write arbitrary expressions and assign a name to the result, is at the core of how SQL is used to transform data and is one of the most common things that we will do with dbt.

In this section, we have seen that the SELECT clause allows us four great abilities:

  1. To decide what data from the source tables to keep in the result of a query
  2. To calculate new data from source data using arbitrary complex expressions
  3. To provide a new name for the data that we want in the query result
  4. To keep only one copy of each distinct row in the result, using the DISTINCT keyword

The FROM clause

The FROM clause introduces the table objects used in a SQL statement as the source of data or target of the command.

In the case of a SELECT statement, the FROM clause can list none, one, or more table objects, each with its own alias. The rows in the result of the SELECT statement will potentially have all the columns from all the tables referenced in the FROM clause.

The simplified syntax looks like this:

SELECT …
FROM [tableObject1 [AS alias1]]
   [, tableObject2 [AS alias2] …]

Table objects, in the context of a SELECT statement, are a combination of the following:

  • Tables and views: Tables and views are the main data objects SQL works with, so most of the time, this is what you find in the FROM clause.
  • Table functions: Table functions are a category of functions that return a set of rows, with one or more columns. They can be predefined functions or user-defined ones.
  • A VALUES clause: This clause allows us to build a set of rows, using constant values. This allows us to create an inline table that can be suitable as a reference table, mapping table, or test input.
  • A few other objects that can be read from, such as a LATERAL subquery or a staged file.

When no table is listed, then the columns in the SELECT clause must use constants and global objects such as the current_date function, as in the following example:

SELECT 1 + 1 as sum, current_date as today;

When we use the VALUES clause, we can define an inline table, like in this example:

SELECT * FROM ( VALUES
  ('IT', 'ITA', 'Italy')
 ,('US', 'USA', 'United States of America')
 ,('SF', 'FIN', 'Finland (Suomi)')
  as inline_table (code_2, code_3, country_name)
);

When the FROM clause lists more than one table object, the result is the Cartesian product of the elements in these tables. To avoid a Cartesian explosion, it is important to include a WHERE clause that restricts the combinations, keeping only the ones we want.

Important note

The Cartesian product is the result of pairing the rows of all the objects in the product in all possible ways. This means that the number of rows returned by the Cartesian product is the product of the number of rows in each object in the product.

The expression Cartesian explosion is often used to refer to the fact that the number of rows returned by the Cartesian product increases very quickly and can easily create performance issues.

The following query, thanks to the WHERE clause, will generate a more meaningful and compact set of rows, ideally equal to the number of rows that exist in the LINEITEM table:

SELECT count(*)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM" as l
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"   as o
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" as c
    ,"SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."PART"     as p
WHERE o.O_ORDERKEY = l.L_ORDERKEY
  and c.C_CUSTKEY  = o.O_CUSTKEY
  and p.P_PARTKEY  = l.L_PARTKEY
;

In the previous query, we have used a WHERE clause to keep in the results only the rows that are really related. We achieve this by using equivalence constraints between the fields of the related tables.

The previous query generates a result of exactly 6,001,215 rows, which is the number of rows in the LINEITEM table, instead of the astronomical number resulting from the multiplication of the number of rows: 6,001,215 x 1,500,000 x 150,000 x 200,000, which is pretty much 2,7 x 10^23.

I’ll let you calculate the exact result…

The consequence is that the full query processes 6 million rows and runs in a few seconds, while the query without the WHERE clause will try to process 10^18 times more rows, taking very many hours or days to complete, in the process using up a lot of resources for nothing, as the resulting data provides nothing of interest.

The exact correspondence in the count derives from the fact that for each line item, there is only one order, for each order only one customer, and so on.

Providing the relevant WHERE clause will help with reducing the result by many orders of magnitude even if the keys don’t match so precisely. We will deal with this topic in more detail in Chapter 3, where we talk about data modeling.

This example illustrates well the power of SQL and why it is important to express what we want correctly.

Tip

As a basic rule, never do an unrestricted Cartesian product unless you know what you are doing and you are fine with the exploded number of rows that you will generate.

The JOIN clause

JOIN is a subclause of the FROM clause and is used to describe in a more precise way how to combine the rows of two tables. JOIN cannot exist outside of a FROM clause.

The simplified syntax of JOIN looks like this:

SELECT …
FROM tableObject1 AS to1
[<join type>] JOIN tableObject2 as to2 ON <condition_A>
[[<join type>] JOIN tableObject3 as to3 ON <condition_B>]
…

The condition in each join is a Boolean expression, often using columns from the base to1 table and the joined table. It does not need to be an equality check and can use columns from multiple tables or even constants.

It is normal to chain more than one JOIN inside a FROM clause to pick data from multiple tables at once, and you generally do not need to think too much about the order of the joins, but it is good to know that the order might matter. In the case of Snowflake, the order is from left to right (think of the statement written on one row).

We will look in more detail into the JOIN clause in the next section, where we will introduce the different types of joins and some examples.

We have seen that we can combine data from multiple tables, listing them in the FROM clause, and that we can use the WHERE clause to restrict the resulting Cartesian product to the combined rows that we actually want to keep. This use of FROM and WHERE is equivalent to the INNER JOIN construct, which we will see in the Combining data in SQL – the JOIN clause section.

The WHERE clause

The WHERE clause specifies the conditions that the rows involved in the command need to match.

The rows for which the expression evaluates to true are kept/processed.

In the context of a SELECT statement, the WHERE clause limits the result of the query to the subset of rows that verify the overall condition set by the WHERE clause.

The simplified syntax is as follows:

SELECT …
WHERE <predicate>

<predicate> can be as simple as the literal true or false or a complex expression, including logical operators and database functions that evaluate to a Boolean value (true, false, or NULL) for every row that the query processes.

Tip

An expression that returns a Boolean value is called a predicate.

The following example returns only the rows where the order total is greater than 500,000:

SELECT O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE O_TOTALPRICE > 500000;

The following example does not return any row, but defines two columns with the same name and type as the columns in the source table and two with the name and type that we provided:

SELECT O_ORDERKEY,O_CUSTKEY, 1 as an_int, null::number as a_num
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
WHERE false;

This may be a good trick when you need to define some column names and types. Remember that you do not need a FROM clause if you do not want to use columns from an existing table…

Important note

Be careful when NULL is involved as it might not behave as you expect.

As an example, the NULL = NULL expression evaluates to NULL.

In a WHERE clause, it means that the corresponding row is discarded.

We have seen in a previous example using the FROM clause that the WHERE clause can be used to specify some types of JOIN. We suggest using the JOIN … ON … syntax for joins and the WHERE clause to filter rows. We will look at more examples in the Combining data in SQL - the JOIN clause section.

The GROUP BY clause

A GROUP BY clause is used to calculate aggregate functions on groups of rows that produce the same value for the group by expression.

The simplified syntax is as follows:

SELECT …
GROUP BY groupExpr1 [, groupExpr2 …]

The group by expression can be one of the following:

  • Column name: The result of the expression is the value of the column with the given name. All rows with the same value in the column are grouped.

The following query calculates the sum of all orders by each customer (customer key):

SELECT O_CUSTKEY, sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY O_CUSTKEY;
  • Number: The value to group by is the value of the column in the given position in the result of the SELECT statement. The value is evaluated at the end of the statement, after applying any function or expression.

The following query is the same as the previous one, using the ordinal 1 instead of the O_CUSTKEY column name to indicate the column to group by:

SELECT O_CUSTKEY, sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY 1;
  • SQL expression: Any expression using any combination of data from the query. The value to group by is the result of the expression.

The following query calculates the total orders by year, using the year() function, which returns the year from a date:

SELECT YEAR(O_ORDERDATE), sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE)
ORDER BY YEAR(O_ORDERDATE);

We have also added the ORDER BY clause with the same expression as the GROUP BY so that we get the output in a nice order. This is a pretty common pattern.

When more than one group by expression is provided, the rows are grouped together according to the values of all the group by expressions.

In the previous example, you could have used GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE) to have the order total split by month (and year), like in the following:

SELECT YEAR(O_ORDERDATE),MONTH(O_ORDERDATE),sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);

The HAVING clause

The HAVING clause filters the rows resulting from a GROUP BY clause according to a predicate:

SELECT …
GROUP BY …
HAVING <predicate>

The predicate is an expression that returns a Boolean value and references:

  • Constants
  • Expressions that appear in the GROUP BY clause
  • Aggregated functions that can be calculated according to the GROUP BY clause

As an example, we could extend the example from the GROUP BY topic to look at the months where we have less than 10,000 orders.

The following query does it:

SELECT YEAR(O_ORDERDATE), MONTH(O_ORDERDATE), sum(O_TOTALPRICE)
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
HAVING count(*) < 10000
ORDER BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE);

In this query, we have used the count() aggregate function to count how many lines, one for each order, there are in one group.

Tip

The HAVING clause is for GROUP BY what the WHERE clause is for the FROM clause.

We will look at the third filtering clause, QUALIFY, in the next topic.

The QUALIFY clause

The QUALIFY clause filters on the results of the window functions, which is the third type of data calculation that we can have in a query after expressions and group by.

We will dedicate the last section of this chapter to window functions.

The QUALIFY clause is an optional SQL clause that proves extremely useful in analytical workloads, so it is implemented in Snowflake.

To recap a bit of what we have seen, these are the three clauses that are used to filter data:

  • WHERE applies to the data being read from the sources specified in the FROM / JOIN clauses
  • HAVING applies to the data that has been grouped by a GROUP BY clause
  • QUALIFY applies to the data that has been calculated by a window function

One of the most common usages of the QUALIFY clause together with the simple row_number() window function is to defend against undesired duplicates or select one specific row with respect to other rows that represent different versions of the same object:

SELECT *
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."LINEITEM"
QUALIFY row_number()
            over(partition by L_ORDERKEY, L_LINENUMBER
                 order by L_COMMITDATE desc ) = 1;

This query selects all fields from a table, just filtering out undesired rows for which the row_number window function returns a value different from 1.

The row_number window function, assigns a progressive number from 1 onward to all the rows in the same window, following the row order, and then restarts from 1 with the next window.

Picking the rows where it is equal to 1 means keeping only the first row for each window.

The windows are defined by the over(…) clause, which comprises a partition by part that defines the windows, in a way similar to group by, and an order by that provides the ordering in the window. We will look in more detail at this in the section devoted to windows functions.

Tip

When using the QUALIFY clause, as shown in the previous query, we avoid the need to create a column with the result of the row_number window function and then use a subquery to filter on it. That is the normal way of doing this kind of filtering in a database that does not support the QUALIFY clause. QUALIFY is much simpler to read and use.

Now, let’s extend the order example from the previous section to select out the “good months,” when we have higher total sales than the average sales for the year they belong to:

WITH
monthly_totals as (
  SELECT
    YEAR(O_ORDERDATE) as year,
    MONTH(O_ORDERDATE) as month,
    sum(O_TOTALPRICE) as month_tot
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
  GROUP BY YEAR(O_ORDERDATE), MONTH(O_ORDERDATE)
)
SELECT year, month, month_tot
       ,avg(month_tot) over(partition by YEAR) as year_avg
FROM monthly_totals
QUALIFY month_tot > year_avg
ORDER BY YEAR, MONTH;

Here, we have put together a few of the clauses that we have seen so far, using a WITH clause to define a CTE named monthly_totals with our previous query to calculate monthly totals.

We have then defined a query that reads from the CTE and uses the avg window function to calculate the average monthly sales for each month of the full year that a particular month belongs to. Then, we use the QUALIFY clause to keep only the rows where the monthly total is greater than the yearly average.

Tip

The previous example illustrates why the window functions are also called analytical functions.

Important note

Please note that the window function calculates a value for each row using the defined windows, not changing the number of rows. In contrast, using the avg aggregate function with a GROUP BY clause would have reduced the number of rows to one per group.

The ORDER BY clause

The ORDER BY clause specifies in which order the query or window function lays out the rows.

The simplified syntax is as follows:

SELECT …
ORDER BY orderExpr1 [ASC|DESC] [NULLS FIRST|LAST] [, orderExpr2 …]

Each order by expression can be made out of three parts:

  1. One expression that identifies what to order on. It can be either a column alias, a position ordinal, or an expression, as we have seen for the GROUP BY clause. Please refer to that section for more details.
  2. An optional direction for the sorting: ASC for ascending or DESC for descending sort.
  3. An optional specification of how to sort null values: NULLS FIRST or NULLS LAST, which are self-explanatory.

Please look at previous sections for examples of ORDER BY in action, in both queries and window function definitions with over().

The LIMIT/FETCH clause

The LIMIT clause is used to restrict the number of rows returned by the query.

LIMIT and FETCH are synonyms, with slightly different syntax.

Let’s look at the LIMIT syntax:

SELECT …
[ORDER BY …]
LIMIT <count> [OFFSET <start>]

The count parameter is a number that specifies the maximum number of rows to return.

If the OFFSET part is present, the returned rows are the ones after the start position. This allows tools connecting to a database to retrieve all the results in chunks of the desired size. As an example, a LIMIT 10 OFFSET 10 clause would retrieve rows from 11 to 20.

Important note

If no ORDER BY clause is present, the order of the rows is undefined and could differ in each execution of the same query. In this case, the result of a LIMIT clause is non-deterministic because what rows are returned depends on the order in which the rows happen in the result set.

Query clause order of evaluation

In the previous sections, we have seen all the clauses that can appear in a SELECT statement.

Now is a good time to bring your attention to the fact that these clauses are generally evaluated in the following specific order, as well as what it is important to pay attention to for each clause:

  1. FROM and its JOIN subclause, which are used to identify the source data for the query.
  2. The WHERE clause, which is used to filter out the source data that we do not want.

This is probably the most important clause for performance, because the less data a query works on, the quicker it is. Use WHERE whenever possible to just bring in the data you need.

  1. The GROUP BY clause, which groups the source data left after applying the WHERE clause and calculates the aggregate functions on the grouped data.
  2. The HAVING clause, which filters on the results of GROUP BY.
  3. Partitioning of the windows and calculation of the window functions.
  4. The QUALIFY clause, which filters on the results of the window functions.
  5. The DISTINCT keyword, if applied to the SELECT clause, which removes duplicated rows.
  6. The ORDER BY clause, which puts the resulting rows in the desired order.
  7. The LIMIT clause, which caps the rows returned by the query to the desired amount.

SQL operators

When writing queries, we can perform operations on the data handled by the query.

We do so by building expressions that return the desired value, using functions and operators.

We can perform an operation pretty much everywhere a value is expected: in the SELECT clause to provide the desired outputs by transforming the inputs, in the WHERE clause or ON part of a JOIN clause, HAVING clause, or QUALIFY clause to identify what should or should not be returned by the query, in GROUP BY to decide how to aggregate, and so on.

Let’s go through the categories of operators and how they are used:

  • Arithmetic operators: These are the traditional +, -, *, /, and % (modulo).

They expect one or more numeric (or convertible to numeric) inputs to provide a numeric result, with the usual arithmetic precedence and meaning, like in the following example:

SELECT 1 + '2' as three, (3+2) * 4 as twenty
WHERE twenty % 2 = 0;

Note that '2' is a string but can be automatically converted to the number 2. Also note that implicit conversions happen, but explicit conversions are better.

The modulo operator returns the remainder of the division of the first operator by the second, and the val % 2 = 0 pattern is often used to identify even numbers.

  • Comparison operators: These are used to test two values for equality or other comparisons, and are = (equal), != (not equal), <> (also not equal), < (less than), <= (less than or equal), > (greater than), and >= (greater than or equal).

They are typically used in the WHERE clause, but can be used anywhere a Boolean result is desired, such as in the following example:

SELECT 2 < 1 as nope, '3' != 'three' as yep
WHERE 1 != 2;

Note that the operators can compare all types where the operation is defined.

I have seen WHERE 1 = 2 used in many places to avoid returning any row, because 1=2 always returns FALSE; it would be clearer to write WHERE false directly and if you do not need to filter any row out, you can just leave the WHERE clause off, or use the WHERE true expression.

  • Logical operators: These are the traditional AND, OR, and NOT Boolean operators.

They operate only on Boolean predicates and values and return Booleans.

They are generally used in the WHERE clause, but can appear anywhere an expression is allowed, such as in the following example:

SELECT *,
  (C_ACCTBAL > 7500) AND (C_NATIONKEY = 24) as IS_TOP_US_CUST
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE (C_NAME IS NOT null) AND IS_TOP_US_CUST;

In the previous example, we have defined an IS_TOP_US_CUST column using a logical operator to combine two predicates, as we have done in the WHERE clause, also reusing the new Boolean column we created.

  • Set operators: These operators are used to put together the results from multiple queries, and are INTERSECT, MINUS or EXCEPT, and UNION [ALL].

The simplified syntax is as follows:

query_1 <set_operator> query_2

The queries must be compatible, having the same number of columns and of the same type.

It is also important that the semantics of the columns in the same position are correct, as combining people’s names with state codes is possible, as they are both strings, but in general, it does not make much sense.

Let’s describe the set operators:

  • INTERSECT returns the rows that appear in both queries, checking all columns to have the same value
  • MINUS or EXCEPT returns the rows from the first query that do not appear in the second
  • UNION [ALL] returns the rows from both queries, with ALL keeping duplicates

As an example, the following query returns data for the customers from India (8) and the US (24) and the customers in the AUTOMOBILE segment without duplicates:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY IN (8, 24)
UNION
SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_MKTSEGMENT = 'AUTOMOBILE';

In this special case, as both the queries are on the same table, we could have just used a single query with a slightly more complex WHERE clause using an OR operator to compose the two individual clauses. In real cases, you might want to combine similar data from different tables and the set operators are here for you.

  • Subquery operators: These operators allow us to use subqueries in WHERE clauses.

A subquery is a query defined inside another query.

A subquery can be used without any operator as a table-like object, such as selecting a subset of a table we want to use some data, or an expression if it returns a single value, as in the following example:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY = (
  SELECT N_NATIONKEY
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION"
  WHERE N_NAME = 'JAPAN'
);

In this example, the subquery returns only one value, so we can use the equal operator.

The subquery operators extend their use to other cases, as per their definitions:

  • ALL/ANY: Allows you to apply a comparison to all/any rows of the subquery
  • [NOT] EXISTS: Returns true if the subquery returns at least one row, false with NOT in front
  • [NOT] IN: Returns true if the expression is not included in the results of the subquery

The following example extends the previous example to the case when we want customers from more than one country, picking the country by name:

SELECT C_NAME, C_ADDRESS, C_PHONE
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER"
WHERE C_NATIONKEY IN (
  SELECT N_NATIONKEY
  FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."NATION"
  WHERE N_NAME IN ('JAPAN', 'CANADA')
);

We could easily rewrite the same query using = ANY instead of IN, and probably with some changes in the subquery also using EXISTS.

Now that we have been acquainted with the basics of querying data in SQL, let’s dive deeper into the JOIN clause, which allows us to put together data from multiple sources, which is crucial to turning raw data into useful information.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Build a solid dbt base and learn data modeling and the modern data stack to become an analytics engineer
  • Build automated and reliable pipelines to deploy, test, run, and monitor ELTs with dbt Cloud
  • Guided dbt + Snowflake project to build a pattern-based architecture that delivers reliable datasets

Description

dbt Cloud helps professional analytics engineers automate the application of powerful and proven patterns to transform data from ingestion to delivery, enabling real DataOps. This book begins by introducing you to dbt and its role in the data stack, along with how it uses simple SQL to build your data platform, helping you and your team work better together. You’ll find out how to leverage data modeling, data quality, master data management, and more to build a simple-to-understand and future-proof solution. As you advance, you’ll explore the modern data stack, understand how data-related careers are changing, and see how dbt enables this transition into the emerging role of an analytics engineer. The chapters help you build a sample project using the free version of dbt Cloud, Snowflake, and GitHub to create a professional DevOps setup with continuous integration, automated deployment, ELT run, scheduling, and monitoring, solving practical cases you encounter in your daily work. By the end of this dbt book, you’ll be able to build an end-to-end pragmatic data platform by ingesting data exported from your source systems, coding the needed transformations, including master data and the desired business rules, and building well-formed dimensional models or wide tables that’ll enable you to build reports with the BI tool of your choice.

Who is this book for?

This book is for data engineers, analytics engineers, BI professionals, and data analysts who want to learn how to build simple, futureproof, and maintainable data platforms in an agile way. Project managers, data team managers, and decision makers looking to understand the importance of building a data platform and foster a culture of high-performing data teams will also find this book useful. Basic knowledge of SQL and data modeling will help you get the most out of the many layers of this book. The book also includes primers on many data-related subjects to help juniors get started.

What you will learn

  • Create a dbt Cloud account and understand the ELT workflow
  • Combine Snowflake and dbt for building modern data engineering pipelines
  • Use SQL to transform raw data into usable data, and test its accuracy
  • Write dbt macros and use Jinja to apply software engineering principles
  • Test data and transformations to ensure reliability and data quality
  • Build a lightweight pragmatic data platform using proven patterns
  • Write easy-to-maintain idempotent code using dbt materialization
Estimated delivery fee Deliver to Australia

Economy delivery 7 - 10 business days

AU$19.95

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jun 30, 2023
Length: 578 pages
Edition : 1st
Language : English
ISBN-13 : 9781803246284
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Australia

Economy delivery 7 - 10 business days

AU$19.95

Product Details

Publication date : Jun 30, 2023
Length: 578 pages
Edition : 1st
Language : English
ISBN-13 : 9781803246284
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
AU$24.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
AU$249.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just AU$5 each
Feature tick icon Exclusive print discounts
AU$349.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just AU$5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total AU$ 195.97 209.97 14.00 saved
Data Engineering with dbt
AU$68.99
Data Modeling with Snowflake
AU$54.99 AU$68.99
Data Engineering with AWS
AU$71.99
Total AU$ 195.97 209.97 14.00 saved Stars icon

Table of Contents

20 Chapters
Part 1: The Foundations of Data Engineering Chevron down icon Chevron up icon
Chapter 1: The Basics of SQL to Transform Data Chevron down icon Chevron up icon
Chapter 2: Setting Up Your dbt Cloud Development Environment Chevron down icon Chevron up icon
Chapter 3: Data Modeling for Data Engineering Chevron down icon Chevron up icon
Chapter 4: Analytics Engineering as the New Core of Data Engineering Chevron down icon Chevron up icon
Chapter 5: Transforming Data with dbt Chevron down icon Chevron up icon
Part 2: Agile Data Engineering with dbt Chevron down icon Chevron up icon
Chapter 6: Writing Maintainable Code Chevron down icon Chevron up icon
Chapter 7: Working with Dimensional Data Chevron down icon Chevron up icon
Chapter 8: Delivering Consistency in Your Data Chevron down icon Chevron up icon
Chapter 9: Delivering Reliability in Your Data Chevron down icon Chevron up icon
Chapter 10: Agile Development Chevron down icon Chevron up icon
Chapter 11: Team Collaboration Chevron down icon Chevron up icon
Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms Chevron down icon Chevron up icon
Chapter 12: Deployment, Execution, and Documentation Automation Chevron down icon Chevron up icon
Chapter 13: Moving Beyond the Basics Chevron down icon Chevron up icon
Chapter 14: Enhancing Software Quality Chevron down icon Chevron up icon
Chapter 15: Patterns for Frequent Use Cases Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon