Data Engineering with dbt

The Basics of SQL to Transform Data

This book is about engineering data platforms with dbt. When working with dbt, you write code in Structured Query Language (normally referred to as SQL, pronounced sequel or as its letters: es-qu-el), with a spritz of Jinja templating.

This means that when we are not discussing architectural and design topics, we will be reading and writing code in SQL, so you should get up to speed with SQL to work proficiently with dbt.

In this chapter, you will review, or learn from scratch, the basics of SQL, becoming familiar with this powerful language that you find everywhere when working with data.

First, we will introduce SQL and walk you through its core concepts, so that even if you are new to this language, you can start building your competence and follow the rest of this chapter and the book.

To complement this introduction, you will quickly be presented with all the main SQL commands, organized by category of use. This will ensure that you know the most relevant parts of SQL and you can use this chapter as a quick reference when you start writing your code.

You will then be guided through how to create an account in Snowflake, the leading cloud database that is very often used with dbt. You will use Snowflake in the cloud, without the need to install anything, and it is free for a trial period.

Once you have a Snowflake account, you will be guided through creating your first role, user, and database.

With your cloud database ready and the basic SQL commands presented, we will look deeper into the SELECT statement and the JOIN clause, as they will be crucial in working with dbt.

Analytical and window functions are presented at the end of the chapter in a simple way, suitable for beginners as well as mid- to senior-experienced SQL users; these advanced functionalities will power up your SQL skills, vastly widening your abilities.

To summarize, in this chapter, we are going to cover these topics:

Introducing SQL
SQL basics – core concepts and commands
Setting up a Snowflake database with users and roles
Querying data in SQL – syntax and operators
Combining data in SQL – the JOIN clause
Advanced – introducing window functions

Introducing SQL

SQL was created in the 70s and by the end of the 80s, had become the de facto standard to interact with Relational Databases (RDBs), and it now powers most of the data management industry in the world.

Given its huge footprint and powerful abstractions, SQL has become a standard that anyone working with database systems eventually becomes familiar with. The expressive power of SQL is well understood and its knowledge is so ubiquitous that it has been taken into use beyond RDBs, with Database Management Systems (DBMSs) of all sorts providing a SQL interface even on top of many non-RDB systems.

Some of the great advantages of SQL are as follows:

The core SQL functionality was standardized in the 80s and yet SQL is still very much alive and well, evolving and adding new powerful functionalities as data management evolves while maintaining compatibility with previous versions.

Every database has its SQL quirks, but the logic is the same and most SQL code will work on multiple databases with little or no change.

Learn it now and use it forever, and with (almost) every database.

At its core, it has a simple, rigorous, and powerful syntax that reads like English sentences, so even non-tech people can grasp the basic idea, while professionals can express exactly what they want in a precise and concise way.

Most people can probably get a sense of what the following SQL does:

SELECT ORDER_ID, CUSTOMER_CODE, TOTAL_AMOUNT
FROM ORDERS
WHERE YEAR(ORDER_DATE) = 2021;

With SQL, you work at the logical level, so you do not have to deal with implementation details, and it is a declarative language; you describe in a rigorous way what you want to achieve, not how to do it. The database engine has the freedom to store data, be implemented, and perform the request in its own way, as long as it produces the correct result according to SQL specifications.
With a single SQL statement, you can process one piece of data or billions, leaving the burden of finding the most effective way to the database and giving you some freedom from scale.

SQL basics – core concepts and commands

The SQL language is certainly big and complex, taking time to be fully mastered, but surprisingly, you can get productive with the limited set of features that we will introduce in this book, and you will pick up the nuances when you’ll need them while working on real use cases.

The goal of this chapter is not to make you a SQL expert but to get you started with it, getting deeper into the few commands that we will use often to make you quickly productive with dbt.

In this intro, we start with the most basic concepts of SQL and then provide an overview of the SQL commands by their categories of use. In the rest of this chapter, we will dig deeper into the commands that are mostly used when working with dbt: SELECT, JOIN, and analytical and window functions.

SQL core concepts

The main goal of SQL, as its name implies, is to allow users to query data that is contained in a database; SQL also provides all the commands to fully manage the database, allowing you to add, transform, delete, organize, and manage data and other database objects, such as users and roles.

The core concepts in SQL come from how an RDB is organized:

A database contains tables, organized in schemata (plural of schema)
Tables store data in rows that have one value for each column defined in the table
Columns have a name and can contain only data of the declared data type
To regulate access to the data, privileges are assigned to users and roles

You can see a database with schemata and tables in the following screenshot, which shows part of the sample database available in any Snowflake account:

Figure 1.1: Example database with schemata and tables

Let’s go through the core concepts of SQL, starting with the table.

What is a table?

The table is the most central concept in SQL, as it is the object that contains data.

A table in SQL is very close to the layman’s concept of a table, with data organized in columns and rows.

Columns represent the attributes that can be stored in a table, such as a customer’s name or the currency of an order, with each column defining the type of data it can store, such as text, a number, or a date.

In a table, you can have as many rows as you want, and you can keep adding more whenever you need. Each row stores in the table one instance of the concept that the table represents, such as a specific order in the following example of a table for orders:

Order_ID	Customer_CODE	Total_amount	Currency
123	ABC	100	EUR
166	C099	125,15	USD

Table 1.1: Example order table

Looking at the previous example order table, we can see that the table has four columns that allow storing the four attributes for each row (order ID, customer code, total amount, and currency of the order). The table, as represented, has two rows of data, representing one order each.

The first row, in bold, is not data but is just a header to represent the column names and make the table easier to read when printed out.

In SQL, a table is both the definition of its content (columns and their types) and the content itself (the data, organized by rows):

Table definition: It lists the columns that make up the table, and each column provides the data type and other optional details. The data type is mandatory and declares what values will be accepted in the column.
Table content: It is organized in rows, each row containing one value for each column defined for the table or null if no value is provided.
Data value: All the values in a column must be compatible with the type declared for the column. null is a special value that corresponds to the absence of data and is compatible with all data types.

When creating a table, we must provide a name for the table and its definition, which consists of at least column names and a type for each column; the data can be added at a different time.

In the following code block, we have a sample definition for an order table:

CREATE TABLE ORDERS (
   ORDER_ID NUMBER,
   CUSTOMER_CODE TEXT,
   TOTAL_AMOUNT FLOAT,
   ORDER_DATE DATE,
   CURRENCY TEXT DEFAULT 'EUR'
);

Tip

When we write some SQL, we will use Snowflake’s SQL syntax and commands. We will guide you on how to create a free Snowflake account where you can run the code shown here.

In this example, we see that the command to create a table reads pretty much like English and provides the name of the table and a list of columns with the type of data each column is going to contain.

You can see that for the CURRENCY column, of type text, this code also provides the default value EUR. Single quotes are used in SQL to delimit a piece of text, aka a string.

View – a special way to look at some data

If you already have data and you want to make it available with some transformation or filtering, you can create a view. You can think of a view like a table, but with the column definition and data both coming from a query, reading from one or more tables.

As an example, if you would like to have a shortlist of orders with amount greater than 1,000 you could write the following query to create a BIG_ORDERS view:

CREATE VIEW BIG_ORDERS AS
SELECT * FROM ORDERS
WHERE TOTAL_AMOUNT > 1000;

In this example, we see that this simple create view statement provides the name for the view and uses a query, which is a SELECT statement, to define what data is made available by the view.

The query provides both the data, all the orders with a total amount greater than 1,000, and the column definitions. The * character – called star – is a shortcut for all columns in the tables read by the SELECT statement.

This is, of course, a naïve example, but throughout this book, you will see that combining tables that store data and views that filter and transform the data coming from tables and views is the bread and butter of working with dbt. Building one object on top of the previous allows us to take raw data as input and provide as output refined information that our users can easily access and understand.

Tip

When working with dbt, you will not need to write create table or create view statements, as dbt will create them for us. It is nevertheless good to get familiar with these basic SQL commands as these are the commands executed in the database and you will see them if you look in the logs.

Database and schema

We have seen that a database is a container for tables and views.

A DB can be further divided and organized using schema objects.

In real-life DBs, the number of tables can range from a few units to many thousands. Schemata act pretty much like folders, but cannot be nested, so you can always identify a table by its database, schema, and name.

In the following screenshot, we see part of the contents of the SNOWFLAKE_SAMPLE_DATA database, which is available in all Snowflake accounts:

Figure 1.2: Some schemata and tables of the SNOWFLAKE_SAMPLE_DATA database

The database and schemata, as table containers, are the main ways information is organized, but they are also used to apply security, access limitations, and other features in a hierarchical way, simplifying the management of big systems.

To create the TEST database and the SOME_DATA schema in it, we can use the following commands:

CREATE DATABASE TEST;
CREATE SCHEMA TEST.SOME_DATA;

The database.schema notation, also known as a fully qualified name, allows us to precisely describe in which database to create the schema and after its creation, uniquely identifies the schema.

Tip

While working with dbt, you will create a database or use an existing one for your project; dbt will create the required schema objects for you if you have not created them already.

A best practice in Snowflake is to have one database for each project or set of data that you want to keep separate for administration purposes. Databases and schemata in Snowflake are soft boundaries, as all the data from all the databases and schemata can be accessed if the user has the appropriate privileges.

In some other database systems, such as PostgreSQL, a database is a stronger boundary.

Users and roles

To control access to your data in SQL, you GRANT access and other privileges to both users and roles.

A user represent one individual user or service that can access the database, while a role represents a named entity that can be granted privileges.

A role can be granted to a user, providing them all the privileges associated with the role.

A role can also be granted to another role, building hierarchies that use simple basic roles to build more complex roles that are assigned to a user.

Using roles instead of granting privileges directly to users allows you to manage even a large number of users simply and consistently. In this case, roles are labels for a set of privileges and become a way to manage groups of users that we want to grant the same privileges. Changing the privileges granted to a role at any moment will change the privileges that the users receive from that role.

A typical pattern when working with dbt is to create a role for the dbt users and then assign it to the developers and the service user that the dbt program will use.

The following is an example of a simple setup with one role and a couple of users:

CREATE ROLE DBT_SAMPLE_ROLE;
CREATE USER MY_NAME;           -- Personal user
CREATE USER SAMPLE_SERVICE;    -- Service user
GRANT ROLE DBT_SAMPLE_ROLE TO USER MY_NAME;
GRANT ROLE DBT_SAMPLE_ROLE TO USER SAMPLE_SERVICE;

A more complex setup could have one role to read and one to write for each source system (represented by a schema with the data from the system), for the data warehouse (one or more schemata where the data is processed), and for each data mart (one schema for each data mart).

You could then control in much more detail who can read and write what, at the cost of more effort.

Understanding the categories of SQL commands

SQL commands can be organized into categories according to their usage in the language:

Data Definition Language (DDL): DDL contains the commands that are used to manage the structure and organization of a database
Data Manipulation Language (DML): DML contains the commands that are used to manipulate data, for example, INSERT, DELETE, and UPDATE
Data Query Language (DQL): DQL contains the SELECT command and is the central part of SQL that allows querying and transforming the data stored in a database
Data Control Language (DCL): DCL contains the GRANT and REVOKE commands, which are used to manage the privileges that control the access to database resources and objects
Transaction Control Language (TCL): TCL contains the commands to manage transactions

In the upcoming sections, we provide more details about these by looking at Snowflake-specific commands, but the ideas and names are of general use in all database systems, with little or no change.

Data Definition Language – DDL

DDL commands do not deal directly with the data but are used to create and maintain the structure and organization of the database, including creating the tables where the data is stored.

They operate on the following objects:

Account/session objects, which contain and operate on the data, such as user, role, database, and warehouse
Database/schema objects, which store and manipulate the data, such as schema, table, view, function, and store procedure

The main commands are as follows:

CREATE: Used to create the database itself and other objects
DROP: Used to delete the specified object
ALTER: Used to modify some attribute of the specified object
DESC: Used to describe the details of the specified object
SHOW: Used to list the existing objects of the specified object type, with metadata
USE: Used to select the database, schema, and object to use when fully specified names are not used

Tip

When working with dbt, we use the DDL and DML commands only in macros.

We do not use the DDL and DML commands in models because dbt will generate the required commands for our models based on the metadata attached to the model.

Data Manipulation Language – DML

DML provides the commands to manipulate data in a database and carry out bulk data loading.

Snowflake also provides specific commands to stage files, such as loading files in a Snowflake-managed location, called a stage.

The main commands are as follows:

INSERT: Inserts rows into a table
DELETE: Removes specified rows from a table
UPDATE: Updates some values of specified rows in a table
MERGE: Inserts, updates, or deletes rows in a table
TRUNCATE TABLE: Empties a table, preserving the definition and privileges

For bulk data loading, Snowflake provides the following command:

COPY INTO: Loads data from files in a stage into a table or unloads data from a table into one or more files in a stage

To manage files in stages, Snowflake provides these file-staging commands:

PUT: Uploads a local file to a Snowflake stage
GET: Downloads a file from a stage to the local machine
LIST: Lists the files existing in a Snowflake stage
REMOVE: Removes a file from a Snowflake stage

Important note

In dbt, we can use macros with the COPY INTO and file-staging commands to manage the data-loading part of a data pipeline, when source data is in a file storage service such as AWS S3, Google Cloud Storage, or Microsoft Azure Data Lake file storage.

Data Query Language – DQL

DQL is the reason why SQL exists: to query and transform data.

The command that is used to query data is SELECT, which is without any doubt the most important and versatile command in all of SQL.

For the moment, consider that a SELECT statement, aka a query, can do all these things:

Read data from one or more tables
Apply functions and transformations to the data retrieved
Filter the data retrieved from each of these tables
Group the retrieved data on a set of columns (or transformed values), producing one row for each group, calculating functions on the grouped rows, and filtering on the results of those functions
Calculate for each row one or more functions based on groups of rows identified by a window expression and filter the results based on the results of these functions

Important note

We have dedicated the Query syntax and operators section later in this chapter to analyzing the SELECT command in Snowflake in detail, as you will use the SELECT command in every dbt model.

Data Control Language – DCL

DCL contains the GRANT and REVOKE commands, which are used to manage privileges and roles that control access to or use database resources and objects.

Together with the DDL commands to create roles, users, and other database objects, the DCL commands are used to manage users and security:

GRANT: Assigns a privilege or a role to a role (or user)
REVOKE: Removes a privilege or a role from a role (or user)
SHOW GRANTS: Lists access privileges granted to a role or object

Transaction Control Language – TCL

The TCL commands are used to manage transactions in a database.

A transaction groups a set of SQL commands into a single execution unit and guarantees that either all the effects of all commands are applied, if the transaction completes with success, or no effect at all is applied if the transaction fails. This can also be described with the ACID acronym, which stands for atomic, consistent, isolated, and durable.

A transaction succeeds and ends only when all the commands it groups have finished with success; in any other case, the transaction fails, and its effects are rolled back like they never happened.

The TCL commands are as follows:

BEGIN: Starts a transaction in the current session
COMMIT: Commits an open transaction in the current session and makes the transaction effects visible to all sessions
ROLLBACK: Rolls back an open transaction in the current session and undoes all the effects applied since the BEGIN statement

Now that we have covered the basic concepts and commands in SQL, it is time to set up a database to run them. The next section will provide you with access to a Snowflake DB.

Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

What do you get with eBook?

Product Details

Data Engineering with dbt

The Basics of SQL to Transform Data

Technical requirements

Introducing SQL

SQL basics – core concepts and commands

SQL core concepts

What is a table?

View – a special way to look at some data

Database and schema

Users and roles

Understanding the categories of SQL commands

Data Definition Language – DDL

Data Manipulation Language – DML

Data Query Language – DQL

Data Control Language – DCL

Transaction Control Language – TCL

Page 1 of 10

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (1)

FAQs

Data Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL

What do you get with eBook?

Product Details

Key benefits

Description

What you will learn

What do you get with eBook?

Product Details

Packt Subscriptions

Table of Contents

Recommendations for you

Customer reviews

Filter reviews by

People who bought this also bought

Authors (1)

FAQs