This chapter provides a set of recipes that introduce you to data management in Snowflake. The chapter talks about common database concepts and introduces you to nuances specific to Snowflake. We'll look at common operations that are required to manage and structure data in a database. Snowflake is not very different from traditional databases and provides similar capabilities, but since Snowflake has been designed for the cloud from the ground up, it has small configurations that allow control over how data is managed in a database or a table and how temporary data is maintained and destroyed when not required. These capabilities are required when designing an ETL system or structuring data according to a data model.
The following recipes are given in this chapter:
The chapter assumes that you have a Snowflake account already set up. It also requires you to have access to an Amazon S3 bucket to use to get hold of external data and use it within Snowflake.
The code for this chapter can be found at the following GitHub URL:
https://github.com/PacktPublishing/Snowflake-Cookbook/tree/master/Chapter02
In this recipe, we will create a new database with default settings and walk through several variations on the database creation process. The recipe provides details such as how to minimize storage usage when creating databases and how to set up the replication of databases across regions and when to do so.
This recipe describes the various ways to create a new database in Snowflake. These steps can be run either in the Snowflake web UI or the SnowSQL command-line client.
Let's start with the creation of a database in Snowflake:
our_first_database
. We are assuming that the database doesn't exist already:CREATE DATABASE our_first_database COMMENT = 'Our first database';
The command should successfully execute with the following message:
SHOW DATABASES LIKE 'our_first_database';
The query should return one row showing information about the newly created database, such as the database name, owner, comments, and retention time. Notice that retention_time is set to 1 and the options column is blank:
CREATE DATABASE production_database DATA_RETENTION_TIME_IN_DAYS = 15 COMMENT = 'Critical production database'; SHOW DATABASES LIKE 'production_database';
The output of SHOW DATABASES
should now show retention_time as 15, indicating that the time travel duration for the database is 15 days:
CREATE TRANSIENT DATABASE temporary_database DATA_RETENTION_TIME_IN_DAYS = 0 COMMENT = 'Temporary database for ETL processing'; SHOW DATABASES LIKE 'temporary_database';
The output of SHOW DATABASES
would show retention_time
as zero, indicating that there is no time travel storage for this database, and also the options
column would show TRANSIENT
as the option, which essentially means that there will be no fail-safe storage for this database.
ALTER
:ALTER DATABASE temporary_database SET DATA_RETENTION_TIME_IN_DAYS = 1; SHOW DATABASES LIKE 'temporary_database';
The basic CREATE DATABASE
command creates a database with the defaults set at the account level. If you have not changed the defaults, the default for time travel is 1 day, which is the value that appears in retention_time
when you run the SHOW DATABASES
command. The database will also have a fail-safe enabled automatically. Both these options will cost you in storage, and in certain cases, you might want to reduce those storage costs. As an example, databases that are used for temporary ETL processing can easily be configured to avoid these costs.
A key thing to know about databases and tables used for ETL processing is that the data in those tables will be repeatedly inserted and deleted. If such tables are not specifically configured, you will be unnecessarily incurring costs for the time travel and fail-safe that is stored with every data change that happens for those tables. We will set such databases to be transient (with TRANSIENT
) so that the fail-safe option is not the default for the tables in that database. Setting this option does mean that such databases are not protected by fail-safe if a data loss event occurs, but for temporary databases and tables, this should not be an issue. Also, we have set time travel to be zero so that there is no time travel storage as well.
Do note that although we have set the database to have no time travel and no fail-safe, we can still set individual tables within the database to be protected by the fail-safe and time travel. Setting these options at the database level only changes the defaults for the objects created within that database.
Note that there is the ALTER DATABASE
command as well, which can be used to change some of the properties after the database has been created. It is a powerful command that allows renaming the database, swapping a database with another database, and also resetting custom properties back to their defaults.
It is important to note that creating a database sets the current database of the session to the newly created database. That would mean that any subsequent data definition language (DDL) commands such as CREATE TABLE
would create a table under that new database. This is like using the USE DATABASE
command.
We will cover time travel and fail-safes in much more detail in subsequent chapters. We will also cover in depth how to create databases from shares and databases that clone other databases.
In this recipe, you will be introduced to the concept of a schema and its uses. A schema is a counterpart of a database, and together, these two define a namespace. There can be multiple schemas in a database, but one schema belongs to a single database. Schemas help in grouping tables and views together that are logically related. We will see how a schema is created and its use. Apart from user-created schemas, we will learn about schemas that are automatically available with a database, including the information schema provided by Snowflake.
The following examples can be run either via the Snowflake web UI or the SnowSQL command-line client.
Let's start with the creation of a user-defined schema in a user-defined database, followed by the listing of schemas:
CREATE DATABASE testing_schema_creation;
You should see a message stating that a schema was successfully created:
SHOW SCHEMAS IN DATABASE testing_schema_creation;
You should see two rows in the result set, which are basically the two schemas that are automatically available with every new database:
CREATE SCHEMA a_custom_schema COMMENT = 'A new custom schema';
The command should successfully execute with the following message:
SHOW SCHEMAS LIKE 'a_custom_schema' IN DATABASE testing_schema_creation ;
The query should return one row displaying information such as the schema name, the database name in which the schema resides, and the time travel retention duration in days:
0
days and also set the type of the schema to be transient. It is common to use these settings on data for which Snowflake data protection features are not required, such as temporary data:CREATE TRANSIENT SCHEMA temporary_data DATA_RETENTION_TIME_IN_DAYS = 0 COMMENT = 'Schema containing temporary data used by ETL processes';
The schema creation should succeed with the following message:
SHOW SCHEMAS
command as shown:SHOW SCHEMAS LIKE 'temporary_data' IN DATABASE testing_schema_creation ;
The output of SHOW SCHEMAS
is shown here:
The output should show retention_time
as zero or a blank value, indicating that there is no time travel storage for this schema, and also the options
column should show TRANSIENT
as the option, which essentially means that there will be no fail-safe storage for this schema.
The standard CREATE SCHEMA
command uses the defaults set at the database level. If you have not changed any of the configuration, the created schema will have fail-safe enabled by default. Also, time travel is enabled for the schema automatically and is set to be 1 day by default. Both these features have costs associated with them and in certain cases, you can choose to turn off these features.
For schemas that will store temporary tables, such as tables used for ETL processing, a schema can be created as a transient schema, which means that there is no fail-safe storage associated with the tables created in the schema, and therefore it would cost less. Similarly, such schemas can also be set to have time travel set to zero to reduce costs further. By default, the time travel for transient schemas is 1 day.
Do note that although we have set the schema to have no time travel and no fail-safe, we can still set individual tables within the schema to be protected by fail-safe and time travel. Setting these options at the schema level sets the default for all tables created inside that schema.
It is important to note that creating a new schema sets the current schema of the session to the newly created schema. The implication of this behavior is that any subsequent DDL commands such as CREATE TABLE
would create the table under that new schema. This is like issuing the USE SCHEMA
command to change the current schema.
Every database in Snowflake will always have a public schema that is automatically created upon database creation. Additionally, under every database, you will also find an additional schema called the information schema. The information schema implements the SQL 92 standard information schema and adds additional information specific to Snowflake. The purpose of the information schema is to act as a data dictionary containing metadata that you can query to find information such as all the tables in the system, all columns along with their data types, and more. It is possible to add many additional schemas under a given database, which can help you organize your tables in a meaningful structure.
A good example of using databases and schemas to organize your data and your environment would be the approach of setting up production, testing, and development databases using the concept of schemas and databases. This approach is especially required if your organization has a single Snowflake account that is being used for development, testing, and production purposes at the same time. The approach is shown in the following diagram:
In this approach, a database is created for each environment, for example, the PROD database for production data, the DEV database for development data, and so on. Within each database, the schema structures are identical; for example, each database has an SRC schema, which contains the source data. The purpose of this approach is to segregate the various environments but keep the structures identical enough to facilitate the development, testing, and productionization of data.
This recipe shows you how to create a table and insert data to explain different behaviors in storing data. Here you will be introduced to the different options that are available from a life cycle perspective, such as tables being permanent, temporary, volatile, and so on. Most of the concepts are not new, so the focus is going to be on the specifics related to Snowflake. We will start with a simple example that creates a table. We shall insert some sample data into it and then try out different variations on creating tables in Snowflake.
The following examples can be run either via the Snowflake web UI or the SnowSQL command-line client.
Let's start by creating a table for storing customer data. We shall start with the DDL statement for creating a table:
CUSTOMERS
in Snowflake:CREATE TABLE customers ( id INT NOT NULL, last_name VARCHAR(100) , first_name VARCHAR(100), email VARCHAR(100), company VARCHAR(100), phone VARCHAR(100), address1 VARCHAR(150), address2 VARCHAR(150), city VARCHAR(100), state VARCHAR(100), postal_code VARCHAR(15), country VARCHAR(50) );
The command should successfully execute, generating the following message:
DESCRIBE TABLE
statement a shown:DESCRIBE TABLE customers;
It should generate the following results in the Snowflake web UI:
ADDRESS1
and ADDRESS2
were supposed to be stored as a STRING
data type. This can be addressed using the REPLACE TABLE
statement, along with CREATE
. This will overwrite the existing CUSTOMERS
table:CREATE TABLE customers ( id INT NOT NULL, last_name VARCHAR(100) , first_name VARCHAR(100), email VARCHAR(100), company VARCHAR(100), phone VARCHAR(100), address1 STRING, address2 STRING, city VARCHAR(100), state VARCHAR(100), postal_code VARCHAR(15), country VARCHAR(50) );
DESCRIBE TABLE
statement again as shown:DESCRIBE TABLE customers;
It should generate the following result in the Snowflake web UI:
Please note the data types. We shall discuss this in the How it works… section along with how REPLACE
works and how it is a shorthand for a two-step process typically required for managing the life cycle of a table.
COPY INTO customers FROM s3://snowflake-cookbook/Chapter02/r3/customer.csv FILE_FORMAT = (TYPE = csv SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '"');
This should complete with the following message:
CTAS
or CREATE TABLE ... AS SELECT
. To explain the use of CTAS
, we will have some records inserted into the CUSTOMERS
table. The dataset has 100 rows, as shown in the following screenshot (showing the first six columns of the table):CUSTOMERS
table. The statements to be executed are as follows:CREATE OR REPLACE TABLE customers_deep_copy AS SELECT * FROM customers;
This should succeed with the following message:
What we now have created is a deep copy of the CUSTOMERS
table and a new table, which has received all data from CUSTOMERS
and is called CUSTOMERS_DEEP_COPY
. The deep copy means that the table structure and data has been copied, as opposed to a shallow copy, which would copy the table structure only. This copied table is now an independent copy that can have a life cycle of its own with no changes or side effects originating from any operations performed on the CUSTOMERS
table. A SELECT QUERY
statement on this new table would generate the same results as shown in the previous table.
LIKE
keyword as provided by Snowflake:CREATE OR REPLACE TABLE customers_shallow_copy LIKE customers;
This should succeed with the following message:
CUSTOMERS
table but with no data in it:SELECT COUNT(*) FROM customers_shallow_copy;
The result of the following count query is 0 rows, as shown:
CREATE TEMPORARY TABLE customers_temp AS SELECT * FROM customers WHERE TRY_TO_NUMBER(postal_code) IS NOT NULL; CREATE TRANSIENT TABLE customers_trans AS AS SELECT * FROM customers WHERE TRY_TO_NUMBER(postal_code) IS NULL;
customers_temp
and customers_trans
. The two tables are not permanent, but the tables have limitations. If you end the web UI session at this point, then the customers_temp
table will not be recoverable after a re-login. Transient tables are available after a session has been closed and will retain data in a subsequent session created by user login; however, they don't consume fail-safe storage. This is an important mechanism for retaining data across sessions and can have applications in scenarios that require state management or in ETL jobs.REPLACE
is actually a shorthand for a two-step process that's required when a table has to be deleted (dropped) and then recreated. That would typically be done by executing the two statements in sequence as follows:
DROP TABLE IF EXISTS CREATE TABLE customers …
The deep and shallow copies can be explained by the following query and the result generated by Snowflake:
show tables like 'customers%';
This shows a table with three rows, each showing a summary of the three tables that we have generated previously, explaining the differences and similarities. The following table shows that the deep copies of the dataset are exactly the same while the shallow copy has been deprived of data, though the metadata is the same:
A thing to note in the preceding table is the kind column. The column is showing that the tables created have the kind
attribute set as TABLE, which is the default type of table – a permanent table structure to be populated with data rows. (Please note that a select set of columns is being shown here for the sake of clarity.)
A local temporary table (also known as a volatile table) persists for the duration of the user session in which it was created and is not visible to other users. A temporary table's definition and contents are dropped at the end of the user session.
Transient tables are non-permanent tables, but unlike temporary tables, transient tables exist until explicitly dropped and are visible to any user with the appropriate privileges. Transient tables have a lower level of data protection than permanent tables. Data in a transient table may be lost in the event of a system failure. Transient tables should only be used for data that can be recreated in the event that the data is lost.
A show tables
command for the customers_temp
and customers_trans
tables will show a table similar to the following table (please note that a limited set of columns is shown here for the sake of clarity):
Please note the content for the kind
column. It shows that the tables are not permanent.
One aspect of CREATE TABLE
statements, CLONE
, has been left for discussion in Chapter 8, Back to the Future with Time Travel. Temporary tables, however, cannot be cloned. Only permanent tables are fail-safe. Fail-safes will be discussed further in later chapters.
An important aspect of ETL applications is managing the loading of data. This recipe introduces you to managing incoming data by creating a stage and querying that data for loading into native Snowflake tables. The process is very different from traditional data warehouses as it mixes concepts from modern big data systems. Details around ETL will not be covered here but are deferred till later chapters to explain how an ETL pipeline can be managed.
The following example requires SnowSQL to run the different steps. Apart from that, you will need to have access to an AWS S3 bucket where data can be placed and made available as files.
The following steps start with the creation of a stage, which is used to temporarily store data before it can be copied into Snowflake:
sfuser_ext_stage
. The stage should be accessible to Snowflake:CREATE OR REPLACE STAGE sfuser_ext_stage URL='s3://snowflake-cookbook/Chapter02/r4/';
The response should say that a stage has been created successfully, as shown:
SFUSER_EXT_STAGE
stage pointing to the snowflake-cookbook
S3 bucket:LIST@SFUSER_EXT_STAGE;
This statement should generate the following output in the web UI:
We can see that there are two types of files in the preceding listing: csv
and parquet
. In the case of csv
, the electronic-card-transactions-may-2020-headless.csv
file is a header-less version of the electronic-card-transactions-may-2020.csv
file. There is a parquet
format file as well called userdata1.parquet
. We shall create external tables on both files. An external table is different from usual database tables because unlike tables that point to data inside a database, external tables provide a view on top of files stored in a stage.
These are read-only tables that maintain metadata that's helpful in interpreting the contents of a file, which could be formatted as parquet
, csv
, and so on.
parquet
file can be loaded into an external table. We shall be creating an external table called ext_tbl_userdata1
. The creation of the table would require a location from which data can be read into the table. It would also require a file format. In this case, the file type is parquet
:create or replace external table ext_tbl_userdata1 with location = @sfuser_ext_stage file_format = (type = parquet);
select * from ext_tbl_userdata1;
The following screenshot is only showing some of the key-value pairs due to size constraints:
ext_card_data
table is created that has the location pointing to the stage. In this case, the file is located in a …/ch2/r4/csv
subfolder. This gets us to the folder where the file is located. file_format
, in this case, is providing the information that the file is a CSV, and finally, a file filter is provided to constrain the search to CSVs with headless
in their names:create or replace external table ext_card_data with location = @sfuser_ext_stage/csv file_format = (type = csv) pattern = '.*headless[.]csv';
select * from ext_card_data;
The following screenshot of the resulting records shows some of the rows generated by the query. Please note the difference here. There are no meaningful column names in this case, unlike the previous case of the Parquet file:
c3
and c2
, and creates the card_sum
and period
aliases, respectively. Moreover, to use the columns effectively, casting has been carried out for each column value:select top 5 value:c3::float as card_sum, value:c2::string as period from ext_card_data;
The result of the query is shown:
drop table ext_card_data; drop table ext_tbl_userdata1;
The stage created in step 1 can be thought of as a reference to a storage location. It is treated as a read-only location that can only be accessed using the appropriate access rights. The S3 bucket in step 1 is a public bucket and does not need credentials to be accessed. In later chapters, when we look at staging in more detail, we shall start delving into securing the staging locations.
Step 2 is dereferencing the specified S3 bucket to list all the files that are available along with the last modified dates.
In step 3, when the Parquet file is loaded into the external table, the table rows have all the field names captured, as shown in step 4. But in step 5, when the CSV-formatted file is loaded into the external table, there are dummy column names created by Snowflake, as can be seen in step 6. This is because a Parquet file has metadata stored inside the file, while a CSV file does not have that metadata embedded in it. This is a major difference and would usually require additional steps as shown in step 7 to generate meaningful column names, plus the casting of data types.
In this recipe, we did not look at how data could be loaded into a stage. It is possible to load data into a stage from a local filesystem. This method will be discussed in the later chapters.
Looking at the metadata limitations for external tables, it can be argued that Avro files can be the best format for staging data as Avro files can specify field names and data types as well. We shall look into an example in a later chapter where we discuss ETL processing.
Please note that the last modified dates for each file in a stage can be a useful mechanism to trigger updating data in an external table and can be used for running the ETL process.
This recipe will introduce you to different variations of views that are specific to Snowflake and in what scenario a variant of a view should be used. The recipe will cover simple views and materialized views and will provide guidance on when to use what type of view.
The following examples can be run either via the Snowflake web UI or the SnowSQL command-line client. Please make sure that you have access to the SNOWFLAKE_SAMPLE_DATA
database in your Snowflake instance. The SNOWFLAKE_SAMPLE_DATA
database is a database that is shared by Snowflake automatically and provides sample data for testing and benchmarking purposes.
Let's start with the creation of views in Snowflake. We shall look into the creation of simple views on tables and then talk about materialized views:
SNOWFLAKE_SAMPLE_DATA
database contains a number of schemas. We will be making use of the schema called TPCH_SF1000
. Within this schema, there are multiple tables, and our view will make use of the STORE_SALES
table to produce an output that shows the sales against order dates. Before we create our first view, let's create a database where we will create the views:CREATE DATABASE test_view_creation;
date_wise_profit
that, as the name suggests, shows the profit against the date:CREATE VIEW test_view_creation.public.date_wise_orders AS SELECT L_COMMITDATE AS ORDER_DATE, SUM(L_QUANTITY) AS TOT_QTY, SUM(L_EXTENDEDPRICE) AS TOT_PRICE FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.LINEITEM GROUP BY L_COMMITDATE;
The view is successfully created with the following message:
SELECT * FROM test_view_creation.public.date_wise_orders;
The view will take some time (2-3 minutes) to execute as there is a large amount of data in the underlying tables. This latency in execution can be managed by opting for a larger warehouse. An extra-small warehouse has been used in this case. After some time, you should see the result set returned (as shown in the following screenshot), which will be approximately 2,500 rows:
CREATE MATERIALIZED VIEW test_view_creation.public.date_wise_orders_fast AS SELECT L_COMMITDATE AS ORDER_DATE, SUM(L_QUANTITY) AS TOT_QTY, SUM(L_EXTENDEDPRICE) AS TOT_PRICE FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1000.LINEITEM GROUP BY L_COMMITDATE;
The first thing that you will notice when creating the materialized view is that it will not be immediate:
It will take a fair bit of time to create the view as opposed to the immediate creation that we saw in step 2, mainly because materialized views store data, unlike normal views, which just store the DDL commands and fetch data on the fly when the view is referenced.
SELECT * FROM test_view_creation.public.date_wise_orders_fast;
The results are returned almost immediately as we are selecting from a materialized view, which performs much better than a simple view.
A standard view in Snowflake is a way to treat the result of a query as if it were a table. The query itself is part of the view definition. When data is selected from a standard view, the query in the view definition is executed and the results are presented back as a table to the user. Since the view appears as a table, it can be joined with other tables as well and used in queries in most places where tables can be used. Views are a powerful method to abstract complex logic from the users of data; that is, a reusable query with complex logic can be created as a view. As such, this takes the burden off the end users to know the logic. Views can also be used to provide access control on data, so for various departments in an organization, different views can be created, each of which provides a subset of the data.
Since a standard view executes its definition at runtime, it can take some time to execute. If there is a complex query that is commonly used, it can be created as a materialized view. A materialized view looks similar to a standard view, but it doesn't run the query in its definition at runtime. Rather, when a materialized view is created, it runs the query right away and stores the results. The advantage is that when the materialized view is queried, it does not need to execute but can retrieve the stored results immediately, providing a performance boost. A materialized view will however incur additional maintenance and storage costs since every time the underlying table is changed, the view recalculates the results and updates the storage.
In addition to standard views and materialized views, Snowflake also provides the concepts of secure views and recursive views. We will explore the application of secure views in Chapter 5, Data Protection and Security in Snowflake.
Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.