Reader small image

You're reading from  Data Modeling with Snowflake

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781837634453
Edition1st Edition
Right arrow
Author (1)
Serge Gershkovich
Serge Gershkovich
author image
Serge Gershkovich

Serge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.
Read more about Serge Gershkovich

Right arrow

Mastering Snowflake Objects

The previous chapter introduced the innovative architecture that powers the Data Cloud and unlocks possibilities that have never been possible in other databases. In this chapter, we will explore the database objects that Snowflake customers will use in their modeling journey. Objects such as tables and views will undoubtedly be familiar to most people who have previously worked with SQL. Still, even here, Snowflake’s unique features unlock potential efficiencies in designing physical models, so users should be well acquainted with them.

Due to its variable spend pricing, Snowflake data modeling requires users to be well acquainted with the cost/performance trade-offs of its supported objects and their parameters. For those who have never worked with Snowflake, exclusive features such as streams and external tables may still be unfamiliar. But making good design decisions demands that users have a firm grasp and a thorough understanding of Snowflake...

Stages

Before data can make its way into tables, it must be loaded into Snowflake using a stage. Stages are logical objects that abstract cloud filesystems so they can be used in a standard manner to load data into Snowflake. There are two types of stages that can be defined: external and internal. External stages can be created on top of a cloud location (for the supported cloud storage services) outside Snowflake and are used to load data from external source systems. Internal stages are created within the Snowflake account and therefore use the storage type of the hosting provider. Internal stages are used to stage files that originate from within the Snowflake account and can not be used to load external data.

Supported cloud storage services for external stages are as follows:

  • Amazon S3 buckets
  • Google Cloud Storage buckets
  • Microsoft Azure containers

The following figure shows the layout of external stages in relation to internal stage types.

...

File formats

File formats can be thought of as templates describing the file types for reading or writing data to a file. A file format defines properties such as the delimiter type (e.g., comma, tab, or other), date/time formats, and encoding (for example, Hex Base64, and UTF-8).

Snowflake provides several structured and semi-structured file formats, such as CSV, JSON, and Parquet, with commonly used defaults. However, users are free to create their own named file formats when loading or unloading files of a specified type. File format properties may also be specified at runtime when using the COPY INTO command.

A description of supported file types and file formats for data loading is available in the Snowflake documentation: https://docs.snowflake.com/en/user-guide/data-load-prepare.html.

For a complete overview of data loading of all kinds, including streaming, see the guide on the Snowflake website: https://docs.snowflake.com/en/user-guide-data-load.html.

Having seen...

Tables

Data in Snowflake is stored in tables, which, as discussed, are one of the fundamental components of data modeling. However, before exploring them in a modeling context, we should understand the various table types that exist in Snowflake and their costs.

The previous chapter described Snowflake’s Time Travel, a feature that allows restoring dropped objects or querying data at a prior point in time. However, Time Travel comes with associated storage costs, and the number of available Time Travel days—known as the retention period—depends on the table type, as we’ll shortly review in detail.

Snowflake also offers a managed type of Time Travel, known as Fail-safe. All permanent tables have a Fail-safe period of seven days. Unlike Time Travel, which the user can access, Fail-safe is managed by and accessible only to Snowflake to protect user data from disasters such as system failures and data breaches. To recover data stored in Fail-safe, users...

Snowflake views

Snowflake views behave like most relational databases; they store a SELECT statement over physical objects as an object in a schema. Storing a SELECT statement as a shareable database object offers several advantages to users of the system:

  • The SELECT statement does not need to be written from scratch each time it is required. This provides time savings and maintainability benefits through reusable modular code for data pipelines and analytics.
  • Consistency in filters and business logic for everyone with access to the view.
  • Views are separate database objects to the data sources they reference and can therefore have different permissions that do not expose all the underlying data.
  • Can select data from multiple sources at once.
  • Always shows the latest data from source tables without having to refresh it (as opposed to materializing the results as a physical table).
  • Zero storage cost as data is not physically copied but read in real time when...

Materialized views

Not exactly views, yet not quite tables, materialized views sit at the intersection of these two object types and offer interesting benefits with one important caveat.

As the name suggests, materialized views are actually physical tables that store the results of a view. When the underlying data changes, the result is re-materialized automatically. This means that materialized views offer the performance of a table combined with a cached query result while offering all the security benefits of a regular view.

The only trade-off is the cost. Since materialized views store query results in physical tables, they incur related storage costs. Additionally, materialized views will use credits when refreshing the data.

Materialized views have certain limitations in the logic they permit in the SELECT statement compared to regular views. Chief among them is that materialized views can only be built over a single data source (stages are allowed), so they cannot use...

Streams

Streams are logical objects that capture data changes in underlying sources, including the previously mentioned objects (physical tables, views, and external and directory tables). Whenever a DML operation occurs in the source object, a stream tracks the changes (inserts, deletions, and the before/after images of updates). Streams achieve this through an offset storage technique—logically taking an initial snapshot of data and then tracking changes through metadata columns. Although a stream can be queried like a table, it is not a separate object and does not contain table data.

When a stream is created, metadata columns are tacked onto the source object and begin tracking changes. The following table describes the metadata fields and their contents:

Figure 4.4 – Stream metadata columns

Figure 4.4 – Stream metadata columns

The following command creates a stream on a table:

CREATE STREAM <stream_name> ON TABLE <table_name>

For every subsequent DML...

Change tracking

While a stream is created as a standalone object in a schema, change tracking is enabled directly on tables, allowing Snowflake users to query CDC metadata. Change tracking uses the same metadata fields found in streams but appends them directly to a table. Unlike streams, the changes are not eliminated if they are used to update downstream objects; instead, the change tracking persists for the data retention time of the table.

Here is an example of how change tracking is enabled and queried for a table. In this example, three records are inserted into a change tracking-enabled table. Subsequently, one record is updated and another is deleted:

create or replace table myTable (
   myTable_id varchar(10),
   my_date date
);
-- Enable change tracking
alter table myTable set change_tracking = true;
-- Initialize a session variable for the current
--create timestamps
set cDts  = (select current_timestamp());
set cDate = (select...

Tasks

Snowflake uses tasks to schedule and automate data loading and transformation. Although data movement is not tracked in relational modeling, it is an integral part of transformational modeling and is covered here for completeness.

Tasks automate data pipelines by executing SQL in serial or parallel steps. Tasks can be combined with streams for continuous ELT workflows to process recently changed table rows. This can be done serverlessly (using auto-scalable Snowflake-managed compute clusters that do not require an active warehouse) or using a dedicated user-defined warehouse.

The code for creating a task is as follows:

CREATE TASK <task_name>
...
[ AFTER <parent_task_1> [ , <parent_task_2> , ... ] ]
[ WHEN <boolean_expr> ]
AS <sql>

Tasks are simple to understand—they run a SQL command (or execute a stored procedure) on a schedule or when called as part of a parent task. The following figure shows how tasks can be chained serially...

Summary

In this chapter, we demonstrated that Snowflake objects pack a lot of features, even behind familiar ones like tables and views. A table in Snowflake can store more than just the data—depending on its settings, it can also hold months of historical and disaster recovery backups and offer offset change tracking for CDC. Views, likewise, exceed expectations by providing change tracking and automated re-materialization.

We saw how stages mark the entry point for data to make its way from external sources to Snowflake tables. Stages also provide helpful features, such as external table access for reading file contents without copying them to a table beforehand.

Finally, to coordinate incoming data, establish automated ELT pipelines, and streamline CDC, Snowflake pairs tasks with streams to give its users full serverless or managed control—tying stages, tables, views, and all the connective transformational logic together.

Having understood the strengths and...

References

You can refer to the following resources for additional information:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Modeling with Snowflake
Published in: May 2023Publisher: PacktISBN-13: 9781837634453
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Serge Gershkovich

Serge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.
Read more about Serge Gershkovich