Reader small image

You're reading from  Data Modeling with Snowflake

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781837634453
Edition1st Edition
Right arrow
Author (1)
Serge Gershkovich
Serge Gershkovich
author image
Serge Gershkovich

Serge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.
Read more about Serge Gershkovich

Right arrow

Modeling Semi-Structured Data

So far, this book has focused on modeling structured data, the kind used in relational databases since the early 70s. However, with the rise of the internet, a different style of data became prevalent: semi-structured. Semi-structured data, such as website traffic and social media feeds, contain some organizational structure but do not conform to the formal structure of a relational database.

New file formats also emerged to support this new type of data, starting with the advent of Extensible Markup Language (XML) in the early 2000s, followed by JavaScript Object Notation (JSON), and, with the rise of distributed computing, formats such as Avro, ORC, and Parquet. These formats offered a lightweight and flexible way to structure data, making them ideal for web-based and mobile app data.

The popularity of semi-structured data can be attributed to its flexibility, adaptability, and ability to handle data sources that do not fit neatly into traditional...

Technical requirements

The scripts used to instantiate and load the examples in this chapter are available in the following GitHub repo: https://github.com/PacktPublishing/Data-Modeling-with-Snowflake/tree/main/ch15. While key sections of this script will be highlighted in this chapter, please refer to the ch_15_semistruct.sql file for the complete code required to load, query, and transform semi-structured data, as it is too long to reprint here in full.

The benefits of semi-structured data in Snowflake

Semi-structured data formats are popular due to their flexibility when working with dynamically varying information. Unlike relational schemas, in which a precise entity structure must be known and fixed ahead of time, semi-structured data is free to include or omit attributes as needed, as long as they are properly nested within corresponding parent objects.

Think of the contact list on your phone. It contains a list of people and their contact details but does not capture those details uniformly. For example, some contacts may contain multiple phone numbers while others have one. Some entries contain information such as an email address and street address, while others have only a number and a vague description in lieu of a name (seriously, who is Green Vespa Laura Friend, and who is Laura?).

To handle this type of data, Snowflake uses the VARIANT data type, which allows semi-structured data to be stored as a column in a relational...

Getting hands-on with semi-structured data

Although we will query semi-structured JSON data as part of this exercise, its storage still conforms to modeling best practices such as naming and standard columns. In this example, we will use semi-structured data containing information about pirates – such as details about the crew, weapons, and their ship – all stored in a single VARIANT data type. With relational data, a row represents a single entity; in semi-structured data, a row is an entire file (although the file itself can contain single or countless entities). For this reason, metadata columns to mark individual loads and source filenames are stored alongside VARIANT.

Figure 15.1 – A table with ELT meta columns and VARIANT for storing semi-structured data

Figure 15.1 – A table with ELT meta columns and VARIANT for storing semi-structured data

This example uses AUTOINCREMENT (a.k.a. IDENTITY) as the default to generate a sequential unique ID for each load/record inserted.

In a real-world scenario, semi-structured...

Schema-on-read != schema-no-need

With the rising popularity of semi-structured data, schema-on-read also entered the lexicon of big data. Schema-on-read is the idea that, unlike in relational modeling, the schema definition for semi-structured data can be delayed until long after the data has been loaded into the data platform. Delaying this task means there are no bottlenecks within the ETL process for generating and ingesting semi-structured data. However, implicit in the design is that a knowable schema exists underneath the flexible semi-structured form.

In this section, we will learn how to query JSON data and infer details about its contents using SQL and Snowflake-native functions. Let’s begin by extracting some basic attributes for our pirate:

SELECT * FROM pirate_json;

Although we can query a table containing semi-structured data in a VARIANT column, a simple SELECT * statement does not return meaningful results, as you can see in the following figure:

...

Converting semi-structured data into relational data

As we saw in the previous exercise, semi-structured data is flexible and can accommodate any amount of densely or sparsely nested elements. However, in nested objects, it can be inferred that lower-level elements are attributes of their immediate parents.

Observe the following simplified example of semi-structured data with three levels of nesting and use the indentation to count the depth.

Figure 15.5 – Observing the number of nested levels in a semi-structured object

Figure 15.5 – Observing the number of nested levels in a semi-structured object

Here, we can see that a (ship) type, of depth 2, is an attribute of ship, which is a level-one attribute of the root object. By this logic, if we follow the levels of a semi-structured object to its maximum depth N, those elements become attributes of an entity N-1. Then, N-1 entities become attributes of N-2, repeating recursively until arriving at the root.

In the current example, boarding axe and blunderbuss become instances...

Summary

With the rising popularity of web applications and IoT data, semi-structured data has gained prominence for its flexibility in creating and loading dynamically changing objects without affecting ELT pipelines. Semi-structured formats, such as JSON, can handle any amount of variable nested data, which doesn’t need to conform to a pre-defined structure. Snowflake makes working with semi-structured formats easy thanks to its VARIANT data type – optimized for storage and analytical queries using easy-to-learn extensions to ANSI-standard SQL.

Querying a VARIANT data type provides the same performance as standard relational data types without needing to analyze the structure ahead of time – an approach known as schema-on-read. This means Snowflake users can work with semi-structured and relational data on the same platform using familiar SQL commands. However, although Snowflake gives users all the tools necessary for analyzing semi-structured data, schema...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Modeling with Snowflake
Published in: May 2023Publisher: PacktISBN-13: 9781837634453
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Serge Gershkovich

Serge Gershkovich is a seasoned data architect with decades of experience designing and maintaining enterprise-scale data warehouse platforms and reporting solutions. He is a leading subject matter expert, speaker, content creator, and Snowflake Data Superhero. Serge earned a bachelor of science degree in information systems from the State University of New York (SUNY) Stony Brook. Throughout his career, Serge has worked in model-driven development from SAP BW/HANA to dashboard design to cost-effective cloud analytics with Snowflake. He currently serves as product success lead at SqlDBM, an online database modeling tool.
Read more about Serge Gershkovich