You're reading from Data Modeling with Snowflake

Product type Book

Published in May 2023

Publisher Packt

ISBN-13 9781837634453

Pages 324 pages

Edition 1st Edition

Languages

Concepts

Data Engineering

Author (1):

Serge Gershkovich

Table of Contents (24) Chapters

Preface

1. Part 1: Core Concepts in Data Modeling and Snowflake Architecture

2. Chapter 1: Unlocking the Power of Modeling

3. Chapter 2: An Introduction to the Four Modeling Types

4. Chapter 3: Mastering Snowflake’s Architecture

5. Chapter 4: Mastering Snowflake Objects

6. Chapter 5: Speaking Modeling through Snowflake Objects

7. Chapter 6: Seeing Snowflake’s Architecture through Modeling Notation

8. Part 2: Applied Modeling from Idea to Deployment

9. Chapter 7: Putting Conceptual Modeling into Practice

10. Chapter 8: Putting Logical Modeling into Practice

11. Chapter 9: Database Normalization

12. Chapter 10: Database Naming and Structure

13. Chapter 11: Putting Physical Modeling into Practice

14. Part 3: Solving Real-World Problems with Transformational Modeling

15. Chapter 12: Putting Transformational Modeling into Practice

16. Chapter 13: Modeling Slowly Changing Dimensions

17. Chapter 14: Modeling Facts for Rapid Analysis

18. Chapter 15: Modeling Semi-Structured Data

19. Chapter 16: Modeling Hierarchies

20. Chapter 17: Scaling Data Models through Modern Techniques

21. Index

Why subscribe?

22. Other Books You May Enjoy

Appendix

Modeling Semi-Structured Data

So far, this book has focused on modeling structured data, the kind used in relational databases since the early 70s. However, with the rise of the internet, a different style of data became prevalent: semi-structured. Semi-structured data, such as website traffic and social media feeds, contain some organizational structure but do not conform to the formal structure of a relational database.

New file formats also emerged to support this new type of data, starting with the advent of Extensible Markup Language (XML) in the early 2000s, followed by JavaScript Object Notation (JSON), and, with the rise of distributed computing, formats such as Avro, ORC, and Parquet. These formats offered a lightweight and flexible way to structure data, making them ideal for web-based and mobile app data.

The popularity of semi-structured data can be attributed to its flexibility, adaptability, and ability to handle data sources that do not fit neatly into traditional...

Technical requirements

The scripts used to instantiate and load the examples in this chapter are available in the following GitHub repo: https://github.com/PacktPublishing/Data-Modeling-with-Snowflake/tree/main/ch15. While key sections of this script will be highlighted in this chapter, please refer to the ch_15_semistruct.sql file for the complete code required to load, query, and transform semi-structured data, as it is too long to reprint here in full.

The benefits of semi-structured data in Snowflake

Semi-structured data formats are popular due to their flexibility when working with dynamically varying information. Unlike relational schemas, in which a precise entity structure must be known and fixed ahead of time, semi-structured data is free to include or omit attributes as needed, as long as they are properly nested within corresponding parent objects.

Think of the contact list on your phone. It contains a list of people and their contact details but does not capture those details uniformly. For example, some contacts may contain multiple phone numbers while others have one. Some entries contain information such as an email address and street address, while others have only a number and a vague description in lieu of a name (seriously, who is Green Vespa Laura Friend, and who is Laura?).

To handle this type of data, Snowflake uses the VARIANT data type, which allows semi-structured data to be stored as a column in a relational...

Getting hands-on with semi-structured data

Although we will query semi-structured JSON data as part of this exercise, its storage still conforms to modeling best practices such as naming and standard columns. In this example, we will use semi-structured data containing information about pirates – such as details about the crew, weapons, and their ship – all stored in a single VARIANT data type. With relational data, a row represents a single entity; in semi-structured data, a row is an entire file (although the file itself can contain single or countless entities). For this reason, metadata columns to mark individual loads and source filenames are stored alongside VARIANT.

Figure 15.1 – A table with ELT meta columns and VARIANT for storing semi-structured data

This example uses AUTOINCREMENT (a.k.a. IDENTITY) as the default to generate a sequential unique ID for each load/record inserted.

In a real-world scenario, semi-structured...

Schema-on-read != schema-no-need

With the rising popularity of semi-structured data, schema-on-read also entered the lexicon of big data. Schema-on-read is the idea that, unlike in relational modeling, the schema definition for semi-structured data can be delayed until long after the data has been loaded into the data platform. Delaying this task means there are no bottlenecks within the ETL process for generating and ingesting semi-structured data. However, implicit in the design is that a knowable schema exists underneath the flexible semi-structured form.

In this section, we will learn how to query JSON data and infer details about its contents using SQL and Snowflake-native functions. Let’s begin by extracting some basic attributes for our pirate:

SELECT * FROM pirate_json;

Although we can query a table containing semi-structured data in a VARIANT column, a simple SELECT * statement does not return meaningful results, as you can see in the following figure:

...

Converting semi-structured data into relational data

As we saw in the previous exercise, semi-structured data is flexible and can accommodate any amount of densely or sparsely nested elements. However, in nested objects, it can be inferred that lower-level elements are attributes of their immediate parents.

Observe the following simplified example of semi-structured data with three levels of nesting and use the indentation to count the depth.

Figure 15.5 – Observing the number of nested levels in a semi-structured object

Here, we can see that a (ship) type, of depth 2, is an attribute of ship, which is a level-one attribute of the root object. By this logic, if we follow the levels of a semi-structured object to its maximum depth N, those elements become attributes of an entity N-1. Then, N-1 entities become attributes of N-2, repeating recursively until arriving at the root.

In the current example, boarding axe and blunderbuss become instances...

Summary

With the rising popularity of web applications and IoT data, semi-structured data has gained prominence for its flexibility in creating and loading dynamically changing objects without affecting ELT pipelines. Semi-structured formats, such as JSON, can handle any amount of variable nested data, which doesn’t need to conform to a pre-defined structure. Snowflake makes working with semi-structured formats easy thanks to its VARIANT data type – optimized for storage and analytical queries using easy-to-learn extensions to ANSI-standard SQL.

Querying a VARIANT data type provides the same performance as standard relational data types without needing to analyze the structure ahead of time – an approach known as schema-on-read. This means Snowflake users can work with semi-structured and relational data on the same platform using familiar SQL commands. However, although Snowflake gives users all the tools necessary for analyzing semi-structured data, schema...