Unlocking the Power of Modeling
The word modeling has come to mean very different things in the half a century that it has been practiced in database systems. This opening chapter prefaces the book’s overall aim to demystify modeling, along with its applications, methodologies, and benefits. Throughout this journey, the concept of modeling will unfold into a set of methods and terms that help organizations design and manage data and, more importantly, help them understand themselves.
In its broadest interpretation, modeling is a selective simplification that aids in navigating or designing something more complex. Any system can be broken down into smaller, more manageable pieces. Manipulating any piece individually may be straightforward, but doing so without regard to an overall strategy is a tenuous proposal that is sure to encumber scalability and maintenance down the line.
While modeling is generally considered database-agnostic, modern cloud data platforms, such as Snowflake, present their users with many unique features thanks to their innovative architecture and consumption-based pricing. A clear and forward-looking design that takes advantage of the native features of the platform that supports it is the key to building cost-effective solutions capable of meeting and anticipating business needs.
As the analytical requirements of a data-driven organization are notoriously complex and constantly evolving, modeling must keep pace and accompany data teams from idea to execution. To achieve this, modeling must go beyond the structure and relationships of database tables and embrace the transformational logic that moves and shapes the underlying data. Only by leaning into the specifics of Snowflake features and architecture can a model be built efficiently from beginning to end.
In this chapter, we’re going to cover the following main topics:
- Recognizing the utility of models in our daily lives
- Getting a glimpse of modeling conventions in action
- Getting acquainted with the tools in the modeling toolkit
- Uncovering the benefits of modeling for enterprise teams
- Incorporating modeling into strategic planning
- Understanding modeling applications for transactional and analytical systems
This book focuses on data modeling specifically for the Snowflake Data Cloud. While modeling includes many system-agnostic terms and conventions, this book will leverage unique features of Snowflake architecture, data types, and functions when building physical models and Structured Query Language (SQL) transformations.
To follow along with the exercises in the following chapters, you will need a Snowflake account with access to a sandbox area for creating schemas, objects, and loading data.
You can sign up for a 30-day free trial of Snowflake (https://signup.snowflake.com/) if you do not already have access.
This book will frequently use visual modeling diagrams as part of the modeling process. While a diagram can be drawn by hand and constructed in PowerPoint or Lucidchart, a tool that supports common database modeling features is recommended. The exercises in this book will take the reader from conceptual database-agnostic diagrams to deployable and runnable Snowflake code. For this reason, a tool that supports various modeling types and can forward engineer Snowflake syntax is recommended.
The diagrams in this book were generated using the SqlDBM online database modeling tool (https://sqldbm.com/Home/), which supports the previously mentioned features and offers a 2-week free trial.
Modeling with purpose
Models are used to simplify complex systems. Take a modern city as an example, and you will see that it consists of intricately linked systems such as highways, electrical grids, and transit systems. While these systems operate in the same physical territory, they require very different models to help us understand them. For example, a subway system snakes and curves below a city’s varied terrain, but our model of it—a subway map—uses straight lines and places stations at nearly equidistant intervals. The subway map is not the city—it is a selective simplification of the city that makes it easier for passengers to visualize their journey. The transit map is a model so ubiquitous that it’s hard to imagine doing it any other way—yet it took time to evolve.
The subway map, as we know it today, was invented by Harry Beck in 1931 while re-designing the map used by the London Underground. The old design was confusing to riders because it focused on the wrong goal—geographical exactness. Here’s what it looked like before Beck:
Figure 1.1 – London tube map, before Beck (Legacy Tube map)
Thankfully, Beck was not a cartographer—he was an engineer. By sacrificing topographical detail, Beck’s design allowed passengers to quickly count the number of stops required for their journey while retaining their overall sense of direction. This story reminds us (quite literally) of the refrain, the map is not the territory.
As with maps, various kinds of modeling exist to help teams within an organization make sense of the many layers that make up its operational landscape. Also, like maps, models help organizations prepare for the journey ahead. But how does one use a model to navigate a database, let alone plan its future?
Leveraging the modeling toolkit
Before we continue, we need to formally delineate three distinct concepts often used together in the service of modeling to make it simpler to refer to a specific tool in the modeling toolkit in later sections. By understanding where each piece fits in the broader domain of database design and management, diving into deeper technical concepts later in the book will become more meaningful and easier to digest.
The three components are listed here:
- Natural language semantics—words
- Technical semantics—SQL
- Visual semantics—diagrams
Let’s discuss each of these in detail, as follows:
- Natural language semantics: Terminology employed in communicating details of a model between people. These are agreed-upon words that employ pre-defined conventions to encapsulate more complex concepts in simpler terms. For example, when both parties involved in a verbal exchange understand the concept of a surrogate key, it saves them from having to explain that it is a unique identifier for a table record that holds no intrinsic business meaning, such as an integer sequence or a hash value.
To ensure effective technical conversations, it helps to be fluent in the semantics of modeling. Not only does it save time by succinctly communicating a complex concept, but it also saves even more time by not miscommunicating it. A waiter would return different foods when ordering chips in London rather than in Los Angeles. A properly modeled database would never return different records for the same surrogate key.
- Technical semantics: SQL is a domain-specific language used to manage data in a Relational Database Management System (RDBMS). Unlike a general-purpose language (for example, YAML or Python), domain-specific languages have a much smaller application but offer much richer nuance and precision. While it can’t format a website or send an email, SQL allows us to create the structure of our database and manipulate its contents.
SQL bridges modeling concepts (expressed in words or images) and what is physically defined in the database. Snowflake uses an American National Standards Institute (ANSI)-compliant SQL syntax, meaning its basic commands (such as
WHERE) are compatible with other database vendors who use this standard. Snowflake also offers many extra functions, clauses, and conventions that go beyond ANSI-standard SQL and give users added flexibility to manage the database.
Unfortunately, due to its domain-specific nature, SQL presents a significant limitation: it can only express what the database explicitly understands. While SQL can define table structure and precisely manipulate data, it is too detailed to easily articulate the underlying business requirements.
- Visual semantics: Through their simplicity, images can convey a density of information that other forms of language simply cannot. In modeling, diagrams combine the domain-specific precision of SQL with the nuance of natural language. This gives diagrams a lot to work with to capture a data model’s business meaning and technical specifics.
To start, diagrams vary in the level of detail they present—giving the observer exactly what they’re looking for without overwhelming (or underwhelming) them with information. Most importantly, the semantic conventions used in diagrams are universal and can be understood by people besides data analysts and engineers. Yes—modeling diagrams are considered technical drawings; they represent strict technical concepts through agreed-upon visual conventions. However, in their simplest form, models can be understood almost intuitively with no prior knowledge. Even at the more advanced levels, such as logical and physical, learning to read a model is much simpler than learning SQL.
When all these semantics come together and are understood by the entire organization, they form a ubiquitous language, a concept first described by Eric Evans in Domain-Driven Design. Modeling then forms a part of the vocabulary that is understood universally throughout the organization to describe its business and store the data assets that support it. But that is just one of the many benefits that modeling provides.
The benefits of database modeling
For many people, database modeling brings to mind stale diagrams, arcane symbols, or extra work at the end of a project. Only a decade ago, fueled by the rise of distributed computing in the early 2000s—which popularized the concept of big data—the notion that modeling is dead gained notoriety. More precisely, it was thought that cheap and near-limitless computing power had made planning and designing a thing of the past. It was said that flexible semi-structured data formats and the ability to parse them on the fly—known as schema-on-read—had made modeling obsolete.
Eventually, operating and maintenance costs caught up with reality and revealed two great shortcomings of the schema-on-read approach. One is that no matter how data is structured, it must be functionally bound to the business that it helps support. In other words, semi-structured formats are neither a panacea nor an excuse to forgo the process of business validation. The second—and most important—is that a model is not simply the shape that data takes once uploaded to a database, but rather, the blueprint for business operations, without which it is impossible to build sustainable architectures.
Sustainable solutions require a long-term strategy to ensure their design matches the underlying business model. Without this, schema-on-read (discussed in Chapter 15, Modeling Semi-Structured Data), star schema (discussed in Chapter 17, Scaling Data Models through Modern Techniques), or any other schema are narrow-sighted tactics that lead nowhere. But done right, modeling makes developing database architectures more agile and helps the project evolve from the idea phase to implementation. At every stage of development, the model serves as a guide for supporting the conversations necessary to propel the design into the next phase and provide additional business context. Once implemented, the model becomes a living document that helps users understand, navigate, and evolve the system it helped create.
While every organization models in the technical sense—creating tables and transforming data—not everyone models strategically, end to end, in the broad sense of the word—thereby foregoing the long-term benefits. Some of these benefits include the following:
- Consensus and visibility of the broader business model
- More productive conversations with business teams
- Better quality of requirements
- Higher signal, lower noise in technical conversations
- Cross-platform, cross-domain, and widely understood conventions
- Big-picture visual overview of the business and its database footprint
- Preliminary designs become implementation blueprints
- Accelerating onboarding of new team members
- Making data more accessible and unlocking self-service within organizations
- Keeping the database landscape manageable at scale
- Getting a handle on complex data pipelines
To demonstrate the difficulties of working without formal modeling, we can take a simple schema based on Snowflake’s shared TPC-H dataset (available in the shared database called
SNOWFLAKE_SAMPLE_DATA), which, at first glance, looks like this:
Figure 1.2 – A list of tables in the Snowsight UI
While these tables have been modeled in the strict sense of the word and even contain data, we get very little information on what that data represents, how it relates to data in other tables, or where it fits in the broad context of business operations.
Intuition suggests that
CUSTOMER share a relationship, but this assertion needs to be tested. Even in this trivial example of only eight tables, it will take considerable time to thoroughly sift through the data to understand its context.
The irony is that many of the details we’re looking for are already baked into the design of the physical tables, having been modeled at some point in the past. We just can’t see them. Without a map, the terrain is lost from view.
Figure 1.3 – A conceptual model using crow’s foot notation
At a glance, the big picture comes into focus. Diagrams such as this one allow us to understand the business concepts behind the data and ensure they are aligned. Having a visual model also lets us zoom out from individual tables and understand the semantics of our business: What are the individual pieces involved and how do they interact? This global perspective gives everyone in the organization a means of finding and making sense of data assets without requiring a technical background—thus, business analysts or new hires can unlock the value of the information without any help from the data team.
As the organization grows, expanding in personnel and data assets, it will inevitably become too big for any person, or even a team of people, to coordinate. Here, organizations that have embraced data modeling will stand out from those that did not. Modeling can be the thing that helps organizations scale their data landscape, or it can be the technical debt that holds them back.
Yet, for all its benefits, modeling is not a cookie-cutter solution that guarantees success. There are many approaches to modeling and various modeling methodologies that are suited for different workloads. Throughout this book, we will tackle the fundamentals of modeling that will allow you to understand these differences and apply the best solution using a first-principles approach. First, we will begin by breaking down the two main database use cases and observing the role modeling plays in each of them.
Operational and analytical modeling scenarios
The relational database as we know it today emerged in the 1970s—allowing organizations to store their data in a centralized repository instead of on individual tapes. Later that decade, Online Transaction Processing (OLTP) emerged, enabling faster access to data and unlocking new uses for databases such as booking and bank teller systems. This was a paradigm shift for databases, which evolved from data archives to operational systems.
Due to limited resources, data analysis could not be performed on the same database that ran operational processes. The need to analyze operational data gave rise, in the 1980s, to Management Information Systems (MIS), or Decision Support Systems (DSS) as they later became known. Data would be extracted from the operational database to the DSS, where it could be analyzed according to business needs. OLTP architecture is not best suited for the latter case, so Online Analytical Processing (OLAP) emerged to enable users to analyze multidimensional data from multiple perspectives using complex queries. This is the same paradigm used today by modern data platforms such as Snowflake.
The approach to storing and managing data in OLAP systems fundamentally differs from the operational or transactional database. Data in OLAP systems is generally stored in data warehouses (also known as DWs or DWHs)—centralized repositories that store structured data from various sources for the purpose of analysis and decision-making. While the transactional system keeps the up-to-date version of the truth and is generally concerned with individual records, the data warehouse snapshots many historical versions and aggregates volumes of data to satisfy various analytical needs.
Data originates in the transactional database when daily business operations (for example, bookings, sales, withdrawals) are recorded. In contrast, the warehouse does not create but rather loads extracted information from one or various source systems. The functional differences between transactional databases and warehouses present different modeling challenges.
A transactional system must be modeled to fit the nature of the data it is expected to process. This means knowing the format, relationships, and attributes required for a transaction.
The main concern of a transactional database model is the structure and relationships between its tables.
By contrast, the data warehouse loads existing data from the source system. A data warehouse isn’t concerned with defining a single transaction but with analyzing multitudes of transactions across various dimensions to answer business questions. To do this, a data warehouse must transform the source data to satisfy multiple business analyses, which often means creating copies with varying granularity and detail.
Modeling in a data warehouse builds upon the relational models of its source systems by conforming common elements and transforming the data using logic.
Wait—if transformational logic is a core concept in data warehouse modeling, why is it so consistently absent from modeling discussions? Because in order to do transformational modeling justice, one must forgo the universality of general modeling principles and venture into the realm of platform specifics (that is, syntax, storage, and memory utilization). This book, in contrast, will embrace Snowflake specifics and go beyond physical modeling by diving into the transformation logic behind the physical tables. This approach provides a fuller understanding of the underlying modeling concepts and equips the reader with the required SQL recipes to not only build models but to load and automate them in the most efficient way possible. As we’ll see in later chapters, this is where Snowflake truly shines and confers performance and cost-saving benefits.
Is Snowflake limited to OLAP?
Snowflake’s primary use case is that of a data warehouse—with all the OLAP properties to enable multidimensional analysis at scale over massive datasets. However, at the 2022 Snowflake Summit, the company announced a new table type called Hybrid Unistore, which features both an OLTP-storage table and an OLAP analysis table under one semantic object. This announcement means Snowflake users can now design transactional OLTP database schemas while leveraging the analytical performance that Snowflake is known for. Hybrid Unistore tables are discussed in more detail in later chapters.
Although OLAP and OLTP systems are optimized for different kinds of database operations, they are still databases at heart and operate on the same set of objects (such as tables, constraints, and views) using SQL. However, each use case requires very different approaches to modeling the data within. The following section demonstrates what modeling will typically look like in each scenario.
A look at relational and transformational modeling
The previous section describes how modeling varies between operational and data warehouse scenarios. Before exploring the modeling process in detail, it’s helpful to understand the look and feel of relational and transformational modeling and what we’re working toward. Before proceeding, it would help to summarize the main differences between transactional databases and data warehouses. You can see what these are in the following table:
Supports daily operations
Provides operational insight
Operates on single records
Summarizes many records
Accurate as of the present instant
Historical snapshots over time
Single source of truth (SSOT), non-redundant
Redundant to support different analyses
Data models defined by business operations
Data models generated by business questions
Static and structured data model
Inherited structure and dynamically transformed data model
Multiple sources of converging data
Figure 1.4 – Common differences between transactional databases and warehouses
Given these differences, the following sections demonstrate what modeling looks like in each system and what it aims to achieve.
What modeling looks like in operational systems
Completely ignoring the modeling workflow that got us here, which will be covered in later chapters, we can observe an example of the type of modeling most commonly seen in transactional systems. The physical diagram in Figure 1.5 serves both as a blueprint for declaring the required tables and a guide to understanding their business context.
Following modeling conventions (don’t worry if they are still unfamiliar—they will be covered thoroughly in the coming chapters), we can infer a lot of information from this simple diagram. For example, a person is uniquely identified by an eight-digit identifier (the primary key) and must have a Social Security number (SSN), driver’s license, name, and birth date.
The one-to-many relationship between the two tables establishes that while a person does not necessarily need to have an account created, an account must belong to just one person:
Figure 1.5 – A physical model using crow’s foot notation
These details, combined with the list of attributes, data types, and constraints, not only dictate what kinds of data can be written to these tables but also provide an idea of how the business operates. So, how does this differ in analytical databases?
What modeling looks like in analytical systems
In a data warehouse scenario, the
ACCOUNT tables would not be defined from scratch—they would be extracted from the source in which they exist and loaded—bringing both structure and data into the process. Then, the analytical transformations begin in answer to the organization’s business questions. This is a process known as Extract Transform Load (ETL). (Although ELT has become the preferred processing order, the original term stuck.)
Suppose the management team wanted to analyze which age groups (by decade) were opening which account types and they wanted to store the result in a separate table for an independent analysis.
The following diagram shows the resulting relational model of an object obtained through transformational analysis but provides no business context:
Figure 1.6 – A relational model of a transformational requirement
Although physical modeling could describe such a table (as seen in Figure 1.6)—containing the account type with age and count of accounts as integers—such a model would fail to communicate the most relevant details, presented here:
- The logic used to perform the analysis
- The relationship between the source tables and the output
The business requirement for
ACCOUNT_TYPE_AGE_ANALYSIS in this example purposely excludes the source key fields from the target table, preventing the possibility of establishing any relational links. However, the relational model still serves a vital role: it tells us how the sources are related and how to join them correctly to produce the required analysis.
The logic could then be constructed by joining
ACCOUNT, as shown here:
CREATE TABLE account_types_age_analysis AS SELECT a.account_type, ROUND(DATEDIFF(years, p.birth_date, CURRENT_DATE()), -1 ) AS age_decade, COUNT(a.account_id) AS total_accounts FROM account AS a INNER JOIN person AS p ON a.person_id = p.person_id GROUP BY 1, 2;
Although there is no relational connection between
ACCOUNT_TYPE_AGE_ANALYSIS and its sources, there is still a clear dependency on them and their columns. Instead of using ERDs, which convey entities and relationships, transformational pipelines are visualized through a lineage diagram. This type of diagram gives a column-level mapping from source to target, including all intermediate steps, as shown here:
Figure 1.7 – Transformational modeling seen visually
Paired with the SQL logic used to construct it, the lineage graph gives a complete picture of the transformational relationship between sources and targets in an analytical/warehousing scenario.
Having witnessed both relational and analytical approaches to modeling, it is clear that both play a vital role in navigating the complex dynamic environments that one is liable to encounter in an enterprise-scale Snowflake environment.
Although we have only skimmed the surface of what modeling entails and the unique features of the Snowflake platform that can be leveraged to this end, this chapter has hopefully given you an idea of the vital role that modeling plays in building, maintaining, and documenting database systems. Before diving into the specifics of verbal, technical, and visual modeling semantics of modeling in the chapters to come, let’s review what we learned.
There is no escaping modeling. We use it in our everyday lives to plan and navigate the complexities of the world around us—databases are no exception. For some readers, the modeling styles presented in this chapter may be a new way of conceptualizing their database landscape, while others may be getting reacquainted with its notation and use cases. Whether thinking about a company business model or sharing a finalized design with team members, we all engage in modeling to varying degrees.
Embracing database modeling and learning to speak in a commonly understood language unlocks many time-saving and collaborative benefits for the entire organization. Thinking long-term and modeling strategically, as opposed to reacting tactically, aligns database designs to the business that they underpin, ensuring their viability. Having seen the advantages that modeling uncovers and where it can be implemented, we can begin to analyze its components to understand precisely where they should be used and how they form a natural design progression.
In the next chapter, we will explore the four modeling types used in database design and discuss where they excel and how they build on one another to help take an idea and evolve it into a technical system design while generating living project artifacts to navigate and maintain the final product.
Eric Evans’ book, mentioned earlier in this chapter, explores how to create effective models by going beyond the surface and getting to the intention of the system itself. It is a recommended read for those wishing to go deeper into the realm of effective communication through models, unrestricted by specific technical domains, methods, or conventions:
Evans, Eric. Domain-Driven Design: Tackling Complexity in the Heart of Software. Addison-Wesley Professional, 2004.
- Legacy Tube map Wikimedia Commons, https://commons.wikimedia.org/wiki/File:Tube_map_1908-2.jpg. Accessed October 2, 2022.