Reader small image

You're reading from  Azure Data Engineer Associate Certification Guide

Product typeBook
Published inFeb 2022
PublisherPackt
ISBN-139781801816069
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Newton Alex
Newton Alex
author image
Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex

Right arrow

Chapter 4: Designing the Serving Layer

In this chapter, we will be learning about the techniques and technologies involved in designing a data serving layer. As we have seen previously in the data lake design, data passes through several zones. It starts from a Landing Zone, from where it gets processed into more useful formats in the Transformation Zone, and finally, the derived data insights land in the Serving Zone (also called the Serving layer). The Serving Zone serves the processed data and insights to the end users. The Landing Zone and Transformation Zone of a data lake focus on aspects such as efficient storage of data, processing huge volumes of data, optimizing queries, and so on. The Serving layer, on the other hand, focuses mainly on how to serve the data in a fast and efficient way to the business intelligence (BI) tools.

Serving layers are usually built using relational data stores (Structured Query Language (SQL)-based stores). This is done for two reasons: relational...

Technical requirements

For this chapter, you will need an Azure account (free or paid).

Learning the basics of data modeling and schemas

Data modeling is a process of designing how data will be represented in data stores. Many data modeling techniques were originally designed for databases and warehouses. Since the Serving layers are usually built with relational data stores such as data warehouses, some of the data modeling techniques can be applied for the Serving layer design too. But do remember that the Serving layer could be built using other storage technologies such as document databases, key-value stores, and so on, based on the customer requirements.

Unlike data lakes, in databases or data warehouses we don't have the luxury of storing huge volumes of data in the format we like. Databases and data warehouses can perform querying exceptionally fast, provided the data is stored in predetermined formats and is limited in size. Hence, while designing the Serving layer, we need to identify the specifics of which data needs to be stored, which format to store...

Designing Star and Snowflake schemas

Schemas are guidelines for arranging data entities such as SQL tables in a data store. Designing a schema refers to the process of designing the various tables and the relationships among them. Star and Snowflake schemas are two of the most commonly used schemas in the data analytics and BI world. In fact, Star schemas are used more frequently than Snowflake schemas. Both have their own advantages and disadvantages, so let's explore them in detail.

Star schemas

A Star schema is the simplest of the data warehouse schemas. It has two sets of tables: one that stores quantitative information such as transactions happening at a retail outlet or trips happening at a cab company, and another that stores the context or descriptions of events that are stored in the quantitative table.

The quantitative tables are called fact tables and the descriptive or context tables are called dimension tables.

The following diagram shows an example of...

Designing SCDs

SCDs refer to data in dimension tables that changes slowly over time and not at a regular cadence. A common example for SCDs is customer profiles—for example, an email address or the phone number of a customer doesn't change that often, and these are perfect candidates for SCD. In this section, we will look at how to design for such changes.

Services such as Azure SQL provide inbuilt support for SCD, but in data warehouses such as Synapse dedicated pools, we will have to implement them ourselves.

Here are some of the main aspects we will need to consider while designing an SCD:

  • Should we keep track of the changes? If yes, how much of the history should we maintain?
  • Or, should we just overwrite the changes and ignore the history?

Based on our requirements for maintaining the history, there are about seven ways in which we can accomplish keeping track of changes. They are named SCD1, SCD2, SCD3, and so on, up to SCD7.

Among these...

Designing a solution for temporal data

Temporal data refers to data at specific points in time. Storing temporal data is required in situations like data auditing, forensic investigations, maintaining SCDs, point in time recoveries and so on. Azure SQL and SQL server provides a mechanism called Temporal tables to store temporal data.

Temporal tables are specialized tables that keep track of data changes over time. They track the history of data changes like what we had already seen in SCD tables, but in this case the system takes care of managing the time validity period of each row, instead of we having to do it manually. Hence these tables are also called as System-versioned temporal tables.

Note

Temporal table is a concept of Azure SQL database and SQL server. It was not available in Azure Synapse pools as of writing this book.

Let us look at an example of how to create Temporal tables in Azure SQL:

CREATE TABLE Customer
(
  [customerId] INT NOT NULL PRIMARY...

Designing a dimensional hierarchy

Dimensional hierarchy refers to the way we group and organize the dimensional data at multiple levels. In a hierarchical structure, there are usually one-to-many or many-to-many parent-child relationship. Examples of hierarchical structures could be organizational structures, product categories in an online store, a file system and so on. The main characteristic of the hierarchical structure is that all the nodes are identical, and they include pointers to their parent or children's nodes.

In order to achieve a dimensional hierarchy, we can use a technique called as the self-referencing relationship or self-joins within the dimension table. Let us take an example of an Employee dimension.

CREATE TABLE DimEmployee (
	[employeeId] VARCHAR(20) NOT NULL,
	[name] VARCHAR(100),
	[department] VARCHAR(50),
	[title] VARCHAR(50),
	[parentEmployeeId] VARCHAR(20)
)

Here, we have a column called parentEmployeeID referencing the employeeID column....

Designing for incremental loading

Incremental loading or delta loading refers to the process of loading smaller increments of data into a storage solution—for example, we could have daily data that is being loaded into a data lake or hourly data flowing into an extract, transform, load (ETL) pipeline, and so on. During data-ingestion scenarios, it is very common to do a bulk upload followed by scheduled incremental loads.

Azure has a very versatile service called Azure Data Factory (ADF) which can help with incremental loading. Since this is the first time we are using ADF in this book, let's learn a little more about it now as the information will come in handy in future chapters.

ADF is a managed cloud service that can be used to coordinate and orchestrate complex cloud- or hybrid- (on-premises)-based pipelines. ADF provides the ability to build ETL and extract, load, transform (ELT) pipelines. With ADF, you can do the following:

  • Ingest data from a wide...

Designing analytical stores

Analytical stores could either be SQL or NoSQL data stores deployed in the data lake Serving Zone. The main job of an analytical data store is to serve the data generated by the data transformation pipelines to the BI tools in a fast and simple manner. Analytical stores are usually subjected to ad hoc querying from business analysts and other end users. As such, these stores need to perform really well with random reads. Azure provides a variety of storage technologies that can cater to these requirements. Here are some of the most important ones:

  • Azure Synapse Analytics (Serverless/dedicated SQL pools and Spark pools)—Synapse Analytics provides both SQL pools and Spark pools. Among these, the SQL dedicated pools are massively parallel processing (MPP) data warehouses and are usually ideal for a majority of analytical store situations. Serverless SQL pools can be used for ad hoc querying. Spark pools, on the other hand, can support analytical...

Designing metastores in Azure Synapse Analytics and Azure Databricks

Metastores store the metadata of data in services such as Spark or Hive. Think of a metastore as a data catalog that can tell you which tables you have, what the table schemas are, what the relationships among the tables are, where they are stored, and so on. Spark supports two metastore options: an in-memory version and an external version.

In-memory metastores are limited in accessibility and scale. They can help jobs running on the same Java virtual machine (JVM) but not much further than this. Also, the metadata is lost once the cluster is shut down.

For all practical purposes, Spark uses an external metastore, and the only supported external metastore at the time of writing this book was Hive Metastore. Hive's metastore is mature and provides generic application programming interfaces (APIs) to access it. Hence, instead of rebuilding a new metastore, Spark just uses the mature and well-designed Hive...

Summary

That brings a close to our fourth chapter. Congratulations on making it this far.

Just to recap, we started off with the basics of data modeling and learned about Star and Snowflake schemas. We then learned about designing for SCDs, the different sub-types of SCDs, dimensional hierarchies, handling temporal data by using time dimensions, loading data incrementally using ADF, and selecting the right analytical store based on the customer's requirements. Finally, we learned about creating metastores for Synapse and Azure Databricks. All these topics complete the syllabus for DP203 – Design the Serving Layer. You have now learned how to design your own Serving layer in Azure.

We have now come to the end of our design chapters. We will be focusing on the implementation details from the next chapter onward. Yay!

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data Engineer Associate Certification Guide
Published in: Feb 2022Publisher: PacktISBN-13: 9781801816069
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex