You're reading from Azure Data Engineer Associate Certification Guide

Product typeBook

Published inFeb 2022

PublisherPackt

ISBN-139781801816069

Edition1st Edition

Tools

Azure

Concepts

Big Data

Author (1)

Newton Alex

Chapter 4: Designing the Serving Layer

In this chapter, we will be learning about the techniques and technologies involved in designing a data serving layer. As we have seen previously in the data lake design, data passes through several zones. It starts from a Landing Zone, from where it gets processed into more useful formats in the Transformation Zone, and finally, the derived data insights land in the Serving Zone (also called the Serving layer). The Serving Zone serves the processed data and insights to the end users. The Landing Zone and Transformation Zone of a data lake focus on aspects such as efficient storage of data, processing huge volumes of data, optimizing queries, and so on. The Serving layer, on the other hand, focuses mainly on how to serve the data in a fast and efficient way to the business intelligence (BI) tools.

Serving layers are usually built using relational data stores (Structured Query Language (SQL)-based stores). This is done for two reasons: relational...

Technical requirements

For this chapter, you will need an Azure account (free or paid).

Learning the basics of data modeling and schemas

Data modeling is a process of designing how data will be represented in data stores. Many data modeling techniques were originally designed for databases and warehouses. Since the Serving layers are usually built with relational data stores such as data warehouses, some of the data modeling techniques can be applied for the Serving layer design too. But do remember that the Serving layer could be built using other storage technologies such as document databases, key-value stores, and so on, based on the customer requirements.

Unlike data lakes, in databases or data warehouses we don't have the luxury of storing huge volumes of data in the format we like. Databases and data warehouses can perform querying exceptionally fast, provided the data is stored in predetermined formats and is limited in size. Hence, while designing the Serving layer, we need to identify the specifics of which data needs to be stored, which format to store...

Designing Star and Snowflake schemas

Schemas are guidelines for arranging data entities such as SQL tables in a data store. Designing a schema refers to the process of designing the various tables and the relationships among them. Star and Snowflake schemas are two of the most commonly used schemas in the data analytics and BI world. In fact, Star schemas are used more frequently than Snowflake schemas. Both have their own advantages and disadvantages, so let's explore them in detail.

Star schemas

A Star schema is the simplest of the data warehouse schemas. It has two sets of tables: one that stores quantitative information such as transactions happening at a retail outlet or trips happening at a cab company, and another that stores the context or descriptions of events that are stored in the quantitative table.

The quantitative tables are called fact tables and the descriptive or context tables are called dimension tables.

The following diagram shows an example of...

Designing SCDs

SCDs refer to data in dimension tables that changes slowly over time and not at a regular cadence. A common example for SCDs is customer profiles—for example, an email address or the phone number of a customer doesn't change that often, and these are perfect candidates for SCD. In this section, we will look at how to design for such changes.

Services such as Azure SQL provide inbuilt support for SCD, but in data warehouses such as Synapse dedicated pools, we will have to implement them ourselves.

Here are some of the main aspects we will need to consider while designing an SCD:

Should we keep track of the changes? If yes, how much of the history should we maintain?
Or, should we just overwrite the changes and ignore the history?

Based on our requirements for maintaining the history, there are about seven ways in which we can accomplish keeping track of changes. They are named SCD1, SCD2, SCD3, and so on, up to SCD7.

Among these...

Designing a solution for temporal data

Temporal data refers to data at specific points in time. Storing temporal data is required in situations like data auditing, forensic investigations, maintaining SCDs, point in time recoveries and so on. Azure SQL and SQL server provides a mechanism called Temporal tables to store temporal data.

Temporal tables are specialized tables that keep track of data changes over time. They track the history of data changes like what we had already seen in SCD tables, but in this case the system takes care of managing the time validity period of each row, instead of we having to do it manually. Hence these tables are also called as System-versioned temporal tables.

Note

Temporal table is a concept of Azure SQL database and SQL server. It was not available in Azure Synapse pools as of writing this book.

Let us look at an example of how to create Temporal tables in Azure SQL:

CREATE TABLE Customer
(
  [customerId] INT NOT NULL PRIMARY...

Designing a dimensional hierarchy

Dimensional hierarchy refers to the way we group and organize the dimensional data at multiple levels. In a hierarchical structure, there are usually one-to-many or many-to-many parent-child relationship. Examples of hierarchical structures could be organizational structures, product categories in an online store, a file system and so on. The main characteristic of the hierarchical structure is that all the nodes are identical, and they include pointers to their parent or children's nodes.

In order to achieve a dimensional hierarchy, we can use a technique called as the self-referencing relationship or self-joins within the dimension table. Let us take an example of an Employee dimension.

CREATE TABLE DimEmployee (
	[employeeId] VARCHAR(20) NOT NULL,
	[name] VARCHAR(100),
	[department] VARCHAR(50),
	[title] VARCHAR(50),
	[parentEmployeeId] VARCHAR(20)
)

Here, we have a column called parentEmployeeID referencing the employeeID column....

Designing for incremental loading

Incremental loading or delta loading refers to the process of loading smaller increments of data into a storage solution—for example, we could have daily data that is being loaded into a data lake or hourly data flowing into an extract, transform, load (ETL) pipeline, and so on. During data-ingestion scenarios, it is very common to do a bulk upload followed by scheduled incremental loads.

Azure has a very versatile service called Azure Data Factory (ADF) which can help with incremental loading. Since this is the first time we are using ADF in this book, let's learn a little more about it now as the information will come in handy in future chapters.

ADF is a managed cloud service that can be used to coordinate and orchestrate complex cloud- or hybrid- (on-premises)-based pipelines. ADF provides the ability to build ETL and extract, load, transform (ELT) pipelines. With ADF, you can do the following:

Ingest data from a wide...

Designing analytical stores

Analytical stores could either be SQL or NoSQL data stores deployed in the data lake Serving Zone. The main job of an analytical data store is to serve the data generated by the data transformation pipelines to the BI tools in a fast and simple manner. Analytical stores are usually subjected to ad hoc querying from business analysts and other end users. As such, these stores need to perform really well with random reads. Azure provides a variety of storage technologies that can cater to these requirements. Here are some of the most important ones:

Azure Synapse Analytics (Serverless/dedicated SQL pools and Spark pools)—Synapse Analytics provides both SQL pools and Spark pools. Among these, the SQL dedicated pools are massively parallel processing (MPP) data warehouses and are usually ideal for a majority of analytical store situations. Serverless SQL pools can be used for ad hoc querying. Spark pools, on the other hand, can support analytical...

Designing metastores in Azure Synapse Analytics and Azure Databricks

Metastores store the metadata of data in services such as Spark or Hive. Think of a metastore as a data catalog that can tell you which tables you have, what the table schemas are, what the relationships among the tables are, where they are stored, and so on. Spark supports two metastore options: an in-memory version and an external version.

In-memory metastores are limited in accessibility and scale. They can help jobs running on the same Java virtual machine (JVM) but not much further than this. Also, the metadata is lost once the cluster is shut down.

For all practical purposes, Spark uses an external metastore, and the only supported external metastore at the time of writing this book was Hive Metastore. Hive's metastore is mature and provides generic application programming interfaces (APIs) to access it. Hence, instead of rebuilding a new metastore, Spark just uses the mature and well-designed Hive...

Summary

That brings a close to our fourth chapter. Congratulations on making it this far.

Just to recap, we started off with the basics of data modeling and learned about Star and Snowflake schemas. We then learned about designing for SCDs, the different sub-types of SCDs, dimensional hierarchies, handling temporal data by using time dimensions, loading data incrementally using ADF, and selecting the right analytical store based on the customer's requirements. Finally, we learned about creating metastores for Synapse and Azure Databricks. All these topics complete the syllabus for DP203 – Design the Serving Layer. You have now learned how to design your own Serving layer in Azure.

We have now come to the end of our design chapters. We will be focusing on the implementation details from the next chapter onward. Yay!

The rest of the chapter is locked

You have been reading a chapter from

Azure Data Engineer Associate Certification Guide

Published in: Feb 2022Publisher: PacktISBN-13: 9781801816069

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages