You're reading from Azure Data Engineer Associate Certification Guide

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Pages 574 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Newton Alex

Table of Contents (23) Chapters

Preface

Part 1: Azure Basics

Chapter 1: Introducing Azure Basics

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 3: Designing a Partition Strategy

Chapter 4: Designing the Serving Layer

Chapter 5: Implementing Physical Data Storage Structures

Chapter 6: Implementing Logical Data Structures

Chapter 7: Implementing the Serving Layer

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 11: Managing Batches and Pipelines

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Other Books You May Enjoy

Chapter 6: Implementing Logical Data Structures

I hope you enjoyed learning about the implementation details of storage structures. In this chapter, we will be covering the implementation details of logical data structures. You will learn to implement advanced data loading concepts like slowly changing dimensions, storage based solutions for optimizing query performance and techniques to read external data without having to copy them over to local storage.

We will cover the following topics in this chapter:

Building a temporal data solution
Building a slowly changing dimension
Building a logical folder structure
Implementing file and folder structures for efficient querying and data pruning
Building external tables

Technical requirements

For this chapter, you will need the following:

An Azure account (free or paid)
The Azure CLI installed on your workstation
An active Synapse workspace

Let's get started!

Building a temporal data solution

Temporal data refers to data at specific points in time. Building a temporal solution deals with building systems that can handle and store such time based data. We have already seen how to use the Azure SQL Temporal tables feature to build such a system in Chapter 4, Designing the Serving Layer under the section: Designing a solution for temporal data. Since we have already explored this topic in detail there, we will not be repeating it again. Please refer to Chapter 4 to refresh your memory about Temporal tables.

Let us next learn how to implement a slowly changing dimension.

Building a slowly changing dimension

In Chapter 4, Designing the Serving Layer, we learned about the different ways to build Slowly Changing Dimensions (SCDs). In this section, we will learn to implement a few of them using ADF/Synapse Pipelines Mapping flows. We will implement the Type 2 SCD as it involves a slightly more complicated workflow. Once you know how to implement one of the SCDs, implementing the others will be similar.

Let's consider the following example scenario:

We have a DimDriver dimension table in a Synapse SQL dedicated pool that contains the driver's data. This data doesn't change very often, so it is a good choice for an SCD.
Let us assume that the changes to the driver data appears periodically as a CSV file in a folder in Azure Data Lake Gen2.
We have to build an ADF pipeline to take the data from the CSV file and apply it to the DimDriver table while maintaining its history.
We will build the Flag-based SCD option in this...

Building a logical folder structure

We learned about efficient folder structures in Chapter 2, Designing a Data Storage Structure, where we explored the best practices for storing data for batch processing and streaming scenarios. The rule of thumb is to store the data in a hierarchical date folder structure, with the date part added toward the end, as shown here:

{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/

We can have more intermediate folders in the folder path based on our business requirements. Please refer to Chapter 2, Designing a Data Storage Structure, to refresh your memory on designing efficient folder structures.

To create a container on Azure Data Lake Gen 2, you can use the following command via the Azure CLI:

az storage fs create -n <container name> --account-name <account name> --auth-mode login

Once you have created a container, you can easily create folders in it by using the following command:

az storage fs directory create -n...

Implementing file and folder structures for efficient querying and data pruning

The concepts we explored in the previous section applies here too. Once we have a date-based hierarchical folder structure, query performance can be improved via data partitioning. If we divide the data into partitions and if we ensure that the partitions are stored in different folder structures, then the queries can skip scanning the irrelevant partitions altogether. This concept, as we already know, is called data pruning.

Another benefit of partitioning is the increased efficiency of data loading and deletion by performing partition switching and partition deletion. Here, instead of reading each row and updating it, huge partitions of data can be added or deleted with simple metadata operations. Chapter 2, Designing a Data Storage Structure, already covered examples of how queries can benefit from data pruning by skipping reading from unnecessary partitions. In this section, we'll learn how...

Building external tables

External tables are similar to regular tables except that the data is stored in external storage locations such as Azure Data Lake, Azure Blob storage, and HDFS. With external tables, you don't need to copy data into internal tables for processing. They can directly read the data from external sources, which saves on the data transfer cost. In Synapse, both dedicated SQL and serverless SQL support external tables. We have to define the following three constructs to access data via external tables:

EXTERNAL DATA SOURCE

Here is an example of how we can create an external data source named users_iacstoreacct:

CREATE EXTERNAL DATA SOURCE [Dp203DataSource] 
	WITH ( LOCATION  = 'abfss://path/to/data')

EXTERNAL FILE FORMAT

Here is an example of how to create an external file format named SynapseParquetFormat:

CREATE EXTERNAL FILE FORMAT [Dp203ParquetFormat] 
	WITH ( FORMAT_TYPE = PARQUET )

EXTERNAL...

Summary

That brings us to the end of this chapter. This was a smaller chapter but it had some very important concepts from the certification's perspective. We started by looking at building SCDs, which was a lengthy but relatively easy process as we mostly had to just drag and drop the components into Synapse pipelines and configure them. Then, we revisited the general rule of thumb for building an efficient folder structure. We also learned how to quickly switch partitions in and out using Synapse SQL. Finally, we learned how easy it is to create external tables in Synapse.

All the topics that were covered in this chapter should cover the syllabus for DP203 – Implementing the logical data structures. In the next chapter, we will focus on implementing the serving layer.