Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Azure Data Engineer Associate Certification Guide

You're reading from  Azure Data Engineer Associate Certification Guide

Product type Book
Published in Feb 2022
Publisher Packt
ISBN-13 9781801816069
Pages 574 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Newton Alex Newton Alex
Profile icon Newton Alex

Table of Contents (23) Chapters

Preface Part 1: Azure Basics
Chapter 1: Introducing Azure Basics Part 2: Data Storage
Chapter 2: Designing a Data Storage Structure Chapter 3: Designing a Partition Strategy Chapter 4: Designing the Serving Layer Chapter 5: Implementing Physical Data Storage Structures Chapter 6: Implementing Logical Data Structures Chapter 7: Implementing the Serving Layer Part 3: Design and Develop Data Processing (25-30%)
Chapter 8: Ingesting and Transforming Data Chapter 9: Designing and Developing a Batch Processing Solution Chapter 10: Designing and Developing a Stream Processing Solution Chapter 11: Managing Batches and Pipelines Part 4: Design and Implement Data Security (10-15%)
Chapter 12: Designing Security for Data Policies and Standards Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)
Chapter 13: Monitoring Data Storage and Data Processing Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing Part 6: Practice Exercises
Chapter 15: Sample Questions with Solutions Other Books You May Enjoy

Chapter 6: Implementing Logical Data Structures

I hope you enjoyed learning about the implementation details of storage structures. In this chapter, we will be covering the implementation details of logical data structures. You will learn to implement advanced data loading concepts like slowly changing dimensions, storage based solutions for optimizing query performance and techniques to read external data without having to copy them over to local storage.

We will cover the following topics in this chapter:

  • Building a temporal data solution
  • Building a slowly changing dimension
  • Building a logical folder structure
  • Implementing file and folder structures for efficient querying and data pruning
  • Building external tables

Technical requirements

For this chapter, you will need the following:

  • An Azure account (free or paid)
  • The Azure CLI installed on your workstation
  • An active Synapse workspace

Let's get started!

Building a temporal data solution

Temporal data refers to data at specific points in time. Building a temporal solution deals with building systems that can handle and store such time based data. We have already seen how to use the Azure SQL Temporal tables feature to build such a system in Chapter 4, Designing the Serving Layer under the section: Designing a solution for temporal data. Since we have already explored this topic in detail there, we will not be repeating it again. Please refer to Chapter 4 to refresh your memory about Temporal tables.

Let us next learn how to implement a slowly changing dimension.

Building a slowly changing dimension

In Chapter 4, Designing the Serving Layer, we learned about the different ways to build Slowly Changing Dimensions (SCDs). In this section, we will learn to implement a few of them using ADF/Synapse Pipelines Mapping flows. We will implement the Type 2 SCD as it involves a slightly more complicated workflow. Once you know how to implement one of the SCDs, implementing the others will be similar.

Let's consider the following example scenario:

  • We have a DimDriver dimension table in a Synapse SQL dedicated pool that contains the driver's data. This data doesn't change very often, so it is a good choice for an SCD.
  • Let us assume that the changes to the driver data appears periodically as a CSV file in a folder in Azure Data Lake Gen2.
  • We have to build an ADF pipeline to take the data from the CSV file and apply it to the DimDriver table while maintaining its history.
  • We will build the Flag-based SCD option in this...

Building a logical folder structure

We learned about efficient folder structures in Chapter 2, Designing a Data Storage Structure, where we explored the best practices for storing data for batch processing and streaming scenarios. The rule of thumb is to store the data in a hierarchical date folder structure, with the date part added toward the end, as shown here:

{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/

We can have more intermediate folders in the folder path based on our business requirements. Please refer to Chapter 2, Designing a Data Storage Structure, to refresh your memory on designing efficient folder structures.

To create a container on Azure Data Lake Gen 2, you can use the following command via the Azure CLI:

az storage fs create -n <container name> --account-name <account name> --auth-mode login

Once you have created a container, you can easily create folders in it by using the following command:

az storage fs directory create -n...

Implementing file and folder structures for efficient querying and data pruning

The concepts we explored in the previous section applies here too. Once we have a date-based hierarchical folder structure, query performance can be improved via data partitioning. If we divide the data into partitions and if we ensure that the partitions are stored in different folder structures, then the queries can skip scanning the irrelevant partitions altogether. This concept, as we already know, is called data pruning.

Another benefit of partitioning is the increased efficiency of data loading and deletion by performing partition switching and partition deletion. Here, instead of reading each row and updating it, huge partitions of data can be added or deleted with simple metadata operations. Chapter 2, Designing a Data Storage Structure, already covered examples of how queries can benefit from data pruning by skipping reading from unnecessary partitions. In this section, we'll learn how...

Building external tables

External tables are similar to regular tables except that the data is stored in external storage locations such as Azure Data Lake, Azure Blob storage, and HDFS. With external tables, you don't need to copy data into internal tables for processing. They can directly read the data from external sources, which saves on the data transfer cost. In Synapse, both dedicated SQL and serverless SQL support external tables. We have to define the following three constructs to access data via external tables:

  • EXTERNAL DATA SOURCE

Here is an example of how we can create an external data source named users_iacstoreacct:

CREATE EXTERNAL DATA SOURCE [Dp203DataSource] 
	WITH ( LOCATION  = 'abfss://path/to/data')
  • EXTERNAL FILE FORMAT

Here is an example of how to create an external file format named SynapseParquetFormat:

CREATE EXTERNAL FILE FORMAT [Dp203ParquetFormat] 
	WITH ( FORMAT_TYPE = PARQUET )
  • EXTERNAL...

Summary

That brings us to the end of this chapter. This was a smaller chapter but it had some very important concepts from the certification's perspective. We started by looking at building SCDs, which was a lengthy but relatively easy process as we mostly had to just drag and drop the components into Synapse pipelines and configure them. Then, we revisited the general rule of thumb for building an efficient folder structure. We also learned how to quickly switch partitions in and out using Synapse SQL. Finally, we learned how easy it is to create external tables in Synapse.

All the topics that were covered in this chapter should cover the syllabus for DP203 – Implementing the logical data structures. In the next chapter, we will focus on implementing the serving layer.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Azure Data Engineer Associate Certification Guide
Published in: Feb 2022 Publisher: Packt ISBN-13: 9781801816069
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}