You're reading from Azure Data Engineer Associate Certification Guide

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Pages 574 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Newton Alex

Table of Contents (23) Chapters

Preface

Part 1: Azure Basics

Chapter 1: Introducing Azure Basics

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 3: Designing a Partition Strategy

Chapter 4: Designing the Serving Layer

Chapter 5: Implementing Physical Data Storage Structures

Chapter 6: Implementing Logical Data Structures

Chapter 7: Implementing the Serving Layer

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 11: Managing Batches and Pipelines

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Other Books You May Enjoy

Chapter 3: Designing a Partition Strategy

Data partitioning refers to the process of dividing data and storing it in physically different locations. We partition data mainly for performance, scalability, manageability, and security reasons. Partitioning itself is a generic term, but the methods and techniques of partitioning vary from service to service—for example, the partitioning techniques used for Azure Blob storage might not be the same as those applied for database services such as Azure SQL or Azure Synapse Dedicated SQL pool. Similarly, document databases such as Cosmos DB have different partitioning techniques from Azure Queues or Azure Tables. In this chapter, we will explore some of the important partitioning techniques and when to use them.

As in the previous chapter, we will again be focusing more on the design aspects, as per the syllabus. The implementation details will be covered in Chapter 5, Implementing Physical Data Storage Structures.

This chapter...

Understanding the basics of partitioning

In the previous chapter, we briefly introduced the concept of partitioning as part of the Designing storage for efficient querying section. We explored storage-side partitioning concepts such as replicating data, reducing cross-partition operations such as joins, and eventual consistency to improve query performance. In this chapter, we will deep dive more systematically into both storage and analytical partitioning techniques. Let's start with the benefits of partitioning.

Benefits of partitioning

Partitioning has several benefits apart from just query performance. Let's take a look at a few important ones.

Improving performance

As we discussed in the previous chapter, partitioning helps improve the parallelization of queries by splitting massive monolithic data into smaller, easily consumable chunks.

Apart from parallelization, partitioning also improves performance via data pruning, another concept that we already...

Designing a partition strategy for files

In this section, we will look at the partitioning techniques available for Azure Storage, which should also cover files. The Azure Storage services are generic and very flexible when it comes to partitioning. We can implement whatever partition logic we want, using the same Create, Read, Update, Delete (CRUD) application programming interfaces (APIs) that are publicly available. There are no special APIs or features available for partitioning. With that background, let's now explore the partitioning options available in Azure Blob storage and ADLS Gen2.

Azure Blob storage

In Azure Blob storage, we first create Azure accounts; then, within accounts, we create containers; and within containers, we create actual storage blobs. These containers are logical entities, so even when we create data blobs within containers, there is no guarantee that the data will land within the same partition. But there is a trick to enhance our chances of...

Designing a partition strategy for analytical workloads

There are three main types of partition strategies for analytical workloads. These are listed here:

Horizontal partitioning, which is also known as sharding
Vertical partitioning
Functional partitioning

Let's explore each of them in detail.

Horizontal partitioning

In a horizontal partition, we divide the table data horizontally, and subsets of rows are stored in different data stores. Each of these subsets of rows (with the same schema as the parent table) are called shards. Essentially, each of these shards is stored in different database instances.

You can see an example of a horizontal partition here:

Figure 3.1 – Example of a horizontal partition

In the preceding example, you can see that the data in the top table is distributed horizontally based on the Trip ID range.

Selecting the right shard key

It is very important we select the right shard key...

Designing a partition strategy for efficiency/performance

In the last few sections, we explored the various storage and analytical partitioning options and learned about how partitioning helps with performance, scale, security, availability, and so on. In this section, we will recap the points we learned about performance and efficiency and learn about some additional performance patterns.

Here are some strategies to keep in mind while designing for efficiency and performance:

Partition datasets into smaller chunks that can be run with optimal parallelism for multiple queries.
Partition the data such that queries don't end up requiring too much data from other partitions. Minimize cross-partition data transfers.
Design effective folder structures to improve the efficiency of data reads and writes.
Partition data such that a significant amount of data can be pruned while running queries.
Partition in units of data that can be easily added, deleted, swapped...

Designing a partition strategy for Azure Synapse Analytics

We learned about Azure Synapse Analytics in Chapter 2, Designing a Data Storage Structure. Synapse Analytics contains two compute engines, outlined here:

A Structured Query Language (SQL) pool that consists of serverless and dedicated SQL pools (previously known as SQL Data Warehouse)
A Spark pool that consists of Synapse Spark pools

But when people refer to Azure Synapse Analytics, they usually refer to the Dedicated SQL pool option. In this section, we will look at the partition strategy available for Synapse Dedicated SQL pool.

Note

We have already briefly covered partitioning in Spark as part of the Data pruning section in the previous chapter. The same concepts apply to Synapse Spark, too.

Before we explore partitioning options, let's recap the data distribution techniques of a Synapse dedicated pool from the previous chapter as this will play an important role in our partition strategy...

Identifying when partitioning is needed in ADLS Gen2

As we have learned in the previous chapter, we can partition data according to our requirements—such as performance, scalability, security, operational overhead, and so on—but there is another reason why we might end up partitioning our data, and that is the various I/O bandwidth limits that are imposed at subscription levels by Azure. These limits apply to both Blob storage and ADLS Gen2.

The rate at which we ingest data into an Azure Storage system is called the ingress rate, and the rate at which we move the data out of the Azure Storage system is called the egress rate.

The following table shows a snapshot of some of the limits enforced by Azure Blob storage. This table is just to give you an idea of the limits that Azure Storage imposes. When we design our data lake applications, we need to take care of such restrictions as part of our design itself:

Figure 3.4 – Some of the...

Summary

With that, we have come to the end of our third chapter. I hope you enjoyed learning about the different partitioning techniques available in Azure! We started with the basics of partitioning, where you learned about the benefits of partitioning; we then moved on to partitioning techniques for storage and analytical workloads. We explored the best practices to improve partitioning efficiency and performance. We understood the concept of distribution tables and how they impact the partitioning of Azure Synapse Analytics, and finally, we learned about storage limitations, which play an important role in deciding when to partition for ADLS Gen2. This covers the syllabus for the DP-203 exam, Designing a Partition Strategy. We will be reinforcing the learnings from this chapter via implementation details and tips in the following chapters.

Let's explore the serving layer in the next chapter.

The rest of the chapter is locked

You're reading from Azure Data Engineer Associate Certification Guide

Table of Contents (23) Chapters

Chapter 3: Designing a Partition Strategy

Understanding the basics of partitioning

Benefits of partitioning

Improving performance

Designing a partition strategy for files

Azure Blob storage

Designing a partition strategy for analytical workloads

Horizontal partitioning

Selecting the right shard key

Designing a partition strategy for efficiency/performance

Designing a partition strategy for Azure Synapse Analytics

Identifying when partitioning is needed in ADLS Gen2

Summary

Authors (1)

Personalised recommendations for you

You're reading from Azure Data Engineer Associate Certification Guide

Table of Contents (23) Chapters

Chapter 3: Designing a Partition Strategy

Understanding the basics of partitioning

Benefits of partitioning

Improving performance

Designing a partition strategy for files

Azure Blob storage

Designing a partition strategy for analytical workloads

Horizontal partitioning

Selecting the right shard key

Designing a partition strategy for efficiency/performance

Designing a partition strategy for Azure Synapse Analytics

Identifying when partitioning is needed in ADLS Gen2

Summary

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you