You're reading from Azure Data Engineer Associate Certification Guide

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Pages 574 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Newton Alex

Table of Contents (23) Chapters

Preface

Part 1: Azure Basics

Chapter 1: Introducing Azure Basics

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 3: Designing a Partition Strategy

Chapter 4: Designing the Serving Layer

Chapter 5: Implementing Physical Data Storage Structures

Chapter 6: Implementing Logical Data Structures

Chapter 7: Implementing the Serving Layer

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 11: Managing Batches and Pipelines

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Other Books You May Enjoy

Chapter 5: Implementing Physical Data Storage Structures

Hope you have had a good learning experience up till now. Let's continue our journey toward certification with more interesting topics in this chapter. Till the previous chapter, we have been focusing on the design aspects, but from now on, we will be focusing on the implementation details. We will learn how to implement the storage-level concepts that we learned in the previous chapters. Once you complete this chapter, you should be able to decide on and implement the following: what kind of data sharding is required, when to compress your data, how many partitions to create, what kind of data redundancy to maintain, and so on.

We will cover the following topics in this chapter:

Getting started with Azure Synapse Analytics
Implementing compression
Implementing partitioning
Implementing horizontal partitioning or sharding
Implementing distributions
Implementing different table geometries with...

Technical requirements

For this chapter, you will need the following:

An Azure account (free or paid)
An active Synapse workspace

Let's start implementing!

Getting started with Azure Synapse Analytics

Most of the examples in this chapter will be executed in an Azure Synapse Analytics workspace, so let's look at how to create one. Proceed as follows:

From the Azure portal, search for Synapse. From the results, select Azure Synapse Analytics.
Once inside, click on + Create and enter the details for a new workspace. The Create Synapse workspace screen is shown in the following screenshot:

Figure 5.1 – Azure Synapse workspace creation

Once done, click on the Review + create button and, finally, the Create button on that screen to create a new Synapse workspace.

This workspace will provide us with Structured Query Language (SQL) and Spark pools, required to experiment with the various concepts that we learned in the previous chapters. Let's start by implementing compression in the next section.

Implementing compression

In the previous chapters, we learned about the importance of data compression in data lakes. As the size of the data in data lakes grows, it becomes important that we store the data in a compressed format in order to save on cost. There are numerous ways in which we can implement compression in Azure. There are a lot of compression libraries available in the market and technically, all we need to do is write some scripts to call those libraries to compress data. But writing ad hoc scripts brings its own maintenance complexity, so let's look at some easy-to-maintain ways of implementing compression in Azure using Azure Synapse Pipelines. The same can be achieved using Azure Data Factory too. Azure Synapse Pipelines is just the same Azure Data Factory implementation within Synapse Analytics.

Compressing files using Synapse Pipelines or ADF

In the previous chapter, we learned about ADF. Like ADF, Synapse Pipelines can be used to create pipelines to...

Implementing partitioning

In Chapter 3, Designing a Partition Strategy, we covered the basics of partitioning. In this section, we will be learning how to implement the different types of partitioning. We will start with partitioning on Azure data stores and then look into partitioning for analytical workloads.

For storage-based partitioning, the main technique is to partition the data into the correct folder structure. In the previous chapters, we learned about how to store the data in a date-based format. The recommendation from Azure is to use the following pattern:

{Region}/{SubjectMatter(s)}/{yyyy}/{mm}/{dd}/{hh}/

Let's learn how to implement folder creation in an automated manner using ADF.

Using ADF/Synapse pipelines to create data partitions

You can use ADF or Synapse Pipelines, as both use the same ADF technology. In this example, I'm using Synapse Pipelines. Let's look at the steps to partition data in an automated fashion:

In your Synapse...

Implementing horizontal partitioning or sharding

Let's explore sharding from two different perspectives: a dedicated SQL pool and Spark. Just note that we will be using the terminologies horizontal partitioning and sharding interchangeably throughout the book, but they mean the same thing.

Sharding in Synapse dedicated pools

Synapse SQL dedicated pools have three different types of tables based on how the data is stored, outlined as follows:

Clustered columnstore
Clustered index
Heap

We will be learning more about these table types later in this chapter. Synapse dedicated pools support sharding for all these table types. They provide three different ways to shard the data, as follows:

Hash
Round-robin
Replicated

These methods through which a SQL dedicated pool distributes data among its tables are also called distribution techniques. Sharding and distribution techniques are overlapping technologies that are always specified together...

Implementing distributions

A dedicated SQL pool massively parallel processing (MPP) engine splits the data into 60 parallel partitions and executes them in parallel. Each of these smaller partitions, along with the compute resources to run the queries, is called a distribution. A distribution is a basic unit of processing and storage for a dedicated SQL pool.

Dedicated SQL pools provide three options for distribution. Let's look at each of them in detail.

Hash distribution

This type of distribution distributes the data based on a hash function. Rows with the same values for the hashed column will always move to the same partition. This can be implemented by providing the DISTRIBUTION = HASH (COLUMN_ID) value in the WITH clause of CREATE TABLE. Here is an example:

CREATE TABLE dbo.TripTable
(
    [tripId] INT NOT NULL,
    [driverId] INT NOT NULL,
    [customerID] INT NOT NULL,
    [tripDate...

Implementing different table geometries with Azure Synapse Analytics pools

The term table geometry is not standard database terminology, so I'm taking an educated guess that the certification team meant the different features of Azure Synapse dedicated pool tables.

The main features of Synapse dedicated pool tables are partitions, indexes, and distributions. We have already covered partitions and distributions in detail, so we will focus on the remaining feature, which is indexing.

SQL dedicated pools provide the following three types of indexing:

Clustered columnstore indexing
Heap indexing
Clustered indexing

Let's look at them in detail.

Clustered columnstore indexing

Clustered columnstore is the default indexing option of a dedicated SQL pool table. This type of indexing works best for large fact tables. It is a column-based data storage and provides very high levels of compression and better query performance than row-based indexes.

...

Implementing data redundancy

Data redundancy is the process of storing multiple copies of data at different locations, in order to protect the data from events such as power failures, disk failures, network failures, and even major catastrophes such as entire data center outages. Azure Storage provides multiple options for data redundancy both at local data center levels and across data centers. We can broadly group the options into these two categories:

Primary region redundancy
Secondary region redundancy

Let's look at these options in detail.

Azure storage redundancy in the primary region

This type of redundancy covers localized failures such as disk failures, machine failures, and rack failures. There are two types of primary region redundancy, as detailed next.

Locally redundant storage (LRS)

With this type of redundancy, the data is copied three times across multiple machines in the same physical region within a data center. This option provides...

Implementing data archiving

In Chapter 2, Designing a Data Storage Structure, we already explored how to design and implement data archiving. We learned about the Hot, Cold, and Archive tiers of data storage and how to build data life cycle management policies. As we have covered the details already, we will not be repeating them here. Please refer to the Designing a data archiving solution section of Chapter 2, Designing a Data Storage Structure, again if you have forgotten about archiving.

Summary

That brings us to the end of this chapter. I hope you enjoyed it as much as I enjoyed writing it. We started with learning how to compress data in a clean way using Synaspe Pipelines and ADF, and natively using Spark. Then, we focused on learning about implementing partitioning, sharding, and distributions in SQL dedicated pools and Spark. After that, we learned about the different types of tables and indexing available in SQL dedicated pools, and finally, we concluded with learning how to set up data redundancy and archiving.

All the preceding topics should cover the syllabus of DP203 – Implement Physical Data Storage Structures. We will focus on implementing logical data storage structures in the next chapter.

The rest of the chapter is locked

You're reading from Azure Data Engineer Associate Certification Guide

Table of Contents (23) Chapters

Chapter 5: Implementing Physical Data Storage Structures

Technical requirements

Getting started with Azure Synapse Analytics

Implementing compression

Compressing files using Synapse Pipelines or ADF

Implementing partitioning

Using ADF/Synapse pipelines to create data partitions

Implementing horizontal partitioning or sharding

Sharding in Synapse dedicated pools

Implementing distributions

Hash distribution

Implementing different table geometries with Azure Synapse Analytics pools

Clustered columnstore indexing

Implementing data redundancy

Azure storage redundancy in the primary region

Locally redundant storage (LRS)

Implementing data archiving

Summary

Authors (1)

Personalised recommendations for you

You're reading from Azure Data Engineer Associate Certification Guide

Table of Contents (23) Chapters

Chapter 5: Implementing Physical Data Storage Structures

Technical requirements

Getting started with Azure Synapse Analytics

Implementing compression

Compressing files using Synapse Pipelines or ADF

Implementing partitioning

Using ADF/Synapse pipelines to create data partitions

Implementing horizontal partitioning or sharding

Sharding in Synapse dedicated pools

Implementing distributions

Hash distribution

Implementing different table geometries with Azure Synapse Analytics pools

Clustered columnstore indexing

Implementing data redundancy

Azure storage redundancy in the primary region

Locally redundant storage (LRS)

Implementing data archiving

Summary

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you