You're reading from Azure Data Engineer Associate Certification Guide

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Pages 574 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Newton Alex

Table of Contents (23) Chapters

Preface

Part 1: Azure Basics

Chapter 1: Introducing Azure Basics

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 3: Designing a Partition Strategy

Chapter 4: Designing the Serving Layer

Chapter 5: Implementing Physical Data Storage Structures

Chapter 6: Implementing Logical Data Structures

Chapter 7: Implementing the Serving Layer

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 11: Managing Batches and Pipelines

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Other Books You May Enjoy

Chapter 11: Managing Batches and Pipelines

Welcome to Chapter 11! This is one of the smaller and easier chapters and will be a breeze to read through. In this chapter, we will be focusing on four broad categories: triggering Batch jobs, handling failures in Batch jobs, managing pipelines, and configuring version control for our pipelines. Once you have completed this chapter, you should be able to comfortably set up and manage Batch pipelines using Azure Batch, Azure Data Factory (ADF), or Synapse pipelines.

In this chapter, we will cover the following topics:

Triggering Batches
Handling failed Batch loads
Validating Batch loads
Managing data pipelines in Data Factory/Synapse pipelines
Scheduling data pipelines in Data Factory/Synapse pipelines
Managing Spark jobs in a pipeline
Implementing version control for pipeline artifacts

Technical requirements

For this chapter, you will need the following:

An Azure account (free or paid)
An active Synapse workspace
An active Azure Data Factory workspace

Let's get started!

Triggering batches

We learned about Azure Batch in Chapter 9, Designing and Developing a Batch Processing Solution, in the Introducing Azure Batch section. In this section, we will learn how to trigger those Batch jobs using Azure Functions. Azure Functions is a serverless service provided by Azure that helps build cloud applications with minimal code, without you having to worry about hosting and maintaining the technology that runs the code. Azure takes care of all the hosting complexities such as deployments, upgrades, security patches, scaling, and more. Even though the name says serverless, it has servers running in the background. It just means that you don't have to maintain those servers – Azure does it for you.

For our requirement of triggering a Batch job, we will be using the Trigger functionality of Azure Functions. A Trigger defines when and how to invoke an Azure function. Azure Functions supports a wide variety of triggers, such as timer trigger, HTTP...

Handling failed Batch loads

An Azure Batch job can fail due to four types of errors:

Pool errors
Node errors
Job errors
Task errors

Let's look at some of the common errors in each group and ways to handle them.

Pool errors

Pool errors occur mostly due to infrastructure issues, quota issues, or timeout issues. Here are some sample pool errors:

Insufficient quota: If there is not enough of a quota for your Batch account, pool creation could fail. The mitigation is to request an increase in quota. You can check the quota limits here: https://docs.microsoft.com/en-us/azure/batch/batch-quota-limit.
Insufficient resources in your VNet: If your virtual network (VNet) doesn't have enough resources, such as available IP addresses, Network Security Groups (NSGs), VMs, and so on, the pool creation process may fail. The mitigation is to look for these errors and request higher resource allocation or move to a different VNet that has enough...

Validating Batch loads

Batch jobs are usually run as part of Azure Data Factory (ADF). ADF provides functionalities for validating the outcome of jobs. Let's learn how to use the Validation activity in ADF to check the correctness of Batch loads:

The Validation activity of ADF can be used to check for a file's existence before proceeding with the rest of the activities in the pipeline. The validation pipeline will look similar to the following:

Figure 11.4 – ADF Validation activity

Once we have validated that the files exist, we can use the Get Metadata activity to get more information about the output files. In the following screenshot, we output Column count, which we'll check later using an If Condition activity to decide if the output files are any good:

Figure 11.5 – Configuring the Get Metadata activity to publish the Column count

Once we get the metadata, we must use the...

Scheduling data pipelines in Data Factory/Synapse pipelines

Scheduling pipelines refers to the process of defining when and how a pipeline needs to be started. The process is the same between ADF and Synapse pipelines. ADF and Synapse pipelines have a button named Add Trigger in the Pipelines tab that can be used to schedule the pipelines, as shown in the following screenshot:

Figure 11.7 – Adding a trigger from ADF/Synapse pipelines

The following screenshot shows the details that are required to configure a Schedule trigger:

Figure 11.8 – Defining the trigger in ADF

ADF and Synapse pipeline services support four types of triggers:

Schedule trigger: This triggers a pipeline once or regularly based on the wall clock time.
Tumbling window trigger: This triggers a pipeline based on periodic intervals while maintaining state; that is, the trigger understands which window of data was processed last and restarts...

Managing data pipelines in Data Factory/Synapse pipelines

ADF and Synapse pipelines provide two tabs called Manage and Monitor, which can help us manage and monitor the pipelines, respectively.

In the Manage tab, you can add, edit, and delete linked services, integration runtimes, triggers, configure Git, and more, as shown in the following screenshot:

Figure 11.9 – The Manage screen of ADF

We have already learned about linked services throughout this book. Now, let's explore the topic of integration runtimes in ADF and Synapse pipelines.

Integration runtimes

An integration runtime (IR) refers to the compute infrastructure that's used by ADF and Synapse Pipelines to run data pipelines and data flows. These are the actual machines or VMs that run the job behind the scenes.

The IR takes care of running data flows, copying data across public and private networks, dispatching activities to services such as Azure HDInsight and Azure...

Managing Spark jobs in a pipeline

Managing Spark jobs in a pipeline involves two aspects:

Managing the attributes of the pipeline's runtime that launches the Spark activity: Managing the Spark activity pipeline attributes is no different than managing any other activities in a pipeline. The Managing and Monitoring pages we saw in Figure 11.9, Figure 11.11, and Figure 11.12 are the same for any Spark activity as well. You can use the options provided on these screens to manage your Spark activity.
Managing Spark jobs and configurations: This involves understanding how Spark works, being able to tune the jobs, and so on. We have a complete chapter dedicated to optimizing Synapse SQL and Spark jobs towards the end of this book. You can refer to Chapter 14, Optimizing and Troubleshooting Data Storage and Data Processing, to learn more about managing and tuning Spark jobs.

In this section, we'll learn how to add an Apache Spark job (via HDInsight) to our pipeline...

Implementing version control for pipeline artifacts

By default, ADF and Synapse pipelines save pipeline details in their internal stores. These internal stores don't provide options for collaboration, version control, or any other benefits provided by the source control systems. Every time you click on the Publish All button, your latest changes are saved within the service. To overcome this shortcoming, both ADF and Synapse pipelines provide options to integrate with source control systems such as Git. Let's explore how to configure version control for our pipeline artifacts.

Configuring source control in ADF

ADF provides a Set up code repository button at the top of the home screen, as shown in the following screenshot. You can use this button to start the Git configuration process:

Figure 11.16 – The Set up code repository button on the ADF home screen

You can also reach the Git Configuration page from the Manage tab (the toolkit icon...

Summary

With that, we have come to the end of this small chapter. We started by learning how to trigger Batch loads, how to handle errors and validate Batch jobs, and then moved on to ADF and Synapse pipelines. We learned about setting up triggers, managing and monitoring pipelines, running Spark pipelines, and configuring version control in ADF and Synapse Pipelines. With all this knowledge, you should now be confident in creating and managing pipelines using ADF, Synapse Pipelines, and Azure Batch.

This chapter marks the end of the Designing and Developing Data Processing section, which accounts for about 25-30% of the certification goals. From the next chapter onward, we will next move on to the Designing and Implementing Data Security section, where we will be focusing on the security aspects of data processing.