Reader small image

You're reading from  Azure Data Engineer Associate Certification Guide

Product typeBook
Published inFeb 2022
PublisherPackt
ISBN-139781801816069
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Newton Alex
Newton Alex
author image
Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex

Right arrow

Chapter 9: Designing and Developing a Batch Processing Solution

Welcome to the next chapter in the data transformation series. If you have come this far, then you are really serious about the certification. Good job! You have already crossed the halfway mark, with only a few more chapters to go.

In the previous chapter, we learned about a lot of technologies, such as Spark, Azure Data Factory (ADF), and Synapse SQL. We will continue the streak here and learn about a few more batch processing related technologies. We will learn how to build end-to-end batch pipelines, how to use Spark Notebooks in data pipelines, how to use technologies like PolyBase to speed up data copy, and more. We will also learn techniques to handle late-arriving data, scaling clusters, debugging pipeline issues, and handling security and compliance of pipelines. After completing this chapter, you should be able to design and implement ADF-based end-to-end batch pipelines using technologies such as Synapse...

Technical requirements

For this chapter, you will need the following:

  • An Azure account (free or paid)
  • An active Synapse workspace
  • An active Azure Data Factory workspace

Let's get started!

Designing a batch processing solution

In Chapter 2, Designing a Data Storage Structure, we learned about the data lake architecture. I've presented the diagram here again for convenience. In the following diagram, there are two branches, one for batch processing and the other for real-time processing. The part highlighted in green is the batch processing solution for a data lake. Batch processing usually deals with larger amounts of data and takes more time to process compared to stream processing.

Figure 9.1 – Batch processing architecture

A batch processing solution typically consists of five major components:

  • Storage systems such as Azure Blob storage, ADLS Gen2, HDFS, or similar
  • Transformation/batch processing systems such as Spark, SQL, or Hive (via Azure HDInsight)
  • Analytical data stores such as Synapse Dedicated SQL pool, Cosmos DB, and HBase (via Azure HDInsight)
  • Orchestration systems such as ADF and Oozie (via Azure...

Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks

Let's try to build an end-to-end batch pipeline using all the technologies listed in the topic header. We will use our Imaginary Airport Cab (IAC) example from the previous chapters to create a sample requirement for our batch processing pipeline. Let's assume that we are continuously getting trip data from different regions (zip codes), which is stored in Azure Blob storage, and the trip fares are stored in an Azure SQL Server. We have a requirement to merge these two datasets and generate daily revenue reports for each region.

In order to take care of this requirement, we can build a pipeline as shown in the following diagram:

Figure 9.2 – High-level architecture of the batch use case

The preceding pipeline, when translated into an ADF pipeline, would look like the following figure:

...

Creating data pipelines

Data pipelines are a collection of various data processing activities arranged in a particular sequence to produce the desired insights from raw data. We have already seen many examples in Azure Data Factory where we chain the activities together to produce a final desirable outcome. ADF is not the only technology available in Azure. Azure also supports Synapse pipelines (which is an implementation of ADF within Synapse) and open source technologies such as Oozie (available via Azure HDInsight), which can help orchestrate pipelines. If your workload only uses open source software, then Oozie could fit the bill. But if the pipeline uses other Azure or external third-party services then ADF might be a better fit as ADF provides readily available source and sink plugins for a huge list of technologies.

You can create a pipeline from the Pipeline tab of Azure Data Factory. All you need to do is to select the activities for your pipeline from the Activities tab...

Integrating Jupyter/Python notebooks into a data pipeline

Integrating Jupyter/Python notebooks into our ADF data pipeline can be done using the Spark activity in ADF. You will need an Azure HDInsight Spark cluster for this exercise.

The prerequisite for integrating Jupyter notebooks is to create linked services to Azure Storage and HDInsight from ADF and have an HDInsight Spark cluster running.

You have already seen how to create linked services, in the Developing batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse Pipelines, PolyBase, and Azure Databricks section earlier in this chapter, so I'll not repeat the steps here.

Select the Spark activity from ADF and specify the HDInsight linked service that you created in the HDInsight linked service field under the HDI Cluster tab as shown in the following screenshot.

Figure 9.26 – Configuring a Spark activity in ADF

Now, start the Jupyter notebook by going to...

Designing and implementing incremental data loads

We covered incremental data loading in Chapter 4, Designing the Serving Layer. Please refer to that chapter to refresh your knowledge of incremental data loads.

Let's next see how to implement slowly changing dimensions.

Designing and developing slowly changing dimensions

We also covered slowly changing dimensions (SCDs) in detail in Chapter 4, Designing the Serving Layer. Please refer to that chapter to refresh your knowledge of the concepts.

Handling duplicate data

We already explored this topic in Chapter 8, Ingesting and Transforming Data. Please refer to that chapter to refresh your understanding of handling duplicate data.

Let's next look at how to handle missing data.

Handling missing data

We already explored this topic in Chapter 8, Ingesting and Transforming Data. Please refer to that chapter to refresh your understanding of handling duplicate data.

Let's next look at how to handle late-arriving data.

Handling late-arriving data

We haven't yet covered this scenario, so let's dive deeper into handling late-arriving data.

A late-arriving data scenario can be considered at three different stages in a data pipeline – during the data ingestion phase, the transformation phase, and the serving phase.

Handling late-arriving data in the ingestion/transformation stage

During the ingestion and transformation phases, the activities usually include copying data into the data lake and performing data transformations using engines such as Spark, Hive, and so on. In such scenarios, the following two methods can be used:

  • Drop the data, if your application can handle some amount of data loss. This is the easiest option. You can keep a record of the last timestamp that has been processed. And if the new data has an older timestamp, you can just ignore that message and move forward.
  • Rerun the pipeline from the ADF Monitoring tab, if your application cannot handle...

Upserting data

Upsert refers to UPDATE or INSERT transactions in data stores. The data stores could be relational, key-value, or any other store that supports the concept of updating rows or blobs.

ADF supports upsert operations if the sink is a SQL-based store. The only additional requirement is that the sink activity must be preceded by an Alter Row operation. Here is an example screenshot of an ADF sink with Allow upsert enabled.

Figure 9.29 – Upsert operation in ADF

Once you have saved the preceding setup, ADF will automatically do an upsert if a row already exists in the configured sink. Let's next look at how to regress to a previous state.

Regressing to a previous state

Regressing to a previous state or rolling back to a stable state is a very commonly used technique in databases and OLTP scenarios. In OLTP scenarios, the transformation instructions are grouped together into a transaction and if any of the instructions fail or reach an inconsistent state then the entire transaction rolls back. Although databases provide such functionality, we don't have such ready-made support in Azure Data Factory or Oozie (HDInsight) today. We will have to build our own rollback stages depending on the activity. Let's look at an example of how to do a rollback of a data copy activity in ADF.

ADF provides options for checking consistency and setting limits for fault tolerance. You can enable them in the Settings options of a copy activity as shown in the following screenshot.

Figure 9.30 – Enabling consistency verification and fault tolerance in an ADF copy activity

If the activity fails...

Introducing Azure Batch

Azure Batch is an Azure service that can be used to perform large-scale parallel batch processing. It is typically used for high-performance computing applications such as image analysis, 3D rendering, genome sequencing, optical character recognition, and so on.

Azure Batch consists of three main components:

  • Resource management: This takes care of node management (things such as VMs and Docker containers), autoscaling, low-priority VM management, and application management. Applications are just ZIP files of all the executables, libraries, and config files required to be run for the batch job.
  • Process management: This takes care of the job and task scheduling, retrying failed jobs, enforcing constraints on jobs, and so on. A job is a logical unit of work. A job is split into tasks that can run in parallel on the nodes from the VM or container pool.
  • Resource and process monitoring: This takes care of all the monitoring aspects. There are several...

Configuring the batch size

To configure the batch size, we will explore how to determine the batch size in Azure Batch. Batch size refers to both the size of Batch pools and the size of the VMs in those pools. The following guidelines are generic enough that they can be applied to other services such as Spark and Hive too.

Here are some of the points to consider while deciding on the batch size:

  • Application requirements: Based on whether the application is CPU-intensive, memory-intensive, storage-intensive, or network-intensive, you will have to choose the right types of VMs and the right sizes. You can find all the supported VM sizes using the following Azure CLI command (here, centralus is an example):
    az batch location list-skus –location centralus
  • Data profile: If you know how your input data is spread, it will help in deciding the VM sizes that will be required. We will have to plan for the highest amount of data that will be processed by each of the VMs.
  • ...

Scaling resources

Scaling refers to the process of increasing or decreasing the compute, storage, or network resources to improve the performance of jobs or reduce expenses. There are two types of scaling: Manual and Automatic. As might be obvious, with manual scaling, we decide on the size beforehand. With automatic scaling, the service dynamically decides on the size of the resources based on various factors, such as the load on the cluster, the cost of running the cluster, time constraints, and more.

Let's explore the scaling options available in Azure Batch and then quickly glance at the options available in Spark and SQL too.

Azure Batch

Azure Batch provides one of the most flexible autoscale options. It lets you specify your own autoscale formula. Azure Batch will then use your formula to decide how many resources to scale up or down to.

A scaling formula can be written based on the following:

  • Time metrics: Using application stats collected at 5-minute...

Configuring batch retention

The default retention time for tasks in Azure Batch is 7 days unless the compute node is removed or lost. We can, however, set the required retention time while adding a job.

Here is an example using REST APIs. The retentionTime needs to be set in the request body as shown:

POST account.region.batch.azure.com/jobs/jobId/tasks?api-version=2021-06-01.14.0

Examine the following request body:

{
  "id": "jobId",
  "priority": 100,
  "jobManagerTask": {
    "id": "taskId",
    "commandLine": "test.exe",
    "constraints": {
      "retentionTime": "PT1H"
   }
}

PT1H specifies 1 hour and uses the ISO_8601 format. You can learn more about the format here: https://en.wikipedia.org/wiki/ISO_8601#Durations.

...

Designing and configuring exception handling

Azure Batch provides error codes, logs, and monitoring events to identify and handle errors. Once the errors are identified, we can handle them programmatically via APIs and .NET code.

Here are some examples of error codes returned by Batch:

Figure 9.38 – Sample Batch error codes

You can get the complete list of error codes here: https://docs.microsoft.com/en-us/rest/api/batchservice/batch-status-and-error-codes.

Next, let's look at some common error types in Azure Batch.

Types of errors

There are four common groups of errors:

  • Application errors: For application errors, Azure Batch writes standard output and standard error to stdout.txt and stderr.txt files in the task directory on the compute node. We can parse these files to identify the issue and take remedial measures.
  • Task errors: A task is considered failed if it returns a non-zero exit code. The failure could happen due...

Handling security and compliance requirements

Security and compliance will always remain one of the core requirements for any cloud-based system. Azure provides a service called Azure Policy to enable and enforce compliance and security policies in any of the Azure services. In our case, it could be Azure Synapse, Azure Batch, VMs, VNets, and so on. Azure Policy helps enforce policies and remedial actions at scale.

Azure Policy contains pre-defined policy rules called built-ins. For example, one of the rules could be Allow only VMs of a particular type to be created in my subscription. When this policy is applied, if anyone tries to create a VM of a different SKU, the policy will fail the VM creation. It will show an error saying Not allowed by policy at the validation screen for the resource creation.

Azure Policy has a huge list of predetermined policies and remedial actions for different compliance use cases. You can choose the policies that are relevant to your application...

Summary

I hope you now have a good idea about both batch pipelines and the Azure Batch service. We learned about creating end-to-end batch pipelines by diving deep into each of the stages, such as ingestion, transformations, BI integrations, and so on. We then learned about a new service called Azure Batch and learned about batch retention, handling errors, handling autoscale, building data pipelines using Batch, and more. We also learned about some of the critical security and compliance aspects. That is a lot of information to chew on. Just try to glance through the chapter once again if you have any doubts.

We will next be focusing on how to design and develop a stream processing solution.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data Engineer Associate Certification Guide
Published in: Feb 2022Publisher: PacktISBN-13: 9781801816069
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex