You're reading from Azure Data Engineer Associate Certification Guide

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Pages 574 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Newton Alex

Table of Contents (23) Chapters

Preface

Part 1: Azure Basics

Chapter 1: Introducing Azure Basics

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 3: Designing a Partition Strategy

Chapter 4: Designing the Serving Layer

Chapter 5: Implementing Physical Data Storage Structures

Chapter 6: Implementing Logical Data Structures

Chapter 7: Implementing the Serving Layer

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 11: Managing Batches and Pipelines

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Other Books You May Enjoy

Chapter 13: Monitoring Data Storage and Data Processing

Welcome to the next chapter. We are now in the final leg of our certification training. This is the last section of the certification: Monitoring and Optimizing Data Storage and Data Processing. This section contains two chapters—the current one, Monitoring Data Storage and Data Processing, and the next chapter, Optimizing and Troubleshooting Data Storage and Data Processing. As this chapter's title suggests, we will be focusing on the monitoring aspect of data storage and pipelines. Once you complete this chapter, you should be able to set up monitoring for any of your Azure data services, set up custom logs, process logs using tools such as Azure Log Analytics, and understand how to read Spark directed acyclic graphs (DAGs). As with the previous chapters, I've taken the liberty of reordering the topic sequence to make reading more comfortable, without too many context switches.

In this chapter, we will cover...

Technical requirements

For this chapter, you will need the following:

An Azure account (free or paid)
An active Azure Data Factory (ADF) workspace

Let's get started!

Implementing logging used by Azure Monitor

Azure Monitor is the service we use to monitor infrastructure, services, and applications. Azure Monitor records two types of data: metrics and logs. Metrics are numerical values that describe an entity or an aspect of a system at different instances of time—for example, the number of gigabytes (GBs) of data stored in a storage account at any point in time, the current number of active pipelines in ADF, and so on. Metrics are stored in time-series databases and can be easily aggregated for alerting, reporting, and auditing purposes.

Logs, on the other hand, are usually text details of what is happening in the system. Unlike metrics, which are recorded at regular intervals, logs are usually event-driven. For example, a user logging in to a system, a web app receiving a REpresentational State Transfer (REST) request, and triggering a pipeline in ADF could all generate logs.

Since Azure Monitor is an independent service, it can aggregate...

Configuring monitoring services

Azure Monitor is enabled as soon as we create an Azure resource. By default, the basic metrics and logs are recorded without requiring any configuration changes from the user side, but we can perform additional configurations such as sending the logs to Log Analytics, as we saw in the previous section. We can configure monitoring at multiple levels, as outlined here:

Application monitoring—Metrics and logs about the applications that you have written on top of Azure services.
Operating system (OS) monitoring—OS-level metrics and logs, such as CPU usage, memory usage, disk usage, and so on.
Azure resource monitoring—Metrics and logs from Azure services such as Azure Storage, Synapse Analytics, Event Hubs, and more.
Subscription-level monitoring—Metrics and logs of Azure subscriptions, such as how many people are using a particular account, what is the account usage, and so on.
Tenant-level monitoring...

Understanding custom logging options

The Custom logs option in Azure Monitor helps to collect text-based logs that are not part of the standard logs collected by Azure Monitor, such as the system logs, event logs in Windows, and similar ones in Linux. In order to configure custom logs, the host machine must have the Log Analytics agent or the newer AMA installed on it. We just saw how to install the agents in the previous section.

Once we have ensured that the agents are in place, it is a very easy process to set up custom logs. Here are the steps:

In the Log Analytics workspace, select the Custom logs section and click on the + Add custom log option, as illustrated in the following screenshot:

Figure 13.9 – Setting up a new custom log using Log Analytics

In the wizard that follows, upload a sample log file so that the tool can parse it and understand the log format. Here is an example of a sample log file:

...

Interpreting Azure Monitor metrics and logs

As we have seen in the introduction to Azure Monitoring, metrics and logs form the two main sources of data for monitoring. Let's explore how to view, interpret, and experiment with these two types of monitoring data.

Interpreting Azure Monitor metrics

The metrics data collected from Azure resources is usually displayed on the overview page of the resource itself, and more details are available under the Metrics tab. Here is an example again of how it looks for a storage account:

Figure 13.15 – Metrics data for a storage account

For each of the metrics, you can aggregate based on Sum, Avg, Min, and Max. The tool also provides the flexibility to overlay with additional metrics using the Add metric option, filter out unwanted data using the Add filter option, and so on. You can access the data for up to 30 days using this metrics explorer.

Let's next see how to interpret logs.

Interpreting...

Measuring the performance of data movement

ADF provides a rich set of performance metrics under its Monitoring tab. In the following example, we have a sample Copy Data activity as part of a pipeline called FetchDataFromBlob, which copies data from Blob storage into Azure Data Lake Storage Gen2 (ADLS Gen2). If you click on the Pipeline runs tab under the Monitoring tab, you will see the details of each of the pipelines. If you click on any of the steps, you will see diagnostic details:

Figure 13.18 – Data movement performance details

This is how you can monitor the performance of data movement. You can learn more about Copy Data monitoring here: https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-monitoring.

Let's next look at how to monitor overall pipeline performance.

Monitoring data pipeline performance

Similar to the data movement metrics we saw in the previous section, ADF provides metrics for overall pipelines too. In the Pipeline runs page under the Monitoring tab, if you hover over the pipeline runs, a small Consumption icon appears, as shown in the following screenshot:

Figure 13.19 – Consumption icon in the Pipeline runs screen

If you click on that icon, ADF shows the pipeline consumption details. Here is a sample screen:

Figure 13.20 – Pipeline resource consumption details screen

You can also get additional metrics about each of the runs from the Gantt chart section. You can change the view from List to Gantt, as shown in the following screenshot:

Figure 13.21 – Additional pipeline details in the Gantt chart page

Note

ADF only maintains pipeline execution details and metrics for 45 days. If you need to analyze the pipeline data for more than...

Monitoring and updating statistics about data across a system

Statistics is an important concept in query optimization. Generating statistics is the process of collecting metadata about your data—such as the number of rows, the size of tables, and so on—which can be used as additional inputs by the SQL engine to optimize query plans. For example, if two tables have to be joined and one table is very small, the SQL engine can use this statistical information to pick a query plan that works best for such highly skewed tables. The Synapse SQL pool engine uses something known as cost-based optimizers (CBOs). These optimizers choose the least expensive query plan from a set of query plans that can be generated for a given SQL script.

Let's look at how to create statistics for both Synapse dedicated and serverless pools.

Creating statistics for Synapse dedicated pools

You can enable statistics in Synapse SQL dedicated pools using the following command:

ALTER...

Measuring query performance

Query performance is a very interesting topic in databases and analytical engines such as Spark and Hive. You will find tons of books and articles written on these topics. In this section, I'll try to give an overview of how to monitor query performance in Synapse dedicated SQL pools and Spark. In the next chapter, we will focus on how to actually optimize the queries. I've provided links for further reading in each section so that you can learn more about the techniques if you wish.

For measuring the performance of any SQL-based queries, it is recommended to set up the Transaction Processing Performance Council H or DS (TPC-H or TPC-DS) benchmarking suites and run them on a regular basis to identify any regressions in the platform. TPC-H and TPC-DS are industry-standard benchmarking test suites. If you are interested in learning more about them, please follow these links:

You can learn more about TPC-H here: http://www.tpc.org/tpch...

Interpreting a Spark DAG

A DAG is just a regular graph with nodes and edges but with no cycles or loops. In order to understand a Spark DAG, we first have to understand where a DAG comes into the picture during the execution of a Spark job.

When a user submits a Spark job, the Spark driver first identifies all the tasks involved in accomplishing the job. It then figures out which of these tasks can be run in parallel and which tasks depend on other tasks. Based on this information, it converts the Spark job into a graph of tasks. The nodes at the same level indicate jobs that can be run in parallel, and the nodes at different levels indicate tasks that need to be run after the previous nodes. This graph is acyclic, as denoted by A in DAG. This DAG is then converted into a physical execution plan. In the physical execution plan, nodes that are at the same level are segregated into stages. Once all the tasks and stages are complete, the Spark job is termed as completed.

Let&apos...

Monitoring cluster performance

Since services such as Synapse and ADF are platform-as-a-service (PaaS) services where you will not have explicit control over the clusters, the one place where we can control each and every aspect of a cluster is the Azure HDInsight service. In HDInsight, you can create your own Hadoop, Spark, Hive, HBase, and other clusters and control every aspect of the cluster. You can use Log Analytics to monitor cluster performance, as with the other examples we saw earlier in the chapter. You can learn more about using Log Analytics in HDInsight here: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-oms-log-analytics-tutorial.

Apart from the Log Analytics approach, there are four main areas of the HDInsight portal that help monitor cluster performance. Let's look at them.

Monitoring overall cluster performance

The HDInsight Ambari dashboard is the first place to check for cluster health. If you see very high heap usage, disk usage...

Scheduling and monitoring pipeline tests

In Chapter 11, Managing Batches and Pipelines, we briefly introduced Azure DevOps for version control. Azure DevOps provides another feature called Azure Pipelines, which can be used to create CI/CD pipelines to deploy ADF. If you are not aware of CI/CD, it is a method of continuously testing and deploying applications to the production environment in an automated manner. In this section, we will look into how to create, schedule, and monitor a CI/CD pipeline.

Note

As of writing this book, Azure DevOps Pipelines support for Synapse Pipelines was not available. It is only available for ADF.

Here are the high-level steps to create a CI/CD pipeline using Azure pipelines:

Select Azure DevOps from the Azure portal. On the Azure DevOps page, select Releases under Pipelines and click the New Pipeline button. This will take you to a new screen, shown in the following screenshot. Choose the Empty job option:

...

Summary

This chapter introduced a lot of new technologies and techniques, and I hope you got a grasp of them. Even though the number of technologies involved is high, the weightage of this chapter with respect to the certification is relatively low, so you may have noticed that I've kept the topics slightly at a higher level and have provided further links for you to read more on the topics.

In this chapter, we started by introducing Azure Monitor and Log Analytics. We learned how to send log data to Log Analytics, how to define custom logging options, and how to interpret metrics and logs data. After that, we focused on measuring the performance of data movements, pipelines, SQL queries, and Spark queries. We also learned how to interpret Spark DAGs, before moving on to monitoring cluster performance and cluster pipeline tests. You should now be able to set up a monitoring solution for your data pipelines and be able to tell if your data movement, pipeline setups, cluster...

The rest of the chapter is locked

You're reading from Azure Data Engineer Associate Certification Guide

Table of Contents (23) Chapters

Chapter 13: Monitoring Data Storage and Data Processing

Technical requirements

Implementing logging used by Azure Monitor

Configuring monitoring services

Understanding custom logging options

Interpreting Azure Monitor metrics and logs

Interpreting Azure Monitor metrics

Interpreting...

Measuring the performance of data movement

Monitoring data pipeline performance

Monitoring and updating statistics about data across a system

Creating statistics for Synapse dedicated pools

Measuring query performance

Interpreting a Spark DAG

Monitoring cluster performance

Monitoring overall cluster performance

Scheduling and monitoring pipeline tests

Summary

Authors (1)

Personalised recommendations for you

You're reading from Azure Data Engineer Associate Certification Guide

Table of Contents (23) Chapters

Chapter 13: Monitoring Data Storage and Data Processing

Technical requirements

Implementing logging used by Azure Monitor

Configuring monitoring services

Understanding custom logging options

Interpreting Azure Monitor metrics and logs

Interpreting Azure Monitor metrics

Interpreting...

Measuring the performance of data movement

Monitoring data pipeline performance

Monitoring and updating statistics about data across a system

Creating statistics for Synapse dedicated pools

Measuring query performance

Interpreting a Spark DAG

Monitoring cluster performance

Monitoring overall cluster performance

Scheduling and monitoring pipeline tests

Summary

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you