Reader small image

You're reading from  Azure Data Engineer Associate Certification Guide

Product typeBook
Published inFeb 2022
PublisherPackt
ISBN-139781801816069
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Newton Alex
Newton Alex
author image
Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex

Right arrow

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Welcome to the final chapter in the Monitoring and Optimizing Data Storage and Data Processing section of the syllabus. The only chapter left after this is the revision and sample questions for the certification. Congratulations on reaching this far; you are now just a hop away from acquiring your certification.

In this chapter, we will be focusing on the optimization and troubleshooting techniques for data storage and data processing technologies. We will start with the topics for optimizing Spark and Synapse SQL queries using techniques such as compacting small files, handling UDFs, data skews, shuffles, indexing, cache management, and more. We will also look into techniques for troubleshooting Spark and Synapse pipelines and general guidelines for optimizing any analytical pipeline. Once you complete this chapter, you will have the knowledge to debug performance issues or troubleshoot failures in pipelines...

Technical requirements

For this chapter, you will need the following:

  • An Azure account (free or paid)
  • An active Azure Data Factory or Synapse workspace
  • An active Azure Databricks workspace

Let's get started!

Compacting small files

Small files are the nightmares of big data processing systems. Analytical engines such as Spark, Synapse SQL, and Hive, and cloud storage systems such as Blob and ADLS Gen2, are all inherently optimized for big files. Hence, to make our data pipelines efficient, it is better to merge or compact the small files into bigger ones. This can be achieved in Azure using Azure Data Factory and Synapse Pipelines. Let's look at an example using Azure Data Factory to concatenate a bunch of small CSV files in a directory into one big file. The steps for Synapse pipelines will be very similar:

  1. From the Azure Data Factory portal, select the Copy Data activity as shown in the following screenshot. In the Source tab, either choose an existing source dataset or create a new one, pointing to the data storage where the small files are present. Next, choose the Wildcard file path option for File Path type. In the Wildcard Paths field, provide a folder path ending with...

Rewriting user-defined functions (UDFs)

User-defined functions are custom functions that can be defined in databases and in certain analytical and streaming engines such as Spark and Azure Stream Analytics. An example of a UDF could be a custom function to check whether a given value is a valid email address.

The DP-203 syllabus just mentions the topic as Rewriting user-defined functions. In a literal sense, this is just dropping a UDF and recreating a new one, as we do for SQL or Spark tables. However, I believe the syllabus committee might have referred to rewriting normal repetitive scripts as UDFs to make them efficient from a development perspective. Note that UDFs can also decrease runtime performance if not designed correctly. Let's look at the ways to create UDFs in SQL, Spark, and Streaming.

Writing UDFs in Synapse SQL Pool

You can create a user-defined function in Synapse SQL using the CREATE FUNCTION command in Synapse SQL. Here is an example of using the...

Handling skews in data

A data skew refers to an extreme, uneven distribution of data in a dataset. Let's take an example of the number of trips per month of our Imaginary Airport Cab (IAC) example. Let's assume the data distribution as shown in the following graph:

Figure 14.8 – An example of skewed data

As you can see from the graph, the trip numbers for November and December are quite high compared to the other months. Such an uneven distribution of data is referred to as a data skew. Now, if we were to distribute the monthly data to individual compute nodes, the nodes that are processing the data for November and December are going to take a lot more time than the ones processing the other months. And if we were generating an annual report, then all the other stages would have to wait for the November and December stages to complete. Such wait times are inefficient for job performance. To make the processing more efficient, we will have...

Handling data spills

Data spill refers to the process where a compute engine such as SQL or Spark, while executing a query, is unable to hold the required data in memory and writes (spills) to disk. This results in increased query execution time due to the expensive disk reads and writes. Spills can occur for any of the following reasons:

  • The data partition size is too big.
  • The compute resource size is small, especially the memory.
  • The exploded data size during merges, unions, and so on exceeds the memory limits of the compute node.

Solutions for handling data spills would be as follows:

  • Increase the compute capacity, especially the memory if possible. This will incur higher costs, but is the easiest of the options.
  • Reduce the data partition sizes, and repartition if necessary. This is more effort-intensive as repartitioning takes time and effort. If you are not able to afford the higher compute resources, then reducing the data partition sizes is...

Tuning shuffle partitions

Spark uses a technique called shuffle to move data between its executors or nodes while performing operations such as join, union, groupby, and reduceby. The shuffle operation is very expensive as it involves the movement of data between nodes. Hence, it is usually preferable to reduce the amount of shuffle involved in a Spark query. The number of partition splits that Spark performs while shuffling data is determined by the following configuration:

spark.conf.set("spark.sql.shuffle.partitions",200)

200 is the default value and you can tune it to a number that suits your query the best. If you have too much data and too few partitions, this might result in longer tasks. But, on the other hand, if you have too little data and too many shuffle partitions, the overhead of shuffle tasks will degrade performance. So, you will have to run your query multiple times with different shuffle partition numbers to arrive at an optimum number.

You can...

Finding shuffling in a pipeline

As we learned in the previous section, shuffling data is a very expensive operation and we should try to reduce it as much as possible. In this section, we will learn how to identify shuffles in the query execution path for both Synapse SQL and Spark.

Identifying shuffles in a SQL query plan

To identify shuffles, print the query plan using the EXPLAIN statement. Here is an example.

Consider a Synapse SQL table, DimDriver, as shown in the following screenshot:

Figure 14.11 – Sample DimDriver table

Here is a sample EXPLAIN statement:

EXPLAIN WITH_RECOMMENDATIONS
SELECT
        [gender], SUM([salary]) as Totalsalary
    FROM
       dbo.DimDriver
    GROUP BY
        [gender]

This will generate a plan similar to the one shown in the following screenshot....

Optimizing resource management

Optimizing resource management in this context refers to how to reduce your billing expenses while using Azure analytic services. Here are some of the general techniques that can help.

Optimizing Synapse SQL pools

Here are a few suggestions for Synapse dedicated SQL pools:

  • Since the storage and compute are decoupled, you can pause your SQL pool compute when not in use. This will not impact your data but will save you some costs.
  • Use the right size of compute units. In the SQL pool, the compute units are defined in terms of Data Warehouse Units (DWUs). You can start with the smallest DWU and then gradually increase to higher DWUs to strike the right balance between cost and performance.
  • Manually scale out or scale in the compute resources based on the load. You can also automate the scale-out and in using Azure Functions.

You can learn more about resource management optimizations for the SQL pool here: https://docs.microsoft...

Tuning queries by using indexers

Indexing is another common optimization technology used in database systems, data warehouses, and analytical engines such as Spark. Let's look at the indexing options and tuning guidelines for both Synapse SQL and Spark.

Indexing in Synapse SQL

If you remember, we learned about the different types of indexing in Chapter 5, Implementing Physical Data Storage Structures, in the Implementing different table geometries with Azure Synapse Analytics pools section. I'll recap the different types of indexers we have along with tips for tuning Synapse SQL here again.

There are three primary types of indexing available in Synapse SQL:

  • Clustered Columnstore Index: This is the default index option for Synapse SQL. If you don't specify any indexing options, the table will automatically get indexed using this method. Use this index for large tables > 100 million rows. It provides very high levels of data compression and good overall...

Tuning queries by using cache

Caching is a well-known method for improving read performance in databases. Synapse SQL supports a feature called Result set caching. As the name implies, this enables the results to be cached and reused if the query doesn't change. Once result set caching is enabled, the subsequent query executions directly fetch the results from the cache instead of recomputing the results. The result set cache is only used under the following conditions:

  • The query being considered is an exact match.
  • There are no changes to the underlying data or schema.
  • The user has the right set of permissions to the tables referenced in the query.

You can enable result set caching at the database level in Synapse SQL using the following SQL statement:

ALTER DATABASE [database_name]
SET RESULT_SET_CACHING ON;

You can also turn result set caching on from within a session by using the following command:

SET RESULT_SET_CACHING { ON | OFF };

Note...

Optimizing pipelines for analytical or transactional purposes

You have surely heard the terms OLAP and OLTP if you have been working in the data domain. Cloud data systems can be broadly classified as either Online Transaction Processing (OLTP) or Online Analytical Processing (OLAP) systems. Let's understand each of these at a high level.

OLTP systems

OLTP systems, as the name suggests, are built to efficiently process, store, and query transactions. They usually have transaction data flowing into a central ACID-compliant database. The databases contain normalized data that adheres to strict schemas. The data sizes are usually smaller, in the range of gigabytes or terabytes. Predominantly RDBMS-based systems, such as Azure SQL and MySQL, are used for the main database.

OLAP systems

On the other hand, OLAP systems are usually big data systems that typically have a warehouse or key value-based store as the central technology to perform analytical processing. The tasks...

Optimizing pipelines for descriptive versus analytical workloads

Data analytics is categorized into four different types:

  • Descriptive Analytics: The type of analytics that deals with the analysis of what happened and when it happened. Most BI reports, such as sales reports and trip reports, that display current and historical data points fall under this category. The analytics tasks would usually be counts, aggregates, filters, and so on.
  • Diagnostic Analytics: This type of analytics also does the why part, along with the what and when. Examples include Root Cause Analysis (RCA). Apart from identifying what happened, we also delve deeper into the logs or metrics to identify why something happened. For example, you could be looking at why a certain cab route is having a dip in revenue, or why a particular type of VM is failing sooner than others by looking into the load on those machines.
  • Predictive Analytics: As the name suggests, this type of analytics refers to the...

Troubleshooting a failed Spark job

There are two aspects to troubleshooting a failed Spark job in a cloud environment: environmental issues and job issues. Let's look at both of these factors in detail.

Debugging environmental issues

Here are some of the steps involved in checking environmental issues:

  1. Check the health of Azure services in the region where your Spark clusters are running by using this link: https://status.azure.com/en-us/status.
  2. Next, check whether your Spark cluster itself is fine. You can do this for your HDInsight clusters by checking the Ambari home page. We saw how to check Ambari for the status in Chapter 13, Monitoring Data Storage and Data Processing, in the Monitoring overall cluster performance section. Here is the Ambari screen home page again for your reference:

Figure 14.23 – Ambari home page showing the status of the cluster

  1. Check to see whether any service is down or whether any of the resources...

Troubleshooting a failed pipeline run

Azure Data Factory and Synapse pipelines provide detailed error messages when pipelines fail. Here are three easy steps to debugging a failed pipeline:

  • Check Datasets: Click on Linked Services and then click on the Test connection link to ensure that the linked services are working fine and that nothing has changed on the source. Here is an example of how to use Test connection on the Edit linked service page.

Figure 14.27 – Using Test connection for linked services

  • Use data previews to check your transformations: Turn the Data flow debug mode on and check the data previews for each of your pipeline activities, starting from the data source. This will help narrow down the issue. Here is an example of how to use Data preview:

Figure 14.28 – Using the Data preview option to see whether the data is getting populated correctly

  • If the issues persist, run the pipeline...

Summary

Like the last chapter, this chapter also introduced a lot of new concepts. Some of these concepts will take a long time to master, such as Spark debugging, optimizing shuffle partitions, and identifying and reducing data spills. These topics could be separate books on their own. I've tried my best to give you an overview of these topics with follow-up links. Please go through the links to learn more about them.

Let's recap what we learned in this chapter. We started with data compaction as small files are very inefficient in big data analytics. We then learned about UDFs, and how to handle data skews and data spills in both SQL and Spark. We then explored shuffle partitions in Spark. We learned about using indexers and cache to speed up our query performance. We also learned about HTAP, which was a new concept that merges OLAP and OLTP processing. We then explored the general resource management tips for descriptive and analytical platforms. And finally, we wrapped...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data Engineer Associate Certification Guide
Published in: Feb 2022Publisher: PacktISBN-13: 9781801816069
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex