You're reading from Azure Data Engineer Associate Certification Guide

Product typeBook

Published inFeb 2022

PublisherPackt

ISBN-139781801816069

Edition1st Edition

Tools

Azure

Concepts

Big Data

Author (1)

Newton Alex

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Welcome to the final chapter in the Monitoring and Optimizing Data Storage and Data Processing section of the syllabus. The only chapter left after this is the revision and sample questions for the certification. Congratulations on reaching this far; you are now just a hop away from acquiring your certification.

In this chapter, we will be focusing on the optimization and troubleshooting techniques for data storage and data processing technologies. We will start with the topics for optimizing Spark and Synapse SQL queries using techniques such as compacting small files, handling UDFs, data skews, shuffles, indexing, cache management, and more. We will also look into techniques for troubleshooting Spark and Synapse pipelines and general guidelines for optimizing any analytical pipeline. Once you complete this chapter, you will have the knowledge to debug performance issues or troubleshoot failures in pipelines...

Technical requirements

For this chapter, you will need the following:

An Azure account (free or paid)
An active Azure Data Factory or Synapse workspace
An active Azure Databricks workspace

Let's get started!

Compacting small files

Small files are the nightmares of big data processing systems. Analytical engines such as Spark, Synapse SQL, and Hive, and cloud storage systems such as Blob and ADLS Gen2, are all inherently optimized for big files. Hence, to make our data pipelines efficient, it is better to merge or compact the small files into bigger ones. This can be achieved in Azure using Azure Data Factory and Synapse Pipelines. Let's look at an example using Azure Data Factory to concatenate a bunch of small CSV files in a directory into one big file. The steps for Synapse pipelines will be very similar:

From the Azure Data Factory portal, select the Copy Data activity as shown in the following screenshot. In the Source tab, either choose an existing source dataset or create a new one, pointing to the data storage where the small files are present. Next, choose the Wildcard file path option for File Path type. In the Wildcard Paths field, provide a folder path ending with...

Rewriting user-defined functions (UDFs)

User-defined functions are custom functions that can be defined in databases and in certain analytical and streaming engines such as Spark and Azure Stream Analytics. An example of a UDF could be a custom function to check whether a given value is a valid email address.

The DP-203 syllabus just mentions the topic as Rewriting user-defined functions. In a literal sense, this is just dropping a UDF and recreating a new one, as we do for SQL or Spark tables. However, I believe the syllabus committee might have referred to rewriting normal repetitive scripts as UDFs to make them efficient from a development perspective. Note that UDFs can also decrease runtime performance if not designed correctly. Let's look at the ways to create UDFs in SQL, Spark, and Streaming.

Writing UDFs in Synapse SQL Pool

You can create a user-defined function in Synapse SQL using the CREATE FUNCTION command in Synapse SQL. Here is an example of using the...

Handling skews in data

A data skew refers to an extreme, uneven distribution of data in a dataset. Let's take an example of the number of trips per month of our Imaginary Airport Cab (IAC) example. Let's assume the data distribution as shown in the following graph:

Figure 14.8 – An example of skewed data

As you can see from the graph, the trip numbers for November and December are quite high compared to the other months. Such an uneven distribution of data is referred to as a data skew. Now, if we were to distribute the monthly data to individual compute nodes, the nodes that are processing the data for November and December are going to take a lot more time than the ones processing the other months. And if we were generating an annual report, then all the other stages would have to wait for the November and December stages to complete. Such wait times are inefficient for job performance. To make the processing more efficient, we will have...

Handling data spills

Data spill refers to the process where a compute engine such as SQL or Spark, while executing a query, is unable to hold the required data in memory and writes (spills) to disk. This results in increased query execution time due to the expensive disk reads and writes. Spills can occur for any of the following reasons:

The data partition size is too big.
The compute resource size is small, especially the memory.
The exploded data size during merges, unions, and so on exceeds the memory limits of the compute node.

Solutions for handling data spills would be as follows:

Increase the compute capacity, especially the memory if possible. This will incur higher costs, but is the easiest of the options.
Reduce the data partition sizes, and repartition if necessary. This is more effort-intensive as repartitioning takes time and effort. If you are not able to afford the higher compute resources, then reducing the data partition sizes is...

Tuning shuffle partitions

Spark uses a technique called shuffle to move data between its executors or nodes while performing operations such as join, union, groupby, and reduceby. The shuffle operation is very expensive as it involves the movement of data between nodes. Hence, it is usually preferable to reduce the amount of shuffle involved in a Spark query. The number of partition splits that Spark performs while shuffling data is determined by the following configuration:

spark.conf.set("spark.sql.shuffle.partitions",200)

200 is the default value and you can tune it to a number that suits your query the best. If you have too much data and too few partitions, this might result in longer tasks. But, on the other hand, if you have too little data and too many shuffle partitions, the overhead of shuffle tasks will degrade performance. So, you will have to run your query multiple times with different shuffle partition numbers to arrive at an optimum number.

You can...

Finding shuffling in a pipeline

As we learned in the previous section, shuffling data is a very expensive operation and we should try to reduce it as much as possible. In this section, we will learn how to identify shuffles in the query execution path for both Synapse SQL and Spark.

Identifying shuffles in a SQL query plan

To identify shuffles, print the query plan using the EXPLAIN statement. Here is an example.

Consider a Synapse SQL table, DimDriver, as shown in the following screenshot:

Figure 14.11 – Sample DimDriver table

Here is a sample EXPLAIN statement:

EXPLAIN WITH_RECOMMENDATIONS
SELECT
        [gender], SUM([salary]) as Totalsalary
    FROM
       dbo.DimDriver
    GROUP BY
        [gender]

This will generate a plan similar to the one shown in the following screenshot....

Optimizing resource management

Optimizing resource management in this context refers to how to reduce your billing expenses while using Azure analytic services. Here are some of the general techniques that can help.

Optimizing Synapse SQL pools

Here are a few suggestions for Synapse dedicated SQL pools:

Since the storage and compute are decoupled, you can pause your SQL pool compute when not in use. This will not impact your data but will save you some costs.
Use the right size of compute units. In the SQL pool, the compute units are defined in terms of Data Warehouse Units (DWUs). You can start with the smallest DWU and then gradually increase to higher DWUs to strike the right balance between cost and performance.
Manually scale out or scale in the compute resources based on the load. You can also automate the scale-out and in using Azure Functions.

You can learn more about resource management optimizations for the SQL pool here: https://docs.microsoft...

Tuning queries by using indexers

Indexing is another common optimization technology used in database systems, data warehouses, and analytical engines such as Spark. Let's look at the indexing options and tuning guidelines for both Synapse SQL and Spark.

Indexing in Synapse SQL

If you remember, we learned about the different types of indexing in Chapter 5, Implementing Physical Data Storage Structures, in the Implementing different table geometries with Azure Synapse Analytics pools section. I'll recap the different types of indexers we have along with tips for tuning Synapse SQL here again.

There are three primary types of indexing available in Synapse SQL:

Clustered Columnstore Index: This is the default index option for Synapse SQL. If you don't specify any indexing options, the table will automatically get indexed using this method. Use this index for large tables > 100 million rows. It provides very high levels of data compression and good overall...

Tuning queries by using cache

Caching is a well-known method for improving read performance in databases. Synapse SQL supports a feature called Result set caching. As the name implies, this enables the results to be cached and reused if the query doesn't change. Once result set caching is enabled, the subsequent query executions directly fetch the results from the cache instead of recomputing the results. The result set cache is only used under the following conditions:

The query being considered is an exact match.
There are no changes to the underlying data or schema.
The user has the right set of permissions to the tables referenced in the query.

You can enable result set caching at the database level in Synapse SQL using the following SQL statement:

ALTER DATABASE [database_name]
SET RESULT_SET_CACHING ON;

You can also turn result set caching on from within a session by using the following command:

SET RESULT_SET_CACHING { ON | OFF };

Note...

Optimizing pipelines for analytical or transactional purposes

You have surely heard the terms OLAP and OLTP if you have been working in the data domain. Cloud data systems can be broadly classified as either Online Transaction Processing (OLTP) or Online Analytical Processing (OLAP) systems. Let's understand each of these at a high level.

OLTP systems

OLTP systems, as the name suggests, are built to efficiently process, store, and query transactions. They usually have transaction data flowing into a central ACID-compliant database. The databases contain normalized data that adheres to strict schemas. The data sizes are usually smaller, in the range of gigabytes or terabytes. Predominantly RDBMS-based systems, such as Azure SQL and MySQL, are used for the main database.

OLAP systems

On the other hand, OLAP systems are usually big data systems that typically have a warehouse or key value-based store as the central technology to perform analytical processing. The tasks...

Optimizing pipelines for descriptive versus analytical workloads

Data analytics is categorized into four different types:

Descriptive Analytics: The type of analytics that deals with the analysis of what happened and when it happened. Most BI reports, such as sales reports and trip reports, that display current and historical data points fall under this category. The analytics tasks would usually be counts, aggregates, filters, and so on.
Diagnostic Analytics: This type of analytics also does the why part, along with the what and when. Examples include Root Cause Analysis (RCA). Apart from identifying what happened, we also delve deeper into the logs or metrics to identify why something happened. For example, you could be looking at why a certain cab route is having a dip in revenue, or why a particular type of VM is failing sooner than others by looking into the load on those machines.
Predictive Analytics: As the name suggests, this type of analytics refers to the...

Troubleshooting a failed Spark job

There are two aspects to troubleshooting a failed Spark job in a cloud environment: environmental issues and job issues. Let's look at both of these factors in detail.

Debugging environmental issues

Here are some of the steps involved in checking environmental issues:

Check the health of Azure services in the region where your Spark clusters are running by using this link: https://status.azure.com/en-us/status.
Next, check whether your Spark cluster itself is fine. You can do this for your HDInsight clusters by checking the Ambari home page. We saw how to check Ambari for the status in Chapter 13, Monitoring Data Storage and Data Processing, in the Monitoring overall cluster performance section. Here is the Ambari screen home page again for your reference:

Figure 14.23 – Ambari home page showing the status of the cluster

Check to see whether any service is down or whether any of the resources...

Troubleshooting a failed pipeline run

Azure Data Factory and Synapse pipelines provide detailed error messages when pipelines fail. Here are three easy steps to debugging a failed pipeline:

Check Datasets: Click on Linked Services and then click on the Test connection link to ensure that the linked services are working fine and that nothing has changed on the source. Here is an example of how to use Test connection on the Edit linked service page.

Figure 14.27 – Using Test connection for linked services

Use data previews to check your transformations: Turn the Data flow debug mode on and check the data previews for each of your pipeline activities, starting from the data source. This will help narrow down the issue. Here is an example of how to use Data preview:

Figure 14.28 – Using the Data preview option to see whether the data is getting populated correctly

If the issues persist, run the pipeline...

Summary

Like the last chapter, this chapter also introduced a lot of new concepts. Some of these concepts will take a long time to master, such as Spark debugging, optimizing shuffle partitions, and identifying and reducing data spills. These topics could be separate books on their own. I've tried my best to give you an overview of these topics with follow-up links. Please go through the links to learn more about them.

Let's recap what we learned in this chapter. We started with data compaction as small files are very inefficient in big data analytics. We then learned about UDFs, and how to handle data skews and data spills in both SQL and Spark. We then explored shuffle partitions in Spark. We learned about using indexers and cache to speed up our query performance. We also learned about HTAP, which was a new concept that merges OLAP and OLTP processing. We then explored the general resource management tips for descriptive and analytical platforms. And finally, we wrapped...

The rest of the chapter is locked

You have been reading a chapter from

Azure Data Engineer Associate Certification Guide

Published in: Feb 2022Publisher: PacktISBN-13: 9781801816069

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages