You're reading from Azure Data Engineer Associate Certification Guide

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Pages 574 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Newton Alex

Table of Contents (23) Chapters

Preface

Part 1: Azure Basics

Chapter 1: Introducing Azure Basics

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 3: Designing a Partition Strategy

Chapter 4: Designing the Serving Layer

Chapter 5: Implementing Physical Data Storage Structures

Chapter 6: Implementing Logical Data Structures

Chapter 7: Implementing the Serving Layer

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 11: Managing Batches and Pipelines

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Other Books You May Enjoy

Designing storage for data pruning

Data pruning, as the name suggests, refers to pruning or snipping out the unnecessary data so that the queries need not read the entire input dataset. I/O is a major bottleneck for any analytical engine, so the idea here is that by reducing the amount of data read, we can improve the query performance. Data pruning usually requires some kind of user input to the analytical engine so that it can decide on which data can be safely ignored for a particular query.

Technologies such as Synapse Dedicated Pools, Azure SQL, Spark, and Hive provide the ability to partition data based on user-defined criteria. If we can organize the input data into physical folders that correspond to the partitions, we can effectively skip reading entire folders of data that are not required for such queries.

Let's consider the examples of Synapse Dedicated Pool and Spark as they are important from a certification point of view.