You're reading from Azure Data Engineer Associate Certification Guide

Product type Book

Published in Feb 2022

Publisher Packt

ISBN-13 9781801816069

Pages 574 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Newton Alex

Table of Contents (23) Chapters

Preface

Part 1: Azure Basics

Chapter 1: Introducing Azure Basics

Part 2: Data Storage

Chapter 2: Designing a Data Storage Structure

Chapter 3: Designing a Partition Strategy

Chapter 4: Designing the Serving Layer

Chapter 5: Implementing Physical Data Storage Structures

Chapter 6: Implementing Logical Data Structures

Chapter 7: Implementing the Serving Layer

Part 3: Design and Develop Data Processing (25-30%)

Chapter 8: Ingesting and Transforming Data

Chapter 9: Designing and Developing a Batch Processing Solution

Chapter 10: Designing and Developing a Stream Processing Solution

Chapter 11: Managing Batches and Pipelines

Part 4: Design and Implement Data Security (10-15%)

Chapter 12: Designing Security for Data Policies and Standards

Part 5: Monitor and Optimize Data Storage and Data Processing (10-15%)

Chapter 13: Monitoring Data Storage and Data Processing

Chapter 14: Optimizing and Troubleshooting Data Storage and Data Processing

Part 6: Practice Exercises

Chapter 15: Sample Questions with Solutions

Other Books You May Enjoy

Selecting the right file types for storage

Now that we understand the components required to build a data lake in Azure, we need to decide on the file formats that will be required for efficient storage and retrieval of data from the data lake. Data often arrives in formats such as text files, log files, comma-separated values (CSV), JSON, XML, and so on. Though these file formats are easier for humans to read and understand, they might not be the best formats for data analytics. A file format that cannot be compressed will soon end up filling up the storage capacities; a non-optimized file format for read operations might end up slowing analytics or ETLs; a file that cannot be easily split efficiently cannot be processed in parallel. In order to overcome such deficiencies, the big data community recommends three important data formats: Avro, Parquet, and Optimized Row Columnar (ORC). These file formats are also important from a certification perspective, so we will be exploring these...