You're reading from Azure Data Engineer Associate Certification Guide

Product typeBook

Published inFeb 2022

PublisherPackt

ISBN-139781801816069

Edition1st Edition

Tools

Azure

Concepts

Big Data

Author (1)

Newton Alex

Chapter 2: Designing a Data Storage Structure

Welcome to Chapter 2. We will be focusing on Azure data storage technologies in this chapter. Azure provides several storage technologies that can cater to a wide range of cloud and hybrid use cases. Some of the important Azure storage technologies includes: Blobs, Files, Queues, Tables, SQL Database, Cosmos DB, Synapse SQL Warehouse, and Azure Data Lake Storage (ADLS). Azure bundles the four fundamental storage technologies, namely:—Blobs, Files, Queues, and Tables—as Azure Storage. Other advanced services such as Cosmos DB, SQL Database, ADLS, and so on are provided as independent Azure services.

From this chapter onward, we will be following the exact sequence of the DP-203 syllabus.

We will mostly be focusing on the design aspects of storage structures in this chapter. The corresponding implementation details will be covered in the later chapters.

This chapter will cover the following topics:

Designing...

Technical requirements

For this chapter, you will need the following:

An Azure account (free or paid)
The Azure command-line interface (Azure CLI) installed on your workstation. Please refer to the section, in Creating a VM using Azure CLI in Chapter 1, Introducing Azure Basics for instructions on installing the Azure CLI.

Designing an Azure data lake

If you have been following the big data technologies domain, you would have definitely come across the term data lake. Data lakes are distributed data stores that can hold very large volumes of diverse data. They can be used to store different types of data such as structured, semi-structured, unstructured, streaming data, and so on.

A data lake solution usually comprises a storage layer, a compute layer, and a serving layer. The compute layers could include Extract, Transform, Load (ETL); Batch; or Stream processing. There are no fixed templates for creating data lakes. Every data lake could be unique and optimized as per the owning organization's requirements. However, there are few general guidelines available to build effective data lakes, and we will be learning about them in this chapter.

How is a data lake different from a data warehouse?

The main difference between a data lake and a data warehouse is that a data warehouse stores structured...

Selecting the right file types for storage

Now that we understand the components required to build a data lake in Azure, we need to decide on the file formats that will be required for efficient storage and retrieval of data from the data lake. Data often arrives in formats such as text files, log files, comma-separated values (CSV), JSON, XML, and so on. Though these file formats are easier for humans to read and understand, they might not be the best formats for data analytics. A file format that cannot be compressed will soon end up filling up the storage capacities; a non-optimized file format for read operations might end up slowing analytics or ETLs; a file that cannot be easily split efficiently cannot be processed in parallel. In order to overcome such deficiencies, the big data community recommends three important data formats: Avro, Parquet, and Optimized Row Columnar (ORC). These file formats are also important from a certification perspective, so we will be exploring these...

Choosing the right file types for analytical queries

In the previous section, we discussed the three file formats—Avro, Parquet, and ORC—in detail. Any format that supports fast reads is better for analytics workloads. So, naturally, column-based formats such as Parquet and ORC fit the bill.

Based on the five core areas that we compared before (read, write, compression, schema evolution, and the ability to split files for parallel processing) and your choice of processing technologies, such as Hive or Spark, you could select ORC or Parquet.

For example, consider the following:

If you have Hive- or Presto-based workloads, go with ORC.
If you have Spark- or Drill-based workloads, go with Parquet.

Now that you understand the different types of file formats available and the ones to use for analytical workloads, let's move on to the next topic: designing for efficient querying.

Designing storage for efficient querying

In order to design for efficient querying, we will have to understand the different querying technologies that are available in Azure. Even though this chapter is primarily focused on storage technologies, a small diversion into the world of querying will help us understand the process of designing for efficient querying better. There are two types of querying services available: SQL-based ones such as Azure SQL, Synapse Serverless/Dedicated Pools (previously known as SQL Warehouse), and big data analytical engines such as Spark and Hive. Let's explore the important design techniques for both these groups of query engines.

We can broadly group the techniques available for efficient querying into the following three layers:

Storage layer—Using techniques such as partitioning, data pruning, and eventual consistency
Application layer—Using techniques such as data caching and application tuning (such as varying...

Designing storage for data pruning

Data pruning, as the name suggests, refers to pruning or snipping out the unnecessary data so that the queries need not read the entire input dataset. I/O is a major bottleneck for any analytical engine, so the idea here is that by reducing the amount of data read, we can improve the query performance. Data pruning usually requires some kind of user input to the analytical engine so that it can decide on which data can be safely ignored for a particular query.

Technologies such as Synapse Dedicated Pools, Azure SQL, Spark, and Hive provide the ability to partition data based on user-defined criteria. If we can organize the input data into physical folders that correspond to the partitions, we can effectively skip reading entire folders of data that are not required for such queries.

Let's consider the examples of Synapse Dedicated Pool and Spark as they are important from a certification point of view.

Dedicated SQL pool example with...

Designing folder structures for data transformation

Considering the variety and volume of data that will land in a data lake, it is very important to design a flexible, but a maintainable, folder structure. Badly designed or ad hoc folder structures will become a management nightmare and will render the data lake unusable. Some points to keep in mind while designing the folder structure are given here:

Human readability—Human-readable folder structures will help improve data exploration and navigation.
Representation of organizational structure—Aligning the folder structure according to the organizational structure helps segregate the data for billing and access control. Such a folder structure will help restrict cross-team data access.
Distinguish sensitive data—The folder structure should be such that it can separate sensitive data from general data. Sensitive data will require higher levels of audit, privacy, and security policies, so keeping...

Designing a distribution strategy

Distribution strategies are techniques that are used in Synapse Dedicated SQL Pools. Synapse Dedicated SQL Pools are massively parallel processing (MPP) systems that split the queries into 60 parallel queries and execute them in parallel. Each of these smaller queries runs on something called a distribution. A distribution is a basic unit of processing and storage for a dedicated SQL pool.

Dedicated SQL uses Azure Storage to store the data, and it provides three different ways to distribute (shard) the data among the distributions. They are listed as follows:

Round-robin tables
Hash tables
Replicated tables

Based on our requirements, we need to decide on which of these distribution techniques should be used for creating our tables. To choose the right distribution strategy, you should understand your application, the data layout, and the data access patterns by using query plans. We will be learning how to generate and read...

Designing a data archiving solution

Now that we have learned how to design a data lake and optimize the storage for our analytical queries, there is one final component that remains to be designed. How do we archive or clean up the old data? Without a proper archiving and/or deletion solution, the data will grow and fill up the storage very soon.

Azure provides three tiers of storage: Hot Access Tier, Cold Access Tier, and Archive Access Tier.

Hot Access Tier

The Hot Access Tier is ideal for data that is accessed frequently. It provides the lowest access cost but at a higher storage cost.

Cold Access Tier

The Cold Access Tier is ideal for data that is accessed occasionally, such as slightly older data that is probably used for backups or monthly reports. It provides lower storage costs but at higher access costs. Azure expects the cold access tier data to be stored for at least 30 days. Early deletion or tier change might result in extra charges.

Archive Access Tier...

Summary

With that, we have come to the end of our second chapter. We explored the various data lake designs in detail and learned good practices for designing one. You should now be comfortable answering questions related to data lake architectures and the storage, compute, and other technologies involved in creating a data lake. You should also be familiar with common file formats such as Avro, Parquet, and ORC and know when to choose which file formats. We also explored different optimization techniques such as data pruning, partitioning, caching, indexing, and more. We also learned about folder structures, data distribution, and finally, designing a data life cycle by using policies to archive or delete data. This covers the syllabus for DP-203 exam, Design a Data Storage Structure chapter. We will be reinforcing the learnings from this chapter via implementation details and tips in the following chapters.

Let's explore the concept of partitioning in more detail in the next...

The rest of the chapter is locked

You have been reading a chapter from

Azure Data Engineer Associate Certification Guide

Published in: Feb 2022Publisher: PacktISBN-13: 9781801816069

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages