Reader small image

You're reading from  Azure Data Engineer Associate Certification Guide

Product typeBook
Published inFeb 2022
PublisherPackt
ISBN-139781801816069
Edition1st Edition
Tools
Concepts
Right arrow
Author (1)
Newton Alex
Newton Alex
author image
Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex

Right arrow

Chapter 2: Designing a Data Storage Structure

Welcome to Chapter 2. We will be focusing on Azure data storage technologies in this chapter. Azure provides several storage technologies that can cater to a wide range of cloud and hybrid use cases. Some of the important Azure storage technologies includes: Blobs, Files, Queues, Tables, SQL Database, Cosmos DB, Synapse SQL Warehouse, and Azure Data Lake Storage (ADLS). Azure bundles the four fundamental storage technologies, namely:—Blobs, Files, Queues, and Tables—as Azure Storage. Other advanced services such as Cosmos DB, SQL Database, ADLS, and so on are provided as independent Azure services.

From this chapter onward, we will be following the exact sequence of the DP-203 syllabus.

We will mostly be focusing on the design aspects of storage structures in this chapter. The corresponding implementation details will be covered in the later chapters.

This chapter will cover the following topics:

  • Designing...

Technical requirements

For this chapter, you will need the following:

  • An Azure account (free or paid)
  • The Azure command-line interface (Azure CLI) installed on your workstation. Please refer to the section, in Creating a VM using Azure CLI in Chapter 1, Introducing Azure Basics for instructions on installing the Azure CLI.

Designing an Azure data lake

If you have been following the big data technologies domain, you would have definitely come across the term data lake. Data lakes are distributed data stores that can hold very large volumes of diverse data. They can be used to store different types of data such as structured, semi-structured, unstructured, streaming data, and so on.

A data lake solution usually comprises a storage layer, a compute layer, and a serving layer. The compute layers could include Extract, Transform, Load (ETL); Batch; or Stream processing. There are no fixed templates for creating data lakes. Every data lake could be unique and optimized as per the owning organization's requirements. However, there are few general guidelines available to build effective data lakes, and we will be learning about them in this chapter.

How is a data lake different from a data warehouse?

The main difference between a data lake and a data warehouse is that a data warehouse stores structured...

Selecting the right file types for storage

Now that we understand the components required to build a data lake in Azure, we need to decide on the file formats that will be required for efficient storage and retrieval of data from the data lake. Data often arrives in formats such as text files, log files, comma-separated values (CSV), JSON, XML, and so on. Though these file formats are easier for humans to read and understand, they might not be the best formats for data analytics. A file format that cannot be compressed will soon end up filling up the storage capacities; a non-optimized file format for read operations might end up slowing analytics or ETLs; a file that cannot be easily split efficiently cannot be processed in parallel. In order to overcome such deficiencies, the big data community recommends three important data formats: Avro, Parquet, and Optimized Row Columnar (ORC). These file formats are also important from a certification perspective, so we will be exploring these...

Choosing the right file types for analytical queries

In the previous section, we discussed the three file formats—Avro, Parquet, and ORC—in detail. Any format that supports fast reads is better for analytics workloads. So, naturally, column-based formats such as Parquet and ORC fit the bill.

Based on the five core areas that we compared before (read, write, compression, schema evolution, and the ability to split files for parallel processing) and your choice of processing technologies, such as Hive or Spark, you could select ORC or Parquet.

For example, consider the following:

  • If you have Hive- or Presto-based workloads, go with ORC.
  • If you have Spark- or Drill-based workloads, go with Parquet.

Now that you understand the different types of file formats available and the ones to use for analytical workloads, let's move on to the next topic: designing for efficient querying.

Designing storage for efficient querying

In order to design for efficient querying, we will have to understand the different querying technologies that are available in Azure. Even though this chapter is primarily focused on storage technologies, a small diversion into the world of querying will help us understand the process of designing for efficient querying better. There are two types of querying services available: SQL-based ones such as Azure SQL, Synapse Serverless/Dedicated Pools (previously known as SQL Warehouse), and big data analytical engines such as Spark and Hive. Let's explore the important design techniques for both these groups of query engines.

We can broadly group the techniques available for efficient querying into the following three layers:

  • Storage layer—Using techniques such as partitioning, data pruning, and eventual consistency
  • Application layer—Using techniques such as data caching and application tuning (such as varying...

Designing storage for data pruning

Data pruning, as the name suggests, refers to pruning or snipping out the unnecessary data so that the queries need not read the entire input dataset. I/O is a major bottleneck for any analytical engine, so the idea here is that by reducing the amount of data read, we can improve the query performance. Data pruning usually requires some kind of user input to the analytical engine so that it can decide on which data can be safely ignored for a particular query.

Technologies such as Synapse Dedicated Pools, Azure SQL, Spark, and Hive provide the ability to partition data based on user-defined criteria. If we can organize the input data into physical folders that correspond to the partitions, we can effectively skip reading entire folders of data that are not required for such queries.

Let's consider the examples of Synapse Dedicated Pool and Spark as they are important from a certification point of view.

Dedicated SQL pool example with...

Designing folder structures for data transformation

Considering the variety and volume of data that will land in a data lake, it is very important to design a flexible, but a maintainable, folder structure. Badly designed or ad hoc folder structures will become a management nightmare and will render the data lake unusable. Some points to keep in mind while designing the folder structure are given here:

  • Human readability—Human-readable folder structures will help improve data exploration and navigation.
  • Representation of organizational structure—Aligning the folder structure according to the organizational structure helps segregate the data for billing and access control. Such a folder structure will help restrict cross-team data access.
  • Distinguish sensitive data—The folder structure should be such that it can separate sensitive data from general data. Sensitive data will require higher levels of audit, privacy, and security policies, so keeping...

Designing a distribution strategy

Distribution strategies are techniques that are used in Synapse Dedicated SQL Pools. Synapse Dedicated SQL Pools are massively parallel processing (MPP) systems that split the queries into 60 parallel queries and execute them in parallel. Each of these smaller queries runs on something called a distribution. A distribution is a basic unit of processing and storage for a dedicated SQL pool.

Dedicated SQL uses Azure Storage to store the data, and it provides three different ways to distribute (shard) the data among the distributions. They are listed as follows:

  • Round-robin tables
  • Hash tables
  • Replicated tables

Based on our requirements, we need to decide on which of these distribution techniques should be used for creating our tables. To choose the right distribution strategy, you should understand your application, the data layout, and the data access patterns by using query plans. We will be learning how to generate and read...

Designing a data archiving solution

Now that we have learned how to design a data lake and optimize the storage for our analytical queries, there is one final component that remains to be designed. How do we archive or clean up the old data? Without a proper archiving and/or deletion solution, the data will grow and fill up the storage very soon.

Azure provides three tiers of storage: Hot Access Tier, Cold Access Tier, and Archive Access Tier.

Hot Access Tier

The Hot Access Tier is ideal for data that is accessed frequently. It provides the lowest access cost but at a higher storage cost.

Cold Access Tier

The Cold Access Tier is ideal for data that is accessed occasionally, such as slightly older data that is probably used for backups or monthly reports. It provides lower storage costs but at higher access costs. Azure expects the cold access tier data to be stored for at least 30 days. Early deletion or tier change might result in extra charges.

Archive Access Tier...

Summary

With that, we have come to the end of our second chapter. We explored the various data lake designs in detail and learned good practices for designing one. You should now be comfortable answering questions related to data lake architectures and the storage, compute, and other technologies involved in creating a data lake. You should also be familiar with common file formats such as Avro, Parquet, and ORC and know when to choose which file formats. We also explored different optimization techniques such as data pruning, partitioning, caching, indexing, and more. We also learned about folder structures, data distribution, and finally, designing a data life cycle by using policies to archive or delete data. This covers the syllabus for DP-203 exam, Design a Data Storage Structure chapter. We will be reinforcing the learnings from this chapter via implementation details and tips in the following chapters.

Let's explore the concept of partitioning in more detail in the next...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data Engineer Associate Certification Guide
Published in: Feb 2022Publisher: PacktISBN-13: 9781801816069
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Newton Alex

Newton Alex leads several Azure Data Analytics teams in Microsoft, India. His team contributes to technologies including Azure Synapse, Azure Databricks, Azure HDInsight, and many open source technologies, including Apache YARN, Apache Spark, and Apache Hive. He started using Hadoop while at Yahoo, USA, where he helped build the first batch processing pipelines for Yahoo's ad serving team. After Yahoo, he became the leader of the big data team at Pivotal Inc., USA, where he was responsible for the entire open source stack of Pivotal Inc. He later moved to Microsoft and started the Azure Data team in India. He has worked with several Fortune 500 companies to help build their data systems on Azure.
Read more about Newton Alex