Reader small image

You're reading from  Cloud Scale Analytics with Azure Data Services

Product typeBook
Published inJul 2021
PublisherPackt
ISBN-139781800562936
Edition1st Edition
Right arrow
Author (1)
Patrik Borosch
Patrik Borosch
author image
Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch

Right arrow

Chapter 3: Understanding the Data Lake Storage Layer

One of the components in our modern data warehouse Architecture that spans over all the processing steps, and is available to all the participating services, is our Data Lake Storage Layer.

It will serve as our landing zone, the transient storage while we're transforming and cleaning our data, and a source for queries for the analytical components. Users will establish their sandboxes, processes will store their logs here, and Data Scientists will use it as their main playground, together with their tools.

In this chapter, you will learn how to set up Azure Data Lake Gen2 Storage, and also find suggestions about how to organize it so that you can flexibly react to any challenge. You will learn how to access data in Data Lake Storage, and you will look at some approaches for how to monitor your Storage Account. We will discuss options for backup and disaster recovery (DR) and examine the security and networking options...

Technical requirements

There are not too many technical requirements for this chapter, but there are a couple! All you will need to follow this chapter is the following:

  • An Azure subscription where you have at least contributor rights, or you are the owner
  • The right to provision an Azure Storage Account with Hierarchical Namespace enabled

Setting up your Cloud Big Data Storage

When you want Azure Data Lake to provision a service within your Azure subscription and you search for it, you will find a service with that exact name. We are not going to use that service, as it is the Generation 1 version of Data Lake Storage and is only there for continuity reasons.

Microsoft added the Data Lake Store Gen2 in kind of a hidden fashion. As this new version of the Data Lake is based on the standard Azure storage account, you will provision it exactly like that: an Azure storage account with an option set for Azure Data Lake Gen2 or, even better, with the Hierarchical Namespace enabled. We will look at this option shortly.

Provisioning a standard storage account instead

When you examine the Azure storage account limits, you might ask yourself, why should I go for a Data Lake Store when the standard Storage Account already has these wide volume boundaries and possibilities?

Well, the standard storage account has it...

Organizing your data lake

A well-structured system of zones/layers and folders will help you control your data lake. On the one hand, you will find a canonical approach that makes it easier to understand structures and the semantics behind these zones and folders. On the other hand, generator approaches will enable you to automate processes in your modern data warehouse.

Many Big Data projects suffer from poorly organized folder structures, and it becomes a challenge to find the right data for the right analysis at the right time. The so-called data swamp can be nearly impossible to use and will even demotivate users from leveraging the effort that must be put into it.

Talking about zones in your data lake

In Chapter 1, Balancing the Benefits of Data Lakes over Data Warehouses, we addressed the question of zones in a data lake. We compared them to the layers in a data warehouse and found that they are pretty similar, and mostly follow similar semantics:

...

Implementing a data model in your Data Lake

You can argue that this should already be done in your Data Lake Storage. But if you examine the capabilities of the services when it comes to interconnectivity and the options to query data directly from the Data Lake without loading it into a database, we can set up a hybrid data model and not only use these functionalities to save a lot of money, but also reduce complexity and loading time windows in your data warehouse.

The rising data volumes that you might face can impact your loading and query strategy heavily. And even with the Azure backbone network, you will come to a point where the latency for loading your data into your data warehouse will no longer be sufficient.

Understanding interconnectivity between your data lake and the presentation layer

Later in this book, we will dive into the Synapse SQL pools (Chapter 4 and Chapter 11) and Synapse SQL on-demand (Chapter 11). These databases – or better, SQL services...

Monitoring your storage account

There are different ways to keep track of your Data Lake Store on Azure. When you navigate to your storage account in the Azure portal, you will find overview charts about input and output, latency, and requests directly on the Overview blade. When you scroll down in the Navigation blade on the left, you will find more detailed views when you select Insights from the Monitoring section. You will find a subsection for Alerts, Metrics, and predefined (but configurable) Workbooks with predefined visuals for your storage account.

However, if you need deeper insights, or if you want to track who is doing what on your data lake, this is not displayed here. You might want to integrate your data lake with Azure Monitor and deep dive into the events.

Note

This feature is in preview at the time of writing. You can enroll in the preview on the Data Lake documentation page (https://docs.microsoft.com/en-us/azure/storage/blobs/monitor-blob-storage) if it...

Talking about backups

In every well-designed system, you want to have your data lake secured against data losses. As we saw at the beginning of this chapter, while creating our Data Lake Storage Gen2 account, there are different levels of redundancy that you can configure for your storage. But replication is only one facet of preventing data loss.

Configuring delete locks for the storage service

To avoid your storage service from being deleted by accident (yes, this can happen!), you must place a delete lock on your storage account. This will cause Azure to block any delete action and prompt the user to first get rid of the delete lock. Follow these steps to set this up:

  1. Navigate to the Locks entry in the Settings section of the Navigation blade of your Data Lake storage account.
  2. Here, you can add a lock of the Delete type to your account.

You can also set the whole account to Read-only if you need to.

Backing up your data

At the time of writing, there...

Implementing access control in your Data Lake

Azure storage accounts implement different ways to control access to content that is stored there:

  • RBAC
  • ACLs
  • Shared Key authorization
  • Shared Access Signature (SAS) authorization

Understanding RBAC

To give access to a user, group, service principal, or a managed identity using RBAC, the user or the application needs to be managed by Azure Active Directory (AAD). Implementing RBAC will use a so-called permission set that is put together as a role that a security principal can be assigned to.

When RBAC is assigned to Data Lake Storage, this will always be at the top level of the account or the filesystem. This means that the user or the application will have access to everything that is stored in the account or in the container that access has been granted to.

The following roles can be used to grant access to data in a data lake:

  • Storage Blob Data Owner: This role will give you unlimited access...

Setting the networking options

While generating your Data Lake Storage account, you must set networking options. You have three options there and, according to your choice, you can implement Azure storage firewalls, virtual networks, and private endpoints with your Data Lake Storage.

Allowing access from all networks will cause the Data Lake Storage to be "visible" to everybody. You won't limit any network addresses, so you will need to secure the data lake with other measures, such as RBAC and ACLs. And don't forget, anybody with a Shared Key or a valid SAS will be able to reach your data lake as well.

Understanding storage account firewalls

You might want to consider setting up firewall rules to limit traffic to your data lake so that only IP ranges and addresses that you know have permission. Let's take a look:

  1. When you examine the Navigation blade of your Data Lake Storage, you will find the entry for Firewalls and virtual networks in...

Discovering additional knowledge

The following is some advice that you might find useful.

Do:

  • Plan for security from day one: Where are your trade-offs between security and usability?
  • Enforce as much discipline as needed, but not more than is really necessary. Your data lake needs to serve your Data Scientists, as well as other communities in your company. Your modern data warehouse needs some agility.
  • Structure your zones clearly and stick to the plan. If you need to redesign, don't do so in your already started structure.
  • Implement a Data Catalog (we will talk about this in Chapter 14, Establishing Data Governance) to enable easy data discovery.
  • Integrate with DevOps for a controlled and repeatable system.

Don't:

  • Don't mix different formats. Always stick to one single file format per folder. You will often want to read all the files in a folder in one go.
  • Don't forget naming conventions!

Summary

In this chapter, we talked about one of the main components of any modern data warehouse architecture: Data Lake Storage. You learned how to provision Data Lake Storage Gen2 on the Azure portal.

We discussed how to organize a data lake from different angles and examined the zones and folder structures that you will need to implement for efficient usage. We also learned how to implement a data model in Data Lake Storage.

After that, we looked at the administrative side of the Data Lake storage account and talked about monitoring, backups, access control, and networking.

The Data Lake Storage account is not the only component in your modern data warehouse Architecture that will hold data for analysis. Stay tuned for the next chapter, where we will shed some light on the relational database components that we can add to the architecture.

Further reading

For more information about the topics that were covered in this chapter, consult the official Azure documentation:

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Cloud Scale Analytics with Azure Data Services
Published in: Jul 2021Publisher: PacktISBN-13: 9781800562936
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch