You're reading from Cloud Scale Analytics with Azure Data Services

Product typeBook

Published inJul 2021

PublisherPackt

ISBN-139781800562936

Edition1st Edition

Tools

Azure Pack

Concepts

Data Streaming

Author (1)

Patrik Borosch

Chapter 3: Understanding the Data Lake Storage Layer

One of the components in our modern data warehouse Architecture that spans over all the processing steps, and is available to all the participating services, is our Data Lake Storage Layer.

It will serve as our landing zone, the transient storage while we're transforming and cleaning our data, and a source for queries for the analytical components. Users will establish their sandboxes, processes will store their logs here, and Data Scientists will use it as their main playground, together with their tools.

In this chapter, you will learn how to set up Azure Data Lake Gen2 Storage, and also find suggestions about how to organize it so that you can flexibly react to any challenge. You will learn how to access data in Data Lake Storage, and you will look at some approaches for how to monitor your Storage Account. We will discuss options for backup and disaster recovery (DR) and examine the security and networking options...

Technical requirements

There are not too many technical requirements for this chapter, but there are a couple! All you will need to follow this chapter is the following:

An Azure subscription where you have at least contributor rights, or you are the owner
The right to provision an Azure Storage Account with Hierarchical Namespace enabled

Setting up your Cloud Big Data Storage

When you want Azure Data Lake to provision a service within your Azure subscription and you search for it, you will find a service with that exact name. We are not going to use that service, as it is the Generation 1 version of Data Lake Storage and is only there for continuity reasons.

Microsoft added the Data Lake Store Gen2 in kind of a hidden fashion. As this new version of the Data Lake is based on the standard Azure storage account, you will provision it exactly like that: an Azure storage account with an option set for Azure Data Lake Gen2 or, even better, with the Hierarchical Namespace enabled. We will look at this option shortly.

Provisioning a standard storage account instead

When you examine the Azure storage account limits, you might ask yourself, why should I go for a Data Lake Store when the standard Storage Account already has these wide volume boundaries and possibilities?

Well, the standard storage account has it...

Organizing your data lake

A well-structured system of zones/layers and folders will help you control your data lake. On the one hand, you will find a canonical approach that makes it easier to understand structures and the semantics behind these zones and folders. On the other hand, generator approaches will enable you to automate processes in your modern data warehouse.

Many Big Data projects suffer from poorly organized folder structures, and it becomes a challenge to find the right data for the right analysis at the right time. The so-called data swamp can be nearly impossible to use and will even demotivate users from leveraging the effort that must be put into it.

Talking about zones in your data lake

In Chapter 1, Balancing the Benefits of Data Lakes over Data Warehouses, we addressed the question of zones in a data lake. We compared them to the layers in a data warehouse and found that they are pretty similar, and mostly follow similar semantics:

...

Implementing a data model in your Data Lake

You can argue that this should already be done in your Data Lake Storage. But if you examine the capabilities of the services when it comes to interconnectivity and the options to query data directly from the Data Lake without loading it into a database, we can set up a hybrid data model and not only use these functionalities to save a lot of money, but also reduce complexity and loading time windows in your data warehouse.

The rising data volumes that you might face can impact your loading and query strategy heavily. And even with the Azure backbone network, you will come to a point where the latency for loading your data into your data warehouse will no longer be sufficient.

Understanding interconnectivity between your data lake and the presentation layer

Later in this book, we will dive into the Synapse SQL pools (Chapter 4 and Chapter 11) and Synapse SQL on-demand (Chapter 11). These databases – or better, SQL services...

Monitoring your storage account

There are different ways to keep track of your Data Lake Store on Azure. When you navigate to your storage account in the Azure portal, you will find overview charts about input and output, latency, and requests directly on the Overview blade. When you scroll down in the Navigation blade on the left, you will find more detailed views when you select Insights from the Monitoring section. You will find a subsection for Alerts, Metrics, and predefined (but configurable) Workbooks with predefined visuals for your storage account.

However, if you need deeper insights, or if you want to track who is doing what on your data lake, this is not displayed here. You might want to integrate your data lake with Azure Monitor and deep dive into the events.

Note

This feature is in preview at the time of writing. You can enroll in the preview on the Data Lake documentation page (https://docs.microsoft.com/en-us/azure/storage/blobs/monitor-blob-storage) if it...

Talking about backups

In every well-designed system, you want to have your data lake secured against data losses. As we saw at the beginning of this chapter, while creating our Data Lake Storage Gen2 account, there are different levels of redundancy that you can configure for your storage. But replication is only one facet of preventing data loss.

Configuring delete locks for the storage service

To avoid your storage service from being deleted by accident (yes, this can happen!), you must place a delete lock on your storage account. This will cause Azure to block any delete action and prompt the user to first get rid of the delete lock. Follow these steps to set this up:

Navigate to the Locks entry in the Settings section of the Navigation blade of your Data Lake storage account.
Here, you can add a lock of the Delete type to your account.

You can also set the whole account to Read-only if you need to.

Backing up your data

At the time of writing, there...

Implementing access control in your Data Lake

Azure storage accounts implement different ways to control access to content that is stored there:

RBAC
ACLs
Shared Key authorization
Shared Access Signature (SAS) authorization

Understanding RBAC

To give access to a user, group, service principal, or a managed identity using RBAC, the user or the application needs to be managed by Azure Active Directory (AAD). Implementing RBAC will use a so-called permission set that is put together as a role that a security principal can be assigned to.

When RBAC is assigned to Data Lake Storage, this will always be at the top level of the account or the filesystem. This means that the user or the application will have access to everything that is stored in the account or in the container that access has been granted to.

The following roles can be used to grant access to data in a data lake:

Storage Blob Data Owner: This role will give you unlimited access...

Setting the networking options

While generating your Data Lake Storage account, you must set networking options. You have three options there and, according to your choice, you can implement Azure storage firewalls, virtual networks, and private endpoints with your Data Lake Storage.

Allowing access from all networks will cause the Data Lake Storage to be "visible" to everybody. You won't limit any network addresses, so you will need to secure the data lake with other measures, such as RBAC and ACLs. And don't forget, anybody with a Shared Key or a valid SAS will be able to reach your data lake as well.

Understanding storage account firewalls

You might want to consider setting up firewall rules to limit traffic to your data lake so that only IP ranges and addresses that you know have permission. Let's take a look:

When you examine the Navigation blade of your Data Lake Storage, you will find the entry for Firewalls and virtual networks in...

Discovering additional knowledge

The following is some advice that you might find useful.

Do:

Plan for security from day one: Where are your trade-offs between security and usability?
Enforce as much discipline as needed, but not more than is really necessary. Your data lake needs to serve your Data Scientists, as well as other communities in your company. Your modern data warehouse needs some agility.
Structure your zones clearly and stick to the plan. If you need to redesign, don't do so in your already started structure.
Implement a Data Catalog (we will talk about this in Chapter 14, Establishing Data Governance) to enable easy data discovery.
Integrate with DevOps for a controlled and repeatable system.

Don't:

Don't mix different formats. Always stick to one single file format per folder. You will often want to read all the files in a folder in one go.
Don't forget naming conventions!

Summary

In this chapter, we talked about one of the main components of any modern data warehouse architecture: Data Lake Storage. You learned how to provision Data Lake Storage Gen2 on the Azure portal.

We discussed how to organize a data lake from different angles and examined the zones and folder structures that you will need to implement for efficient usage. We also learned how to implement a data model in Data Lake Storage.

After that, we looked at the administrative side of the Data Lake storage account and talked about monitoring, backups, access control, and networking.

The Data Lake Storage account is not the only component in your modern data warehouse Architecture that will hold data for analysis. Stay tuned for the next chapter, where we will shed some light on the relational database components that we can add to the architecture.

Data Lake Storage scenarios: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-data-scenarios
Best practices for using Data Lake Storage Gen2: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
Security recommendations for Blob storage: https://docs.microsoft.com/en-us/azure/storage/blobs/security-recommendations
Using private endpoints for Azure storage: https://docs.microsoft.com/en-us/azure/storage/common/storage-private-endpoints
Azure Data Lake Storage query acceleration: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-query-acceleration

The rest of the chapter is locked

You have been reading a chapter from

Cloud Scale Analytics with Azure Data Services

Published in: Jul 2021Publisher: PacktISBN-13: 9781800562936

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Patrik Borosch

Patrik Borosch is a cloud solution architect for data and AI at Microsoft Switzerland GmbH. He has more than 25 years of BI and analytics development, engineering, and architecture experience and is a Microsoft Certified Data Engineer and a Microsoft Certified AI Engineer. Patrik has worked on numerous significant international data warehouse, data integration, and big data projects. Through this, he has built and extended his experience in all facets, from requirements engineering to data modeling and ETL, all the way to reporting and dashboarding. At Microsoft Switzerland, he supports customers in their journey into the analytical world of the Azure Cloud.
Read more about Patrik Borosch

Other recommended products

Related to this chapter

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

BookJun 2021392 pages

Azure Data Engineering Cookbook

This book will help you design and implement modern ETL workflows along with data management, monitoring, and security aspects to meet the current organization's needs. You will use various services such as Azure Data Factory, Azure Databricks, Azure Stream Analytics, and Azure Data Explorer to design efficient data processing solutions.

BookApr 2021454 pages

Azure Data Factory Cookbook

With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premise. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with using ADF as the main ETL and orchestration tool for your data warehouse or data platform project.

BookDec 2020382 pages

Azure Databricks Cookbook

The Azure Databricks Cookbook shows you how to work with the latest as well as older versions of Apache Spark and integrate with various Azure resources for orchestrating, deploying, and monitoring big data solutions. You'll use Azure Databricks to build end-to-end solutions and address challenges in securing, productionizing, and monitoring them.

BookSep 2021452 pages

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

BookJul 2021428 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure enables you to understand the design and business considerations that you must keep in mind while planning to adopt the cloud analytics model for your business.

BookJan 2021184 pages

Hands-On Data Warehousing with Azure Data Factory

Azure Data Factory (ADF) is a Microsoft Azure PaaS solution which supports data movement between many on premises and cloud data sources. This book covers custom tailored tutorials to help you develop , maintain and troubleshoot data movement processes and environments using Azure Data Factory V2 and SQL Server Integration Services 2017

BookMay 2018284 pages

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

BookNov 2019242 pages

Stream Analytics with Microsoft Azure

This book is your guide to understanding the basics of how Azure Stream Analytics works, and build your own analytics solution using its capabilities. By the end of this book, you will be well-versed in using Azure Stream Analytics to develop an efficient analytics solution which can work with any type of data.

BookDec 2017322 pages

ETL with Azure Cookbook

This book will take you through hand-on recipes for extracting, transforming, and loading data using big data tools and Azure services such as Data Factory and Azure Databricks. You will learn how to interact effectively with Azure services, along with covering automation with BIML and data profiling in Azure.

BookSep 2020446 pages

Distributed Data Systems with Azure Databricks

This book helps you to learn how to extract, transform, and orchestrate massive amounts of data to develop robust data pipelines. You'll perform complex machine learning tasks using advanced Azure Databricks features, and also explore model tuning, deployment, and control using Databricks functionalities such as AutoML and Delta Lake with TensorFlow.

BookMay 2021414 pages

Introducing Microsoft SQL Server 2019

Introducing Microsoft SQL Server 2019 takes you through what’s new in SQL Server 2019 and why it matters. After reading this book, you’ll be well placed to explore exactly how you can make MIcrosoft SQL Server 2019 work best for you.

BookApr 2020488 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

You're reading from Cloud Scale Analytics with Azure Data Services

Chapter 3: Understanding the Data Lake Storage Layer

Technical requirements

Setting up your Cloud Big Data Storage

Provisioning a standard storage account instead

Organizing your data lake

Talking about zones in your data lake

Implementing a data model in your Data Lake

Understanding interconnectivity between your data lake and the presentation layer

Monitoring your storage account

Talking about backups

Configuring delete locks for the storage service

Backing up your data

Implementing access control in your Data Lake

Understanding RBAC

Setting the networking options

Understanding storage account firewalls

Discovering additional knowledge

Summary

Further reading

Unlock this book and the full library FREE for 7 days

Author (1)

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

Azure Data Engineering Cookbook

Azure Data Factory Cookbook

Azure Databricks Cookbook

Data Modeling for Azure Data Services

Data modeling for Azure Data Services teaches you the core concepts of setting up different types of databases for different use cases. With this hands-on guide, you'll learn how to implement the resulting data model in Azure efficiently.

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure enables you to understand the design and business considerations that you must keep in mind while planning to adopt the cloud analytics model for your business.

Hands-On Data Warehousing with Azure Data Factory

Cloud Analytics with Microsoft Azure

Cloud Analytics with Microsoft Azure is an end-to-end guide to processing and analyzing big data using a range of Microsoft Azure features. This book covers everything you need to build your own data warehouse and learn numerous techniques to gain useful insights by analyzing big data.

Stream Analytics with Microsoft Azure

ETL with Azure Cookbook

Distributed Data Systems with Azure Databricks

Introducing Microsoft SQL Server 2019

Introducing Microsoft SQL Server 2019 takes you through what’s new in SQL Server 2019 and why it matters. After reading this book, you’ll be well placed to explore exactly how you can make MIcrosoft SQL Server 2019 work best for you.

Et al.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

Mastering Tableau 2023

Building AI Applications with ChatGPT APIs

Building AI Applications with ChatGPT APIs

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

Modern Data Architecture on AWS

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

TinyML Cookbook