You're reading from Azure Data and AI Architect Handbook

Product type Book

Published in Jul 2023

Publisher Packt

ISBN-13 9781803234861

Pages 284 pages

Edition 1st Edition

Languages

Concepts

Data Science

Authors (2):

Olivier Mertens

Breght Van Baelen

View More author details

Table of Contents (18) Chapters

Preface

Part 1: Introduction to Azure Data Architect

Chapter 1: Introduction to Data Architectures

Chapter 2: Preparing for Cloud Adoption

Part 2: Data Engineering on Azure

Chapter 3: Ingesting Data into the Cloud

Chapter 4: Transforming Data on Azure

Chapter 5: Storing Data for Consumption

Part 3: Data Warehousing and Analytics

Chapter 6: Data Warehousing

Chapter 7: The Semantic Layer

Chapter 8: Visualizing Data Using Power BI

Chapter 9: Advanced Analytics Using AI

Part 4: Data Security, Governance, and Compliance

Chapter 10: Enterprise-Level Data Governance and Compliance

Chapter 11: Introduction to Data Security

Index

Why subscribe?

Other Books You May Enjoy

Ingesting Data into the Cloud

Ingesting data into the cloud is a key step in any data pipeline and can greatly impact the efficiency and scalability of your data processing and analysis. This is why it is critical for any cloud data architect to have a deep understanding of data ingestion techniques and architectures on Azure. In this chapter, we will dive into the world of data ingestion on Azure, focusing on the key concepts of batch ingestion and data streaming, as well as various ingestion architectures.

We will begin by discussing the differences between batch ingestion and data streaming, and when to use each method. We will also explore the benefits and limitations of each approach and provide examples of use cases for each method.

Next, we will explore data ingestion architectures on Azure. We will introduce Azure Data Factory and Azure Synapse pipelines for designing and implementing data pipelines on Azure. Azure Data Lake Storage (ADLS) is introduced for permanently...

Batch and streaming ingestion

Regardless of the type (batch or streaming), data ingestion is located in the first layer of the data architecture, as seen in Figure 3.1:

Figure 3.1 – Reference diagram for cloud data architectures: the ingestion layer (on the left) forms the first layer of the architecture

The ingestion layer forms the front door for the solution. Here, we pull in data using data pipelines and, in enterprise-level solutions, commonly have it land in a massive-scale, unstructured storage service such as a data lake.

The type of ingestion plays a key role in the design of a cloud data architecture. Batch ingestion was, and in most cases still is, the norm for ingesting data into the cloud. A batch approach refers to the periodical ingestion or processing of (usually large) bulks of data. Streaming ingestion, as the name suggests, involves continuous streams of data.

In general, batch ingestion and processing have long been the...

ADLS for raw data ingestion

Before diving deeper into ingestion architectures, we need to introduce the fundamentals of data lakes, where the ingested data will land in the majority of cases.

A data lake can be seen as a mass storage with support for all kinds of data. It does not enforce specific file types or data types, which makes it a remarkably good landing zone for ingestion. The more rules that are enforced—as is the case in structured databases, for example—the likelier it becomes that data ingestion pipelines will break if the file type or schema changes.

On the Azure cloud, a data lake is a specific version of the Azure Storage account. Therefore, we will first introduce this service and its features.

Azure storage accounts

Azure storage accounts can be used to store all kinds of data objects. They provide four distinct types of storage, as follows:

Binary Large Object (Blob) storage
File storage
Queue storage
Table storage

Batch ingestion architectures

The simplest form of ingestion architecture is a use case where data is only ingested in batches from other cloud-based sources (no sources residing on-premises). In this case, we will use data pipelines to periodically fetch large amounts of data and write them to the bronze layer in the data lake. Note that we restrain from performing any kind of transformation in this initial pipeline.

We will look at ingesting data from the following sources:

Cloud sources
On-premises sources

Let’s first look at how to ingest data from cloud-based sources.

Ingesting data from cloud sources

When ingesting data from other cloud sources, the connection is often more convenient, Also, we can make use of Azure-hosted integration runtimes (IRs). This will serve as the compute for the pipeline orchestration in either Azure Data Factory or Azure Synapse pipelines. Other Data Factory components will be more elaborately discussed in the next...

Streaming ingestion architectures

While batch ingestion architectures are designed to receive a collection of data at once, streaming ingestion architectures receive data in real time, as soon as a new event occurs in the streaming data sources. Examples of streaming data sources are given here:

IoT sensors in a manufacturing process
Server and security logs
Click-stream data from apps and websites
Stock values
Live sport updates
Real-time traffic updates

Having a real-time data source does not necessarily mean you need a streaming ingestion architecture to ingest the data. Data can also be buffered at the source and ingested in batches. This could be more cost-effective as streaming ingestion architectures tend to be more expensive. Streaming ingestion architectures are recommended when the volume and velocity of data are too big to handle at the source or in use cases where decisions need to be made in real time. Examples of such use cases are given...

Summary

In this chapter, we provided a comprehensive overview of the various methods and tools available for getting data into the cloud. The chapter started by discussing the differences between batch ingestion and streaming ingestion and when to use each method. It explained the benefits and limitations of each approach and provided examples of use cases for each method.

One of the key tools introduced in this chapter is ADLS. This is a powerful storage solution for big data and allows for efficient and flexible storage of large datasets in the cloud. The chapter explained how ADLS can store data in a variety of formats, including structured and unstructured data. We also discussed access tiers, redundancy, and data lake tiers.

We delved into architectures for both batch ingestion from cloud sources and on-premises sources. Next, we explained streaming architectures, such as lambda and kappa architectures, which are becoming increasingly popular for real-time data ingestion...