Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Azure Data and AI Architect Handbook

You're reading from  Azure Data and AI Architect Handbook

Product type Book
Published in Jul 2023
Publisher Packt
ISBN-13 9781803234861
Pages 284 pages
Edition 1st Edition
Languages
Authors (2):
Olivier Mertens Olivier Mertens
Profile icon Olivier Mertens
Breght Van Baelen Breght Van Baelen
Profile icon Breght Van Baelen
View More author details

Table of Contents (18) Chapters

Preface Part 1: Introduction to Azure Data Architect
Chapter 1: Introduction to Data Architectures Chapter 2: Preparing for Cloud Adoption Part 2: Data Engineering on Azure
Chapter 3: Ingesting Data into the Cloud Chapter 4: Transforming Data on Azure Chapter 5: Storing Data for Consumption Part 3: Data Warehousing and Analytics
Chapter 6: Data Warehousing Chapter 7: The Semantic Layer Chapter 8: Visualizing Data Using Power BI Chapter 9: Advanced Analytics Using AI Part 4: Data Security, Governance, and Compliance
Chapter 10: Enterprise-Level Data Governance and Compliance Chapter 11: Introduction to Data Security Index Other Books You May Enjoy

Ingesting Data into the Cloud

Ingesting data into the cloud is a key step in any data pipeline and can greatly impact the efficiency and scalability of your data processing and analysis. This is why it is critical for any cloud data architect to have a deep understanding of data ingestion techniques and architectures on Azure. In this chapter, we will dive into the world of data ingestion on Azure, focusing on the key concepts of batch ingestion and data streaming, as well as various ingestion architectures.

We will begin by discussing the differences between batch ingestion and data streaming, and when to use each method. We will also explore the benefits and limitations of each approach and provide examples of use cases for each method.

Next, we will explore data ingestion architectures on Azure. We will introduce Azure Data Factory and Azure Synapse pipelines for designing and implementing data pipelines on Azure. Azure Data Lake Storage (ADLS) is introduced for permanently...

Batch and streaming ingestion

Regardless of the type (batch or streaming), data ingestion is located in the first layer of the data architecture, as seen in Figure 3.1:

Figure 3.1 – Reference diagram for cloud data architectures: the ingestion layer (on the left) forms the first layer of the architecture

Figure 3.1 – Reference diagram for cloud data architectures: the ingestion layer (on the left) forms the first layer of the architecture

The ingestion layer forms the front door for the solution. Here, we pull in data using data pipelines and, in enterprise-level solutions, commonly have it land in a massive-scale, unstructured storage service such as a data lake.

The type of ingestion plays a key role in the design of a cloud data architecture. Batch ingestion was, and in most cases still is, the norm for ingesting data into the cloud. A batch approach refers to the periodical ingestion or processing of (usually large) bulks of data. Streaming ingestion, as the name suggests, involves continuous streams of data.

In general, batch ingestion and processing have long been the...

ADLS for raw data ingestion

Before diving deeper into ingestion architectures, we need to introduce the fundamentals of data lakes, where the ingested data will land in the majority of cases.

A data lake can be seen as a mass storage with support for all kinds of data. It does not enforce specific file types or data types, which makes it a remarkably good landing zone for ingestion. The more rules that are enforced—as is the case in structured databases, for example—the likelier it becomes that data ingestion pipelines will break if the file type or schema changes.

On the Azure cloud, a data lake is a specific version of the Azure Storage account. Therefore, we will first introduce this service and its features.

Azure storage accounts

Azure storage accounts can be used to store all kinds of data objects. They provide four distinct types of storage, as follows:

  • Binary Large Object (Blob) storage
  • File storage
  • Queue storage
  • Table storage
  • ...

Batch ingestion architectures

The simplest form of ingestion architecture is a use case where data is only ingested in batches from other cloud-based sources (no sources residing on-premises). In this case, we will use data pipelines to periodically fetch large amounts of data and write them to the bronze layer in the data lake. Note that we restrain from performing any kind of transformation in this initial pipeline.

We will look at ingesting data from the following sources:

  • Cloud sources
  • On-premises sources

Let’s first look at how to ingest data from cloud-based sources.

Ingesting data from cloud sources

When ingesting data from other cloud sources, the connection is often more convenient, Also, we can make use of Azure-hosted integration runtimes (IRs). This will serve as the compute for the pipeline orchestration in either Azure Data Factory or Azure Synapse pipelines. Other Data Factory components will be more elaborately discussed in the next...

Streaming ingestion architectures

While batch ingestion architectures are designed to receive a collection of data at once, streaming ingestion architectures receive data in real time, as soon as a new event occurs in the streaming data sources. Examples of streaming data sources are given here:

  • IoT sensors in a manufacturing process
  • Server and security logs
  • Click-stream data from apps and websites
  • Stock values
  • Live sport updates
  • Real-time traffic updates

Having a real-time data source does not necessarily mean you need a streaming ingestion architecture to ingest the data. Data can also be buffered at the source and ingested in batches. This could be more cost-effective as streaming ingestion architectures tend to be more expensive. Streaming ingestion architectures are recommended when the volume and velocity of data are too big to handle at the source or in use cases where decisions need to be made in real time. Examples of such use cases are given...

Summary

In this chapter, we provided a comprehensive overview of the various methods and tools available for getting data into the cloud. The chapter started by discussing the differences between batch ingestion and streaming ingestion and when to use each method. It explained the benefits and limitations of each approach and provided examples of use cases for each method.

One of the key tools introduced in this chapter is ADLS. This is a powerful storage solution for big data and allows for efficient and flexible storage of large datasets in the cloud. The chapter explained how ADLS can store data in a variety of formats, including structured and unstructured data. We also discussed access tiers, redundancy, and data lake tiers.

We delved into architectures for both batch ingestion from cloud sources and on-premises sources. Next, we explained streaming architectures, such as lambda and kappa architectures, which are becoming increasingly popular for real-time data ingestion...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Azure Data and AI Architect Handbook
Published in: Jul 2023 Publisher: Packt ISBN-13: 9781803234861
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}