Reader small image

You're reading from  Azure Data and AI Architect Handbook

Product typeBook
Published inJul 2023
PublisherPackt
ISBN-139781803234861
Edition1st Edition
Concepts
Right arrow
Authors (2):
Olivier Mertens
Olivier Mertens
author image
Olivier Mertens

Olivier Mertens is a cloud solution architect for Azure data and AI at Microsoft, based in Dublin, Ireland. In this role, he assisted organizations in designing their enterprise-scale data platforms and analytical workloads. Next to his role as an architect, Olivier leads the technical AI expertise for Microsoft EMEA in the corporate market. This includes leading knowledge sharing and internal upskilling, as well as solving highly complex or strategic customer AI cases. Before his time at Microsoft, he worked as a data scientist at a Microsoft partner in Belgium. Olivier is a lecturer for generative AI and AI solution architectures, a keynote speaker for AI, and holds a master's degree in information management, a postgraduate degree as an AI business architect, and a bachelor's degree in business management.
Read more about Olivier Mertens

Breght Van Baelen
Breght Van Baelen
author image
Breght Van Baelen

Breght Van Baelen is a Microsoft employee based in Dublin, Ireland, and works as a cloud solution architect for the data and AI pillar in Azure. He provides guidance to organizations building large-scale analytical platforms and data solutions. In addition, Breght was chosen as an advanced cloud expert for Power BI and is responsible for providing technical expertise in Europe, the Middle East, and Africa. Before his time at Microsoft, he worked as a data consultant at Microsoft Gold Partners in Belgium. Breght led a team of eight data and AI consultants as a data science lead. Breght holds a master's degree in computer science from KU Leuven, specializing in AI. He also holds a bachelor's degree in computer science from the University of Hasselt.
Read more about Breght Van Baelen

View More author details
Right arrow

Transforming Data on Azure

Azure offers a wide range of services for data processing. One of the key features of Azure is its ability to easily transform data from various sources into a format that is suitable for further analysis and reporting.

In this chapter, we will discuss the following:

  • Designing data pipelines on Azure
  • Transforming data on Azure
  • Data transformation architectures
  • Data transformations in data lake tiers
  • Operationalizing data pipelines on Azure

This chapter will introduce the various tools and services available on Azure for data transformation, including Azure Data Factory, (ADF) Azure Stream Analytics, and Azure Databricks. We will explore the core features and capabilities of each service, and show in which scenarios they work best. In line with the previous chapter, the focus will be put on both batch processing and real-time processing.

Next, we will look at some example architectures and provide a quick guide on how to...

Designing data pipelines on Azure

In the previous chapter, we discussed how ADF and Azure Synapse Analytics fit into a data architecture by providing data pipelines for batch ingestion.

Here, we will look at how Azure Data Factory and Azure Synapse Analytics are used for transformation pipelines. These pipelines will read data from one data lake tier, process it in some way, and write the resulting dataset to the next data lake tier.

Types of pipelines on Azure

Across all Azure services, we can find many different pipelines. However, we can classify these pipelines into three categories; data pipelines (also referred to as ETL or ELT pipelines), machine learning pipelines (also referred to as MLOps pipelines), and release pipelines (also referred to as CI/CD pipelines).

Data pipelines are used for data movements and data transformations, machine learning pipelines are used to (re)train and (re)deploy machine learning models, and release pipelines are used to push code through...

Transforming data on Azure

As datasets continue to grow in size and complexity, it is increasingly important to have efficient ways of manipulating and processing this data. We will cover both batch and real-time transformation options.

For batch transformations, we will discuss the use of the following:

  • Mapping data flows
  • Spark notebooks
  • SQL scripts
  • SSIS

These tools can be used for shaping and cleaning large datasets and allow you to define complex data transformations using a visual interface or programming language, making it easy to handle even the most challenging data manipulation tasks.

For real-time transformations, we will look at the following:

  • Azure Stream Analytics
  • Azure Databricks

Both technologies allow you to process data with remarkably low latency, enabling real-time insights and decision-making. With these tools, you can process data streams from various sources in real time, transforming and analyzing the data as...

Data transformation architectures

We have explored and discussed the different tools for data transformation. Next, it is time to indicate where they fit in the overall architecture of an Azure data solution. We will look at batch transformation and stream transformation architectures separately.

Batch transformation architecture

For a solution only making use of batch processing, this is straightforward. The transformation is performed in the ETL pipelines, which push the data through the different data lake tiers. The following figure shows an example architecture of batch processing:

Figure 4.3 – Batch transformations are orchestrated by data pipelines between data lake tiers in modern cloud architectures

Figure 4.3 – Batch transformations are orchestrated by data pipelines between data lake tiers in modern cloud architectures

The ADF or Synapse pipeline will call upon the transformation workflow in the form of a pipeline activity. Both ADF and Azure Synapse Analytics have built-in activities for calling mapping data flows, Synapse notebooks, and Azure Databricks...

Data transformations in data lake tiers

As we saw in Chapter 3, we dump raw data into the bronze layer. It serves as the primary source of raw, unrefined data for the warehouse. This layer contains all the original data as it is received from various sources, including transactional systems, log files, and external data feeds. The purpose of the bronze layer is to provide a centralized location for raw data to be stored, and to make it available for further processing in the higher layers of the warehouse. Data is transformed into the silver and gold layers.

Bronze-to-silver transformations

When moving from the bronze layer to the silver layer, a series of transformations are applied to make the data more usable for analysis. Some examples of transformations that are typically done in the silver layer include the following:

  • Data cleansing: Removing any duplicates and correcting errors and inconsistencies in the data.
  • Data integration: Combining data from multiple...

Operationalizing data pipelines on Azure

Operationalizing data pipelines on Azure is a process of creating, managing, and maintaining data workflows in the Azure cloud. It involves several key steps, including scheduling data pipelines, monitoring data pipelines, and implementing a Continuous Integration/Continuous Deployment (CI/CD) process for data pipelines.

Scheduling data pipelines on Azure

Scheduling data pipelines on Azure is a crucial step in the operationalization process. It ensures that data pipelines are run at the appropriate times and frequencies, and that data is updated and available when needed. ADF provides several ways to schedule data pipelines, such as triggers, schedules, control flows, and data flows, which offer flexibility in scheduling data pipelines based on time, events, or conditions.

ADF provides two different ways to schedule data pipelines:

  • Triggers: A trigger is a way to start a pipeline on a schedule or in response to an event. ADF...

Summary

To recap, data pipelines in Azure are a set of tools and services that allows for the efficient movement and transformation of data. One of the concepts covered in the chapter is the difference between ETL and ELT pipelines. In this book, we will focus mostly on ETL. The chapter also covered the differences between data pipelines in ADF and data pipelines in Azure Synapse Analytics.

We described various tools and technologies available for data transformation in Azure, including mapping data flows, Spark notebooks, SQL scripts, and SSIS packages for batch processing, and Azure Stream Analytics and Azure Databricks for real-time processing.

Next, we looked at an example architecture for both batch and stream processing, providing a high-level overview of the components and technologies involved. Later parts of the architecture remain abstract for now. We introduced a holistic flowchart to map the decision-making process when choosing one of the transformation tools discussed...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data and AI Architect Handbook
Published in: Jul 2023Publisher: PacktISBN-13: 9781803234861
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Olivier Mertens

Olivier Mertens is a cloud solution architect for Azure data and AI at Microsoft, based in Dublin, Ireland. In this role, he assisted organizations in designing their enterprise-scale data platforms and analytical workloads. Next to his role as an architect, Olivier leads the technical AI expertise for Microsoft EMEA in the corporate market. This includes leading knowledge sharing and internal upskilling, as well as solving highly complex or strategic customer AI cases. Before his time at Microsoft, he worked as a data scientist at a Microsoft partner in Belgium. Olivier is a lecturer for generative AI and AI solution architectures, a keynote speaker for AI, and holds a master's degree in information management, a postgraduate degree as an AI business architect, and a bachelor's degree in business management.
Read more about Olivier Mertens

author image
Breght Van Baelen

Breght Van Baelen is a Microsoft employee based in Dublin, Ireland, and works as a cloud solution architect for the data and AI pillar in Azure. He provides guidance to organizations building large-scale analytical platforms and data solutions. In addition, Breght was chosen as an advanced cloud expert for Power BI and is responsible for providing technical expertise in Europe, the Middle East, and Africa. Before his time at Microsoft, he worked as a data consultant at Microsoft Gold Partners in Belgium. Breght led a team of eight data and AI consultants as a data science lead. Breght holds a master's degree in computer science from KU Leuven, specializing in AI. He also holds a bachelor's degree in computer science from the University of Hasselt.
Read more about Breght Van Baelen