You're reading from Azure Data and AI Architect Handbook

Product typeBook

Published inJul 2023

PublisherPackt

ISBN-139781803234861

Edition1st Edition

Tools

Azure Databricks

Concepts

Data Science

Authors (2):

Olivier Mertens

Breght Van Baelen

View More author details

Transforming Data on Azure

Azure offers a wide range of services for data processing. One of the key features of Azure is its ability to easily transform data from various sources into a format that is suitable for further analysis and reporting.

In this chapter, we will discuss the following:

Designing data pipelines on Azure
Transforming data on Azure
Data transformation architectures
Data transformations in data lake tiers
Operationalizing data pipelines on Azure

This chapter will introduce the various tools and services available on Azure for data transformation, including Azure Data Factory, (ADF) Azure Stream Analytics, and Azure Databricks. We will explore the core features and capabilities of each service, and show in which scenarios they work best. In line with the previous chapter, the focus will be put on both batch processing and real-time processing.

Next, we will look at some example architectures and provide a quick guide on how to...

Designing data pipelines on Azure

In the previous chapter, we discussed how ADF and Azure Synapse Analytics fit into a data architecture by providing data pipelines for batch ingestion.

Here, we will look at how Azure Data Factory and Azure Synapse Analytics are used for transformation pipelines. These pipelines will read data from one data lake tier, process it in some way, and write the resulting dataset to the next data lake tier.

Types of pipelines on Azure

Across all Azure services, we can find many different pipelines. However, we can classify these pipelines into three categories; data pipelines (also referred to as ETL or ELT pipelines), machine learning pipelines (also referred to as MLOps pipelines), and release pipelines (also referred to as CI/CD pipelines).

Data pipelines are used for data movements and data transformations, machine learning pipelines are used to (re)train and (re)deploy machine learning models, and release pipelines are used to push code through...

Transforming data on Azure

As datasets continue to grow in size and complexity, it is increasingly important to have efficient ways of manipulating and processing this data. We will cover both batch and real-time transformation options.

For batch transformations, we will discuss the use of the following:

Mapping data flows
Spark notebooks
SQL scripts
SSIS

These tools can be used for shaping and cleaning large datasets and allow you to define complex data transformations using a visual interface or programming language, making it easy to handle even the most challenging data manipulation tasks.

For real-time transformations, we will look at the following:

Azure Stream Analytics
Azure Databricks

Both technologies allow you to process data with remarkably low latency, enabling real-time insights and decision-making. With these tools, you can process data streams from various sources in real time, transforming and analyzing the data as...

Data transformation architectures

We have explored and discussed the different tools for data transformation. Next, it is time to indicate where they fit in the overall architecture of an Azure data solution. We will look at batch transformation and stream transformation architectures separately.

Batch transformation architecture

For a solution only making use of batch processing, this is straightforward. The transformation is performed in the ETL pipelines, which push the data through the different data lake tiers. The following figure shows an example architecture of batch processing:

Figure 4.3 – Batch transformations are orchestrated by data pipelines between data lake tiers in modern cloud architectures

The ADF or Synapse pipeline will call upon the transformation workflow in the form of a pipeline activity. Both ADF and Azure Synapse Analytics have built-in activities for calling mapping data flows, Synapse notebooks, and Azure Databricks...

Data transformations in data lake tiers

As we saw in Chapter 3, we dump raw data into the bronze layer. It serves as the primary source of raw, unrefined data for the warehouse. This layer contains all the original data as it is received from various sources, including transactional systems, log files, and external data feeds. The purpose of the bronze layer is to provide a centralized location for raw data to be stored, and to make it available for further processing in the higher layers of the warehouse. Data is transformed into the silver and gold layers.

Bronze-to-silver transformations

When moving from the bronze layer to the silver layer, a series of transformations are applied to make the data more usable for analysis. Some examples of transformations that are typically done in the silver layer include the following:

Data cleansing: Removing any duplicates and correcting errors and inconsistencies in the data.
Data integration: Combining data from multiple...

Operationalizing data pipelines on Azure

Operationalizing data pipelines on Azure is a process of creating, managing, and maintaining data workflows in the Azure cloud. It involves several key steps, including scheduling data pipelines, monitoring data pipelines, and implementing a Continuous Integration/Continuous Deployment (CI/CD) process for data pipelines.

Scheduling data pipelines on Azure

Scheduling data pipelines on Azure is a crucial step in the operationalization process. It ensures that data pipelines are run at the appropriate times and frequencies, and that data is updated and available when needed. ADF provides several ways to schedule data pipelines, such as triggers, schedules, control flows, and data flows, which offer flexibility in scheduling data pipelines based on time, events, or conditions.

ADF provides two different ways to schedule data pipelines:

Triggers: A trigger is a way to start a pipeline on a schedule or in response to an event. ADF...

Summary

To recap, data pipelines in Azure are a set of tools and services that allows for the efficient movement and transformation of data. One of the concepts covered in the chapter is the difference between ETL and ELT pipelines. In this book, we will focus mostly on ETL. The chapter also covered the differences between data pipelines in ADF and data pipelines in Azure Synapse Analytics.

We described various tools and technologies available for data transformation in Azure, including mapping data flows, Spark notebooks, SQL scripts, and SSIS packages for batch processing, and Azure Stream Analytics and Azure Databricks for real-time processing.

Next, we looked at an example architecture for both batch and stream processing, providing a high-level overview of the components and technologies involved. Later parts of the architecture remain abstract for now. We introduced a holistic flowchart to map the decision-making process when choosing one of the transformation tools discussed...

The rest of the chapter is locked

You have been reading a chapter from

Azure Data and AI Architect Handbook

Published in: Jul 2023Publisher: PacktISBN-13: 9781803234861

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Olivier Mertens

Olivier Mertens is a cloud solution architect for Azure data and AI at Microsoft, based in Dublin, Ireland. In this role, he assisted organizations in designing their enterprise-scale data platforms and analytical workloads. Next to his role as an architect, Olivier leads the technical AI expertise for Microsoft EMEA in the corporate market. This includes leading knowledge sharing and internal upskilling, as well as solving highly complex or strategic customer AI cases. Before his time at Microsoft, he worked as a data scientist at a Microsoft partner in Belgium. Olivier is a lecturer for generative AI and AI solution architectures, a keynote speaker for AI, and holds a master's degree in information management, a postgraduate degree as an AI business architect, and a bachelor's degree in business management.
Read more about Olivier Mertens

Breght Van Baelen

Breght Van Baelen is a Microsoft employee based in Dublin, Ireland, and works as a cloud solution architect for the data and AI pillar in Azure. He provides guidance to organizations building large-scale analytical platforms and data solutions. In addition, Breght was chosen as an advanced cloud expert for Power BI and is responsible for providing technical expertise in Europe, the Middle East, and Africa. Before his time at Microsoft, he worked as a data consultant at Microsoft Gold Partners in Belgium. Breght led a team of eight data and AI consultants as a data science lead. Breght holds a master's degree in computer science from KU Leuven, specializing in AI. He also holds a bachelor's degree in computer science from the University of Hasselt.
Read more about Breght Van Baelen

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages