Reader small image

You're reading from  Azure Data Factory Cookbook - Second Edition

Product typeBook
Published inFeb 2024
PublisherPackt
ISBN-139781803246598
Edition2nd Edition
Right arrow
Authors (4):
Dmitry Foshin
Dmitry Foshin
author image
Dmitry Foshin

Dmitry Foshin is a business intelligence team leader, whose main goals are delivering business insights to the management team through data engineering, analytics, and visualization. He has led and executed complex full-stack BI solutions (from ETL processes to building DWH and reporting) using Azure technologies, Data Lake, Data Factory, Data Bricks, MS Office 365, PowerBI, and Tableau. He has also successfully launched numerous data analytics projects – both on-premises and cloud – that help achieve corporate goals in international FMCG companies, banking, and manufacturing industries.
Read more about Dmitry Foshin

Tonya Chernyshova
Tonya Chernyshova
author image
Tonya Chernyshova

Tonya Chernyshova is an experienced Data Engineer with over 10 years in the field, including time at Amazon. Specializing in Data Modeling, Automation, Cloud Computing (AWS and Azure), and Data Visualization, she has a strong track record of delivering scalable, maintainable data products. Her expertise drives data-driven insights and business growth, showcasing her proficiency in leveraging cloud technologies to enhance data capabilities.
Read more about Tonya Chernyshova

Dmitry Anoshin
Dmitry Anoshin
author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

Xenia Ireton
Xenia Ireton
author image
Xenia Ireton

Xenia Ireton is a Senior Software Engineer at Microsoft. She has extensive knowledge in building distributed services, data pipelines and data warehouses.
Read more about Xenia Ireton

View More author details
Right arrow

Azure Data Factory Cookbook, Second Edition: Data engineers guide to build and manage ETL and ELT pipelines with data integration

Welcome to Packt Early Access. We’re giving you an exclusive preview of this book before it goes on sale. It can take many months to write a book, but our authors have cutting-edge information to share with you today. Early Access gives you an insight into the latest developments by making chapter drafts available. The chapters may be a little rough around the edges right now, but our authors will update them over time.You can dip in and out of this book or follow along from start to finish; Early Access is designed to be flexible. We hope you enjoy getting to know more about the process of writing a Packt book.

  1. Chapter 1: Getting Started with ADF
  2. Chapter 2: Orchestration and Control Flow
  3. Chapter 3: Setting up Synapse Analytics
  4. Chapter 4: Working with Data Lake and Spark Pools
  5. Chapter 5: Working with Big Data: Databricks
  6. ...

Introduction to the Azure data platform

The Azure data platform provides us with a number of data services for databases, data storage, and analytics. In Table 1.1, you can find a list of services and their purpose:

Figure 1.1: Azure data platform services

Using Azure data platform services can help you build a modern analytics solution that is secure and scalable. The following diagram shows an example of a typical modern cloud analytics architecture:

Figure 1.2: Modern analytics solution architecture

You can find most of the Azure data platform services here. ADF is a core service for data movement and transformation.

Let’s learn more about the reference architecture in Figure 1.1. It starts with source systems. We can collect data from files, databases, APIs, IoT, and so on. Then, we can use Event Hubs for streaming data and ADF for batch operations. ADF will push data into Azure Data Lake as a staging area, and then we can prepare data for...

Creating and executing our first job in ADF

ADF allows us to create workflows for transforming and orchestrating data movement. You may think of ADF as an Extract, Transform, Load (ETL) tool for the Azure cloud and the Azure data platform. ADF is Software as a Service (SaaS). This means that we don’t need to deploy any hardware or software. We pay for what we use. Often, ADF is referred to as code-free ETL as a service or managed service. The key operations of ADF are listed here:

  • Ingest: Allows us to collect data and load it into Azure data platform storage or any other target location. ADF has 90+ data connectors.
  • Control flow: Allows us to design code-free extracting and loading workflows.
  • Data flow: Allows us to design code-free data transformations.
  • Schedule: Allows us to schedule ETL jobs.
  • Monitor: Allows us to monitor ETL jobs.

We have learned about the key operations of ADF. Next, we should try them.

Getting ready

...

Creating an ADF pipeline using the Copy Data tool

We just reviewed how to create the ADF job using the UI. However, we can also use the Copy Data tool (CDT). The CDT allows us to load data into Azure storage faster. We don’t need to set up linked services, pipelines, and datasets as we did in the previous recipe. In other words, depending on your activity, you can use the ADF UI or the CDT. Usually, we will use the CDT for simple load operations, when we have lots of data files and we would like to ingest them into Data Lake as fast as possible.

Getting ready

In this recipe, we will use the CDT in order to do the same task of copying data from one folder to another.

How to do it...

We already created the ADF job with the UI. Let’s review the CDT:

  1. In the previous recipe, we created the Azure Blob storage instance and container. We will use the same file and the same container. However, we have to delete the file from the output location.
  2. ...

Creating an ADF pipeline using Python

We can use PowerShell, .NET, and Python for ADF deployment and data integration automation. Here is an extract from the Microsoft documentation:

”Azure Automation delivers a cloud-based automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources.”

In this recipe, we want to cover the Python scenario because Python is one of the most popular languages for analytics and data engineering. We will use Jupyter Notebook with example code.

You can use Jupyter notebooks or Visual Code notebooks.

Getting ready

For this exercise, we will use Python in order to create a data pipeline and copy our file from one folder to another. We need to use the azure...

Creating a data factory using PowerShell

Often, we don’t have access to the UI and we want to create infrastructure as code. It is easily maintainable and deployable and allows us to track versions and have code commit and change requests. In this recipe, we will use PowerShell to create a data factory. If you have never used PowerShell before, you can find information about how to get PowerShell and install it onto your machine at the end of this recipe.

Getting ready

For this exercise, we will use PowerShell to create a data pipeline and copy our file from one folder to another.

How to do it…

Let’s create an ADF job using PowerShell:

  1. In the case of macOS, we can run the following command to install PowerShell:
    brew install powershell/tap/powershell
    
  2. Check that it is working:
    pwsh
    

    Optionally, we can download PowerShell for our OS from https://github.com/PowerShell/PowerShell/.

    ...

Using templates to create ADF pipelines

Modern organizations are operating in a fast-paced environment. It is important to deliver insights faster and have shorter analytics iterations. Moreover, Azure found that many organizations have similar use cases for their modern cloud analytics deployments. As a result, Azure built a number of predefined templates. For example, if you have data in Amazon S3 and you want to copy it into Azure Data Lake, you can find a specific template for this operation; or say you want to move an on-premises Oracle data warehouse to the Azure Synapse Analytics data warehouse – you are covered with ADF templates.

Getting ready

ADF provides us with templates in order to accelerate data engineering development. In this recipe, we will review the common templates and see how to use them.

How to do it...

We will find and review an existing template using Data Factories:

  1. In the Azure portal, choose Data Factories.
  2. Open our...

Creating an Azure Data Factory using Azure Bicep

Azure Bicep is a domain-specific language that offers a more readable and maintainable approach to creating and managing Azure resources. It simplifies the process of creating, deploying, and managing ADF resources, reducing the complexity and tediousness of managing raw JSON files. In this recipe, we will create an Azure Data Factory using Azure Bicep and the Visual Studio Code Azure Bicep extension. The Azure Bicep extension for Visual Studio Code provides syntax highlighting, code snippets, and IntelliSense to make working with Azure Bicep files more efficient.

Getting ready

Before diving into the creation of an Azure Data Factory using Azure Bicep and Visual Studio Code, ensure that you have the necessary prerequisites in place:

  • An active Azure subscription
  • Visual Studio Code installed on your local machine
  • Azure CLI installed on your local machine
  • Azure Bicep CLI extension installed on your...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data Factory Cookbook - Second Edition
Published in: Feb 2024Publisher: PacktISBN-13: 9781803246598
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Dmitry Foshin

Dmitry Foshin is a business intelligence team leader, whose main goals are delivering business insights to the management team through data engineering, analytics, and visualization. He has led and executed complex full-stack BI solutions (from ETL processes to building DWH and reporting) using Azure technologies, Data Lake, Data Factory, Data Bricks, MS Office 365, PowerBI, and Tableau. He has also successfully launched numerous data analytics projects – both on-premises and cloud – that help achieve corporate goals in international FMCG companies, banking, and manufacturing industries.
Read more about Dmitry Foshin

author image
Tonya Chernyshova

Tonya Chernyshova is an experienced Data Engineer with over 10 years in the field, including time at Amazon. Specializing in Data Modeling, Automation, Cloud Computing (AWS and Azure), and Data Visualization, she has a strong track record of delivering scalable, maintainable data products. Her expertise drives data-driven insights and business growth, showcasing her proficiency in leveraging cloud technologies to enhance data capabilities.
Read more about Tonya Chernyshova

author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

author image
Xenia Ireton

Xenia Ireton is a Senior Software Engineer at Microsoft. She has extensive knowledge in building distributed services, data pipelines and data warehouses.
Read more about Xenia Ireton