Reader small image

You're reading from  Azure Data Factory Cookbook

Product typeBook
Published inDec 2020
PublisherPackt
ISBN-139781800565296
Edition1st Edition
Right arrow
Authors (4):
Dmitry Anoshin
Dmitry Anoshin
author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

Dmitry Foshin
Dmitry Foshin
author image
Dmitry Foshin

Dmitry Foshin is a business intelligence team leader, whose main goals are delivering business insights to the management team through data engineering, analytics, and visualization. He has led and executed complex full-stack BI solutions (from ETL processes to building DWH and reporting) using Azure technologies, Data Lake, Data Factory, Data Bricks, MS Office 365, PowerBI, and Tableau. He has also successfully launched numerous data analytics projects – both on-premises and cloud – that help achieve corporate goals in international FMCG companies, banking, and manufacturing industries.
Read more about Dmitry Foshin

Roman Storchak
Roman Storchak
author image
Roman Storchak

Roman Storchak is a PhD, and is a chief data officer whose main interest lies in building data-driven cultures through making analytics easy. He has led teams that have built ETL-heavy products in AdTech and retail and often uses Azure Stack, PowerBI, and Data Factory.
Read more about Roman Storchak

Xenia Ireton
Xenia Ireton
author image
Xenia Ireton

Xenia Ireton is a Senior Software Engineer at Microsoft. She has extensive knowledge in building distributed services, data pipelines and data warehouses.
Read more about Xenia Ireton

View More author details
Right arrow

Chapter 1: Getting Started with ADF

Microsoft Azure is a public cloud vendor. It offers different services for modern organizations. The Azure cloud has several key components, such as compute, storage, databases, and networks. They serve as building blocks for any organization that wants to reap the benefits of cloud computing. There are many benefits to using the cloud, including utilities, metrics, elasticity, and security. Many organizations across the world already benefit from cloud deployment and have fully moved to the Azure cloud. They deploy business applications and run their business on the cloud. As a result, their data is stored in cloud storage and cloud applications.

Microsoft Azure offers a cloud analytics stack that helps us to build modern analytics solutions, extract data from on-premises and the cloud, and use data for decision-making progress, searching patterns in data, and deploying machine learning applications.

In this chapter we will meet Azure Data...

Introduction to the Azure data platform

The Azure data platform provides us with a number of data services for databases, data storage, and analytics. In Table 1.1, you can find a list of services and their purpose:

Table 1.1 – Azure data platform services

Using the Azure data platform services can help you build a modern analytics solution that is secure and scalable. The following diagram shows an example of a typical modern cloud analytics architecture:

Figure 1.1 – Modern analytics solution architecture

You can find most of the Azure data platform services here. ADF is a core service for data movement and transformation.

Let's learn more about the reference architecture in Figure 1.1. It starts with source systems. We can collect data from files, databases, APIs, IoT, and so on. Then, we can use Event Hubs for streaming data and ADF for batch operations. ADF will push data into Azure Data Lake as a staging...

Creating and executing our first job in ADF

ADF allows us to create workflows for transforming and orchestrating data movement. You may think of ADF as an ETL (short for Extract, Transform, Load) tool for the Azure cloud and the Azure data platform. ADF is Software as a Service (SaaS). This means that we don't need to deploy any hardware or software. We pay for what we use. Often, ADF is referred to as a code-free ETL as a service. The key operations of ADF are listed here:

  • Ingest: Allows us to collect data and load it into Azure data platform storage or any other target location. ADF has 90+ data connectors.
  • Control flow: Allows us to design code-free extracting and loading.
  • Data flow: Allows us to design code-free data transformations.
  • Schedule: Allows us to schedule ETL jobs.
  • Monitor: Allows us to monitor ETL jobs.

We have learned about the key operations in ADF. Next, we should try them.

Getting ready

In this recipe, we will continue on...

Creating an ADF pipeline by using the Copy Data tool

We just reviewed how to create the ADF job using UI. However, we can also use the Copy Data tool (CDT). The CDT allows us to load data into Azure storage faster. We don't need to set up linked services, pipelines, and datasets as we did in the previous recipe. In other words, depending on your activity, you can use the ADF UI or the CDT. Usually, we will use the CDT for simple load operations, when we have lots of data files and we would like to ingest them into Data Lake as fast as possible.

Getting ready

In this recipe, we will use the CDT in order to do the same task of copying data from one folder to another.

How to do it...

We created the ADF job with the UI. Let's review the CDT:

  1. In the previous recipe, we created the Azure Blob storage instance and container. We will use the same file and the same container. However, we have to delete the file from the output location.
  2. Go to Azure Storage...

Creating an ADF pipeline using Python

We can use PowerShell, .NET, and Python for ADF deployment and data integration automation. Here is an extract from the Microsoft documentation:

Azure Automation delivers a cloud-based automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources.

In this recipe, we want to cover the Python scenario because Python is one of the most popular languages for analytics and data engineering. We will use Jupyter Notebook with example code.

Getting ready

For this exercise, we will use Python in order to create a data pipeline and copy our file from one folder to another. We need to use the azure-mgmt-datafactory and azure-mgmt-resource Python packages as well as some others.

...

Creating a data factory using PowerShell

Often, we don't have access to the UI and we want to create infrastructure as code. It is easily maintainable and deployable and allows us to track versions and have code commit and change requests. In this recipe, we will use PowerShell in order to create a data factory. If you have never used PowerShell before, you can find information about how to get PowerShell and install it onto your machine at the end of this recipe.

Getting ready

For this exercise, we will use PowerShell in order to create a data pipeline and copy our file from one folder to another.

How to do it…

Let's create an ADF job using PowerShell.

  1. In the case of macOS, we can run the following command to install PowerShell:
    brew install powershell/tap/powershell
  2. Check that it is working:
    pwsh

    Optionally, we can download PowerShell for our OS from https://github.com/PowerShell/PowerShell/releases/.

  3. Next, we have to install the Azure...

Using templates to create ADF pipelines

Modern organizations are operating in a fast-pace environment. It is important to deliver insights faster and have shorter analytics iterations. Moreover, Azure found that many organizations have similar use cases for their modern cloud analytics deployments. As a result, Azure built a number of predefined templates. For example, if you have data in Amazon S3 and you want to copy it into Azure Data Lake, you can find a specific template for this operation; or say you want to move an on-premises Oracle data warehouse to the Azure Synapse Analytics data warehouse – you are covered with ADF templates.

Getting ready

ADF provides us with templates in order to accelerate data engineering development. In this recipe, we will review the common templates and see how to use them.

How to do it...

We will find and review an existing template using Data Factories.

  1. In the Azure portal, choose Data Factories.
  2. Open our existing...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data Factory Cookbook
Published in: Dec 2020Publisher: PacktISBN-13: 9781800565296
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (4)

author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

author image
Dmitry Foshin

Dmitry Foshin is a business intelligence team leader, whose main goals are delivering business insights to the management team through data engineering, analytics, and visualization. He has led and executed complex full-stack BI solutions (from ETL processes to building DWH and reporting) using Azure technologies, Data Lake, Data Factory, Data Bricks, MS Office 365, PowerBI, and Tableau. He has also successfully launched numerous data analytics projects – both on-premises and cloud – that help achieve corporate goals in international FMCG companies, banking, and manufacturing industries.
Read more about Dmitry Foshin

author image
Roman Storchak

Roman Storchak is a PhD, and is a chief data officer whose main interest lies in building data-driven cultures through making analytics easy. He has led teams that have built ETL-heavy products in AdTech and retail and often uses Azure Stack, PowerBI, and Data Factory.
Read more about Roman Storchak

author image
Xenia Ireton

Xenia Ireton is a Senior Software Engineer at Microsoft. She has extensive knowledge in building distributed services, data pipelines and data warehouses.
Read more about Xenia Ireton