Reader small image

You're reading from  Azure Data Engineering Cookbook

Product typeBook
Published inApr 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800206557
Edition1st Edition
Languages
Right arrow
Author (1)
Ahmad Osama
Ahmad Osama
author image
Ahmad Osama

Ahmad Osama works for Pitney Bowes Pvt. Ltd. as a technical architect and is a former Microsoft Data Platform MVP. In his day job, he works on developing and maintaining high performant, on-premises and cloud SQL Server OLTP environments as well as deployment and automating tasks using PowerShell. When not working, Ahmad blogs at DataPlatformLabs and can be found glued to his Xbox.
Read more about Ahmad Osama

Right arrow

Chapter 5: Control Flow Transformation and the Copy Data Activity in Azure Data Factory

In this chapter, we'll look at the transformation activities available in Azure Data Factory control flows. Transformation activities allow us to perform data transformation within the pipeline before loading data at the source.

In this chapter, we'll cover the following recipes:

  • Implementing HDInsight Hive and Pig activities
  • Implementing an Azure Functions activity
  • Implementing a Data Lake Analytics U-SQL activity
  • Copying data from Azure Data Lake Gen2 to an Azure Synapse SQL pool using the copy activity
  • Copying data from Azure Data Lake Gen2 to Azure Cosmos DB using the copy activity

Technical requirements

For this chapter, the following are required:

  • A Microsoft Azure subscription
  • PowerShell 7
  • Microsoft Azure PowerShell

Implementing HDInsight Hive and Pig activities

Azure HDInsight is an Infrastructure as a Service (IaaS) offering that lets you create big data clusters to use Apache Hadoop, Spark, and Kafka to process big data. We can also scale up or down the clusters as and when required.

Apache Hive, built on top of Apache Hadoop, facilitates querying big data on Hadoop clusters using SQL syntax. Using Hive, we can read files stored in the Apache Hadoop Distributed File System (HDFS) as an external table. We can then apply transformations to the table and write the data back to HDFS as files.

Apache Pig, built on top of Apache Hadoop, is a language to perform Extract, Transform, and Load (ETL) operations on big data. Using Pig, we can read, transform, and write the data stored in HDFS.

In this recipe, we'll use Azure Data Factory, HDInsight Hive, and Pig activities to read data from Azure Blob storage, aggregate the data, and write it back to Azure Blob storage.

Getting ready

...

Implementing an Azure Functions activity

Azure Functions is a serverless compute service that lets us run code without the need for any virtual machine or containers. In this recipe, we'll implement an Azure Functions activity to run an Azure function to resume an Azure Synapse SQL database.

Getting ready

To get started, do the following:

  1. Log in to https://portal.azure.com using your Azure credentials.
  2. Open a new PowerShell prompt. Execute the Connect-AzAccount command to log in to your Azure account from PowerShell.
  3. You will need an existing Data Factory account. If you don't have one, create one by executing the ~/azure-data-engineering-cookbook\Chapter04\3_CreatingAzureDataFactory.ps1 PowerShell script.

How to do it…

Let's start by creating an Azure function to resume an Azure Synapse SQL database:

  1. In the Azure portal, type functions in the Search box and select Function App from the search results:

    Figure 5.13 –...

Implementing a Data Lake Analytics U-SQL activity

Azure Data Lake Analytics is an on-demand analytics service that allows you to process data using R, Python, and U-SQL without provisioning any infrastructure. All we need to do is to upload the data onto Data Lake, provision the Data Lake Analytics account, and run U-SQL to process the data.

In this recipe, we'll implement a Data Lake Analytics U-SQL activity to calculate total sales by country from the orders data stored in the Data Lake store.

Getting ready

To get started, do the following:

  1. Log in to https://portal.azure.com using your Azure credentials.
  2. Open a new PowerShell prompt. Execute the Connect-AzAccount command to log in to your Azure account from PowerShell.
  3. You will need an existing Data Factory account. If you don't have one, create one by executing the ~/azure-data-engineering-cookbook\Chapter04\3_CreatingAzureDataFactory.ps1 PowerShell script.

How to do it…

Let&apos...

Copying data from Azure Data Lake Gen2 to an Azure Synapse SQL pool using the copy activity

The copy activity, as the name suggests, is used to copy data quickly from a source to a destination. In this recipe, we'll learn how to use the copy activity to copy data from Azure Data Lake Gen2 to an Azure Synapse SQL pool.

Getting ready

Before you start, do the following:

  1. Log in to Azure from PowerShell. To do this, execute the Connect-AzAccount command and follow the instructions to log in to Azure.
  2. Open https://portal.azure.com and log in using your Azure credentials.

How to do it…

Follow the given steps to perform the activity:

  1. The first step is to create a new Azure Data Lake Gen2 storage account and upload the data. To create the storage account and upload the data, execute the following PowerShell command:
    .\ADE\azure-data-engineering-cookbook\Chapter04\1_UploadOrderstoDataLake.ps1 -resourcegroupname packtade -storageaccountname packtdatalakestore...

Copying data from Azure Data Lake Gen2 to Azure Cosmos DB using the copy activity

In this recipe, we'll copy data from Azure Data Lake Gen2 to an Azure Cosmos DB SQL API. Azure Cosmos DB is a managed NoSQL database service and offers multiple NoSQL databases, such as MongoDB, DocumentDB, GraphDB (Gremlin), Azure Table storage, and Cassandra, to store data.

Getting ready

Before you start, do the following:

  1. Log in to Azure from PowerShell. To do this, execute the following command and follow the instructions to log in to Azure:
    Connect-AzAccount
  2. Open https://portal.azure.com and log in using your Azure credentials.
  3. Follow step 1 of the Copying data from Azure Data Lake Gen2 to an Azure Synapse SQL pool using the copy activity recipe to create and upload files to Azure Data Lake Storage Gen2.

To copy data from Azure Data Lake Storage Gen2 to a Cosmos DB SQL API, we'll do the following:

  1. Create and upload data to the Azure Data Lake Storage...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data Engineering Cookbook
Published in: Apr 2021Publisher: PacktISBN-13: 9781800206557
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Ahmad Osama

Ahmad Osama works for Pitney Bowes Pvt. Ltd. as a technical architect and is a former Microsoft Data Platform MVP. In his day job, he works on developing and maintaining high performant, on-premises and cloud SQL Server OLTP environments as well as deployment and automating tasks using PowerShell. When not working, Ahmad blogs at DataPlatformLabs and can be found glued to his Xbox.
Read more about Ahmad Osama