Azure Data Factory Cookbook

By Dmitry Anoshin , Dmitry Foshin , Roman Storchak and 1 more
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Chapter 1: Getting Started with ADF

About this book

Azure Data Factory (ADF) is a modern data integration tool available on Microsoft Azure. This Azure Data Factory Cookbook helps you get up and running by showing you how to create and execute your first job in ADF. You’ll learn how to branch and chain activities, create custom activities, and schedule pipelines. This book will help you to discover the benefits of cloud data warehousing, Azure Synapse Analytics, and Azure Data Lake Gen2 Storage, which are frequently used for big data analytics. With practical recipes, you’ll learn how to actively engage with analytical tools from Azure Data Services and leverage your on-premise infrastructure with cloud-native tools to get relevant business insights. As you advance, you’ll be able to integrate the most commonly used Azure Services into ADF and understand how Azure services can be useful in designing ETL pipelines. The book will take you through the common errors that you may encounter while working with ADF and show you how to use the Azure portal to monitor pipelines. You’ll also understand error messages and resolve problems in connectors and data flows with the debugging capabilities of ADF.

By the end of this book, you’ll be able to use ADF as the main ETL and orchestration tool for your data warehouse or data platform projects.

Publication date:
December 2020
Publisher
Packt
Pages
382
ISBN
9781800565296

 

Chapter 1: Getting Started with ADF

Microsoft Azure is a public cloud vendor. It offers different services for modern organizations. The Azure cloud has several key components, such as compute, storage, databases, and networks. They serve as building blocks for any organization that wants to reap the benefits of cloud computing. There are many benefits to using the cloud, including utilities, metrics, elasticity, and security. Many organizations across the world already benefit from cloud deployment and have fully moved to the Azure cloud. They deploy business applications and run their business on the cloud. As a result, their data is stored in cloud storage and cloud applications.

Microsoft Azure offers a cloud analytics stack that helps us to build modern analytics solutions, extract data from on-premises and the cloud, and use data for decision-making progress, searching patterns in data, and deploying machine learning applications.

In this chapter we will meet Azure Data Platform services and meet main cloud data integration service - Azure Data Factory (ADF). We will login to the Azure and navigate to the Data Factories service in order to create the first data pipeline and run Copy activity. Then, we will do the same exercise but will use different methods of data factories management and control by using Python, PowerShell and Copy Data Tool.

If you don't have an Azure account, we will cover, how you can get a free Azure Account.

In this chapter, we will cover the following recipes:

  • Introduction to the Azure data platform
  • Creating and executing our first job in ADF
  • Creating an ADF pipeline by using the Copy Data tool
  • Create an ADF pipeline using Python
  • Creating a data factory using PowerShell
  • Using templates to create ADF pipelines
 

Introduction to the Azure data platform

The Azure data platform provides us with a number of data services for databases, data storage, and analytics. In Table 1.1, you can find a list of services and their purpose:

Table 1.1 – Azure data platform services

Table 1.1 – Azure data platform services

Using the Azure data platform services can help you build a modern analytics solution that is secure and scalable. The following diagram shows an example of a typical modern cloud analytics architecture:

Figure 1.1 – Modern analytics solution architecture

Figure 1.1 – Modern analytics solution architecture

You can find most of the Azure data platform services here. ADF is a core service for data movement and transformation.

Let's learn more about the reference architecture in Figure 1.1. It starts with source systems. We can collect data from files, databases, APIs, IoT, and so on. Then, we can use Event Hubs for streaming data and ADF for batch operations. ADF will push data into Azure Data Lake as a staging area, and then we can prepare data for analytics and reporting in Azure Synapse Analytics. Moreover, we can use Databricks for big data processing and machine learning models. Power BI is an ultimate data visualization service. Finally, we can push data into Azure Cosmos DB if we want to use data in business applications.

Getting ready

In this recipe, we will create a free Azure account, log in to the Azure portal, and locate ADF services. If you have an Azure account already, you can skip the creation of the account and log straight into the portal.

How to do it...

Open https://azure.microsoft.com/free/, then take the following steps:

  1. Click Start Free.
  2. You can sign in to your existing Microsoft account or create a new one. Let's create one as an example.
  3. Enter an email address in the format [email protected] and click Next.
  4. Enter a password of your choice.
  5. Verify your email by entering the code, and click Next.
  6. Fill in the information for your profile (Country, Name, and so on). It will also require your credit card information.
  7. After you have finished the account creation, it will bring you to the Microsoft Azure portal, as shown in the following screenshot:
    Figure 1.2 – Azure portal

    Figure 1.2 – Azure portal

  8. Now, we can explore the Azure portal and find Azure data services. Let's find Azure Synapse Analytics. In the search bar, enter Azure Synapse Analytics and choose Azure Synapse Analytics (formerly SQL DW). It will open the Synapse control panel, as shown in the following screenshot:
Figure 1.3 – Azure Synapse Analytics menu

Figure 1.3 – Azure Synapse Analytics menu

Here, we can launch a new instance of a Synapse data warehouse.

Let's find and create some data factories. In the next recipe, we will create a new data factory.

Before doing anything with ADF, though, let's review what we have covered about an Azure account.

How it works...

Now that we have created a free Azure account, it gives us the following benefits:

  • 12 months of free access to popular products
  • $250 worth of credit
  • 25+ always-free products

The Azure account we created is free and you won't be charged unless you choose to upgrade.

Moreover, we discovered the Azure data platform products, which we will use over the course of the book. The Azure portal has a friendly UI where we can easily locate, launch, pause, or terminate the service. Aside from the UI, Azure offers us other ways of communicating with Azure services, using the command-line interface (CLI), APIs, SDKs, and so on.

Using the Microsoft Azure portal, you can choose the Analytics category and it will show you all the analytics services, as shown in the following screenshot:

Figure 1.4 – Azure analytics services

Figure 1.4 – Azure analytics services

We just located Azure Synapse Analytics in the Azure portal. Next, we should be able to create an ADF job.

 

Creating and executing our first job in ADF

ADF allows us to create workflows for transforming and orchestrating data movement. You may think of ADF as an ETL (short for Extract, Transform, Load) tool for the Azure cloud and the Azure data platform. ADF is Software as a Service (SaaS). This means that we don't need to deploy any hardware or software. We pay for what we use. Often, ADF is referred to as a code-free ETL as a service. The key operations of ADF are listed here:

  • Ingest: Allows us to collect data and load it into Azure data platform storage or any other target location. ADF has 90+ data connectors.
  • Control flow: Allows us to design code-free extracting and loading.
  • Data flow: Allows us to design code-free data transformations.
  • Schedule: Allows us to schedule ETL jobs.
  • Monitor: Allows us to monitor ETL jobs.

We have learned about the key operations in ADF. Next, we should try them.

Getting ready

In this recipe, we will continue on from the previous recipe, where we found Azure Synapse Analytics in the Azure console. We will create a data factory using a straightforward method – through the ADF UI via the Azure portal UI. It is important to have the correct permissions in order to create a new data factory. In our example, we are a super admin, and so we should be good to go.

During the exercise, we will create a new resource group. A resource group is a collection of resources that share the same life cycle, permissions, and policies.

How to do it...

Let's get back to our data factory:

  1. If you have closed the Data Factory console, you should open it again. Search for Data factories and click Enter.
  2. Click Create data factory, or Add if you are on the Data factories screen, and it will open the project details, where we will choose a subscription (in our case, Free Trial).
  3. We haven't created a resource group yet. Click Create new and type the name ADFCookbook. Choose East US for Region, give the name as ADFcookbookJob1-<YOUR NAME> (in my case, ADFcookbookJob1-Dmitry), and leave the version as V2. Then, click Next: Git Configuration.
  4. We can use GitHub or Azure DevOps. We won't configure anything yet and so we will select Configure Git later. Then, click Next: Networking.
  5. We have an option to increase the security of our pipelines using Managed Virtual Network and Private endpoint. For this recipe, we will use the default settings. Click Next.
  6. Optionally, you can specify tags. Then, click Next: Review + Create. ADF will validate your settings and will allow you to click Create.
  7. Azure will deploy the data factory. We can choose our data factory and click Author and Monitor. This will open the ADF UI home page, where we can find lots of useful tutorials and webinars.
  8. From the left panel, choose the blue pencil icon, as shown in the following screenshot, and it will open a window where we will start the creation of the pipeline. Choose New pipeline and it will open the pipeline1 window, where we have to provide the following information: input, output, and compute. Add the name ADF-cookbook-pipeline1 and click Validate All:
    Figure 1.5 – ADF resources

    Figure 1.5 – ADF resources

  9. When executing Step 8, you will find out that you can't save the pipeline without the activity. For our new data pipeline, we will do a simple copy data activity. We will copy the file from one blob folder to another. In this chapter, we won't spend time on spinning resources such as databases, Synapse, or Databricks. Later in this book, you will learn about using ADF with other data platform services. In order to copy data from Blob storage, we should create an Azure storage account and a blob container.
  10. Let's create the Azure storage account. Go to All Services | Storage | Storage Accounts.
  11. Click + Add.
  12. Use our Free Trial subscription. For the resource group, we will use ADFCookbook. Give a name for the storage account, such as adfcookbookstorage, then click Review and Create. The name should be unique to you.
  13. Click Go to Resource and select Containers:
    Figure 1.6 – Azure storage account UI

    Figure 1.6 – Azure storage account UI

  14. Click + Container and enter the name adfcookbook.
  15. Now, we want to upload a data file into the SalesOrders.txt file. You can get this file from the book's GitHub account. Go to the adfcookbook container and click Upload. We will specify the folder name as input. We just uploaded the file to the cloud! You can find it with the /container/folder/file – adfcookbook/input/SalesOrders.txt path.
  16. Next, we can go back to ADF. In order to finish the pipeline, we should add an input dataset and create a new linked service.
  17. In the ADF studio, click the Managed icon from the left sidebar. This will open the linked services. Click + New and choose Azure Blob Storage, then click Continue.
  18. We can optionally change the name or leave it as the default, but we have to specify the subscription and choose the storage account that we just created.
  19. Click Test Connection and if all is good, click Create.
  20. Next, we will add a dataset. Go to our pipeline and click New dataset, as shown in the following screenshot:
     Figure 1.7 – ADF resources

    Figure 1.7 – ADF resources

  21. Choose Azure Blob Storage and click Continue. Choose the Binary format type for our text file and click Continue.
  22. Now, we can specify the AzureBlobStorage1 linked services and we will specify the path to the adfcookbook/input/SalesOrders.txt file and click Create.
  23. We can give the name of the dataset in Properties. Type in SalesOrdersDataset and click Validate all. We shouldn't encounter any issues with data.
  24. We should add one more dataset as the output for our job. Let's create a new dataset with the name SalesOrdersDatasetOutput.
  25. Now, we can go back to our data pipeline. We couldn't save it when we created it without a proper activity. Now, we have all that we need in order to finish the pipeline. Add the new pipeline and give it the name ADF-cookbook-pipeline1. Then, from the activity list, expand Move & transform and drag and drop the Copy data step to the canvas.
  26. We have to specify the parameters of the step – the source and sink information. Click the Source tab and choose our dataset, SalesOrdersDataset.
  27. Click the Sink tab and choose SalesOrdersDatasetOutput. This will be our output folder.
  28. Now, we can publish two datasets and one pipeline.
  29. Then, we can trigger our pipeline manually. Click Add trigger, as shown in the following screenshot:
    Figure 1.8 – ADF canvas with the Copy data activity

    Figure 1.8 – ADF canvas with the Copy data activity

  30. Select Trigger Now. It will launch our job.
  31. We can click on Monitor from the left sidebar and find the pipeline runs. In the case of failure, we can pick up the logs here and find the root cause. In our case, the ADF-cookbook-pipeline1 pipeline succeeds. In order to see the outcome, we should go to Azure Storage and open our container. You can find the additional Output folder and a file named SalesOrders.txt there.

We just created our first job using the UI. Let's learn more about ADF.

How it works...

Using the ADF UI, we created a new pipeline – an ETL job. We specified input and output datasets and used Azure Blob storage as a linked service. The linked service itself is a kind of connection string. ADF is using the linked service in order to connect external resources. On the other hand, we have datasets. They represent the data structure for the data stores. We performed the simple activity of copying data from one folder to another. After the job ran, we reviewed the Monitor section with the job run logs.

There's more...

An ADF pipeline is a set of config JSON files. You can also view the JSON for each pipeline, dataset, and so on in the portal by clicking the three dots in the top-right corner. We are using the UI in order to create the configuration file and run the job. You can review the config JSON file by downloading a JSON file, as shown in the following figure:

Figure 1.9 – Downloading the pipeline config files

Figure 1.9 – Downloading the pipeline config files

This will save the archive file. Extract it and you will find a folder with the following subfolders:

  • Dataset
  • LinkedService
  • Pipeline

Each folder has a corresponding JSON config file.

See also

You can find more information about ADF from this Microsoft video, Introduction to Azure Data Factory: https://azure.microsoft.com/en-us/resources/videos/detailed-introduction-to-azure-data-factory/.

 

Creating an ADF pipeline by using the Copy Data tool

We just reviewed how to create the ADF job using UI. However, we can also use the Copy Data tool (CDT). The CDT allows us to load data into Azure storage faster. We don't need to set up linked services, pipelines, and datasets as we did in the previous recipe. In other words, depending on your activity, you can use the ADF UI or the CDT. Usually, we will use the CDT for simple load operations, when we have lots of data files and we would like to ingest them into Data Lake as fast as possible.

Getting ready

In this recipe, we will use the CDT in order to do the same task of copying data from one folder to another.

How to do it...

We created the ADF job with the UI. Let's review the CDT:

  1. In the previous recipe, we created the Azure Blob storage instance and container. We will use the same file and the same container. However, we have to delete the file from the output location.
  2. Go to Azure Storage Accounts, choose adfcookbookstorage, and click Containers. Choose adfcookbook. Go to the Output folder and delete the SalesOrders.txt file.
  3. Now, we can go back to the Data Factories portal. On the home page, we can see the icon for copy data. Click on it. It will open with the CPD wizard. Give it the name CDP-copy-job and choose Run once now. Click Next.
  4. Click Create a new connection. Choose the free trial subscription and adfcookbookstorage as the account name. It will create the AzureBlobStorage2 connection. Click Next.
  5. You can browse the blob storage and you will find the filename. The path should look like adfcookbook/input/SalesOrders.txt. Mark Binary copy. When we are choosing the binary option, the file will be treated as binaries and won't enforce the schema. This is a great option to just copy the file as is. Click Next.
  6. Next, we will choose the destination. Choose AzureBlobStorage2 and click Next. Enter the adfcookbook/output output path and click Next until you reach the end. As a result, you should get a similar output as I have, as you can see in the following screenshot:
    Figure 1.10 – CDT UI

    Figure 1.10 – CDT UI

  7. If we go to the storage account, we will find that CDT copied data into the Output folder.

We have created a copy job using CDT.

How it works...

CPD basically created the data pipeline for us. If you go to ADF author, you will find a new job and new datasets.

There's more...

You can learn more about the CDT at the Microsoft documentation page: https://docs.microsoft.com/en-us/azure/data-factory/copy-data-tool.

 

Creating an ADF pipeline using Python

We can use PowerShell, .NET, and Python for ADF deployment and data integration automation. Here is an extract from the Microsoft documentation:

Azure Automation delivers a cloud-based automation and configuration service that provides consistent management across your Azure and non-Azure environments. It consists of process automation, update management, and configuration features. Azure Automation provides complete control during deployment, operations, and decommissioning of workloads and resources.

In this recipe, we want to cover the Python scenario because Python is one of the most popular languages for analytics and data engineering. We will use Jupyter Notebook with example code.

Getting ready

For this exercise, we will use Python in order to create a data pipeline and copy our file from one folder to another. We need to use the azure-mgmt-datafactory and azure-mgmt-resource Python packages as well as some others.

How to do it...

We will create data factory pipeline using Python. We will start from preparation steps.

  1. We will start with the deletion of our file in the output directory. Go to Azure Storage Accounts, choose adfcookbookstorage, and click Containers. Choose adfcookbook. Go to the Output folder and delete the SalesOrders.txt file.
  2. We will install the Azure management resources Python package by running this command from the CLI. In my example, I used Terminal on macOS:
    pip install azure-mgmt-resource
  3. Next, we will install the ADF Python package by running this command from the CLI:
    pip install azure-mgmt-datafactory
  4. Also, I installed these packages in order to run code from Jupyter:
    pip install msrestazure
    pip install azure.mgmt.datafactory
    pip install azure.identity
  5. When we finish installing the Python packages, we should use these packages in order to create the data pipeline, datasets, and linked service, as well as to run the code. Python gives us flexibility and we could embed this into our analytics application or Spark/Databricks.

    The code itself is quite big and you can find the code in the attachment to this chapter, ADF_Python_Run.ipynb. This is the Jupyter notebook. It has 10 sections, and you can run them one by one and see the output.

  6. In order to control Azure resources from the Python code, we have to register the app with Azure Active Directory and assign a contributor role to this app in Identity and Access Management (IAM) under our subscription. We have to get tenant_id, client_id, and client_secret.
  7. Go to Azure Active Directory and click App registrations. Click + New registration. Enter the name ADFcookbookapp and click Register. From the app properties, you have to copy Application(client)ID and Directory (tenant)ID.
  8. Still in ADFcookbookapp, go to Certificates & secrets on the left sidebar. Click + New client secret and add new client secret. Copy the value.
  9. Next, we should give permissions to our app. Go to the subscriptions. Choose Free Trial. Click on Access control (IAM). Click on Add role assignments. Select the Contributor role. Assign access to a user, group, or service principal. Finally, search for our app, ADFcookbookapp, and click Save. As a result, we just granted access to the app and we can use these credentials in our Python code.
  10. Open ADF_Python_Run.ipynb and make sure that you have all the libraries in place by execute the first code block. You can open the file in Jupyter Notebook:
    from azure.identity import ClientSecretCredential
    from azure.mgmt.datafactory import DataFactoryManagementClient
    from azure.mgmt.datafactory.models import *
    from msrest.authentication import BasicTokenAuthentication
    from azure.core.pipeline.policies import BearerTokenCredentialPolicy
    from azure.core.pipeline import PipelineRequest, PipelineContext
    from azure.core.pipeline.transport import HttpRequest
    from azure.identity import DefaultAzureCredential
  11. You should run this piece without any problems. If you encounter an issue, it means you are missing the Python package. Make sure that you have installed all of the packages. Run sections 2 and 3 in the notebook. You can find the notebook in the GitHub repository with the book files.
  12. In section 4, Authentificate Azure, you have to enter the tenant_id, client_id, and client_secret values. The resource group and data factory name we can leave as is. Then, run sections 4 and 5.
  13. The Python code will also interact with the Azure storage account and we should provide the storage account name and key. For this chapter, we are using the adfcookbookstorage storage account and you can find the key under the Access keys section of this storage account menu. Copy the key value and paste it into section 6, Created a Linked Service, and run it.
  14. In sections 7 and 8, we are creating input and output datasets. You can run the code as is. In section 9, we will create the data pipeline and specify the CopyActivity activity.
  15. Finally, we will run the pipeline at section 10, Create a pipeline run.
  16. In section 17, Monitor a pipeline run, we will check the output of the run. We should get the following:
    Pipeline run status: Succeeded

We just created an ADF job with Python. Let's add more details.

How it works...

We used Azure Python packages in order to control Azure resources. We registered an app in order to authenticate the Python code and granted contributor permissions. Using Jupyter Notebook, we ran the code step by step and created a data factory, as well as executed the copy command.

There's more...

We used notebooks in order to demonstrate the sequence of steps and its output. We can also create a Python file and run it.

See also

There are lots of useful resources available online about the use of Python for ADF. Here are a few of them:

 

Creating a data factory using PowerShell

Often, we don't have access to the UI and we want to create infrastructure as code. It is easily maintainable and deployable and allows us to track versions and have code commit and change requests. In this recipe, we will use PowerShell in order to create a data factory. If you have never used PowerShell before, you can find information about how to get PowerShell and install it onto your machine at the end of this recipe.

Getting ready

For this exercise, we will use PowerShell in order to create a data pipeline and copy our file from one folder to another.

How to do it…

Let's create an ADF job using PowerShell.

  1. In the case of macOS, we can run the following command to install PowerShell:
    brew install powershell/tap/powershell
  2. Check that it is working:
    pwsh

    Optionally, we can download PowerShell for our OS from https://github.com/PowerShell/PowerShell/releases/.

  3. Next, we have to install the Azure module. Run the following command:
    Install-Module -Name Az -AllowClobber
  4. Next, we should connect to the Azure account by running this command:
    Connect-AzAccount

    It will ask us to open the https://microsoft.com/devicelogin page and enter the code for authentication, and will tell us something like this:

    Account                 SubscriptionName TenantId                              Environment                                                              -------                 ---------------- --------                             -----------                                                              [email protected] Free Trial       1c204124 -0ceb-41de-b366-1983c14c1628 AzureCloud  
  5. Run the command in order to check the Azure subscription:
    Get-AzSubscription
  6. Now, we can create a data factory. As usual, we should specify the resource group:
    $resourceGroupName = "ADFCookbook"

    Then, run the code that will create or update the existing resource group:

    $ResGrp = New-AzResourceGroup $resourceGroupName -location 'East US'

    You can choose your region, then specify the ADF name:

    $dataFactoryName = "ADFCookbook-PowerShell"

    Now, we can run the command that will create a data factory under our resource group:

    $DataFactory = Set-AzDataFactoryV2 -ResourceGroupName $ResGrp.ResourceGroupName `
        -Location $ResGrp.Location -Name $dataFactoryName

    As a result, PowerShell will create for us a new data factory.

  7. The next steps would be the same as we did in Python – creating a linked service, datasets, and pipeline. In the case of PowerShell, we should use JSON config files where we would specify the parameters.

We used PowerShell in order to create an ADF job. Let's add more details.

How it works...

We used PowerShell in order to connect to Azure and control Azure resources. We created a new data factory using the PowerShell command. In the same way, we can create datasets, data flows, linked services, and pipelines using JSON files for configuration, and then execute the command with PowerShell. For example, we can define a JSON file for the input dataset using the following code block:

{
    "name": "InputDataset",
    "properties": {
        "linkedServiceName": {
            "referenceName": "AzureStorageLinkedService",
            "type": "LinkedServiceReference"
        },
        "annotations": [],
        "type": "Binary",
        "typeProperties": {
            "location": {
                "type": "AzureBlobStorageLocation",
                "fileName": "emp.txt",
                "folderPath": "input",
                "container": "adftutorial"
            }
        }
    }
}

Save it as input.json and then execute the following PowerShell command:

Set-AzDataFactoryV2Dataset -DataFactoryName $DataFactory.DataFactoryName `
    -ResourceGroupName $ResGrp.ResourceGroupName -Name "InputDataset" `
    -DefinitionFile ".\Input.json"

This command will create a dataset for our data factory.

There's more...

You can learn about the use of PowerShell with ADF by reviewing the available samples from Microsoft at https://docs.microsoft.com/en-us/azure/data-factory/samples-powershell.

See also

You can refer to the following links to get more information about the use of PowerShell:

 

Using templates to create ADF pipelines

Modern organizations are operating in a fast-pace environment. It is important to deliver insights faster and have shorter analytics iterations. Moreover, Azure found that many organizations have similar use cases for their modern cloud analytics deployments. As a result, Azure built a number of predefined templates. For example, if you have data in Amazon S3 and you want to copy it into Azure Data Lake, you can find a specific template for this operation; or say you want to move an on-premises Oracle data warehouse to the Azure Synapse Analytics data warehouse – you are covered with ADF templates.

Getting ready

ADF provides us with templates in order to accelerate data engineering development. In this recipe, we will review the common templates and see how to use them.

How to do it...

We will find and review an existing template using Data Factories.

  1. In the Azure portal, choose Data Factories.
  2. Open our existing data factory, ADFcokbookJob1-Dmitry.
  3. Click Author and Monitor and it will open the ADF portal.
  4. From the home page, click on Create pipeline from template. It will open the page to the list of templates.
  5. Let's open Slow Changing Dimension Type 2. This is one of the most popular techniques for building a data warehouse and dimensional modeling. From the description page, we can review the documentation, examples, and user input. For this particular example, we have Delimited Text as input and the Azure SQL database as output. If you would like to proceed and use this template, you have to fill in the user input and click Use this template. It will import this template into ADF and you can review the steps in detail as well as modify them.

    Let's review one more template.

  6. Let's choose the Distinct Rows template. For the user input, let's choose the existing AzureBlobStorage1 and click Use this template.
  7. It will import the pipeline, datasets, and data flows, as shown in the following screenshot:
    Figure 1.11 – ADF data flow activity

    Figure 1.11 – ADF data flow activity

  8. We should review the datasets and update the information about the file path for the input dataset and output location. We won't run this job.
  9. You can also review the data flow's DistinctRows feature, where you can see all the logic, as shown in the following screenshot:
Figure 1.12 – ADF data flow

Figure 1.12 – ADF data flow

You can review other templates and see many examples of ADF design.

How it works...

We learned that ADF is a set of JSON files with configuration. As a result, it is relatively easy to create new components and share them as a template. We can deploy each template right to ADF or we can download the template bundle and modify the JSON file. These templates help us learn best practices or avoid reinventing the wheel.

See also

There are useful materials about the use of templates available online:

About the Authors

  • Dmitry Anoshin

    Dmitry Anoshin is an expert in analytics with 10 years of experience. He started using Tableau as a primary BI tool in 2011 as a BI consultant at Teradata. He is certified in both Tableau Desktop and Tableau Server. He leads probably the biggest Tableau user community, with more than 2,000 active users. This community has two to three Tableau talks every month led by top Tableau experts, Tableau Zen Masters, Viz Champions, and more. In addition, Dmitry has previously written three books with Packt and reviewed more than seven books. Finally, he is an active speaker at data conferences and helps people to adopt cloud analytics.

    Browse publications by this author
  • Dmitry Foshin

    Dmitry Foshin is a business intelligence team leader, whose main goals are delivering business insights to the management team through data engineering, analytics, and visualization. He has led and executed complex full-stack BI solutions (from ETL processes to building DWH and reporting) using Azure technologies, Data Lake, Data Factory, Data Bricks, MS Office 365, PowerBI, and Tableau. He has also successfully launched numerous data analytics projects – both on-premises and cloud – that help achieve corporate goals in international FMCG companies, banking, and manufacturing industries.

    Browse publications by this author
  • Roman Storchak

    Roman Storchak is a PhD, and is a chief data officer whose main interest lies in building data-driven cultures through making analytics easy. He has led teams that have built ETL-heavy products in AdTech and retail and often uses Azure Stack, PowerBI, and Data Factory.

    Browse publications by this author
  • Xenia Ireton

    Xenia Ireton is a software engineer at Microsoft and has extensive knowledge in the field of data engineering, big data pipelines, data warehousing, and systems architecture.

    Browse publications by this author
Book Title
Access this book, plus 7,500 other titles for FREE
Access now