You're reading from Azure Data Factory Cookbook - Second Edition

Product typeBook

Published inFeb 2024

PublisherPackt

ISBN-139781803246598

Edition2nd Edition

Concepts

Data Engineering

Authors (4):

Dmitry Foshin

Tonya Chernyshova

Dmitry Anoshin

Xenia Ireton

View More author details

The Best Practices of Working with ADF

Welcome to the final chapter of Azure Data Factory Cookbook, where we delve into the best practices for working with Azure Data Factory (ADF) and Azure Synapse. Throughout this cookbook, we’ve explored a multitude of recipes and techniques to help you harness the power of ADF for your data integration and transformation needs. In this closing chapter, we’ll guide you through essential considerations, strategies, and practical recipes that will elevate your ADF projects to new heights of efficiency, security, and scalability.

We will cover the following list of recipes in this chapter:

Setting up roles and permissions with access levels for working with ADF
Setting up Meta ETL with ADF
Scaling your ADF project
Using ADF disaster recovery built-in features
Change data capture
Managing data factory costs with FinOps

Technical requirements

For this chapter, you will need the following:

ADF instance: You need to have access to Microsoft Azure with an ADF instance created. An Azure free account is sufficient for all recipes in this chapter. To create an account, use the following link: https://azure.microsoft.com/free/. To create an ADF instance, please refer to Chapter 1.
GitHub repository: You will need access to the files and templates provided for the recipes: https://github.com/PacktPublishing/Azure-Data-Factory-Cookbook-Second-Edition/tree/main/Chapter12.

Setting up roles and permissions with access levels in ADF

ADF is built on principles of collaboration, and to work effectively you will need to grant access privileges to other users and teams. By its very nature, ADF relies on integration with other services, therefore entities such as users, service principles, and managed identities will require access to resources within your ADF instance. User access management is a pivotal feature of ADF.

Similar to many Azure services, ADF relies on Role-Based Access Control (RBAC). RBAC enables fine-grained definitions of roles that can be granted, or assigned, to users, groups, service principals, or managed identities. These role assignments determine who can perform specific actions, such as viewing or making changes to pipelines, datasets, linked services, and other components, and ultimately govern access to your data workflows.

Imagine a scenario where a company is using ADF to orchestrate their data pipelines, which involves...

Setting up Meta ETL with ADF

When faced with the task of copying vast amounts of objects, such as thousands of tables, or loading data from a diverse range of sources, an effective approach is to leverage a control table that contains a list of object names along with their required copy behaviors. By employing parameterized pipelines, these object names and behaviors can be read from the control table and applied to the jobs accordingly. “Copy behaviors” refer to the specific actions or configurations associated with copying each object. These behaviors can include parameters such as source and destination locations, data transformation requirements, scheduling preferences, error-handling strategies, and any other settings relevant to the copying process.

Unlike traditional methods that require redeploying pipelines whenever the objects list needs modification (e.g., adding or removing objects), utilizing a control table allows for swift and straightforward updates...

Leveraging ADF scalability: Performance tuning of an ADF pipeline

Due to its serverless architecture, ADF is inherently scalable, dynamically adjusting its resource allocation to meet workload demands without the need for users to manage physical servers. This flexible architecture offers users various techniques to enhance the performance of their data pipelines.

One approach for improving performance involves harnessing the power of parallelism, such as incorporating a ForEach activity into your pipelines. The ForEach activity allows for the parallel processing of data by iterating over a collection of items, executing a specified set of activities for each item in parallel. This can significantly reduce overall execution time, especially when dealing with large datasets or when multiple independent tasks can be processed concurrently.

For example, suppose you have a pipeline that needs to process data from multiple files stored in Azure Blob Storage. By using a ForEach...

Using ADF disaster recovery built-in features

ADF provides organizations with the tools they need to effortlessly create, schedule, and oversee data pipelines, facilitating the seamless movement and transformation of data. Maintaining data availability and keeping downtime to a minimum are pivotal aspects of preserving business operations. In this recipe, we’ll guide you through the process of designing a disaster recovery solution for your ADF as the ETL/ELT engine for data movement and transformation.

Getting ready

Before we start, please ensure that you have an Azure subscription and are familiar with the basics of Azure resources such as the Azure portal, creating and deleting Azure resources, and creating pipelines in ADF.

How to do it...

Before diving into disaster recovery planning, it’s crucial to understand that ADF is a Platform-as-a-Service (PaaS) offering by Azure. Azure PaaS provides a ready-to-develop and deploy infrastructure, including...

Change Data Capture

The Change Data Capture (CDC) tool in Azure Data Factory enables real-time data synchronization by efficiently tracking and capturing only the changed data. It optimizes data integration workflows, reduces processing time, and ensures data consistency across systems. With built-in connectors and support for hybrid environments, CDC empowers organizations to stay up to date with analytics and reporting.

Getting ready

Before getting started with the recipe, log in to your Microsoft Azure account.

We assume you have a pre-configured resource group and storage account with Azure Data Lake Gen2, Azure Data Factory, and Azure SQL Database. To set these up, please refer to Chapter 1, Getting Started with ADF, and the Creating and executing our first job in ADF recipe.

In Azure SQL Database, you will need to have movielens CSV files to be loaded to the dbo schema with the following table name: dbo.movielens_ratings.
In Azure Data Lake Gen2...

Managing Data Factory costs with FinOps

Data Factory is a crucial service for data processing in Azure, but managing its costs effectively is essential to avoid unexpected expenses.

FinOps is a set of practices and principles that help organizations manage their cloud costs efficiently. It involves collaboration between finance, IT, and business teams to optimize cloud spending, allocate costs accurately, and drive accountability. The goal of FinOps is to strike a balance between cost optimization and enabling cloud innovation.

Examples of applying FinOps principles to ADF include:

Resource Right-sizing:: Analyze the compute resources used by your Data Factory pipelines and adjust them based on actual workload requirements. For instance, if certain pipelines consistently underutilize resources, consider downsizing the compute instances to save costs.
Schedule Optimization: Leverage Data Factory’s scheduling capabilities to run pipelines during off-peak...

The rest of the chapter is locked

You have been reading a chapter from

Azure Data Factory Cookbook - Second Edition

Published in: Feb 2024Publisher: PacktISBN-13: 9781803246598

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (4)

Dmitry Foshin

Dmitry Foshin is a business intelligence team leader, whose main goals are delivering business insights to the management team through data engineering, analytics, and visualization. He has led and executed complex full-stack BI solutions (from ETL processes to building DWH and reporting) using Azure technologies, Data Lake, Data Factory, Data Bricks, MS Office 365, PowerBI, and Tableau. He has also successfully launched numerous data analytics projects – both on-premises and cloud – that help achieve corporate goals in international FMCG companies, banking, and manufacturing industries.
Read more about Dmitry Foshin

Tonya Chernyshova

Tonya Chernyshova is an experienced Data Engineer with over 10 years in the field, including time at Amazon. Specializing in Data Modeling, Automation, Cloud Computing (AWS and Azure), and Data Visualization, she has a strong track record of delivering scalable, maintainable data products. Her expertise drives data-driven insights and business growth, showcasing her proficiency in leveraging cloud technologies to enhance data capabilities.
Read more about Tonya Chernyshova

Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

Xenia Ireton

Xenia Ireton is a Senior Software Engineer at Microsoft. She has extensive knowledge in building distributed services, data pipelines and data warehouses.
Read more about Xenia Ireton

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages