Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Hands-On Data Warehousing with Azure Data Factory
Hands-On Data Warehousing with Azure Data Factory

Hands-On Data Warehousing with Azure Data Factory: ETL techniques to load and transform data from various sources, both on-premises and on cloud

By Christian Cote , Michelle Gutzait , Giuseppe Ciaburro
$48.99
Book May 2018 284 pages 1st Edition
eBook
$39.99 $27.98
Print
$48.99
Subscription
$15.99 Monthly
eBook
$39.99 $27.98
Print
$48.99
Subscription
$15.99 Monthly

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : May 31, 2018
Length 284 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781789137620
Category :
Concepts :
Table of content icon View table of contents Preview book icon Preview Book

Hands-On Data Warehousing with Azure Data Factory

Chapter 1. The Modern Data Warehouse

Azure Data Factory(ADF) is a service that is available in the Microsoft Azure ecosystem. This service allows the orchestration of different data loads and transfers in Azure.

Back in 2014, there were hardly any easy ways to schedule data transfers in Azure. There were a few open source solutions available, such as Apache Falcon and Oozie, but nothing was easily available as a service in Azure. Microsoft introduced ADF in public preview in October 2014, and the service went to general availability in July 2015.

The service allows the following actions:

  • Copying data from various sources and destinations
  • Calling various computation services, such as HDInsight and Azure data warehouse data transformations
  • Orchestrating the preceding activities using time slices and retrying the activities when there is an error

All these activities were available via the Azure portal at first, and in Visual Studio 2013 before general availability (GA).

The need for a data warehouse


A data warehouse is a repository of enterprise data used for reporting and analysis. There have been three waves of data warehouses so far, which we will cover in the upcoming subsections.

Driven by IT

This is the first wave of business intelligence (BI). IT needed to separate operational data and databases from its origin for the following reasons:

  • Keep data changes history. Some operational applications purge the data after a while.
  • When users wanted to report on the application's data, they were often affecting the performance of the system. IT replicated the operational data to another server to avoid any performance impact on applications.
  • Things got more complex when users wanted to do analysis and reports on databases from multiple enterprise's applications. IT had to replicate all the needed systems and make them speak together. This implied that new structures had to be built and new patterns emerged from there: star schemas, decision support systems (DSS), OLAP cubes, and so on.

Self-service BI

Analysts and users always need data warehouses to evolve at a faster pace. This is the second wave of BI and it happened when major BI players such as Microsoft and Click came with tools that enabled users to merge some data with or without data warehouses. In many enterprises, this is used as a temporary source of analytics or proof of concept. On the other hand, not every data could fit at that time in data warehouses. Many ad hoc reports were, and are still, using self-service BI tools. Here is a short list of such tools:

  • Microsoft Power Pivot
  • Microsoft Power BI
  • Click

Cloud-based BI – big data and artificial intelligence

This is the third wave of BI. The cloud capabilities enable enterprises to do more accurate analysis. Big data technologies allows users to base their analysis on much bigger data volumes. This helps them deriving patterns form the data and have technologies that incorporate and modify these patterns. This leads to artificial intelligence or AI.

Technologies used in big data are not that new. They were used by many search engines in the early 21st century such as Yahoo! and Google. They have also been used quite a lot in research faculties in different enterprises. The third wave of BI broaden the usage of these technologies. Vendors such as Microsoft, Amazon, or Google make it available to almost everyone with their cloud offer.]

The modern data warehouse


Microsoft, as well as many other service providers, have listed the concepts of the modern data warehouse as follows:

Here are some of the many features a modern data warehouse should have:

  • Integration of relational as well as non-relational sources: The data warehouse should be able to ingest data that is not easily integrable in the traditional data warehouse, such as big data, non-relational crunched data, and so on.
  • Hybrid deployment: The data warehouse should be able to extend the data warehouse from on-premises storage to the cloud.
  • Advanced analytics: The data warehouse should be able to analyze the data from all kinds of datasets using different modern machine learning tools.
  • In-database analytics: The data warehouse should be able to use Microsoft software that is integrated with some very powerful analytics open tools, such as R and Python, in its database. Also, with PolyBase integration, the data warehouse can integrate more data sources when it's based on SQL Server.

Main components of a data warehouse

This section will discuss the various parts of a data warehouse.

Staging area

In a classic data warehouse, this zone is usually a database and/or a schema in it that used to hold a copy of the data from the source systems. The staging area is necessary because most of the time, data sources are not stored on the same server as the data warehouse. Even if they are on the same server, we prefer a copy of them for the following reasons:

  • Preserve data integrity. All data is copied over from a specific point in time. This ensures that we have consistency between tables.
  • We might need specific indexes that we could not create in the source system. When we query the data, we're not necessarily making the same links (joins) in the source system. Therefore, we might have to create indexes to increase query performance.
  • Querying the source might have an impact on the performance of the source application. Usually, the staging area is used to bring just the changes from the source systems. This prevents processing too much data from the data source.

Not to mention that the data source might be files: CSV, XML, and so on. It’s much easier to bring their content in relational tables. From a modern data warehouse perspective, this means storing the files in HDFS and separating them using dates.

In a modern data warehouse, if we’re in the cloud only, relational data can still be stored in databases. The only difference might be in the location of the databases. In Azure, we can use Azure SQL tables or Azure data warehouse.

Data warehouse

This is where the data is copied over from the staging area. There are several schools of thought that define the data warehouse:

  • Kimball group data warehouse bus: Ralph Kimball was a pioneer in data warehousing. He and his colleagues wrote many books and articles on their method. It consists of conformed dimensions that can be used by many business processes. For example, if we have a dimension named DimCustomer, we should link it to all fact tables that store customers. We should not create another dimension that redefines our customers. The following link gives more information on the Kimball group method: https://www.kimballgroup.com.
  • Inmon CIF: Bill Inmon and his colleagues defined the corporate information factory at the end of 1990s. This consisted of modeling the source systems commonly using the third normal form. All the data in the table was dated, which means that any changes in the data sources were inserted in the data warehouse tables. The following link gives more information on CIF: http://www.inmoncif.com.
  • Data Vault: Created by Dan Linsted in the 21st century, this is the latest and more efficient modeling method in data warehousing. It consists of breaking down the source data into many different entities. This gives a lot of flexibility when the data is consumed. We have to reconstruct the data and use the necessary pieces for our analysis. Here is a link that gives more information on Data Vault: http://learndatavault.com.
Cubes

In addition to the relational data warehouse, we might have a cube such as SQL Server Analysis Services. Cubes don't replace the relational data warehouses, they extend it. They can also connect to the other part of the warehouse that is not necessarily stored in a relational database. By doing this, they become a semantic layer that can be used by the consumption layer described next.

Consumption layer – BI and analytics

This area is where the data is consumed from the data warehouse and/or the data lake. This book has a chapter dedicated to data lake. In short, the data lake is composed of several areas (data ponds) that classify the data inside of it. The data warehouse is a part of the data lake; it contains the certified data. The data outside the data warehouse in the data lake is most of the time noncertified. It is used to do ad hoc analysis or data discovery.

The BI part can be stored in relational databases, analytic cubes, or models. It can also consist of views on top of the data warehouse when the data is suitable for it.

What is Azure Data Factory

Azure data factories are composed of the following components:

  • Linked services: Connectors to the various storage and compute services. For example, we can have a pipeline that will use the following artifacts:
    • HDInsight cluster on demand: Access to the HDInsight compute service to run a Hive script that uses HDFS external storage
    • Azure Blob storage/SQL Azure: As the Hive job runs, this will retrieve the data from Azure and copy it to an SQL Azure database
  • Datasets: There are layers for the data used in pipelines. A dataset uses a linked service.
  • Pipeline: The pipeline is the link between all datasets. It contains activities that initiate data movements and transformations. It is the engine of the factory; without pipelines, nothing will move in the factory.

Limitations of ADF V1.0

As good as ADF was, and although a lot of features have been added to it since its GA in 2015, there were a few limitations. At first, we relied on JSON quite a lot to define various ADF abstracts. The number of data stores and compute capabilities were quite limited.

The development experience is very different compared to V2.0. As shown in the following screenshot, we could use the Author and Deploy capability, but it only gave us JSON templates.

As we will see later in this book, the new V2.0 factory has a much better development experience.

When it came to source control, we had to rely on Visual Studio integration. From Visual Studio, we could create or import an existing factory and therefore, use the source control of our choice to version it.

What's new in V2.0?


With V2, ADF has now been overhauled. This section will describe the main novelties of ADF V2.

Integration runtime

This is one of the main features of version 2.0. It represents the compute infrastructure and performs data integration across networks. Here are some enhancements it can provide:

  • Data movements between public and private networks either on-premises or using a virtual private network (VPN). They were known as data management gateways in V1 and Power BI.
    • Public: They are used by Azure and other cloud connections. There's a default integration runtime that comes with ADF.
    • Private: They are used to connect private computer resources such as SQL Server on-premises to ADF. We need to install a service on one Windows machine in the private network. That machine can connect to the enterprise resources and send the data to ADF via the service installed on it.
  • SSIS package execution—managing SSIS packages in Azure. This is one of the main topics of this book. Chapter 3, SSIS Lift and Shift, is completely dedicated to this feature.

Linked services

Linked services now have a connectVia property to be able to use the Integration Runtimes that we mentioned in this chapter before. They can now connect to a lot more of data stores than it was possible before.

Datasets

Datasets are the same as they were in V1, but we don't need to define any availability schedules in them now. This means that they have more flexibility in their usage. In conjunction with Linked Services, the datasets have now access to a whole lot of new data stores: sources and destinations.

Pipelines

Pipelines have been modified quite a lot in V2. They don't have any windows of execution, with start times and end times. Pipelines can now be executed using the following technique:

  • On demand via .NET, PowerShell, REST API, or Python
  • Trigger:
    • Schedule trigger: This trigger uses a wall clock kind of schedule, for example, a pipeline can be executed on a weekly basis every Tuesday and Thursday at 10:00 AM
    • Tumbling window trigger: This works on a periodic interval, for example, every 15 minutes between two specific dates

Activities

Pipelines now have the following control activities:

  • Execute pipeline: Calls another pipeline in the same factory.
  • For each activity: Executes activities in a loop similar to any for each loop in structured programming languages.
  • Web activity: Used to call custom REST endpoints.
  • Lookup activity: Gets a record from any external data. The output can later be used by subsequent activities.
  • Get metadata activity: Gets the metadata of activities in ADF.
  • Until activity: Loops the execution of activity sets until the condition is evaluated to true.
  • If condition activity: This is like any if statement in standard programming languages.
  • Wait activity: Pauses the pipeline for a time before resuming other activities.

Parameters

Parameters can be used in pipelines. They are read-only values that are passed when the pipeline is executed manually or when they are scheduled to be executed.

Expressions

In V1, functions could be used to filter out dataset queries. In V2, expressions can be used anywhere in JSON-defined factory objects.

Controlling the flow of activities

Calling activities is more flexible in V2 than in the previous one (V1). As stated in the Pipeline section, there are many new activities, such as for each, if, until, lookup, and so on.

SSIS package deployment in Azure

There is now a new SSI runtime that completely manages clusters of Azure VMs dedicated to running SSIS in the cloud. Packages are deployed in the same manner that they are deployed on-premises when using the Azure SSIS integration runtime. SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS) can be used to deploy SSIS packages.

Spark cluster data store

There are many more data stores available now.

Spark clusters are now available in V2. Since Spark is very performant and now integrates more functionalities, it has become an almost essential player in the big data world. In the previous version of ADF, Spark clusters were available via MapReduce custom activities. In this version, Spark is now a first-class citizen, so there will be no more headaches when it comes to integrating it in our data flow.

Summary


In this chapter, we saw the features of a modern data warehouse. We also saw the new features added in the version 2.0 of ADF.

In the next chapter, we will use the data factory to move data from Azure SQL to Azure storage.

 

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • • Combine the power of Azure Data Factory v2 and SQL Server Integration Services
  • • Design and enhance performance and scalability of a modern ETL hybrid solution
  • • Interact with the loaded data in data warehouse and data lake using Power BI

Description

ETL is one of the essential techniques in data processing. Given data is everywhere, ETL will always be the vital process to handle data from different sources. Hands-On Data Warehousing with Azure Data Factory starts with the basic concepts of data warehousing and ETL process. You will learn how Azure Data Factory and SSIS can be used to understand the key components of an ETL solution. You will go through different services offered by Azure that can be used by ADF and SSIS, such as Azure Data Lake Analytics, Machine Learning and Databrick’s Spark with the help of practical examples. You will explore how to design and implement ETL hybrid solutions using different integration services with a step-by-step approach. Once you get to grips with all this, you will use Power BI to interact with data coming from different sources in order to reveal valuable insights. By the end of this book, you will not only learn how to build your own ETL solutions but also address the key challenges that are faced while building them.

What you will learn

Understand the key components of an ETL solution using Azure Data Factory and Integration Services Design the architecture of a modern ETL hybrid solution Implement ETL solutions for both on-premises and Azure data Improve the performance and scalability of your ETL solution Gain thorough knowledge of new capabilities and features added to Azure Data Factory and Integration Services

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Black & white paperback book shipped to your address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : May 31, 2018
Length 284 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781789137620
Category :
Concepts :

Table of Contents

12 Chapters
Title Page Chevron down icon Chevron up icon
Packt Upsell Chevron down icon Chevron up icon
Contributors Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
The Modern Data Warehouse Chevron down icon Chevron up icon
Getting Started with Our First Data Factory Chevron down icon Chevron up icon
SSIS Lift and Shift Chevron down icon Chevron up icon
Azure Data Lake Chevron down icon Chevron up icon
Machine Learning on the Cloud Chevron down icon Chevron up icon
Introduction to Azure Databricks Chevron down icon Chevron up icon
Reporting on the Modern Data Warehouse Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela