Reader small image

You're reading from  Hands-On Data Warehousing with Azure Data Factory

Product typeBook
Published inMay 2018
PublisherPackt
ISBN-139781789137620
Edition1st Edition
Tools
Concepts
Right arrow
Authors (3):
Christian Cote
Christian Cote
author image
Christian Cote

Christian Cote is an IT professional with more than 15 years of experience working in a data warehouse, Big Data, and business intelligence projects. Christian developed expertise in data warehousing and data lakes over the years and designed many ETL/BI processes using a range of tools on multiple platforms. He's been presenting at several conferences and code camps. He currently co-leads the SQL Server PASS chapter. He is also a Microsoft Data Platform Most Valuable Professional (MVP).
Read more about Christian Cote

Michelle Gutzait
Michelle Gutzait
author image
Michelle Gutzait

Michelle Gutzait has been in IT for 30 years as a developer, business analyst, and database
Read more about Michelle Gutzait

Giuseppe Ciaburro
Giuseppe Ciaburro
author image
Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro

View More author details
Right arrow

Chapter 4. Azure Data Lake

One of the biggest problems that mid enterprise-sized organizations face is that data resides everywhere. Over the years, data has been accumulated usually by different systems, third-party, or in-house developed applications. Many vendors have set up a requirement to segregate their database servers in order to ensure performance, security, and management of their systems. Also, third-party vendors did not or do not want to take responsibility for their systems in a shared environment.

Organizations are starting to realize, or are already in the process of realizing, that consolidation is a must, both from the cost perspective as well as for easier manageability. However, in many cases, the vendors or developers are no longer to be found, which makes it very hard to make decisions to upgrade and/or migrate to the cloud. What could complicate things even further is the fact that shared or centralized data may be replicated everywhere and there may not even be one...

Creating and configuring Data Lake Store


We will first create and configure the Data Lake Store:

  1. Open the Azure Portal. If you are just starting, you will not see any resource configured under the All resources and ALL SUBSCRIPTIONS section:
  1. On the top left, click on Create a resource; enter the words Data Lake in Search the Marketplace:

  1. Select Data Lake Store from the list (third option in the image) if you have no Data Lake stores yet; the following screen will open up:
  1. Select Create.
  2. Enter the details of the Data Lake. Note that the name has to be all lowercase and with no special characters. You will get a message as you type if you've entered any incorrect character. In this case, we are not using any encryption, for simplicity. Note that the default is encryption enabled. For more information about the encryption options, see Encryption of data in Azure Data Lake Store (https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal).
  1. Select Create.

 

  1. Once the Data...

Creating a Data Lake Analytics resource


In order to be able to run a U-SQL task or job, we need to create the Data Lake Analytics resource. In the Azure dashboard, click on New to create a new resource and look for the Data Lake Analytics resource in the new window:

Press Enter, and in the new window, click on Create:

Data Lake Analytics blade

Enter the name of the new resource (note that the resource name should contain only lowercase letters and numbers) and the rest of the information:

We click on the Data Lake Store section and choose the Data Lake Store we have previously created:

And click on Create:

Find the new resource to ensure it was created:

All resources blade

We have created the Data Lake Analytics resource and now we can run U-SQL to manipulate or summarize data. We can run U-SQL either directly from the Data Lake Analytics Resource, via job, or from the Data Factory in a pipeline.

The next two sections will show you how to do the following:

  • Run U-SQL via a job in Data Lake Analytics...

Using the data factory to manipulate data in the Data Lake


In the previous section, we created the Data Lake Analytics Resource for the U-SQL task:

  • Even though possible, it is not at all straightforward to run U-SQL to connect directly to an SQL database. It involves tweaking firewalls and permissions. This is why we do not cover this part in the next section, which describes how to run a U-SQL job directly from the Data Lake Analytics resource.
  • It is much simpler to copy data from an SQL Server database to a file on Azure Blob Storage via the Azure Data Factory.
  • In this section, we show how to do this and then how to manipulate the copied data with U-SQL using the Azure Data Factory.

We will now create a pipeline in Azure Data Factory that will do the following:

  • Task 1: Import data from SQL Server (from a view) into a file on blob storage
  • Task 2: Use U-SQL to export summary data to a file on blob storage

Task 1 – copy/import data from SQL Server to a blob storage file using data factory

Let's create...

Run U-SQL from a job in the Data Lake Analytics


In this section, we will learn how to create a Data Lake Analytics job that will debug and run a U-SQL script. This job will summarize data from the file created by Task 1 in the preceding data factory pipeline (the task that imports SQL Server data into a blob file). The summary data will be copied to a new file on the blob storage.

With U-SQL, we can join different blob files and manipulate/summarize the data. We can also import data from different data sources. However, in this section, we will only provide a very basic U-SQL as an example.

Let's get started...

First, we open the Data Lake Analytics resource from the dashboard. We first need to add the Blob Storage account here. Open Data sources:

Click on Add data source:

Fill in the details:

You should see the added blob storage in the list:

You can explore the containers in the blob storage and files from the Data Lake Analytics | Data explorer:

Click on Data explorer:

In order to get the path...

Summary


In this chapter we saw the components of the Azure Data Lake and basic implementation of those components.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Data Warehousing with Azure Data Factory
Published in: May 2018Publisher: PacktISBN-13: 9781789137620
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Christian Cote

Christian Cote is an IT professional with more than 15 years of experience working in a data warehouse, Big Data, and business intelligence projects. Christian developed expertise in data warehousing and data lakes over the years and designed many ETL/BI processes using a range of tools on multiple platforms. He's been presenting at several conferences and code camps. He currently co-leads the SQL Server PASS chapter. He is also a Microsoft Data Platform Most Valuable Professional (MVP).
Read more about Christian Cote

author image
Michelle Gutzait

Michelle Gutzait has been in IT for 30 years as a developer, business analyst, and database
Read more about Michelle Gutzait

author image
Giuseppe Ciaburro

Giuseppe Ciaburro holds a PhD and two master's degrees. He works at the Built Environment Control Laboratory - Università degli Studi della Campania "Luigi Vanvitelli". He has over 25 years of work experience in programming, first in the field of combustion and then in acoustics and noise control. His core programming knowledge is in MATLAB, Python and R. As an expert in AI applications to acoustics and noise control problems, Giuseppe has wide experience in researching and teaching. He has several publications to his credit: monographs, scientific journals, and thematic conferences. He was recently included in the world's top 2% scientists list by Stanford University (2022).
Read more about Giuseppe Ciaburro