Reader small image

You're reading from  SQL Server 2017 Integration Services Cookbook

Product typeBook
Published inJun 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781786461827
Edition1st Edition
Languages
Right arrow
Authors (6):
Christian Cote
Christian Cote
author image
Christian Cote

Christian Cote is an IT professional with more than 15 years of experience working in a data warehouse, Big Data, and business intelligence projects. Christian developed expertise in data warehousing and data lakes over the years and designed many ETL/BI processes using a range of tools on multiple platforms. He's been presenting at several conferences and code camps. He currently co-leads the SQL Server PASS chapter. He is also a Microsoft Data Platform Most Valuable Professional (MVP).
Read more about Christian Cote

Dejan Sarka
Dejan Sarka
author image
Dejan Sarka

Dejan Sarka, MCT and Microsoft Data Platform MVP, is an independent trainer and consultant who focuses on the development of database and business intelligence applications. He is the founder of the Slovenian SQL Server and .NET Users Group.
Read more about Dejan Sarka

Matija Lah
Matija Lah
author image
Matija Lah

Matija Lah has more than 18 years of experience working with Microsoft SQL Server, mostly from architecting data-centric solutions in the legal domain. His contributions to the SQL Server community have led to him being awarded the MVP Professional award (Data Platform) between 2007 and 2017/2018. He spends most of his time on projects involving advanced information management and natural language processing, but often finds time to speak at events related to Microsoft SQL Server where he loves to share his experience with the SQL Server platform.
Read more about Matija Lah

View More author details
Right arrow

Chapter 9. On-Premises and Azure Big Data Integration

This chapter will cover the following recipes:

  • Azure Blob storage data management
  • Installing a Hortonworks cluster
  • Copying data to an on-premises cluster
  • Using Hive – creating a database
  • Transforming the data with Hive
  • Transferring data between Hadoop and Azure
  • Leveraging a HDInsight big data cluster
  • Managing data with Pig Latin
  • Importing Azure Blob storage data

Introduction


Data warehouse architects are facing the need to integrate many types of data. Cloud data integration can be a real challenge for on-premises data warehouses for the following reasons:

  • The data sources are obviously not stored on-premises and the data stores differ a lot from what ETL tools such as SSIS are usually made for. As we saw earlier, the out-of-the-box SSIS toolbox has sources, destinations, and transformation tools that deal with on-premises data only.
  • The data transformation toolset is quite different to the cloud one. In the cloud, we don't necessarily use SSIS to transform data. There are specific data transformation languages such as Hive and Pig that are used by the cloud developers. The reason for this is that the volume of data may be huge and these languages are running on clusters. as opposed to SSIS, which is running on a single machine.

While there are many cloud-based solutions on the market, the recipes in this chapter will talk about the Microsoft Azure...

Azure Blob storage data management


This recipe will cover the following topics:

  • Creating a Blob storage in Azure
  • Using SSIS to connect to a Blob storage in Azure
  • Using SSIS to upload and download files
  • Using SSIS to loop through the file using a for each loop task

Getting ready

This recipe assumes that you have a Microsoft Azure account. You can always create a trial account by registering at https://azure.microsoft.com .

How to do it...

  1. In the Azure portal, create a new storage account and name it ssiscookbook.
  2. Add a new package in the ETL.Staging project and call it AggregatedSalesFromCloudDW.
  3. Right-click in the Connection Manager pane and select New file connection from the contextual menu that appears.
  4. The Add SSIS Connection Manager window appears. Select Azure Srorage and click on the Add... button.

  1. Fill the Storage account name textbox, as shown the following screenshot:

  1. Rename the connection manager cmgr_AzureStorage_ssiscookbook.
  2. Right-click on the newly created connection manager and select...

Installing a Hortonworks cluster


In the previous recipe, we created and managed files using an Azure Blob storage. This recipe will do similar actions but this time using an on-premises Hadoop cluster.

Getting ready

This recipe assumes that you can download and install a virtual machine on your PC.

How to do it...

  1. You will need to download and install a Hortonworks sandbox for this recipe. Go to https://hortonworks.com/downloads/ to download a Docker version of the sandbox. You can choose the sandbox you want, as shown in the following screenshot:

  1. Download the VM you want; in our case, we used the last one, DOWNLOAD FOR DOCKER. Once done, follow the instructions to configure it and make sure you have added the following entry to the %systemroot%\system32\drivers\etc\hosts file:
127.0.0.1 sandbox.hortonworks.com 

This is shown in the following screenshot:

  1. Open your browser and navigate to http://sandbox.hortonworks.com:8888. Your browser screen should look like the following screenshot:

  1. Click...

Copying data to an on-premises cluster


In this recipe, we'll add a package that will copy local data to the local cluster.

Getting ready

This recipe assumes that you have access to an on-premises cluster and have created a folder to hold the files in it from the previous recipe.

How to do it...

  1. In the solution explorer, open (expand) the ETL.DW project and right-click on it to add a new package. Name it FactOrdersToHDPCuster.dtsx.
  1. Go to the Parameters tab and add a new parameter:
    • Name: LoadExecutionId
    • Data type: Int64
    • Value: Leave the default value 0
    • Sensitive: Leave the default value False
    • Required: True
  1. Add a data flow task on the control flow and name it dft_FactOrders.
  2. In the data flow task, drag and drop an OLE DB source. Name it ole_src_DW_vwFactOrders.
  3. Double-click on it to open the OLE DB source editor.
  4. Set the OLE DB connection manager to cmgr_DW.
  5. For data access mode, use the SQL command.
  6. Set the SQL command text to the following:
SELECT        OrderDate, FirstName, LastName, CompanyName, Category...

Using Hive – creating a database


Hive is one of the languages used in Hadoop to interact with large volumes of data. It is very easy to learn since it uses SQL commands. This recipe will show you how we can use Hive to transform data from our source. Although we have only 542 lines of data in our file, we can still use it to learn Hadoop services calls.

In this recipe, we're going to create a database in Hive.

Getting ready

This recipe assumes that you have access to a Hortonworks sandbox on-premises or in Azure. It is also assumed that you have executed the previous recipe.

How to do it...

  1. Open Ambari and navigate to http://Sandbox.Hortonworks.com:8080. Use raj_ops for both the username and password to log in.
  2. Click on the more icon (nine-squares button near raj_ops) in the toolbar and select Hive View 2.0, as shown in the following screenshot:

  1. Type create database SSISCookBook in Worksheet1 and click on Execute, as shown in the following screenshot:

  1. Refresh your browser and click on Browse...

Transforming the data with Hive


The data is now in the cluster in HDFS. We'll now transform it using a SQL script. The program we're using is Hive. This program interacts with the data using SQL statements.

With most Hadoop programs (Hive, Pig, Sparks, and so on), source is read-only. It means that we cannot modify the data in the file that we transferred in the previous recipe. Some languages such as HBase allow us to modify the source data though. But for our purpose, we'll use Hive, a well-known program in the Hadoop ecosystem.

Getting ready

This recipe assumes that you have access to a Hortonworks cluster and that you have transferred data to it following the previous recipe.

How to do it...

  1. If not already done, open the package created in the previous recipe, FactOrdersToHDPCuster.dtsx.
  2. Add a Hadoop Hive task and rename it hht_HDPDWHiveTable.
  3. Double-click on it to open the Hadoop Hive Task Editor, as shown in the following screenshot:

Update the following parameters:

HadoopConnection: cmgr_Hadoop_Sandbox...

Transferring data between Hadoop and Azure


Now that we have some data created by Hadoop Hive on-premises, we're going to transfer this data to a cloud storage on Azure. Then, we'll do several transformations to it using Hadoop Pig Latin. Once done, we'll transfer the data to an on-premises table in the staging schema of our AdventureWorksLTDW2016 database.

In this recipe, we're going to copy the data processed by the local Hortonworks cluster to an Azure Blob storage. Once the data is copied over, we can transform it using Azure compute resources, as we'll see in the following recipes.

Getting ready

This recipe assumes that you have created a storage space in Azure as described in the previous recipe.

How to do it...

  1. Open the ETL.Staging SSIS project and add a new package to it. Rename it StgAggregateSalesFromCloud.dtsx.
  2. Add a Hadoop connection manager called cmgr_Hadoop_Sandbox like we did in the previous recipe.
  3. Add another connection manager, which will connect to the Azure storage like the...

Leveraging a HDInsight big data cluster


So far, we've managed Blobs data using SSIS. In this case, the data was at rest and SSIS was used to manipulate it. SSIS was the orchestration service in Azure parlance. As stated in the introduction, SSIS can only be used on- premises and, so far, on a single machine.

The goal of this recipe is to use Azure HDInsight computation services. These services allow us to use (rent) powerful resources as a cluster of machines. These machines can run Linux or Windows according to user choice, but be aware that Windows will be deprecated for the newest version of HDInsight. Such clusters or machines, as fast and powerful as they can be, are very expensive to use. In fact, this is quite normal; we're talking about a potentially large amount of hardware here.

For this reason, unless we want to have these computing resource running continuously, SSIS has a way to create and drop a cluster on demand. The following recipe will show you how to do it.

Getting ready

You...

Managing data with Pig Latin


Pig Latin is one of the programs available in big data clusters. The purpose of this program is to run scripts that can accept any type of data. "Pig can eat everything," as the mantra of the creators states.

This recipe is just meant to show you how to call a simple Pig script. No transformations are done. The purpose of the script is to show you how we can use an Azure Pig task with SSIS.

Getting ready

This recipe assumes that you have created a HDInsight cluster successfully.

How to do it...

  1. In the StgAggregatedSales.dtsx SSIS package, drag and drop an Azure Pig Task onto the control flow. Rename it apt_AggregateData.
  1. Double-click on it to open the Azure HDInsight Pig Task Editor and set the properties as shown in the following screenshot:

  1. In the script property, insert the following code:
SalesExtractsSource = LOAD 'wasbs:///Import/FactOrdersAggregated.txt'; 
rmf wasbs:///Export/; 
STORE SalesExtractsSource INTO 'wasbs:///Export/' USING PigStorage('|'); 
  1. The first...

Importing Azure Blob storage data


So far, we've created and dropped a HDInsight cluster and called a Pig script using the Azure Pig task. This recipe will demonstrate how to import data from an Azure Blob storage to a table in the staging schema.

Getting ready

This recipe assumes that you have completed the previous one.

How to do it...

  1. From the SSIS toolbox, drag and drop, and Execute SQL Task on the control flow, and rename it sql_truncate_Staging_StgCloudSales.
  2. Double-click on it to open the SQL Task Editor. Set the properties as follows and click on OK:
    • Connection: cmgr_DW
    • SQL Statement: TRUNCATE TABLE [Staging].[StgCloudSales];
  1. From the SSIS toolbox, drag a Foreach Loop Container and rename it felc_StgCloudSales.
  1. Double-click on it to open the Foreach Loop Editor, and assign the properties in the Collection pane, as shown in the following screenshot:

  1. Now go to the Variable Mappings pane and add a string variable called User::AzureAggregatedData. Make sure the scope is at the package level...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
SQL Server 2017 Integration Services Cookbook
Published in: Jun 2017Publisher: PacktISBN-13: 9781786461827
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (6)

author image
Christian Cote

Christian Cote is an IT professional with more than 15 years of experience working in a data warehouse, Big Data, and business intelligence projects. Christian developed expertise in data warehousing and data lakes over the years and designed many ETL/BI processes using a range of tools on multiple platforms. He's been presenting at several conferences and code camps. He currently co-leads the SQL Server PASS chapter. He is also a Microsoft Data Platform Most Valuable Professional (MVP).
Read more about Christian Cote

author image
Dejan Sarka

Dejan Sarka, MCT and Microsoft Data Platform MVP, is an independent trainer and consultant who focuses on the development of database and business intelligence applications. He is the founder of the Slovenian SQL Server and .NET Users Group.
Read more about Dejan Sarka

author image
Matija Lah

Matija Lah has more than 18 years of experience working with Microsoft SQL Server, mostly from architecting data-centric solutions in the legal domain. His contributions to the SQL Server community have led to him being awarded the MVP Professional award (Data Platform) between 2007 and 2017/2018. He spends most of his time on projects involving advanced information management and natural language processing, but often finds time to speak at events related to Microsoft SQL Server where he loves to share his experience with the SQL Server platform.
Read more about Matija Lah