Packt+ | Advance your knowledge in tech

You're reading from SQL Server 2017 Integration Services Cookbook

Product typeBook

Published inJun 2017

Reading LevelIntermediate

PublisherPackt

ISBN-139781786461827

Edition1st Edition

Languages

SQL

Tools

SQL Server

Concepts

Database Administration

Authors (6):

Christian Cote

Dejan Sarka

David Peter Hansen

Matija Lah

Samuel Lester

Christo Olivier

View More author details

Chapter 9. On-Premises and Azure Big Data Integration

This chapter will cover the following recipes:

Azure Blob storage data management
Installing a Hortonworks cluster
Copying data to an on-premises cluster
Using Hive – creating a database
Transforming the data with Hive
Transferring data between Hadoop and Azure
Leveraging a HDInsight big data cluster
Managing data with Pig Latin
Importing Azure Blob storage data

Introduction

Data warehouse architects are facing the need to integrate many types of data. Cloud data integration can be a real challenge for on-premises data warehouses for the following reasons:

The data sources are obviously not stored on-premises and the data stores differ a lot from what ETL tools such as SSIS are usually made for. As we saw earlier, the out-of-the-box SSIS toolbox has sources, destinations, and transformation tools that deal with on-premises data only.

The data transformation toolset is quite different to the cloud one. In the cloud, we don't necessarily use SSIS to transform data. There are specific data transformation languages such as Hive and Pig that are used by the cloud developers. The reason for this is that the volume of data may be huge and these languages are running on clusters. as opposed to SSIS, which is running on a single machine.

While there are many cloud-based solutions on the market, the recipes in this chapter will talk about the Microsoft Azure...

Azure Blob storage data management

This recipe will cover the following topics:

Creating a Blob storage in Azure
Using SSIS to connect to a Blob storage in Azure
Using SSIS to upload and download files
Using SSIS to loop through the file using a for each loop task

Getting ready

This recipe assumes that you have a Microsoft Azure account. You can always create a trial account by registering at https://azure.microsoft.com .

How to do it...

In the Azure portal, create a new storage account and name it ssiscookbook.
Add a new package in the ETL.Staging project and call it AggregatedSalesFromCloudDW.
Right-click in the Connection Manager pane and select New file connection from the contextual menu that appears.
The Add SSIS Connection Manager window appears. Select Azure Srorage and click on the Add... button.

Fill the Storage account name textbox, as shown the following screenshot:

Rename the connection manager cmgr_AzureStorage_ssiscookbook.
Right-click on the newly created connection manager and select...

Installing a Hortonworks cluster

In the previous recipe, we created and managed files using an Azure Blob storage. This recipe will do similar actions but this time using an on-premises Hadoop cluster.

Getting ready

This recipe assumes that you can download and install a virtual machine on your PC.

How to do it...

You will need to download and install a Hortonworks sandbox for this recipe. Go to https://hortonworks.com/downloads/ to download a Docker version of the sandbox. You can choose the sandbox you want, as shown in the following screenshot:

Download the VM you want; in our case, we used the last one, DOWNLOAD FOR DOCKER. Once done, follow the instructions to configure it and make sure you have added the following entry to the %systemroot%\system32\drivers\etc\hosts file:

127.0.0.1 sandbox.hortonworks.com

This is shown in the following screenshot:

Open your browser and navigate to http://sandbox.hortonworks.com:8888. Your browser screen should look like the following screenshot:

Click...

Copying data to an on-premises cluster

In this recipe, we'll add a package that will copy local data to the local cluster.

Getting ready

This recipe assumes that you have access to an on-premises cluster and have created a folder to hold the files in it from the previous recipe.

How to do it...

In the solution explorer, open (expand) the ETL.DW project and right-click on it to add a new package. Name it FactOrdersToHDPCuster.dtsx.

Go to the Parameters tab and add a new parameter:
- Name: LoadExecutionId
- Data type: Int64
- Value: Leave the default value 0
- Sensitive: Leave the default value False
- Required: True

Add a data flow task on the control flow and name it dft_FactOrders.
In the data flow task, drag and drop an OLE DB source. Name it ole_src_DW_vwFactOrders.
Double-click on it to open the OLE DB source editor.
Set the OLE DB connection manager to cmgr_DW.
For data access mode, use the SQL command.
Set the SQL command text to the following:

SELECT        OrderDate, FirstName, LastName, CompanyName, Category...

Using Hive – creating a database

Hive is one of the languages used in Hadoop to interact with large volumes of data. It is very easy to learn since it uses SQL commands. This recipe will show you how we can use Hive to transform data from our source. Although we have only 542 lines of data in our file, we can still use it to learn Hadoop services calls.

In this recipe, we're going to create a database in Hive.

Getting ready

This recipe assumes that you have access to a Hortonworks sandbox on-premises or in Azure. It is also assumed that you have executed the previous recipe.

How to do it...

Open Ambari and navigate to http://Sandbox.Hortonworks.com:8080. Use raj_ops for both the username and password to log in.
Click on the more icon (nine-squares button near raj_ops) in the toolbar and select Hive View 2.0, as shown in the following screenshot:

Type create database SSISCookBook in Worksheet1 and click on Execute, as shown in the following screenshot:

Refresh your browser and click on Browse...

Transforming the data with Hive

The data is now in the cluster in HDFS. We'll now transform it using a SQL script. The program we're using is Hive. This program interacts with the data using SQL statements.

With most Hadoop programs (Hive, Pig, Sparks, and so on), source is read-only. It means that we cannot modify the data in the file that we transferred in the previous recipe. Some languages such as HBase allow us to modify the source data though. But for our purpose, we'll use Hive, a well-known program in the Hadoop ecosystem.

Getting ready

This recipe assumes that you have access to a Hortonworks cluster and that you have transferred data to it following the previous recipe.

How to do it...

If not already done, open the package created in the previous recipe, FactOrdersToHDPCuster.dtsx.
Add a Hadoop Hive task and rename it hht_HDPDWHiveTable.
Double-click on it to open the Hadoop Hive Task Editor, as shown in the following screenshot:

Update the following parameters:

HadoopConnection: cmgr_Hadoop_Sandbox...

Transferring data between Hadoop and Azure

Now that we have some data created by Hadoop Hive on-premises, we're going to transfer this data to a cloud storage on Azure. Then, we'll do several transformations to it using Hadoop Pig Latin. Once done, we'll transfer the data to an on-premises table in the staging schema of our AdventureWorksLTDW2016 database.

In this recipe, we're going to copy the data processed by the local Hortonworks cluster to an Azure Blob storage. Once the data is copied over, we can transform it using Azure compute resources, as we'll see in the following recipes.

Getting ready

This recipe assumes that you have created a storage space in Azure as described in the previous recipe.

How to do it...

Open the ETL.Staging SSIS project and add a new package to it. Rename it StgAggregateSalesFromCloud.dtsx.
Add a Hadoop connection manager called cmgr_Hadoop_Sandbox like we did in the previous recipe.
Add another connection manager, which will connect to the Azure storage like the...

Leveraging a HDInsight big data cluster

So far, we've managed Blobs data using SSIS. In this case, the data was at rest and SSIS was used to manipulate it. SSIS was the orchestration service in Azure parlance. As stated in the introduction, SSIS can only be used on- premises and, so far, on a single machine.

The goal of this recipe is to use Azure HDInsight computation services. These services allow us to use (rent) powerful resources as a cluster of machines. These machines can run Linux or Windows according to user choice, but be aware that Windows will be deprecated for the newest version of HDInsight. Such clusters or machines, as fast and powerful as they can be, are very expensive to use. In fact, this is quite normal; we're talking about a potentially large amount of hardware here.

For this reason, unless we want to have these computing resource running continuously, SSIS has a way to create and drop a cluster on demand. The following recipe will show you how to do it.

Getting ready

You...

Managing data with Pig Latin

Pig Latin is one of the programs available in big data clusters. The purpose of this program is to run scripts that can accept any type of data. "Pig can eat everything," as the mantra of the creators states.

This recipe is just meant to show you how to call a simple Pig script. No transformations are done. The purpose of the script is to show you how we can use an Azure Pig task with SSIS.

Getting ready

This recipe assumes that you have created a HDInsight cluster successfully.

How to do it...

In the StgAggregatedSales.dtsx SSIS package, drag and drop an Azure Pig Task onto the control flow. Rename it apt_AggregateData.

Double-click on it to open the Azure HDInsight Pig Task Editor and set the properties as shown in the following screenshot:

In the script property, insert the following code:

SalesExtractsSource = LOAD 'wasbs:///Import/FactOrdersAggregated.txt'; 
rmf wasbs:///Export/; 
STORE SalesExtractsSource INTO 'wasbs:///Export/' USING PigStorage('|');

The first...

Importing Azure Blob storage data

So far, we've created and dropped a HDInsight cluster and called a Pig script using the Azure Pig task. This recipe will demonstrate how to import data from an Azure Blob storage to a table in the staging schema.

Getting ready

This recipe assumes that you have completed the previous one.

How to do it...

From the SSIS toolbox, drag and drop, and Execute SQL Task on the control flow, and rename it sql_truncate_Staging_StgCloudSales.
Double-click on it to open the SQL Task Editor. Set the properties as follows and click on OK:
- Connection: cmgr_DW
- SQL Statement: TRUNCATE TABLE [Staging].[StgCloudSales];

From the SSIS toolbox, drag a Foreach Loop Container and rename it felc_StgCloudSales.

Double-click on it to open the Foreach Loop Editor, and assign the properties in the Collection pane, as shown in the following screenshot:

Now go to the Variable Mappings pane and add a string variable called User::AzureAggregatedData. Make sure the scope is at the package level...

The rest of the chapter is locked

You have been reading a chapter from

SQL Server 2017 Integration Services Cookbook

Published in: Jun 2017Publisher: PacktISBN-13: 9781786461827

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (6)

Christian Cote

Christian Cote is an IT professional with more than 15 years of experience working in a data warehouse, Big Data, and business intelligence projects. Christian developed expertise in data warehousing and data lakes over the years and designed many ETL/BI processes using a range of tools on multiple platforms. He's been presenting at several conferences and code camps. He currently co-leads the SQL Server PASS chapter. He is also a Microsoft Data Platform Most Valuable Professional (MVP).
Read more about Christian Cote

Dejan Sarka

Dejan Sarka, MCT and Microsoft Data Platform MVP, is an independent trainer and consultant who focuses on the development of database and business intelligence applications. He is the founder of the Slovenian SQL Server and .NET Users Group.
Read more about Dejan Sarka

David Peter Hansen

Other recommended products

Related to this chapter

ETL with Azure Cookbook

This book will take you through hand-on recipes for extracting, transforming, and loading data using big data tools and Azure services such as Data Factory and Azure Databricks. You will learn how to interact effectively with Azure services, along with covering automation with BIML and data profiling in Azure.

BookSep 2020446 pages

Hands-On Data Warehousing with Azure Data Factory

Azure Data Factory (ADF) is a Microsoft Azure PaaS solution which supports data movement between many on premises and cloud data sources. This book covers custom tailored tutorials to help you develop , maintain and troubleshoot data movement processes and environments using Azure Data Factory V2 and SQL Server Integration Services 2017

BookMay 2018284 pages

Data Science with SQL Server Quick Start Guide

SQL Server started to fully support data science only with its last two editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning Services for their projects, then this is the ideal book for you.

BookAug 2018206 pages

Hands-On Data Science with SQL Server 2017

Learn how to utilize Microsoft SQL Server with NoSQL concepts for data science challenges. This book will help enhance your knowledge beyond data querying & processing tasks by implementing a data science pipeline. We will implement data science tasks and show how to use them on a day-to-day basis for efficient smart predictive models.

BookNov 2018506 pages

Hands-On SQL Server 2019 Analysis Services

This book will expand your ability to deliver meaningful, performant solutions to your organization. You’ll learn how to use an analytical engine for decision making and business analytics. With the help of this practical guide, you’ll also be able to work confidently with data and analytics.

BookOct 2020474 pages

Limitless Analytics with Azure Synapse

This book helps you understand the basic concepts and techniques of using Azure Synapse step-by-step. You'll gradually gain the skills you need to work with data and develop analytics solutions using the Azure analytics platform even with no prior knowledge of Azure.

BookJun 2021392 pages

Tabular Modeling with SQL Server 2016 Analysis Services Cookbook

BookJan 2017372 pages

Azure Data Factory Cookbook

With the help of well-structured and practical recipes, this book will teach you how to integrate data from the cloud and on-premise. You’ll learn how to transform, clean, and consolidate data into a single data platform and get to grips with using ADF as the main ETL and orchestration tool for your data warehouse or data platform project.

BookDec 2020382 pages

SQL Server 2017 Machine Learning Services with R

With integrated R Services within SQL Server 2017, developers and data scientists can now benefit from the integrated, effective, efficient and more streamlined analytics environment. In this book, you will understand how to leverage the capabilities of R Services in SQL Server 2017. This short yet effective guide will help you get familiar with SQL Server 2017 R Services, and will show how to implement efficient data science models using it.

BookFeb 2018338 pages

SQL Server 2019 Administrator's Guide

This book will give you all the information you need to become an expert database administrator and master the administrative aspects of SQL Server 2019. From setting up and configuring your SQL Server instance to fine-tuning your database, this extensive guide will teach you the nitty-gritty of SQL Server 2019 administration.

BookSep 2020522 pages

SQL Server 2017 Administrator's Guide

This book will give you all the information you need to become an expert database administrator, and master the administrative aspects of SQL Server 2017. From setting up and configuring your SQL Server instance to fine-tuning your database, this extensive guide will teach you the nitty-gritty of SQL Server 2017 administration.

BookDec 2017434 pages

SQL Server 2016 Developer's Guide

This book is designed to get you up to speed with SQL Server 2016, covering the essential concepts and techniques. By the end of this book, you’ll be able to design efficient, high-performance database applications confidently.

BookMar 2017616 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages