Home

Data

SAP Data Services 4.x Cookbook

Book + AI Assistant

eBook + AI Assistant $51.99 $35.99

Print $65.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

Gain access to our AI Assistant (beta) for an exclusive selection of 500 books, available during your subscription period. Enjoy a personalized, interactive, and narrative experience to engage with the book content on a deeper level.

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook + AI Assistant $51.99 $35.99

Print $65.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Along with your eBook purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Along with your Print book purchase, enjoy AI Assistant (beta) access in our online reader for a personalized, interactive reading experience.

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Want to cost effectively deliver trusted information to all of your crucial business functions? SAP Data Services delivers one enterprise-class solution for data integration, data quality, data profiling, and text data processing. It boosts productivity with a single solution for data quality and data integration. SAP Data Services also enables you to move, improve, govern, and unlock big data. This book will lead you through the SAP Data Services environment to efficiently develop ETL processes. To begin with, you’ll learn to install, configure, and prepare the ETL development environment. You will get familiarized with the concepts of developing ETL processes with SAP Data Services. Starting from smallest unit of work- the data flow, the chapters will lead you to the highest organizational unit—the Data Services job, revealing the advanced techniques of ETL design. You will learn to import XML files by creating and implementing real-time jobs. It will then guide you through the ETL development patterns that enable the most effective performance when extracting, transforming, and loading data. You will also find out how to create validation functions and transforms. Finally, the book will show you the benefits of data quality management with the help of another SAP solution—Information Steward.

Publication date:: December 2015
Publisher: Packt
Pages: 430
ISBN: 9781782176565
Download code from GitHub

Chapter 1. Introduction to ETL Development

In this chapter, we will cover:

Preparing a database environment
Creating a source system database
Defining and creating staging area structures
Creating a target data warehouse

Introduction

Simply put, Extract-Transform-Load (ETL) is an engine of any data warehouse. The nature of the ETL system is straightforward:

Extract data from operational databases/systems
Transform data according to the requirements of your data warehouse so that the different pieces of data can be used together
Apply data quality transformation methods in order to cleanse data and ensure that it is reliable before it gets loaded into a data warehouse
Load conformed data into a data warehouse so that end users can access it via reporting tools, using client applications directly, or with the help of SQL-based query tools

While your data warehouse delivery structures or data marts represent the frontend or, in other words, what users see when they access the data, the ETL system itself is a backbone backend solution that does all the work of moving data and getting it ready in time for users to use. Building the ETL system can be a really challenging task, and though it is not part of the data warehouse data structures, it is definitely the key factor in defining the success of the data warehouse solution as a whole. In the end, who wants to use a data warehouse where the data is unreliable, corrupted, or sometimes even missing? This is exactly what ETL is responsible for getting right.

The following data structure types most often used in ETL development to move data between sources and targets are flat files, XML datasets, and DBMS tables, both in normalized schemas and dimensional data models. When choosing an ETL solution, you might face two simple choices: building a handcoded ETL solution or using a commercial one.

The following are some advantages of a handcoded ETL solution:

A programming language allows you to build your own sophisticated transformations
You are more flexible in building the ETL architecture as you are not limited by the vendor's ETL abilities
Sometimes, it can be a cheap way of building a few simplistic ETL processes, whereas buying an ETL solution from a vendor can be overkill
You do not have to spend time learning the commercial ETL solution's architecture and functionality

Here are some advantages of a commercial ETL solution:

This is more often a simpler, faster, and cheaper development option as a variety of existing tools allow you to build a very sophisticated ETL architecture quickly
You do not have to be a professional programmer to use the tool
It automatically manages ETL metadata by collecting, storing, and presenting it to the ETL developer, which is another important aspect of any ETL solution
It has a huge range of additional ready-to-use functionality, from built-in schedulers to various connectors to existing systems, built-in data lineages, impact analysis reports, and many others

In the majority of DWH projects, the commercial ETL solution from a specific vendor, in spite of the higher immediate cost, eventually saves you a significant amount of money on the development and maintenance of ETL code.

SAP Data Services is an ETL solution provided by SAP and is part of the Enterprise Information Management product stack, which also includes SAP Information Steward; we will review this in one of the last chapters of this book.

Preparing a database environment

This recipe will lead you through the further steps of preparing the working environment, such as preparing a database environment to be utilized by ETL processes as a source and staging and targeting systems for the migrated and transformed data.

Getting ready

To start the ETL development, we need to think about three things: the system that we will source the data from, our staging area (for initial extracts and as a preliminary storage for data during subsequent transformation steps), and finally, the data warehouse itself, to which the data will be eventually delivered.

How to do it…

Throughout the book, we will use a 64-bit environment, so ensure that you download and install the 64-bit versions of software components. Perform the following steps:

Let's start by preparing our source system. For quick deployment, we will choose the Microsoft SQL Server 2012 Express database, which is available for download at http://www.microsoft.com/en-nz/download/details.aspx?id=29062.
Click on the Download button and select the SQLEXPRWT_x64_ENU.exe file in the list of files that are available for download. This package contains everything required for the installation and configuration of the database server: the SQL Server Express database engine and the SQL Server Management Studio tool.
After the download is complete, run the executable file and follow the instructions on the screen. The installation of SQL Server 2012 Express is extremely straightforward, and all options can be set to their default values. There is no need to create any default databases during or after the installation as we will do it a bit later.

How it works…

After you have completed the installation, you should be able to run the SQL Server Management Studio application and connect to your database engine using the settings provided during the installation process.

If you have done everything correctly, you should see the "green" state of your Database Engine connection in the Object Explorer window of SQL Server Management Studio, as shown in the following screenshot:

We need an "empty" installation of MS SQL Server 2012 Express because we will create all the databases we need manually in the next steps of this chapter. This database engine installation will host all our source, stage, and target relational data structures. This option allows us to easily build a test environment that is perfect for learning purposes in order to become familiar with ETL development using SAP Data Services.

In a real-life scenario, your source databases, staging area database, and DWH database/appliance will most likely reside on separate server hosts, and they may sometimes be from different vendors. So, the role of SAP Data Services is to link them together in order to migrate data from one system to another.

Creating a source system database

In this section, we will create our source database, which will play the role of an operational database that we will pull data from with the help of Data Services in order to transform the data and deliver it to a data warehouse.

How to do it…

Luckily for us, there are plenty of different flavors of ready-to-use databases on the Web nowadays. Let's pick one of the most popular ones: Adventure Works OLTP for SQL Server 2012, which is available for download on the CodePlex website. Perform the following steps:

Use the following link to see the list of the files available for download:
https://msftdbprodsamples.codeplex.com/releases/view/55330
Click on the AdventureWorks2012 Data File link, which should download the AdventureWorks2012_Data.mdf data file.
When the download is complete, copy the file into the C:\AdventureWorks\ directory (create it before copying if necessary).

The next step is to map this database file to our database engine, which will create our source database. To do this, perform the following steps:

Start SQL Server Management Studio.
Click on the New Query button, which will open a new session connection to a master database.

In the SQL Query window, type the following command and press F5 to execute it:

CREATE DATABASE AdventureWorks_OLTP ON 
(FILENAME = 'C:\AdventureWorks\AdventureWorks2012_Data.mdf') 
FOR ATTACH_REBUILD_LOG;

After a successful command execution and upon refreshing the database list (using F5), you should be able to see the AdventureWorks_OLTP database in the list of the available databases in the Object Explorer window of SQL Server Management Studio.
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

How it works…

In a typical scenario, every SQL Server database consists of two data files: a database file and a transaction log file. A database file contains actual data structures and data, while a transaction log file keeps the transactional changes applied to the data.

As we only downloaded the data file, we had to execute the CREATE DATABASE command with a special ATTACH_REBUILD_LOG clause, which automatically creates a missing transaction log file so that the database could be successfully deployed and opened.

Now, our source database is ready to be used by Data Services in order to access, browse, and extract data from it.

There's more…

There are different ways to deploy test databases. This mainly depends on which RDBMS system you use. Sometimes, you may find a package of SQL scripts that contains the commands required to create all the database structures and commands used to insert data into these structures. This option may be useful if you have problems with attaching the downloaded mdf data file to your database engine or, for example, if you find the SQL scripts created for SQL Server RDBMS but have to apply them to the Oracle DB. With slight modifications to the command, you can run them in order to create an Oracle database.

Explaining RDBMS technologies lies beyond the scope of this book. So, if you are looking for more information regarding how a specific RDBMS system works, refer to the official documentation.

What has to be said here is that from the perspective of using Data Services, it does not matter which source system or target systems you use. Data Services not only supports the majority of them, but it also creates its own representation of the source and target objects; this way, they all look the same to Data Services users and abide by the same rules within the Data Services environment. So, you really do not have to be a DBA or database developer to easily connect to any RDBMS from Data Services. All that is required is a knowledge of the SQL language to understand the principle of methods that Data Services uses when extracting and loading data or creating database objects for you.

Defining and creating staging area structures

In this recipe, we will talk about ETL data structures that will be used in this book. Staging structures are important storage areas where extracted data is kept before it gets transformed or stored between the transformation steps. The staging area in general can be used to create backup copies of data or to run analytical queries on the data in order to validate the transformations made or the extract processes. Staging data structures can be quite different, as you will see. Which one to use depends on the tasks you are trying to accomplish, your project requirements, and the architecture of the environment used.

How to do it…

The most popular data structures that could be used in the staging area are flat files and RDBMS tables.

Flat files

One of the perks of using Data Services against the handcoded ETL solution is that Data Services allows you to easily read from and write information to a flat file.

Create the C:\AW\ folder, which will be used throughout this book to store flat files.

Note

Inserting data into a flat file is faster than inserting data into an RDBMS table. So, during ETL development, flat files are often used to reach two goals simultaneously: creating a backup copy of the data snapshot and providing you with the storage location for your preliminary data before you apply the next set of transformation rules.

Another common use of flat files is the ability to exchange data between systems that cannot communicate with each other in any other way.

Lastly, it is very cost-effective to store flat files (OS disk storage space is cheaper than DB storage space).

The main disadvantage of the flat files storage method is that the modification of data in a flat file can sometimes be a real pain, not to mention that it is much slower than modifying data in a relational DB table.

RDBMS tables

These ETL data structures will be used more often than others to stage the data that is going through the ETL transformation process.

Let's create two separate databases for relational tables, which will play the role of the ETL staging area in our future examples:

Open SQL Server Management Studio.
Right-click on the Databases icon and select the New Database… option.
On the next screen, input ODS as the database name, and specify 100 MB as the initial size value of the database file and 10 MB as that of the transactional log file:
Repeat the last two steps to create another dataset called STAGE.

How it works…

Let's recap. The ETL staging area is a location to store the preliminary results of our ETL transformations and also a landing zone for the extracts from the source system.

Yes, Data Services allows you to extract data and perform all transformations in the memory before loading to the target system. However, as you will see in later chapters, the ETL process, which does everything in one "go", can be complex and difficult to maintain. Plus, if something goes wrong along the way, all the changes that the process has already performed will be lost and you may have to start the extraction/transformation process again. This obviously creates extra workload on a source system because you have to query it again in order to get the data. Finally, big does not mean effective. We will show you how splitting your ETL process into smaller pieces helps you to create a well-performing sequence of dataflow.

The ODS database will be used as a landing zone for the data coming from source systems. The structure of the tables here will be identical to the structure of the source system tables.

The STAGE database will hold the relational tables used to store data between the data transformation steps.

We will also store some data extracted from a source database in a flat file format to demonstrate the ability of Data Services to work with them and show the convenience of this data storage method in the ETL system.

Creating a target data warehouse

Finally, this is the time to create our target data warehouse system. The data warehouse structures and tables will be used by end users with the help of various reporting tools to make sense of the data and analyze it. As a result, it should help business users to make strategic decisions, which will hopefully lead to business growth.

We should not forget that the main purpose of a data warehouse, and hence that of our ETL system, is to serve business needs.

Getting ready

The data warehouse created in this recipe will be used as a target database populated by the ETL processes developed in SAP Data Services. This is where the data modified and cleansed by ETL processes will be inserted in the end. Plus, this is the database that will mainly be accessed by business users and reporting tools.

How to do it…

Perform the following steps:

AdventureWorks comes to the rescue again. Use another link to download the AdventureWorks data warehouse data file, which will be mapped in the same manner to our SQL Server Express database engine in order to create a local data warehouse for our own learning purposes. Go to the following URL and click on the AdventureWorksDW for SQL Server 2012 link:
https://msftdbprodsamples.codeplex.com/releases/view/105902
After you have successfully downloaded the AdventureWorksDW2012.zip file, unpack its contents into the same directory as the previous file:
C:\AdventureWorks\
There should be two files in the archive:
- AdventureWorksDW2012_Data.mdf—the database data file
- AdventureWorksDW2012_Log.ldf—the database transaction log file
Open SQL Server Management Studio and click on the New Query… button in the uppermost tool bar.

Enter and execute the following command in the SQL Query window:

CREATE DATABASE AdventureWorks_DWH ON 
(FILENAME = 'C:\AdventureWorks\AdventureWorksDW2012_Data.mdf'), (FILENAME = 'C:\AdventureWorks\AdventureWorksDW2012_Log.ldf') FOR ATTACH;

After a successful command execution, right-click on the Databases icon and choose the Refresh option in the opened menu list. This should refresh the contents of your object library, and you should see the following list of databases:
- ODS
- STAGE
- AdventureWorks_OLTP
- AdventureWorks_DWH

How it works…

Get yourself familiar with the tables of the created data warehouse. Throughout the whole book, you will be using them in order to insert, update, and delete data using Data Services.

There are also some diagrams available that could help you see the visual data warehouse structure. To get access to them, open SQL Server Management Studio, expand the Databases list in the Object Explorer window, then expand the AdventureWorks_DWH database object list, and finally open the Diagrams tree. Double-clicking on any diagram in the list opens a new window within Management Studio with the graphical presentation of tables, key columns, and links between the tables, which shows you the relationships between them.

There's more…

In the next recipe, we will have an overview of the knowledge resources that exist on the Web. We highly recommend that you get familiar with them in order to improve your data warehousing skills, learn about the data warehouse life cycle, and understand what makes a successful data warehouse project. In the meantime, feel free to open New Query in SQL Server Management Studio and start running the SELECT commands to explore the contents of the tables in your AdventureWorks_DWH database.

Note

The most important asset of any DWH architect or ETL developer is not the knowledge of a programming language or the available tools but the ability to understand the data that is, or will be, populating the data warehouse and the business needs and requirements for this data.

Great deal and good service

Still busy working through the Cookbook but it looks like it will be very useful for our Data Services project.