Extract data from operational databases/systems
Transform data according to the requirements of your data warehouse so that the different pieces of data can be used together
Apply data quality transformation methods in order to cleanse data and ensure that it is reliable before it gets loaded into a data warehouse
Load conformed data into a data warehouse so that end users can access it via reporting tools, using client applications directly, or with the help of SQL-based query tools
While your data warehouse delivery structures or data marts represent the frontend or, in other words, what users see when they access the data, the ETL system itself is a backbone backend solution that does all the work of moving data and getting it ready in time for users to use. Building the ETL system can be a really challenging task, and though it is not part of the data warehouse data structures, it is definitely the key factor in defining the success of the data warehouse solution as a whole. In the end, who wants to use a data warehouse where the data is unreliable, corrupted, or sometimes even missing? This is exactly what ETL is responsible for getting right.
The following data structure types most often used in ETL development to move data between sources and targets are flat files, XML datasets, and DBMS tables, both in normalized schemas and dimensional data models. When choosing an ETL solution, you might face two simple choices: building a handcoded ETL solution or using a commercial one.
A programming language allows you to build your own sophisticated transformations
You are more flexible in building the ETL architecture as you are not limited by the vendor's ETL abilities
Sometimes, it can be a cheap way of building a few simplistic ETL processes, whereas buying an ETL solution from a vendor can be overkill
You do not have to spend time learning the commercial ETL solution's architecture and functionality
This is more often a simpler, faster, and cheaper development option as a variety of existing tools allow you to build a very sophisticated ETL architecture quickly
You do not have to be a professional programmer to use the tool
It automatically manages ETL metadata by collecting, storing, and presenting it to the ETL developer, which is another important aspect of any ETL solution
It has a huge range of additional ready-to-use functionality, from built-in schedulers to various connectors to existing systems, built-in data lineages, impact analysis reports, and many others
In the majority of DWH projects, the commercial ETL solution from a specific vendor, in spite of the higher immediate cost, eventually saves you a significant amount of money on the development and maintenance of ETL code.
SAP Data Services is an ETL solution provided by SAP and is part of the Enterprise Information Management product stack, which also includes SAP Information Steward; we will review this in one of the last chapters of this book.
This recipe will lead you through the further steps of preparing the working environment, such as preparing a database environment to be utilized by ETL processes as a source and staging and targeting systems for the migrated and transformed data.
To start the ETL development, we need to think about three things: the system that we will source the data from, our staging area (for initial extracts and as a preliminary storage for data during subsequent transformation steps), and finally, the data warehouse itself, to which the data will be eventually delivered.
Throughout the book, we will use a 64-bit environment, so ensure that you download and install the 64-bit versions of software components. Perform the following steps:
Let's start by preparing our source system. For quick deployment, we will choose the Microsoft SQL Server 2012 Express database, which is available for download at http://www.microsoft.com/en-nz/download/details.aspx?id=29062.
Click on the Download button and select the SQLEXPRWT_x64_ENU.exe file in the list of files that are available for download. This package contains everything required for the installation and configuration of the database server: the SQL Server Express database engine and the SQL Server Management Studio tool.
After the download is complete, run the executable file and follow the instructions on the screen. The installation of SQL Server 2012 Express is extremely straightforward, and all options can be set to their default values. There is no need to create any default databases during or after the installation as we will do it a bit later.
After you have completed the installation, you should be able to run the SQL Server Management Studio application and connect to your database engine using the settings provided during the installation process.
If you have done everything correctly, you should see the "green" state of your Database Engine connection in the Object Explorer window of SQL Server Management Studio, as shown in the following screenshot:
We need an "empty" installation of MS SQL Server 2012 Express because we will create all the databases we need manually in the next steps of this chapter. This database engine installation will host all our source, stage, and target relational data structures. This option allows us to easily build a test environment that is perfect for learning purposes in order to become familiar with ETL development using SAP Data Services.
In a real-life scenario, your source databases, staging area database, and DWH database/appliance will most likely reside on separate server hosts, and they may sometimes be from different vendors. So, the role of SAP Data Services is to link them together in order to migrate data from one system to another.
In this section, we will create our source database, which will play the role of an operational database that we will pull data from with the help of Data Services in order to transform the data and deliver it to a data warehouse.
Luckily for us, there are plenty of different flavors of ready-to-use databases on the Web nowadays. Let's pick one of the most popular ones: Adventure Works OLTP for SQL Server 2012, which is available for download on the CodePlex website. Perform the following steps:
Use the following link to see the list of the files available for download:
Click on the AdventureWorks2012 Data File link, which should download the
When the download is complete, copy the file into the
C:\AdventureWorks\directory (create it before copying if necessary).
The next step is to map this database file to our database engine, which will create our source database. To do this, perform the following steps:
Start SQL Server Management Studio.
Click on the New Query button, which will open a new session connection to a master database.
In the SQL Query window, type the following command and press F5 to execute it:
CREATE DATABASE AdventureWorks_OLTP ON (FILENAME = 'C:\AdventureWorks\AdventureWorks2012_Data.mdf') FOR ATTACH_REBUILD_LOG;
After a successful command execution and upon refreshing the database list (using F5), you should be able to see the
AdventureWorks_OLTPdatabase in the list of the available databases in the Object Explorer window of SQL Server Management Studio.
In a typical scenario, every SQL Server database consists of two data files: a database file and a transaction log file. A database file contains actual data structures and data, while a transaction log file keeps the transactional changes applied to the data.
As we only downloaded the data file, we had to execute the
CREATE DATABASE command with a special
ATTACH_REBUILD_LOG clause, which automatically creates a missing transaction log file so that the database could be successfully deployed and opened.
Now, our source database is ready to be used by Data Services in order to access, browse, and extract data from it.
There are different ways to deploy test databases. This mainly depends on which RDBMS system you use. Sometimes, you may find a package of SQL scripts that contains the commands required to create all the database structures and commands used to insert data into these structures. This option may be useful if you have problems with attaching the downloaded
mdf data file to your database engine or, for example, if you find the SQL scripts created for SQL Server RDBMS but have to apply them to the Oracle DB. With slight modifications to the command, you can run them in order to create an Oracle database.
Explaining RDBMS technologies lies beyond the scope of this book. So, if you are looking for more information regarding how a specific RDBMS system works, refer to the official documentation.
What has to be said here is that from the perspective of using Data Services, it does not matter which source system or target systems you use. Data Services not only supports the majority of them, but it also creates its own representation of the source and target objects; this way, they all look the same to Data Services users and abide by the same rules within the Data Services environment. So, you really do not have to be a DBA or database developer to easily connect to any RDBMS from Data Services. All that is required is a knowledge of the SQL language to understand the principle of methods that Data Services uses when extracting and loading data or creating database objects for you.
In this recipe, we will talk about ETL data structures that will be used in this book. Staging structures are important storage areas where extracted data is kept before it gets transformed or stored between the transformation steps. The staging area in general can be used to create backup copies of data or to run analytical queries on the data in order to validate the transformations made or the extract processes. Staging data structures can be quite different, as you will see. Which one to use depends on the tasks you are trying to accomplish, your project requirements, and the architecture of the environment used.
The most popular data structures that could be used in the staging area are flat files and RDBMS tables.
C:\AW\ folder, which will be used throughout this book to store flat files.
Inserting data into a flat file is faster than inserting data into an RDBMS table. So, during ETL development, flat files are often used to reach two goals simultaneously: creating a backup copy of the data snapshot and providing you with the storage location for your preliminary data before you apply the next set of transformation rules.
Another common use of flat files is the ability to exchange data between systems that cannot communicate with each other in any other way.
Lastly, it is very cost-effective to store flat files (OS disk storage space is cheaper than DB storage space).
The main disadvantage of the flat files storage method is that the modification of data in a flat file can sometimes be a real pain, not to mention that it is much slower than modifying data in a relational DB table.
Let's create two separate databases for relational tables, which will play the role of the ETL staging area in our future examples:
Open SQL Server Management Studio.
Right-click on the Databases icon and select the New Database… option.
Repeat the last two steps to create another dataset called
Let's recap. The ETL staging area is a location to store the preliminary results of our ETL transformations and also a landing zone for the extracts from the source system.
Yes, Data Services allows you to extract data and perform all transformations in the memory before loading to the target system. However, as you will see in later chapters, the ETL process, which does everything in one "go", can be complex and difficult to maintain. Plus, if something goes wrong along the way, all the changes that the process has already performed will be lost and you may have to start the extraction/transformation process again. This obviously creates extra workload on a source system because you have to query it again in order to get the data. Finally, big does not mean effective. We will show you how splitting your ETL process into smaller pieces helps you to create a well-performing sequence of dataflow.
ODS database will be used as a landing zone for the data coming from source systems. The structure of the tables here will be identical to the structure of the source system tables.
STAGE database will hold the relational tables used to store data between the data transformation steps.
We will also store some data extracted from a source database in a flat file format to demonstrate the ability of Data Services to work with them and show the convenience of this data storage method in the ETL system.
Finally, this is the time to create our target data warehouse system. The data warehouse structures and tables will be used by end users with the help of various reporting tools to make sense of the data and analyze it. As a result, it should help business users to make strategic decisions, which will hopefully lead to business growth.
We should not forget that the main purpose of a data warehouse, and hence that of our ETL system, is to serve business needs.
The data warehouse created in this recipe will be used as a target database populated by the ETL processes developed in SAP Data Services. This is where the data modified and cleansed by ETL processes will be inserted in the end. Plus, this is the database that will mainly be accessed by business users and reporting tools.
Perform the following steps:
AdventureWorkscomes to the rescue again. Use another link to download the
AdventureWorksdata warehouse data file, which will be mapped in the same manner to our SQL Server Express database engine in order to create a local data warehouse for our own learning purposes. Go to the following URL and click on the AdventureWorksDW for SQL Server 2012 link:
After you have successfully downloaded the
AdventureWorksDW2012.zipfile, unpack its contents into the same directory as the previous file:
There should be two files in the archive:
AdventureWorksDW2012_Data.mdf—the database data file
AdventureWorksDW2012_Log.ldf—the database transaction log file
Open SQL Server Management Studio and click on the New Query… button in the uppermost tool bar.
Enter and execute the following command in the SQL Query window:
CREATE DATABASE AdventureWorks_DWH ON (FILENAME = 'C:\AdventureWorks\AdventureWorksDW2012_Data.mdf'), (FILENAME = 'C:\AdventureWorks\AdventureWorksDW2012_Log.ldf') FOR ATTACH;
After a successful command execution, right-click on the Databases icon and choose the Refresh option in the opened menu list. This should refresh the contents of your object library, and you should see the following list of databases:
There are also some diagrams available that could help you see the visual data warehouse structure. To get access to them, open SQL Server Management Studio, expand the
Databases list in the Object Explorer window, then expand the
AdventureWorks_DWH database object list, and finally open the
Diagrams tree. Double-clicking on any diagram in the list opens a new window within Management Studio with the graphical presentation of tables, key columns, and links between the tables, which shows you the relationships between them.
In the next recipe, we will have an overview of the knowledge resources that exist on the Web. We highly recommend that you get familiar with them in order to improve your data warehousing skills, learn about the data warehouse life cycle, and understand what makes a successful data warehouse project. In the meantime, feel free to open New Query in SQL Server Management Studio and start running the
SELECT commands to explore the contents of the tables in your