Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pentaho 3.2 Data Integration: Beginner's Guide

You're reading from  Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book
Published in Apr 2010
Publisher Packt
ISBN-13 9781847199546
Pages 492 pages
Edition 1st Edition
Languages

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
Getting Started with Pentaho Data Integration Getting Started with Transformations Basic Data Manipulation Controlling the Flow of Data Transforming Your Data with JavaScript Code and the JavaScript Step Transforming the Row Set Validating Data and Handling Errors Working with Databases Performing Advanced Operations with Databases Creating Basic Task Flows Creating Advanced Transformations and Jobs Developing and Implementing a Simple Datamart Taking it Further Working with Repositories Pan and Kitchen: Launching Transformations and Jobs from the Command Line Quick Reference: Steps and Job Entries Spoon Shortcuts Introducing PDI 4 Features Pop Quiz Answers Index

Chapter 9. Performing Advanced Operations with Databases

In this chapter you will learn about advanced operations with databases. The first part of the chapter includes:

  • Populating the Jigsaw puzzle database so that it is prepared for the rest of the activities in the chapter

  • Doing simple lookups in a database

  • Doing complex lookups

The second part of the chapter is fully devoted to datawarehouse-related concepts. The list of the topics that will be covered includes:

  • Introducing dimensional modeling

  • Loading dimensions

Preparing the environment


In order to learn the concepts of this chapter, a database with little or no data is useless. Therefore, the first thing you'll do is populating your Jigsaw puzzle database.

Time for action – populating the Jigsaw database


To load data massively into your Jigsaw database, you must have the Jigsaw database created and the MySQL server running. You already know how to do this. If not, please refer to Chapter 1 for the installation of MySQL and Chapter 8 for the creation of the Jigsaw database.

Note

This tutorial will overwrite all your data in the js database. If you don't want to overwrite the data in your js database, you could simply create a new database with a different name and run the js.sql script to create the tables in your new database.

After checking that everything is in order, follow these instructions:

  1. From Packt's website download the js_data.sql script file.

  2. Launch the MySQL query browser.

  3. From the File menu select Open Script....

  4. Locate the downloaded file and open it.

  5. At the beginning of the script file you will see this line:

    USE js;

    If you created a new database, replace the name js by the name of your new database.

  6. Click on the Execute button.

  7. At the...

Looking up data in a database


You already know how to create, update, and delete data from a database. It's now time to learn to look up data. Lookup is the act of searching for information in a database. You can look up a column of a single table or you can do more complex lookups. Let's begin with the simplest way of looking up.

Doing simple lookups

Sometimes you need to get information from a database table based on the data you have in your main stream. Let's see how you can do it.

Time for action – using a Database lookup step to create a list of products to buy


Suppose you have an online system for your customers to order products. On a daily basis, the system creates a file with the orders information. Now you will check if you have stock for the ordered products and make a list of the products you'll have to buy.

  1. Create a new transformation.

  2. From the Input category of steps, drag a Get data from XML step to the canvas.

  3. Use it to read the orders.xml file. In the Content tab, fill the Loop XPath option with the /orders/order string. In the Fields tab get the fields.

  4. Do a preview. You will see the following:

    Note

    To keep this exercise simple, the file contains a single product by order.

  5. Add a Sort rows step and use it to sort the data by man_code, prod_code.

  6. Add a Group by step and double-click it.

  7. Use the upper grid for grouping by man_code and prod_code.

  8. Use the lower grid for adding a field with the number of orders in each group. As Name write quantity, as Subject ordernumber...

Time for action – using a Database join step to create a list of suggested products to buy


If your customers ordered a product that is out of stock and you don't want to let them down, you will suggest them some alternative puzzles to buy.

  1. Open the transformation of the previous tutorial and save it under a new name.

  2. Delete the Text file output step.

  3. Double-click the Group by step and add an aggregated field named customers with the list of customers separated by (,). Under Subject, select idcus and as Type, select Concatenate strings separated by ,.

  4. Double-click the Database lookup step. In the Values to return from the lookup table grid, add pro_theme as value in the String field.

  5. Add a Select values step. Use it to select the fields customers, quantity, pro_theme, and pro_name. Also rename quantity as quantity_param and pro_theme as theme_param. From the Lookup category, drag a Database join step to the canvas. Create a hop from the Select values step to this step.

  6. Double-click the Database...

Introducing dimensional modeling


So far you have dealt with the Jigsaw puzzles database, a database used for daily operational work. In the real-world, a database like this is maintained by an On-Line Transaction Processing (OLTP) system. The users of an OLTP system perform operational tasks—sell products, process orders, control stock, and so on.

As a counterpart, a datawarehouse is a nonoperational database; it is a specialized database designed for decision support purposes. Users of a datawarehouse analyze the data, and they do it from different points of view.

The most used technique for delivering data to datawarehouse users is dimensional modeling. This technique makes databases simple and understandable.

The primary table in a dimensional model is the fact table. A fact table stores numerical measurements of the business such as quantity of products sold, amount represented by the sold products, discounts, taxes, number of invoices, number of claims, and anything that can be measured...

Loading dimensions with data


A dimension is an entity that describes your business—customers and products are examples of dimensions. A very special dimension is the time dimension that you already know. A dimension table (no surprises here) is a table that contains information about a dimension. In this section you will learn to load dimension tables, that is, fill dimension tables with data.

Time for action – loading a region dimension with a Combination lookup/update step


In this tutorial you will load a dimension that stores geographical information.

  1. Launch Spoon.

  2. Create a new transformation.

  3. Drag a Table input step to the canvas and double-click it.

  4. As connection select js.

  5. In the SQL area type the following query:

    SELECT ci.city_id, city_name, country_name
    FROM cities ci, countries co
    WHERE ci.cou_id = co.cou_id
  6. Click on OK.

  7. Expand the Data Warehouse category of steps.

  8. Select the Combination lookup/update step and drag it to the canvas.

  9. Create a hop from the Table input step to this new step.

  10. Double-click the Combination lookup/update step.

  11. As Connection select dw.

  12. As Target table browse and select lk_regions or simply type it.

  13. Enter id as Technical key field and lastupdate as Date of last update field.

  14. Click OK.

  15. After the Combination lookup/update step, add an Update step.

  16. Double-click the Update step.

  17. Select dw as Connection and lk_regions as Target table.

  18. Fill the upper grid adding the...

Time for action – testing the transformation that loads the region dimension


  1. In the previous tutorial you loaded a dimension that stores geographical information. You ran it once, causing the insertion of one record for each city and a special record with values n/a for the descriptive fields. Let's apply some changes in the operational database, and run the transformation again to see what happens.

  2. Launch MySQL Query Browser.

  3. Type the following sentence to change the names of the countries to upper case:

    UPDATE countries SET country_name = UCASE(country_name)
  4. Execute it.

  5. If the transformation created in the last tutorial is not open, open it again.

  6. Run the transformation.

  7. The Step Metrics looks like this:

  8. Explore the js_dw database again and do a preview of the lk_regions table. This time you will see the following:

What just happened?

After changing the letter case for the names of the countries in the transactional database js, you again ran the transformation that updates the Regions dimension...

Time for action – keeping a history of product changes with the Dimension lookup/update step


Let's load a puzzles dimension along with the history of the changes in puzzle attributes:

  1. Create a new transformation.

  2. Drag a Table input step to the work area and double-click it.

  3. Select js as Connection.

  4. Type the following query in the SQL area:

    SELECT pro_code
         , man_code
         , pro_name
         , pro_theme
    FROM  products
    WHERE pro_type LIKE 'PUZZLE'
  5. Click on OK.

  6. Add an Add constants step, and create a hop from the Table input, step toward it.

  7. Use the step to add a Date field named changedate. As Format type dd/MM/yyyy, and as Value, type 01/10/2009.

  8. Expand the Data Warehouse category of steps.

  9. Select the Dimension lookup/update step and drag it to the canvas.

  10. Create a hop from the Add constants step to this new step.

  11. Double-click the Dimension lookup/update step.

  12. As Connection select dw.

  13. As Target table type lk_puzzles.

  14. Fill the Key fields as shown:

  15. Select id as Technical key field.

  16. In the frame Creation...

Time for action – testing the transformation that keeps a historyof product changes


  1. In the previous tutorial you loaded a dimension with products by using a Dimension lookup/update step. You ran the transformation once, causing the insertion of one record for each product and a special record with values n/a for the descriptive fields. Let's apply some changes in the operational database, and run the transformation again to see how the Dimension lookup/update step keeps history.

  2. In MySQL Query Browser, open the script update_jumbo_products.sql and run it.

  3. Switch to Spoon.

  4. If the transformation created in the last tutorial is not open, open it again.

  5. Run the transformation. Explore the js_dw database again. Press Open SQL for [lk_puzzles] and type the following sentence:

    SELECT   *
    FROM     lk_puzzles
    WHERE    id_js_man = 'JUM' 
    ORDER BY id_js_prod
           , version
  6. You will see this:

What just happened?

After making some changes in the operational database, you ran the transformation for a second...

Summary


In this chapter you learned to perform some advanced operations on databases.

First, you populated the Jigsaw database in order to have data for the activities in the chapter. Then, you learned to do simple and complex searches in a database.

Then you were introduced to dimensional concepts and learned what dimensions are and how to load them with PDI. You learned about Type I, Type II, Type III SCDs and mini-dimensions. You still have to learn when and how to use those dimensions. You will do so in Chapter 12.

The steps you learned in this and the preceding chapter are far from being the full list of steps that PDI offers to work with databases. However, taking into account all you learned, you are now ready to use PDI for implementing most of your database requirements. In the next chapter, you will switch to a totally different yet core subject needed to work with PDI—jobs.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Pentaho 3.2 Data Integration: Beginner's Guide
Published in: Apr 2010 Publisher: Packt ISBN-13: 9781847199546
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}