Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pentaho 3.2 Data Integration: Beginner's Guide

You're reading from  Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book
Published in Apr 2010
Publisher Packt
ISBN-13 9781847199546
Pages 492 pages
Edition 1st Edition
Languages

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
Getting Started with Pentaho Data Integration Getting Started with Transformations Basic Data Manipulation Controlling the Flow of Data Transforming Your Data with JavaScript Code and the JavaScript Step Transforming the Row Set Validating Data and Handling Errors Working with Databases Performing Advanced Operations with Databases Creating Basic Task Flows Creating Advanced Transformations and Jobs Developing and Implementing a Simple Datamart Taking it Further Working with Repositories Pan and Kitchen: Launching Transformations and Jobs from the Command Line Quick Reference: Steps and Job Entries Spoon Shortcuts Introducing PDI 4 Features Pop Quiz Answers Index

Chapter 12. Developing and Implementing a Simple Datamart

In this chapter you will develop a simple but complete process of loading a datamart while reviewing all concepts you learned throughout the book.

The chapter will cover the following:

  • Introduction to a sales datamart based on the Jigsaw puzzles database

  • Loading the dimensions of the sales datamart

  • Loading the fact table for the sales datamart

  • Automating what has been done

Exploring the sales datamart


In Chapter 9, you were introduced to star schemas. In short, a star schema consists of a central table known as the fact table, surrounded by dimension tables. While the fact has indicators of your business such as sales in dollars, the dimensions have descriptive information for the attributes of your business such as time, customers, and products.

A star that addresses a specific department's needs or that is built for use by a particular group of users is called a datamart. You can have datamarts focused on customer relationship management, inventory, human resources management, budget, and more. In this chapter, you will load a datamart focused on sales.

Sometimes the term datamart is confused with datawarehouse. However, datamarts and datawarehouses are not the same.

Note

The main difference between datamarts and datawarehouses is that datawarehouses address the needs of the whole organization, whereas a datamarts addresses the needs of a particular department...

Loading the dimensions


As you saw, the sales star model consists of a fact surrounded by the dimension tables. In order to load the star, first you have to load the dimensions. You already learned how to load dimension tables. Here you will load the dimensions for the sales star.

Time for action – loading dimensions for the sales datamart


In this tutorial, you will load each dimension for the sales datamart and enclose them into a single job. Before starting, check the following things:

  • Check that the database engine is up and that both the js and the js_dw databases are accessible from PDI.

  • If your time dimension table, lk_time, has data, truncate the table. You may do it by using the Truncate table [lk_time] option in the database explorer.

Tip

You may reuse the js_dw database in which you have been loading data in previous chapters. There is no problem with that. However, creating a whole new database is preferred so that you can see how the entire process works.

The explanation will be focused on the general process. For details of creating a transformation that loads a particular type of dimension, please refer to Chapter 9. You can also download the full material for this chapter where the transformations and jobs are ready to browse and try.

  1. Create a new transformation...

Extending the sales datamart model


You may, and you usually, have more than one fact table sharing some of the dimensions. Look at the following diagram:

It shows two stars sharing three dimensions: Regions, Manufacturers, and Time. The star model to the left is the sales star model you already know. The star model to the right doesn't have data for accessories, but does have more detail for puzzles such as the number of pieces they have or the category or theme they belong to. When you have more than one fact table sharing dimensions as here, you have what is called a constellation .

The following table summarizes the dimensions added to the datamart:

Loading a fact table with aggregated data


Now that you have data in your dimensions, you are ready to load the sales fact table. In this section, you will learn how to do it.

Time for action – loading the sales fact table by looking up dimensions


Let's load the sales fact table, ft_sales, with sales information for a given range of dates. Before doing this exercise, be sure that you have already loaded the dimensions. You did it in the previous tutorial.

Also check that the database engine is up and that both the js and the js_dw databases are accessible from PDI. If everything is in order, you are ready to start:

  1. Create a new transformation.

  2. Drag a Table input step to the canvas.

  3. Double-click the step. Select js as Connection—the connection to the operational database.

  4. In the SQL frame type the following query:

    SELECT i.inv_date
          ,d.man_code
          ,cu.city_id
          ,pr.pro_type      product_type
          ,b.buy_desc
          ,p.pay_desc
          ,sum(d.cant_prod) quantity
          ,sum(d.price)     amount
    FROM   invoices         i
          ,invoices_detail  d
          ,customers        cu
          ,buy_methods      b
          ,payment_methods  p
          ,products         pr
    WHERE i.invoice_number...

Getting facts and dimensions together


Loading the star involves both loading the dimensions and loading the fact. You already loaded the dimensions and the fact separately. In the following two tutorials, you will put it all together:

Time for action – loading the fact table using a range of dates obtained from the command line


Now you will get the range of dates from the command line and load the fact table using that range:

  1. Create a new transformation.

  2. With a Get system info step, get the first two arguments from the command line and name them date_from and date_to.

  3. By using a couple of steps, check that the arguments are not null, have the proper format (yyyy-mm-dd), and are valid dates.

  4. If something is wrong with the arguments, abort.

  5. If the arguments are valid, use a Set variables step to set two variables named DATE_FROM and DATE_TO.

  6. Save the transformation in the same folder you saved the transformation that loads the fact table.

  7. Test the transformation by providing valid and invalid arguments to see that it works as expected.

  8. Create a job and save it in the same folder you saved the job that loads the dimensions.

  9. Drag to the canvas a START and two transformation job entries, and link them one after the other.

  10. Use the first...

Time for action – loading the sales star


You already created a job for loading the dimensions and another job for loading the fact.

In this tutorial, you will put them together in a single main job:

  1. Create a new job in the same folder in which you saved those jobs. Name this job load_dm_sales.kjb.

  2. Drag to the canvas a START and two job entries, and link them one after the other.

  3. Use the first job entry to execute the job that loads the dimensions.

  4. Use the second Job entry to execute the job you just created for loading the fact table.

  5. Save the job. This is how it looks:

  6. Press F9 to run the job.

  7. As arguments, provide a new range of dates: 2009-09-01, 2009-09-30. Then press Launch.

  8. The dimensions will be loaded first, followed by the loading of the fact table.

  9. The Job metrics tab in the Execution results window shows you the whole process running:

  10. Exploring the database, you'll see once again the data updated:

What just happened?

You built a main job that loads the sales datamart. First, it loads the...

Getting rid of administrative tasks


The solution you built during the chapter loads both dimensions and fact in a star model for a given range of dates. Now suppose that you want to keep your datamart always updated. Would you sit every day in front of your computer, and run the same job over and over again? You probably would, but you know that it wouldn't be a good idea. There are better ways to do this. Let's see how you can get rid of that task.

Time for action – automating the loading of the sales datamart


Suppose that every day you want to update your sales datamart by adding the information about the sales for the day before. Let's do some modifications to the jobs and transformations you did so that the job can run automatically.

In order to test the changes, you'll have to change the date for your system. Set the current date as 2009-10-02.

  1. Create a new transformation.

  2. Drag to the canvas a Get system data step and fill it like here:

  3. With a Select values step, change the metadata of both fields: As type put String and as format, yyyy-MM-dd.

  4. Add a Set variables step and use the two fields to create two variables named START_DATE and END_DATE.

  5. Save the transformation in the same folder you saved the transformation that loads the fact.

  6. Modify the job that loads the fact so that instead of executing, the transformation that takes the range of dates from the command line executes this one. The job looks like this:

  7. Save it.

Now let's create...

Summary


In this chapter you created a set of jobs and transformations that loads a sales datamart. Specifically, you learned how to load a fact table and to embed that process into a bigger one—the process that loads a full datamart.

You also learned to automate PDI processes, which is useful to get rid of tedious and repetitive manual tasks. In particular, you automated the loading of your sales datamart.

Beyond that, you must have found this chapter useful for reviewing all you learned since the first chapter. If you can't wait for more, read the next chapter. There you will find useful information for going further.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Pentaho 3.2 Data Integration: Beginner's Guide
Published in: Apr 2010 Publisher: Packt ISBN-13: 9781847199546
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}

Dimension

Description

Pieces

Number of pieces of the puzzle, grouped in the following ranges: 0-25, 26-100, and so on

Theme

Classification of the puzzle in any of the following categories: Fantasy, Castles, Landscapes, and so on

Glows in the dark

Yes/No

3D puzzle

Yes/No

Wooden puzzle

Yes/No

Panoramic puzzle

Yes/No

Packaging

...