Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pentaho 3.2 Data Integration: Beginner's Guide

You're reading from  Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book
Published in Apr 2010
Publisher Packt
ISBN-13 9781847199546
Pages 492 pages
Edition 1st Edition
Languages

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
1. Getting Started with Pentaho Data Integration 2. Getting Started with Transformations 3. Basic Data Manipulation 4. Controlling the Flow of Data 5. Transforming Your Data with JavaScript Code and the JavaScript Step 6. Transforming the Row Set 7. Validating Data and Handling Errors 8. Working with Databases 9. Performing Advanced Operations with Databases 10. Creating Basic Task Flows 11. Creating Advanced Transformations and Jobs 12. Developing and Implementing a Simple Datamart 13. Taking it Further Working with Repositories Pan and Kitchen: Launching Transformations and Jobs from the Command Line Quick Reference: Steps and Job Entries Spoon Shortcuts Introducing PDI 4 Features Pop Quiz Answers Index

Chapter 6. Transforming the Row Set

So far, you have been working with simple datasets, that is, datasets where the each row represented a different entity (for example a student) and each column represented a different attribute for that entity (for example student name). There are occasions when your dataset doesn’t resemble such a simple format, and working with it as is, may be complicate or even impossible. In other occasions your data simply does not have the structure you like or the structure you need.

Whichever your situation, you have to transform the dataset in an appropriate format and the solution is not always about changing or adding fields, or about filtering or adding rows. Sometimes it has to do with twisting the whole dataset. In this chapter you will learn how to:

  • Convert rows to columns

  • Convert columns to rows

  • Operate on sets of rows

You will also be introduced to a core subject in data warehousing: Time dimensions.

Converting rows to columns


In most datasets each row belongs to a different element such as a different match or a different student. However, there are datasets where a single row doesn't completely describe one element. Take, for example, the real-estate file from Chapter 5. Every house was described through several rows. A single row gave incomplete information about the house. The ideal situation would be one in which all the attributes for the house were in a single row. With PDI you can convert the data into this alternative format. You will learn how to do it in this section.

Time for action – enhancing a films file by converting rows to columns


In this tutorial we will work with a file that contains list of all French movies ever made. Each movie is described through several rows. This is how it looks like:

...
Caché
Year: 2005
Director:Michael Haneke
Cast: Daniel Auteuil, Juliette Binoche, Maurice Bénichou

Jean de Florette
Year: 1986
Genre: Historical drama
Director: Claude Berri
Produced by: Pierre Grunstein
Cast: Yves Montand, Gérard Depardieu, Daniel Auteuil

Le Ballon rouge
Year: 1956
Genre: Fantasy | Comedy | Drama
...

In order to process the information of the file, it would be better if the rows belonging to each movie were merged into a single row. Let's work on that.

  1. Download the file from the Packt website.

  2. Create a transformation and read the file with a Text file input step.

  3. In the Content tab of the Text file input step put : as separator. Also uncheck the Header and the No empty rows options.

  4. In the Fields tab enter two string fields—feature and description...

Time for action – calculating total scores by performances by country


Let's work now with the contest file from Chapter 5. You will need the output file for the Hero exercise. Fill gaps in the contest file from that chapter. If you don't have it, you can download it from the Packt website.

In this tutorial, we will calculate the total score for each performance by country.

  1. Create a new transformation.

  2. Read the file with a Text file input step and do a preview to see that the step is well configured. You should see this:

  3. With a Select values step, keep only the following columns: Country, Performance, and totalScore.

  4. With a Sort Rows step sort the data by Country ascendant.

  5. After the Sort Rows step, put a Row denormalizer step.

  6. Double-click this last step to configure it.

  7. As the key field put Performance, and as group fields put Country.

  8. Fill the target fields' grid like shown:

  9. Close the window.

  10. With the Row denormalizer step selected, do a preview. You will see this:

What just happened?

You read the...

Normalizing data


Some datasets are nice to see but complicate to process further. Take a look at the matches file we saw in Chapter 3:

Match Date;Home Team;Away Team;Result
02/06;Italy;France;2-1
02/06;Argentina;Hungary;2-1
06/06;Italy;Hungary;3-1
06/06;Argentina;France;2-1
10/06;France;Hungary;3-1
10/06;Italy;Argentina;1-0
...

Imagine you want to answer these questions:

  1. How many teams played?

  2. Which team converted most goals?

  3. Which team won all matches it played?

The dataset is not prepared to answer those questions, at least in an easy way. If you want to answer those questions in a simple way, you will first have to normalize the data, that is, convert it to a suitable format before proceeding. Let's work on it.

Time for action – enhancing the matches file by normalizing the dataset


Now you will convert the matches file you generated in Chapter 2 to a format suitable for answering the proposed questions.

  1. Search on your disk for the file you created in Chapter 2, or download it from the Packt website.

  2. Create a new transformation and read the file by using a Text file input step.

  3. With a Split Fields step, split the Result field in two: home_t_goals and away_t_goals. (Do you remember having done this in chapter 3?)

  4. From the Transform category of steps, drag a Row Normalizer step to the canvas.

  5. Create a hop from the last step to this new one.

  6. Double-click the Row Normalizer step to edit it and fill the window as follows:

  7. With the Row Normalizer selected, do a preview. You should see this:

What just happened?

You read the matches file and converted the dataset to a new one where both the home team and the away team appeared under a new column named team, together with another new column named goals holding...

Generating a custom time dimension dataset by using Kettle variables


Dimensions are sets of attributes useful for describing a business. A list of products along with their shape, color, or size is a typical example of dimension. The time dimension is a special dimension used for describing a business in terms of when things happened. Just think of a time dimension as a list of dates along with attributes describing those dates. For example, given the date 05/08/2009, you know that it is a day of August, it belongs to the third quarter and it is Wednesday. These are some of the attributes for that date.

In the following tutorial you will create a transformation that generates the dataset for a time dimension. The dataset for a time dimension has one row for every date in a given range of dates and one column for each attribute of the date.

Time for action – creating the time dimension dataset


In this tutorial we will create a simple dataset for a time dimension.

First we will create a stream with the days of the week:

  1. Create a new transformation.

  2. Press Ctrl+T to access the Transformation settings window.

  3. Select the Parameters tab and fill it like shown in the next screenshot:

  4. Expand the Job category of steps.

  5. Drag a Get Variables step to the canvas, double-click the step, and fill the window like here:

  6. After the Get Variables step, add a Split Fields step and use it to split the field week_days into seven String fields named sun, mon, tue, wed, thu, fri, and sat. As Delimiter, set a comma (,).

  7. Add one more Split Fields step and use it to split the field week_days_short into seven String fields named sun_sh, mon_sh, tue_sh, wed_sh, thu_sh, fri_sh, and sat_sh. As Delimiter, set a comma (,).

  8. After this last step, add a Row Normalizer step.

  9. Double-click the Row Normalizer step and fill it as follows:

  10. Keep the Row Normalizer step selected...

Time for action – getting variables for setting the default starting date


Let's modify the transformation so that the starting date depends on a parameter.

  1. Press Ctrl+T to open the transformation settings window.

  2. Add a parameter named START_DATE with default value 01/12/1999.

  3. Add a Get variables step between the Calculator step and the Filter rows step .

  4. Edit the Get variables step and a new field named start_date. Under Variable write ${START_DATE}. As Type select Date, and under Format select or type dd/MM/yyyy.

  5. Modify the filter step so the condition is now: date>=start_date and date<=31/12/2020.

  6. Modify the Select values step to remove the start_date field.

  7. With the Select values step selected do a preview. You will see this:

What just happened?

You added a starting date as a named parameter. Then you read that variable into a new field and used it to keep only the dates that are greater or equal to its value.

Using the Get Variables step

As you just saw, the Get Variables step allows you...

Summary


In this chapter, you learned to transform your dataset by applying two magical steps: Row Normalizer and Row denormalizer. These two steps aren't the kind of steps you use every day such as a Filter Rows or a Select values step. But when you need to do the kind of task they achieve, you are really grateful that these steps exist. They do a complex task in a quite simple way. You also learned what a time dimension is and how to create a dataset for a time dimension.

So far, you've been learning to transform data. In the next chapter, you will set that kind of learning aside for a while. The chapter will be devoted to an essential subject when it comes to working in productive environments and dealing with real data—data validation and error handling.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Pentaho 3.2 Data Integration: Beginner's Guide
Published in: Apr 2010 Publisher: Packt ISBN-13: 9781847199546
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}