Packt+ | Advance your knowledge in tech

You're reading from Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book

Published in Apr 2010

Publisher Packt

ISBN-13 9781847199546

Pages 492 pages

Edition 1st Edition

Languages

Java

Concepts

Business Intelligence

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

1. Getting Started with Pentaho Data Integration

2. Getting Started with Transformations

3. Basic Data Manipulation

4. Controlling the Flow of Data

5. Transforming Your Data with JavaScript Code and the JavaScript Step

6. Transforming the Row Set

7. Validating Data and Handling Errors

8. Working with Databases

9. Performing Advanced Operations with Databases

10. Creating Basic Task Flows

11. Creating Advanced Transformations and Jobs

12. Developing and Implementing a Simple Datamart

13. Taking it Further

Working with Repositories

Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Quick Reference: Steps and Job Entries

Spoon Shortcuts

Introducing PDI 4 Features

Pop Quiz Answers

Index

Chapter 6. Transforming the Row Set

So far, you have been working with simple datasets, that is, datasets where the each row represented a different entity (for example a student) and each column represented a different attribute for that entity (for example student name). There are occasions when your dataset doesn’t resemble such a simple format, and working with it as is, may be complicate or even impossible. In other occasions your data simply does not have the structure you like or the structure you need.

Whichever your situation, you have to transform the dataset in an appropriate format and the solution is not always about changing or adding fields, or about filtering or adding rows. Sometimes it has to do with twisting the whole dataset. In this chapter you will learn how to:

Convert rows to columns
Convert columns to rows
Operate on sets of rows

You will also be introduced to a core subject in data warehousing: Time dimensions.

Converting rows to columns

In most datasets each row belongs to a different element such as a different match or a different student. However, there are datasets where a single row doesn't completely describe one element. Take, for example, the real-estate file from Chapter 5. Every house was described through several rows. A single row gave incomplete information about the house. The ideal situation would be one in which all the attributes for the house were in a single row. With PDI you can convert the data into this alternative format. You will learn how to do it in this section.

Time for action – enhancing a films file by converting rows to columns

In this tutorial we will work with a file that contains list of all French movies ever made. Each movie is described through several rows. This is how it looks like:

...
Caché
Year: 2005
Director:Michael Haneke
Cast: Daniel Auteuil, Juliette Binoche, Maurice Bénichou

Jean de Florette
Year: 1986
Genre: Historical drama
Director: Claude Berri
Produced by: Pierre Grunstein
Cast: Yves Montand, Gérard Depardieu, Daniel Auteuil

Le Ballon rouge
Year: 1956
Genre: Fantasy | Comedy | Drama
...

In order to process the information of the file, it would be better if the rows belonging to each movie were merged into a single row. Let's work on that.

Download the file from the Packt website.
Create a transformation and read the file with a Text file input step.
In the Content tab of the Text file input step put : as separator. Also uncheck the Header and the No empty rows options.
In the Fields tab enter two string fields—feature and description...

Time for action – calculating total scores by performances by country

Let's work now with the contest file from Chapter 5. You will need the output file for the Hero exercise. Fill gaps in the contest file from that chapter. If you don't have it, you can download it from the Packt website.

In this tutorial, we will calculate the total score for each performance by country.

Create a new transformation.
Read the file with a Text file input step and do a preview to see that the step is well configured. You should see this:
With a Select values step, keep only the following columns: Country, Performance, and totalScore.
With a Sort Rows step sort the data by Country ascendant.
After the Sort Rows step, put a Row denormalizer step.
Double-click this last step to configure it.
As the key field put Performance, and as group fields put Country.
Fill the target fields' grid like shown:
Close the window.
With the Row denormalizer step selected, do a preview. You will see this:

What just happened?

You read the...

Normalizing data

Some datasets are nice to see but complicate to process further. Take a look at the matches file we saw in Chapter 3:

Match Date;Home Team;Away Team;Result
02/06;Italy;France;2-1
02/06;Argentina;Hungary;2-1
06/06;Italy;Hungary;3-1
06/06;Argentina;France;2-1
10/06;France;Hungary;3-1
10/06;Italy;Argentina;1-0
...

Imagine you want to answer these questions:

How many teams played?
Which team converted most goals?
Which team won all matches it played?

The dataset is not prepared to answer those questions, at least in an easy way. If you want to answer those questions in a simple way, you will first have to normalize the data, that is, convert it to a suitable format before proceeding. Let's work on it.

Time for action – enhancing the matches file by normalizing the dataset

Now you will convert the matches file you generated in Chapter 2 to a format suitable for answering the proposed questions.

Search on your disk for the file you created in Chapter 2, or download it from the Packt website.
Create a new transformation and read the file by using a Text file input step.
With a Split Fields step, split the Result field in two: home_t_goals and away_t_goals. (Do you remember having done this in chapter 3?)
From the Transform category of steps, drag a Row Normalizer step to the canvas.
Create a hop from the last step to this new one.
Double-click the Row Normalizer step to edit it and fill the window as follows:
With the Row Normalizer selected, do a preview. You should see this:

What just happened?

You read the matches file and converted the dataset to a new one where both the home team and the away team appeared under a new column named team, together with another new column named goals holding...

Generating a custom time dimension dataset by using Kettle variables

Dimensions are sets of attributes useful for describing a business. A list of products along with their shape, color, or size is a typical example of dimension. The time dimension is a special dimension used for describing a business in terms of when things happened. Just think of a time dimension as a list of dates along with attributes describing those dates. For example, given the date 05/08/2009, you know that it is a day of August, it belongs to the third quarter and it is Wednesday. These are some of the attributes for that date.

In the following tutorial you will create a transformation that generates the dataset for a time dimension. The dataset for a time dimension has one row for every date in a given range of dates and one column for each attribute of the date.

Time for action – creating the time dimension dataset

In this tutorial we will create a simple dataset for a time dimension.

First we will create a stream with the days of the week:

Create a new transformation.
Press Ctrl+T to access the Transformation settings window.
Select the Parameters tab and fill it like shown in the next screenshot:
Expand the Job category of steps.
Drag a Get Variables step to the canvas, double-click the step, and fill the window like here:
After the Get Variables step, add a Split Fields step and use it to split the field week_days into seven String fields named sun, mon, tue, wed, thu, fri, and sat. As Delimiter, set a comma (,).
Add one more Split Fields step and use it to split the field week_days_short into seven String fields named sun_sh, mon_sh, tue_sh, wed_sh, thu_sh, fri_sh, and sat_sh. As Delimiter, set a comma (,).
After this last step, add a Row Normalizer step.
Double-click the Row Normalizer step and fill it as follows:
Keep the Row Normalizer step selected...

Time for action – getting variables for setting the default starting date

Let's modify the transformation so that the starting date depends on a parameter.

Press Ctrl+T to open the transformation settings window.
Add a parameter named START_DATE with default value 01/12/1999.
Add a Get variables step between the Calculator step and the Filter rows step .
Edit the Get variables step and a new field named start_date. Under Variable write ${START_DATE}. As Type select Date, and under Format select or type dd/MM/yyyy.
Modify the filter step so the condition is now: date>=start_date and date<=31/12/2020.
Modify the Select values step to remove the start_date field.
With the Select values step selected do a preview. You will see this:

What just happened?

You added a starting date as a named parameter. Then you read that variable into a new field and used it to keep only the dates that are greater or equal to its value.

Using the Get Variables step

As you just saw, the Get Variables step allows you...

Summary

In this chapter, you learned to transform your dataset by applying two magical steps: Row Normalizer and Row denormalizer. These two steps aren't the kind of steps you use every day such as a Filter Rows or a Select values step. But when you need to do the kind of task they achieve, you are really grateful that these steps exist. They do a complex task in a quite simple way. You also learned what a time dimension is and how to create a dataset for a time dimension.

So far, you've been learning to transform data. In the next chapter, you will set that kind of learning aside for a while. The chapter will be devoted to an essential subject when it comes to working in productive environments and dealing with real data—data validation and error handling.