Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pentaho 3.2 Data Integration: Beginner's Guide

You're reading from  Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book
Published in Apr 2010
Publisher Packt
ISBN-13 9781847199546
Pages 492 pages
Edition 1st Edition
Languages

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
1. Getting Started with Pentaho Data Integration 2. Getting Started with Transformations 3. Basic Data Manipulation 4. Controlling the Flow of Data 5. Transforming Your Data with JavaScript Code and the JavaScript Step 6. Transforming the Row Set 7. Validating Data and Handling Errors 8. Working with Databases 9. Performing Advanced Operations with Databases 10. Creating Basic Task Flows 11. Creating Advanced Transformations and Jobs 12. Developing and Implementing a Simple Datamart 13. Taking it Further Working with Repositories Pan and Kitchen: Launching Transformations and Jobs from the Command Line Quick Reference: Steps and Job Entries Spoon Shortcuts Introducing PDI 4 Features Pop Quiz Answers Index

Chapter 3. Basic Data Manipulation

In the previous chapter, you learned how to get data into PDI. Now you're ready to begin transforming that data. This chapter explains the simplest and most used ways of transforming data. We will cover the following:

  • Executing basic operations

  • Filtering and sorting of data

  • Looking up data outside the main stream of data

By the end of this chapter, you will be able to do simple but meaningful transformations on different types of data.

Basic calculations


You already know how to create a transformation and read data from an external source. Now, taking that data as a starting point, you will begin to do basic calculations.

Time for action – reviewing examinations by using the Calculator step


Can you recollect the exercise about examinations you did in the previous chapter? You created an incremental file with examination results. The final file looked like the following:

---------------------------------------------------------
Annual Language Examinations
Testing writing, reading, speaking and listening skills
---------------------------------------------------------
student_code;name;writing;reading;speaking;listening;file_processed;process_date
80711-85;William Miller; 81;83;80;90;C:\pdi_files\input\first_turn.txt;28-05-2009
20362-34;Jennifer Martin; 87;76;70;80;C:\pdi_files\input\first_turn.txt;28-05-2009
75283-17;Margaret Wilson; 99;94;90;80;C:\pdi_files\input\first_turn.txt;28-05-2009
83714-28;Helen Thomas; 89;97;80;80;C:\pdi_files\input\first_turn.txt;28-05-2009
61666-55;Maria Thomas; 88;77;70;80;C:\pdi_files\input\first_turn.txt;28-05-2009
...

Now you want to convert all grades in the scale 0-100 to...

Time for action – reviewing examinations by using the Formula step


In this tutorial you will redo the previous exercise, but this time you will do the calculations with the Formula step.

  1. Open the transformation you just finished.

  2. Delete from the transformation the Calculator step, and put in its place a Formula step. You will find it under the Scripting category of steps.

  3. Add a field named writing.

  4. When you click the cell under the Formula column, a window appears to edit the formula for the new field.

  5. In the upper area of the window, type [writing]/20. You will notice that the sentence is red if it is incomplete or the syntax is incorrect. In that case, the error is shown below the editing area, like in the following example:

  6. As soon as the formula is complete and correct, the red color disappears.

  7. Click OK.

  8. The formula you typed will be displayed in the cell you clicked.

  9. Set Number as the type for the new field, and type writing in the Replace value column.

  10. Add three more fields to the grid in the...

Calculations on groups of rows


You just learned to do simple operations for every row of a dataset. Now you are ready to go beyond. Suppose you have a list of daily temperatures of a given country over a year. You may want to know the overall average temperature, the average temperature by region, or the coldest day of the year. When you work with data, these types of calculations are a common requirement. In this section you will learn to address those requirements with PDI.

Time for action – calculating World Cup statistics by grouping data


Let's forget the examinations for a while, and retake the World Cup tutorial from the previous chapter. The file you obtained from that tutorial was a list of results of football matches. These are sample rows of the final file:

Match Date;Home Team;Away Team;Result
02/06;Italy;France;2-1
02/06;Argentina;Hungary;2-1
06/06;Italy;Hungary;3-1
06/06;Argentina;France;2-1
10/06;France;Hungary;3-1
10/06;Italy;Argentina;1-0
...

Now you want to take that information to obtain some statistics such as the maximum number of goals per match in a given day. To do it, follow these instructions:

  1. Create a new transformation, give it a name and description, and save it.

  2. By using a Text file input step, read the wcup_first_round.txt file you generated in Chapter 2. Give the name and location of the file, check the Content tab to see that everything matches your file, and fill the Fields tab.

  3. Do a preview just to confirm that the step is well configured...

Filtering


Until now you learned how to accomplish several kinds of calculations that enriched the set of data. There is still another kind of operation that is frequently used, and does not have to do with enriching the data but with discarding data. It is filtering unwanted data. Now you will learn how to discard rows under given conditions.

Time for action – counting frequent words by filtering


Let's suppose, you have some plain text files, and you want to know what is said in them. You don't want to read them, so you decide to count the times that words appear in the text, and see the most frequent ones to get an idea of what the files are about.

Note

Before starting, you'll need at least one text file to play with. The text file used in this tutorial is named smcng10.txt and is available for you to download from the Packt website.

Let's work:

  1. Create a new transformation.

  2. By using a Text file input step, read your file. The trick here is to put as a separator a sign you are not expecting in the file, for example |. By doing so, the entire line would be recognized as a single field. Configure the Fields tab by defining a single string field named line.

  3. From the Transform category of step, drag to the canvas a Split field to rows step, and create a hop from Text file input step to this new step.

  4. Configure the step like this:

  5. With...

Looking up data


Until now, you have been working with a single stream of data. When you did calculations or created conditions to compare fields, you only involved fields of your stream. Usually, this is not enough, and you need data from other sources. In this section you will learn to look up data outside your stream.

Time for action – finding out which language people speak


An International Musical Contest will take place and 24 countries will participate, each presenting a duet. Your task is to hire interpreters so the contestants can communicate in their native language. In order to do that, you need to find out the language they speak:

  1. Create a new transformation.

  2. By using a Get Data From XML step, read the countries.xml file that contains information about countries that you used in Chapter 2.

    Tip

    To avoid configuring the step again, you can open the transformation that reads this file, copy the Get data from XML step, and paste it here.

  3. Drag a Filter rows step to the canvas.

  4. Create a hop from the Get data from XML step to the Filter rows step.

  5. Edit the Filter rows step and create the condition- isofficial= T.

  6. Click the Filter rows step and do a preview. The list of previewed rows will show the countries along with the official languages:

    Now let's create the main flow of data:

  7. From the book website download...

Summary


This chapter covered the simplest and most common ways of transforming data. Specifically, it covered how to:

  • Use different transformation steps to calculate new fields

  • Use the Calculator and the Formula steps

  • Filter and sort data

  • Calculate statistics on groups of rows

  • Look up data

After learning basic manipulation of data, you may now create more complex transformations, where the streams begin to split and merge. That is the core subject of the next chapter.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Pentaho 3.2 Data Integration: Beginner's Guide
Published in: Apr 2010 Publisher: Packt ISBN-13: 9781847199546
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}