Packt+ | Advance your knowledge in tech

You're reading from Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book

Published in Apr 2010

Publisher Packt

ISBN-13 9781847199546

Pages 492 pages

Edition 1st Edition

Languages

Java

Concepts

Business Intelligence

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

1. Getting Started with Pentaho Data Integration

2. Getting Started with Transformations

3. Basic Data Manipulation

4. Controlling the Flow of Data

5. Transforming Your Data with JavaScript Code and the JavaScript Step

6. Transforming the Row Set

7. Validating Data and Handling Errors

8. Working with Databases

9. Performing Advanced Operations with Databases

10. Creating Basic Task Flows

11. Creating Advanced Transformations and Jobs

12. Developing and Implementing a Simple Datamart

13. Taking it Further

Working with Repositories

Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Quick Reference: Steps and Job Entries

Spoon Shortcuts

Introducing PDI 4 Features

Pop Quiz Answers

Index

Chapter 2. Getting Started with Transformations

In the previous chapter you used the graphical designer Spoon to create your first transformation: Hello world. Now you will start creating your own transformations to explore data from the real world. Data is everywhere; in particular you will find data in files. Product lists, logs, survey results, and statistical information are just a sample of the different kinds of information usually stored in files. In this chapter you will create transformations to get data from files, and also to send data back to files. This in turn will allow you to learn the basic PDI terminology related to data.

Reading data from files

Despite being the most primitive format used to store data, files are broadly used and they exist in several flavors as fixed width, comma-separated values, spreadsheet, or even free format files. PDI has the ability to read data from all types of files; in this first tutorial let's see how to use PDI to get data from text files.

Time for action – reading results of football matches from files

Suppose you have collected several football statistics in plain files. Your files look like this:

Group|Date|Home Team |Results|Away Team|Notes
Group 1|02/June|Italy|2-1|France|
Group 1|02/June|Argentina|2-1|Hungary
Group 1|06/June|Italy|3-1|Hungary
Group 1|06/June|Argentina|2-1|France
Group 1|10/June|France|3-1|Hungary
Group 1|10/June|Italy|1-0|Argentina
-------------------------------------------
World Cup 78
Group 1

You don't have one, but many files, all with the same structure. You now want to unify all the information in one single file. Let's begin by reading the files.

Create the folder named pdi_files. Inside it, create the input and output subfolders.
By using any text editor, type the file shown and save it under the name group1.txt in the folder named input, which you just created. You can also download the file from Packt's official website.
Start Spoon.
From the main menu select File | New Transformation.
Expand the...

Time for action – reading all your files at a time using a single Text file input step

To read all your files follow the next steps:

Open the transformation, double-click the input step, and add the other files in the same way you added the first.
After Clicking the Preview rows button, you will see this:

What just happened?

You read several files at once. By putting in the grid the names of all the input files, you could get the content of every specified file one after the other.

Time for action – reading all your files at a time using a single Text file input step and regular expressions

You could do the same thing you did above by using a different notation. Follow these instructions:

Open the transformation and edit the configuration windows of the input step.
Delete the lines with the names of the files.
In the first row of the grid, type C:\pdi_files\input\ under the File/Directory column, and group[1-4]\.txt under the Wildcard (Reg.Exp.) column.
Click the Show filename(s)... button. You'll see the list of files that match the expression.
Close the tiny window and click Preview rows to confirm that the rows shown belong to the four files that match the expression you typed.

What just happened?

In this particular case, all filenames follow a pattern—group1.txt, group2.txt, and so on. In order to specify the names of the files, you used a regular expression. In the column File/Directory you put the static part of the names, while in the Wildcard (Reg.Exp.) column...

Sending data to files

Now you know how to bring data into Kettle. You didn't bring the data just to preview it; you probably want to do some transformation on the data, to finally send it to a final destination such as another plain file. Let's learn how to do this last task.

Time for action – sending the results of matches to a plain file

In the previous tutorial, you read all your "results of matches" files. Now you want to send the data coming from all files to a single output file.

Create a new transformation.
Drag a Text file input step to the canvas and configure it just as you did in the previous tutorial.
Drag a Select values step to the canvas and create a hop from the Text file input step to the Select values step.
Double-click the Select values step.
Click the Get fields to select button.
Modify the fields as follows:
Expand the Output branch of the steps tree.
Drag the Text file output icon to the canvas.
Create a hop from the Select values step to the Text file output step.
Double-click the Text file output step and give it a name.
In the file name type: C:/pdi_files/output/wcup_first_round.
Tip
Note that the path contains forward slashes. If your system is Windows, you may use back or forward slashes. PDI will recognize both notations.
In the Content tab, leave...

Getting system information

Until now, you have learned how to read data from known files, and send data back to files. What if you don't know beforehand the name of the file to process? There are several ways to handle this with Kettle. Let's learn the simplest.

Time for action – updating a file with news about examinations

Imagine you are responsible to collect the results of an annual examination that is being taken in a language school. The examination evaluates writing, reading, speaking, and listening skills. Every professor gives the exam to the students, the students take the examination, the professors grade the examinations in the scale 0-100 for each skill, and write the results in a text file, like the following:

student_code;name;writing;reading;speaking;listening
80711-85;William Miller;81;83;80;90
20362-34;Jennifer Martin;87;76;70;80
75283-17;Margaret Wilson;99;94;90;80
83714-28;Helen Thomas;89;97;80;80
61666-55;Maria Thomas;88;77;70;80

All the files follow that pattern.

When a professor has the file ready, he/she sends it to you, and you have to integrate the results in a global list. Let's do it with Kettle.

Before starting, be sure to have a file ready to read. Type it or download the sample files from the Packt's official website.
Create...

Time for action – running the examination transformation from a terminal window

Before executing the transformation from a terminal window, make sure that you have a new examination file to process, let's say exam3.txt. Then follow these instructions:

Open a terminal window and go to the directory where Kettle is installed.
- On Windows systems type:
```
	C:\pdi-ce>pan.bat /file:c:\pdi_labs\examinations.ktr c:\pdi_files\input\exam3.txt
```
- On Unix, Linux, and other Unix-based systems type:
```
	/home/yourself/pdi-ce/pan.sh /file:/home/yourself/pdi_labs/examinations.ktr c:/pdi_files/input/exam3.txt
```
- If your transformation is in another folder, modify the command accordingly.
You will see how the transformation runs, showing you the log in the terminal.
Check the output file. The contents of exam3.txt should be at the end of the file.

What just happened?

You executed a transformation with Pan, the program that runs transformations from terminal windows. As part of the command, you specified the name of...

XML files

Even if you're not a system developer, you must have heard about XML files. XML files or documents are not only used to store data, but also to exchange data between heterogeneous systems over the Internet. PDI has many features that enable you to manipulate XML files. In this section you will learn to get data from those files.

Time for action – getting data from an XML file with information about countries

In this tutorial you will build an Excel file with basic information about countries. The source will be an XML file that you can download from the Packt website.

If you work under Windows, open the kettle.properties file located in the C:/Documents and Settings/yourself/.kettle folder and add the following line:
```
LABSOUTPUT=c:/pdi_files/output
```
On the other hand, if you work under Linux (or similar), open the kettle.properties file located in the /home/yourself/.kettle folder and add the following line:
```
LABSOUTPUT=/home/yourself/pdi_files/output
```
Make sure that the directory specified in kettle.properties exists.
Save the file.
Restart Spoon.
Create a new transformation.
Give a name to the transformation and save it in the same directory you have all the other transformations.
From the Packt website, download the resources folder containing a file named countries.xml. Save the folder in your working directory. For example...

Summary

In this chapter you learned how to get data from files and put data back into files. Specifically, you learned how to:

Get data from plain files and also from XML files
Put data into text files and Excel files
Get information from the operating system such as command-line arguments and system date

We also discussed the following:

The main PDI terminology related to data, for example datasets, data types, and streams
The Select values step, a commonly used step for selecting, reordering, removing and changing data
How and when to use Kettle variables
How to run transformations from a terminal with the Pan command

Now that you know how to get data into a transformation, you are ready to start manipulating data. This is going to happen in the next chapter.

The rest of the chapter is locked

You're reading from Pentaho 3.2 Data Integration: Beginner's Guide

Table of Contents (27) Chapters

Chapter 2. Getting Started with Transformations

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single Text file input step

What just happened?

Time for action – reading all your files at a time using a single Text file input step and regular expressions

What just happened?

Sending data to files

Time for action – sending the results of matches to a plain file

Tip

Getting system information

Time for action – updating a file with news about examinations

Time for action – running the examination transformation from a terminal window

What just happened?

XML files

Time for action – getting data from an XML file with information about countries

Summary

Unlock this book and the full library FREE for 7 days

Personalised recommendations for you