Chapter 2. Getting Started with Transformations
In the previous chapter you used the graphical designer Spoon to create your first transformation: Hello world. Now you will start creating your own transformations to explore data from the real world. Data is everywhere; in particular you will find data in files. Product lists, logs, survey results, and statistical information are just a sample of the different kinds of information usually stored in files. In this chapter you will create transformations to get data from files, and also to send data back to files. This in turn will allow you to learn the basic PDI terminology related to data.
Despite being the most primitive format used to store data, files are broadly used and they exist in several flavors as fixed width, comma-separated values, spreadsheet, or even free format files. PDI has the ability to read data from all types of files; in this first tutorial let's see how to use PDI to get data from text files.
Time for action – reading results of football matches from files
Suppose you have collected several football statistics in plain files. Your files look like this:
You don't have one, but many files, all with the same structure. You now want to unify all the information in one single file. Let's begin by reading the files.
Create the folder named pdi_files
. Inside it, create the input
and output
subfolders.
By using any text editor, type the file shown and save it under the name group1.txt
in the folder named input
, which you just created. You can also download the file from Packt's official website.
Start Spoon.
From the main menu select File | New Transformation.
Expand the...
Time for action – reading all your files at a time using a single Text file input step
To read all your files follow the next steps:
Open the transformation, double-click the input step, and add the other files in the same way you added the first.
After Clicking the Preview rows button, you will see this:
You read several files at once. By putting in the grid the names of all the input files, you could get the content of every specified file one after the other.
Time for action – reading all your files at a time using a single
Text file input step and regular expressions
You could do the same thing you did above by using a different notation. Follow these instructions:
Open the transformation and edit the configuration windows of the input step.
Delete the lines with the names of the files.
In the first row of the grid, type C:\pdi_files\input\
under the File/Directory column, and group[1-4]\.txt
under the Wildcard (Reg.Exp.) column.
Click the Show filename(s)... button. You'll see the list of files that match the expression.
Close the tiny window and click Preview rows to confirm that the rows shown belong to the four files that match the expression you typed.
In this particular case, all filenames follow a pattern—group1.txt
, group2.txt
, and so on. In order to specify the names of the files, you used a regular expression. In the column File/Directory you put the static part of the names, while in the Wildcard (Reg.Exp.) column...
Now you know how to bring data into Kettle. You didn't bring the data just to preview it; you probably want to do some transformation on the data, to finally send it to a final destination such as another plain file. Let's learn how to do this last task.
Time for action – sending the results of matches to a plain file
In the previous tutorial, you read all your "results of matches" files. Now you want to send the data coming from all files to a single output file.
Create a new transformation.
Drag a Text file input step to the canvas and configure it just as you did in the previous tutorial.
Drag a Select values step to the canvas and create a hop from the Text file input step to the Select values step.
Double-click the Select values step.
Click the Get fields to select button.
Modify the fields as follows:
Expand the Output branch of the steps tree.
Drag the Text file output icon to the canvas.
Create a hop from the Select values step to the Text file output step.
Double-click the Text file output step and give it a name.
In the file name type: C:/pdi_files/output/wcup_first_round
.
Tip
Note that the path contains forward slashes. If your system is Windows, you may use back or forward slashes. PDI will recognize both notations.
In the Content tab, leave...
Getting system information
Until now, you have learned how to read data from known files, and send data back to files. What if you don't know beforehand the name of the file to process? There are several ways to handle this with Kettle. Let's learn the simplest.
Time for action – updating a file with news about examinations
Imagine you are responsible to collect the results of an annual examination that is being taken in a language school. The examination evaluates writing, reading, speaking, and listening skills. Every professor gives the exam to the students, the students take the examination, the professors grade the examinations in the scale 0-100 for each skill, and write the results in a text file, like the following:
All the files follow that pattern.
When a professor has the file ready, he/she sends it to you, and you have to integrate the results in a global list. Let's do it with Kettle.
Before starting, be sure to have a file ready to read. Type it or download the sample files from the Packt's official website.
Create...
Time for action – running the examination transformation from a terminal window
Before executing the transformation from a terminal window, make sure that you have a new examination file to process, let's say exam3.txt
. Then follow these instructions:
Open a terminal window and go to the directory where Kettle is installed.
On Windows systems type:
On Unix, Linux, and other Unix-based systems type:
If your transformation is in another folder, modify the command accordingly.
You will see how the transformation runs, showing you the log in the terminal.
Check the output file. The contents of exam3.txt
should be at the end of the file.
You executed a transformation with Pan, the program that runs transformations from terminal windows. As part of the command, you specified the name of...
Even if you're not a system developer, you must have heard about XML files. XML files or documents are not only used to store data, but also to exchange data between heterogeneous systems over the Internet. PDI has many features that enable you to manipulate XML files. In this section you will learn to get data from those files.
Time for action – getting data from an XML file with information
about countries
In this tutorial you will build an Excel file with basic information about countries. The source will be an XML file that you can download from the Packt website.
If you work under Windows, open the kettle.properties
file located in the C:/Documents and Settings/yourself/.kettle
folder and add the following line:
On the other hand, if you work under Linux (or similar), open the kettle.properties
file located in the /home/yourself/.kettle
folder and add the following line:
Make sure that the directory specified in kettle.properties
exists.
Save the file.
Restart Spoon.
Create a new transformation.
Give a name to the transformation and save it in the same directory you have all the other transformations.
From the Packt website, download the resources
folder containing a file named countries.xml
. Save the folder in your working directory. For example...
In this chapter you learned how to get data from files and put data back into files. Specifically, you learned how to:
Get data from plain files and also from XML files
Put data into text files and Excel files
Get information from the operating system such as command-line arguments and system date
We also discussed the following:
The main PDI terminology related to data, for example datasets, data types, and streams
The Select values step, a commonly used step for selecting, reordering, removing and changing data
How and when to use Kettle variables
How to run transformations from a terminal with the Pan
command
Now that you know how to get data into a transformation, you are ready to start manipulating data. This is going to happen in the next chapter.