Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pentaho 3.2 Data Integration: Beginner's Guide

You're reading from  Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book
Published in Apr 2010
Publisher Packt
ISBN-13 9781847199546
Pages 492 pages
Edition 1st Edition
Languages

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
1. Getting Started with Pentaho Data Integration 2. Getting Started with Transformations 3. Basic Data Manipulation 4. Controlling the Flow of Data 5. Transforming Your Data with JavaScript Code and the JavaScript Step 6. Transforming the Row Set 7. Validating Data and Handling Errors 8. Working with Databases 9. Performing Advanced Operations with Databases 10. Creating Basic Task Flows 11. Creating Advanced Transformations and Jobs 12. Developing and Implementing a Simple Datamart 13. Taking it Further Working with Repositories Pan and Kitchen: Launching Transformations and Jobs from the Command Line Quick Reference: Steps and Job Entries Spoon Shortcuts Introducing PDI 4 Features Pop Quiz Answers Index

Chapter 2. Getting Started with Transformations

In the previous chapter you used the graphical designer Spoon to create your first transformation: Hello world. Now you will start creating your own transformations to explore data from the real world. Data is everywhere; in particular you will find data in files. Product lists, logs, survey results, and statistical information are just a sample of the different kinds of information usually stored in files. In this chapter you will create transformations to get data from files, and also to send data back to files. This in turn will allow you to learn the basic PDI terminology related to data.

Reading data from files


Despite being the most primitive format used to store data, files are broadly used and they exist in several flavors as fixed width, comma-separated values, spreadsheet, or even free format files. PDI has the ability to read data from all types of files; in this first tutorial let's see how to use PDI to get data from text files.

Time for action – reading results of football matches from files


Suppose you have collected several football statistics in plain files. Your files look like this:

Group|Date|Home Team |Results|Away Team|Notes
Group 1|02/June|Italy|2-1|France|
Group 1|02/June|Argentina|2-1|Hungary
Group 1|06/June|Italy|3-1|Hungary
Group 1|06/June|Argentina|2-1|France
Group 1|10/June|France|3-1|Hungary
Group 1|10/June|Italy|1-0|Argentina
-------------------------------------------
World Cup 78
Group 1

You don't have one, but many files, all with the same structure. You now want to unify all the information in one single file. Let's begin by reading the files.

  1. Create the folder named pdi_files. Inside it, create the input and output subfolders.

  2. By using any text editor, type the file shown and save it under the name group1.txt in the folder named input, which you just created. You can also download the file from Packt's official website.

  3. Start Spoon.

  4. From the main menu select File | New Transformation.

  5. Expand the...

Time for action – reading all your files at a time using a single Text file input step


To read all your files follow the next steps:

  1. Open the transformation, double-click the input step, and add the other files in the same way you added the first.

  2. After Clicking the Preview rows button, you will see this:

What just happened?

You read several files at once. By putting in the grid the names of all the input files, you could get the content of every specified file one after the other.

Time for action – reading all your files at a time using a single Text file input step and regular expressions


You could do the same thing you did above by using a different notation. Follow these instructions:

  1. Open the transformation and edit the configuration windows of the input step.

  2. Delete the lines with the names of the files.

  3. In the first row of the grid, type C:\pdi_files\input\ under the File/Directory column, and group[1-4]\.txt under the Wildcard (Reg.Exp.) column.

  4. Click the Show filename(s)... button. You'll see the list of files that match the expression.

  5. Close the tiny window and click Preview rows to confirm that the rows shown belong to the four files that match the expression you typed.

What just happened?

In this particular case, all filenames follow a pattern—group1.txt, group2.txt, and so on. In order to specify the names of the files, you used a regular expression. In the column File/Directory you put the static part of the names, while in the Wildcard (Reg.Exp.) column...

Sending data to files


Now you know how to bring data into Kettle. You didn't bring the data just to preview it; you probably want to do some transformation on the data, to finally send it to a final destination such as another plain file. Let's learn how to do this last task.

Time for action – sending the results of matches to a plain file


In the previous tutorial, you read all your "results of matches" files. Now you want to send the data coming from all files to a single output file.

  1. Create a new transformation.

  2. Drag a Text file input step to the canvas and configure it just as you did in the previous tutorial.

  3. Drag a Select values step to the canvas and create a hop from the Text file input step to the Select values step.

  4. Double-click the Select values step.

  5. Click the Get fields to select button.

  6. Modify the fields as follows:

  7. Expand the Output branch of the steps tree.

  8. Drag the Text file output icon to the canvas.

  9. Create a hop from the Select values step to the Text file output step.

  10. Double-click the Text file output step and give it a name.

  11. In the file name type: C:/pdi_files/output/wcup_first_round.

    Tip

    Note that the path contains forward slashes. If your system is Windows, you may use back or forward slashes. PDI will recognize both notations.

  12. In the Content tab, leave...

Getting system information


Until now, you have learned how to read data from known files, and send data back to files. What if you don't know beforehand the name of the file to process? There are several ways to handle this with Kettle. Let's learn the simplest.

Time for action – updating a file with news about examinations


Imagine you are responsible to collect the results of an annual examination that is being taken in a language school. The examination evaluates writing, reading, speaking, and listening skills. Every professor gives the exam to the students, the students take the examination, the professors grade the examinations in the scale 0-100 for each skill, and write the results in a text file, like the following:

student_code;name;writing;reading;speaking;listening
80711-85;William Miller;81;83;80;90
20362-34;Jennifer Martin;87;76;70;80
75283-17;Margaret Wilson;99;94;90;80
83714-28;Helen Thomas;89;97;80;80
61666-55;Maria Thomas;88;77;70;80

All the files follow that pattern.

When a professor has the file ready, he/she sends it to you, and you have to integrate the results in a global list. Let's do it with Kettle.

  1. Before starting, be sure to have a file ready to read. Type it or download the sample files from the Packt's official website.

  2. Create...

Time for action – running the examination transformation from a terminal window


Before executing the transformation from a terminal window, make sure that you have a new examination file to process, let's say exam3.txt. Then follow these instructions:

  1. Open a terminal window and go to the directory where Kettle is installed.

    • On Windows systems type:

      	C:\pdi-ce>pan.bat /file:c:\pdi_labs\examinations.ktr c:\pdi_files\input\exam3.txt
      
    • On Unix, Linux, and other Unix-based systems type:

      	/home/yourself/pdi-ce/pan.sh /file:/home/yourself/pdi_labs/examinations.ktr c:/pdi_files/input/exam3.txt
      
    • If your transformation is in another folder, modify the command accordingly.

  2. You will see how the transformation runs, showing you the log in the terminal.

  3. Check the output file. The contents of exam3.txt should be at the end of the file.

What just happened?

You executed a transformation with Pan, the program that runs transformations from terminal windows. As part of the command, you specified the name of...

XML files


Even if you're not a system developer, you must have heard about XML files. XML files or documents are not only used to store data, but also to exchange data between heterogeneous systems over the Internet. PDI has many features that enable you to manipulate XML files. In this section you will learn to get data from those files.

Time for action – getting data from an XML file with information about countries


In this tutorial you will build an Excel file with basic information about countries. The source will be an XML file that you can download from the Packt website.

  1. If you work under Windows, open the kettle.properties file located in the C:/Documents and Settings/yourself/.kettle folder and add the following line:

    LABSOUTPUT=c:/pdi_files/output

    On the other hand, if you work under Linux (or similar), open the kettle.properties file located in the /home/yourself/.kettle folder and add the following line:

    LABSOUTPUT=/home/yourself/pdi_files/output
  2. Make sure that the directory specified in kettle.properties exists.

  3. Save the file.

  4. Restart Spoon.

  5. Create a new transformation.

  6. Give a name to the transformation and save it in the same directory you have all the other transformations.

  7. From the Packt website, download the resources folder containing a file named countries.xml. Save the folder in your working directory. For example...

Summary


In this chapter you learned how to get data from files and put data back into files. Specifically, you learned how to:

  • Get data from plain files and also from XML files

  • Put data into text files and Excel files

  • Get information from the operating system such as command-line arguments and system date

We also discussed the following:

  • The main PDI terminology related to data, for example datasets, data types, and streams

  • The Select values step, a commonly used step for selecting, reordering, removing and changing data

  • How and when to use Kettle variables

  • How to run transformations from a terminal with the Pan command

Now that you know how to get data into a transformation, you are ready to start manipulating data. This is going to happen in the next chapter.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Pentaho 3.2 Data Integration: Beginner's Guide
Published in: Apr 2010 Publisher: Packt ISBN-13: 9781847199546
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}