Pentaho Data Integration (PDI) is a popular business intelligence tool, used for exploring, transforming, validating, and migrating data, along with other useful operations. PDI allows you to perform all of the preceding tasks thanks to its friendly user interface, modern architecture, and rich functionality. This book will introduce you to the tool, giving you a quick understanding of the daily tasks that you can perform with it.
We will cover the following topics in this chapter:
- Introducing PDI
- Installing PDI
- Configuring the graphical designer tool
- Creating a simple transformation
- Understanding the Kettle home directory
PDI, also known as Kettle, is a very powerful tool. It can be used for performing typical Extract, Transform, and Load (ETL) processes. PDI gets data from different sources and manipulates it in many ways (deduplicating, filtering, cleaning, and formatting, among others), saving the data in different formats and destinations. The following diagram illustrates a very simple example of an ETL process designed with PDI:
Aside from the preceding processes, PDI serves to migrate data between applications, access and manipulate real-time data, access data in the cloud, orchestrate administrative tasks, and more.
- Make sure that you have JRE 8.0 installed.
If you don't have JRE 8.0 installed, download it from http://www.java.com Redash source code by cloning the repository, and install it before proceeding. Make sure that the
JAVA_HOME system variable is set.
- Go to the download page at: https://sourceforge.net/projects/pentaho/files/Data%20Integration/.
- Choose the latest stable release. At the time of writing this book, it is
8.1, as shown in the following screenshot:
PDI on SourceForge.net
- Download the available ZIP file, which will serve you for all platforms.
- Unzip the downloaded file in a folder of your choice (for example,
- Browse your disk and look for the PDI folder that was just created. You will see a folder named
data-integration, with several subfolders (
samples, and more) and a bunch of scripts (
pan.bat, and others), which we will soon learn how to use.
- Start Spoon: If your system is Windows, run
Spoon.batfrom within the PDI installation directory. On other platforms, such as Unix, Linux, and so on, open a Terminal window and type
- The main window will show up, with a
Welcome!window already open, as shown in the following screenshot:
Welcome! page includes some links to web resources, forums, and more, as well as some shortcuts for working with PDI. You can reach that window at any time by navigating to the
Welcome Screen option.
- Click on
Toolsmenu. A window appears, where you can change various general characteristics, as follows:
- Many of the options in this tab will not make sense to you yet. Instead of doing anything here, select the tab
Look & Feel:
Look & Feel options
- Feel free to change any of the options in this tab (for example, the font color or size). Click on the
- Restart Spoon to apply the changes.
Transformations and jobs are the main PDI artifacts. Transformations are data-flow oriented entities, while jobs are task-oriented. In this book, we will start by learning all about transformations, focusing on jobs later. To get a quick idea of what, exactly, a transformation is, we will start by creating a simple one. This will also allow you to see what it's like to work with Spoon.
Our first transformation will find out the current version of PDI (Kettle), and will print the value to the log. Proceed as follows:
- On the
Welcome page, click on the New transformation link, located under the
WORKlink group. Alternatively, press Ctrl + N.
- A new tab will appear, with the title
Transformation 1. It's in this tab that you will create your work.
- To the left of the screen, under the
Designtab, you'll see a tree of folders. Expand the
Inputfolder by double-clicking on it.
- Then, left-click on the
Get System Infoicon, and, without releasing the button, drag and drop the selected icon to the work area (that is, the blank area that occupies almost all of the screen). You should see something like this:
Dragging and dropping a step
- Double-click on the
Get System Infoicon. A configuration window will show up. Fill in the first row in the grid, as shown in the following screenshot. Note that you don't have to type the
Kettle version. Instead, you can choose it from a list of available options:
Configuring the Get System Info step
- In the
Designtab, double-click on the
Utilityfolder, click on the Write to log icon, and drag and drop it to the work area.
- Put the mouse cursor over the
Get System Infoicon and wait until a tiny toolbar shows up, as shown in the following screenshot:
Mouseover assistance toolbar
- Click on the output connector (the icon highlighted in the preceding image) and drag it towards the Write to log icon. A greyed hop is displayed.
- When the mouse cursor is over the
Write to logstep, release the button. A link (a hop, from now on) is created, from the first step to the second one. The screen should look as follows:
Connecting steps with a hop
- Right-click anywhere in the work area to bring up a contextual menu.
- In the menu, select the
New Note...option. A note editor will appear.
- Type a description, such as
My first transformation. Select the
Font styletab and choose a nice font and some colors for your note, and then click on
OK. The following should be the final result:
My first transformation
- Save the transformation by pressing Ctrl + S. PDI will ask for a destination folder. Select the folder of your choice, and give the transformation a name. PDI will save the transformation as a file with a
ktrextension (for example,
Finally, let's run the transformation to see what happens:
- Click on the Run icon, located in the transformation toolbar:
Run icon in the transformation toolbar
When you run Spoon for the first time, a folder named
.kettle is created in your home directory by default. This folder is referred to as the Kettle home directory.
The purpose of the
kettle.properties file – created along with the
.kettle folder, the first time you run Spoon – is to contain variable definitions with a broad scope: Java Virtual Machine. Therefore, it's the perfect place to define general settings; some examples are as follows:
- Database connection settings: host, database name, and so on
- SMTP settings: SMTP server, port, and so on
- Common input and output folders
- Directory to send log files to
Before continuing, let's add some variables to the file. Suppose that you have two folders, named
C:/PDI/OUTPUT, which you will use for storing files. The objective will be to add two variables, named
OUTPUT_FOLDER, containing those values:
- Locate the Kettle home directory. If you work in Windows, the folder could be
C:\Documents and Settings\<your_name>or
C:\Users\<your_name>, depending on which Windows version you have. If you work in Linux (or similar) or macOS, the folder will most likely be
- Edit the
kettle.propertiesfile. You will see that it only contains commented sample lines.
- You can safely remove the contents of the file and define your own variables by typing the following lines:
Save the file and restart Spoon, so that it can recognize the variables defined in the file. We will learn how to use these variables in Chapter 2, Getting Familiar with Spoon.
In this chapter, you were introduced to Pentaho Data Integration. Specifically, you learned what PDI is, and you installed the tool. You were introduced to Spoon, PDI's graphical designer tool, and you created your first transformation. You were also introduced to the Kettle home directory and the
kettle.properties file, which will be used throughout the rest of the book.
In Chapter 2, Getting Familiar with Spoon, you will learn much more about the process of creating, testing, and running transformations in Spoon.