Pentaho Data Integration (PDI) is a popular business intelligence tool, used for exploring, transforming, validating, and migrating data, along with other useful operations. PDI allows you to perform all of the preceding tasks thanks to its friendly user interface, modern architecture, and rich functionality. This book will introduce you to the tool, giving you a quick understanding of the daily tasks that you can perform with it.
We will cover the following topics in this chapter:
- Introducing PDI
- Installing PDI
- Configuring the graphical designer tool
- Creating a simple transformation
- Understanding the Kettle home directory
PDI, also known as Kettle, is a very powerful tool. It can be used for performing typical Extract, Transform, and Load (ETL) processes. PDI gets data from different sources and manipulates it in many ways (deduplicating, filtering, cleaning, and formatting, among others), saving the data in different formats and destinations. The following diagram illustrates a very simple example of an ETL process designed with PDI:
ETL process
Aside from the preceding processes, PDI serves to migrate data between applications, access and manipulate real-time data, access data in the cloud, orchestrate administrative tasks, and more.
The following are the instructions to install the PDI Community Edition (CE), irrespective of the operating system that you may be using:
- Make sure that you have JRE 8.0 installed.
Note
If you don't have JRE 8.0 installed, download it from http://www.java.com Redash source code by cloning the repository, and install it before proceeding. Make sure that the JAVA_HOME
system variable is set.
- Go to the download page at: https://sourceforge.net/projects/pentaho/files/Data%20Integration/.
- Choose the latest stable release. At the time of writing this book, it is
8.1
, as shown in the following screenshot:
PDI on SourceForge.net
- Download the available ZIP file, which will serve you for all platforms.
- Unzip the downloaded file in a folder of your choice (for example,
c:/software/pdi
or/home/pdi_user/pdi
). - Browse your disk and look for the PDI folder that was just created. You will see a folder named
data-integration
, with several subfolders (lib
,plugins
,samples
, and more) and a bunch of scripts (spoon.bat
,pan.bat
, and others), which we will soon learn how to use.
Spoon is PDI's desktop designer tool. With Spoon, you can design, preview, and test all of your work (that is, transformations and jobs).
Before starting to work with PDI, it's advisable to take a look at the Spoon interface and do some minimal configuration. The instructions are as follows:
- Start Spoon: If your system is Windows, run
Spoon.bat
from within the PDI installation directory. On other platforms, such as Unix, Linux, and so on, open a Terminal window and typespoon.sh
. - The main window will show up, with a
Welcome!
window already open, as shown in the following screenshot:
Welcome page
Note
The Welcome!
page includes some links to web resources, forums, and more, as well as some shortcuts for working with PDI. You can reach that window at any time by navigating to the Help
| Welcome Screen
option.
In order to customize Spoon, do the following:
- Click on
Options...
in theTools
menu. A window appears, where you can change various general characteristics, as follows:
Options
- Many of the options in this tab will not make sense to you yet. Instead of doing anything here, select the tab
Look & Feel
:
Look & Feel options
- Feel free to change any of the options in this tab (for example, the font color or size). Click on the
OK
button. - Restart Spoon to apply the changes.
Transformations and jobs are the main PDI artifacts. Transformations are data-flow oriented entities, while jobs are task-oriented. In this book, we will start by learning all about transformations, focusing on jobs later. To get a quick idea of what, exactly, a transformation is, we will start by creating a simple one. This will also allow you to see what it's like to work with Spoon.
Our first transformation will find out the current version of PDI (Kettle), and will print the value to the log. Proceed as follows:
- On the
Welcome page
, click on the New transformation link, located under theWORK
link group. Alternatively, press Ctrl + N. - A new tab will appear, with the title
Transformation 1
. It's in this tab that you will create your work. - To the left of the screen, under the
Design
tab, you'll see a tree of folders. Expand theInput
folder by double-clicking on it.
- Then, left-click on the
Get System Info
icon, and, without releasing the button, drag and drop the selected icon to the work area (that is, the blank area that occupies almost all of the screen). You should see something like this:
Dragging and dropping a step
- Double-click on the
Get System Info
icon. A configuration window will show up. Fill in the first row in the grid, as shown in the following screenshot. Note that you don't have to type theKettle version
. Instead, you can choose it from a list of available options:
Configuring the Get System Info step
- In the
Design
tab, double-click on theUtility
folder, click on the Write to log icon, and drag and drop it to the work area. - Put the mouse cursor over the
Get System Info
icon and wait until a tiny toolbar shows up, as shown in the following screenshot:
Mouseover assistance toolbar
- Click on the output connector (the icon highlighted in the preceding image) and drag it towards the Write to log icon. A greyed hop is displayed.
- When the mouse cursor is over the
Write to log
step, release the button. A link (a hop, from now on) is created, from the first step to the second one. The screen should look as follows:
Connecting steps with a hop
Let's add some color note to our work, as follows:
- Right-click anywhere in the work area to bring up a contextual menu.
- In the menu, select the
New Note...
option. A note editor will appear. - Type a description, such as
My first transformation
. Select theFont style
tab and choose a nice font and some colors for your note, and then click onOK
. The following should be the final result:
My first transformation
- Save the transformation by pressing Ctrl + S. PDI will ask for a destination folder. Select the folder of your choice, and give the transformation a name. PDI will save the transformation as a file with a
ktr
extension (for example,sample_transformation.ktr
).
Finally, let's run the transformation to see what happens:
- Click on the Run icon, located in the transformation toolbar:
Run icon in the transformation toolbar
Execution Results
When you run Spoon for the first time, a folder named .kettle
is created in your home directory by default. This folder is referred to as the Kettle home directory.
The folder contains several configuration files, mainly created and updated by the different PDI tools. Among these files, there is the kettle.properties
file.
The purpose of the kettle.properties
file – created along with the .kettle
folder, the first time you run Spoon – is to contain variable definitions with a broad scope: Java Virtual Machine. Therefore, it's the perfect place to define general settings; some examples are as follows:
- Database connection settings: host, database name, and so on
- SMTP settings: SMTP server, port, and so on
- Common input and output folders
- Directory to send log files to
Before continuing, let's add some variables to the file. Suppose that you have two folders, named C:/PDI/INPUT
and C:/PDI/OUTPUT
, which you will use for storing files. The objective will be to add two variables, named INPUT_FOLDER
and OUTPUT_FOLDER
, containing those values:
- Locate the Kettle home directory. If you work in Windows, the folder could be
C:\Documents and Settings\<your_name>
orC:\Users\<your_name>
, depending on which Windows version you have. If you work in Linux (or similar) or macOS, the folder will most likely be/home/<your_name>/
. - Edit the
kettle.properties
file. You will see that it only contains commented sample lines. - You can safely remove the contents of the file and define your own variables by typing the following lines:
INPUT_FOLDER=C:/PDI/INPUT OUTPUT_FOLDER=C:/PDI/OUTPUT
Save the file and restart Spoon, so that it can recognize the variables defined in the file. We will learn how to use these variables in Chapter 2, Getting Familiar with Spoon.
In this chapter, you were introduced to Pentaho Data Integration. Specifically, you learned what PDI is, and you installed the tool. You were introduced to Spoon, PDI's graphical designer tool, and you created your first transformation. You were also introduced to the Kettle home directory and the kettle.properties
file, which will be used throughout the rest of the book.
In Chapter 2, Getting Familiar with Spoon, you will learn much more about the process of creating, testing, and running transformations in Spoon.