Integrating Kettle and the Pentaho Suite

Exclusive offer: get 50% off this eBook here
Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 Cookbook — Save 50%

Over 70 recipes to solve ETL problems using Pentaho Kettle

$26.99    $13.50
by Adrián Sergio Pulvirenti María Carina Roldán | July 2011 | Cookbooks Java Open Source

PDI aka Kettle is part of the Pentaho Business Intelligent Suite. As such, it can be used interacting with other components of the suite, for example as the datasource for reporting, or as part of a bigger process. This article by Adrián Sergio Pulvirenti and María Carina Roldán, authors of Pentaho Data Integration 4 Cookbook, shows you how to run Kettle jobs and transformations in that context.

In this article, we will cover:

  • Creating a Pentaho report with data coming from PDI
  • Configuring the Pentaho BI Server for running PDI jobs and transformations
  • Executing a PDI transformation as part of a Pentaho process
  • Executing a PDI job from the PUC (Pentaho User Console)
  • Generating files from the PUC with PDI and the CDA (Community Data Access) plugin
  • Populating a CDF (Community Dashboard Framework) dashboard with data coming from a PDI transformation

 

Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 Cookbook

Over 70 recipes to solve ETL problems using Pentaho Kettle

        Read more about this book      

(For more resources on this subject, see here.)

Introduction

Kettle, also known as PDI, is mostly used as a stand-alone application. However, it is not an isolated tool, but part of the Pentaho Business Intelligence Suite. As such, it can also interact with other components of the suite; for example, as the datasource for a report, or as part of a bigger process. This chapter shows you how to run Kettle jobs and transformations in that context.

The article assumes a basic knowledge of the Pentaho BI platform and the tools that made up the Pentaho Suite. If you are not familiar with these tools, it is recommended that you visit the wiki page (wiki.pentaho.com) or the Pentaho BI Suite Community Edition (CE) site: http://community.pentaho.com/.

As another option, you can get the Pentaho Solutions book (Wiley) by Roland Bouman and Jos van Dongen that gives you a good introduction to the whole suite.

A sample transformation

The different recipes in this article show you how to run Kettle transformations and jobs integrated with several components of the Pentaho BI suite. In order to focus on the integration itself rather than on Kettle development, we have created a sample transformation named weather.ktr that will be used through the different recipes.

The transformation receives the name of a city as the first parameter from the command line, for example Madrid, Spain. Then, it consumes a web service to get the current weather conditions and the forecast for the next five days for that city. The transformation has a couple of named parameters:

Integrating Kettle and the Pentaho Suite

The following diagram shows what the transformation looks like:

Integrating Kettle and the Pentaho Suite

It receives the command-line argument and the named parameters, calls the service, and retrieves the information in the desired scales for temperature and wind speed.

You can download the transformation from the book's site and test it. Do a preview on the next_days, current_conditions, and current_conditions_normalized steps to see what the results look like.

The following is a sample preview of the next_days step:

Integrating Kettle and the Pentaho Suite

The following is a sample preview of the current_conditions step:

Integrating Kettle and the Pentaho Suite

Finally, the following screenshot shows you a sample preview of the current_conditions_normalized step:

Integrating Kettle and the Pentaho Suite

There is also another transformation named weather_np.ktr. This transformation does exactly the same, but it reads the city as a named parameter instead of reading it from the command line. The Getting ready sections of each recipe will tell you which of these transformations will be used.

Avoiding consuming the web service
It may happen that you do not want to consume the web service (for example, for delay reasons), or you cannot do it (for example, if you do not have Internet access). Besides, if you call a free web service like this too often, then your IP might be banned from the service. Don't worry. Along with the sample transformations on the book's site, you will find another version of the transformations that instead of using the web service, reads sample fictional data from a file containing the forecast for over 250 cities. The transformations are weather (file version).ktr and weather_np (file version).ktr. Feel free to use these transformations instead. You should not have any trouble as the parameters and the metadata of the data retrieved are exactly the same as in the transformations explained earlier.

If you use transformations that do not call the web service, remember that they rely on the file with the fictional data (weatheroffline.txt). Wherever you copy the transformations, do not forget to copy that file as well.

 

Creating a Pentaho report with data coming from PDI

The Pentaho Reporting Engine allows designing, creating, and distributing reports in various popular formats (HTML, PDF, and so on) from different kind of sources (JDBC, OLAP, XML, and so on).

There are occasions where you need other kinds of sources such as text files or Excel files, or situations where you must process the information before using it in a report. In those cases, you can use the output of a Kettle transformation as the source of your report. This recipe shows you this capability of the Pentaho Reporting Engine.

For this recipe, you will develop a very simple report: The report will ask for a city and a temperature scale and will report the current conditions in that city. The temperature will be expressed in the selected scale.

Getting ready

A basic understanding of the Pentaho Report Designer tool is required in order to follow this recipe. You should be able to create a report, add parameters, build a simple report, and preview the final result.

Regarding the software, you will need the Pentaho Report Designer. You can download the latest version from the following URL:

http://sourceforge.net/projects/pentaho/files/Report%20Designer/

You will also need the sample transformation weather.ktr.

The sample transformation has a couple of UDJE steps. These steps rely on the Janino library. In order to be able to run the transformation from Report Designer, you will have to copy the janino.jar file from the Kettle libext directory into the Report Designer lib directory.

How to do it...

In the first part of the recipe, you will create the report and define the parameters for the report: the city and the temperature scale.

  1. Launch Pentaho Report Designer and create a new blank report.
  2. Add two mandatory parameters: A parameter named city_param, with Lisbon, Portugal as Default Value and a parameter named scale_param which accepts two possible values: C meaning Celsius or F meaning Fahrenheit.

Now, you will define the data source for the report:

  1. In the Data menu, select Add Data Source and then Pentaho Data Integration.
  2. Click on the Add a new query button. A new query named Query 1 will be added. Give the query a proper name, for example, forecast.
  3. Click on the Browse button. Browse to the sample transformation and select it. The Steps listbox will be populated with the names of the steps in the transformation.
  4. Select the step current_conditions. So far, you have the following:

    Integrating Kettle and the Pentaho Suite

    The specification of the transformation file name with the complete path will work only inside Report Designer. Before publishing the report, you should edit the file name (C:\Pentaho\reporting\weather.ktr in the preceding example) and leave just a path relative to the directory where the report is to be published (for example, reports\weather.ktr).

  5. Click on Preview; you will see an empty resultset. The important thing here is that the headers should be the same as the output fields of the current_conditions step: city, observation_time, weatherDesc, and so on.
  6. Now, close that window and click on Edit Parameters.
  7. You will see two grids: Transformation Parameter and Transformation Arguments. Fill in the grids as shown in the following screenshot. You can type the values or select them from the available drop-down lists:

    Integrating Kettle and the Pentaho Suite

  8. Close the Pentaho Data Integration Data Source window. You should have the following:

    Integrating Kettle and the Pentaho Suite

    The data coming from Kettle is ready to be used in your report.

  9. Build the report layout: Drag and drop some fields into the canvas and arrange them as you please. Provide a title as well. The following screenshot is a sample report you can design:

    Integrating Kettle and the Pentaho Suite

  10. Now, you can do a Print Preview. The sample report above will look like the one shown in the following screenshot:

    Integrating Kettle and the Pentaho Suite

Note that the output of the current_condition step has just one row.

If for data source you choose the next_days or the current_condition_normalized step instead, then the result will have several rows. In that case, you could design a report by columns: one column for each field.

How it works...

Using the output of a Kettle transformation as the data source of a report is very useful because you can take advantage of all the functionality of the PDI tool. For instance, in this case you built a report based on the result of consuming a web service. You could not have done this with Pentaho Report Designer alone.

In order to use the output of your Kettle transformation, you just added a Pentaho Data Integration datasource. You selected the transformation to run and the step that would deliver your data.

In order to be executed, your transformation needs a command-line parameter: the name of the city. The transformation also defines two named parameters: the temperature scale and the wind scale. From the Pentaho Report Designer you provided both—a value for the city and a value for the temperature scale. You did it by filling in the Edit Parameter setting window inside the Pentaho Data Integration Data Source window. Note that you did not supply a value for the SPEED named parameter, but that is not necessary because Kettle uses the default value.

As you can see in the recipe, the data source created by the report engine has the same structure as the data coming from the selected step: the same fields with the same names, same data types, and in the same order.

Once you configured this data source, you were able to design your report as you would have done with any other kind of data source.

Finally, when you are done and want to publish your report on the server, do not forget to fix the path as explained in the recipethe File should be specified with a path relative to the solution folder. For example, suppose that your report will be published in my_solution/reports, and you put the transformation file in my_solution/reports/resources. In that case, for File, you should type resources/ plus the name of the transformation.

There's more...

Pentaho Reporting is a suite of Java projects built for report generation. The suite is made up of the Pentaho Reporting Engine and a set of tools such as the Report Designer (the tool used in this recipe), Report Design Wizard, and Pentaho's web-based Ad Hoc Reporting user interface.

In order to be able to run transformations, the Pentaho Reporting software includes the Kettle libraries. To avoid any inconvenience, be sure that the versions of the libraries included are the same or newer than the version of Kettle you are using. For instance, Pentaho Reporting 3.8 includes Kettle 4.1.2 libraries. If you are using a different version of Pentaho Reporting, then you can verify the Kettle version by looking in the lib folder inside the reporting installation folder. You should look for files named kettle-core-<version>.jar, kettle-db-<version>.jar, and kettle-engine-<version>.jar.

Besides, if the transformations you want to use as data sources rely on external libraries, then you have to copy the proper jar files from the Kettle libext directory into the Report Designer lib folder, just as you did with the janino.jar file in the recipe.

For more information about Pentaho Reporting, just visit the following wiki website:

http://wiki.pentaho.com/display/Reporting/Pentaho+Reporting+Community+Documentation

Alternatively, you can get the book Pentaho Reporting 3.5 for Java Developers (Packt Publishing) by Will Gorman.

Configuring the Pentaho BI Server for running PDI jobs and transformations

The Pentaho BI Server is a collection of software components that provide the architecture and infrastructure required to build business intelligence solutions. With the Pentaho BI Server, you are able to run reports, visualize dashboards, schedule tasks, and more. Among these tasks, there is the ability to run Kettle jobs and transformations. This recipe shows you the minor changes you might have to make in order to be able to run Kettle jobs and transformations.

Getting ready

In order to follow this recipe, you will need some experience with the Pentaho BI Server.

For configuring the Pentaho BI server, you obviously need the software. You can download the latest version of the Pentaho BI Server from the following URL:

http://sourceforge.net/projects/pentaho/files/Business%20Intelligence%20Server/

Make sure you download the distribution that matches your platform.

If you intend to run jobs and transformations from a Kettle repository, then make sure you have the name of the repository and proper credentials (user and password).

How to do it...

Carry out the following steps:

  1. If you intend to run a transformation or a job from a file, skip to the How it works section.
  2. Edit the settings.xml file located in the \biserver-ce\pentaho-solutions\system\kettle folder inside the Pentaho BI Server installation folder.
  3. In the repository.type tag, replace the default value files with rdbms. Provide the name of your Kettle repository and the user and password, as shown in the following example:

    <kettle-repository>
    <!-- The values within <properties> are passed directly to the
    Kettle Pentaho components. -->
    <!-- This is the location of the Kettle repositories.xml file,
    leave empty if the default is used: $HOME/.kettle/repositories.xml
    -->
    <repositories.xml.file></repositories.xml.file>
    <repository.type>rdbms</repository.type>
    <!-- The name of the repository to use -->
    <repository.name>pdirepo</repository.name>
    <!-- The name of the repository user -->
    <repository.userid>dev</repository.userid>
    <!-- The password -->
    <repository.password>1234</repository.password>
    </kettle-repository>

  4. Start the server. It will be ready to run jobs and transformations from your Kettle repository.

How it works...

If you want to run Kettle transformations and jobs, then the Pentaho BI server already includes the Kettle libraries. The server is ready to run both jobs and transformations from files. If you intend to use a repository, then you have to provide the repository settings. In order to do this, you just have to edit the settings.xml file, as you did in the recipe.

There's more...

To avoid any inconvenience, be sure that the version of the libraries included are the same or newer than the version of Kettle you are using. For instance, Pentaho BI Server 3.7 includes Kettle 4.1 libraries. If you are using a different version of the server, then you can verify the Kettle version by looking in the following folder:

\biserver-ce\tomcat\webapps\pentaho\WEB-INF\lib

This folder is inside the server installation folder. You should look for files named kettlecore-TRUNK-SNAPSHOT .jar, kettle-db-TRUNK-SNAPSHOT.jar, and kettleengine-TRUNK-SNAPSHOT.jar.

Unzip any of them and look for the META-INF\MANIFEST.MF file. There, you will find the Kettle version. You will see a line like this: Implementation-Version: 4.1.0.

There is even an easier way: In the Pentaho User Console (PUC), look for the option 2. Get Environment Information inside the Data Integration with Kettle folder of the BI Developer Examples solution; run it and you will get detailed information about the Kettle environment.

For your information, the transformation that is run behind the scenes is GetPDIEnvironment.ktr located in the biserverce\pentaho-solutions\bi-developers\etl folder.

Pentaho Data Integration 4 Cookbook Over 70 recipes to solve ETL problems using Pentaho Kettle
Published: June 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

Executing a PDI transformation as part of a Pentaho process

Everything in the Pentaho platform is made of action sequences. An action sequence is, as its name suggests, a sequence of atomic actions that together accomplish small processes. Those atomic actions cover a broad spectrum of tasks, for example, getting data from a table in a database, running a piece of JavaScript code, launching a report, sending e-mails, or running a Kettle transformation.

For this recipe, suppose that you want to run the sample transformation to get the current weather conditions for some cities. Instead of running this from the command line, you want to interact with this service from the PUC. You will do it with an action sequence.

Getting ready

In order to follow this recipe, you will need a basic understanding of action sequences and at least some experience with the Pentaho BI Server and Pentaho Design Studio, the action sequences editor.

Before proceeding, make sure that you have a Pentaho BI Server running. You will also need Pentaho Design Studio. You can download the latest version from the following URL:

http://sourceforge.net/projects/pentaho/files/Design%20Studio/

Finally, you will need the sample transformation weather.ktr.

How to do it...

This recipe is split into two parts: First, you will create the action sequence, and then you will test it from the PUC. So carry out the following steps:

  1. Launch Design Studio. If this is your first use, then create the solution project where you will save your work.
  2. Copy the sample transformation to the solution folder.
  3. Create a new action sequence and save it in your solution project with the name weather.xaction.
  4. Define two inputs that will be used as the parameters for your transformation: city_name and temperature_scale.
  5. Add two Prompt/Secure Filter actions and configure them to prompt for the name of the city and the temperature scale.
  6. Add a new process action by selecting Get Data From | Pentaho Data Integration.
  7. Now, you will fill in the Input Section of the process action configuration. Give the process action a name.
  8. For Transformation File, type solution:weather.ktr. For Transformation Step, type current_conditions_normalized and for Kettle Logging Level, type or select basic.
  9. In the Transformation Inputs, add the inputs city_name and temperature_scale.
  10. Select the XML source tab.
  11. Search for the <action-definition> tag that contains the following line:

    <component-name>KettleComponent</component-name>

  12. You will find something like this:

    <action-definition>
    <component-name>KettleComponent</component-name>
    <action-type>looking for the current weather</action-type>
    <action-inputs>
    <city_name type="string"/>
    <temperature_scale type="string"/>
    </action-inputs>
    <action-resources>
    <transformation-file type="resource"/>
    </action-resources>
    <action-outputs/>
    <component-definition>
    <monitor-step><![CDATA[current_conditions]]></monitor-step>
    <kettle-logging-level><![CDATA[basic]]></kettle-logging-
    level>
    </component-definition>
    </action-definition>

  13. Below <component-definition>, type the following:

    <set-parameter>
    <name>TEMP</name>
    <mapping>temperature_scale</mapping>
    </set-parameter>
    <set-argument>
    <name>1</name>
    <mapping>city_name</mapping>
    </set-argument>

    In fact, you can type this anywhere between <componentdefinition> and </component-definition>. The order of the internal tags is not important.

  14. Go back to the tab named 2. Define Process.
  15. Now, fill in the Output Section of the Process Data Integration process action. For Output Rows Name, type weather_result and for Output Rows Count Name, type number_of_rows.
  16. Below the Process Data Integration process action, add an If Statement. As the condition, type number_of_rows==0.
  17. Within the If Statement, add a Message Template process action.
  18. In the Text frame, type No results for the city {city_name}. For Output Name, type weather_result.
  19. Finally, in the Process Outputs section of the action sequence, add weather_result as the only output.
  20. Your final action sequence should look like the one shown in the following screenshot:

    Integrating Kettle and the Pentaho Suite

  21. Save the file.

Now, it is time to test the action sequence that you just created.

  1. Login to the PUC and refresh the repository, so that the weather.xaction that you just created shows up.
  2. Browse the solution folders and look for the xaction and double-click on it.
  3. Provide a name of a city and change the temperature scale, if you wish.
  4. Click on Run; you will see something similar to the following:

    Integrating Kettle and the Pentaho Suite

  5. You can take a look at the Pentaho console to see the log of the transformation running behind the scenes.
  6. Run the action sequence again. This time, type the name of a fictional city, for example, my_invented_city. This time, you will see the following message.

    Action Successful
    weather_result=No results for the city my_invented_city

How it works...

You can run Kettle transformations as part of an action sequence by using the Pentaho Data Integration process action located within the Get Data From category of process actions.

The main task of a PDI process action is to run a Kettle transformation. In order to do that, it has a list of checks and textboxes where you specify everything you need to run the transformation and everything you want to receive back after having run it.

The most important setting in the PDI process action is the name and location of the transformation to be executed. In this example, you had a .ktr file in the same location as the action sequence, so you simply typed solution: followed by the name of the file.

Then, in the Transformation Step textbox, you specified the name of the step in the transformation that would give you the results you needed. The PDI process action (just as any regular process action) is capable of receiving input from the action sequence and returning data to be used later in the sequence. Therefore, in the drop-down list in the Transformation Step textbox, you could see the list of available action sequence inputs. In this case, you just typed the name of the step.

If you are not familiar with action sequences, note that the drop-down list in the Transformation Step textbox is not the list of available steps. It is the list of available action sequence inputs.

You have the option of specifying the Kettle log level. In this case, you selected Basic. This was the level of log that Kettle wrote to the Pentaho console. Note that in this case, you also have the option of selecting an action sequence input instead of one of the log levels in the list.

As said earlier, the process action can use any inputs from the action sequence. In this case, you used two inputs: city_name and temperature_scale. Then you passed them to the transformation in the XML code:

  • By putting city_name between <set-parameter></set-parameter>, you passed the city_name input as the first command-line argument.
  • By putting temperature_scale between <set-argument></set-argument>, you passed the temperature_scale to the transformation as the value for the named parameter TEMP.

As mentioned, the process can return data to be used later in the sequence. The textboxes in the Output Section are meant to do that. Each textbox you fill in will be a new data field to be sent to the next process action. In this case, you defined two outputs: weather_result and number_of_rows. The first contains the dataset that comes out of the step you defined in Transformation Step; in this case, current_conditions_normalized. The second has the number of rows in that dataset.

You used those outputs in the next process action. If number_of_rows was equal to zero, then you would overwrite the weather_result data with a message to be displayed to the user.

Finally, you added the weather_result as the output of the action sequence, so that the user either sees the current conditions for the required city, or the custom message indicating that the city was not found.

There's more...

The following are some variants in the use of the Pentaho Data Integration process action:

Specifying the location of the transformation

When your transformation is in a file, you specify the location by typing or browsing for the name of the file. You have to provide the name relative to the solution folder. In the recipe, the transformation was in the same folder as the action sequence, so you simply typed solution: followed by the name of the transformation including the extension ktr.

If instead of having the transformation in a file it is located in a repository, then you should check the Use Kettle Repository option. The Transformation File textbox will be replaced with two textboxes named Directory and Transformation File. In these textboxes, you should type the name of the folder and the transformation exactly as they are in the repository. Alternatively, you can select the names from the available drop-down lists.

In these drop-down lists, you will not see the available directories and transformations in the repository. The lists are populated with the available action sequence inputs. This also applies to specifying the location of a job in an action sequence.

Supplying values for named parameters, variables and arguments

If your transformation defines or needs named parameters, Kettle variables or commandline arguments, you can pass them from the action sequence by mapping KettleComponent inputs.

First of all, you need to include them in the Transformation Inputs section. This is equivalent to typing them inside the KettleComponent action-definition XML element.

Then, depending on the kind of data to pass, you have to define a different element:

Integrating Kettle and the Pentaho Suite

In the recipe, you mapped one command line argument and one named parameter.

With the following lines, you mapped the input named temperature_scale with the named parameter TEMP:

<set-parameter>
<name>TEMP</name>
<mapping>temperature_scale</mapping>
</set-parameter>

In the case of a variable, the syntax is exactly the same.

In the case of arguments instead of a name, you have to provide the position of the parameter: 1, 2, and so on.

Design Studio does not implement the capability of mapping inputs with variables or named parameters. Therefore, you have to type the mappings in the XML code. If you just want to pass command-line arguments, then you can skip this task because by default, it is assumed that the inputs you enter are command-line arguments.
This way of providing values for named parameters, variables, and commandline arguments also applies to jobs executed from an action sequence.

Keeping things simple when it's time to deliver a plain file

Reporting is a classic way of delivering data. In the PUC, you can publish not only Pentaho reports, but also third-party ones, for example, Jasper reports. However, what if the final user simply wants a plain file with some numbers in it? You can avoid the effort of creating it with a reporting tool. Just create a Kettle transformation that does it and call it from an action, in the same way you did in the recipe. This practical example is clearly explained by Nicholas Goodman in his blog post Self Service Data Export using Pentaho. The following is the link to that post, which also includes sample code for downloading:

http://www.nicholasgoodman.com/bt/blog/2009/02/09/self-service-dataexport-using-pentaho/

Executing a PDI job from the Pentaho User Console

The Pentaho User Console (PUC) is a web application included with the Pentaho Server conveniently built for you to generate reports, browse cubes, explore dashboards, and more. Among the list of tasks, you can do is the ability of running Kettle jobs. As said in the previous recipe, everything in the Pentaho platform is made up of action sequences. Therefore, if you intend to run a job from the PUC, you have to create an action sequence that does it.

For this recipe, you will use a job which simply deletes all files with extension tmp found in a given folder. The objective is to run the job from the PUC through an action sequence.

Getting ready

In order to follow this recipe, you will need a basic understanding of action sequences and at least some experience with the Pentaho BI Server and Pentaho Design Studio, the action sequences editor.

Before proceeding, make sure you have a Pentaho BI Server running. You will also need Pentaho Design Studio; you can download the latest version from the following URL:

http://sourceforge.net/projects/pentaho/files/Design%20Studio/

Besides, you will need a job like the one described in the introduction of the recipe. The job should have a named parameter called TMP_FOLDER and simply delete all files with extension .tmp found in that folder.

You can develop the job before proceeding (call it delete_files.kjb), or download it from here.

Finally, pick a directory on your computer (or create one) with some tmp files for deleting.

How to do it...

This recipe is split into two parts: First, you will create the action sequence and then you will test the action sequence from the PUC.

  1. Launch Design Studio. If it is the first time you do it, create the solution project where you will save your work.
  2. Copy the sample job to the solution folder.
  3. Create a new action sequence and save it in your solution project with the name delete_files.xaction.
  4. Define an input that will be used as the parameter for your job: folder. As Default Value, type the name of the folder with the .tmp files, for example, c:\myfolder.
  5. Add a process action by selecting Execute | Pentaho Data Integration Job.
  6. Now, you will fill in the Input Section of the process action configuration. Give the process action a name.
  7. As Job File, type solution:delete_files.kjb.
  8. In the Job Inputs, add the only input you have: folder.
  9. Select the XML source tab.
  10. Search for the <action-definition> tag that contains the following line:

    <component-name>KettleComponent</component-name>

  11. You will find something similar to the following:

    <action-definition>
    <component-name>KettleComponent</component-name>
    <action-type>Pentaho Data Integration Job</action-type>
    <action-inputs>
    <folder type="string"/>
    </action-inputs>
    <action-resources>
    <job-file type="resource"/>
    </action-resources>
    <action-outputs/>
    <component-definition/>
    </action-definition>

  12. Replace the <component-definition/> tag with the following:

    <component-definition>
    <set-parameter>
    <name>TMP_FOLDER</name>
    <mapping>folder</mapping>
    </set-parameter>
    </component-definition>

  13. Save the file.

Now, it is time to test the action sequence that you just created.

  1. Login to the Pentaho BI Server and refresh the repository.
  2. Browse the solution folders and look for the delete_files action you just created. Double-click on it.
  3. You should see a window with the legend Action Successful.
  4. You can take a look at the Pentaho console to see the log of the job.
  5. Take a look at the folder defined in the input of your action sequence. There should be no tmp files.

How it works...

You can run Kettle jobs as part of an action sequence by using the Pentaho Data Integration Job process action located within the Execute category of process actions.

The main task of a PDI Job process action is to run a Kettle job. In order to do that, it has a series of checks and textboxes where you specify everything you need to run the job, and everything you want to receive back after having run it.

The most important setting in the PDI process action is the name and location of the job to be executed. In this example, you had a .kjb file in the same location as the action sequence, so you simply typed solution: followed by the name of the file.

You can specify the Kettle log level, but it is not mandatory. In this case, you left the log level empty. The log level you select here (or Basic, by default) is the level of log that Kettle writes to the Pentaho console when the job runs.

Besides the name and location of the job, you had to provide the name of the folder needed by the job. In order to do that, you created an input named folder and then you passed it to the job. You did it in the XML code by putting the name of the input enclosed between <setparameter> and </set-parameter>.

When you run the action sequence, the job was executed deleting all .tmp files in the given folder.

Note that the action sequence in this recipe has just one process action (the PDI Job). This was made on purpose to keep the recipe simple, but it could have had other actions as well, just like any action sequence.

There's more...

The main reason for embedding a job in an action sequence is for scheduling its execution with the Pentaho scheduling services. This is an alternative approach to the use of a system utility such as cron in Unix-based operating systems or the Task Scheduler in Windows.

Pentaho Data Integration 4 Cookbook Over 70 recipes to solve ETL problems using Pentaho Kettle
Published: June 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

Generating files from the PUC with PDI and the CDA plugin

As you know, PDI allows you to generate Excel, CSV, and XML files and starting with the latest version, also JSON files. You do it with a transformation that has to be executed from Spoon or the command line. There is a quicker way to generate those kinds of files in an interactive fashion from the Pentaho User Console (PUC). This recipe teaches you how to do it by using the Community Data Access (CDA) plugin.

You will experiment the CDA Editor and the CDA Previewer for querying the current weather conditions in a given city. Then, you will learn how to export the results to different formats. You will do that from the PUC.

Getting ready

In order to follow this recipe, you will need some experience with the Pentaho BI Server.

Regarding the software, you will need a Pentaho BI Server running. You will also need the CDA plugin. You can download the installer from http://cda.webdetails.org or the source code from http://code.google.com/p/pentaho-cda/

The Community Dashboard Editor (CDE) includes CDA. Therefore, if you have CDE installed, just skip the CDA installation.

Finally, you will need the sample transformation weather_np.ktr.

How to do it...

This recipe is split in two parts: In the first part, you will create a CDA file for getting the data you need. In the second part, you will export the results.

So, carry out the following steps in order to complete the first part:

  1. Create the solution project where you will save your work.
  2. Inside the folder of your project, copy the weather_np.ktr transformation into your project directory.
  3. Also inside that folder, create an empty file with cda extension. Name it weather.cda.
  4. Log in to the Pentaho User Console and refresh the repository. You should see the solution folder with the file that you just created.
  5. Right-click on the file and select Edit. A new tab window should appear with the CDA Editor ready to edit your CDA file, as shown in the following screenshot:

    Integrating Kettle and the Pentaho Suite

  6. The black area is where you will type the CDA file content. Type the skeleton of the file:

    <?xml version="1.0" encoding="UTF-8"?>
    <CDADescriptor>
    <DataSources>
    </DataSources>
    </CDADescriptor>

  7. Inside the <DataSources> tag, type the connection to the Kettle transformation:

    <Connection id="weather" type="kettle.TransFromFile">
    <KtrFile>weather_np.ktr</KtrFile>
    <variables datarow-name="CITY"/>
    <variables datarow-name="Scale" variable-name="TEMP"/>
    </Connection>

  8. Now you will define a data access to that datasource. In CDA terminology, this is a query over the preceding connection. Below the closing tag </DataSources>, type the following:

    <DataAccess access="public"
    cache="true"
    cacheDuration="3600"
    connection="weather"
    id="current"
    type="kettle">
    <Columns/>
    <Parameters>
    <Parameter default="Lisbon, Portugal"
    name="CITY"
    type="String"/>
    <Parameter default="C"
    name="Scale"
    type="String"/>
    </Parameters>
    <Query>current_conditions_normalized</Query>
    <Output indexes="0,1"/>
    </DataAccess>

  9. Click on the Save button located at the top of the editor window.
  10. Click on Preview and a new window is displayed with the CDA Previewer.
  11. In the drop-down list, select the data access that you just defined.
  12. Take a look at the Pentaho server console. You should see how the weather_np transformation is being executed. The following is an excerpt of that log:

    ... - weather_np - Dispatching started for transformation
    [weather_np]
    ... - Get ${TEMP}, ${SPEED} and ${CITY} - Finished processing
    (I=0, O=0, R=1, W=1, U=0, E=0)
    ... - key & days & format - Finished processing (I=0, O=0, R=1,
    W=1, U=0, E=0)
    ... - worldweatheronline - Finished processing (I=0, O=0, R=1,
    W=2, U=0, E=0)
    ... - ...
    ... - current_conditions_normalized - Finished processing (I=0,
    O=0, R=11, W=11, U=0, E=0)

  13. In the previewer, you will see the results of the CDA query, as shown in the following screenshot:

    Integrating Kettle and the Pentaho Suite

  14. Try changing the values for the parameters: city and temperature scale. Click on Refresh and the data should be refreshed accordingly.

    Now that you have a CDA file with a connection and a data access, let's export some results to .csv format.

  15. Copy the URL of the previewer and paste it into a new tab window of your browser. Assuming that you are running the server on localhost, and your solution is in the folder pdi_cookbook/CDA, the complete URL should look like the following:

    http://localhost:8080/pentaho/content/cda/previewQuery?path=pdi_cookbook/CDA/weather.cda

    By double-clicking on the CDA file, the editor opens in a tab inside the PUC. This prevents you from copying the URL.

    In order to be able to copy the URL, double-click on the tab that contains the CDA Editor. Alternatively, you can right-click on the CDA file and select Open in New Window.

  16. In the URL, replace previewQuery with doQuery.
  17. At the end of the URL, add the following:

    &dataAccessId=current
    &paramCITY=Buenos Aires, Argentina
    &paramScale=F
    &outputType=csv

    These parameters are written in four lines for simplicity. You should type all in a single line one next to the other.

  18. Press Enter. A csv will be generated that should look like the following:

    FEATURE;VALUE
    City;Buenos Aires, Argentina
    Observation time;06:25 PM
    Weather description;Moderate or heavy rain in area
    Temperature;91
    Wind speed;24
    Wind direction;W
    Precipitation;0.2
    Humidity;53
    Visibility;3
    Pressure;9982
    Cloud Cover;100

How it works...

In this recipe, you exported the results of a Kettle transformation to a .csv file from the PUC. In order to do that, you used CDA.

First, you created a CDA file. You edited it with the CDA Editor. The purpose of this file was to firstly define a connection to the Kettle transformation and then a query over that connection. Let's explain them in detail.

The connection or CDA DataSource is the definition of the source of your data, in this case a Kettle transformation. The following is the definition of your connection:

<Connection id="weather" type="kettle.TransFromFile">
<KtrFile>weather_np.ktr</KtrFile>
<variables datarow-name="CITY"/>
<variables datarow-name="Scale" variable-name="TEMP"/>
</Connection>

The id must be unique in the CDA file. type="kettle.TransFromFile" identifies this as a connection to a Kettle transformation. You provide the name of the Kettle transformation inside the tags <KtrFile></KtrFile>.

Then you have the variables. The variables are the means for passing parameters from CDA to Kettle; datarow-name is the name you use in the CDA file, while variable-name is the name of the Kettle named parameter. For example, the named variable TEMP defined in the transformation is referred to as Scale inside the CDA file.

If both names coincide, then you can just put the datarow-name and omit the variable-name, as in the case of the CITY variable.

Each variable you define in a CDA connection of type kettle, TransFromFile must be defined as a named parameter in the transformation. That was the reason for using the transformation weather_np.ktr instead of weather.ktr.

Now, let's see the CDA query. A CDA DataAccess is a query over a CDA Datasource. In the case of Kettle, a CDA DataAccess has the details of a Kettle transformation execution. A CDA query is enclosed inside a tag named DataAccess:

<DataAccess access="public"
cache="true"
cacheDuration="3600"
connection="weather"
id="current"
type="kettle">
...
</DataAccess>

Here you define a unique data access id (id="current"), the data access type (type="kettle"), and the connection (connection="weather"). The connection must be declared earlier in the same file.

The <columns></columns> tag is useful if you want to rename the columns or perform operations between them, which was not the case here. Therefore, you left it empty.

Then you have a parameters section:

<Parameters>
<Parameter default="Lisbon, Portugal"
name="CITY"
type="String"/>
<Parameter default="C"
name="Scale"
type="String"/>
</Parameters>

Here you define one <Parameter> tag for each parameter you want to pass to the transformation. The parameters you type here will be available in the CDA Previewer for you to change.

In order to be able to pass the parameters, they have to be defined in the DataSource as explained earlier.

Inside <Query></Query> tag, you type the name of the transformation step that returns the data you need. The sample transformation has three steps that return data: current_conditions, current_conditions_normalized, and forecast. You typed the second of these steps.

Finally, the <Output> tag is meant to indicate which columns you want and in which order. The output fields of the current_conditions_normalized step are FEATURE and VALUE. You wanted both fields and in the same order, therefore, you typed indexes="0,1".

You can edit the CDA files in any text editor. Just remember that if you do not use the CDA Editor, then you should periodically refresh the solution repository in order to be able to preview them.

Once you created the contents of the file, you saved it, and previewed the results with the CDA Previewer. This previewer shows the results as a table.

After previewing the results, you experimented with the doQuery feature of the CDA API.

The doQuery function allows you to export the results of a DataAccess to different formats. In order to run a doQuery, you have to provide the following parameters:

Integrating Kettle and the Pentaho Suite

The parameters you provided in the recipe (shown in the preceding table) meant: Run the DataAccess with ID current, supplying the parameter CITY with values Buenos Aires, Argentina and Scale with value F, and give me a csv file with the results.

There's more...

CDA is a plugin for the Pentaho BI Suite developed by Webdetails, one of the main Pentaho Community Contributors.

CDA was designed as an abstraction tool between sources of information, for example, Kettle transformations, databases, or Mondrian cubes, and the CDF. As such, it is mainly used in the context of Pentaho Dashboards.

However, it also serves for exporting data to different formats from the PUC. That was exactly what you did in the recipe.

If you are interested in knowing more about CDA, then you will find the complete documentation at the following URL: http://cda.webdetails.org.

Populating a CDF dashboard with data coming from a PDI transformation

A dashboard is in its broad sense is an application that shows you visual indicators, for example, bar charts, traffic lights, or dials. A CDF dashboard is a dashboard created with the Community Dashboard Framework. CDF accepts many kinds of data sources being the output of a Kettle transformation being one of them.

In this recipe, you will create a very simple dashboard that shows this capability. You will type the name of a city and the dashboard will display graphically the 5-days forecast for that city. The forecast information will be obtained with the sample transformation explained in the introduction.

Getting ready

In order to follow this recipe, you will need a minimal experience with the Pentaho BI Server. Some experience with the Community Dashboard Editor (CDE) is also desirable.

Before proceeding, make sure you have a Pentaho BI Server running. You will also need the CDE. You can download it from http://cde.webdetails.org. To install it, simply unzip the downloaded material and follow the instructions in the INSTALL.txt file.

Finally, you will need the sample transformation weather_np.ktr.

How to do it...

Carry out the following steps:

  1. Log into the Pentaho User Console.
  2. Create the solution folder where you will save your work.
  3. Copy the sample transformation to the solution folder and refresh the repository.
  4. From the File menu, select New | CDE Dashboard or click on the CDE icon in the toolbar.
  5. Save the dashboard in the solution folder that you just created, close the dashboard window, and refresh the repository. A new file with extension wcdf will appear in the solution folder.
  6. Go to the solution folder, right-click on the dashboard file and select Edit. The dashboard editor will appear. Maximize the window, so that you can work more comfortably.
  7. Define the dashboard layout by adding rows and columns from the layout toolbar, until you get the following screen:

    Integrating Kettle and the Pentaho Suite

Now, let's add the visual elements of the dashboard.

  1. Click on Components from the menu at the top-right area of the editor.
  2. From the Generic category, add a Simple parameter. Name it city_param and type Lisbon, Portugal for Property value.
  3. From the Selects category, add a TextInput Component. Name it city_textbox. For Parameter, select city_param and for HtmlObject, select filter_panel.
  4. Click on Save on the menu at the top of the editor.
  5. Click on Preview; you will see the dashboard with a textbox prompting for the city_name parameter, showing the default value Lisbon, Portugal.
  6. Close the preview window.
  7. Now, you will add the chart that will display the forecast. From the Charts category, add a CCC Bar Chart.
  8. Fill in the properties as follows:
    • For Name, type forecast_bars
    • For Width, type 350
    • For Height, type 250
    • For Datasource, type forecast
    • For Crosstab mode, select True
    • For Title, type 5-days forecast
    • For HtmlObject, select chart_panel
    • For Listeners, select city_param
  9. Click on the Parameters property and in the window that displays, add one parameter. For Arg0, type CITY and for Val0, type city_param, and then Click on Ok.

Finally, you have to create the datasource for that chart: forecast. The following steps will do it:

  1. Click on Data Sources from the menu at the top-right area of the editor. In the list of available data sources, click on KETTLE Queries and select kettle over kettleTransFromFile. A new datasource will be added.
  2. Fill in the list of properties as explained in the following steps:
  3. For Name, type forecast.
  4. For Kettle Transformation File, type weather_np.ktr.
  5. Click on Variables and in the window that displays, click on Add. For Arg0, type CITY and click on Ok.
  6. Click on Parameters and in the window that displays, click on Add.
    • For Arg0, type CITY
    • For Val0 type Lisbon, Portugal
    • For Type0 leave the default String and click on Ok.
  7. Click on Output Options and in the window that shows up, click on Add three times. For Arg0, Arg1, and Arg2, type 1, 2, and 3 respectively and click on Ok.
  8. Click on Column Configurations and in the window that displays, click on Add twice. In the first row, type 2 for Arg0 and MIN for Val0. In the second row, type 3 for Arg1 and MAX for Val1.
  9. Click on the little square to the next of the Query property. The Sql Wizard shows up. In the editing area, type next_days and click on Ok.
  10. Save the dashboard by clicking on Save and click on Preview to see the dashboard. You should see a result similar to that shown in the following screenshot:

    Integrating Kettle and the Pentaho Suite

  11. If you take a look at the Pentaho console, then you will see the log of the Kettle transformation that is executed.
  12. Try changing the value for the city and press Enter. The chart should be refreshed accordingly.

How it works...

In this recipe, you created a very simple dashboard. The dashboard allows you to enter the name of a city and then refreshes a bar chart displaying the 5-days forecast for that city. The special feature of this dashboard is that it gets data from a web service through the execution of a Kettle transformation.

In order to use the output of your Kettle transformation as data source, you just have to add a new datasource from KETTLE Queries | kettle over kettleTransFromFile and configure it properly. This configuration involves providing the following properties:

Integrating Kettle and the Pentaho Suite

Once you configured your Kettle transformation as a datasource, it was ready to be used in the components of your dashboard.

There's more...

CDF is a community project whose objective is mainly to integrate dashboards in the Pentaho's solution repository structure. In the recipe, you used the CDE, which is a graphical editor that complements the power of the CDF engine. With CDE, you can create dashboards without having to get involved in the low-level details of the CDF files, thus focusing on the business requirements.

Kettle is just one of several kinds of data sources accepted by CDF. Behind the scenes, most of the data sources definitions are saved in a CDA file.

If you already have a CDA file that has a data access for your transformation, then you can avoid configuring the data source twice and use the Community Data Access | CDA Datasource instead.

CDF is bundled with the Pentaho BI Suite, but maintained by Webdetails with the help of the community. For more information about CDF, see the full documentation here: http://cdf.webdetails.org. For more on CDE visit http://cde.webdetails.org.

Summary

This article covered integrating Kettle and the Pentaho Suite and showed how to run Kettle jobs and transformations in that context.


Further resources on this subject:


About the Author :


Adrián Sergio Pulvirenti

Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in South America.

He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions.

María Carina Roldán

María Carina, born in Esquel, Argentina, earned her Bachelor's degree in Computer Science at UNLP in La Plata and then moved to Buenos Aires where she has been living since 1994.

She has worked as a BI consultant for almost 15 years. Over the last six, she has been dedicated full time to developing BI solutions using the Pentaho Suite. Currently, she works for Webdetails—a Pentaho company—as an ETL specialist.

She is the author of Pentaho 3.2 Data Integration Beginner’s Guide book published by Packt Publishing in April 2009 and co-author of Pentaho Data Integration 4 Cookbook, also published by Packt Publishing in June 2011.

Books From Packt


Pentaho Reporting 3.5 for Java Developers
Pentaho Reporting 3.5 for Java Developers

Pentaho 3.2 Data Integration: Beginner's Guide
Pentaho 3.2 Data Integration: Beginner's Guide

Java EE 6 Development with NetBeans 7
Java EE 6 Development with NetBeans 7

JasperReports 3.6 Development Cookbook
JasperReports 3.6 Development Cookbook

iReport 3.7
iReport 3.7

Python 2.6 Text Processing: Beginners Guide
Python 2.6 Text Processing: Beginners Guide

Android 3.0 Application Development Cookbook
Android 3.0 Application Development Cookbook

Oracle Data Integrator 11g: Getting Started
Oracle Data Integrator 11g: Getting Started


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software