Packt+ | Advance your knowledge in tech

You're reading from Instant Pentaho Data Integration Kitchen

Product type Book

Published in Jul 2013

Publisher Packt

ISBN-13 9781849696906

Pages 68 pages

Edition 1st Edition

Languages

Java

Concepts

Business Intelligence

Author (1):

Sergio Ramazzina

Table of Contents (7) Chapters

Instant Pentaho Data Integration Kitchen

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

1. Instant Pentaho Data Integration Kitchen

Executing PDI jobs from the repository (Simple)

This recipe guides you through starting PDI jobs using Kitchen by assuming that PDI jobs are stored in the PDI repository. We will learn how to start simple jobs both with and without a set of input parameters previously defined in the job.

There are different versions of repositories that you can use depending on your needs: filesystem, database, and Pentaho repository.

The filesystem repository is a very easy form of repository; it lets you store your files at a specified location in the filesystem.
The database repository is another simple form repository that uses database tables as a way to have a unique central objects store. It is usually an authenticated repository and does not have any kind of versioning or locking mechanism on the object, so it is not recommended for team development.
The last type of repository, the Pentaho repository, is a sophisticated repository environment for safely storing our BI object (and so our ETL processes) available only for the Enterprise Edition (EE) of PDI.

Getting ready

To play with this recipe, I prepared a filesystem repository in the <book_samples>/sample3 directory with all the related transformations and the job in it. The repository is located in the samples/sample3 directory. To have our example working fully, we need to add the filesystem repository (see the How to define a filesystem repository section under the There's more… section of this recipe for details about it) to PDI by specifying the complete path to the <book_samples>/sample3 directory in the base directory field.

Check if the JAVA_HOME environment variable is set properly and configure your environment variables so that the Kitchen script can start from anywhere without specifying the complete path to your PDI home directory. For details about these checks, refer to the Executing PDI jobs from a filesystem (Simple) recipe.

How to do it...

For starting a PDI job from a filesystem repository in Linux or Mac, perform the following steps:

Let's start our sample job from the repository; to do this, we need to use two new arguments:
- The –rep argument identifies the repository we are going to connect to; the syntax is as follows:
```
–rep: <id_of_the_repository>
```
- The –job argument specifies the name of the job that needs to be started; the syntax is as follows:
```
–job: <name_of_the_job>
```
So, because the ID of our repository is sample3, we'll start the sample job with the following syntax:
```
$ kitchen.sh –rep:sample3 –job: export-job
```
Suppose that the job is not located in the root directory of our repository, but is stored in the pdi_book directory; to start it, we must specify the correct path to the job in the repository with the following syntax using the –dir argument:
```
$ kitchen.sh –rep:sample3 –dir:/pdi_book  -job:export-job
```
Parameter usage from within the command-line tool is exactly the same as we saw in the previous recipes. So, if we want to extract all the customers from the country U.S.A, a sample of the command to use is as follows:
```
kitchen.sh -param:p_country=USA ––rep:sample3 –job: export-job
```

For starting a PDI job from a filesystem repository on Windows, perform the following steps:

Starting a PDI job from within a filesystem repository on Windows requires for the same rules we saw previously to be followed with the exception that, as usual, we need to specify the arguments using the / character instead of the – character that we used for Linux or Mac. So, the –rep argument will change to /rep:
```
/rep: < id_of_the_repository >
```
And the –job argument will change to /job:
```
/job: < name_of_the_job >
```
Let's suppose that the job is located in the root directory of our repository; we can call the Kitchen script using the following syntax:
```
C:\temp\samples>Kitchen.bat /rep:sample3 /job:export-job
```
About the use of PDI parameters through the command-line arguments: let's suppose we run our job by extracting all the customers for the country U.S.A; we can call our job using the following syntax:
```
C:\temp\samples>Kitchen.bat /rep:sample3 /job:export-job /param:p_country:USA
```

For starting a PDI job from a database repository, perform the following steps:

The process of starting a PDI job in a database repository is the same as in a filesystem repository. The only difference in this case is that a database repository is an authenticated source, so it requires a username and password.
The –user parameter lets you specify a valid username to connect to the repository. If you are on Linux/Mac, the syntax will be as follows:
```
–user: <repository_username>
```
For Windows, you need to change the – character to the / character; the syntax is as follows:
```
/user: <repository_username>
```
The –pass argument lets you specify a valid password for the user you are using to connect to the database repository; the syntax in this case for Linux/Mac is as follows:
```
–pass: <repository_user_password>
```
And for Windows, the syntax is as follows:
```
/pass: <repository_username>
```
Suppose that our job, called export-job.kjb, is located in a database repository instead of in a filesystem repository, and that we are connecting to the user pdiuser with the password password. The example command to start the job in Linux/Mac is as follows:
```
$ kitchen.sh –user:pdiuser –pass:password –rep:sample3 –dir:/pdi_book  -job:export-job
```
Whereas in Windows, the syntax is as follows:
```
C:\temp\samples>Kitchen.bat /user:pdiuser /pass:password /rep:sample3 /job:export-job
```

There's more...

Up until now, we worked with jobs and transformations saved as XML files in the plain filesystem. Let's discuss some important things in the following sections that will help them work properly if we were to store them in a repository. Then, we'll talk about how to define a filesystem and a database repository.

Changes in job and transformation design

When using jobs and transformations from within repositories (either database or files), things change slightly. In this case, the ETL process files relate to the root of the repository, and the internal variable Internal.Job.Filename.Directory does not make any sense. Because of this, we need to change the way our example job links jobs or transformations or any kind of file. The major changes are as follows:

The job links the transformation using the system variable Internal.Job. Repository.Directory to dynamically get the path to the job file in the repository. This system variable gives you the path to the job file being executed in the repository at runtime.
To specify the input and output path for all the files read or written from or to the outside world, we added two new parameters, p_input_directory and p_target_directory, to specify the absolute or relative path (it depends on your needs) to the input and output directories. In our examples, these parameters default to the local directory where the job or transformation starts.

How to define a filesystem repository

Creating a filesystem repository is a fairly easy task. Let's follow the ensuing summarized steps and try to define a new one:

Start Spoon and navigate to Tools | Repository | Connect. The repository connection dialog opens.
Click on the new repository button (the green circular button with the + icon in it). The Select repository type dialog box opens.
Spoon lets you decide between the two repository types. Choose the filesystem repository type and click on OK. The File repository settings dialog box opens in front of you.
Select the base directory that will contain the repository objects (in our case, it is the <book_samples>/sample3 directory).
Fill in the ID field with the ID of the repository; in our case, the ID of the repository is sample3.
Set a name for the repository. You can even use spaces in your repository name because it is just a sort of label.
Click on OK and you are done. The newly created repository will appear in the list of available repositories.

Defining a database repository

The process to define a database repository is almost identical to what we saw in the previous paragraph regarding the filesystem repository. The difference here is that PDI requires a connection definition where it can create the related repository's database tables. Let's follow the ensuing summarized steps to create a new one:

Start Spoon and navigate to Tools | Repository | Connect. The repository connection dialog opens.
Click on the new repository button (the green circular button with the + icon in it). The Select repository type dialog box opens.
Spoon lets you decide between the two repository types. Choose the database repository type and click on OK. The Repository information dialog box opens in front of you.
Choose the connection name from a list of connections. In case there is no valid connection to use, you can define it by clicking on the New button (we assume you already know how to define a connection and that you already have a connection available). Fill in the repository ID and repository name.
If you don't have an existing repository or if you are accessing an old repository version, you can respectively create the necessary tables or upgrade the existing ones by clicking on the Create or Upgrade button.
Click on OK and you are done. The newly created repository will appear in the list of available repositories.