Packt+ | Advance your knowledge in tech

You're reading from Instant Pentaho Data Integration Kitchen

Product typeBook

Published inJul 2013

Reading LevelBeginner

PublisherPackt

ISBN-139781849696906

Edition1st Edition

Languages

Java

Tools

Pentaho

Concepts

Business Intelligence

Author (1)

Sergio Ramazzina

Executing PDI jobs packaged in archive files (Intermediate)

This recipe guides you through starting a PDI job packed in an archived file (.zip or .tar.gz) using Kitchen. We will assume that the PDI job to be launched (we will call it the main job) and all the related jobs and transformations called during the execution are stored inside a .zip or .tar.gz archive file locally in the computer's filesystem; the execution will happen directly by accessing the files inside the archive without needing to unpack them to a local directory. This practice is a good idea in certain situations where we need to find a quick way to move our ETL processes around on different systems rapidly and without any pain; by packing everything in an archive file, we can move just one file instead of moving a bunch of files and directories—this really is easier!

Getting ready

To get ready for this recipe, you need to check that the JAVA_HOME environment variable is properly set and then configure your environment variables so that the Kitchen script can start from anywhere without specifying the complete path to your PDI home directory. For details about these checks, refer to the recipe Executing PDI jobs from a filesystem (Simple).

To play with this recipe, you can use the samples in the directory <book_samples>/sample2; here, <book_samples> is the directory where you unpacked all the samples of the book.

How to do it...

For starting a PDI job from within a .zip or .tar.gz archive file in Linux or Mac, you can perform the following steps:

Open a command-line window and go to the <book_samples>/sample2 directory.
Kettle uses Apache VFS to let you access a set of files from inside an archive and use them directly in your processes; this is because it considers the archived content as a sort of virtual filesystem (VFS). To access a file in the archive, you need to give the complete path using a specific URI. In case you're going to access files contained in a .zip archive, the syntax to be used for the URI is as follows:
```
zip://arch-file-uri[!absolute-path]
```
On the other hand, if we wanted to access files contained in a .tar.gz file, we need to use the following syntax:
```
tgz://arch-file-uri[!absolute-path]
```
To identify which job file needs to be started, we need to use the –file argument with the following syntax; but now, the syntax has changed because we need to use, as a value, the URI with the syntax we saw in the preceding step:
```
–file: <complete_URI_to_job_file>
```
Remember that because we are talking about a URI to the file, we always need to consider the absolute path to the archive file followed by the path to the file we are going to start as a job.
To start the sample job named export-job1.kjb from within a .zip archive, use the following syntax:
```
$ kitchen.sh –file:'zip:///home/sramazzina/tmp/samples/samples.zip!export-job1.kjb'
```
The usage of parameters from within the command-line tool is the same as we saw in the Executing PDI jobs from a filesystem (Simple) recipe. So, if we want to extract all the customers from the country U. S. A, the command to use is as follows:
```
kitchen.sh -param:p_country=USA –file:'zip:///home/sramazzina/tmp/samples/samples.zip!export-job1.kjb'
```
Use the same command to start a job from within a .tar.gz archive, but with a different filesystem type in the URI. Here is a sample of a simple job that has been started without parameters:
```
$ kitchen.sh –file:'tgz:///home/sramazzina/tmp/samples/samples.tar.gz!export-job1.kjb'
```
The following syntax is for a simple job that has been started with parameters:
```
kitchen.sh -param:p_country=USA –file:'tgz:///home/sramazzina/tmp/samples/samples.tar.gz!export-job1.kjb'
```
Different from what we saw in the Executing PDI jobs from a filesystem (Simple) recipe, this time the –dir argument does not make any sense because we always need to access the archive through its complete URI and the –file argument.

For starting a PDI job from within a .zip archive file in Windows, perform the following steps:

Starting a PDI job from within an archive file on Windows requires for the same rules we saw previously to be followed using the same arguments in the same way.
Any time we start the PDI jobs from Windows, we need to specify the arguments using the / character instead of the – character we used in Linux or Mac. Therefore, this means that the –file argument will change from:
```
–file: <complete_URI_to_job_file>
```
To:
```
/file: <complete_URI_to_job_file>
```
Go to the directory <books_samples>/sample2; to start your sample job from within the ZIP archive, you can start the Kitchen script using the following syntax:
```
C:\temp\samples>Kitchen.bat /file:'zip:///home/sramazzina/tmp/samples/samples.zip!export-job1.kjb'
```
Let's create another example using parameters in the command. Go to the directory <books_samples>/sample2; to start the job by extracting all the customers for the country U. S. A, you can use the following syntax:
```
C:\temp\samples>Kitchen.bat /param:p_country:USA /file:'zip:///home/sramazzina/tmp/samples/samples.zip!export-job1.kjb'
```

For starting PDI transformations from within archive files, perform the following steps:

PDI transformations are always started using Pan scripts. On Linux or Mac, you can find the pan.sh script in the PDI home directory. To start a simple transformation from within an archive file, go to the <book_samples>/sample2 directory and type the following command:
```
$ pan.sh –file: 'tgz:///home/sramazzina/tmp/samples/samples.tar.gz!export-job1.kjb'
```
Or, if you need to specify some parameters, type the following command:
```
$ pan.sh –param:p_country=USA –file:./read-customers.ktr
```

On Windows, you can use the Pan.bat script and the sample commands to start our transformation as follows:

C:\temp\samples>Pan.bat /file='zip:///home/sramazzina/tmp/samples/samples.zip!read-customers1.ktr'

Or, if you need to specify some parameters through the command line, type the following command:

C:\temp\samples>Pan.bat /param:p_country:USA /file='zip:///home/sramazzina/tmp/samples/samples.zip!read-customers1.ktr'

How it works...

This way of starting jobs and transformations is possible because PDI uses the Apache VFS library to accomplish this task. The Apache VFS library is a piece of software that lets you directly access files from within any type of archive by exposing them through a virtual filesystem using an appropriate set of APIs. You can find more details about the library and how it works on the Apache website at http://commons.apache.org/proper/commons-vfs.

There's more...

Using jobs and transformations from within archive files slightly changes the way we design jobs and transformations. Another interesting consideration is that you can directly reference resource files packed together with your ETL process file. This lets you distribute configuration files or other kinds of resources in a uniform way. This approach could be an easy way to have a single file containing anything needed by our ETL process, making everything more portable and easier to manage. The following paragraph details the main changes applied to this new version of our sample.

Changes in job and transformation design

When jobs or transformations are used from inside an archive, the files relate to the root of the archive, and the internal variable ${Internal.Job.Filename.Directory} does not make any sense. Because of this, we need to change the way our example process links any kind of file.

Look at the samples located in the directory <book_samples>/sample2; this directory contains the same transformations and jobs, but they need to undergo major changes for them to work in this case. They are as follows:

The job links the transformation without using the system variable ${Internal.Job.Filename.Directory} to dynamically obtain the path to the job file. This is because, internally to the archive file, the transformation is in the root of this virtual filesystem, so the filename is enough for this purpose.
Instead of using the ${Internal.Job.Filename.Directory} variable to specify the input and output path for the files, we added two new parameters, p_input_directory and p_target_directory, to let the user specify the input directory and output directory. If we have not specified a value for these parameters, we'll set a default value that is local to the directory where the job starts.

You have been reading a chapter from

Instant Pentaho Data Integration Kitchen

Published in: Jul 2013Publisher: PacktISBN-13: 9781849696906

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sergio Ramazzina

Sergio Ramazzina is an experienced software architect/trainer with more than 25 years of experience in the IT field. He has worked on a broad number of projects for banks and major Italian companies and has designed complex enterprise solutions in Java, JavaEE, and Ruby. He started using Pentaho products from the very beginning in late 2003. He gained thorough experience by deploying Pentaho as an open source BI solution, standalone or deeply integrated in other applications as the analytical engine of choice. In 2009, due to his experience in the Java/JavaEE world and appreciation for the open source world and its main ideas, he began participating actively as a contributor to some of the Pentaho projects such as JPivot, Saiku, CDF, and CDA and rose to the Pentaho Active Contributor level. At that time, he started participating as a BI architect and Pentaho expert on a wide number of projects where open source BI and Pentaho were the main players. In late 2010, he founded Serasoft, a young Italian consulting firm that specializes in delivering high value open source Business Intelligence solutions. With the team in Serasoft, he shared his passion and experience in designing and delivering highly innovative enterprise solutions to help users make their work more effective. In July 2013, he published his first book, Instant Pentaho Data Integration Kitchen, Packt Publishing. He is also passionate about skiing, tennis, and photography, and he loves his young daughter, Camilla, very much. You can follow him on Twitter at @sramazzina. You can also look at his profile on LinkedIn at http://it.linkedin.com/in/sramazzina/.
Read more about Sergio Ramazzina

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages