Packt+ | Advance your knowledge in tech

You're reading from Instant Pentaho Data Integration Kitchen

Product typeBook

Published inJul 2013

Reading LevelBeginner

PublisherPackt

ISBN-139781849696906

Edition1st Edition

Languages

Java

Tools

Pentaho

Concepts

Business Intelligence

Author (1)

Sergio Ramazzina

Dealing with the execution log (Simple)

This recipe guides you through managing the PDI execution log in terms of the following aspects:

Setting up the verbosity level
Redirecting it to an output file for future reference

This recipe will work the same for both Kitchen and Pan; the only difference is in the name of the script's file used to start the process.

Getting ready

To get ready for this recipe, you need to check that the JAVA_HOME environment variable is properly set and then configure your environment variables so that the Kitchen script can start from anywhere without specifying the complete path to your PDI home directory. For details about these checks, refer to the recipe Executing PDI jobs from a filesystem (Simple).

How to do it...

For changing the log's verbosity level, perform the following steps:

Open a command-line window and go to the <book_samples>/sample1 directory.
Any time that we are going to start a job or a transformation, we can manually set the verbosity of our log output. The more verbosity you choose to have, the more logs will be produced. To do this, we can use the –level argument specified for Linux/Mac as follows:
```
–level: <logging_level>
```
And for Windows, the argument specified is as follows:
```
/level: <logging_level>
```
The –level argument lets you specify the desired logging level by choosing its value from a set of seven possible values specified as follows:
- Error: This level is intended only to show errors
- Nothing: This means the argument isn't showing any output
- Minimal: This level uses minimal logging and provides a low verbosity on your log output
- Basic: This is the default basic logging level
- Detailed: This is intended to be used as soon as you require a detailed logging output
- Debug: This is used for debugging purposes for a very detailed output
- Rowlevel: The maximum amount of verbosity; logging at a row level can generate a lot of data
To start the job in Linux/Mac with a log level set to the Error level, we can give the following command:
```
$ kitchen.sh –file:/home/sramazzina/tmp/samples/export-job.kjb –level:Error
```
To start the job in Windows with a log level set to the Error level, we can give the following command:
```
C:\temp\samples>Kitchen.bat /file C:\temp\samples\export-job.kjb /level:Error
```

For saving an ETL process log to output files for future reference, use the following steps:

The log produced by our Kettle processes is an invaluable resource to properly diagnose problems and solve them quickly. So, it is a good rule of thumb to save the logs and eventually archive them for future reference.
In case you are launching your jobs from the command line, there are different ways to save the log.
The first thing we can do is save the log to a file using the logfile argument. This argument lets you specify the complete path to the logfile name.
To set the logfile name on Linux/Mac, use the following syntax:
```
–logfile: <complete_logfilename>
```
To set the logfile name on Windows, use the following syntax:
```
/logfile: <complete_logfilename>
```
Let 's suppose that we are going to start the export-job.kjb Kettle job, and we want a Debug log level and to save the output to a specified logfile called pdilog_debug_output.log. To do this on Linux/Mac, type the following command:
```
$ kitchen.sh –file:/home/sramazzina/tmp/samples/export-job.kjb –level:Debug –logfile:./pdilog_debug_output.log
```

To set the logfile on Windows, type the following command:

C:\temp\samples>Kitchen.bat /file:C:\temp\samples\export-job.kjb /level:Debug /logfile:.\pdiloc_debug_output.log

As soon as your PDI starts, it will start filling a buffer that contains all the rows produced by your log. This is interesting because it lets you keep the memory usage under control. By default, PDI maintains the first 5000 logs' rows produced by your job in this buffer. This means that if your ETL process produces more than 5000 rows, the output log is truncated.
To change the length of the log buffer, you need to use the maxloglines argument; this argument lets you specify the maximum number of log lines.
To set the maximum number of log lines that are kept by Kettle on Linux/Mac, use the following argument:
```
–maxloglines: <number_of_log_lines_to_keep>
```
To set the maximum number of log lines that are kept by Kettle on Windows, use the following argument:
```
/maxloglines: <number_of_log_lines_to_keep>
```
If you specify 0 as the value for this argument, PDI will maintain all of the log lines produced.
Another method to limit the number of log lines kept internally by the PDI logging system is to filter log lines by age. The maxlogtimeout argument lets you specify the maximum age of a log line in minutes before it is removed by the log buffer.
To set the maximum age of a log line on Linux/Mac in minutes, use the following argument:
```
–maxlogtimeout: <age_of_a_logline_in_minutes>
```
To set the maximum age of a log line on Windows in minutes, use the following argument:
```
/maxlogtimeout: <age_of_a_logline_in_minutes>
```
If you specify 0 as the value for this argument, PDI will maintain all the log lines indefinitely.
Let's suppose, for example, that we're going to start the export-job.kjb Kettle job and that we want to keep 10000 rows in our log buffer. In this case, the command we need to use in Linux/Mac is as follows:
```
$ kitchen.sh –file:/home/sramazzina/tmp/samples/export-job.kjb –level:Debug –logfile:./pdilog_debug_output.log –maxloglines:10000
```

There's more...

The log is an invaluable source of information that is useful to understand what and where something does not work. This will be the topic covered by the first paragraph of this section. Then, we will see a brief example that helps us produce logfiles with a parametric name.

Understanding the log to identify where our process fails

The log that our ETL process produces contains a valuable set of information that helps us understand where our process fails. The first case of failure is given by a system exception. In this case, it is very easy to identify why our process fails because the exception message is clearly identifiable in the logfile. As an example, let's suppose that we're starting our job from a wrong directory or that our job file is not found in the path we're giving; we will find a detailed exception message in the log as follows:

INFO  17-03 22:15:40,312 - Kitchen - Start of run.
ERROR: Kitchen can't continue because the job couldn't be loaded.

However, a very different thing is when our process does not explicitly fail because of an exception, but the results are different from what is expected. It could be that we expected 1000 rows to be written to our file, but only 900 were written. Therefore, what can we do to understand what is going wrong? A simple but effective way to try to understand what goes wrong is to analyze an important part of our log that summarizes what happened for each of our tasks. Let's consider the following section taken from the log of one of our sample processes:

INFO  17-03 22:31:54,712 - Read customers from file - Finished processing (I=123, O=0, R=0, W=122, U=1, E=0)
INFO  17-03 22:31:54,720 - Get parameters - Finished processing (I=0, O=0, R=122, W=122, U=0, E=0)
INFO  17-03 22:31:54,730 - Filter rows with different countries – Finished processing (I=0, O=0, R=122, W=122, U=0, E=0)
INFO  17-03 22:31:54,914 - Write selected country customers – Finished processing (I=0, O=122, R=122, W=122, U=0, E=0)

As you can clearly see, the section that can always be found almost at the end of any transformation called by the job summarizes what happens at the boundaries (input and output) of every step of our transformation. Keeping an eye on this log fragment is a key point in understanding where our business rules are failing and where we are getting lesser records than expected. On the other hand, remember that because jobs are mainly orchestrators, they do not contain any data, so there is no need for such a log section for them.

Separating execution logfiles by date and time

It could be interesting to separate our execution logs by appending the execution date and time to the logfile name. To do this, the simplest thing we can do is to wrap our Kitchen script in another script used to set the proper logfile name using some shell functions.

As an example, I wrote a sample script for Linux/Mac that starts PDI jobs by filename and writes a logfile whose name is made up of a base name, a text string containing the date and time when the job was submitted, and the extension (conventionally .log). Have a look at the following code:

now=$(date +"%Y%m%d_%H%M%s")
kitchen.sh -file:$1.kjb -logfile:$2_$now.log

As you can see, it's a fairly trivial script; the first line builds a string composed by year, month, name, hour, minutes, and seconds, and the second concatenates that string with a logfile's base name and extension. To start the script, use the following syntax:

startKitchen.sh <job_to_start_base_name> <logfile_base_name>

So, for example, the following command starts Kitchen by calling the job export-job.kjb and produces a logfile called logfile_20130319_13261363696006.log:

$./startKitchen.sh ./export-job ./logfile

You can do a similar batch for the Windows platform.

You have been reading a chapter from

Instant Pentaho Data Integration Kitchen

Published in: Jul 2013Publisher: PacktISBN-13: 9781849696906

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Sergio Ramazzina

Sergio Ramazzina is an experienced software architect/trainer with more than 25 years of experience in the IT field. He has worked on a broad number of projects for banks and major Italian companies and has designed complex enterprise solutions in Java, JavaEE, and Ruby. He started using Pentaho products from the very beginning in late 2003. He gained thorough experience by deploying Pentaho as an open source BI solution, standalone or deeply integrated in other applications as the analytical engine of choice. In 2009, due to his experience in the Java/JavaEE world and appreciation for the open source world and its main ideas, he began participating actively as a contributor to some of the Pentaho projects such as JPivot, Saiku, CDF, and CDA and rose to the Pentaho Active Contributor level. At that time, he started participating as a BI architect and Pentaho expert on a wide number of projects where open source BI and Pentaho were the main players. In late 2010, he founded Serasoft, a young Italian consulting firm that specializes in delivering high value open source Business Intelligence solutions. With the team in Serasoft, he shared his passion and experience in designing and delivering highly innovative enterprise solutions to help users make their work more effective. In July 2013, he published his first book, Instant Pentaho Data Integration Kitchen, Packt Publishing. He is also passionate about skiing, tennis, and photography, and he loves his young daughter, Camilla, very much. You can follow him on Twitter at @sramazzina. You can also look at his profile on LinkedIn at http://it.linkedin.com/in/sramazzina/.
Read more about Sergio Ramazzina

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages