Packt+ | Advance your knowledge in tech

You're reading from Learning Pentaho Data Integration 8 CE - Third Edition

Product type Book

Published in Dec 2017

Publisher Packt

ISBN-13 9781788292436

Pages 500 pages

Edition 3rd Edition

Languages

Java

Concepts

Data Processing

Table of Contents (23) Chapters

Title Page

Credits

About the Author

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

Getting Started with Pentaho Data Integration

Getting Started with Transformations

Creating Basic Task Flows

Reading and Writing Files

Manipulating PDI Data and Metadata

Controlling the Flow of Data

Cleansing, Validating, and Fixing Data

Manipulating Data by Coding

Transforming the Dataset

Performing Basic Operations with Databases

Loading Data Marts with PDI

Creating Portable and Reusable Transformations

Implementing Metadata Injection

Creating Advanced Jobs

Launching Transformations and Jobs from the Command Line

Best Practices for Designing and Deploying a PDI Project

Chapter 13. Implementing Metadata Injection

This chapter is about a powerful feature of Pentaho Data Integration (PDI): metadata injection, which is basically about injecting metadata into a template Transformation at runtime. In this chapter, we will explain the motivation behind this feature and then we will give a couple of practical examples for you to learn how to implement this feature.

We will be covering the following topics in this chapter:

Introducing metadata injection
Discovering metadata and injecting it
Identifying use cases to implement metadata injection

Introducing metadata injection

Throughout the book, we have been talking about PDI metadata, the data that describes the PDI datasets. Metadata includes field names and data types, among other attributes. Inside PDI, metadata not only refers to datasets, but also to other entities. For example, the definition of an input file—name, description, columns--is also considered as metadata.

You usually define the metadata in the configuration windows of the different steps. You do this manually while you are developing or modifying a Transformation in Spoon. This works perfectly when you know exactly how the data looks like—for example, when you are reading a file—or how you want it to be—for example, when you are creating new fields. There are situations where this is not the case, and you don't know the metadata until runtime. This is a kind of situation where metadata injection can help.

Explaining how metadata injection works

Let's see how metadata injection works through a very simple example...

Discovering metadata and injecting it

Let's move to a use case a bit more elaborate than the previous one. We will continue working with sales data. In this case, we will work with an Excel file named sales_data.xls, which has a single sheet. There are several fields in this file, but we are only interested in the following: PRODUCTLINE, PRODUCTCODE, and QUANTITYORDERED. The problem is that the fields can be in any order in the Excel file. We will only know the order when we read the file.

In the same way as before, we need to create a template with missing data and then a Transformation that injects that data.

Let's start with the template. As we don't have the list of fields, we will fill the Fields grid with generic names—HEADER1, HEADER2, and so on. We have to select and keep only three fields. For this, we will use a Select Values step and leave the task of filling it to the Transformation that injects the missing data:

Create a Transformation.

Add a Microsoft Excel input step, and configure...

Identifying use cases to implement metadata injection

So far, we used injection to deal with dynamic sources. The opposite could have been dealing with dynamic targets. An example of this is generating files with a variable number of fields.

Metadata injection can also be used to reduce repetitive tasks. A typical example is the loading of text files into staging tables. Suppose that you have a text file that you want to load into a staging table. Besides the specific task of loading the table, you want to apply some validations—for example, checking for non-null values, storing audit information such as user and timestamp for the execution, counting the number of processed rows and log in a result table, among other tasks.

Now suppose that you have to do this for a considerable quantity of different files. You could take this process as the base and start copying and pasting, adapting the process for each file. This is, however, not a good idea for a list of reasons:

It is time-consuming
It...

Summary

In this chapter, you learned the basics about metadata injection. You learned what metadata injection is about and how it works. After that, you developed a couple of examples with PDI, which will serve as patterns for implementing your own solutions. Finally, you were introduced to use cases where injection can be useful.

By learning metadata injection, you already have all the knowledge to create advanced transformations. In the next chapter, we will switch back to jobs to continue learning advanced concepts.