Pentaho Data Integration 4: Understanding Data Flows

Exclusive offer: get 50% off this eBook here
Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 Cookbook — Save 50%

Over 70 recipes to solve ETL problems using Pentaho Kettle

$26.99    $13.50
by Adrián Sergio Pulvirenti María Carina Roldán | June 2011 | Cookbooks Java Open Source

This article by Adrián Sergio Pulvirenti and María Carina Roldán, authors of Pentaho Data Integration 4 Cookbook, focuses on the different ways for combining, splitting, or manipulating streams or flows of data using Kettle transformations. The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation.

In this article, we will cover:

  • Splitting a stream into two or more streams based on a condition
  • Merging rows from two streams with the same or different structure
  • Comparing two streams and generating differences
  • Generating all possible pairs formed from two datasets


Pentaho Data Integration 4 Cookbook

Pentaho Data Integration 4 Cookbook

Over 70 recipes to solve ETL problems using Pentaho Kettle

        Read more about this book      

(For more resources on this subject, see here.)

Introduction

The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation.

When a transformation is launched, all its steps are started. During the execution, the steps work simultaneously reading rows from the incoming hops, processing them, and delivering them to the outgoing hops. When there are no more rows left, the execution of the transformation ends.

The dataset that flows from step to step is not more than a set of rows all having the same structure or metadata. This means that all rows have the same number of columns, and the columns in all rows have the same type and name.

Suppose that you have a single stream of data and that you apply the same transformations to all rows, that is, you have all steps connected in a row one after the other. In other words, you have the simplest of the transformations from the point of view of its structure. In this case, you don't have to worry much about the structure of your data stream, nor the origin or destination of the rows. The interesting part comes when you face other situations, for example:

  • You want a step to start processing rows only after another given step has processed all rows
  • You have more than one stream and you have to combine them into a single stream
  • You have to inject rows in the middle of your stream and those rows don't have the same structure as the rows in your dataset

With Kettle, you can actually do this, but you have to be careful because it's easy to end up doing wrong things and getting unexpected results or even worse: undesirable errors.

With regard to the first example, it doesn't represent a default behavior due to the parallel nature of the transformations as explained earlier. There are two steps however, that might help, which are as follows:

  • Blocking Step: This step blocks processing until all incoming rows have been processed.
  • Block this step until steps finish: This step blocks processing until the selected steps finish.

Both these steps are in the Flow category.

This and the next article on Working with Complex Data Flows focuses on the other two examples and some similar use cases, by explaining the different ways for combining, splitting, or manipulating streams of data.

Splitting a stream into two or more streams based on a condition

In this recipe, you will learn to use the Filter rows step in order to split a single stream into different smaller streams. In the There's more section, you will also see alternative and more efficient ways for doing the same thing in different scenarios.

Let's assume that you have a set of outdoor products in a text file, and you want to differentiate the tents from other kind of products, and also create a subclassification of the tents depending on their prices.

Let's see a sample of this data:

id_product,desc_product,price,category
1,"Swedish Firesteel - Army Model",19,"kitchen"
2,"Mountain House #10 Can Freeze-Dried Food",53,"kitchen"
3,"Lodge Logic L9OG3 Pre-Seasoned 10-1/2-Inch Round
Griddle",14,"kitchen"
...

Getting ready

To run this recipe, you will need a text file named outdoorProducts.txt with information about outdoor products. The file contains information about the category and price of each product.

How to do it...

Carry out the following steps:

  1. Create a transformation.
  2. Drag into the canvas a Text file input step and fill in the File tab to read the file named outdoorProducts.txt. If you are using the sample text file, type , as the Separator.
  3. Under the Fields tab, use the Get Fields button to populate the grid. Adjust the entries so that the grid looks like the one shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  4. Now, let's add the steps to manage the flow of the rows. To do this, drag two Filter rows steps from the Flow category. Also, drag three Dummy steps that will represent the three resulting streams.
  5. Create the hops, as shown in the following screenshot. When you create the hops, make sure that you choose the options according to the image: Result is TRUE for creating a hop with a green icon, and Result is FALSE for creating a hop with a red icon in it.

    Pentaho Data Integration 4 tutorial on Data Flows

  6. Double-click on the first Filter rows step and complete the condition, as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  7. Double-click on the second Filter rows step and complete the condition with price < 100.
  8. You have just split the original dataset into three groups. You can verify it by previewing each Dummy step. The first one has products whose category is not tents; the second one, the tents under 100 US$; and the last group, the expensive tents; those whose price is over 100 US$.
  9. The preview of the last Dummy step will show the following:

    Pentaho Data Integration 4 tutorial on Data Flows

How it works...

The main objective in the recipe is to split a dataset with products depending on their category and price. To do this, you used the Filter rows step.

In the Filter rows setting window, you tell Kettle where the data flows to depending on the result of evaluating a condition for each row. In order to do that, you have two list boxes: Send 'true' data to step and Send 'false' data to step. The destination steps can be set by using the hop properties as you did in the recipe. Alternatively, you can set them in the Filter rows setting dialog by selecting the name of the destination steps from the available drop-down lists.

You also have to enter the condition. The condition has the following different parts:

Pentaho Data Integration 4 tutorial on Data Flows

  • The upper textbox on the left is meant to negate the condition.
  • The left textbox is meant to select the field that will be used for comparison.
  • Then, you have a list of possible comparators to choose from.
  • On the right, you have two textboxes: The upper textbox for comparing against a field and the bottom textbox for comparing against a constant value.

Also, you can include more conditions by clicking on the Add Condition button on the right. If you right-click on a condition, a contextual menu appears to let you delete, edit, or move it.

In the first Filter rows step of the recipe, you typed a simple condition: You compared a field (category) with a fixed value (tents) by using the equal (=) operator. You did this to separate the tents products from the others.

The second filter had the purpose of differentiating the expensive and the cheap tents.

There's more...

You will find more filter features in the following subsections.

Avoiding the use of Dummy steps

In the recipe, we assumed that you wanted all three groups of products for further processing. Now, suppose that you only want the cheapest tents and you don't care about the rest. You could use just one Filter rows step with the condition category = tents AND price < 100, and send the 'false' data to a Dummy step, as in shown in the following diagram:

Pentaho Data Integration 4 tutorial on Data Flows

The rows that don't meet the condition will end at the Dummy step. Although this is a very commonly used solution for keeping just the rows that meet the conditions, there is a simpler way to implement it. When you create the hop from the Filter rows toward the next step, you are asked for the kind of hop. If you choose Main output of step, the two options Send 'true' data to step and Send 'false' data to step will remain empty. This will cause two things:

  1. Only the rows that meet the condition will pass.
  2. The rest will be discarded.

Comparing against the value of a Kettle variable

The recipe above shows you how to configure the condition in the Filter rows step to compare a field against another field or a constant value, but what if you want to compare against the value of a Kettle variable?

Let's assume, for example, you have a named parameter called categ with kitchen as Default Value. As you might know, named parameters are a particular kind of Kettle variable.

You create the named parameters under the Parameter tab from the Settings option of the Edit menu.

To use this variable in a condition, you must add it to your dataset in advance. You do this as follows:

  1. Add a Get Variables step from the Job category. Put it in the stream after the Text file input step and before the Filter Rows step; use it to create a new field named categ of String type with the value ${categ} in the Variable column.
  2. Now, the transformation looks like the one shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  3. After this, you can set the condition of the first Filter rows step to category = categ, selecting categ from the listbox of fields to the right. This way, you will be filtering the kitchen products.
  4. If you run the transformation and set the parameter to tents, you will obtain similar results to those that were obtained in the main recipe.

Avoiding the use of nested Filter Rows steps

Suppose that you want to compare a single field against a discrete and short list of possible values and do different things for each value in that list. In this case, you can use the Switch / Case step instead of nested Filter rows steps.

Let's assume that you have to send the rows to different steps depending on the category. The best way to do this is with the Switch / Case step. This way you avoid adding one Filter row step for each category.

In this step, you have to select the field to be used for comparing. You do it in the Field name to switch listbox. In the Case values grid, you set the Value—Target step pairs. The following screenshot shows how to fill in the grid for our particular problem:

Pentaho Data Integration 4 tutorial on Data Flows

The following are some considerations about this step:

  • You can have multiple values directed to the same target step
  • You can leave the value column blank to specify a target step for empty values
  • You have a listbox named Default target step to specify the target step for rows that do not match any of the case values
  • You can only compare with an equal operator
  • If you want to compare against a substring of the field, you could enable the Use string contains option and as Case Value, type the substring you are interested in. For example, if for Case Value, you type tent_ then all categories containing tent_ such as tent_large, tent_small, or best_tents will be redirected to the same target step.

Overcoming the difficulties of complex conditions

There will be situations where the condition is too complex to be expressed in a single Filter rows step. You can nest them and create temporary fields in order to solve the problem, but it would be more efficient if you used the Java Filter or User Defined Java Expression step as explained next.

You can find the Java Filter step in the Flow category. The difference compared to the Filter Rows step is that in this step, you write the condition using a Java expression.

The names of the listboxes—Destination step for matching rows (optional) and Destination step for non-matching rows (optional)—differ from the names in the Filter rows step, but their purpose is the same.

As an example, the following are the conditions you used in the recipe rewritten as Java expressions: category.equals("tents") and price < 100. These are extremely simple, but you can write any Java expression as long as it evaluates to a Boolean result.

If you can't guarantee that the category will not be null, you'd better invert the first expression and put "tents".equals(category) instead. By doing this, whenever you have to check if a field is equal to a constant, you avoid an unexpected Java error.

Finally, suppose that you have to split the streams simply to set some fields and then join the streams again. For example, assume that you want to change the category as follows:

Pentaho Data Integration 4 tutorial on Data Flows

Doing this with nested Filter rows steps leads to a transformation like the following:

Pentaho Data Integration 4 tutorial on Data Flows

You can do the same thing in a simpler way:

  1. Replace all the steps but the Text file input with a User Defined Java Expression step located in the Scripting category.
  2. In the setting window of this step, add a row in order to replace the value of the category field: As New field and Replace value type category. As Value type select String. As Java expression, type the following:

    (category.equals("tents"))?(price<100?"cheap_tents":"expensive_
    tents"):category

    The preceding expression uses the Java ternary operator ?:. If you're not familiar with the syntax, think of it as shorthand for the if-then-else statement. For example, the inner expression price<100?"cheap_tents":"expensive_tents" means if (price<100) then return "cheap_tents" else return "expensive_tents".

  3. Do a preview on this step. You will see something similar to the following:

Pentaho Data Integration 4 tutorial on Data Flows

Pentaho Data Integration 4 Cookbook Over 70 recipes to solve ETL problems using Pentaho Kettle
Published: June 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

Merging rows of two streams with the same or different structures

It's a common requirement to combine two or more streams into a single stream that includes the union of all rows. In these cases, the streams come from different sources and don't always have the same structure. Consequently, combining the streams is not as easy as not just putting in a step that freely joins the streams. You have to take a couple of things into account. This recipe gives you the tips to make it easier.

Suppose that you received data about roller coasters from two different sources. The data in one of those sources looks like the following:

roller_coaster|speed|park|location|country|Year
Top Thrill Dragster|120 mph|Cedar Point|Sandusky, Ohio||2003
Dodonpa|106.8 mph|Fuji-Q Highland|FujiYoshida-shi|Japan|2001
Steel Dragon 2000|95 mph|Nagashima Spa Land|Mie|Japan|2000
Millennium Force|93 mph|Cedar Point|Sandusky, Ohio||2000
Intimidator 305|90 mph|Kings Dominion|Doswell, Virginia||2010
Titan|85 mph|Six Flags Over Texas|Arlington, Texas||2001
Furious Baco|84 mph|PortAventura|Spain||2007
...

The other source data looks like the following:

attraction|park_name|top_speed|trains_qt|ride_time
Expedition Everest|Disney's Animal Kingdom|50 mph|6 - 34 passenger|
Goofy'S Barnstormer|Disney's Magic Kingdom|25 mph|2 - 16 passenger|
Gwazi|Busch Gardens Tampa|50 mph|4 - 24 passenger|2 minutes, 30
seconds
Journey To Atlantis|SeaWorld Orlando||8 passenger boats|
Kraken|SeaWorld Orlando|65 mph|3 - 32 passenger|2 minutes, 2
seconds
...

You want to merge those rows into a single dataset with the following columns:

  • attraction
  • park_name
  • speed
  • trains_qt
  • ride_time

Getting ready

Download from the Pentaho Data Integration 4 book's site the files roller_coasters_I.txt and roller_coasters_II.txt. These files represent the two sources mentioned in the introduction.

How to do it...

Carry out the following steps:

  1. Create a transformation and drag two Text file input steps into the canvas.
  2. Use one of the steps to read the file roller_coasters_I.txt. Set the data types as follows: The speed as a Number with Format #0.### mph, and the rest of the fields as String. Do a preview to make sure that you are reading the file properly.
  3. Drag the cursor over the step and press Space to see the output fields:

    Pentaho Data Integration 4 tutorial on Data Flows

  4. Use the other step to read the file roller_coasters_II.txt. Set the data type of top_speed to Integer and the rest of the fields to I. Do a preview to make sure that you are reading the file properly.
  5. Drag the cursor over the step and press Space to see the output fields:

    Pentaho Data Integration 4 tutorial on Data Flows

    As you can see, the outputs of the streams are different. You have to insert the necessary steps to make them alike. That's the purpose of the next steps.

  6. After the first Text file input step add an Add constants step. Use it to add two fields of type String. Name the fields trains_qt and ride_time and as Value, type Not available.
  7. After it, add a Select values step. Fill in the Select & Alter tab as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  8. After the second Text file input step, add another Select values step. Select the Meta-data tab and fill it in as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  9. Repeat the procedure to see the output fields of the streams: Drag the cursor over the last step of each stream and press Space. Now, both streams should have the same layout.

    You can keep both windows open at the same time: the one showing the output fields of the upper stream, and the one showing the output fields of the lower one. If you put one next to the other, you can immediately see if they are equal or not.

  10. Finally, join the streams with a Dummy step as depicted in the following diagram:

    Pentaho Data Integration 4 tutorial on Data Flows

  11. Do a preview on the Dummy step. You will see something similar to the result shown in the following screenshot:

Pentaho Data Integration 4 tutorial on Data Flows

How it works...

When you need to merge the rows of two streams into a single stream, you have to do all you can to make the metadata of the streams alike. That's what you did in this recipe. In the first stream, you added the fields you needed that were absent. You also selected and reordered the desired fields to resemble the second stream. After that you changed the metadata of the top_speed field in the second stream. You converted the field from Integer to Number, which was the type of the analogous field in the first stream.

When you did the preview, you could see the rows from both incoming steps.

There's more...

In the recipe, you merged two streams with data coming from different sources. However, that is not the only situation in which you may need to merge streams. It is common to split a stream into two or more streams to perform some particular manipulations, and then merge them back together, as depicted in the following diagram:

Pentaho Data Integration 4 tutorial on Data Flows

Whichever the case, when you have to merge two or more streams, there are two things that you should remember:

  • Make sure that the metadata of all streams is exactly the same
  • Decide and tell Kettle how you want to merge the streams

The following subsections explain these things in detail.

Making sure that the metadata of the streams is the same

In order to merge two or more streams their metadata has to coincide. This basically means that all streams have to have the same layout: same number of columns, same names, same types, and the columns must be in the same order.

If you try to merge streams that don't meet that requirement, you will receive a warning. A window will show up with the title This hop cause the target step to receive rows with mixed layout! and a text explaining the differences found. That means that you have to find the way to fix that situation. Here you have a quick-list that will help you make the metadata of several streams alike:

  • Identify the fields that you want in the final dataset that are not present in all streams. In the streams that don't have those fields, add them. You can get the value from any source (a text file, the command line, and so on), or simply think of a default value and add the field with an Add constant step from the Transformation category. This is what you did with the fields: trains_qt and ride_time.
  • Identify the fields that you want to keep that are present in all streams, but with different structure. Change the metadata of those fields in the streams where the metadata is not as desired. You can do this with a Select values step by using the Metadata tab. This is what you did for the field, top_speed.
  • Verify the layouts of the streams. If they still differ, for each stream that does not have the proper layout, add a Select values step at the end. With this step, select the fields you want to keep (implicitly deleting the others), rename and reorder them, in order to match the desired layout. This was what you did with the first Select values step.

Now, you are free to merge the streams as explained in the next subsection.

Telling Kettle how to merge the rows of your streams

Once your streams are ready to be merged, you can then proceed in the following ways:

  • Suppose that you want to put all the rows of one of the streams below all the rows of the other. If you don't care about the order of the streams, you can use any step to merge them. This was what you did in the recipe with the Dummy step.
  • If you care about the order of the streams, you should use the Append Streams step from the Flow category, in order to merge the streams. By selecting a Head hop and a Tail hop, you can tell Kettle which stream goes first.

    This only works for just two streams. If you need to merge several streams, then you need to add nested Append Streams steps.

  • Now, suppose that you really want to merge the rows of the streams and leave them ordered by certain fields. You do it with a Sorted Merge step from the Join category. The step assumes that each stream in turn is sorted by the same fields.
  • Note that Kettle warns you, but it doesn't prevent you from mixing row layouts when you merge streams. If you see a warning of this kind, refer to the tips in the previous subsection.

If you want Kettle to prevent you from running a transformation with mixed layout, check the option Enable safe mode located in the windows that shows up when you run the transformation. Note that the use of the Enable safe mode option will cause a drop in performance and should only be used when you are debugging a transformation

Pentaho Data Integration 4 Cookbook Over 70 recipes to solve ETL problems using Pentaho Kettle
Published: June 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

Comparing two streams and generating differences

Suppose that you have two streams with the same structure and want to find out the differences in the data. Kettle has a step meant specifically for that purpose: the Merge Rows (diff) step. In this recipe, you will see how it works.

Suppose that you have a file with information about the fastest roller coasters around the world. Now, you get an updated file and want to find out the differences between the files: There can be new roller coasters in the list; maybe some roller coasters are no longer among the fastest. Besides, you were told that in the old file, there were some errors about the location, country, and year information, so you are also interested in knowing if some of these have changed.

Getting ready

For this recipe, you will need two files with information about roller coasters. You can download them from the book's site.

Both files have the same structure and look like the following:

Roller_Coaster|Speed|park|location|country|Year
Kingda Ka|128 mph|Six Flags Great Adventure|Jackson, New Jersey||
2005 Top Thrill Dragster|120 mph|Cedar Point|Sandusky, Ohio||2003
Dodonpa|106.8 mph|Fuji-Q Highland|FujiYoshida-shi|Japan|2001
Steel Dragon 2000|95 mph|Nagashima Spa Land|Mie|Japan|2000
Millennium Force|93 mph|Cedar Point|Sandusky, Ohio||2000
...

For the There's more section, you will also need a database with the first file already loaded in a table. You will find a script for creating and loading it also available for downloading.

How to do it...

Carry out the following steps:

  1. Create a transformation.
  2. Drag a Text file input step into the canvas and use it to read the file top_roller_coasters.txt. As a separator, type |.
  3. Do a preview to make sure that you are reading the file as expected.
  4. Add a Sort rows step to sort the rows by roller_coaster and park.
  5. Repeat the steps 2 to 4 to read the file named top_roller_coasters_updates.txt and sort the rows also by roller_coaster and park.
  6. From the Join category, add a Merge Rows (diff) step and use it to join both streams as depicted in the following diagram:

    Pentaho Data Integration 4 tutorial on Data Flows

  7. Double-click on the step you just added. In the Reference rows origin: select the name of the step coming from the stream that reads the top_roller_coasters.txt file.
  8. In the Compare rows origin: select the name of the step coming from the stream that reads the top_roller_coasters_updates.txt file.
  9. As Flag fieldname, type flag.
  10. Fill the Keys to match: and Values to compare: grids as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

    You can save time by clicking on the Get key fields and Get value fields buttons to fill each grid respectively. Then just delete the fields that you don't need.

  11. Close the window and do a preview; you should see the following:

Pentaho Data Integration 4 tutorial on Data Flows

How it works...

The Merge Rows (diff) step is used for comparing two streams and finding out the differences between them. The output of the step is a single stream. The output stream contains a new field that acts as a flag indicating the kind of difference found as explained next.

When you use the Merge Rows (diff) step, the two streams you are merging must have the same metadata, that is, the name, order, and type of the fields must be the same.

Let's call the streams being merged reference stream and compare stream. The first holds the old data while the second holds the new data. In the recipe, the old data is the data coming from the top_roller_coasters.txt file and the new data is the data coming from the top_roller_coasters_update.txt file.

Both streams must be sorted on the specified keys.

In order to perform the comparison, you have to tell Kettle how to detect that a row is the same in both streams, that is, you have to specify the key fields. You do it by entering them in the first grid. In the recipe, the key was made up by the roller coaster name and the park name (roller_coaster and park fields).

If your data comes from a database, instead of using a Sort rows step for sorting the rows, you can sort them in the Table input. That will give you better performance.

Given the two streams, Kettle tries to match rows of both streams based on the key fields provided. Depending on the result, it sets a different value for the flag as explained in the following table:

Pentaho Data Integration 4 tutorial on Data Flows

Note that if a row is found in both streams with identical key fields and compare fields, it is marked as identical even if there are differences in other fields. For example, the Dodonpa roller coaster has a speed of 106.8 mph in the reference stream, but a speed of 106 mph in the compare stream. As you didn't put the speed in the values to compare list, the rows are marked as identical.

As a final remark, note that for the rows marked as new or changed, the values that pass to the output stream are those coming from the compare stream.

For the rows marked as identical or deleted, the values that are passed are those coming from the reference stream.

There's more...

The Merge Rows (diff) step is commonly used together with the Synchronize after merge step to keep a database table updated. The following section shows an example of how to do this.

Using the differences to keep a table up to date

Suppose that you have a table in a database with information about roller coasters, and that you have already inserted the data in the top_roller_coasters.txt file in that table.

Now, you have the top_roller_coasters_updates.txt file and want to update the table based on the differences.

The table is totally de-normalized on purpose to keep the exercise simple.

Try the following:

  1. After running the script mentioned in the introduction, modify the transformation in the recipe by replacing the first stream with a Table Input step, in order to read the table rollercoasters. Use the following statement:

    SELECT roller_coaster
    , speed
    , park
    , location
    , country
    , year
    FROM rollercoasters
    ORDER BY roller_coaster
    , park

  2. You will have something like the following:

    Pentaho Data Integration 4 tutorial on Data Flows

  3. Do a preview on the last step, the Merge Rows (diff). The output should be exactly the same as the output in the recipe.
  4. Now, add a Synchronize after merge step. Select the connection to the database or create it if it doesn't exist and as Target table, type rollercoasters. Fill the grids as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  5. Select the Advanced tab and fill in the Operation frame as shown in the following screenshot:

    Pentaho Data Integration 4 tutorial on Data Flows

  6. Close the window, save the transformation, and run it.
  7. Execute a SELECT statement to see the data in the rollercoaster table. The roller coasters with the flag deleted should have been deleted from the table. The rows with the flag new should have been inserted in the table, and the rows with the flag changed should have been updated.

Generating all possible pairs formed from two datasets

This is a quick recipe that teaches you how to do a Cartesian product between datasets. By Cartesian product we mean taking all rows from one dataset, all rows from the other, and generate a new dataset with all the possible combinations of rows.

This particular recipe is in fact the implementation of the CAG or Community Acronym Generator as proposed by Nicholas Goodman (@nagoodman) on twitter:

@webdetails @pmalves @josvandongen How about CAG? Community Acronym Generator? A project to generate new acronyms for community projects?!

There are already several community projects around Pentaho such as CDF, CDE, or CDA. Why don't we follow Nicholas's suggestion and develop the CAG as follows:

Given two lists of words, the Kettle transformation will generate all combinations of words that lead to potential community projects.

How to do it...

Carry out the following steps:

  1. Create a new transformation and add two Data Grid steps from the Input category.
  2. Use one of the Data Grid steps to create a dataset with a single String field named middle_word. Under the Data tab, enter a set of names for the middle word of the acronym. Here you have a sample list: Dashboard, Data, Chart, Acronym, Cube, Report.
  3. Use the other Data Grid step to create a dataset with a single String field named last_word. Under the Data tab, enter a set of names for the last word of the acronym. Here you have a sample list: Framework, Editor, Translator, Access, Generator, Integrator, Component.
  4. From the Join category, add a Join Rows (Cartesian product) step.
  5. Create hops from the Data Grid steps toward this step. You will have something like the following:

    Pentaho Data Integration 4 tutorial on Data Flows

  6. From the Scripting category, add a User Defined Java Expression step (UDJE for short).
  7. Use the UDJE to add two String fields. Name the first new_component, and as Java Expression type "Community "+middle_word+" "+last_word. Name the second field acronym and as Java Expression type "C"+middle_word.substring(0,1)+last_word.substring(0,1).
  8. Do a preview on this last step. You will see a list of candidate community projects as shown in the following screenshot:

Pentaho Data Integration 4 tutorial on Data Flows

How it works...

The Join Rows (Cartesian product) step has the task of performing the Cartesian product of all streams coming to it. In this case, you had two streams but you could have had more. The step received those two streams and created all combinations of rows.

Then with the UDJE, you simply build the strings with the name of the candidate community projects and their acronyms, as for example, Community Chart Framework (CCF). The real purpose of the generated potential projects is up to your imagination and out of the scope of this article.

There's more...

In the recipe, you used the Join Rows (Cartesian product) step for joining two datasets. You could join more datasets if you want to; however that is not a common requirement.

There are a couple of settings in the step that you didn't use in the recipe. They are explained in the following subsections.

Getting variables in the middle of the stream

This section describes one of the most common situations in which you may see the Join Rows (Cartesian product) step in action. Back to the recipe. Suppose that you have a named parameter named VERSION, which can be CE (representing Community Edition) or EE (representing Enterprise Edition). After generating the names of the candidate projects and their acronyms, you want to add the version. You can add the version to your stream by using a Get Variable step from the Job category. However, instead of getting the variable for each row, it's recommended to get it outside the main stream and then join both streams, as shown in the following screenshot:

Pentaho Data Integration 4 tutorial on Data Flows

As the stream coming out of the Get Variable step has a single row, the Cartesian product will have all the possible combinations of N rows of the main stream with a single row, that is, N rows. In this case, it is important that in the Main step to read from option, you select the main stream, the stream coming from the UDJE. Doing so, you tell Kettle that most of the data will come from this step and Kettle will cache or spool to disk the data coming from the Get Variable step.

Limiting the number of output rows

With a Join Rows (Cartesian product) step, you can limit the number of output rows by entering a simple or complex condition in its setting window. The rows that don't meet the condition are discarded.

Back to the recipe. As you might have noticed, it is possible for the transformation to generate acronyms that already exist, for example, CDF. In the previous subsection, you added a second Join Rows (Cartesian product) step. In this step, you could add a condition to discard the rows with acronyms that already exist, excepting when the product is Enterprise Edition. The condition area in the setting window of the step would look like the one shown in the following screenshot (except for the exact list of acronyms which might have changed by the time you're reading this):

Pentaho Data Integration 4 tutorial on Data Flows

If you do a preview on this step, you will see something like the following:

Pentaho Data Integration 4 tutorial on Data Flows

If you take a look at the Step Metrics tab of the Execution Results window, you will notice that the number of written rows is less than Cartesian product of incoming rows. Note that the GUI for entering the condition is the same as the one in the Filter rows step.

As you may pick fields from more than one stream in the condition of the Join Rows (Cartesian product) step, it is therefore required that the picked fields have unique names in the streams.

Summary

We saw different ways for combining, splitting, or manipulating streams or flows of data in simple situations using Kettle transformations. In the next article we will focus on Working with Complex Data Flows.


Further resources on this subject:


About the Author :


Adrián Sergio Pulvirenti

Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in South America.

He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions.

María Carina Roldán

María Carina, born in Esquel, Argentina, earned her Bachelor's degree in Computer Science at UNLP in La Plata and then moved to Buenos Aires where she has been living since 1994.

She has worked as a BI consultant for almost 15 years. Over the last six, she has been dedicated full time to developing BI solutions using the Pentaho Suite. Currently, she works for Webdetails—a Pentaho company—as an ETL specialist.

She is the author of Pentaho 3.2 Data Integration Beginner’s Guide book published by Packt Publishing in April 2009 and co-author of Pentaho Data Integration 4 Cookbook, also published by Packt Publishing in June 2011.

Books From Packt


Pentaho Reporting 3.5 for Java Developers
Pentaho Reporting 3.5 for Java Developers

Pentaho 3.2 Data Integration: Beginner's Guide
Pentaho 3.2 Data Integration: Beginner's Guide

Java EE 6 Development with NetBeans 7
Java EE 6 Development with NetBeans 7

JasperReports 3.6 Development Cookbook
JasperReports 3.6 Development Cookbook

iReport 3.7
iReport 3.7

Python 2.6 Text Processing: Beginners Guide
Python 2.6 Text Processing: Beginners Guide

Android 3.0 Application Development Cookbook
Android 3.0 Application Development Cookbook

Oracle Data Integrator 11g: Getting Started
Oracle Data Integrator 11g: Getting Started


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software