Reader small image

You're reading from  Pentaho Data Integration Quick Start Guide

Product typeBook
Published inAug 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781789343328
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
María Carina Roldán
María Carina Roldán
author image
María Carina Roldán

María Carina Roldán was born in Argentina and has a bachelor's degree in computer science. She started working with Pentaho back in 2006. She spent all these years developing BI solutions, mainly as an ETL specialist, and working for different companies around the world. Currently, she lives in Buenos Aires and works as an independent consultant. Carina is the author of Learning Pentaho Data Integration 8 CE, published by Packt in December 2017. She has also authored other books on Pentaho, all of them published by Packt.
Read more about María Carina Roldán

Right arrow

Chapter 4. Transforming Data

Transforming data is about manipulating the data that flows from step to step in a PDI transformation. There are many ways in which this transformation can be done. We can modify incoming data, change its datatype, add new fields, fix erroneous data, sort, group, and filter unwanted information, aggregate data in several ways, and more. In this chapter we will explain some of the possibilities.

The following is the list of topics that we will cover:

  • Transforming data in different ways
  • Sorting and aggregating data
  • Filtering rows
  • Looking up for data

Transforming data in different ways


So far, we have seen how to create a PDI dataset mainly using data coming from files or databases. Once you have the data, there are many things you can do with it depending on your particular needs. One very common requirement is to create new fields where the values are based on the values of existent fields.

The set of operations covered in this section is not a full list of the available options, but includes the most common ones, and will inspire you when you come to implement others. 

Note

The files that we will use in this section were built with data downloaded from www.numbeo.com, a site containing information about living conditions in cities and countries worldwide.

Note

For learning the topics in this chapter, you are free to create your own data. However, if you want to reproduce the exercises exactly as they are explained, you will need the afore mentioned files from www.numbeo.com.

Note

Before continuing, make sure you download the set of data...

Sorting and aggregating data


In the previous section, we learned how to work with individual fields—for example, by creating new ones or modifying existent ones. The operations were applied row by row. In this section, we will not look at individual rows, but we will instead learn to observe and work on the dataset as a unit.

Sorting data

Sorting the dataset is a very useful and common task. Sorting is really easy to do in PDI, and we will demonstrate it with a simple transformation. We will take the files of the surveys that we used in the previous chapter, and we will sort the data by neighborhood and room_type columns, and then by the reviews column in descending order. In order to do this, go through the following steps:

  1. Open any of the transformations created in the last chapter that read files with surveys. Save the transformation with a different name.
  2. Drag a Sort rows step from the Transform folder and create a hop from the Text file input toward this new step.
  3. Double-click the step and...

Filtering rows


Until now, we have been enriching our dataset with new data. Now we will do the exact opposite: we will discard unwanted information. We already know how to keep a subset of fields and discard the rest: We do it by using the Select values step. Now it's time to keep only the rows that we are interested on.

Filtering rows upon conditions

To demonstrate how to filter rows with PDI, we will work again with the survey files. This time, we will read a set of files, and will keep only the locations with more than three rooms. The main step we will be using is the Filter rows step. Go through the following steps:

  1. Create a transformation and use a Text file input step to read the files containing the surveys carried in 2015.

Note

You are free to read a different set of files, but if you read this set, you will be able to compare your results with the results shown in the following screenshots.

  1. After the Text file input step, add a Filter rows step. You will find it in the Flow folder.
  2. In...

Looking up for data


In all the transformations that we have created so far, we had single streams of data. We could, however, create more than one stream, with data coming from different sources. The streams can eventually be merged together—as was the case when we merged data coming from different files in the previous chapters—or they can also be used for looking up data, as we will learn in this section.

Looking for data in a secondary stream

Looking for data in a secondary stream is a common requirement when the data you need comes from a source that is different from your main data—for example, if your data comes from a database, and you need to look up related data in an XML file. In this section, you will learn how to implement this kind of lookup through a simple exercise: We will have a list of European cities, and we will look for their cost of living indexes that are located in a different source. To do this, go through the following steps:

Note

For this exercise, we will use a file...

Summary


In this chapter, we transformed PDI datasets in several ways. First, we learned to transform data at row level by combining values, extracting pieces of a value, creating new fields, just to mention some of the different operations. For each particular operation, we learned how PDI offers different ways of doing the same thing. We encouraged you to experiment with different steps and adopt the ones that best fit your needs.

Then, we learned how to sort data and then aggregate it by adding values and calculating averages, among other common aggregate operations.

After having transformed the dataset, we learned how to filter unwanted data, either discarding it or redirecting it to alternative flows.

At the end of the chapter, we enriched the datasets by looking up external data—both in databases and in secondary streams—and adding it to our main flow.

Now that we have seen the main ways of transforming data coming in from different sources, we are ready to load that data into multiple...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Pentaho Data Integration Quick Start Guide
Published in: Aug 2018Publisher: PacktISBN-13: 9781789343328
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
María Carina Roldán

María Carina Roldán was born in Argentina and has a bachelor's degree in computer science. She started working with Pentaho back in 2006. She spent all these years developing BI solutions, mainly as an ETL specialist, and working for different companies around the world. Currently, she lives in Buenos Aires and works as an independent consultant. Carina is the author of Learning Pentaho Data Integration 8 CE, published by Packt in December 2017. She has also authored other books on Pentaho, all of them published by Packt.
Read more about María Carina Roldán