Packt+ | Advance your knowledge in tech

You're reading from Pentaho Data Integration Quick Start Guide

Product typeBook

Published inAug 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781789343328

Edition1st Edition

Languages

Java

Tools

Pentaho

Concepts

Business Intelligence

Author (1)

María Carina Roldán

Chapter 4. Transforming Data

Transforming data is about manipulating the data that flows from step to step in a PDI transformation. There are many ways in which this transformation can be done. We can modify incoming data, change its datatype, add new fields, fix erroneous data, sort, group, and filter unwanted information, aggregate data in several ways, and more. In this chapter we will explain some of the possibilities.

The following is the list of topics that we will cover:

Transforming data in different ways
Sorting and aggregating data
Filtering rows
Looking up for data

Transforming data in different ways

So far, we have seen how to create a PDI dataset mainly using data coming from files or databases. Once you have the data, there are many things you can do with it depending on your particular needs. One very common requirement is to create new fields where the values are based on the values of existent fields.

The set of operations covered in this section is not a full list of the available options, but includes the most common ones, and will inspire you when you come to implement others.

Note

The files that we will use in this section were built with data downloaded from www.numbeo.com, a site containing information about living conditions in cities and countries worldwide.

Note

For learning the topics in this chapter, you are free to create your own data. However, if you want to reproduce the exercises exactly as they are explained, you will need the afore mentioned files from www.numbeo.com.

Note

Before continuing, make sure you download the set of data...

Sorting and aggregating data

In the previous section, we learned how to work with individual fields—for example, by creating new ones or modifying existent ones. The operations were applied row by row. In this section, we will not look at individual rows, but we will instead learn to observe and work on the dataset as a unit.

Sorting data

Sorting the dataset is a very useful and common task. Sorting is really easy to do in PDI, and we will demonstrate it with a simple transformation. We will take the files of the surveys that we used in the previous chapter, and we will sort the data by neighborhood and room_type columns, and then by the reviews column in descending order. In order to do this, go through the following steps:

Open any of the transformations created in the last chapter that read files with surveys. Save the transformation with a different name.
Drag a Sort rows step from the Transform folder and create a hop from the Text file input toward this new step.
Double-click the step and...

Filtering rows

Until now, we have been enriching our dataset with new data. Now we will do the exact opposite: we will discard unwanted information. We already know how to keep a subset of fields and discard the rest: We do it by using the Select values step. Now it's time to keep only the rows that we are interested on.

Filtering rows upon conditions

To demonstrate how to filter rows with PDI, we will work again with the survey files. This time, we will read a set of files, and will keep only the locations with more than three rooms. The main step we will be using is the Filter rows step. Go through the following steps:

Create a transformation and use a Text file input step to read the files containing the surveys carried in 2015.

Note

You are free to read a different set of files, but if you read this set, you will be able to compare your results with the results shown in the following screenshots.

After the Text file input step, add a Filter rows step. You will find it in the Flow folder.
In...

Looking up for data

In all the transformations that we have created so far, we had single streams of data. We could, however, create more than one stream, with data coming from different sources. The streams can eventually be merged together—as was the case when we merged data coming from different files in the previous chapters—or they can also be used for looking up data, as we will learn in this section.

Looking for data in a secondary stream

Looking for data in a secondary stream is a common requirement when the data you need comes from a source that is different from your main data—for example, if your data comes from a database, and you need to look up related data in an XML file. In this section, you will learn how to implement this kind of lookup through a simple exercise: We will have a list of European cities, and we will look for their cost of living indexes that are located in a different source. To do this, go through the following steps:

Note

For this exercise, we will use a file...

Summary

In this chapter, we transformed PDI datasets in several ways. First, we learned to transform data at row level by combining values, extracting pieces of a value, creating new fields, just to mention some of the different operations. For each particular operation, we learned how PDI offers different ways of doing the same thing. We encouraged you to experiment with different steps and adopt the ones that best fit your needs.

Then, we learned how to sort data and then aggregate it by adding values and calculating averages, among other common aggregate operations.

After having transformed the dataset, we learned how to filter unwanted data, either discarding it or redirecting it to alternative flows.

At the end of the chapter, we enriched the datasets by looking up external data—both in databases and in secondary streams—and adding it to our main flow.

Now that we have seen the main ways of transforming data coming in from different sources, we are ready to load that data into multiple...

The rest of the chapter is locked

You have been reading a chapter from

Pentaho Data Integration Quick Start Guide

Published in: Aug 2018Publisher: PacktISBN-13: 9781789343328

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

María Carina Roldán

María Carina Roldán was born in Argentina and has a bachelor's degree in computer science. She started working with Pentaho back in 2006. She spent all these years developing BI solutions, mainly as an ETL specialist, and working for different companies around the world. Currently, she lives in Buenos Aires and works as an independent consultant. Carina is the author of Learning Pentaho Data Integration 8 CE, published by Packt in December 2017. She has also authored other books on Pentaho, all of them published by Packt.
Read more about María Carina Roldán

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages