Chapter 6. Transforming the Row Set
So far, you have been working with simple datasets, that is, datasets where the each row represented a different entity (for example a student) and each column represented a different attribute for that entity (for example student name). There are occasions when your dataset doesn’t resemble such a simple format, and working with it as is, may be complicate or even impossible. In other occasions your data simply does not have the structure you like or the structure you need.
Whichever your situation, you have to transform the dataset in an appropriate format and the solution is not always about changing or adding fields, or about filtering or adding rows. Sometimes it has to do with twisting the whole dataset. In this chapter you will learn how to:
Convert rows to columns
Convert columns to rows
Operate on sets of rows
You will also be introduced to a core subject in data warehousing: Time dimensions.
Converting rows to columns
In most datasets each row belongs to a different element such as a different match or a different student. However, there are datasets where a single row doesn't completely describe one element. Take, for example, the real-estate
file from Chapter 5. Every house was described through several rows. A single row gave incomplete information about the house. The ideal situation would be one in which all the attributes for the house were in a single row. With PDI you can convert the data into this alternative format. You will learn how to do it in this section.
Time for action – enhancing a films file by converting rows to columns
In this tutorial we will work with a file that contains list of all French movies ever made. Each movie is described through several rows. This is how it looks like:
In order to process the information of the file, it would be better if the rows belonging to each movie were merged into a single row. Let's work on that.
Download the file from the Packt website.
Create a transformation and read the file with a Text file input step.
In the Content tab of the Text file input step put :
as separator. Also uncheck the Header and the No empty rows options.
In the Fields tab enter two string fields—feature
and description...
Time for action – calculating total scores by performances by country
Let's work now with the contest file from Chapter 5. You will need the output file for the Hero exercise. Fill gaps in the contest file from that chapter. If you don't have it, you can download it from the Packt website.
In this tutorial, we will calculate the total score for each performance by country.
Create a new transformation.
Read the file with a Text file input step and do a preview to see that the step is well configured. You should see this:
With a Select values step, keep only the following columns: Country
, Performance
, and totalScore
.
With a Sort Rows step sort the data by Country
ascendant.
After the Sort Rows step, put a Row denormalizer step.
Double-click this last step to configure it.
As the key field put Performance
, and as group fields put Country
.
Fill the target fields' grid like shown:
Close the window.
With the Row denormalizer step selected, do a preview. You will see this:
Some
datasets are nice to see but complicate to process further. Take a look at the matches file we saw in Chapter 3:
Imagine you want to answer these questions:
How many teams played?
Which team converted most goals?
Which team won all matches it played?
The dataset is not prepared to answer those questions, at least in an easy way. If you want to answer those questions in a simple way, you will first have to normalize the data, that is, convert it to a suitable format before proceeding. Let's work on it.
Time for action – enhancing the matches file by normalizing the dataset
Now you will convert the matches file you generated in Chapter 2 to a format suitable for answering the proposed questions.
Search on your disk for the file you created in Chapter 2, or download it from the Packt website.
Create a new transformation and read the file by using a Text file input step.
With a Split Fields step, split the Result
field in two: home_t_goals
and away_t_goals
. (Do you remember having done this in chapter 3?)
From the Transform category of steps, drag a Row Normalizer step to the canvas.
Create a hop from the last step to this new one.
Double-click the Row Normalizer step to edit it and fill the window as follows:
With the Row Normalizer selected, do a preview. You should see this:
You read the matches file and converted the dataset to a new one where both the home team and the away team appeared under a new column named team
, together with another new column named goals
holding...
Generating a custom time dimension dataset by using Kettle variables
Dimensions are sets of attributes useful for describing a business. A list of products along with their shape, color, or size is a typical example of dimension. The time dimension is a special dimension used for describing a business in terms of when things happened. Just think of a time dimension as a list of dates along with attributes describing those dates. For example, given the date 05/08/2009
, you know that it is a day of August
, it belongs to the third quarter
and it is Wednesday
. These are some of the attributes for that date.
In the following tutorial you will create a transformation that generates the dataset for a time dimension. The dataset for a time dimension has one row for every date in a given range of dates and one column for each attribute of the date.
Time for action – creating the time dimension dataset
In this tutorial we will create a simple dataset for a time dimension.
First we will create a stream with the days of the week:
Create a new transformation.
Press Ctrl+T to access the Transformation settings window.
Select the Parameters tab and fill it like shown in the next screenshot:
Expand the Job category of steps.
Drag a Get Variables step to the canvas, double-click the step, and fill the window like here:
After the Get Variables step, add a Split Fields step and use it to split the field week_days
into seven String
fields named sun
, mon
, tue
, wed
, thu
, fri
, and sat
. As Delimiter, set a comma (,
).
Add one more Split Fields step and use it to split the field week_days_short
into seven String
fields named sun_sh
, mon_sh
, tue_sh
, wed_sh
, thu_sh
, fri_sh
, and sat_sh
. As Delimiter, set a comma (,
).
After this last step, add a Row Normalizer step.
Double-click the Row Normalizer step and fill it as follows:
Keep the Row Normalizer step selected...
Time for action – getting variables for setting the default starting date
Let's modify the transformation so that the starting date depends on a parameter.
Press Ctrl+T to open the transformation settings window.
Add a parameter named START_DATE
with default value 01/12/1999
.
Add a Get variables step between the Calculator step and the Filter rows step .
Edit the Get variables step and a new field named start_date
. Under Variable write ${START_DATE}
. As Type select Date
, and under Format select or type dd/MM/yyyy
.
Modify the filter step so the condition is now: date>=start_date and date<=31/12/2020
.
Modify the Select values step to remove the start_date
field.
With the Select values step selected do a preview. You will see this:
You added a starting date as a named parameter. Then you read that variable into a new field and used it to keep only the dates that are greater or equal to its value.
Using the Get Variables step
As you just
saw, the Get Variables step allows you...
In this chapter, you learned to transform your dataset by applying two magical steps: Row Normalizer and Row denormalizer. These two steps aren't the kind of steps you use every day such as a Filter Rows or a Select values step. But when you need to do the kind of task they achieve, you are really grateful that these steps exist. They do a complex task in a quite simple way. You also learned what a time dimension is and how to create a dataset for a time dimension.
So far, you've been learning to transform data. In the next chapter, you will set that kind of learning aside for a while. The chapter will be devoted to an essential subject when it comes to working in productive environments and dealing with real data—data validation and error handling.