Packt+ | Advance your knowledge in tech

You're reading from Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book

Published in Apr 2010

Publisher Packt

ISBN-13 9781847199546

Pages 492 pages

Edition 1st Edition

Languages

Java

Concepts

Business Intelligence

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

1. Getting Started with Pentaho Data Integration

2. Getting Started with Transformations

3. Basic Data Manipulation

4. Controlling the Flow of Data

5. Transforming Your Data with JavaScript Code and the JavaScript Step

6. Transforming the Row Set

7. Validating Data and Handling Errors

8. Working with Databases

9. Performing Advanced Operations with Databases

10. Creating Basic Task Flows

11. Creating Advanced Transformations and Jobs

12. Developing and Implementing a Simple Datamart

13. Taking it Further

Working with Repositories

Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Quick Reference: Steps and Job Entries

Spoon Shortcuts

Introducing PDI 4 Features

Pop Quiz Answers

Index

Appendix F. Pop Quiz Answers

Chapter 1

PDI data sources

PDI prerequisites

1 and 3

PDI basics

1	False (Spoon is the only graphical tool)
2	True
3	False (Spoon doesn't generate code, but interprets Transformation and Jobs)
4	False (The grid size is intended to line up steps in the screen)
5	False (As an example the transformation in this chapter created the rows of data from scratch; it didn't use external data)

Chapter 2

formatting data

(a) and (b). The field is already a Number, so you may define the output field as a Number, taking care of the format you apply. If you define the output field as a String and you don't set a format, Kettle will send the field to the output as 1.0, 2.0, 3.0, etc., which clearly is not the same as your code. Just to confirm this, create a single file and a transformation to see the results for yourself.

Chapter 3

concatenating strings

1	(a) and (c). The calculator allows you to use the `+` operator both for adding numbers and for concatenating text. The Formula step makes a difference: To add numbers you use `+`; to concatenate text you have to use `&` instead.

Chapter 4

data movement (copying and distributing)

(b). In the second transformation the rows are copied, so all the unassigned rows reach the dummy step. In the first transformation the rows are distributed, so to the filter step arrives half of the rows. When you do the preview, you see only the unassigned tasks for this half; you don't see the unassigned tasks that went to the other stream.

splitting a stream

1	(c). Both (a) and (b) solve the situation.

Chapter 5

finding the seven errors

1. The type of log a doesn't exist. Look at the sample provided for the function to see the valid options.

2. The variable uSkill is not defined. Its definition is required if you want to add it to the list of new fields.

3. setValue() cause an error without compatibility mode. To change the value of the Country field, a new variable should be used instead.

4. A variable named average is calculated but wAverage is used as the new field.

5. It is not trans_status; it is trans_Status.

6. No data type was specified for the totalScore field.

7. The sentence writeToLog(‘Ready to calculate averages...') will be written for every row. To write it at the beginning, you have to put it in a Start script, not in the main.

Chapter 6

using Kettle variables inside transformations

(a). You don't need a Get Variables step in this case. As name of the file you simply type hello_${user.name} or hello_%%user.name%%.

In (b) and (c) you need to add the variables ${user.name} and ${user.language} respectively as fields of your dataset. You do it with a Get Variables step.

Chapter 7

PDI error handling

1	(c). With PDI you cannot avoid unexpected errors; you can capture them avoiding the crash of the transformation. After that, discarding or treating the bad rows is up to you.

Chapter 8

defining database connections

(c)

database datatypes versus PDI datatypes

(b)

Insert/Update step versus Table Output/Update steps

(a) If an incoming row belongs to a product that doesn't exist in the products table, both the Insert/Update step and the Table output step will insert the record.

If an incoming row belongs to a product that already exist in the products table, the Insert/Update step updates it. In this alternative version, the Table output will fail (there cannot be two products with the same value for the primary key) but the failing row goes to the Update step that updates the record.

If an incoming row contains invalid data (for example, a price with a non numeric value), neither of the Insert/Update step, the Table output step, and the Update step would insert or update the table with this product.

filtering the first 10 rows

1	(c). To limit the number of rows in MySQL you use the clause `LIMIT`. (a) and (b) are dialects: (a) is valid in HSQLDB. (b) is valid in Oracle. If you put any of this options in a Table Input for querying the `js` database, the transformation would fail

Chapter 9

loading slowly changing dimensions

(a). The decision for the kind of dimension is not related to data you have. You just have to know your business, so the last option is out. You don't need to keep history for the name of the film. If the name changes it is because it was misspelled, or because you want to change the name to upper case, or something like that. It doesn't have sense to keep the old value. So you create a Type I SCD.

(c). You can use any of these steps for loading a Type I SCD. In the tutorial for loading a type I SCD you used a Combination L/U, but you could have used the other too, as explained above.

loading type III slowly changing dimensions

(b). With a Database lookup to get the current value stored in the dimension. If there is no data in the dimension table, the lookup fails and returns null; that is not a problem. After that, you compare the found data with the new one and set the proper values for the dimension columns. Then you load the dimension either with a Combination L/U or with a Dimension lookup, just as you do for a regular Type I SCD.

Chapter 10

defining PDI jobs

1	(b)
2	All the given options are True. Simply explore the Job entries tree and you'll find the answers.

Chapter 11

using the Add sequence step

(e) None of the proposed solution gives you the same results you obtained in the tutorial. The Add sequence step gives you the next value in a sequence which can be a database sequence or transformation counter. In the tutorial you used a transformation counter. In the options (b) and (c), instead of four sequences from 1 to 10, a single sequence from 1 to 40 would have been generated. No matter which method you use for generating the sequence, if you use the same name of sequence in more than one Add sequence step, the sequence is the same and is shared by all those steps. Therefore, the option (a) also would have generated a single sequence from 1 to 40 shared by the four streams.

Besides these details about the generation of sequences, the (b) option introduces an extra inconvenience. By distributing rows, you cannot be sure that the rows will go to the proper stream. PDI would have distributed them in its own fashion.

deciding the scope of variables

1	All the options are valid. In the tutorial you had just a transformation and its parent job, that is also the root job. So (a) is valid. The grand-parent job scope includes the parent job so option (b) is valid too. Option (c) includes all the other options, so it is a valid option too.

Chapter 12

modifying a star model and loading the star with PDI

1	a	iii	As mentioned in Chapter 9, despite being designed for building Type II SCDs, the Dimension L/U step can be used for building Type I SCDs as well. So, you have two options: Reuse the table (modifying the transformation that loads it) and get the surrogate key with a Dimension L/U step, or use another table without all fields specific to Type II dimensions and, for getting the surrogate key, use a DB Lookup step. In any case, you may reuse the `id_region` field, as it is a integer and serves in any situation.
	b	i
	c	iii
2	a	ii	The dimension table has to have one record by month. Therefore a different table is needed. For the key you could use a string with the format `yyyymm`. If you don't want to change the fact table, you may reuse the `dt` field leaving blank the last two characters, but it would be more appropriate to have a string field with just 6 positions. For getting the surrogate key you use a Select values step changing the metadata but this time you put as format the new mask `yyyymm`.
	b	ii
	c	i
3	a	ii	The `product_type` field is a string; it's not the proper field for referencing a surrogate key from a fact table, so you have to define a new field for that purpose. For getting the right key you use a Database lookup step.
	b	iii