Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pentaho 3.2 Data Integration: Beginner's Guide

You're reading from  Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book
Published in Apr 2010
Publisher Packt
ISBN-13 9781847199546
Pages 492 pages
Edition 1st Edition
Languages

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
1. Getting Started with Pentaho Data Integration 2. Getting Started with Transformations 3. Basic Data Manipulation 4. Controlling the Flow of Data 5. Transforming Your Data with JavaScript Code and the JavaScript Step 6. Transforming the Row Set 7. Validating Data and Handling Errors 8. Working with Databases 9. Performing Advanced Operations with Databases 10. Creating Basic Task Flows 11. Creating Advanced Transformations and Jobs 12. Developing and Implementing a Simple Datamart 13. Taking it Further Working with Repositories Pan and Kitchen: Launching Transformations and Jobs from the Command Line Quick Reference: Steps and Job Entries Spoon Shortcuts Introducing PDI 4 Features Pop Quiz Answers Index

Appendix F. Pop Quiz Answers

Chapter 1


PDI data sources

1

5

PDI prerequisites

1

1 and 3

PDI basics

1

False (Spoon is the only graphical tool)

2

True

3

False (Spoon doesn't generate code, but interprets Transformation and Jobs)

4

False (The grid size is intended to line up steps in the screen)

5

False (As an example the transformation in this chapter created the rows of data from scratch; it didn't use external data)

Chapter 2


formatting data

1

(a) and (b). The field is already a Number, so you may define the output field as a Number, taking care of the format you apply. If you define the output field as a String and you don't set a format, Kettle will send the field to the output as 1.0, 2.0, 3.0, etc., which clearly is not the same as your code. Just to confirm this, create a single file and a transformation to see the results for yourself.

Chapter 3


concatenating strings

1

(a) and (c). The calculator allows you to use the + operator both for adding numbers and for concatenating text. The Formula step makes a difference: To add numbers you use +; to concatenate text you have to use & instead.

Chapter 4


data movement (copying and distributing)

1

(b). In the second transformation the rows are copied, so all the unassigned rows reach the dummy step. In the first transformation the rows are distributed, so to the filter step arrives half of the rows. When you do the preview, you see only the unassigned tasks for this half; you don't see the unassigned tasks that went to the other stream.

splitting a stream

1

(c). Both (a) and (b) solve the situation.

Chapter 5


finding the seven errors

1

1. The type of log a doesn't exist. Look at the sample provided for the function to see the valid options.

2. The variable uSkill is not defined. Its definition is required if you want to add it to the list of new fields.

3. setValue() cause an error without compatibility mode. To change the value of the Country field, a new variable should be used instead.

4. A variable named average is calculated but wAverage is used as the new field.

5. It is not trans_status; it is trans_Status.

6. No data type was specified for the totalScore field.

7. The sentence writeToLog(‘Ready to calculate averages...') will be written for every row. To write it at the beginning, you have to put it in a Start script, not in the main.

Chapter 6


using Kettle variables inside transformations

1

(a). You don't need a Get Variables step in this case. As name of the file you simply type hello_${user.name} or hello_%%user.name%%.

In (b) and (c) you need to add the variables ${user.name} and ${user.language} respectively as fields of your dataset. You do it with a Get Variables step.

Chapter 7


PDI error handling

1

(c). With PDI you cannot avoid unexpected errors; you can capture them avoiding the crash of the transformation. After that, discarding or treating the bad rows is up to you.

Chapter 8


defining database connections

1

(c)

database datatypes versus PDI datatypes

1

(b)

Insert/Update step versus Table Output/Update steps

1

(a) If an incoming row belongs to a product that doesn't exist in the products table, both the Insert/Update step and the Table output step will insert the record.

If an incoming row belongs to a product that already exist in the products table, the Insert/Update step updates it. In this alternative version, the Table output will fail (there cannot be two products with the same value for the primary key) but the failing row goes to the Update step that updates the record.

If an incoming row contains invalid data (for example, a price with a non numeric value), neither of the Insert/Update step, the Table output step, and the Update step would insert or update the table with this product.

filtering the first 10 rows

1

(c). To limit the number of rows in MySQL you use the clause LIMIT. (a) and (b) are dialects: (a) is valid in HSQLDB. (b) is valid in Oracle. If you put any of this options in a Table Input for querying the js database, the transformation would fail

Chapter 9


loading slowly changing dimensions

1

(a). The decision for the kind of dimension is not related to data you have. You just have to know your business, so the last option is out. You don't need to keep history for the name of the film. If the name changes it is because it was misspelled, or because you want to change the name to upper case, or something like that. It doesn't have sense to keep the old value. So you create a Type I SCD.

2

(c). You can use any of these steps for loading a Type I SCD. In the tutorial for loading a type I SCD you used a Combination L/U, but you could have used the other too, as explained above.

loading type III slowly changing dimensions

1

(b). With a Database lookup to get the current value stored in the dimension. If there is no data in the dimension table, the lookup fails and returns null; that is not a problem. After that, you compare the found data with the new one and set the proper values for the dimension columns. Then you load the dimension either with a Combination L/U or with a Dimension lookup, just as you do for a regular Type I SCD.

Chapter 10


defining PDI jobs

1

(b)

2

All the given options are True. Simply explore the Job entries tree and you'll find the answers.

Chapter 11


using the Add sequence step

1

(e) None of the proposed solution gives you the same results you obtained in the tutorial. The Add sequence step gives you the next value in a sequence which can be a database sequence or transformation counter. In the tutorial you used a transformation counter. In the options (b) and (c), instead of four sequences from 1 to 10, a single sequence from 1 to 40 would have been generated. No matter which method you use for generating the sequence, if you use the same name of sequence in more than one Add sequence step, the sequence is the same and is shared by all those steps. Therefore, the option (a) also would have generated a single sequence from 1 to 40 shared by the four streams.

Besides these details about the generation of sequences, the (b) option introduces an extra inconvenience. By distributing rows, you cannot be sure that the rows will go to the proper stream. PDI would have distributed them in its own fashion.

deciding the scope of variables

1

All the options are valid. In the tutorial you had just a transformation and its parent job, that is also the root job. So (a) is valid. The grand-parent job scope includes the parent job so option (b) is valid too. Option (c) includes all the other options, so it is a valid option too.

Chapter 12


modifying a star model and loading the star with PDI

1

a

iii

As mentioned in Chapter 9, despite being designed for building Type II SCDs, the Dimension L/U step can be used for building Type I SCDs as well. So, you have two options: Reuse the table (modifying the transformation that loads it) and get the surrogate key with a Dimension L/U step, or use another table without all fields specific to Type II dimensions and, for getting the surrogate key, use a DB Lookup step.

In any case, you may reuse the id_region field, as it is a integer and serves in any situation.

 

b

i

 

c

iii

2

a

ii

The dimension table has to have one record by month. Therefore a different table is needed. For the key you could use a string with the format yyyymm. If you don't want to change the fact table, you may reuse the dt field leaving blank the last two characters, but it would be more appropriate to have a string field with just 6 positions. For getting the surrogate key you use a Select values step changing the metadata but this time you put as format the new mask yyyymm.

 

b

ii

 

c

i

3

a

ii

The product_type field is a string; it's not the proper field for referencing a surrogate key from a fact table, so you have to define a new field for that purpose. For getting the right key you use a Database lookup step.

 

b

iii

Chapter 13


remote execution and clustering

1

None of the sentences are true.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Pentaho 3.2 Data Integration: Beginner's Guide
Published in: Apr 2010 Publisher: Packt ISBN-13: 9781847199546
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}