Pentaho 3.2 Data Integration: Beginner's Guide


There is a newer version of this book available - Pentaho Data Integration Beginner's Guide - Second Edition
Pentaho 3.2 Data Integration: Beginner's Guide
eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats
$25.49
save 15%!
Print + free eBook + free PacktLib access to the book: $79.98    Print cover: $49.99
$49.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Reviews
Support
Sample Chapters
  • Get started with Pentaho Data Integration from scratch.
  • Enrich your data transformation operations by embedding Java and JavaScript code in PDI transformations.
  • Create a simple but complete Datamart Project that will cover all key features of PDI.
  • Part of Packt's Beginner's Guide series: Each task gradually develops from the previous task with an organized sequence of instructions accompanied with relevant explanation and a set of challenging tasks that the reader will be able to accomplish.

Book Details

Language : English
Paperback : 492 pages [ 235mm x 191mm ]
Release Date : April 2010
ISBN : 1847199542
ISBN 13 : 9781847199546
Author(s) : María Carina Roldán
Topics and Technologies : All Books, Big Data and Business Intelligence, Data, Beginner's Guides, Open Source


Table of Contents

Preface
Chapter 1: Getting started with Pentaho Data Integration
Chapter 2: Getting Started with Transformations
Chapter 3: Basic data manipulation
Chapter 4: Controlling the Flow of Data
Chapter 5: Transforming Your Data with JavaScript Code and the JavaScript Step
Chapter 6: Transforming the Row Set
Chapter 7: Validating Data and Handling Errors
Chapter 8: Working with Databases
Chapter 9: Performing Advanced Operations with Databases
Chapter 10: Creating Basic Task Flows
Chapter 11: Creating Advanced Transformations and Jobs
Chapter 12: Developing and Implementing a Simple Datamart
Chapter 13: Taking it Further
Appendix A: Working with Repositories
Appendix B: Pan and Kitchen: Launching Transformations and Jobs from the Command Line
Appendix C: Quick Reference: Steps and Job Entries
Appendix D: Spoon Shortcuts
Appendix E: Introducing PDI 4 Features
Appendix F: Pop Quiz Answers
Index
  • Chapter 1: Getting started with Pentaho Data Integration
    • Pentaho Data Integration and Pentaho BI Suite
      • Exploring the Pentaho Demo
    • Pentaho Data Integration
      • Using PDI in real world scenarios
        • Loading data warehouses or data marts
        • Integrating data
        • Data cleansing
        • Migrating information
        • Exporting data
        • Integrating PDI using Pentaho BI
    • Installing PDI
    • Time for action – installing PDI
    • Launching the PDI graphical designer: Spoon
    • Time for action – starting and customizing Spoon
      • Spoon
        • Setting preferences in the Options window
        • Storing transformations and jobs in a repository
      • Creating your first transformation
    • Time for action – creating a hello world transformation
      • Directing the Kettle engine with transformations
      • Exploring the Spoon interface
      • Running and previewing the transformation
  • Time for action – running and previewing the hello_world transformation
  • Installing MySQL
  • Time for action – installing MySQL on Windows
  • Time for action – installing MySQL on Ubuntu
  • Summary
  • Chapter 2: Getting Started with Transformations
    • Reading data from files
    • Time for action – reading results of football matches from files
      • Input files
        • Input steps
      • Reading several files at once
    • Time for action – reading all your files at a time using a single Text file input step
    • Time for action – reading all your files at a time using a single Text file input step and regular expressions
      • Regular expressions
    • Grids
  • Sending data to files
  • Time for action – sending the results of matches to a plain file
    • Output files
      • Output steps
    • Some data definitions
      • Rowset
      • Streams
    • The Select values step
  • Getting system information
  • Time for action – updating a file with news about examinations
    • Getting information by using Get System Info step
    • Data types
      • Date fields
      • Numeric fields
    • Running transformations from a terminal window
  • Time for action – running the examination transformation from a terminal window
  • XML files
  • Time for action – getting data from an XML file with information about countries
    • What is XML
      • PDI transformation files
    • Getting data from XML files
      • XPath
      • Configuring the Get data from XML step
    • Kettle variables
      • How and when you can use variables
  • Summary
  • Chapter 3: Basic data manipulation
    • Basic calculations
    • Time for action – reviewing examinations by using the Calculator step
      • Adding or modifying fields by using different PDI steps
        • The Calculator step
        • The Formula step
    • Time for action – reviewing examinations by using the Formula step
    • Calculations on groups of rows
    • Time for action – calculating World Cup statistics by grouping data
      • Group by step
    • Filtering
    • Time for action – counting frequent words by filtering
      • Filtering rows using the Filter rows step
    • Looking up data
    • Time for action – finding out which language people speak
      • The Stream lookup step
    • Summary
  • Chapter 4: Controlling the Flow of Data
    • Splitting streams
    • Time for action – browsing new PDI features by copying a dataset
      • Copying rows
      • Distributing rows
    • Time for action – assigning tasks by distributing
    • Splitting the stream based on conditions
    • Time for action – assigning tasks by filtering priorities with the Filter rows step
      • PDI steps for splitting the stream based on conditions
    • Time for action – assigning tasks by filtering priorities with the Switch/ Case step
    • Merging streams
    • Time for action – gathering progress and merging all together
      • PDI options for merging streams
    • Time for action – giving priority to Bouchard by using Append Stream
    • Summary
  • Chapter 5: Transforming Your Data with JavaScript Code and the JavaScript Step
    • Doing simple tasks with the JavaScript step
    • Time for action – calculating scores with JavaScript
      • Using the JavaScript language in PDI
      • Inserting JavaScript code using the Modified Java Script Value step
        • Adding fields
        • Modifying fields
        • Turning on the compatibility switch
      • Testing your code
    • Time for action – testing the calculation of averages
      • Testing the script using the Test script button
  • Enriching the code
  • Time for action – calculating flexible scores by using variables
    • Using named parameters
    • Using the special Start, Main, and End scripts
    • Using transformation predefined constants
  • Reading and parsing unstructured files
  • Time for action – changing a list of house descriptions with JavaScript
    • Looking at previous rows
  • Avoiding coding by using purpose-built steps
  • Summary
  • Chapter 6: Transforming the Row Set
    • Converting rows to columns
    • Time for action – enhancing a films file by converting rows to columns
      • Converting row data to column data by using the Row denormalizer step
      • Aggregating data with a Row denormalizer step
    • Time for action – calculating total scores by performances by country
      • Using Row denormalizer for aggregating data
    • Normalizing data
    • Time for action – enhancing the matches file by normalizing the dataset
      • Modifying the dataset with a Row Normalizer step
      • Summarizing the PDI steps that operate on sets of rows
    • Generating a custom time dimension dataset by using Kettle variables
    • Time for action – creating the time dimension dataset
      • Getting variables
    • Time for action – getting variables for setting the default starting date
      • Using the Get Variables step
  • Summary
  • Chapter 7: Validating Data and Handling Errors
    • Capturing errors
    • Time for action – capturing errors while calculating the age of a film
      • Using PDI error handling functionality
      • Aborting a transformation
    • Time for action – aborting when there are too many errors
      • Aborting a transformation using the Abort step
    • Fixing captured errors
  • Time for action – treating errors that may appear
    • Treating rows coming to the error stream
  • Avoiding unexpected errors by validating data
  • Time for action – validating genres with a Regex Evaluation step
    • Validating data
  • Time for action – checking films file with the Data Validator
    • Defining simple validation rules using the Data Validator
  • Cleansing data
  • Summary
  • Chapter 8: Working with Databases
    • Introducing the Steel Wheels sample database
      • Connecting to the Steel Wheels database
    • Time for action – creating a connection with the Steel Wheels database
      • Connecting with Relational Database Management Systems
    • Exploring the Steel Wheels database
  • Time for action – exploring the sample database
    • A brief word about SQL
    • Exploring any configured database with the PDI Database explorer
  • Querying a database
  • Time for action – getting data about shipped orders
    • Getting data from the database with the Table input step
    • Using the SELECT statement for generating a new dataset
      • Making flexible queries by using parameters
  • Time for action – getting orders in a range of dates by using parameters
    • Making flexible queries by using Kettle variables
  • Time for action – getting orders in a range of dates by using variables
  • Sending data to a database
  • Time for action – loading a table with a list of manufacturers
    • Inserting new data into a database table with the Table output step
    • Inserting or updating data by using other PDI steps
  • Time for action – inserting new products or updating existent ones
  • Time for action – testing the update of existing products
    • Inserting or updating data with the Insert/Update step
  • Eliminating data from a database
  • Time for action – deleting data about discontinued items
    • Deleting records of a database table with the Delete step
  • Summary
  • Chapter 9: Performing Advanced Operations with Databases
    • Preparing the environment
    • Time for action – populating the Jigsaw database
      • Exploring the Jigsaw database model
    • Looking up data in a database
      • Doing simple lookups
    • Time for action – using a Database lookup step to create a list of products to buy
      • Looking up values in a database with the Database lookup step
    • Doing complex lookups
  • Time for action – using a Database join step to create a list of suggested products to buy
    • Joining data from the database to the stream data by using a Database join step
  • Introducing dimensional modeling
  • Loading dimensions with data
  • Time for action – loading a region dimension with a Combination lookup/update step
  • Time for action – testing the transformation that loads the region dimension
    • Describing data with dimensions
      • Loading Type I SCD with a Combination lookup/update step
    • Keeping a history of changes
  • Time for action – keeping a history of product changes with the Dimension lookup/update step
  • Time for action – testing the transformation that keeps a history of product changes
    • Keeping an entire history of data with a Type II slowly changing dimension
  • Loading Type II SCDs with the Dimension lookup/update step
  • Summary
  • Chapter 10: Creating Basic Task Flows
    • Introducing PDI jobs
    • Time for action – creating a simple hello world job
      • Executing processes with PDI jobs
        • Using Spoon to design and run jobs
      • Using the transformation job entry
    • Receiving arguments and parameters in a job
    • Time for action – customizing the hello world file with arguments and parameters
      • Using named parameters in jobs
    • Running jobs from a terminal window
    • Time for action – executing the hello world job from a terminal window
    • Using named parameters and command-line arguments in transformations
    • Time for action – calling the hello world transformation with fixed arguments and parameters
    • Deciding between the use of a command-line argument and a named parameter
    • Running job entries under conditions
    • Time for action – sending a sales report and warning the administrator if something is wrong
      • Changing the flow of execution on the basis of conditions
      • Creating and using a file results list
    • Summary
  • Chapter 11: Creating Advanced Transformations and Jobs
    • Enhancing your processes with the use of variables
    • Time for action – updating a file with news about examinations by setting a variable with the name of the file
      • Setting variables inside a transformation
    • Enhancing the design of your processes
    • Time for action – generating files with top scores
      • Reusing part of your transformations
    • Time for action – calculating the top scores with a subtransformation
      • Creating and using subtransformations
    • Creating a job as a process flow
  • Time for action – splitting the generation of top scores by copying and getting rows
    • Transferring data between transformations by using the copy /get rows mechanism
  • Nesting jobs
  • Time for action – generating the files with top scores by nesting jobs
    • Running a job inside another job with a job entry
    • Understanding the scope of variables
  • Iterating jobs and transformations
  • Time for action – generating custom files by executing a transformation for every input row
    • Executing for each row
  • Summary
  • Chapter 12: Developing and Implementing a Simple Datamart
    • Exploring the sales datamart
      • Deciding the level of granularity
    • Loading the dimensions
    • Time for action – loading dimensions for the sales datamart
    • Extending the sales datamart model
    • Loading a fact table with aggregated data
    • Time for action – loading the sales fact table by looking up dimensions
      • Getting the information from the source with SQL queries
      • Translating the business keys into surrogate keys
        • Obtaining the surrogate key for a Type I SCD
        • Obtaining the surrogate key for a Type II SCD
        • Obtaining the surrogate key for the Junk dimension
        • Obtaining the surrogate key for the Time dimension
    • Getting facts and dimensions together
    • Time for action – loading the fact table using a range of dates obtained from the command line
    • Time for action – loading the sales star
    • Getting rid of administrative tasks
    • Time for action – automating the loading of the sales datamart
    • Summary
  • Chapter 13: Taking it Further
    • PDI best practices
    • Getting the most out of PDI
      • Extending Kettle with plugins
      • Overcoming real world risks with some remote execution
      • Scaling out to overcome bigger risks
    • Integrating PDI and the Pentaho BI suite
      • PDI as a process action
      • PDI as a datasource
      • More about the Pentaho suite
    • PDI Enterprise Edition and Kettle Developer Support
    • Summary
  • Appendix A: Working with Repositories
    • Creating a repository
    • Time for action – creating a PDI repository
      • Creating repositories to store your transformations and jobs
    • Working with the repository storage system
    • Time for action – logging into a repository
      • Logging into a repository by using credentials
        • Defining repository user accounts
      • Creating transformations and jobs in repository folders
      • Creating database connections, partitions, servers, and clusters
      • Backing up and restoring a repository
    • Examining and modifying the contents of a repository with the Repository explorer
    • Migrating from a file-based system to a repository-based system and vice-versa
    • Summary
  • Appendix E: Introducing PDI 4 Features
    • Agile BI
    • Visual improvements for designing transformations and jobs
      • Experiencing the mouse-over assistance
    • Time for action – creating a hop with the mouse-over assistance
      • Using the mouse-over assistance toolbar
    • Experiencing the sniff-testing feature
    • Experiencing the job drill-down feature
    • Experiencing even more visual changes
  • Enterprise features
  • Summary
  • Appendix F: Pop Quiz Answers
    • Chapter 1
      • PDI data sources
      • PDI prerequisites
      • PDI basics
    • Chapter 2
      • formatting data
    • Chapter 3
      • concatenating strings
    • Chapter 4
      • data movement (copying and distributing)
      • splitting a stream
    • Chapter 5
      • finding the seven errors
    • Chapter 6
      • using Kettle variables inside transformations
    • Chapter 7
      • PDI error handling
    • Chapter 8
      • defining database connections
      • database datatypes versus PDI datatypes
      • Insert/Update step versus Table Output/Update steps
      • filtering the first 10 rows
    • Chapter 9
      • loading slowly changing dimensions
      • loading type III slowly changing dimensions
    • Chapter 10
      • defining PDI jobs
    • Chapter 11
      • using the Add sequence step
      • deciding the scope of variables
    • Chapter 12
      • modifying a star model and loading the star with PDI
    • Chapter 13
      • remote execution and clustering

María Carina Roldán

María Carina, born in Esquel, Argentina, earned her Bachelor's degree in Computer Science at UNLP in La Plata and then moved to Buenos Aires where she has been living since 1994. She has worked as a BI consultant for almost 15 years. Over the last six, she has been dedicated full time to developing BI solutions using the Pentaho Suite. Currently, she works for Webdetails—a Pentaho company—as an ETL specialist. She is the author of Pentaho 3.2 Data Integration Beginner’s Guide book published by Packt Publishing in April 2009 and co-author of Pentaho Data Integration 4 Cookbook, also published by Packt Publishing in June 2011.

Code Downloads

Download the code and support files for this book.


Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.


Errata

- 6 submitted: last submission 25 Oct 2013

Errata type: Technical | Page number: 55 |

In Chapter 2, p55, there is something missing at step 19 (Double-click the text file output step icon and give it a name). In fact, we first have to go to Content to check Append. So, after the item 20, it should be the following “In the Content tab check the Append option and uncheck the Header option.”

 

Errata type: Technical | Page number: 77

In chapter 3, page 77, the step 9 is wrong (Create a hop from the Text file input step to the Calculator step). It should be: Create a hop from the Select Values step to the Calculator step.

 

Errata type: Technical | Page number: 91 |

In chapter 3, page 91, the step 12 (Select the Dummy and the Group by steps)is wrong. It should be: Select the Dummy and the Sort rows steps.

 

Errata type: Technical | Page number: 181 |

In chapter 6, page 181, the figure in step 6 doesn't show the reality. The error is the text inside the “Type field” option. Instead of type it should be “class”.

 

Errata type: Technical | Page number: 210 |

In Chapter 7, page 211, step 14. There is something missing to make it work. We should also uncheck the "Null Allowed?" checkbox. Otherwise it doesn't work.

 

Errata type: Typo | Page number: 53 |

Step 5 "drag the "Get system info" icon to the canvas" should be drag the "Get system info" steps in the flow.

 

Errata type: Technical | Page number: 55 |

In Chapter 2, step 15 states to click the Content tab and then the Get fields button. However since the file inputs were defined as a command line argument in step 9, this causes an error.

To solve it: If the reader has the PDI Cookbook, he/she can read the tip on page 76, which basically states that: When you don't specify the name and location you will not be able to use the Get Fields button. The trick is: uncheck the options in "Accept filenames from previous steps", fill the File tab with the name of a sample, real file, then click the Get Fields button. Finally restore the settings in the File tab.

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Frequently bought together

Pentaho 3.2 Data Integration: Beginner's Guide +    ADempiere 3.4 ERP Solutions =
50% Off
the second eBook
Price for both: $43.05

Buy both these recommended eBooks together and get 50% off the cheapest eBook.

What you will learn from this book

  • Install Penataho Data Integration and get familiar with the graphical designer—Spoon
  • Work with files to get data from simple input sources, preview, and send it back in any of the common output formats
  • Perform basic transformation operations on data such as performing calculations, adding constants, counting, filtering, ordering, and searching data
  • Solve real-world problems by manipulating the flow of data by combining or splitting it
  • Solve sophisticated problems such as normalizing data from pivoted tables with ease
  • Explore the various PDI options to validate data and to handle errors
  • Connect to a database engine of your choice to get and store information coming from any source
  • Perform advanced operations with databases such as loading data dimensions
  • Create advanced processes such as nesting jobs, iterating on jobs and transformations, and creating subtransformations
  • Implement business processes by scheduling tasks, checking conditions, organizing files and folders, sending emails with alerts, and treating errors in a way that meets your requirements

In Detail

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.

This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.

Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.
Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

A practical, easy-to-read guide that gives you full understanding of the Pentaho Data Integration tool and shows you how to use it to your advantage to manipulate data

Approach

As part of Packt's Beginner's Guide, this book focuses on teaching by example. The book walks you through every aspect of PDI, giving step-by-step instructions in a friendly style, allowing you to learn in front of your computer, playing with the tool. The extensive use of drawings and screenshots make the process of learning PDI easy. Throughout the book numerous tips and helpful hints are provided that you will not find anywhere else.

The book provides short, practical examples and also builds from scratch a small datamart intended to reinforce the learned concepts and to teach you the basics of data warehousing.

Who this book is for

This book is for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or, more generally, doing any kind of data manipulation. If you have never used PDI before, this will be a perfect book to start with.

You will find this book is a good starting point if you are a database administrator, data warehouse designer, architect, or any person who is responsible for data warehouse projects and need to load data into them.

You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in easy-to-understand language.

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software