Free Sample
+ Collection

Pentaho Data Integration Beginner's Guide - Second Edition

Beginner's Guide
María Carina Roldán

Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guide with this book and ebook
$29.99
$49.99
RRP $29.99
RRP $49.99
eBook
Print + eBook

Want this title & more?

$21.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781782165040
Paperback502 pages

About This Book

  • Manipulate your data by exploring, transforming, validating, and integrating it
  • Learn to migrate data between applications
  • Explore several features of Pentaho Data Integration 5.0
  • Connect to any database engine, explore the databases, and perform all kind of operations on databases

Who This Book Is For

This book is a must-have for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or, more generally, doing any kind of data manipulation. Those who have never used Pentaho Data Integration will benefit most from the book, but those who have, they will also find it useful.

This book is also a good starting point for database administrators, data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

Table of Contents

Chapter 1: Getting Started with Pentaho Data Integration
Pentaho Data Integration and Pentaho BI Suite
Exploring the Pentaho Demo
Installing PDI
Time for action – installing PDI
Launching the PDI graphical designer – Spoon
Time for action – starting and customizing Spoon
Time for action – creating a hello world transformation
Installing MySQL
Time for action – installing MySQL on Windows
Time for action – installing MySQL on Ubuntu
Summary
Chapter 2: Getting Started with Transformations
Designing and previewing transformations
Time for action – creating a simple transformation and getting familiar with the design process
Running transformations in an interactive fashion
Time for action – generating a range of dates and inspecting the data as it is being created
Handling errors
Time for action – avoiding errors while converting the estimated time from string to integer
Time for action – configuring the error handling to see the description of the errors
Summary
Chapter 3: Manipulating Real-world Data
Reading data from files
Time for action – reading results of football matches from files
Time for action – reading all your files at a time using a single text file input step
Time for action – reading all your files at a time using a single text file input step and regular expressions
Sending data to files
Time for action – sending the results of matches to a plain file
Getting system information
Time for action – reading and writing matches files with flexibility
Time for action – running the matches transformation from a terminal window
XML files
Time for action – getting data from an XML file with information about countries
Summary
Chapter 4: Filtering, Searching, and Performing Other Useful Operations with Data
Sorting data
Time for action – sorting information about matches with the Sort rows step
Calculations on groups of rows
Time for action – calculating football match statistics by grouping data
Filtering
Time for action – counting frequent words by filtering
Time for action – refining the counting task by filtering even more
Looking up data
Time for action – finding out which language people speak
Chapter 5: Controlling the Flow of Data
Splitting streams
Time for action – browsing new features of PDI by copying a dataset
Time for action – assigning tasks by distributing
Splitting the stream based on conditions
Time for action – assigning tasks by filtering priorities with the Filter rows step
Time for action – assigning tasks by filtering priorities with the Switch/Case step
Merging streams
Time for action – gathering progress and merging it all together
Time for action – giving priority to Bouchard by using the Append Stream
Treating invalid data by splitting and merging streams
Time for action – treating errors in the estimated time to avoid discarding rows
Summary
Chapter 6: Transforming Your Data by Coding
Doing simple tasks with the JavaScript step
Time for action – counting frequent words by coding in JavaScript
Reading and parsing unstructured files with JavaScript
Time for action – changing a list of house descriptions with JavaScript
Doing simple tasks with the Java Class step
Time for action – counting frequent words by coding in Java
Transforming the dataset with Java
Time for action – splitting the field to rows using Java
Avoiding coding by using purpose built steps
Summary
Chapter 7: Transforming the Rowset
Converting rows to columns
Time for action – enhancing the films file by converting rows to columns
Aggregating data with a Row Denormaliser step
Time for action – aggregating football matches data with the Row Denormaliser step
Normalizing data
Time for action – enhancing the matches file by normalizing the dataset
Generating a custom time dimension dataset by using Kettle variables
Time for action – creating the time dimension dataset
Time for action – parameterizing the start and end date of the time dimension dataset
Summary
Chapter 8: Working with Databases
Introducing the Steel Wheels sample database
Time for action – creating a connection to the Steel Wheels database
Time for action – exploring the sample database
Querying a database
Time for action – getting data about shipped orders
Time for action – getting orders in a range of dates using parameters
Time for action – getting orders in a range of dates by using Kettle variables
Sending data to a database
Time for action – loading a table with a list of manufacturers
Time for action – inserting new products or updating existing ones
Time for action – testing the update of existing products
Eliminating data from a database
Time for action – deleting data about discontinued items
Summary
Chapter 9: Performing Advanced Operations with Databases
Preparing the environment
Time for action – populating the Jigsaw database
Looking up data in a database
Time for action – using a Database lookup step to create a list of products to buy
Time for action – using a Database join step to create a list of suggested products to buy
Introducing dimensional modeling
Loading dimensions with data
Time for action – loading a region dimension with a Combination lookup/update step
Time for action – testing the transformation that loads the region dimension
Time for action – keeping a history of changes in products by using the Dimension lookup/update step
Time for action – testing the transformation that keeps history of product changes
Summary
Chapter 10: Creating Basic Task Flows
Introducing PDI jobs
Time for action – creating a folder with a Kettle job
Designing and running jobs
Time for action – creating a simple job and getting familiar with the design process
Running transformations from jobs
Time for action – generating a range of dates and inspecting how things are running
Receiving arguments and parameters in a job
Time for action – generating a hello world file by using arguments and parameters
Running jobs from a terminal window
Time for action – executing the hello world job from a terminal window
Using named parameters and command-line arguments in transformations
Time for action – calling the hello world transformation with fixed arguments and parameters
Deciding between the use of a command-line argument and a named parameter
Summary
Chapter 11: Creating Advanced Transformations and Jobs
Re-using part of your transformations
Time for action – calculating statistics with the use of a subtransformations
Time for action – generating top average scores by copying and getting rows
Iterating jobs and transformations
Time for action – generating custom files by executing a transformation for every input row
Enhancing your processes with the use of variables
Time for action – generating custom messages by setting a variable with the name of the examination file
Summary
Chapter 12: Developing and Implementing a Simple Datamart
Exploring the sales datamart
Loading the dimensions
Time for action – loading the dimensions for the sales datamart
Extending the sales datamart model
Loading a fact table with aggregated data
Time for action – loading the sales fact table by looking up dimensions
Getting facts and dimensions together
Time for action – loading the fact table using a range of dates obtained from the command line
Time for action – loading the SALES star
Automating the administrative tasks
Time for action – automating the loading of the sales datamart
Summary

What You Will Learn

  • Install and get started with Pentaho Data Integration
  • Get started with MySQL
  • Learn the ins and outs of Spoon, the graphical designer tool
  • Transform data in several ways such as performing simple and complex calculations, cleaning, counting, de-duplicating, filtering, and ordering
  • Learn to get data from all kind of data sources as plain files, Excel spreadsheets, databases, XML files and more, then preview it, and send it back to the same or different destinations
  • Discover how to read and parse unstructured files
  • Embed Java and JavaScript code in your Pentaho Data Integration transformations to enrich the treatment of data
  • Use Pentaho Data Integration to perform CRUD (create, read, update, and delete) operations on databases
  • Learn the basic concepts of data warehousing
  • Populate a data warehouse with Pentaho Data Integration including loading slowly changing dimensions, junk dimensions, time dimensions and more
  • Implement business processes by scheduling tasks, checking conditions, organizing files and folders, running daily processes, treating errors, and so on in a way that meets your requirements

In Detail

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work.

Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing.

"Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration.

"Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart.

With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Authors

Read More