Free Sample
+ Collection

Pentaho Data Integration Cookbook - Second Edition

Alex Meadows, Adrián Sergio Pulvirenti, María Carina Roldán

The premier open source ETL tool is at your command with this recipe-packed cookbook. Learn to use data sources in Kettle, avoid pitfalls, and dig out the advanced features of Pentaho Data Integration the easy way.
RRP $29.99
RRP $49.99
Print + eBook

Want this title & more?

$12.99 p/month

Subscribe to PacktLib

Enjoy full and instant access to over 2000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

Book Details

ISBN 139781783280674
Paperback462 pages

About This Book

  • Intergrate Kettle in integration with other components of the Pentaho Business Intelligence Suite, to build and publish Mondrian schemas,create  reports, and populatedashboards
  • This book contains an organized sequence of recipes packed with screenshots, tables, and tips so you can complete the tasks as efficiently as possible
  • Manipulate your data by exploring, transforming, validating, integrating, and performing data analysis

Who This Book Is For

Pentaho Data Integration Cookbook Second Edition is designed for developers who are familiar with the basics of Kettle but who wish to move up to the next level.It is also aimed at advanced users that want to learn how to use the new features of PDI as well as and best practices for working with Kettle.

Table of Contents

Chapter 1: Working with Databases
Connecting to a database
Getting data from a database
Getting data from a database by providing parameters
Getting data from a database by running a query built at runtime
Inserting or updating rows in a table
Inserting new rows where a simple primary key has to be generated
Inserting new rows where the primary key has to be generated based on stored values
Deleting data from a table
Creating or altering a database table from PDI (design time)
Creating or altering a database table from PDI (runtime)
Inserting, deleting, or updating a table depending on a field
Changing the database connection at runtime
Loading a parent-child table
Building SQL queries via database metadata
Performing repetitive database design tasks from PDI
Chapter 2: Reading and Writing Files
Reading a simple file
Reading several files at the same time
Reading semi-structured files
Reading files having one field per row
Reading files with some fields occupying two or more rows
Writing a simple file
Writing a semi-structured file
Providing the name of a file (for reading or writing) dynamically
Using the name of a file (or part of it) as a field
Reading an Excel file
Getting the value of specific cells in an Excel file
Writing an Excel file with several sheets
Writing an Excel file with a dynamic number of sheets
Reading data from an AWS S3 Instance
Chapter 3: Working with Big Data and Cloud Sources
Loading data into
Getting data from
Loading data into Hadoop
Getting data from Hadoop
Loading data into HBase
Getting data from HBase
Loading data into MongoDB
Getting data from MongoDB
Chapter 4: Manipulating XML Structures
Reading simple XML files
Specifying fields by using the Path notation
Validating well-formed XML files
Validating an XML file against DTD definitions
Validating an XML file against an XSD schema
Generating a simple XML document
Generating complex XML structures
Generating an HTML page using XML and XSL transformations
Reading an RSS Feed
Generating an RSS Feed
Chapter 5: File Management
Copying or moving one or more files
Deleting one or more files
Getting files from a remote server
Putting files on a remote server
Copying or moving a custom list of files
Deleting a custom list of files
Comparing files and folders
Working with ZIP files
Encrypting and decrypting files
Chapter 6: Looking for Data
Looking for values in a database table
Looking for values in a database with complex conditions
Looking for values in a database with dynamic queries
Looking for values in a variety of sources
Looking for values by proximity
Looking for values by using a web service
Looking for values over intranet or the Internet
Validating data at runtime
Chapter 7: Understanding and Optimizing Data Flows
Splitting a stream into two or more streams based on a condition
Merging rows of two streams with the same or different structures
Adding checksums to verify datasets
Comparing two streams and generating differences
Generating all possible pairs formed from two datasets
Joining two or more streams based on given conditions
Interspersing new rows between existent rows
Executing steps even when your stream is empty
Processing rows differently based on the row number
Processing data into shared transformations via filter criteria and subtransformations
Altering a data stream with Select values
Processing multiple jobs or transformations in parallel
Chapter 8: Executing and Re-using Jobs and Transformations
Launching jobs and transformations
Executing a job or a transformation by setting static arguments and parameters
Executing a job or a transformation from a job by setting arguments and parameters dynamically
Executing a job or a transformation whose name is determined at runtime
Executing part of a job once for every row in a dataset
Executing part of a job several times until a condition is true
Moving part of a transformation to a subtransformation
Using Metadata Injection to re-use transformations
Chapter 9: Integrating Kettle and the Pentaho Suite
Creating a Pentaho report with data coming from PDI
Creating a Pentaho report directly from PDI
Configuring the Pentaho BI Server for running PDI jobs and transformations
Executing a PDI transformation as part of a Pentaho process
Executing a PDI job from the Pentaho User Console
Generating files from the PUC with PDI and the CDA plugin
Populating a CDF dashboard with data coming from a PDI transformation
Chapter 10: Getting the Most Out of Kettle
Sending e-mails with attached files
Generating a custom logfile
Running commands on another server
Programming custom functionality
Generating sample data for testing purposes
Working with JSON files
Getting information about transformations and jobs (file-based)
Getting information about transformations and jobs (repository-based)
Using Spoon's built-in optimization tools
Chapter 11: Utilizing Visualization Tools in Kettle
Managing plugins with the Marketplace
Data profiling with DataCleaner
Visualizing data with AgileBI
Using Instaview to analyze and visualize data
Chapter 12: Data Analytics
Reading data from a SAS datafile
Studying data via stream statistics
Building a random data sample for Weka

What You Will Learn

  • Configure Kettle to connect to relational and NoSQL databases and web applications like SalesForce, explore them, and perform CRUD operations
  • Utilize plugins to get even more functionality into your Kettle jobs
  • Embed Java code in your transformations to gain performance and flexibility
  • Execute and reuse transformations and jobs in different ways
  • Integrate Kettle with Pentaho Reporting, Pentaho Dashboards, Community Data Access, and the Pentaho BI Platform
  • Interface Kettle with cloud-based applications
  • Learn how to control and manipulate data flows
  • Utilize Kettle to create datasets for analytics

In Detail

Pentaho Data Integration is the premier open source ETL tool, providing easy, fast, and effective ways to move and transform data. While PDI is relatively easy to pick up, it can take time to learn the best practices so you can design your transformations to process data faster and more efficiently. If you are looking for clear and practical recipes that will advance your skills in Kettle, then this is the book for you.

Pentaho Data Integration Cookbook Second Edition guides you through the features of explains the Kettle features in detail and provides easy to follow recipes on file management and databases that can throw a curve ball to even the most experienced developers.

Pentaho Data Integration Cookbook Second Edition provides updates to the material covered in the first edition as well as new recipes that show you how to use some of the key features of PDI that have been released since the publication of the first edition. You will learn how to work with various data sources – from relational and NoSQL databases, flat files, XML files, and more. The book will also cover best practices that you can take advantage of immediately within your own solutions, like building reusable code, data quality, and plugins that can add even more functionality.

Pentaho Data Integration Cookbook Second Edition will provide you with the recipes that cover the common pitfalls that even seasoned developers can find themselves facing. You will also learn how to use various data sources in Kettle as well as advanced features.


Read More