Pentaho Data Integration 4 Cookbook

Over 70 recipes to solve ETL problems using Pentaho Kettle

Pentaho Data Integration 4 Cookbook

Cookbook
Adrián Sergio Pulvirenti, María Carina Roldán

Over 70 recipes to solve ETL problems using Pentaho Kettle
$10.00
$44.99
RRP $26.99
RRP $44.99
eBook
Print + eBook
$12.99 p/month

Get Access

Get Unlimited Access to every Packt eBook and Video course

Enjoy full and instant access to over 3000 books and videos – you’ll find everything you need to stay ahead of the curve and make sure you can always get the job done.

+ Collection
Free Sample

Book Details

ISBN 139781849515245
Paperback352 pages

About This Book

  • Manipulate your data by exploring, transforming, validating, integrating, and more
  • Work with all kinds of data sources such as databases, plain files, and XML structures among others
  • Use Kettle in integration with other components of the Pentaho Business Intelligence Suite
  • Each recipe is a carefully organized sequence of instructions packed with screenshots, tables, and tips to complete the task as efficiently as possible

Who This Book Is For

If you are a software developer or anyone involved or interested in developing ETL solutions, or in general, doing any kind of data manipulation, this book is for you. It does not cover PDI basics, SQL basics, or database concepts. You are expected to have a basic understanding of the PDI tool, SQL language, and databases.

Table of Contents

Chapter 1: Working with Databases
Introduction
Connecting to a database
Getting data from a database
Getting data from a database by providing parameters
Getting data from a database by running a query built at runtime
Inserting or updating rows in a table
Inserting new rows where a simple primary key has to be generated
Inserting new rows where the primary key has to be generated based on stored values
Deleting data from a table
Creating or altering a database table from PDI (design time)
Creating or altering a database table from PDI (runtime)
Inserting, deleting, or updating a table depending on a field
Changing the database connection at runtime
Loading a parent-child table
Chapter 2: Reading and Writing Files
Introduction
Reading a simple file
Reading several files at the same time
Reading unstructured files
Reading files having one field by row
Reading files with some fields occupying two or more rows
Writing a simple file
Writing an unstructured file
Providing the name of a file (for reading or writing) dynamically
Using the name of a file (or part of it) as a field
Reading an Excel file
Getting the value of specific cells in an Excel file
Writing an Excel file with several sheets
Writing an Excel file with a dynamic number of sheets
Chapter 3: Manipulating XML Structures
Introduction
Reading simple XML files
Specifying fields by using XPath notation
Validating well-formed XML files
Validating an XML file against DTD definitions
Validating an XML file against an XSD schema
Generating a simple XML document
Generating complex XML structures
Generating an HTML page using XML and XSL transformations
Chapter 4: File Management
Introduction
Copying or moving one or more files
Deleting one or more files
Getting files from a remote server
Putting files on a remote server
Copying or moving a custom list of files
Deleting a custom list of files
Comparing files and folders
Working with ZIP files
Chapter 5: Looking for Data
Introduction
Looking for values in a database table
Looking for values in a database (with complex conditions or multiple tables involved)
Looking for values in a database with extreme flexibility
Looking for values in a variety of sources
Looking for values by proximity
Looking for values consuming a web service
Looking for values over an intranet or Internet
Chapter 6: Understanding Data Flows
Introduction
Splitting a stream into two or more streams based on a condition
Merging rows of two streams with the same or different structures
Comparing two streams and generating differences
Generating all possible pairs formed from two datasets
Joining two or more streams based on given conditions
Interspersing new rows between existent rows
Executing steps even when your stream is empty
Processing rows differently based on the row number
Chapter 7: Executing and Reusing Jobs and Transformations
Introduction
Executing a job or a transformation by setting static arguments and parameters
Executing a job or a transformation from a job by setting arguments and parameters dynamically
Executing a job or a transformation whose name is determined at runtime
Executing part of a job once for every row in a dataset
Executing part of a job several times until a condition is true
Creating a process flow
Moving part of a transformation to a subtransformation
Chapter 8: Integrating Kettle and the Pentaho Suite
Introduction
Creating a Pentaho report with data coming from PDI
Configuring the Pentaho BI Server for running PDI jobs and transformations
Executing a PDI transformation as part of a Pentaho process
Executing a PDI job from the Pentaho User Console
Generating files from the PUC with PDI and the CDA plugin
Populating a CDF dashboard with data coming from a PDI transformation
Chapter 9: Getting the Most Out of Kettle
Introduction
Sending e-mails with attached files
Generating a custom log file
Programming custom functionality
Generating sample data for testing purposes
Working with Json files
Getting information about transformations and jobs (file-based)
Getting information about transformations and jobs (repository-based)

What You Will Learn

  • Configure Kettle to connect to databases, explore them, and perform CRUD operations
  • Read, write, and parse simple and unstructured files
  • Solve common Excel needs such as reading from a particular cell or generating several sheets at a time
  • Read, validate, and generate simple and complex XML structures
  • Manipulate files by copying, deleting, compressing, or transferring to remote servers
  • Look up information from different sources such as databases, web services, or spreadsheets among others
  • Work with data flows performing operations such as joining, merging, or filtering rows
  • Customize the Kettle logs to your needs
  • Embed Java code in your transformations to gain performance and flexibility
  • Execute and reuse transformations and jobs in different ways
  • Integrate Kettle with Pentaho Reporting, Pentaho Dashboards, Community Data Access, and Pentaho BI Platform

In Detail

Pentaho Data Integration (PDI, also called Kettle), one of the data integration tools leaders, is broadly used for all kind of data manipulation such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. Do you need quick solutions to the problems you face while using Kettle?

Pentaho Data Integration 4 Cookbook explains Kettle features in detail through clear and practical recipes that you can quickly apply to your solutions. The recipes cover a broad range of topics including processing files, working with databases, understanding XML structures, integrating with Pentaho BI Suite, and more.

Pentaho Data Integration 4 Cookbook shows you how to take advantage of all the aspects of Kettle through a set of practical recipes organized to find quick solutions to your needs. The initial chapters explain the details about working with databases, files, and XML structures. Then you will see different ways for searching data, executing and reusing jobs and transformations, and manipulating streams. Further, you will learn all the available options for integrating Kettle with other Pentaho tools.

Pentaho Data Integration 4 Cookbook has plenty of recipes with easy step-by-step instructions to accomplish specific tasks. There are examples and code that are ready for adaptation to individual needs.

Authors

Table of Contents

Chapter 1: Working with Databases
Introduction
Connecting to a database
Getting data from a database
Getting data from a database by providing parameters
Getting data from a database by running a query built at runtime
Inserting or updating rows in a table
Inserting new rows where a simple primary key has to be generated
Inserting new rows where the primary key has to be generated based on stored values
Deleting data from a table
Creating or altering a database table from PDI (design time)
Creating or altering a database table from PDI (runtime)
Inserting, deleting, or updating a table depending on a field
Changing the database connection at runtime
Loading a parent-child table
Chapter 2: Reading and Writing Files
Introduction
Reading a simple file
Reading several files at the same time
Reading unstructured files
Reading files having one field by row
Reading files with some fields occupying two or more rows
Writing a simple file
Writing an unstructured file
Providing the name of a file (for reading or writing) dynamically
Using the name of a file (or part of it) as a field
Reading an Excel file
Getting the value of specific cells in an Excel file
Writing an Excel file with several sheets
Writing an Excel file with a dynamic number of sheets
Chapter 3: Manipulating XML Structures
Introduction
Reading simple XML files
Specifying fields by using XPath notation
Validating well-formed XML files
Validating an XML file against DTD definitions
Validating an XML file against an XSD schema
Generating a simple XML document
Generating complex XML structures
Generating an HTML page using XML and XSL transformations
Chapter 4: File Management
Introduction
Copying or moving one or more files
Deleting one or more files
Getting files from a remote server
Putting files on a remote server
Copying or moving a custom list of files
Deleting a custom list of files
Comparing files and folders
Working with ZIP files
Chapter 5: Looking for Data
Introduction
Looking for values in a database table
Looking for values in a database (with complex conditions or multiple tables involved)
Looking for values in a database with extreme flexibility
Looking for values in a variety of sources
Looking for values by proximity
Looking for values consuming a web service
Looking for values over an intranet or Internet
Chapter 6: Understanding Data Flows
Introduction
Splitting a stream into two or more streams based on a condition
Merging rows of two streams with the same or different structures
Comparing two streams and generating differences
Generating all possible pairs formed from two datasets
Joining two or more streams based on given conditions
Interspersing new rows between existent rows
Executing steps even when your stream is empty
Processing rows differently based on the row number
Chapter 7: Executing and Reusing Jobs and Transformations
Introduction
Executing a job or a transformation by setting static arguments and parameters
Executing a job or a transformation from a job by setting arguments and parameters dynamically
Executing a job or a transformation whose name is determined at runtime
Executing part of a job once for every row in a dataset
Executing part of a job several times until a condition is true
Creating a process flow
Moving part of a transformation to a subtransformation
Chapter 8: Integrating Kettle and the Pentaho Suite
Introduction
Creating a Pentaho report with data coming from PDI
Configuring the Pentaho BI Server for running PDI jobs and transformations
Executing a PDI transformation as part of a Pentaho process
Executing a PDI job from the Pentaho User Console
Generating files from the PUC with PDI and the CDA plugin
Populating a CDF dashboard with data coming from a PDI transformation
Chapter 9: Getting the Most Out of Kettle
Introduction
Sending e-mails with attached files
Generating a custom log file
Programming custom functionality
Generating sample data for testing purposes
Working with Json files
Getting information about transformations and jobs (file-based)
Getting information about transformations and jobs (repository-based)

Book Details

ISBN 139781849515245
Paperback352 pages
Read More