Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Pentaho 3.2 Data Integration: Beginner's Guide

You're reading from  Pentaho 3.2 Data Integration: Beginner's Guide

Product type Book
Published in Apr 2010
Publisher Packt
ISBN-13 9781847199546
Pages 492 pages
Edition 1st Edition
Languages

Table of Contents (27) Chapters

Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
Getting Started with Pentaho Data Integration Getting Started with Transformations Basic Data Manipulation Controlling the Flow of Data Transforming Your Data with JavaScript Code and the JavaScript Step Transforming the Row Set Validating Data and Handling Errors Working with Databases Performing Advanced Operations with Databases Creating Basic Task Flows Creating Advanced Transformations and Jobs Developing and Implementing a Simple Datamart Taking it Further Working with Repositories Pan and Kitchen: Launching Transformations and Jobs from the Command Line Quick Reference: Steps and Job Entries Spoon Shortcuts Introducing PDI 4 Features Pop Quiz Answers Index

Preface

Pentaho Data Integration (aka Kettle) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading—better known as the ETL processes. PDI not only serves as an ETL tool, but it's also used for other purposes such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag-and-drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult or confusing. This book provides the guidance needed to overcome that difficulty, covering the key features of PDI. Each chapter introduces new features, allowing you to gradually get involved with the tool.

By the end of the book, you will have not only experimented with all kinds of examples, but will also have built a basic but complete datamart with the help of PDI.

How to read this book

Although it is recommended that you read all the chapters, you don't need to. The book allows you to tailor the PDI learning process according to your particular needs.

The first four chapters, along with Chapter 7 and Chapter 10, cover the core concepts. If you don't know PDI and want to learn just the basics, reading those chapters would suffice. Besides, if you need to work with databases, you could include Chapter 8 in the roadmap.

If you already know the basics, you can improve your PDI knowledge by reading chapters 5, 6, and 11.

Finally, if you already know PDI and want to learn how to use it to load or maintain a datawarehouse or datamart, you will find all that you need in chapters 9 and 12.

While Chapter 13 is useful for anyone who is willing to take it further, all the appendices are valuable resources for anyone who reads this book.

What this book covers

Chapter 1, Getting started with Pentaho Data Integration serves as the most basic introduction to PDI, presenting the tool. The chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon). The chapter also includes instructions for installing a MySQL server.

Chapter 2, Getting Started with Transformations introduces one of the basic components of PDI—transformations. Then, it focuses on the explanation of how to work with files. It explains how to get data from simple input sources such as txt, csv, xml, and so on, do a preview of the data, and send the data back to any of these common output formats. The chapter also explains how to read command-line parameters and system information.

Chapter 3, Basic Data Manipulation explains the simplest and most commonly used ways of transforming data, including performing calculations, adding constants, counting, filtering, ordering, and looking for data.

Chapter 4Controlling the Flow of Data explains different options that PDI offers to combine or split flows of data.

Chapter 5, Transforming Your Data with JavaScript Code and the JavaScript Step explains how JavaScript coding can help in the treatment of data. It shows why you need to code inside PDI, and explains in detail how to do it.

Chapter 6, Transforming the Row Set explains the ability of PDI to deal with some sophisticated problems, such as normalizing data from pivoted tables, in a simple fashion.

Chapter 7, Validating Data and Handling Errors explains the different options that PDI has to validate data, and how to treat the errors that may appear.

Chapter 8, Working with Databases explains how to use PDI to work with databases. The list of topics covered includes connecting to a database, previewing and getting data, and inserting, updating, and deleting data. As database knowledge is not presumed, the chapter also covers fundamental concepts of databases and the SQL language.

Chapter 9, Performing Advanced Operations with Databases explains how to perform advanced operations with databases, including those specially designed to load datawarehouses. A primer on datawarehouse concepts is also given in case you are not familiar with the subject.

Chapter 10, Creating Basic Task Flow serves as an introduction to processes in PDI. Through the creation of simple jobs, you will learn what jobs are and what they are used for.

Chapter 11, Creating Advanced Transformations and Jobs deals with advanced concepts that will allow you to build complex PDI projects. The list of covered topics includes nesting jobs, iterating on jobs and transformations, and creating subtransformations.

Chapter 12, Developing and implementing a simple datamart presents a simple datamart project, and guides you to build the datamart by using all the concepts learned throughout the book.

Chapter 13, Taking it Further gives a list of best PDI practices and recommendations for going beyond.

Appendix A, Working with repositories guides you step by step in the creation of a PDI database repository and then gives instructions to work with it.

Appendix B, Pan and Kitchen: Launching Transformations and Jobs from the Command Line is a quick reference for running transformations and jobs from the command line.

Appendix C, Quick Reference: Steps and Job Entries serves as a quick reference to steps and job entries used throughout the book.

Appendix D, Spoon Shortcuts is an extensive list of Spoon shortcuts useful for saving time when designing and running PDI jobs and transformations.

Appendix E, Introducing PDI 4 features quickly introduces you to the architectural and functional features included in Kettle 4—the version that was under development while writing this book.

Appendix F, Pop Quiz Answers, contains answers to pop quiz questions.

What you need for this book

PDI is a multiplatform tool. This means no matter what your operating system is, you will be able to work with the tool. The only prerequisite is to have JVM 1.5 or a higher version installed. It is also useful to have Excel or Calc along with a nice text editor.

Having an Internet connection while reading is extremely useful as well. Several links are provided throughout the book that complement what is explained. Besides, there is the PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

This book is for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions or, more generally, doing any kind of data manipulation. If you have never used PDI before, this will be a perfect book to start with.

You will find this book to be a good starting point if you are a database administrator, a data warehouse designer, an architect, or any person who is responsible for data warehouse projects and need to load data into them.

You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in an easy-to-understand language.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "You read the examination.txt file, and did some calculations to see how the students did."

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in our text like this: "Edit the Sort rows step by double-clicking it, click the Get Fields button, and adjust the grid."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply drop an email to , and mention the book title in the subject of your message.

If there is a book that you need and would like to see us publish, please send us a note in the SUGGEST A TITLE form on www.packtpub.com or email .

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Tip

Downloading the example code for the book

Visit http://www.packtpub.com/files/code/9546_Code.zip to directly download the example code.

The downloadable files contain instructions on how to use them.

Errata

Although we have taken every care to ensure the accuracy of our contents, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in text or code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration, and help us to improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the let us know link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata added to any list of existing errata. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Pentaho 3.2 Data Integration: Beginner's Guide
Published in: Apr 2010 Publisher: Packt ISBN-13: 9781847199546
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}